SlideShare a Scribd company logo
HIVE Data Migration
NHN Comico Data Laboratory
Bopyo Hong
2015-11-05
Overview
1. Concept
2. Export Data
from Old Hadoop Cluster
3. Data Copy
from Old Hadoop Cluster to New one
4. Import Data
to New Hadoop Cluster
5. Reference sites
Concept
Migration of Hive data from old hadoop cluster to new one
Hive
(old version)
Hive
(new version)
Export Data from Old Hadoop Cluster - 1
1. To make table list of database you want to export
Hive doesn’t support to export/import by DB unit
$ hive --database test -e 'SHOW TABLES' | sed -e '1d' > table_list.txt
2. To make export statements that is executed by partition
when you export whole table(not use partition), if the table is very big the export will
take too long time. So I recommend use partition.
In this example, export 2015-10-01 ~ 2014-10-31 partition
ex) partition name is dt
/user/hive/warehouse/test.db/tablename/dt=2015-10-01
/user/hive/warehouse/test.db/tablename/dt=2015-10-02
/user/hive/warehouse/test.db/tablename/dt=2015-10-03
……………………………………..
/user/hive/warehouse/test.db/tablename/dt=2015-10-31
$ source make_cmd_export_by_partition.sh 20151001 20151101 > export_201510.q
Export Data from Old Hadoop Cluster - 2
$vi make_cmd_export_by_partition.sh
#!/bin/bash
startdate=$1
enddate=$2
rundate="$startdate"
until [ "$rundate" == "$enddate" ]
do
YEAR=${rundate:0:4}
MONTH=${rundate:4:2}
DAY=${rundate:6:2}
DATE2=${YEAR}'-'${MONTH}'-'${DAY}
TBLS=(`cat table_list.txt`)
for TBL in ${TBLS[@]}
do
echo "export table ${TBL} PARTITION (dt='$DATE2') to '/user/hadoop/temp_export_dir/${TBL}/dt=$DATE2';"
echo
done
rundate=`date --date="$rundate +1 day" +%Y%m%d`
done
Export Data from Old Hadoop Cluster - 3
3. Export of test DB
$ source export_by_hive.sh export_201510.q
$vi export_by_hive.sh
#!/bin/bash
echo "export start `date +'%Y-%m-%d %H:%M'`“
# all tables couldn’t have same partitions, so need a option (-hiveconf hive.cli.errors.ignore=true)
hive -hiveconf hive.cli.errors.ignore=true --database test -f $1
echo "export end `date +'%Y-%m-%d %H:%M'`"
Data Copy from old hadoop cluster to new one
Data copy by using distcp command
You should execute distcp command on new hadoop cluster
old cluster url -> hftp://namenode_hostname:50070
new cluster url -> hdfs://namenode_hostname:8020
$ source distcp_script.sh
$ vi distcp_script.sh
#!/bin/bash
echo "`date +'%Y-%m-%d %H:%M'` distcp start“
# -pb option is necessary to avoid errors that could happen by difference of block size between old and new cluster
hadoop distcp -pb hftp://old_host:50070/user/hadoop/temp_export_dir hdfs://new_host:8020/user/bopyo.hong/temp_import_dir
echo "`date +'%Y-%m-%d %H:%M'` distcp end“
Import Data to New Hadoop Cluster - 1
1. To make import statement that is executed by partition
$ source make_cmd_import_by_partition.sh 20151001 20151101 > import_201510.q
$ vi make_cmd_import_by_partition.sh
#!/bin/bash
TBLS=(`cat table_list.txt`)
startdate=$1
enddate=$2
rundate="$startdate"
until [ "$rundate" == "$enddate" ]
do
YEAR=${rundate:0:4}
MONTH=${rundate:4:2}
DAY=${rundate:6:2}
DATE2=${YEAR}'-'${MONTH}'-'${DAY}
for TBL in ${TBLS[@]}
do
echo "import table ${TBL} PARTITION (dt='$DATE2') from '/user/bopyo.hong/temp_import_dir/${TBL}/dt=$DATE2';"
echo
done
rundate=`date --date="$rundate +1 day" +%Y%m%d`
done
Import Data to New Hadoop Cluster - 2
2. Import tables of test DB
$ source import_by_hiveql.sh import_201510.q
$ vi import_by_hiveql.sh
#!/bin/bash
echo "import start `date +'%Y-%m-%d %H:%M'`“
# all tables couldn’t have same partitions so need a option (-hiveconf hive.cli.errors.ignore=true)
hive -hiveconf hive.cli.errors.ignore=true --database test -f $1
echo "export start `date +'%Y-%m-%d %H:%M'`"
Reference site
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
http://kickstarthadoop.blogspot.jp/2012/08/how-to-migrate-hive-table-from-one-hive.html
https://hadoop.apache.org/docs/r2.7.1/hadoop-distcp/DistCp.html#Command_Line_Options

More Related Content

What's hot (20)

PDF
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
DOCX
Commands documentaion
TejalNijai
 
PDF
Installing Apache Hive, internal and external table, import-export
Rupak Roy
 
PDF
Tajo Seoul Meetup-201501
Jinho Kim
 
PDF
Performance Profiling in Rust
InfluxData
 
PDF
Unix commands in etl testing
Garuda Trainings
 
PDF
Hypertable
betaisao
 
PDF
Hypertable - massively scalable nosql database
bigdatagurus_meetup
 
PDF
Perl Programming - 03 Programming File
Danairat Thanabodithammachari
 
PDF
Perl for System Automation - 01 Advanced File Processing
Danairat Thanabodithammachari
 
PDF
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Introduction to scoop and its functions
Rupak Roy
 
PPT
Database Architectures and Hypertable
hypertable
 
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
PPTX
HADOOP 실제 구성 사례, Multi-Node 구성
Young Pyo
 
PDF
Programming Hive Reading #4
moai kids
 
PPTX
HBase Secondary Indexing
Gino McCarty
 
PPTX
2015 bioinformatics bio_python
Prof. Wim Van Criekinge
 
KEY
PostgreSQL
Reuven Lerner
 
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Commands documentaion
TejalNijai
 
Installing Apache Hive, internal and external table, import-export
Rupak Roy
 
Tajo Seoul Meetup-201501
Jinho Kim
 
Performance Profiling in Rust
InfluxData
 
Unix commands in etl testing
Garuda Trainings
 
Hypertable
betaisao
 
Hypertable - massively scalable nosql database
bigdatagurus_meetup
 
Perl Programming - 03 Programming File
Danairat Thanabodithammachari
 
Perl for System Automation - 01 Advanced File Processing
Danairat Thanabodithammachari
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to scoop and its functions
Rupak Roy
 
Database Architectures and Hypertable
hypertable
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxData
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
HADOOP 실제 구성 사례, Multi-Node 구성
Young Pyo
 
Programming Hive Reading #4
moai kids
 
HBase Secondary Indexing
Gino McCarty
 
2015 bioinformatics bio_python
Prof. Wim Van Criekinge
 
PostgreSQL
Reuven Lerner
 

Similar to Hive data migration (export/import) (20)

PDF
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
DOCX
Run a mapreduce job
subburaj raj
 
PDF
Big data hadooop analytic and data warehouse comparison guide
Danairat Thanabodithammachari
 
PPTX
Apache Hive
Ajit Koti
 
DOC
Configure h base hadoop and hbase client
Shashwat Shriparv
 
PPTX
Advanced Sqoop
Yogesh Kulkarni
 
PPTX
BigData - Apache Spark Sqoop Introduce Basic
luandnh1998
 
PDF
Mahout Workshop on Google Cloud Platform
IMC Institute
 
PDF
Introduction to Hadoop part1
Giovanna Roda
 
PDF
Webinar: Top 5 Hadoop Admin Tasks
Edureka!
 
PDF
Top 5 Hadoop Admin Tasks
Edureka!
 
PDF
Big Data on DC/OS
(Susan) Xinh Huynh
 
DOCX
Exadata - BULK DATA LOAD Testing on Database Machine
Monowar Mukul
 
PPTX
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Sankar H
 
PDF
Quay 3.3 installation
Jooho Lee
 
PDF
Wanting distributed volumes - Experiences with ceph-docker
Ewout Prangsma
 
PPTX
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
PDF
Big Data Analytics using Mahout
IMC Institute
 
PPT
Ch23 system administration
Raja Waseem Akhtar
 
DOCX
Upgrading hadoop
Shashwat Shriparv
 
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Run a mapreduce job
subburaj raj
 
Big data hadooop analytic and data warehouse comparison guide
Danairat Thanabodithammachari
 
Apache Hive
Ajit Koti
 
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Advanced Sqoop
Yogesh Kulkarni
 
BigData - Apache Spark Sqoop Introduce Basic
luandnh1998
 
Mahout Workshop on Google Cloud Platform
IMC Institute
 
Introduction to Hadoop part1
Giovanna Roda
 
Webinar: Top 5 Hadoop Admin Tasks
Edureka!
 
Top 5 Hadoop Admin Tasks
Edureka!
 
Big Data on DC/OS
(Susan) Xinh Huynh
 
Exadata - BULK DATA LOAD Testing on Database Machine
Monowar Mukul
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Sankar H
 
Quay 3.3 installation
Jooho Lee
 
Wanting distributed volumes - Experiences with ceph-docker
Ewout Prangsma
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon
 
Big Data Analytics using Mahout
IMC Institute
 
Ch23 system administration
Raja Waseem Akhtar
 
Upgrading hadoop
Shashwat Shriparv
 
Ad

Recently uploaded (20)

PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Python basic programing language for automation
DanialHabibi2
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Ad

Hive data migration (export/import)

  • 1. HIVE Data Migration NHN Comico Data Laboratory Bopyo Hong 2015-11-05
  • 2. Overview 1. Concept 2. Export Data from Old Hadoop Cluster 3. Data Copy from Old Hadoop Cluster to New one 4. Import Data to New Hadoop Cluster 5. Reference sites
  • 3. Concept Migration of Hive data from old hadoop cluster to new one Hive (old version) Hive (new version)
  • 4. Export Data from Old Hadoop Cluster - 1 1. To make table list of database you want to export Hive doesn’t support to export/import by DB unit $ hive --database test -e 'SHOW TABLES' | sed -e '1d' > table_list.txt 2. To make export statements that is executed by partition when you export whole table(not use partition), if the table is very big the export will take too long time. So I recommend use partition. In this example, export 2015-10-01 ~ 2014-10-31 partition ex) partition name is dt /user/hive/warehouse/test.db/tablename/dt=2015-10-01 /user/hive/warehouse/test.db/tablename/dt=2015-10-02 /user/hive/warehouse/test.db/tablename/dt=2015-10-03 …………………………………….. /user/hive/warehouse/test.db/tablename/dt=2015-10-31 $ source make_cmd_export_by_partition.sh 20151001 20151101 > export_201510.q
  • 5. Export Data from Old Hadoop Cluster - 2 $vi make_cmd_export_by_partition.sh #!/bin/bash startdate=$1 enddate=$2 rundate="$startdate" until [ "$rundate" == "$enddate" ] do YEAR=${rundate:0:4} MONTH=${rundate:4:2} DAY=${rundate:6:2} DATE2=${YEAR}'-'${MONTH}'-'${DAY} TBLS=(`cat table_list.txt`) for TBL in ${TBLS[@]} do echo "export table ${TBL} PARTITION (dt='$DATE2') to '/user/hadoop/temp_export_dir/${TBL}/dt=$DATE2';" echo done rundate=`date --date="$rundate +1 day" +%Y%m%d` done
  • 6. Export Data from Old Hadoop Cluster - 3 3. Export of test DB $ source export_by_hive.sh export_201510.q $vi export_by_hive.sh #!/bin/bash echo "export start `date +'%Y-%m-%d %H:%M'`“ # all tables couldn’t have same partitions, so need a option (-hiveconf hive.cli.errors.ignore=true) hive -hiveconf hive.cli.errors.ignore=true --database test -f $1 echo "export end `date +'%Y-%m-%d %H:%M'`"
  • 7. Data Copy from old hadoop cluster to new one Data copy by using distcp command You should execute distcp command on new hadoop cluster old cluster url -> hftp://namenode_hostname:50070 new cluster url -> hdfs://namenode_hostname:8020 $ source distcp_script.sh $ vi distcp_script.sh #!/bin/bash echo "`date +'%Y-%m-%d %H:%M'` distcp start“ # -pb option is necessary to avoid errors that could happen by difference of block size between old and new cluster hadoop distcp -pb hftp://old_host:50070/user/hadoop/temp_export_dir hdfs://new_host:8020/user/bopyo.hong/temp_import_dir echo "`date +'%Y-%m-%d %H:%M'` distcp end“
  • 8. Import Data to New Hadoop Cluster - 1 1. To make import statement that is executed by partition $ source make_cmd_import_by_partition.sh 20151001 20151101 > import_201510.q $ vi make_cmd_import_by_partition.sh #!/bin/bash TBLS=(`cat table_list.txt`) startdate=$1 enddate=$2 rundate="$startdate" until [ "$rundate" == "$enddate" ] do YEAR=${rundate:0:4} MONTH=${rundate:4:2} DAY=${rundate:6:2} DATE2=${YEAR}'-'${MONTH}'-'${DAY} for TBL in ${TBLS[@]} do echo "import table ${TBL} PARTITION (dt='$DATE2') from '/user/bopyo.hong/temp_import_dir/${TBL}/dt=$DATE2';" echo done rundate=`date --date="$rundate +1 day" +%Y%m%d` done
  • 9. Import Data to New Hadoop Cluster - 2 2. Import tables of test DB $ source import_by_hiveql.sh import_201510.q $ vi import_by_hiveql.sh #!/bin/bash echo "import start `date +'%Y-%m-%d %H:%M'`“ # all tables couldn’t have same partitions so need a option (-hiveconf hive.cli.errors.ignore=true) hive -hiveconf hive.cli.errors.ignore=true --database test -f $1 echo "export start `date +'%Y-%m-%d %H:%M'`"