SlideShare a Scribd company logo
Sqoop – Advanced Options
2015
Contents
1 What is Sqoop ?
2 Import and Export data using Sqoop
3 Import and Export command in Sqoop
4 Saved Jobs in Sqoop
5 Option File
6 Important Sqoop Options
What is Sqoop?
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and
structured data stores such as relational databases.
Import and Export using Sqoop
The import command in Sqoop transfers the data from RDBMS to HDFS/Hive/HBase.
The export command in Sqoop transfers the data from HDFS/Hive/HBase back to
RDBMS.
Import command in Sqoop
The command to import data into Hive :
The command to import data into HDFS :
The command to import data in HBase :
sqoop import --connect <connect-string>/dbname --username uname -P
--table table_name --hive-import -m 1
sqoop import --connect <connect-string>/dbname --username uname --P
--table table_name -m 1
sqoop import --connect <connect-string>/dbname --username root -P
--table table_name --hbase-table table_name
--column-family col_fam_name --hbase-row-key row_key_name --hbase-create-table -m 1
Export command in Sqoop
The command to export data from RDBMS to Hive :
The command to export data from RDBMS to HDFS :
Limitations of Import and Export command:
- Import and Export commands are convenient to use when one wants to transfer data from RDBMS to
HDFS/Hive/HBase and vice-a-versa for a limited number of times.
So what if there is a requirement of executing the import and export commands several times a day ?
In such situations Saved Sqoop Job can save your time.
sqoop export --connect <connect-string>/db_name --table table_name -m 1
--export-dir <path_to_export_dir>
sqoop export --connect <connect-string>/db_name --table table_name -m 1
--export-dir <path_to_export_dir>
Saved Jobs in Sqoop
The Saved Sqoop Job remembers the parameters used by a job so they can be re-
executed by invoking the job several times.
Following command creates saved jobs:
The command above just creates a job with the job name you specify.
It means that the job you created is now available in your saved jobs list which can be
executed later.
Following command executes a saved job :
sqoop job --create job_name --import --connect <connect-string>/dbname  --table table_name
sqoop job --exec job_name --username uname –P
Sample Saved Job
sqoop job --create JOB1
-- import --connect jdbc:mysql://192.168.56.1:3306/adventureworks
-username XXX
-password XXX
--table transactionhistory
--target-dir /user/cloudera/datasets/trans
-m 1
--columns "TransactionID,ProductId,TransactionDate"
--check-column TransactionDate
--incremental lastmodified
--last-value "2004-09-01 00:00:00";
Important Options in Saved Jobs in Sqoop
Sqoop option Usage
--connect Connection string for the source database
--table Source table name
--columns Columns to be extracted
--username User name for accessing source table
--password Password for accessing source table
--check-column
Specifies the column to be examined when determining which rows
to import.
--incremental Specifies how Sqoop determines which rows are new.
--last-value
Specifies the maximum value of the check column from the previous
import. For the first execution of the job, “last-value” is treated as
the upper bound and data is extracted from first record till the upper
bound.
--target-dir Target HDFS directory
--m Number of mapper tasks
--compress
Specifies that compression has to be applied while loading data into
target.
--fields-terminated-by Fields separator in output directory
Sqoop Metastore
• A Sqoop metastore keeps track of all jobs.
• By default, the metastore is contained in your home directory under .sqoop and is
only used for your own jobs. If you want to share jobs, you would need to install a
JDBC-compliant database and use the --meta-connect argument to specify its
location when issuing job commands.
• Important Sqoop commands:
• $ sqoop job –list – Lists all jobs available in metastore
• sqoop job --exec JOB1 – Executes JOB1
• sqoop job --show JOB1 – Displays metadata of JOB1
Option File
Certain arguments in import, export commands and saved jobs are to be written every
time you execute them.
What would be an alternative to this repetitive work ?
For instance following arguments are used repetitively in import and export
commands as well as saved jobs :
• So these arguments can be saved in a single text file say option.txt.
• While executing the command just include this file for the argument --options-file.
• Following command shows the use of –options-file argument:
import
-connect
jdbc:mysql//localhost
-username
-P
Option.txt
sqoop --options-file <path_to_option_file>/db_name --table table_name
Option File
1. Each argument in the option file should be on a new line.
2. -connect in option file cannot be written as --connect.
3. Same is the case for other arguments too.
4. Option file is generally used when large number of Sqoop jobs use a common set
of parameters such as:
1. Source RDBMS ID, Password
2. Source database URL
3. Field Separator
4. Compression type
Sqoop Design Guidelines for Performance
1. Sqoop imports data in parallel from database sources. You can specify the number
of map tasks (parallel processes) to use to perform the import by using the -
m argument. Some databases may see improved performance by increasing this
value to 8 or 16. Do not increase the degree of parallelism greater than that
available within your MapReduce cluster;
2. By default, the import process will use JDBC. Some databases can perform imports
in a more high-performance fashion by using database-specific data movement
tools. For example, MySQL provides the mysqldump tool which can export data
from MySQL to other systems very quickly. By supplying the --direct argument,
you are specifying that Sqoop should attempt the direct import channel.
Thank You

More Related Content

What's hot (19)

PDF
Apache Sqoop: Unlocking Hadoop for Your Relational Database
huguk
 
PDF
Sqoop tutorial
Ashoka Vanjare
 
PPTX
Hadoop on osx
Devopam Mittra
 
PDF
Introduction to Apache Sqoop
Avkash Chauhan
 
PPTX
Apache sqoop with an use case
Davin Abraham
 
PPTX
Hive commands
Ganesh Sanap
 
PDF
Hive Quick Start Tutorial
Carl Steinbach
 
PPTX
HiveServer2
Schubert Zhang
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PPTX
Data analysis scala_spark
Yiguang Hu
 
PDF
Hive Anatomy
nzhang
 
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
PDF
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
PDF
Apache Spark Tutorial
Farzad Nozarian
 
PDF
Beginning hive and_apache_pig
Mohamed Ali Mahmoud khouder
 
PDF
Introduction to scoop and its functions
Rupak Roy
 
PPTX
Hadoop & HDFS for Beginners
Rahul Jain
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
huguk
 
Sqoop tutorial
Ashoka Vanjare
 
Hadoop on osx
Devopam Mittra
 
Introduction to Apache Sqoop
Avkash Chauhan
 
Apache sqoop with an use case
Davin Abraham
 
Hive commands
Ganesh Sanap
 
Hive Quick Start Tutorial
Carl Steinbach
 
HiveServer2
Schubert Zhang
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Data analysis scala_spark
Yiguang Hu
 
Hive Anatomy
nzhang
 
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Hive User Meeting August 2009 Facebook
ragho
 
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
Apache Spark Tutorial
Farzad Nozarian
 
Beginning hive and_apache_pig
Mohamed Ali Mahmoud khouder
 
Introduction to scoop and its functions
Rupak Roy
 
Hadoop & HDFS for Beginners
Rahul Jain
 

Viewers also liked (13)

PDF
Big Data Analytics using Mahout
IMC Institute
 
PDF
Thai Software & Software Market Survey 2015
IMC Institute
 
PDF
สมุดกิจกรรม Code for Kids
IMC Institute
 
PPT
ITSS Overview
IMC Institute
 
PDF
Big data: Loading your data with flume and sqoop
Christophe Marchal
 
PDF
Big data processing using Hadoop with Cloudera Quickstart
IMC Institute
 
PDF
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
PDF
Mobile User and App Analytics in China
IMC Institute
 
PDF
Apache Sqoop: A Data Transfer Tool for Hadoop
Cloudera, Inc.
 
PDF
Install Apache Hadoop for Development/Production
IMC Institute
 
PDF
Machine Learning using Apache Spark MLlib
IMC Institute
 
PDF
Kanban boards step by step
Giulio Roggero
 
PPTX
Flume vs. kafka
Omid Vahdaty
 
Big Data Analytics using Mahout
IMC Institute
 
Thai Software & Software Market Survey 2015
IMC Institute
 
สมุดกิจกรรม Code for Kids
IMC Institute
 
ITSS Overview
IMC Institute
 
Big data: Loading your data with flume and sqoop
Christophe Marchal
 
Big data processing using Hadoop with Cloudera Quickstart
IMC Institute
 
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
Mobile User and App Analytics in China
IMC Institute
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Cloudera, Inc.
 
Install Apache Hadoop for Development/Production
IMC Institute
 
Machine Learning using Apache Spark MLlib
IMC Institute
 
Kanban boards step by step
Giulio Roggero
 
Flume vs. kafka
Omid Vahdaty
 
Ad

Similar to Advanced Sqoop (18)

PDF
Sqoop Explanation with examples and syntax
dspyanand
 
PDF
Scoop Job, import and export to RDBMS
Rupak Roy
 
PPT
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
AjajKhan23
 
PDF
SQOOP PPT
Dushhyant Kumar
 
TXT
Quick reference for sqoop
Rajkumar Asohan, PMP
 
PPT
Apache scoop overview
Nisanth Simon
 
PDF
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
PDF
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
PPTX
Introduction to sqoop
Uday Vakalapudi
 
PDF
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
PDF
Oracle hadoop let them talk together !
Laurent Leturgez
 
PPTX
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
 
PPTX
Oozie &amp; sqoop by pradeep
Pradeep Pandey
 
PPTX
Hadoop and rdbms with sqoop
Guy Harrison
 
PDF
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
PDF
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
PDF
SDPHP - Percona Toolkit (It's Basically Magic)
Robert Swisher
 
PPTX
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
Sqoop Explanation with examples and syntax
dspyanand
 
Scoop Job, import and export to RDBMS
Rupak Roy
 
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
AjajKhan23
 
SQOOP PPT
Dushhyant Kumar
 
Quick reference for sqoop
Rajkumar Asohan, PMP
 
Apache scoop overview
Nisanth Simon
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
Introduction to sqoop
Uday Vakalapudi
 
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Oracle hadoop let them talk together !
Laurent Leturgez
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
 
Oozie &amp; sqoop by pradeep
Pradeep Pandey
 
Hadoop and rdbms with sqoop
Guy Harrison
 
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
SDPHP - Percona Toolkit (It's Basically Magic)
Robert Swisher
 
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 

Advanced Sqoop

  • 1. Sqoop – Advanced Options 2015
  • 2. Contents 1 What is Sqoop ? 2 Import and Export data using Sqoop 3 Import and Export command in Sqoop 4 Saved Jobs in Sqoop 5 Option File 6 Important Sqoop Options
  • 3. What is Sqoop? Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured data stores such as relational databases.
  • 4. Import and Export using Sqoop The import command in Sqoop transfers the data from RDBMS to HDFS/Hive/HBase. The export command in Sqoop transfers the data from HDFS/Hive/HBase back to RDBMS.
  • 5. Import command in Sqoop The command to import data into Hive : The command to import data into HDFS : The command to import data in HBase : sqoop import --connect <connect-string>/dbname --username uname -P --table table_name --hive-import -m 1 sqoop import --connect <connect-string>/dbname --username uname --P --table table_name -m 1 sqoop import --connect <connect-string>/dbname --username root -P --table table_name --hbase-table table_name --column-family col_fam_name --hbase-row-key row_key_name --hbase-create-table -m 1
  • 6. Export command in Sqoop The command to export data from RDBMS to Hive : The command to export data from RDBMS to HDFS : Limitations of Import and Export command: - Import and Export commands are convenient to use when one wants to transfer data from RDBMS to HDFS/Hive/HBase and vice-a-versa for a limited number of times. So what if there is a requirement of executing the import and export commands several times a day ? In such situations Saved Sqoop Job can save your time. sqoop export --connect <connect-string>/db_name --table table_name -m 1 --export-dir <path_to_export_dir> sqoop export --connect <connect-string>/db_name --table table_name -m 1 --export-dir <path_to_export_dir>
  • 7. Saved Jobs in Sqoop The Saved Sqoop Job remembers the parameters used by a job so they can be re- executed by invoking the job several times. Following command creates saved jobs: The command above just creates a job with the job name you specify. It means that the job you created is now available in your saved jobs list which can be executed later. Following command executes a saved job : sqoop job --create job_name --import --connect <connect-string>/dbname --table table_name sqoop job --exec job_name --username uname –P
  • 8. Sample Saved Job sqoop job --create JOB1 -- import --connect jdbc:mysql://192.168.56.1:3306/adventureworks -username XXX -password XXX --table transactionhistory --target-dir /user/cloudera/datasets/trans -m 1 --columns "TransactionID,ProductId,TransactionDate" --check-column TransactionDate --incremental lastmodified --last-value "2004-09-01 00:00:00";
  • 9. Important Options in Saved Jobs in Sqoop Sqoop option Usage --connect Connection string for the source database --table Source table name --columns Columns to be extracted --username User name for accessing source table --password Password for accessing source table --check-column Specifies the column to be examined when determining which rows to import. --incremental Specifies how Sqoop determines which rows are new. --last-value Specifies the maximum value of the check column from the previous import. For the first execution of the job, “last-value” is treated as the upper bound and data is extracted from first record till the upper bound. --target-dir Target HDFS directory --m Number of mapper tasks --compress Specifies that compression has to be applied while loading data into target. --fields-terminated-by Fields separator in output directory
  • 10. Sqoop Metastore • A Sqoop metastore keeps track of all jobs. • By default, the metastore is contained in your home directory under .sqoop and is only used for your own jobs. If you want to share jobs, you would need to install a JDBC-compliant database and use the --meta-connect argument to specify its location when issuing job commands. • Important Sqoop commands: • $ sqoop job –list – Lists all jobs available in metastore • sqoop job --exec JOB1 – Executes JOB1 • sqoop job --show JOB1 – Displays metadata of JOB1
  • 11. Option File Certain arguments in import, export commands and saved jobs are to be written every time you execute them. What would be an alternative to this repetitive work ? For instance following arguments are used repetitively in import and export commands as well as saved jobs : • So these arguments can be saved in a single text file say option.txt. • While executing the command just include this file for the argument --options-file. • Following command shows the use of –options-file argument: import -connect jdbc:mysql//localhost -username -P Option.txt sqoop --options-file <path_to_option_file>/db_name --table table_name
  • 12. Option File 1. Each argument in the option file should be on a new line. 2. -connect in option file cannot be written as --connect. 3. Same is the case for other arguments too. 4. Option file is generally used when large number of Sqoop jobs use a common set of parameters such as: 1. Source RDBMS ID, Password 2. Source database URL 3. Field Separator 4. Compression type
  • 13. Sqoop Design Guidelines for Performance 1. Sqoop imports data in parallel from database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the - m argument. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; 2. By default, the import process will use JDBC. Some databases can perform imports in a more high-performance fashion by using database-specific data movement tools. For example, MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel.