Advanced Sqoop

Sqoop – Advanced Options
2015

Contents
1 What is Sqoop ?
2 Import and Export data using Sqoop
3 Import and Export command in Sqoop
4 Saved Jobs in Sqoop
5 Option File
6 Important Sqoop Options

What is Sqoop?
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and
structured data stores such as relational databases.

Import and Export using Sqoop
The import command in Sqoop transfers the data from RDBMS to HDFS/Hive/HBase.
The export command in Sqoop transfers the data from HDFS/Hive/HBase back to
RDBMS.

Import command in Sqoop
The command to import data into Hive :
The command to import data into HDFS :
The command to import data in HBase :
sqoop import --connect <connect-string>/dbname --username uname -P
--table table_name --hive-import -m 1
sqoop import --connect <connect-string>/dbname --username uname --P
--table table_name -m 1
sqoop import --connect <connect-string>/dbname --username root -P
--table table_name --hbase-table table_name
--column-family col_fam_name --hbase-row-key row_key_name --hbase-create-table -m 1

Export command in Sqoop
The command to export data from RDBMS to Hive :
The command to export data from RDBMS to HDFS :
Limitations of Import and Export command:
- Import and Export commands are convenient to use when one wants to transfer data from RDBMS to
HDFS/Hive/HBase and vice-a-versa for a limited number of times.
So what if there is a requirement of executing the import and export commands several times a day ?
In such situations Saved Sqoop Job can save your time.
sqoop export --connect <connect-string>/db_name --table table_name -m 1
--export-dir <path_to_export_dir>
sqoop export --connect <connect-string>/db_name --table table_name -m 1
--export-dir <path_to_export_dir>

Saved Jobs in Sqoop
The Saved Sqoop Job remembers the parameters used by a job so they can be re-
executed by invoking the job several times.
Following command creates saved jobs:
The command above just creates a job with the job name you specify.
It means that the job you created is now available in your saved jobs list which can be
executed later.
Following command executes a saved job :
sqoop job --create job_name --import --connect <connect-string>/dbname --table table_name
sqoop job --exec job_name --username uname –P

Sample Saved Job
sqoop job --create JOB1
-- import --connect jdbc:mysql://192.168.56.1:3306/adventureworks
-username XXX
-password XXX
--table transactionhistory
--target-dir /user/cloudera/datasets/trans
-m 1
--columns "TransactionID,ProductId,TransactionDate"
--check-column TransactionDate
--incremental lastmodified
--last-value "2004-09-01 00:00:00";

Important Options in Saved Jobs in Sqoop
Sqoop option Usage
--connect Connection string for the source database
--table Source table name
--columns Columns to be extracted
--username User name for accessing source table
--password Password for accessing source table
--check-column
Specifies the column to be examined when determining which rows
to import.
--incremental Specifies how Sqoop determines which rows are new.
--last-value
Specifies the maximum value of the check column from the previous
import. For the first execution of the job, “last-value” is treated as
the upper bound and data is extracted from first record till the upper
bound.
--target-dir Target HDFS directory
--m Number of mapper tasks
--compress
Specifies that compression has to be applied while loading data into
target.
--fields-terminated-by Fields separator in output directory

Sqoop Metastore
• A Sqoop metastore keeps track of all jobs.
• By default, the metastore is contained in your home directory under .sqoop and is
only used for your own jobs. If you want to share jobs, you would need to install a
JDBC-compliant database and use the --meta-connect argument to specify its
location when issuing job commands.
• Important Sqoop commands:
• $ sqoop job –list – Lists all jobs available in metastore
• sqoop job --exec JOB1 – Executes JOB1
• sqoop job --show JOB1 – Displays metadata of JOB1

Option File
Certain arguments in import, export commands and saved jobs are to be written every
time you execute them.
What would be an alternative to this repetitive work ?
For instance following arguments are used repetitively in import and export
commands as well as saved jobs :
• So these arguments can be saved in a single text file say option.txt.
• While executing the command just include this file for the argument --options-file.
• Following command shows the use of –options-file argument:
import
-connect
jdbc:mysql//localhost
-username
-P
Option.txt
sqoop --options-file <path_to_option_file>/db_name --table table_name

Option File
1. Each argument in the option file should be on a new line.
2. -connect in option file cannot be written as --connect.
3. Same is the case for other arguments too.
4. Option file is generally used when large number of Sqoop jobs use a common set
of parameters such as:
1. Source RDBMS ID, Password
2. Source database URL
3. Field Separator
4. Compression type

Sqoop Design Guidelines for Performance
1. Sqoop imports data in parallel from database sources. You can specify the number
of map tasks (parallel processes) to use to perform the import by using the -
m argument. Some databases may see improved performance by increasing this
value to 8 or 16. Do not increase the degree of parallelism greater than that
available within your MapReduce cluster;
2. By default, the import process will use JDBC. Some databases can perform imports
in a more high-performance fashion by using database-specific data movement
tools. For example, MySQL provides the mysqldump tool which can export data
from MySQL to other systems very quickly. By supplying the --direct argument,
you are specifying that Sqoop should attempt the direct import channel.

Advanced Sqoop

More Related Content

What's hot (19)

Viewers also liked (13)

Similar to Advanced Sqoop (18)

Recently uploaded (20)

Advanced Sqoop