DSCI 5350 - Lecture 3 PDF
DSCI 5350 - Lecture 3 PDF
Kashif Saeed
1
Lecture Outline
2
What is
3
How does Sqoop work?
4
Sqoop – under the hood
5
Sqoop Syntax
6
• The Cloudera Instance comes
configured with a MySQL instance
• The MySQL instance has a database
called ‘Loudacre’ which we will use
for our activities
• The following command will list all
tables in the Loudacre database
7
Import an Entire Database
8
•Use the --warehouse-dir option to specify a
different base directory.
•Using the --warehouse-dir option will create
one sub-directory for each table under the
directory provided in tool options
10
• Note that the tab sequence (\t) is quoted using double
quotes because double quotes will work correctly with
command-line as well as Oozie, where as single quote
only works with command-line
11
Incremental Imports
12
• Sqoop --incremental append mode imports only
new records
• Based on a value of last record in a specified column
13
Importing Partial Tables
14
Importing Based on a Free-Form Query
17
Free-Form Query with WHERE Criteria
18
Database Connectivity Options
• JDBC (generic)
Compatible with all databases
JDBC tends to be slower than other options and can cause performance
issues
• Direct Mode
Uses database specific utilities and results in better performance
Used with --direct option
Currently supports MySQL and Postgres databases
Not all Sqoop features are available in direct mode
• High Performance Sqoop Connectors
Cloudera and partners offer connectors for Teradata, Oracle, and
Netezza
Available for download from Cloudera’s website
Not open source, but free of cost
19
Controlling Parallelism in Sqoop
20
• Sqoop assumes that each table has an evenly-
distributed numeric primary key and uses the primary
key to divide up the work among the tasks
You can use a different column by using the
--split-by option
It is important to choose a split column that has an index to
avoid each mapper to scan the entire table
If the split column is not indexed or if the table has no
Primary key, it is better to specify only one mapper so that
the table is scanned once as opposed to being scanned 4
times
21
Limitations of Sqoop
24
Sqoop Resources
• Sqoop 2
www.tiny.cloudera.com/adcc05c
25
Exporting data to RDBMS
Problem:
Sqoop export creates one insert statement for each row.
This can be extremely time consuming for big tables.
Possible Solutions:
• Use the --batch parameter in your sqoop command
USES JDBC batching feature
Works with almost all databases
28
Batch Export - code
sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--batch
29
All or Nothing Export
Problem:
What if you want to ensure that all of the data gets
updated in target database. In case of a failure, you do not
want partial data.
Solution:
• Use the --staging-table parameter in your sqoop
command
• Sqoop will load to the staging table first and then will load to
the table specified in the --table parameter
• The load to the --table destination will only happen when
the staging table is completely loaded
• Both staging and final tables should have the same metadata
30
All or Nothing Export - code
sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--staging-table staging_cities
31
Updating an existing Table
Problem:
You previously exported data from Hadoop to a table. You have
newer changes in data that you’d like to update instead of an
overwrite.
Solution:
• Use the --update-key parameter in your sqoop command
followed by the name of the column that can identify the
changes
• Sqoop will issue an UPDATE query instead of an INSERT query
to the RDBMS table
• Challenge with just UPDATING is that it does not capture new
rows added in the source (Hadoop)
32
Update - code
sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--update-key id
33
Updating and Inserting at the same time
Solution:
• Use the --update-mode allowinsert parameter
in your sqoop command
• This works in conjunction with the --update-key
parameter
• This is also called the upsert feature
• The Upsert method does not delete rows, hence you
cannot use it to sync the two systems
If your goal is to sync the table with Hadoop data, you should
use truncate and reload instead
34
Upsert - code
sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--update-key id \
--update-mode allowinsert
35
Using Database Stored Procs to Insert
Problem:
Databases typically use Stored Procs for bulk Inserts
instead of individual insert statements.
Solution:
• Use the --call parameter in your sqoop command
followed by the name of the stored procedure
• The stored procedure and table structure must exist in
the database for this to work
• It is recommended to use a simple stored proc (without
dozens of transformations) with Hadoop
36
Stored Proc - code
sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--call populate_cities
37
Exporting into a Subset of Columns
Problem:
The corresponding table in your database has more columns
than the HDFS data and you only want to export a subset of
columns
Solution:
• Use the --columns parameter in your sqoop command
to specify which columns and in what order are present in
your Hadoop data
• By default, Sqoop assumes that your HDFS data contains
the same number and ordering of columns as the table
you’re exporting into 38
Subset Export - code
sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--columns country,city
39