0% found this document useful (0 votes)
107 views

DSCI 5350 - Lecture 3 PDF

The document discusses using Sqoop to transfer data between relational databases and Hadoop. It explains how Sqoop works, how to import and export data using Sqoop commands, and how to control parallelism. It also covers limitations of traditional Sqoop and introduces new features of Sqoop 2 to address these limitations.

Uploaded by

Praz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

DSCI 5350 - Lecture 3 PDF

The document discusses using Sqoop to transfer data between relational databases and Hadoop. It explains how Sqoop works, how to import and export data using Sqoop commands, and how to control parallelism. It also covers limitations of traditional Sqoop and introduces new features of Sqoop 2 to address these limitations.

Uploaded by

Praz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

DSCI 5350 – Big Data Analytics

Lecture 3 – Sqoop: your interface


to RDBMS

Kashif Saeed
1
Lecture Outline

• How to import tables from an RDBMS


• Controlling which columns and rows are imported
• Improving Sqoop performance
• Next gen Sqoop features
• Exporting data with Sqoop

2
What is

• Sqoop is an open source Apache project which was


originally developed by Cloudera
• The name ‘Sqoop’ comes from ‘SQL to Hadoop’
• Sqoop allows data to be exchanged between RDBMS
and HDFS
Can import all tables, single table, or partial tables into HDFS
Data can be imported into variety of formats
Data can also be exported from HDFS to RDBMS

3
How does Sqoop work?

• Sqoop is a client-side application that


imports data using MapReduce
• An import involves three steps:
1. Client gets table(s) Metadata from
RDBMS
2. Client creates and submits job to the
cluster
3. The job runs on Hadoop cluster and
pulls data

4
Sqoop – under the hood

• Sqoop begins by examining the tables to be imported


Determines the Primark Key
Runs a boundary query to determine how many records are
to be imported
Divides results of the boundary query by the number of tasks
(mappers)
• Sqoop generates a Java source file for each table to be
imported
It compiles and uses this during the import process
The file can be deleted after the import

5
Sqoop Syntax

• Sqoop is a command-line utility with several


subcommands, called tools
There are tools for import, export, listing content etc.
Run sqoop help to see a list of tools
Run sqoop help tool-name for help on a specific tool
• Uses jdbc for connecting to the databases
• Basic syntax is:

6
• The Cloudera Instance comes
configured with a MySQL instance
• The MySQL instance has a database
called ‘Loudacre’ which we will use
for our activities
• The following command will list all
tables in the Loudacre database

7
Import an Entire Database

• The import-all-tables tool imports an entire


database
Stored as comma-delimited file
Default base location is your HDFS home directory
Data will be in subdirectories corresponding to the name of
each table

8
•Use the --warehouse-dir option to specify a
different base directory.
•Using the --warehouse-dir option will create
one sub-directory for each table under the
directory provided in tool options

•Note that there is a --target-dir option as


well which creates the tables in the target-dir
without creating sub-directories 9
Importing a Single Table

• The import tool imports a Single Table


• Stored as comma-delimited file (default)
• The following example imports the accounts table

• Following command creates a tab delimited file

10
• Note that the tab sequence (\t) is quoted using double
quotes because double quotes will work correctly with
command-line as well as Oozie, where as single quote
only works with command-line

11
Incremental Imports

• Sqoop --incremental lastmodified mode


imports new and modified rows
• Based on a timestamp in a specified table column
• Database table must have a column to track additions
or changes to records

12
• Sqoop --incremental append mode imports only
new records
• Based on a value of last record in a specified column

13
Importing Partial Tables

• The following command imports partial columns from


the accounts table

• The following command imports matching rows from


the accounts table

14
Importing Based on a Free-Form Query

• Using --query option, you can import the results of a


query instead of a table
• When using --query option:
You must have the literal WHERE $CONDITIONS
oOnly of Sqoop internal use
oThis is for Sqoop to insert its range conditions for each task, e.g. task1:
row 1-1000, task2: row 1001-2000, etc.
oWHERE $CONDITIONS does not include any of your own query
conditions
You must have --target-dir option
oThis is required because by default the target directory is named based
on the table name. Since there can be many queries using the same
table and one query can include multiple tables, target directory must
be explicitly specified to avoid any ambiguity
15
• In the example above, we are using ‘ ‘ for the entire --
query argument to prevent the UNIX shell to interpret
$CONDITIONS as a variable. Also, we want Sqoop
rather than UNIX shell to interpret this command
16
• --split by option is used to explicitly provide Sqoop a
column that will be used to split the results of the
bound query
By default, Sqoop will use the Primary key column
More on –split by option later in this lecture

17
Free-Form Query with WHERE Criteria

• The --where option is ignored in a Free-Form query


• You must add your filter criteria using an AND following
the WHERE clause

18
Database Connectivity Options

• JDBC (generic)
Compatible with all databases
JDBC tends to be slower than other options and can cause performance
issues
• Direct Mode
Uses database specific utilities and results in better performance
Used with --direct option
Currently supports MySQL and Postgres databases
Not all Sqoop features are available in direct mode
• High Performance Sqoop Connectors
Cloudera and partners offer connectors for Teradata, Oracle, and
Netezza
Available for download from Cloudera’s website
Not open source, but free of cost
19
Controlling Parallelism in Sqoop

• By default, Sqoop typically imports data using four


parallel tasks (mappers)
Increasing the number of tasks might improve import speed
However, you must keep in mind that each task creates its own
connection to your database server
• You can influence the number of tasks by using the –m
option
Sqoop views this only as a hint and might not honor it

20
• Sqoop assumes that each table has an evenly-
distributed numeric primary key and uses the primary
key to divide up the work among the tasks
You can use a different column by using the
--split-by option
It is important to choose a split column that has an index to
avoid each mapper to scan the entire table
If the split column is not indexed or if the table has no
Primary key, it is better to specify only one mapper so that
the table is scanned once as opposed to being scanned 4
times

21
Limitations of Sqoop

• Sqoop is a stable tool and has been used in Production


for years
• However, the client-side architecture imposes some
limitations:
Requires JDBC connectivity from client-side to RDBMS;
which means JDBC installation and configuration
Requires connectivity to Hadoop cluster from client
Requires users to specify RDBMS username and password
It is difficult to integrate command-line interface with other
applications
• Sqoop is tightly coupled with JDBC, hence does not
work well with NoSQL databases
22
Sqoop 2 Architecture

• Client-Server design addresses


limitations described in previous
slide
• Client requires connectivity only
to the Sqoop Server
Database and Hadoop Connections
are established at the Server level
End-user no longer needs database
credentials
Centralized audit trail
Sqoop server is accessible via CLI,
Web UI, and Rest API
All heavy lifting done by the server 23
All data flow happens directly between RDBMS and the Hadoop Cluster

24
Sqoop Resources

• Sqoop User Guide


http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.
html

• Apache Sqoop Cookbook


http://shop.oreilly.com/product/0636920029519.do

• Sqoop 2
www.tiny.cloudera.com/adcc05c

25
Exporting data to RDBMS

What’s the purpose of exporting data?


It is often required to export your Hadoop data to
RDMBS for easy querying purposes
How does it work?
Since the data in the Hadoop cluster is saved as files,
you will need to create a table in RDBMS and then
export the data from Hadoop to that table
Sqoop export tool is used for exporting data
Sqoop will transfer data to the relational database
using INSERT statements
26
Sqoop Export – under the hood

Internal mechanics of export – Step-by-Step


1. Connect to the database and get table metadata
2. Using metadata, generate a Java class file to be used by
the MapReduce job
3. Similar to Import, the data transfer does not happen
through the client
Target table:
Is identified by --table parameter
Must be created before the sqoop export runs
Must not have any primary key constraints (so that you can
export the same values multiple times if needed)
oPrimary key constraint will also slow down the export process
Must be created with proper data types and column lengths
to handle the data or you’ll get an error message in Sqoop
27
Batch Export

Problem:
Sqoop export creates one insert statement for each row.
This can be extremely time consuming for big tables.

Possible Solutions:
• Use the --batch parameter in your sqoop command
USES JDBC batching feature
Works with almost all databases

28
Batch Export - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--batch

29
All or Nothing Export

Problem:
What if you want to ensure that all of the data gets
updated in target database. In case of a failure, you do not
want partial data.

Solution:
• Use the --staging-table parameter in your sqoop
command
• Sqoop will load to the staging table first and then will load to
the table specified in the --table parameter
• The load to the --table destination will only happen when
the staging table is completely loaded
• Both staging and final tables should have the same metadata
30
All or Nothing Export - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--staging-table staging_cities

31
Updating an existing Table

Problem:
You previously exported data from Hadoop to a table. You have
newer changes in data that you’d like to update instead of an
overwrite.

Solution:
• Use the --update-key parameter in your sqoop command
followed by the name of the column that can identify the
changes
• Sqoop will issue an UPDATE query instead of an INSERT query
to the RDBMS table
• Challenge with just UPDATING is that it does not capture new
rows added in the source (Hadoop)
32
Update - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--update-key id

33
Updating and Inserting at the same time

Solution:
• Use the --update-mode allowinsert parameter
in your sqoop command
• This works in conjunction with the --update-key
parameter
• This is also called the upsert feature
• The Upsert method does not delete rows, hence you
cannot use it to sync the two systems
If your goal is to sync the table with Hadoop data, you should
use truncate and reload instead

34
Upsert - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--update-key id \
--update-mode allowinsert

35
Using Database Stored Procs to Insert

Problem:
Databases typically use Stored Procs for bulk Inserts
instead of individual insert statements.

Solution:
• Use the --call parameter in your sqoop command
followed by the name of the stored procedure
• The stored procedure and table structure must exist in
the database for this to work
• It is recommended to use a simple stored proc (without
dozens of transformations) with Hadoop
36
Stored Proc - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--call populate_cities

37
Exporting into a Subset of Columns

Problem:
The corresponding table in your database has more columns
than the HDFS data and you only want to export a subset of
columns

Solution:
• Use the --columns parameter in your sqoop command
to specify which columns and in what order are present in
your Hadoop data
• By default, Sqoop assumes that your HDFS data contains
the same number and ordering of columns as the table
you’re exporting into 38
Subset Export - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--columns country,city

39

You might also like