0% found this document useful (0 votes)

107 views

DSCI 5350 - Lecture 3 PDF

The document discusses using Sqoop to transfer data between relational databases and Hadoop. It explains how Sqoop works, how to import and export data using Sqoop commands, and how to control parallelism. It also covers limitations of traditional Sqoop and introduces new features of Sqoop 2 to address these limitations.

Uploaded by

Praz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views

DSCI 5350 - Lecture 3 PDF

Uploaded by

Praz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

DSCI 5350 – Big Data Analytics

Lecture 3 – Sqoop: your interface

to RDBMS

Kashif Saeed
1
Lecture Outline

• How to import tables from an RDBMS

• Controlling which columns and rows are imported
• Improving Sqoop performance
• Next gen Sqoop features
• Exporting data with Sqoop

2
What is

• Sqoop is an open source Apache project which was

originally developed by Cloudera
• The name ‘Sqoop’ comes from ‘SQL to Hadoop’
• Sqoop allows data to be exchanged between RDBMS
and HDFS
Can import all tables, single table, or partial tables into HDFS
Data can be imported into variety of formats
Data can also be exported from HDFS to RDBMS

3
How does Sqoop work?

• Sqoop is a client-side application that

imports data using MapReduce
• An import involves three steps:
1. Client gets table(s) Metadata from
RDBMS
2. Client creates and submits job to the
cluster
3. The job runs on Hadoop cluster and
pulls data

4
Sqoop – under the hood

• Sqoop begins by examining the tables to be imported

Determines the Primark Key
Runs a boundary query to determine how many records are
to be imported
Divides results of the boundary query by the number of tasks
(mappers)
• Sqoop generates a Java source file for each table to be
imported
It compiles and uses this during the import process
The file can be deleted after the import

5
Sqoop Syntax

• Sqoop is a command-line utility with several

subcommands, called tools
There are tools for import, export, listing content etc.
Run sqoop help to see a list of tools
Run sqoop help tool-name for help on a specific tool
• Uses jdbc for connecting to the databases
• Basic syntax is:

6
• The Cloudera Instance comes
configured with a MySQL instance
• The MySQL instance has a database
called ‘Loudacre’ which we will use
for our activities
• The following command will list all
tables in the Loudacre database

7
Import an Entire Database

• The import-all-tables tool imports an entire

database
Stored as comma-delimited file
Default base location is your HDFS home directory
Data will be in subdirectories corresponding to the name of
each table

8
•Use the --warehouse-dir option to specify a
different base directory.
•Using the --warehouse-dir option will create
one sub-directory for each table under the
directory provided in tool options

•Note that there is a --target-dir option as

well which creates the tables in the target-dir
without creating sub-directories 9
Importing a Single Table

• The import tool imports a Single Table

• Stored as comma-delimited file (default)
• The following example imports the accounts table

• Following command creates a tab delimited file

10
• Note that the tab sequence (\t) is quoted using double
quotes because double quotes will work correctly with
command-line as well as Oozie, where as single quote
only works with command-line

11
Incremental Imports

• Sqoop --incremental lastmodified mode

imports new and modified rows
• Based on a timestamp in a specified table column
• Database table must have a column to track additions
or changes to records

12
• Sqoop --incremental append mode imports only
new records
• Based on a value of last record in a specified column

13
Importing Partial Tables

• The following command imports partial columns from

the accounts table

• The following command imports matching rows from

the accounts table

14
Importing Based on a Free-Form Query

• Using --query option, you can import the results of a

query instead of a table
• When using --query option:
You must have the literal WHERE $CONDITIONS
oOnly of Sqoop internal use
oThis is for Sqoop to insert its range conditions for each task, e.g. task1:
row 1-1000, task2: row 1001-2000, etc.
oWHERE $CONDITIONS does not include any of your own query
conditions
You must have --target-dir option
oThis is required because by default the target directory is named based
on the table name. Since there can be many queries using the same
table and one query can include multiple tables, target directory must
be explicitly specified to avoid any ambiguity
15
• In the example above, we are using ‘ ‘ for the entire --
query argument to prevent the UNIX shell to interpret
$CONDITIONS as a variable. Also, we want Sqoop
rather than UNIX shell to interpret this command
16
• --split by option is used to explicitly provide Sqoop a
column that will be used to split the results of the
bound query
By default, Sqoop will use the Primary key column
More on –split by option later in this lecture

17
Free-Form Query with WHERE Criteria

• The --where option is ignored in a Free-Form query

• You must add your filter criteria using an AND following
the WHERE clause

18
Database Connectivity Options

• JDBC (generic)
Compatible with all databases
JDBC tends to be slower than other options and can cause performance
issues
• Direct Mode
Uses database specific utilities and results in better performance
Used with --direct option
Currently supports MySQL and Postgres databases
Not all Sqoop features are available in direct mode
• High Performance Sqoop Connectors
Cloudera and partners offer connectors for Teradata, Oracle, and
Netezza
Available for download from Cloudera’s website
Not open source, but free of cost
19
Controlling Parallelism in Sqoop

• By default, Sqoop typically imports data using four

parallel tasks (mappers)
Increasing the number of tasks might improve import speed
However, you must keep in mind that each task creates its own
connection to your database server
• You can influence the number of tasks by using the –m
option
Sqoop views this only as a hint and might not honor it

20
• Sqoop assumes that each table has an evenly-
distributed numeric primary key and uses the primary
key to divide up the work among the tasks
You can use a different column by using the
--split-by option
It is important to choose a split column that has an index to
avoid each mapper to scan the entire table
If the split column is not indexed or if the table has no
Primary key, it is better to specify only one mapper so that
the table is scanned once as opposed to being scanned 4
times

21
Limitations of Sqoop

• Sqoop is a stable tool and has been used in Production

for years
• However, the client-side architecture imposes some
limitations:
Requires JDBC connectivity from client-side to RDBMS;
which means JDBC installation and configuration
Requires connectivity to Hadoop cluster from client
Requires users to specify RDBMS username and password
It is difficult to integrate command-line interface with other
applications
• Sqoop is tightly coupled with JDBC, hence does not
work well with NoSQL databases
22
Sqoop 2 Architecture

• Client-Server design addresses

limitations described in previous
slide
• Client requires connectivity only
to the Sqoop Server
Database and Hadoop Connections
are established at the Server level
End-user no longer needs database
credentials
Centralized audit trail
Sqoop server is accessible via CLI,
Web UI, and Rest API
All heavy lifting done by the server 23
All data flow happens directly between RDBMS and the Hadoop Cluster

24
Sqoop Resources

• Sqoop User Guide

http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.
html

• Apache Sqoop Cookbook

http://shop.oreilly.com/product/0636920029519.do

• Sqoop 2
www.tiny.cloudera.com/adcc05c

25
Exporting data to RDBMS

What’s the purpose of exporting data?

It is often required to export your Hadoop data to
RDMBS for easy querying purposes
How does it work?
Since the data in the Hadoop cluster is saved as files,
you will need to create a table in RDBMS and then
export the data from Hadoop to that table
Sqoop export tool is used for exporting data
Sqoop will transfer data to the relational database
using INSERT statements
26
Sqoop Export – under the hood

Internal mechanics of export – Step-by-Step

1. Connect to the database and get table metadata
2. Using metadata, generate a Java class file to be used by
the MapReduce job
3. Similar to Import, the data transfer does not happen
through the client
Target table:
Is identified by --table parameter
Must be created before the sqoop export runs
Must not have any primary key constraints (so that you can
export the same values multiple times if needed)
oPrimary key constraint will also slow down the export process
Must be created with proper data types and column lengths
to handle the data or you’ll get an error message in Sqoop
27
Batch Export

Problem:
Sqoop export creates one insert statement for each row.
This can be extremely time consuming for big tables.

Possible Solutions:
• Use the --batch parameter in your sqoop command
USES JDBC batching feature
Works with almost all databases

28
Batch Export - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--batch

29
All or Nothing Export

Problem:
What if you want to ensure that all of the data gets
updated in target database. In case of a failure, you do not
want partial data.

Solution:
• Use the --staging-table parameter in your sqoop
command
• Sqoop will load to the staging table first and then will load to
the table specified in the --table parameter
• The load to the --table destination will only happen when
the staging table is completely loaded
• Both staging and final tables should have the same metadata
30
All or Nothing Export - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--staging-table staging_cities

31
Updating an existing Table

Problem:
You previously exported data from Hadoop to a table. You have
newer changes in data that you’d like to update instead of an
overwrite.

Solution:
• Use the --update-key parameter in your sqoop command
followed by the name of the column that can identify the
changes
• Sqoop will issue an UPDATE query instead of an INSERT query
to the RDBMS table
• Challenge with just UPDATING is that it does not capture new
rows added in the source (Hadoop)
32
Update - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--update-key id

33
Updating and Inserting at the same time

Solution:
• Use the --update-mode allowinsert parameter
in your sqoop command
• This works in conjunction with the --update-key
parameter
• This is also called the upsert feature
• The Upsert method does not delete rows, hence you
cannot use it to sync the two systems
If your goal is to sync the table with Hadoop data, you should
use truncate and reload instead

34
Upsert - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--export-dir cities \
--update-key id \
--update-mode allowinsert

35
Using Database Stored Procs to Insert

Problem:
Databases typically use Stored Procs for bulk Inserts
instead of individual insert statements.

Solution:
• Use the --call parameter in your sqoop command
followed by the name of the stored procedure
• The stored procedure and table structure must exist in
the database for this to work
• It is recommended to use a simple stored proc (without
dozens of transformations) with Hadoop
36
Stored Proc - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--call populate_cities

37
Exporting into a Subset of Columns

Problem:
The corresponding table in your database has more columns
than the HDFS data and you only want to export a subset of
columns

Solution:
• Use the --columns parameter in your sqoop command
to specify which columns and in what order are present in
your Hadoop data
• By default, Sqoop assumes that your HDFS data contains
the same number and ordering of columns as the table
you’re exporting into 38
Subset Export - code

sqoop export \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities \
--columns country,city

Gmade Manual
No ratings yet
Gmade Manual
120 pages
Practicedump: Free Practice Dumps - Unlimited Free Access of Practice Exam
No ratings yet
Practicedump: Free Practice Dumps - Unlimited Free Access of Practice Exam
4 pages
Synopsis
No ratings yet
Synopsis
14 pages
04-Sqoop(1)(1)
No ratings yet
04-Sqoop(1)(1)
30 pages
bda u3 copy
No ratings yet
bda u3 copy
59 pages
Big Data: Sqoop
No ratings yet
Big Data: Sqoop
43 pages
Sqoop - A Haddop Technology: Srikalahasti
No ratings yet
Sqoop - A Haddop Technology: Srikalahasti
13 pages
SQOOP
No ratings yet
SQOOP
8 pages
Module 5_Sqoop
No ratings yet
Module 5_Sqoop
25 pages
Cloudera Academic Partnership 8 PDF
No ratings yet
Cloudera Academic Partnership 8 PDF
69 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
90 pages
B22 BDA Experiment 03
No ratings yet
B22 BDA Experiment 03
11 pages
6.moving Data Into Hadoop
No ratings yet
6.moving Data Into Hadoop
18 pages
Bda 11
No ratings yet
Bda 11
10 pages
Unit 4 3 Lumify,Data Rapper and Sqooop
No ratings yet
Unit 4 3 Lumify,Data Rapper and Sqooop
27 pages
Fundamentals of Apache Sqoop Notes
No ratings yet
Fundamentals of Apache Sqoop Notes
66 pages
Apache Sqoop: Hanoi - Autumn 2019
No ratings yet
Apache Sqoop: Hanoi - Autumn 2019
18 pages
Chapter n3 Sqoop
No ratings yet
Chapter n3 Sqoop
24 pages
SQOOP
No ratings yet
SQOOP
6 pages
Sqoop Additional Reading Pp-200913-222451-Unlocked
No ratings yet
Sqoop Additional Reading Pp-200913-222451-Unlocked
18 pages
Sqoop
No ratings yet
Sqoop
4 pages
160 P16cse5a-P16ite3a 2020052411232116
No ratings yet
160 P16cse5a-P16ite3a 2020052411232116
13 pages
BD Sqltohadoop3 PDF
No ratings yet
BD Sqltohadoop3 PDF
13 pages
Apache Sqoop Data Transfer Between Hadoop and RDBMS
No ratings yet
Apache Sqoop Data Transfer Between Hadoop and RDBMS
9 pages
M - M - Num-Mappers
No ratings yet
M - M - Num-Mappers
4 pages
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
No ratings yet
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
7 pages
BDA Lab2
No ratings yet
BDA Lab2
8 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
Sqoop
No ratings yet
Sqoop
9 pages
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
No ratings yet
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
7 pages
Experiment-5(Case Study on Sqoop)
No ratings yet
Experiment-5(Case Study on Sqoop)
5 pages
scoop_ppt
No ratings yet
scoop_ppt
3 pages
SqoopTutorial Ver 2.0
No ratings yet
SqoopTutorial Ver 2.0
51 pages
5 - Big - Data Vivek
No ratings yet
5 - Big - Data Vivek
4 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
6 pages
BigData - Sem 4 - Elective 1 - Module 2 - PPT
No ratings yet
BigData - Sem 4 - Elective 1 - Module 2 - PPT
29 pages
Sqoop Performance Tuning Guidelines
No ratings yet
Sqoop Performance Tuning Guidelines
8 pages
sqoopintro
No ratings yet
sqoopintro
2 pages
Essential Hadoop Tools: Module - 2 Session - 2
No ratings yet
Essential Hadoop Tools: Module - 2 Session - 2
6 pages
Zep Sqoop Big Data Interview Questions
No ratings yet
Zep Sqoop Big Data Interview Questions
25 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
Sqoop 1
No ratings yet
Sqoop 1
29 pages
sqooprequestfiles
No ratings yet
sqooprequestfiles
7 pages
UNIT-4
No ratings yet
UNIT-4
119 pages
0930 SqoopPerformanceTuningGuidelines en H2L
No ratings yet
0930 SqoopPerformanceTuningGuidelines en H2L
10 pages
SIC Big Data Chapter 3 Workbook
No ratings yet
SIC Big Data Chapter 3 Workbook
86 pages
Cse 17CS82 M2 S2 PPT
No ratings yet
Cse 17CS82 M2 S2 PPT
20 pages
Using Sqooptool To Transfer Data Between Hadoop and Mysql: Implementation
No ratings yet
Using Sqooptool To Transfer Data Between Hadoop and Mysql: Implementation
4 pages
What Are The Components of Web Service?: Java Questions
No ratings yet
What Are The Components of Web Service?: Java Questions
9 pages
Sqoop 2
No ratings yet
Sqoop 2
10 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Module 2
No ratings yet
Module 2
27 pages
2020300053_BDA_EXP8_CHINMAY
No ratings yet
2020300053_BDA_EXP8_CHINMAY
6 pages
Sqoop Implementation Revised
No ratings yet
Sqoop Implementation Revised
7 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
Practice Assignment
No ratings yet
Practice Assignment
3 pages
This Documents Are About Apache Sqoop
No ratings yet
This Documents Are About Apache Sqoop
23 pages
DMBD MBAA21041 Sqoop
No ratings yet
DMBD MBAA21041 Sqoop
11 pages
Unit 6
No ratings yet
Unit 6
26 pages
32 BDA Exp2
No ratings yet
32 BDA Exp2
24 pages
Sqoop
No ratings yet
Sqoop
28 pages
Data Ingest
No ratings yet
Data Ingest
15 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Classic Games
No ratings yet
Classic Games
2 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
DSCI 5350 - Lecture 4 PDF
No ratings yet
DSCI 5350 - Lecture 4 PDF
33 pages
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
Tetralogy of Fallot (TOF) : Dr. Sayeedur Rahman Khan Rumi MD Final Part Student Nhfh&Ri
No ratings yet
Tetralogy of Fallot (TOF) : Dr. Sayeedur Rahman Khan Rumi MD Final Part Student Nhfh&Ri
49 pages
Tournament Regulations Xxi International Open Sant Marti 2019
No ratings yet
Tournament Regulations Xxi International Open Sant Marti 2019
2 pages
WindowsServer2012VirtualTech VLBrief
No ratings yet
WindowsServer2012VirtualTech VLBrief
92 pages
16th August - Presentation - Basic PS Core
No ratings yet
16th August - Presentation - Basic PS Core
108 pages
Konica Minolta: Basics of Printer Driver & UI Comparison
No ratings yet
Konica Minolta: Basics of Printer Driver & UI Comparison
8 pages
Unix/Linux Command Reference
No ratings yet
Unix/Linux Command Reference
1 page
HughesNet HN9000 Satellite Modem User Guide
No ratings yet
HughesNet HN9000 Satellite Modem User Guide
94 pages
Datasheet qrn-1610s
No ratings yet
Datasheet qrn-1610s
3 pages
Geckoboard: Make Your Most Important Metrics Visible
No ratings yet
Geckoboard: Make Your Most Important Metrics Visible
8 pages
E-Forms For K+12 : Kinder To Grade 12
No ratings yet
E-Forms For K+12 : Kinder To Grade 12
14 pages
Delphi - The Road To Delphi - A Blog About Programming
No ratings yet
Delphi - The Road To Delphi - A Blog About Programming
22 pages
Differences Between
No ratings yet
Differences Between
8 pages
ADB Command
No ratings yet
ADB Command
2 pages
OV7670 300KP: Software UART Between All The Modules
No ratings yet
OV7670 300KP: Software UART Between All The Modules
4 pages
Informe HAT
No ratings yet
Informe HAT
11 pages
Ba - CP 1243 7 Lte - 76
No ratings yet
Ba - CP 1243 7 Lte - 76
154 pages
System Requirements Ansys 11
No ratings yet
System Requirements Ansys 11
4 pages
CSE CPL Manual-2016 PDF
No ratings yet
CSE CPL Manual-2016 PDF
79 pages
1 - Chap 3 - Types of Digital Data
68% (19)
1 - Chap 3 - Types of Digital Data
40 pages
5701BHRF20BC Specifications
No ratings yet
5701BHRF20BC Specifications
1 page
Denon dn-d4000 v.1 (ET)
No ratings yet
Denon dn-d4000 v.1 (ET)
50 pages
OPENrio Modules Data Sheets
No ratings yet
OPENrio Modules Data Sheets
13 pages
Streamcube: Audio Streaming Over Ip
No ratings yet
Streamcube: Audio Streaming Over Ip
2 pages
Cisco Unified Operations Manager 2.0 - Deployment Best Practices
No ratings yet
Cisco Unified Operations Manager 2.0 - Deployment Best Practices
111 pages
Ara TELE-satellite 0907
No ratings yet
Ara TELE-satellite 0907
108 pages
Modbus ASCII & RS485 PDF
No ratings yet
Modbus ASCII & RS485 PDF
5 pages
Launching A Panel/driver: Applications) Will Be Loaded
No ratings yet
Launching A Panel/driver: Applications) Will Be Loaded
2 pages
Information Technology Mission Statement, Goals and Objectives
No ratings yet
Information Technology Mission Statement, Goals and Objectives
3 pages
Lecture 2 - BEx Analyzer
No ratings yet
Lecture 2 - BEx Analyzer
35 pages

DSCI 5350 - Lecture 3 PDF

Uploaded by

DSCI 5350 - Lecture 3 PDF

Uploaded by

DSCI 5350 – Big Data Analytics

Lecture 3 – Sqoop: your interface

• How to import tables from an RDBMS

• Sqoop is an open source Apache project which was

• Sqoop is a client-side application that

• Sqoop begins by examining the tables to be imported

• Sqoop is a command-line utility with several

• The import-all-tables tool imports an entire

•Note that there is a --target-dir option as

• The import tool imports a Single Table

• Following command creates a tab delimited file

• Sqoop --incremental lastmodified mode

• The following command imports partial columns from

• The following command imports matching rows from

• Using --query option, you can import the results of a

• The --where option is ignored in a Free-Form query

• By default, Sqoop typically imports data using four

• Sqoop is a stable tool and has been used in Production

• Client-Server design addresses

• Sqoop User Guide

• Apache Sqoop Cookbook

What’s the purpose of exporting data?

Internal mechanics of export – Step-by-Step

You might also like