SlideShare a Scribd company logo
Apache Sqoop

A Data Transfer Tool for Hadoop




         Arvind Prabhakar, Cloudera Inc. Sept 21, 2011
What is Sqoop?

● Allows easy import and export of data from structured
  data stores:
   ○ Relational Database
   ○ Enterprise Data Warehouse
   ○ NoSQL Datastore

● Allows easy integration with Hadoop based systems:
   ○ Hive
   ○ HBase
   ○ Oozie
Agenda

● Motivation

● Importing and exporting data using Sqoop

● Provisioning Hive Metastore

● Populating HBase tables

● Sqoop Connectors

● Current Status and Road Map
Motivation

● Structured data stored in Databases and EDW is not easily
  accessible for analysis in Hadoop

● Access to Databases and EDW from Hadoop Clusters is
  problematic.

● Forcing MapReduce to access data from Databases/EDWs is
  repititive, error-prone and non-trivial.

● Data preparation often required for efficient consumption
  by Hadoop based data pipelines. 

● Current methods of transferring data are inefficient/ad-
  hoc.
Enter: Sqoop

    A tool to automate data transfer between structured     
    datastores and Hadoop.

Highlights

 ● Uses datastore metadata to infer structure definitions
 ● Uses MapReduce framework to transfer data in parallel
 ● Allows structure definitions to be provisioned in Hive
   metastore
 ● Provides an extension mechanism to incorporate high
   performance connectors for external systems. 
Importing Data

mysql> describe ORDERS;
+-----------------+-------------+------+-----+---------+-------+
| Field        | Type        | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+-------+
| ORDER_NUMBER | int(11) | NO | PRI | NULL |                            |
| ORDER_DATE | datetime | NO | | NULL |                             |
| REQUIRED_DATE | datetime | NO | | NULL |                            |
| SHIP_DATE           | datetime | YES | | NULL |                 |
| STATUS           | varchar(15) | NO | | NULL |               |
| COMMENTS              | text     | YES | | NULL |             |
| CUSTOMER_NUMBER | int(11) | NO | | NULL |                               |
+-----------------+-------------+------+-----+---------+-------+
7 rows in set (0.00 sec)
Importing Data
$ sqoop import --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS --username test --password ****
 ...

INFO mapred.JobClient: Counters: 12
INFO mapred.JobClient:   Job Counters 
INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12873
...
INFO mapred.JobClient:     Launched map tasks=4
INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
INFO mapred.JobClient:   FileSystemCounters
INFO mapred.JobClient:     HDFS_BYTES_READ=505
INFO mapred.JobClient:     FILE_BYTES_WRITTEN=222848
INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=35098
INFO mapred.JobClient:   Map-Reduce Framework
INFO mapred.JobClient:     Map input records=326
INFO mapred.JobClient:     Spilled Records=0
INFO mapred.JobClient:     Map output records=326
INFO mapred.JobClient:     SPLIT_RAW_BYTES=505
INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.2754 seconds (3.0398
KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 326 records.
Importing Data

$ hadoop fs -ls
Found 32 items
....
drwxr-xr-x - arvind staff 0 2011-09-13 19:12 /user/arvind/ORDERS
....

$ hadoop fs -ls /user/arvind/ORDERS

arvind@ap-w510:/opt/ws/apache/sqoop$ hadoop fs -ls /user/arvind/ORDERS
Found 6 items
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_SUCCESS
... 0 2011-09-13 19:12 /user/arvind/ORDERS/_logs
... 8826 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00000
... 8760 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00001
... 8841 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00002
... 8671 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00003
Exporting Data

$ sqoop export --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS_CLEAN --username test --password **** 
  --export-dir /user/arvind/ORDERS
...
INFO mapreduce.ExportJobBase: Transferred 34.7178 KB in 6.7482 seconds (5.1447 KB/sec)
INFO mapreduce.ExportJobBase: Exported 326 records.
$



  ● Default Delimiters: ',' for fields, New-Lines for records
  ● Optionally Specify Escape sequence 
  ● Delimiters can be specified for both import and export
Exporting Data

Exports can optionally use Staging Tables

 ● Map tasks populate staging table

 ● Each map write is broken down into many transactions

 ● Staging table is then used to populate the target table in a
   single transaction

 ● In case of failure, staging table provides insulation from
   data corruption.
Importing Data into Hive

$ sqoop import --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS --username test --password **** --hive-import
 ...

INFO mapred.JobClient: Counters: 12
INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.3995 seconds (3.0068
KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 326 records.
INFO hive.HiveImport: Removing temporary files from import process: ORDERS/_logs
INFO hive.HiveImport: Loading uploaded data into Hive
...
WARN hive.TableDefWriter: Column ORDER_DATE had to be cast to a less precise type in
Hive
WARN hive.TableDefWriter: Column REQUIRED_DATE had to be cast to a less precise type
in Hive
WARN hive.TableDefWriter: Column SHIP_DATE had to be cast to a less precise type in
Hive
...
$
Importing Data into Hive

$ hive
hive> show tables;
OK
...
orders
...
hive> describe orders;
OK
order_number int
order_date string
required_date string
ship_date string
status string
comments string
customer_number int
Time taken: 0.236 seconds
hive>
Importing Data into HBase

$ bin/sqoop import --connect jdbc:mysql://localhost/acmedb 
  --table ORDERS --username test --password **** 
  --hbase-create-table --hbase-table ORDERS --column-family mysql
...
INFO mapreduce.HBaseImportJob: Creating missing HBase table ORDERS
...
INFO mapreduce.ImportJobBase: Retrieved 326 records.
$


  ● Sqoop creates the missing table if instructed
  ● If no Row-Key specified, the Primary Key column is used.
  ● Each output column placed in same column family
  ● Every record read results in an HBase put operation
  ● All values are converted to their string representation and
    inserted as UTF-8 bytes.
Importing Data into HBase

hbase(main):001:0> list
TABLE 
ORDERS 
1 row(s) in 0.3650 seconds

hbase(main):002:0>  describe 'ORDERS'
DESCRIPTION                             ENABLED
{NAME => 'ORDERS', FAMILIES => [                true
 {NAME => 'mysql', BLOOMFILTER => 'NONE',
  REPLICATION_SCOPE => '0', COMPRESSION => 'NONE',
  VERSIONS => '3', TTL => '2147483647',
  BLOCKSIZE => '65536', IN_MEMORY => 'false',
  BLOCKCACHE => 'true'}]}
1 row(s) in 0.0310 seconds

hbase(main):003:0>
Importing Data into HBase

hbase(main):001:0> scan 'ORDERS', { LIMIT => 1 }
ROW COLUMN+CELL
10100 column=mysql:CUSTOMER_NUMBER,timestamp=1316036948264,
    value=363
10100 column=mysql:ORDER_DATE, timestamp=1316036948264,
    value=2003-01-06 00:00:00.0
10100 column=mysql:REQUIRED_DATE, timestamp=1316036948264,
    value=2003-01-13 00:00:00.0
10100 column=mysql:SHIP_DATE, timestamp=1316036948264,
    value=2003-01-10 00:00:00.0
10100 column=mysql:STATUS, timestamp=1316036948264,
    value=Shipped
1 row(s) in 0.0130 seconds

hbase(main):012:0>
Sqoop Connectors

● Connector Mechanism allows creation of new connectors
  that improve/augment Sqoop functionality.

● Bundled connectors include:
   ○ MySQL, PostgreSQL, Oracle, SQLServer, JDBC
   ○ Direct MySQL, Direct PostgreSQL

● Regular connectors are JDBC based.

● Direct Connectors use native tools for high-performance
  data transfer implementation.
Import using Direct MySQL Connector

$ sqoop import --connect jdbc:mysql://localhost/acmedb 
   --table ORDERS --username test --password **** --direct
...
manager.DirectMySQLManager: Beginning mysqldump fast
path import
...

Direct import works as follows:
 ● Data is partitioned into splits using JDBC
 ● Map tasks used mysqldump to do the import with conditional
   selection clause (-w 'ORDER_NUMBER' > ...)
 ● Header and footer information was stripped out

Direct Export similarly uses            mysqlimport   utility.
Third Party Connectors

● Oracle - Developed by Quest Software

● Couchbase - Developed by Couchbase

● Netezza - Developed by Cloudera

● Teradata - Developed by Cloudera

● Microsoft SQL Server - Developed by Microsoft

● Microsoft PDW - Developed by Microsoft

● Volt DB - Developed by VoltDB
Current Status

Sqoop is currently in Apache Incubator

  ● Status Page
     http://incubator.apache.org/projects/sqoop.html

  ● Mailing Lists
     sqoop-user@incubator.apache.org
     sqoop-dev@incubator.apache.org

  ● Release
     Current shipping version is 1.3.0
Hadoop World 2011


A gathering of Hadoop practitioners, developers,
business executives, industry luminaries and
innovative companies in the Hadoop ecosystem.

    ● Network: 1400 attendees, 25+ sponsors
    ● Learn: 60 sessions across 5 tracks for             November 8-9
         ○ Developers                              Sheraton New York Hotel &
         ○ IT Operations                                  Towers, NYC
         ○ Enterprise Architects
         ○ Data Scientists
         ○ Business Decision Makers                 Learn more and register at
                                                     www.hadoopworld.com
    ● Train: Cloudera training and certification
       (November 7, 10, 11)
Sqoop Meetup



      Monday, November 7 - 2011, 8pm - 9pm

                       at

     Sheraton New York Hotel & Towers, NYC
Thank you!

   Q&A

More Related Content

What's hot (20)

PPTX
Apache hive introduction
Mahmood Reza Esmaili Zand
 
PDF
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Edureka!
 
PDF
Introduction to Apache Sqoop
Avkash Chauhan
 
PDF
Spark SQL Bucketing at Facebook
Databricks
 
PPTX
Hive
Manas Nayak
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PPTX
Zero to Snowflake Presentation
Brett VanderPlaats
 
PDF
Optimizing Hive Queries
Owen O'Malley
 
PDF
Hive partitioning best practices
Nabeel Moidu
 
PDF
What is new in Apache Hive 3.0?
DataWorks Summit
 
PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PDF
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
Technical Deck Delta Live Tables.pdf
Ilham31574
 
PDF
Hadoop Overview kdd2011
Milind Bhandarkar
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PPTX
Introduction of ssis
deepakk073
 
Apache hive introduction
Mahmood Reza Esmaili Zand
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Edureka!
 
Introduction to Apache Sqoop
Avkash Chauhan
 
Spark SQL Bucketing at Facebook
Databricks
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Zero to Snowflake Presentation
Brett VanderPlaats
 
Optimizing Hive Queries
Owen O'Malley
 
Hive partitioning best practices
Nabeel Moidu
 
What is new in Apache Hive 3.0?
DataWorks Summit
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Technical Deck Delta Live Tables.pdf
Ilham31574
 
Hadoop Overview kdd2011
Milind Bhandarkar
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Introduction of ssis
deepakk073
 

Viewers also liked (13)

PDF
สมุดกิจกรรม Code for Kids
IMC Institute
 
PPTX
Advanced Sqoop
Yogesh Kulkarni
 
PDF
Mobile User and App Analytics in China
IMC Institute
 
PDF
Thai Software & Software Market Survey 2015
IMC Institute
 
PDF
Big data: Loading your data with flume and sqoop
Christophe Marchal
 
PDF
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
PPT
ITSS Overview
IMC Institute
 
PDF
Big Data Analytics using Mahout
IMC Institute
 
PDF
Big data processing using Hadoop with Cloudera Quickstart
IMC Institute
 
PDF
Install Apache Hadoop for Development/Production
IMC Institute
 
PDF
Machine Learning using Apache Spark MLlib
IMC Institute
 
PPTX
Flume vs. kafka
Omid Vahdaty
 
PDF
Kanban boards step by step
Giulio Roggero
 
สมุดกิจกรรม Code for Kids
IMC Institute
 
Advanced Sqoop
Yogesh Kulkarni
 
Mobile User and App Analytics in China
IMC Institute
 
Thai Software & Software Market Survey 2015
IMC Institute
 
Big data: Loading your data with flume and sqoop
Christophe Marchal
 
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
ITSS Overview
IMC Institute
 
Big Data Analytics using Mahout
IMC Institute
 
Big data processing using Hadoop with Cloudera Quickstart
IMC Institute
 
Install Apache Hadoop for Development/Production
IMC Institute
 
Machine Learning using Apache Spark MLlib
IMC Institute
 
Flume vs. kafka
Omid Vahdaty
 
Kanban boards step by step
Giulio Roggero
 
Ad

Similar to Apache Sqoop: A Data Transfer Tool for Hadoop (20)

PDF
Couchbas for dummies
Qureshi Tehmina
 
PDF
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
Dave Stokes
 
PDF
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
Dave Stokes
 
PDF
MySQL 5.7 Tutorial Dutch PHP Conference 2015
Dave Stokes
 
PDF
Migrations from PLSQL and Transact-SQL - m18
Wagner Bianchi
 
PDF
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
MariaDB plc
 
PDF
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
PDF
In-memory ColumnStore Index
SolidQ
 
ODP
Drupal database Mssql to MySQL migration
Anton Ivanov
 
PDF
How to migrate from Oracle Database with ease
MariaDB plc
 
PPTX
Serverless in-action
Assaf Gannon
 
PPTX
MySQL Without the SQL -- Oh My! Longhorn PHP Conference
Dave Stokes
 
PDF
Streaming ETL - from RDBMS to Dashboard with KSQL
Bjoern Rost
 
PDF
NoSQL and MySQL: News about JSON
Mario Beck
 
PPTX
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...
Jürgen Ambrosi
 
PDF
SQL on Hadoop
nvvrajesh
 
PDF
Write Faster SQL with Trino.pdf
Eric Xiao
 
PPTX
Optimizing your Database Import!
Nabil Nawaz
 
PDF
Beyond php - it's not (just) about the code
Wim Godden
 
PDF
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
Couchbas for dummies
Qureshi Tehmina
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
Dave Stokes
 
MySQL 5.7. Tutorial - Dutch PHP Conference 2015
Dave Stokes
 
MySQL 5.7 Tutorial Dutch PHP Conference 2015
Dave Stokes
 
Migrations from PLSQL and Transact-SQL - m18
Wagner Bianchi
 
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
MariaDB plc
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
In-memory ColumnStore Index
SolidQ
 
Drupal database Mssql to MySQL migration
Anton Ivanov
 
How to migrate from Oracle Database with ease
MariaDB plc
 
Serverless in-action
Assaf Gannon
 
MySQL Without the SQL -- Oh My! Longhorn PHP Conference
Dave Stokes
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Bjoern Rost
 
NoSQL and MySQL: News about JSON
Mario Beck
 
2° Ciclo Microsoft CRUI 3° Sessione: l'evoluzione delle piattaforme tecnologi...
Jürgen Ambrosi
 
SQL on Hadoop
nvvrajesh
 
Write Faster SQL with Trino.pdf
Eric Xiao
 
Optimizing your Database Import!
Nabil Nawaz
 
Beyond php - it's not (just) about the code
Wim Godden
 
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

Recently uploaded (20)

PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Biography of Daniel Podor.pdf
Daniel Podor
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 

Apache Sqoop: A Data Transfer Tool for Hadoop

  • 1. Apache Sqoop A Data Transfer Tool for Hadoop Arvind Prabhakar, Cloudera Inc. Sept 21, 2011
  • 2. What is Sqoop? ● Allows easy import and export of data from structured data stores: ○ Relational Database ○ Enterprise Data Warehouse ○ NoSQL Datastore ● Allows easy integration with Hadoop based systems: ○ Hive ○ HBase ○ Oozie
  • 3. Agenda ● Motivation ● Importing and exporting data using Sqoop ● Provisioning Hive Metastore ● Populating HBase tables ● Sqoop Connectors ● Current Status and Road Map
  • 4. Motivation ● Structured data stored in Databases and EDW is not easily accessible for analysis in Hadoop ● Access to Databases and EDW from Hadoop Clusters is problematic. ● Forcing MapReduce to access data from Databases/EDWs is repititive, error-prone and non-trivial. ● Data preparation often required for efficient consumption by Hadoop based data pipelines.  ● Current methods of transferring data are inefficient/ad- hoc.
  • 5. Enter: Sqoop     A tool to automate data transfer between structured          datastores and Hadoop. Highlights ● Uses datastore metadata to infer structure definitions ● Uses MapReduce framework to transfer data in parallel ● Allows structure definitions to be provisioned in Hive metastore ● Provides an extension mechanism to incorporate high performance connectors for external systems. 
  • 6. Importing Data mysql> describe ORDERS; +-----------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+-------------+------+-----+---------+-------+ | ORDER_NUMBER | int(11) | NO | PRI | NULL | | | ORDER_DATE | datetime | NO | | NULL | | | REQUIRED_DATE | datetime | NO | | NULL | | | SHIP_DATE | datetime | YES | | NULL | | | STATUS | varchar(15) | NO | | NULL | | | COMMENTS | text | YES | | NULL | | | CUSTOMER_NUMBER | int(11) | NO | | NULL | | +-----------------+-------------+------+-----+---------+-------+ 7 rows in set (0.00 sec)
  • 7. Importing Data $ sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password ****  ... INFO mapred.JobClient: Counters: 12 INFO mapred.JobClient:   Job Counters  INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12873 ... INFO mapred.JobClient:     Launched map tasks=4 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0 INFO mapred.JobClient:   FileSystemCounters INFO mapred.JobClient:     HDFS_BYTES_READ=505 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=222848 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=35098 INFO mapred.JobClient:   Map-Reduce Framework INFO mapred.JobClient:     Map input records=326 INFO mapred.JobClient:     Spilled Records=0 INFO mapred.JobClient:     Map output records=326 INFO mapred.JobClient:     SPLIT_RAW_BYTES=505 INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.2754 seconds (3.0398 KB/sec) INFO mapreduce.ImportJobBase: Retrieved 326 records.
  • 8. Importing Data $ hadoop fs -ls Found 32 items .... drwxr-xr-x - arvind staff 0 2011-09-13 19:12 /user/arvind/ORDERS .... $ hadoop fs -ls /user/arvind/ORDERS arvind@ap-w510:/opt/ws/apache/sqoop$ hadoop fs -ls /user/arvind/ORDERS Found 6 items ... 0 2011-09-13 19:12 /user/arvind/ORDERS/_SUCCESS ... 0 2011-09-13 19:12 /user/arvind/ORDERS/_logs ... 8826 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00000 ... 8760 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00001 ... 8841 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00002 ... 8671 2011-09-13 19:12 /user/arvind/ORDERS/part-m-00003
  • 9. Exporting Data $ sqoop export --connect jdbc:mysql://localhost/acmedb --table ORDERS_CLEAN --username test --password **** --export-dir /user/arvind/ORDERS ... INFO mapreduce.ExportJobBase: Transferred 34.7178 KB in 6.7482 seconds (5.1447 KB/sec) INFO mapreduce.ExportJobBase: Exported 326 records. $ ● Default Delimiters: ',' for fields, New-Lines for records ● Optionally Specify Escape sequence  ● Delimiters can be specified for both import and export
  • 10. Exporting Data Exports can optionally use Staging Tables ● Map tasks populate staging table ● Each map write is broken down into many transactions ● Staging table is then used to populate the target table in a single transaction ● In case of failure, staging table provides insulation from data corruption.
  • 11. Importing Data into Hive $ sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --hive-import  ... INFO mapred.JobClient: Counters: 12 INFO mapreduce.ImportJobBase: Transferred 34.2754 KB in 11.3995 seconds (3.0068 KB/sec) INFO mapreduce.ImportJobBase: Retrieved 326 records. INFO hive.HiveImport: Removing temporary files from import process: ORDERS/_logs INFO hive.HiveImport: Loading uploaded data into Hive ... WARN hive.TableDefWriter: Column ORDER_DATE had to be cast to a less precise type in Hive WARN hive.TableDefWriter: Column REQUIRED_DATE had to be cast to a less precise type in Hive WARN hive.TableDefWriter: Column SHIP_DATE had to be cast to a less precise type in Hive ... $
  • 12. Importing Data into Hive $ hive hive> show tables; OK ... orders ... hive> describe orders; OK order_number int order_date string required_date string ship_date string status string comments string customer_number int Time taken: 0.236 seconds hive>
  • 13. Importing Data into HBase $ bin/sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --hbase-create-table --hbase-table ORDERS --column-family mysql ... INFO mapreduce.HBaseImportJob: Creating missing HBase table ORDERS ... INFO mapreduce.ImportJobBase: Retrieved 326 records. $ ● Sqoop creates the missing table if instructed ● If no Row-Key specified, the Primary Key column is used. ● Each output column placed in same column family ● Every record read results in an HBase put operation ● All values are converted to their string representation and inserted as UTF-8 bytes.
  • 14. Importing Data into HBase hbase(main):001:0> list TABLE  ORDERS  1 row(s) in 0.3650 seconds hbase(main):002:0>  describe 'ORDERS' DESCRIPTION ENABLED {NAME => 'ORDERS', FAMILIES => [ true {NAME => 'mysql', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0310 seconds hbase(main):003:0>
  • 15. Importing Data into HBase hbase(main):001:0> scan 'ORDERS', { LIMIT => 1 } ROW COLUMN+CELL 10100 column=mysql:CUSTOMER_NUMBER,timestamp=1316036948264, value=363 10100 column=mysql:ORDER_DATE, timestamp=1316036948264, value=2003-01-06 00:00:00.0 10100 column=mysql:REQUIRED_DATE, timestamp=1316036948264, value=2003-01-13 00:00:00.0 10100 column=mysql:SHIP_DATE, timestamp=1316036948264, value=2003-01-10 00:00:00.0 10100 column=mysql:STATUS, timestamp=1316036948264, value=Shipped 1 row(s) in 0.0130 seconds hbase(main):012:0>
  • 16. Sqoop Connectors ● Connector Mechanism allows creation of new connectors that improve/augment Sqoop functionality. ● Bundled connectors include: ○ MySQL, PostgreSQL, Oracle, SQLServer, JDBC ○ Direct MySQL, Direct PostgreSQL ● Regular connectors are JDBC based. ● Direct Connectors use native tools for high-performance data transfer implementation.
  • 17. Import using Direct MySQL Connector $ sqoop import --connect jdbc:mysql://localhost/acmedb --table ORDERS --username test --password **** --direct ... manager.DirectMySQLManager: Beginning mysqldump fast path import ... Direct import works as follows: ● Data is partitioned into splits using JDBC ● Map tasks used mysqldump to do the import with conditional selection clause (-w 'ORDER_NUMBER' > ...) ● Header and footer information was stripped out Direct Export similarly uses mysqlimport utility.
  • 18. Third Party Connectors ● Oracle - Developed by Quest Software ● Couchbase - Developed by Couchbase ● Netezza - Developed by Cloudera ● Teradata - Developed by Cloudera ● Microsoft SQL Server - Developed by Microsoft ● Microsoft PDW - Developed by Microsoft ● Volt DB - Developed by VoltDB
  • 19. Current Status Sqoop is currently in Apache Incubator ● Status Page      http://incubator.apache.org/projects/sqoop.html ● Mailing Lists      [email protected]      [email protected] ● Release      Current shipping version is 1.3.0
  • 20. Hadoop World 2011 A gathering of Hadoop practitioners, developers, business executives, industry luminaries and innovative companies in the Hadoop ecosystem. ● Network: 1400 attendees, 25+ sponsors ● Learn: 60 sessions across 5 tracks for November 8-9 ○ Developers Sheraton New York Hotel & ○ IT Operations Towers, NYC ○ Enterprise Architects ○ Data Scientists ○ Business Decision Makers Learn more and register at www.hadoopworld.com ● Train: Cloudera training and certification      (November 7, 10, 11)
  • 21. Sqoop Meetup Monday, November 7 - 2011, 8pm - 9pm at Sheraton New York Hotel & Towers, NYC
  • 22. Thank you! Q&A