2.3 - Best Practices - Native Hadoop Tool Sqoop
2.3 - Best Practices - Native Hadoop Tool Sqoop
Version: 2.00
Date: XXX
Revision History
Date Version Description Author
March 3 2013 1.0 Initial Draft Sarang Patil
March 12 2013 1.5 Removed other tools and added Sqoop of Teradata Sarang Patil
April 2, 2013 2.0 Added interface to Teradata ASTER Sarang Patil
April 5, 2013 2.2 For Internal DI CoE Review Sarang Patil
April 9, 2013 2.3 Review Comments included Sarang Patil
Table 1: Revision History
Issues Documentation
The following Issues were defined during Design/Preparation.
Approvals
The undersigned acknowledge they have reviewed the high-level architecture design and agree with its contents.
RE List of potential
BP reviewers.msg
Alex Tuabman Data Integration Consultant Version 2.2
Approve Best
Practice - Native Hadoop Tool Sqoop.msg
Table of Contents
Summary ............................................................................................................................. 23
Table of Figures
Executive summary
Currently there are three ways proven standard methods to interface Hadoop and Teradata, as well as
Teradata Aster.
In the BIG DATA environment it is not recommended to write and read BIG data multiple times flat file
interface will be used only when none of the other options are available to be used. The Best Practices
for these interfaces are documented in Best Practices for Teradata tools and utilities.
Detail Best Practices for SQL-H are documented in separate document Best Practices for Aster Data
Integration. Currently (Q2-2013) the SQL-H interface is available for Teradata Aster Platform. SQL-H
interface for Teradata will be available in Q-3 of 2013.
The scope of the document is detail best practices for native Hadoop tool Sqoop. Current version of
Sqoop is Sqoop2.
Using Hadoop for analytics and data processing requires loading data into clusters and processing it in
conjunction with other data that often resides in production databases across the enterprise. Loading
bulk data into Hadoop from production systems or accessing it from map reduce applications running on
large clusters can be a challenging task. Users must consider details like ensuring consistency of data,
the consumption of production system resources, data preparation for provisioning downstream
pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data
residing on external systems from within the map reduce applications complicates applications and
exposes the production system to the risk of excessive load originating from cluster nodes.
This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software
Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.
Sqoop allows easy import and export of data from structured data stores such as relational databases,
enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external
system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to
schedule and automate import and export tasks. Sqoop uses a connector based architecture which
supports plugins that provide connectivity to new external systems.
What happens underneath the covers when you run Sqoop is very straightforward. The dataset being
transferred is sliced up into different partitions and a map-only job is launched with individual mappers
responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe
manner since Sqoop uses the database metadata to infer the data types.
In the rest of this post we will walk through an example that shows the various ways you can use Sqoop.
The goal of this post is to give an overview of Sqoop operation without going into much detail or
advanced functionality
Ease of Use
Whereas Sqoop requires client-side installation and configuration, Sqoop 2 will be installed and
configured server-side. This means that connectors will be configured in one place, managed by the
Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database
connectivity will only be needed on the server. Sqoop 2 will be a web-based service: front-ended by a
Command Line Interface (CLI) and browser and back-ended by a metadata repository. Moreover, Sqoop
2's service level integration with Hive and HBase will be on the server-side. Oozie will manage Sqoop
tasks through the REST API. This decouples Sqoop internals from Oozie, i.e. if you install a new Sqoop
connector then you won't need to install it in Oozie also.
Ease of Extension
In Sqoop 2, connectors will no longer be restricted to the JDBC model, but can rather define their own
vocabulary, e.g. Couchbase no longer needs to specify a table name, only to overload it as a backfill or
dump operation.
Common functionality will be abstracted out of connectors, holding them responsible only for data
transport. The reduce phase will implement common functionality, ensuring that connectors benefit
from future development of functionality.
Sqoop 2's interactive web-based UI will walk users through import/export setup, eliminating redundant
steps and omitting incorrect options. Connectors will be added in one place, with the connectors
exposing necessary options to the Sqoop framework. Thus, users will only need to provide information
relevant to their use-case.
With the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more
predictable. In the same way, the user will not need to be aware of the functionality of all connectors.
As a result, connectors no longer need to provide downstream functionality, transformations, and
integration with other systems. Hence, the connector developer no longer has the burden of
understanding all the features that Sqoop supports.
Security
Currently, Sqoop operates as the user that runs the 'sqoop' command. The security principal used by a
Sqoop job is determined by what credentials the users have when they launch Sqoop. Going forward,
Sqoop 2 will operate as a server based application with support for securing access to external systems
via role-based access to Connection objects. For additional security, Sqoop 2 will no longer allow code
generation, require direct access to Hive and HBase, nor open up access to all clients to execute jobs.
Sqoop 2 will introduce Connections as First-Class Objects. Connections, which will encompass
credentials, will be created once and then used many times for various import/export jobs. Connections
will be created by the Admin and used by the Operator, thus preventing credential abuse by the end
user. Furthermore, Connections can be restricted based on operation (import/export). By limiting the
total number of physical Connections opens at one time and with an option to disable Connections,
resources can be managed
Sqoop ships with a help tool. To display a list of all available tools,
type the following command:
$ sqoop help
Available commands:
Sqoop ships as one binary package however it’s compound from two separate parts - client and server.
You need to install server on single node in your cluster. This node will then serve as an entry point for
all connecting Sqoop clients. Server acts as a MapReduce client and therefore Hadoop must be installed
and configured on machine hosting Sqoop server. Clients can be installed on any arbitrary number of
machines. Client is not acting as a MapReduce client and thus you do not need to install Hadoop on
nodes that will act only as a Sqoop client.
Server installation
Copy Sqoop artifact on machine where you want to run Sqoop server. This machine must have installed
and configured Hadoop. You don’t need to run any Hadoop related services there, however the machine
must be able to act as an Hadoop client. You should be able to list a HDFS for example:
Sqoop server supports multiple Hadoop versions. However as Hadoop major versions are not
compatible with each other, Sqoop have multiple binary artifacts - one for each supported major version
of Hadoop. You need to make sure that you’re using appropriated binary artifact for your specific
Hadoop version. To install Sqoop server decompress appropriate distribution artifact in location at your
convenience and change your working directory to this folder.
cd /usr/lib/sqoop
Installing Dependencies
You need to install Hadoop libraries into Sqoop server war file. Sqoop provides convenience script
addtowar.sh to do so. If you have installed Hadoop in usual location in /usr/lib and executable
hadoop is in your path, you can use automatic Hadoop installation procedure:
./bin/addtowar.sh -hadoop-auto
In case that you have Hadoop installed in different location, you will need to manually specify Hadoop
version and path to Hadoop libraries. You can use parameter -hadoop-version for specifying
Hadoop major version, we’re currently support versions 1.x and 2.x. Path to Hadoop libraries can be
specified using -hadoop-path parameter. In case that your Hadoop libraries are in multiple different
folders, you can specify all of them separated by :.
Lastly you might need to install JDBC drivers that are not bundled with Sqoop because of incompatible
licenses. You can add any arbitrary java jar file to Sqoop server using script addtowar.sh with -jars
parameter. Similarly as in case of hadoop path you can enter multiple jars separated with :.
Configuring Server
Before starting server you should revise configuration to match your specific environment. Server
configuration files are stored in server/config directory of distributed artifact along side with other
configuration files of Tomcat.
Second configuration file sqoop.properties contains remaining configuration properties that can
affect Sqoop server. File is very well documented, so check if all configuration properties fits your
environment. Default or very little tweaking should be sufficient most common cases.
After installation and configuration you can start Sqoop server with following command:
Client installation
Client do not need extra installation and configuration steps. Just copy Sqoop distribution artifact on
target machine and unzip it in desired location. You can start client with following command:
bin/sqoop.sh client
Debugging information
The logs of the Tomcat server is located under the server/logs directory in the Sqoop2 distribution
directory.
The logs of the Sqoop2 server and the Derby repository are located as sqoop.log and derbyrepo.log (by
default unless changed by the above configuration), respectively, under the (LOGS) directory in the
Sqoop2 distribution directory.
The following section describes the option to export data from RDBMS to Hadoop hdfs as well as higher
level constructs like Hive and HBase.
The following command is used to import all data from a table called ORDERS from a Teradata database:
---
$ sqoop import --connect jdbc:teradata://12.13.24.54/localhost/
--table <<TABLE NAME>> --username <<USERNAME>> --password <<Password>>
• --connect <connect string>, --username <user name>, --password <password>: These are
connection parameters that are used to connect with the database. This is no different from the
connection parameters that you use when connecting to the database via a JDBC connection.
• --table <table name>: This parameter specifies the table which will be imported.
The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the
database to gather the necessary metadata for the data being imported.
The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the
actual data transfer using the metadata captured in the previous step.
The imported data is saved in a directory on HDFS based on the table being imported. As is the case with
most aspects of Sqoop operation, the user can specify any alternative directory where the files should
be populated.
By default these files contain comma delimited fields, with new lines separating different records. You
can easily override the format in which data is copied over by explicitly specifying the field separator and
record terminator characters.
Sqoop also supports different data formats for importing data. For example, you can easily import data
in Avro data format by simply specifying the option --as-avrodatafile with the import command.
There are many other options that Sqoop provides which can be used to further tune the import
operation to suit your specific requirements.
Sqoop takes care of populating the Hive meta-store with the appropriate metadata for the table and
also invokes the necessary commands to load the table or partition as the case may be. All of this is
done by simply specifying the option --hive-import with the import command.
When you run a Hive import, Sqoop converts the data from the native datatypes within the external
datastore into the corresponding types within Hive.
Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new
line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the
data correctly populated for consumption in Hive.
Once the import is complete, you can see and operate on the table just like any other table in Hive.
Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.
Some connectors support staging tables that help isolate production tables from possible corruption in
case of job failures due to any reason. Staging tables are first populated by the map tasks and then
merged into the target table once all of the data has been delivered it.
In some cases data processed by Hadoop pipelines may be needed in production systems to help run
additional critical business functions. Sqoop can be used to export such data into external data stores as
necessary.
Continuing our example from above - if data generated by the pipeline on Hadoop corresponded to the
ORDERS table in a database somewhere, you could populate it using the following command:
Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.
Using specialized connectors, Sqoop can connect with external systems that have optimized import and
export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop’s
extension framework and can be added to any existing Sqoop installation. Once a connector is installed,
Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the
connector.
By default Sqoop includes connectors for various popular databases such as Teradata, Teradata Aster,
MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and
PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch
tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be
used to connect to any database that is accessible via JDBC.
Apart from the built-in connectors, many companies have developed their own connectors that can be
plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to
NoSQL datastores
Sqoop2 can transfer large datasets between Hadoop and external datastores such as relational
databases. Beyond this, Sqoop offers many advance features such as different data formats,
compression, working with queries instead of tables etc.
Operational Dos
• If you need to move big data, make it small first, and then move small data.
• Prepare data model in advance to ensure that queries touch the least amount of data.
• Always create an empty export table.
• Do use --escaped-by option during import and --input-escaped-by during export.
• Do use fields-terminated-by during import and input-fields-terminated-by during export.
• Do specify the direct mode option (--direct), if you use the direct connector
• Develop some kind of incremental import when sqoop-ing in large tables.
o If you do not, your Sqoop jobs will take longer and longer as the data grows from the
• Compress data in HDFS.
o You will save space on HDFS as your replication factor makes multiple copies of your
data.
o You will benefit in processing as your Map/Reduce jobs have less data to feaster and
HADOOP becomes less I/O bound
• Do use --escaped-by option during import and --input-escaped-by during export.
• Do use fields-terminated-by during import and input-fields-terminated-by during export.
Operational Don’ts
• Don’t use the same table for both import and export
• Don’t specify the query, if you use the direct connector
• Don’t have too many partitions same file that will be stored in HDFS
o This translates into time consuming map tasks, use partitioning if possible
o 1000 Partitions will perform better than 10,000 partitions
Following section describes how the data is transferred using the JDBC connection; Including the
technical implementation of data pipes in and out Teradata as well as hdfs.
Following section describes the use cases and examples on how to transfer the data. To be updated
once we have this information
Summary
Sqoop 2 will enable users to use Sqoop effectively with a minimal understanding of its details by having
a web-application run Sqoop, which allows Sqoop to be installed once and used from anywhere.
In addition, having a REST API for operation and management will help Sqoop integrate better with
external systems such as Oozie.
Also, introducing a reduce phase allows connectors to be focused only on connectivity and ensures that
Sqoop functionality is uniformly available for all connectors. This facilitates ease of development of
connectors