0% found this document useful (0 votes)
42 views

2.3 - Best Practices - Native Hadoop Tool Sqoop

Best Practices - Native Hadoop Tool Sqoop
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

2.3 - Best Practices - Native Hadoop Tool Sqoop

Best Practices - Native Hadoop Tool Sqoop
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Best Practices

Native Hadoop tool Sqoop

Version: 2.00

Date: XXX

Copyright © 2013 by Teradata. All Rights Reserved.


History / Issues / Approvals

Revision History
Date Version Description Author
March 3 2013 1.0 Initial Draft Sarang Patil
March 12 2013 1.5 Removed other tools and added Sqoop of Teradata Sarang Patil
April 2, 2013 2.0 Added interface to Teradata ASTER Sarang Patil
April 5, 2013 2.2 For Internal DI CoE Review Sarang Patil
April 9, 2013 2.3 Review Comments included Sarang Patil
Table 1: Revision History

Issues Documentation
The following Issues were defined during Design/Preparation.

Raised By Issue Date Resolution/Answer Date Resolved By


Needed Completed
Sada Shiro We need to include some Section added.
/ Deepak examples
Mangani l
Chris Ward Needs to understand the internal Section added; No UDA
of Sqoop connector available at this point
Table 2: Issues Documentation

Approvals
The undersigned acknowledge they have reviewed the high-level architecture design and agree with its contents.

Name Role Email Approval Approved


Version
Steve Fox Sr. Architect – DI CoE Version 2.2

RE List of potential
BP reviewers.msg
Alex Tuabman Data Integration Consultant Version 2.2

Approve Best
Practice - Native Hadoop Tool Sqoop.msg

Table 3: Document Signoff

Teradata Confidential and Proprietary Page i of xxiv


Table of Contents

Table of Contents

Executive summary ................................................................................................................ 5


Apache Sqoop - Overview ...................................................................................................... 6
Ease of Use ..............................................................................................................................................7
Ease of Extension ....................................................................................................................................7
Security ...................................................................................................................................................8
Apache Sqoop – help tool....................................................................................................... 9
Best practices for Sqoop Installation ..................................................................................... 10
Server installation .....................................................................................................................................10
Installing Dependencies ............................................................................................................................10
Configuring Server .....................................................................................................................................11
Server Life Cycle ........................................................................................................................................11
Client installation ......................................................................................................................................12
Debugging information .............................................................................................................................12
Best practices for importing data to Hadoop ......................................................................... 13
Importing data to HDFS .............................................................................................................................13
Importing Data into Hive ...........................................................................................................................14
Importing Data Importing Data into HBase ...............................................................................................15
Best practices to exporting data from Hadoop ...................................................................... 17
Best practices NoSQL database ............................................................................................. 18
Best practices operational .................................................................................................... 19
Operational Dos ....................................................................................................................................19
Operational Don’ts ................................................................................................................................19
Sqoop Examples ................................................................................................................... 20
hdfs to Teradata/Teradata Aster .................................................................. Error! Bookmark not defined.
Moving entire table to hdfs...................................................................................................................21
Moving entire table to hive ...................................................................... Error! Bookmark not defined.
Moving entire table to Hbase ...............................................................................................................21
Teradata/Teradata Aster to hdfs ...............................................................................................................21
Moving entire table from hdfs to Teradata ..........................................................................................21
Moving entire table from hdfs to Teradata ASTER ...............................................................................21
Sqoop informational links .................................................................................................... 22

Teradata Confidential and Proprietary Page ii of xxiv


Table of Contents

Summary ............................................................................................................................. 23

Teradata Confidential and Proprietary Page iii of xxiv


Table of Figures

Table of Figures

FIGURE 1—SQOOP ARCHITECTURE ...............................................................................................................................7


FIGURE 2—SQOOP IMPORT JOB ..................................................................................................................................14
FIGURE 3—SQOOP EXPORT JOB ..................................................................................................................................16

Teradata Confidential and Proprietary Page iv of xxiv


Best Practices – Native Hadoop Tool Sqoop

Executive summary

Currently there are three ways proven standard methods to interface Hadoop and Teradata, as well as
Teradata Aster.

1. Using Flat file interfaces


a. Available for Teradata
b. Available for Teradata Aster
2. Using SQL-H Interface.
a. Available for Teradata will be in Q3 of 2013
b. Available for Teradata Aster
3. Using Apache tool Sqoop
a. Available for Teradata
b. Available for Teradata Aster

In the BIG DATA environment it is not recommended to write and read BIG data multiple times flat file
interface will be used only when none of the other options are available to be used. The Best Practices
for these interfaces are documented in Best Practices for Teradata tools and utilities.

Detail Best Practices for SQL-H are documented in separate document Best Practices for Aster Data
Integration. Currently (Q2-2013) the SQL-H interface is available for Teradata Aster Platform. SQL-H
interface for Teradata will be available in Q-3 of 2013.

The scope of the document is detail best practices for native Hadoop tool Sqoop. Current version of
Sqoop is Sqoop2.

Teradata Confidential and Proprietary Page 5 of 24


Best Practices – Native Hadoop Tool Sqoop

Apache Sqoop - Overview

Using Hadoop for analytics and data processing requires loading data into clusters and processing it in
conjunction with other data that often resides in production databases across the enterprise. Loading
bulk data into Hadoop from production systems or accessing it from map reduce applications running on
large clusters can be a challenging task. Users must consider details like ensuring consistency of data,
the consumption of production system resources, data preparation for provisioning downstream
pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data
residing on external systems from within the map reduce applications complicates applications and
exposes the production system to the risk of excessive load originating from cluster nodes.

This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software
Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

Sqoop allows easy import and export of data from structured data stores such as relational databases,
enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external
system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to
schedule and automate import and export tasks. Sqoop uses a connector based architecture which
supports plugins that provide connectivity to new external systems.

What happens underneath the covers when you run Sqoop is very straightforward. The dataset being
transferred is sliced up into different partitions and a map-only job is launched with individual mappers
responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe
manner since Sqoop uses the database metadata to infer the data types.

In the rest of this post we will walk through an example that shows the various ways you can use Sqoop.
The goal of this post is to give an overview of Sqoop operation without going into much detail or
advanced functionality

Teradata Confidential and Proprietary Page 6 of 24


Best Practices – Native Hadoop Tool Sqoop

Figure 1—Sqoop Architecture

Ease of Use
Whereas Sqoop requires client-side installation and configuration, Sqoop 2 will be installed and
configured server-side. This means that connectors will be configured in one place, managed by the
Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database
connectivity will only be needed on the server. Sqoop 2 will be a web-based service: front-ended by a
Command Line Interface (CLI) and browser and back-ended by a metadata repository. Moreover, Sqoop
2's service level integration with Hive and HBase will be on the server-side. Oozie will manage Sqoop
tasks through the REST API. This decouples Sqoop internals from Oozie, i.e. if you install a new Sqoop
connector then you won't need to install it in Oozie also.

Ease of Extension

In Sqoop 2, connectors will no longer be restricted to the JDBC model, but can rather define their own
vocabulary, e.g. Couchbase no longer needs to specify a table name, only to overload it as a backfill or
dump operation.

Teradata Confidential and Proprietary Page 7 of 24


Best Practices – Native Hadoop Tool Sqoop

Common functionality will be abstracted out of connectors, holding them responsible only for data
transport. The reduce phase will implement common functionality, ensuring that connectors benefit
from future development of functionality.

Sqoop 2's interactive web-based UI will walk users through import/export setup, eliminating redundant
steps and omitting incorrect options. Connectors will be added in one place, with the connectors
exposing necessary options to the Sqoop framework. Thus, users will only need to provide information
relevant to their use-case.

With the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more
predictable. In the same way, the user will not need to be aware of the functionality of all connectors.
As a result, connectors no longer need to provide downstream functionality, transformations, and
integration with other systems. Hence, the connector developer no longer has the burden of
understanding all the features that Sqoop supports.

Security

Currently, Sqoop operates as the user that runs the 'sqoop' command. The security principal used by a
Sqoop job is determined by what credentials the users have when they launch Sqoop. Going forward,
Sqoop 2 will operate as a server based application with support for securing access to external systems
via role-based access to Connection objects. For additional security, Sqoop 2 will no longer allow code
generation, require direct access to Hive and HBase, nor open up access to all clients to execute jobs.

Sqoop 2 will introduce Connections as First-Class Objects. Connections, which will encompass
credentials, will be created once and then used many times for various import/export jobs. Connections
will be created by the Admin and used by the Operator, thus preventing credential abuse by the end
user. Furthermore, Connections can be restricted based on operation (import/export). By limiting the
total number of physical Connections opens at one time and with an option to disable Connections,
resources can be managed

Teradata Confidential and Proprietary Page 8 of 24


Best Practices – Native Hadoop Tool Sqoop

Apache Sqoop – help tool

Sqoop ships with a help tool. To display a list of all available tools,
type the following command:

$ sqoop help

usage: sqoop COMMAND [ARGS]

Available commands:

codegen Generate code to interact with database records

create-hive-table Import a table definition into Hive

eval Evaluate a SQL statement and display the results

export Export an HDFS directory to a database table

help List available commands

import Import a table from a database to HDFS

import-all-tables Import tables from a database to HDFS

list-databases List available databases on a server

list-tables List available tables in a database

version Display version information

See 'sqoop help COMMAND' for information on a specific command.

Teradata Confidential and Proprietary Page 9 of 24


Best Practices – Native Hadoop Tool Sqoop

Best practices for Sqoop Installation

Sqoop ships as one binary package however it’s compound from two separate parts - client and server.
You need to install server on single node in your cluster. This node will then serve as an entry point for
all connecting Sqoop clients. Server acts as a MapReduce client and therefore Hadoop must be installed
and configured on machine hosting Sqoop server. Clients can be installed on any arbitrary number of
machines. Client is not acting as a MapReduce client and thus you do not need to install Hadoop on
nodes that will act only as a Sqoop client.

Server installation

Copy Sqoop artifact on machine where you want to run Sqoop server. This machine must have installed
and configured Hadoop. You don’t need to run any Hadoop related services there, however the machine
must be able to act as an Hadoop client. You should be able to list a HDFS for example:

hadoop dfs -ls

Sqoop server supports multiple Hadoop versions. However as Hadoop major versions are not
compatible with each other, Sqoop have multiple binary artifacts - one for each supported major version
of Hadoop. You need to make sure that you’re using appropriated binary artifact for your specific
Hadoop version. To install Sqoop server decompress appropriate distribution artifact in location at your
convenience and change your working directory to this folder.

# Decompress Sqoop distribution tarball

tar -xvf sqoop-<version>-bin-hadoop<hadoop-version>.tar.gz

# Move decompressed content to any location

mv sqoop-<version>-bin-hadoop<hadoop version>.tar.gz /usr/lib/sqoop

# Change working directory

cd /usr/lib/sqoop

Installing Dependencies

You need to install Hadoop libraries into Sqoop server war file. Sqoop provides convenience script
addtowar.sh to do so. If you have installed Hadoop in usual location in /usr/lib and executable
hadoop is in your path, you can use automatic Hadoop installation procedure:

Teradata Confidential and Proprietary Page 10 of 24


Best Practices – Native Hadoop Tool Sqoop

./bin/addtowar.sh -hadoop-auto

In case that you have Hadoop installed in different location, you will need to manually specify Hadoop
version and path to Hadoop libraries. You can use parameter -hadoop-version for specifying
Hadoop major version, we’re currently support versions 1.x and 2.x. Path to Hadoop libraries can be
specified using -hadoop-path parameter. In case that your Hadoop libraries are in multiple different
folders, you can specify all of them separated by :.

Example of manual installation:

./bin/addtowar.sh -hadoop-version 2.0 -hadoop-path /usr/lib/hadoop-


common:/usr/lib/hadoop-hdfs:/usr/lib/hadoop-yarn

Lastly you might need to install JDBC drivers that are not bundled with Sqoop because of incompatible
licenses. You can add any arbitrary java jar file to Sqoop server using script addtowar.sh with -jars
parameter. Similarly as in case of hadoop path you can enter multiple jars separated with :.

Example of installing MySQL JDBC driver to Sqoop server:

./bin/addtowar.sh -jars /path/to/jar/mysql-connector-java-*-bin.jar

Configuring Server

Before starting server you should revise configuration to match your specific environment. Server
configuration files are stored in server/config directory of distributed artifact along side with other
configuration files of Tomcat.

File sqoop_bootstrap.properties specifies which configuration provider should be used for


loading configuration for rest of Sqoop server. Default value
PropertiesConfigurationProvider should be sufficient.

Second configuration file sqoop.properties contains remaining configuration properties that can
affect Sqoop server. File is very well documented, so check if all configuration properties fits your
environment. Default or very little tweaking should be sufficient most common cases.

Server Life Cycle

After installation and configuration you can start Sqoop server with following command:

Teradata Confidential and Proprietary Page 11 of 24


Best Practices – Native Hadoop Tool Sqoop

./bin/sqoop.sh server start

Similarly you can stop server using following command:

./bin/sqoop.sh server stop

Client installation
Client do not need extra installation and configuration steps. Just copy Sqoop distribution artifact on
target machine and unzip it in desired location. You can start client with following command:

bin/sqoop.sh client

Debugging information
The logs of the Tomcat server is located under the server/logs directory in the Sqoop2 distribution
directory.
The logs of the Sqoop2 server and the Derby repository are located as sqoop.log and derbyrepo.log (by
default unless changed by the above configuration), respectively, under the (LOGS) directory in the
Sqoop2 distribution directory.

Teradata Confidential and Proprietary Page 12 of 24


Best Practices – Native Hadoop Tool Sqoop

Best practices for importing data to Hadoop

The following section describes the option to export data from RDBMS to Hadoop hdfs as well as higher
level constructs like Hive and HBase.

Importing data to HDFS

The following command is used to import all data from a table called ORDERS from a Teradata database:
---
$ sqoop import --connect jdbc:teradata://12.13.24.54/localhost/
--table <<TABLE NAME>> --username <<USERNAME>> --password <<Password>>

• import: This is the sub-command that instructs Sqoop to initiate an import.

• --connect <connect string>, --username <user name>, --password <password>: These are
connection parameters that are used to connect with the database. This is no different from the
connection parameters that you use when connecting to the database via a JDBC connection.

• --table <table name>: This parameter specifies the table which will be imported.

The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the
database to gather the necessary metadata for the data being imported.

The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the
actual data transfer using the metadata captured in the previous step.

The imported data is saved in a directory on HDFS based on the table being imported. As is the case with
most aspects of Sqoop operation, the user can specify any alternative directory where the files should
be populated.

By default these files contain comma delimited fields, with new lines separating different records. You
can easily override the format in which data is copied over by explicitly specifying the field separator and
record terminator characters.

Sqoop also supports different data formats for importing data. For example, you can easily import data
in Avro data format by simply specifying the option --as-avrodatafile with the import command.

There are many other options that Sqoop provides which can be used to further tune the import
operation to suit your specific requirements.

Teradata Confidential and Proprietary Page 13 of 24


Best Practices – Native Hadoop Tool Sqoop

Figure 2—Sqoop Import Job

Importing Data into Hive


In most cases, importing data into Hive is the same as running the import task and then using Hive to
create and load a certain table or partition. Doing this manually requires that you know the correct type
mapping between the data and other details like the serialization format and delimiters.

Sqoop takes care of populating the Hive meta-store with the appropriate metadata for the table and
also invokes the necessary commands to load the table or partition as the case may be. All of this is
done by simply specifying the option --hive-import with the import command.

$ sqoop import --connect jdbc:teradata://12.13.24.54/


--table <<TABLE NAME>> --username <<USERNAME>> --password <<Password>>
-- hive-import

Teradata Confidential and Proprietary Page 14 of 24


Best Practices – Native Hadoop Tool Sqoop

When you run a Hive import, Sqoop converts the data from the native datatypes within the external
datastore into the corresponding types within Hive.

Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new
line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the
data correctly populated for consumption in Hive.

Once the import is complete, you can see and operate on the table just like any other table in Hive.

Importing Data Importing Data into HBase


You can use Sqoop to populate data in a particular column family within the HBase table. Much like the
Hive import, this can be done by specifying the additional options that relate to the HBase table and
column family being populated. All data imported into HBase is converted to their string representation
and inserted as UTF-8 bytes..

$ sqoop import --connect jdbc:teradata://12.13.24.54/


--table <<TABLE NAME>> --username <<USERNAME>> --password <<Password>>
-- hbase-create-table –hbase-table MYTABLE –column-family Teradata

In this command the various options specified are as follows:

• --hbase-create-table: This option instructs Sqoop to create the HBase table.


• --hbase-table: This option specifies the table name to use.
• --column-family: This option specifies the column family name to use

Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.

Some connectors support staging tables that help isolate production tables from possible corruption in
case of job failures due to any reason. Staging tables are first populated by the map tasks and then
merged into the target table once all of the data has been delivered it.

Teradata Confidential and Proprietary Page 15 of 24


Best Practices – Native Hadoop Tool Sqoop

Figure 3—Sqoop Export Job

Teradata Confidential and Proprietary Page 16 of 24


Best Practices – Native Hadoop Tool Sqoop

Best practices to exporting data from Hadoop

In some cases data processed by Hadoop pipelines may be needed in production systems to help run
additional critical business functions. Sqoop can be used to export such data into external data stores as
necessary.

Continuing our example from above - if data generated by the pipeline on Hadoop corresponded to the
ORDERS table in a database somewhere, you could populate it using the following command:

$ sqoop export --connect jdbc:Teradata://12.13.24.54/


--table ORDERS --username test --password **** \
--export -dir /user/stagedata/20130201/ORDERS

• export: This is the sub-command that instructs Sqoop to initiate an export.


• --connect <connect string>, --username <user name>, --password <password>: These are
connection parameters that are used to connect with the database. This is no different from the
connection parameters that you use when connecting to the database via a JDBC connection.
• --table <table name>: This parameter specifies the table which will be populated.
• --export-dir <directory path>: This is the directory from which data will be exported.

Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for
metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into
splits and then uses individual map tasks to push the splits to the database. Each map task performs this
transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.

Teradata Confidential and Proprietary Page 17 of 24


Best Practices – Native Hadoop Tool Sqoop

Best practices NoSQL database

Using specialized connectors, Sqoop can connect with external systems that have optimized import and
export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop’s
extension framework and can be added to any existing Sqoop installation. Once a connector is installed,
Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the
connector.

By default Sqoop includes connectors for various popular databases such as Teradata, Teradata Aster,
MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and
PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch
tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be
used to connect to any database that is accessible via JDBC.

Apart from the built-in connectors, many companies have developed their own connectors that can be
plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to
NoSQL datastores

Sqoop2 can transfer large datasets between Hadoop and external datastores such as relational
databases. Beyond this, Sqoop offers many advance features such as different data formats,
compression, working with queries instead of tables etc.

Teradata Confidential and Proprietary Page 18 of 24


Best Practices – Native Hadoop Tool Sqoop

Best practices operational

Operational Dos
• If you need to move big data, make it small first, and then move small data.
• Prepare data model in advance to ensure that queries touch the least amount of data.
• Always create an empty export table.
• Do use --escaped-by option during import and --input-escaped-by during export.
• Do use fields-terminated-by during import and input-fields-terminated-by during export.
• Do specify the direct mode option (--direct), if you use the direct connector
• Develop some kind of incremental import when sqoop-ing in large tables.
o If you do not, your Sqoop jobs will take longer and longer as the data grows from the
• Compress data in HDFS.
o You will save space on HDFS as your replication factor makes multiple copies of your
data.
o You will benefit in processing as your Map/Reduce jobs have less data to feaster and
HADOOP becomes less I/O bound
• Do use --escaped-by option during import and --input-escaped-by during export.
• Do use fields-terminated-by during import and input-fields-terminated-by during export.

Operational Don’ts
• Don’t use the same table for both import and export
• Don’t specify the query, if you use the direct connector
• Don’t have too many partitions same file that will be stored in HDFS
o This translates into time consuming map tasks, use partitioning if possible
o 1000 Partitions will perform better than 10,000 partitions

Teradata Confidential and Proprietary Page 19 of 24


Best Practices – Native Hadoop Tool Sqoop

Technical implementation of Sqoop JDBC

Following section describes how the data is transferred using the JDBC connection; Including the
technical implementation of data pipes in and out Teradata as well as hdfs.

To be updated once we have this information

Teradata Confidential and Proprietary Page 20 of 24


Best Practices – Native Hadoop Tool Sqoop

Sqoop sample use case

Following section describes the use cases and examples on how to transfer the data. To be updated
once we have this information

Exporting data to hdfs


Exporting entire table to hdfs
Exporting table to hive using SQL statement
Exporting table to hive using SQL join statement
Exporting entire table to Hbase

Importing data from hdfs


Importing entire table from hdfs to Teradata
Importing entire table from hdfs to Teradata ASTER
Importing table to hive using SQL statement to Teradata
Importing table to hive using SQL join statement Teradata ASTER

Teradata Confidential and Proprietary Page 21 of 24


Best Practices – Native Hadoop Tool Sqoop

Sqoop informational links

Subject Area Links to Sqoop Project

Sqoop2 Down load Download


Sqoop2 Documentation Documentation
API Documentation Scoop2 API documentation
Sqoop trouble shooting Guide Sqoop Troubleshooting Tips
Teradata Sqoop connector Teradata Sqoop Connector
Teradata Aster Sqoop connector Teradata Aster Sqoop connector
Frequently asked questions FAQ
Sqoop2 Project Status Sqoop2 Project Status
Sqoop2 command line interface details Command Line Client
Issues related to Sqoop Issue Tracker (JIRA)

Teradata Confidential and Proprietary Page 22 of 24


Best Practices – Native Hadoop Tool Sqoop

Summary

Sqoop 2 will enable users to use Sqoop effectively with a minimal understanding of its details by having
a web-application run Sqoop, which allows Sqoop to be installed once and used from anywhere.

In addition, having a REST API for operation and management will help Sqoop integrate better with
external systems such as Oozie.

Also, introducing a reduce phase allows connectors to be focused only on connectivity and ensures that
Sqoop functionality is uniformly available for all connectors. This facilitates ease of development of
connectors

Teradata Confidential and Proprietary Page 23 of 24

You might also like