Teradata Connector for Hadoop Tutorial v1.5 1.6 1.7 1.8 December 2020
Teradata Connector for Hadoop Tutorial v1.5 1.6 1.7 1.8 December 2020
Tutorial
1 Introduction ................................................................................................................... 4
1.1 Overview ............................................................................................................... 4
1.2 Audience ............................................................................................................... 4
1.3 Architecture........................................................................................................... 5
1.4 TDCH Plugins and Features Note: Please See Appendix A for Detailed Property
Information ...................................................................................................................... 8
1.5 Teradata Plugin Space Requirements ................................................................ 12
1.6 Teradata Plugin Privilege Requirements ............................................................ 14
2 Installing Teradata Connector for Hadoop ............................................................... 15
2.1 Prerequisites ....................................................................................................... 15
2.2 Software Download ............................................................................................. 15
2.3 RPM Installation.................................................................................................. 16
2.4 ConfigureOozie Installation ................................................................................. 17
3 Launching TDCH Jobs ............................................................................................... 18
3.1 TDCH’s Command Line Interface ....................................................................... 18
3.2 Runtime Dependencies ...................................................................................... 18
3.3 Launching TDCH with Oozie Workflows ............................................................. 20
3.4 TDCH’s Java API ................................................................................................ 20
4 Use Case Examples .................................................................................................... 22
4.1 Environment Variables for Runtime Dependencies ............................................ 22
4.2 Use Case: Import to HDFS File from Teradata Table ......................................... 24
4.3 Use Case: Export from HDFS File to Teradata Table ......................................... 25
4.4 Use Case: Import to Existing Hive Table from Teradata Table ........................... 27
4.5 Use Case: Import to New Hive Table from Teradata Table ................................ 29
4.6 Use Case: Export from Hive Table to Teradata Table ........................................ 30
4.7 Use Case: Import to Hive Partitioned Table from Teradata PPI Table ................ 31
4.8 Use Case: Export from Hive Partitioned Table to Teradata PPI Table ............... 33
4.9 Use Case: Import to Teradata Table from HCatalog Table ................................. 34
4.10 Use Case: Export from HCatalog Table to Teradata Table ............................. 36
4.11 Use Case: Import to Teradata Table from ORC File Hive Table ...................... 37
4.12 Use Case: Export from ORC File HCatalog Table to Teradata Table .............. 38
4.13 Use Case: Import to Teradata Table from Avro File in HDFS .......................... 39
4.14 Use Case: Export from Avro to Teradata Table ............................................... 41
5 Performance Tuning ................................................................................................... 42
5.1 Selecting the Number of Mappers ...................................................................... 42
5.2 Selecting a Teradata Target Plugin .................................................................... 44
5.3 Selecting a Teradata Source Plugin ................................................................... 44
5.4 Increasing the Batchsize Value........................................................................... 45
5.5 Configuring the JDBC Driver............................................................................... 45
6 Troubleshooting.......................................................................................................... 46
6.1 Troubleshooting Requirements ........................................................................... 46
6.2 Troubleshooting Overview .................................................................................. 47
6.3 Functional: Understand Exceptions .................................................................... 48
6.4 Functional: Data Issues ...................................................................................... 49
6.5 Performance: Back of the Envelope Guide ......................................................... 49
6.6 Console Output Structure ................................................................................... 52
6.7 Troubleshooting Examples ................................................................................. 54
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 2
7 FAQ .............................................................................................................................. 62
7.1 Do I need to install the Teradata JDBC driver manually? ................................... 62
7.2 What authorization is necessary for running TDCH? .......................................... 62
7.3 How do I enable encryption for data transfers? .................................................. 62
7.4 How do I use User Customized Text Format Parameters? ................................. 62
7.5 How do I use a Unicode character as the separator? ......................................... 63
7.6 Why is the actual number of mappers less than the value of -nummappers? ..... 63
7.7 Why don’t decimal values in Hadoop exactly match the value in Teradata? ...... 63
7.8 When should charset be specified in the JDBC URL? ........................................ 63
7.9 How do I configure the Capacity Scheduler to prevent task skew? .................... 63
7.10 How can I build my own ConnectorDataTypeConverter? ................................ 64
8 Limitations & Known Issues ...................................................................................... 66
8.1 Teradata Connector for Hadoop ......................................................................... 66
8.2 Teradata JDBC Driver ........................................................................................ 67
8.3 Teradata Database ............................................................................................. 67
8.4 Hadoop Map/Reduce .......................................................................................... 67
8.5 Hive .................................................................................................................... 68
8.6 Avro Data Type Conversion and Encoding ......................................................... 69
8.7 Parquet ............................................................................................................... 69
8.8 Data Compression .............................................................................................. 69
8.9 Teradata Wallet (TD Wallet) ............................................................................... 69
9 Recommendations & Requirements ......................................................................... 70
9.1 Hadoop Port Requirements ................................................................................ 70
9.2 AWS Recommendations ..................................................................................... 70
9.3 Google Cloud Dataproc Recommendations........................................................ 71
9.4 Teradata Connectivity Recommendations .......................................................... 73
10 Data Compression ................................................................................................. 74
Appendix A Supported Plugin Properties.................................................................... 75
10.1 Source Plugin Definition Properties ................................................................. 75
10.2 Target Plugin Definition Properties .................................................................. 76
10.3 Common Properties ......................................................................................... 77
10.4 Teradata Source Plugin Properties .................................................................. 84
10.5 Teradata Target Plugin Properties ................................................................... 93
10.6 HDFS Source Plugin Properties .................................................................... 102
10.7 HDFS Target Properties ................................................................................ 105
10.8 Hive Source Properties .................................................................................. 109
10.9 Hive Target Properties ................................................................................... 112
10.10 HCatalog Source Properties .......................................................................... 116
10.11 HCatalog Target Properties ........................................................................... 117
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 3
1 Introduction
1.1 Overview
The Teradata Connector for Hadoop (TDCH) is a MapReduce application that supports high-
performance parallel bi-directional data movement between Teradata systems and various Hadoop
ecosystem components.
TDCH can function as an end user tool with its own command-line interface, can be included in and
launched with custom Oozie workflows, and can also be integrated with other end user tools via its
Java API.
1.2 Audience
TDCH is designed and implemented for the Hadoop user audience. Users in this audience are
familiar with the Hadoop Distributed File System (HDFS) and MapReduce. They are also familiar
with other widely used Hadoop ecosystem components such as Hive, Pig and Sqoop. They should be
comfortable with the command line style of interfaces many of these tools support. Basic knowledge
about the Teradata database system is also assumed.
Teradata Hadoop
Pig
Sqoop
Teradata Teradata
MapReduce
Tools SQL
Teradata DB HDFS
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 4
1.3 Architecture
TDCH is a bi-directional data movement utility which runs as a MapReduce application inside the
Hadoop cluster. TDCH employs an abstracted ‘plugin’ architecture which allows users to easily
configure, extend and debug their data movement jobs.
1.3.1 MapReduce
TDCH utilizes MapReduce as its execution engine. MapReduce is a framework designed for
processing parallelizable problems across huge datasets using a large number of computers (nodes).
When run against files in HDFS, MapReduce can take advantage of locality of data, processing data
on or near the storage assets to decrease transmission of data. MapReduce supports other distributed
filesystems such as Amazon S3. MapReduce is capable of recovering from partial failure of servers
or storage at runtime. TDCH jobs get submitted to the MapReduce framework, and the distributed
processes launched by the MapReduce framework make JDBC connections to the Teradata database;
the scalability and fault tolerance properties of the framework are key features of TDCH data
movement jobs.
Namenode
JT / RM
…
TDCH TDCH TDCH
Mapper Mapper Mapper
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 5
1.3.2 Controlling the Degree of Parallelism
Both Teradata and Hadoop systems employ extremely scalable architectures, and thus it is very
important to be able to control the degree of parallelism when moving data between the two systems.
Because TDCH utilizes the MapReduce framework as its execution engine, the degree of parallelism
for TDCH jobs is defined by the number of mappers used by the MapReduce job. The number of
mappers used by the MapReduce framework can be configured via the command line parameter
‘nummappers’, or via the ‘tdch.num.mappers’ configuration property. More information about
general TDCH command line parameters and their underlying properties will be discussed in Section
2.1.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 6
Preprocessing Stage
ConnectorImport/ExportTool
Input/OutputPlugInConfiguration
ConnectorJobRunner JobContext
ConnectorIF ConnectorOF
ConnectorRR ConnectorRW
InputPlugin SerDe
OutputPlugin
OutputPlugin
OutputPlugin
Converter
OF/RW
SerDe
Postprocessing Stage
ConnectorJobRunner
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 7
1.4 TDCH Plugins and Features
Note: Please See Appendix A for Detailed Property Information
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 9
❖ RCFile
RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based
data warehouse systems, such as Hive. RCFile applies the concept of “first horizontally-partition,
then vertically-partition”. It combines the advantages of both row-store and column-store.
RCFile guarantees that data in the same row are located in the same node and can exploit a
column-wise data compression and skip unnecessary column reads.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 10
❖ internal.fastexport
The Teradata “internal.fastexport” source plugin associates a FastExport JDBC session with each
mapper in the TDCH job to retrieve data from the source table in Teradata. The
“internal.fastexport” method utilizes a FastExport “slot” on Teradata and implements
coordination between the TDCH mappers and a TDCH coordinator process (running on the
Hadoop node where the job was submitted) as is defined by the Teradata FastExport protocol.
The “internal.fastexport” plugin offers exceptional load performance when dealing with larger
amounts of data compared to other source plugins, but cannot recover from mapper failure.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 11
1.5 Teradata Plugin Space Requirements
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 12
❖ split.by.partition
When the source table is not partitioned, the Teradata “split.by.partition” source plugin creates a
temporary partitioned staging table and executes an INSERT-SELECT to move data from the
source table into the stage table. To support the use of a temporary partitioned staging table, the
source database must have enough permanent space to accommodate the source data set in the
stage table as well as in the source table. In addition to the permanent space required by the
temporary stage table, the Teradata “split.by.partition” source plugin requires spool space
equivalent to the size of the source data to support the INSERT-SELECT operation between the
source table and the temporary partitioned stage table.
Once a partitioned source table is available, the Teradata “split.by.partition” source plugin
associates partitions from the source table with distinct mappers from the TDCH job. Each
mapper retrieves the associated data via a SELECT statement, and thus the Teradata
“split.by.partition” source plugin requires that the source database have enough spool space to
support N SELECT statements, where N is the number of mappers in use by the TDCH job.
❖ split.by.amp
The Teradata “split.by.amp” source plugin does not require any space on the source database due
to its use of the “tdampcopy” table operator.
❖ internal.fastexport
The Teradata “internal.fastexport” source plugin does not require any space on the source
database due to its use of the JDBC FastExport protocol.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 13
1.6 Teradata Plugin Privilege Requirements
The following table defines the privileges required by the database user associated with the TDCH
job when the Teradata source or target plugins are in use.
Requires Requires
Create Create
Teradata Plugin Select Privilege Required on System Views/Tables
Table View
Privilege Privilege
usexviews argument enabled usexviews argument disabled
DBC.COLUMNSX DBC.COLUMNS
split.by.hash No No
DBC.INDICESX DBC.INDICES
DBC.COLUMNSX DBC.COLUMNS
split.by.value No No
DBC.INDICESX DBC.INDICES
DBC.COLUMNSX DBC.COLUMNS
split.by.partition Yes Yes
DBC.INDICESX DBC.INDICES
DBC.COLUMNSX DBC.COLUMNS
split.by.amp No No
DBC.TABLESX DBC.TABLES
DBC.COLUMNSX DBC.COLUMNS
internal.fastexport No No
DBC.TABLESX DBC.TABLES
DBC.COLUMNSX DBC.COLUMNS
batch.insert Yes No DBC.INDICESX DBC.INDICES
DBC.TABLESX DBC.TABLES
DBC.COLUMNSX DBC.COLUMNS
DBC.INDICESX DBC.INDICES
DBC.TABLESX DBC.TABLES
internal.fastload Yes No DBC.DATABASESX DBC.DATABASES
DBC.TABLE_LEVELCONSTRA DBC.TABLE_LEVELCONSTRAIN
INTSX TS
DBC.TRIGGERSX DBC.TRIGGERS
NOTE: Create table privileges are only required by the “batch.insert” and “internal.fastload”
Teradata plugins when staging tables are required.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 14
2 Installing Teradata Connector for Hadoop
2.1 Prerequisites
The latest software release of the Teradata Connector for Hadoop is available at the following
location:
https://downloads.teradata.com/download/connectivity/teradata-connector-for-hadoop-command-
line-edition
TDCH is also available for download via the Teradata Software Server (TSS).
Some Hadoop vendors will distribute Teradata-specific versions of Sqoop that are “Powered by
Teradata”; in most cases these Sqoop packages contain one of the latest versions of TDCH. These
Sqoop implementations will forward Sqoop command line arguments to TDCH via the Java API and
then rely on TDCH for data movement between Hadoop and Teradata. In this way, Hadoop users
can utilize a common Sqoop interface to launch data movement jobs using specialized vendor-
specific tools.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 15
2.3 RPM Installation
TDCH can be installed on any node in the Hadoop cluster, though typically it is installed on a
Hadoop edge node.
TDCH is distributed in RPM format, and can be installed in a single step:
After RPM installation, the following directory structure should be created (teradata-connector-
1.7.0-hadoop3.x used as example):
/usr/lib/tdch/1.7/:
/usr/lib/tdch/1.7/conf:
teradata-export-properties.xml.template
teradata-import-properties.xml.template
/usr/lib/tdch/1.7/lib:
/usr/lib/tdch/1.7/scripts:
The README and SUPPORTLIST files contain information about the features and fixes included
in a given TDCH release, as well as information about what versions of relevant systems (Teradata,
Hadoop, etc.) are supported by a given TDCH release.
The “conf” directory contains a set of XML files that can be used to define default values for
common TDCH properties. To use these files, specify defaults values for the desired properties in
Hadoop configuration format, remove the “.template” extension and copy them into the Hadoop
“conf” directory.
The “lib” directory contains the TDCH JAR, as well as the Teradata GSS and JDBC JARs. Only the
TDCH JAR is required when launching TDCH jobs via the command line interface, while all three
JARs are required when launching TDCH jobs via Oozie Java actions.
The “scripts” directory contains the configureOozie.sh script which can be used to install TDCH into
HDFS such that TDCH jobs can be launched by other Teradata products via custom Oozie Java
actions; see the following section for more information.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 16
2.4 ConfigureOozie Installation
Once TDCH has been installed into the Linux filesystem, the configureOozie.sh script can be used to
install TDCH, its dependencies, and a set of custom Oozie workflow files into HDFS. By installing
TDCH into HDFS in this way, TDCH jobs can be launched by users and applications outside of the
Hadoop cluster via Oozie. Currently, both Teradata Studio and Teradata Data Mover products
support launching TDCH jobs from nodes outside of the cluster when TDCH is installed into HDFS
using the configureOozie.sh script.
The configureOozie.sh script supports the following arguments in the form ‘<argument>=<value>’:
• nn - The Name Node host name (required)
• nnHA - If the name node is HA, specify the fs.defaultFS value found in `core-site.xml'
• rm - The Resource Manager host name (uses nn parameter value if omitted)
• rmHA - If Resource Manager HA is enabled, specify the yarn.resourcemanager.cluster-id
found in ‘yarn-site.xml’
• oozie - The Oozie host name (uses nn parameter value if omitted)
• webhcat - The WebHCatalog host name (uses nn parameter if omitted)
• webhdfs - The WebHDFS host name (uses nn parameter if omitted)
• nnPort - The Name node port number (8020 if omitted)
• rmPort - The Resource Manager port number (8050 if omitted)
• ooziePort - The Oozie port number (11000 if omitted)
• webhcatPort - The WebHCatalog port number (50111 if omitted)
• webhdfsPort - The WebHDFS port number (50070 if omitted)
• hiveClientMetastorePort - The URI port for hive client to connect to metastore server (9083
if omitted)
• kerberosRealm - name of the Kerberos realm
• hiveMetaStore - The Hive Metastore host name (uses nn paarameter value if omitted)
• hiveMetaStoreKerberosPrincipal - The service principal for the metastore thrift server
(hive/_HOST if omitted)
Once the configureOozie.sh script has been run, the following directory structure should exist in
HDFS:
/teradata/hadoop/lib/<all dependent hadoop jars>
/teradata/tdch/<version>/lib/teradata-connector-<version>.jar
/teradata/tdch/<version>/lib/tdgssconfig.jar
/teradata/tdch/<version>/lib/terajdbc4.jar
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 17
3 Launching TDCH Jobs
To launch a TDCH job via the command line interface, utilize the following syntax:
The tool class to-be-used will depend on whether the TDCH job is exporting data from the Hadoop
cluster to Teradata or importing data into the Hadoop cluster from Teradata.
• For exports from Hadoop, reference the “ConnectorExportTool” main class via the path
‘com.teradata.connector.common.tool.ConnectorExportTool’
• For imports to Hadoop, reference the “ConnectorImportTool” main class via the path
‘com.teradata.connector.common.tool.ConnectorImportTool’
When running TDCH jobs which utilize the Hive or HCatalog source or target plugins, a set of
dependent JARs must be distributed with the TDCH JAR to the nodes on which the TDCH job will
be run. These runtime dependencies should be defined in comma-separated format using the ‘-libjars’
command line option; see the following section for more information about runtime dependencies.
Job and plugin-specific properties can be defined via the ‘-D<property>=value’ format, or via their
associated command line interface arguments. See section 2 for a full list of the properties and
arguments supported by the plugins and tool classes and see section 5.1 for examples which utilize
the “ConnectorExportTool” and the “ConnectorImportTool” classes to launch TDCH jobs via the
command line interface.
In some cases, TDCH supports functionality which depends on libraries that are not encapsulated in
the TDCH JAR. When utilizing TDCH via the command line interface, the absolute path to the
runtime dependencies associated with the given TDCH functionality should be included in the
HADOOP_CLASSPATH environment variable as well as specified in comma-separated format for
the ‘-libjars’ argument. Most often, these runtime dependencies can be found in the lib directories of
the Hadoop components installed on the cluster. As an example, the JARs associated with Hive are
used below. The location and version of the dependent JARs will change based on the version of
Hive installed on the local cluster, and thus the version numbers associated with the JARs should be
updated accordingly.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 18
TDCH jobs which utilize the HDFS Avro plugin as a source or target are dependent on the following
Avro JAR files:
For Hadoop 1.x:
• avro-1.7.1.jar or later versions
• avro-mapred-1.7.1.jar or later versions
• paranamer-2.3.jar
For Hadoop 2.x:
• avro-1.7.3.jar or later versions
• avro-mapred-1.7.3-hadoop2.jar or later versions
• paranamer-2.3.jar
TDCH jobs which utilize the Hive plugins as sources or targets are dependent on the following Hive
JAR files:
• antlr-runtime-3.4.jar
• commons-dbcp-1.4.jar
• commons-pool-1.5.4.jar
• datanucleus-api-jdo-3.2.6.jar
• datanucleus-core-3.2.10.jar
• datanucleus-rdbms-3.2.9.jar
• hive-cli-1.2.1.jar
• hive-exec-1.2.1.jar
• hive-jdbc-1.2.1.jar
• hive-metastore-1.2.1.jar
• jdo-api-3.0.1.jar
• libfb303-0.9.3.jar
• libthrift-0.9.3.jar
TDCH jobs which utilize the HCatalog plugins as sources or targets are dependent on all the JARs
associated with the Hive plugins (defined above), as well as the following HCatalog JAR files:
• hive-hcatalog-core-1.2.1.jar
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 19
3.3 Launching TDCH with Oozie Workflows
TDCH can be launched from nodes outside of the Hadoop cluster via the Oozie web application.
Oozie executes user-defined workflows - an Oozie workflow is one or more actions arranged in a
graph, defined in an XML document in HDFS. Oozie actions are Hadoop jobs (Oozie supports
MapReduce, Hive, Pig and Sqoop jobs) which get conditionally executed on the Hadoop cluster
based on the workflow definition. TDCH’s “ConnectorImportTool” and “ConnectorExportTool”
classes can be referenced directly in Oozie workflows via Oozie Java actions.
The configureOozie.sh script discussed in section 3.4 creates a set of Oozie workflows in HDFS
which can be used directly or can be used as examples during custom workflow development.
For advanced TDCH users who would like to build a Java application around TDCH, launching
TDCH jobs directly via the Java API is possible, though Teradata support for custom applications
which utilize the TDCH Java API is not provided. The TDCH Java API is composed of the
following sets of classes:
• Utility classes
o These classes can be used to fetch information and modify the state of a given data
source or target.
Below is an overview of the TDCH package structure including the locations of useful utility,
configuration, and job execution classes:
• com.teradata.connector.common.utils
o ConnectorJobRunner
• com.teradata.connector.common.utils
o ConnectorConfiguration
o ConnectorMapredUtils
o ConnectorPlugInUtils
• com.teradata.connector.hcat.utils
o HCatPlugInConfiguration
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 20
• com.teradata.connector.hdfs.utils
o HdfsPlugInConfiguration
• com.teradata.connector.hive.utils
o HivePlugInConfiguration
o HiveUtils
• com.teradata.connector.teradata.utils
o TeradataPlugInConfiguration
o TeradataUtils
IMPORTANT
More information about TDCH’s Java API and Javadocs detailing the above classes are
available upon request, but please note that support is not provided for building, debugging,
troubleshooting etc. of custom applications which utilize the TDCH Java API.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 21
4 Use Case Examples
Before launching a TDCH job via the command line interface, setup the following environment
variables on an edge node in the Hadoop cluster where the TDCH job will be run. As an example,
the following environment variables reference TDCH 1.5.7 and the Hive 1.2.1 libraries. Ensure that
the HIVE_HOME and HCAT_HOME environment variables are set, and the versions of the
referenced Hive libraries are updated for the given local cluster.
export LIB_JARS=
$HIVE_HOME/lib/hive-builtins-0.9.0.jar,
$HIVE_HOME/lib/hive-cli-0.9.0.jar,
$HIVE_HOME/lib/hive-exec-0.9.0.jar,
$HIVE_HOME/lib/hive-metastore-0.9.0.jar,
$HIVE_HOME/lib/libfb303-0.7.0.jar,
$HIVE_HOME/lib/libthrift-0.7.0.jar,
$HIVE_HOME/lib/jdo2-api-2.3-ec.jar,
$HIVE_HOME/lib/slf4j-api-1.6.1.jar
$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar,
$HIVE_HOME/lib/hive-builtins-0.9.0.jar,
$HIVE_HOME/lib/hive-cli-0.9.0.jar,
$HIVE_HOME/lib/hive-exec-0.9.0.jar,
$HIVE_HOME/lib/hive-metastore-0.9.0.jar,
$HIVE_HOME/lib/libfb303-0.7.0.jar,
$HIVE_HOME/lib/libthrift-0.7.0.jar,
$HIVE_HOME/lib/jdo2-api-2.3-ec.jar,
$HIVE_HOME/lib/slf4j-api-1.6.1.jar
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 22
export HADOOP_CLASSPATH=
$HIVE_HOME/conf:
$HIVE_HOME/lib/antlr-runtime-3.0.1.jar:
$HIVE_HOME/lib/commons-dbcp-1.4.jar:
$HIVE_HOME/lib/commons-pool-1.5.4.jar:
$HIVE_HOME/lib/datanucleus-connectionpool-2.0
$HIVE_HOME/lib/datanucleus-core-2.0.3.jar:
$HIVE_HOME/lib/datanucleus-rdbms-2.0.3.jar:
$HIVE_HOME/lib/hive-builtins-0.9.0.jar:
$HIVE_HOME/lib/hive-cli-0.9.0.jar:
$HIVE_HOME/lib/hive-exec-0.9.0.jar:
$HIVE_HOME/lib/hive-metastore-0.9.0.jar:
$HIVE_HOME/lib/jdo2-api-2.3-ec.jar:
$HIVE_HOME/lib/libfb303-0.7.0.jar:
$HIVE_HOME/lib/libthrift-0.7.0.jar:
$HIVE_HOME/lib/mysql-connector-java-5.1.17-bin.jar:
$HIVE_HOME/lib/slf4j-api-1.6.1.jar
$HIVE_HOME/lib/avro-1.7.1.jar:
$HIVE_HOME/lib/avro-mapred-1.7.1.jar:
$HIVE_HOME/lib/paranamer-2.3.jar
export USERLIBTDCH=/usr/lib/tdch/1.5/lib/teradata-connector-1.5.7.jar
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 23
4.2 Use Case: Import to HDFS File from Teradata Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
,c2 VARCHAR(100)
);
.LOGOFF
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 24
4.3 Use Case: Export from HDFS File to Teradata Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
,c2 VARCHAR(100)
);
.LOGOFF
rm /tmp/example2_hdfs_data
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 25
4.3.3 Run: ConnectorExportTool command
Execute the following on the Hadoop edge node.
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as ‘hdfs’
-jobtype hdfs
Set source HDFS path
-sourcepaths /user/mapred/example2_hdfs
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 26
4.4 Use Case: Import to Existing Hive Table from Teradata Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
,c2 VARCHAR(100)
);
.LOGOFF
h1 INT
, h2 STRING
) STORED AS RCFILE;
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 27
4.4.3 Run: ConnectorImportTool Command
Execute the following on the Hadoop edge node
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as ‘hive’
-jobtype hive
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as ‘hive’
-jobtype hive
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 28
4.5 Use Case: Import to New Hive Table from Teradata Table
DATABASE testdb;
,c3 FLOAT
);
.LOGOFF
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 29
4.6 Use Case: Export from Hive Table to Teradata Table
DATABASE testdb;
c1 INT
, c2 VARCHAR(100)
);
.LOGOFF
h1 INT
, h2 STRING
h1 INT
, h2 STRING
) stored as textfile;
echo "4,acme">/tmp/example5_hive_data
rm /tmp/example5_hive_data
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 30
4.6.3 Run: ConnectorExportTool Command
Execute the following on the Hadoop edge node.
hadoop jar $USERLIBTDCH
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as ‘hive’
-jobtype hive
4.7 Use Case: Import to Hive Partitioned Table from Teradata PPI Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
, c2 DATE
.LOGOFF
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 31
4.7.2 Setup: Create a Hive Partitioned Table
Execute the following through the Hive command line interface on the Hadoop edge node.
h1 INT
STORED AS RCFILE;
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-jobtype hive
-fileformat rcfile
-sourcetable example6_td
-sourcefieldnames "c1,c2"
Specify both source and target
-nummappers 1 field names so TDCH knows how to
map Teradata column Hive
-targettable example6_hive partition columns.
-targetfieldnames "h1,h2"
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 32
4.8 Use Case: Export from Hive Partitioned Table to Teradata PPI Table
DATABASE testdb;
c1 INT
, c2 DATE
.LOGOFF
Execute the following through the Hive command line interface on the Hadoop edge node.
h1 INT
STORED AS RCFILE;
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 33
4.8.3 Run: ConnectorExportTool Command
Execute the following on the Hadoop edge node.
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-jobtype hive
-fileformat rcfile
-sourcetable example7_hive
-sourcefieldnames "h1,h2"
Specify both source and target
-nummappers 1 field names so TDCH knows how to
map Hive partition column to
-targettable example7_td Teradata column.
-targetfieldnames "c1,c2"
DATABASE testdb;
c1 INT
, c2 VARCHAR(100)
);
.LOGOFF
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 34
4.9.2 Setup: Create a Hive Table
Execute the following on the Hive command line.
h1 INT
, h2 STRING
) STORED AS RCFILE;
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-sourcetable example8_td
-nummappers 1
-targettable example8_hive
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 35
4.10 Use Case: Export from HCatalog Table to Teradata Table
.LOGON testsystem/testuser
DATABASE testdb;
c1 INT
, c2 VARCHAR(100)
);
.LOGOFF
h1 INT
, h2 STRING
echo "8,acme">/tmp/example9_hive_data
rm /tmp/example9_hive_data
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 36
4.10.3 Run: ConnectorExportTool Command
Execute the following on the Hadoop edge node.
hadoop jar $USERLIBTDCH
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
Set job type as ‘hcat’
-jobtype hcat
-sourcetable example9_hive
-nummappers 1
-targettable example9_td
4.11 Use Case: Import to Teradata Table from ORC File Hive Table
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-classname com.teradata.jdbc.TeraDriver
-jobtype hive
-targettable import_hive_fun22
-sourcetable import_hive_fun2
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-classname com.teradata.jdbc.TeraDriver
-sourcedatabase default
-sourcetable export_hcat_fun1
-nummappers 2
-separator ','
-targettable export_hcat_fun1
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 38
4.13 Use Case: Import to Teradata Table from Avro File in HDFS
insert into tdtbl(null, null, null, null, null, null, null, null,
null, null);
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 39
4.13.3 Run: ConnectorImportTool Command
Execute the following on the Hadoop edge node.
com.teradata.connector.common.tool.ConnectorImportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-classname com.teradata.jdbc.TeraDriver
-jobtype hdfs
-targetpaths /user/hduser/avro_import
-nummappers 2
-sourcetable tdtbl
-avroschemafile file:////home/hduser/tdch/manual/schema_default.avsc
-targetfieldnames "col2,col3"
-sourcefieldnames "i,s"
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 40
4.14 Use Case: Export from Avro to Teradata Table
com.teradata.connector.common.tool.ConnectorExportTool
-libjars $LIB_JARS
-url jdbc:teradata://testsystem/database=testdb
-username testuser
-password testpassword
-classname com.teradata.jdbc.TeraDriver
-fileformat avrofile
-jobtype hdfs
-sourcepaths /user/hduser/avro_export
-nummappers 2
-targettable tdtbl_export
Set file format to ‘avrofile’
-usexviews false
-avroschemafile file:///home/hduser/tdch/manual/schema_default.avsc
-sourcefieldnames "col2,col3"
-targetfieldnames "i,s"
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 41
5 Performance Tuning
TDCH is a highly scalable application which runs atop the MapReduce framework, and thus its
performance is directly related to the number of mappers associated with a given TDCH job. In most
cases, TDCH jobs should run with as many mappers as the given Hadoop and Teradata systems, and
their administrators, will allow. The number of mappers will depend on whether the Hadoop and
Teradata systems are used for mixed workloads or are dedicated to data movement tasks for a certain
period of time and will also depend on the mechanisms TDCH utilizes to interface with both
systems. This section attempts to describe the factors that come into play when defining the number
of mappers associated with a given TDCH job.
• The scheduler’s queue definition for the queue associated with the TDCH job; the queue
definition will include information about the minimum and maximum number of containers
offered by the queue, as well as whether the scheduler supports preemption.
• Whether the given TDCH job supports preemption if the associated YARN scheduler and
queue have enabled preemption.
To determine the maximum number of mappers that can be run on a given scheduler-enabled YARN
cluster, see the Hadoop documentation for the scheduler that has been implemented in YARN on the
given cluster. See the following section for more information on which TDCH jobs support
preemption.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 42
5.1.3 TDCH Support for Preemption
In some cases, the queues associated with a given YARN scheduler will be configured to support
elastic scheduling. This means that a given queue can grow in size to utilize the resources associated
with other queues when those resources are not in use; if these inactive queues become active while
the original job is running, containers associated with the original job will be preempted, and these
containers will be restarted when resources associated with the elastic queue become available. All
of TDCH’s source plugins, and all of TDCH’s target plugins except the TDCH “internal.fastload”
Teradata target plugin, support preemption. This means that all TDCH jobs, with the exception of
jobs which utilize the TDCH “internal.fastload” target plugin, can be run with more mappers than
are defined by maximum amount of containers associated with the given queue on scheduler-
enabled, preemption-enabled YARN clusters. Jobs which utilize the TDCH “internal.fastload” target
plugin can also be run in this environment, but may not utilize elastically-available resources. Please
see the Hadoop documentation for the given scheduler to determine the maximum number of
mappers supported by a given queue.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 43
5.2 Selecting a Teradata Target Plugin
This section provides suggestions on how to select a Teradata target plugin and provides some
information about the performance of the various plugins.
❖ batch.insert
The Teradata “batch.insert” target plugin utilizes uncoordinated SQL sessions when connecting
with Teradata. This plugin should be used when loading a small amount of data, or when there
are complex data types in the target table which are not supported by the Teradata
“internal.fastload” target plugin. This plugin should also be used for long running jobs on YARN
clusters where preemptive scheduling is enabled or regular failures are expected.
❖ internal.fastload
The Teradata “internal.fastload” target plugin utilizes coordinated FastLoad sessions when
connection with Teradata, and thus this plugin is more performant than the Teradata
“batch.insert” target plugin. This plugin should be used when transferring large amounts of data
from large Hadoop systems to large Teradata systems. This plugin should not be used for long
running jobs on YARN clusters where preemptive scheduling could cause mappers to be
restarted or where regular failures are expected, as the FastLoad protocol does not support
restarted sessions and the job will fail in this scenario.
This section provides suggestions on how to select a Teradata source plugin and provides some
information about the performance of the various plugins.
❖ split.by.value
The Teradata “split.by.value” source plugin performs the best when the split-by column has
more distinct values than the TDCH job has mappers, and when the distinct values in the split-by
column evenly partition the source dataset. The plugin has each mapper submit a range-based
SELECT query to Teradata, fetching the subset of data associated with the mapper’s designated
range. Thus, when the source data set is not evenly partitioned by the values in the split-by
column, the work associated with the data transfer will be skewed between the mappers, and the
job will take longer to complete.
❖ split.by.hash
The Teradata “split.by.hash” source plugin performs the best when the split-by column has more
distinct hash values than the TDCH job has mappers, and when the distinct hash values in the
split-by column evenly partition the source dataset. The plugin has each mapper submit a range-
based SELECT query to Teradata, fetching the subset of data associated with the mapper’s
designated range. Thus, when the source data set is not evenly partitioned by the hash values in
the split-by column, the work associated with the data transfer will be skewed between the
mappers, and the job will take longer to complete.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 44
❖ split.by.partition
The Teradata “split.by.partition” source plugin performs the best when the source table is evenly
partitioned, the partition column(s) are also the indexed, and the number of partitions in the
source table is equal to the number of mappers used by the TDCH job. The plugin has each
mapper submit a range-based SELECT query to Teradata, fetching the subset of data associated
with one or more partitions. The plugin is the only Teradata source plugin to support defining the
source data set via an arbitrarily complex select query; in this scenario a staging table is used.
The number of partitions associated with the staging table created by the Teradata
“split.by.partition” source plugin can be explicitly defined by the user, so the plugin is the most
tunable of the four Teradata source plugins.
❖ split.by.amp
The Teradata “split.by.amp” source plugin performs the best when the source data set is evenly
distributed on the amps in the Teradata system, and when the number of mappers used by the
TDCH job is equivalent to the number of amps in the Teradata system. The plugin has each
mapper submit a table operator-based SELEC query to Teradata, fetching the subset of data
associated with the mapper’s designated amps. The plugin’s use of the table operator makes it
the most performant of the four Teradata source plugins, but the plugin can only be used against
Teradata systems which have the table operator available (14.10+).
❖ internal.fastexport
The Teradata “internal.fastexport” source plugin utilizes coordinated FastExport sessions when
connection with Teradata, and thus this plugin is better performing than the Teradata split.by
source plugins when dealing with larger data sets. This plugin should be used when transferring
large amounts of data from large Teradata systems to large Hadoop systems. This plugin should
not be used for long running jobs on YARN clusters where preemptive scheduling could cause
mappers to be restarted or where regular failures are expected, as the FastExport protocol does
not support restarted sessions and the job will fail in this scenario.
Each mapper uses a JDBC connection to interact with the Teradata database. The TDCH jar includes
the latest version of the Teradata JDBC driver and provides users with the ability to interface directly
with the driver via the ‘tdch.input.teradata.jdbc.url’ and ‘tdch.output.teradata.jdbc.url’ properties.
Using these properties, TDCH users can fine tune the low-level data transfer characteristics of the
driver by appending JDBC options to the JDBC URL properties. Internal testing has shown that
disabling the JDBC CHATTER option increases throughput when the Teradata and Hadoop systems
are localized. More information about the JDBC driver and its supported options can be found at the
online version of the Teradata JDBC Driver Reference.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 45
6 Troubleshooting
In order to troubleshoot failing or performance issues with TDCH jobs, the following information
should be available:
• To enable DEBUG messages in the mapper logs, add the following property definition to the
TDCH command ‘-Dmapred.map.log.level=DEBUG’.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 46
6.2 Troubleshooting Overview
Import Export
Issue Type
Performance (go through
Functional (look for exceptions)
checklist)
Problem Area
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 47
6.3 Functional: Understand Exceptions
NOTE: The example console output contained in this section has not been updated to reflect the latest
version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though not
identical.
Look in the console output for:
Examples:
com.teradata.hadoop.exception.TeradataHadoopSQLException:
(omitted)……
(omitted)……
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 48
6.4 Functional: Data Issues
This category of issues occurs at runtime (most often with the internal.fastload method), and usually it’s
not obvious what the root cause is. Our suggestion is that you can check the following:
We are all aware that no throughput is higher than the total I/O or network transfer capacities of the least
powerful component in the overall solution. Therefore, our methodology is to understand the maximum
for the configuration and work backwards.
∑(Ttd-io), ∑(Ttd-transfer),
Therefore, we should:
• Watch out for node-level CPU saturation (including core saturation), because “no CPU = no
work can be done”.
• If all-node saturated with either Hadoop or Teradata, consider expanding system footprint
and/or lowering concurrency.
• If one-node much busier than other nodes within either Hadoop or Teradata, try to balance
the workload skew.
• If both Hadoop and Teradata are mostly idle, look for obvious user mistakes or configuration
issues, and if possible, increase concurrency.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 49
Here is the checklist we could go through in case of slow performance:
❖ User Settings
• Using most optimal number of mappers? (small number of mapper sessions can significantly
impact performance)
❖ Database
• Is there any TDWM setting limiting # of concurrent sessions or user’s query priority?
o tdwmcmd -a
• DBSControl settings
o AWT tasks: maxawttask, maxloadtask, maxloadawt
o Compression settings
❖ Network
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 50
❖ Hadoop
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 51
6.6 Console Output Structure
NOTE: The example console output contained in this section has not been updated to reflect the latest
version of TDCH; the error messages and stack traces for TDCH 1.5+ will look similar, though not
identical.
Verify parameter settings
20/08/06 03:23:18 INFO processor.TeradataOutputProcessor: the teradata connector for hadoop version is: 1.65
20/08/06 03:34:39 INFO processor.TeradataBatchInsertProcessor: insert from staget table to target table
20/08/06 03:34:39 INFO processor.TeradataBatchInsertProcessor: the insert select sql starts at: 1565076879496
20/08/06 03:48:54 INFO processor.TeradataBatchInsertProcessor: the insert select sql ends at: 1565077734646
20/08/06 03:48:54 INFO processor.TeradataBatchInsertProcessor: the total elapsed time of the insert select sql is: 855s
20/08/06 03:49:01 INFO processor.TeradataOutputProcessor: the total elapsed time of output postprocessor
com.teradata.connector.teradata.processor.TeradataBatchInsertProcessor is: 862s
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 53
6.7 Troubleshooting Examples
com.teradata.hadoop.exception.TeradataHadoopException:
com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata Database]
[TeraJDBC 14.00.00.13] [Error 3802] [SQLState 42S02] Database ‘testdb' does not
exist. at
com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(Error
Factory.java:307) at
com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveI
nitSubState.java:102) at
com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachi
ne(StatementReceiveState.java:302) at
com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(Statem
entReceiveState.java:183) at
com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(Stateme
ntController.java:121) at
com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementCo
ntroller.java:112) at
com.teradata.jdbc.jdbc_4.TDSession.executeSessionRequest(TDSession.java:6
24) at
com.teradata.jdbc.jdbc_4.TDSession.<init>(TDSession.java:288) at
com.teradata.jdbc.jdk6.JDK6_SQL_Connection.<init>(JDK6_SQL_Connection.jav
a:30) at
com.teradata.jdbc.jdk6.JDK6ConnectionFactory.constructConnection(JDK6Conn
ectionFactory.java:22) at
com.teradata.jdbc.jdbc.ConnectionFactory.createConnection(ConnectionFacto
ry.java:130) at
com.teradata.jdbc.jdbc.ConnectionFactory.createConnection(ConnectionFacto
ry.java:120) at
com.teradata.jdbc.TeraDriver.doConnect(TeraDriver.java:228) at
com.teradata.jdbc.TeraDriver.connect(TeraDriver.java:154) at
java.sql.DriverManager.getConnection(DriverManager.java:582) at
java.sql.DriverManager.getConnection(DriverManager.java:185) at
com.teradata.hadoop.db.TeradataConnection.connect(TeradataConnection.java
:274)
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 54
6.7.2 Internal FastLoad Server Socket Time-out
When running export job using the "internal.fastload" method, the following error may occur:
This error occurs because the number of available map tasks currently is less than the number of
map tasks specified in the command line by parameter of ‘-nummappers’. This error can occur in the
following conditions:
• There are some other map/reduce jobs running concurrently in the Hadoop cluster, so there
are not enough resources to allocate specified map tasks for the export job.
• The maximum number of map tasks is smaller than existing map tasks added expected map
tasks of the export jobs in the Hadoop cluster.
When the above error occurs, please try to increase the maximum number of map tasks of the
Hadoop cluster, or decrease the number of map tasks for the export job.
When this error occurs, please double check the input parameters and their values.
6.7.4 Hive Partition Column Cannot Appear in the Hive Table Schema
When running import job with 'hive' job type, the columns defined in the target partition schema
cannot appear in the target table schema. Otherwise, the following exception will be thrown:
In this case, please check the provided schemas for Hive table and Hive partition.
6.7.5 String will be Truncated if its Length Exceeds the Teradata String Length
(VARCHAR or CHAR) when running Export Job
When running an export job, if the length of the source string exceeds the maximum length of
Teradata’s String type (CHAR or VARCHAR), the source string will be truncated and will result in
data inconsistency.
To prevent that from happening, please carefully set the data schema for source data and target data.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 55
6.7.6 Scaling Number of Timestamp Data Type Should be Specified Correctly in
JDBC URL in “internal.fastload” Method
NOTE: The example console output contained in this section has not been updated to reflect the
latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though
not identical.
When loading data into Teradata using the internal.fastload method, the following error may occur:
com.teradata.hadoop.exception.TeradataHadoopException: java.io.EOFException
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
at java.io.DataInputStream.readUTF(DataInputStream.java:572)
at
java.io.DataInputStream.readUTF(DataInputStream.java:547) at
com.teradata.hadoop.mapreduce.TeradataInternalFastloadOutputProcessor.beginL
oading(TeradataInternalFastloadOutputProcessor.java:889) at
com.teradata.hadoop.mapreduce.TeradataInternalFastloadOutputProcessor.run
(TeradataInternalFastloadOutputProcessor.java:173) at
com.teradata.hadoop.job.TeradataExportJob.runJob(TeradataExportJob.java:7
5) at
com.teradata.hadoop.tool.TeradataJobRunner.runExportJob(TeradataJobRunner
.java:192) at
com.teradata.hadoop.tool.TeradataExportTool.run(TeradataExportTool.java:3
9) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at
com.teradata.hadoop.tool.TeradataExportTool.main(TeradataExportTool.java:
395) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
This error is usually caused by setting the wrong ‘tsnano’ value in the JDBC URL. In Teradata DDL,
the default length of timestamp is 6, which is also the maximum allowed value, but the user can
specify a lower value.
When ‘tsnano’ is set to
• The same as the specified length of timestamp in the Teradata table: no problem;
• ‘tsnano’ is not set: no problem, it will use the specified length as in the Teradata table
• less than the specified length: an error table will be created in Teradata, but no exception will
be shown
• Greater than the specified length: the quoted error message will be received.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 56
6.7.7 Existing Error Table Error Received when Exporting to Teradata in
“internal.fastload” Method
NOTE: The example console output contained in this section has not been updated to reflect the
latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though
not identical.
If the following error occurs when exporting to Teradata using the “internal.fastload” method:
com.teradata.hadoop.exception.TeradataHadoopException:
com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata Database]
[TeraJDBC 14.00.00.13] [Error 2634] [SQLState HY000] Existing ERROR table(s) or
Incorrect use of export_hdfs_fun1_054815 in Fast Load operation.
This is caused by the existence of the Error table. If an export task is interrupted or aborted while
running, an error table will be generated and stay in the Teradata database. When you then try to run
another export job, the above error will occur.
In this case, the user needs to drop the existing error table manually and then rerun the export job.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 57
6.7.8 No More Room in Database Error Received when Exporting to Teradata
NOTE: The example console output contained in this section has not been updated to reflect the
latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though
not identical.
com.teradata.hadoop.exception.TeradataHadoopSQLException:
com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata Database]
[TeraJDBC 14.00.00.01] [Error 2644] [SQLState HY000] No more room in database
testdb. at
com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(Error
Factory.java:307) at
com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveI
nitSubState.java:102) at
com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachi
ne(StatementReceiveState.java:298) at
com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(Statem
entReceiveState.java:179) at
com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(Stateme
ntController.java:120) at
com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementCo
ntroller.java:111) at
com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:37
2) at
com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:31
4) at
com.teradata.jdbc.jdbc_4.TDStatement.doNonPrepExecute(TDStatement.java:27
7) at
com.teradata.jdbc.jdbc_4.TDStatement.execute(TDStatement.java:1087) at
com.teradata.hadoop.db.TeradataConnection.executeDDL(TeradataConnection.j
ava:364) at
com.teradata.hadoop.mapreduce.TeradataMultipleFastloadOutputProcessor.get
RecordWriter(TeradataMultipleFastloadOutputProcessor.java:315)
This is caused by the perm space of the database in Teradata being set too low. Please reset it to a
higher value to resolve the issue.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 58
6.7.9 “No more spool space” Error Received when Exporting to Teradata
NOTE: The example console output contained in this section has not been updated to reflect the
latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though
not identical.
java.io.IOException: com.teradata.jdbc.jdbc_4.util.JDBCException:
[Teradata Database] [TeraJDBC 14.00.00.21] [Error 2646] [SQLState HY000]
No more spool space in example_db.
This is caused by the spool space of the database in Teradata being set too low. Please reset it to a
higher value to resolve the issue.
Please make sure the separator parameter’s name and value are specified correctly.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 59
6.7.11 Date / Time / Timestamp Format Related Errors
NOTE: The example console output contained in this section has not been updated to reflect the
latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though
not identical.
java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:138)
java.lang.IllegalArgumentException
at java.sql.Time.valueOf(Time.java:89)
1) When exporting data with time, date or timestamp type from HDFS text files to Teradata:
a) Value of date type in text files should follow the format of ‘yyyy-mm-dd’
b) Value of time type in text files should follow the format of ‘hh:mm:ss’
c) Value of timestamp type in text files should follow the format of ‘yyyy-mm-dd
hh:mm:ss[.f...]’, length of nano should be less than 9.
2) When importing data with time, date or timestamp type from Teradata to HDFS text file:
a) Value of date type in text files will follow the format of ‘yyyy-mm-dd’
b) Value of time type in text files will follow the format of ‘hh:mm:ss’
c) Value of timestamp in text files will follow the format of ‘yyyy-mm-dd hh:mm:ss.fffffffff’,
length of nano is 9.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 60
6.7.12 Japanese Language Problem
NOTE: The example console output contained in this section has not been updated to reflect the
latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though
not identical.
This error is reported by the Teradata database. One reason is that the user uses a database with
Japanese language supported. When the connector wants to get table schema from the database, it
uses the following statement:
SELECT TRIM (TRAILING FROM COLUMNNAME) AS COLUMNNAME, CHARTYPE
FROM DBC.COLUMNS WHERE DATABASENAME = (SELECT DATABASE) AND
TABLENAME = $TABLENAME;
The internal process in database encounters an invalid character during processing, which may be a
problem starting in Teradata 14.00. The workaround is to set DBSControl flag
“acceptreplacementCharacters” to true.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 61
7 FAQ
You do not need to find and install the Teradata JDBC driver, as the latest Teradata JDBC driver
(17.00) is packaged in the TDCH jar file (Note: TDCH 1.6.4+, 1.7.4+ and 1.8.0+ versions are
updated to Teradata JDBC 17.00 whereas TDCH 1.5.10+ still includes Teradata JDBC 16.20). If you
have installed other versions of TDJDBC driver, please ensure they not included in the
HADOOP_CLASSPATH or libjars environment variables such that TDCH utilizes the version of
the driver packaged with TDCH. If you would like TDCH to utilize a different JDBC driver, see the
“tdch.input.teradata.jdbc.driver.class” and “tdch.output.teradata.jdbc.driver.class” properties.
• Add “ENCRYPTDATA=ON” parameter to the JDBC connection string URL like below:
o -url
jdbc:teradata://$DBSipaddress/database=$DBSdatabase,charset=ASCII,TCP=SEND1024
000,ENCRYPTDATA=ON
TDCH provides two parameters, ‘enclosedby’ and ‘escapeby’, for dealing with data containing
separator characters and quote characters in the ‘textfile’ of ‘hdfs’ job. The default values for
enclosedby is “ (double quote) and for escapeby is \ (backward slash). If the file format is not
‘textfile’ or when the job type is not ‘hdfs’, these two parameters do not take effect.
When neither parameter is specified, TDCH does not enclose or escape any characters in the data
during import or scan for enclose-by or escape-by characters during export. If either or both
parameters are provided, then TDCH will process enclose-by and escape-by values as appropriate.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 62
7.5 How do I use a Unicode character as the separator?
7.6 Why is the actual number of mappers less than the value of -nummappers?
When you specify the number of mappers using the ‘nummappers’ parameter, but in the execution,
you find that the actual number of mappers is less than your specified value, this is expected
behavior. This behavior is due to the fact that we use the getSplits() method of
CombineFileInputFormat class of Hadoop to determine the partitioned splits number. As a result,
the number of mappers for running the job equals the splits number.
7.7 Why don’t decimal values in Hadoop exactly match the value in Teradata?
When exporting data to Teradata, if the precision of decimal type is more than that of the target
Teradata column type, the decimal value will be rounded when stored in Teradata. On the other
hand, if the precision of decimal type is less than the definition of the column in the Teradata table,
‘0’s will be appended to the scaling.
If the column of the Teradata table is defined as Unicode, then you should specify the same character
set in the JDBC URL. Failure to do so will result in wrong encoding of transmitted data and there
will be no exception thrown. Furthermore, if you want to display Unicode data on Shell or other
clients correctly, don’t forget to configure your client to display as UTF-8 as well.
Users also have the ability to define their own conversion routines and reference these routines in the
value supplied to the ‘sourcerecordschema’ command line argument. In this scenario, the user would
also need to supply a value for the ‘targetrecordschema’ command line argument, providing TDCH
with information about the record generated by the user-defined converter.
As an example, here's a user-defined converter which replaces occurrences of the term 'foo' in a
source string with the term 'bar':
if (object == null)
return null;
return ((String)object).replaceAll("foo","bar");
This user-defined converter extends the ConnectorDataTypeConverter class, and thus requires an
implementation for the convert(Object) method. At the time of the 1.4 release, user-defined
converters with no-arg constructors were not supported (this was fixed in TDCH 1.4.1); thus this
user-defined converter a single arg constructor, where the input argument is not used. To compile
this user-defined converter, use the following syntax:
To run using this user-defined converter, first create a new jar which contains the user-defined
converter's class files:
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 64
Then add the new jar onto the HADOOP_CLASSPATH and LIB_JARS environment variables:
export HADOOP_CLASSPATH=/path/to/user-defined-
converter.jar:$HADOOP_CLASSPATH
export LIB_JARS=/path/to/user-defined-converter.jar,$LIB_JARS
Finally, reference the user-defined converter in your TDCH command. As an example, this TDCH
job would export 2 columns from an HDFS file into a Teradata table with one int column and one
string column. The second column in the HDFS file will have the “FooBarConverter” applied to it
before the record is sent to the Teradata table:
com.teradata.connector.common.tool.ConnectorExportTool
-libjars=$LIB_JARS
The ‘-sourcetableschema’ command-line argument is required when reading from HDFS and not
Hive where the table schema is already available.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 65
8 Limitations & Known Issues
Please always refer to the README file for a given release of TDCH for the most accurate
limitation information pertaining to that release. The following is a list of general limitations for
awareness.
General Limitations
a) When using the “split.by.amp” method, all empty (zero length) strings in Teradata database will
be converted to NULL when imported into Hadoop.
b) When using the “split.by.value” method, the “split-by column” cannot contain NULL values.
c) If the “split-by column” being used for the “split.by.value” method is VARCHAR or CHAR, it is
limited to characters contained in the ASCII character set only.
d) Source tables must not be empty for TDCH import jobs or else the following error will be
displayed (Note: check can be overridden in more recent versions of TDCH):
ERROR tool.ImportTool: Import failed:
com.teradata.connector.common.exception.ConnectorException:
Input source table is empty at
com.teradata.connector.common.tool.ConnectorJobRunner.runJob(ConnectorJobRunner.java:1
42)
e) In the Hive configuration settings "hive.load.data.owner" must be set to the Linux user name you
logged in for running the TDCH test. Else TDCH job will fail with the error “is not owned by
hive and load data is also not ran as hive)”
(or)
For non-Kerberos environment you can set environment variable:
HADOOP_USER_NAME=hive
h) When using TDCH 1.6.x with HDP 3.1.5, the following property must be added in the
"hive-site.xml" file to avoid Hive ACID related exceptions:
<property>
<name>hive.metastore.client.capabilities</name>
<value>EXTWRITE,EXTREAD,HIVEBUCKET2,HIVEFULLACIDREAD,HIVEFULLACIDWRITE,HIVEMANAGEDINSE
RTWRITE,HIVEMANAGEDINSERTREAD,HIVESQL</value>
</property>
i) TDCH 1.7.x supports CDH 6.0 and above and MapR 6.0 and above.
j) TDCH 1.8.x supports CDP Private Cloud Base (CDP Datacenter) 7.1.1 and above.
k) When using TDCH 1.8.x, use CLI option "hivecapabilities" to avoid Hive ACID related
exceptions with required value.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 66
l) When using TDCH 1.8.x, with 'split.by.partition' method containing special characters in
partition value, job fails with "Illegal character in path" error.
Bug in CDP 7.1 - https://jira.cloudera.com/browse/CDPD-13277
Any row with unsupported Unicode characters during FastLoad export will be moved to error tables
or an exception may be thrown during batch insert export. For a complete list of the unsupported
Unicode characters in Teradata database, see Teradata knowledge article KAP1A269E.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 67
less than or equal to the maximum concurrently runnable mapper or reducer tasks allowed by the
Hadoop cluster setting.
e) When “OutputFormat” method is set to “internal.fastload”, the total number of tasks (either
mappers or reducers) must be less than or equal to the total number of AMPs on the target
Teradata system.
f) The “internal.fastload” method will proceed with data transfer only after all tasks are launched;
when a user requests more mappers than the cluster can run concurrently, the job will hang and
time out after 8 minutes. TDCH attempts to check the number of mappers requested by the job
submitter against the value returned from “ClusterStatus.getMaxMapTasks” and generates a
warning message when the requested value is greater than the cluster's maximum map tasks. The
“ClusterStatus.getMaxMapTasks” method returns incorrect results when TDCH is run on
YARN-enabled clusters, and thus the TDCH warning may not always be generated in this
situation.
8.5 Hive
a) The "-hiveconf" option is used to specify the path of a Hive configuration file (see Section 3.1).
It is required for a “hive” or “hcat” job.
With version 1.0.7, the file can be located in HDFS (hdfs://) or in a local file system (file://).
Without the URI schema (hdfs:// or file://) specified, the default schema name is "hdfs". Without
the "-hiveconf" parameter specified, the "hive-site.xml" file should be located in
$HADOOP_CLASSPATH, a local path, before running the TDCH job. For example, if the file
"hive-site.xml" is in "/home/hduser/", a user should export the path using the following
command before running the TDCH job:
export HADOOP_CLASSPATH=/home/hduser/conf:$HADOOP_CLASSPATH
b) On Hive complex type support, we only support data type conversion between complex types of
Hive and string data types (CLOB, VARCHAR) in Teradata. During exporting, all complex data
types value of Hive will be converted to corresponding JSON string. During importing, a
VARCHAR or CLOB value of Teradata will be interpreted as the JSON string of the
corresponding complex data types of Hive.
c) To support ORC file format, please use hive 0.11 or higher version.
d) Hive MAP, ARRAY, and STRUCT types are supported with export and import, and converted
to and from VARCHAR in JSON format in Teradata system.
e) Hive UNIONTYPE is not yet supported.
f) Partition values with import to Hive partitioned table cannot be null or empty.
g) “split.by.partition” is the required method for importing into Hive partitioned tables.
h) '/' and '=' are not supported for string value of a partition column of Hive table.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 68
8.6 Avro Data Type Conversion and Encoding
a) On Avro complex type (except UNION) support, we only support data type conversion between
complex types to/from string data types (CLOB, VARCHAR) in Teradata.
b) When importing data from Teradata to an Avro file, if a field data type in Avro is UNION with
null and the corresponding source column in Teradata table is nullable,
i) A NULL value is converted to a null value within a UNION value in the corresponding target
field of Avro.
ii) Non-NULL value is converted to a value of corresponding type within a UNION value in the
corresponding target Avro field.
c) When exporting data from Avro to Teradata, if a field data type in Avro is UNION with null and
i) Target column is nullable, then a NULL value within UNION is converted to a NULL value in
the target Teradata table column.
ii) Target column is not nullable, then a NULL value within UNION is converted to a connector-
defined default value of the specified data type.
d) TDCH currently only supports Avro binary encoding.
8.7 Parquet
a) TDCH does not currently support the COMPLEX data type for Parquet.
b) The BINARY and DOUBLE data types are not supported for MapR.
c) For HDP 2.5.x, HDP 2.6.x, Google Cloud Dataproc and Amazon EMR , Parquet support for
Hive import is not supported (only export from Hive is supported).
d) The Hive TIMESTAMP data type is not supported but the Teradata Database TIMESTAMP data
type is supported if the corresponding column in Hive is either VARCHAR or STRING data
type.
a) TDCH versions 1.5.5 – 1.5.9, 1.6.0 – 1.6.3 and 1.7.0 – 1.7.3 support TD Wallet 16.10.
TD Wallet 16.10 can be installed on the same system as other versions of TD Wallet that are
needed.
b) TDCH versions 1.5.10+, 1.6.4+, 1.7.4+ and 1.8.0+ support TD Wallet 16.20.
TD Wallet 16.20 can be installed on the same system on the same system as other versions of
TD Wallet that are needed.
c) TD Wallet creates temporary files in “/tmp” directory path by default. To change the “/tmp”
directory path, set the following environment variable:
export _JAVA_OPTIONS=-Djava.io.tmpdir=/<yourowntmpdir>
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 69
9 Recommendations & Requirements
TDCH requires that all nodes in the Hadoop cluster have port 1025 open for both inbound and
outbound traffic.
• If using “internal.fastload” and/or “internal.fastexport”, all ports between 8678 and 65535
should be open for inbound and outbound traffic between all nodes of the Hadoop cluster.
• The protocols start at port 8678 and look for an open port on the coordinator node (the node
where the TDCH job is started from) and go up to port 65535 until they find an open port to
communicate with each mapper.
• If a specific port is defined, then only this port needs to be open for inbound and outbound
traffic between all nodes of the Hadoop cluster.
TDCH 1.5.2 + has been tested and certified to work in AWS with the HDP 2.x and CDH 5.x
distributions as well as Amazon EMR.
TDCH functions as normal when using the following recommendations regarding security groups
and ports when running TDCH jobs.
Update the mapred-site.xml under the path " /usr/lib/hadoop/etc/hadoop" with the following
property:
<property>
<name>mapreduce.job.user.classpath.first</name>
<value>true</value>
</property>
b) The Security Group that the EC2 instances belong to must have port 1025 (the Teradata
database port) open to both inbound and outbound traffic to the Teradata database.
If the Teradata database is also running in AWS and in the same Security Group, port 1025
should be open to all EC2 instances in the group.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 70
c) If using “internal.fastexport” and/or “internal.fastload”, it is recommended that the port for these
protocols be defined using the ‘-fastexportsocketport’ command-line argument for
“internal.fastexport” and ‘-fastloadsocketport’ for “internal.fastload”.
Make sure that this port is open for both inbound and outbound traffic for all EC2 instances in
the security group.
The default port for “internal.fastload” and “internal.fastexport” can be between 8678 and
65535 (the protocol starts looking at port 8678 and increments until an open port is found) so if
all these ports can be open to inbound and outbound traffic within the security group, the port
for “internal.fastload” and/or “internal.fastexport” does not have to be defined.
d) For Amazon EMR, the jar file “$HIVE_HOME/lib/hive-llap-tez.jar” must be included in both
the “LIB_JARS” and “HADOOP_CLASSPATH”.
a) The Firewall rules that VM instances belong to must have port 1025 (TD database port) open to
both ingress and egress traffic to the Teradata database.
b) If internal.fastexport and/or internal.fastload protocols are used, their default ports 8678 and
65535 need to be open for ingress and egress traffic in the firewall rules.
c) All VM's of Google Cloud Dataproc cluster (both Debian & Ubuntu flavor)
are of Debian platform where rpm installation is not natively supported.
Use the following commands to install the TDCH rpm file on Debian platforms:
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 71
d) In Google Cloud Dataproc, Hive internally uses TEZ as an execution engine instead of
MapReduce and requires the following changes for TDCH functionality. So, it requires the
following changes in job submitter node (master or worker):
Set the following file paths in “tez.lib.uris” property of TEZ configuration file
"/etc/tez/conf/tez-site.xml":
<property>
<name>tez.lib.uris</name>
<value>file:/usr/lib/tez,file:/usr/lib/tez/lib,
file:/usr/local/share/google/dataproc/lib,
file:/usr/lib/tdch/1.5/lib</value>
</property>
e) The TDCH rpm needs to be installed on all the nodes of a Google Cloud Dataproc cluster (This
requirement is unique to Google Cloud Dataproc only). When a MapReduce job runs on other
nodes besides the job submitter node, Tez will access the TDCH & TDJDBC jars from the path
"/usr/lib/tdch/1.5/lib" provided in “tez.lib.uris”.
f) The following environment variables are required on job submitter node to run a TDCH job on
Google Cloud Dataproc 1.4.x & 1.5.x (Debian & Ubuntu flavor):
export HIVE_HOME=/usr/lib/hive
export HCAT_HOME=/usr/lib/hive-hcatalog
export HADOOP_USER_NAME=hive
export LIB_JARS=/usr/lib/spark/jars/avro-1.8.2.jar,
/usr/lib/spark/jars/avro-mapred-1.8.2-hadoop2.jar,
/usr/lib/spark/jars/paranamer-2.8.jar,
$HIVE_HOME/conf,$HIVE_HOME/lib/antlr-runtime-3.5.2.jar,
$HIVE_HOME/lib/commons-dbcp-1.4.jar,
$HIVE_HOME/lib/commons-pool-1.5.4.jar,
$HIVE_HOME/lib/datanucleus-api-jdo-4.2.4.jar,
$HIVE_HOME/lib/datanucleus-core-4.1.17.jar,
$HIVE_HOME/lib/datanucleus-rdbms-4.1.19.jar,
$HIVE_HOME/lib/hive-cli-2.3.7.jar,
$HIVE_HOME/lib/hive-exec-2.3.7.jar,
$HIVE_HOME/lib/hive-jdbc-2.3.7.jar,
$HIVE_HOME/lib/hive-metastore-2.3.7.jar,
$HIVE_HOME/lib/jdo-api-3.0.1.jar,
$HIVE_HOME/lib/libfb303-0.9.3.jar,
$HIVE_HOME/lib/libthrift-0.9.3.jar,
$HCAT_HOME/share/hcatalog/hive-hcatalog-core-2.3.7.jar,
/usr/lib/tez/tez-api-0.9.2.jar,
/usr/lib/hive/lib/hive-llap-tez-2.3.7.jar,
/usr/lib/tez/*,
/usr/lib/tez/lib/*,
/etc/tez/conf,
/usr/lib/tdch/1.5/lib/tdgssconfig.jar,
/usr/lib/tdch/1.5/lib/terajdbc4.jar,
/usr/lib/hadoop/lib/hadoop-lzo-0.4.20.jar,
/usr/lib/hive/lib/lz4-1.3.0.jar
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 72
export HADOOP_CLASSPATH=/usr/lib/spark/jars/avro-1.8.2.jar:
/usr/lib/spark/jars/avro-mapred-1.8.2-hadoop2.jar:
/usr/lib/spark/jars/paranamer-2.8.jar:$HIVE_HOME/conf:
$HIVE_HOME/lib/antlr-runtime-3.5.2.jar:
$HIVE_HOME/lib/commons-dbcp-1.4.jar:
$HIVE_HOME/lib/commons-pool-1.5.4.jar:
$HIVE_HOME/lib/datanucleus-api-jdo-4.2.4.jar:
$HIVE_HOME/lib/datanucleus-core-4.1.17.jar:
$HIVE_HOME/lib/datanucleus-rdbms-4.1.19.jar:
$HIVE_HOME/lib/hive-cli-2.3.7.jar:
$HIVE_HOME/lib/hive-exec-2.3.7.jar:
$HIVE_HOME/lib/hive-jdbc-2.3.7.jar:
$HIVE_HOME/lib/hive-metastore-2.3.7.jar:
$HIVE_HOME/lib/jdo-api-3.0.1.jar:
$HIVE_HOME/lib/libfb303-0.9.3.jar:
$HIVE_HOME/lib/libthrift-0.9.3.jar:
$HCAT_HOME/share/hcatalog/hive-hcatalog-core-2.3.7.jar:
/usr/lib/tez/tez-api-0.9.2.jar:
/usr/lib/hive/lib/hive-llap-tez-2.3.7.jar:
/usr/lib/tez/*:
/usr/lib/tez/lib/*:
/etc/tez/conf:
/usr/lib/tdch/1.5/lib/tdgssconfig.jar:
/usr/lib/tdch/1.5/lib/terajdbc4.jar:
/usr/lib/hadoop/lib/hadoop-lzo-0.4.20.jar:
/usr/lib/hive/lib/lz4-1.3.0.jar
g) For Google Cloud Dataproc 1.3.x (Debian & Ubuntu flavor), set below two Avro jars in
LIB_JARS & HADOOP_CLASSPATH:
/usr/lib/hadoop/lib/avro-1.7.7.jar
/usr/lib/spark/jars/avro-mapred-1.7.7-hadoop2.jar
When connecting to Teradata systems with more than one TPA node, always use the COP entry vs.
the IP address or hostname of a single Teradata node when possible. This utilizes the built-in load
balancing mechanism of Teradata and prevents bottlenecks on the Teradata system.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 73
10 Data Compression
TDCH 1.5.2 added the ability to compress the output data when importing into Hadoop or
intermediate compression to compress the data during the MapReduce job.
For TDCH Export (Hadoop → Teradata), only intermediate compression is available, and the
intermediate compression will always use the Snappy codec.
For TDCH Import (Teradata → Hadoop), the following codecs are supported for output:
The Snappy codec is always used for intermediate compression (if enabled).
For more information on compression in Hadoop, please see the following link:
http://comphadoop.weebly.com/
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 74
Appendix A Supported Plugin Properties
TDCH jobs are configured by associating a set of properties and values with a Hadoop configuration
object. The TDCH job’s source and target plugins should be defined in the Hadoop configuration
object using TDCH’s ConnectorImportTool and ConnectorExportTool command line utilities, while
other common and plugin-centric attributes can be defined either by command line arguments or
directly via their java property names. The table below provides some metadata about the
configuration property definitions in this section.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 76
Java Property tdch.plugin.input.processor
tdch.plugin.input.format
tdch.plugin.input.serde
tdch.plugin.data.converter
CLI Argument method
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The three ‘tdch.plugin.output’ properties and the
‘tdch.plugin.data.converter’ property define the target plugin. When
using the ConnectorExportTool, the target plugin will always be one of
the two Teradata target plugins. Submitting a valid value to the
ConnectorImportTool’s ‘method’ command line argument will cause the
three ’input.plugin.input’ properties and the ‘tdch.plugin.data.converter’
property to be assigned values associated with the selected Teradata
target plugin. At this point, users should not define the
‘tdch.plugin.output’ properties directly.
Required no
Supported Values The following values are supported by the ConnectorExportTool’s
‘method’ argument: batch.insert, internal.fastload
Default Value batch.insert
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 77
Java Property tdch.throttle.num.mappers
CLI Argument throttlemappers
Description Force the TDCH job to only use as many mappers as the queue
associated with the job can handle concurrently, overwriting the user
defined nummapper value.
Required no
Supported Values true | false
Default Value false
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 78
Java Property tdch.num.reducers
CLI Argument
Description The maximum number of output reducer tasks if export is done in reduce
phase.
Required No
Supported Values integers > 0
Default Value 0
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 80
Java Property tdch.input.timezone.id
CLI Argument sourcetimezoneid
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
com.teradata.connector.common.tool.ConnectorExportTool
Description The source timezone used during conversions to or from date and time
types.
Required no
Supported Values string
Default Value hadoop cluster’s default timezone
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 81
Java Property tdch.output.timestamp.format
CLI Argument targettimestampformat
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
com.teradata.connector.common.tool.ConnectorExportTool
Description The format of all output string columns, when the input column type is
determined to be a timestamp column.
Required no
Supported Values string
Default Value yyyy-MM-dd HH:mm:ss.SSS
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 82
Java Property tdch.string.truncate
CLI Argument stringtruncate
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
com.teradata.connector.common.tool.ConnectorExportTool
Description If set to 'true', strings will be silently truncated based on the length of the
target char or varchar column. If set to 'false', when a string is larger than
the target column an exception will be thrown, and the mapper will fail.
Required no
Supported Values true | false
Default Value true
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 83
10.4 Teradata Source Plugin Properties
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 84
Java Property tdch.input.teradata.jdbc.password
CLI Argument password
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The authentication password used by the source Teradata plugins to
connect to the Teradata system. Note that the value can include Teradata
Wallet references in order to use user name information from the current
user's wallet.
Required no
Supported Values string
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 85
Java Property tdch.input.teradata.conditions
CLI Argument sourceconditions
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The SQL WHERE clause (with the WHERE removed) that the source
Teradata plugins will use in conjunction with the
‘tdch.input.teradata.table’ value when reading data from the Teradata
system.
Required no
Supported Values Teradata database supported conditional SQL.
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 86
Java Property tdch.input.teradata.data.dictionary.use.xview
CLI Argument usexviews
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description If set to true, the source Teradata plugins will use XViews to get
Teradata system information. This option allows users who have access
privileges to run TDCH jobs, though performance may be degraded.
Required no
Supported Values true | false
Default Value false
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 87
Java Property tdch.input.teradata.num.partitions
CLI Argument numpartitions
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The number of partitions in the staging table created by the
split.by.partition Teradata source plugin. If the number of mappers is
larger than the number of partitions in the staging table, the value of
‘tdch.num.mappers’ will be overridden with the
‘tdch.input.teradata.num.partitions’ value.
Required no
Supported Values integer greater than 0
Default Value If undefined, ‘tdch.input.teradata.num.partitions’ is set to
‘tdch.num.mappers’.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 88
Java Property tdch.input.teradata.stage.database.for.view
CLI Argument stagedatabaseforview
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The database in Teradata system with which Teradata Connector for
Hadoop uses to create temporary staging view.stagedatabasefortable and
stagedatabaseforview parameters should not be used along with
stagedatabase parameter.stagedatabase parameter is used to specify the
target database on which both stage table and stage view should be
created.
Required no
Supported Values the name of a database in the Teradata system
Default Value the current logon database of the JDBC connection
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 89
Java Property tdch.input.teradata.split.by.column
CLI Argument splitbycolumn
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The name of a column in the source table which the Teradata
split.by.hash and split.by.value plugins use to split the source data set. If
this parameter is not specified, the first column of the table’s primary
key or primary index will be used.
Required no
Supported Values a valid table column name
Default Value The first column of the table’s primary index
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 90
Java Property tdch.input.hive.capabilities
CLI Argument hivecapabilities
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description When target Hive table does not exist,
TDCH will try to create one with the information provided through
targetTableSchema & schemafields.
For this Hive may require some additional capabilities for that session.
This option is used to set Hive capabilities to the Processor.
This option is only supported on TDCH 1.8.x
Required no
Supported Values string
Default Value
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 91
Java Property tdch.input.teradata.fastexport.coordinator.socket.port
CLI Argument fastexportsocketport
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The port that the Teradata internal.fastexport plugin coordinator will
listen on. The TDCH mappers will communicate with the coordinator on
this port.
Required no
Supported Values integer > 0
Default Value The Teradata internal.fastexport plugin will automatically select an
available port starting from 8678.
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 92
10.5 Teradata Target Plugin Properties
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 93
Java Property tdch.output.teradata.jdbc.password
CLI Argument password
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The authentication password used by the target Teradata plugins to
connect to the Teradata system. Note that the value can include Teradata
Wallet references in order to use user name information from the current
user's wallet.
Required no
Supported Values string
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 94
Java Property tdch.output.teradata.field.count
CLI Argument
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The number of fields to export to the target table in Teradata system.
Either specify this or the 'tdch.output.teradata.field.names'
Required no
Supported Values integer > 0
Default Value 0
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 95
Java Property tdch.output.teradata.query.band
CLI Argument queryband
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description A string, when specified, is used to set the value of session level query
band for the Teradata target plugins.
Required no
Supported Values string
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 97
Java Property tdch.output.teradata.stage.table.kept
CLI Argument keepstagetable
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description If set to true, the staging table is not dropped by the Teradata target
plugins when a failure occurs during the insert-select operation between
the staging and target tables.
Required no
Supported Values true | false
Default Value false
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 98
Java Property tdch.output.teradata.fastload.coordinator.socket.timeout
CLI Argument fastloadsockettimeout
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The amount of milliseconds the Teradata internal.fastload coordinator
will wait for connections from TDCH mappers before timing out.
Required no
Supported Values integer > 0
Default Value 480000
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 99
Java Property tdch.output.teradata.error.table.database
CLI Argument errortabledatabase
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The name of the database where the error tables will be created by the
internal.fastload Teradata target plugin.
Required no
Supported Values string
Default Value the current logon database of the JDBC connection
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 100
Java Property tdch.output.hdfs.logging
CLI Argument hdfslogenable
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description Turn on HDFS logging for exception logging for transfer from
HDFS/Hive to Teradata DB
Required no
Supported Values true|false
Default Value false
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 101
10.6 HDFS Source Plugin Properties
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 102
Java Property tdch.input.hdfs.separator
CLI Argument separator
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The field separator that the HDFS textfile source plugin uses when
parsing files from HDFS.
Required no
Supported Values string
Default Value \t (tab character)
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 103
Java Property tdch.input.hdfs.null.non.string
CLI Argument nullnonstring
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description When specified, the HDFS textfile source plugin compares the columns
from the source HDFS files with this value, and when the column value
matches the user-defined value the column is then treated as a null. This
logic is only applied to non-string columns.
Required no
Supported Values string
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 104
Java Property tdch.input.hdfs.avro.schema
CLI Argument avroschema
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description A string representing an inline Avro schema. This schema is applied to
the input Avro file in HDFS by the HDFS Avro source plugin. This
value takes precedence over the value supplied for
‘tdch.input.hdfs.avro.schema.file’.
Required no
Supported Values string
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 105
Java Property tdch.output.hdfs.field.names
CLI Argument targetfieldnames
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The names of fields that the target HDFS plugins will write to the target
HDFS files, in comma separated format. The order of the target field
names need to match the order of the source field names for schema
mapping.
Required no
Supported Values String
Default Value
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 106
Java Property tdch.output.hdfs.line.separator
CLI Argument
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The line separator that the HDFS textfile target plugin uses when writing
files to HDFS.
Required no
Supported Values string
Default Value \n
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 107
Java Property tdch.output.hdfs.enclosed.by
CLI Argument enclosedby
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description When specified the HDFS textfile target plugin encloses each column in
the source record with the user-defined characters before writing the
records to files in HDFS.
Required no
Supported Values single characters
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 108
Java Property tdch.output.hdfs.avro.schema.file
CLI Argument avroschemafile
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The path to an Avro schema file in HDFS. This schema is used when
generating the output Avro file in HDFS by the HDFS Avro target
plugin.
Required no
Supported Values string
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 109
Java Property tdch.input.hive.database
CLI Argument sourcedatabase
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The name of the database in Hive from which the source Hive plugins
will read data.
Required no
Supported Values string
Default Value
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 110
Java Property tdch.input.hive.table.schema
CLI Argument sourcetableschema
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description A comma separated schema specification. If defined, the source Hive
plugins will override the schema associated with the
‘tdch.input.hive.table’ table and use the ‘tdch.input.hive.table.schema’
value instead. The ‘tdch.input.hive.table.schema’ value should not
include the partition schema associated with the table.
Required no
Supported Values string
Default Value
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 111
Java Property tdch.input.hive.fields.separator
CLI Argument separator
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The field separator that the Hive textfile source plugin uses when
reading from Hive delimited tables.
Required no
Supported Values string
Default Value \u0001
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 112
Java Property tdch.output.hive.paths
CLI Argument targetpaths
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The directory in HDFS where the target Hive plugins will write files.
Either specify this or the 'teradata.output.hive.table’ parameter but not
both.
Required no
Supported Values string
Default Value The directory in HDFS associated with the target hive table.
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 113
Java Property tdch.output.hive.field.names
CLI Argument targetfieldnames
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The names of fields that the target Hive plugins will write to the table in
Hive. If this property is specified via the 'targetfieldnames' command
line argument, the value should be in comma separated format. If this
property is specified directly via the '-D' option, or any equivalent
mechanism, the value should be in JSON format. The order of the target
field names must match the order of the source field names for schema
mapping.
Required no
Supported Values string
Default Value
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 114
Java Property tdch.output.hive.null.string
CLI Argument nullstring
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description When specified, the Hive textfile target plugin replaces null columns in
records generated by the source plugin with this value.
Required no
Supported Values string
Default Value
Case Sensitive yes
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 115
Java Property tdch.output.hive.overwrite
CLI Argument
Tool Class
Description If this parameter is set to true and the Hive table is non-partitioned it will
overwrite the entire table. If the table is partitioned the data in each of
the affected partitions will be overwritten and the partition data that is
not affected will remain intact.
Required no
Supported Values true | false
Default Value false
Case Sensitive No
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 116
Java Property tdch.input.hcat.field.names
CLI Argument sourcefieldnames
Tool Class com.teradata.connector.common.tool.ConnectorExportTool
Description The names of fields that the source HCat plugin will read from the HCat
table, in comma separated format. The order of the source field names
need to match the order of the target field names for schema mapping.
Required no
Supported Values string
Default Value
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 117
Java Property tdch.output.hcat.field.names
CLI Argument targetfieldnames
Tool Class com.teradata.connector.common.tool.ConnectorImportTool
Description The names of fields that the target HCat plugin will write to the HCat
table, in comma separated format. The order of the target field names
need to match the order of the source field names for schema mapping.
Required no
Supported Values string
Default Value
Case Sensitive no
Teradata Connector for Hadoop Tutorial v1.5, 1.6, 1.7 and 1.8 Page 118