IN 101 BigDataManagementInstallationandConfigurationGuide en PDF
IN 101 BigDataManagementInstallationandConfigurationGuide en PDF
(Version 10.1)
Version 10.1
June 2016
© Copyright Informatica LLC 2014, 2016
This software and documentation contain proprietary information of Informatica LLC and are provided under a license agreement containing restrictions on use and
disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any
form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. This Software may be protected by U.S. and/or
international Patents and other Patents Pending.
Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as
provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013©(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14
(ALT III), as applicable.
The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us
in writing.
Informatica, Informatica Platform, Informatica Data Services, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange,
PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Transformation, Informatica B2B Data Exchange Informatica
On Demand, Informatica Identity Resolution, Informatica Application Information Lifecycle Management, Informatica Complex Event Processing, Ultra Messaging,
Informatica Master Data Management, and Live Data Map are trademarks or registered trademarks of Informatica LLC in the United States and in jurisdictions
throughout the world. All other company and product names may be trade names or trademarks of their respective owners.
Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies. All rights
reserved. Copyright © Sun Microsystems. All rights reserved. Copyright © RSA Security Inc. All Rights Reserved. Copyright © Ordinal Technology Corp. All rights
reserved. Copyright © Aandacht c.v. All rights reserved. Copyright Genivia, Inc. All rights reserved. Copyright Isomorphic Software. All rights reserved. Copyright © Meta
Integration Technology, Inc. All rights reserved. Copyright © Intalio. All rights reserved. Copyright © Oracle. All rights reserved. Copyright © Adobe Systems
Incorporated. All rights reserved. Copyright © DataArt, Inc. All rights reserved. Copyright © ComponentSource. All rights reserved. Copyright © Microsoft Corporation. All
rights reserved. Copyright © Rogue Wave Software, Inc. All rights reserved. Copyright © Teradata Corporation. All rights reserved. Copyright © Yahoo! Inc. All rights
reserved. Copyright © Glyph & Cog, LLC. All rights reserved. Copyright © Thinkmap, Inc. All rights reserved. Copyright © Clearpace Software Limited. All rights
reserved. Copyright © Information Builders, Inc. All rights reserved. Copyright © OSS Nokalva, Inc. All rights reserved. Copyright Edifecs, Inc. All rights reserved.
Copyright Cleo Communications, Inc. All rights reserved. Copyright © International Organization for Standardization 1986. All rights reserved. Copyright © ej-
technologies GmbH. All rights reserved. Copyright © Jaspersoft Corporation. All rights reserved. Copyright © International Business Machines Corporation. All rights
reserved. Copyright © yWorks GmbH. All rights reserved. Copyright © Lucent Technologies. All rights reserved. Copyright (c) University of Toronto. All rights reserved.
Copyright © Daniel Veillard. All rights reserved. Copyright © Unicode, Inc. Copyright IBM Corp. All rights reserved. Copyright © MicroQuill Software Publishing, Inc. All
rights reserved. Copyright © PassMark Software Pty Ltd. All rights reserved. Copyright © LogiXML, Inc. All rights reserved. Copyright © 2003-2010 Lorenzi Davide, All
rights reserved. Copyright © Red Hat, Inc. All rights reserved. Copyright © The Board of Trustees of the Leland Stanford Junior University. All rights reserved. Copyright
© EMC Corporation. All rights reserved. Copyright © Flexera Software. All rights reserved. Copyright © Jinfonet Software. All rights reserved. Copyright © Apple Inc. All
rights reserved. Copyright © Telerik Inc. All rights reserved. Copyright © BEA Systems. All rights reserved. Copyright © PDFlib GmbH. All rights reserved. Copyright ©
Orientation in Objects GmbH. All rights reserved. Copyright © Tanuki Software, Ltd. All rights reserved. Copyright © Ricebridge. All rights reserved. Copyright © Sencha,
Inc. All rights reserved. Copyright © Scalable Systems, Inc. All rights reserved. Copyright © jQWidgets. All rights reserved. Copyright © Tableau Software, Inc. All rights
reserved. Copyright© MaxMind, Inc. All Rights Reserved. Copyright © TMate Software s.r.o. All rights reserved. Copyright © MapR Technologies Inc. All rights reserved.
Copyright © Amazon Corporate LLC. All rights reserved. Copyright © Highsoft. All rights reserved. Copyright © Python Software Foundation. All rights reserved.
Copyright © BeOpen.com. All rights reserved. Copyright © CNRI. All rights reserved.
This product includes software developed by the Apache Software Foundation (http://www.apache.org/), and/or other software which is licensed under various versions
of the Apache License (the "License"). You may obtain a copy of these Licenses at http://www.apache.org/licenses/. Unless required by applicable law or agreed to in
writing, software distributed under these Licenses is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the Licenses for the specific language governing permissions and limitations under the Licenses.
This product includes software which was developed by Mozilla (http://www.mozilla.org/), software copyright The JBoss Group, LLC, all rights reserved; software
copyright © 1999-2006 by Bruno Lowagie and Paulo Soares and other software which is licensed under various versions of the GNU Lesser General Public License
Agreement, which may be found at http:// www.gnu.org/licenses/lgpl.html. The materials are provided free of charge by Informatica, "as-is", without warranty of any
kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose.
The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California,
Irvine, and Vanderbilt University, Copyright (©) 1993-2006, all rights reserved.
This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (copyright The OpenSSL Project. All Rights Reserved) and
redistribution of this software is subject to terms available at http://www.openssl.org and http://www.openssl.org/source/license.html.
This product includes Curl software which is Copyright 1996-2013, Daniel Stenberg, <[email protected]>. All Rights Reserved. Permissions and limitations regarding this
software are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with or
without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.
The product includes software copyright 2001-2005 (©) MetaStuff, Ltd. All Rights Reserved. Permissions and limitations regarding this software are subject to terms
available at http://www.dom4j.org/ license.html.
The product includes software copyright © 2004-2007, The Dojo Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to
terms available at http://dojotoolkit.org/license.
This product includes ICU software which is copyright International Business Machines Corporation and others. All rights reserved. Permissions and limitations
regarding this software are subject to terms available at http://source.icu-project.org/repos/icu/icu/trunk/license.html.
This product includes software copyright © 1996-2006 Per Bothner. All rights reserved. Your right to use such materials is set forth in the license which may be found at
http:// www.gnu.org/software/ kawa/Software-License.html.
This product includes OSSP UUID software which is Copyright © 2002 Ralf S. Engelschall, Copyright © 2002 The OSSP Project Copyright © 2002 Cable & Wireless
Deutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mit-license.php.
This product includes software developed by Boost (http://www.boost.org/) or under the Boost software license. Permissions and limitations regarding this software are
subject to terms available at http:/ /www.boost.org/LICENSE_1_0.txt.
This product includes software copyright © 1997-2007 University of Cambridge. Permissions and limitations regarding this software are subject to terms available at
http:// www.pcre.org/license.txt.
This product includes software copyright © 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to terms
available at http:// www.eclipse.org/org/documents/epl-v10.php and at http://www.eclipse.org/org/documents/edl-v10.php.
This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?License, http://
www.stlport.org/doc/ license.html, http://asm.ow2.org/license.html, http://www.cryptix.org/LICENSE.TXT, http://hsqldb.org/web/hsqlLicense.html, http://
httpunit.sourceforge.net/doc/ license.html, http://jung.sourceforge.net/license.txt , http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/
license.html, http://www.libssh2.org, http://slf4j.org/license.html, http://www.sente.ch/software/OpenSourceLicense.html, http://fusesource.com/downloads/license-
agreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html;
http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/LICENSE.txt; http://jotm.objectweb.org/bsd_license.html; . http://www.w3.org/Consortium/Legal/
2002/copyright-software-20021231; http://www.slf4j.org/license.html; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http://
forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html, http://www.tcl.tk/software/tcltk/license.html, http://
www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html, http://www.slf4j.org/license.html; http://www.iodbc.org/dataspace/iodbc/wiki/iODBC/License; http://
www.keplerproject.org/md5/license.html; http://www.toedter.com/en/jcalendar/license.html; http://www.edankert.com/bounce/index.html; http://www.net-snmp.org/about/
license.html; http://www.openmdx.org/#FAQ; http://www.php.net/license/3_01.txt; http://srp.stanford.edu/license.txt; http://www.schneier.com/blowfish.html; http://
www.jmock.org/license.html; http://xsom.java.net; http://benalman.com/about/license/; https://github.com/CreateJS/EaselJS/blob/master/src/easeljs/display/Bitmap.js;
http://www.h2database.com/html/license.html#summary; http://jsoncpp.sourceforge.net/LICENSE; http://jdbc.postgresql.org/license.html; http://
protobuf.googlecode.com/svn/trunk/src/google/protobuf/descriptor.proto; https://github.com/rantav/hector/blob/master/LICENSE; http://web.mit.edu/Kerberos/krb5-
current/doc/mitK5license.html; http://jibx.sourceforge.net/jibx-license.html; https://github.com/lyokato/libgeohash/blob/master/LICENSE; https://github.com/hjiang/jsonxx/
blob/master/LICENSE; https://code.google.com/p/lz4/; https://github.com/jedisct1/libsodium/blob/master/LICENSE; http://one-jar.sourceforge.net/index.php?
page=documents&file=license; https://github.com/EsotericSoftware/kryo/blob/master/license.txt; http://www.scala-lang.org/license.html; https://github.com/tinkerpop/
blueprints/blob/master/LICENSE.txt; http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html; https://aws.amazon.com/asl/; https://github.com/
twbs/bootstrap/blob/master/LICENSE; https://sourceforge.net/p/xmlunit/code/HEAD/tree/trunk/LICENSE.txt; https://github.com/documentcloud/underscore-contrib/blob/
master/LICENSE, and https://github.com/apache/hbase/blob/master/LICENSE.txt.
This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and Distribution
License (http://www.opensource.org/licenses/cddl1.php) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code License
Agreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php), the new BSD License (http://opensource.org/
licenses/BSD-3-Clause), the MIT License (http://www.opensource.org/licenses/mit-license.php), the Artistic License (http://www.opensource.org/licenses/artistic-
license-1.0) and the Initial Developer’s Public License Version 1.0 (http://www.firebirdsql.org/en/initial-developer-s-public-license-version-1-0/).
This product includes software copyright © 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding this
software are subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab.
For further information please visit http://www.extreme.indiana.edu/.
This product includes software Copyright (c) 2013 Frank Balluffi and Markus Moeller. All rights reserved. Permissions and limitations regarding this software are subject
to terms of the MIT license.
DISCLAIMER: Informatica LLC provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the implied
warranties of noninfringement, merchantability, or use for a particular purpose. Informatica LLC does not warrant that this software or documentation is error free. The
information provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation is
subject to change at any time without notice.
NOTICES
This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress Software
Corporation ("DataDirect") which are subject to the following terms and conditions:
1. THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT
INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT
LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.
4 Table of Contents
Installing the Address Reference Data Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Update Hadoop Cluster Configuration Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Enable Developer Tool Communication with the Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . 25
Enable Support for Lookup Transformations with Teradata Data Objects. . . . . . . . . . . . . . . . . . 25
Big Data Management Configuration Utility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Use Cloudera Manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Use Apache Ambari. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Use SSH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Download the JDBC Driver JAR Files for Sqoop Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . 33
Add Hadoop Environment Variable Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Configure Run-time Engines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Blaze Engine Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Spark Engine Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Hive Engine Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Table of Contents 5
Create a Staging Directory on HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Configure Virtual Memory Limits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Add hbase_protocol.jar to the Hadoop classpath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Configure the HiveServer2 Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Configure the Hadoop Cluster for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Configure HiveServer2 for DB2 Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Disable SQL Standard Based Authorization for HiveServer2. . . . . . . . . . . . . . . . . . . . . . . 80
Configuring Big Data Management in the Hortonworks HDP Environment. . . . . . . . . . . . . . . . . 80
Configure Hadoop Cluster Properties for the Data Integration Service. . . . . . . . . . . . . . . . . 81
Configure the Mapping Logic Pushdown Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Add hbase_protocol.jar to the Hadoop classpath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
HiveServer 2 Configuration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Configure the Hadoop Cluster for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Update Cluster Configuration Settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Configuring Big Data Management in the IBM BigInsights Environment. . . . . . . . . . . . . . . . . . . 93
User Account for the JDBC and Hive Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Enable Support for Data Quality Capabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Create the HiveServer2 Environment Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Enable Support for HBase with HiveServer2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Configure the Hadoop Cluster for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Configuring Big Data Management in the MapR Environment. . . . . . . . . . . . . . . . . . . . . . . . . 97
Verify the Cluster Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Install the EBF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Configure the Informatica Domain to Communicate with a Kerberos-Enabled MapR 5.1
Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Configure Run-time Engines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Configure the Informatica Domain to Communicate with a Cluster that Uses MapR Ticket
Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Configure Hive and HDFS Metadata Fetch for MapR Ticket or Kerberos. . . . . . . . . . . . . . . 106
Running Mappings Using the Teradata Connector for Hadoop on a Hive or Blaze Engine. . . 107
Configure Environment Variables for MapR 5.1 in the Hadoop Environment Properties File. . 107
Configure Hadoop Cluster Properties on the Data Integration Service Machine for
MapReduce 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Configure yarn-site.xml for MapReduce 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Edit warden.conf to Configure Heap Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6 Table of Contents
Configuring Big Data Management for a Highly Available Hortonworks HDP Cluster. . . . . . . 118
Configuring Big Data Management for a Highly Available IBM BigInsights Cluster. . . . . . . . . . . 119
Configuring Informatica for Highly Available MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Table of Contents 7
Preface
The Informatica Big Data Management Installation and Configuration Guide is written for the system
administrator who is responsible for installing Informatica Big Data Management. This guide assumes you
have knowledge of operating systems, relational database concepts, and the database engines, flat files, or
mainframe systems in your environment. This guide also assumes you are familiar with the interface
requirements for the Hadoop environment.
Informatica Resources
Informatica Network
Informatica Network hosts Informatica Global Customer Support, the Informatica Knowledge Base, and other
product resources. To access Informatica Network, visit https://network.informatica.com.
To access the Knowledge Base, visit https://kb.informatica.com. If you have questions, comments, or ideas
about the Knowledge Base, contact the Informatica Knowledge Base team at
[email protected].
Informatica Documentation
To get the latest documentation for your product, browse the Informatica Knowledge Base at
https://kb.informatica.com/_layouts/ProductDocumentation/Page/ProductDocumentSearch.aspx.
If you have questions, comments, or ideas about this documentation, contact the Informatica Documentation
team through email at [email protected].
8
Informatica Product Availability Matrixes
Product Availability Matrixes (PAMs) indicate the versions of operating systems, databases, and other types
of data sources and targets that a product release supports. If you are an Informatica Network member, you
can access PAMs at
https://network.informatica.com/community/informatica-network/product-availability-matrices.
Informatica Velocity
Informatica Velocity is a collection of tips and best practices developed by Informatica Professional Services.
Developed from the real-world experience of hundreds of data management projects, Informatica Velocity
represents the collective knowledge of our consultants who have worked with organizations from around the
world to plan, develop, deploy, and maintain successful data management solutions.
If you are an Informatica Network member, you can access Informatica Velocity resources at
http://velocity.informatica.com.
If you have questions, comments, or ideas about Informatica Velocity, contact Informatica Professional
Services at [email protected].
Informatica Marketplace
The Informatica Marketplace is a forum where you can find solutions that augment, extend, or enhance your
Informatica implementations. By leveraging any of the hundreds of solutions from Informatica developers and
partners, you can improve your productivity and speed up time to implementation on your projects. You can
access Informatica Marketplace at https://marketplace.informatica.com.
To find your local Informatica Global Customer Support telephone number, visit the Informatica website at the
following link: http://www.informatica.com/us/services-and-training/support-services/global-support-centers.
If you are an Informatica Network member, you can use Online Support at http://network.informatica.com.
Preface 9
CHAPTER 1
• Installation Overview, 10
• Before You Begin, 11
• Big Data Management Installation from an RPM Package, 14
• Big Data Management Installation from a Debian Package, 17
• Big Data Management Installation from a Cloudera Parcel Package , 19
• Informatica Big Data Management Uninstallation, 20
Installation Overview
The Informatica Big Data Management installation package includes the Data Integration Service, the Blaze
run-time engine, and adapter components. Depending on your Hadoop implementation, Informatica
distributes the package to the Hadoop cluster as one of the following package types:
Debian package
To install Big Data Management on Ubuntu Hadoop distributions on Azure HDInsight, the tar.gz file
includes a Debian package and the binary files that you need to run the Big Data Management
installation.
After you complete the installation, you must configure the Informatica domain and the Hadoop cluster to
enable Informatica mappings to run on a Hadoop cluster.
10
Installing in a Single Node Environment
You can install Big Data Management in a single node environment.
1. Extract the Big Data Management tar.gz file to a machine on the cluster.
2. Install Big Data Management by running the installation shell script in a Linux environment. You can
install Big Data Management from the primary name node or from any machine using the
HadoopDataNodes file.
Add the IP addresses or machine host names, one for each line, for each of the nodes in the Hadoop
cluster in the HadoopDataNodes file. During the Big Data Management installation, the installation shell
script picks up all of the nodes from the HadoopDataNodes file and copies the Big Data Management
binary files to the /<BigDataManagementInstallationDirectory>/Informatica directory on each of the
nodes.
Run the Informatica services installation to configure the Informatica domain and create the Informatica
services. Run the Informatica client installation to install the Informatica client tools.
To run Informatica mappings in a Hadoop environment you must install and configure Informatica adapters.
You can use the following Informatica adapters as part of Big Data Management:
• Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce. The Hadoop installation
should include a Hive data warehouse that is configured to use a non-embedded database as the
MetaStore. For more information, see the Apache website here: http://hadoop.apache.org.
• To perform both read and write operations in native mode, install the required third-party client software.
For example, install the Oracle client to connect to the Oracle database.
• Verify that the Big Data Management administrator user can run sudo commands or have user root
privileges.
• Verify that the temporary folder on the local node has at least 700 MB of disk space.
• Download the following file to the temporary folder: InformaticaHadoop-
<InformaticaForHadoopVersion>.tar.gz
• Extract the following file to the local node where you want to run the Big Data Management installation:
InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz
To verify that you can distribute Big Data Management to the Hadoop cluster with one of the protocols,
perform the following tasks:
Note: If you use Cloudera Manager to distribute Big Data Management to the Hadoop cluster, skip these
tasks.
1. Ensure that the server or service for your distribution method is running.
2. In the config file on the machine where you want to run the Big Data Management installation, set the
DISTRIBUTOR_NODE parameter to the following setting:
• FTP: Set DISTRIBUTOR_NODE=ftp://<Distributor Node IP Address>/pub
• HTTP: Set DISTRIBUTOR_NODE=http://<Distributor Node IP Address>
• NFS: Set DISTRIBUTOR_NODE=<Shared file location on the node.>
The file location must be accessible to all nodes in the cluster.
• The Big Data Management administrator can run sudo commands or has root user privileges.
• The temporary folder in each of the nodes on which Big Data Management will be installed has at least
700 MB of disk space.
Big Data Management requires a Secure Shell (SSH) connection without a password between the machine
where you want to run the Big Data Management installation and all the nodes in the Hadoop cluster.
You can install Big Data Management in a single node environment. You can also install Big Data
Management in a cluster environment from the primary name node or from any machine.
1. Verify that the Big Data Management administrator has user root privileges on the node that will be
running the Big Data Management installation.
2. Log in to the machine as the root user.
3. In the HadoopDataNodes file, add the IP addresses or machine host names of the nodes in the Hadoop
cluster on which you want to install Big Data Management. The HadoopDataNodes file is located on the
node from where you want to launch the Big Data Management installation. You must add one IP
addresses or machine host names of the nodes in the Hadoop cluster for each line in the file.
To enable Big Data Management in an Ubuntu Hadoop cluster environment, download, decompress, and run
the product installer.
Note: The default installation location of Informatica Hadoop binaries is /opt/Informatica. This location
cannot be changed.
You can view the informatica-hadoop-install.<DateTimeStamp>.log installation log file to get more
information about the tasks performed by the installer.
1. Verify that the Big Data Management administrator has user root privileges on the node that will be
running the Big Data Management installation.
2. Log in to the machine as the root user.
3. In the HadoopDataNodes file, add the IP addresses or machine host names of the nodes in the Hadoop
cluster on which you want to install Big Data Management.
You must add one IP addresses or machine host names of the nodes in the Hadoop cluster for each line
in the file.
Note: The HadoopDataNodes file is located on the node from where you want to launch the Big Data
Management installation.
4. Run the following command to start the Big Data Management installation in console mode:
sudo bash InformaticaHadoopInstall.sh
5. Press y to accept the Big Data Management terms of agreement.
6. Press Enter.
7. Press 2 to install Big Data Management in a cluster environment.
8. Press Enter.
To enable Big Data Management in a Cloudera Hadoop cluster environment, download, decompress, and run
the product installer.
Note: The default installation location of Informatica Hadoop binaries is /opt/Informatica. This location
cannot be changed.
To uninstall Big Data Management on Cloudera, see “Uninstalling Big Data Management on Cloudera” on
page 21.
1. Verify that the Big Data Management administrator can run sudo commands.
2. If you are uninstalling Big Data Management in a cluster environment, set up password-less Secure
Shell (SSH) connection between the machine where you want to run the Big Data Management
installation and all of the nodes on which Big Data Management will be uninstalled.
3. If you are uninstalling Big Data Management in a cluster environment using the HadoopDataNodes file,
verify that the HadoopDataNodes file contains the IP addresses or machine host names of each of the
nodes in the Hadoop cluster from which you want to uninstall Big Data Management. The
HadoopDataNodes file is located on the node from where you want to launch the Big Data Management
installation. You must add one IP addresses or machine host names of the nodes in the Hadoop cluster
for each line in the file.
Post-Installation Tasks
This chapter includes the following topics:
• Post-Installation Overview, 22
• Reference Data Requirements, 23
• Update Hadoop Cluster Configuration Parameters, 24
• Enable Developer Tool Communication with the Hadoop Cluster , 25
• Enable Support for Lookup Transformations with Teradata Data Objects, 25
• Big Data Management Configuration Utility, 26
• Download the JDBC Driver JAR Files for Sqoop Connectivity, 33
• Add Hadoop Environment Variable Properties, 33
• Configure Run-time Engines, 34
Post-Installation Overview
After you install Big Data Management, perform the post-installation tasks to ensure that Big Data
Management runs properly.
22
Reference Data Requirements
If you have a Data Quality product license, you can push a mapping that contains data quality
transformations to a Hadoop cluster. Data quality transformations can use reference data to verify that data
values are accurate and correctly formatted.
When you apply a pushdown operation to a mapping that contains data quality transformations, the operation
can copy the reference data that the mapping uses. The pushdown operation copies reference table data,
content set data, and identity population data to the Hadoop cluster. After the mapping runs, the cluster
deletes the reference data that the pushdown operation copied with the mapping.
Note: The pushdown operation does not copy address validation reference data. If you push a mapping that
performs address validation, you must install the address validation reference data files on each DataNode
that runs the mapping. The cluster does not delete the address validation reference data files after the
address validation mapping runs.
Address validation mappings validate and enhance the accuracy of postal address records. You can buy
address reference data files from Informatica on a subscription basis. You can download the current address
reference data files from Informatica at any time during the subscription period.
1. Browse to the address reference data files that you downloaded from Informatica.
You download the files in a compressed format.
2. Extract the data files.
3. Copy the files to the name node machine or to another machine that can write to the DataNodes.
4. Create an automation script to copy the files to each DataNode.
• If you copied the files to the name node, use the slaves file for the Hadoop cluster to identify the
DataNodes. If you copied the files to another machine, use the Hadoop_Nodes.txt file to identify the
DataNodes.
Find the Hadoop_Nodes.txt file in the Big Data Management installation package.
• The default directory for the address reference data files in the Hadoop environment
is /reference_data. If you install the files to a non-default directory, create the following custom
property on the Data Integration Service to identify the directory:
AV_HADOOP_DATA_LOCATION
Create the custom property on the Data Integration Service that performs the pushdown operation in
the native environment.
5. Run the automation script.
The script copies the address reference data files to the DataNodes.
The following cluster configuration parameters in mapred-site.xml can override the Java library path set in
hadoopEnv.properties:
• mapreduce.admin.map.child.java.opts
• mapreduce.admin.reduce.child.java.opts
If the Data Integration Service cannot access the native libraries set in hadoopEnv.properties, mappings
can fail to run in a Hadoop environment.
• Update the cluster configuration file mapred-site.xml to remove the Java option -Djava.library.path
from the property configuration.
• Edit hadoopEnv.properties to include the user Hadoop libraries in the Java Library path.
• -DINFA_HADOOP_DIST_DIR=hadoop\<Hadoop_distribution_name>_<version_number>
For example, the distribution name for a Hadoop cluster that runs MapR version 5.1 is mapr_5.1.0.
If you use the MapR distribution you must also set the MAPR_HOME environment variable to run MapR
mappings in a Hadoop environment. Perform the following additional tasks:
- -Dmapr.library.flatclass
• Edit run.bat to set the MAPR_HOME environment variable and the -clean settings.
For example, include the following lines:
MAPR_HOME=<InformaticaClientInstallationDirectory>/<version>/clients/DeveloperClient
\hadoop\mapr_<version>
developerCore.exe -clean
• Copy mapr-cluster.conf to the following directory on the machine where the Developer tool runs:
<Informatica installation directory>\<version>\clients\DeveloperClient\hadoop
\mapr_<version>\conf.
You can find mapr-cluster.conf in the following directory on any node in the Hadoop cluster: <MapR
installation directory>/conf
You can download the Teradata JDBC drivers from Teradata. For more information about the drivers, see the
following Teradata website: http://downloads.teradata.com/download/connectivity/jdbc-driver.
The software available for download at the referenced links belongs to a third party or third parties, not
Informatica LLC. The download links are subject to the possibility of errors, omissions or change. Informatica
assumes no responsibility for such links and/or such software, disclaims all warranties, either express or
implied, including but not limited to, implied warranties of merchantability, fitness for a particular purpose, title
and non-infringement, and disclaims all liability relating thereto.
Copy the tdgssconfig.jar and terajdbc4.jar files from the Teradata JDBC drivers to the following
directory on the machine where the Data Integration runs and every node in the Hadoop cluster:
<Informatica installation directory>/externaljdbcjars
The Big Data Management Configuration Utility assists with the following tasks:
• Creates configuration files on the machine where the Data Integration Service runs.
• Creates connections between the cluster and the Data Integration Service.
• Updates Data Integration Service properties in preparation for running mappings on the cluster.
After you run the utility, complete the configuration process for Big Data Management.
Note: The utility does not support Big Data Management for the following distributions:
1. On the machine where the Data Integration Service runs, open the command line.
2. Go to the following directory: <Informatica installation directory>/tools/BDMUtil.
3. Run BDMConfig.sh.
4. Press Enter.
5. Choose the Hadoop distribution that you want to use to configure Big Data Management:
Option Description
1 Cloudera CDH
2 Hortonworks HDP
3 MapR
4 IBM BigInsights
Note: Select only 1 for Cloudera or 2 for Hortonworks. At this time, the utility does not support
configuration for MapR or BigInsights.
6. Based on the option you selected in step 5, see the corresponding topic to continue with the
configuration process:
• “Use Cloudera Manager” on page 27
• “Use Apache Ambari” on page 29
• “Use SSH” on page 31
Option Description
1 Cloudera Manager. Select this option to use the Cloudera Manager API to access files on the Hadoop cluster.
2 Secure Shell (SSH). Select this option to use SSH to access files on the Hadoop cluster. This option requires
SSH connections to the machines that host the name node, JobTracker, and Hive client. If you select this
option, Informatica recommends that you use an SSH connection without a password or have sshpass or
Expect installed.
Option Description
Option Description
Option Description
1 No. Select this option to update Data Integration Service properties later.
2 Yes. Select this option to update Data Integration Service properties now.
Option Description
Note: Edit the connection name, domain username and password to use the generated commands.
HiveServer2_EnvInfa.txt
Contains the list of environment variables and values that need to be copied to the HiveServer2
environment on the Hadoop cluster. This file is created only if you choose HiveServer2.
Option Description
1 Apache Ambari. Select this option to use the Ambari REST API to access files on the Hadoop cluster.
2 Secure Shell (SSH). Select this option to use SSH to access files on the Hadoop cluster. This option requires
SSH connections to the machines that host the name node, JobTracker, and Hive client. If you select this
option, Informatica recommends that you use an SSH connection without a password or have sshpass or
Expect installed.
Option Description
Option Description
Option Description
1 No. Select this option to update Data Integration Service properties later.
2 Yes. Select this option to update Data Integration Service properties now.
Option Description
Note: Edit the connection name, domain username and password to use the generated commands.
HiveServer2_EnvInfa.txt
Contains the list of environment variables and values that need to be copied to the HiveServer2
environment on the Hadoop cluster. This file is created only if you choose HiveServer2.
Use SSH
If you choose SSH, you must provide host names and Hadoop configuration file locations.
Note: Informatica recommends that you use an SSH connection without a password or have sshpass or
Expect installed. If you do not use one of these methods, you must enter the password each time the utility
downloads a file from the Hadoop cluster.
Verify the following host names: name node, JobTracker, and Hive client. Additionally, verify the locations for
the following files on the Hadoop cluster:
• hdfs-site.xml
• core-site.xml
• mapred-site.xml
• yarn-site.xml
• hive-site.xml
Perform the following steps to configure Big Data Management:
Option Description
Note: Edit the connection name, domain username and password to use the generated commands.
HiveServer2_EnvInfa.txt
Contains the list of environment variables and values that need to be copied to the HiveServer2
environment on the Hadoop cluster. This file is created only if you choose HiveServer2.
You can use any Type 4 JDBC driver that the database vendor recommends for Sqoop connectivity.
Note: The DataDirect JDBC drivers that Informatica ships are not licensed for Sqoop connectivity.
1. Download the JDBC driver jar files for the database that you want to connect to.
2. On the node where the Data Integration Service runs, copy the JDBC driver jar files to the following
directory:
<Informatica installation directory>\externaljdbcjars
If the Data Integration Service runs on a grid, repeat this step on all nodes in the grid.
Perform this task manually if you do not use the Big Data Management configuration utility. For more
information about the utility, see “Big Data Management Configuration Utility” on page 26.
When you choose the native run-time engine, Big Data Management uses the Data Integration to run
mappings on the Informatica domain. You can also choose a run-time engine to run mappings in the Hadoop
environment. This pushes mapping run processing to the cluster.
When you want to run mappings on the cluster, you choose from the following run-time engines:
Blaze engine
The Blaze engine is an Informatica software component that can run mappings on the Hadoop cluster.
Spark engine
Spark is an Apache project that provides a run-time engine that can run mappings on the Hadoop
cluster.
Hive engine
When you run mappings on the Hive run-time engine, you choose Hive Command Line Interface or
HiveServer 2.
• Azure HDInsight
• Cloudera CDH
• Hortonworks HDP Hadoop
• MapR
• IBM BigInsights
Skip the tasks for the Blaze engine if you run Big Data Management on another Hadoop distribution.
Perform the following configuration tasks in the Big Data Management installation:
Depending on the Hadoop environment, you perform additional steps in the Hadoop cluster to allow Big Data
Management to use the Blaze engine to run mappings. See " Chapter 3, “Configuring Big Data Management
to Run Mappings in Hadoop Environments” on page 41."
Grant write permission for log directories on the user account that starts the Blaze engine in the following
properties:
• infagrid.node.local.root.log.dir
• infacal.hadoop.logs.directory
For more information about user accounts for the Blaze engine, see the Informatica Big Data Management
Security Guide.
To get a list of the operating system settings, including the file descriptor limit, run the following command:
C Shell
limit
Bash Shell
ulimit -a
Informatica service processes can use a large number of files. Set the file descriptor limit per process to
16,000 or higher. The recommended limit is 32,000 file descriptors per process.
To change system settings, run the limit or ulimit command with the pertinent flag and value. For example, to
set the file descriptor limit, run the following command:
C Shell
limit -h filesize <value>
Bash Shell
ulimit -n <value>
Note: Skip this task if the Blaze engine does not support the distribution that the Hadoop cluster runs.
When you create the Hadoop connection, specify the port range that the Blaze engine can use with the
minimum port and maximum port fields.
Allocate the following types of resource for each container on the cluster:
Memory
Random Access Memory (RAM) available for each container. This setting is also known as the container
size. You can set the minimum and maximum memory per container.
• Set the minimum container memory to allow the VM to spawn sufficient containers.
• Set maximum memory on the cluster to increase resource memory available to Blaze services.
Vcore
A vcore is a virtual core. The number of virtual cores per container may correspond to the number of
physical cores on the cluster, but you can increase the number to allow for more processing. You can set
the minimum and maximum number of vcores per container.
Runtime node -- runs mappings only - Minimum memory: Set to no less than 4 GB less than the maximum memory.
- At least 10 GB maximum memory
- 6 vcores
Management node -- a single node that runs - Minimum memory: Set to no less than 4 GB less than the maximum memory.
mappings and management services - At least 13 GB maximum memory
- 9 vcores
Set the resources in the configuration console for the cluster, or edit the file yarn-site.xml.
To get a list of the operating system settings, including the file descriptor limit, run the following command:
C Shell
limit
Bash Shell
ulimit -a
Informatica service processes can use a large number of files. Set the file descriptor limit per process to
16,000 or higher. The recommended limit is 32,000 file descriptors per process.
To change system settings, run the limit or ulimit command with the pertinent flag and value. For example, to
set the file descriptor limit, run the following command:
C Shell
limit -h filesize <value>
Bash Shell
ulimit -n <value>
1. Locate the Spark shuffle .jar file and note the location.
• For HortonWorks implementations, the file is located in the following path: /opt/Informatica/
services/shared/hadoop/hortonworks_<version_number>/spark/lib/spark-<version_number>-
yarn-shuffle.jar
• For Cloudera implementations, the file is located in the following path: /<Informatica installation
directory>/services/shared/hadoop/cloudera_<version_number>/spark/lib/spark-
<version_number>-yarn-shuffle.jar
2. Add the Spark shuffle .jar file location to the classpath of each cluster node manager.
3. Edit the yarn-site.xml file in each cluster node manager.
The file is located in the following location:
• For HortonWorks implementations, the file is located in the following path: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>/conf/
• For Cloudera implementations, the file is located in the following path: <Informatica installation
directory>/services/shared/hadoop/cloudera_cdh<version>/conf
a. Change the value of the yarn.nodemanager.aux-services property as follows:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
b. Add the following property-value pair:
yarn.nodemanager.aux-
services.spark_shuffle.class=org.apache.spark.network.yarn.YarnShuffleService
3. Locate the spark.executor.instances property and place a # character at the beginning of the line to
comment it out.
Note: If you enable dynamic allocation for the Spark engine, Informatica recommends that you comment
out this property.
After editing, the line appears as follows:
#spark.executor.instances=100
The default tool is Hive CLI. When you choose the Hive Command Line Interface (CLI) to run mappings, no
configuration is required.
Alternatively, you can edit the hadoopEnv.properties file to choose Hive CLI or HiveServer2. You can find
the hadoopEnv.properties file in the following directory: <Informatica installation directory>/
services/shared/hadoop/<Hadoop_distribution_name>/infaConf.
1. Assign the required permissions on the cluster to the user account specified in the Hive connection.
For example, the user account testuser1 belongs to the "Other" user group. To use this account, verify
that the "Other" user group has permissions on the Hive Warehouse Directory.
Additionally, testuser1 must have the following permissions:
• Full permission on the staging directory
• Full permission on the /tmp/hive-<username> directory
• Read and write permission on the /tmp directory
2. Edit the Hadoop environment properties file to set HiveServer2 as the tool to run mappings.
Note: Skip this step if you used the Big Data Management Utility to configure Hadoop properties for Big
Data Management.
a. Browse to the hadoopEnv.properties file in the following directory: <Informatica installation
directory>/services/shared/hadoop/hortonworks_<version_number>/infaConf.
The hadoopEnv.properties file contains two entries for the infapdo.aux.jars.path property.
The default value is Hive CLI, and the entry for HiveServer2 is commented out.
b. To use HiveServer2, comment out the Hive CLI entry, and uncomment the HiveServer2 entry.
3. Use the Administrator tool in the Informatica domain to configure the Data Integration Service for
HiveServer2.
Note: Skip this step if you used the Big Data Management Utility to configure Hadoop properties for Big
Data Management.
a. Log in to the Administrator tool.
b. In the Domain Navigator, select the Data Integration Service.
c. In the Processes tab, create the following custom property:
ExecutionContextOptions.hive.executor.
d. Set the value to hiveserver2.
e. Recycle the Data Integration Service.
4. Disable SQL-based authorization for HiveServer2.
5. Optionally, enable storage-based authorization.
For example, the user account testuser1 belongs to the "Other" user group. Verify that the "Other" user group
has permissions on the Hive Warehouse Directory.
Additionally, testuser1 must have the following permissions on the HDFS directories:
Troubleshooting HiveServer2
Consider the following troubleshooting tips when you configure HiveServer2.
A mapping fails with the following error: java.lang.OutOfMemoryError: Java heap space
Increase the heap size that mapReduce can use with HiveServer2 to run mappings.
To configure the heap size, you must edit hadoopEnv.properties. You can find hadoopEnv.properties in
the following directory: <Informatica installation directory>/services/shared/hadoop/
hortonworks_<version>/infaConf.
The following sample text shows the infapdo.java.opts property with a modified heap size:
infapdo.java.opts=-Djava.library.path=$HADOOP_NODE_INFA_HOME/services/shared/bin:
$HADOOP_NODE_HADOOP_DIST/lib/native:$HADOOP_NODE_HADOOP_DIST/lib/*:
$HADOOP_NODE_HADOOP_DIST/lib/native -Djava.security.egd=file:/dev/./urandom -Xms3150m -
Xmx6553m -XX:MaxPermSize=512m
After you enable Informatica mappings to run on a Hadoop cluster, you must configure the Big Data
Management Client files to communicate with a Hadoop cluster on a particular Hadoop distribution. You can
use the Big Data Management Configuration Utility to automatically configure some of the properties. After
you run the utility, you must complete the configuration for your Hadoop distribution.
Alternatively, you can manually Big Data Management without the utility.
The following table describes the Hadoop distributions and schedulers that you can use with Big Data
Management:
41
Hadoop Distribution Scheduler
You might also have to perform additional steps, depending on your Hadoop environment.
The default Hadoop RPM installation sets hive.optimize.ppd to FALSE. Retain this value.
exec.dynamic.partition.mode
Set this property to nonstrict. This allows all partitions to be dynamic.
You can optionally add third-party environment variables or extend the existing PATH environment variable in
hadoopEnv.properties.
You configure some environment variables for all Hadoop distributions. Other environment variables that you
configure depend on the Hadoop distribution.
Configure the following library path and path environment variables for all Hadoop distributions:
• When you run mappings in a Hadoop environment, configure the ODBC library path before the Teradata
library path. For example, infapdo.env.entry.ld_library_path=LD_LIBRARY_PATH=
$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_INFA_HOME/ODBC7.0/lib/:/opt/
teradata/client/13.10/tbuild/lib64:/opt/teradata/client/13.10/odbc_64/lib:/databases/
oracle11.2.0_64BIT/lib:/databases/db2v9.5_64BIT/lib64/:$HADOOP_NODE_INFA_HOME/
DataTransformation/bin:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-
amd64-64:$LD_LIBRARY_PATH .
Before you can configure Big Data Management 10.1 to enable mappings to run on an Amazon EMR cluster,
you must download and install EBF 17557 on top of Big Data Management 10.1.
After you install the EBF, you complete the following configuration tasks:
1. Make a note of the master host node name from the cluster at the following location:
/etc/hadoop/conf/yarn-site.xml
2. Open the following file for editing:
<Informatica_installation_directory>/conf/yarn-site.xml
3. Replace all instances of HOSTNAME with the master host node name.
The following sample text shows the properties you configure in the hive-site.xml file:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value><your-s3-access-key-id></value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value><your-s3-accesskey></value>
</property>
Configure the Hadoop Pushdown Properties for the Data Integration Service
Configure Hadoop pushdown properties for the Data Integration Service to run mappings in a Hadoop
environment.
You can configure Hadoop pushdown properties for the Data Integration Service in the Administrator tool.
Property Description
Informatica Home The Big Data Management home directory on every data node created by the Hadoop RPM install.
Directory on Hadoop Type /opt/Informatica.
Hadoop Distribution The directory containing a collection of Hive and Hadoop JARS on the cluster from the RPM Install
Directory locations. The directory contains the minimum set of JARS required to process Informatica mappings in
a Hadoop environment. Type /opt/Informatica/services/shared/hadoop/
amazon_emr<version_number>.
Data Integration The Hadoop distribution directory on the Data Integration Service node. Type ../../services/
Service Hadoop shared/hadoop/amazon_emr<version_number>.
Distribution Directory
The contents of the Data Integration Service Hadoop distribution directory must be identical to Hadoop
distribution directory on the data nodes.
• 8020
• 8032
• 8080
• 9083
• 9080 -- for the Blaze monitoring console
• 12300 to 12600 -- for the Blaze engine.
Optionally, you can also open the following ports for debugging: 8088, 19888, and 50070.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.
yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.
Use 3600000.
yarn.nodemanager.local-dirs
List of directories to store localized files in.
The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>
To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &
1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.
For example, if the warehouse directory is /user/hive/warehouse, run the following command to grant the
user permissions for the directory:
hadoop fs –chmod –R 777 /user/hive/warehouse
To enable Informatica mappings to run on an HDInsight cluster, complete the following steps:
1. Verify prerequisites.
2. Perform post-distribution tasks.
3. Populate the HDFS File System.
4. Update cluster configuration settings.
• You have an instance of HDInsight in a supported Linux cluster up and running on the Azure environment.
Refer to the Product Availability Matrix on the Informatica Network for all platform compatibility details.
• You have permission to access and administer the HDInsight instance, and to get the names and
addresses of cluster resources and other information from cluster configuration pages.
• If HBase is not already installed, install it.
Informatica supports read/write from the local HDFS location, but not the wasb location. In an HDInsight
cluster, the default environment has a local HDFS location that is empty, and a wasb location populated with
files. Perform the following steps to copy files from the wasb location to the local HDFS location:
1. Use the Ambari configuration tool to identify the wasb location and the HDFS location.
You can find these locations as follows:
wasb location
The wasb location is a resource locator like:
wasb://<cluster_name>@<domain_or_IP-address>/
HDFS location
The HDFS location is a resource locator like:
hdfs://<headnode_IP_address>:<port_number>/
You can configure Hadoop pushdown properties for the Data Integration Service in the Administrator tool.
The following table describes the Hadoop pushdown properties for the Data Integration Service:
Property Description
Informatica Home The Big Data Management home directory on every data node created by the Hadoop Debian install.
Directory on Hadoop Type /opt/Informatica.
Hadoop Distribution The directory containing a collection of Hive and Hadoop JARS on the cluster from the Debian install
Directory locations. The directory contains the minimum set of JARS required to process Informatica mappings
in a Hadoop environment. Type /opt/Informatica/services/shared/hadoop/
hortonworks_2.3.
Data Integration The Hadoop distribution directory on the Data Integration Service node. Type ../../services/
Service Hadoop shared/hadoop/hortonworks_2.3.
Distribution Directory
The contents of the Data Integration Service Hadoop distribution directory must be identical to
Hadoop distribution directory on the data nodes.
When you modify the Hadoop distribution directory, you must copy the minimum set of Hive and Hadoop
JARS, and the Snappy libraries required to process Informatica mappings in a Hadoop environment from
your Hadoop install location. The actual Hive and Hadoop JARS can vary depending on the Hadoop
distribution and version.
The Hadoop Debian distribution installs the Hadoop distribution directories in the following path:
<BigDataManagementInstallationDirectory>/Informatica/services/shared/hadoop.
You can optionally add third-party environment variables or extend the existing PATH environment variable in
hadoopEnv.properties.
1. Open the hive-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_2.3/conf/
To run a mapping in HiveServer2, configure the following properties in the hive-site.xml file:
hive.metastore.uris
URI for the metastore host.
For example:
<property>
<name>hive.metastore.uris</name>
<value>thrift://<HOSTNAME>:9083</value>
</property>
yarn.app.mapreduce.am.staging-dir
The directory where submitted jobs that use MapReduce are staged.
Open the yarn-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/
For example:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server IPC host:port</description>
</property>
mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server. The default value is 19888.
For example:
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>
yarn.resourcemanager.scheduler.address
Scheduler interface address.
yarn.resourcemanager.webapp.address
Web application address for the Resource Manager.
Open the mapred-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/
mapreduce.jobhistory.done-dir
Directory where the MapReduce JobHistory server manages history files.
The following sample text shows the properties you must set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
<description>Directory where MapReduce jobs write history files.</description>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
<description>Directory where the MapReduce JobHistory server manages history
files.</description>
</property>
mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server.
The following sample text shows the properties you can set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
Replace ${hdp.version} with the version number of the Hortonworks HDInsights cluster.
mapreduce.application.framework.path
Path for the MapReduce framework archive.
The following sample text shows the properties you can set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server IPC host:port</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>
Charges for this instance of Big Data Management will go to this subscription.
Location
Location of the resource group.
Machine prefix
Type an alphanumeric string that will be a prefix on the name of each virtual machine in the
Informatica domain.
For example, if you use the prefix "infa" then Azure will identify virtual machines in the domain with
this string at the beginning of the name.
Username
Username that you use to log in to the virtual machine that hosts the Informatica domain.
Authentication type
Authentication protocol you use to communicate with the Informatica domain.
Password
Password to use to log in to the virtual machine that hosts the Informatica domain.
Machine size
Select from among the available preconfigured VMs.
4. Supply information in the Domain Settings panel, and then click OK.
This tab allows you to configure additional details of the Informatica domain.
Informatica Domain Name
Type a name for the Informatica domain. This becomes the name of the Informatica domain on the
cluster.
Password
Password for the Informatica administrator.
Username
Username for the administrator of the virtual machine host of the database.
These credentials to log into the virtual machine where the database is hosted.
Password
Password for the database machine administrator.
The Informatica domain uses this account to communicate with the Model repository database.
Password
Password for the HDInsight cluster user.
Password
Password to access the cluster SSH host.
The panel requires you to enter values for the following additional addresses. Get these addresses from
the Ambari cluster management tool:
• mapreduce.jobhistory.address
• mapreduce.jobhistory.webapp.address
• yarn.resourcemanager.scheduler.address
• yarn.resourcemanager.webapp.address
When you select an existing storage resource, verify that it belongs to the resource group you want.
It is not essential to select the same resource group as the group that the Big Data Management
implementation belongs to.
Virtual network
Virtual network for the Big Data Management implementation to belong to. Select the same network
as the one that you used to create the HDInsight cluster.
Subnets
The subnet that the virtual network contains.
Informatica supports HDInsight clusters that are deployed on-premise on Microsoft Azure.
Note: If you do not use HiveServer2 to run mappings, skip the steps related to HiveServer2.
To enable Informatica mappings to run on a Hortonworks HDInsight cluster, complete the following steps:
1. Enable the Data Integration Service to use Hive CLI to run mappings.
2. Configure the mapping logic pushdown method.
3. Enable HBase support.
4. Create the HiveServer2 environment variables and configure the HiveServer2 environment.
5. Configure the Hadoop cluster for the Blaze engine.
6. Disable SQL standard based authorization to run mappings with HiveServer2.
7. Enable storage based authorization with HiveServer2.
8. Enable support for HBase with HiveServer2.
Enable the Data Integration Service to Use Hive CLI to Run Mappings
Perform the following tasks to enable the Data Integration Service to use Hive CLI to run mappings:
1. Copy the following files from the Hadoop cluster to the following location on the machine that hosts the
Data Integration Service: <Informatica_installation_directory>/hortonworks_2.3/lib
• /usr/hdp/<CurrentVersion>/hadoop/hadoop-azure-2.7.1.2.3.3.1-7.jar
When you enable MapReduce or Tez for the Data Integration Service, that execution engine becomes the
default execution engine to push mapping logic to the Hadoop cluster. When you enable MapReduce or Tez
for a connection, that engine takes precedence over the execution engine set for the Data Integration
Service.
Choose MapReduce or Tez as the Execution Engine for the Data Integration Service
To use MapReduce or Tez as the default execution engine to push mapping logic to the Hadoop cluster,
perform the following steps:
1. Open hive-site.xml in the following directory on the node on which the Data Integration Service runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/
conf/
2. Edit the hive.execution.engine property.
The following sample text shows the property in hive-site.xml:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
<description>Chooses execution engine. Options are: mr (MapReduce, default) or tez
(Hadoop 2 only)</description>
</property>
Set the value of the property as follows:
• mr -- Sets MapReduce as the execution engine.
• tez -- Sets Tez as the execution engine.
If you enable Tez for the Data Integration Service but want to use MapReduce, you can use the following
value for the Environment SQL property: set hive.execution.engine=mr;.
Configure Tez
If you use Tez as the execution engine, you must configure properties in tez-site.xml.
You can find tez-site.xml in the following directory on the machine where the Data Integration Service
runs: <Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf.
Use the value specified in tez-site.xml on the cluster. You can find tez-site.xml in the following
directory on any node in the cluster: /etc/tez/conf.
tez.am.launch.env
Specifies the location of Hadoop libraries.
The following example shows the properties if tez.tar.gz is in the /apps/tez/lib directory on HDFS:
<property>
<name>tez.lib.uris</name>
<value>hdfs://<Active_Name_Node>:8020/hdp/apps/<version>/tez/tez.tar.gz</value>
<description>The location of tez.tar.gz. Set tez.lib.uris to point to the tar.gz
uploaded to HDFS.</description>
</property>
<property>
<name>tez.am.launch.env</name>
<value>LD_LIBRARY_PATH=/usr/hdp/<hadoop_version>/hadoop/lib/native</value>
<description>The location of Hadoop libraries.</description>
</property>
• tez.am.launch.cmd-opts
• tez.task.launch.env
• tez.am.launch.env
1. On the machine where the Data Integration Service runs, go to the following directory: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>/infaConf.
2. Edit hadoopEnv.properties.
3. Verify the HBase version specified in infapdo.env.entry.mapred_classpath uses the correct HBase
version for the Hadoop cluster.
The following sample text shows infapdo.env.entry.mapred_classpath for a Hadoop cluster that uses
HBase version 1.1.1.2.3.0.0-2504:
infapdo.env.entry.mapred_classpath=INFA_MAPRED_CLASSPATH=
$HADOOP_NODE_HADOOP_DIST/lib/hbase-server-1.1.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/htrace-core.jar:$HADOOP_NODE_HADOOP_DIST/lib/htrace-
core-2.04.jar:$HADOOP_NODE_HADOOP_DIST/lib/protobuf-java-2.5.0.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hbase-client-1.1.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hbase-common-1.1.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hive-hbase-handler-1.2.1.2.3.0.0-2504.jar:
$HADOOP_NODE_HADOOP_DIST/lib/hbase-protocol-1.1.1.2.3.0.0-2504.jar
4. Add the following entry to the infapdo.aux.jars.path variable: file://$DIS_HADOOP_DIST/conf/
hbase-site.xml.
The following sample text shows infapdo.aux.jars.path with the variable added:
infapdo.aux.jars.path=file://$DIS_HADOOP_DIST/infaLib/hive0.14.0-infa-
boot.jar,file://$DIS_HADOOP_DIST/infaLib/hive-infa-plugins-interface.jar,file://
$DIS_HADOOP_DIST/infaLib/profiling-hive0.14.0-udf.jar,file://$DIS_HADOOP_DIST/
infaLib/hadoop2.2.0-avro_complex_file.jar,file://$DIS_HADOOP_DIST/conf/hbase-site.xml
5. On the machine where the Data Integration Service runs, go to the following directory: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>/conf.
6. In hbase-site.xml and hive-site.xml, verify that thezookeeper.znode.parent property exists and
matches the property set in hbase-site.xml on the cluster.
By default, the ZooKeeper directory on the cluster is /usr/hdp/current/hbase-client/conf.
7. On the machine where the Developer tool runs, go to the following directory: <Informatica installation
directory>\clients\DeveloperClient\hadoop\hortonworks_<version>/conf.
8. In hbase-site.xml and hive-site.xml, verify that thezookeeper.znode.parent property exists and
matches the property set in hbase-site.xml on the cluster.
By default, the ZooKeeper directory on the cluster is /usr/hdp/current/hbase-client/conf.
9. Edit the Hadoop classpath on every node on the Hadoop cluster to point to the hbase-protocol.jar file.
Then, restart the Node Manager for each node in the Hadoop cluster.
hbase-protocol.jar is located in the HBase installation directory on the Hadoop cluster. For more
information, refer to the following link: https://issues.apache.org/jira/browse/HBASE-10304
You can run the Big Data Management Configuration Utility and select HiveServer2 to generate the
HiveServer2_EnvInfa.txt file. Alternatively, you can modify a template to create the required environment
variables.
export TMP_INFA_AUX_JARS=$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.4.0-hdfs-native-impl.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.7.1.hw23-native-impl.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hbase1.1.2-infa-plugins.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-
boot.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-plugins.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-storagehandler.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hive0.14.0-native-impl.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive1.1.0-
avro_complex_file.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive-infa-plugins-interface.jar:
$INFA_HADOOP_DIST_DIR/infaLib/infa-hadoop-hdfs.jar:$INFA_HADOOP_DIST_DIR/infaLib/
profiling-hive0.14.0-udf.jar:/opt/Informatica/infa_jars.jar:$INFA_HADOOP_DIST_DIR/lib/
parquet-avro-1.6.0rc3.jar
export JAVA_LIBRARY_PATH=/opt/Informatica/services/shared/bin
export INFA_RESOURCES=/opt/Informatica/services/shared/bin
export INFA_HOME=/opt/Informatica
export IMF_CPP_RESOURCE_PATH=/opt/Informatica/services/shared/bin
export
INFA_MAPRED_OSGI_CONFIG='osgi.framework.activeThreadType:false&:org.osgi.framework.stora
ge.clean:none&:eclipse.jobs.daemon:true&:infa.osgi.enable.workdir.reuse:true&:infa.osgi.
parent.workdir::/tmp/infa&:infa.osgi.workdir.poolsize:4'
• Replace <HADOOP_NODE_INFA_HOME> with the Informatica installation directory on the HDInsight 3.3
cluster.
• Replace <HADOOP_DISTRIBUTION> with the Informatica Hadoop installation directory on the HDInsight
3.3 cluster.
Note: If you use Ambari with CSH as the default shell, you must change the export command to set.
After you create the environment variables, configure the HiveServer2 environment with Ambari or the hive-
env.sh file.
You can use Ambari to configure the required properties in the yarn-site.xml file. Alternatively, configure
the yarn-site.xml file on every node in the Hadoop cluster.
You can find the yarn-site.xml file in the following directory on every node in the Hadoop cluster: /etc/
hadoop/conf.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.
yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.
yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.
Use 3600000.
yarn.nodemanager.resource-memory-mb
Amount of physical memory that can be allotted for containers.
yarn.nodemanager.local-dirs
List of directories to store localized files in.
The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>6144</value>
<description>Amount of physical memory that can be allotted for containers.</
description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>
To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &
1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.
1. Log in to Ambari.
2. Select Hive > Configs.
3. In the Security section, set Hive Security Authorization to None.
4. Navigate to the Advanced tab for hiveserver2-site.
5. Set Enable Authorization to false.
6. Restart Hive Services.
1. Log in to Ambari.
2. Click Hive > Configs.
3. In the Security section, set the Hive Security Authorization to SQLStdAuth.
4. Navigate to Advanced Configs.
5. In the General section, verify that the Hive Authorization Manager property is set to the following value:
Hive Authorization Manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r,org.apache.hadoop.hive.ql.security.authorization.MetaStoreAuthzAPIAuthorizerEmb
edOnly
hive.security.authorization.manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r
By default, this property is set to the following value:
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuth
orizerFactory
Enable Authorization
Set this property to True.
6. In the Advanced hiveserver2-site section, configure the following properties:
hive.security.authorization.manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r
By default, this property is set to the following value:
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthoriz
erFactory
7. Restart all Hive services.
1. Verify that the value for the zookeeper.znode.parent property in the hbase-site.xml file on the
machine where the Data Integration Service runs matches the value on the Hadoop cluster.
The default value is /hbase-unsecure.
You can find the hbase-site.xml file in the following directory on the machine where the Data
Integration Service runs: <Informatica installation directory>/services/shared/hadoop/
hortonworks_<version>/conf.
You can find the hbase-site.xml file in the following directory on the Hadoop cluster: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>.
2. Verify that the infapdo.aux.jars.path property contains the path to the hbase-site.xml file.
The following sample text shows the infapdo.aux.jars.path property with the path for hbase-site.xml:
infapdo.aux.jars.path=file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive0.14.0-infa-
boot.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive-infa-plugins-
interface.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/profiling-hive0.14.0-
udf.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/hadoop2.2.0-
avro_complex_file.jar,file://$HADOOP_NODE_HADOOP_DIST/conf/hbase-site.xml,file://
$HADOOP_NODE_HADOOP_DIST/infaLib/infa_jars.jar
The Administrator tool is a browser-enabled utility that allows you to create, configure and run different
services on the Informatica domain.
To see more about Informatica services, see the Informatica Application Service Guide. You can download
this and all other documentation from the Informatica Network portal.
1. Get the host name, IP address, and port of the virtual machine where Azure deployed the Informatica
domain.
2. Add an entry for this domain host to the hosts file.
6. Use the Administrator tool to create, configure and run application services on the Informatica domain.
u To change the value of the fs.DefaultFS property to the wasb location, edit the following file:
<Informatica_installation_directory>/services/shared/<hadoop_distribution>/conf/hive-
site.xml
You can get the wasb location from the hdfs-site.xml file on the Hadoop cluster, or through the Ambari
cluster management tool.
Connections
Define the connections that you want to use to access data in HBase, HDFS, Hive, or relational databases, or
run a mapping on a Hadoop cluster. You can create the connections using the Developer tool, Administrator
tool, and infacmd.
Create a Hadoop connection to run mappings on the Hadoop cluster. Select the Hadoop connection if
you select the Hadoop run-time environment. You must also select the Hadoop connection to validate a
mapping to run on the Hadoop cluster. Before you run mappings in the Hadoop cluster, review the
information in this guide about rules and guidelines for mappings that you can run in the Hadoop cluster.
HDFS connection
Create an HDFS connection to read data from or write data to the HDFS file system on the Hadoop
cluster.
HBase connection
Create an HBase connection to access HBase. The HBase connection is a NoSQL connection.
JDBC connection
Create a JDBC connection and configure Sqoop properties in the connection to import and export
relational data through Sqoop. You must also create a Hadoop connection to run the mapping on the
Hadoop cluster.
Note: For information about creating connections to other sources or targets such as social media web sites
or Teradata, see the respective PowerExchange adapter user guide for information.
Note: The order of the connection properties might vary depending on the tool where you view them.
Property Description
Name Name of the connection. The name is not case sensitive and must be unique within the domain. The name
cannot exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must
be 255 characters or less and must be unique in the domain. You cannot change this property after you
create the connection. Default value is the connection name.
Description The description of the connection. The description cannot exceed 765 characters.
Location The domain where you want to create the connection. Not valid for the Analyst tool.
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within
the domain. You can change this property after you create the connection. The name
cannot exceed 128 characters, contain spaces, or contain the following special
characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is not
case sensitive. It must be 255 characters or less and must be unique in the domain. You
cannot change this property after you create the connection. Default value is the
connection name.
Description The description of the connection. The description cannot exceed 4,000 characters.
ZooKeeper Host(s) Name of the machine that hosts the ZooKeeper server.
ZooKeeper Port Port number of the machine that hosts the ZooKeeper server.
Use the value specified for hbase.zookeeper.property.clientPort in hbase-
site.xml. You can find hbase-site.xml on the Namenode machine in the following
directory: /opt/HDinsight/hbase/hbase-0.98.7/conf
Enable Kerberos Connection Enables the Informatica domain to communicate with the HBase master server or region
server that uses Kerberos authentication.
HBase Master Principal Service Principal Name (SPN) of the HBase master server. Enables the ZooKeeper
server to communicate with an HBase master server that uses Kerberos authentication.
Enter a string in the following format:
hbase/<domain.name>@<YOUR-REALM>
Where:
- domain.name is the domain name of the machine that hosts the HBase master server.
- YOUR-REALM is the Kerberos realm.
HBase Region Server Principal Service Principal Name (SPN) of the HBase region server. Enables the ZooKeeper
server to communicate with an HBase region server that uses Kerberos authentication.
Enter a string in the following format:
hbase_rs/<domain.name>@<YOUR-REALM>
Where:
- domain.name is the domain name of the machine that hosts the HBase master server.
- YOUR-REALM is the Kerberos realm.
Note: The order of the connection properties might vary depending on the tool where you view them.
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within
the domain. You can change this property after you create the connection. The name
cannot exceed 128 characters, contain spaces, or contain the following special
characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is not
case sensitive. It must be 255 characters or less and must be unique in the domain. You
cannot change this property after you create the connection. Default value is the
connection name.
Description The description of the connection. The description cannot exceed 4000 characters.
Location The domain where you want to create the connection. Not valid for the Analyst tool.
Connection Modes Hive connection mode. Select at least one of the following options:
- Access Hive as a source or target. Select this option if you want to use the connection
to access the Hive data warehouse. If you want to use Hive as a target, you must
enable the same connection or another Hive connection to run mappings in the
Hadoop cluster.
- Use Hive to run mappings in Hadoop cluster. Select this option if you want to use the
connection to run mappings in the Hadoop cluster.
You can select both the options. Default is Access Hive as a source or target.
User Name User name of the user that the Data Integration Service impersonates to run mappings
on a Hadoop cluster.
Use the user name of an operating system user that is present on all nodes on the
Hadoop cluster.
Common Attributes to Both the SQL commands to set the Hadoop environment. In native environment type, the Data
Modes: Environment SQL Integration Service executes the environment SQL each time it creates a connection to a
Hive metastore. If you use the Hive connection to run mappings in the Hadoop cluster,
the Data Integration Service executes the environment SQL at the beginning of each
Hive session.
The following rules and guidelines apply to the usage of environment SQL in both
connection modes:
- Use the environment SQL to specify Hive queries.
- Use the environment SQL to set the classpath for Hive user-defined functions and
then use environment SQL or PreSQL to specify the Hive user-defined functions. You
cannot use PreSQL in the data object properties to specify the classpath. The path
must be the fully qualified path to the JAR files used for user-defined functions. Set
the parameter hive.aux.jars.path with all the entries in infapdo.aux.jars.path and the
path to the JAR files for user-defined functions.
- You can use environment SQL to define Hadoop or Hive parameters that you want to
use in the PreSQL commands or in custom queries.
If you use the Hive connection to run mappings in the Hadoop cluster, the Data
Integration service executes only the environment SQL of the Hive connection. If the
Hive sources and targets are on different clusters, the Data Integration Service does not
execute the different environment SQL commands for the connections of the Hive source
or target.
Property Description
Metadata The JDBC connection URI used to access the metadata from the Hadoop server.
Connection You can use PowerExchange for Hive to communicate with a HiveServer service or HiveServer2 service.
String
To connect to HiveServer2, specify the connection string in the following format:
jdbc:hive2://<hostname>:<port>/<db>;transportMode=<mode>
Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database to which you want to connect. If you do not provide the database name, the Data
Integration Service uses the default database details.
- <mode> is the value of the hive.server2.transport.mode property in the Hive tab of the Ambari tool.
Bypass Hive JDBC driver mode. Select the check box to use the embedded JDBC driver mode.
JDBC Server To use the JDBC embedded mode, perform the following tasks:
- Verify that Hive client and Informatica services are installed on the same machine.
- Configure the Hive connection properties to run mappings in the Hadoop cluster.
If you choose the non-embedded mode, you must configure the Data Access Connection String.
Informatica recommends that you use the JDBC embedded mode.
Data Access The JDBC connection URI used to access data from the Hadoop server.
Connection You can use PowerExchange for Hive to communicate with a HiveServer service or HiveServer2 service.
String
To connect to HiveServer2, specify the connection string in the following format:
jdbc:hive2://<hostname>:<port>/<db>;transportMode=<mode>
Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database to which you want to connect. If you do not provide the database name, the Data
Integration Service uses the default database details.
- <mode> is the value of the hive.server2.transport.mode property in the Hive tab of the Ambari tool.
Property Description
Database Name Namespace for tables. Use the name default for tables that do not have a specified
database name.
Default FS URI The URI to access the default HDinsight File System.
Use the connection URI that matches the storage type. The storage type is configured for
the cluster in the fs.defaultFS property.
If the cluster uses HDFS storage, use the following string to specify the URI:
hdfs://<cluster_name>
Example:
hdfs://my-cluster
If the cluster uses wasb storage, use the following string to specify the URI:
wasb://<container_name>@<account_name>.blob.core.windows.net/
<path>
where:
- <container_name> identifies a specific Azure Blob storage container.
Note: <container_name> is optional.
- <account_name> identifies the the Azure storage object.
Example:
wasb://infabdmoffering1storage.blob.core.windows.net/
infabdmoffering1cluster/mr-history
Yarn Resource Manager URI The service within Hadoop that submits the MapReduce tasks to specific nodes in the
cluster.
For HDInsight 3.3 with YARN, use the following format:
<hostname>:<port>
Where
- <hostname> is the host name or IP address of the JobTracker or Yarn resource
manager.
- <port> is the port on which the JobTracker or Yarn resource manager listens for
remote procedure calls (RPC).
Use the value specified by yarn.resourcemanager.address in yarn-site.xml.
You can find yarn-site.xml in the following directory on the NameNode: /etc/
hive/<version>/0/.
For HDInsight 3.3 with MapReduce 2, use the following URI:
hdfs://host:port
Hive Warehouse Directory on The absolute HDFS file path of the default database for the warehouse that is local to the
HDFS cluster. For example, the following file path specifies a local warehouse:
/user/hive/warehouse
If the Metastore Execution Mode is remote, then the file path must match the file path
specified by the Hive Metastore Service on the hadoop cluster.
Use the value specified for the hive.metastore.warehouse.dir property in
hive-site.xml. You can find yarn-site.xml in the following directory on the node
that runs HiveServer2: /etc/hive/<version>/0/.
Advanced Hive/Hadoop Properties Configures or overrides Hive or Hadoop cluster properties in hive-site.xml on the
machine on which the Data Integration Service runs. You can specify multiple properties.
Use the following format:
<property1>=<value>
Where
- <property1> is a Hive or Hadoop property in hive-site.xml.
- <value> is the value of the Hive or Hadoop property.
To specify multiple properties use &: as the property separator.
The maximum length for the format is 1 MB.
If you enter a required property for a Hive connection, it overrides the property that you
configure in the Advanced Hive/Hadoop Properties.
The Data Integration Service adds or sets these properties for each map-reduce job. You
can verify these properties in the JobConf of each mapper and reducer job. Access the
JobConf of each job from the Jobtracker URL under each map-reduce job.
The Data Integration Service writes messages for these properties to the Data Integration
Service logs. The Data Integration Service must have the log tracing level set to log each
row or have the log tracing level set to verbose initialization tracing.
For example, specify the following properties to control and limit the number of reducers
to run a mapping job:
mapred.reduce.tasks=2&:hive.exec.reducers.max=10
Temporary Table Compression Hadoop compression library for a compression codec class name.
Codec
Codec Class Name Codec class name that enables data compression and improves performance on
temporary staging tables.
Metastore Execution Mode Controls whether to connect to a remote metastore or a local metastore. By default, local
is selected. For a local metastore, you must specify the Metastore Database URI, Driver,
Username, and Password. For a remote metastore, you must specify only the Remote
Metastore URI.
Metastore Database URI The JDBC connection URI used to access the data store in a local metastore setup. Use
the following connection URI:
jdbc:<datastore type>://<node name>:<port>/<database name>
where
- <node name> is the host name or IP address of the data store.
- <data store type> is the type of the data store.
- <port> is the port on which the data store listens for remote procedure calls (RPC).
- <database name> is the name of the database.
For example, the following URI specifies a local metastore that uses MySQL as a data
store:
jdbc:mysql://hostname23:3306/metastore
Use the value specified for the javax.jdo.option.ConnectionURL property in
hive-site.xml. You can find hive-site.xml in the following directory on the node
that runs HiveServer2: /etc/hive/<version>/0/hive-site.xml.
Metastore Database Driver Driver class name for the JDBC data store. For example, the following class name
specifies a MySQL driver:
Use the value specified for the javax.jdo.option.ConnectionDriverName
property in hive-site.xml. You can find hive-site.xml in the following directory
on the node that runs HiveServer2: /etc/hive/<version>/0/hive-site.xml.
Metastore Database Password Required if the Metastore Execution Mode is set to local. The password for the metastore
user name.
Use the value specified for the javax.jdo.option.ConnectionPassword
property in hive-site.xml. You can find hive-site.xml in the following directory
on the node that runs HiveServer2: /etc/hive/<version>/0/hive-site.xml.
Remote Metastore URI The metastore URI used to access metadata in a remote metastore setup. For a remote
metastore, you must specify the Thrift server details.
Use the following connection URI:
thrift://<hostname>:<port>
Where
- <hostname> is name or IP address of the Thrift metastore server.
- <port> is the port on which the Thrift server is listening.
Use the value specified for the hive.metastore.uris property in hive-site.xml.
You can find hive-site.xml in the following directory on the node that runs
HiveServer2: /etc/hive/<version>/0/hive-site.xml.
Hive Connection String The JDBC connection URI used to access the metadata from the Hadoop server.
You can use PowerExchange for Hive to communicate with a HiveServer service or
HiveServer2 service.
To connect to HiveServer2, specify the connection string in the following format:
jdbc:hive2://<hostname>:<port>/<db>;transportMode=<mode>
Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database to which you want to connect. If you do not provide the database
name, the Data Integration Service uses the default database details.
- <mode> is the value of the hive.server2.transport.mode property in the Hive tab of the
Ambari tool.
Informatica supports Cloudera CDH clusters that are deployed on-premise, on Amazon EC2, or on Microsoft
Azure.
To enable Informatica mappings to run on a Cloudera CDH cluster, complete the following steps:
Note: If you do not use HiveServer2 to run mappings, skip the HiveServer2 related steps.
1. Configure Hadoop cluster properties on the machine on which the Data Integration Service runs.
2. Create a staging directory on HDFS.
The following sample code describes the properties you can set in hive-site.xml:
<property>
<name>hive.optimize.constant.propagation</name>
<value>false</value>
</property>
yarn.application.classpath
Required if you used the Big Data Management Configuration Utility. A comma-separated list of
CLASSPATH entries for YARN applications.
Alternatively, you can use the value for this property from yarn-site.xml on the Hadoop cluster.
The Big Data Management Configuration utility automatically configures the following properties in the yarn-
site.xml file. You can also manually configure the properties.
mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server.
yarn.resourcemanager.scheduler.address
Scheduler interface address.
By default, a staging directory already exists on HDFS. You must grant the anonymous user the Execute
permission on the staging directory. If you cannot grant the anonymous user the Execute permission on this
directory, you must enter a valid user name for the user in the Hive connection. If you use the default staging
directory on HDFS, you do not have to configure mapred-site.xml or hive-site.xml.
If you want to create another staging directory to store mapreduce jobs, you must create a directory on
HDFS. After you create the staging directory, you must add it to mapred-site.xml and hive-site.xml.
To create another staging directory on HDFS, run the following commands from the command line of the
machine that runs the Hadoop cluster:
hadoop fs –mkdir /staging
hadoop fs –chmod –R 0777 /staging
Add the staging directory to hive-site.xml on the machine where the Data Integration Service runs.
hive-site.xml is located in the following directory on the machine where the Data Integration Service runs:
<Informatica installation directory>/services/shared/adhoop/cloudera_<version>/conf.
In hive-site.xml, add the yarn.app.mapreduce.am.staging-dir property. Use the value that you specified
in mapred-site.xml.
yarn-site.xml is located in the following directory on every node in the Hadoop cluster:
/etc/hadoop/conf/yarn-site.xml
The following example describes the property you can configure in yarn-site.xml:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Enforces virtual memory limits for containers.</description>
</property>
Edit the Hadoop classpath on every node on the Hadoop cluster to point to the hbase-protocol.jar file.
Then, restart the Node Manager for each node in the Hadoop cluster.
hbase-protocol.jar is located in the HBase installation directory on the Hadoop cluster. For more
information, refer to the following link: https://issues.apache.org/jira/browse/HBASE-10304
You can use Cloudera Manager to configure the HiveServer2 environment. Alternatively, you can copy the
contents HiveServer2_EnvInfa.txt to the end of the hive-env.sh file.
You can find hive-env.sh in the following directory on the Hadoop cluster: /etc/hive/conf/hive-env.sh.
To configure a Cloudera CDH cluster for the Blaze engine, complete the following tasks:
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.
yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.
Use 3600000.
yarn.nodemanager.resource-memory-mb
Amount of physical memory that can be allotted for containers.
yarn.nodemanager.local-dirs
List of directories to store localized files in.
The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.
Note: If the Hadoop cluster uses RPMs, you must manually edit the hive-env.sh file to add the
<DB2_HOME>/lib64 directory to LD_LIBRARY_PATH. You can find hive-env.sh in the following
directory: /etc/hive/conf
Informatica supports Hortonworks HDP clusters that are deployed on-premise, on Amazon EC2, or on
Microsoft Azure.
To enable Informatica mappings to run on a Hortonworks HDP cluster, complete the following steps:
Note: Skip the HiveServer2 related steps if you do not use HiveServer2 to run mappings.
You need to configure the Hortonworks cluster properties in the hive-site.xml file that the Data Integration
Service uses when it runs mappings in a Hadoop cluster. If you use the Big Data Management Configuration
Utility to configure Big Data Management, the hive-site.xml file is automatically configured.
Open the hive-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/
The following sample text shows the property you can configure in the hive-site.xml file:
<property>
<name>hive.metastore.uris</name>
<value>thrift://hostname:port</value>
</property>
Open the yarn-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/
yarn.resourcemanager.webapp.address
Web application address for the Resource Manager.
If the HDInsight cluster uses MapReduce 2, configure the following properties in the yarn-site.xml file:
mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server. The default value is 10020.
mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server. The default value is 19888.
yarn.resourcemanager.scheduler.address
Scheduler interface address. The default value is 8030.
yarn.resourcemanager.webapp.address
Resource Manager web application address.
The following sample text shows the properties you can set in the yarn-site.xml file:
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hostname:port</value>
<description>The address of the scheduler interface</description>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hostname:port</value>
<description>The address for the Resource Manager web application.</description>
</property>
Open the mapred-site.xml file in the following directory on the node on which the Data Integration Service
runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf/
mapreduce.jobhistory.done-dir
Directory where the MapReduce JobHistory server manages history files.
The following sample text shows the properties you must set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp</value>
<description>Directory where MapReduce jobs write history files.</description>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
<description>Directory where the MapReduce JobHistory server manages history
files.</description>
</property>
If you use the Big Data Management Configuration Utility to configure Big Data Management, the following
properties are automatically configured in mapred-site.xml. If you do not use the utility, configure the
following properties in mapred-site.xml:
mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server.
The following sample text shows the properties you can set in the mapred-site.xml file:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server IPC host:port</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>
Replace <hadoop_version> with your Hortonworks HDP version. For example, use 2.2.0.0-2041 for a
Hortonworks HDP 2.2 cluster.
mapreduce.application.framework.path
Path for the MapReduce framework archive.
Replace <hadoop_version> with your Hortonworks HDP version. For example, use 2.2.0.0-2041 for a
Hortonworks HDP 2.2 cluster.
The following sample text shows the properties you can set in the mapred-site.xml file:
<property>
<name>mapreduce.application.classpath</name>
<value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/
hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/
hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-
framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:
$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/<hadoop_version>/hadoop/lib/
hadoop-lzo-0.6.0.2.2.0.0-2041.jar:/etc/hadoop/conf/secure
</value>
<description>Classpaths for MapReduce applications. Replace <hadoop_version> with your
Hortonworks HDP version. For example, use 2.2.0.0-2041 for a Hortonworks HDP 2.2
cluster.
When you enable MapReduce or Tez for the Data Integration Service, that execution engine becomes the
default execution engine to push mapping logic to the Hadoop cluster. When you enable MapReduce or Tez
for a connection, that engine takes precedence over the execution engine set for the Data Integration
Service.
Choose MapReduce or Tez as the Execution Engine for the Data Integration Service
To use MapReduce or Tez as the default execution engine to push mapping logic to the Hadoop cluster,
perform the following steps:
1. Open hive-site.xml in the following directory on the node on which the Data Integration Service runs:
<Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/
conf/
2. Edit the hive.execution.engine property.
The following sample text shows the property in hive-site.xml:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
<description>Chooses execution engine. Options are: mr (MapReduce, default) or tez
(Hadoop 2 only)</description>
</property>
Set the value of the property as follows:
• mr -- Sets MapReduce as the execution engine.
• tez -- Sets Tez as the execution engine.
If you enable Tez for the Data Integration Service but want to use MapReduce, you can use the following
value for the Environment SQL property: set hive.execution.engine=mr;.
You can find tez-site.xml in the following directory on the machine where the Data Integration Service
runs: <Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/conf.
Use the value specified in tez-site.xml on the cluster. You can find tez-site.xml in the following
directory on any node in the cluster: /etc/tez/conf.
tez.am.launch.env
Specifies the location of Hadoop libraries.
The following example shows the properties if tez.tar.gz is in the /apps/tez/lib directory on HDFS:
<property>
<name>tez.lib.uris</name>
<value>hdfs://<Active_Name_Node>:8020/hdp/apps/<version>/tez/tez.tar.gz</value>
<description>The location of tez.tar.gz. Set tez.lib.uris to point to the tar.gz
uploaded to HDFS.</description>
</property>
<property>
<name>tez.am.launch.env</name>
<value>LD_LIBRARY_PATH=/usr/hdp/<hadoop_version>/hadoop/lib/native</value>
<description>The location of Hadoop libraries.</description>
</property>
• tez.am.launch.cmd-opts
• tez.task.launch.env
• tez.am.launch.env
Edit the Hadoop classpath on every node on the Hadoop cluster to point to the hbase-protocol.jar file.
Then, restart the Node Manager for each node in the Hadoop cluster.
You can run the Big Data Management Configuration Utility and select HiveServer2 to generate the
HiveServer2_EnvInfa.txt file. Alternatively, you can modify a template to create the required environment
variables.
export TMP_INFA_AUX_JARS=$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.4.0-hdfs-native-impl.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.7.1.hw23-native-impl.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hbase1.1.2-infa-plugins.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-
boot.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-plugins.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-storagehandler.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hive0.14.0-native-impl.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive1.1.0-
avro_complex_file.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive-infa-plugins-interface.jar:
$INFA_HADOOP_DIST_DIR/infaLib/infa-hadoop-hdfs.jar:$INFA_HADOOP_DIST_DIR/infaLib/
profiling-hive0.14.0-udf.jar:/opt/Informatica/infa_jars.jar
export JAVA_LIBRARY_PATH=<HADOOP_NODE_INFA_HOME>/services/shared/bin
export INFA_RESOURCES=<HADOOP_NODE_INFA_HOME>/Informatica/services/shared/bin
export INFA_HOME=<HADOOP_NODE_INFA_HOME>
export IMF_CPP_RESOURCE_PATH=<HADOOP_NODE_INFA_HOME>/Informatica/services/shared/bin
export
INFA_MAPRED_OSGI_CONFIG='osgi.framework.activeThreadType:false&:org.osgi.framework.stora
ge.clean:none&:eclipse.jobs.daemon:true&:infa.osgi.enable.workdir.reuse:true&:infa.osgi.
parent.workdir::/tmp/infa&:infa.osgi.workdir.poolsize:4'
Replace <HADOOP_NODE_INFA_HOME> with the Informatica installation directory on the Hadoop cluster.
Replace <HADOOP_DISTRIBUTION> with the Informatica Hadoop installation directory on the Hadoop
cluster. Based on your Hadoop distribution, use one of the following phrases to replace
<HADOOP_DISTRIBUTION>:
Note: If you use Ambari with CSH as the default shell, you must change the command to set.
export
After you create the environment variables, configure the HiveServer2 environment with Ambari or the hive-
env.sh file.
If you use the utility to select HiveServer2, you can find HiveServer2_EnvInfa.txt in the following directory
on the machine where the Data Integration Service runs: <Informatica installation directory>/tools/
BDMUtil.
If you use the utility to select HiveServer2, you can find HiveServer2_EnvInfa.txt in the following directory
on the machine where the Data Integration Service runs: <Informatica installation directory>/tools/
BDMUtil.
1. Log in to Ambari.
2. Select Hive > Configs.
3. In the Security section, set Hive Security Authorization to None.
4. Navigate to the Advanced tab for hiveserver2-site.
5. Set Enable Authorization to false.
6. Restart Hive Services.
1. Log in to Ambari.
2. Click Hive > Configs.
3. In the Security section, set the Hive Security Authorization to SQLStdAuth.
4. Navigate to Advanced Configs.
5. In the General section, verify that the Hive Authorization Manager property is set to the following value:
Hive Authorization Manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r,org.apache.hadoop.hive.ql.security.authorization.MetaStoreAuthzAPIAuthorizerEmb
edOnly
hive.security.authorization.manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r
By default, this property is set to the following value:
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuth
orizerFactory
Enable Authorization
Set this property to True.
6. In the Advanced hiveserver2-site section, configure the following properties:
Enable Authorization
Set this value to True.
hive.security.authorization.manager
Set this property to the following value:
org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvide
r
By default, this property is set to the following value:
org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthoriz
erFactory
7. Restart all Hive services.
1. On the machine where the Data Integration Service runs, go to the following directory: <Informatica
installation directory>/services/shared/hadoop/hortonworks_<version>/infaConf.
2. Edit hadoopEnv.properties.
3. Verify the HBase version specified in infapdo.env.entry.mapred_classpath uses the correct HBase
version for the Hadoop cluster.
1. Verify that the value for the zookeeper.znode.parent property in the hbase-site.xml file on the
machine where the Data Integration Service runs matches the value on the Hadoop cluster.
The default value is /hbase-unsecure.
You can find the hbase-site.xml file in the following directory on the machine where the Data
Integration Service runs: <Informatica installation directory>/services/shared/hadoop/<hadoop
distribution>/conf.
You can find the hbase-site.xml file in the following directory on the Hadoop cluster: <Informatica
installation directory>/services/shared/hadoop/<hadoop distribution>.
2. Verify that the infapdo.aux.jars.path property contains the path to the hbase-site.xml file.
The following sample text shows the infapdo.aux.jars.path property with the path for hbase-site.xml:
infapdo.aux.jars.path=file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive0.14.0-infa-
boot.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive-infa-plugins-
interface.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/profiling-hive0.13.0.hw21-
udf.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/hadoop2.2.0-
avro_complex_file.jar,file://$HADOOP_NODE_HADOOP_DIST/conf/hbase-site.xml,file://
$HADOOP_NODE_HADOOP_DIST/infaLib/infa_jars.jar
Note: If the Hadoop cluster uses RPMs, you must manually edit the hive-env.sh file to add the
<DB2_HOME>/lib64 directory to LD_LIBRARY_PATH. You can find hive-env.sh in the following
directory: /etc/hive/conf
1. Open Ambari.
2. Click Hive > Configs > Advanced.
3. Search for the hive-env template property.
4. Add the following directory to the LD_LIBRARY_PATH property: <DB2_HOME>/lib64.
5. Restart the Hive services.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.
Use 3600000.
yarn.nodemanager.resource-memory-mb
Amount of physical memory that can be allotted for containers.
yarn.nodemanager.local-dirs
List of directories to store localized files in.
The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>6144</value>
<description>Amount of physical memory that can be allotted for containers.</
description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>
To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &
1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.
Note: You can use the Ambari cluster configuration tool to view and edit cluster properties. After you change
property values, the Ambari tool displays the affected cluster components. Restart the affected components
for the changes to take effect.
You can use the following runtime engines to run mappings on BigInsights:
To enable Informatica mappings to run on an IBM BigInsights cluster, complete the following steps:
Provide an operating system user account that is present on all nodes when you configure the JDBC and
Hive connections in the Developer Tool.
To use an anonymous user with Hive sources in the native environment or Hive data preview, create an
operating system user account named "anonymous" that is present on all nodes. Use this user account when
you set the JDBC and Hive connection properties.
You must add the following path to the infapdo.env.entry.mapred_classpath property in the
hadoopEnv.properties file: $HADOOP_NODE_INFA_HOME/services/shared/jars/shapp/*
The following sample text shows the infapdo.env.entry.mapred_classpath property with the
$HADOOP_NODE_INFA_HOME/services/shared/jars/shapp/* path:
infapdo.env.entry.mapred_classpath=INFA_MAPRED_CLASSPATH=$HADOOP_NODE_HADOOP_DIST/lib/*:
$HADOOP_NODE_HADOOP_DIST/lib/protobuf-java-2.5.0.jar:$HADOOP_NODE_HADOOP_DIST/lib/hbase-
client.jar:$HADOOP_NODE_HADOOP_DIST/lib/hbase-common.jar:$HADOOP_NODE_HADOOP_DIST/lib/
hive-hbase-handler.jar:$HADOOP_NODE_HADOOP_DIST/lib/hbase-protocol.jar:
$HADOOP_NODE_HADOOP_DIST/infaLib/*:$HADOOP_NODE_INFA_HOME/services/shared/jars/*:
$HADOOP_NODE_INFA_HOME/services/shared/jars/platform/*:$HADOOP_NODE_INFA_HOME/services/
shared/jars/platform/dtm/*:$HADOOP_NODE_INFA_HOME/services/shared/jars/thirdparty/*:
$HADOOP_NODE_HADOOP_DIST/infaLib/*:$HADOOP_NODE_INFA_HOME/plugins/infa/*:
$HADOOP_NODE_INFA_HOME/plugins/dynamic/*:$HADOOP_NODE_INFA_HOME/plugins/osgi/*:
$HADOOP_NODE_HADOOP_DIST/lib/htrace-core.jar:$HADOOP_NODE_INFA_HOME/services/shared/
jars/shapp/*:$HADOOP_NODE_HADOOP_DIST/lib/htrace-core-3.1.0-incubating.jar:
$HADOOP_CONF_DIR
You can run the Big Data Management Configuration Utility and select HiveServer2 to generate the
HiveServer2_EnvInfa.txt file. Alternatively, you can modify a template to create the required environment
variables.
export TMP_INFA_AUX_JARS=$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.4.0-hdfs-native-impl.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hadoop2.7.1.hw23-native-impl.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hbase1.1.2-infa-plugins.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-
boot.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-plugins.jar:
$INFA_HADOOP_DIST_DIR/infaLib/hive0.14.0-infa-storagehandler.jar:$INFA_HADOOP_DIST_DIR/
infaLib/hive0.14.0-native-impl.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive1.1.0-
avro_complex_file.jar:$INFA_HADOOP_DIST_DIR/infaLib/hive-infa-plugins-interface.jar:
$INFA_HADOOP_DIST_DIR/infaLib/infa-hadoop-hdfs.jar:$INFA_HADOOP_DIST_DIR/infaLib/
profiling-hive0.14.0-udf.jar:/opt/Informatica/infa_jars.jar
export JAVA_LIBRARY_PATH=<HADOOP_NODE_INFA_HOME>/services/shared/bin
export INFA_RESOURCES=<HADOOP_NODE_INFA_HOME>/Informatica/services/shared/bin
export INFA_HOME=<HADOOP_NODE_INFA_HOME>
export IMF_CPP_RESOURCE_PATH=<HADOOP_NODE_INFA_HOME>/Informatica/services/shared/bin
export
INFA_MAPRED_OSGI_CONFIG='osgi.framework.activeThreadType:false&:org.osgi.framework.stora
ge.clean:none&:eclipse.jobs.daemon:true&:infa.osgi.enable.workdir.reuse:true&:infa.osgi.
parent.workdir::/tmp/infa&:infa.osgi.workdir.poolsize:4'
Replace <HADOOP_NODE_INFA_HOME> with the Informatica installation directory on the Hadoop cluster.
Replace <HADOOP_DISTRIBUTION> with the Informatica Hadoop installation directory on the Hadoop
cluster. Based on your Hadoop distribution, use one of the following phrases to replace
<HADOOP_DISTRIBUTION>:
Note: If you use Ambari with CSH as the default shell, you must change the export command to set.
After you create the environment variables, configure the HiveServer2 environment with Ambari or the hive-
env.sh file.
If you use the utility to select HiveServer2, you can find HiveServer2_EnvInfa.txt in the following directory
on the machine where the Data Integration Service runs: <Informatica installation directory>/tools/
BDMUtil.
If you use the utility to select HiveServer2, you can find HiveServer2_EnvInfa.txt in the following directory
on the machine where the Data Integration Service runs: <Informatica installation directory>/tools/
BDMUtil.
You must add the path for the hbase-site.xml file to the infapdo.aux.jars.path property in the
hadoopEnv.properties file.
The following sample text shows the infapdo.aux.jars.path property with the path for hbase-site.xml:
infapdo.aux.jars.path=file://$HADOOP_NODE_HADOOP_DIST/infaLib/hive0.14.0-infa-
boot.jar,file://$HADOOP_NODE_HADOOP_DIST/infaLib/profiling-hive0.13.0-udf.jar,file://
$HADOOP_NODE_HADOOP_DIST/infaLib/hive-infa-plugins-interface.jar,file://
$HADOOP_NODE_INFA_HOME/infa_jars.jar,file://$HADOOP_NODE_HADOOP_DIST/conf/hbase-site.xml
You can find the hadoopEnv.properties file in the following directory on the machine where the Data
Integration Service runs: <Informatica installation directory>/services/shared/hadoop/
<Hadoop_distribution_name>/infaConf.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.
yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.
Use 3600000.
yarn.nodemanager.resource-memory-mb
Amount of physical memory that can be allotted for containers.
yarn.nodemanager.local-dirs
List of directories to store localized files in.
The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &
1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.
Before you configure the Informatica domain and the MapR cluster to run mappings, download and install
EBF 17557. This EBF release supports MapR ticket and Kerberos-enabled MapR clusters.
MapReduce Version
Verify that the cluster is configured for the correct version of MapReduce. You can use the MapR Control
System (MCS) to change the MapReduce version. Then, restart the cluster.
• User ID (uid)
• Group ID (gid)
• Groups
For example, MapR User details might be set to the following values:
• uid=2000(mapr)
• gid=2000(mapr)
• groups=2000(mapr)
For example, a Data Integration Service user named testuser, might have the following properties:
• uid=30103(testuser)
• gid=2000(mapr)
• groups=2000(mapr)
After you verify the Data Integration Service user details, perform the following steps on every node in
the cluster:
1. Use a tool like putty to communicate with the node using ssh.
Verify Prerequisites
Before you download and install the EBF, verify that you have the following environment:
• You can access a running Informatica domain that includes a Model Repository Service and a Data
Integration Service.
• You have the Developer client installed on a machine in your cluster.
• Informatica Big Data Management 10.1 RPM packages or Cloudera parcels are installed on your Hadoop
cluster.
EBF Installation
Install EBF 17588 on top of Informatica Big Data Management 10.1.
EBF17588_Server_Installer_linux_em64t.tar
This archive contains Linux updates for servers and the Big Data Management Configuration Utility.
EBF17588_Client_Installer_win_em64t.tar
This archive contains updates to clients, including the Developer tool.
INFORMATICA-10.1.0.informatica10.1.0.p1.364.parcel.tar
This archive contains updates to Big Data Management support for Cloudera clusters.
Contact Informatica Global Customer Support for the link to download and install EBF 17588. Then perform
the following tasks:
100 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
Generate a MapR Ticket
To enable mappings to run on a Kerberos-enabled MapR cluster, generate a MapR ticket for the Data
Integration Service user.
1. Run the MapR kinit utility on the CLDB node of the cluster to create a Kerberos ticket for the Data
Integration Service user.
For information about how to generate MapR Tickets, refer to MapR documentation.
2. Run the maprlogin kerberos utility. Type:
maprlogin kerberos
The utility generates a MapR ticket in the /tmp directory using the following naming convention:
maprticket_<userid>
where <userid> corresponds to the Data Integration Service user.
3. Copy the ticket file from the cluster node to the following directory on the VM that runs the Data
Integration Service:
/tmp
1. In the Administrator tool, browse to the Data Integration Service Process Properties tab.
2. In the Advanced Properties area, add the following line to the JVM Command Line Options:
-Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -Dhttps.protocols=TLSv1.2
where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the MAPR_ECOSYSTEM_LOGIN_OPTS
property in the file /opt/mapr/conf/env.sh.
3. Restart the Data Integration Service for the change to take effect.
When you choose the native run-time engine, Big Data Management uses the Data Integration to run
mappings on the Informatica domain. You can also choose a run-time engine to run mappings in the Hadoop
environment. This pushes mapping run processing to the cluster.
When you want to run mappings on the cluster, you choose from the following run-time engines:
Blaze engine
The Blaze engine is an Informatica software component that can run mappings on the Hadoop cluster.
Hive engine
When you run mappings on the Hive run-time engine, you choose Hive Command Line Interface or
HiveServer 2.
yarn.timeline-service.webapp.address
The HTTP address for the Application Timeline service web application.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.enabled
Indicates whether the Timeline service is enabled.
yarn.timeline-service.address
Address for the Application Timeline Server to start the RPC server.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.hostname
The host name for the Application Timeline Service web application.
Use the host name of the machine that starts the Application Timeline Server for the host name.
yarn.timeline-service.ttl-ms
The time-to-live in milliseconds for data in the timeline store.
Use 3600000.
yarn.nodemanager.local-dirs
List of directories to store localized files in.
102 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
The following sample text shows the properties you configure in the yarn-site.xml file:
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><ATSHostname>:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value><ATSHostname>:10200</value>
</property>
<property>
<name>yarn.timeline-service.hostname</name>
<value><ATSHostname></value>
</property>
<property>
<name>yarn.timeline-service.ttl-ms</name>
<value>3600000</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/<local_directory>,/<local_directory></value>
</property>
1. In the Administrator tool, browse to the Data Integration Service Process tab.
2. In the environment variables area, define the Kerberos authentication protocol:
Property Value
To start the Hadoop Application Timeline Server, run the following command on any node in the Hadoop
cluster:
sudo yarn timelineserver &
1. On the machine where the Data Integration Service runs, edit the hadoopEnv.properties file.
You can find hadoopEnv.properties in the following directory: <Informatica installation
directory>/services/shared/hadoop/<hadoop_distribution><version_number>/infaConf.
2. Set the infagrid.blaze.console.enabled property to true.
3. Save and close the hadoopEnv.properties file.
1. In the Administrator tool, browse to the Data Integration Service Process tab.
2. In the Custom Properties area, define the following properties and values:
Property Value
ExecutionContextOptions.JVMOption2 -Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -
Dhttps.protocols=TLSv1.2
where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the
MAPR_ECOSYSTEM_LOGIN_OPTS property in the file /opt/mapr/conf/
env.sh.
ExecutionContextOptions.JVMOption7 -Dhttps.protocols=TLSv1.2
1. In the Administrator tool, browse to the Connections tab and browse to the HiveServer2 Connection
Properties area.
2. Configure the following connection properties:
Property Value
104 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
3. In the environment variables area, configure the following property to define the Kerberos authentication
protocol:
Property Value
Edit hive-site.xml to Enable a Mapping to Run with the Hive Run-Time Engine
To run mappings using Hive, open the file <Informatica installation directory>/services/shared/
hadoop/mapr_<version>/conf/hive-site.xml for editing and make the following changes:
To configure the Informatica domain to enable mappings to run on a MapR 5.1 cluster that uses MapR ticket
for authentication perform the following steps:
To configure users for MapR Ticket or Kerberos-enabled MapR clusters, establish Linux accounts and
configure user permissions for users.
1. Create a Linux user on the node where the HiveServer2 service runs. Use the same username as the
Windows user account that runs the Developer tool client. We will refer to this user as the client user.
2. If the cluster is Kerberos-enabled, you can perform the following steps to generate a MapR ticket.
Alternatively, follow steps 3 and 4.
a. Install maprclient on the Windows machine.
b. Generate a Kerberos ticket on the Windows machine.
c. Use maprlogin to generate a maprticket at %TEMP%.
Skip to step 5.
3. On the same node, log in as the client user and generate a MapR ticket.
Refer to MapR documentation for more information.
If the cluster is not Kerberos-enabled, follow these steps:
a. Type the following command:
maprlogin password
b. When prompted, provide the password for the client user.
If the cluster is Kerberos-enabled, follow these steps:
a. Generate a Kerberos ticket using kinit.
b. Type the following command to generate a maprticket:
maprlogin kerberos
The cluster generates a MapR ticket associated with the client user. By default, tickets on Linux systems
are generated in the /tmp directory and have a name like maprticket_<username>.
4. Copy the MapR ticket file and paste it to the %TEMP% directory on the Windows machine.
5. Rename the file like this:
maprticket_<username>
where <username> is the username of the client user.
6. On the MapR Control System browser, get the value of the property hive.server2.authentication.
106 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
7. Open the file <Informatica_client_installation>\clients\DeveloperClient\hadoop
\mapr_<version_number>\conf\hive-site.xml for editing.
8. Change the value of the property hive.server2.authentication from NONE to the value you got in Step 5.
Note: If Kerberos is enabled on the cluster, comment out the hive.server2.authentication property in
hive-site.xml.
9. Add the following lines to the hive-site.xml file:
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
10. Save and close the hive-site.xml file.
To test the Hive connection, or perform a metadata fetch task, use the following format for the connection
string if the cluster is Kerberos-enabled:
jdbc:hive2://<hostname>:10000/default;principal=<SPN>
Example:
jdbc:hive2://myServer2:10000/default;principal=mapr/myServer2@clustername
If custom authentication is enabled, specify the user name and password in the Database Connection tab of
the Hive connection.
Note: When the mapping performs a metadata fetch of a complex file object, the user whose maprticket is
present at %TEMP% on the Windows machine must have read permission on the HDFS directory to list the
files inside it and perform the import action. The metadata fetch operation ignores privileges of the user who
is listed in the HDFS connection definition.
• Add MAPR_HOME to the environment variables in the Data Integration Service Process properties. Set
MAPR_HOME to the following path: <Informatica installation directory>/services/shared/
hadoop/mapr_<version_number>/.
• Add -Dmapr.library.flatclass to the custom properties in the Data Integration Service Process properties.
For example, add
JVMOption1=-Dmapr.library.flatclass
• When you use the MapR distribution on the Linux operating system, change the environment variable
LD_LIBRARY_PATH to include the following path: <Informatica Installation Directory>/services/
shared/hadoop/mapr_<version>/lib/native/Linux-amd64-64:.:<Informatica Installation
Directory>/services/shared/bin.
• Add -Dmapr.library.flatclass to the Data Integration Service advanced property JVM Command Line
Options.
hive-site.xml and yarn-site.xml are located in the following directory on the machine where the Data
Integration Service runs: <Informatica installation directory>/services/shared/hadoop/
mapr_<version_number>_yarn/conf/.
The following sample code describes the property you can set in hive-site.xml:
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>maprfs:<staging directory path></value>
</property>
mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server. The default value is 19888.
yarn.resourcemanager.scheduler.address
Scheduler interface address. The default value is 8030.
yarn.resourcemanager.webapp.address
Resource Manager web application address.
The following sample code describes the properties you can set in yarn-site.xml:
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server IPC host:port</description>
108 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:port</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hostname:port</value>
<description>The address of the scheduler interface</description>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hostname:port</value>
<description>The address for the Resource Manager web application.</description>
</property>
yarn-site.xml is located in the following directory on the Hadoop cluster nodes: /opt/mapr/hadoop/
hadoop-<version>/etc/hadoop.
yarn.nodemanager.resource.memory-mb
Amount of physical memory, in megabytes, that can be allocated for containers.
yarn.scheduler.minimum-allocation-mb
The minimum allocation for every container request at the RM, in megabytes. Memory requests lower
than this do not take effect, and the specified value will get allocated.
yarn.scheduler.maximum-allocation-mb
The maximum allocation for every container request at the RM, in megabytes. Memory requests higher
than this do not take effect and are capped at this value.
yarn.app.mapreduce.am.resource.mb
The amount of memory that the MR AppMaster needs.
yarn.nodemanager.resource.cpu-vcores
Number of CPU cores that can be allocated for containers.
To use the Blaze engine, you must go on to configure additional properties in the yarn-site.xml file to enable
the Application Timeline Server.
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<description>The minimum allocation for every container request at the RM, in MBs.
Memory requests lower than this won't take effect, and the specified value will get
allocated at minimum.</description>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<description> The maximum allocation for every container request at the RM, in MBs.
Memory requests higher than this won't take effect, and will get capped to this value.</
description>
<value>24000</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<description> The amount of memory the MR AppMaster needs.</description>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<description> Number of CPU cores that can be allocated for containers. </description>
<value>8</value>
</property>
1. Go to the following directory on any node in the Hadoop cluster: <MapR installation directory>/
conf .
2. Find the mapr-cluster.conf file.
3. Copy the file to the following directory on the machine on which the Developer tool runs: <Informatica
installation directory>\clients\DeveloperClient\hadoop\mapr_<version_number>\conf
4. Go to the following directory on the machine on which the Developer tool runs: <Informatica
installation directory>\<version_number>\clients\DeveloperClient
110 Chapter 3: Configuring Big Data Management to Run Mappings in Hadoop Environments
5. Edit run.bat to include the MAPR_HOME environment variable and the -clean settings:
For example, include the following lines:
<Informatica installation directory>\clients\DeveloperClient\hadoop\mapr_510
developerCore.exe -clean
6. Save and close the file.
7. Add the following values to the developerCore.ini file:
-Dmapr.library.flatclass
-Djava.library.path=hadoop\mapr_<version_number>\lib\native\Win32;bin;..\DT\bin
You can find developerCore.ini in the following directory: <Informatica installation directory>
\clients\DeveloperClient
8. Save and close the file.
9. Use run.bat to start the Developer tool.
High Availability
This chapter includes the following topics:
A highly available Hadoop cluster can provide uninterrupted access to the JobTracker, name node, and
ResourceManager in the cluster. The JobTracker is the service within Hadoop that assigns MapReduce jobs
on the cluster. The name node tracks file data across the cluster. The ResourceManager tracks resources
and schedules applications in the cluster.
You can configure Big Data Management to communicate with a highly available Hadoop cluster on the
following Hadoop distributions:
• Cloudera CDH
• Hortonworks HDP
• IBM BigInsights
• MapR
112
Configuring Big Data Management for a Highly
Available Cloudera CDH Cluster
You can configure the Data Integration Service and the Developer tool to read from and write to a highly
available Cloudera CDH cluster. The Cloudera CDH cluster provides a highly available name node and
ResourceManager.
Configuring Big Data Management for a Highly Available Cloudera CDH Cluster 113
14. Edit the Hive connection and configure the following properties in the Properties to Run Mappings in
Hadoop Cluster tab:
Default FS URI
Use the value from the dfs.nameservices property in hdfs-site.xml.
To enable support for a highly available Hortonworks HDP cluster, perform the following tasks:
On the machine where the Data Integration Service runs, you can find hive-site.xml in the following
directory: <Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/
conf.
dfs.ha.namenodes.<ClusterName>
The ClusterName is specified in the dfs.nameservice property. The following sample text shows the
property for a cluster named cluster01: dfs.ha.namenodes.cluster01.
Specify the name node IDs with a comma separated list. For example, you can use the following values:
nn1,nn2.
dfs.namenode.https-address
The HTTPS server that the name node listens on.
The following sample text shows a name node with the ID nn1 on a cluster named cluster01:
dfs.namenode.https-address.cluster01.nn1
dfs.namenode.http-address
The HTTP server that the name node listens on.
dfs.namenode.http-address.<ClusterName>.<Name_NodeID>
The HTTPS server that a highly available name node specified in dfs.ha.namenodes.<ClusterName>
listens on. Each name node requires a separate entry. For example, if you have two highly available
name node, you must have two corresponding dfs.namenode.http-
address.<ClusterName>.<Name_NodeID> properties.
The following sample text shows a name node with the ID nn1 on a cluster named cluster01:
dfs.namenode.http-address.cluster01.nn1
dfs.namenode.rpc-address
The fully-qualified RPC address for the name node to listen on.
dfs.namenode.rpc-address.<ClusterName>.<Name_NodeID>
The fully-qualified RPC address for a highly available name node specified in
dfs.ha.namenodes.<ClusterName> listens on. Each name node requires a separate entry. For example,
if you have two highly available name node, you must have two corresponding dfs.namenode.rpc-
address.<ClusterName>.<Name_NodeID> properties.
The following sample text shows a name node with the ID nn1 on a cluster named cluster01:
dfs.namenode.rpc-address.cluster01.nn1.
The following sample text shows the properties for two highly available name node with the IDs nn1 and nn2
on a cluster named cluster01:
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.namenodes.cluster01</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.https-address</name>
<value>node01.domain01.com:50470</value>
</property>
<property>
<name>dfs.namenode.https-address.cluster01.nn1</name>
<value>node01.domain01.com:50470</value>
</property>
<property>
<name>dfs.namenode.https-address.cluster01.nn2</name>
<value>node02.domain01.com:50470</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<property>
<name>dfs.namenode.http-address.cluster01.nn1</name>
<value>node01.domain01.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.cluster01.nn2</name>
<value>node02.domain01.com:50070</value>
</property>
<property>
<name>dfs.namenode.rpc-address</name>
<value>node01.domain01.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster01.nn1</name>
<value>node01.domain01.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster01.nn2</name>
<value>node02.domain01.com:8020</value>
</property>
On the machine where the Data Integration Service runs, you can find yarn-site.xml in the following
directory: <Informatica installation directory>/services/shared/hadoop/hortonworks_<version>/
conf.
yarn.resourcemanager.ha.rm-ids
List of highly available Resource Manager IDs.
yarn.resourcemanager.hostname
The host name for the Resource Manager.
yarn.resourcemanager.hostname.<ResourceManagerID>
Host name for one of the highly available Resource Managers specified in
yarn.resourcemanager.ha.rm-ids.
Each Resource Manager requires a separate entry. For example, if you have two Resource Managers,
you must have two corresponding yarn.resourcemanager.hostname.<ResourceManagerID> properties.
The following sample text shows a Resource Manager with the ID rm1:
yarn.resourcemanager.hostname.rm1.
yarn.resourcemanager.scheduler.address
The address of the scheduler interface.
yarn.resourcemanager.scheduler.address.<ResourceManagerID>
The address of the scheduler interface for one of the highly available Resource Managers.
Each resource manager requires a separate entry.
The following sample text shows the properties for two highly available Resource Managers with the IDs rm1
and rm2:
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2></value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node01.domain01.com</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>node01.domain01.com</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>node02.domain01.com</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>node01.domain01.com:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>node01.domain01.com:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>node02.domain01.com:8088</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>node01.domain01.com:8030</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm1</name>
<value>node01.domain01.com:8030</value>
</property>
<property>
Configuring Big Data Management for a Highly Available IBM BigInsights Cluster 119
2. Locate the mapr-cluster.conf file.
3. Copy the file to the machine on which the Data Integration Service runs and the machine on which the
Developer tool client runs:
On the machine on which the Data Integration Service runs, copy the file to the following directory:
<Informatica installation directory>/services/shared/hadoop/mapr_<version>/conf
On the machine on which the Developer tool runs, copy the file to the following directory:
<Informatica installation directory>/clients/DeveloperClient/Hadoop/mapr_<version>/conf
4. Open the Developer tool.
5. Click Window > Preferences.
6. Select Informatica > Connections.
7. Expand the domain.
8. Expand File Systems and select the HDFS connection.
9. Edit the HDFS connection and configure the following property in the Details tab:
NameNode URI
Use the value of the dfs.nameservices property.
You can get the value of the dfs.nameservices property from hdfs-site.xml from the following
location on the NameNode of the cluster: /etc/hadoop/conf
When you upgrade Big Data Management, you uninstall the previous Big Data Management RPMs and install
the new version.
1. Verify that the Informatica domain and client tools are upgraded.
2. Uninstall the Big Data Management RPM package.
For more information about how to uninstall Big Data Management, see “Informatica Big Data
Management Uninstallation” on page 20
Note: If you used Cloudera Manager parcels to install Big Data Management, skip this step.
3. Install Big Data Management.
For more information about how to install Big Data Management, see “Installation Overview” on page 10
4. Configure Big Data Management.
Complete the tasks in “Post-Installation Overview” on page 22 and Chapter 3, “Configuring Big Data
Management to Run Mappings in Hadoop Environments” on page 41 for your Hadoop distribution.
5. Configure the Developer tool.
For more information, see “Enable Developer Tool Communication with the Hadoop Cluster ” on page 25
6. Optionally, configure Big Data Management to connect to a highly available Hadoop cluster.
For more information, see “Configure High Availability” on page 112
121
APPENDIX B
For more information about application services, see the Informatica 10.1 Application Service Guide.
Informatica Domain
The following table lists the default port associated with the Informatica domain:
Domain configuration Default is 6005. You can change the default port when during installation. You can
modify the port after installation with the infasetup updateGatewayNode command.
122
Type Default Port
Analyst Service
The following table lists the default port associated with the Analyst Service:
Analyst Service (HTTPS) No default port. Enter the required port number when you create the service.
Analyst Service (Staging database) No default port. Enter the database port number.
Content Management Service (HTTPS) No default port. Enter the required port number when you create the service.
Data Director Service (HTTP) No default port. Enter the required port number when you create the service.
Data Director Service (HTTPS) No default port. Enter the required port number when you create the service.
Data Integration Service (HTTPS) No default port. Enter the required port number when you create the service.
Profiling Warehouse database No default port. Enter the database port number.
Human Task database No default port. Enter the database port number.
Metadata Manager Service (HTTPS) No default port. Enter the required port number when you create the service.
Use the same port number that you specify in the SVCNODE statement of the DBMOVER file.
If you define more than one Listener Service to run on a node, you must define a unique SVCNODE port
number for each service.
Use the same port number that you specify in the SVCNODE statement of the DBMOVER file.
If you define more than one Listener Service to run on a node, you must define a unique SVCNODE port
number for each service.
Cloudera 5.x
The following table lists the Cloudera Hadoop components and default port numbers:
HDFS read/write 50010, 50020 Open this port for all data nodes.
HiveServer 10000
JobTracker 8021
NameNode 8020
ZooKeeper 2181
HDFS read/write 50010, 50020 Open this port for all data nodes.
HiveServer 10000
JobTracker 8021
NameNode 8020
ZooKeeper 2181
HDFS read/write 50010, 50020 Open this port for all data nodes.
HiveServer 10000
JobTracker 9001
NameNode 9000
ZooKeeper 2181
MapR 5.x
The following table lists the MapR Hadoop components and default port numbers:
CLDB 7222
HiveServer 10000
JobTracker 9001
NameNode 8020
ZooKeeper 5181
HTTP 9080
JSF 9090
Blaze Services
Blaze services include Grid Manager, Orchestrator, the DEF Client, the DEF Daemon, the OOP Container
manager, and the OOP Container.
The Blaze Grid Manager looks for configured Min and Max ports in the Hadoop connectio, and then starts
services on the available ports from the specified range. Default ports are 12300 to 12600. An administrator
may configure a different range.
The following table lists the ports that the Developer tool installer opens:
B properties 66
HDInsight
Big Data Management configuring mappings 47
Blaze high availability
configuration 36 NameNode 113
cluster installation 11, 15, 18 ResourceManager 113
cluster pre-installation tasks 12 Hive connections
Data Quality 23 properties 68
HiveServer 2 Hortonworks
configuration 86 configuring mappings 56, 80
single node installation 11, 14, 18
single node pre-installation tasks 12
I
C Informatica adapters
installation and configuration 11
Cloudera Informatica clients
creating a staging directory on HDFS 75 installation and configuration 11
Hadoop cluster properties 74 Informatica services
mapping configuration 73 installation and configuration 11
cluster installation
any machine 16, 19
primary NameNode 15, 18
connections
M
HBase 65 mappings in a Hadoop environment
HDFS 65 Hive variables 42
Hive 65 mappings in a Hive environment
JDBC 65 library path 43
path environment variables 43
MapR
D configuring mappings 97
Data Quality
address reference data files 23
reference data 23
N
Data Replication NameNode
installation and configuration 12 high availability 113
H P
Hadoop 65 primary NameNode
Hadoop distributions FTP protocol 16, 18
Amazon EMR 43 HTTP protocol 16, 18
Cloudera 73 NFS protocol 16, 18
configuration tasks 41 SCP protocol 15, 18
configuring virtual memory limits 76
Developer tool file 25, 45, 50
HDInsight 47
HortonWorks 56, 80
130
R V
ResourceManager vcore 36
high availability 113
S
Sqoop configuration
copying JDBC driver jar files 33
Index 131