PowerCenter On AWS Architecture
PowerCenter On AWS Architecture
Informatica, the Informatica logo [and any other trademarks appearing in the document] are trademarks or
registered trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A
current list of Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html.
Other company and product names may be trade names or trademarks of their respective owners.
The information in this documentation is subject to change without notice. If you find any problems in this
documentation, please report them to us in writing at Informatica LLC 2100 Seaport Blvd. Redwood City, CA
94063.
INFORMATICA LLC PROVIDES THE INFORMATION IN THIS DOCUMENT "AS IS" WITHOUT WARRANTY OF ANY
KIND, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.
2
Contents
Introduction ....................................................................................................................................................................... 4
PowerCenter for all Your Data Integration Needs .......................................................................................................... 5
AWS Overview ................................................................................................................................................................... 6
PowerCenter Deployment Options .................................................................................................................................. 7
Deployment Architecture.................................................................................................................................................. 7
PowerCenter to Redshift Connectivity........................................................................................................................... 10
Amazon Redshift Overview ............................................................................................................................................ 12
Configuring PowerExchange for Redshift ..................................................................................................................... 15
PowerCenter Session with a Redshift Connection ....................................................................................................... 17
Integration Best Practices .............................................................................................................................................. 18
Regions and Availability Zones ...................................................................................................................................... 21
Networking, Connectivity and Security .......................................................................................................................... 21
3
Introduction
Organizations are looking for opportunities to reduce their on-premises datacenter footprint by
offloading or extending on-premises applications and data warehouses to the cloud. Cloud
deployments increase agility as they allow organizations to rapidly add new capabilities and scale up
and down as their needs change. Cloud solutions free up IT resources from supporting commoditized
infrastructure and allow them to focus on building differentiating technical capabilities.
A robust data integration solution will greatly increase the success of your organization’s journey to
the cloud, helping you implement hybrid cloud use cases such as hybrid data warehousing or
application migration to the cloud. A successful data integration solution should enable your
organization to focus on their current and future data management needs to address both current
and future state.
Informatica PowerCenter is a proven data integration solution that transforms fragmented, raw data
from any source, any technology, at any latency into complete, high-quality, actionable information.
Customers of Amazon Web Services (AWS) and Informatica can now deploy PowerCenter in the AWS
public cloud, leveraging the data management power of PowerCenter and the flexibility of the AWS
cloud. Customers, with investment in Informatica PowerCenter, who are migrating their environment
to Amazon Web Services (AWS), can fully leverage this investment by deploying PowerCenter on
AWS. Customers who are interested in a fully managed, multi-tenant iPaaS (Integration Platform as a
Service) option, can also explore Informatica Cloud as an alternative.
This document provides technical guidance on how you can seamlessly expand the data
management experience of PowerCenter to Amazon Web Services by migrating PowerCenter to
AWS.
4
PowerCenter for all Your Data Integration Needs
PowerCenter is certified to run in the AWS environment, which offers a great option for current
PowerCenter customers considering moving or offloading their applications and/or data warehouses
to AWS. This allows them to realize cloud benefits while leveraging their existing data management
investment in PowerCenter.
Enterprise class data integration. PowerCenter is an enterprise proven data integration solution
that can process billions of records in the AWS cloud.
Connects to existing data sources and quickly onboards new data sources and data types.
PowerCenter offers a vast array of connectors, whether you want to connect to on- premises data
sources or AWS services such as Amazon Redshift, Amazon RDS, or Amazon S3.
Accelerates data architecture modernization. If you are planning to modernize your data
warehousing initiatives on AWS, PowerCenter’s rich functionalities such as metadata driven data
integration, dynamic mappings, SQL conversion mapping, and automatic data validation will help
you to shorten development cycles and reduce time to market.
Delivers clean, complete and trustworthy data. Whether you are offloading or extending on-
premises applications to the cloud or fully embracing the cloud, delivering, complete, high- quality
data is critical. PowerCenter has a long history of helping organizations empower their users with
complete, high-quality, actionable data.
5
AWS Overview
Amazon Web Services offers the basic building blocks of storage, networking and compute, as well
as services such as managed database, big data, and messaging services. PowerCenter can help
you get the most out of the following:
You can easily customize the network configuration for your Amazon Virtual Private Cloud. For
example, you can create a public-facing subnet for your webservers that has access to the Internet,
and place your backend systems such as databases or application servers in a private- facing subnet
with no Internet access. You can leverage multiple layers of security, including security groups and
network access control lists, to help control access to Amazon EC2 instances in each subnet.
6
PowerCenter Deployment Options
Starting with Informatica version 9.6.1 HotFix 3, PowerCenter customers can choose to seamlessly
extend the data integration and data management experience to AWS. As a PowerCenter customer,
you can execute the full Informatica data pipeline on AWS and take advantage of multiple AWS
services. Informatica products can be deployed on AWS with new or existing on-premises licenses.
PowerCenter installed on an EC2 instance can leverage all standard PowerCenter connection types
and PowerExchange© adapter connections to talk to on-premises applications. When PowerCenter is
installed on EC2, you can take advantage of the economies of scale of AWS, while reducing your total
cost of ownership. You can leverage lower latency to services like Amazon Redshift and capitalize on
the security and availability features built-in to the AWS platform.
An often-costly effort that involves multiple Back up your data to S3 for a durable, low cost
Back-up
vendors and media, with different approach and utilize the built-in data lifecycle
strategy
management planes. policies to get the right storage at the right price.
Informatica fully supports its products running on Amazon EC2. Informatica does not provide general
support for cloud computing specific issues. For general cloud computing support, we recommend
maintain a support relationship with your cloud computing vendor. Please refer to the Informatica
Support Statement related to Usage of Informatica Products on a Cloud Computing Platform for
supported editions of PowerCenter on AWS.
Deployment Architecture
To install PowerCenter on the AWS Cloud Infrastructure, use one of the following installation
methods: Marketplace Deployment (recommended) and Conventional and Manual Installation.
PowerCenter is available on AWS Marketplace. You can subscribe to PowerCenter listings and
deploy PowerCenter in AWS Cloud Infrastructure with simple and quick configurations. Use the
PowerCenter listing on AWS Marketplace for an optimal configuration of the Informatica domain,
domain database and AWS infrastructure settings. You can also install PowerCenter on AWS Cloud
Infrastructure with Manual configuration where you must configure AWS infrastructure settings
such as Amazon EC2 Instance configuration, Networking settings (VPC, Security group, Inbound
and Outbound rules etc.) and Domain Database configurations.
7
When you run PowerCenter on AWS Cloud Infrastructure, you experience the same product features
as running PowerCenter on-premises. Whether you are installing PowerCenter for the first time or
you are planning to migrate from an on-premises to AWS platform, the steps involved in running
Informatica services inside AWS follow a similar deployment lifecycle.
8
Figure 1: PowerCenter Services Running on AWS
Make sure to enable database snapshots. These full database backups will be stored by Amazon
RDS until you explicitly delete them thus preserving your repository contents in case they need to be
restored.
Note: Allocated storage in RDS cannot be scaled down. RDS backups cannot be used for database
restore and recovery outside of AWS. Unlike EC2 instances, database users in RDS cannot be
managed through AWS management console.
9
Security Group Settings
Configure a security group to allow traffic on the following ports:
Storage Segregation
PowerCenter stores a variety of types of data when installed on EC2: workflow and session logs,
repository backups, and both persistent and non-persistent cache files generated by transformations.
For maximum performance on EC2, use EBS volumes for persistent data and ephemeral instance
storage for temporary cache data. Examples of data suitable for EBS include:
$INFA_HOME, repository backup, and the session log directory.
10
Figure 2: PowerCenter Services in an On-Premises Setup and Redshift Cluster
The PowerCenter Integration Service uses the Redshift driver to communicate with Amazon
Redshift. The PowerCenter Integration Service writes data to Amazon Redshift based on the
workflow and Amazon Redshift connection configuration.
The PowerCenter Integration Service first writes data to Amazon S3, and then initiates a copy of data
into Redshift. This leverages the Amazon Redshift massively parallel processing (MPP) architecture
to read and load data in parallel from files in an Amazon S3 bucket.
Make sure that the S3 bucket that is specified in the session properties has the correct permissions
and is in the same region as Redshift so that Informatica can successfully upload the source data.
11
Figure 3: PowerCenter Services on AWS and with Redshift Connectivity
The Amazon Redshift ensemble consists of a group of machines called nodes. The group is called a
Redshift cluster. A cluster can be comprised of a single node, which is a single machine. However, a
cluster generally consists of more than one node.
In a “distributed mode”, a special machine called the leader node coordinates the incoming data
traffic for the cluster. In a “pseudo distributed” mode, the same machine acts as a leader and
compute node.
A compute node is where data actually resides and leader node is the gateway for any client requests,
it parses the client requests, creates an execution plan for the query, compiles the code and
dispatches the compiled code to compute node.
The compute nodes execute their portion of code, sent by a leader node, and responds back with an
intermediate result. The control passes back to the leader node, which takes the individual results
from the compute nodes, aggregates the results, and sends them back to the requesting client.
12
The compute nodes are the machines where the actual data resides. They are an individual unit of a
cluster with their own CPU, memory, and disk storage. A compute node is partitioned into node
slices. A node slice controls a portion of compute nodes memory and disk space. Node slices enable
parallel execution of queries inside compute node by taking care of subset of data traffic flowing
through the respective compute node.
Redshift has several salient features that make it suitable for a petabyte scale data warehouse:
Query Engine
Redshift makes use of a database engine (query optimizer) that’s MPP aware and exploits the
parallel processing capabilities to the fullest.
Columnar storage
In OLTP systems you typically query entire rows, but in a data warehouse architecture, data access
patterns often return many rows of a fewer number of columns. Columnar storage stores this
column data contiguously on disk allowing the database to process billions of rows lightning fast.
To access Redshift using PowerCenter, PowerCenter does not need to be inside the AWS ecosystem.
PowerCenter can access the Redshift cluster using PowerExchange for Redshift installed on the
Informatica on-premises server.
Informatica guidance: Use ODBC only for pushdown optimization when moving data within Redshift.
PowerExchange for Redshift is much faster than ODBC when bringing data from outside AWS.
13
Figure 4: PowerCenter Clients and Amazon Redshift
The matrix below provides a comparison of capabilities PowerExchange for Redshift and ODBC
provides when used inside the PowerCenter sessions:
14
ODBC facilitates pushdown optimization in PowerCenter. PowerExchange for Redshift, on the other
hand, optimizes the bulk loads by first writing data to an S3 bucket and using a COPY command
thereafter to load data into Redshift. The COPY command uses Redshift’s MPP architecture to read
and load data in parallel from multiple data sources and is faster and efficient than INSERT
commands.
To access the S3 bucket, you need an AWS key ID and a secret key. The access key ID isan
alphanumeric text string. It uniquely identifies the user who owns the account that has privileges on
the S3 bucket. The secret access key serves as the password to validate the credentials when the
PowerCenter session tries to connect to S3 bucket. Never share your secret key with anyone!
15
Number of Nodes in the Cluster
Define the right number of nodes when you create your Amazon Redshift cluster. You can view this
property in the Redshift console under Cluster Properties.
JDBC URL
The JDBC URL is your connection URL. Click on the cluster name on Amazon Redshift console. A
window appears on the Configuration tab. Under Cluster Database Properties, there are two URLs: one
each for JDBC and ODBC. Use the JDBC URL for your configuration in PowerCenter.
16
PowerCenter Session with a Redshift Connection
To read and write data with Amazon Redshift as a source or target:
Create a mapping with any source and a relational target to write data to an Amazon Redshift
target. Starting in PowerCenter 9.6.1 HotFix 3, you can import a Redshift source and target.
To write data to an Amazon Redshift table, you must configure an Amazon Redshift connection
in the Workflow Manager. Create a session and associate it with the mapping that you created to
move data to an Amazon Redshift table. Define the session properties to write data to Amazon
Redshift.
The PowerCenter Integration Service writes the data to a staging directory and then to an Amazon
S3 bucket before it writes the data to Amazon Redshift. You must specify the location of the
staging directory in the session properties. You must also specify an Amazon S3 bucket name in
the session properties. You must have write access to the Amazon S3 bucket.
Session Configuration
S3 Bucket Name
Use the bucket created for Redshift sessions. Create a bucket in the same region as the Redshift
cluster.
Enable Compression
Improves the session performance. Enable compression property is enabled by default. The property
compresses the staged files before the files are written to Amazon Redshift.
Batch Size
A critical component in overall performance of the system. Use a batch size high enough to limit the
number of batches created to 4 or 5.
Recommended Batch Size = Total number of rows on input source / 5
17
Success File Directory
A directory on the node where the PowerCenter Integration Service is running. The directory serves as
the location for the success file in the session properties. By default, the PowerCenter Integration
Service generates the success file with the following naming convention:
<sessionName>_<timestamp>_success.csv.
The PowerCenter Integration Service generates a success file after each session execution and has
an entry for each record that’s successfully written into Amazon Redshift. Each entry describes the
values written for all the fields of the record.
By default, the PowerCenter Integration Service writes a blank file to $PMBadFileDir, and the
PowerCenter Integration Service generates errors file with the name
<sessionName>_<timestamp>_error.csv.
S3 Encryption properties
Turn on S3 Server Side Encryption. Use if server side encryption is already enabled on the S3
buckets, PowerCenter sessions maintain the encryption if this feature is turned on. This is
recommended unless you want to use your own encryption keys.
Turn on S3 Client Side Encryption. Use if a private encryption key needs to be used. Provide the
master Symmetric Key in the Redshift application connection and turn on S3 client-side
encryption.
For more information about S3 encryption, see server side encryption and client side encryption.
Use pushdown optimization to take advantage of Redshift’s MPP architecture to quickly do such SQL
operations. It’s easy to turn on pushdown optimization functionality in a mapping without any major
design changes to the mapping.
In this example, we will create a mapping, m_agg_event, using sample tables from a TICKIT
database. The mapping uses an EVENT table as lookup to populate an aggregate table AGG_VENUE.
The mapping uses a VENUE table as the source.
18
Use the following link to access a TICKIT database, table DDLs, and sample data:
Figure 5: The mapping uses venueid as join column between venue and event tables.
http://docs.aws.amazon.com/redshift/latest/dg/t_creating_database.html.
Aggregator groups perform a count of venueseats using eventname and venuename as group by
columns. The full pushdown option in the session is turned on.
The session pushes the following insert DML to the Redshift cluster:
19
Compression
Database level. At the table level inside Redshift, column compression is used as a space
reduction strategy. Compression saves disk space by compressing column values. Space
reduction also minimizes I/O as compressed data is loaded into server memory before being
uncompressed. Redshift supports multiple compression encodings and will automatically
choose the most efficient based upon your data.
See the Redshift documentation for a full list of compression encodings allowed in Redshift:
http://docs.aws.amazon.com/redshift/latest/dg/c_Compression_encodings.html.
Session level. Use compression at the PowerCenter session level to further reduce space
occupied at intermediate stage of ETL and improve performance. PowerCenter supports
loading compressed data to Amazon Redshift.
When loading large data sets, enable compression in the session properties of
the Redshift target. This allows compression of staged files before writing the files to Amazon S3
bucket. The PowerCenter Integration Service issues a COPY command that copies compressed data
from Amazon S3 to the Amazon Redshift target table using the GZIP option.
Here is a sample copy command that gets fired from the PowerCenter Integration Service:
copy sample_tbl (a) from 's3://sampleredshiftbucket/0b0ad503-1c2c-4514-95ac-
85a5adb71b3b1441213218371/INSERT_sample_tbl.batch_0.csv.' credentials
'aws_access_key_id=********;aws_secret_access_key=********' MAXERROR 1 DELIMITER ','
QUOTE '"' GZIP NULL '' IGNOREHEADER 1 CSV ROUNDEC ;
During the load execution, a COPY command similar to the above example is visible in the Redshift
console on https://console.aws.amazon.com/redshift/.
After the Load execution, a COPY command similar to the above example becomes visible in the
Redshift console.
Vacuum
Amazon Redshift does not reclaim and reuse space that is freed when you delete rows and update
rows, unless specifically instructed by the vacuum command.
Vacuum is important from space as well as performance considerations. Redshift does a soft delete
during a delete operation. The rows are marked for delete but not physically removed. Any query
running on the table with deleted records will still scan the deleted records as they are not physically
removed from the database blocks.
Vacuum performs housekeeping on the database by reclaiming the empty space left by deleted
records in a table. Then, it performs a re-sort of the remaining records. A PowerCenter session
provides three VACUUM options based on application needs:
Full. When full vacuum is turned on, Amazon Redshift reclaims the space left void by any
previous update or delete operation. This is also the default vacuum in Redshift. Another
feature of vacuum is the resorting that it does after reclaiming all the unused space.
20
Sort only. Sorts the new rows after an update or delete, but does not reclaim the disk space left
open due to deletes. This option is less resource intensive compared to a full vacuum and allows
for optimizer to take advantage of resorted order for query plans.
Delete only. Allows for reclaiming any disk space left open by a previous delete or update
operation. Use this option when disk space optimization is the primary goal. This option will not
assist in any query optimization.
Analyze
The analyze command does a refresh of the table statistics to help the query optimizer create the
most updated plan when it run the query next time. If analyze is not done after considerable records
are added or deleted from a table, the optimizer will generate a query plan based on outdated table
statistics.
When the analyze option is turned on while executing the load using PowerExchange for Redshift, the
session will file a COPY ANALYZE <Target table Name> statement. COPY ANALYZE works on the
input data and automatically applies optimal compression encodings to the target table.
PowerExchange for Redshift can perform a vacuum and analyze on the whole database or a
particular table based on need.
21
Connectivity to the Internet and Other AWS Services
Deploying the instances in a public subnet allows them to have access to the Internet for outgoing
traffic as well as to other AWS services, such as S3 and RDS.
Security Groups
Security Groups are analogous to firewalls. You can define rules for EC2 instances and define
allowable traffic, IP addresses, and port ranges. Instances can belong to multiple security groups.
Worldwide Headquarters, 2100 Seaport Blvd, Redwood City, CA 94063, USA Phone: 650.385.5000 Fax: 650.385.5500
Toll-free in the US: 1.800.653.3871 informatica.com linkedin.com/company/informatica twitter.com/ Informatica
© 2017 Informatica LLC. All rights reserved. Informatica® and Put potential to work™ are trademarks or registered trademarks of Informatica in the
United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks.
© Amazon Web Services, Inc or its afiliates. All rights reserved.
22