Informatica PowerCenter (8.6.1) Performance Tuning
Informatica PowerCenter (8.6.1) Performance Tuning
Informatica PowerCenter
(Version 8.6.1)
Jishnu Pramanik
Informatica is an ETL tool with high performance capability. We need to make maximum utilization of its features
to increase its performance. With the ever increasing user requirements and exploding data volumes, we need to
achieve more in less time. The goal of performance tuning is optimize session performance. This document lists all
the techniques available to tune Informatica performance.
2.Identifying Bottlenecks
2.1 Overview
Performance of Informatica is dependent on the performance of its several components like database, network,
transformations, mappings, sessions etc. To tune the performance of Informatica, we have to identify the bottleneck
first. Bottleneck may be present in source, target, transformations, mapping, session, database or network. It is best
to identify performance issue in components in the order source, target, transformations, mapping and session. After
identifying the bottleneck, apply the tuning mechanisms in whichever way they are applicable to the project.
So if source was fine, then in the latter case, session should take less time. Still if the session takes near equal time
as former case, then there is a source bottleneck.
But removal of transformation for testing can be a pain for the developer since that might require further changes for
the session to get into the ‘working mode’. So we can put filter with the FALSE condition just after the
transformation and run the session. If the session run takes equal time with and without this test filter, then
transformation is the bottleneck.
2.5 Identify bottleneck in sessions
We can use the session log to identify whether the source, target or transformations are the performance bottleneck.
Session logs contain thread summary records like the following:-
Basically we have to rely on thread statistics to identify the cause of performance issues. Once the ‘Collect
Performance Data’ option (In session- ‘Properties’ tab) is enabled, all the performance related information would
appear in the log created by the session.
♦Increase checkpoint intervals : The Integration Service performance slows each time it waits for the database to
perform a checkpoint. To increase performance, consider increasing the database checkpoint interval. When you
increase the database checkpoint interval, you increase the likelihood that the database performs checkpoints as
necessary, when the size of the database log file reaches its limit.
♦Use bulk loading : You can use bulk loading to improve the performance of a session that inserts a large amount
of data into a DB2, Sybase ASE, Oracle, or Microsoft SQL Server database. Configure bulk loading in the session
properties.
When bulk loading, the Integration Service bypasses the database log, which speeds performance. Without writing
to the database log, however, the target database cannot perform rollback. As a result, you may not be able to
perform recovery. When you use bulk loading, weigh the importance of improved session
performance against the ability to recover an incomplete session.
When bulk loading to Microsoft SQL Server or Oracle targets, define a large commit interval to increase
performance. Microsoft SQL Server and Oracle start a new bulk load transaction after each commit. Increasing the
commit interval reduces the number of bulk load transactions, which increases performance
♦Use external loading : You can use an external loader to increase session performance. If you have a DB2 EE or
DB2 EEE target database, you can use the DB2 EE or DB2 EEE external loaders to bulk load target files. The DB2
EE external loader uses the Integration Service db2load utility to load data. The DB2 EEE external loader uses the
DB2 Autoloader utility.
If you have a Teradata target database, you can use the Teradata external loader utility to bulk load target files. To
use the Teradata external loader utility, set up the attributes, such as Error Limit, Tenacity, MaxSessions, and Sleep,
to optimize performance.
If the target database runs on Oracle, you can use the Oracle SQL*Loader utility to bulk load target files. When you
load data to an Oracle database using a pipeline with multiple partitions, you can increase performance if you create
the Oracle target table with the same number of partitions you use for the pipeline.
If the target database runs on Sybase IQ, you can use the Sybase IQ external loader utility to bulk load target files. If
the Sybase IQ database is local to the Integration Service process on the UNIX system, you can increase
performance by loading data to target tables directly from named pipes. If you run the Integration Service on a grid,
configure the Integration Service to check resources, make Sybase IQ a resource, make the resource available on all
nodes of the grid, and then, in the Workflow Manager, assign the Sybase IQ resource to the applicable sessions.
♦Minimize deadlocks : If the Integration Service encounters a deadlock when it tries to write to a target, the
deadlock only affects targets in the same target connection group. The Integration Service still writes to targets in
other target connection groups.
Encountering deadlocks can slow session performance. To improve session performance, you can increase the
number of target connection groups the Integration Service uses to write to the targets in a session. To use a different
target connection group for each target in a session, use a different database connection name for each target
instance. You can specify the same connection information for each connection name.
♦Increase database network packet size : If you write to Oracle, Sybase ASE or, Microsoft SQL Server targets,
you can improve the performance by increasing the network packet size. Increase the network packet size to allow
larger packets of data to cross the network at one time. Increase the network packet size based on the database you
write to:
-Oracle. You can increase the database server network packet size in listener.ora and tnsnames.ora. Consult your
database documentation for additional information about increasing the packet size, if necessary.
-Sybase ASE and Microsoft SQL. Consult your database documentation for information about how to increase the
packet size.
For Sybase ASE or Microsoft SQL Server, you must also change the packet size in the relational connection object
in the Workflow Manager to reflect the database server packet size.
♦Optimize Oracle target databases : If the target database is Oracle, you can optimize the target database by
checking the storage clause, space allocation, and rollback or undo segments.
When you write to an Oracle database, check the storage clause for database objects. Make sure that tables are using
large initial and next values. The database should also store table and index data in separate tablespaces, preferably
on different disks.
When you write to Oracle databases, the databases uses rollback or undo segments during loads. Ask the Oracle
database administrator to ensure that the database stores rollback or undo segments in appropriate tablespaces,
preferably on different disks. The rollback or undo segments should also have appropriate storage clauses.
You can optimize the Oracle database by tuning the Oracle redo log. The Oracle database uses the redo log to log
loading operations. Make sure the redo log size and buffer size are optimal. You can view redo log properties in the
init.ora file.
If the Integration Service runs on a single node and the Oracle instance is local to the Integration Service process
node, you can optimize performance by using IPC protocol to connect to the Oracle database. You can set up Oracle
database connection in listener.ora and tnsnames.ora.
• If the source is a flat file, reduce the number of bytes (By default it is 1024 bytes per line) the Informatica
reads per line. If we do this, we can decrease the Line Sequential Buffer Length setting of the session
properties.
• If possible, give a conditional query in the source qualifier so that the records are filtered off as soon as
possible in the process.
• In the source qualifier, if the query has ORDER BY or GROUP BY, then create an index on the source
table and order by the index field of the source table.
5.Optimizing Mappings
5.1 Overview
Mapping-level optimization may take time to implement, but it can significantly boost session performance. Focus
on mapping-level optimization after you optimize the targets and sources.
Generally, you reduce the number of transformations in the mapping and delete unnecessary links between
transformations to optimize the mapping. Configure the mapping with the least number of transformations and
expressions to do the most amount of work possible. Delete unnecessary links between transformations to minimize
the amount of data moved.
You can also perform the following tasks to optimize the mapping:
♦Optimize the flat file sources.
♦Configure single-pass reading.
♦Optimize Simple Pass Through mappings.
♦Optimize filters.
♦Optimize datatype conversions
♦Optimize expressions
♦Optimize external procedures.
If you factor out the aggregate function call, as below, the Integration Service adds COLUMN_A to COLUMN_B,
then finds the sum of both.
SUM(COLUMN_A + COLUMN_B)
CUSTOMERS.FIRST_NAME || ‘ ’ || CUSTOMERS.LAST_NAME
IIF expressions can return a value and an action, which allows for more compact expressions. For example, you
have a source with three Y/N flags: FLG_A, FLG_B, FLG_C. You want to return values based on the values of each
flag.
VAL_A + VAL_B,
VAL_A + VAL_C,
VAL_B + VAL_C,
VAL_B ,
VAL_C,
IIF( FLG_A = 'N' and FLG_B = 'N' AND FLG_C = 'N', 0.0,
))))))))
If you take advantage of the IIF function, you can rewrite that expression as:
This results in three IIFs, two comparisons, two additions, and a faster session.
Evaluating Expressions
If you are not sure which expressions slow performance, evaluate the expression performance to isolate the problem.
2. Copy the mapping and replace half of the complex expressions with a constant.
4. Make another copy of the mapping and replace the other half of the complex expressions with a constant.
For example, you need to create an external procedure with two input groups. The external procedure reads a row
from the first input group and then reads a row from the second input group. If you use blocking, you can write the
external procedure code to block the flow of data from one input group while it processes the data from the other
input group. When you write the external procedure code to block data, you increase performance because the
procedure does not need to copy the source data to a buffer. However, you could write the external procedure to
allocate a buffer and copy the data from one input group to the buffer until it is ready to process the data. Copying
source data to a buffer decreases performance.
5.9 Tips and Tricks
• Avoid executing major sql queries from mapplets or mappings.
• Use optimized queries when we are using them.
• Reduce the number of transformations in the mapping. Active transformations like rank, joiner, filter,
aggregator etc should be used as less as possible.
• Remove all the unnecessary links between the transformations from mapping.
• If a single mapping contains many targets, then dividing them into separate mappings can improve
performance.
• If we need to use a single source more than once in a mapping, then keep only one source and source
qualifier in the mapping. Then create different data flows as required into different targets or same target.
• If a session joins many source tables in one source qualifier, then an optimizing query will improve
performance.
• In the sql query that Informatica generates, ORDERBY will be present. Remove the ORDER BY clause if
not needed or at least reduce the number of column names in that list. For better performance it is best to
order by the index field of that table.
• Combine the mappings that use same set of source data.
• On a mapping, field with the same information should be given the same type and length throughout the
mapping. Otherwise time will be spent on field conversions.
• Instead of doing complex calculation in query, use an expression transformer and do the calculation in the
mapping.
• If data is passing through multiple staging areas, removing the staging area will increase performance.
• Stored procedures reduce performance. Try to keep the stored procedures simple in the mappings.
• Unnecessary data type conversions should be avoided since the data type conversions impact performance.
• Transformation errors result in performance degradation. Try running the mapping after removing all
transformations. If it is taking significantly less time than with the transformations, then we have to fine-
tune the transformation.
• Keep database interactions as less as possible.
6. Optimizing Transformations
6.1 Overview
You can further optimize mappings by optimizing the transformations contained in the mappings.
You can optimize the following transformations in a mapping:
♦Aggregator transformations.
♦Custom transformations.
♦Joiner transformations.
♦Lookup transformations.
♦Sequence Generator transformations.
♦Sorter transformations.
♦Source Qualifier transformations.
♦SQL transformations
♦Update transformation
♦Filter transformation
♦Expression transformation
When using incremental aggregation, you apply captured changes in the source to aggregate calculations in a
session. The Integration Service updates the target incrementally, rather than processing the entire source and
recalculates the same calculations every time you run the session.
You can increase the index and data cache sizes to hold all data in memory without paging to disk.
4. In joiner transformation, the source with lesser number of records should be the master source.
6. Use source qualifier to perform joins instead of joiner transformation wherever possible.
Types of Caches
Use the following types of caches to increase performance:
♦Shared cache. You can share the lookup cache between multiple transformations. You can share an unnamed
cache between transformations in the same mapping. You can share a named cache between transformations in the
same or different mappings.
♦Persistent cache. If you want to save and reuse the cache files, you can configure the transformation to use a
persistent cache. Use this feature when you know the lookup table does not change between session runs. Using a
persistent cache can improve performance because the Integration Service builds the memory cache from the cache
files instead of from the database.
The lookup index cache holds data for the columns used in the lookup condition. For best session performance,
specify the maximum lookup index cache size. Use the following information to calculate the minimum and
maximum lookup index cache for both connected and unconnected Lookup transformations: -
To calculate the minimum lookup index cache size, use the formula:-
Columns in lookup cache = <Number of rows in lookup table>* [<column size> + 16] * 2
Example:-
Suppose the lookup table has lookup values based in the field ITEM_ID. It uses the lookup condition, ITEM_ID =
IN_ITEM_ID1.
Therefore the total column size is 16. The table contains 60000 rows.
So this lookup transformation needs an index cache size between 6400 and 3,840,000.
For best session performance, this lookup transformation needs an index cache size of 3,840,000 bytes.
In a connected transformation, the data cache contains data for the connected output ports, not including ports used
in the lookup condition. In an unconnected transformation, the data cache contains data from the return port.
To calculate the minimum lookup data cache size, use the formula:-
Columns in lookup cache = <Number of rows in lookup table> * [<Column size of connected output ports not
in lookup condition > + 8]
Example:-
Suppose the lookup table has column names as PROMOTION_ID and DISCOUNT which are connected output
ports not in lookup condition
Column size of each is 16. Therefore total column size is 32.The table contains 60000 rows.
The Lookup transformation includes three lookup ports used in the mapping, ITEM_ID, ITEM_NAME, and PRICE.
When you enter the ORDER BY statement, enter the columns in the same order as the ports in the lookup condition.
You must also enclose all database reserved words in quotes.
Enter the following lookup query in the lookup SQL override:
SELECT ITEMS_DIM.ITEM_NAME, ITEMS_DIM.PRICE, ITEMS_DIM.ITEM_ID FROM
ITEMS_DIM ORDER BY ITEMS_DIM.ITEM_ID, ITEMS_DIM.PRICE --
♦Cached lookups. To improve performance, index the columns in the lookup ORDER BY statement. The session
log contains the ORDER BY statement.
♦Uncached lookups. To improve performance, index the columns in the lookup condition. The Integration Service
issues a SELECT statement for each row that passes into the Lookup transformation.
1. To improve performance, cache the lookup tables. Informatica can cache all the lookup and reference
tables; this makes operations run very fast. (Meaning of cache is given in point 2 of this section and the
procedure for determining the optimum cache size is given at the end of this document.)
2. Even after caching, the performance can be further improved by minimizing the size of the lookup cache.
Reduce the number of cached rows by using a sql override with a restriction.
Cache: Cache stores data in memory so that Informatica does not have to read the table each time it is
referenced. This reduces the time taken by the process to a large extent. Cache is automatically generated
by Informatica depending on the marked lookup ports or by a user defined sql query.
‘employee_id’ is from the lookup table, EMPLOYEE_TABLE and ‘eno’ is the input that comes from the
from the source table, SUPPORT_TABLE.
If there are 50,000 employee_id, then size of the lookup cache will be 50,000.
If there are 1000 eno, then the size of the lookup cache will be only 1000.
But here the performance gain will happen only if the number of records in SUPPORT_TABLE is not
huge. Our concern is to make the size of the cache as less as possible.
3. In lookup tables, delete all unused columns and keep only the fields that are used in the mapping.
4. If possible, replace lookups by joiner transformation or single source qualifier. Joiner transformation takes
more time than source qualifier transformation.
5. If lookup transformation specifies several conditions, then place conditions that use equality operator ‘=’
first in the conditions that appear in the conditions tab.
6. In the sql override query of the lookup table, there will be an ORDER BY clause. Remove it if not needed
or put fewer column names in the ORDER BY list.
7. Do not use caching in the following cases: -
-Source is small and lookup table is large.
9. If lookup data is static, use persistent cache. Persistent caches help to save and reuse cache files. If several
sessions in the same job use the same lookup table, then using persistent cache will help the sessions to
reuse cache files. In case of static lookups, cache files will be built from memory cache instead of from the
database, which will improve the performance.
10. If source is huge and lookup table is also huge, then also use persistent cache.
11. If target table is the lookup table, then use dynamic cache. The Informatica server updates the lookup cache
as it passes rows to the target.
12. Use only the lookups you want in the mapping. Too many lookups inside a mapping will slow down the
session.
13. If lookup table has a lot of data, then it will take too long to cache or fit in memory. So move those fields to
source qualifier and then join with the main table.
14. If there are several lookups with the same data set, then share the caches.
15. If we are going to return only 1 row, then use unconnected lookup.
16. All data are read into cache in the order the fields are listed in lookup ports. If we have an index that is even
partially in this order, the loading of these lookups can be speeded up.
17. If the table that we use for look up has an index (or if we have privilege to add index to the table in the
database, do so), then the performance would increase both for cached and uncached lookups.
1. If we need the sequence generator more than once in a job, then make it reusable & use multiple times in
the folder.
2. To generate primary keys, use Sequence generator transformation instead of using a stored procedure for
generating sequence numbers.
3. We can also opt for sequencing in the source qualifier by adding a dummy field in the source definition and
source qualifier, and then giving a sql query like
‘select seq_name.nextval, <other column names>... from <source table name> where <condition if
any>’.
Seq_name is the sequence that generates primary key for our source table. <Sequence name>. Nextval is a
sequence generator object in Oracle.
This method of primary key generation is faster than using sequence generator transformation.
1. While using the sorter transformation, configure sorter cache size to be larger than the input data size.
2. Configure the sorter cache size setting to be larger than the input data size while using sorter
transformation.
3. At the sorter transformation, use hash auto keys partitioning or hash user keys partitioning.
6.8 Optimizing Source Qualifier Transformations
Use the Select Distinct option for the Source Qualifier transformation if you want the Integration Service to select
unique values from a source. Use Select Distinct option to filter unnecessary data earlier in the data flow. This can
improve performance.
1. While using the sorter transformation, configure sorter cache size to be larger than the input data size.
2. Configure the sorter cache size setting to be larger than the input data size while using sorter
transformation.
3. At the sorter transformation, use hash auto keys partitioning or hash user keys partitioning.
1. Use filter transformation as close to source as possible so that unwanted data gets eliminated sooner.
2. If elimination of unwanted data can be done by source qualifier instead of filter, then eliminate them using
former.
3. Use conditional filters and keep the filter condition simple, involving TRUE/FALSE or 1/0.
6.12 Optimizing Expression Transformation
Expression transformation is used to perform simple calculations and also to do source lookups.
7.Optimizing Sessions
7.1 Overview
Once you optimize the source database, target database, and mapping, you can focus on optimizing the session. You
can perform the following tasks to improve overall performance:
♦Use a grid. You can increase performance by using a grid to balance the Integration Service workload.
♦Use pushdown optimization. You can increase session performance by pushing transformation logic to the source
or target database.
♦Run sessions and workflows concurrently. You can run independent sessions and workflows concurrently to
improve session and workflow performance.
♦Allocate buffer memory. You can increase the buffer memory allocation for sources and targets that require
additional memory blocks. If the Integration Service cannot allocate enough memory blocks to hold the data, it fails
the session.
♦Optimize caches. You can improve session performance by setting the optimal location and size for the caches.
♦Increase the commit interval. Each time the Integration Service commits changes to the target, performance
slows. You can increase session performance by increasing the interval at which the Integration Service commits
changes.
♦Disable high precision. Performance slows when the Integration Service reads and manipulates data with the high
precision datatype. You can disable high precision to improve session performance.
♦Reduce errors tracing. To improve performance, you can reduce the error tracing level, which reduces the
number of log events generated by the Integration Service..
♦Remove staging areas. When you use a staging area, the Integration Service performs multiple passes on the data.
You can eliminate staging areas to improve session performance.
You can increase the number of available memory blocks by adjusting the following session parameters:
♦DTM Buffer Size. Increase the DTM buffer size on the Properties tab in the session properties.
♦Default Buffer Block Size. Decrease the buffer block size on the Config Object tab in the session properties.
To configure these settings, first determine the number of memory blocks the Integration Service requires to
initialize the session. Then, based on default settings, calculate the buffer size and/or the buffer block size to create
the required number of session blocks.
If you have XML sources or targets in a mapping, use the number of groups in the XML source or target in the
calculation for the total number of sources and targets.
For example, you create a session that contains a single partition using a mapping that contains 50 sources and 50
targets. Then you make the following calculations:
1.You determine that the session requires a minimum of 200 memory blocks:
2.Based on default settings, you determine that you can change the DTM Buffer Size to 15,000,000, or you can
change the Default Buffer Block Size to 54,000:
(session Buffer Blocks) = (.9) * (DTM Buffer Size) / (Default Buffer Block Size) * (number of partitions)
200 = .9 * 14222222 / 64000 * 1
or
200 = .9 * 12000000 / 54000 * 1
Note: For a session that contains n partitions, set the DTM Buffer Size to at least n times the value for the session
with one partition. The Log Manager writes a warning message in the session log if the number of memory blocks is
so small that it causes performance degradation. The Log Manager writes this warning message even if the number
of memory blocks is enough for the session to run successfully. The warning message also gives a suggestion for the
proper value.
If you modify the DTM Buffer Size, increase the property by multiples of the buffer block size.
Note: You may encounter performance degradation when you cache large quantities of data on a mapped or
mounted drive.
1. Partition the session: This creates many connections to the source and target, and loads data in parallel
pipelines. Each pipeline will be independent of the other. But the performance of the session will not
improve if the number of records is less. Also the performance will not improve if it does updates and
deletes. So session partitioning should be used only if the volume of data is huge and the job is mainly
insertion of data.
2. Run the sessions in parallel rather than serial to gain time, if they are independent of each other.
3. Drop constraints and indexes before we run session. Rebuild them after the session run completes.
Dropping can be done in pre session script and Rebuilding in post session script. But if data is too much,
dropping indexes and then rebuilding them etc. will be not possible. In such cases, stage all data, pre-create
the index, use a transportable table space and then load into database.
4. Use bulk loading, external loading etc. Bulk loading can be used only if the table does not have an index.
5. In a session we have options to ‘Treat rows as ‘Data Driven, Insert, Update and Delete’. If update strategies
are used, then we have to keep it as ‘Data Driven’. But when the session does only insertion of rows into
target table, it has to be kept as ‘Insert’ to improve performance.
6. Increase the database commit level (The point at which the Informatica server is set to commit data to the
target table. For e.g. commit level can be set at every every 50,000 records)
7. By avoiding built in functions as much as possible, we can improve the performance. E.g. For
concatenation, the operator ‘||’ is faster than the function CONCAT (). So use operators instead of
functions, where possible. The functions like IS_SPACES (), IS_NUMBER (), IFF (), DECODE () etc.
reduce the performance to a big extent in this order. Preference should be in the opposite order.
8. String functions like substring, ltrim, and rtrim reduce the performance. In the sources, use delimited
strings in case the source flat files or use varchar data type.
9. Manipulating high precision data types will slow down Informatica server. So disable ‘high precision’.
10. Localize all source and target tables, stored procedures, views, sequences etc. Try not to connect across
synonyms. Synonyms and aliases slow down the performance.
8.Optimizing the System
8.1 Overview
Often performance slows because the session relies on inefficient connections or an overloaded Integration Service
process system. System delays can also be caused by routers, switches, network protocols, and usage by many users.
Slow disk access on source and target databases, source and target file systems, and nodes in the domain can slow
session performance. Have the system administrator evaluate the hard disks on the machines.
After you determine from the system monitoring tools that you have a system bottleneck, make the following global
changes to improve the performance of all sessions:
♦Improve network speed. Slow network connections can slow session performance. Have the system administrator
determine if the network runs at an optimal speed. Decrease the number of network hops between the Integration
Service process and databases.
♦Use multiple CPUs. You can use multiple CPUs to run multiple sessions in parallel and run multiple pipeline
partitions in parallel.
♦Reduce paging. When an operating system runs out of physical memory, it starts paging to disk to free physical
memory. Configure the physical memory for the Integration Service process machine to minimize paging to disk.
♦Use processor binding. In a multi-processor UNIX environment, the Integration Service may use a large amount
of system resources. Use processor binding to control processor usage by the Integration Service process. Also, if
the source and target database are on the same machine, use processor binding to limit the resources used by the
database.
9.Optimizing Database
9.1 Tips and Tricks
To gain the best Informatica performance, the database tables, stored procedures and queries used in Informatica
should be tuned well.
1. If the source and target are flat files, then they should be present in the system in which the Informatica
server is present.
Data generally moves across a network at less than 1 MB per second, whereas a local disk moves data five
to twenty times faster. Thus network connections often affect on session performance. So avoid network
connections.