Design Approach To Handle Late Arriving Dimensions and Late Arriving Facts
Design Approach To Handle Late Arriving Dimensions and Late Arriving Facts
inShare
In the typical case for a data warehouse, dimensions are processed first and the facts are loaded later, with the assumption that all required dimension data is already in place. This may not be true in all cases because of nature of your business process or the source application behavior. Fact data also, can be sent from the source application to the warehouse way later than the actual fact data is created. In this article lets discusses several options for handling late arriving dimension and Facts.
Design Approaches
Depending on the business scenario and the type of dimension in use, we can take different design approaches.
Ads not by this site
process and a delay in loading fact records until the dimension records are available is acceptable for the business.
In the insurance claim example explained in the beginning; it is almost certain that the "patient id" will be part of the claim fact, which is the natural key of the patient dimension. So we can create a new placeholder dimension record for the patient with a new surrogate key and the natural key "patient id".
Note : When you get all other attributes for the patient dimension record in a later point, you will have to do a SCD Type 1 update for the first time and SCD Type 2 going forward.
As described above, we can handle late arriving dimension by keeping an "Unknown" dimension record or an "Inferred" dimension record, which acts an a placeholder. Even before we get the full dimension record details from the source system, there may be multiple SCD Type 2 changes to the placeholder dimension record. This leads to the creation of new dimension record with new surrogate key and modify any subsequent fact records surrogate key to point the new surrogate key. Late Arriving Dimension with retro effective changes You can get Dimension records from source system with retro effective dates. For example you might update your marital status in your HR system way later than your marriage date. This update come to data warehouse with retro effective date. This leads to a new dimension record with a new surrogate key and changes in effective dates for the affected dimension. You will have to scan forward in the dimension to see if there is any subsequent type 2 rows for this dimension. This further leads in modify any subsequent fact records surrogate key to point the new surrogate key.
Ads not by this site
Design Approaches
Unlike late arriving dimensions, late arriving fact records can be handles relatively easily. When loading the fact record, the associated dimension table history has to be searched to find out the appropriate surrogate key which is effective at the time of the transaction occurrences. Below data flow describes the late arriving fact design approach.
Hope you guys enjoyed this article and gave you some new insights into late arriving dimension and fact scenarios in Data Warehouse. Leave us your questions and commends. We would also like to hear how you have handled late arriving dimension and fact in your data warehouse.
SOFT and HARD Deleted Records and Change Data Capture in Data Warehouse
Johnson Cyriac Dec 8, 2013 DW Design | ETL Design
inShare10
In our couple of prior articles we spoke about change data capture, different techniques to capture change data and a change data capture frame work as well. In this article we will deep dive into different aspects for change data in Data Warehouse including soft and hard deletions in source systems.
NEW transactions happened at source. CORRECTIONS happened on old transactional values or measured values. INVALID transactions removed from source.
Ads not by this site
Usually in our ETL we take care of the 1st and 2nd case(Insert/Update Logic); the 3rd change is not captured in DW unless it is specifically instructed in the requirement specification. But when its especially amended, we need to devise convenient ways to track the transactions that were removed i.e., to track the deleted records at source and accordingly DELETE those records in DW. One thing to make clear is that Purging might be enabled at your OLTP, i.e OLTP keeping data for a fixed historical period of time, but that is a different scenario. Here we are more interested about what was DELETED at Source because the transactions was NOT valid.
1. When the DW table load nature is 'Truncate & Load' or 'Delete & Reload', we don't have any impact, since the requirement is to keep the exact snapshot of the source table at any point of time. 2. When the DW table does not track history on data changes and deletes are allowed against the source table. If a record is deleted in the source table, it is also deleted in the DW. 3. When the DW table tracks history on data changes and deletes are allowed against the source table. The DW table will retain the record that has been deleted in the source system, but this record will be either expired in DW based on the change captured date or 'Soft Delete' will be applied against it.
Logical Delete :- In this case, we have a specific flag in the source table as STATUS which would be having the values as ACTIVE or INACTIVE. Some OLTPs keep the field name as ACTIVE with the values as I, U or D, where D means that the record is deleted or the record is INACTIVE. This approach is quite safe and also known as Soft DELETE.
Ads not by this site
Physical Delete :- In this case the record related to invalid transactions are fully deleted from the source table by issuing DML statement. This is usually done after thorough discussing with Business Users and related business rules are strictly followed. This is also known as Hard DELETE.
If only ACTIVE records are supposed to be used in ETL processing, we need to add specific filters while fetching source data. Sometimes INACTIVE records are pulled into the DW and moved till the ETL Data Warehouse level. While pushing the data into Exploration Data Warehouse, only the ACTIVE records are sent for reporting purpose. For Hard DELETE, if Audit Table is maintained at source systems for what are transactions were deleted, we can source the same, i.e. join the Audit table and the Source table based on NK and logically delete them in DW too. But it becomes quite cumbersome and costly when no account is kept of what was deleted at all. In these cases, we need to use different ways to track them and update the corresponding records in DW.
If we have DELETION enabled for Dimensions in DW, it's always safe to keep a copy of the OLD record in some AUDIT table, as it helps to track any defects in future. A simple DELETE trigger should work fine; since DELETION hardly happens, this trigger would not degrade the performance much. Let's take this ORDERS table into consideration. Along with this, we can have a History table for ORDERS, e.g. ORDERS_Hist, which would store the DELETED records from ORDERS.
The AUDIT Fields will convey when this particular record was deleted and by which user. But this table needs to be created for each and every DW table where we want to keep the audit of what was DELETED. If the entire record is not need and only fields involved in Natural Key(NK) may work, we can have a consolidated table for all the Dimensions.
Here the Record_IDENTIFIER field contains the values of all the columns involved in the Natural Key(NK) separated by '#' of the table mentioned in the OBJECT_NAME field. Sometimes, we face a situation in DW where a FACT table record contains a Surrogate Key(SK) from a Dimension but the Dimension table doesn't own it anymore. In those cases, the FACT table record becomes orphan and it will hardly be able to appear in any report since we always use the INNER JOIN between Dimensions and Fact while retrieving data in the reporting layer, and there it misses the Referential Integrity(RI). Suppose, we want to track the orphan records from the SALES Fact table in respect of Product Dimension. We can use the query as below.
So, the above query will provide only the Orphan records, BUT certainly it cannot provide you the records DELETED from the PRODUCT_Dimension. So, one feasible solution could be while populating the EVENT table with the SKs from PRODUCT_Dimension that are being DELETED, provided we don't reuse our Surrogate Keys. So, when we have both the SKs and the NKs from the PRODUCT_Dimension in the EVENT table for DELETED entries, we can achieve a better compliance over the Data Warehouse data. Another useful but least used approach is enabling the audit for any table for DELETE in an Oracle DB using queries like the following.
Audit DELETE on SCHEMA.TABLE; The table DBA_AUDIT_STATEMENT will contain all the related details related to this deletion, example the user who issued the, exact DML statement and so on, but this cannot provide you with the record that was deleted. Since this approach cannot directly provide you information on which record was deleted, its not so useful in our current discussion, so I would like to keep aloof from the topic here.
1. Records are DELETED from SOURCE for a known Time Period, no Audit Trail was kept.
In this case, the ideal solution is to DELETE the entire records set in DW for the Target table and pull the source records once again for the time period. This will bring the DW in
sync with Source and DELETED records also will not be available in DW.
Usually time period is mentioned in terms of Ship_DATE or Invoice_DATE or Event_DATE, i.e. a DATE type field from the actual dataset of the source table is used, and hence the way we can filter the records for Extraction from source table using WHERE clause, we can do the same in DW table as well. Obviously, in this case we are NOT able to capture the 'Hard DELETE' from the Source i.e., we cannot track the History of DATA, but we would be able to bring the Source and DW in sync at the least. Again, this approach is recommended only when the situation occurs once in a while and not on regular basis.
2. Records are DELETED from SOURCE on regular basis with NO Timeframe, no Audit Trail was kept.
The possible solution in this case would be to implement FULL Outer JOIN between the Source and the Target table. The tables should be joined on the fields involved in the Natural Key(NK). This approach will help us to track all three kinds of changes to source data in one shot. The logic can be better explained with the help of a Venn diagram.
Records that have values for the NK fields only from the Source and not from the Target, they should go for the INSERT flow. These are all new records coming from source. Records that have values for the NK fields from both the Source and the Target, they should go for the UPDATE flow. These are already existing records of Source. Records that have values for the NK fields only from Target, will go for the DELETE flow. These are the records that were somehow DELETED from Source table.
Now, what we do with those DELETED records from Source, i.e. apply 'Soft DELETE' or 'Hard DELETE' in DW, depends on our requirement specification and business scenarios. But this approach is having severe disadvantage in terms of ETL Performance. Whenever we go for a FULL Outer JOIN between Source and Target, we are using the entire data set from both the ends and this will obviously obstruct the smooth processing of ETL when data volume increases.
When some old transactions become invalidated, source team sends those transactions related records again to DW but with inverted measures, i.e. the sales figure are same as the old ones but they are negative. So, DW contains both the old set of records and the newly arrived records, but the aggregated measures become NULL in the aggregated FACT table, thus diminishing the impact of those invalid transactions in DW to NULL. Only disadvantage of this approach is, Aggregated FACT contains the correct data at the summarized level, but the transactional FACT dual set of records, which together
About
the
Author
represent the real scenario, i.e. at first the transaction happened(with the older record) and then it became invalid(with the newer record). Hope you guys enjoyed this article and gave you some new insights into change data capture in Data Warehouse. Leave us your questions and commends. We would like to hear how you have handled change data capture in your data warehouse.
inShare9
Surrogate Key is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. We discussed about Surrogate Key in in detail in our previous article. Here in this article we will concentrate on different approaches to generate Surrogate Key for different type ETL process.
NEXTVAL port from the Sequence Generator can be mapped to the surrogate key in the target table. Below shown is the sequence generator transformation.
Ads not by this site
Note : Make sure to create a reusable transformation, so that the same transformation can be reused in multiple mappings, which loads the same dimension table.
Schema name (DW) and sequence name (Customer_SK) can be passed in as input value for the transformation and the output can be mapped to the target SK column. Below shown is the SQL transformation image.
Create a database function as below. Here we are creating an Oracle function. CREATE OR REPLACE FUNCTION DW.Customer_SK_Func
RETURN NUMBER IS Out_SK NUMBER; BEGIN SELECT DW.Customer_SK.NEXTVAL INTO Out_SK FROM DUAL; RETURN Out_SK; EXCEPTION WHEN OTHERS THEN raise_application_error(-20001,'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM); END; You can import the database function as a stored procedure transformation as shown in below image.
Now, just before the target instance for Insert flow, we add an Expression transformation. We add an output port there with the following formula. This output port GET_SK can be connected to the target surrogate key column.
Note : Database function can be parametrized and the stored procedure can also be made reusable to make this approach more effective
For a Dynamic LookUP on Target, we have the option of associating any LookUP port with an input port, output port, or Sequence-ID. When we associate a Sequence-ID, the Integration Service generates a unique Integer value for each inserted rows in the lookup cache., but this is applicable for the ports with Bigint, Integer or Small Integer data type. Since SK is usually of Integer type, we can exploit this advantage.
The Integration Service uses the following process to generate Sequence IDs.
When the Integration Service creates the dynamic lookup cache, it tracks the range of values for each port that has a sequence ID in the dynamic lookup cache. When the Integration Service inserts a row of data into the cache, it generates a key for a port by incrementing the greatest sequence ID value by one. When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one. The Integration Service increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.
Above
shown
is
dynamic
lookup
configuration
to
generate
SK
for
CUST_SK.
The Integration Service generates a Sequence-ID for each row it inserts into the cache. For any records which is already present in the Target, it gets the SK value from the Target Dynamic LookUP cache, based on the Associated Ports matching. So, if we take this port and connect to the target SK field, there will not be any need to generate SK values separately, since the new SK value(for records to be Inserted) or the existing SK value(for records to be Updated) is supplied from the Dynamic LookUP.
The disadvantage of this technique lies in the fact that we dont have any separate SK Generating Area and the source of SK is totally embedded into the code.
CUSTOMER_ID = IN_CUSTOMER_ID
Next in the mapping after the SQ use an Expression transformation. Here actually we will be generating the SKs for the Dimension based on the previous value generated. We will create the following ports in the EXP to compute the SK value.
VAR_COUNTER = IIF(ISNULL( VAR_INC ), NVL(:LKP.LKP_CUSTOMER_DIM('1'), 0) + 1, VAR_INC +1) VAR_INC = VAR_COUNTER OUT_COUNTER = VAR_COUNTER
When the mapping starts, for the first row we will look up the Dimension table to fetch the maximum available SK in the table. Next we will keep on incrementing the SK value stored in the variable port by 1 for each incoming row. Here the O_COUNTER will give the SKs to be populated in CUSTOMER_KEY.
In s_Dummy, we will have a mapping variable, e.g. $$MAX_CUST_SK which will be set with the value of MAX (SK) in Customer Dimension table.
We will have the CUSTOMER_DIM as our source table and target can be a simple flat file, which will not be used anywhere. We pull this MAX (SK) from the SQ and then in an EXP we assign this value to the mapping variable using the SETVARIABLE function. So, we will have the following ports in the EXP:
INP_CUSTOMER_KEY = INP_CUSTOMER_KEY - The MAX of SK coming from Customer Dimension table. OUT_MAX_SK = SETVARIABLE ($$MAX_CUST_SK, INP_CUSTOMER_KEY) - Output Port
This output port will be connected to the flat file port, but the value we assigned to the variable will persist in the repository. In our second mapping we start generating the SK from the value $$MAX_CUST_SK + 1. But how can we pass the parameter value from one session into the other one? Here the use of Workflow Variable comes into picture. We define a WF variable as $$MAX_SK and in the Post-session on success variable assignment section of s_Dummy, we assign the value of $$MAX_CUST_SK to $$START_SK. Now the variable $$MAX_SK contains the maximum available SK value from CUSTOMER_DIM table. Next we define another mapping variable in the session s_New_Customer as $$START_VALUE and this is assigned the value of $$MAX_SK in the Pre-session variable assignment section of s_New_Customer. So, the sequence is:
Post-session on success variable assignment of First Session: o $$MAX_SK = $$MAX_CUST_SK Pre-session variable assignment of Second Session: o $$START_VALUE = $$MAX_SK
Now in the actual mapping, we add an EXP and the following ports into that to compute the SKs one by one for each records being loaded in the target.
About
the
Author
VAR_INC = VAR_COUNTER OUT_COUNTER = VAR_COUNTER will be connected to the SK port of the target.
OUT_COUNTER
Hope you enjoyed this article and earned some new ways to generate surrogate keys for your dimension tables. Please leave us a comment or feedback if you have any, we are happy to hear from you.
Surrogate Key in Data Warehouse, What, When, Why and Why Not
Johnson Cyriac Nov 13, 2013 DW Design | ETL Design
inShare11
Surrogate keys are widely used and accepted design standard in data warehouses. It is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. It join between the fact and dimension tables and is necessary to handle changes in dimension table attributes.
It is UNIQUE since it is sequentially generated integer for each record being inserted in the table. It is MEANINGLESS since it does not carry any business meaning regarding the record it is attached to in any table. It is SEQUENTIAL since it is assigned in sequential order as and when new records are created in the table, starting with one and going up to the highest number that is needed.
The below diagram shows how the FACT table is loaded from the source.
The below image shows a typical Star Schema, joining different Dimensions with the Fact using SKs.
Ralph Kimball emphasizes more on the abstraction of NK. As per him, Surrogate Keys should NOT be:
Smart, where you can tell something about the record just by looking at the key. Composed of natural keys glued together. Implemented as multiple parallel joins between the dimension table and the fact table; socalled double or triple barreled joins.
As per Thomas Kejser, a good key is a column that has the following properties:
It forced to be unique It is small It is an integer Once assigned to a row, it never changes Even if deleted, it will never be re-used to refer to a new row It is a single column It is stupid It is not intended as being remembered by users
If the above mentioned features are taken into account, SK would be a great candidate for a Good Key in a DW.
Apart from these, few more reasons for choosing this SK approach are:
If we replace the NK with a single Integer, it should be able to save a substantial amount of storage space. The SKs of different Dimensions would be stored as Foreign Keys (FK) in the Fact tables to maintain Referential Integrity (RI), and here instead of storing of those big or huge NKs, storing of concise SKs would result in less amount of space needed. The UNIQUE indexes built on the SK will take less space than the UNIQUE index built on the NK which may be alphanumeric. Replacing big, ugly NKs and composite keys with beautiful, tight integer SKs is bound to improve join performance, since joining two Integer columns works faster. So, it provides an extra edge in the ETL performance by fastening data retrieval and lookup. Advantage of a four-byte integer key is that it can represent more than 2 billion different values, which would be enough for any dimension and SK would not run out of values, not even for the Big or Monster Dimension. SK is usually independent of the data contained in the record, we cannot understand anything about the data in a record simply by seeing only the SK. Hence it provides Data Abstraction.
So, apart from the abstraction of critical business data involved in the NK, we have the advantage of storage space reduction as well to implement the SK in our DW. It has become a Standard Practice to associate an SK with a table in DW irrespective of being it a Dimension, Fact, Bridge or Aggregate table.
The values of SKs have no relationship with the real world meaning of the data held in a row. Therefore over usage of SKs lead to the problem of disassociation.
The generation and attachment of SK creates unnecessary ETL burden. Sometimes it may be found that the actual piece of code is short and simple, but generating the SK and carrying it forward till the target adds extra overhead on the code. During the Horizontal Data Integration (DI) where multiple source systems loads data into a single Dimension, we have to maintain a single SK Generating Area to enforce the Uniqueness of SK. This may come as an extra overhead on the ETL. Even query optimization becomes difficult since SK takes the place of PK, unique index is applied on that column. And any query based on NK leads to Full Table Scan (FTS) as that query cannot take the advantage of unique index on the SK. Replication of data from one environment to another, i.e. Data Migration, becomes difficult since SKs from different Dimension tables are used as the FKs in the Fact table and SKs are DW specific, any mismatch in the SK for a particular Dimension would result in no data or erroneous data when we join them in a Star Schema. If duplicate records come from the source, there is a potential risk of duplicates
About
the
Author
being loaded into the target, since Unique Constraint is defined on the SK and not on the NK. SK should not be implemented just in the name of standardizing your code. SK is required when we cannot use an NK to uniquely identify a record or when using an SK seems more suitable as the NK is not a good fit for PK. Reference : Ralph Kimball, Thomas Kejser
Ads not by this site
inShare7
Informatica has developed a solution that leverages the power of grid computing for greater data integration scalability and performance. The grid option delivers the load balancing, dynamic partitioning, parallel processing and high availability to
ensure optimal scalability, performance and reliability. In this article lets discuss how to setup Infrmatica Workflow to run on grid.
Domain : A PowerCenter domain consists of one or more nodes in the grid environment. PowerCenter services run on the nodes. A domain is the foundation for PowerCenter service administration. Node : A node is a logical representation of a physical machine that runs a PowerCenter service.
You can setup the workflow to run on grid as shown in below image.You can assign the integration service, which is configured on grid to run the workflow on grid.
You can setup the session to run on grid as shown in below image.
Load Balancing : While facing spikes in data processing, load balance guarantees smooth operations by switching the data processing between nodes on the grid. The node is chosen dynamically based on process size, CPU utilization, memory requirements etc... High Availability : Grid complements the High Availability feature or PowerCenter by switching the master node in case of a node failure. This ensures the monitoring and the shorten time needed for recovery processes. Dynamic Partitioning : Dynamic Partitioning helps making the best use of currently available nodes on the grid. By adapting to available resources, it also helps increasing the performance of the whole ETL process.
Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.
inShare8
When your data warehouse is sourcing data from multi-time zoned data sources, it is recommended to capture a universal standard time, as well as local times. Same goes with transactions involving multiple currencies. This design enables analysis on the local time along with the universal standard time. The time standardization will be done as part of the ETL, which loads the warehouse. In this article lets discuss about the implementation using Informatica PowerCenter.
We will concentrate only on the ETL part of time zone conversion and standardization, but not the data modeling part. You can learn more about the dimensional modeling aspect from Ralph Kimball.
In the expression transformation, you can create below ports and the corresponding expressions. Be sure to have the ports created in the same order, data type and precision in the transformation.
o o o o
LOC_TIME_WITH_TZ : STRING(36) (Input) DATE_TIME : DATE/TIME (Variable) TZ_DIFF : INTEGER (Variable) TZ_DIFF_HR (V) : INTEGER (Variable)
o o o o
TZ_DIFF_MI (V) : INTEGER (Variable) GMT_TIME_HH : DATE/TIME (Variable) GMT_TIME_MI : DATE/TIME (Variable) GMT_TIME_WITH_TZ STRING(36) (Output)
o o o o o o o
DATE_TIME : TO_DATE(SUBSTR(LOC_TIME_WITH_TZ,0,29),'DD-MON-YY HH:MI:SS.US AM') TZ_DIFF : IIF(SUBSTR(LOC_TIME_WITH_TZ,30,1)='+',-1,1) TZ_DIFF_HR : TO_DECIMAL(SUBSTR(LOC_TIME_WITH_TZ,31,2)) TZ_DIFF_MI : TO_DECIMAL(SUBSTR(LOC_TIME_WITH_TZ,34,2)) GMT_TIME_HH : ADD_TO_DATE(DATE_TIME,'HH',TZ_DIFF_HR*TZ_DIFF) GMT_TIME_MI : ADD_TO_DATE(GMT_TIME_HH,'MI',TZ_DIFF_MI*TZ_DIFF) GMT_TIME_WITH_TZ : TO_CHAR(GMT_TIME_MI,'DD-MON-YYYY HH:MI:SS.US AM') || ' +00:00'
Note : The expression is based on the timestamp format 'DD-MON-YYYY HH:MI:SS.FF AM TZH:TZM'. If you are using a different oracle timestamp format, this expression might not work. Below is the expression transformation with the expressions added.
The reusable transformation can be used in any Mapping, which needs the time zone conversion. Below shown is the completed expression transformation.
You can see a sample output data generated by expression as shown in below image.
Expression Usage
This reusable transformation takes one input port and gives one output port. The input port should be a date timestamp with time zone information. Below shown is a mapping using this reusable transformation.
Note : Timestamp with time zone is processed as STRING(36) data type in the mapping. All the transformations should use STRING(36) data type. Source and target should use VARCHAR2(36) data type.
Download
You can download the reusable expression we discussed in this article. Click here for the download link. Hope this tutorial was helpful and useful for your project. Please leave you questions and commends, We will be more than happy to help you.
inShare10
In our performance turning article series, so far we covered about the performance turning basics, identification of bottlenecks and resolving different bottlenecks. In this article we will cover different performance enhancement features available in Informatica PowerCener. In addition to the features provided by PowerCenter, we will go over the designs tips and tricks for ETL load performance improvement.
1. Pushdown Optimization
Pushdown Optimization Option enables data transformation processing, to be pushed down into any relational database to make the best use of database processing power. It converts the transformation logic into SQL statements, which can directly execute on database. This minimizes the need of moving data between servers and utilizes the power of database engine.
2. Session Partitioning
The Informatica PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing. Partitioning option will let you split the large data set into smaller subsets which can be processed in parallel to get a better session performance.
4. Concurrent Workflows
Ads not by this site
A concurrent workflow is a workflow that can run as multiple instances concurrently. A workflow instance is a representation of a workflow. We can configure two types of concurrent workflows. It can be concurrent workflows with the same instance name or unique workflow instances to run concurrently.
5. Grid Deployments
When a PowerCenter domain contains multiple nodes, you can configure workflows and sessions to run on a grid. When you run a workflow on a grid, the Integration Service runs a service process on each available node of the grid to increase performance and scalability. When you run a session on a grid, the Integration Service distributes session threads to multiple DTM processes on nodes in the grid to increase performance and scalability.
Hope you guys enjoyed these tips and tricks and it is helpful for your project needs. Leave us your questions and commends. We would like to hear any other performance tips you might have used in your projects.
Ads not by this site
inShare9
Informatica PowerCenter Workflows runs on grid, distributes workflow tasks across nodes in the grid. It also distributes Session, Command, and predefined Event-Wait tasks within workflows across the nodes in a grid. PowerCenter uses load balancer to distribute workflows and session tasks to different nodes. This article describes, how to use load balancer to setup high workflow priorities and how to allocate resources.
Assign service levels : You assign service levels to workflows. Service levels establish priority among workflow tasks that are waiting to be dispatched.
Assign resources : You assign resources to tasks. Session, Command, and predefined Event-Wait tasks require PowerCenter resources to succeed. If the Integration Service is configured to check resources, the Load Balancer dispatches these tasks to nodes where the resources are available.
tasks before low priority tasks. You create service levels and configure the dispatch priorities in the Administrator tool. Integration service will be limited to run You give Higher Service Level for the workflows, which needs to be dispatched first, when multiple workflows are running in parallel. Service Levels are set up in the Admin console. You assign service levels to workflows on the General tab of the workflow properties as shown below.
Below configuration shows that, the source qualifier needs source file from File Directory NDMSource, which is accessible only from one node. Available resource on different nodes are configured from Admin console.
Hope you enjoyed this article and this will help you prioritize your workflows to to meet your data refresh time lines. Please leave us a comment or feedback if you have any, we are happy to hear from you.
inShare6
Quite often we deal with ETL logic, which is very dynamic in nature. Such as a discount calculation which changes every month or a special weekend only logic. There is a lot of practical difficulty in making such frequent ETL change into production environment. Best option to deal with this dynamic scenario is parametrization. In this article let discuss how we can make the ETL calculations dynamic.
The sales department wants to build a monthly sales fact table. The fact table need to be refreshed after the month end closure. Sales commission is one of the fact table data element, its calculation is dynamic in nature. It is a factor of sales or sales revenue or net sales. Sales Commission calculation can be : 1. Sales Commission = Sales * 18 / 100 2. Sales Commission = Sales Revenue * 20 / 100 3. Sales Commission = Net Sales * 20 / 100 Note : The expression calculation can be as complex as the business requirement demands.
The calculation need to be used by the month end ETL will be decided by the Sales Manager before the month ETL load.
Mapping Configuration
Now we understand the use case, lets build the mapping logic.
Here we will be building the dynamic sales commission calculation logic with the help of a mapping variable. The changing expression for the calculation will be passed into the mapping using a session parameter file.
Step 1 : As the first step, Create a mapping variable $$EXP_SALES_COMM and set the isExpVar property TRUE as shown in below image.
Note : Precision for the mapping variable should be big enough to hold the whole expression. Step 2 : In an expression transformation, create an output port and provide the mapping variable as the expression. Below shown is the screenshot of expression transformation.
Note : All the ports used in the expression $$EXP_SALES_COMM should be available as an input or input/output port in the expression transformation.
Workflow Configuration
In the workflow configuration, we will create the parameter file with the expression for Sales Commission and set up in the session.
Step 1 : Create the session parameter file with the expression for Sales Commission calculation with the below details. [s_m_LOAD_SALES_FACT] $$EXP_SALES_COMM=SALES_REVENUE*20/100 Step 2 : Set the parameter in the session properties as shown below.
With that we are done with the configuration. You can update the expression in the parameter file when ever a change is required in the sales commission calculation. This clearly eliminate the need of a ETL code change. Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.
Informatica HTTP Transformation, The Interface Between ETL and Web Services
Johnson Cyriac Sep 30, 2013 Transformations
inShare12
In a matured data warehouse environment, you will see all sorts of data sources, like Mainframe, ERP, Web Services, Machine Logs, Message Queues, Hadoop etc. Informatica has provided a variety of connector to get data extracted from such data sources. Using Informatica HTTP transformation, you can make Web Service calls and get data from web servers. We will have this transformation explained in this article with a use case.
1. Read data from an HTTP server :- It retrieves data from the HTTP server and passes the data
to a downstream transformation in the mapping. 2. Update data on the HTTP server :- It posts data to the HTTP server and passes HTTP server responses to a downstream transformation in the mapping.
Output. Contains data from the HTTP response. Passes responses from the HTTP server to downstream transformations. Input. Used to construct the final URL for the GET method or the data for the POST request. Header. Contains header data for the request and response.
In the above shown image, we have two input ports for the GET method and the response from the server as the output port
Configuring a URL
The web service will be accessed using a URL and the base URL of the web service need to be provided in the transformation. The Designer constructs the final URL for the GET method based on the base URL and port names in the input group. In the above shown image, you can see the base url and the constructed URL, which includes the query parameters. This web service call is to get the currency conversion and we are passing two parameters to the base url, "from" and "to" currency.
Solution : Here in the ETL process lets us use a web service call to get the real time currency conversion rate and convert the foreign currency to USD. We will use HTTP Transformation to call the web service.
For the demo, we will concentrate only on the HTTP transformation. We will be using the web service from http://rate-exchange.appspot.com/ for the demonstration. This web service take two parameters, "from currency" and "to currency" and returns a JSON document, with the exchange rate information.
http://rate-exchange.appspot.com/currency?from=USD&to=EUR Step 1 :- Create the HTTP Transformation like any other transformation in the mapping designer. We need to configure the transformation for the GET HTTP method to access currency conversion data. Below shown is the configuration.
Ads not by this site
Step 2 :- Create two input ports as shown in below image. The ports need to be string data type and the port name should match with the url parameter name.
Step 3 :- Now you can provide the base URL for the web service and the designer will construct the complete URL with the parameters included.
Step 4 :- The output from the HTTP transformation will look similar to what is given below.
Finally, you can plug in the transformation into the mapping as shown in below image. Parse the output from HTTP Transformation in an expression transformation and do the calculation to convert the currency to USD.
Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out HTTP transformation or share us if you use any different use cases you want to implement using HTTP transformation.
Informatica SQL Transformation, SQLs Beyond Pre & Post Session Commands
Johnson Cyriac Sep 24, 2013 Transformations
inShare10
SQL statements can be used as part of pre or post SQL commands in a PowerCenter workflow. These are static SQLs and can run only once before or after the mapping pipeline is run. With the help of SQL transformation, we can use SQL statements much more effectively to build your ETL logic. In this tutorial lets learn more about the transformation and its usage with a real time use case.
Script mode :- Runs SQL scripts from text files that are externally located. You pass a script name to the transformation with each input row. It outputs script execution status and any script error. Query mode :- Executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries. You can output multiple rows when the query has a SELECT statement.
Script Mode
An SQL transformation running in script mode runs SQL scripts from text files. It creates an SQL procedure and sends it to the database to process. The database validates the SQL and executes the query. You cannot use scripting languages such as Oracle PL/SQL or Microsoft/Sybase T-SQL in the script.
In the script mode, you pass script file name with the complete path from the source to the SQL transformation ScriptName port. ScriptResult port gives the status of the script execution status. It will be either PASSED or FAILED. ScriptError returns errors that occur when a script fails for a row.
Above shown is an SQL transformation in Script Mode, which will have a ScriptName input and ScripResult, ScriptError as output.
Query Mode
When SQL transformation runs in query mode, it executes an SQL query defined in the transformation. You can pass strings or parameters to the query from the transformation input ports to change the SQL query statement or the query data. The SQL query can be static or dynamic.
Static SQL query :- The query statement does not change, but you can use query parameters to change the data, which is passed in through the input ports of the transformation. Dynamic SQL query :- You can change the query statements and the data, which is passed in through the input ports of the transformation.
With static query, the Integration Service prepares the SQL statement once and executes it for each row. With a dynamic query, the Integration Service prepares the SQL for each input row.
Above shown SQL transformation, which runs in query mode has two input parameters and returns one output.
Lets consider the ETL for loading Dimension tables into a data warehouse. The surrogate key for each of the dimension tables are populated using an Oracle Sequence. The ETL architect needs to create an Informatica reusable component, which can be reused in different dimension table loads to populate the surrogate key.
Solution : Here lets create a reusable SQL transformation in Query mode, which can take the name of the oracle sequence generator, and pass the sequence number as the output. Step 1 :- Once you have the transformation developer open you can start creating the SQL transformation like any other transformations. It opens up a window like shown in below image.
This screen will let you choose the mode, database type, database connection type and you can make the transformation active or passive. If the database connection type is dynamic, you can dynamically pass in the connection details into the transformation. If the SQL query returns more than one record, you need to make the transformation active.
Step 2 :- Now create the input and output ports as shown in the below image. We are passing in the database schema name and the sequence name. It return sequence number as an output port.
Step 3 :- Using the SQL query editor, we can build the query to get the sequence generator. Using the 'String Substitution' ports we can make the SQL dynamic. Here we are making the query dynamic by passing the schema name, sequence name dynamically as an input port.
That is all we need for the reusable SQL transformation. Below shown is the completed SQL transformation, which can take two input values (schema name, sequence name) and returns one output value (sequence number).
Step 4 :- We can use this transformation just like any other reusable transformations, Need to pass in the schema name, sequence name as input ports and returns sequence number, which can be used to populate the surrogate key of the dimension table as shown below.
As per the above example, integration service will convert the SQL as follows during the session runtime. SELECT DW.S_CUST_DIM.NEXTVAL FROM DUAL;
Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out this tutorial or share us if you use any different use cases you want to implement using SQL transformation.
Ads not by this site
inShare9
Java is, one of the most popular programming languages in use, particularly for client-server web applications. With the introduction of PowerCenter Java Transformation, ETL developers can get their feet wet with Java programming and leverage the power of Java. In this article lets learn more about Java Transformation, its components and its usage with the help of a use case.
With Java transformation you can define transformation logic using java programming language without advanced knowledge of the Java programming language or an external Java development environment.
The PowerCenter Client uses the Java Development Kit (JDK) to compile the Java code and generate byte code for the transformation. The PowerCenter Client stores the byte code in the PowerCenter repository. When the Integration Service runs a session with a Java transformation, the Integration Service uses the Java Runtime Environment (JRE) to execute the byte code and process input rows and generate output rows.
Below image shows different code entry tabs under 'Java Code'.
Import Packages :- Import third-party Java packages, built-in Java packages, or custom Java packages. Helper Code :- Define variables and methods available to all tabs except Import Packages. After you declare variables and methods on the Helper Code tab, you can use the variables and methods on any code entry tab except the Import Packages tab.
On Input Row :- Define transformation behavior when it receives an input row. The Java code in this tab executes one time for each input row On End of Data :- Use this tab to define transformation logic when it has processed all input data. On Receiving Transaction :- Define transformation behavior when it receives a transaction notification. You can use this only with active Java transformations. Java Expressions : - Define Java expressions to call PowerCenter expressions. You can use this in multiple code entry tabs.
Ads not by this site
Step 1 :- Once you have the source and source qualifier pulled in to the Java Transformation and create input and output ports as shown in below image. Just like any other transformation, you can drag and drop ports from other transformations to create new ports.
Step 2 :- Now move to the 'Java Code' tab and from 'import package' tab import the external java classes required by the java code. This tab can be used to import any third party java classes or build in java classes.
As shown in above image here is the import code used. import java.util.Map; import java.util.HashMap; Step 3 :- In the 'Helper Code' tab, define the variables, objects and functions required by the java code, which will be written in 'On Input Row'. Here we have created four objects.
private static Map <Integer, String> empMap = new HashMap <Integer, String> (); private static Object lock = new Object(); private boolean generateRow; private boolean isRoot; Step 4 :- In the 'On Input Row' tab, define the ETL logic, which will be executed for every input record.
Below is the complete code we need to place it in the 'On Input Row' generateRow = true; isRoot = false; if (isNull ("EMP_ID_INP") || isNull ("EMP_NAME_INP")) { incrementErrorCount(1); generateRow = false; } else { EMP_ID_OUT = EMP_ID_INP; EMP_NAME_OUT = EMP_NAME_INP; } if (isNull ("EMP_DESC_INP")) { setNull("EMP_DESC_OUT"); } else { EMP_DESC_OUT = EMP_DESC_INP;
} boolean isParentEmpIdNull = isNull("EMP_PARENT_EMPID"); if(isParentEmpIdNull) { isRoot = true; logInfo("This is the root for this hierarchy."); setNull("EMP_PARENT_EMPNAME"); } synchronized(lock) { if(!isParentEmpIdNull) EMP_PARENT_EMPNAME = (String) (empMap.get(new Integer (EMP_PARENT_EMPID))); empMap.put (new Integer(EMP_ID_INP), EMP_NAME_INP); } if(generateRow) generateRow(); With this we are done with the coding required in Java Transformation and only left with code compilation. Remaining tabs in this java transformation do not need any code for our use case.
Completed Mapping
Remaining tabs do not need any code for our use case and all the ports from the java transformation can be connected from the source qualifier and to the target. Below shown is the completed structure of the mapping.
Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out this java code and java transformation or share us if you use any different use cases you want to implement using java transformation.
inShare9
In our previous article in the performance tuning series, we covered the basics of Informatica performance tuning process and the session anatomy. In this article we will cover the methods to identify different performance bottlenecks. Here we will use session thread statistics, session performance counter and workflow monitor properties to help us understand the bottlenecks.
Run Time : Amount of time the thread runs. Idle Time : Amount of time the thread is idle. Includes the time the thread waits for other thread processing. Busy Time : Percentage of the run time. It is (run time - idle time) / run time x 100.
Thread Work Time : The percentage of time taken to process each transformation in a thread.
Note : Session Log file with normal tracing level is required to get the thread statistics.
If you read it closely, you will see reader, transformation and writer thread and how much time is spent on each thread and how busy each thread is. Additional to that, transformation thread shows how much busy each transformation in the mapping is.
The total run time for the transformation thread is 506 seconds and the busy percentage is 99.7%. This means the transformation thread was never idle for the 506 seconds. The reader and writer busy percentages were significantly smaller, about 9.6% and 24%. In this session, the transformation thread is the bottleneck in the mapping.
To determine which transformation in the transformation thread is the bottleneck, view the busy percentage of each transformation in the thread work time breakdown. The transformation RTR_ZIP_CODE had a busy percentage of 53%. Hint : Thread with the highest busy percentage is the bottleneck.
All transformations have counters to help measure and improve performance of the transformations. Analyzing these performance details can help you identify session bottlenecks. The Integration Service tracks the number of input rows, output rows, and error rows for each transformation.
Ads not by this site
A non-zero count for Errorrows indicates you should eliminate the transformation errors to improve performance.
Errorrows : Transformation errors impact session performance. If a transformation has large numbers of error rows in any of the Transformation_errorrows counters, you should eliminate the errors to improve performance. Readfromdisk and Writetodisk : If these counters display any number other than zero, you can increase the cache sizes to improve session performance. Readfromcache and Writetocache : Use this counters to analyze how the Integration Service reads from or writes to cache. Rowsinlookupcache : Gives the number of rows in the lookup cache. To improve session performance, tune the lookup expressions for the larger lookup tables.
Message: WARNING: Insufficient number of data blocks for adequate performance. Increase DTM buffer size of the session. The recommended value is xxxx.
CPU% : The percentage of CPU usage includes other external tasks running on the system. A high CPU usage indicates the need of additional processing power required by the server. Memory Usage : The percentage of memory usage includes other external tasks running on the system. If the memory usage is close to 95%, check if the tasks running on the system are using the amount indicated in the Workflow Monitor or if there is a memory leak. To troubleshoot, use system tools to check the memory usage before and after running the session and then compare the results to the memory usage while running the session. Swap Usage : Swap usage is a result of paging due to possible memory leaks or a high number of concurrent tasks.
inShare10
Informatica PowerCenter Session Partitioning can be effectively used for parallel data processing and achieve faster data delivery. Parallel data processing performance is heavily depending on the additional hardware power available. In additional to that, it is important to choose the appropriate partitioning algorithm or partition type. In this article lets discuss the optimal session partition settings.
Pass-through Partition
A pass-through partition at the source qualifier transformation is used to split the source data into three different parallel processing data sets. Below image shows how to setup pass through partition for three different sales regions.
Once the partition is setup at the source qualifier, you get additional Source Filter option to restrict the data which corresponds to each partition. Be sure to provide the filter condition such that same data is not processed through more than one partition and data is not duplicated. Below image shows three additional Source Filters, one per each partition.
Here the target table is range partitioned on product line. Create a range partition on target definition on PRODUCT_LINE_ID port to get the best write throughput.
Below images shows the steps involved in setting up the key range partition. Click on Edit Keys to define the ports on which the key range partition is defined.
A pop up window shows the list of ports in the transformation, Choose the ports on which the key range partition is required.
Now give the value start and end range for each partition as shown below.
We did not have to use Hash User Key Partition and Database Partition algorithm in the use case discussed here.
Hash User Key partition algorithm will let you choose the ports to group rows among partitions. This algorithm can be used in most of the places where hash auto key algorithm is appropriate. Database partition algorithm queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database. This algorithm can be applied either on the source or target definition.
Hope you enjoyed this article. Please leave your comments and feedback.
Ads not by this site
inShare8
In our previous article in the performance tuning series, we covered different approaches to identify performance bottlenecks. In this article we will cover the methods to resolve different performance bottlenecks. We will talk about session memory, cache memory, source, target and mapping performance turning techniques in detail.
Performance Tuning Part I : Performance Part II : Identify Part III : Remove Part IV : Performance Enhancements.
Not having enough buffer memory for DTM process, can slowdown reading, transforming or writing and cause large fluctuations in performance. Adding extra memory blocks can keep the threads busy and improve session performance. You can do this by adjusting the buffer block size and DTM Buffer size. Note : You can identify DTM buffer bottleneck from Session Log File, Check here for details.
To identify the optimal buffer block size, sum up the precision of individual source and targets columns. The largest precision among all the source and target should be the buffer block size for one row. Ideally, a buffer block should accommodates at least 100 rows at a time.
You can change the buffer block size in the session configuration as shown in below image.
Session Buffer Blocks = (total number of sources + total number of targets) * 2 DTM Buffer Size = Session Buffer Blocks * Buffer Block Size / 0.9
Ads not by this site
You can change the DTM Buffer Size in the session configuration as shown in below image.
Note : You can examine the performance counters to determine what all transformations require cache memory turning, Check here for details.
You can calculate the memory requirements for a transformation using the Cache Calculator. Below shown is the Cache Calculator for Lookup transformation.
You can update the cache size in the session property of the transformation as shown below.
Note : Target bottleneck can be determined with the help of Session Log File, check here for details.
database log, however, the target database cannot perform rollback. As a result, you may not be able to perform recovery.
4. Minimizing Deadlocks
Encountering deadlocks can slow session performance. You can increase the number of target connection groups in a session to avoid deadlocks. To use a different target connection group for each target in a session, use a different database connection name for each target instance.
create optimizer hints to tell the database how to execute the query for a particular set of source tables.
Generally, you reduce the number of transformations in the mapping and delete unnecessary links between transformations to optimize the mapping. Configure the mapping with the least number of transformations and expressions to do the most amount of work possible. Delete unnecessary links between transformations to minimize the amount of data moved. Note : You can identify Mapping bottleneck from Session Log File, check here for details.
2. Optimizing Expressions
You can also optimize the expressions used in the transformations. When possible, isolate slow expressions and simplify them.
Factoring Out Common Logic : If the mapping performs the same task in multiple places, reduce the number of times the mapping performs the task by moving the task earlier in the mapping. Minimizing Aggregate Function Calls : When writing expressions, factor out as many aggregate function calls as possible. Each time you use an aggregate function call, the Integration Service must search and group the data. For example SUM(COL_A + COL_B) performs better than SUM(COL_A) + SUM(COL_B) Replacing Common Expressions with Local Variables : If you use the same expression multiple times in one transformation, you can make that expression a local variable. Choosing Numeric Versus String Operations : The Integration Service processes numeric operations faster than string operations. For example, if you look up large amounts of data
on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance. Using Operators Instead of Functions : The Integration Service reads expressions written with operators faster than expressions with functions. Where possible, use operators to write expressions.
3. Optimizing Transformations
Ads not by this site
Each transformation is different and the tuning required for different transformation is different. But generally, you reduce the number of transformations in the mapping and delete unnecessary links between transformations to optimize the transformation. Note : Tuning technique for different transformation will be covered as a separate article.
inShare7
Informatica Pushdown Optimization Option increases performance by providing the flexibility to push transformation processing to the most appropriate processing resource. Using Pushdown Optimization, data transformation logic can be pushed to source database or target database or through the PowerCenter server. This gives the option for the ETL architect to choose the best of the available resources for data processing.
Performance Improvement Features Pushdown Optimization Pipeline Partitions Dynamic Partitions Concurrent Workflows Grid Deployments Workflow Load Balancing Pushdown Optimization Option enables data transformation processing, to be pushed down into any relational database to make the best use of database processing power. It converts the transformation logic into SQL statements, which can directly execute on database. This minimizes the need of moving data between servers and utilizes the power of database engine.
When you run a session configured for source-side pushdown optimization, the Integration Service analyzes the mapping from the source to the target or until it reaches a downstream transformation it cannot push to the database.
The Integration Service generates a SELECT statement based on the transformation logic for each transformation it can push to the database. When you run the session, the Integration Service pushes all transformation logic that is valid to push to the database by executing the generated SQL statement. Then, it reads the results of this SQL statement and continues to run the session.
If you run a session that contains an SQL override or lookup override, the Integration Service generates a view based on the override. It then generates a SELECT statement and runs the SELECT statement against this view. When the session completes, the Integration Service drops the view from the database.
The Integration Service generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database, starting with the first transformation in the pipeline it can push to the database. The Integration Service processes the transformation logic up to the point that it can push the transformation logic to the target database. Then, it executes the generated SQL.
The Integration Service pushes as much transformation logic as possible to both source and target databases. If you configure a session for full pushdown optimization, and the Integration Service cannot push all the transformation logic to the database, it performs partial pushdown optimization instead.
To use full pushdown optimization, the source and target must be on the same database. When you run a session configured for full pushdown optimization, the Integration Service analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it analyzes the target. It generates SQL statements that are executed against the source and target database based on the transformation logic it can push to the database. If the session contains an SQL override or lookup override, the Integration Service generates a view and runs a SELECT statement against this view.
You can additionally choose few options to control how integration service push data transformation into SQL statements. Below screen shot shows the available options.
Allow Temporary View for Pushdown. Allows the Integration Service to create temporary view objects in the database when it pushes the session to the database. Allow Temporary Sequence for Pushdown. Allows the Integration Service to create temporary sequence objects in the database. Allow Pushdown for User Incompatible Connections. Indicates that the database user of the active database has read permission on the idle databases.
You can invoke the viewer from highlighted 'Pushdown Optimization' as shown in below image.
Pushdown optimizer viewer pops up in a new window and it shows how integration service converts the data transformation logic into SQL statement for a particular mapping. When you select a pushdown option or pushdown group in the viewer, you do not change the pushdown configuration. To change the configuration, we must update the pushdown option in the session properties.
A long transaction uses more database resources. A long transaction locks the database for longer periods of time, and thereby reduces the database concurrency and increases the likelihood of deadlock. A long transaction can increase the likelihood that an unexpected event may occur.
Hope you enjoyed this article and it is informative. Please leave us your comments and feedback.
Ads not by this site
inShare8
Debugger is an integral part of Informatica PowerCenter mapping designer, which help you in troubleshooting the ETL logical error or data error conditions in an Informatica mapping. The debugger user interface shows the step by step execution path of a mapping and how the source data is transformed in the mapping. Features like "break points", "evaluate expression" makes the debugging process easy.
Above image shows the mapping with one breakpoint set on the expression transformation. Target instance window is showing first two records set for update. And the Instance window is showing how the third record from the source is transformed in the expression EXP_INSERT_UPDATE.
Creating Breakpoints
When you are running a debugger session, you may not be interested to see the data transformations in all the transformations instances, but specific transformations where you expect a logical or data error.
For example, you might want to see what is going wrong in the expression transformation EXP_INSERT_UPDATE for a specific customer record, say CUST_ID = 1001.
By setting a break point, you can pause the Debugger on specific transformation or specific condition is satisfied. You can set two types of break points.
Error Breakpoints : When you create an error break point, the Debugger pauses when the Integration Service encounters error conditions such as a transformation error. You also set the number of errors to skip for each break point before the Debugger pauses. Data Breakpoints : When you create a data break point, the Debugger pauses when the data break point condition evaluates to true. You can set the number of rows to skip or a data condition or both.
You can start the Break point Window from Mapping -> Debugger -> Edit Breakpoints (Alt+F9) as shown in below image.
Below shown is a Data Break point created on EXP_INSERT_UPDATE, with condition CUST_ID = 1001. With this setting the debugger pauses on the transformation EXP_INSERT_UPDATE, when processing the CUST_ID = 1001.
In the same way, we can create error breakpoints on any transformation. Setting up break point is optional to run the debugger But this option helps to narrow down the issue faster, especially when the mapping is pretty big and complex.
You can start the Debugger Wizard from Mapping -> Debugger -> Start Debugger (F9) as shown in below image.
From below shown window you choose the integration service. You choose an existing non-reusable session, an existing reusable session, or create a debug session instance.
Next window will give an option to choose the sessions attached to the mapping which is being debugged.
You can choose to load or discard target data when you run the Debugger. If you discard target data, the Integration Service does not connect to the target. You can select the target instances you want to display in the Target window while you run a debug session.
When the Debugger is in paused state, you can see the transformation data in the Instance Window.
After you review or modify data, you can continue the Debugger in the following ways. Different commands to control the Debugger execution is shown in below image. This menu is available under Mapping -> Debugger.
Continue to the next break : To continue to the next break, click Continue (F5). The Debugger continues running until it encounters the next break. Continue to the next instance : To continue to the next instance, click Next Instance (F10) option. The Debugger continues running until it reaches the next transformation or until it encounters a break. If the current instance has output going to more than one transformation instance, the Debugger stops at the first instance it processes. Step to a specified instance : To continue to a specified instance, select the transformation instance in the mapping, then click Step to Instance (Ctrl+F10) option. The Debugger continues running until it reaches the selected transformation in the mapping or until it encounters a break.
Evaluating Expression
When the Debugger pauses, you can use the Expression Editor to evaluate expressions using mapping variables and ports in a selected transformation. This option is helpful to evaluate and rewrite an expression, in cause if you find the expression result is erroneous.
You can access Evaluate Expression window from Mapping -> Debugger -> Evaluate Expression.
Modifying Data
When the Debugger pauses, the current instance displays in the Instance window. You can make the data modifications to the current instance when the Debugger pauses on a data break point.
You can modify the data from the Instance Window. This option is helpful, if you want to check what would be the result if the input was any different from the current value.
Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out mapping debugger and subscribe to the mailing list to get the latest tutorials in your mail box.
inShare11
Performance tuning process identifies the bottlenecks and eliminate it to get a better acceptable ETL load time. Tuning starts with the identification of bottlenecks in source, target, mapping and further to session tuning. It might need further tuning on the system resources on which the Informatica PowerCenter Services are running.
This performance tuning article series is split into multiple articles, which goes over specific areas of performance tuning. In this article we will discuss about the session anatomy and more about bottlenecks.
Determining the best way to improve performance can be complex. An iterative method of identifying one bottleneck at a time and eliminate it, then identify and eliminate the next bottleneck until an acceptable throughput is achieved is more effective.
The first step in performance tuning is to identify performance bottlenecks. Performance bottlenecks can occur in the source and target, the mapping, the session, and the system. Before we understand different bottlenecks, lets see the components of Informatica PowerCenter session and how a bottleneck arises.
according to the transformation logic in the mapping and writer thread connects to the target and loads the data. Any data processing delay in these threads leads to a performance issue.
Above shown is the pictorial representation of a session. Reader thread reads data from the source and data transformation is done by transformation thread. Finally loaded into the target by the writer thread.
Source Bottlenecks
Performance bottlenecks can occur when the Integration Service reads from a source database. Slowness in reading data from the source leads to delay in filling enough data into DTM buffer. So the transformation and writer threads wait for data. This delay causes the entire session to run slower.
Inefficient query or small database network packet sizes can cause source bottlenecks.
Target Bottlenecks
When target bottleneck occurs, writer thread will not be able to free up space for reader and transformer threads, until the data is written to the target. So the the reader and transformer threads to wait for free blocks. This causes the entire session to run slower.
Small database checkpoint intervals, small database network packet sizes, or problems during heavy loading operations can cause target bottlenecks.
Mapping Bottlenecks
A complex mapping logic or a not well written mapping logic can lead to mapping bottleneck. With mapping bottleneck, transformation thread runs slower causing the reader thread to wait for free blocks and writer thread to wait blocks filled up for writing to target.
Session Bottlenecks
If you do not have a source, target, or mapping bottleneck, you may have a session bottleneck. Session bottleneck occurs normally when you have the session memory configuration is not turned correctly. This in turn leads to a bottleneck on the reader, transformation or writer thread. Small cache size, low buffer memory, and small commit intervals can cause session bottlenecks.
System Bottlenecks
After you tune the source, target, mapping, and session, consider tuning the system to prevent system bottlenecks. The Integration Service uses system resources to process transformations, run sessions, and read and write data. The Integration Service also uses system memory to create cache files for transformations, such as Aggregator, Joiner, Lookup, Sorter, XML, and Rank.
inShare7
Such
Debugger is a great tool to troubleshoot your mapping logic, but there are instances where we need to go for a different troubleshooting approach for mappings. Session log file with verbose data gives much more details than the debugger tool, such as what data is stored in the cache files, how variables ports are evaluated. information helps in complex tricky troubleshooting.
For our discussion, lets consider a simple mapping. In this mapping we have one lookup transformation and an expression transformation. Below shown is the structure of the mapping. We will set up the session to debug these two transformations.
As we mentioned, we are setting up the Tracing Level to Verbose Data for the expression transformation as well as shown below.
Note : We can override the tracing level for all the individual transformations at once from Configuration Object -> Override Tracing property.
Once you open the session log file with verbose data, you going to notice a lot more information that we normally see in a log file.
Since we have are interested in the data transformation details, we can scroll down through the session log and look for transformation thread.
Below shown part of the log file; details what data is stored in the lookup cache file. The highlighted section shows the data is read from the lookup source LKP_T_DIM_CUST{{DSQ}} and is build into LKP_T_DIM_CUST{{BLD}} cache. Further you can see the values stored in the cache file.
Further down through the transformation thread, you can see three records are passed on to LKP_T_DIM_CUST from the source qualifier SQ_CUST_STAGE. You can see the Rowid in the log file.
You can see what data is received by EXP_INSERT_UPDATE from the Lookup transformation. Rowid is helpful to track the rows between the transformations.
Since we have enabled verbose data for the expression transformation as well, additional to the above details, you will see how data is passed into and out of the expression transformation. But skipped from this demo.
Pros
Faster : One you get the hang of verbose data, it is faster debugging using session log file than the debugger tool. You do not have to patiently wait to get info from each transformation like debugger tool. Detailed info : Verbose data gives much more details than the debugger, such as what data is stored in the cache files, how variables ports are evaluated and much more, which is useful in detailed debugging. All in one place : We get all the detailed debugger info in one place, which help you go through how rows are transformed from source to target. You can see only one row at a time in debugger tool.
Cons
Difficult to understand : Unlike debugger tool, it requires an extra bit of effort to understand the verbose data in session log file. No user interface : No user interface is available, all the debugging info is provided in text format, which might not be a proffered way for some. Lot of info : Session log file with verbose data gives much more details than the debugger tool, which is some times irrelevant for your troubleshooting.
Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out this debugging approach or share us if you use any different methods for your debugging.
Ads not by this site