Datastage Imp
Datastage Imp
Center of Excellence
This document is Confidential, Proprietary and Trade Secret Information (“Confidential Information”) of IBM, Inc. and is provided solely for the purpose
of evaluating IBM products with the understanding that such Confidential Information will be disclosed only to those who have a “need to know.” The
attached documents constitute Confidential Information as they include information relating to the business and/or products of IBM (including, without
limitation, trade secrets, technical, business, and financial information) and are trade secret under the laws of the State of Massachusetts and the United
States.
Copyrights
Document Goals
Intended Use This document presents a set of standard practices, methodologies, and examples for IBM
WebSphere® DataStage Enterprise Edition™ (“DS/EE”) on UNIX, Windows, and USS.
Except where noted, this document is intended to supplement, not replace the
installation documentation.
Target Audience The primary audience for this document is DataStage developers who have been trained in
Enterprise Edition. Information in certain sections may also be relevant for Technical
Architects, System Administrators, and Developers
Product Version This document is intended for the following product releases:
- WebSphere DataStage Enterprise Edition 7.5.1 (UNIX, USS)
- WebSphere DataStage Enterprise Edition 7.5x2 (Windows)
Document Conventions
This document uses the following conventions:
Convention Usage
Bold In syntax, bold indicates commands, function names, keywords, and options that
must be input exactly as shown. In text, bold indicates keys to press, function
names, and menu selections.
Italic In syntax, italic indicates information that you supply. In text, italic also indicates
UNIX commands and options, file names, and pathnames.
Plain In text, plain indicates Windows NT commands and options, file names, and
pathnames.
Bold Italic Indicates: important information.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 2 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Lucida Blue In examples, Lucida Blue will be used to illustrate operating system command
line prompt.
A right arrow between menu commands indicates you should choose each
command in sequence. For example, “Choose File Exit” means you should
choose File from the menu bar, and then choose Exit from the File pull-down
menu.
This line The continuation character is used in source code examples to indicate a line
continues that is too long to fit on the page, but must be entered as a single line on screen.
Interaction with our example system will usually include the system prompt (in blue) and the
command, most often on 2 or more lines.
If appropriate, the system prompt will include the user name and directory for context. For example:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 3 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Table of Contents
1 DATA INTEGRATION OVERVIEW...............................................................................................................................6
1.1 JOB SEQUENCES....................................................................................................................................................................7
1.2 JOB TYPES...........................................................................................................................................................................8
2 STANDARDS.....................................................................................................................................................................13
2.1 DIRECTORY STRUCTURES.....................................................................................................................................................13
2.2 NAMING CONVENTIONS.......................................................................................................................................................18
2.3 DOCUMENTATION AND ANNOTATION......................................................................................................................................29
2.4 WORKING WITH SOURCE CODE CONTROL SYSTEMS.................................................................................................................31
2.5 UNDERSTANDING A JOB’S ENVIRONMENT...............................................................................................................................35
3 DEVELOPMENT GUIDELINES....................................................................................................................................39
3.1 MODULAR DEVELOPMENT ................................................................................................39
3.2 ESTABLISHING JOB BOUNDARIES...........................................................................................................................................39
3.3 JOB DESIGN TEMPLATES......................................................................................................................................................40
3.4 DEFAULT JOB DESIGN.........................................................................................................................................................41
3.5 JOB PARAMETERS................................................................................................................................................................42
3.6 PARALLEL SHARED CONTAINERS...........................................................................................................................................43
3.7 ERROR AND REJECT RECORD HANDLING................................................................................................................................43
3.8 COMPONENT USAGE............................................................................................................................................................51
4 DATASTAGE DATA TYPES...........................................................................................................................................54
4.2 NULL HANDLING................................................................................................................................................................56
4.3 RUNTIME COLUMN PROPAGATION.........................................................................................................................................58
5 PARTITIONING AND COLLECTING..........................................................................................................................59
5.1 PARTITION TYPES...............................................................................................................................................................59
5.2 MONITORING PARTITIONS.....................................................................................................................................................67
5.3 PARTITION METHODOLOGY...................................................................................................................................................68
5.4 PARTITIONING EXAMPLES.....................................................................................................................................................70
5.5 COLLECTOR TYPES..............................................................................................................................................................72
5.6 COLLECTING METHODOLOGY................................................................................................................................................73
6 SORTING...........................................................................................................................................................................74
6.1 PARTITION AND SORT KEYS.................................................................................................................................................74
6.2 COMPLETE (TOTAL) SORT...................................................................................................................................................75
6.3 LINK SORT AND SORT STAGE...............................................................................................................................................76
6.4 STABLE SORT.....................................................................................................................................................................77
6.5 SUB-SORTS........................................................................................................................................................................77
6.6 AUTOMATICALLY-INSERTED SORTS........................................................................................................................................78
6.7 SORT METHODOLOGY..........................................................................................................................................................79
6.8 TUNING SORT.....................................................................................................................................................................79
7 FILE STAGE USAGE.......................................................................................................................................................81
7.1 WHICH FILE STAGE TO USE.................................................................................................................................................81
7.2 DATA SET USAGE..............................................................................................................................................................81
7.3 SEQUENTIAL FILE STAGES (IMPORT AND EXPORT)...................................................................................................................82
7.4 COMPLEX FLAT FILE STAGE.................................................................................................................................................85
8 TRANSFORMATION LANGUAGES.............................................................................................................................87
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 4 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 5 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Before Job
Subroutine
No
Create Reject
Files (Limited) Read Input Data
No
Errors and
Warnings
No
No
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 6 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part ofJobthis publication may be reproduced, transmitted, transcribed, stored in a
After
Subroutine
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
These job Sequences control the interaction and error handling between individual DataStage jobs, and
together form a single end-to-end module within a DataStage application.
Job sequences also provide the recommended level of integration with external schedulers (such as
AutoSys, Cron, CA7, etc). This provides a level of granularity and control that is easy to manage and
maintain, and provides an appropriate leveraging of the respective technologies.
In most production deployments, Job Sequences require a level of integration with various production
automation technologies (scheduling, auditing/capture, error logging, etc). These topics are discussed
in Parallel Framework Standard Practices: Administration, Management, and Production Automation.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 7 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 8 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The following example transformation job demonstrates the use of write-through cache DS/EE Data
Sets:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 9 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 10 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
A column generator inserts the key column for a join and generates a single value guaranteed to never
appear in the other input(s) to the join. By specifying a full-outer join we produce a Cartesian product
dataset. In this case, we replicated the Oracle structure (lower input) for each country found in the
write-through cache country dataset (upper input).
The key column for a Referential Integrity check is validated by a Transformer stage. If the key
column is NULL, it is rejected by the transformer to a reject port and the validation is not performed
for those records. The non-validated records, the validated records, and the write-through cache
records from the last load of the target database are merged.
The merged records are grouped and ordered before being de-duplicated to remove obsolete records.
The de-duplicated records are re-grouped and ordered before calculation of the terminating keys,
producing an ordered and linked associative table.
This job also loads the target database table and creates write-through cache. In this case, if the load
fails, the cache is deleted, forcing other jobs that might depend on this data to access the existing (not
updated) target database table. This enforces a coherent view of the subject area from either cache
(current state if all jobs complete successfully) or target tables (previous state if any job fails).
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 11 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 12 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
2 Standards
Establishing consistent development standards helps to improve developer productivity and reduce
ongoing maintenance costs. Development standards can also make it easier to integrate external
processes such as automated auditing and reporting, and to build technical and support documentation.
...
/Scratch0 /ScratchN
/Ascential
/Project_A /P
/patches
...
/DataStage /Project_Z /
/DSEngine
/Configurations
/Project_A /P
...
/Projects
Figure 1: Recommended DataStage Install, Scratch, and Data Directories
1 Gigabyte /Project_Z /
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 13 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No/Project_A
part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
File systems are
...
highlighted in blue
/Project_Z
Information Integration Solutions
Center of Excellence
NOTE: On some operating systems, it is possible to create separate file systems at non-root
levels. This is illustrated in the above diagram, as a separate file system for the Projects sub
directory within the DataStage installation.
The DataStage Administrator should ensure that these default directories are never used by any parallel
configuration files. Scratch is used by the EE framework for temporary files such as buffer overflow,
sort memory overflow. It is a bad practice to share the DataStage project file system and conductor file
system with volatile files like scratch files and Parallel data set part files, because they increase the risk
of filling the DataStage project file systems.
To scale I/O performance within DataStage, the administrator should consider creating separate file
systems for each Scratch and Resource partition. As a standard practice,
In order to scale I/O for DataStage, consider creating separate file systems for each Scratch and Data
resource partition. Consider naming the file systems in accordance with partition numbers in your
DataStage EE Configuration file. This best practice advocates creating subdirectories for each project
for each scratch and disk partition.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 14 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
/Staging
...
...
...
...
/Project_Z
Figure 2: DataStage Staging Directories
/Project_Z /Project_Z /Project_Z
Within the separate Staging file system, data directories are implemented for each deployment phase of
a job (development, system integration, qa, and production) as appropriate. If the file system is not
/archive
shared across multiple servers,/archive /archive
not all of these development /archive
phases may be present on a local file
system. Within each deployment directory, files are separated by Project name as shown below.
/dev development data tree, location of source data files, target data files, error and reject
files.
/archive location of compressed archives created by archive process of previously processed files
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 15 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
To completely integrate all aspects of a DataStage application the directory structure that is used for
integration with external entities should be defined in a way that provides a complete and separate
structure in the same spirit as a DataStage project. A directory structure should be created that
organizes external entities and is directly associated with 1 and only 1 DataStage project. This will
provide a convenient vehicle to group and manage resources used by a project. The directory structure
will be made transparent to the DataStage application, through the use of environment variables.
Environment variables are a critical portability tool, which will enable DataStage applications to move
through the life cycle without any code changes.
...
...
...
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 16 of 179
© 2006 IBM Information Integration Solutions. All rights reserved./Project_Z
/Project_Z No part of this publication may be reproduced, transmitted, transcribed, stored in a
/Project_Z /Project_Z
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
/bin location of custom programs, DataStage routines, BuildOps, utilities, and shells
/doc location of documentation for programs found in /bin subdirectory
/src location of source code and makefiles for items found in /bin subdirectory (Note:
depending on change management policies, this directory may only be present in
the /dev development code tree)
/params location of parameter files for automated program control, a copy of dsenv and copies of
DSParams.$ProjectName project files
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 17 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
In some implementations, there may be external entities that are shared with other DataStage projects,
for example all jobs are invoked with the same Script. A similar directory structure to the Project_Plus
structure could be configured and referred to as DataStage_Plus.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 18 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Any set of standards needs to take on the culture of an organization, to be tuned according to needs, so
it is envisaged that these standards will develop and will adapt over time to suit both the organization
and the purpose.
There are a number of benefits from using a graphical development tool like DataStage, and many of
these benefits were used to establish this naming standard:
• With rapid development, more effort can be put into analysis and design, enabling a greater
understanding of the requirements and greater control over how they are delivered.
• There can be a much tighter link between design and development.
• Since much of the development work is done using a click, drag and drop paradigm there is less
typing involved hence the opportunity to use longer more meaningful, more readable names,
while maintaining quality.
Throughout this section, the term “Standard” refers to those principles that are required, while the term
“Guideline” refers to recommended, but not required, principles.
In the context of DataStage, the class word is used to identify either a type of object or the function that
a particular type of object will perform. In some cases where appropriate, objects can be sub-typed (for
example, a Left Outer Join). In these cases the class word represents the subtype.
For example, in the case of a link object, the class word refers to the functions of Reading, Reference
(Lookup), Moving or Writing data (or within a Sequence Job, the moving of a message).
In the case of a data store the class word will refer to the type of data store, for example: Data Set,
Sequential File, Table, View, and so forth.
Where there is no sub classification required then the class word will simply refer to the object. As an
example, a transformer might be named: Data_Block_Split_Tfm
As a guideline, the Class Word is represented as a two, three or four letter abbreviation. Where it is a
three or four letter abbreviation then it should be word capitalized. Where it is a two letter abbreviation
both letters should be capitalized.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 19 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
One benefit of using the Subject, Subject Modifier, Class Word approach, over using the Prefix
approach, is to enable two levels of sorting or grouping. In WebSphere MetaStage, the object type is
defined in a separate field- there is a field that denotes whether the object is a column, a derivation, a
link, a stage, a job design, and so forth. This is the same or similar information that would be carried in
a prefix approach. Carrying this information as a separate attributes enables the first word of the name
to be used as the subject matter, allowing sort either by subject matter or by object type. Secondly the
class word approach enables sub-classification by object type to provide additional information.
For the purposes of documentation, all word abbreviations should be referenced by the long form to get
used to saying the name in full even if reading the abbreviation. Like a logical name, however, when
creating the object, the abbreviated form is used. This will help re-enforce wider understanding of the
subjects.
The key issue is readability. Though DataStage imposes some limitations on the type of characters and
length of various object names, the standard, where possible, will be to separate words by an
Underscore which will allow clear identification of each work in a name. This should be enhanced by
also using Word Capitalization, for example, the first letter of each Word should be capitalized.
When development is more or less complete, attention should be given to the layout to enhance
readability before it is handed over to versioning.
Where possible, consideration should be made to provide DataStage developers with higher resolution
screens as this provides them with more screen display real-estate. This can help make them more
productive and makes their work more easily read.
DataStage provides the ability to document during development with the use of meaningful naming
standards (as outlined in this section). Establishing standards also eases use of external tools and
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 20 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
2.2.4.1 Projects
Each DataStage Project is a standalone repository. It may or may not have a one to one relationship
with an organizations’ project of work. This factor often can cause terminology issues especially in
teamwork where both business and developers are involved. The suffix of a Project name should be
used to identify Development (“Dev”), Test (“Test”), and Production (“Prod”).
The name of a DataStage Project may only be 18 characters in length, it can contain alpha-numeric
characters and it can contain underscores. However with the limit of 18 characters the name is most
often composed of abbreviations.
Examples of Project naming where the project is single application focused are:
• “Accounting Engine NAB Development” would be named: Acct_Eng_NAB_Dev
• “Accounting Engine NAB Production” would be named: Acct_Eng_NAB_Prod
DataStage enforces the top level Directory Structure for different types of Objects (for example, Jobs,
Routines, Shared Containers, Table definitions…). Below this level, developers have the flexibility to
define their own Directory or Category hierarchy.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 21 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The main reason for having Categories is to group related objects. Where possible, a Category level
should only contain objects that are directly related. For example, a job category might contain a Job
Sequence and all the jobs and only those jobs that are contained in that sequence.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 22 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Categorization by Developer
In development projects, categories will be created for each developer as their personal sandbox and
place they perform unit test activities on jobs they are developing. It is the responsibility of each
developer to delete unused or obsolete code, and the responsibility of the development manager
assigned the DataStage Manager role to ensure that projects are not obese with unused jobs, categories
and metadata.
Remembering that Job names must be unique within a given project, two developers cannot save a
copy of the same job with the same name within their individual “sandbox” categories – a unique Job
name must be given.
In the previous illustration, project manager, two developers have private categories for sandbox and
development activities, and there are 2 additional high-level categories, ECRP and Templates.
Although the default table definition categories are useful from a functional perspective, establishing a
Table Definition categorization that matches project development organization is recommended.
New Table Definition categories can be created within the repository by right-clicking within the Table
Definitions area of the DataStage project repository and choosing the “New Category” command.
When implementing a customized Table Definition categorization, care must be taken to override the
default choices for category names during Table
Definition import. On import, the first level Table
Definition category is identified as the “Data Source
Type” and the second level categorization is referred to as
the “Data Source Name” as shown in the example on the
below. The placement of these fields varies with the
method of metadata import.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 23 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The following is one of the TableDefs from this project showing how to correctly specify the category
and sub-category.
Jobs and Job Sequences are all held under the Category Directory Structure of which the top level is
the category “Jobs”.
A Job will be suffixed with the class word “Job” and a Job Sequence will be suffixed with the class
word “Seq”.
Jobs should be organized under Category Directories to provide grouping such that a Directory should
contain a Sequence Job and all the Jobs that are contained within that sequence. This will be discussed
further in Section 2.2.4.2 Category Hierarchy.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 24 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
To differentiate between Parallel Shared Containers and Server Shared Containers, the following Class
Word naming is recommended:
• Psc = Parallel (Enterprise Edition) Shared Container
• Ssc = Server Edition Shared Container
IMPORTANT: Use of Server Edition Shared Containers is discouraged within a parallel job.
2.2.4.7 Parameters
A Parameter can be a long name consisting of alphanumeric characters and underscores. Therefore the
parameter name must be made readable using Capitalized words separated by underscores. The class
word suffix is “Parm”.
2.2.4.8 Links
Within a DataStage Job, links are objects that represent the flow of data from one stage to the next.
Within a Job Sequence, links represent the flow of a message from one activity / step to the next.
It is particularly important to establish a consistent naming convention for link names, instead of using
the default “DSLink#” (where “#” is an assigned number). Within the graphical Designer environment,
stage editors identify links by name; having a descriptive link name reduces the chance for errors (for
example, during Link Ordering). Furthermore, when sharing data with external applications (for
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 25 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Instead of using the full object name, a 2, 3, or 4 character abbreviation should be used for the Class
Word suffix, after the subject name and subject modifier. A list of frequently-used stages and their
corresponding Class Word abbreviation may be found in 12.4.2 DataStage Naming Reference.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 26 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The concept of source and target can be applied in a couple of ways. Every job in a series of jobs could
consider the data it gets in to be a source and the data it writes out as being a target. However for the
sake of this naming convention a Source will only be data that is extracted from an original system and
Target will be the data structures that are produced or loaded as the final result of a particular series of
jobs. This is based on the purpose of the project – to move some data from a source to a target.
Data Stores used as temporary structures to land data between jobs, supporting restart and modularity,
should use the same names in the originating job and any downstream jobs reading the structure.
A Transformer Stage Variable can have a long name consisting of alphanumeric characters but not
underscores. Therefore the Stage Variable name must be made readable only by using Capitalized
words. The Class Word suffix is Stage Variable or “SV”. Stage Variables should be named according
to their purpose.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 27 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
A How-To document describing the appropriate use of the routine must be provided by the author of
the routine, and placed in a documentation repository.
DataStage Custom Transformer routine names will indicate their function and they will be grouped in
sub-categories by function under a main category of Custom, for example:
Routines/Custom/DetectTeradataUnicode.
Source code, a makefile, and the resulting object for each Custom Transformer routine must be placed
in the project phase source directory, e.g.: /home/dsadm/dev/bin/source.
Intermediate datasets are created between modules. Their names will include the name of the module
that created the dataset OR the contents of the dataset in that more than one module may use the dataset
after it is written, for example:
BUSN_RCR_CUST.ds
Target output files will include the name of the target database or system, the target table name or
copybook name. The goal is the same as with source files – to connect the name of the file with the
name of the file on the target system. Target flat files will have a unique serial number composed of
the date, “_ETL_” and time, for example:
Client_Relationship_File1_Out_20060104_ETL_184325.psv
Files and datasets will have suffixes that allow easy identification of the content and type. DataStage
proprietary format files have required suffixes and are identified in italics in the table below which
defines the types of files and their suffixes.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 28 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The Short Description field is also displayed on summary lines within the Director and Manager
clients. At a minimum, description annotations must be provided in the Job Properties Short
Description field for each job and job sequence, as shown below:
Within a job, the Annotation tool should be used to highlight steps in a given job flow. Note that by
changing the vertical alignment properties (for example, Bottom) the annotation can be drawn around
the referenced stage(s), as shown in the following example.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 29 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
DataStage also allows descriptions to be attached to each stage within the General tab of the stage
properties.
Each stage should have a short description of its function specified within the stage properties. These
descriptions will appear in the job documentation automatically generated from jobs and sequencers
adhering to the standards in this document. More complex operators or operations should have
correspondingly longer and more complex explanations on this tab.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 30 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Lookup stage
This stage validates the input and writes rejects.
This stage validates the input and continues.
This stage identifies changes and drops records not matched (not updated).
Copy stage
This stage sends data to the TDMLoadPX stage for loading into Teradata, and to a dataset for
use as write-through cache.
This stage renames and/or drops columns and is NOT optimized out.
This stage is cosmetic and is optimized out.
Transformer stage:
This stage generates sequence numbers that have a less-than file scope.
This stage converts null dates.
Modify stage:
This stage performs data conversions not requiring a transformer.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 31 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
DataStage does not directly integrate with source code control systems, but it does offer the ability to
exchange information with these systems. It is the responsibility of the DataStage developer to
maintain DataStage objects within the source code system.
The Manager client is the primary interface to the DataStage object repository. Using Manager, you
can export objects (job designs, table definitions, custom stage types, user-defined routines, etc.) from
the repository as clear-text format files. These files can then be checked into the external source code
control system.
The export file format for DataStage objects can be either .DSX (DataStage eXport format) or .XML.
Both formats contain the same information, although the XML file is generally much larger. Unless
there is a need to parse information in the export file, .DSX is the recommended export format.
For these reasons, it is important that an identified individual maintains backup copies of the
important job designs using .DSX file exports to a local or (preferably) shared file system.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 32 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The DataStage client includes Windows command-line utilities for automating the export process.
These utilities (dsexport and dscmdexport) are documented in the DataStage Manager Guide.
All exports from the DataStage repository are performed on the Windows workstation. There is
no server-side project export facility.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 33 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
NOTE: Objects cannot be exported from DataStage if they are open in Designer.
Make sure all objects are saved and closed before exporting.
To export a group of objects to a single export file, the option “Selection: By category”
should be specified in the “Options” tab.
The filename for export is specified in the “Export to file:” field at the top of the Export
dialog.
If you wish to include compiled Transformer objects for a selected job, make sure the
“Job Executables” category is checked.
• Using your source code control utilities, check-in the exported .DSX file
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 34 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
For test and production environments, it is possible to import the job executables from the DataStage
server host using the dsjob command-line, as documented in the DataStage Development Kit chapter of
the Parallel Job Advanced Developer’s Guide. Note that using dsjob will only import job executables -
job designs can only be imported using the Manager client or the dsimport or dscmdimport client tools.
• Use the source code control system to check-out (or export) the .DSX file to your client
workstation.
• Import objects in the .DSX file using Manager. Choose “Import DataStage Components” from
the “Import” menu. Select the file you checked out of your source code control system by
clicking on the ellipsis (“…”) next to the filename field in the import dialog. After selecting
your file, click OK to import.
• The import of the .DSX file will place the object in the same DataStage category it originated
from. This means that if necessary it will create the Job Category if it doesn't already exits.
• If the objects were not exported with the “Job Executables”, then compile the imported objects
from Designer, or using the Multi-Job Compile tool.
Although operating system environment variables can be set in multiple places, there is a defined order
of precedence that is evaluated when a job’s actual environment is established at runtime:
1) The daemon for managing client connections to the DataStage server engine is called dsrpcd.
By default (in a root installation), dsrpcd is started when the server installed, and should start
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 35 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
By default, DataStage jobs inherit the dsrpcd environment, which, on UNIX platforms is set in
the etc/profile and $DSHOME/dsenv scripts. On Windows, the default DataStage environment is
defined in the registry. Note that client connections DO NOT pick up per-user environment
settings from their $HOME/.profile script.
On USS environments, the dsrpc environment is not inherited since DataStage jobs do not
execute on the conductor node.
2) Environment variable settings for particular projects can be set in the DataStage Administrator
client. Any project-level settings for a specific environment variable will override any settings
inherited from dsrpcd.
Any project-level environment variables must be set for new projects using the Administrator
client, or by carefully editing the DSPARAMS file within the project. Refer to the DataStage
Administration, Management, and Production Automation Best Practice for additional details.
3) Within Designer, environment variables may be defined for a particular job using the Job
Properties dialog box. Any job-level settings for a specific environment variable will override
any settings inherited from dsrpcd or from project-level defaults.
To avoid hard-coding default values for job parameters, there are three special values that can be used
for environment variables within job parameters:
• $ENV causes the value of the named environment variable to be retrieved from the operating
system of the job environment. Typically this is used to pickup values set in the operating
system outside of DataStage.
NOTE: $ENV should not be used for specifying the default $APT_CONFIG_FILE value because,
during job development, the Designer parses the corresponding parallel configuration file to
obtain a list of node maps and constraints (advanced stage properties).
• $PROJDEF causes the project default value for the environment variable (as shown on the
Administrator client) to be picked up and used to set the environment variable and job
parameter for the job.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 36 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 37 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 38 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
3 Development Guidelines
3.1 Modular Development
Modular development techniques should be used to maximize re-use of DataStage jobs and
components:
• Job parameterization allows a single job design to process similar logic instead of creating
multiple copies of the same job. The Multiple-Instance job property allows multiple invocations
of the same job to run simultaneously.
• A set of standard job parameters should be used in DataStage jobs for source and target
database parameters (DSN, user, password, etc) and directories where files are stored. To ease
re-use, these standard parameters and settings should be made part of a Designer Job Template.
• Create a standard directory structure outside of the DataStage project directory for source and
target files, intermediate work files, and so forth.
• Where possible, create re-usable components such as parallel shared containers to encapsulate
frequently-used logic.
While it may be possible to construct a large, complex job that satisfies given functional requirements,
this may not be appropriate. Factors to consider when establishing job boundaries include:
Establishing job boundaries through intermediate Data Sets creates “checkpoints” that can be
used in the event of a failure when processing must be restarted. Without these checkpoints,
processing must be restarted from the beginning of the job flow. It is for these reasons that
long-running tasks are often segmented into separate jobs in an overall sequence.
o For example, if the extract of source data takes a long time (such as an FTP transfer over
a wide area network) it would be good to land the extracted source data to a parallel data
set before processing.
o As another example, it is generally a good idea to land data to a parallel Data Set before
loading to a target database unless the data volume is small or the overall time to
process the data is minimal.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 39 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Section 12.3: Minimizing Runtime Processes and Resource Requirements provides some
recommendations for minimizing resource requirements of a given job design, especially when
the volume of data does not dictate parallel processing.
Breaking large job flows into smaller jobs may further facilitate modular development and re-
use if business requirements for more than one process depend on intermediate data created by
an earlier job.
The size of a job directly impacts the speed of development tasks such as opening, saving, and
compiling. These factors may be amplified when developing across a wide-area or high-latency
network connection. In extreme circumstances this can significantly impact developer
productivity and ongoing maintenance costs.
The startup time of a given job is directly related to the number of stages and links in the job
flow. Larger more complex jobs require more time to startup before actual data processing can
begin. Job startup time is further impacted by the degree of parallelism specified by the parallel
configuration file.
Remember that the number of stages in a parallel job includes the number of stages within each
shared container used in a particular job flow.
As a rule of thumb, keeping job designs to less than 50 stages may be a good starting point. But this is
not a hard-and-fast rule. The proper job boundaries are ultimately dictated by functional / restart /
performance requirements, expected throughput and data volumes, degree of parallelism, number of
simultaneous jobs and their corresponding complexity, and the capacity and capabilities of the target
hardware environment.
Combining or splitting jobs is relatively easy, so don't be afraid to experiment and see what
works best for your jobs in your environment.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 40 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
In addition, template jobs may contain any number of stages and pre-built logic, allowing multiple
templates to be created for different types of “standardized” processing.
By default, the Designer client stores all job templates in the local “Templates” directory within the
DataStage client install directory, for example, C:\Program Files\Ascential\DataStage751\Templates
To facilitate greater re-use of job templates, especially in a team-based development, the template
directory can be changed using the Windows Registry Editor.
This change must be made on each client workstation, by altering the following registry key:
HKEY_LOCAL_MACHINE\SOFTWARE\Ascential Software\DataStage Client\CurrentVersion\Intelligent Assistant\Templates
The default job design specifically will support the creation of write-through cache in which data in
load-ready format is stored in DS/EE Data Sets for use in the load process or in the event the target
table becomes unavailable.
The default job design incorporates several features and components of DataStage that are used
together to support tactical and strategic job deployment. These features include:
1. Re-start-able job sequencers which manage one or more jobs, detect and report failure
conditions, provide monitoring and alert capabilities and support checkpoint restart
functionality.
2. Custom routines written in DataStage BASIC (DS Basic) that detect external events,
manage and manipulate external resources, provide enhanced notification and alert
capabilities and interface to the UNIX operating system.
3. DataStage Enterprise Edition (DS/EE) ETL jobs that exploit job parameterization, runtime
UNIX environment variables, and conditional execution.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 41 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Other sections will discuss in detail each of the components and give examples of their use in a
working example job sequencer.
Job parameters are passed from a job sequencer to the jobs in its control as if a user were answering the
runtime dialog questions displayed in the DataStage Director job-run dialogue. Default environment
variables cannot be reset during this dialog unless explicitly specified in the job.
Job parameters are required for the following DataStage programming elements:
1. File name entries in stages that use files or Data Sets must NEVER use a hard-coded operating
system pathname.
a. Staging area files must ALWAYS have pathnames as follows:
/jpSTAGING/jpENVIRON/jpSUBJECT_AREA[filename.suffix]
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 42 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Use and management of job parameters, as well as standardized routines for use in Job Sequencers are
discussed further in Parallel Framework Standard Practices: Administration, Management, and
Production Automation.
Because Parallel Shared Containers are inserted when a job is compiled, all jobs that use a shared
container must be recompiled when the container is changed. The Usage Analysis and Multi-Job
Compile tools can be used to recompile jobs that use a shared container.
The exact policy for each reject is specified in the job design document, and further, whether the job or
ETL processing is to continue is specified on a per-job and/or per-sequence and/or per-script basis
based on business requirements.
Reject files will include those records rejected from the ETL stream due to Referential Integrity
failures, data rule violations or other reasons that would disqualify a row from processing. The
presence of rejects may indicate that a job has failed and prevent further processing. Specification of
this action is the responsibility of the Business Analyst and will be published in the design document.
Error files will include those records from sources that fail quality tests. The presence of errors may not
prevent further processing. Specification of this action is the responsibility of the Business Analyst and
will be published in the design document.
Both rejects and errors will be archived and placed in a special directory for evaluation or other action
by support staff. The presence of rejects and errors will be detected and notification sent by email to
selected staff. These activities are the responsibility of job sequencers used to group jobs by some
reasonable grain or by a federated scheduler.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 43 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The default action is to push back reject and error rows to a Data Steward.
Option Description
Continue Drop read failures from input stream. Pass successful reads
to the output stream. (No reject link exists)
Fail Abort job on read format failure (No reject link exists)
Output Reject switch failures to the reject stream. Pass successful
reads to the output stream. (Reject link exists)
The reject option should be used in all cases where active management of the rejects is required.
If a file is created by this option, it must have a *.rej file extension. Alternatively, a shared container
error handler can be used.
Rejects are categorized in the ETL job design document using the following ranking:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 44 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Option Description
Continue Ignore lookup failures and pass lookup fields as nulls to the
output stream. Pass successful lookups to the output
stream.
Drop Drop lookup failures from the input stream. Pass successful
lookups to the output stream.
Fail Abort job on lookup failure
Reject Reject lookup failures to the reject stream. Pass successful
lookups to the output stream.
The reject option should be used in all cases where active management of the rejects is required.
Furthermore, to enforce error management ONLY ONE REFERENCE LINK is allowed on a Lookup
stage. If there are multiple validations to perform, each must be done in its own Lookup.
If a file is created by this option, it must have a *.rej or *.err file extension. The *.rej extension is used
when rejects require investigation after a job run, the *.err extension when rejects can be ignored but
need to be recorded. Alternatively, a local error handler based on a shared container can be used.
Rejects are categorized in the ETL job design document using the following ranking:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 45 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
If a file is created from the reject stream, it must have a *.rej or *.err file extension. The *.rej extension
is used when rejects require investigation after a job run, the *.err extension when rejects can be
ignored but need to be recorded. Alternatively, a shared container error handler can be used.
Rejects are categorized in the ETL job design document using the following ranking:
Option Description
No reject link exists Do not capture rows that fail to be written.
Reject link exists Pass rows that fail to be written to the reject stream.
The reject option should be used in all cases where active management of the rejects is required.
If a file is created by this option, it must have a *.rej file extension. Alternatively, a shared container
error handler is used.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 46 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Rows will be converted to the common file record format with 9 columns (below) using Column
Export and Transformer stages for each reject port, and gathered using a Funnel stage that feeds a
Sequential File stage. The Column Export and Transformer stages may be kept in a template Shared
Container the developer will make local in each job.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 47 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
In this example, the following stages process the only errors produced by a job:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 48 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
And the downstream Transformer stage builds the standard output record by creating the required
keys:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 49 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
3.7.5.2 Processing Errors and Rejects and Merging with an Output Stream
There may be processing requirements that specify that rejected or error rows be tagged as having
failed a validation and merged back into the output stream. This is done by processing the rows from
the reject ports and setting the value of a specific column with a value specified by the design
document. The following table identifies the tagging method to be used for the previously cited
operators.
In this example, rows rejected by the Lookup stage are processed by a corrective Transformer stage
where the failed references as set to a specific value and then merged with the output of the Lookup
stage:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 50 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The ability to use a Server Edition component within a parallel job is intended only as a migration
option for existing Server Edition applications that might benefit by leveraging some parallel
capabilities on SMP platforms.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 51 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Note that BASIC Routines are still appropriate, and necessary, for the job control components of a
DataStage Job Sequence and Before/After Job Subroutines for parallel jobs. This is discussed in more
detail in Parallel Framework Standard Practices: Administration, Management, and Production
Automation.
o For simple jobs with only two stages, the Copy stage should be used as a placeholder so
that new stages can be inserted easily should future requirements change.
o Unless the Force property is set to “True”, a Copy stage with a single input link and a
single output link will be optimized out of the final job flow at runtime.
Data Sets offer parallel I/O on read and write operations, without overhead for format or data type
conversions.
NOTE: Because parallel Data Sets are platform and configuration-specific, they should not be
used for long-term archive of source data.
• The Copy stage should be used instead of a Transformer for simple operations including:
- Job Design placeholder between stages (unless the Force option =true, Enterprise Edition
will optimize this out at runtime)
- Renaming Columns
- Dropping Columns
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 52 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Note that rename, drop (if Runtime Column Propagation is disabled), and default type
conversion can also be performed by the output mapping tab of any stage.
• NEVER use the “BASIC Transformer” stage in large-volume job flows. Instead, user-defined
functions and routines can expand parallel Transformer capabilities. The BASIC Transformer
is intended as a “stop-gap” migration choice for existing Server Edition jobs containing
complex routines. Even then its use should be restricted and the routines should be converted as
soon as possible.
could also be implemented with a lookup table containing values for column A and
corresponding values of column B.
• Optimize the overall job flow design to combine derivations from multiple Transformers
into a single Transformer stage when possible.
• Because the parallel Transformer is compiled, it is faster than the interpreted Filter and
Switch stages. The only time that Filter or Switch should be used is when the selection clauses
need to be parameterized at runtime.
• The Modify stage can be used for non-default type conversions, null handling, and
character string trimming. See Section 8.2: Modify Stage.
As always, performance should this should be tested in isolation to identify specific cause of
bottlenecks.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 53 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The following table summarizes the underlying data types of DataStage Enterprise Edition:
SQL Type Internal Type Size Description
Date date 4 bytes Date with month, day, and year
Decimal, decimal (roundup(p)+1)/2 Packed decimal, compatible with IBM packed
Numeric decimal format.
Float, sfloat 4 bytes IEEE single-precision (32-bit) floating point
Real value
Double dfloat 8 bytes IEEE double-precision (64-bit) floating point
value
TinyInt int8, uint8 1 byte Signed or unsigned integer of 8 bits
(Specify unsigned Extended option for unsigned)
SmallInt int16, uint16 2 bytes Signed or unsigned integer of 16 bits
(Specify unsigned Extended option for unsigned)
Integer int32, unit32 4 bytes Signed or unsigned integer of 32 bits
(Specify unsigned Extended option for unsigned)
BigInt1 int64, unit64 8 bytes Signed or unsigned integer of 64 bits
(Specify unsigned Extended option for unsigned)
Binary, raw 1 byte per character Untyped collection, consisting of a fixed or
Bit, variable number of contiguous bytes and an
LongVarBinary, optional alignment value
VarBinary
Unknown, string 1 byte per character ASCII character string of fixed or variable length
Char, (Unicode Extended option NOT selected)
LongVarChar,
VarChar
NChar, ustring multiple bytes per ASCII character string of fixed or variable length
NVarChar, character (Unicode Extended option NOT selected)
LongNVarChar
Char, ustring multiple bytes per ASCII character string of fixed or variable length
LongVarChar, character (Unicode Extended option IS selected)
VarChar
Time time 5 bytes Time of day, with resolution to seconds
Time time(microseconds) 5 bytes Time of day, with resolution of microseconds
(Specify microseconds Extended option)
Timestamp timestamp 9 bytes Single field containing both date and time value
with resolution to seconds.
Timestamp timestamp(microseconds) 9 bytes Single field containing both date and time value
with resolution to microseconds.
(Specify microseconds Extended option)
1
BigInt values map to long long integers on all supported platforms except Tru64 where they map
to longer integer values.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 54 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The Char, VarChar, and LongVarChar SQL types relate to underlying string types where each
character is 8-bits and does not require mapping because it represents an ASCII character. You can,
however, specify that these data types are extended, in which case they are taken as ustrings and do
require mapping. (They are specified as such by selecting the Extended check box for the column in
the Edit Meta Data dialog box.) An Extended field appears in the columns grid, and extended Char,
VarChar, or LongVarChar columns have ‘Unicode’ in this field. The NChar, NVarChar, and
LongNVarChar types relate to underlying ustring types so do not need to be explicitly extended.
date
time
int8
uint8
sfloat
dfloat
int16
uint16
int32
uint32
int64
uint64
string
ustring
timestamp
decimal
raw
Int8 d d d d d d d d de d d de e e e
e
uint8 d d d d d d d d d d d d
Int16 de d d d d d d d d d d de
e
uint16 d d d d d d d d d d d de
e
Int32 de d d d d d d d d d d de e e
e
uint32 d d d d d d d d d d d de e
e
Int64 de d d d d d d d d d d d
uint64 d d d d d d d d d d d d
sfloat de d d d d d d d d d d d
dfloat de d d d d d d d d d d de e e
e e
decimal de d d d de d de de d de d de
e
string de d de d d de d d d de d d e e e
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 55 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
e
ustring de d de d d de d d d de d d e e
e
raw e e
date e e e e e e e
time e e e e e de
timesta e e e e e e e
mp
The conversion of numeric data types may result in a loss of precision and cause incorrect results,
depending on the source and result data types. In these instances, Enterprise Edition displays a warning
message in the job log.
When converting from variable-length to fixed-length strings, Enterprise Edition pads the remaining
length with NULL (ASCII zero) characters by default.
• The environment variable APT_STRING_PADCHAR can be used to change the default pad
character from an ASCII NULL (0x0) to another character; for example, an ASCII space
(0x20) or a Unicode space (U+0020). When entering a space for the value of
APT_STRING_PADCHAR do note enclose the space character in quotes.
• As an alternate solution, the PadString Transformer function can be used to pad a variable-
length (Varchar) string to a specified length using a specified pad character. Note that
PadString does not work with fixed-length (CHAR) string types. You must first convert a Char
string type to a Varchar type before using PadString.
• Some stages (for example, Sequential File and DB2/UDB Enterprise targets) allow the pad
character to be specified in their stage or column definition properties. When used in these
stages, the specified pad character will override the default for that stage only.
The Transformer and Modify stages can change a null representation from an out-of-band null to an in-
band null and from an in-band null to an out-of-band null.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 56 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
When reading from Data Set and database sources with nullable columns, Enterprise Edition uses the
internal, out-of-band null representation for NULL values.
When reading from or writing to Sequential Files or File Sets, the in-band (value) must be explicitly
defined in the extended column attributes for each Nullable column, as shown in Figure 17:
The Table Definition of a stage’s input or output data set can contain columns defined to support out-
of-band nulls (Nullable attribute is checked). The next table lists the rules for handling nullable fields
when a stage takes a Data Set as input or writes to a Data Set as output.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 57 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
If the source
value is null, a fatal error occurs.
Using RCP judiciously in a job design facilitates re-usable job designs based on input metadata, rather
than using a large number of jobs with hard-coded table definitions to perform the same tasks. Some
stages, for example the Sequential File stage, allow their runtime schema to be parameterized further
extending re-use through RCP.
Furthermore, RCP facilitates re-use through parallel shared containers. Using RCP, only the columns
explicitly referenced within the shared container logic need to be defined, the remaining columns pass
through at runtime, as long as each stage in the shared container has RCP enabled on their stage Output
properties.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 58 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Stage Stage
running running in
sequentially parallel
Collectors combine parallel partitions of a single link for sequential processing. Collectors only exist
before stages running sequentially and when the previous stage is running in parallel, and are indicated
by a “fan-in” icon as shown in this example:
Stage Stage
running in running
parallel sequentially
This section provides an overview of partitioning and collecting methods, and provides guidelines for
appropriate use in job designs. It also provides tips for monitoring jobs running in parallel.
- Keyed partitioning examines the data values in one or more key columns, ensuring that records
with the same values in those key column(s) are assigned to the same partition. Keyed partitioning
is used when business rules (for example, Remove Duplicates) or stage requirements (for example,
Join) require processing on groups of related records.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 59 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Within the Designer canvas, links with Auto partitioning are drawn with the following link icon:
Auto partitioning is designed to allow the beginning DataStage developer to construct simple data
flows without having to understand the details of parallel design principles. However, the partitioning
method may not necessarily be the most efficient from an overall job perspective.
Furthermore, the ability for the Enterprise Edition engine to determine the appropriate partitioning
method depends on the information available to it. In general, Auto partitioning will ensure correct
results when using built-in stages. However, since the Enterprise Edition engine has no visibility into
user-specified logic (such as Transformer or BuildOp stages) it may be necessary to explicitly specify a
partitioning method for some stages. For example, if the logic defined in a Transformer stage is based
on a group of related records, then a keyed partitioning method must be specified to achieve correct
results.
The “Preserve Partitioning” flag is an internal “hint” that Auto partitioning uses to attempt to preserve
carefully ordered data (for example, on the output of a parallel Sort). This flag is set automatically by
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 60 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The Preserve Partitioning flag is part of the Data Set structure, and its state is stored in persistent Data
Sets.
There are some cases when the input stage requirements prevent partitioning from being preserved. In
these instances, if the Preserve Partitioning flag was set, a warning will be placed in the Director log
indicating that Enterprise Edition was unable to preserve partitioning for a specified stage.
Same partitioning doesn’t move data between partitions (or, in the case
of a cluster or Grid, between servers), and is appropriate when trying to 0 1 2
preserve the grouping of a previous operation (for example, a parallel 3 4 5
6 7 8
Sort).
Within the Designer canvas, links that have been specified with Same partitioning are drawn with a
“horizontal line” partitioning icon:
It is important to understand the impact of Same partitioning in a given data flow. Because Same
does not redistribute existing partitions, the degree of parallelism remains unchanged:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 61 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
If you read a parallel Data Set with Same partitioning, the downstream stage runs with the degree of
parallelism used to create the Data Set, regardless of the current $APT_CONFIG_FILE
Since the random partition number must be calculated, Random partitioning has a slightly higher
overhead than Round Robin partitioning.
While in theory Random partitioning is not subject to regular data patterns that might exist in the
source data, it is rarely used in real-world data flows.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 62 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
If the source data values are evenly distributed within these key HASH
column(s), and there are a large number of unique values, then the
resulting partitions will be of relatively equal size.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 63 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Partition 0: Partition 1:
ID LName FName Address ID LName FName Address
5 Dodge Horace 17840 Jefferson 1 Ford Henry 66 Edison Avenue
6 Dodge John 75 Boston Boulevard 2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
7 Ford Henry 4901 Evergreen
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 64 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Also note that in this example the number of unique values will limit the degree of parallelism,
regardless of the actual number of nodes in the parallel configuration file.
Using the same source Data Set, hash partitioning on the key columns LName and FName yields the
following distribution with a 4-node configuration file:
Partition 0: Partition 2:
I LName FName Address I LName FName Address
D D
2 Ford Clara 66 Edison Avenue 4 Ford Eleanor 7900 Jefferson
8 Ford Clara 4901 Evergreen 6 Dodge John 75 Boston Boulevard
10 Ford Eleanor 1100 Lakeshore
Partition 1:
I LName FName Address Partition 3:
D I LName FName Address
3 Ford Edsel 7900 Jefferson D
5 Dodge Horace 17840 Jefferson 1 Ford Henry 66 Edison Avenue
9 Ford Edsel 1100 Lakeshore 7 Ford Henry 4901 Evergreen
In this example, the key column combination of LName and FName yields improved data distribution
and a greater degree of parallelism.
Also note that only the unique combination of key column values appear in the same partition when
used for hash partitioning. When using hash partitioning on a composite key (more than one key
column), individual key column values have no significance for partition assignment.
Like hash, the partition size of modulus partitioning will be equally distributed as long as the data
values in the key column are equally distributed.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 65 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The “read twice” penalty of Range partitioning limits its use to specific scenarios, typically where
the incoming data values and distribution are consistent over time. In these instances, the Range
Map file can be re-used.
It is important to note that if the data distribution changes without recreating the Range Map, partition
balance will be skewed, defeating the intention of Range partitioning. Also, if new data values are
processed outside of the range of a given Range Map, these rows will be assigned to either the first or
the last partition, depending on the value.
In another scenario to avoid, if the incoming Data Set is sequential and ordered on the key column(s),
Range partitioning will result in sequential processing.
DB2 partitioning can only be specified for target DB2/UDB Enterprise stages. To maintain partitioning
on data read from a DB2/UDB Enterprise stage, use Same partitioning on the input to downstream
stages.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 66 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
This information is detailed in the parallel job score, which is output to the Director job log when the
environment variable APT_DUMP_SCORE is set to True. Specific details on interpreting the parallel
job score can be found in 12.4.2Understanding the Parallel Job Score.
Partitions are assigned numbers, starting at zero. The partition number is appended to the stage name
for messages written to the Director log, as shown in the example log below where the stage named
“Peek” is running with four degrees of parallelism (partition numbers zero through 3):
To display row counts per partition in the Director Job Monitor window, right-click anywhere in the
window, and select the “Show Instances” option, as illustrated below. This is very useful in
determining the distribution across parallel partitions (skew). In this instance, the stage named “Sort_3”
is running across four partitions (“x 4” next to the stage name), and each stage is processing an equal
number (12,500) of rows for an optimal balanced workload.
Setting the environment variable APT_RECORD_COUNTS will output the row count per link per
partition to the Director log as each stage/node completes processing, as illustrated below:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 67 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Finally, the “Data Set Management” tool (available in the Tools menu of Designer, Director, or
Manager) can be used to identify the degree of parallelism and number of rows per partition for an
existing persistent Data Set, as shown below:
In a non-graphical way, the orchadmin command line utility on the DataStage server can also be used
to examine a given parallel Data Set.
Objective 1: Choose a partitioning method that gives close to an equal number of rows in
each partition, while minimizing overhead.
This ensures that the processing workload is evenly balanced, minimizing overall run time.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 68 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Any stage that processes groups of related records (generally using one or more key columns)
must be partitioned using a keyed partition method.
This includes, but is not limited to: Aggregator, Change Capture, Change Apply, Join, Merge,
Remove Duplicates, and Sort stages. It may also be necessary for Transformers and BuildOps
that process groups of related records.
Note that in satisfying the requirements of this second objective, it may not be possible to
choose a partitioning method that gives close to an equal number of rows in each partition.
Using the above objectives as a guide, the following methodology can be applied:
a) Start with Auto partitioning (the default)
b) Specify Hash partitioning for stages that require groups of related records
o Specify only the key column(s) that are necessary for correct grouping as long as the
number of unique values is sufficient
o Use Modulus partitioning if the grouping is on a single integer key column
o Use Range partitioning if the data is highly skewed and the key column values and
distribution do not change significantly over time (Range Map can be reused)
c) If grouping is not required, use Round Robin partitioning to redistribute data equally across
all partitions
o Especially useful if the input Data Set is highly skewed or sequential
d) Use Same partitioning to optimize end-to-end partitioning and to minimize repartitioning
o Being mindful that Same partitioning retains the degree of parallelism of the
upstream stage
o Within a flow, examine up-stream partitioning and sort order and attempt to preserve
for down-stream processing. This may require re-examining key column usage
within stages and re-ordering stages within a flow (if business requirements permit).
o Across jobs, persistent Data Sets can be used to retain the partitioning and sort order.
This is particularly useful if downstream jobs are run with the same degree of
parallelism (configuration file) and require the same partition and sort order.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 69 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
However, on closer inspection, the partitioning and sorting of this scenario can be optimized. Because
the Join and Aggregator use the same partition keys and sort order, we can move the Hash partition and
Sort before the Copy stage, and apply Same partitioning to the downstream links, as shown below:
This example will be revisited in the Sorting discussion because there is one final step necessary to
optimize the sorting in this example.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 70 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Although Hash partitioning guarantees correct results for stages that require groupings of related
records, it is not always the most efficient solution, depending on the business requirements. Although
functionally correct, the above solution has one serious limitation. Remembering that the degree of
parallel operation is limited by the number of distinct values, the single value join column will assign
all rows to a single partition, resulting in sequential processing.
To optimize partitioning, consider that the single header row is really a form of reference data. An
optimized solution would be to alter the partitioning for the input links to the Join stage:
- Use Round Robin partitioning on the detail input to evenly distribute rows across all partitions
- Use Entire partitioning on the header input to copy the single header row to all partitions
Because we are joining on a single value, there is no need to pre-sort the input to the Join, so we will
revisit this in the Sorting discussion.
In order to process a large number of detail records, the link order of the Inner Join is significant.
The Join stage operates by reading a single row from the Left input and reading all rows from the Right
input that match the key value(s). For this reason, the link order in this example should be set so that
the single header row is assigned to the Right input, and the detail rows are assigned to the Left input as
shown in the following illustration:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 71 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
For advanced users, there is one further detail in this example. Because the Join will wait until it
receives an End of Group (new key value) or End of Data (no more rows on the input Data Set) from
the Right input, the detail rows in the Left input will buffer to disk to prevent a deadlock. (See Section
12.3: Minimizing Runtime Processes and Resource Requirements). Changing the output derivation on
the header row to a series of numbers instead of a constant value will establish the End of Group and
prevent buffering to disk.
However, there is a specialized example where the Round Robin collector may be appropriate.
Consider an example where data is read sequentially and passed to a Round Robin partitioner:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 72 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Stage
Sequential Sequential
running in
input output
parallel
Assuming the data is not repartitioned within the job flow and that the number of rows is not reduced
(for example, through aggregation), then a Round Robin collector can be used before the final
Sequential output to reconstruct a sequential output stream in the same order as the input data stream.
This is because Round Robin collector reads from partitions using the same partition order that a
Round Robin partitioner assigns rows to parallel partitions.
Ordered collectors are generally only useful if the input Data Set has been Sorted and Range
partitioned on the same key column(s). In this scenario, an Ordered collector will generate a sequential
stream in sort order.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 73 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
6 Sorting
Traditionally, the process of sorting data uses one primary key column and, optionally, one or more
secondary key column(s) to generate a sequential, ordered result set. The order of key columns
determines the sequence and groupings in the result set. Each column is specified with an ascending or
descending sort order. This is the method the SQL databases use for an ORDER BY clause, as
illustrated in the following example, sorting on primary key LName (ascending), secondary key
FName (descending):
However, in most cases there is no need to globally sort data to produce a single sequence of rows.
Instead, sorting is most often needed to establish order within specified groups of data. This sort can be
done in parallel.
For example, the Remove Duplicates stage selects either the first or last row from each group of an
input Data Set sorted by one or more key columns. Other stages (for example, Sort Aggregator, Change
Capture, Change Apply, Join, Merge) require pre-sorted groups of related records.
NOTE: By definition, when data is re-partitioned, sort order is not maintained. To restore row
order and groupings, a sort is required after repartitioning.
In the following example, the previous input Data Set is partitioned on LName and FName columns.
Given a 4-node configuration file, we would see the following results:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 74 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Partition 0 Partition 2
ID LName FName Address ID LName FName Address
2 Ford Clara 66 Edison Avenue 4 Ford Eleanor 7900 Jefferson
8 Ford Clara 4901 Evergreen 6 Dodge John 75 Boston Boulevard
10 Ford Eleanor 1100 Lakeshore
Partition 1
ID LName FName Address Partition 3
3 Ford Edsel 7900 Jefferson ID LName FName Address
5 Dodge Horace 17840 Jefferson 1 Ford Henry 66 Edison Avenue
9 Ford Edsel 1100 Lakeshore 7 Ford Henry 4901 Evergreen
Applying a parallel sort to this partitioned input Data Set, using the primary key column LName
(ascending) and secondary key column FName (descending) would generate the resulting Data Set:
Partition 0 Partition 2
ID LName FName Address ID LName FName Address
2 Ford Clara 66 Edison Avenue 6 Dodge John 75 Boston Boulevard
8 Ford Clara 4901 Evergreen 4 Ford Eleanor 7900 Jefferson
10 Ford Eleanor 1100 Lakeshore
Partition 1
ID LName FName Address Partition 3
5 Dodge Horace 17840 Jefferson ID LName FName Address
3 Ford Edsel 7900 Jefferson 1 Ford Henry 66 Edison Avenue
9 Ford Edsel 1100 Lakeshore 7 Ford Henry 4901 Evergreen
Note that the partition and sort keys do not always have to match.
For example, secondary sort keys can be used to establish order within a group for selection with the
Remove Duplicates stage (which can specify First or Last duplicate to retain). Let’s say that an input
Data Set consists of order history based on CustID and Order Date. Using Remove Duplicates, we want
to select the most recent order for a given customer.
12.4.2 Sorting and Hashing Advanced Example provides a more detailed discussion and example of
partitioning and sorting.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 75 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The Link sort offers fewer options, but is easier to maintain in a DataStage job, as there are fewer
stages on the design canvas. The stand-alone sort offers more options, but as a separate stage makes job
maintenance slightly more complicated.
In general, use the Link sort unless a specific option is needed on the stand-alone Stage. Most often,
the standalone Sort stage is used to specify the Sort Key mode for partial sorts.
Additional properties can be specified by right-clicking on the key column as shown in the following
illustration:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 76 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Of the options only available in the standalone Sort stage, the Sort Key Mode is most frequently used.
NOTE: The Sort Utility option is an artifact of previous releases. Always specify “DataStage”
Sort Utility, which is significantly faster than a “UNIX” sort.
It is important to note that by default the Stable sort option is disabled for sorts on a link and Enabled
with the standalone Sort stage.
6.5 Sub-Sorts
Within the standalone Sort stage, the key column property “Sort Key Mode” is a particularly powerful
feature and a significant performance optimization. It is used when resorting a sub-grouping of a
previously sorted input Data Set, instead of performing a complete Sort. This “subsort” uses
significantly less disk space and CPU resource, and can often be performed in memory (depending on
the size of the new subsort groups).
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 77 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
To successfully perform a subsort, keys with “Don’t Sort (Previously Sorted)” property must be at the
top of the list, without gaps between them. And, the key column order for these keys must match the
key columns and order defined in the previously-sorted input Data Set.
If the input data does not match the key column definition for a subsort, the job will abort.
Typically, Enterprise Edition inserts sorts before any stage that requires matched key values or ordered
groupings of (Join, Merge, Remove Duplicates, Sort Aggregator). Sorts are only inserted automatically
when the flow developer has not explicitly defined an input sort.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 78 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Revisiting the partitioning examples in Section 5.4: Partitioning Examples, the environment
variable $APT_SORT_INSERTION_CHECK_ONLY should be set to prevent Enterprise
Edition from inserting un-necessary sorts before the Join stage.
To perform a sort, rows in the input Data Set are read into a memory buffer on each partition. If the sort
operation can be performed in memory (as is often the case with a subsort) then no disk I/O is
performed.
By default, each sort uses 20MB of memory per partition for its memory buffer. This value can be
changed for each standalone Sort stage using the “Restrict Memory Usage” option (the minimum is
1MB/partition). On a global basis, the environment variable APT_TSORT_STRESS_BLOCKSIZE can be
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 79 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
use to specify the size of the memory buffer, in MB, for all
sort operators (link and standalone), overriding any per-sort
specifications.
If the input Data Set cannot fit into the sort memory buffer, then results are temporarily spooled to disk
in the following order:
- scratch disks defined in the current configuration file (APT_CONFIG_FILE) in the “sort” named disk
pool
- scratch disks defined in the current configuration file default disk pool
- the default directory specified by the environment variable TMPDIR
- the directory “/tmp” (on UNIX) or “C:/TMP” (on Windows) if available
The file system configuration and number of scratch disks defined in parallel configuration file can
greatly impact the I/O performance of a parallel sort. Having a greater number of scratch disks for each
node allows the sort to spread I/O across multiple file systems.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 80 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
No DS/EE file stage supports “update” of existing records. Some stages (parallel Data Set) support
“Append” to add new records to an existing file, but this is not recommended as it imposes risks for
failure recovery.
However, Data Sets can only be read from and written to using a DataStage parallel job. If data is
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 81 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Note that when reading in parallel, input row order is not maintained across readers.
A better option for writing to a set of Sequential Files in parallel is to use the FileSet stage. This will
create a single header file (in text format) and corresponding data files, in parallel, using the format
options specified in the FileSet stage. The FileSet stage will write in parallel.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 82 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Note that this method is also useful for External Source and FTP sequential source stages.
• Don’t read from Sequential File using SAME partitioning in the downstream stage! Unless
more than one source file is specified, SAME will read the entire file into a single partition,
making the entire downstream flow run sequentially (unless it is later repartitioned).
• When multiple files are read by a single Sequential File stage (using multiple files, or by using
a File Pattern), each file’s data is read into a separate partition. It is important to use ROUND-
ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly
distribute the data in the flow.
Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small
performance penalty associated with increased I/O. It is also important to remember that this setting
will apply to all Sequential File stages in the data flow.
The format of the Schema File, including Sequential File import / export format properties is
documented in the Orchestrate Record Schema manual. Note that this document is required, since the
Import / Export properties used by the Sequential File and Column Import stages are not documented
in the DataStage Parallel Job Developer’s Guide.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 83 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
• If a field is nullable, you must define the null field value and length in the Nullable section of
the column property. Double-click on the column number in the grid dialog or right mouse click
on the column and select edit column to set these properties.
• When writing fixed-length files from variable-length fields (eg. Integer, Decimal, Varchar), the
field width and pad string column properties must be set to match the fixed-width of the output
column. Double-click on the column number in the grid dialog to set this column property.
• To display each field value, use the print_field import property. Use caution when specifying
this option as it can generate an enormous amount of detail in the job log. All import and
export properties are listed in the Import/Export Properties chapter of the Orchestrate
Operators Reference.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 84 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
NOTE: The Complex Flat File stage cannot read from sources with OCCURS DEPENDING
ON clauses. (This is an error in the DataStage documentation.)
When used as a target, the stage allows you to write data to one or more complex flat files. It does not
write to MVS datasets.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 85 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 86 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
8 Transformation Languages
8.1 Transformer Stage
The DataStage Enterprise Edition parallel Transformer stage generates “C” code which is then
compiled into a parallel component. For this reason, it is important to minimize the number of
transformers, and to use other stages (such as Copy) when derivations are not needed.
See Section 3.8.4: Parallel Transformer stages for guidelines on Transformer stage usage.
The parallel Transformer rejects NULL derivation results (including output link constraints) because
the rules for arithmetic and string handling of NULL values are, by definition, undefined. Even if the
target column in an output derivation allows nullable results, the Transformer will reject the row
instead of sending it to the output link(s).
For this reason, if you intend to use a nullable column within a Transformer derivation or output link
constraint, it should be converted from its out-of-band (internal) null representation to an in-band
(specific value) null representation using stage variables or the Modify stage.
For example, the following stage variable expression would convert a null value to a specific empty
string:
Note that if an incoming column is only used in an output column mapping, the Transformer will allow
this row to be sent to the output link without being rejected.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 87 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
For example, TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the
incoming column must be type VarChar before it is evaluated in the Transformer.
Since the Transformer will abort the entire job flow immediately, it is possible that valid rows will not
have been flushed from Sequential File (export) buffers, or committed to database tables. It is
important to set the database commit parameters or adjust the Sequential File buffer settings (see
Section 7.3.5: Sequential File (Export) Buffering).
ceil
Rounds towards positive infinity. Examples: 1.4 -> 2, -1.6 -> -1
floor
Rounds towards negative infinity. Examples: 1.6 ->1, -1.4 -> -2
round_inf
Rounds or truncates towards nearest representable value, breaking ties by rounding positive values toward positive infinity and
negative values toward negative infinity.
Examples: 1.4 -> 1, 1.5-> 2, -1.4 -> -1, -1.5 -> -2
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 88 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
trunc_zero
Discard any fractional digits to the right of the rightmost fractional digit supported regardless of sign. For example, if
$APT_DECIMAL_INTERM_SCALE is smaller than the results of the internal calculation, round or truncate to the scale size.
Examples: 1.56 -> 1.5, -1.56 ->-1.5.
The stage variables and the columns within a link are evaluated in the order in which they are displayed
in the Transformer editor. Similarly, the output links are also evaluated in the order in which they are
displayed.
From this sequence, it can be seen that there are certain constructs that would be inefficient to include
in output column derivations, as they would be evaluated once for every output column that uses them.
Such constructs are:
In this case, the evaluation of the substring of DSLINK1.col[1,3] is evaluated for each column
that uses it.
This can be made more efficient by moving the substring calculation into a stage variable. By
doing this, the substring is evaluated just once for every input row. In this case, the stage
variable definition would be:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 89 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
DSLINK1.col1[1,3]
In fact, this example could be improved further by also moving the string comparison into the
stage variable. The stage variable would be:
IF (DSLink1.col[1,3] = “001” THEN 1 ELSE 0
This reduces both the number of substring functions evaluated and string comparisons made in
the Transformer.
This returns a string of 20 spaces. In this case, the function would be evaluated every time the
column derivation is evaluated. It would be more efficient to calculate the constant value just
once for the whole Transformer.
This can be achieved using stage variables. This function could be moved into a stage variable
derivation; but in this case, the function would still be evaluated once for every input row. The
solution here is to move the function evaluation into the initial value of a stage variable.
A stage variable can be assigned an initial value from the Stage Properties dialog/Variables tab
in the Transformer stage editor. In this case, the variable would have its initial value set to:
Str(“ “,20)
You would then leave the derivation of the stage variable on the main Transformer page empty.
Any expression that previously used this function would be changed to use the stage variable
instead.
The initial value of the stage variable is evaluated just once, before any input rows are
processed. Then, because the derivation expression of the stage variable is empty, it is not re-
evaluated for each input row. Therefore, its value for the whole Transformer processing is
unchanged from the initial value.
In addition to a function value returning a constant value, another example would be part of an
expression such as:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 90 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
"abc" : "def"
As with the function-call example, this concatenation is evaluated every time the column
derivation is evaluated. Since the subpart of the expression is actually constant, this constant
part of the expression could again be moved into a stage variable, using the initial value setting
to perform the concatenation just once.
In this case, the "1" is a string constant, and so, in order to be able to add it to DSLink1.col1, it
must be converted from a string to an integer each time the expression is evaluated. The
solution in this case is just to change the constant from a string to an integer:
DSLink1.col1+1
In this example, if DSLINK1.col1 were a string field, then, again, a conversion would be
required every time the expression is evaluated. If this just appeared once in one output column
expression, this would be fine. However, if an input column is used in more than one
expression, where it requires the same type conversion in each expression, then it would be
more efficient to use a stage variable to perform the conversion once. In this case, you would
create, for example, an integer stage variable, specify its derivation to be DSLINK1.col1, and
then use the stage variable in place of DSLink1.col1, where that conversion would have been
required.
It should be noted that when using stage variables to evaluate parts of expressions, the data type
of the stage variable should be set correctly for that context. Otherwise, needless conversions
are required wherever that variable is used.
As noted in the previous section, the Output Mapping properties for any parallel stage will generate an
underlying modify for default data type conversions, dropping and renaming columns.
The standalone Modify stage can be used for non-default type conversions (nearly all date and time
conversions are non-default), null conversion, and string trim. The Modify stage uses the syntax of the
underlying modify operator, documented in the Parallel Job Developers Guide as well as the
Orchestrate Operators Reference.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 91 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
NOTE: The DataStage Parallel Job Developers Guide gives incorrect syntax for converting an
out-of-band null to an in-band null (value) representation.
To convert from an out-of-band null to an in-band null (value) representation within Modify, the syntax
is: destField[:dataType] = make_null(sourceField,value)
where:
- destField is the destination field’s name.
- dataType is its optional data type; use it if you are also converting types.
- sourceField is the source field’s name.
- value is the value of the source field when it is null.
where:
- destField is the destination field’s name.
- dataType is its optional data type; use it if you are also converting types.
- sourceField is the source field’s name
- value is the value you wish to represent a null in the output.
The destField is converted from an Orchestrate out-of-band null to a value of the field’s data type. For
a numeric field value can be a numeric value, for decimal, string, time, date, and timestamp fields,
value can be a string.
You can use this function to remove the characters used to pad variable-length strings when they are
converted to fixed-length strings of greater length. By default, these characters are retained when the
fixed-length string is then converted back to a variable-length string.
The character argument is the character to remove. By default, this is NULL. The value of the direction
and justify arguments can be either begin or end; direction defaults to end, and justify defaults to begin.
Justify has no affect when the target string has variable length.
The following example removes all leading ASCII NULL characters from the beginning of name and
places the remaining characters in an output variable-length string with the same name:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 92 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The following example removes all trailing Z characters from color, and left-justifies the resulting hue
fixed-length string:
hue:string[10] = string_trim[‘Z’, end, begin](color)
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 93 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
9 Combining Data
9.1 Lookup vs. Join vs. Merge
The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small
enough to fit into available physical memory. Each lookup reference requires a contiguous block of
shared memory. If the Data Sets are larger than available memory resources, the JOIN or MERGE
stage should be used.
Limit the use of database Sparse Lookups (available in the DB2 Enterprise, Oracle Enterprise, and
ODBC Enterprise stages) to scenarios where the number of input rows is significantly smaller (for
example, 1:100 or more) than the number of reference rows. (see Section 10.1.7: Database Sparse
Lookup vs. Join).
Sparse Lookups may also be appropriate for exception-based processing when the number of
exceptions is a small fraction of the main input data. It is best to test both the Sparse and Normal to see
which actually performs best, and to retest if the relative volumes of data change dramatically.
During an Outer Join, when a match does not occur, the Join stage inserts values into the unmatched
non-key column(s) using the following rules:
a) If the non-key column is defined as nullable (on the Join input links) then Enterprise Edition
will insert NULL values in the unmatched columns
b) If the non-key column is defined as not-nullable, then Enterprise Edition inserts “default”
values based on the data type. For example, the default value for an Integer is zero, the default
value for a Varchar is an empty string (“”), and the default value for a Char is a string of
padchar characters equal to the length of the Char column.
For this reason, care must be taken to change the column properties to allow NULL values
before the Join. This is most easily done by inserting a Copy stage and mapping a column from
NON-NULLABLE to NULLABLE.
A Transformer stage can be used to test for NULL values in unmatched columns.
In most cases, it is best to use a Column Generator to add an ‘indicator’ column, with a constant
value, to each of the inner links and test that column for the constant after you have performed the join.
This isolates your match/no-match logic from any changes in the metadata. This is also handy with
Lookups that have multiple reference links.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 94 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The Sort Aggregation Method should be used when the number of key values is unknown or very
large. Unlike the Hash Aggregator, the Sort Aggregator requires presorted data, but only maintains the
calculations for the current group in memory.
You can also specify that the result of an individual calculation or recalculation is decimal by using the
optional “Decimal Output” sub-property.
Note that performance is typically better if you let calculations occur in floating point (Double) data
type and convert the results to decimal downstream in the flow. An exception to this is financial
calculations which should be done in decimal to preserve appropriate precision.
Note that in this example use two Aggregators are used to prevent the sequential aggregation from
disrupting upstream processing.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 95 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 96 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Because there are exceptions to this rule (especially with Teradata), specific guidelines of when to
use various stage types are provided in the database-specific topics in this section.
Because of their tight integration with database technologies, the native parallel stages often have more
stringent connectivity requirements than plug-in stages. For example, the DB2/UDB Enterprise stage is
only compatible with DB2 Enterprise Server Edition with DPF on the same UNIX platform as the
DataStage server.
Native parallel stages always pre-query the database for actual runtime metadata (column names, types,
attributes). This allows Enterprise Edition to match return columns by name, not position in the stage
Table Definitions. However, care must be taken to assign the correct data types in the job design.
The benefit of ODBC Enterprise stage comes from the large number of included and third party ODBC
drivers to enable connectivity to all major database platforms. ODBC also provides an increased level
of “data virtualization” which can be useful when sources and targets (or deployment platforms) can
change.
DataStage Enterprise Edition bundles OEM versions of ODBC drivers from DataDirect. On UNIX, the
DataDirect ODBC Driver Manager is also included. “Wire Protocol” ODBC Drivers generally do not
require database client software to be installed on the server platform.
From a design perspective, plug-in database stages match columns by order, not name, so Table
Definitions must match the order of columns in a query.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 97 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The actual metadata used by a DS/EE native parallel database stage is always determined at runtime,
regardless of the table definitions assigned by the DataStage developer. This allows the database stages
to match return values by column name instead of position. However, care must be taken that the
column data types defined by the DataStage developer match the data types generated by the database
stage at runtime. Database-specific data type mapping tables are included in the following sections.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 98 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
One disadvantage to the graphical orchdbutil metadata import is that the user interface requires each
table to be imported individually.
When importing a large number of tables, it will be easier to use the corresponding orchdbutil
command-line utility from the DataStage server machine. As a command, orchdbutil can be scripted to
automate the process of importing a large number of tables.
For example, the following SQL assigns the alias Total to the calculated column:
Note that in many cases it may be more appropriate to aggregate using the Enterprise Edition
Aggregator stage. However, there may be cases where user-defined functions or logic need to be
executed on the database server.
The only exception to this rule is when building dynamic database jobs that use runtime column
propagation to process all columns in a source table.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 99 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
If the connection fails, an error message may appear, and you will be prompted to view additional
detail. Clicking YES will display a detailed dialog box with the specific error messages generated by
the database stage that can be very useful in debugging a database connection failure.
As a further optimization, a Lookup stage (or Join stage, depending on data volume) can be used to
identify existing rows before they are inserted into the target table.
For example, the OPEN command could be used to create a temporary table, and the CLOSE command
could be used to select all rows from the temporary table and insert into a final target table.
As another example, the OPEN command can be used to create a target table, including database-
specific options (tablespace, logging, constraints, etc) not possible with the “Create” option. In general,
it is not a good idea to let DataStage generate target tables unless they are used for temporary storage.
There are limited capabilities to specify Create table options in the stage, and doing so may violate
data-management (DBA) policies.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 100 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Further details are outlined in the respective database sections of the Orchestrate Operators Reference
which is part of the Orchestrate OEM documentation.
When directly connected as the reference link to a Lookup stage, the DB2/UDB Enterprise, ODBC
Enterprise, and Oracle Enterprise stages allow the lookup type to be changed to “Sparse”, sending
individual SQL statements to the reference database for each incoming Lookup row. Sparse Lookup is
only available when the database stage is directly connected to the reference link, with no intermediate
stages.
IMPORTANT: The individual SQL statements required by a “Sparse” Lookup are an expensive
operation from a performance perspective. In most cases, it is faster to use a DataStage JOIN
stage between the input and DB2 reference data than it is to perform a “Sparse” Lookup.
For scenarios where the number of input rows is significantly smaller (for example, 1:100 or more)
than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup may be appropriate.
While there are extreme scenarios when the appropriate technology choice is clearly understood, there
may be “gray areas” where the decision should be made based on factors such as developer
productivity, metadata capture and re-use, and ongoing application maintenance costs.
The following guidelines can assist with the appropriate use of SQL and DataStage technologies in a
given job flow:
• When possible, use a SQL filter (WHERE clause) to limit the number of rows sent to
the DataStage job. This minimizes impact on network and memory resources, and
leverages the database capabilities.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 101 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
• When combining data from very large tables, or when the source includes a large
number of database tables, the efficiency of the Enterprise Edition Sort and Join stages
can be significantly faster than an equivalent SQL query. In this scenario, it can still be
beneficial to use database filters (WHERE clause) if appropriate.
• Avoid the use of database stored procedures (for example, Oracle PL/SQL) on a per-row
basis within a high-volume data flow. For maximum scalability and parallel
performance, it is best to implement business rules using native parallel DataStage
components.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 102 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
For specific details on the stage capabilities, consult the DataStage documentation (DataStage Parallel
Job Developers Guide, and DataStage Plug-In guides)
2
It is possible to connect the DB2 UDB stage to a remote database by simply cataloging the
remote database in the local instance and then using it as if it were a local database. This will
only work when the authentication mode of the database on the remote instance is set to “client
authentication”. If you use the stage in this way, you may experience data duplication when
working in partitioned instances since the node configuration of the local instance may not be the
same as the remote instance. For this reason, the “client authentication” configuration of a
remote instance is not recommended.
3
A patched version of the ODBC Enterprise stage allowing parallel read is available from IBM IIS
Support for some platforms. Check with IBM IIS Support for availability.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 103 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
By facilitating flexible connectivity to multiple types of remote DB2 database servers, the use of
DataStage plug-in stages will limit overall performance and scalability. Furthermore, when used as data
sources, plug-in stages cannot read from DB2 in parallel.
Using the DB2/UDB API stage or the Dynamic RDBMS stage, it may be possible to write to a DB2
target in parallel, since the DS/EE framework will instantiate multiple copies of these stages to handle
the data that has already been partitioned in the parallel framework. Because each plug-in invocation
will open a separate connection to the same target DB2 database table, the ability to write in parallel
may be limited by the table and index configuration set by the D2 database administrator.
The DB2/API (plug-in) stage should only be used to read from and write to DB2 databases on non-
UNIX platforms (such as mainframe editions through DB2-Connect). Sparse Lookup is not supported
through the DB2/API stage.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 104 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
In order to get this configuration to work correctly, you must adhere to all of the directions specified
for connecting to a remote instance AND the following:
• You must not set the APT_DB2INSTANCE_HOME environment variable. Once this variable
is set, it will try to use it for each of the connections in the job. Since a db2nodes.cfg file can
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 105 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
• In order for DataStage to locate the db2nodes.cfg, you must build a user on the DataStage
server with the same name as the instance you are trying to connect to (the default logic for the
DB2/UDB Enterprise stage is to use the instance’s home directory as defined for the UNIX user
with the same name as the DB2 instance). In the users UNIX home directory, create a sqllib
subdirectory and place the remote instance’s db2nodes.cfg there. Since the
APT_DB2INSTANCE_HOME is not set, DS will default to this directory to find the
configuration file for the remote instance.
To connect to multiple DB2 instances, we recommend using separate jobs with their respective DB2
environment variable settings, landing intermediate results to a parallel Data Set. Depending on
platform configuration and I/O subsystem performance, separate jobs can communicate through
named pipes, although this incurs the overhead of Sequential File stage (corresponding export/import
operators) which does not run in parallel.
Or, if the data volumes are sufficiently small, DB2 plug-in stages (DB2 API, DB2 Load, Dynamic
RDBMS) may be used to access data in other instances.
When there is an incompatibility, Enterprise Edition converts the DB2 column name as follows:
- if the DB2 column name does not begin with a letter or underscore, the string
“APT__column#” (two underscores) is added to beginning of the column name, where
column# is the number of the column. For example, if the third DB2 column is named 7dig,
the Enterprise Edition column will be named “APT_37dig”
- if the DB2 column name contains a character that is not alphanumeric or an underscore, the
character is replaced by two underscore characters
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 106 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Table Definitions should be imported into DataStage using orchdbutil to ensure accurate Table
Definitions. The DB2/UDB Enterprise stage converts DB2 data types to Enterprise Edition data types,
as shown in the following table.
If the DATETIME starts with a year component, the result is a timestamp field.
IMPORTANT: DB2 data types that are not listed in the above table cannot be used in the
DB2/UDB Enterprise stage, and will generate an error at runtime
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 107 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
• DB2/UDB Enterprise stage is tightly integrated with the DB2 RDBMS, communicating directly
with each database node, reading from and writing to DB2 in parallel (where appropriate), and
using the same data partitioning as the referenced DB2 tables.
When writing to a DB2 database in parallel, the DB2/UDB Enterprise stage offers the choice of SQL
(insert / update / upsert / delete) or fast DB2 loader methods. The choice between these methods
depends on required performance, database log usage, and recoverability.
a) The Write Method (and corresponding insert / update / upsert / delete) communicates
directly with the DB2 database nodes to execute instructions in parallel. All operations are
logged to the DB2 database log, and the target table(s) may be accessed by other users.
Time and row-based commit intervals determine the transaction size, and the availability of
new rows to other applications.
b) The DB2 Load method requires that the DataStage user running the job have DBADM
privilege on the target DB2 database. During the load operation, the DB2 Load method
places an exclusive lock on the entire DB2 tablespace into which it loads the data and no
other tables in that tablespace can be accessed by other applications until the load
completes. The DB2 load operator performs a non-recoverable load. That is, if the load
operation is terminated before it is completed, the contents of the table are unusable and the
tablespace is left in a load pending state. In this scenario, the DB2 Load DataStage job must
be re- run in Truncate mode to clear the load pending state.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 108 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
For example, Table T is in tablespace TS and TS is partitioned into 3 partitions on Col1 (limits: F, P, T)
and Col2 (10, 30, 40).
The WHERE clauses which are created to read this tables are:
Where Col1 < ‘F’ or (Col1 = ‘F’ and (Col2 < 10 or Col2 = 10))
Where (Col1 > ‘F’ and Col1 < ‘P’) or (Col1 = ‘F’ and Col2 > 10) or (Col1 = ‘P’ and (Col2 < 20 or Col2 = 20))
Where Col1 > ‘T’ or (Col1 = ‘T’ and Col2 > 40)
The method that DataStage/USS Edition uses to write to DB2 UDB on z/OS works differently than the
read process, and is controlled by the number of nodes in the configuration file. Since all write
operations need to go through the DB2 coordinator node on z/OS (this is different than on non-z/OS
platforms), the number of operators do not have to match to the number of partitions. This is illustrated
in Figure 47.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 109 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
On DataStage/USS Edition, Lookups work differently depending on whether the lookup is done
normally (in memory) or using a sparse technique where each lookup is effectively a query to the
database. An example of an in-memory Normal Lookup is shown in Figure 48.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 110 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Contrast the Normal Lookup with the way a Sparse Lookup is done as shown in Figure 49, where each
lookup operator is issuing an SQL to DB2 for every row it processes. Since each of these queries must
go through the DB2 coordinator node we can effectively ignore the level of parallelism specified for
the table.
Finally, using the DB2 load utility in USS is different from non-z/OS environments. The DB2 LOAD
utility is designed to run from JCL only. In order to invoke it from a DataStage/USS job, we call a
DB2 stored procedure called DSNUTILS. The LOAD utility has a second limitation in that data
cannot be piped into it, nor can it be read in from a USS HFS file. This requires DataStage/USS to
create an MVS flat file to pass to the loader – note that this is the only non-HFS file that DS/USS can
write to. Since there is no sequential file stage associated with this MVS load file, we need to add a
special resource statement in our configuration file to specify the MVS dataset name to use. Figure 50
illustrates the DB2 LOAD process on USS and also shows the format of the special resource statement
used to define the MVS dataset used during the load operation.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 111 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 112 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
If the DATETIME starts with a year component and ends with a month, the
result is a date field.
If the DATETIME starts with a year component, the result is a timestamp field.
IMPORTANT: Informix data types that are not listed in the above table cannot be used in the
Informix Enterprise stage, and will generate an error at runtime
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 113 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Note that the maximum size of a DataStage record is limited to 32K. If you attempt to read a record
larger than 32K, Enterprise Edition will return an error and abort your job.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 114 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 115 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
4
On some platforms, a patch may be available through IBM IIS Support to support parallel reads
through ODBC. Parallel reads through ODBC match the degree of parallelism in the
$APT_CONFIG_FILE.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 116 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Note that the maximum size of a DataStage record is limited to 32K. If you attempt to read a record
larger than 32K, Enterprise Edition will return an error and abort your job.
IMPORTANT: Oracle data types that are not listed in the above table cannot be used in the
Oracle Enterprise stage, and will generate an error at runtime
It is important to note that certain types of queries cannot run in parallel. Examples include:
- queries containing a GROUP BY clause that are also hash partitioned on the same field
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 117 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The Upsert Write Method can be used to insert rows into a target Oracle table without bypassing
indexes or constraints. In order to automatically generate the SQL required by the Upsert method, the
key column(s) must be identified using the check boxes in the column grid.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 118 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
IMPORTANT: Sybase data types that are not listed in the above table cannot be used in the
Sybase Enterprise stage, and will generate an error at runtime
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 119 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
For maximum performance of high-volume data flows, the native parallel Teradata Enterprise stage
should be used. Teradata Enterprise uses the programming interface of the Teradata utilities
FastExport (reads) and FastLoad (writes), and is subject to all these utilities’ restrictions.
NOTE: Unlike the FastLoad utility, the Teradata Enterprise stage supports Append mode,
inserting rows into an existing target table. This is done through a shadow “terasync” table.
Teradata has a system-wide limit to the number of concurrent database utilities. Each use of the
Teradata Enterprise stages counts toward this limit.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 120 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 121 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
vargraphic(n) raw[max=n]
IMPORTANT: Teradata data types that are not listed in the above table cannot be used in the
Teradata Enterprise stage, and will generate an error at runtime.
Aggregates and most arithmetic operators are not allowed in the SELECT clause
of a Teradata Enterprise stage.
user=username,password=password[,SessionsPerPlayer=nn][,RequestedSessions=nn]
where SesionsPerPlayer and RequestedSessions are optional connection parameters that are required
when accessing large Teradata databases.
By default, RequestedSessions equals the maximum number of available sessions on the Teradata
instance, but this can be set to a value between 1 and the database vprocs.
The SessionsPerPlayer option determines the number of connections each DataStage EE player opens to
Teradata. Indirectly, this determines the number of DataStage players, and hence the number of UNIX
processes and overall system resource requirements of the DataStage job. SessionsPerPlayer should be
set such that:
RequestedSessions = (sessions per player * the number of nodes * players per node)
Setting the SessionsPerPlayer too low on a large system can result in so many players that the job fails
due to insufficient resources. In that case SessionsPerPlayer should be increased, and/or
RequestedSessions should be decreased.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 122 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
To connect to a Teradata server, you must supply the client with the Teradata Director Program (TDP)
identifier, also known as the tdpid. On a network-attached system, the tdpid is the host name of the
Teradata server. On MVS, the tdpid must be in the form TDPx, where x is 0-9, A-Z (case insensitive),
$, #, or @. The first three characters must be TDP. That leaves 39 possible TDP names and is
different than the convention used for non-channel attached systems.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 123 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
If the job results are correct with a single-node configuration file, and incorrect with a multi-node
configuration file, the job’s partitioning logic and parallel design concepts (especially within
Transformer stages) should be examined.
Using $APT_BUFFERING_POLICY=FORCE in
combination with $APT_BUFFER_FREE_RUN
Setting effectively isolates each operator from slowing
$APT_BUFFERING_POLICY=FORCE is not upstream production. Using the job monitor
recommended for production job runs. performance statistics, this can identify which part of a
job flow is impacting overall performance.
$DS_PX_DEBUG 1 Set this environment variable to capture copies of the
job score, generated osh, and internal Enterprise
Edition log messages in a directory corresponding to
the job name. This directory will be created in the
“Debugging” sub-directory of the Project home
directory on the DataStage server.
$APT_PM_STARTUP_CONCURRENCY 5 This environment variable should not normally need to
be set. When trying to start very large jobs on heavily-
loaded servers, lowering this number will limit the
number of processes that are simultaneously created
when a job is started.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 124 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Check the Director job log for warnings. These may indicate an underlying logic problem or
unexpected data type conversion. When a fatal error occurs, the log entry is sometimes
preceded by a warning condition.
All fatal and warning messages should be addressed before attempting to debug, tune, or
promote a job from development into test or production. In some instances, it may not be
possible to remove all warning messages generated by the EE engine. But all warnings should
be examined and understood.
Enable the Job Monitoring Environment Variables detailed in Section 2.5.1: Environment
Variable Settings and the DataStage Parallel Job Advanced Developers Guide.
Use the Data Set Management tool (available in the Tools menu of DataStage Designer or
DataStage Manager) to examine the schema, look at row counts, and to manage source or target
Parallel Data Sets.
Set the environment variable $DS_PX_DEBUG to capture copies of the job score, generated
osh, and internal Enterprise Edition log messages in a directory corresponding to the job name.
This directory will be created in the “Debugging” sub-directory of the Project home directory
on the DataStage server.
Use $OSH_PRINT_SCHEMAS to verify that the job’s runtime schemas matches what the job
developer expected in the design-time column definitions. This will place entries in the Director
log with the actual runtime schema for every link using Enterprise Edition internal data types.
NOTE: For large jobs, it is possible for $OSH_PRINT_SCHEMAS to generate a log entry
that is too large for DataStage Director to store or display. To capture the full schema output in
these cases, enable both $OSH_PRINT_SCHEMAS and $DS_PX_DEBUG .
Examine the score dump (placed in the Director log when $APT_DUMP_SCORE is enabled).
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 125 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
od –xc –Ax
o To display the number of lines and characters in a specified ASCII text file, use the UNIX
command
wc –lc [filename]
Dividing the total number of characters number of lines provides an audit to ensure all rows
are same length.
NOTE: The wc command counts UNIX line delimiters, so if the file has any binary columns,
this count may be incorrect. It is also not useful for files of non-delimited fixed-length
record format.
To enable viewing of generated OSH, it must be enabled for a given project within the Administrator
client:
Once this option has been enabled for a given project, the generated OSH tab will appear in the Job
Properties dialog box:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 126 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
When attempting to understand an Enterprise Edition flow, the first task is to examine the score dump
which is generated when you set APT_DUMP_SCORE=TRUE in your environment. A score dump
includes a variety of information about a flow, including how composite operators and shared
containers break down; where data is repartitioned and how it is repartitioned; which operators, if any,
have been inserted by EE; what degree of parallelism each operator runs with; and exactly which nodes
each operator runs on. Also available is some information about where data may be buffered.
The following score dump shows a flow with a single Data Set, which has a hash partitioner that
partitions on key field a. It shows three stages: Generator, Sort (tsort) and Peek. The Peek and Sort
stages are combined; that is, they have been optimized into the same process. All stages in this flow are
running on one node. The job runs 3 processes on 2 nodes.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 127 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
lemond.torrent.com[op0,p0]
)}
op1[2p] {(parallel APT_CombinedOperatorController:
(tsort)
(peek)
)on nodes (
lemond.torrent.com[op1,p0]
lemond.torrent.com[op1,p1]
)}
More details on interpreting the parallel job score can be found in 12.4.2 Understanding the Parallel
Job Score.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 128 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
This section provides tips for designing a job for optimal performance, and for optimizing the
performance of a given data flow using various settings and features within DataStage Enterprise
Edition.
Overall job design can be the most significant factor in data flow performance. This section outlines
performance-related tips that can be followed when building a parallel data flow using DataStage
Enterprise Edition.
a) Use Parallel Data Sets to land intermediate result between parallel jobs.
• Parallel Data Sets retain data partitioning and sort order, in the DS/EE native internal
format, facilitating end-to-end parallelism across job boundaries.
• Data Sets can only be read by other DS/EE parallel jobs (or the orchadmin command
line utility). If you need to share information with external applications, File Sets
facilitate parallel I/O at the expense of exporting to a specified file format.
• Lookup File Sets can be used to store reference data used in subsequent jobs. They
maintain reference data in DS/EE internal format, pre-indexed. However, Lookup File
Sets can only be used on reference links to a Lookup stage. There are no utilities for
examining data within a Lookup File Set.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 129 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
• Avoid using the BASIC Transformer, especially in large-volume data flows. External
user-defined functions can expand the capabilities of the parallel Transformer.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 130 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
At runtime, DataStage Enterprise Edition analyzes a given job design and uses the parallel
configuration file to build a job score which defines the processes and connection topology (Data Sets)
between them used to execute the job logic.
When composing the score, Enterprise Edition attempts to reduce the number of processes by
combining the logic from 2 or more stages (operators) into a single process (per partition). Combined
operators are generally adjacent to each other in a data flow.
The purpose behind operator combination is to reduce the overhead associated with an increased
process count. If two processes are interdependent (one processes the other’s output) and they are both
CPU-bound or I/O-bound, there is nothing to be gained from pipeline partitioning5.
5
One exception to this guideline is when operator combination generates too few processes to
keep the processors busy. In these configurations, disabling operator combination allows CPU
activity to be spread across multiple processors instead of being constrained to a single processor.
As with any example, your results should be tested in your environment.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 131 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
When deciding which operators to include in a particular combined operator (a.k.a. Combined Operator
Controller), Enterprise Edition is ‘greedy’ - it will include all operators that meet the following rules:
o Must be contiguous
o Must be the same degree of parallelism
o Must be ‘Combinable’, here is a partial list of non-combinable operators:
Join
Aggregator
Remove Duplicates
Merge
BufferOp
Funnel
DB2 Enterprise Stage
Oracle Enterprise Stage
ODBC Enterprise Stage
BuildOps
In general, it is best to let DSEE decide what to combine and what to leave uncombined. However,
when other performance tuning measures have been applied and still greater performance is needed,
tuning combination might yield additional performance benefits.
The job score identifies what components are combined. (For information on interpreting a job score
dump, see 12.4.2 Understanding the Parallel Job Score” in this document.) In addition, if the “%CPU”
column is displayed in a Job Monitor window in Director, combined stages are indicated by parenthesis
surrounding the % CPU, as shown in the following illustration:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 132 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
In fact, in the above job design, the I/O-intensive FileSet is combined with a CPU-intensive
Transformer. Disabling combination with the Transformer enables pipeline partitioning, and improves
performance, as shown in this subsequent Job Monitor for the same job:
The architecture of Enterprise Edition is well suited for processing massive volumes of data in parallel
across available resources. Toward that end, DS/EE executes a given job across the resources defined
in a the specified configuration file.
There are times, however, when it is appropriate to minimize the resource requirements for a given
scenario, for example:
Batch jobs that process a small volume of data
Real-time jobs that process data in small message units
Environments running a large number of jobs simultaneously on the same server(s)
In these instances, a single-node configuration file is often appropriate to minimize job startup time and
resource requirements without significantly impacting overall performance.
There are many factors that can reduce the number of processes generated at runtime:
Use a single-node configuration file
Remove ALL partitioners and collectors (especially when using a single-node configuration
file)
Enable runtime column propagation on Copy stages with only one input and one output
Minimize join structures (any stage with more than one input, such as Join, Lookup, Merge,
Funnel)
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 133 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
There are two types of buffering in Enterprise Edition: ‘inter-operator transport’ and ‘deadlock
prevention’.
The first block will be used by the upstream/producer stage to output data it is done with. The second
block will be used by the downstream/consumer stage to obtain data that is ready for the next
processing step. Once the upstream block is full and the downstream block is empty, the blocks are
swapped and the process begins again.
This type of buffering (or ‘Record Blocking’) is rarely tuned. It usually only comes into play when the
size of a single record exceeds the default size of the transport block, then setting
APT_DEFAULT_TRANSPORT_BLOCK_SIZE to a multiple of (or equal to) the record size will
resolve the problem. Remember, there are 2 of these transport blocks for each partition of each link, so
setting this value too high will result in a large amount of memory consumption.
APT_DEFAULT_TRANSPORT_BLOCK_SIZE
o Specifies the default block size for transferring data between players. The default value
is 8192, with a valid value range for between 8192 and 1048576. If necessary, the value
provided by a user for this variable is rounded up to the operating system's nearest page
size.
APT_MIN_TRANSPORT_BLOCK_SIZE
o Specifies the minimum allowable block size for transferring data between players.
Default is 8192. Cannot be less than 8192; cannot be greater than 1048576. This
variable is only meaningful when used in combination with
APT_LATENCY_COEFFICIENT, APT_AUTO_TRANSPORT_BLOCK_SIZE and
APT_MAX_TRANSPORT_BLOCK_SIZE
APT_MAX_TRANSPORT_BLOCK_SIZE
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 134 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 135 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Aggregator 1
Waiting to Qu
eu
d Write to Join Wr ed
e ue ite
Qu ite
Wr
Transformer Join
Waiting to Waiting to
write to read from
Aggregator1 Re Aggregator2
Qu ad
ad eue
Re d d
eue
Qu Aggregator2
Waiting to
read from
Transformer
Without deadlock buffering, this scenario would create a circular dependency where Transformer is
waiting on Aggregator1, Aggregator1 is waiting on Join, Join is waiting on Aggregator2, and
Aggregator2 is waiting on Transformer. Without deadlock buffering, the job would deadlock - bringing
processing to a halt (though the job does not stop running, it would eventually time out).
To guarantee that this problem never happens in Enterprise Edition, there is a specialized operator
called BufferOp. BufferOp is always ready to read or write and will not allow a read/write request to be
queued. It is placed on all inputs to a join structure (again, not necessarily a Join stage) by Enterprise
Edition during job startup. So the above job structure would be altered by the DS/EE engine to look
like this:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 136 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
BufferOp1
Aggregator 1
Transformer Join
Aggregator2
BufferOp 2
Since BufferOp is always ready to read or write, Join cannot be ‘stuck’ waiting to read from either of
its inputs, thus breaking the circular dependency and guaranteeing no deadlock will occur.
BufferOps will also be placed on the input partitions to any sequential stage that is fed by a parallel
stage, as these same types of circular dependencies can result from partition-wise Fork-Joins.
By default, BufferOps will allocate 3MB of memory each (remember that this is per operator, per
partition). When that is full (because the upstream operator is still writing but the downstream operator
isn’t ready to accept that data yet) it will begin to flush data to the scratchdisk resources specified in the
configuration file (detailed in Chapter 11, “The Parallel Engine Configuration File” of the DataStage
Manager guide).
TIP: For very wide rows, it may be necessary to increase the default buffer size
(APT_BUFFER_MAXIMUM_MEMORY) to hold more rows in memory.
The behavior of deadlock-prevention BufferOps can be tuned through these environment variables:
APT_BUFFER_DISK_WRITE_INCREMENT
o Controls the “blocksize” written to disk as the memory buffer fills. Default 1 MB. May
not exceed 2/3 of APT_BUFFER_MAXIMUM_MEMORY.
APT_BUFFER_FREE_RUN
o Maximum capacity of the buffer operator before it starts to offer resistance to incoming
flow, as a nonnegative (proper or improper) fraction of
APT_BUFFER_MAXIMUM_MEMORY. Values greater than 1 indicate that the buffer
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 137 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
APT_BUFFER_MAXIMUM_MEMORY
o Maximum memory consumed by each buffer operator for data storage. Default is 3 MB.
APT_BUFFERING_POLICY
o Specifies the buffering policy for the entire job. When it is not defined or defined to be
the null string, the default buffering policy is AUTOMATIC_BUFFERING. Valid
settings are:
AUTOMATIC_BUFFERING: buffer as necessary to prevent dataflow
deadlocks
FORCE_BUFFERING: buffer all virtual Data Sets
NO_BUFFERING: inhibit buffering on all virtual Data Sets
WARNING: Inappropriately specifying NO_BUFFERING can cause dataflow deadlock during job
execution; use of this setting is only recommend for advanced users!
Additionally, the buffer mode, buffer size, buffer free run, queue bound, and write increment can be set
on a per-stage basis from the Input/ Advanced tab of the stage properties, as shown in the illustration
below:
Aside from ensuring that no dead-lock occurs, BufferOps also have the effect of “smoothing out”
production/consumption spikes. This allows the job to run at the highest rate possible even when a
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 138 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Choosing which stages to tune buffering for and which to leave alone is as much art as science,
and should be considered among the last resorts for performance tuning. Stages
upstream/downstream from high-latency stages (such as remote databases, NFS mount points for data
storage, etc.) are a good place to start. If that doesn’t yield enough of a performance boost (remember
to test iteratively, while changing only 1 thing at a time), then setting the buffering policy to
“FORCE_BUFFERING” will cause buffering to occur everywhere.
By using the performance statistics in conjunction with this buffering, you may be able identify points
in the data flow where a downstream stage is waiting on an upstream stage to produce data. Each place
may offer an opportunity for buffer tuning.
As implied above, when a buffer has consumed its RAM, it will ask the upstream stage to “slow down”
- this is called “pushback”. Because of this, if you do not have force buffering set and
APT_BUFFER_FREE_RUN set to at least ~1000, you cannot determine that any one stage is waiting
on any other stage, as some other stage far downstream could be responsible for cascading pushback all
the way upstream to the place you are seeing the bottleneck.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 139 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
1. Standards
It is important to establish and follow consistent standards in:
Directory structures for install and application support directories. An example directory
naming structure is given in Section 2.1:Directory Structures.
Naming conventions, especially for DataStage Project categories, stage names, and links.
An example DataStage naming structure is given in Section 2.2: Naming Conventions.
All DataStage jobs should be documented with Short Description field, as well as Annotation fields.
See Section 2.3: Documentation and Annotation.
It is the DataStage developer’s responsibility to make personal backups of their work on their local
workstation, using the Manager DSX export capability. This can also be used for integration with
source code control systems. See Section 2.4: Working with Source Code Control Systems.
2. Development Guidelines
Modular development techniques should be used to maximize re-use of DataStage jobs and
components, as outlined in Section 3:Development Guidelines:
Job parameterization allows a single job design to process similar logic instead of creating
multiple copies of the same job. The Multiple-Instance job property allows multiple invocations
of the same job to run simultaneously.
A set of standard job parameters should be used in DataStage jobs for source and target
database parameters (DSN, user, password, etc) and directories where files are stored. To ease
re-use, these standard parameters and settings should be made part of a Designer Job Template.
Create a standard directory structure outside of the DataStage project directory for source and
target files, intermediate work files, and so forth.
Where possible, create re-usable components such as parallel shared containers to encapsulate
frequently-used logic.
Job Parameters should always be used for file paths, file names, database login settings. The scope of
a parameter is discussed further in Section 3.5: Job Parameters.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 140 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Standardized Error Handling routines should be followed to capture errors and rejects. Further details
are provided in Section 3.7:Error and Reject Record Handling.
3. Component Usage
As discussed in Section 3.8: Component Usage, the following guidelines should be followed when
constructing parallel jobs in DS/EE:
Never use Server Edition components (BASIC Transformer, Server Shared Containers) within
a parallel job. BASIC Routines are appropriate only for job control sequences.
Always use parallel Data Sets for intermediate storage between jobs.
Use the Copy stage as a placeholder for iterative design, and to facilitate default type
conversions.
Use the parallel Transformer stage (not the BASIC Transformer) instead of the Filter or Switch
stages.
Use BuildOp stages only when logic cannot be implemented in the parallel Transformer.
5. Partitioning Data
Given the numerous options for keyless and keyed partitioning, the following objectives help to form a
methodology for assigning partitioning:
Objective 1: Choose a partitioning method that gives close to an equal number of rows in
each partition, while minimizing overhead.
This ensures that the processing workload is evenly balanced, minimizing overall run time.
Objective 2: The partition method must match the business requirements and stage
functional requirements, assigning related records to the same partition if required
Any stage that processes groups of related records (generally using one or more key columns)
must be partitioned using a keyed partition method.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 141 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Note that in satisfying the requirements of this second objective, it may not be possible to
choose a partitioning method that gives close to an equal number of rows in each partition.
Using the above objectives as a guide, the following methodology can be applied:
a) Start with Auto partitioning (the default)
b) Specify Hash partitioning for stages that require groups of related records
o Specify only the key column(s) that are necessary for correct grouping as long as the
number of unique values is sufficient
o Use Modulus partitioning if the grouping is on a single integer key column
o Use Range partitioning if the data is highly skewed and the key column values and
distribution do not change significantly over time (Range Map can be reused)
c) If grouping is not required, use Round Robin partitioning to redistribute data equally across
all partitions
o Especially useful if the input Data Set is highly skewed or sequential
d) Use Same partitioning to optimize end-to-end partitioning and to minimize repartitioning
o Being mindful that Same partitioning retains the degree of parallelism of the
upstream stage
o Within a flow, examine up-stream partitioning and sort order and attempt to preserve
for down-stream processing. This may require re-examining key column usage
within stages and re-ordering stages within a flow (if business requirements permit).
Across jobs, persistent Data Sets can be used to retain the partitioning and sort order. This is
particularly useful if downstream jobs are run with the same degree of parallelism (configuration file)
and require the same partition and sort order.
Further details on Partitioning methods can be found in Section 5: Partitioning and Collecting.
6. Collecting Data
Given the options for collecting data into a sequential stream, the following guidelines form a
methodology for choosing the appropriate collector type:
a) When output order does not matter, use Auto partitioning (the default)
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 142 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Further details on Partitioning methods can be found in Section 5: Partitioning and Collecting.
7. Sorting
Using the rules and behavior outlined in Section 6: Sorting, the following methodology should be
applied when sorting in a DataStage Enterprise Edition data flow:
8. Stage-Specific Guidelines
As discussed in Section 8.1.1: Transformer NULL Handling and Reject Link, precautions
must be taken when using expressions or derivations on nullable columns within the parallel
Transformer:
o Always convert nullable columns to in-band values before using them in an expression
or derivation.
o Always place a reject link on a parallel Transformer to capture / audit possible rejects.
The Lookup stage is most appropriate when reference data is small enough to fit into
available memory. If the Data Sets are larger than available memory resources, use the Join
or Merge stage. See Section 9.1: Lookup vs. Join vs. Merge.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 143 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
The ODBC Enterprise stage should only be used when a native parallel stage is not available
for the given source or target database.
When using Oracle, DB2, or Informix databases, use orchdbutil to properly import design
metadata.
Care must be taken to observe the data type mappings documented in Section 10: Database
Stage Guidelines when designing a parallel job with DS/EE.
If possible, use a SQL where clause to limit the number of rows sent to a DataStage job.
Avoid the use of database stored procedures on a per-row basis within a high-volume data flow.
For maximum scalability and parallel performance, it is best to implement business rules
natively using DataStage parallel components.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 144 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 145 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 146 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Restructure Stages
Column Export CExp
Column Import CImp
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 147 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
At runtime, Enterprise Edition uses the given job design and configuration file to compose a job score
which details the processes created, degree of parallelism and node (server) assignments, and
interconnects (Data Sets) between them. Similar to the way a parallel database optimizer builds a query
plan, the DS/EE job score:
Where possible, multiple operators are combined within a single operating system process to improve
performance and optimize resource requirements.
As shown in the illustration below, job score entries start with the phrase “main_program: This step has
n datasets…” Two separate scores are written to the log for each job run. The first score is from the
license operator, not the actual job, and can be ignored. The second score entry is the actual job score.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 148 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 149 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Note that the number of virtual Data Sets and the degree of
parallelism determine the amount of memory used by the inter-operator transport buffers. The memory
used by deadlock-prevention BufferOps can be calculated based on the number of inserted BufferOps.
The degree of parallelism is identified in brackets after the operator name. For example, in the example
on the right, operator zero (op0) is running sequentially, with 1 degree of parallelism [1p]. Operator 1
(op1) is running in parallel with 4 degrees of parallelism [4p].
Within the Data Set definition, the upstream producer is identified first, followed by a notation to
indicate the type of partitioning or collecting (if any), followed by the downstream consumer.
Producer
Partitioner Collector
Consumer
The notation between producer and consumer is used to report the type of partitioning or collecting (if
any) that is applied. The partition type is associated with the first term, collector type with the second.
The symbol between the partition name and collector name indicates:
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 150 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Some stages are composite operators – to the DataStage developer, a composite operator appears to be
a single stage on the design canvas. But internally, a composite operator includes more than one
function.
For example, Lookup is a composite operator. It is composted of the following internal operators:
- APT_LUTCreateImpl:
op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3)
Reads the reference data into memory on nodes (
ecc3671[op2,p0]
- APT_LUTProcessImpl: )}
op3[4p] {(parallel buffer(0))
Performs actual lookup processing once reference on nodes (
ecc3671[op3,p0]
data has been loaded ecc3672[op3,p1]
ecc3673[op3,p2]
ecc3674[op3,p3]
)}
At runtime, each individual component of a composite op4[4p] {(parallel APT_CombinedOperatorController:
(APT_LUTProcessImpl in Lookup_3)
operator is represented as an individual operator in the (APT_TransformOperatorImplV0S7_cpLookupTest1_Tran
sformer_7 in Transformer_7)
job score, as shown in the following score fragment (PeekNull)
) on nodes (
shown on the right: ecc3671[op4,p0]
ecc3672[op4,p1]
ecc3673[op4,p2]
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 151 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
In a similar way, a persistent Data Set defined to “Overwrite” an existing Data Set of the same name
will have multiple entries in the job score to: main_program: This step has 2 datasets:
ds0: {op1[1p] (parallel delete data files in delete temp.ds)
- Delete Data Files ->eCollectAny
op2[1p] (sequential delete descriptor file in delete temp.ds)}
- Delete Descriptor File ds1: {op0[1p] (sequential Row_Generator_0)
->
temp.ds}
It has 3 operators:
op0[1p] {(sequential Row_Generator_0)
on nodes (
node1[op0,p0]
)}
op1[1p] {(parallel delete data files in delete temp.ds)
on nodes (
node1[op1,p0]
)}
op2[1p] {(sequential delete descriptor file in delete temp.ds)
on nodes (
node1[op2,p0]
)}
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 152 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Using the internal DataStage Enterprise Edition C++ libraries, the method
APT_Record::estimateFinalOutputSize() can give you an estimate for a given record schema. As can
APT_Transfer::getTransferBufferSize(), if you have a transfer that transfers all fields from input to
output.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 153 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
NOTE: The environment variable settings in this Appendix are only examples. Set values that are
optimal to your environment.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 154 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
$APT_DELIMITED_READ_SIZE [bytes]
Specifies the number of bytes the Sequential
File (import) stage reads-ahead to get the
next delimiter. The default is 500 bytes, but
this can be set as low as 2 bytes.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 155 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
4. Informix
Environment Variables
Environment Variable Setting Description
$INFORMIXDIR [path] Specifies the Informix install directory.
$INFORMIXSQLHOSTS [filepath] Specifies the path to the Informix sqlhosts file.
$INFORMIXSERVER [name] Specifies the name of the Informix server
matching an entry in the sqlhosts file.
$APT_COMMIT_INTERVAL [rows] Specifies the commit interval in rows for
Informix HPL Loads. The default is 10000 per
partiton.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 156 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
$DS_ENABLE_RESERVED_CHAR_CONVERT 1
Allows DataStage plug-in stages to handle
Oracle databases which use the special
characters # and $ in column names.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 157 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
8. Performance-
Tuning Environment Variables
Environment Variable Setting Description
$APT_BUFFER_MAXIMUM_MEMORY 41903040 Specifies the maximum amount of virtual memory, in
(example) bytes, used per buffer per partition. If not set, the
default is 3MB (3145728). Setting this value higher
will use more memory, depending on the job flow, but
may improve performance.
$APT_BUFFER_FREE_RUN 1000 Specifies how much of the available in-memory buffer
(example) to consume before the buffer offers resistance to any
new data being written to it. If not set, the default is 0.5
(50% of $APT_BUFFER_MAXIMUM_MEMORY).
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 158 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Using $APT_BUFFERING_POLICY=FORCE in
combination with $APT_BUFFER_FREE_RUN
Setting effectively isolates each operator from slowing
$APT_BUFFERING_POLICY=FORCE is not upstream production. Using the job monitor
recommended for production job runs. performance statistics, this can identify which part of a
job flow is impacting overall performance.
$DS_PX_DEBUG 1 Set this environment variable to capture copies of the
job score, generated osh, and internal Enterprise
Edition log messages in a directory corresponding to
the job name. This directory will be created in the
“Debugging” sub-directory of the Project home
directory on the DataStage server.
$APT_PM_STARTUP_CONCURRENCY 5 This environment variable should not normally need to
be set. When trying to start very large jobs on heavily-
loaded servers, lowering this number will limit the
number of processes that are simultaneously created
when a job is started.
$APT_PM_NODE_TIMEOUT [seconds] For heavily loaded MPP or clustered environments, this
variable determines the number of seconds the
conductor node will wait for a successful startup from
each section leader. The default is 30 seconds.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 159 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Hash and Sort/Join/Merge on exactly the same keys, in the same order.
This approach is guaranteed to work, but is frequently inefficient as records are ‘over-hashed’ and
‘over-partitioned’. There is also an “advanced” rule:
This Appendix contains descriptions of what happens “behind the scenes”. It will be followed by a
detailed example that discusses these ideas in much greater depth. If you have a lot of experience with
hashing and sorting, this may be review for you. The second portion of this Appendix assumes you
have read and thoroughly understand these concepts.
b) Hash gathers
Hash gathers into the same partition, rows from all partitions that share the same value in key
columns. This creates partition-wise concurrency (a.k.a. partition-wise co-location), i.e. related rows
are in the same partition, at the same time, but other rows may separate them within that partition.
Remove Duplicates requires only (i): when it completes processing all the rows in a key cluster, it
does not care about the key value of the next cluster with respect to the current key value—in part
because this stage takes only one input. An illustrative piece of information, but there is little you
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 160 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Join and Merge, on the other hand, require both (i) and (ii). This due, in part, to the fact these stages
take multiple input links—these stages only see 2 records at a timei, one from each input, so row order
between the two inputs is obviously critical. When this stage completes processing all the rows in a
key cluster, they DO care about the key value of the next cluster. If the values on both inputs aren't
ordered (vs. grouped/clustered for remove duplicates), Join/Merge can't effectively choose which input
stream to advance for finding subsequent key matches.
Partitioners, (and most other stages) do not gratuitously alter existing intra-partition row order. In other
words, Enterprise Edition will not, as a rule, allow a row in a partition to jump ahead of another row in
the same partitionii. Nonetheless, any existing sort order is usually destroyed—see example below.
To restore row-adjacency, a sort operation is needed even on previously sorted columns following any
partitioner.
There is a component that will allow you to partition sorted data and achieve a sorted result:
parallelsortmerge (PSM). Enterprise Edition itself normally manages this use of this component, but it
can be invoked via the generic stage. Whenever you re-partition your sorted data, follow the partitioner
with a PSM, and your data will retain its previous sort order. See usage notes in footnotesiii.
1. Inside a partitioner
In Enterprise Edition, partitioners, like stages, work in parallel, for example:
Repartitioning:
P0 P1
2 103
1 102
Note that ‘1’ and ‘101’ have 3 101
switched partitions.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 161 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
3 102
There is more than one way to correctly hash-partition any Data Set6.
Here is another possible outcome:
Also: Consider the result of running the same job with the same data, but a different number of
partitions.
Partition 0
Partition 1
1 Orlando Jones Elm 2 Adam Smith
10 Rose Jones Pine 1 Orlando Jones
1
10 Boris Smith Walnut 10 Boris Smith
2 2
2 Adam Smith Pine 10 Rose Jones
1
10 John Zorn Walnut 10 John Zorn
3 3
3 Eve Smith Pine 3 Eve Smith
6
There is an exception to this rule: If your hash key has only one value.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 162 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Scenario Description:
Our customer is a national retail business with several
hundred outlets nation-wide. They wish to determine the weighted average transaction amount per-
item nation-wide, as well as the average transaction amount per-item, per store for all stores in the
nation, and append these values to the original data. This would make it possible to determine how well
each store is doing in relation to the national averages and track these performance trends over time.
There are many common extensions on gathering these kinds of sales metrics that take the following
ideas and increase the scale of the problem at hand, thereby increasing the value of this exercise.
The screen capture below shows how to implement the business logic in an efficient manner, taking
advantage of Enterprise Edition’s ability to analyze a jobflow and insert sorts and partitioners in
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 163 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
appropriate places notice, there is only one sort and one re-
partition in the diagram, both on the output link of
JoinSourceToAggregator_1):
The Aggregator stage NationalAverageItemTransactionAmt will aggregate the data on ‘Item ID’, and
‘Transaction Date’ calculating the average of the ‘TransactionAmt’ column and place the results in a
column named ‘National Average Item Transaction Amt’. This is the nation-wide transaction average
per item (weighted by transaction, not store). To do this, Enterprise Edition will hash-partition and sort
on ‘Item ID’, and ‘Transaction Date’.
The Aggregator StoreAverageItemTransactionAmt will aggregate the data on ‘Store ID’, ‘Item ID’, and
‘Transaction Date’, and calculate the average of the ‘Transaction Amt’ column and place the results in
a column named ‘Store Average Item Transaction Amt’. This is the per-store transaction average per
item. Here, Enterprise Edition will hash-partition and sort on ‘Store ID’, ‘Item ID’, and ‘Transaction
Date’.
Since the aggregator reduces row count (to the group count), we will need to join each aggregator’s
output back to the original data in order to get the original row count. This is done with
JoinSourceToAggregator_1 and JoinSourceToAggregator_2, to get the original data with the averages
appended.
7
To enable automatic partition insertion, ensure that the environment variable, APT_NO_PART_INSERTION
is NOT defined in your environment, with ANY value. If you allow this environment variable to exist
with any value, you will disable this facility.
To enable automatic sort insertion, ensure that the environment variable, APT_NO_SORT_INSERTION is NOT
defined in your environment, with ANY value. If you allow this environment variable to exist with any
value, you will disable this facility.
Here you want to let DS/EE choose where to insert sorts and partitioners for you, so you want to leave
them enabled (the default).
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 164 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 165 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
2 1 2004/01/04 8
16.25
2 1 2004/01/04 45 16.25 26.5
1 2 2004/01/01 45 44.25 27
1 2 2004/01/01 9 44.25 27
2 2 2004/01/01 78 44.25 61.5
2 2 2004/01/01 45 44.25 61.5
1 2 2004/01/04 2 38.75 33.5
1 2 2004/01/04 65 38.75 33.5
2 2 2004/01/04 65 38.75 44
2 2 2004/01/04 23 38.75 44
Since both the Aggregator and Join expect the data to arrive hashed and sorted on the grouping key(s)
—both operations that consume large amounts of CPU—a couple of questions arise with respect to
efficiency:
What is the minimum number of hash-partitioners needed to implement this solution correctly?
What is the minimum number of sorts needed to implement this solution?
What is the minimum number of times that sort will need to buffer the entire Data Set to disk to
implement this solution?
Though running the job sequentially eliminates questions related to partitioners, even sequential job
execution does not alter the answer for the sort-related questions, as only partition concurrency is
affected by sequential execution, i.e.: record adjacency assumes partition concurrency8.
An examination of the job above would suggest: 6, 6, and 6. A deeper examination (of the score dump,
appended to the end of this document for masochistsiv) might suggest: 4, 3, and 3. This is certainly an
improvement on the previous answer.
A much better answer is: 1, 3, and 1. Here’s a screen shot of this, more efficientv, solution (score dump
also attached below vi):
8
Records cannot be adjacent if they are not in the same partition.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 166 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
In our initial copy stage (DistributeCopiesOfSourceData), we hash and sort on ItemID and
TransactionDate only. Hashing on these fields will gather all unique combinations of ItemID and
TransactionDate into the same partition. This combination of hash and sort adequately prepares the
data for NationalAverageItemTransactionAmt, just as in the previous example.
However, the data is not properly prepared for StoreAverageItemTransactionAmt. What is wrong with
the data? The sort order does not include the StoreLocation. This is a problem for
StoreAverageItemTransactionAmt, as it expects all of the records for a particular
StoreLocation/TransactionDate/ItemId combination to arrive on the same partition, in order.
You may be wondering why the partitioning wasn’t mentioned as part of the problem. This is because
the data is already partitioned in a compatible manner for the aggregator. The ‘advanced’ rule for hash
partitioning is: you may partition on any sub-set of the aggregation/join/sort/etc. keys (viii This footnote
contains key concepts that this document addresses, but it is a lengthy parenthetical statement that
would interrupt the flow of the scenario discussion).
However, we still need to fix the sort order. One would expect that we would need to sort on
StoreLocation, TransactionDate, and ItemId, but we know that the data is already sorted on
TransactionDate, and ItemId. Sort offers an efficiency-mode for pre-sorted data, but you must use the
sort stage to access it, as it isn’t available on the link sort.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 167 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
As you can see, we have instructed the sort stage that the data is already sorted on ItemID and
TransactionDate (as always with sorting records, key order is very important), and we only want to
‘sub-sort’ the data on StoreLocation (this option is only viable for situations where you need to
maintain the sort order on the initial keys). This lets sort know that it only needs to gather all records
with a unique combination of ItemID and TransactionDate in order to sort a batch of records, instead of
buffering the entire Data Set. If the group size was only several hundred records, but the entire Data Set
was 100 million records, this would save a tremendous amount of very expensive disk I/O as sort can
hold a few hundred records in memory in most cases (disk I/O is typically several orders of magnitude
more costly than memory I/O, even for ‘fast’ disks).
Also worth noting here: because we already hashed the data on ItemID and TransactionDate, ALL
extant values of the remaining columns are already in the same partition, which is what makes this sort
possible w/o re-partitioning (which is also quite expensive, especially in MPP environments where
repartitioning implies network I/O).
The previous two paragraphs contain two key concepts in Enterprise Edition (pun fully intended,
however dreadful).
Getting back to the aggregators. Since the aggregator does not need to disturb row-order (for pre-sorted
data), the rows will come out in the same order they went in (different rows, granted, but the group
keys will force the proper order). This means that the output of the DistributeCopiesOfSourceData and
NationalAverageItemTransactionAmt are already hashed and sorted on the keys needed to perform
JoinSourceToAggregator_1. This accomplishes the first goal, to append a column representing the
national (weighted) average item transaction amount.
The output of StoreAverageItemTransactionAmt contains the other column we need to append to our
source rows. However since we sub-sorted the data before this aggregator (unlike
NationalAverageItemTransactionAmt), we will have to prep the output from the first join to account for
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 168 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Remember to disable the ‘Stable Sort’ option if you do not need it (it will try to maintain row order
except as needed to perform the sort, useful for preserving previous sort orderings), as it is much more
expensive than non-stable sorts, and it is enabled by default.
Output from above solution:
Data Set 3:
PeekFinalOutput, Partition 1: 16 Rows
Store Item Transaction Transaction National Store
Location ID Date Amt Average Average
Item Item
Transaction Transaction
Amt Amt
1 1 2004/01/01 3 3.5 2
1 1 2004/01/01 1 3.5 2
1 1 2004/01/04 5 16.25 6
1 1 2004/01/04 7 16.25 6
2 1 2004/01/01 3 3.5 5
2 1 2004/01/01 7 3.5 5
2 1 2004/01/04 45 16.25 26.5
2 1 2004/01/04 8 16.25 26.5
1 2 2004/01/01 9 44.25 27
1 2 2004/01/01 45 44.25 27
2 2 2004/01/01 45 44.25 61.5
2 2 2004/01/01 78 44.25 61.5
1 2 2004/01/04 2 38.75 33.5
1 2 2004/01/04 65 38.75 33.5
2 2 2004/01/04 23 38.75 44
2 2 2004/01/04 65 38.75 44
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 169 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
This solution produces the same result but is achieved with only one complete sort, a single partitioner,
and two sub-sorts—a much more efficient solution for large data volumes.
Imagine a job with 100 million records as the input. With the initial solution, we had to sort (on disk)
300, 000,000 records in addition to hashing 300,000,000 records. The second solution only sorts (on
disk) 100,000,000 records, and only hashes, 100,000,000 records; a savings of 400,000,000 additional
record movements—half of them involving disk I/O—for a 100 million record input volume. That is a
LOT of saved processing power.
There is an even more efficient solution. It looks very similar to the first solution, but there is a critical
difference.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 170 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
Looks a lot like solution 1, except w/o the sort on the output of JoinSourceToAggregator.
The difference is on DistributeCopiesOfSourceData:
Here, we have chosen to use the StoreLocation column as a part of our sorting key, but NOT to use it
for hashing. This is functionally equivalent to doing a sub-sort right before the
StoreAverageItemTransactionAmt aggregator. However it will not create additional processes to
handle the records and re-order them. This is a potentially huge savings on large data volumes
(remember the previous example). Also, the need for the second sort on the output of the
JoinSourceToAggregator_1 is not needed, for the same reasons. Here is the output from this version of
the job.
Comparing the efficiency of this solution with that of number two, we saved a sub-sort on 100 million
records - a significant savings.
Data Set 4:
PeekFinalOutput, Partition 1: 16 Rows
Store Item Transaction Transaction National Store
Location ID Date Amt Average Average
Item Item
Transaction Transaction
Amt Amt
1 1 2004/01/01 1 3.5 2
1 1 2004/01/01 3 3.5 2
2 1 2004/01/01 7 3.5 5
2 1 2004/01/01 3 3.5 5
1 1 2004/01/04 5 16.25 6
1 1 2004/01/04 7 16.25 6
2 1 2004/01/04 8 16.25 26.5
2 1 2004/01/04 45 16.25 26.5
1 2 2004/01/01 45 44.25 27
1 2 2004/01/01 9 44.25 27
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 171 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
2 2 2004/01/01 78
44.25
2 2 2004/01/01 45 44.25 61.5
1 2 2004/01/04 2 38.75 33.5
1 2 2004/01/04 65 38.75 33.5
2 2 2004/01/04 23 38.75 44
2 2 2004/01/04 65 38.75 44
Finally, in addition to the heavy penalty paid in disk I/O for using a full sort, sort, by definition,
inhibits pipe-lining (by buffering large amounts of data to disk since it needs to see all data before it
can determine the resulting sorted sequence)ix.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 172 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
Information Integration Solutions
Center of Excellence
As you can see, although ~5 million records have entered the sort, no rows have left yet. This is
because a standard sort requires all rows to be present in order to release the first row, requiring a large
amount of scratch diskx. This situation is analogous to all of the sorts in solution 1, the link sort in
solution 2, and the link sort in solution three.
Here, you can clearly see that a sub-sort does not inhibit pipe-lining--very nearly the same number of
rows have entered and left the sort stage (and NO buffering is required to perform the subsort). This
allows down-stream stages to be processing data during the sorting process; instead of waiting until all
40 million records have been sorted (in this instance, we are sub-sorting the data we sorted in the
previous diagram).
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 173 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a
retrieval system, or translated into any language in any form by any means without the written permission of IBM.
i
This is an over-simplification; it’s only true for cases where the key is unique. In other cases, Join
needs to see all of the rows in the current cluster on at least one of the input links; otherwise a
Cartesian product is impossible.
ii
A common problem: Suppose you have two (or more), presorted, datasets with differing partition counts
and you wish to join/merge them. At least one of these dataset must be re-hashed, which would result in
having to completely re-sort that dataset despite having a sorted version already. This ‘problem’ is
addressed by the parallelsortmerge component iii.
Another common problem: You need to Hash and Sort on columns A and C to implement the business logic in
one section of the flow, but in another section you need to hash and sort on columns A and B. You could
hash only on A, but suppose that A has too small a number of unique values (country codes, gender
codes, race/gender/ethnicity codes are typical). This would allow you to combine other columns into
your hash key to reduce data-skew, but not introduce superfluous sorts.
A third, less common, problem: you created a fileset with 8 nodes, but the job that reads it only has 4
nodes. Normally EE would re-partition the data into 4 nodes + destroy your sort order. However, you can
use the ParallelSortMerge stage to ensure that no matter the degree of parallelism of the writer +
reader, the sort order will be preserved.
There are other situations where this is valuable but they are much less common.
iii
iv
Dump score for solution 1
main_program: This step has 16 datasets:
ds0: {op0[3p] (parallel SourceData.DSLink2)
eAny=>eCollectAny
op1[3p] (parallel DistributeCopiesOfSourceDta)}
ds1: {op1[3p] (parallel DistributeCopiesOfSourceDta)
eOther(APT_HashPartitioner { key={ value=ItemID },
key={ value=TransactionDate }
})#>eCollectAny
op2[3p] (parallel APT_HashedGroup2Operator in NationalAverageItemTransactionAmt)}
ds2: {op1[3p] (parallel DistributeCopiesOfSourceDta)
eOther(APT_HashPartitioner { key={ value=ItemID },
key={ value=TransactionDate },
key={ value=StoreLocation }
})#>eCollectAny
op3[3p] (parallel APT_HashedGroup2Operator in StoreAverageItemTransactionAmt)}
ds3: {op1[3p] (parallel DistributeCopiesOfSourceDta)
eOther(APT_HashPartitioner { key={ value=ItemID },
key={ value=TransactionDate }
})#>eCollectAny
op4[3p] (parallel APT_TSortOperator(0))}
ds4: {op2[3p] (parallel APT_HashedGroup2Operator in NationalAverageItemTransactionAmt)
[pp] eSame=>eCollectAny
op5[3p] (parallel APT_TSortOperator(1))}
ds5: {op3[3p] (parallel APT_HashedGroup2Operator in StoreAverageItemTransactionAmt)
[pp] eSame=>eCollectAny
op8[3p] (parallel APT_TSortOperator(2))}
ds6: {op4[3p] (parallel APT_TSortOperator(0))
[pp] eSame=>eCollectAny
op6[3p] (parallel buffer(0))}
ds7: {op5[3p] (parallel APT_TSortOperator(1))
[pp] eSame=>eCollectAny
op7[3p] (parallel buffer(1))}
ds8: {op6[3p] (parallel buffer(0))
[pp] eSame=>eCollectAny
op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)}
ds9: {op7[3p] (parallel buffer(1))
[pp] eSame=>eCollectAny
op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)}
ds10: {op8[3p] (parallel APT_TSortOperator(2))
[pp] eSame=>eCollectAny
op12[3p] (parallel buffer(3))}
ds11: {op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)
[pp] eOther(APT_HashPartitioner { key={ value=ItemID,
subArgs={ cs }
},
key={ value=TransactionDate },
key={ value=StoreLocation,
subArgs={ cs }
}
})#>eCollectAny
op10[3p] (parallel JoinSourceToAggregator_2.DSLink18_Sort)}
ds12: {op10[3p] (parallel JoinSourceToAggregator_2.DSLink18_Sort)
[pp] eSame=>eCollectAny
op11[3p] (parallel buffer(2))}
ds13: {op11[3p] (parallel buffer(2))
[pp] eSame=>eCollectAny
op13[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)}
ds14: {op12[3p] (parallel buffer(3))
[pp] eSame=>eCollectAny
op13[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)}
ds15: {op13[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)
[pp] eSame=>eCollectAny
op14[3p] (parallel PeekFinalOutput)}
It has 15 operators:
op0[3p] {(parallel SourceData.DSLink2)
on nodes (
node1[op0,p0]
node1[op0,p1]
node1[op0,p2]
)}
op1[3p] {(parallel DistributeCopiesOfSourceDta)
on nodes (
node1[op1,p0]
node1[op1,p1]
node1[op1,p2]
)}
op2[3p] {(parallel APT_HashedGroup2Operator in NationalAverageItemTransactionAmt)
on nodes (
node1[op2,p0]
node2[op2,p1]
node3[op2,p2]
)}
op3[3p] {(parallel APT_HashedGroup2Operator in StoreAverageItemTransactionAmt)
on nodes (
node1[op3,p0]
node2[op3,p1]
node3[op3,p2]
)}
op4[3p] {(parallel APT_TSortOperator(0))
on nodes (
node1[op4,p0]
node2[op4,p1]
node3[op4,p2]
)}
op5[3p] {(parallel APT_TSortOperator(1))
on nodes (
node1[op5,p0]
node2[op5,p1]
node3[op5,p2]
)}
op6[3p] {(parallel buffer(0))
on nodes (
node1[op6,p0]
node2[op6,p1]
node3[op6,p2]
)}
op7[3p] {(parallel buffer(1))
on nodes (
node1[op7,p0]
node2[op7,p1]
node3[op7,p2]
)}
op8[3p] {(parallel APT_TSortOperator(2))
on nodes (
node1[op8,p0]
node2[op8,p1]
node3[op8,p2]
)}
op9[3p] {(parallel APT_JoinSubOperator in JoinSourceToAggregator_1)
on nodes (
node1[op9,p0]
node2[op9,p1]
node3[op9,p2]
)}
op10[3p] {(parallel JoinSourceToAggregator_2.DSLink18_Sort)
on nodes (
node1[op10,p0]
node2[op10,p1]
node3[op10,p2]
)}
op11[3p] {(parallel buffer(2))
on nodes (
node1[op11,p0]
node2[op11,p1]
node3[op11,p2]
)}
op12[3p] {(parallel buffer(3))
on nodes (
node1[op12,p0]
node2[op12,p1]
node3[op12,p2]
)}
op13[3p] {(parallel APT_JoinSubOperator in JoinSourceToAggregator_2)
on nodes (
node1[op13,p0]
node2[op13,p1]
node3[op13,p2]
)}
op14[3p] {(parallel PeekFinalOutput)
on nodes (
node1[op14,p0]
node2[op14,p1]
node3[op14,p2]
)}
It runs 45 processes on 3 nodes.
v
Throughout this document the general meaning of the phrase ‘more efficient’ is fewer record
movements--i.e. a record changes partition, or order, fewer times. Since moving records around takes
CPU time and extra system calls, if you move records unnecessarily, your run time will be adversely
affected.
vi
Dump Score for Solution 2
main_program: This step has 15 datasets:
ds0: {op0[3p] (parallel SourceData.DSLink2)
eOther(APT_HashPartitioner { key={ value=ItemID,
subArgs={ cs }
},
key={ value=TransactionDate }
})#>eCollectAny
op1[3p] (parallel DistributeCopiesOfSourceDta.DSLink2_Sort)}
ds1: {op1[3p] (parallel DistributeCopiesOfSourceDta.DSLink2_Sort)
[pp] eSame=>eCollectAny
op2[3p] (parallel DistributeCopiesOfSourceDta)}
ds2: {op2[3p] (parallel DistributeCopiesOfSourceDta)
[pp] eSame=>eCollectAny
op3[3p] (parallel SubSortOnStoreLocation)}
ds3: {op2[3p] (parallel DistributeCopiesOfSourceDta)
[pp] eSame=>eCollectAny
op4[3p] (parallel APT_SortedGroup2Operator in NationalAverageItemTransactionAmt)}
ds4: {op2[3p] (parallel DistributeCopiesOfSourceDta)
[pp] eSame=>eCollectAny
op6[3p] (parallel buffer(0))}
ds5: {op3[3p] (parallel SubSortOnStoreLocation)
[pp] eSame=>eCollectAny
op5[3p] (parallel APT_SortedGroup2Operator in StoreAverageItemTransactionAmt)}
ds6: {op4[3p] (parallel APT_SortedGroup2Operator in NationalAverageItemTransactionAmt)
[pp] eSame=>eCollectAny
op7[3p] (parallel buffer(1))}
ds7: {op5[3p] (parallel APT_SortedGroup2Operator in StoreAverageItemTransactionAmt)
[pp] eSame=>eCollectAny
op8[3p] (parallel buffer(2))}
ds8: {op6[3p] (parallel buffer(0))
[pp] eSame=>eCollectAny
op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)}
ds9: {op7[3p] (parallel buffer(1))
[pp] eSame=>eCollectAny
op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)}
ds10: {op8[3p] (parallel buffer(2))
[pp] eSame=>eCollectAny
op12[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)}
ds11: {op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)
[pp] eSame=>eCollectAny
op10[3p] (parallel SubSortOnStoreLocation2)}
ds12: {op10[3p] (parallel SubSortOnStoreLocation2)
[pp] eSame=>eCollectAny
op11[3p] (parallel buffer(3))}
ds13: {op11[3p] (parallel buffer(3))
[pp] eSame=>eCollectAny
op12[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)}
ds14: {op12[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)
[pp] eSame=>eCollectAny
op13[3p] (parallel PeekFinalOutput)}
It has 14 operators:
op0[3p] {(parallel SourceData.DSLink2)
on nodes (
node1[op0,p0]
node1[op0,p1]
node1[op0,p2]
)}
op1[3p] {(parallel DistributeCopiesOfSourceDta.DSLink2_Sort)
on nodes (
node1[op1,p0]
node2[op1,p1]
node3[op1,p2]
)}
op2[3p] {(parallel DistributeCopiesOfSourceDta)
on nodes (
node1[op2,p0]
node2[op2,p1]
node3[op2,p2]
)}
op3[3p] {(parallel SubSortOnStoreLocation)
on nodes (
node1[op3,p0]
node2[op3,p1]
node3[op3,p2]
)}
op4[3p] {(parallel APT_SortedGroup2Operator in NationalAverageItemTransactionAmt)
on nodes (
node1[op4,p0]
node2[op4,p1]
node3[op4,p2]
)}
op5[3p] {(parallel APT_SortedGroup2Operator in StoreAverageItemTransactionAmt)
on nodes (
node1[op5,p0]
node2[op5,p1]
node3[op5,p2]
)}
op6[3p] {(parallel buffer(0))
on nodes (
node1[op6,p0]
node2[op6,p1]
node3[op6,p2]
)}
op7[3p] {(parallel buffer(1))
on nodes (
node1[op7,p0]
node2[op7,p1]
node3[op7,p2]
)}
op8[3p] {(parallel buffer(2))
on nodes (
node1[op8,p0]
node2[op8,p1]
node3[op8,p2]
)}
op9[3p] {(parallel APT_JoinSubOperator in JoinSourceToAggregator_1)
on nodes (
node1[op9,p0]
node2[op9,p1]
node3[op9,p2]
)}
op10[3p] {(parallel SubSortOnStoreLocation2)
on nodes (
node1[op10,p0]
node2[op10,p1]
node3[op10,p2]
)}
op11[3p] {(parallel buffer(3))
on nodes (
node1[op11,p0]
node2[op11,p1]
node3[op11,p2]
)}
op12[3p] {(parallel APT_JoinSubOperator in JoinSourceToAggregator_2)
on nodes (
node1[op12,p0]
node2[op12,p1]
node3[op12,p2]
)}
op13[3p] {(parallel PeekFinalOutput)
on nodes (
node1[op13,p0]
node2[op13,p1]
node3[op13,p2]
)}
It runs 42 processes on 3 nodes.
vii
In this instance, you want auto insertion turned off b/c EE will see that you are ‘missing’ a
sort/partitioner and insert one for you, thus introducing the inefficiencies we are trying to avoid.
viii
To understand why this is true, look at this example:
Any combination of these groups can be in any partition, regardless of the number of partitions: if you
are running a job with 6 partitions, you could have ALL 5 groups sent to the same partition (this is
unlikely, and the likelihood decreases with larger numbers of groups, in fact, the distribution of
groups across partitions is nearly even for large numbers of groups).
One effect is that if we wanted to aggregate on ColumnA and ColumnB, summing ColumnC, this grouping is
OK, b/c: if all unique values of ColumnA are together, then all unique combinations of ColumnA and
ColumnB are together, as well as all unique combinations of ColumnA, ColumnB, and ColumnC.
In the scenario that we are discussing in the main document, we want to reduce the number of times that
we hash (b/c partitioning costs CPU time). We can do this by identifying the intersection of keys
needed among all of the hash-partitioners and hashing only on those keys:
TransactionDate and ItemId
NOTE: if you take this to an extreme, you will get a very small number of groups, which will,
effectively, reduce the parallelism of the job. In the above example, we would have only two groups.
Therefore, even if we ran the job 12-ways, we wouldn’t see any improvements in performance over a 2-way
job.
You need to understand your data and make educated decisions about your hashing strategy.
ix
This means that down-stream process will be sitting idle until the sort is completed, consuming RAM
and process space and offering nothing in return.
x
This is a slight oversimplification. It is only true on a per-partition basis, not for the entire
dataset.