RedBooks-InfoSphere DataStage For Enterprise XML Data Integration PDF
RedBooks-InfoSphere DataStage For Enterprise XML Data Integration PDF
InfoSphere DataStage
for Enterprise XML
Data Integration
Addresses the complexities of
hierarchical data types
Chuck Ballard
Vinay Bhat
Shruti Choudhary
Ravi Ravindranath
Enrique Amavizca Ruiz
Aaron Titus
ibm.com/redbooks
International Technical Support Organization
May 2012
SG24-7987-00
Note: Before using this information and the product it supports, read the information in
“Notices” on page ix.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
The team who wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . xiv
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Contents v
7.2.3 Mapping constants to a target column . . . . . . . . . . . . . . . . . . . . . . 178
7.2.4 Selecting mapping candidates by using the More option . . . . . . . . 181
7.3 Mapping suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.3.1 Mapping child and parent items to same output list . . . . . . . . . . . . 183
7.3.2 Mapping when no lists are available in the Parser step . . . . . . . . . 185
7.4 Parsing large schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.4.1 Output schema location of the second XML Parser step . . . . . . . . 190
7.4.2 Configuring the auto-chunked elements in the assembly . . . . . . . . 193
7.4.3 Designing the assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.5 Parsing only a section of the schema . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.6 Incorporating an XSLT style sheet during parsing . . . . . . . . . . . . . . . . . 198
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Contents vii
Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area.
Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product, program, or service that
does not infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer
of express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm
the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on
the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the
sample programs are written. These examples have not been thoroughly tested under all conditions. IBM,
therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
Netezza, and N logo are trademarks or registered trademarks of IBM International Group B.V., an IBM
Company.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its
affiliates.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel
SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
Other company, product, or service names may be trademarks or service marks of others.
XML is one of the most common standards for the exchange of information.
However, organizations find challenges in how to address the complexities of
dealing with hierarchical data types, particularly as they scale to gigabytes and
beyond. In this IBM® Redbooks® publication, we discuss and describe the new
capabilities in IBM InfoSphere® DataStage® 8.5. These capabilities enable
developers to more easily manage the design and processing requirements
presented by the most challenging XML sources. Developers can use these
capabilities to create powerful hierarchical transformations and to parse and
compose XML data with high performance and scalability. Spanning both batch
and real-time run times, these capabilities can be used to solve a broad range of
business requirements.
Also, XML Stage has the capability to read single huge documents using a new
streaming methodology that avoids the need to load the document into memory.
XML Stage has support for any type of XSD, or a collection of XSDs, to define
your XML metadata. The most important capability is a whole new hierarchical
editing mode called an assembly, which provides support for the creation of
complex multi-node hierarchical structures. Much more function exists, such as
the explicit control of XML validation, a built-in test facility to ease transformation
development, and support for both enterprise edition (EE) and server jobs.
Other contributors
In this section, we thank others who contributed to this book, in the form of
written content, subject expertise, and support.
Preface xiii
From IBM locations worldwide
Ernie Ostic - Client Technical Specialist, IBM Software Group, Worldwide Sales
Enablement, Piscataway, NJ
Tony Curcio - InfoSphere Product Management, Software Product Manager,
Charlotte, NC
Find out more about the residency program, browse the residency index, and
apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
Preface xv
xvi InfoSphere DataStage for Enterprise XML Data Integration
1
Enterprise data integration initiatives present many challenges. You might deploy
enterprise services using a service-oriented architecture (SOA), implementing a
cloud initiative, or building the infrastructure for Dynamic Data Warehousing. For
many of these tasks, XML can be an important building block for your integration
architecture.
App1 CSV
ASCII App2
XML
WH
DB
Appn
In December 2000, the United Nations Centre for Trade Facilitation and
Electronic Business (UN/CEFACT) and the Organizations for the Advancements
of Standard Information Standards (OASIS) came together to initiate a project to
standardize XML specifications for business. This initiative, which is called the
Electronic Business XML (ebXML), developed a technical framework that
enabled XML to be utilized for all exchange of all electronic business data. The
main aim of ebXML was to lower the cost and difficulties, focusing on small and
medium-sized business and developing nations, in order to facilitate international
trade.
CRM and EAI involve bringing together a multitude of applications and systems
from multiple vendors to behave as a coordinated whole. The common thread
among all these applications is XML. Imagine trying to integrate information from
billing histories, customer personal information and credit histories, distribution
systems, and workforce management systems; and then displaying them using
browsers, databases, processes, and workflow systems on a variety of technical
platforms, such as mainframes, PCs, and medium-sized servers. All this
information can be shared by using XML, because it is a versatile, powerful, and
flexible language. It has the ability to describe complex data. It is also extensible,
allowing applications to grow and develop without architectural re-engineering.
IBM uses XML in all its new tools. The IBM applications WebSphere® Studio,
DB2®, and WebSphere Application Server are based on XML with extensibility
as a major advantage.
Voice XML
In October, 2001, the World Wide Web Consortium (W3C) announced the first
release of a working draft for the Voice Extensible Markup Language
(VoiceXML). VoiceXML has these purposes among others:
Hides designers and developers from low-level platform-specific details.
Promotes portability across implementation platforms.
Offers a common language for platform providers, development tool
providers, and content providers.
The benefits of this new XML derivative are not difficult to comprehend. Its uses
are many. For example, insurance companies can estimate and forecast natural
disaster claims. Scientists can study environmental impacts. And, local and
federal governments can perform city and town planning.
XML is not a total solution for every problem in e-business, but it is making
significant inroads in communications between old computer programs.
Therefore, these old programs last longer, saving money and time, which are
important when both are so precious to the bottom line.
Imagine a navigation system that is used by consumers to move from one place
to another. This system has street maps, address information, local attractions,
and other information. If this information must be displayed on a web browser,
personal device assistant (PDA), or a mobile phone, it is a major development
cost if we must develop three data access systems and three data presentation
systems. However, if we develop one data access system, we also need to
create three data presentation files for each system. We transform XML using the
Extensible Stylesheet Language Transformation (XSLT) feature.
Extensibility
HTML has a major problem in that it is not extensible. It is enhanced by software
vendors. These enhancements are not coordinated and, therefore, non-standard.
It was never designed to access data from databases. To overcome this
deficiency, Microsoft built Active Server Pages (ASP) and Sun produced
JavaServer Pages (JSP).
Industry acceptance
XML is accepted widely by the information and computing industry. It is based on
common concepts. A large number of XML tools will emerge from both
existing software vendors and XML start-up companies. XML is readable by
every operating system, because it is in ASCII text. XML can be seen by any
text editor or word processor.
The tree-based structure of XML is much more powerful than fixed-length data
formats. Because objects are tree structures as well, XML is ideally suited to
working with object-oriented programming.
DataStage can manage data arriving in real time, as well as data received on a
periodic or scheduled basis, which enables companies to solve large-scale
business problems through high-performance processing of massive data
volumes. By making use of the parallel processing capabilities of multiprocessor
hardware platforms, IBM InfoSphere DataStage Enterprise Edition can scale to
satisfy the demands of ever-growing data volumes, stringent real-time
requirements, and ever-shrinking batch windows.
Frequently, these obvious choices for managing and sharing XML data do not
meet performance requirements. File systems are fine for simple tasks, but they
do not scale well when you have hundreds or thousands of documents.
Concurrency, recovery, security, and usability issues become unmanageable.
With the release of DB2 9, IBM is leading the way to a new era in data
management. DB2 9 embodies technology that provides pure XML services.
This pureXML technology is not only for data server external interfaces; rather,
pureXML extends to the core of the DB2 engine. The XML and relational
services in DB2 9 are tightly integrated. They offer the industry’s first pureXML
and relational hybrid data server. Figure 1-4 illustrates the hybrid database.
Integrated Information
XQuery SQL
Language
Flexibility
Common
Services
<XML>
Optimized
Storage
RDB
</XML>
Two capabilities were added to DataStage over time to address the needs of real
time:
Message Queue/Distributed Transaction Stage (MQ/DTS): MQ/DTS
addresses the need for guaranteed delivery of source messages to target
databases, with the once-and-only-once semantics. This type of delivery
mechanism was originally made available in DataStage 7.5 in the form of the
Unit-of-Work (UOW) stage. The original target in DS 7.5 was Oracle. In
InfoSphere DataStage 8.x, this solution is substantially upgraded
(incorporating the new database connector technology for various types of
databases) and rebranded as the Distributed Transaction Stage.
Information Services Director (ISD): ISD enables DataStage to expose DS
jobs as services for service-oriented applications (SOA). ISD supports, as
examples, the following types of bindings:
– SOAP
– Enterprise JavaBeans (EJB)
– JMS
The term near-real time is used and applies to the following types of scenarios:
Message delivery:
– The data is delivered and expected to be processed immediately.
– Users can accept a short lag time (ranging from seconds to a few
minutes).
– No person waits for a response, for example:
• Reporting systems
• Active warehouses
– The following DS solutions can be applied:
• MQ DTS/UOW
• ISD with text over JMS binding
The high cost of starting up and shutting down jobs demanded that DataStage
was enhanced with additional capabilities to support these types of scenarios.
Implementing these types of scenarios with batch applications is not feasible.
IBM InfoSphere MDM Server is a master repository that delivers a single version
of an organization’s data entities, such as customer, product, and supplier. Its
SOA library of prepackaged business services allows organizations to define
how they want users to access master data and seamlessly integrate into current
architectures and business processes.
Web SOAP
Application
Service
by Client
1
World Wide Web Consortium at http://www.w3.org/TR/wsdl
The source and target databases can potentially be databases from separate
vendors running on separate platforms. Propagated data can involve user data
that is stored both in relational format, as well as XML, and can include audit data
that is hidden in the database logs.
The flow of data through Change Data Capture can be divided into three parts.
might read the log files directly or it might use an application programming
interface (API) to read data from the log. InfoSphere Change Data Capture has
native log-based capture capabilities for the following databases:
IBM DB2 z/OS®
DB2 for IBM i
DB2 Linux, UNIX, and Microsoft Windows
Oracle
SQL Server
Sybase
Upon reading data from the database recovery logs, InfoSphere Change Data
Capture filters data based on the table where the change occurred. Only data
pertaining to tables of interest is retained for further processing. InfoSphere
Change Data Capture then stages the changes in a holding area until a commit
of these changes occurs in the source database. If a rollback occurs instead,
InfoSphere Change Data Capture discards the associated changes.
Transformations
In many cases, the data models of the source and target are not the same.
InfoSphere Change Data Capture can apply transformations to the data while it is
in flight between the source and target. The following transformations are
commonly used:
Adding other information that can be obtained from the database recovery
logs
Concatenation and other string functions
Data type conversion
Arithmetic
Joining to look-up data in the secondary tables
If/then/else logic
Apply
After transformations, the changes are applied to the target database. Changes
are applied by executing SQL statements against the appropriate target tables.
UC OPY
DB Log
Source Target
SOUR CE2
WebSphere MQ
SOURCE1
CC D
DB Log
Q Replication Architecture
Propagated data can involve user data that is stored as relational data or as XML
and can include audit data from the database logs. Captured changes are also
transformed to meet the target database characteristics. Then, they are
propagated to target DB2 tables, directly integrated into DataStage, or trickle-fed
to drive other processes via messaging, as depicted in Figure 1-9 on page 22.
InfoSphere Data Event Publisher consists of the Capture component that is used
by InfoSphere Replication Server and WebSphere MQ transport. However, it
does not use the Apply component. Instead, the changed data is converted into
either XML messages or CSV data and written to the WebSphere MQ queue.
Consuming clients can retrieve the message at the other end of the queue and
do not provide any feedback to the Capture component. This type of architecture
eliminates the need for handshakes between a sending (Capture) process and a
receiving (Apply) process. Therefore, this architecture provides flexible
change-data solutions that enable many types of business integration scenarios,
as shown in Figure 1-10 on page 23.
User File
DB Log
InfoSphere Classic Data Event Publisher for z/OS V9.5 provides a new
publication interface for use by InfoSphere DataStage and InfoSphere DataStage
1.4.9 Summary
In this chapter, we discuss the benefits of using XML technologies that are widely
accepted by various industrial organizations throughout the world. We also
describe a number of IBM solutions that facilitate data integration and how to use
current and future investments. The focus of this book is the XML integration
capabilities that are built into DataStage. It is imperative to have a basic
understanding of XML and a working knowledge of DataStage to fully appreciate
the concepts and solutions that are presented in this book.
XML was developed in 1998 and is now widely used. It is one of the most flexible
ways to automate web transactions. XML is derived as a subset from Standard
Generalized Markup Language (SGML) and is designed to be simple, concise,
human readable, and relatively easy to use in programs on multiple platforms.
For more information about the XML standard, see the following web page:
http://www.w3.org/XML
As with other markup languages, XML is built using tags. Basic XML consists of
start tags, end tags, and a data value between the two. In XML, you create your
own tags, with a few restrictions. Example 2-1 shows a simple XML document.
The XML syntax is simple, but it is difficult to parse and transform an XML
document into a form that is usable to programming languages. Therefore, it is
essential to have access to efficient parsing and transformation tools.
XML contains document type and schema definitions. These document type and
schema definitions are used to specify semantics (allowable grammar) for an
XML document.
XML
Service Service
Requester Provider
XML documents can be well formed, or they can be well formed and valid. These
important rules do not exist for HTML documents. These rules contrast with the
freestyle nature of many of the concepts in XML. The rules can be defined briefly
as invalid, valid, and well-formed documents.
Validation
The process of checking to see whether an XML document conforms to a
schema or DTD is called validation. Validation is in addition to checking a
document for compliance to the XML core concept of syntactic well-formedness.
All XML documents must be well formed, but it is not required that a document is
valid unless the XML parser is validating. When validating, the document is also
checked for conformance with its associated schema.
Documents are only considered valid if they satisfy the requirements of the DTD
or schema with which they are associated. These requirements typically include
the following types of constraints:
Elements and attributes that must or might be included, and their permitted
structure.
XML declarations
Most XML documents start with an XML declaration that provides basic
information about the document to the parser. An XML declaration is
recommended, but not required. If there is an XML declaration, it must be the first
item in the document.
The declaration can contain up to three name-value pairs (many people call them
attributes, although technically they are not). The version is the version of XML
used; currently this value must be 1.0. The encoding is the character set used in
this document. The ISO-8859-1 character set referenced in this declaration
includes all of the characters used by most Western European languages. If no
encoding is specified, the XML parser assumes that the characters are in the
UTF-8 set, a Unicode standard that supports virtually every character and
ideograph in all languages. The following declaration is an example:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
Finally, standalone, which can be either yes or no, defines whether this
document can be processed without reading any other files. For example, if the
XML document does not reference any other files, you specify standalone="yes".
If the XML document references other files that describe what the document can
contain (more about those files later), you specify standalone="no". Because
standalone="no" is the default, you rarely see standalone in XML declarations.
Namespaces
The power of XML comes from its flexibility. You can define your own tags to
describe your data. Consider an XML document with the tag <Title>. The
document includes the <title> element for a courtesy title, the <title> element for
the title of a book, and the <title> element for the title to a piece of property. All
the <Title> information elements mentioned in this case are reasonable choices,
but all of them create elements with the same name. How do you tell if a specific
<title> element refers to a person, a book, or a piece of property? These
situations are handled with namespaces.
In Example 2-2 on page 30, the three namespace prefixes are addr, books, and
mortgage. Defining a namespace for a particular element means that all of its
child elements belong to the same namespace. The first <title> element belongs
to the addr namespace because its parent element, <addr:Name>, belongs to
the addr namespace.
Additional components
Comments, processing instructions, and entities are additional components of
XML. The following sections provide additional details about these components.
Comments
Comments can appear anywhere in the document; they can even appear before
or after the root element. A comment begins with <!-- and ends with -->. A
comment cannot contain a double hyphen (--) except at the end; with that
exception, a comment can contain anything. Most importantly, any markup inside
a comment is ignored; if you want to remove a large section of an XML
document, simply wrap that section in a comment. (To restore the
commented-out section, simply remove the comment tags.) This example
markup contains a comment:
<!-- This is comment section of Customer Name XML Doc : -->
<Customer Name>
<First Name> John </First Name>
<Last Name> Smith </Last Name>
</Customer Name>
Processing instructions
A processing instruction is markup intended for a particular piece of code. In the
following example, a processing instruction (PI) exists for Cocoon, which is an
XML processing framework from the Apache Software Foundation. When
Cocoon processes an XML document, it looks for processing instructions that
begin with cocoon-process, then processes the XML document. In this example,
the type="sql" attribute tells Cocoon that the XML document contains an SQL
statement:
<!-- Here is a PI for Cocoon: -->
Entities
The following example defines an entity for the document. Anywhere the XML
processor finds the string &RB;, it replaces the entity with the string ‘RedBooks’:
<!-- Here is an entity: -->
<!ENTITY RB "RedBooks">
The XML spec also defines five entities that can be used in place of various
special characters:
< for the less-than sign
> for the greater-than sign
" for a double-quote
' for a single quote (or apostrophe)
& for an ampersand
A DTD defines your own language for a specific application. The DTD can be
either stored in a separate file or embedded within the same XML file. If it is
stored in a separate file, it can be shared with other documents. XML documents
referencing a DTD contain a <!DOCTYPE> declaration. This <!DOCTYPE>
declaration either contains the entire DTD declaration in an internal DTD or
specifies the location of an external DTD. Example 2-3 on page 33 shows an
external DTD in a file named DTD-Agenda.dtd.
We can also define an internal DTD in an XML document so that both of them
are in the same file, as shown in Example 2-5. In either case, internal or external
DTD, the <!DOCTYPE> declaration indicates the root element.
Symbols in DTDs
A few symbols are used in DTDs to indicate how often (or whether) something
can appear in an XML document. We show several examples, along with their
meanings:
<!ELEMENT address (name, city, state)>
The <address> element must contain a <name>, a <city>, and a <state>
element, in that order. All of the elements are required. The comma indicates
a list of items.
<!ELEMENT name (title?, first-name, last-name)>
This means that the <name> element contains an optional <title> element,
followed by a mandatory <first-name> and a <last-name> element. The
question mark indicates that an item is optional; it can appear one time or not
at all.
<!ELEMENT addressbook (address+)>
An <addressbook> element contains one or more <address> elements. You
can have as many <address> elements as you need, but there must be at
least one. The plus sign (+) indicates that an item must appear at least one
time, but it can appear any number of times.
<!ELEMENT private-addresses (address*)>
A <private-addresses> element contains zero or more <address> elements.
The asterisk indicates that an item can appear any number of times, including
zero.
<!ELEMENT name (title?, first-name, (middle-initial | middle-name)?,
last-name)>
A <name> element contains an optional <title> element, followed by a
<first-name> element, possibly followed by either a <middle-initial> or a
<middle-name> element, followed by a <last-name> element. Both
<middle-initial> and <middle-name> are optional, and you can have only one
of the two elements. Vertical bars indicate a list of choices; you can choose
Because the XML schema is a language, several choices exist to build a possible
schema that covers the XML document. Example 2-7 is a possible, simple
design.
The first option has a real disadvantage: the schema might be difficult to read
and maintain when documents are complex. W3C XML schema allows us to
define data types and use these types to define our attributes and elements. It
also allows the definition of groups of elements and attributes. In addition, there
are several ways to arrange relationships between the elements.
2.4.3 Namespaces
Namespaces are used when you need separate elements, possibly with separate
attributes, but with the same name. Depending upon the context, a tag is related
to one specific element or to another specific element. Without namespaces, it is
Clearly, a problem exists with the element <title>. It appears in two separate
contexts. This situation complicates the situation for processors and might cause
ambiguities. We need a mechanism to distinguish between the two contexts and
apply the correct semantic description to each tag. The cause of this problem is
that this document uses only one common namespace. The solution to this
problem is namespaces. Namespaces are a simple and straightforward way to
distinguish names that are used in XML documents. By providing the related
namespace when an element is validated, the problem is solved, as shown in
Example 2-9.
XSLT offers a powerful means of transforming XML documents into other forms,
producing XML, HTML, and other formats. It can sort, select, and number. And, it
offers many other features for transforming XML. It operates by reading a style
sheet, which consists of one or more templates, and then matching the templates
as it visits the nodes of the XML document. The templates can be based on
names and patterns. In the context of information integration, XSLT is
increasingly used to transform XML data into another form, sometimes other
XML (for example, filtering out certain data, SQL statements, and plain text), or
any other format. Thus, any XML document can be shown in separate formats,
such as HTML, PDF, RTF, VRML, and Postscript.
XPath patterns
Table 2-2 shows several XPath patterns. These examples show you a few objects
that you can select.
@ Refer to an attribute
2.5.2 XSLT
XSLT helps you access and display the content in the XML file. XSLT is referred
to as the stylesheet language of XML. The relationship of Cascading Style
Sheets (CSS) and HTML is comparable to the relationship of XSLT and XML.
However, XML and XSLT are far more sophisticated technologies than HTML
and CSS.
Figure 2-2 shows the source tree from the XML document that was presented in
Example 2-10 on page 41.
library
XHTML
HEAD BODY
title P
book
title
copies
The most important aspect of XSLT is that it allows you to perform complex
manipulations on the selected tree nodes by affecting both content and
appearance. The final output might not resemble the source document. XSLT far
surpasses CSS in this capability to manipulate the nodes.
Web services help to bridge the gap between business users and technologists
in an organization. Web services make it easier for business users to understand
web operations. Business users can then describe events and activities, and
technologists can associate them with appropriate services.
The versatility of XML is what makes Web services different from previous
generation component technologies. XML allows the separation of grammatical
structure (syntax) and the grammatical meaning (semantics), and how that is
processed and understood by each service and the environment in which it
exists. Now, objects can be defined as services, communicating with other
services in XML-defined grammar, whereby each service then translates and
analyzes the message according to its local implementation and environment.
Thus, a networked application can be composed of multiple entities of various
makes and designs as long as they conform to the rules defined by their SOA.
InfoSphere Information Server helps you derive more value from complex,
heterogeneous information. It helps business and IT personnel collaborate to
understand the meaning, structure, and content of information across a wide
variety of sources. IBM InfoSphere Information Server helps you access and use
information in new ways to drive innovation, increase operational efficiency, and
lower risk.
Most critical business initiatives cannot succeed without the effective integration
of information. Initiatives, such as a single view of the customer, business
intelligence, supply chain management, and Basel II and Sarbanes-Oxley
compliance, require consistent, complete, and trustworthy information.
3.1.1 Capabilities
InfoSphere Information Server features a unified set of separately orderable
product modules, or suite components, that solve multiple types of business
problems. Information validation, access, and processing rules can be reused
across projects, leading to a higher degree of consistency, stronger control over
data, and improved efficiency in IT projects. This single unified platform enables
companies to understand, cleanse, transform, and deliver trustworthy and
context-rich information, as shown in Figure 3-1 on page 48.
Application Server
Desktop & Web
DataStage/QualityStage Scratch
Common connectivity
InfoSphere Information Server connects to information sources whether they are
structured, unstructured, on the mainframe, or applications. Metadata-driven
Unified metadata
InfoSphere Information Server is built on a unified metadata infrastructure that
enables shared understanding between business and technical domains. This
infrastructure reduces development time and provides a persistent record that
can improve confidence in information. All functions of InfoSphere Information
Server share the metamodel, making it easier for different roles and functions to
collaborate.
Common services
InfoSphere Information Server is built entirely on a set of shared services that
centralizes core tasks across the platform. These tasks include administrative
tasks, such as security, user administration, logging, and reporting. Shared
services allow these tasks to be managed and controlled in one place,
regardless of which suite component is used. The common services also include
the metadata services, which provide standard service-oriented access and
analysis of metadata across the platform. In addition, the common Services tier
Client Tier
Common Services
Connectors Packs
InfoSphere QualityStage
Product-specific services
modules
Service Agents
Application Ser ver (Log ging, Co mmun ications
(ASB), Job Mo nitor, Resource
T racker )
Web Console
Administration tasks are performed in the IBM Information Server Web Console.
The IBM Information Server Web Console is a browser-based interface for
administrative activities, such as managing security and creating views of
scheduled tasks.
IBM Information Server products can access three general categories of service:
Design: Design services help developers create function-specific services
that can also be shared.
Execution: Execution services include logging, scheduling, monitoring,
reporting, security, and web framework.
Metadata: Using metadata services, metadata is shared “live” across tools so
that changes made in one IBM Information Server component are instantly
visible across all of the suite components. Metadata services are tightly
integrated with the common repository. You can also exchange metadata with
external tools by using metadata services.
Repository tier
The shared repository is used to store all IBM Information Server product module
objects1 (including IBM InfoSphere DataStage objects) and is shared with other
applications in the suite. Clients can access metadata and the results of data
analysis from the service layers.
Engine tier
This tier is the parallel runtime engine that executes the IBM Information Server
tasks. It consists of the Information Server Engine, Service Agents, Connectors,
and Packaged Application Connectivity Kits (PACKs2):
1
IBM Information Server product module objects include jobs and table definitions, as well as
operational metadata, such as job start and stop times. The repository is also used to store
Information Server configuration settings, such as user group assignments and roles.
2
PACKs provide an application-specific view of data and use the packaged application vendor’s
application programming interfaces (APIs) for connectivity and business metadata.
For all topologies, you can add clients and engines (for scalability) on additional
computers.
To select a topology, you must consider your performance needs by reviewing the
capacity requirements for the topology elements: the server, disk, network, data
sources, targets, data volumes, processing requirements, and any service-level
agreements (SLAs).
Tip: We suggest that you use the same topology for your test and production
environments to minimize issues when a job is deployed into production.
Two tier
The engine, application server, and metadata repository are all on the same
computer system, while the clients are on a different machine, as shown in
Figure 3-4 on page 61.
High availability and failover are simpler to manage with two computers, because
all the servers fail over at the same time.
WebSphere
Application Server
Domain
DB2
Engine
Metadata
Repository
Three tier
The engine is on one machine, and the application server and metadata
repository are co-located on another machine. The clients are on a third
machine, as shown in Figure 3-5 on page 62.
WebSphere
Application Server
Domain
DB2
Engine
Metadata
Repository
Cluster
This cluster topology is a slight variation of the three-tier topology with the engine
duplicated over multiple computers, as shown in Figure 3-6 on page 63.
WebSphere
Application Server
Engine
Domain
DB2
Metadata Engine
Repository
Grid
With hardware computing power a commodity, grid computing provides more
processing power to a task than was previously possible. Grid computing uses all
of the low-cost computing resources, processors, and memory that are available
on the network to create a single system image.
Grid topology is similar to that of a cluster (Figure 3-6) with engines distributed
over multiple machines. As in the case of a cluster environment, a single parallel
job execution can span multiple computers, each with its own engine.
The key difference with cluster computing is that in a grid environment, the
machines over which a job executes are dynamically determined (through the
generation of a dynamic configuration file) with an integrated resource manager,
such as IBM Tivoli® Workload Scheduler LoadLeveler®.
IBM InfoSphere DataStage integrates data across multiple and high volume data
sources and target applications. It integrates data on demand with a high
performance parallel framework, extended metadata management, and
enterprise connectivity. DataStage supports the collection, integration, and
transformation of large volumes of data, with data structures that range from
simple to highly complex.
DataStage can manage data that arrives in real time, as well as data received on
a periodic or scheduled basis. This capability enables companies to solve
large-scale business problems through high-performance processing of massive
data volumes. By using the parallel processing capabilities of multiprocessor
hardware platforms, IBM InfoSphere DataStage Enterprise Edition can scale to
satisfy the demands of ever-growing data volumes, stringent real-time
requirements, and ever-shrinking batch windows.
With these components and a great set of standard practices, you are on your
way to a highly successful data integration effort.
DataStage manages data that arrives and data that is received on a periodic or
scheduled basis. It enables companies to solve large-scale business problems
Using the combined suite of IBM Information Server offerings, DataStage can
simplify the development of authoritative master data by showing where and how
information is stored across source systems. DataStage can also consolidate
disparate data into a single, reliable record, cleanse and standardize information,
remove duplicates, and link records across systems. This master record can be
loaded into operational data stores, data warehouses, or master data
applications, such as IBM Master Data Management (MDM), by using IBM
InfoSphere DataStage.
3.5.2 Jobs
An IBM InfoSphere DataStage job consists of individual, linked stages, which
describe the flow of data from a data source to a data target.
A stage usually has at least one data input and one data output. However, certain
stages can accept more than one data input, and output to more than one stage.
Each stage has a set of predefined and editable properties that tell it how to
perform or process data. Properties might include the file name for the
Sequential File stage, the columns to sort, the transformations to perform, and
the database table name for the DB2 stage. These properties are viewed or
edited by using stage editors. Stages are added to a job and linked by using the
Designer. Figure 3-7 on page 69 shows several of the stages and their iconic
representations.
Complex Flat File stage Extracts data from a flat file containing
complex data structures, such as arrays
or groups.
Stages and links can be grouped in a shared container. Instances of the shared
container can then be reused in different parallel jobs. You can also define a local
container within a job, which groups stages and links into a single unit. The local
container must be used within the job in which it is defined.
Separate types of jobs have separate stage types. The stages that are available
in the Designer depend on the type of job that is currently open in the Designer.
Parallel Job stages are organized into groups on the Designer palette:
General includes stages, such as Container and Link.
Data Quality includes stages, such as Investigate, Standardize, Reference
Match, and Survive.
Stages: For details about all the available stages, refer to IBM InfoSphere
DataStage and QualityStage Parallel Job Developer’s Guide, SC18-9891-03,
and relevant connectivity guides for the stages that relate to connecting to
external data sources and data targets.
In a parallel job, each stage normally (but not always) corresponds to a process.
You can have multiple instances of each process to run on the available
processors in your system.
To the DataStage developer, this job appears the same way on your Designer
canvas, but you can optimize it through advanced properties.
Pipeline parallelism
In the Figure 3-8 example, all stages run concurrently, even in a single-node
configuration. As data is read from the Oracle source, it is passed to the
If you ran the example job on a system with multiple processors, the stage
reading starts on one processor and starts filling a pipeline with the data that it
read. The Transformer stage starts running as soon as data exists in the pipeline,
processes it, and starts filling another pipeline. The stage that writes the
transformed data to the target database similarly starts writing as soon as data is
available. Thus, all three stages operate simultaneously.
Partition parallelism
When large volumes of data are involved, you can use the power of parallel
processing to your best advantage by partitioning the data into a number of
separate sets, with each partition handled by a separate instance of the job
stages. Partition parallelism is accomplished at run time, instead of a manual
process that is required by traditional systems.
The DataStage developer needs to specify only the algorithm to partition the
data, not the degree of parallelism or where the job executes. By using partition
parallelism, the same job effectively is run simultaneously by several processors,
and each processor handles a separate subset of the total data. At the end of the
job, the data partitions can be collected again and written to a single data source,
as shown in Figure 3-9.
In certain circumstances, you might want to repartition your data between stages.
This repartition might happen, for example, where you want to group data
differently. Suppose that you initially processed data based on customer last
name, but now you want to process data grouped by zip code. You must
repartition to ensure that all customers that share a zip code are in the same
group. With DataStage, you can repartition between stages as and when
necessary. With the Information Server Engine, repartitioning happens in
memory between stages, instead of writing to disk.
Multiple discrete services give IBM InfoSphere DataStage the flexibility to match
increasingly varied customer environments and tiered architectures. Figure 3-2
on page 51 shows how IBM InfoSphere DataStage Designer (labeled “User
Common services
As part of the IBM Information Server Suite, DataStage uses the common
services, as well as DataStage-specific services.
Common repository
The common repository holds three types of metadata that are required to
support IBM InfoSphere DataStage:
Project metadata: All the project-level metadata components, including IBM
InfoSphere DataStage jobs, table definitions, built-in stages, reusable
subcomponents, and routines, are organized into folders.
Operational metadata: The repository holds metadata that describes the
operational history of integration process runs, the success or failure of jobs,
the parameters that were used, and the time and date of these events.
Design metadata: The repository holds design-time metadata that is created
by the IBM InfoSphere DataStage and QualityStage Designer and IBM
InfoSphere Information Analyzer.
The common parallel processing engine runs executable jobs that extract,
transform, and load data in a wide variety of settings. The engine uses
parallelism and pipelining to handle high volumes of work more quickly and to
scale a single job across the boundaries of a single server in cluster or grid
topologies.
OSH script
The IBM InfoSphere DataStage and QualityStage Designer client creates IBM
InfoSphere DataStage jobs that are compiled into parallel job flows, and reusable
components that execute on the parallel Information Server Engine. With the
The Designer generates all the code. It generates the OSH and C++ code for any
Transformer stages used.
The following list shows the equivalency between framework and DataStage
terms:
Execution flow
When you execute a job, the generated OSH and contents of the configuration
file ($APT_CONFIG_FILE) are used to compose a “score,” which is similar to an
SQL query optimization plan.
At run time, IBM InfoSphere DataStage identifies the degree of parallelism and
node assignments for each operator, and inserts sorts and partitioners as
needed to ensure correct results. It also defines the connection topology (virtual
datasets/links) between adjacent operators/stages, and inserts buffer operators
to prevent deadlocks (for example, in fork-joins). It also defines the number of
actual operating system (OS) processes. Multiple operators/stages are
combined within a single OS process as appropriate, to improve performance
and optimize resource requirements.
The job score is used to fork processes with communication interconnects for
data, message, and control. Processing begins after the job score and processes
are created. Job processing ends when either the last row of data is processed
by the final operator, a fatal error is encountered by any operator, or the job is
halted by DataStage Job Control or human intervention, such as DataStage
Director STOP.
You can set $APT_STARTUP_STATUS to show each step of the job start-up.
You can set $APT_PM_SHOW_PIDS to show process IDs in the DataStage
log.
Job scores are divided into two sections: datasets (partitioning and collecting)
and operators (node/operator mapping). Both sections identify sequential or
parallel processing.
The execution (orchestra) manages control and message flow across processes
and consists of the conductor node and one or more processing nodes, as
Conductor Node
OS Process OS Process
Player Player
Player Player
XML Mapper
Menu Tools St age
Custom UI
Running as a Regular
Web 2.0 Clients Stage Inside
XML Metadata Assembly
DataStage Engines
Importer Editor
UI UI
Restful Services
In terms of binary components, XML Pack provides client, services, and engine
components. The Client tier has one executable file and a set of dynamic link
library (DLL) files.
The Service tier has one WAR file for client interaction and one EAR file to
provide REST service. The Engine tier has one large JAR file, which is expanded
as a list of JAR files in the engine installation directory.
In real-time applications, this model is inefficient, because the job must remain
running even though there might not be any data for it to process. Under those
circumstances, the stages are to remain idle, waiting for data to arrive.
3.7.2 End-of-wave
The ability for jobs to remain always-on introduces the need to flush records at
the end of each request.
An end-of-wave (EOW) marker is a hidden record type that forces the flushing of
records. An EOW marker is generated by the source real-time stage, depending
on its nature and certain conditions set as stage properties.
Figure 3-13 on page 81 shows an example of how EOW markers are propagated
through a parallel job. The source stage can be an MQConnector; in this case,
the target stage is a Distributed Transaction stage (DTS).
The ISD and MQConnector stages generate EOW markers in the following way:
ISD Input:
– It issues an EOW for each incoming request (SOAP, EJB, or JMS).
– ISD converts an incoming request to one or more records. A request might
consist of an array of records, or a single record, such as one large XML
payload. The ISD input stage passes on downstream all records for a
single incoming request. After passing on the last record for a request, it
sends out an EOW marker.
– The mapping from incoming requests to records is determined when the
operation is defined by the Information Services Console.
EOWs modify the behavior of regular stages. Upon receiving EOW markers, the
internal state of the stage must be reset, so a new execution context begins. For
most stages (Parallel Transformers, Modify, or Lookups), there is no practical
impact from a job design perspective.
For database sparse lookups, record flow for the Lookup stage is the same. But
the stage needs to keep its connections to the database across waves, instead of
reconnecting after each wave.
However, there are a few stages whose results are directly affected by EOWs:
Sorts
Aggregations
For these stages, the corresponding logic is restricted to the records that belong
to a certain wave. Instead of consuming all records during the entire execution of
the job, the stage produces a partial result, for the records that belong to a
certain wave. A Sort stage, for instance, writes out sorted results that are sorted
only in the wave, and not across waves. The stage continues with the records for
the next wave, until a new EOW marker arrives.
The transaction context for database stages prior to Information Server 8.5
always involved a single target table. Those stages support a single input link,
which maps to one or a couple of SQL statements against a single target table. If
there are multiple database stages on a job, each stage works on its own
connection and transaction context. One stage is oblivious to what is going on in
other database stages, as depicted in Figure 3-14 on page 83.
Begining with IS 8.5, Connectors support multiple input links, instead of a single
input link, which is the limit in versions prior to IS 8.5 for Connectors and
Enterprise database stage types. With multiple input links, a Connector stage
executes all SQL statements for all rows from all input links as part of single unit
of work. Figure 3-15 shows this action.
Figure 3-15 Multistatement transactional unit with a connector for a single database type
With database Connectors, ISD jobs no longer have to cope with potential
database inconsistencies in the event of failure. ISD requests might still need to
For transactions spanning multiple database types and those transactions that
include guaranteed delivery of MQ messages, you must use the Distributed
Transaction stage, which is described next.
This stage was originally introduced in DataStage 7.5, with the name
UnitOfWork. It was re-engineered for Information Server 8, and it is now called
DTS. Figure 3-16 illustrates DTS.
Also, you can use the DTS without a source MQConnector, such as in
Information Services Director jobs. You still need to install and configure MQ,
because it is required as the underlying transaction manager.
The first solution is for interoperability with JMS. The second solution is for MQ.
You can use ISD with JMS to process MQSeries messages and MQ/DTS to
process JMS messages by setting up a bridging between MQ and JMS with
WebSphere enterprise service bus (ESB) capabilities. However, the bridging
setup is relatively complexity between JMS and MQ.
Both ISD and MQ/DTS jobs are parallel jobs, which are composed of processes
that implement a pipeline, possibly with multiple partitions. The parent-child
relationships between OS processes are represented by the dotted green lines.
The path that is followed by the actual data is represented by solid (red) lines.
ISD jobs deployed with JMS bindings have the active participation of the
WebSphere Application Server and the ASB agent. In an MQ/DTS job, the data
flow is restricted to the parallel framework processes.
However, one additional transaction context exists on the JMS side, which is
managed by EJBs in the WebSphere Application Server J2EE container as Java
Transaction API (JTA) transactions.
JTA transactions make sure that no messages are lost. If any components along
the way (WebSphere Application Server, ASB Agent, or the parallel job) fail
during the processing of an incoming message before a response is placed on
the response queue, the message is reprocessed. The JTA transaction that
selected the incoming message from the source queue is rolled back, and the
message remains on the source queue. This message is then reprocessed on
job restart.
Therefore, database transactions in ISD jobs that are exposed as JMSs must be
idempotent. For the same input data, they must yield the same result on the
target database.
Consider these aspects when taking advantage of ISD with JMS bindings:
The retry attempt when JTA transactions are enabled largely depends on the
JMS provider. In WebSphere 6, with its embedded JMS support, the default is
five attempts (this number is configurable). After five attempts, the message
is considered a poison message and goes into the dead letter queue.
Subtle differences from provider to provider can cause problems. You must
understand exactly how things operate when MQ is the provider or when the
embedded JMS in WebSphere Application Server is the provider.
ISD is notable for the way that it simplifies the exposure of DS jobs as SOA
services, letting users bypass the underlying complexities of creating J2EE
services for the various binding types.
ISD controls the invocation of those services, supporting request queuing and
load balancing across multiple service providers (a DataStage Engine is one of
the supported provider types).
A single DataStage job can be deployed as different service types, and it can
retain a single data flow design. DS jobs that are exposed as ISD services are
referred to as “ISD Jobs” throughout this section.
Figure 3-18 on page 90 is reproduced from the ISD manual and depicts its major
components. The top half of the diagram shows the components that execute
inside the WebSphere Application Server on which the Information Server
Domain layer runs. Information Server and ISD are types of J2EE applications
that can be executed on top of J2EE containers.
The bottom half of Figure 3-18 on page 90 presents components that belong to
the Engine Layer, which can reside either on the same host as the Domain, or on
separate hosts.
The endpoints forward incoming requests to the ASB adapter, which provides
load balancing and interfacing with multiple services providers. Load balancing is
another important concept in SOA applications. In this context, it means the
spreading incoming requests across multiple DataStage engines.
An ASB agent implements the queuing and load balancing of incoming requests.
Inside a specific DS engine, multiple pre-started, always-on job instances of the
For ISD applications, a direct mapping exists between a service request and a
wave. For each service request, an end-of-wave marker is generated. See 3.7.2,
“End-of-wave” on page 80.
After an ISD job is compiled and ready, the ISD developer creates an operation
for that job by using the Information Server Console. That operation is created as
part of a service, which is an element of an ISD application.
When the ISD operations, services, and application are ready, you can use the
Information Server Console to deploy that application. The result is the
installation of a J2EE application on the WebSphere Application Server instance.
The deployment results in the activation of one or more job instances in the
corresponding service provider, which are the DS engines that participate in this
deployment.
The DS engine, in turn, creates one or more parallel job instances. A parallel job
is started with an OSH process (the conductor). This process parses the OSH
script that the ISD job flow was compiled into and launches the multiple section
leaders and players that implement the runtime version of the job. This process is
represented by the dashed (green) arrows.
Incoming requests follow the path that is depicted with the red arrows. They
originate from remote or local applications, such as SOAP or EJB clients, and
even messages posted onto JMS queues. One endpoint exists for each type of
binding for each operation.
All endpoints forward requests to the local ASB adapter. The ASB adapter
forwards a request to one of the participating engines according to a load
balancing algorithm. The request reaches the remote ASB agent, which puts the
request in the pipeline for the specific job instance.
The ASB agent sends the request to the WebSphere Information Services
Director (WISD) Input stage for the job instance. The ISD job instance processes
the request (an EOW marker is generated by the WISD Input for each request).
The job response is sent back to the ASB agent through the WISD Output stage.
3.8.2 Scalability
ISD supports two layers of scalability:
Multiple job instances
Deployment across multiple service providers, that is, multiple DS engines
DS Server
Job instances
DS Server
ISD Load
Balancing
ASB Agent
The first task is to ensure that the logic is efficient, which includes ensuring that
database transactions, transformations, and the entire job flow are optimally
designed.
After the job design is tuned, assess the maximum number of requests that a
single job instance can handle. This maximum is a function of the job complexity
and the number of requests (in other words, EOW markers) that can flow through
the job simultaneously in a pipeline. If that number of requests is not enough to
service the amount of incoming requests, more than one job instance can be
instantiated. In most cases, increasing the number of job instances helps meet
the service-level agreement (SLA) requirements.
However, cases might occur when you reach the maximum number of requests
that a single ASB agent can handle. Therefore, the limit of a single DS engine is
Remember that throughout this tuning exercise, we assume that you have
enough hardware resources. Increasing the number of job instances and DS
engines does not help if the CPUs, disks, and network are already saturated.
In an always-on job, the database stages must complete the SQL statements
before the response is returned to the caller (that is, before the result is sent to
the WISD Output stage).
The only exception to this rule in releases prior to 8.5 was the Teradata
Connector, which in Version 8.0.1 already supported an output link similar to
what is described in 3.8.5, “Information Services Director with connectors” on
page 98. However, prior to 8.5, the Teradata Connector did not allow more than a
single input link.
The Join stage guaranteed that the response was sent only after the database
statements for the wave complete (either successfully or with errors).
The logic that is implemented by the Column Generator, Aggregator, and Join
stages is repeated for each standard database stage that is present in the flow.
Synchronized results ensure that no other database activity occurs for a request
that is already answered.
Again, those steps were necessary in pre-8.5 releases. In IS 8.5, the Database
Connectors are substantially enhanced to support multiple input links and output
links that can forward not only rejected rows, but also processed rows.
Two alternatives for always-on 8.5 jobs are available for database operations:
DTS
Database Connector stages
Figure 3-22 on page 97 illustrates a DTS stage that is used in an ISD job. It
significantly reduces the clutter that is associated with the synchronization of the
standard database stages. We provide this illustration to give you an overall
perspective.
When used in ISD jobs, the Use MQ Messaging DTS property must be set to NO.
Although source and work queues are not present, MQSeries must still be
installed and available locally, because it acts as the XA transaction manager.
DTS must be used in ISD jobs only when there are multiple target database
types and multiple target database instances. If all SQL statements are to be
executed on the same target database instance, a database connector must be
used instead.
For Connector stages, an output link carries the input link data, plus an optional
error code and error message.
If configured to output successful rows to the reject link, each output record
represents one incoming row to the stage. Output links are already supported by
the Teradata Connector in IS 8.0.1, although that connector is still restricted to a
single input link.
Figure 3-23 shows an example of multiple input links to a DB2 Connector (units
of work with Connector stage in an Information Services Director job). All SQL
statements for all input links are executed and committed as part of a single
transaction, for each wave. An EOW marker triggers the commit.
All SQL statements must be executed against the same target database
instance. If more than one target database instance is involved (of the same or
different types), the DTS stage must be used instead.
The example also depicts multiple input links to a DB2 Connector (units of work
with the Connector stage in an Information Services Director job).
Homogeneous
This issue only occurs for jobs that are deployed as ISD providers and that meet
all of the criteria in the following list:
The job has an ISD Input stage.
The configuration file that is used by the job (APT_CONFIG_FILE) contains
multiple nodes.
Multiple requests can be processed by the job at the same time.
To avoid this problem, ensure that your ISD jobs only use a single-node
configuration file. You can ensure that your ISD jobs only use a single-node
configuration file by adding the environment variable $APT_CONFIG_FILE to the
job properties at the job level. You then need to set $APT_CONFIG_FILE to
reference a configuration file that contains only a single defined node. After this
change, you must disable this job within Information Services Director, recompile
it in DataStage Designer, and re-enable it in the ISD Console.
Alternatively, if all jobs in the project are ISD jobs, you can change the
$APT_CONFIG_FILE environment variable at the DataStage project level to
point to a single node configuration file as the default.
Important: Even when making a project-level change, you must still disable
and then re-enable the jobs in the ISD Console. However, recompilation of the
job is not necessary.
When setting the variable at the project level, be careful that the individual jobs
do not override the project-level default by defining a value locally.
Consider the performance implications. For complex jobs that process larger
data volumes, limiting the number of nodes in the configuration file might result in
reduced performance on a request-by-request basis. To ensure that overall
performance remains acceptable, you can increase the number of minimum job
instances that are available to service requests. See 3.8.2, “Scalability” on
page 93 for more information about multiple job instance servicing requests, as
determined at deployment time through the IS Console.
The following scenarios justify the publication of DS server jobs as ISD services:
Publishing existing DS server jobs
Invoking DS sequences:
– Sequences are in the Batch category and are outside of the scope of this
chapter.
– Sequences cannot be exposed as ISD services.
DataStage was conceived as an extract, transform, and load (ETL) product in the
year 1996, the same year XML was adopted as a World Wide Web Consortium
(W3C) standard. Since then, it constantly evolved, adding new features and
capabilities and remaining a leading ETL tool.
In this chapter, we review the available XML functionality in DataStage and how it
evolved over time. In the next several chapters, we explain the XML Stage, its
components, and how to work with it in detail. In the last two chapters, we
describe performance issues when handling large, complex XML documents and
designing XML-based solutions with DataStage.
Although the schema was used for validation at run time, and for metadata
import of the XPath and namespace information, the job developer needed to
manually design and configure the stage to ensure that the XML documents
were created to match the schema that was used. Additionally, when validation
was used, the schema must be accessible at run time either on disk or online
through a URL. The next generation of XML technology improves its
performance and eliminates these restrictions.
The XML Stage uses a unique custom GUI interface, which is called the
Assembly Editor, that is designed to make the complex task of defining
hierarchical relationships and transformations as convenient and manageable as
possible. Chapter 6, “Introduction to XML Stage” on page 131 describes the
Assembly Editor in great detail. The XML Stage also can accept data from
multiple input links, and then establish a hierarchical relationship between the
two within the Assembly. For small data volumes, this capability can simplify job
design requirements and make the jobs easier to maintain. Chapter 10,
“DataStage performance and XML” on page 291 describes performance tuning
for the XML Stage, including how to manage large volumes of relational data and
large XML document sizes.
The most important feature of the XML Pack Version 3.0 is its tight integration
with XSD schema. After an XSD schema is imported by using the new Schema
Library Manager, the metadata from within the schema is saved locally. This
decoupling means that the schema files do not have to be accessible at run time.
This feature is important when deploying into restricted production environments.
The transformation and mapping functions directly link to the schema, so that the
job developer can focus on the business logic. The stage ensures that the
document conforms to the schema.
In the remaining chapters in this book, we fully explain the new XML Pack 3.0
functionality and performance tuning. We explain how to implement XML-based
solutions and migration strategies from XML Pack 2.0.
With these stages, you can supply input data to call a web service operation, and
retrieve the response as rows and columns. SOAP is an XML-based standard.
SOAP messages are XML documents that adhere to the SOAP schema.
The WSClient and WSTransformer stages can directly convert from relational to
SOAP XML, for simple structures. For more complex structures, with repeating
groups, the XML Stage is used in combination with the Web Service Pack stages
to compose or parse the SOAP message body or header, as needed. See 11.1,
“Integration with Web services” on page 330 for a complete solution to implement
a web service scenario with a complex structure in the SOAP message.
InfoSphere Metadata Workbench supports both Data Lineage and Job Lineage
reports. Job lineage provides information about the flow of data through jobs and
XPath represents one of the steps within the flow of data in either direction of a
selected metadata asset. It passes through a series of DataStage stages,
including an XML Stage and a sequence of DataStage jobs. It ends in databases
or other targets, such as business intelligence (BI) reports.
The DataStage client includes Windows command-line utilities for automating the
export/import processes, when needed. Using the utilities in conjunction with the
XML export format was one method of integrating with external source control
systems. The built-in source control integration that is now available with
Information Server significantly replaces this manual XML-based method.
The Information Server Manager tool and its command-line version istool
provide the source control integration features and robust packaging options to
give the developers greater control of how objects are backed up and deployed.
The .isx file that is created by Information Server Manager is actually a
compressed file/jar format. The files that are inside that represent the contents
are actually XML files. Of particular interest is the manifest file, which is located
within the export file at ../META-INF/IS-MANIFEST.MF. This file is an XML file, and
it contains an index of all the repository objects that are contained within the
export file. Figure 4-2 on page 111 contains an excerpt from an IS-MANIFEST.MF
file.
Each object that is contained in the export file includes an <entries> element.
The developer can extract the IS-MANIFEST.MF file and use this index to identify
the contents of the file as needed without running the file through the tool. For
information about the Information Server Manager, see the links to the online
product documentation in , “Related publications” on page 375.
You can access the information in the metadata repository from any of the suite
components, which means that metadata can be shared by multiple suite
components. For example, after importing metadata into the repository, an
authorized user can then use the same metadata in projects in any of the suite
components, such as IBM InfoSphere FastTrack, InfoSphere DataStage, or
InfoSphere QualityStage.
The XML Stage can be used to read, write, and transform XML data. To read,
write, and transform XML files, the XML schema is required. Without a valid
schema, you cannot use the XML Stage. The XML schema defines the structure
of the XML file. That is, it defines the metadata of the XML file. The XML schema
is analogous to the columns that are defined in the Sequential File stage. The
columns in the Sequential File stage describe the data that is read by the
You must store the XML schema in the metadata repository to use it in the XML
Stage while designing the DataStage job. You use the Schema Library Manager
to import the XML schema into the metadata repository. You access the Schema
Library Manager from the DataStage Designer Client by selecting Import
Schema Library Manager, as shown in Figure 5-1.
Use the following steps to create a library in the Schema Library Manager:
1. In the Schema Library Manager window, click New Library. The New
Contract Library window opens, as shown in Figure 5-3.
2. In the New Contract Library window, in the Name text box, enter the name of
the library, which can be any valid name that begins with an underscore,
letter, or colon. Enter the library name as Test, as shown in Figure 5-4 on
page 116.
3. In the Description text box, enter a short optional description of the library.
Enter This is a test library in the Description field, as shown in
Figure 5-5. Click OK.
Figure 5-5 Description text box in the New Contract Library window
A library with the name Test is created, as shown in Figure 5-6 on page 117.
Multiple libraries can be created in the Schema Library Manager, but each library
name must be unique. Therefore, two libraries cannot have the same name. The
names are case-sensitive; therefore, the names Test and test are unique.
A library can be accessed through all DataStage projects; therefore, our library
Test can be accessed through all DataStage projects.
2. Click Import New Resource. You can browse through the folders in the client
machine. Select the schema employee.xsd and click Open. The
employee.xsd file is imported. You are notified through the window, as shown
in Figure 5-8. Click Close.
Figure 5-8 The notification window while importing a schema into the library
The employee.xsd is imported into the library Test, as shown in Figure 5-9 on
page 119.
Two schemas with the same name cannot be imported into the same library. But,
two libraries can hold the same schema. Therefore, you can have the same set of
schemas in two separate libraries.
A library can be added to a category in two ways: while creating the library and
after creating the library. Follow these steps:
1. When you create a library, you can enter a name in the Category text box, as
shown in Figure 5-10 on page 120.
2. If the library is created without a category, the library can still be categorized.
To categorize an existing library, select the library. The details of the library
are listed, as shown in Figure 5-11. You can enter a category name in the
Category text box.
When you add the library Test to the Category Cat1, you create a folder called
Cat1. The Test library is in the folder, as shown in Figure 5-12 on page 121.
You cannot have two categories with the same name; each category must have a
unique name. The names are case-sensitive; therefore, the names Cat1 and cat1
are unique. Two categories cannot have the same library. For example, the
library Test cannot exist in two separate categories.
The global elements of the schema shows on the Types page. In employee.xsd,
there are two global elements, as shown in Figure 5-13. The elements do not
have a namespace; therefore, the namespace is not shown.
When you click the global element employee, you can see its structure on the
right side of the window, as shown in Figure 5-14 on page 123.
In these cases, you must modify the file location attribute, as shown in
Figure 5-15, of the referenced file in the Schema Library Manager to the location
used by the INCLUDE statements.
Figure 5-16 Changing file location to match the URL in the INCLUDE statement
c. When you change the file location name, the library turns green, which
indicates that the library is valid.
2. The Schema Library Manager can import an entire compressed file that
contains many XSD files. When a compressed file is imported, the Schema
Library Manager imports all the files in the compressed file and sets the
location attribute to the relative directory from the root of the compressed file.
This feature can save the tedious work of importing all the files one at a time
and updating their locations.
To view the list of errors, click Validate. Whenever you modify a schema, delete a
schema from the library, or add a schema to the library, the library is
automatically revalidated.
Figure 5-19 Accessing the Schema Library Manager through the Assembly Editor
In this chapter, we described how to use the Schema Library Manager to import
XML schemas into the metadata repository so that it can be used in the XML
Stage while designing the DataStage job. We also described how these schemas
are grouped into libraries that help in the management of many schemas inside a
single project. We explained the benefits of using XML schemas inside a
common repository. Think of putting the schemas in the repository as a first step
in configuring the XML Stage. In the subsequent chapters, we describe the XML
Stage in detail.
Now, InfoSphere DataStage supports two XML solutions: the XML Pack and the
XML Stage. The XML Pack, which includes the XML Input, XML Output, and
XML Transformer stages, is useful to perform simple transformations that do not
involve a large amount of data. The XML Stage is the best choice for an XML
solution, because it can perform complex transformations on a large amount of
data.
The XML Stage differs from other known XML tools because of its powerful
execution, which can process any file size, with limited memory and in parallel.
The XML Stage has unique features, such as the ability to control and configure
the level of validation that is performed on the input XML data. Language skills,
such as Extensible Stylesheet Language Transformation (XSLT) or Xquery, are
not required to use XML Stage.
The XML Stage consists of the Stage Editor and the Assembly Editor. The Stage
Editor is the first window that opens when you click the stage, as shown in
Figure 6-1. The runtime properties of the stage are defined in the Stage Editor.
Click Edit Assembly on the Stage Editor to open the Assembly Editor. You
design and create an assembly in the Assembly Editor. An assembly consists of a
series of steps that enriches and transforms the XML data, as depicted in
Figure 6-2 on page 133.
Aggregate
Aggregate
Output
Output
Output
Input
Input
Input
Join
Step 1 Step 2 Step n
When you open the Assembly Editor, as shown in Figure 6-3, by default, it
includes the Input and the Output steps.
This pictorial representation of the schema in the assembly helps you to see the
structure of the incoming XML and clearly establish the relationship between
various elements within the schema.
<xs:element name="pet">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
</xs:sequence>
<xs:attribute name="type" type="xs:string"/>
<xs:attribute name="age" type="xs:int"/>
<xs:attribute name="breed" type="xs:string"/>
</xs:complexType>
</xs:element>
NMTOKENS-type representation
The XML schema can have elements of type NMTOKENS. In these elements,
maxOccurs might not be set to a value greater than 1 in the XML schema. In the
assembly, these elements are represented by a list, because the data type
NMTOKENS represents an array of data separated by spaces. Because multiple
data values are held by a single element, the assembly represents this element
as a list. As shown in Figure 6-8, the element "codes" is represented by a list with
the same name. The actual value of the element is held by the item text().
<xs:element name="appliance">
<xs:complexType>
<xs:sequence>
<xs:element name="codes" type="xs:NMTOKENS"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Representation in
Definition in the XSD
the Schema
Figure 6-9 NMTOKENS with maxOccurs greater than 1 representation in the Assembly
xs:List representation
An element of type xs:list in the schema is also represented by a list in the
assembly, as shown in Figure 6-10.
<xs:element name="appliance">
<xs:simpleType>
<xs:list itemType="xs:int"/>
</xs:simpleType>
</xs:element>
<xs:element name="colony">
<xs:complexType>
<xs:sequence>
<xs:element name="house" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="number" type="xs:integer"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType> Representation in
</xs:element> the Schema
<xs:element name="pet">
<xs:complexType>
<xs:choice>
<xs:element name="dog" type="xs:string"/>
<xs:element name="cat" type="xs:string"/>
<xs:element name="parrot" type="xs:string"/>
<xs:element name="cow" type="xs:string"/>
</xs:choice>
</xs:complexType>
</xs:element>
For an element with mixed content, two new items are added in the assembly:
@@mixedContent and @@leadingMixedContent:
– @@leadingMixedContent captures the mixed content at the start of the
element.
– @@mixedContent captures the data that follows the element. As shown in
Figure 6-14, the complex element "location" is of mixed type.
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="age" type="xs:int"/>
<xs:element name="dob" type="xs:date"/>
<xs:element name="location">
<xs:complexType mixed="true">
<xs:sequence>
<xs:element name="address" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
Representation in
Definition in the XSD the Schema
When an XML schema is imported into the library, it is translated into a simplified
model. All the Transformation steps use these simplified concepts. For example,
in the Aggregate step, aggregating an attribute of type double is the same as
aggregating an element of type double. The only steps that are sensitive to the
original XML schema concepts are the XML Parser and Composer steps. Due to
the ability to preserve the information in the XML schema, these steps can
recreate the XML schema concepts and adhere to them, as needed.
Figure 6-18 Comparison between DataStage jobs and Assembly computational model
The input and output steps of the assembly have a distinguished role. The Input
step transforms relational data into hierarchical data, and the Output step
transforms hierarchical data to relational data. Each input/output link to the XML
Stage turns into a list in the assembly. Each column turns into an item of the list,
as shown in Figure 6-11 on page 139.
In the assembly, the entire input tree is rooted in a list named top. Therefore, top
is the parent of all lists and items in the assembly. Consider that the XML Stage
has two input links: DSLink1 and DSLink2. As shown in Figure 6-19 on page 144,
these links turn into a list in the assembly, and both links are present under the
list top.
Similarly, when multiple steps are used, the input and output of each step is
present under the list top.
When input links exist to the XML Stage, all the input lists in the assembly are
categorized under the group InputLinks. This group is the child of the list top, as
shown in Figure 6-19. Even when no input links exist to the XML Stage, the
group InputLinks is still present under top, but it does not contain any child lists
or items.
The steps in the assembly pass the entire input to the output along with an
enrichment branch that contains the result of the step. The output of a step is
always contained within the top-level list through which the current step iterates.
The Assembly Editor presents for each step its input and output trees. The
Assembly Editor highlights the enrichment branch that the current step
computed, so that the enrichment branch can be distinguished from the original
input of the step. As shown in Figure 6-20 on page 145, the output of the Sort
step is highlighted by graying out the area.
This powerful assembly data model simplifies the transformation description into
a series of steps rather than a direct graph.
The
Theclaim
claimdata
data
A web application gets
gets converted
convertedto to
captures the XML
XMLfiles
files
customer claim
Customers
First, the company parses the XML file with either the XML Input stage or the
XML Stage within the DataStage Designer. The company selects the XML Stage,
because it can process large XML files easily. Within the XML Stage, the Parser
step is used, and it is added to the Assembly Editor. Follow these steps:
1. Open Assembly Editor and click Palette in the Assembly Outline bar. The
steps are shown in the Palette, as shown in Figure 7-2.
The company stores all the XML files in two central repositories: the database
and one of its internal servers. The Parser step must read the XML files from
these locations to parse the XML data. In the XML Parser step, the method of
reading the XML data must be specified in the XML Source tab. A single Parser
step cannot read the XML data from multiple sources at the same time.
The Input step in the Assembly Editor describes the metadata of a relational
structure. The step becomes active when an input link is provided to the XML
Stage. You can view the columns that are added to the preceding stage of the
XML Stage here. You can modify, remove, or add the columns. When the input is
provided from the DB2 Connector to the XML Stage, the Input Step becomes
active and the DB2 column, claim_XML, is displayed, as shown in Figure 7-5.
To read the data from the database column, you must configure the XML Source
tab of the Parser Step. You must use the String set option in the Parser Step to
read the XML data from the database column claim_XML. Use the String set
option when the entire XML file must be read from a previous stage in the
designer canvas. Another stage reads the XML data and passes it on to the XML
Stage.
After following these steps, you can see that the XML Source tab is configured,
as shown in Figure 7-6 on page 151.
Figure 7-6 The XML Source tab in the XML Parser Step
You can use the File set option in the XML Source tab of the Parser Step to read
multiple files that need to be parsed by the stage. You need to pass the absolute
location of the input XML files to the File set option. The File set option reads the
files from the folder based on the absolute paths of the files.
You can use one of the following approaches to feed the absolute path of the
XML files to the XML Stage:
A single flat file can be created to store the absolute path of each input XML
file. This flat file can be read by using the Sequential File stage in the designer
canvas. The input of the sequential file must be given to the XML Stage, as
shown in Figure 7-7. With this approach, you need to make an entry in the flat
file for every new XML file.
The External Source stage in the parallel canvas can be used to find the
absolute paths of the XML files. The output of the External Source stage must
be given to the XML Stage, as shown in Figure 7-8. The advantage of using
the External Source stage is that you do not need to enter the paths manually
in a flat file. The stage can be configured to run a query and fetch all XML file
locations at run time from the folder, for example, a query to read the XML
files from the folder C:\files is ls C:\files\*.xml (ls is the command to list
files).
Figure 7-8 External Source stage reads multiple XML documents job design
Figure 7-9 File_paths column in the Input Step in the Assembly Editor
Use the following steps to configure the File set option in the XML Parser Step:
1. Select File set option in the XML Source tab. The drop-down list under the
File set option becomes active.
2. Click the drop-down list. The input columns to the XML Stage appear.
All input columns appear beginning with top/InputLinks followed by the
name of the input link to the XML Stage. Therefore, if the input link is DSLink2
and the input column name is file_paths, the column name is
top/InputLinks/DSLink2/file_paths.
3. Select the column file_paths from the drop-down list. The column file_paths
holds the absolute location of the XML files.
Figure 7-10 File set option configuration in the XML Parser Step
In this job design, the XML Stage is the source stage in the designer canvas and
no input is provided to it, as shown in Figure 7-12.
Figure 7-12 Job design when the XML file is read from the XML Stage
The fictitious healthcare company based its input files on an XML schema. This
schema needs to be defined in the Document Root tab of the XML Parser Step,
as shown in Figure 7-13. The schema defined in the Document Root tab
describes the common structure of all input XML files to be parsed by the XML
Parser step.
Use the following steps to define the schema in the Document Root tab:
1. Click Browse in the Document Root tab of the Parser Step, and the Select
Document Root window opens. This window shows all the available libraries
2. In the Select Document Root window, select the top-level element on which
the XML file is based. The top-level element in the claim files of the company
is Claim, as shown in Example 7-1. Therefore, you select the item Claim in
the Select Document Root window.
3. After you select the element Claim in the Select Document Root window, you
can see the structure of the element on the right side of the window, as shown
in Figure 7-15 on page 158.
4. Click OK to close the Select Document Root window. The schema structure is
defined in the Document Root tab, as shown in Figure 7-16 on page 159.
The degree of validation can be selected through the Validation tab in the XML
Parser Step, as shown in Figure 7-17. The Validation tab presents a set of
validation rules. Each rule consists of a condition and an action. The condition is
the validation check. If the check fails, the action is performed.
To ensure that the XML file strictly conforms to the schema, Strict Validation
mode can be selected. In Strict Validation mode, all the validation rule actions are
set to Fatal to ensure that the job cancels on the first occurrence of invalid data.
Each of the validation rules can be configured to perform a specific action. When
you click the action for a particular rule, the drop-down list shows the available
actions that can be selected for that rule, as shown in Figure 7-18 on page 162.
While entering the claim information in the web application, the customers of the
healthcare company might make errors, such as the following errors:
Not provide any data for the Phone Number field
Add leading or trailing spaces in the name fields
Provide a claim code that is invalid
The default Phone Number can be specified in the XML schema or in the
Administration tab in the Assembly Editor, as shown in Figure 7-19. The Phone
Number field is a string field, and therefore, the default value must be entered in
the text box next to String in the Default Data Type Values tab. To add this default
value to the element during parsing, you must set the action for the validation rule
Use global default values for missing values to Yes.
Figure 7-19 The Default Data Type Values window in the Administration tab
To trim the leading and trailing spaces in each of the fields, you must set the
action for the validation rule Trim Values to Yes.
<Claim_Information>
<claim_line_number>2</claim_line_number>
<claim_number>A100</claim_number>
<member_id>12Z00</member_id>
<provider_id>00X0</provider_id>
<claim_code>ZZCVTBJU</claim_code>
<phone_number></phone_number>
<firstname> James </firstname>
</Claim_Information>
</Claim>
The valid claim_code values are specified in the XML schema. To ensure that the
value in the XML file matches the value in the XML schema, the validation rule
Value fails facet constraint is set to Log once per document. The validation rule
Value fails facet constraint ensures that the incoming data value is in the range
specified in the schema.
For the XML document shown in the Example 7-2, if these validation rules are
applied, the following actions occur while parsing:
For the Phone Number fields, the default value "1-333-333-0000" is added.
This value needs to be defined in the Administration tab.
The final configuration of the Validation tab can be seen in Figure 7-20.
Figure 7-20 Validation in the Configuration tab in the XML Parser Step
When the Reject action is selected for any of the validation rules, two new items,
named success and message, are added to the output schema of the XML Parser
step, as shown in Figure 7-21. Success holds the value true if the validation
passes and false if the validation fails. Message holds the error message if the
validation fails. Based on the value of the success item, the XML documents can
be divided into valid and invalid files.
Figure 7-21 The output of the XML Parser step after using the Reject action
For example, assume that the rule Data type contains an illegal value. This rule
checks whether the value specified for each of the elements/attributes in the XML
file is valid and within the range for the specified data type. When the reject
action is selected for this rule and if any of the elements or attributes have invalid
data, the item success holds the value false and message holds the reason for
the failure. Based on the value of success, the files can be filtered into valid and
invalid files. For more detail, see Chapter 9, “Transforming XML documents” on
page 265.
Even if one of the data values in the XML file is invalid, the value of the
successful item is false. Therefore, the entire XML file becomes invalid.
In Figure 7-22 on page 168, we show the actual schema representation in the
Document Root tab of the XML Parser Step. The output of the XML Parser Step
is seen in the Output tab on the rightmost section of the Assembly Editor. The
output of the XML Parser Step shows all elements converted to string elements,
because the action for the validation rule Data type has an illegal value is set to
ignore.
When the String set or File set option is selected in the XML Source tab of the
XML Parser Step, the XML_Parser:result is under the input list. As shown in
Figure 7-23, the XML_Parser:result is under the list DSLink2. If the Single file
option is selected, the XML_Parser:result will under the list top.
The healthcare company wants to parse the XML claim files and store the data in
a database. The Parser step, which we described in 6.1, “The XML Stage” on
page 132, parses the claim files. To write the parsed data into the different
columns of the database, the Output step is required.
When an output link is provided from the XML Stage, the link is represented as a
list within the assembly. The output columns are represented as items in the
assembly. The columns for each of the output links are visible in the Output tab of
the Output Step. As shown in Figure 7-25 on page 171, the output link DSLink4
has five columns in it. These columns can be modified or removed, or new
columns can be added in the Output tab.
To map the output of the Parser step to the columns of the DB2 Connector, the
Mappings tab is used. The Mappings table is a common construct to map one
hierarchical structure to another. The Mappings table is used as the last step in
adapting the input to the output structure. You cannot perform a join, union, or
any other set-operation and map to a target list. If a set-operation is needed, it
needs to be performed in a Transformation step before the Output step. As
shown in Figure 7-26 on page 172, the Mappings table contains three columns:
Source, Result, and Target. The Target column displays the output link and the
columns under it. The Source column displays the columns that need to be
mapped to the target items. The Result column displays the status of the current
mapping.
The top list is always the first list in the Mappings table and is always mapped to
the source item top. This mapping cannot be configured. Top contains the group
OutputLinks. The entire output structure is contained in this group.
Following these rules, the output of the Parser step must be mapped to the target
columns in the Output step, as shown in Figure 7-27.
Figure 7-27 Mapping Parser step output and the target columns in the Output step
2. Because the source columns that need to be mapped to the target columns
are under the source list Claim_Information, this column is selected from the
suggestion list. When you select Claim_Information, the mapping occurs for
this row and a white check mark in a green circle shows between the source
and target lists, as shown in Figure 7-29 on page 175.
3. After mapping the target list DSLink4, the columns under it can be mapped.
To map each of the target columns, select the corresponding row in the
Source column. From the suggestion list, select the appropriate item to be
mapped. The item recommendations in the suggestion list are based on the
name and data type of the Target column.
After mapping all the columns, the Mappings table is complete. No errors are
seen in the Output Step, as shown in Figure 7-30 on page 176.
The mapping occurs based on the similarity between the name and data type of
the source and target columns. The automatic mapping suggestion might not
always be correct. The Auto Map feature also cannot map columns of different
data types. Therefore, if the Target column is an integer and all source columns
2. Select the row corresponding to the target list DSLink4, and click Propagate.
2. Select the row in the source corresponding to the target item ProviderID. The
suggestion list is shown in Figure 7-34 on page 180.
3. The last option in the suggestion list is Constant. Select the Constant option
to open the Constant window. In the Constant window, as shown in
Figure 7-35, you can enter the value that needs to be assigned to the target
item. To assign the value '00X0', enter '00X0' in the Constant window and
click OK.
3. The Target window opens, as shown in Figure 7-38 on page 183. In this
window, you can see the entire schema tree and select the item to map.
In certain instances, you might want to map the parent and child fields to
columns under the same output link in the Output step. This way, you relate the
parent and child columns in the output file or database. For example, in
Figure 7-40, the source items claim_type and claim_line_number can be
mapped to the columns under the same output link. These source items are not
under the same list. For example, the claim_type item is the child of
Claim_details, and Claim_details is the parent of Claim_Information.
To map the source columns to the target columns, first the source list and target
list need to be mapped. But as you can see in Figure 7-41, no lists show in the
output of the XML Parser step. To map them, use the following options.
2. Select the Group To List projection. The group changes into a list, as shown
in Figure 7-43 on page 187.
Figure 7-44 The complete mapping between the source and target structures
When the Single File option is selected in the XML Source tab of the XML Parser
Step, the output of the XML Parser Step is directly under the list top. Therefore,
in the Output Step, the list top can be mapped to the output link to map the items
under the XML Parser Step, as shown in Figure 7-45 on page 188. The warnings
in the mapping are the result of the difference in data types between the Source
column and the Target column.
The XML Stage is a schema-driven stage. To parse any XML document, the
schema needs to be defined in the Document Root tab of the XML Parser Step.
This XML schema is converted into an internal model by the XML Stage. When
large schemas are used, a large amount of memory is required to store the
schema in the internal XML Stage model, because every aspect of the schema
needs to be saved. Therefore, for large schemas, the representation of the XML
schema in the assembly is changed.
For large schemas, several elements are automatically chunked. Chunking refers
to converting all child items under an element to a single item.
The leftmost diagram in Figure 7-46 on page 189 shows the actual schema
representation. When this schema is part of another large schema, the element
Claim_Information might be automatically chunked. As shown in Figure 7-46 on
When an element is automatically chunked, its child items are not available for
mapping in the Output step. For example, you cannot map the element
claim_line_number to the Target column in the Output step, because the Output
step sees only the single element e2res:text() under Claim_Information.
After configuring the second XML Parser step, the child items of the
auto-chunked element are available for mapping in the Output step.
Assume that the auto-chunked element, book, is under the group library. The
parent list of the element book is top. Therefore, the output of the second
XML Parser step is placed under the list top, as shown in Figure 7-49 on
page 192.
Two fields are defined in the Administration tab, as shown in Figure 7-51. You can
use these fields, Schema Trimming starts at and Maximal Schema Tree size, to
specify thresholds on the expected size of schema trees that need to be
imported. When the number of imported elements reaches these thresholds,
XML Stage starts auto-chunking the schema elements. These thresholds are
soft limits.
The XML Stage starts to chunk the top-level XML elements when the tree size
reaches the value for the Schema Trimming starts at. When the tree size
reaches Maximal Schema Tree size, the XML Stage auto-chunks all the qualified
If you do not use a large schema inside the assembly, there is no user behavior
change. The field Schema trimming starts at defaults to 500, and the field
Maximal Schema Tree size defaults to 2000. For schemas whose size is smaller
than 500, XML Stage does not auto-chunk any schema elements. When you
need to import a large schema into the assembly, configure the two fields in the
Administration tab and use multiple parsers to extract the auto-chunked
elements.
Because the values are defined inside the Assembly Editor, they apply to the
particular assembly in use. When the schema is defined in the Document Root
tab of the XML Parser step, the representation of the schema depends on the
values for the Schema Tree Configuration fields at that instance. After defining
the schema in the Document Root of the XML Parser step, the values can be
changed. The changed values do not affect the schema representation in the
XML Parser steps in which the Document Root is already defined. Therefore, in a
single assembly, each XML Parser step can be configured to use different
configuration values.
For example, suppose that you configure the first XML Parser step, save the
Assembly Editor by closing it, and then reopen it. When you add the second XML
Parser step to parse the auto-chunked element, the Document Root is not
automatically defined. You must go to the first parser and set the Document Root
again. You do not need to delete or reconfigure the first parser. You merely need
to reset the Document Root.
This step also applies to a scenario with several parsers in the assembly. The
first XML Parser is used to import the whole schema. The rest of the Parsers are
used to parse several auto-chunked elements from the output of the first parser.
If you reopen the job to add another parser to parse another auto-chunked
schema element, you need to reset the Document Root in the first parser.
Use the XML Parser step to parse the XML file, extract a single element from it,
and write the XML file along with the column back into the database.
Use the chunk operation in the XML Parser step to perform these operations.
This operation is manual as opposed to the automatic chunk described in 7.4,
“Parsing large schemas” on page 188.
The chunk feature is available in the Document Root tab of the XML Parser step.
You can select any element in the Document Root window and right-click that
element to activate the chunk option, as shown in Figure 7-52 on page 196.
Chunking refers to converting all child items under an element to a single item. In
Figure 7-53 on page 197, we show the actual schema representation. When the
element Claim is manually chunked, it is represented with the symbol <>. All the
child items of the element Claim are contained in this single element, as shown in
Figure 7-53 on page 197.
To extract the item claim_number, the chunk Claim needs to be passed through a
second XML Parser step. Use the following steps to parse the element Claim:
1. Add a second XML Parser step to the assembly.
2. In the second XML Parser step, use the String Set option to read the chunked
element. Select the String Set option in the XML Source tab. From the
drop-down list, select Claim. The String Set option reads the XML data from
the first XML Parser step.
3. As soon as the String Set option is configured, the Document Root is
automatically set by the XML Stage. Therefore, you do not need to configure
the Document Root tab for the second XML Parser step.
After configuring the second Parser step, the chunked element and its child items
are available for mapping in the Output step, as shown in Figure 7-54. You can
map the items Claim and claim_number to two columns in the Output step.
However, if necessary, the assembly can first perform the XSLT transformation,
and it reads the result of the XSLT as though it is the content of the original file.
Therefore, the Document Root element reflects the result of the XSLT and not the
original XML input document, as shown in Figure 7-55.
XSLT
Input XML File XML Parser Step
Stylesheet
XML Stage
Figure 7-55 Processing XML documents that incorporate XSLT style sheets
You must paste the XSLT style sheet in the Enable Filtering section of the XML
Source window in the XML Parser step. To enable the Enable Filtering option,
first the XML Source needs to be defined, and then, you must select Enable
Filtering. A default XSLT style sheet shows in the area under the Enable
Filtering option, as shown in Figure 7-56 on page 199. Without selecting an XML
Source, the Enable Filtering option is not enabled. You can replace the default
style sheet with a style sheet of your choice.
The healthcare company uses several XML files that do not conform to the XML
schema. That is, the top level of the XML file and the schema do not match. But
the company still wants to parse these files by using the XML Stage. To ensure
that the XML documents conform to the XML schema, the company wants to
apply an XSLT style sheet to the XML documents. The XSLT must be pasted in
Customer Application to
Information in Creating XML files from create reports
the database the data in the database
Customers
The bank uses IBM DB2 as the database storage. The customer information is
stored in multiple columns in a single database table. Therefore, in the
DataStage job, the source of input data is DB2. In the DataStage Designer
canvas, the DB2_Connector_Stage can be used to read the relational columns
from the database. The output of the DB2_Connector_Stage must be fed to the
XML Stage. The overall job design is shown in Figure 8-4.
Figure 8-4 The DataStage job to read data from DB2 and create an XML file
The bank wants to store the output XML files in two kinds of columns in the
database: in a pureXML column and in a large object column. The Composer
step writes the XML files in these locations. In the XML Composer step, the
method of writing the XML file must be specified in the XML Target tab. A single
Composer step cannot write the XML data to multiple destinations at a time.
In the DataStage Designer canvas, the DB2 Connector stage can be used to
write the XML files into the database table. The XML files are stored in a single
column xml_data in the database. The DB2 Connector can access the correct
database table to write into this particular column. Because the XML Stage
needs to pass the created XML files to the DB2 Connector stage, the output of
the XML Stage must be given to the DB2 Connector stage, as shown in
Figure 8-6.
Figure 8-6 DataStage job design: XML file written into pureXML column of the database
To pass the XML files to a downstream stage within DataStage, you must use the
Pass as String option in the XML Target tab of the Composer step. When the
Pass as String option is used, the Composer step creates the XML files, and
another stage within DataStage actually writes these files to the target.
Select Pass as String in the XML Target tab, as shown in Figure 8-7 on
page 207. The XML Target tab is then configured to pass the created XML files to
a downstream stage.
To pass the XML files to a LOB column in the DB2 Connector stage, the Pass as
Large Object option must be used in the XML Target tab of the Composer Step.
The Pass as Large Object option is used to pass the XML files to a LOB-aware
stage. Therefore, the file is created in the Composer step and passed to a
LOB-aware stage in DataStage, which actually writes the file to the target. The
job design is shown in Figure 8-8.
Figure 8-8 DataStage job design: XML file is written to LOB column of the database
Figure 8-9 The XML Target tab LOB option in the Composer Step
In this job design, the XML Stage is the target stage in the designer canvas. No
output is provided from the XML Stage, as shown in Figure 8-11.
Figure 8-11 DataStage Job Design: XML file is written from the XML Stage
To define the schema in the Document Root tab, follow these steps:
1. Click Browse. The Select Document Root window opens. This window shows
all the available libraries in the Schema Library Manager. Each library shows
the elements of the schema that are imported into it.
The bank created the library Account_Info in the Schema Library Manager and
imported the xsd file Account.xsd into it. When you click Browse in the
Document Root tab of the Composer Step, you can see the library in the Select
Document Root window, as shown in Figure 8-13 on page 211.
2. In the Select Document Root window, select the top-level element of the XML
file. The output XML files of the bank must have Account as their top-level
element, as shown in Example 8-1. Therefore, you must select the item
Account in the Select Document Root window.
3. When you select the element Account in the Select Document Root window,
you can see the structure of the element on the right half of the window, as
shown in Figure 8-14 on page 212.
4. Click OK to close the Select Document Root window. The schema structure is
defined in the Document Root tab, as shown in Figure 8-15 on page 213.
The XML Stage offers unique validation capabilities for you to control the amount
of schema validation when composing XML documents. When the XML
Composer step produces the XML document, it always creates a well-formed
document. It also validates the incoming data from the previous steps against the
document root element structure. The spectrum of validation that ranges from
minimal validation to strict schema validation allows control of the trade-off
The degree of validation can be selected through the Validation tab in the XML
Composer Step, as shown in Figure 8-16. The Validation tab presents a set of
validation rules. Each rule consists of a condition and an action. The condition is
the validation check, and when the check fails, the action is performed.
Minimal validation mode sets all the validation rule actions to Ignore. When
Minimal validation is selected in the Composer step, no validation occurs. The
Composer step, by default, always produces a well-formed XML document. Even
when Minimal Validation is selected, the Composer step creates a well-formed
XML document.
Each of the validation rules can be configured to perform a specific action. When
you click the action for a particular rule, the drop-down list shows the available
actions that can be selected for that rule, as shown in Figure 8-17.
Figure 8-17 The actions that are available for the validation rules
The bank wants to take the following action for each of these errors:
Log an error when the date formats are incorrect.
Log an error when the Account Type field is an incorrect value.
To log error messages, the actions Log per Occurrence and Log once per
Document are available. These messages are logged in the DataStage Director
Log. Log per Occurrence logs the message for data that fails the validation rule.
Log once per Document logs the first error message only when the validation
rule fails. For example, in the XML document shown in Example 8-2, the Account
Type fields contain invalid data. Log per Occurrence logs messages for each of
the account types, and therefore, two error messages show in the Director log.
Log once per Document logs the message only for the first invalid account type,
for example, Private.
The valid Account Type values are specified in the XML schema. The data that
comes from the database must match the value in the XML schema. If the data
does not match, an error message needs to be logged. To log an error message
for every invalid Account Type, the validation rule Value fails facet constraint is
set to Log per Occurrence.
For the XML document that is shown in Example 8-2 on page 216, if these
validation rules are applied, the following results occur while parsing:
An error message is logged for the date of birth field, because the value
02-07-1980 is not in the correct format.
Two error messages are logged for the Account Type field for the values
Private and non-savings.
You can see the final configuration of the Validation tab in Figure 8-18 on
page 218.
The Mappings table, which is shown in Figure 8-19, contains three columns:
Source, Result, and Target. The Target column displays the items of the target
structure. The target structure is the XML schema that is defined in the
Document Root tab of the Composer Step. The Source column displays the
columns that need to be mapped to the target items. The Result column displays
the status of the current mapping.
The document_collection list is the first target list in the Mappings table. Based
on the mapping to this list, one or multiple XML documents are created (see 8.3,
“Multiple XML documents: Single Composer step” on page 253).
In order to map the items from the Input step to the Composer step, follow these
mapping rules:
A list in the Target column can be mapped to a list in the source column only.
For example, in Figure 8-20, only a source list can be mapped to Customer.
All lists are represented with blue icons. An item with a blue icon must be
mapped to another item with a blue icon.
A group is never mapped. Groups create structures and do not need to be
mapped. As shown in Figure 8-20, the group Account does not need to be
mapped.
To map a source item to a target column, first the parent list of the item needs
to be mapped. Therefore, in Figure 8-20, to map a source item to
Account_Number, first the Customer list needs to be mapped.
Following these rules, the output of the Input step must be mapped to the target
columns in the Composer step, as shown in Figure 8-20.
Figure 8-20 Mapping the Input Step output and the Composer Step target structure
Figure 8-21 The suggestion list for the target list Customer
2. Select the source list DSLink2 from the suggestion list. The mapping occurs
and a white check mark in a green circle shows in the Result column, as
shown in Figure 8-22 on page 222.
3. After you map the Customer target list, you can map the columns under it. To
map each of the target columns, select the corresponding row in the Source
column. From the suggestion list, select the appropriate item to map. The item
recommendations in the suggestion list are based on the name and data type
of the Target column.
After mapping all the columns, the Mappings table is complete, as shown in
Figure 8-23 on page 223.
The mapping is based on the similarity between the name and data type of the
source and target columns. The automatic mapping suggestion might not always
be correct. The Auto Map feature also cannot map columns of different data
types. Therefore, if the target column is an integer and all source columns are of
Figure 8-24 The suggestion list that shows the More option
3. The last option in the suggestion list is Constant. Select the Constant option
to open the Constant window. In the Constant window shown in Figure 8-27,
you can enter the value to assign to the target item. To assign the value
Savings, enter Savings in the Constant window and click OK.
Figure 8-28 The Constant value Savings assigned to the target item Account_Type
You do not need to enter the comments in the format <!-- -->. The tags are
automatically added by the XML Composer step while you add comments.
You can also add Processing Instructions to the output XML file by choosing the
Include Processing Instructions option. The Processing Instruction must be
added between the tags <? ?>, as shown in Figure 8-30 on page 229.
The Include XML Declaration is selected by default in the Header tab. Use this
option to add the XML declaration to the output XML file:
When Include XML Declaration is not selected, the output XML file does not
contain the XML declaration.
Use the Generate XML Fragment option when you do not want to include the
XML declaration, comments, and processing instructions in the output XML
document.
The Indentation Length helps specify the degree of indentation that is required.
The New Line Style helps specify the new line character to use within the file.
Formatting affects performance.
When the Write to File option is selected in the XML Target tab, the output XML
file is created by the Composer step. Therefore, a single item called
output-filename shows under the group XML_Composer:result, as shown in
Figure 8-32. This item contains the name of the output XML file in the Filename
Prefix text box in the XML Target tab.
Figure 8-32 Composer step output for the Write to File option
When the Pass as String or Pass as Large Object option is selected, the XML
Composer step does not write the file directly to the server. The XML file is
passed to a stage after XML Stage. Therefore, the item result-string is under the
group XML_Composer:result, as shown in Figure 8-33. This single item holds the
entire XML file.
Figure 8-33 Composer step output for the Pass as String or Pass as Large Object option
To create an XML file based on the structure shown in Figure 8-34, the same
hierarchy must be created so that the mapping is possible. The hierarchy can be
created by using the Regroup step or the HJoin step in the assembly. We explain
these steps in 8.2.1, “The Regroup step” on page 232, and 8.2.2, “The HJoin
step” on page 240.
The bank has a schema that contains multiple lists, as shown in Figure 8-34. To
compose an XML file with this structure, use the Composer step. But, to be able
To configure the Regroup step, it needs to be added to the assembly. To add the
Regroup step, use the following steps:
1. Open Assembly Editor and click Palette in the Assembly Outline bar. You
can see the various steps in the Palette, as shown in Figure 8-35.
The Regroup Step is configured, as shown in Figure 8-37 and Figure 8-38 on
page 236.
The input and output of the Regroup step is shown in Figure 8-39 on page 237.
The output of the Regroup step is under the list Regroup:result. Directly under
this list are the parent items. A child list, DSLink2, is also under Regroup:result.
The name of the child list is same name as the List to Regroup, which we
configured in the Regroup step. As shown in Figure 8-39 on page 237, the child
list DSLink2 contains all the child items.
The Output of the Regroup step can now be mapped to the XML schema in the
Mappings tab of the Composer step, as shown in Figure 8-40.
The “Input records of regroup list are clustered by key - optimize execution”
option in the Keys tab instructs the Regroup Step to regroup in a streaming
Figure 8-41 The Input records of regroup lists are clustered by key - optimize execution option
In case the data is not sorted in the database, the data can be sorted before
feeding it to the XML Stage. We advise that you sort in DataStage rather than in
XML Stage, because the DataStage Sort operates in parallel and can easily sort
the relational data that comes from the database. The Sort step within the XML
Stage is appropriate for sorting hierarchical data. Figure 8-42 shows the overall
DataStage job design. The option “Input records of regroup lists are clustered by
key - optimize execution” is used within the Regroup step.
To create this hierarchy, two Regroup steps can be used. In the first Regroup
step, create the first level of hierarchy, which is between Customer and
address_info. The address_info link also contains the phoneNumber columns, as
shown in Figure 8-44.
The second Regroup step creates the second hierarchy level, which is between
address_info and phoneNumber. Remember to set the scope in the second
The XML Stage needs to combine the data from the two input links into a single
XML structure, as shown in Figure 8-47.
Figure 8-47 Combining two input files to create a single XML file
The HJoin step is used to join data that comes from two input links. This join is
called a Hierarchical Join, because the step creates a hierarchical structure.
Therefore, the HJoin step joins the data and creates the hierarchy. To be able to
map the data from the input database to the Composer step, the HJoin step must
be used.
The HJoin step must be added to the assembly to configure the HJoin step.
Follow these steps to add the HJoin step to the assembly:
1. Open Assembly Editor and click Palette in the Assembly Outline bar. The
steps show in the Palette, as shown in Figure 8-48 on page 242.
2. Drag the HJoin step from the Palette onto the Assembly Outline bar after the
Input Step. The HJoin step is added between the Input Step and the
Composer Step, as shown in Figure 8-49 on page 243.
Figure 8-51 on page 245 shows the output of the HJoin step. The output contains
a nested structure. Under the input parent list DSLink2, the child list is added
along with the prefix HJoin, for example, HJoin:DSLink4.
In several rows of the child list, the value of the column Account_Type might not
match a value in the parent list. All of these rows are placed under a single group
called HJoin:orphans, as shown in Figure 8-52.
The output of the HJoin step can now be mapped to the XML schema in the
Mappings tab of the Composer step, as shown in Figure 8-53 on page 246.
Use the Regroup step when a single input file is provided to the XML Stage, and
the input relational data has duplicate rows. For example, consider the data that
is shown in Table 8-2.
To create a hierarchical structure of each department and its employees, you can
use the Regroup step. Several of the Department IDs are repeated. By using the
Regroup step, the redundancy in the Department IDs can be removed.
When an HJoin is performed between these tables, the Department IDs A100
and C300 repeat, because the HJoin step does not remove any redundancy
while performing a join. Therefore, the Regroup step must be used. Because the
Regroup step requires a single input list, both tables can be combined by using
the Join_Stage that is available in DataStage. Figure 8-54 shows the DataStage
job design.
BeautyPr
HomePr
The bank uses a schema that contains the customer account information, as
shown in Figure 8-56.
The Composer step is used to compose XML files. To map the data from the
Input step to the Composer step, the hierarchy needs to be created. The input
columns are shown in Figure 8-57 on page 249. Therefore, two input columns,
You need to add the H-Pivot step to the assembly before you configure it. To add
the H-Pivot step, use the following steps:
1. Open Assembly Editor and click Palette in the Assembly Outline bar, which
shows the various steps in the Palette, as shown in Figure 8-58 on page 250.
2. Drag the H-Pivot step from the Palette onto the Assembly Outline bar after
the Input Step. The H-Pivot Step is added between the Input Step and the
Composer Step, as shown in Figure 8-59 on page 251.
The output of the H-Pivot step is contained under the scope DSLink2. The output
of the H-Pivot step is under the group H-Pivot:result, as shown in Figure 8-61 on
page 253. It contains the rows list, which contains the child items: name and
value. Name contains the name of the fields selected as columns (in this
scenario, Name stores phoneNumber1 and phoneNumber2). Value contains the
data that is stored in the phoneNumber1 and phoneNumber2 columns.
The output of the H-Pivot step can now be mapped to the phoneNumbers list, as
shown in Figure 8-62.
Figure 8-62 Mapping the H-Pivot step output and Composer step schema structure
The XML Composer result is contained in the top list when the top list is mapped
to the document_collection list. When the input list DSLink2 is mapped to the
document_collection list, the XML Composer result is contained in the DSLink2
list, as shown in Figure 8-64.
Figure 8-64 The XML Composer result that is contained in the DSLink2 list
When the Write to File option is selected in the XML Target tab of the Composer
step, the Composer step directly writes the XML file onto the server. It creates
one XML document for every input row. For example, if three input rows come
into the XML Stage and the Filename Prefix value is file, the Composer step
creates three files with the names file.xml, file_1.xml, and file_2.xml.
When the Pass as String or Pass as Large Object option is selected in the XML
Target tab of the Composer step, the Composer step passes the created XML file
to the stage after XML Stage. The Composer step does not directly write the file
onto the server. By choosing this option, the Composer step creates a single item
that contains both files. For example, if there are two input rows, the result of the
Composer step is a single item, result_string, which contains the data that is
shown in Example 8-3.
When an output link is provided from the XML Stage, the link is a list within the
assembly. The output columns are items in the assembly. The columns for each
of the output links are visible in the Output tab of the Output Step. The Mappings
tab is used to map the output of the Composer Step to the columns of the
database. As shown in Figure 8-65 on page 257, the output link DSLink3 has a
single column result to which the Composer Result must be mapped.
When the top list is mapped to the document_collection list, the output of the
Composer step is under the list top. The Composer result is contained in a single
item called result_string. To map result_string to the output column, you must first
map the parent list DSLink3. Because result_string is under the list top, top is
mapped to the target list DSLink3. Then, the source item result_string is mapped
to the target item result. Figure 8-66 on page 258 shows the complete mapping.
These mapping rules apply when you map the @@choiceDiscriminator item:
If the input data is present only for the element self_employed, the constant
value self_employed must be mapped to the target item
@@choiceDiscriminator, as shown in Figure 8-69 on page 261. The constant
value must be the same as the name of the choice item.
If the input data is present only for the element hired, the constant value hired
must be mapped to the target item @@choiceDiscriminator, as shown in
Figure 8-70.
Sometimes the incoming data can have both elements, that is, one row is
self_employed and the other row is hired. The constant value cannot be
At run time, if the input row contains data for the self_employed element,
choice contains the value self_employed. The Composer step knows which
element must be present in the output XML file.
XML schemas might use choice elements with the tns extension. This
extension means that the self_employed element is present in the
namespace that is represented by tns. When mapping the value to the
@@choiceDiscriminator, you need to take care of the namespace, too.
Therefore, if the input has the data for the self_employed element and the
namespace is http://check123, you map the constant value
“{http://check123}self_employed” to the @@choiceDiscriminator item.
The following mapping rules apply when you create an XML file for a schema
with @@type:
If nothing is mapped to the @@type item, the output XML file contains only
the elements of home_address. As you can see in Figure 8-73 on page 264,
the mapping occurred for home and office address fields, but because
nothing is mapped to @@type, the output contains only home_address.
We use the same XML structure to explain each step. The XML file contains the
details of a bank account. Figure 9-1 on page 266 shows the XML structure in
the assembly.
For the XML schema structure that is shown in Figure 9-1 on page 266, the
Account information needs to be sorted in ascending order based on the
FirstName field. To sort the XML file, first the XML file needs to be read by using
the Parser step. The Sort step must be added after the Parser step to sort the
content of the XML file. Follow these steps to configure the Sort step that is
depicted in Figure 9-2 on page 268:
1. The List to Sort specifies the list that needs to be sorted in the XML file. In the
List to Sort drop-down list, select the list Account.
2. The Scope in the Sort step defines the scope of the sort function. The
elements are sorted within the list that is selected as the Scope. The Scope
also determines the location of the output of the Sort step. The Scope must
always be the parent node of the list that is selected in the List to Sort field. In
this scenario, because the Account is selected as the List to Sort, only top
can be selected as the Scope. Select top. The output of the Sort step is now
under the list top. All the FirstName items that are under the top scope are
sorted.
3. The XML file must be sorted on the FirstName item. Therefore, for Keys,
select FirstName from the drop-down list.
4. You can select the sort order in the drop-down list under Order. The sorting
options are to sort in Ascending or Descending order. Select Ascending.
Because we selected top as the Scope, the output of the Sort step is under top,
as shown in Figure 9-3 on page 269. The output of the Sort step is in a group
called Sort:result. The output shows the list Account sorted in ascending order
based on the FirstName field. This output can then be mapped to the Composer
step to create the sorted XML file.
Sorting has special considerations. See 10.5.1, “Sort step” on page 318 for the
details.
A customer at our scenario bank has either a loan account or a deposit account.
You can use the Aggregate step to count the number of customers based on the
type of account. To count the number of accounts of a particular account type in
the XML file, first the XML file needs to be read by using the Parser step. The
Aggregate step must to be added after the Parser step to aggregate the
hierarchical data.
Figure 9-5 The output of the Aggregate step when a key value is specified
The configuration of the Aggregate Step is shown in Figure 9-6 on page 272.
You can map the output of the Aggregate step to various columns of a Sequential
file by using the Output step.
A customer at our scenario bank has either a loan account or a deposit account.
You can use the Switch step to write the account information for each type into a
separate output file. The XML file must be read by using the Parser step. You
must add the Switch step after the Parser step to filter the items in the XML file.
Follow these steps to configure the Switch step, as depicted in Figure 9-9 on
page 275:
b. Type loan for the Target Name. The Target Name is the name of the
constraint, which also is the name of the output list.
c. The Filter Field specifies the item on which the constraint is specified.
Select account_type from the drop-down list.
d. The Switch step provides multiple functions based on the type of filtering
that you want to perform. Select Compare in the drop-down list for
Function.
e. The account_type can be either loan or deposit. Therefore, either one of
these constant values can be specified as the constraint. For Parameter 1,
Figure 9-9 on page 275 shows the configuration of the Switch Step.
Each of the lists under Switch:filtered can be mapped to a separate output link by
using the Output step. Therefore, all the account information for the loan account
type is in one output file. All the account information for the deposit account type
is in a separate output file (the output file to which the default list is mapped).
See Appendix B, “Transformer functions” on page 369 for a list of functions that
are available in the Switch step.
The XML files can be read by using two Parser steps. The Union step must be
used after the Parser steps to combine the two files.
2. The Mappings tab is similar to the Mappings tab in the Composer step. You
can use the Mappings table to map the output of the Parser step to the target
structure in the Union step.
This Mappings table, as shown in Figure 9-12 on page 278, contains three
columns: Source, Result, and Target. The Source column shows the columns
that need to be mapped to the target items. The Result column shows the
status of the current mapping. The Target column contains two lists: the left
3. The left list is mapped to the output of the first Parser step, and the right list is
mapped to the output of the second Parser step, as shown in Figure 9-13 on
page 279.
As shown in Figure 9-13, a source list needs to be mapped to the left and right
lists in the target. The Account lists under the first and second Parser steps must
be mapped to the Account fields under the left and right lists. To create lists on
the source side, you can use the Group to List projection. Follow these steps to
create the lists on the source side:
1. Go to the Output tab of the First Parser step. Right-click the
XML_Parser:result group, as shown in Figure 9-14 on page 280.
3. Go to the Output tab of the Second Parser step. Right-click the group
XML_Parser_1:result and select Group To List. The group is converted to a
list, as shown in Figure 9-16 on page 282.
Follow these steps to map the data from the Parser steps to the Union step:
1. Double-click the Source column for the target list left. You see a drop-down
list, which is the suggestion list. The suggestion list shows all available source
lists that can be mapped to the target list. The absolute path,
top/XML_Parser:result, is shown in the suggestion list.
2. Select the source list top/XML_Parser:result from the suggestion list. When
you select top/XML_Parser:result, the mapping occurs for this row. A green
circle with a white check mark shows in the Result column, as seen in
Figure 9-17 on page 283.
Figure 9-18 on page 284 shows the Mappings table after mapping all the
columns.
The Union step is configured. The output of the Union step is in a list called
Union_result, as shown in Figure 9-19. The structure of the Union:result list is the
same structure that is defined in the Union Type. The output of the Union step
can be mapped to the Composer step to recreate the XML file. Therefore, two
XML files are combined into a single file.
To demonstrate the use of the V-Pivot step, the Account schema is slightly
modified. The new schema representation is shown in Figure 9-20. The address
is a list that can contain the account holder home address and the account holder
office address. The address_type field is used to distinguish the addresses. It
contains office for office address and home for home address. When this
structure is parsed, it creates two output rows: one row for home address and
one row for office address. To map all the account holder information and the
address information in a single output row, you must use the V-Pivot step after
the Parser step.
Follow these steps to configure the V-Pivot step that is depicted in Figure 9-21 on
page 287:
1. The Source of Rows field specifies the list that contains the field on which the
pivot is performed. In this scenario, the address_type field is unique for each
address list. Therefore, the pivot is performed on the address_type field.
Because the address_type field is under the list address, select address in
the drop-down list for the Source of Rows field.
The V-Pivot Step is now configured, as shown in Figure 9-21 on page 287.
The output of the V-Pivot step is in a group called V-Pivot:result, which contains
two nodes: office and home. These names are based on the values that are
specified for the Column Names. The home node holds the house address, and
the office node holds the office address. The output of the V-Pivot step is under
the Account list, because the Scope is set to Account. See Figure 9-22 on
page 288.
You can map the output of the V-Pivot step to a sequential file by using the
Output step.
A hospital stores its patient chart information in two XML files. One XML file
stores information about the doctor who treated the patients. The other XML file
stores the medications that are recommended by the doctor. We must merge
both sets of details into one XML file, as shown in Figure 9-23 on page 289. We
are not required to combine the exact medication that is prescribed with the
doctor who prescribed it. The information merely needs to be merged so that the
doctor information is followed by all the prescribed medicines. The OrderJoin
step can be used to create this XML file.
In this scenario, two Parser steps are required to read each of the XML files. The
XML schema structure of each of the XML files is shown in Figure 9-24. The
OrderJoin step is added after the Parser steps.
To configure the OrderJoin step, you need to specify the Left list and the Right
list:
1. In the Left list, select the list whose content you need to list first. Select
Provider from the drop-down list, because you need to list the doctor
information first.
2. In the Right list, select Medication.
The output of the OrderJoin step is a list called OrderJoin:result. This list
contains two groups: the group that you defined as the Left list and the other
group that you defined as the Right list. Therefore, as seen in Figure 9-26,
OrderJoin:result includes Provider, which contains the doctor information, and
Medication, which contains the prescribed medications. The output of the
OrderJoin step can now be mapped to a Composer step to recreate the single
XML file.
In this chapter, you configured the steps that are available in the XML Stage. We
described the Parser step in detail in Chapter 7, “Consuming XML documents”
on page 147. We described the Composer step, Regroup step, HJoin step, and
H-Pivot step in detail in Chapter 8, “Creating and composing XML documents” on
page 201.
Round-robin partitioning
Round-robin partitioning evenly distributes rows across partitions in a
round-robin assignment, similar to dealing cards one at a time. Round-robin
partitioning has a fairly low overhead. Because optimal parallel processing
occurs when all partitions have the same workload, round-robin partitioning is
useful for redistributing data that is highly skewed (an unequal number of rows
in each partition).
Random partitioning
Random partitioning evenly distributes rows across partitions. It uses a
random assignment. As a result, the order that rows are assigned to a
particular partition differs between job runs. Because the random partition
number must be calculated, random partitioning has a slightly higher
overhead than round-robin partitioning. Although in theory random
partitioning is not subject to regular data patterns that might exist in the
source data, it is rarely used in functional data flows because it has a slightly
larger overhead than round-robin partitioning.
Entire partitioning
Entire partitioning distributes a complete copy of the entire dataset to each
partition. Entire partitioning is useful for distributing the reference data of a
lookup task, which might or might not involve the lookup stage. On clustered
10.1.3 Collectors
Collectors combine parallel partitions of an input data set (single link) into a
single input stream to a stage that is running sequentially. The collector method
is defined in the stage input/partitioning properties for any stage that is running
sequentially, when the previous stage is running in parallel. The following list
shows the collector types:
Auto collector
The auto collector first checks the dataset to see whether it is sorted. If so,
the framework automatically inserts a sort merge collector instead. If the data
is not sorted, the auto collector reads rows from all partitions in the input data
set without blocking for rows to become available. For this reason, the order of
rows in an auto collector is undefined, and the order might vary between job
runs on the same data set. Auto is the default collector method.
However, in most cases, you do not need to globally sort data to produce a single
sequence of rows. Instead, sorting is most often needed to establish order in
specified groups of data. This sort can be performed in parallel. The business
logic, such as join, can be performed accurately on each sorted group, as long as
key-based partitioning is used to ensure that all members of the group are
distributed to the same partition. Figure 10-5 on page 299 illustrates key-based
partitioning and a parallel sort.
Notice that the partitioning happens first. The data is distributed into each
partition as it arrives. The sort is performed independently and in parallel on all
partitions. Parallel sort yields a significant performance improvement on large
data sets. Parallel sort is a high-performance solution. Perform sorting in parallel
whenever possible.
When you need to compose a single XML document, the XML Stage needs to be
running in sequential mode. The partitioned and sorted relational data needs to
be collected to a single partition where the XML Stage processes it sequentially.
To collect the data to a single partition, a sort merge collector must be used to
generate a sorted sequential data set. Figure 10-6 on page 300 illustrates a job
that performs a high-performance relational sort and composes the XML
document from the result.
With the composeInvoiceXML Stage in the default configuration using the auto
collector, the parallel engine framework automatically inserts the sort merge
collector when it identifies that the upstream data is sorted. In Figure 10-6, the
sort merge collector is inserted at the point where you see the collector link
marking icon on the sortedOrders link. By choice, the user can also manually
specify a sort merge collector, as shown in Figure 10-7 on page 301.
To sort the relational data, you need to specify the sort and partitioning criteria to
match the columns that we use as keys within the XML assembly. Figure 10-8 on
page 302 illustrates the configuration of the hash partitioner, which is a
key-based partitioning method. Order number is selected as the hash key, which
ensures that all the records that belong to each order are hashed to the same
partition.
In the XML assembly, we need to regroup the lines within each order, and so we
sort by using order number and line number as keys. The sort keys and the
partitioning keys do not need to match exactly. The sort keys can be a super-set
of the partitioning keys; however, the order must be preserved. Therefore, the
order number must be the first sort key. Figure 10-9 illustrates the configuration
of the sort keys.
Figure 10-10 Configuring the list and parent-child items for the Regroup Step
For the List to Regroup option, we select the sortedOrders list, which is the
incoming link that holds the relational data. We set the Scope option to top,
which means that the regrouped output structure shows under the top level at the
output. This choice is the only available choice for the scope property, because
the sortedOrders input list is a flat structure. The Regroup step forms a
hierarchical structure with a parent and a child list. The line items are part of the
child list, so the orderNumber is configured to be part of the parent list.
The key specification defines when an item is added to the output regroup list. In
this example, we need the regroup list to contain one item per order, so we use
the order Number as the key. Figure 10-11 on page 304 illustrates the
configuration of keys.
The “Input records of regroup list are clustered by key - optimize execution”
option instructs the Regroup step that the data is already sorted on the regroup
key. This sort used the parallel sort in the DataStage job prior to the XML Stage.
Important: You must select this optimize execution option to achieve the
performance benefit. If not, the XML Stage assumes that the incoming data is
unsorted and starts a less efficient, non-parallel sort.
To design a solution, we start with the XML files. Each file is stamped with a date
and time code, plus the store number from where it originated. This information
presents an opportunity, because many of the business calculations are driven
by store number. The DataStage job is shown in Figure 10-12.
Figure 10-12 DataStage job to process many small XML files in parallel
By using an External Source stage, we obtain a data stream of file paths to the
XML files, one path per record. In the transformer, we extract the store number
into its own column and pass it along with the full XML file path to the XML Stage.
A hash partitioner that uses the store number as the key distributes the file paths
to multiple instances of the XML Stage that run in parallel. Each instance of the
XML Stage on each partition parses an XML file for the data for one store. The
data is loaded into an IBM DB2 database table that is partitioned by store
number. Next, we describe each stage in this job.
First, the External Source stage is used, in sequential mode, to obtain a list of the
XML files in the directory to be processed, as shown in Figure 10-13 on
page 306.
The Transformer stage also runs sequentially and extracts the store number, as
shown in Figure 10-14 on page 307.
The transformer assembles the complete file path from the file name and the
path from the job parameter. It extracts the pure store number from the XML file
name and returns that number as a separate output column. The hash partitioner
is defined on the input link on the XML Stage by using the store number field as
the key, as shown in Figure 10-15 on page 308.
In the XML assembly, the Parser step must be configured to accept the incoming
path from the input link and, then, open each file for parsing. This configuration is
shown in Figure 10-17 on page 310. The rest of the assembly is configured
normally, exactly as you parse a single file.
This solution is effective and benefits from the increased I/O throughput of
reading multiple files simultaneously. Additionally, because the XML files store
the data that is already broken down by store number, it does not need to be
repartitioned. Even in other situations where repartitioning might be required, it is
most often the case that the performance boost gained from increased
throughput more than offsets any cost for repartitioning. You can use a similar
strategy to compose XML documents, as long as each partition produces a
document or stream of multiple documents.
If the XML Stage is given an input link, and the records supply the document
paths, one additional configuration step is required. The partitioner on the input
link to the XML Stage must be configured to use the Entire partitioning mode.
Entire partitioning mode is necessary so that the file path is distributed to each
instance of the XML Stage that runs on each partition. Figure 10-17 on page 310
illustrates how to configure the XML Stage to read a file path from an input link.
To configure the XML Stage to use entire partitioning, edit the stage properties
and click the input link in the image map. Click the Partitioning tab, and choose
Entire from the Partition type choices, as shown in Figure 10-21 on page 314.
Figure 10-22 on page 315 shows that the /invoice/order list is mapped to the
output link. If you add additional output links, or map a more deeply nested list,
such as /invoice/order/lines, parallel parsing is not enabled. Therefore, we see a
It is important to recognize that the second XML Stage cannot parse the lines in
parallel. The actual chunk of XML containing the order must be passed as a
string set from the first XML Stage to the second XML Stage. Parallel parsing is
only enabled when reading an XML file directly; it is not possible to parallel parse
from a string set in memory.
This design lets us get to the necessary data within the lines that we need. At the
same time, this design gives us a substantial increase in performance when
processing a large file. Right-click the order element on the Document Root tab,
and choose Chunk from the context menu. If you then click the Output tab in the
right pane, you can see that the order list element now shows that its only child
node is <>e2res:text(). The child node is a chunk of XML that is not parsed
further within this step. It is sent byte-for-byte to the next step in the assembly,
which is illustrated in Figure 10-24 on page 317.
The second XML Stage is set up in a typical parsing fashion. The Parser step
reads from the input link as a string set. The document root is configured to parse
the order element, as shown in Figure 10-25 on page 318.
If this option is not enabled, it uses disk-based files for sorting. It does not sort in
parallel. Disk-based sorts in the XML assembly must be avoided if at all possible.
The Sort step is most useful when dealing with small amounts of data that fit
completely into memory. The sort must be performed in the middle of a complex
hierarchical transformation. The sort cannot be performed by using DataStage
parallel sort in advance of or after the XML Stage.
Important: The option to use memory is absolute. After you select the
in-memory sort, you cannot fall back to disk-based sort in the event that the
data does not fit in memory. If it cannot allocate enough memory to perform
the sort, the job fails with a runtime error.
Important: As with the Sort step, the in-memory H-Join option is absolute.
You cannot fall back to disk-based sort in the event that the data does not fit in
memory. If it cannot allocate enough memory to perform the join, the job fails
with a runtime error.
There are performance implications with these job designs. The DataStage
parallel framework is primarily designed to pass large volumes of relatively small
records. Buffering is created based on a relatively small record size. To move
data from one stage to another, the entire record must fit inside a transport block.
The default block size is 128 KB.
If the individual XML documents that are processed are all consistently close to
the maximum size, no further tuning is necessary. However, if the XML
documents vary significantly in size, with most of them less than half of the
maximum size, you might try further tuning. In this situation, set
APT_OLD_BOUNDED_LENGTH=True. This value forces both in-memory and
physical data set storage to manage all string values by their actual length and
not their maximum length. This value uses more CPU resources, but it reduces
memory and I/O.
If you attempt to access a LOB column with a non-Connector stage, it is only able
to see the reference string and not the actual content to which it refers. For
example, you cannot use the Transformer stage to modify the XML that is passed
by reference as a LOB column, because the transformer does not have LOB
column support. It sees the small reference ID string only. Appendix A,
“Supported type mappings” on page 363 contains a list of Connector stages,
which have LOB column pass by reference support, that are available at the time
of writing this book.
For this reason, it is important to determine in advance the size of the XML
documents that are passed through the job, and design it to use LOB columns or
pass data in-line as appropriate:
LOB columns with the Composer step
The Composer step supports writing data into a LOB column. In the XML
Stage assembly, open the Composer step. As shown in Figure 10-28 on
page 323, select Pass as Large Object on the XML Target properties page.
At the time of writing this book, the XML Connector stage uses physical disk
files stored in the scratch disk location to store XML data that is passed by
LOB reference. The downstream Connector stage, such as the DB2
Connector, retrieves the data from the file at access time.
LOB columns with the Parser step
The Parser step supports reading data from a LOB reference column. No
special option is required, you use the String set option for XML Source in the
XML Parser Step Configuration tab, as shown in Figure 10-29 on page 324.
The XML Stage automatically detects if the column contains an in-line or
pass-by-reference value. The XML Stage takes the appropriate action to
retrieve the LOB data, if necessary.
Important: The Parser step cannot write a LOB reference column. All parser
output is passed by value, even when the Chunk feature is used.
If it is necessary to extract a large chunk of XML data with the Parser step, and
pass it externally to DataStage by reference, you need to add a Composer step
to manage this chunk by writing the LOB reference.
The <xs:any> syntax allows the schema to be extensible beyond its defined
elements, by allowing any XML element to be contained within it.
The Composer step can then write its output as a LOB reference column, which
is accessible by any downstream Connector stage.
Many existing production deployments of DataStage use XML Pack 2.0 for
DataStage for file-based and real-time data integration. This chapter describes
several key differences of which developers familiar with XML Pack 2.0 need to
be aware when building new solutions or migrating existing solutions with XML
Transformation stage.
Figure 11-1 shows the section where the types are defined in the WSDL. We
highlighted the <xsd:import> tag, and it must appear as the first element in the
<xsd:schema> tag. The schemaLocation attribute of the <xsd:import> tag
identifies the location of the schema file. Normally, this location is a URL location,
and it can be downloaded with a web browser. Otherwise, it needs to be
distributed with the WSDL file.
Extract the highlighted <schema> element and paste it into an empty file, and
give it the .xsd extension. You must change a few lines. Add the standard XML
declaration as the first line in the file:
<?xml version="1.0" encoding="UTF-8"?>
Identify all the namespace prefixes (xsd: and impl: in this example) that are used
in the schema content. Look at the <wsdl:definitions> element at the top of the
original WSDL file to see the attributes for xmlns:xsd= and xmlns:impl=. Copy
these attributes and their assigned values into the <schema> element into the
new file. Figure 11-3 on page 333 shows the highlighted namespace prefixes
and attributes.
After you identify all the namespaces and copy everything into a new file, save
the new file with the .xsd extension. This file can now be imported into the
Schema Manager. If you receive any errors during import, the likely cause is a
missed namespace definition. Go back through these steps to verify that you
located all of the namespace prefixes and defined them in the new document.
Figure 11-4 on page 334 contains the complete XSD schema file after the
manual extraction process.
Important: You might see multiple entries for the web service in the Web
Services Explorer pane. Browse and import operations that do not end in the
suffix 12. Any operations with the 12 suffix are for SOAP 1.2, which is not
supported by the Web services stages at the time of writing this book.
Step 2: Create the DataStage job with Web Service Client and
XML stages
In the DataStage Designer, create a job and add the Web Service Client stage
(WS_Client) and XML stages. Map the output of the WS_Client stage directly to
the XML Stage. Figure 11-7 illustrates the required mapping between the
WS_Client and the XML stages.
Figure 11-7 Job design fragment that shows the WS_Client and XML Stage mapping
The output link from the XML Stage can be directed to whatever downstream
stage is needed, such as a Sequential File stage or a Peek stage for testing
purpose.
Now, click the Output Message tab. At the bottom, select User-defined
Message, and select the SOAPbody column to receive the user message, as
shown in Figure 11-9 on page 338. With this configuration, the WS_Client stage
does not attempt to parse the SOAP message body. Instead, it loads the entire
message body in-line into the SOAPbody column. Finally, complete the
configuration of the WS_Client stage as required for the web service operation.
For example, use the Input Arguments tab to define input parameter values for
the operation that is called.
Important: The Web Service Pack stages do not have large object (LOB)
column (pass by reference) support. The data in the SOAPbody column is
passed in-line as part of the record, which means that there are limits and
performance implications. For more information about passing XML data
in-line, refer to Chapter 10, “DataStage performance and XML” on page 291.
In the Output step, configure the mapping to obtain the list and element data that
is needed, as shown in Figure 11-11.
Figure 11-11 Map output data from the SOAP message body
When working with XML documents inside of a job that is deployed as a service,
the end-of-wave construct is critical. If a group of rows is consumed to compose
an XML document, it must stop when the end-of-wave marker is reached. When
parsing incoming XML documents, it must output the end-of-wave marker at the
end of each input document that is consumed.
At the time of writing this book, the DataStage parallel engine framework cannot
automatically force stages to run on a single partition when the job is deployed as
a service provider. It is the responsibility of the developer to ensure that the job is
configured to use a single-node configuration file at all times. Failure to configure
the job to use a single-node configuration file at all times results in potentially
incorrect output from the service operation.
Figure 11-12 Non-Information Services Director DataStage Wave Generator testing job
You do not need to perform any specific configuration in the XML Connector
stage to handle the end-of-wave, it is handled automatically. Insert Peek stages
at various points in the job design so that you can observe the end-of-wave
marker and verify that the correct output is generated.
Important: If your XML Stage has multiple output links, the end-of-wave
marker is included in the output. The marker follows the records from each
document on each link.
You can use XML as the payload for MQ messages as an architecture for
event-driven applications. The MQ Connector stage supports LOB reference
columns when used both as a source and a target. This support is useful for
The Message options parameter is set to Yes. Set this parameter first so that it
enables other important parameters:
Message truncation must always be set to Yes with XML documents. If this
parameter is set to No, and the message on the queue is larger than the
maximum length defined for the column, it creates multiple output records
each with part of the document. These records can cause undesirable
behavior in the rest of the job. It is better to truncate the message. Then, allow
the validation capabilities of the XML Stage to handle an incomplete
document scenario.
Message padding must always be set to No with XML documents. This
parameter is important when LOB references are not used, and the data is
passed in-line. White space in an XML document is ignored, but these bytes
still must be stored when moving the document. Because not every message
is likely to be close to the maximum size, you waste a substantial amount of
resources by transferring irrelevant white space characters if this option is set
to Yes.
Treat EOL as row terminator must always be set to No with XML documents.
If this parameter is incorrectly set to Yes, and XML contains any line
In Figure 11-15 on page 343, Enable payload reference parameter is set to Yes.
This value is appropriate for large XML documents. It results in a LOB reference
in the output column instead of the data. If for small documents, you want to pass
the data in-line, set Enable payload reference to No.
The DB2 Connector can read and write to DB2 XML data type columns without
any special handling requirements. DB2 automatically converts between strings
and its internal binary storage type. If a CLOB column is used, no conversion is
required, because the XML is stored byte for byte in string format. DataStage
adds the capability to validate the XML before loading it into the database. This
capability eliminates the overhead of configuring an XML schema in the
database for validation, which impacts load performance. It also allows reject
handling to be managed in the ETL process.
With XML query extensions, you can return a specific piece of XML from
within the column or reference a value of a specific element in a where clause
predicate. Example 11-2 shows this type of a query.
It is also possible to use the XQuery language instead of SQL in the DB2
Connector. XQuery restricts the result to be retrieved as an in-line value; the DB2
Connector is unable to use a LOB reference column in this case.
If the data is small enough to fit in the standard Oracle string buffer, you can bring
it back as varchar. Otherwise, you need to bring it back by converting it to a
CLOB or binary LOB (BLOB). Check your Oracle database documentation to see
the size of the Oracle string buffer for your specific release.
Example 11-4 Query with table alias but missing column alias
select t.docid, t.document.getClobVal() from order_xml t;
The query must use both alias types, as shown in Figure 11-16 on page 346, or a
runtime error message occurs.
Example 11-5 Query that extracts the first order from an invoice XML document
select t.docid, t.document.extract('/invoice/order[1]') as document
from order_xml t;
The full content of the After SQL statement property is shown in Figure 11-18.
This statement inserts the staged data into the final table that contains the
XMLType column.
We explain how to move from XML Pack 2.0 stages, provide tips, and identify
pitfalls. At the time of writing this book, the Connector Migration Tool does not
provide any automated conversion of XML stages in DataStage jobs. This
conversion is a manual process. We describe the areas in which you can see the
greatest return on your investment for refactoring existing jobs to use the new
technology.
Important: The Parser step continues to write data ahead to the next step,
regardless of whether a failure occurred.
This behavior differs from the XML Input stage, which produced no output
(except for the reject link) when a validation failure occurred. In most situations,
the job developer does not want the partially processed, and usually incorrect,
data written to the output link. In this section, we explain how to use the Switch
step in conjunction with the Parser step to implement reject functionality that is
equivalent to the XML Input stage reject function.
To prevent output data when a validation failure occurs, we add a Switch step to
after the Parser step. In the Switch Step, we add a single target, as shown in
Figure 11-21 on page 353.
The Switch step splits the output into two lists based on the specified target
condition. The isvalid list contains the output when the document passes
successful validation. The default list contains the output when the document
fails validation. The input file name is included in each list, so that you can
identify the processed document. Figure 11-22 on page 354 illustrates the
mapping for the output of the Switch step. The output link named rej gets the
validation failure message and the file name, which is mapped from the default
list. The data output link receives valid parsed data from the isvalid list.
This design pattern achieves the same result as the reject function in the XML
Pack 2.0. At the same time, it provides additional control over the reject
conditions during validation.
First, we quickly review XML Pack 2.0. The XMLOutput stage required document
boundaries (when to finalize a document and begin a new) to be managed by a
combination of the XPATH defined in the input columns and a single control
The equivalent for these behaviors in the new XML Transformation stage is
based on the use of specific steps in the assembly and the XSD. The structure
defined in the XML schema must match the instance document that you are
trying to create exactly. In the following scenarios, we demonstrate how to
configure the correct steps in the assembly, map the correct list to the top of the
document output structure, and select the correct scope on which to apply the
step.
Aggregation step in the new XML Transformation stage: XML Pack 2.0
has an output mode named Aggregate all rows. Job developers must be
careful not to confuse this mode with the similarly named Aggregation step in
the new XML Transformation stage assembly. The Aggregation step in the
assembly performs true aggregation functions, such as sum and max, on the
data. It is not used to control the creation of output documents.
The new XML Transformation stage continues to support this design pattern. As
described in Chapter 10, “DataStage performance and XML” on page 291, high
performance can be obtained with relational joins taking place in the parallel job
and sending sorted, normalized data into the XML Stage. In the assembly, the
normalized input data is converted to a hierarchical relationship, and it is mapped
to the target document. Figure 11-25 demonstrates how to achieve the same
invoice example with the XML StageXML Stage.
In the assembly before the Composer step, multiple Regroup steps are used to
create the hierarchical relationship from the flat, normalized input link. You can
see the results of these Regroup steps in the input tree view of Figure 11-25 on
page 357, labeled Regroup:Invoice and Regroup:Lines. The Regroup steps take
multiple input rows and group them into multiple element lists. The Composer
step can then easily map these lists into the output document.
Important: The configuration of the input link remains the same for both
stages. Therefore, existing jobs can be modified easily without needing to
change the job design. Merely replace the XML Output stage with the XML
Stage on the job canvas. The rest is performed in the XML Stage assembly.
The XPATH looks similar, because it represents only the nesting structure of the
target XML document. It does not represent any of the context, such as
minimum/maximum occurrence. The XML Output stage used the single row
output mode to enforce a single occurrence of the XML structure in each
document and the creation of multiple output documents. The key to
implementing the equivalent functionality in the XML Transformation stage is to
use the proper XSD schema. The target schema must define its hierarchy by
using groups rather than lists. In the XSD schema, set the maxOccurs attribute to
1 for the element to define the hierarchy. Figure 11-27 on page 360 illustrates the
configuration of the Composer step in the XML assembly, for the equivalent of
single row mode.
The only list in the target structure is the document_collection list, which
identifies when a new document is created. The DSLink4 input link list is mapped
to the document_collection; thus, one document is created for each input row.
The rest of the elements are then mapped. There is no Regroup step. Even
though the output target defines a hierarchical relationship, it can be managed
through mapping because of the one-to-one relationships of all the fields.
The configuration of the input link remains the same. Jobs that use the XML
Output stage with single row mode can be modified easily to use the XML
Transformation stage without having to change the rest of the job design.
Figure 11-28 XML Output stage configured to create a document for each order
Figure 11-29 Schema fragment that shows one order in each invoice document
A Regroup step is used to group the lines into a list in each order. The regrouped
Orders are then mapped to the document_collection, which results in one output
document that is created for each order.
BigInt long
Binary binary
Bit boolean
Char string
Date date
Decimal decimal
Double double
Float double
Integer int
LongNVarChar string
LongVarBinary binary
LongVarChar string
Nchar string
Numeric decimal
NvarChar string
Real float
SmallInt short
Time time
Timestamp dateTime
TinyInt byte
VarBinary binary
VarChar string
duration VarChar
dateTime TimeStamp
time Time
date Date
gYear Date
gYearMonth Date
gMonth Date
gMonthDay Date
gDay Date
anyURI VarChar
ENTITY VarChar
ENTITIES VarChar
ID VarChar
IDREF VarChar
IDREFS VarChar
QName VarChar
token VarChar
language VarChar
Name VarChar
NCName VarChar
NMTOKEN VarChar
NMTOKENS VarChar
NOTATION VarChar
normalizedString VarChar
string VarChar
float Real
double Double
decimal Decimal
integer Decimal
long BigInt
int Integer
short SmallInt
byte TinyInt
positiveInteger Decimal
nonPositiveInteger Decimal
negativeInteger Decimal
nonNegativeInteger Decimal
hexBinary VarChar
base64Binary VarChar
boolean Bit
Netezza Connector No No
Concatenate Concatenates all strings in the list, starting with the first element.
Maximum Selects the maximum value in the list. If the list contains Boolean
strings, True is greater than False.
Minimum Selects the minimum value in the list. If the list contains Boolean
strings, False is less than True.
Variance Calculates the variance value of all of the values in the list.
Greater than Returns true if the value is greater than the value of the
parameter.
Greater than or equal Returns true if the value is greater than or equal to the
value of the parameter.
Less than Returns true if the value is less than the value of the
parameter.
Less than or equal Returns true if the value is less than or equal to the value
of the parameter.
Compare Returns true if the string value is the same as the string
value of the parameter.
CompareNoCase Returns true if the string value is the same as the string
value of the parameter. Ignores the case.
Contains Returns true if the string value contains the string value of
the parameter.
ContainsCaseInsensitive Returns true if the string value contains the string value of
the parameter. Ignores the case.
IsNotBlank Returns true if the string is not empty and not null.
Access mode. The access mode of an IBM solidDB Open Database Connectivity (ODBC). A standard
parameter defines whether the parameter can be application programming interface for accessing
changed dynamically via an ADMIN COMMAND, data in both relational and non-relational database
and when the change takes effect. The possible management systems. Using this API, database
access modes are RO, RW, RW/Startup, and applications can access data stored in database
RW/Create. management systems on a variety of computers
even if each database management system uses a
Application Programming Interface. An interface different data storage format and programming
provided by a software product that enables interface. ODBC is based on the call level interface
programs to request services. (CLI) specification of the X/Open SQL Access
Group.
BLOB. Binary large object. A block of bytes of data
(for example, the body of a message) that has no Optimization. The capability to enable a process to
discernible meaning, but is treated as one entity that execute and perform in such a way as to maximize
cannot be interpreted. performance, minimize resource utilization, and
minimize the process execution response time
DDL. (Data Definition Language). An SQL delivered to the user.
statement that creates or modifies the structure of a
table or database, for example, CREATE TABLE, Partition. Part of a database that consists of its own
DROP TABLE, ALTER TABLE, and CREATE data, indexes, configuration files, and transaction
DATABASE. logs.
Disk-based table (D-table). A table whose contents Primary Key. Field in a table that is uniquely
are stored primarily on disk so that the server copies different for each record in the table.
only small amounts of data at a time into memory.
Process. An instance of a program running in a
DML. (Data Manipulation Language). An INSERT, computer.
UPDATE, DELETE, or SELECT SQL statement.
Server. A computer program that provides services
Distributed Application. A set of application to other computer programs (and their users) in the
programs that collectively constitute a single same or other computers. However, the computer in
application. which a server program runs is also frequently
referred to as a server.
Dynamic SQL. SQL that is interpreted during
execution of the statement. Shared nothing. A data management architecture
where nothing is shared between processes. Each
Metadata. Typically called data (or information) process has its own processor, memory, and disk
about data. It describes or defines data elements. space.
Multi-Threading. Capability that enables multiple SQL passthrough. The act of passing SQL
concurrent operations to use the same process. statements to the back end, instead of executing
statements in the front end.
The publications listed in this section are considered particularly suitable for a
more detailed discussion of the topics covered in this book.
IBM Redbooks
The following IBM Redbooks publications provide additional information about
the topic in this document. Several publications referenced in this list might be
available in softcopy only.
InfoSphere DataStage Parallel Framework Standard Practices, SG24-7830
IBM InfoSphere DataStage Data Flow and Job Design, SG24-7576
The XML Files: Development of XML/XSL Applications Using WebSphere
Studio Version 5, SG24-6586
Implementing IBM InfoSphere Change Data Capture for DB2 z/OS V6.5,
REDP-4726
DB2 9 pureXML Guide, SG24-7315
Cloud Computing and the Value of zEnterprise, REDP-4763
Master Data Management IBM InfoSphere Rapid Deployment Package,
SG24-7704
You can search for, view, download or order these documents and other
Redbooks, Redpapers, Web Docs, draft and additional materials, at the following
website:
ibm.com/redbooks
Online resources
These websites are also relevant as further information sources:
Information Server, Version 8.7 Information Center: Oracle connector and
LOB /XML types:
http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r7/topic/com.ibm
.swg.im.iis.conn.oracon.usage.doc/topics/suppt_lob_oracc.html
E I
EAI 6 IBM Information Server
EJB 13, 15, 81, 85, 89, 92 execution flow 76
Electronic Business XML (ebXML) initiative 5 Server-side
Enable payload reference parameter 344 Engine 58
End-of-wave 14, 79–80, 340 Topologies supported
endpoints 89–90, 92 Cluster 62
Engine tier 54 Grid 63
enterprise application integration (EAI) 22 Three tier 61
Enterprise Architecture Integration 6 Two-tier 60
entire partitioning 313 IBM InfoSphere Change Data Capture (CDC) 18
entities 32, 44 IBM InfoSphere DataStage 1–2, 9–10, 19
environment variable 327 IBM InfoSphere Information Server 9
EOW markers 80–82, 84, 94, 96 IBM InfoSphere Information Services Director (ISD)
error messages, logging 164 55
ETL 49, 56, 64, 66 IBM InfoSphere Master Data Management (MDM)
Execution services 74 15
Index 379
NewsML 4 partitioning keys 302
NMTOKENS 137–138 Pass as Large Object option 207–208, 231,
255–256
Pass as String option 206
O payload processing 14
ODBC
Peek stage 340–342
driver 347
pipeline parallelism 14, 71, 91
stage 101
plug-ins, custom 80
operational metadata 57–58
processing stages 70
Oracle CLOB type 346
project metadata 74
Ordered collector 297
pureXML 11–12, 344–345
Organizations for the Advancements of Standard In-
pureXML column 205, 256
formation Standards (OASIS) 5
OSH scripts 75
Output stage 355–361 Q
query functionality 344
P
PACKS 58–59 R
parallel framework 293 random partitioning 294
Parallel Job stages 69 range partitioning 295
parallel mode 311 real time stages 70
parallel parsing 304, 311, 314–316 recursive elements 140
parallel processing 50–51, 63, 65, 67, 70–71, 74, Redbooks website 375
76 Contact us xiv
parallel sort 298–299, 304, 319 Regroup step 360, 362
parallel sorting 299, 318 Reject action 166
Parent List option 243 relational structure, metadata 150
Parser Step 314–315, 317, 323 re-partitioning 310, 315
ParserStatus value 351 repeating element 311, 314–315
parsing Replication Server 20
parallel 311, 314–316 repository tier 54
XML files 305, 310 Representational State Transfer (REST) 78
parsing an XSLT stylesheet 198 request based processing 340
parsing schemas 185, 188 REST 78
partition parallelism 71, 304, 310 restructure stages 70
partitioning Round-robin collector 297–298
auto 293 round-robin partitioning 294
DB2 296 RSS feeds 28
entire 294, 313
hash 295, 315
key-based 298, 301
S
same partitioning 294
keyed 293, 295
Scalable Vector Graphics 7
keyles 294
schema element 37, 194, 332
modulus 295
schema elements, extracting 332
parallel 340
Schema Library Manager 113–115, 117–119,
random 294
121–123, 125–127, 129
range 295
schemaLocation attribute 331
round-robin 294
schemas 25, 28, 32, 34–35
same 294
Index 381
Web Services Description Language 17 XML Input 131
Web Services Pack 330 XML Metadata Importer UI 78
WebSphere Information Analyzer 56 XML Output stage 356, 358, 360
well-formed documents 28 stages
WISD Input stage 92–93 XML Output 131
WISD Output stage 92–93, 95 XML Pack
Write to File option 208, 231, 255 for DataStage 329, 349
WS_Client stage 336–337 migrating 329
WSDL 16–17, 330–335 output mode 354–355
wsdl reject function 354
definitions element 332 schema 78
tiers 78
XML pack, 131
X XML Stage 148–156, 160, 166–167, 169–170,
XA transaction 342
188–189, 193–194, 197, 199, 265, 269, 290, 314
XBRL 4
XML Transformation stage 349–350, 354–355
XML
XML Transformer stage 131
benefits 4–5, 7
XMLOutput stage 354–355, 358
extensions 345
XMLType datatype 345
feeds 3
XPATH 347, 354, 358–359
industry standards 3, 12
XPath 39–40
native hierarchical format 344
XQuery language 345
query functionality 344
xs
stages 131, 330
list 138
syndication services 3
xsd
validation 29, 35, 160
import tag 331
XML DB 345
schema tag 331
XML files
XSD file 330–331, 335
adding comments 227
XSD schema 330
combining 276
XSLT 35, 39–41, 43
creating 202
XSLT Stylesheet, parsing 198
creating for schemas 259
creating hierarchies 231
creating in Oracle 347
extracting large chunks of data 324
filtering 265, 272
formatting 230
joining 288
metadata 113
parsing 305, 310
passing by reference 321
passing chunks 320
passing in-line 321
reading 265
recreating 290
sorting lists 267
storing 206
XML Input stage
stages
InfoSphere DataStage
for Enterprise XML
Data Integration ®
Addresses the XML is one of the most common standards for the exchange
complexities of of information. However, organizations find challenges in INTERNATIONAL
hierarchical data how to address the complexities of dealing with hierarchical TECHNICAL
types data types, particularly as they scale to gigabytes and SUPPORT
beyond. In this IBM Redbooks publication, we discuss and ORGANIZATION
describe the new capabilities in IBM InfoSphere DataStage
Reads huge
8.5. These capabilities enable developers to more easily
documents using
manage the design and processing requirements presented
streaming by the most challenging XML sources. Developers can use BUILDING TECHNICAL
technology these capabilities to create powerful hierarchical INFORMATION BASED ON
PRACTICAL EXPERIENCE
transformations and to parse and compose XML data with
Spans both batch high performance and scalability. Spanning both batch and
and real-time run real-time run times, these capabilities can be used to solve a IBM Redbooks are developed by
times broad range of business requirements. the IBM International Technical
Support Organization. Experts
As part of the IBM InfoSphere Information Server 8.5 release, from IBM, Customers and
InfoSphere DataStage was enhanced with new hierarchical Partners from around the world
transformation capabilities called XML Stage. XML Stage create timely technical
provides native XML schema support and powerful XML information based on realistic
scenarios. Specific
transformation functionality. These capabilities are based on recommendations are provided
a unique state-of-the-art technology that allows you to parse to help you implement IT
and compose any complex XML structure from and to a solutions more effectively in
relational form, as well as to a separate hierarchical form. your environment.