100% found this document useful (1 vote)

666 views

Standardize Your Data Using InfoSphere QualityStage

The document discusses using InfoSphere QualityStage to standardize data. It describes standardization concepts and demonstrates how to achieve standardized data using QualityStage. It covers implementing standardization using country identifier, domain pre-processor, domain-specific and validation rule sets.

Uploaded by

MuraliKrishna

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

666 views

Standardize Your Data Using InfoSphere QualityStage

Uploaded by

MuraliKrishna

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

developerWorks

Technical topics

Information Management

Technical library

Standardize your data using InfoSphere QualityStage

Data standardization is a process that ensures that data conforms to quality rules. This tutorial introduces data
standardization concepts and demonstrates how you can achieve standardized data using IBM InfoSphere
QualityStage. A reader who is new to QualityStage standardization will get a basic understanding of the process. Readers
should have basic knowledge of InfoSphere DataStage job development. This tutorial covers standardization using country
identifier, domain pre-processor, domain-specific and validation types of rule sets.

Share:
Dhanunjaya Lokireddy is a Senior QA Engineer working for the InfoSphere QualityStage team at IBM India Software Lab, Hyderabad. He has
six years of experience in IBM working for different QA teams in the Information Server product area.

11 August 2011
Also available in Chinese Portuguese

Before you start

Editor's note: All personal data appearing in this tutorial is fictitious and was
created for sample purposes only.

InfoSphere QualityStage overview

Develop and deploy your

next
app on the IBM Bluemix
cloud platform.

Enterprises often face issues with data arising out of lack of standards. Data
may be entered in inconsistent ways across different systems, causing
Start building for free
records to appear different even though they are actually the same. For
example, the following two records describe the same person at the same address, even though the name
and address appear to be quite different:
Bob Christiansan

614 Columbus Ave #3, Boston, Massachusetts 02116

R.J. Christensen

614 Columbus Suite #3, Suffolk County 02116

Another common error leading to "data surprises" is that data can be misplaced. Here is an example where
several of the fields contain the wrong type of information. The name field contains address information, the
tax ID field contains telephone numbers, and the telephone field contains city name information. This
misplacement of data often leads to application errors.
Name

Tax ID

Telephone

Becker & Co. C/O Bill

025-37-1998

415-392-2770

B Smith DBA Lime Cons.

228-02-1695

6173380220

1st Natl Provident

34-2671854

3309321

HP 15 State St.

508-466-1550

Orlando

A third kind of common data standardization problem involves the lack of consistent identifiers. The
following example has three records containing a product description. They look different, but they are
actually same. This is because of the lack of consistent identifiers.
91-84-301 RS232 Cable 5' M-F CandS

CS-89641 5 ft. Cable Male-F, RS232 #87951

C&SUCH6 Male/Female 25 PIN 5 Foot Cable

InfoSphere QualityStage (hereafter called QualityStage), a component product of InfoSphere Information

Server, helps identify and resolve the issues described above and provides a way to maintain an accurate
view of master data entities. QualityStage has following capabilities:
Investigation Helps you understand the nature and scope of data anomalies
Standardization Parses individual fields and makes them uniform according to business standards
Matching Identifies duplicate records within and across data sources
Survivorship Helps eliminate duplicate records and create the best-breed record of data

Understanding the standardization process

Standardization parses or separates free-form fields into single component fields or assigns data to its
appropriate metadata fields in a standard format.
Data is frequently captured with variations resulting from:
Data entry errors
Different conventions for representing the same data value
Semantic differences across systems
Multiple sources for the same data element
Lack of data quality standards
But the target systems require cleansed data for reporting and decision-making. Standardization helps
improve the addressability of data stored in free-form columns and ensures that each data element has
relevant content and format. It normalizes data values to standard forms and prepares data elements for
more effective matching. It also helps in identifying and removing invalid data values. Standardization is
important because it prepares the data for further processing.
Standardization works based on special instructions called rule sets. Some rule sets are:
Country identifier, such as COUNTRY
Domain pre-processor, such as USPREP
Domain-specific, such as USNAME
Validation, such as VDATE
Most of the packaged rule sets are country-specific. For example, there are different name standardization
rule sets for the United States and Japan. As of InfoSphere Information Server V8.5, these rule sets are
packaged with QualityStage. Advanced users can create rule sets based on their business requirements.
Rule sets have three required components:
Classification Table Contains the keywords, standard value, and user-defined class

Dictionary File Defines the layout of the output columns

Pattern-Action File Contains the logic to populate output columns and parsing parameters
Figure 1. Standardization process overview

Figure 1 shows an overview of the standardization process:

1.
2.
3.
4.

Parses input data using pattern action file (SEPLIST/STRIPLIST) parameters

Assigns user-defined classes from classification table and apples default classes for remaining tokens
Forms output fields using a dictionary file
Populates data to output fields using a pattern action file

The remaining sections of the tutorial contain detailed steps to create standardize jobs using different type
of rule sets with examples.

Implementing the country identifier rule set

Editor's note: All personal data appearing in this tutorial is fictitious and was created for sample purposes
only.
The country identifier rule set helps to identify the country using the given data. For example, take the
following data:
Listing 1. Data records for country identifier example
Andrew Conacher Level 10, 135 Exhibition St Melbourne VIC 3000
Ian Williams 167-170 Washway Road Sale Manchester M33 6RJ
Eric Ferm 17 Wellington Street W. 4th Floor Toronto, Ontario, M5K 1B1
Dr Jeffery David Thomson Jnr PHD 52280A NC 42 72 HWY # 42

The data contains records belongs to various countries. The steps below show how to use QualityStage to
identify the country for each record.

Step 1: Create a parallel job

Create a parallel job as shown in Figure 2. Configure the input sequential file stage to read the input file,
which contains the example records listed above.
Figure 2. Parallel job with sequential and standardize stages

Figure 3 shows the designer palette where the standardize stage is selected.
Figure 3. Designer palette showing standardize stage

Figure 4 shows the input sequential file with the data from the listing above.
Figure 4. Input sequential file view data

Step 2: Configure the standardize stage

1. Create a new process. Use the New Process button in the toolbar.
Figure 5. Standardize stage properties

The next screen is the standardize new rule process window, with the available columns listed.
Figure 6. Standardize new rule process window

2. For the listed data column, which is the input sequential file metadata, select Rule Sets > Other >
COUNTRY.
Figure 7. Rule set selection

3. Click the > button to move the Data column to the Selected column area.
Figure 8. Standardize rule process window with selected rule set and columns

4. Add metadata delimiter. Metadata delimiter plays an important rule in this type of rule set. The delimiter
is used to set default country code. If the country rule set can't determine the country based on the
information provided, it defaults to the delimiter value. The format of the metadata delimiter is
ZQ<Country Code>ZQ. In this example, we are setting US as the default country. Enter ZQUSZQin the
Literal field.
Figure 9. Standardize rule process window with metadata delimiter entered

5. Click the > button beside the Literal field.

Figure 10. Using literal to set the country code

6. Use the Move Up and Move Down buttons to arrange the metadata delimiter in the following way:

ZQUSZQ
Data
Click OK to add the process.
Figure 11. Standardize rule process window with all metadata delimiter arranged in order

Figure 12. Standardize stage properties window with created rule process

7. Map the output columns (Stage Properties > Output > Mapping)
The standardize stage produces columns based on the rule set selected. The following columns were
selected in this example: ISOCountryCode_COUNTRY, IdentifierFlag_COUNTRY, along with "Data" input
field.
Drag and drop the columns listed above to the output.
Figure 13. Standardize stage output column mapping

Step 3: Configure the output file and run the job

Configure the output sequential file stage to supply required fields like file name and other settings like
format as required. Run the job and verify the output. Here is the output produced:
Figure 14. Output sequential file view data

Andrew Conacher Level 10, 135 Exhibition St Melbourne VIC 3000

Country code for this record is identified as AU (ISOCountryCode_COUNTRY)
Country code is identified based on the data only (IdentifierFlag_COUNTRY)
Ian Williams 167-170 Washway Road Sale Manchester M33 6RJ
Country code for this record is identified as GB (ISOCountryCode_COUNTRY)
Country code is identified based on the data only (IdentifierFlag_COUNTRY)
Eric Ferm 17 Wellington Street W. 4th Floor Toronto, Ontario, M5K 1B1
Country code for this record is identified as CA (ISOCountryCode_COUNTRY)
Country code is identified based on the data only (IdentifierFlag_COUNTRY)
Dr Jeffery David Thomson Jnr PHD 52280A NC 42 72 HWY # 42
Country code for this record is identified as US (ISOCountryCode_COUNTRY)
Here country code couldn't identify based on data so it used default country code based on the metadata
delimiter (US (IdentifierFlag_COUNTRY))

Implementing the domain pre-processor

Editor's note: All personal data appearing in this tutorial is fictitious and was created for sample purposes

only.
The domain pre-processor will identify different domains (like name, address and area) from the given data
and populate them to the correct fields. Let's take the following data:
"52280A NC 42 72 HWY # 42","KNOXVILLE TN 37920","Dr Jeffery David Thomson Jnr PHD"
"International Business Machines Corp","1480 CARRIAGE LN APT 301","AUBURN IN 467069555"
"Peter heines","ASHVILLE NEW YORK 147109762","930 SOUTH BROAD ST EAST APT H"

It has three fields: Field1, Field2, and Field3 (see Figure 16). But the data is scattered in all three fields.
For example, the name in the first record is in Field3, in Field 1 in the second record, and in Field1 in the
third record. We will create a standardize job using pre-processor rule set to identify different domains.

Step 1: Create a parallel job

Create a parallel job as shown in Figure 15. Configure the input sequential file stage to read the input file,
which contains the example records listed above.
Figure 15. Parallel job with sequential and standardize stages

Figure 16. Input sequential file view data

Step 2: Configure the standardize stage

1. Create a new process.
Figure 17. Standardize stage properties

Figure 18. Standardize new rule process window

2. Select the USPREP rule set (Standardization Rules > USA > USPREP > USPREP) for the available
columns Field1, Field2, and Field3, which is the input sequential file metadata.
Figure 19. Rule set selection

3. Click the > button for the three fields to move them to the selected column area.
Figure 20. Standardize rule process window with selected rule set and columns

4. Add metadata delimiters. Metadata delimiters are used to convey what kind of information we are
expecting in each of the input field. If the pre-processor cannot determine the domain of a token, it will be
defaulted to the domain that specified through metadata delimiter. The format of the metadata delimiter
is ZQ<Domain>ZQ. In this example, we are anticipating that Field1 contains Name data, Field2 contains
Address data, and Field3 contains Area data. Add three delimiters: ZQNAMEZQ, ZQADDRZQ and

ZQAREAZQ. Enter ZQNAMEZQin the Literal field.

Figure 21. Standardize rule process window with metadata delimiter entered

5. Click the > button.

Figure 22. Standardize rule process window with metadata delimiter selected

6. Repeat steps 4 and 5 to add delimiters ZQADDRZQ and ZQAREAZQ.

Figure 23. Standardize rule process window with all metadata delimiters selected

7. Use the Move Up and Move Down buttons to arrange the metadata delimiters in the following way:
ZQNAMEZQ
Field1

ZQADDRZQ
Field2
ZQAREAZQ
Field3
Click OK to add the process.
Figure 24. Standardize rule process window with all metadata delimiters arranged in order

Figure 25. Standardize stage properties window with created rule process

8. Map the output columns (Stage Properties > Output > Mapping)
The standardize stage produces columns based on the rule set selected. The following columns were
selected in this example: NameDomain_USPREP, AddressDomain_U SPREP and AreaDomain_USPREP
Drag and drop the columns listed above to the output.
Figure 26. Standardize stage output column mapping

Step 3: Configure the output file and run the job

Configure the output sequential file stage to supply required fields like the file name and other settings like
format as required. Run the job and verify the output. Figure 27 shows the output produced.
Figure 27. Output sequential file view data

"International Business Machines Corp","1480 CARRIAGE LN APT 301","AUBURN IN 467069555"

"International Business Machines Corp" is identified as name domain (NameDomain)
"1480 CARRIAGE LN APT 301" is address domain (AddressDomain)
"AUBURN IN 467069555" is area domain (AreaDomain)
"52280A NC 42 72 HWY # 42","KNOXVILLE TN 37920","Dr Jeffery David Thomson Jnr PHD"
"Dr Jeffery David Thomson Jnr PHD" is identified as name domain (NameDomain)
"52280A NC 42 72 HWY # 42" is address domain (AddressDomain)
"KNOXVILLE TN 37920" is area domain (AreaDomain)
"Peter heines","ASHVILLE NEW YORK 147109762","930 SOUTH BROAD ST EAST APT H"
"Peter heines" is identified as name domain (NameDomain)
"930 SOUTH BROAD ST EAST APT H" is address domain (AddressDomain)
"ASHVILLE NEW YORK 147109762" is area domain (AreaDomain)

Implementing name standardization

Editor's note: All personal data appearing in this tutorial is fictitious and was created for sample purposes
only.
This is the domain-specific type of standardization. Let's take the following name examples.
Dr Jeffery David Thomson Jnr PHD
International Business Machines Corp
Peter heines

These examples contain individual and organization names, and assume these belong to country US. Our

intention here is to identify different parts of the name like the primary name, first name, and last name.

Step 1: Create a parallel job

Create a parallel job as shown in Figure 28. Configure input sequential file stage to read the input file that
contains the above example records.
Figure 28. Parallel job with sequential and standardize stages

Figure 29. Input sequential file view data

Step 2: Configure the standardize stage

1. Create a new process.
Figure 30. Standardize stage properties

Figure 31. Standardize new rule process window

2. Select the USNAME rule set (Standardization Rules > USA > USNAME > USNAME) for the column
"name," which is the input sequential file metadata.
Figure 32. Rule set selection

3. Click the > button.

Figure 33. Standardize rule process window with rule set selected

4. Do not add the "Optional NAMES Handling" option. The Optional NAMES Handling field has the
following options:
Process All as Individual All columns are standardized as individual names.
Process All as Organization All columns are standardized as organization names.
Process Undefined as Individual All unhandled columns are standardized as individual names.
Process Undefined as Organization All unhandled columns are standardized as organization names.
This option is useful if we know the types of names in the input file. For example, if the file mainly
contains organization names, specifying Process All as Organization enhances performance by
eliminating the processing steps of determining the name's type.
5. Click OK.
Figure 34. Standardize rule process window with selected rule set and columns

Figure 35. Standardize stage properties window with created rule process

6. Map the output columns (Stage Properties > Output > Mapping)
The standardize stage produces columns based on the rule set selected. In this example, the following
columns were selected: NameType_USNAME, GenderCode_USNAME, NamePrefix_USNAME,
FirstName_USNAME, MiddleName_USNAME, PrimaryName_USNAME, NameGeneration_USNAME, and
NameSuffix_USNAME
Drag and drop the above columns to the output.
Figure 36. Standardize stage output column mapping

Step 3: Configure the output file and run the job

Configure the output sequential file stage to supply required fields like the file name and other settings like
format as required. Run the job and verify the output. Figure 37 shows the output produced.
Figure 37. Output sequential file view data

Dr Jeffery David Thomson Jnr PHD

The data is identified as an individual name (NameType)
Gender is male (GenderCode)
Dr is the name prefix (NamePrefix).
Jeffery is the first name(FirstName).
David is the middle name (MiddleName).
Thomson is the primary name (PrimaryName).
Jr is identified as generation (NameGeneration) here, the actual input contains Jnr, but the standardize
stage gave the commonly used standard format
PHD is the name suffix (NameSuffix).
International Business Machines Corp
The data is identified as the organization name (NameType).
International Business Machines is the primary name (PrimaryName).
Corp is the name suffix (NameSuffix).
Peter heines
The data is identified as the individual name (NameType).
Gender is male (GenderCode).
Peter is the first name (FirstName).

Heines is the primary name (PrimaryName).

Implementing validation
This type of rule set is mainly used to validate the data (VDATE, VEMAIL, for example). Let's take the
following date examples:
OCT021983
09211991
02/29/2011

These are some of the acceptable input formats. The standardization job verifies whether these are valid
and sets valid flag, if valid. Then it produces the output in standard format CCYYMMDD; otherwise, it sets
invalid reason code.

Step 1: Create the parallel job

Create a parallel job as shown in Figure 38. Configure the input sequential file stage to read the input file,
which contains the above example records.
Figure 38. Parallel job with sequential and standardize stages

Figure 39. Input sequential file view data

Step 2: Configure the standardize stage

1. Create a new process.
Figure 40. Standardize stage properties

Figure 41. Standardize new rule process window

2. Select the VDATE rule set (Standardization Rules > Other > VDATE) for the column "Date," which is
the input sequential file metadata.
Figure 42. Rule set selection

3. Click the > button.

Figure 43. Standardize rule process window with rule set selected

4. Click OK.
Figure 44. Standardize rule process window with selected rule set and columns

Figure 45. Standardize stage properties window with created rule process

5. Map the output columns (Stage Properties > Output > Mapping)
The standardize stage produces columns based on the rule set selected. In this example following
columns were selected: ValidFlag_VDATE, DateCCYYMMDD_VDATE, InvalidReason_VDATE, along
with input column "Date."
Drag and drop the above columns to the output.
Figure 46. Standardize stage output column mapping

Step 3: Configure the output file and run the job

Configure the output sequential file stage to supply required fields like the file name and other settings like
format as required. Run the job and verify the output. Here is the output produced:
Figure 47. Output sequential file view data

OCT021983
Valid date (ValidFlag_VDATE)
19831002 is the standard format (DateCCYYMMDD_VDATE)
09211991
Valid date (ValidFlag_VDATE)
19910921 is the standard format (DateCCYYMMDD_VDATE)
02/29/2011
Invalid date (ValidFlag_VDATE)
The reason is it is invalid leap-year date (InvalidReason_VDATE)

Conclusion
In this tutorial, you have learned what the standardization process is and how it can be achieved by using
InfoSphere QualityStage. You have also learned about standardization using different types of rule sets like
country identifier, domain pre-processor, domain-specific, and validation.

Download
Description

Name

Size

Sample jobs and data

SampleJobDesigns.zip

10KB

Resources
Learn
Get more information about InfoSphere Information Server from the Information
Center.
Learn more about Information Management at the developerWorks Information
Management zone. Find technical documentation, how-to articles, education,
downloads, product information, and more.

Dig deeper into Information

management on
developerWorks
Overview
New to Information management
Technical library (tutorials and more)
Forums
Community

Stay current with developerWorks technical events and webcasts.

Downloads

Follow developerWorks on Twitter.

Products

Get products and technologies

Build your next development project with IBM trial software, available for download
directly from developerWorks.
Discuss

Events

Bluemix Developers
Community
Get samples, articles, product
docs, and community resources to
help build, deploy, and manage
your cloud apps.

Participate in the discussion forum.

Check out the developerWorks blogs and get involved in the developerWorks
community.

developerWorks Weekly
Newsletter
Keep up with the best and latest
technical info to help you tackle
your development challenges.

DevOps Services
Software development in the cloud.
Register today to create a project.

IBM evaluation software

Evaluate IBM software and
solutions, and transform
challenges into opportunities.

Etl Project Plan
No ratings yet
Etl Project Plan
2 pages
Informatica Data Qaulity Technical Design Document
0% (1)
Informatica Data Qaulity Technical Design Document
17 pages
ETL Design Template
No ratings yet
ETL Design Template
14 pages
The IBM Data Governance Unified Process: Driving Business Value with IBM Software and Best Practices
From Everand
The IBM Data Governance Unified Process: Driving Business Value with IBM Software and Best Practices
Sunil Soares
4/5 (1)
ETL Test Scenarios and Test Cases
75% (8)
ETL Test Scenarios and Test Cases
5 pages
Chapter 2 Server-Side Scripting Overview
No ratings yet
Chapter 2 Server-Side Scripting Overview
35 pages
Logical Data Model Project Plan
No ratings yet
Logical Data Model Project Plan
6 pages
ETL Process Definitions and Deliverables
100% (1)
ETL Process Definitions and Deliverables
4 pages
Informatica MDM Intregration With PIM - Design Blueprint v1.0
100% (1)
Informatica MDM Intregration With PIM - Design Blueprint v1.0
22 pages
Sas Data Governance Framework 107325
No ratings yet
Sas Data Governance Framework 107325
12 pages
Data Quality - Information Quality For Northwind
No ratings yet
Data Quality - Information Quality For Northwind
18 pages
SnapLogic Second Edition
From Everand
SnapLogic Second Edition
Gerardus Blokdyk
No ratings yet
Interview Program Questions
No ratings yet
Interview Program Questions
20 pages
Data Quality: Empowering Businesses with Analytics and AI
From Everand
Data Quality: Empowering Businesses with Analytics and AI
Prashanth Southekal
No ratings yet
Datastage On Ibm Cloud Pak For Data
No ratings yet
Datastage On Ibm Cloud Pak For Data
6 pages
Informatica MDM Match Tuning Guide
No ratings yet
Informatica MDM Match Tuning Guide
13 pages
Data Governance Concepts
No ratings yet
Data Governance Concepts
13 pages
Technet Etl Design Questionnaire
100% (1)
Technet Etl Design Questionnaire
15 pages
Data Warehouse and ETL Verification Services Process Methods
No ratings yet
Data Warehouse and ETL Verification Services Process Methods
10 pages
7 - Informatica Data Cleanse
No ratings yet
7 - Informatica Data Cleanse
43 pages
FSLDM Data Modeller
No ratings yet
FSLDM Data Modeller
1 page
ETL Development Standards
No ratings yet
ETL Development Standards
8 pages
IDQ Reference
No ratings yet
IDQ Reference
31 pages
Data Warehouse Architecture
No ratings yet
Data Warehouse Architecture
11 pages
Data Quality A Survey of Data Quality Dimensions
No ratings yet
Data Quality A Survey of Data Quality Dimensions
5 pages
(632354715) Informatica MDM Demo
100% (1)
(632354715) Informatica MDM Demo
18 pages
Ds Data Quality Business Intelligence
No ratings yet
Ds Data Quality Business Intelligence
2 pages
Agile Master Data Management: Better Approach Than Trial and Error
No ratings yet
Agile Master Data Management: Better Approach Than Trial and Error
8 pages
What Are Slowly Changing Dimensions
No ratings yet
What Are Slowly Changing Dimensions
2 pages
LUF-MDM-002 Informatica MDM Hub Installation and Configuration Guide v01.1
100% (1)
LUF-MDM-002 Informatica MDM Hub Installation and Configuration Guide v01.1
50 pages
Informatica MDM Course Contents
No ratings yet
Informatica MDM Course Contents
7 pages
Data Profiling With Informatica Data Quality
No ratings yet
Data Profiling With Informatica Data Quality
5 pages
Informatica Tutorials
No ratings yet
Informatica Tutorials
2 pages
BIGuidebook Templates - BI Logical Data Model - Preliminary Design
No ratings yet
BIGuidebook Templates - BI Logical Data Model - Preliminary Design
9 pages
Informatica MDM Interview Preparation
100% (1)
Informatica MDM Interview Preparation
35 pages
ETL Process Definitions and Deliverables
No ratings yet
ETL Process Definitions and Deliverables
4 pages
Performance Tuning Techniques in Informatica - 0
No ratings yet
Performance Tuning Techniques in Informatica - 0
16 pages
Innovations in MDM Implementation: Success Via A Boxed Approach
No ratings yet
Innovations in MDM Implementation: Success Via A Boxed Approach
4 pages
Master Data Management at Bosch PDF
100% (1)
Master Data Management at Bosch PDF
10 pages
MDM Questions
No ratings yet
MDM Questions
1 page
Data Model
100% (1)
Data Model
11 pages
Preso Accenture - INFADAY - 2011
No ratings yet
Preso Accenture - INFADAY - 2011
18 pages
Data Warehousing Strategy
No ratings yet
Data Warehousing Strategy
22 pages
SAP Master Data Governance, Cloud Edition Trial Getting Started Guide
100% (1)
SAP Master Data Governance, Cloud Edition Trial Getting Started Guide
19 pages
Dimensional Modeling in Data Warehousing
No ratings yet
Dimensional Modeling in Data Warehousing
23 pages
XBRL US Pacific Rim Workshop Database and Business Intelligence Workshop Karen Hsu Director Product Marketing, Informatica
No ratings yet
XBRL US Pacific Rim Workshop Database and Business Intelligence Workshop Karen Hsu Director Product Marketing, Informatica
18 pages
CloudMDM Student Lab Guide
No ratings yet
CloudMDM Student Lab Guide
68 pages
Data Profiling
No ratings yet
Data Profiling
15 pages
Defining and Using Master Data Management
No ratings yet
Defining and Using Master Data Management
18 pages
Benefits of Data Archiving in Data Warehouses
100% (1)
Benefits of Data Archiving in Data Warehouses
12 pages
SSIS Package Naming Standards
No ratings yet
SSIS Package Naming Standards
6 pages
IDQ Learning
No ratings yet
IDQ Learning
33 pages
Pentaho Data Integration Cookbook - Second Edition
From Everand
Pentaho Data Integration Cookbook - Second Edition
María Carina Roldán
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Oracle Warehouse Builder 11g: Getting Started
From Everand
Oracle Warehouse Builder 11g: Getting Started
Bob Griesemer
No ratings yet
Database testing Third Edition
From Everand
Database testing Third Edition
Gerardus Blokdyk
No ratings yet
IBM InfoSphere: A Platform for Big Data Governance and Process Data Governance
From Everand
IBM InfoSphere: A Platform for Big Data Governance and Process Data Governance
Sunil Soares
2/5 (1)
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
Master data management Complete Self-Assessment Guide
From Everand
Master data management Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
OLAP Solutions: Building Multidimensional Information Systems
From Everand
OLAP Solutions: Building Multidimensional Information Systems
Erik Thomsen
3/5 (4)
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
MDM for Customer Data: Optimizing Customer Centric Management of Your Business
From Everand
MDM for Customer Data: Optimizing Customer Centric Management of Your Business
Kelvin K. A. Looi
No ratings yet
Data Lake Architecture Complete Self-Assessment Guide
From Everand
Data Lake Architecture Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Datastage Common Errors
No ratings yet
Datastage Common Errors
18 pages
Datastage Common Errors
No ratings yet
Datastage Common Errors
18 pages
Intermediate 1st Year Chemistry - MAY 2013 Question Paper
0% (1)
Intermediate 1st Year Chemistry - MAY 2013 Question Paper
3 pages
Types of Relationships - Chapter 10. Table Relationships - Part II - The Design Process - Database Design For Mere Mortals - SQL - ETutorials
No ratings yet
Types of Relationships - Chapter 10. Table Relationships - Part II - The Design Process - Database Design For Mere Mortals - SQL - ETutorials
18 pages
Exception Activity Properties For Data Stage
No ratings yet
Exception Activity Properties For Data Stage
15 pages
What Is The Difference Between Re-Engineering and Reverse Engineering - Stack Overflow
No ratings yet
What Is The Difference Between Re-Engineering and Reverse Engineering - Stack Overflow
4 pages
Lecture 7 - Backpropagation Example With Numbers Step by Step – a Not So Random Walk
No ratings yet
Lecture 7 - Backpropagation Example With Numbers Step by Step – a Not So Random Walk
10 pages
Uploading Files (PDF) To MySQL Database - PHP MySQL Tutorial
75% (4)
Uploading Files (PDF) To MySQL Database - PHP MySQL Tutorial
7 pages
Image Reference Guide: Install Pillow
No ratings yet
Image Reference Guide: Install Pillow
4 pages
2017 Mit 070
No ratings yet
2017 Mit 070
71 pages
AI Lab Experiment 1 - 2020 - BB
No ratings yet
AI Lab Experiment 1 - 2020 - BB
8 pages
Escug Systemc GDB
No ratings yet
Escug Systemc GDB
12 pages
Coupling and Cohesion
100% (1)
Coupling and Cohesion
19 pages
Visual Basic, Controls, and Events
No ratings yet
Visual Basic, Controls, and Events
57 pages
Oracle ADF Developer
No ratings yet
Oracle ADF Developer
8 pages
Unit 6 Devops
No ratings yet
Unit 6 Devops
50 pages
Design and Analysis of Algorithm (Lab)
No ratings yet
Design and Analysis of Algorithm (Lab)
27 pages
Syllabus ANN
No ratings yet
Syllabus ANN
2 pages
Module V (Spring) : 1. Dependency Injection or Inversion of Control (IOC)
No ratings yet
Module V (Spring) : 1. Dependency Injection or Inversion of Control (IOC)
14 pages
OpenFOAM Course PDF
No ratings yet
OpenFOAM Course PDF
136 pages
API Postman
No ratings yet
API Postman
17 pages
Final1 M.SC (Computer Science) 2019 Pattern-32-33
No ratings yet
Final1 M.SC (Computer Science) 2019 Pattern-32-33
2 pages
Lab Exercises
50% (2)
Lab Exercises
6 pages
SolidJS Getting Started
No ratings yet
SolidJS Getting Started
28 pages
CSE330 Assignment1 Solution
No ratings yet
CSE330 Assignment1 Solution
7 pages
Eeq 5 V 4 JCDQR 4 Emxk 21
No ratings yet
Eeq 5 V 4 JCDQR 4 Emxk 21
4 pages
Unit-3 Conditional Statements
No ratings yet
Unit-3 Conditional Statements
8 pages
Java Mail Server
No ratings yet
Java Mail Server
62 pages
Ds Lab 2
No ratings yet
Ds Lab 2
7 pages
Practice Set Advjs
No ratings yet
Practice Set Advjs
2 pages
Computer Programming: Chapter 1. Overview of Computer Software and Programming Languages
No ratings yet
Computer Programming: Chapter 1. Overview of Computer Software and Programming Languages
15 pages
CLR and Lalr
No ratings yet
CLR and Lalr
10 pages
Unit j276 02 Computational Thinking Algorithms and Programming Sample Assessment Materials
No ratings yet
Unit j276 02 Computational Thinking Algorithms and Programming Sample Assessment Materials
28 pages
Material (Algorithm)
No ratings yet
Material (Algorithm)
5 pages

Standardize Your Data Using InfoSphere QualityStage

Uploaded by

Standardize Your Data Using InfoSphere QualityStage

Uploaded by

developerWorks

Standardize your data using InfoSphere QualityStage

Before you start

InfoSphere QualityStage overview

Develop and deploy your

614 Columbus Ave #3, Boston, Massachusetts 02116

614 Columbus Suite #3, Suffolk County 02116

Becker & Co. C/O Bill

B Smith DBA Lime Cons.

1st Natl Provident

CS-89641 5 ft. Cable Male-F, RS232 #87951

InfoSphere QualityStage (hereafter called QualityStage), a component product of InfoSphere Information

Understanding the standardization process

Dictionary File Defines the layout of the output columns

Figure 1 shows an overview of the standardization process:

Parses input data using pattern action file (SEPLIST/STRIPLIST) parameters

Implementing the country identifier rule set

Step 1: Create a parallel job

Step 2: Configure the standardize stage

5. Click the > button beside the Literal field.

Step 3: Configure the output file and run the job

Andrew Conacher Level 10, 135 Exhibition St Melbourne VIC 3000

Implementing the domain pre-processor

Step 1: Create a parallel job

Figure 16. Input sequential file view data

Step 2: Configure the standardize stage

Figure 18. Standardize new rule process window

ZQAREAZQ. Enter ZQNAMEZQin the Literal field.

5. Click the > button.

6. Repeat steps 4 and 5 to add delimiters ZQADDRZQ and ZQAREAZQ.

Step 3: Configure the output file and run the job

"International Business Machines Corp","1480 CARRIAGE LN APT 301","AUBURN IN 467069555"

Implementing name standardization

Step 1: Create a parallel job

Figure 29. Input sequential file view data

Step 2: Configure the standardize stage

Figure 31. Standardize new rule process window

3. Click the > button.

Step 3: Configure the output file and run the job

Dr Jeffery David Thomson Jnr PHD

Heines is the primary name (PrimaryName).

Step 1: Create the parallel job

Figure 39. Input sequential file view data

Step 2: Configure the standardize stage

Figure 41. Standardize new rule process window

3. Click the > button.

Step 3: Configure the output file and run the job

Sample jobs and data

Dig deeper into Information

Stay current with developerWorks technical events and webcasts.

Follow developerWorks on Twitter.

Get products and technologies

Participate in the discussion forum.

IBM evaluation software

You might also like