Standardize Your Data Using InfoSphere QualityStage
Standardize Your Data Using InfoSphere QualityStage
Technical topics
Information Management
Technical library
Share:
Dhanunjaya Lokireddy is a Senior QA Engineer working for the InfoSphere QualityStage team at IBM India Software Lab, Hyderabad. He has
six years of experience in IBM working for different QA teams in the Information Server product area.
11 August 2011
Also available in Chinese Portuguese
Enterprises often face issues with data arising out of lack of standards. Data
may be entered in inconsistent ways across different systems, causing
Start building for free
records to appear different even though they are actually the same. For
example, the following two records describe the same person at the same address, even though the name
and address appear to be quite different:
Bob Christiansan
R.J. Christensen
Another common error leading to "data surprises" is that data can be misplaced. Here is an example where
several of the fields contain the wrong type of information. The name field contains address information, the
tax ID field contains telephone numbers, and the telephone field contains city name information. This
misplacement of data often leads to application errors.
Name
Tax ID
Telephone
025-37-1998
415-392-2770
228-02-1695
6173380220
34-2671854
3309321
HP 15 State St.
508-466-1550
Orlando
A third kind of common data standardization problem involves the lack of consistent identifiers. The
following example has three records containing a product description. They look different, but they are
actually same. This is because of the lack of consistent identifiers.
91-84-301 RS232 Cable 5' M-F CandS
The remaining sections of the tutorial contain detailed steps to create standardize jobs using different type
of rule sets with examples.
The data contains records belongs to various countries. The steps below show how to use QualityStage to
identify the country for each record.
Figure 3 shows the designer palette where the standardize stage is selected.
Figure 3. Designer palette showing standardize stage
Figure 4 shows the input sequential file with the data from the listing above.
Figure 4. Input sequential file view data
The next screen is the standardize new rule process window, with the available columns listed.
Figure 6. Standardize new rule process window
2. For the listed data column, which is the input sequential file metadata, select Rule Sets > Other >
COUNTRY.
Figure 7. Rule set selection
3. Click the > button to move the Data column to the Selected column area.
Figure 8. Standardize rule process window with selected rule set and columns
4. Add metadata delimiter. Metadata delimiter plays an important rule in this type of rule set. The delimiter
is used to set default country code. If the country rule set can't determine the country based on the
information provided, it defaults to the delimiter value. The format of the metadata delimiter is
ZQ<Country Code>ZQ. In this example, we are setting US as the default country. Enter ZQUSZQin the
Literal field.
Figure 9. Standardize rule process window with metadata delimiter entered
6. Use the Move Up and Move Down buttons to arrange the metadata delimiter in the following way:
ZQUSZQ
Data
Click OK to add the process.
Figure 11. Standardize rule process window with all metadata delimiter arranged in order
Figure 12. Standardize stage properties window with created rule process
7. Map the output columns (Stage Properties > Output > Mapping)
The standardize stage produces columns based on the rule set selected. The following columns were
selected in this example: ISOCountryCode_COUNTRY, IdentifierFlag_COUNTRY, along with "Data" input
field.
Drag and drop the columns listed above to the output.
Figure 13. Standardize stage output column mapping
only.
The domain pre-processor will identify different domains (like name, address and area) from the given data
and populate them to the correct fields. Let's take the following data:
"52280A NC 42 72 HWY # 42","KNOXVILLE TN 37920","Dr Jeffery David Thomson Jnr PHD"
"International Business Machines Corp","1480 CARRIAGE LN APT 301","AUBURN IN 467069555"
"Peter heines","ASHVILLE NEW YORK 147109762","930 SOUTH BROAD ST EAST APT H"
It has three fields: Field1, Field2, and Field3 (see Figure 16). But the data is scattered in all three fields.
For example, the name in the first record is in Field3, in Field 1 in the second record, and in Field1 in the
third record. We will create a standardize job using pre-processor rule set to identify different domains.
2. Select the USPREP rule set (Standardization Rules > USA > USPREP > USPREP) for the available
columns Field1, Field2, and Field3, which is the input sequential file metadata.
Figure 19. Rule set selection
3. Click the > button for the three fields to move them to the selected column area.
Figure 20. Standardize rule process window with selected rule set and columns
4. Add metadata delimiters. Metadata delimiters are used to convey what kind of information we are
expecting in each of the input field. If the pre-processor cannot determine the domain of a token, it will be
defaulted to the domain that specified through metadata delimiter. The format of the metadata delimiter
is ZQ<Domain>ZQ. In this example, we are anticipating that Field1 contains Name data, Field2 contains
Address data, and Field3 contains Area data. Add three delimiters: ZQNAMEZQ, ZQADDRZQ and
7. Use the Move Up and Move Down buttons to arrange the metadata delimiters in the following way:
ZQNAMEZQ
Field1
ZQADDRZQ
Field2
ZQAREAZQ
Field3
Click OK to add the process.
Figure 24. Standardize rule process window with all metadata delimiters arranged in order
Figure 25. Standardize stage properties window with created rule process
8. Map the output columns (Stage Properties > Output > Mapping)
The standardize stage produces columns based on the rule set selected. The following columns were
selected in this example: NameDomain_USPREP, AddressDomain_U SPREP and AreaDomain_USPREP
Drag and drop the columns listed above to the output.
Figure 26. Standardize stage output column mapping
These examples contain individual and organization names, and assume these belong to country US. Our
intention here is to identify different parts of the name like the primary name, first name, and last name.
2. Select the USNAME rule set (Standardization Rules > USA > USNAME > USNAME) for the column
"name," which is the input sequential file metadata.
Figure 32. Rule set selection
4. Do not add the "Optional NAMES Handling" option. The Optional NAMES Handling field has the
following options:
Process All as Individual All columns are standardized as individual names.
Process All as Organization All columns are standardized as organization names.
Process Undefined as Individual All unhandled columns are standardized as individual names.
Process Undefined as Organization All unhandled columns are standardized as organization names.
This option is useful if we know the types of names in the input file. For example, if the file mainly
contains organization names, specifying Process All as Organization enhances performance by
eliminating the processing steps of determining the name's type.
5. Click OK.
Figure 34. Standardize rule process window with selected rule set and columns
Figure 35. Standardize stage properties window with created rule process
6. Map the output columns (Stage Properties > Output > Mapping)
The standardize stage produces columns based on the rule set selected. In this example, the following
columns were selected: NameType_USNAME, GenderCode_USNAME, NamePrefix_USNAME,
FirstName_USNAME, MiddleName_USNAME, PrimaryName_USNAME, NameGeneration_USNAME, and
NameSuffix_USNAME
Drag and drop the above columns to the output.
Figure 36. Standardize stage output column mapping
Implementing validation
This type of rule set is mainly used to validate the data (VDATE, VEMAIL, for example). Let's take the
following date examples:
OCT021983
09211991
02/29/2011
These are some of the acceptable input formats. The standardization job verifies whether these are valid
and sets valid flag, if valid. Then it produces the output in standard format CCYYMMDD; otherwise, it sets
invalid reason code.
2. Select the VDATE rule set (Standardization Rules > Other > VDATE) for the column "Date," which is
the input sequential file metadata.
Figure 42. Rule set selection
4. Click OK.
Figure 44. Standardize rule process window with selected rule set and columns
Figure 45. Standardize stage properties window with created rule process
5. Map the output columns (Stage Properties > Output > Mapping)
The standardize stage produces columns based on the rule set selected. In this example following
columns were selected: ValidFlag_VDATE, DateCCYYMMDD_VDATE, InvalidReason_VDATE, along
with input column "Date."
Drag and drop the above columns to the output.
Figure 46. Standardize stage output column mapping
OCT021983
Valid date (ValidFlag_VDATE)
19831002 is the standard format (DateCCYYMMDD_VDATE)
09211991
Valid date (ValidFlag_VDATE)
19910921 is the standard format (DateCCYYMMDD_VDATE)
02/29/2011
Invalid date (ValidFlag_VDATE)
The reason is it is invalid leap-year date (InvalidReason_VDATE)
Conclusion
In this tutorial, you have learned what the standardization process is and how it can be achieved by using
InfoSphere QualityStage. You have also learned about standardization using different types of rule sets like
country identifier, domain pre-processor, domain-specific, and validation.
Download
Description
Name
Size
SampleJobDesigns.zip
10KB
Resources
Learn
Get more information about InfoSphere Information Server from the Information
Center.
Learn more about Information Management at the developerWorks Information
Management zone. Find technical documentation, how-to articles, education,
downloads, product information, and more.
Downloads
Products
Events
Bluemix Developers
Community
Get samples, articles, product
docs, and community resources to
help build, deploy, and manage
your cloud apps.
developerWorks Weekly
Newsletter
Keep up with the best and latest
technical info to help you tackle
your development challenges.
DevOps Services
Software development in the cloud.
Register today to create a project.