0% found this document useful (0 votes)
54 views

Data Quality Challenges

Quality Stage is a tool that identifies components of data that may be in different formats, standardizes values and formats, identifies duplicate records, and builds consolidated records. It cleanses data to improve customer support, identify profitable customers, and enable business intelligence. Quality Stage performs investigation, standardization, matching, and survivorship. Investigation analyzes field content and structure, standardization puts data in consistent formats, matching finds duplicate records despite variations, and survivorship builds consolidated records from duplicates.

Uploaded by

abreddy2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Data Quality Challenges

Quality Stage is a tool that identifies components of data that may be in different formats, standardizes values and formats, identifies duplicate records, and builds consolidated records. It cleanses data to improve customer support, identify profitable customers, and enable business intelligence. Quality Stage performs investigation, standardization, matching, and survivorship. Investigation analyzes field content and structure, standardization puts data in consistent formats, matching finds duplicate records despite variations, and survivorship builds consolidated records from duplicates.

Uploaded by

abreddy2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Quality Challenges

 Different or inconsistent standards in structure, format or values

 Missing data, default values

 Spelling errors, data in wrong fields

 Buried information

 Data myopia

 Data anomalies

Different or Inconsistent Standards

Missing Data & Default Values


Buried Information

The Anomalies Nightmare


Quality Stage

Quality Stage is a tool intended to deliver high quality data required for success in a
range of enterprise initiatives including business intelligence, legacy consolidation and
master data management. It does this primarily by identifying components of data that
may be in columns or free format, standardizing the values and formats of those data,
using the standardized results and other generated values to determine likely duplicate
records, and building a “best of breed” record out of these sets of potential duplicates.
Through its intuitive user interface Quality Stage substantially reduces time and cost to
implement Customer Relationship Management (CRM), data warehouse/business
intelligence (BI), data governance, and other strategic IT initiatives and maximizes their
return on investment by ensuring their data quality.

With Quality Stage it is possible, for example, to construct consolidated customer and
household views, enabling more effective cross-selling, up-selling, and customer
retention, and to help to improve customer support and service, for example by
identifying a company's most profitable customers. The cleansed data provided by Quality
Stage allows creation of business intelligence on individuals and organizations for research,
fraud detection, and planning.

Out of the box Quality Stage provides for cleansing of name and address data and some
related types of data such as email addresses, tax IDs and so on. However, Quality Stage
is fully customizable to be able to cleanse any kind of classifiable data, such as
infrastructure, inventory, health data, and so on.

Quality Stage Heritage

The product now called Quality Stage has its origins in a product called INTEGRITY from a
company called Vality. Vality was acquired by Ascential Software in 2003 and the
product renamed to Quality Stage. This first version of Quality Stage reflected its
heritage (for example it only had batch mode operation) and, indeed, its mainframe
antecedents (for example file name components limited to eight characters).
Ascential did not do much with the inner workings of Quality Stage which was, after all,
already a mature product. Ascential’s emphasis was to provide two new modes of
operation for Quality Stage. One was a “plug-in” for Data Stage that allowed data
cleansing/standardization to be performed (by Quality Stage jobs) as part of an ETL data
flow. The other was to provide for Quality Stage to use the parallel execution technology
(Orchestrate) that Ascential had as a result of its acquisition of Torrent Systems in 2001.
IBM acquired Accentual Software at the end of 2005. Since then the main direction has
been to put together a suite of products that share metadata transparently and share a
common set of services for such things as security, metadata delivery, reporting, and so
on. In the particular case of Quality Stage, it now shares a common Designer client with
Data Stage: from version 8.0 onwards Quality Stage jobs run as, or as part of, Data Stage
jobs, at least in the parallel execution environment.

QualityStage Functionality

Four tasks are performed by QualityStage; they are investigation, standardization,


matching and survivorship. We need to look at each of these in turn. Under the covers
QualityStage incorporates a set of probabilistic matching algorithms that can find
potential duplicates in data despite variations in spelling, numeric or date values, use of
non-standard forms, and various other obstacles to performing the same tasks using
deterministic methods. For example, if you have what appears to be the same
employee record where the name is the same but date of hire differs by a day or two, a
deterministic algorithm would show two different employees whereas a probabilistic
algorithm would show the potential duplicate.

(Deterministic means “absolute” in this sense; either something is equal or it is not.


Probabilistic leaves room for some degree of uncertainty; a value is close enough to be
considered equal. Needless to say, the degree of uncertainty used within QualityStage
is configurable by the designer.)
Investigation

By investigation we mean inspection of the data to reveal certain types of information


about those data. There is some overlap between Quality Stage investigation and the
kinds of profiling results that are available using Information Analyzer, but not so much
overlap as to suggest that removal of functionality from either tool. Quality Stage can
undertake three different kinds of investigation.

Features

 Data investigation is done using the investigate stage


 This stage analyzes each record field by field for its content and structure.
 Free form fields are broken up into individuals and then analyzed.
 Provide frequency distributions of distinct values and patterns
 Each investigation phase produces pattern reports, word frequency reports and
word classification reports. The reports are located in the same data directory of the
server.

Investigate methods
Character Investigation

Single-domain fields

 Entity Identifiers:
Eg: ZIP codes, SSN, Canadian postal codes
 Entity Clarifiers:
Eg: name prefix, gender, and marital status.

Multiple-domain fields

 large free-form fields such as multiple Address fields.

Character discrete investigation: looks at the characters in a single field (domain) to


report what values or patterns exist in that field. For example a field might be expected
to contain only codes A through E. A character discrete investigation looking at the
values in that field will report the number of occurrences of every value in the field (and
therefore any out of range values, empty or null, etc.) “Pattern” in this context means
whether each character is alphabetic, numeric, blank or something else. This is useful in
planning cleansing rules; for example a telephone number may be represented with or
without delimiters and with or without parentheses surrounding the area code, all in
the one field. To come up with a standard format, you need to be aware of what
formats actually exist in the data. The result of a character discrete investigation (which
can also examine just part of a field, for example the first three characters) is a
frequency distribution of values or patterns – the developer determines which.

Character concatenate investigation is exactly the same as character discrete


investigation except that the contents of more than one field can be examined as if they
were in a single field – the fields are, in some sense, concatenated prior to the
investigation taking place. The results of a character concatenate investigation can be
useful in revealing whether particular sets of patterns or values occur together.

Word investigation :is probably the most important of the three for the entire
QuialityStage suite, performing a free-format analysis of the data records. It performs
two different kinds of task; one is to report which words/tokens are already known, in
terms of the currently selected “rule set”, the other is to report how those words are to
be classified, again in terms of the currently selected “rule set”. There is no overlap to
Information Analyzer (data profiling tool) from word investigation.

Rule Set :

A rule set includes a set of tables that list the “known” words or tokens. For example,
the GBNAME rule set contains a list of names that are known to be first names in Great
Britain, such as Margaret, Charles, John, Elizabeth, and so on. Another table in the
GBNAME rule set contains a list of name prefixes, such as Mr, Ms, Mrs and so on, that
can not only be recognized as name prefixes (titles, if you prefer) but can in some cases
reveal additional information, such as gender.
When a word investigation reports about classification, it does so by producing a
pattern. This shows how each known word in the data record is classified, and the order
in which each occurs. For example, under the USNAME rule set the name WILLIAM F.
GAINES III would report the pattern FI?G – the F indicates that “William” is a known first
name, the I indicates the “F” is an initial, the ? indicates that “Gaines” is not a known
word in context, and the G indicates that “III” is a “generation” – as would be “Senior”,
“IV” and “fils”. Punctuation may be included or ignored.
Rule sets also come into play when performing standardization (discussed below).
Classification tables contain not only the words/tokens that are known and classified,
but also contain the standard form of each (for example “William” might be recorded as
the standard form for “Bill”) and may contain an uncertainty threshold (for example
“Felliciity” might still be recognizable as “Felicity” even though it is misspelled in the
original data record). Probabilistic matching is one of the significant strengths of
QualityStage.
Investigation might also be performed to review the results of standardization,
particularly to see whether there are any unhandled patterns or text that could be
better handled if the rule set itself were tweaked, either with improved classification
tables or through a mechanism called rule set overrides.

You might also like