Data Quality Challenges
Data Quality Challenges
Buried information
Data myopia
Data anomalies
Quality Stage is a tool intended to deliver high quality data required for success in a
range of enterprise initiatives including business intelligence, legacy consolidation and
master data management. It does this primarily by identifying components of data that
may be in columns or free format, standardizing the values and formats of those data,
using the standardized results and other generated values to determine likely duplicate
records, and building a “best of breed” record out of these sets of potential duplicates.
Through its intuitive user interface Quality Stage substantially reduces time and cost to
implement Customer Relationship Management (CRM), data warehouse/business
intelligence (BI), data governance, and other strategic IT initiatives and maximizes their
return on investment by ensuring their data quality.
With Quality Stage it is possible, for example, to construct consolidated customer and
household views, enabling more effective cross-selling, up-selling, and customer
retention, and to help to improve customer support and service, for example by
identifying a company's most profitable customers. The cleansed data provided by Quality
Stage allows creation of business intelligence on individuals and organizations for research,
fraud detection, and planning.
Out of the box Quality Stage provides for cleansing of name and address data and some
related types of data such as email addresses, tax IDs and so on. However, Quality Stage
is fully customizable to be able to cleanse any kind of classifiable data, such as
infrastructure, inventory, health data, and so on.
The product now called Quality Stage has its origins in a product called INTEGRITY from a
company called Vality. Vality was acquired by Ascential Software in 2003 and the
product renamed to Quality Stage. This first version of Quality Stage reflected its
heritage (for example it only had batch mode operation) and, indeed, its mainframe
antecedents (for example file name components limited to eight characters).
Ascential did not do much with the inner workings of Quality Stage which was, after all,
already a mature product. Ascential’s emphasis was to provide two new modes of
operation for Quality Stage. One was a “plug-in” for Data Stage that allowed data
cleansing/standardization to be performed (by Quality Stage jobs) as part of an ETL data
flow. The other was to provide for Quality Stage to use the parallel execution technology
(Orchestrate) that Ascential had as a result of its acquisition of Torrent Systems in 2001.
IBM acquired Accentual Software at the end of 2005. Since then the main direction has
been to put together a suite of products that share metadata transparently and share a
common set of services for such things as security, metadata delivery, reporting, and so
on. In the particular case of Quality Stage, it now shares a common Designer client with
Data Stage: from version 8.0 onwards Quality Stage jobs run as, or as part of, Data Stage
jobs, at least in the parallel execution environment.
QualityStage Functionality
Features
Investigate methods
Character Investigation
Single-domain fields
Entity Identifiers:
Eg: ZIP codes, SSN, Canadian postal codes
Entity Clarifiers:
Eg: name prefix, gender, and marital status.
Multiple-domain fields
Word investigation :is probably the most important of the three for the entire
QuialityStage suite, performing a free-format analysis of the data records. It performs
two different kinds of task; one is to report which words/tokens are already known, in
terms of the currently selected “rule set”, the other is to report how those words are to
be classified, again in terms of the currently selected “rule set”. There is no overlap to
Information Analyzer (data profiling tool) from word investigation.
Rule Set :
A rule set includes a set of tables that list the “known” words or tokens. For example,
the GBNAME rule set contains a list of names that are known to be first names in Great
Britain, such as Margaret, Charles, John, Elizabeth, and so on. Another table in the
GBNAME rule set contains a list of name prefixes, such as Mr, Ms, Mrs and so on, that
can not only be recognized as name prefixes (titles, if you prefer) but can in some cases
reveal additional information, such as gender.
When a word investigation reports about classification, it does so by producing a
pattern. This shows how each known word in the data record is classified, and the order
in which each occurs. For example, under the USNAME rule set the name WILLIAM F.
GAINES III would report the pattern FI?G – the F indicates that “William” is a known first
name, the I indicates the “F” is an initial, the ? indicates that “Gaines” is not a known
word in context, and the G indicates that “III” is a “generation” – as would be “Senior”,
“IV” and “fils”. Punctuation may be included or ignored.
Rule sets also come into play when performing standardization (discussed below).
Classification tables contain not only the words/tokens that are known and classified,
but also contain the standard form of each (for example “William” might be recorded as
the standard form for “Bill”) and may contain an uncertainty threshold (for example
“Felliciity” might still be recognizable as “Felicity” even though it is misspelled in the
original data record). Probabilistic matching is one of the significant strengths of
QualityStage.
Investigation might also be performed to review the results of standardization,
particularly to see whether there are any unhandled patterns or text that could be
better handled if the rule set itself were tweaked, either with improved classification
tables or through a mechanism called rule set overrides.