0% found this document useful (0 votes)

54 views

Data Quality Challenges

Quality Stage is a tool that identifies components of data that may be in different formats, standardizes values and formats, identifies duplicate records, and builds consolidated records. It cleanses data to improve customer support, identify profitable customers, and enable business intelligence. Quality Stage performs investigation, standardization, matching, and survivorship. Investigation analyzes field content and structure, standardization puts data in consistent formats, matching finds duplicate records despite variations, and survivorship builds consolidated records from duplicates.

Uploaded by

abreddy2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Data Quality Challenges

Uploaded by

abreddy2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Quality Challenges

 Different or inconsistent standards in structure, format or values

 Missing data, default values

 Spelling errors, data in wrong fields

 Buried information

 Data myopia

 Data anomalies

Different or Inconsistent Standards

Missing Data & Default Values

Buried Information

The Anomalies Nightmare

Quality Stage

Quality Stage is a tool intended to deliver high quality data required for success in a
range of enterprise initiatives including business intelligence, legacy consolidation and
master data management. It does this primarily by identifying components of data that
may be in columns or free format, standardizing the values and formats of those data,
using the standardized results and other generated values to determine likely duplicate
records, and building a “best of breed” record out of these sets of potential duplicates.
Through its intuitive user interface Quality Stage substantially reduces time and cost to
implement Customer Relationship Management (CRM), data warehouse/business
intelligence (BI), data governance, and other strategic IT initiatives and maximizes their
return on investment by ensuring their data quality.

With Quality Stage it is possible, for example, to construct consolidated customer and
household views, enabling more effective cross-selling, up-selling, and customer
retention, and to help to improve customer support and service, for example by
identifying a company's most profitable customers. The cleansed data provided by Quality
Stage allows creation of business intelligence on individuals and organizations for research,
fraud detection, and planning.

Out of the box Quality Stage provides for cleansing of name and address data and some
related types of data such as email addresses, tax IDs and so on. However, Quality Stage
is fully customizable to be able to cleanse any kind of classifiable data, such as
infrastructure, inventory, health data, and so on.

Quality Stage Heritage

The product now called Quality Stage has its origins in a product called INTEGRITY from a
company called Vality. Vality was acquired by Ascential Software in 2003 and the
product renamed to Quality Stage. This first version of Quality Stage reflected its
heritage (for example it only had batch mode operation) and, indeed, its mainframe
antecedents (for example file name components limited to eight characters).
Ascential did not do much with the inner workings of Quality Stage which was, after all,
already a mature product. Ascential’s emphasis was to provide two new modes of
operation for Quality Stage. One was a “plug-in” for Data Stage that allowed data
cleansing/standardization to be performed (by Quality Stage jobs) as part of an ETL data
flow. The other was to provide for Quality Stage to use the parallel execution technology
(Orchestrate) that Ascential had as a result of its acquisition of Torrent Systems in 2001.
IBM acquired Accentual Software at the end of 2005. Since then the main direction has
been to put together a suite of products that share metadata transparently and share a
common set of services for such things as security, metadata delivery, reporting, and so
on. In the particular case of Quality Stage, it now shares a common Designer client with
Data Stage: from version 8.0 onwards Quality Stage jobs run as, or as part of, Data Stage
jobs, at least in the parallel execution environment.

QualityStage Functionality

Four tasks are performed by QualityStage; they are investigation, standardization,

matching and survivorship. We need to look at each of these in turn. Under the covers
QualityStage incorporates a set of probabilistic matching algorithms that can find
potential duplicates in data despite variations in spelling, numeric or date values, use of
non-standard forms, and various other obstacles to performing the same tasks using
deterministic methods. For example, if you have what appears to be the same
employee record where the name is the same but date of hire differs by a day or two, a
deterministic algorithm would show two different employees whereas a probabilistic
algorithm would show the potential duplicate.

(Deterministic means “absolute” in this sense; either something is equal or it is not.

Probabilistic leaves room for some degree of uncertainty; a value is close enough to be
considered equal. Needless to say, the degree of uncertainty used within QualityStage
is configurable by the designer.)
Investigation

By investigation we mean inspection of the data to reveal certain types of information

about those data. There is some overlap between Quality Stage investigation and the
kinds of profiling results that are available using Information Analyzer, but not so much
overlap as to suggest that removal of functionality from either tool. Quality Stage can
undertake three different kinds of investigation.

Features

 Data investigation is done using the investigate stage

 This stage analyzes each record field by field for its content and structure.
 Free form fields are broken up into individuals and then analyzed.
 Provide frequency distributions of distinct values and patterns
 Each investigation phase produces pattern reports, word frequency reports and
word classification reports. The reports are located in the same data directory of the
server.

Investigate methods
Character Investigation

Single-domain fields

 Entity Identifiers:
Eg: ZIP codes, SSN, Canadian postal codes
 Entity Clarifiers:
Eg: name prefix, gender, and marital status.

Multiple-domain fields

 large free-form fields such as multiple Address fields.

Character discrete investigation: looks at the characters in a single field (domain) to

report what values or patterns exist in that field. For example a field might be expected
to contain only codes A through E. A character discrete investigation looking at the
values in that field will report the number of occurrences of every value in the field (and
therefore any out of range values, empty or null, etc.) “Pattern” in this context means
whether each character is alphabetic, numeric, blank or something else. This is useful in
planning cleansing rules; for example a telephone number may be represented with or
without delimiters and with or without parentheses surrounding the area code, all in
the one field. To come up with a standard format, you need to be aware of what
formats actually exist in the data. The result of a character discrete investigation (which
can also examine just part of a field, for example the first three characters) is a
frequency distribution of values or patterns – the developer determines which.

Character concatenate investigation is exactly the same as character discrete

investigation except that the contents of more than one field can be examined as if they
were in a single field – the fields are, in some sense, concatenated prior to the
investigation taking place. The results of a character concatenate investigation can be
useful in revealing whether particular sets of patterns or values occur together.

Word investigation :is probably the most important of the three for the entire
QuialityStage suite, performing a free-format analysis of the data records. It performs
two different kinds of task; one is to report which words/tokens are already known, in
terms of the currently selected “rule set”, the other is to report how those words are to
be classified, again in terms of the currently selected “rule set”. There is no overlap to
Information Analyzer (data profiling tool) from word investigation.

Rule Set :

A rule set includes a set of tables that list the “known” words or tokens. For example,
the GBNAME rule set contains a list of names that are known to be first names in Great
Britain, such as Margaret, Charles, John, Elizabeth, and so on. Another table in the
GBNAME rule set contains a list of name prefixes, such as Mr, Ms, Mrs and so on, that
can not only be recognized as name prefixes (titles, if you prefer) but can in some cases
reveal additional information, such as gender.
When a word investigation reports about classification, it does so by producing a
pattern. This shows how each known word in the data record is classified, and the order
in which each occurs. For example, under the USNAME rule set the name WILLIAM F.
GAINES III would report the pattern FI?G – the F indicates that “William” is a known first
name, the I indicates the “F” is an initial, the ? indicates that “Gaines” is not a known
word in context, and the G indicates that “III” is a “generation” – as would be “Senior”,
“IV” and “fils”. Punctuation may be included or ignored.
Rule sets also come into play when performing standardization (discussed below).
Classification tables contain not only the words/tokens that are known and classified,
but also contain the standard form of each (for example “William” might be recorded as
the standard form for “Bill”) and may contain an uncertainty threshold (for example
“Felliciity” might still be recognizable as “Felicity” even though it is misspelled in the
original data record). Probabilistic matching is one of the significant strengths of
QualityStage.
Investigation might also be performed to review the results of standardization,
particularly to see whether there are any unhandled patterns or text that could be
better handled if the rule set itself were tweaked, either with improved classification
tables or through a mechanism called rule set overrides.

Anil Kumar - ETL Testing - 3.2 Yrs - Resume
100% (1)
Anil Kumar - ETL Testing - 3.2 Yrs - Resume
4 pages
Standardize Your Data Using InfoSphere QualityStage
100% (1)
Standardize Your Data Using InfoSphere QualityStage
22 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
4/5 (2)
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
Sample TSD 6
No ratings yet
Sample TSD 6
7 pages
Quality Stage Guide
No ratings yet
Quality Stage Guide
45 pages
Data Quality Challenges
No ratings yet
Data Quality Challenges
45 pages
DQ Investigation
No ratings yet
DQ Investigation
4 pages
Data Schema Basics
From Everand
Data Schema Basics
Mei Gates
No ratings yet
Survivorship: New Address Verification Module
No ratings yet
Survivorship: New Address Verification Module
8 pages
Andromeda
No ratings yet
Andromeda
6 pages
Software Reliability and Quality Management: Version 2 CSE IIT, Kharagpur
No ratings yet
Software Reliability and Quality Management: Version 2 CSE IIT, Kharagpur
6 pages
User'S Guide: Ibm Infosphere Qualitystage
No ratings yet
User'S Guide: Ibm Infosphere Qualitystage
331 pages
5 Data Cleaning
No ratings yet
5 Data Cleaning
36 pages
QS User Tutorial
No ratings yet
QS User Tutorial
333 pages
Address Verification Interface
No ratings yet
Address Verification Interface
7 pages
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Address Verification and Certification
No ratings yet
Address Verification and Certification
7 pages
Quality Stage Essentials
No ratings yet
Quality Stage Essentials
2 pages
Data Quality Requirements Analysis and Modeling: Richard Y. Wang Henry B. Kon Stuart E. Madnick
No ratings yet
Data Quality Requirements Analysis and Modeling: Richard Y. Wang Henry B. Kon Stuart E. Madnick
15 pages
Quality Stage Student Guide
No ratings yet
Quality Stage Student Guide
89 pages
IBM InfoSphere QualityStage
No ratings yet
IBM InfoSphere QualityStage
6 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Quality Stage Wipro
No ratings yet
Quality Stage Wipro
240 pages
Chapter 8
No ratings yet
Chapter 8
10 pages
DQ Matching
No ratings yet
DQ Matching
6 pages
Named Entity Recognition: Fundamentals and Applications
From Everand
Named Entity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Best Practices in DataStage
No ratings yet
Best Practices in DataStage
7 pages
Best Practices in DataStage
No ratings yet
Best Practices in DataStage
7 pages
Big DQ Academia
No ratings yet
Big DQ Academia
10 pages
Data Quality A Survey of Data Quality Dimensions
No ratings yet
Data Quality A Survey of Data Quality Dimensions
5 pages
Data Quality and Its Parameters
No ratings yet
Data Quality and Its Parameters
10 pages
Lecture 5 - Data Quality Checks and Lecture 6 Missing value analysis (1)
No ratings yet
Lecture 5 - Data Quality Checks and Lecture 6 Missing value analysis (1)
53 pages
Vol2no11 10
No ratings yet
Vol2no11 10
12 pages
Encyclopedia 02 00032 v2
No ratings yet
Encyclopedia 02 00032 v2
13 pages
Data Analytics
No ratings yet
Data Analytics
8 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
IM 101_Fundamentals of Database Systems_Unit 8
No ratings yet
IM 101_Fundamentals of Database Systems_Unit 8
27 pages
A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing
No ratings yet
A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing
10 pages
An Evaluation Framework For Data Quality Tools
No ratings yet
An Evaluation Framework For Data Quality Tools
15 pages
Data Quality
No ratings yet
Data Quality
4 pages
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Project Quality Management
No ratings yet
Project Quality Management
9 pages
Zimstat Qaf
No ratings yet
Zimstat Qaf
14 pages
Schendera C. Data Quality With SPSS. Improve Data...2020
No ratings yet
Schendera C. Data Quality With SPSS. Improve Data...2020
553 pages
Information Technology Quality Control and Quality Assurance Design, Planning, and Support
No ratings yet
Information Technology Quality Control and Quality Assurance Design, Planning, and Support
3 pages
Data Quality and Database Design 1
No ratings yet
Data Quality and Database Design 1
4 pages
DQStrategy
No ratings yet
DQStrategy
17 pages
Ompliance Omponent: Efinition
No ratings yet
Ompliance Omponent: Efinition
6 pages
The Six Most Commonly Used Data Quality Dimensions
No ratings yet
The Six Most Commonly Used Data Quality Dimensions
2 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
An Evaluation Model of Software Testing Management in Core Banking System Programme
No ratings yet
An Evaluation Model of Software Testing Management in Core Banking System Programme
6 pages
Discuss The Phases Involved in The Use of System Development Life Cycle Is An Approach in Information System Development
No ratings yet
Discuss The Phases Involved in The Use of System Development Life Cycle Is An Approach in Information System Development
13 pages
Quality Management in Software Engineering
No ratings yet
Quality Management in Software Engineering
8 pages
The Quality Metrics of Information Systems: Abstract: Information System Is A Special Kind of Products
No ratings yet
The Quality Metrics of Information Systems: Abstract: Information System Is A Special Kind of Products
8 pages
Data Quality Rule
No ratings yet
Data Quality Rule
6 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Quality Stage
No ratings yet
Quality Stage
3 pages
SPSS: The Ultimate Data Analysis Tool
From Everand
SPSS: The Ultimate Data Analysis Tool
Steven Bright
5/5 (1)
Data Quality Information and Decision Making: A Healthcare Case Study
No ratings yet
Data Quality Information and Decision Making: A Healthcare Case Study
10 pages
Tec Assignment'
No ratings yet
Tec Assignment'
8 pages
Socket 1
No ratings yet
Socket 1
2 pages
Problem Oriented Software Engineering1
No ratings yet
Problem Oriented Software Engineering1
1 page
Intrusion Detection in Homogeneous and
No ratings yet
Intrusion Detection in Homogeneous and
2 pages
Snowflake Roles
No ratings yet
Snowflake Roles
1 page
Modeling and Automated
No ratings yet
Modeling and Automated
1 page
Avistage Testreports Document2
No ratings yet
Avistage Testreports Document2
4 pages
Bandwidth Estimation For IEEE 802
No ratings yet
Bandwidth Estimation For IEEE 802
2 pages
QS Avi Stage Referencedocumnet
No ratings yet
QS Avi Stage Referencedocumnet
16 pages
Instructions To Prepare For Interview
No ratings yet
Instructions To Prepare For Interview
10 pages
Bhaskar Reddy
No ratings yet
Bhaskar Reddy
7 pages
New Uploaded Resume
No ratings yet
New Uploaded Resume
3 pages
S.No Emp - No Emp - Name Designation Department DOJ Current - Location Mobile No
No ratings yet
S.No Emp - No Emp - Name Designation Department DOJ Current - Location Mobile No
16 pages
Reference Residual
No ratings yet
Reference Residual
150 pages
Ds Filter
No ratings yet
Ds Filter
2 pages
Bhaskar Datastage Profile2
No ratings yet
Bhaskar Datastage Profile2
7 pages
Data Stage Course Content: Unit-1 Data Warehousing Concepts
No ratings yet
Data Stage Course Content: Unit-1 Data Warehousing Concepts
3 pages
IA Rules
No ratings yet
IA Rules
120 pages
Data Stage Scenarios: Scenario1. Cummilative Sum
No ratings yet
Data Stage Scenarios: Scenario1. Cummilative Sum
13 pages
PUMPS
No ratings yet
PUMPS
16 pages
02_AS_Pure_Mathematics_Practice_Paper_A_Mark_Scheme
No ratings yet
02_AS_Pure_Mathematics_Practice_Paper_A_Mark_Scheme
12 pages
325C FM & 325C LL Electrical System: Electrical Schematic Symbols and Definitions
No ratings yet
325C FM & 325C LL Electrical System: Electrical Schematic Symbols and Definitions
2 pages
Tif Fisica 4 Completo
No ratings yet
Tif Fisica 4 Completo
15 pages
Kindergarten Report Card
No ratings yet
Kindergarten Report Card
3 pages
PSP - Slides CH # 8 (Transformer Protection)
100% (1)
PSP - Slides CH # 8 (Transformer Protection)
59 pages
AIChE Journal - March 1960 - Dlouhy - Heat and Mass Transfer in Spray Drying
No ratings yet
AIChE Journal - March 1960 - Dlouhy - Heat and Mass Transfer in Spray Drying
6 pages
Important
No ratings yet
Important
4 pages
Nikon F3 With SB-16A
No ratings yet
Nikon F3 With SB-16A
7 pages
Electrodynamics 2020
No ratings yet
Electrodynamics 2020
196 pages
Kinematics Ust
No ratings yet
Kinematics Ust
67 pages
Carboline South West Type GP5 Application
No ratings yet
Carboline South West Type GP5 Application
3 pages
Chapter 3 (HR)
No ratings yet
Chapter 3 (HR)
20 pages
06 - MICAFIL OilRIP Bushings Catalogue English - AD
No ratings yet
06 - MICAFIL OilRIP Bushings Catalogue English - AD
17 pages
Manonmaniam Sundaranar University: B.Sc. Chemistry - Iii Year
No ratings yet
Manonmaniam Sundaranar University: B.Sc. Chemistry - Iii Year
135 pages
Pag-Asa Steel Works Inc. - Products & Services - Product Guide
No ratings yet
Pag-Asa Steel Works Inc. - Products & Services - Product Guide
5 pages
chapter-15-user-authentication-protocols[1]
No ratings yet
chapter-15-user-authentication-protocols[1]
17 pages
Wiryawan, Retnowati, Burhan, Syekhfani, Method of Analysis For Determination
No ratings yet
Wiryawan, Retnowati, Burhan, Syekhfani, Method of Analysis For Determination
8 pages
1exp Chapter 11 Polygons and Geometrical Constructions Lesson Notes Part 3
No ratings yet
1exp Chapter 11 Polygons and Geometrical Constructions Lesson Notes Part 3
10 pages
Igcse Biology CH 19-Mcq Ws-1
No ratings yet
Igcse Biology CH 19-Mcq Ws-1
21 pages
JR MATHS-IA IMP VSAQ'S 2024-25
No ratings yet
JR MATHS-IA IMP VSAQ'S 2024-25
5 pages
Sensors in Pneumatics: Festo Worldwide
No ratings yet
Sensors in Pneumatics: Festo Worldwide
26 pages
FAT Procedure 12m Door
No ratings yet
FAT Procedure 12m Door
5 pages
Catalogo Tecnico Split Ducto Baja Silueta
100% (2)
Catalogo Tecnico Split Ducto Baja Silueta
120 pages
BIMC 2023 Keystage 3 Individual Final Version.x17381
No ratings yet
BIMC 2023 Keystage 3 Individual Final Version.x17381
7 pages
Transport Eng. Chap 4
No ratings yet
Transport Eng. Chap 4
54 pages
Vishay - Resistors, Fixed - Metal Film
No ratings yet
Vishay - Resistors, Fixed - Metal Film
9 pages
Lattice Dynamics Extends The Concept of Crystal Lattice To An Array of Atoms With Finite
No ratings yet
Lattice Dynamics Extends The Concept of Crystal Lattice To An Array of Atoms With Finite
6 pages
Shear Stress of Insert On Honeycomb
No ratings yet
Shear Stress of Insert On Honeycomb
3 pages

Data Quality Challenges

Uploaded by

Data Quality Challenges

Uploaded by

Data Quality Challenges

 Different or inconsistent standards in structure, format or values

 Missing data, default values

 Spelling errors, data in wrong fields

Different or Inconsistent Standards

Missing Data & Default Values

The Anomalies Nightmare

Quality Stage Heritage

Four tasks are performed by QualityStage; they are investigation, standardization,

(Deterministic means “absolute” in this sense; either something is equal or it is not.

By investigation we mean inspection of the data to reveal certain types of information

 Data investigation is done using the investigate stage

 large free-form fields such as multiple Address fields.

Character discrete investigation: looks at the characters in a single field (domain) to

Character concatenate investigation is exactly the same as character discrete

You might also like