Data Quality Management Methods and Tools
Data Quality Management Methods and Tools
A Seminar Report
Bachelor of Technology
in
COMPUTER ENGINEERING
by
April, 2010
Certificate
The seminar report entitled Data Quality Management: Methods and Tools sub-
mitted by Mr. Tage Nobin (20070653) is approved for the partial fulfillment of the
requirement for the award of the degree of Bachelor of Technology in Computer Engi-
neering.
External Examiner(s)
1. (Name: )
2. (Name: )
Date: 15/05/2010
Acknowledgements
This seminar report is a result of intense effort of many people whom I need to thank for
making this a reality. I thus express my deep regards to all those who have offered their
assistance and suggestions.
I am grateful to my seminar guide Prof. L.D.Netak, for making this work possible.
No word of thanks is enough for his mentorship, guidance, support and patience. I must
acknowledge the freedom he gave me in pursuing topics that I found interesting. His
resourcefulness, influence and keen scientific intuition were also vital to the progress of
this work and for these I am deeply thankful.
Finally, I would like to thank all whose direct and indirect support helped me in com-
pleting the seminar in time.
Defective data is one of the serious problems pertaining to data world. Business suc-
cess is becoming ever more dependent on the accuracy and integrity of mission critical
data resources. As data volume increases, the question of internal consistency within
data becomes paramount, regardless of fitness for use for any external purpose. Different
methods and tools are being used for the maintenance of data quality as per the condition
and the situation.
This paper describes major data quality problems, requirements and common strate-
gies to manage data quality in systems related to data. It also explains the importance
of data quality management with special spotlight addressing the management issues of
data quality and various methods and tools that can be used and implemented compre-
hensively in the management of data quality.
i
Contents
1 Introduction 1
ii
7 Basic Tools of Data Quality 26
Bibliography 32
iii
List of Figures
iv
Chapter 1
Introduction
SIX HUNDRED BILLION DOLLARS ANNUALLY! Yes that is what poor data quality
costs American businesses, according to the Data Warehousing Institute. What about
the whole world? Ensuring high level data quality is one of the most expensive and time-
consuming tasks to perform in data warehousing projects. Data quality management is
the field being the one in which all kinds of data- raw or processed are managed. It is
one of the major fields yet to be rolled on in a serious way.
This is a key first step, as the ability to understand the up-front level of data
quality management will form the foundation of the domain rules and processes
that will be put in place. Without performing an upfront assessment, the ability
to effectively implement a data quality strategy will be negatively impacted. From
an ongoing perspective, the data quality management will allow an organization
to see how the data quality procedures put in place have caused the quality of the
data to improve.
1
2. Data Quality Definition (Rules and Targets)
Once the initial data quality assessment is complete, the second part of the process
involves defining what exactly we mean by data quality. From an ongoing perspec-
tive, this phase will involve describing the various attributes of a quality data that
will make the whole process much easier. Performing trend analyses on the data
and the rules in place to ensure the data rules are adhered to and the target is
focused on.
2
in place if needed. Conversely, the scorecards and trend analysis results can also
make sure that data quality is being effectively addressed within the organization.
3
Chapter 2
First of all we will be dealing with what exactly is ’Data’, ’Quality’ and ’Manage-
ment’. As Oxford dictionary defines
• Quality - ”the standard of how good something is measured against other similar
things”. In context of data it quality may be defined as ”fitness for use”.
In a broad spectrum, data quality management entails the establishment and de-
ployment of roles, responsibilities, policies, and procedures concerning the acquisition,
maintenance, dissemination, and disposition of data. Actually the perfect definition de-
pends upon the organization. For, successful data quality management, the solution
must include techniques, processes, methods and tools. Data quality management life-
cycle must be clearly defined using continuous as well as iterative frameworks.
Data quality must be designed into systems using proven engineering principles.
Data quality is too often left to chance or given only superficial attention in the de-
sign of information systems. While good engineering principles are sometimes applied to
software development, data quality is usually left up to the end user. Applying engineer-
ing principles to data quality involves understanding the factors that affect the creation
4
and maintenance of quality data. It is helpful to look at data as the output of a data
manufacturing process.
5
Chapter 3
Data quality is evidenced by valid and reliable data; therefore planning in the early
stages about the clear concept of the need and definition is well worth the investment
6
of time and resources. Data in a database has no actual value (or even quality); it only
has potential value. Data has realized value only when someone uses it to do something
useful. As mentioned earlier there is no fixed global definition of high quality data. Data
quality don’t restrict itself to a particular concept, instead differs as domain and appli-
cation changes. Whatever be the domain, there are some particular parameters which
are essential to be gratified in order to make the data a real quality data. Contrary to
popular belief, quality is not necessarily zero defects. Quality is conformance to valid
requirements.
1) Utility - refers to the usefulness of the information for its intended users.
All the above three attributes may be further defined in terms of seven dimensions of
data quality:
Relevance- refers to the degree to which our data products provide information that
meets the customer’s needs.
Accuracy- refers to the difference between an estimate of a parameter and its true value.
Timeliness- refers to the length of time between the reference periods of the infor-
mation and when we deliver the data product to customers.
7
Accessibility- refers to the ease with which customers can identify, obtain, and use
the information in data products.
Completeness- refers to the degree to which values are present in the attributes that
require them.
8
Figure 3.1: Data Quality Attributes
9
Chapter 4
Data is impacted by numerous processes, most of which affect its quality to a certain
degree. It is imperative that the issue of data quality be addressed if the data warehouse
is to prove beneficial to an organization. The information in the dataware house has the
potential to be used by an organization to generate greater understanding of their cus-
tomers, processes, and the organization itself. There potential to increase the usefulness
of data by combining it with other data sources is great. But, if the underlying data
is not accurate, any relationships found in the data warehouse will be misleading. As
Wyatt Earp (Data-Base Expert) said Fast is fine, but accuracy is everything.
Resolving data quality problems is often the biggest effort in a data mining study.
50-80 percent of time in data mining projects spent on DQ. Because it’s (data) in the
computer, it doesn’t mean its right.
10
Figure 4.1: Data Flow
Data/information is not static, it flows between data collection and usage process.
The main problem herein is that problems can and do arise at all of these stages which
makes the need of End-to-End continuous monitoring. This process indeed becomes a
herculean task.
1. Data Control
2. Data Age
3. Data Types
4. Device Availability
5. Data Structure
6. Read/Write Management
7. Communication Timing
11
These factors matters a lot to development of a quality data. Any error up to 1 percent
of data may impact findings and result.
There are numerous problems which arise in wake of data quality. Some of them are of
small nature and some of them big. Whatever be the nature, ignoring any such kind of
matter may prove to be costly affair.
Some of them listed:
- Much of the raw data is of poor quality. This is because of incorrect data gathering
and data operations. This leads to inaccurate assessment of the data.
- The above mentioned fact results in data being costly to diagnose and assess.
- Many of the costs involved are hidden and hard to quantify. This makes the assessment
a tough task.
- Inconsistent data between different systems. Since data flows between different sys-
tems, any obstacle in the smooth transition may lead to a total data failure.
- Most of the attributes of a quality data are extremely difficult to measure sometimes
impossible.
- They are of vague nature. The conventional definitions provide no guidance towards
practical improvements of the data.
- The priority for metadata is undermined. Setting standards for metadata is very
important.
- There are various other systematic errors which can be attributed to lack of resources
and skills.
12
Chapter 5
The quality of any data statistics disseminated by an agency depends on two aspects:
the quality of statistics received, and the quality of internal processes for the collection,
processing, analysis and dissemination of data and Meta data.
13
Figure 5.1: Radial Cycle of Data Quality Process
Here in we develop a basic strategy by combining all the above steps: Prelimi-
nary Problem Definition, followed by Analysis, Improvement and monitor steps for each
problem.
14
Chapter 6
Designing the quality management process is clearly not the end of the data quality
effort. Just identifying issues does nothing to improve things. The issues need to drive
changes that will improve the quality of the data for the eventual users. We see process
improvement fundamentally as a way of solving problems. If there is not an apparent or
latent problem, process improvement is not needed. If there is any problem, howsoever
intangible, one or more processes need to be improved to deal with the problem. Once
you sense a problem, good problem solving technique involves alternating between the
levels of thought and experience.
6.1 Methods
1. Profiling
2. Cleansing
3. Data Integration/Consolidation
4. Data Augmentation
15
Figure 6.1: Data Quality Methods
16
6.1.1 Data Profiling
It can be defined as use of analytical techniques on data for the purpose of developing
a thorough knowledge of its content, structure and quality. It is a process of developing
information about data instead of information from data.
1. Find out whether existing data can easily be used for other purposes.
2. Improve the ability to search the data by tagging it with keywords, descriptions or
assigning it to a category.
3. Give metrics on data quality, including whether the data conforms to particular
standards or patterns.
5. Assess whether metadata accurately describes the actual values in the source database.
6. Develop a master data management process for data governance for improving data
quality.
17
from the system at entry and is performed at entry time, rather than on batches of data.
The process of data cleansing are
• Data Auditing: The data is audited with the use of statistical methods to detect
anomalies and contradictions. This eventually gives an indication of the character-
istics of the anomalies and their locations.
• Workflow execution: In this stage, the workflow is executed after its specification is
complete and its correctness is verified. The implementation of the workflow should
be efficient even on large sets of data which inevitably poses a trade-off because the
execution of a data cleansing operation can be computationally expensive.
• Post-Processing and Controlling: After executing the cleansing workflow, the re-
sults are inspected to verify correctness. Data that could not be corrected during
execution of the workflow are manually corrected if possible. The result is a new
cycle in the data cleansing process where the data is audited again to allow the
specification of an additional workflow to further cleanse the data by automatic
processing.
18
being considered.
6.2 Tools
It is commonly accepted that data quality tools can be grouped according to the
part of a data quality process they cover. Data profiling and analysis assist in detecting
data problems. Data transformation, data cleaning, duplicate elimination and data en-
hancement propose to solve the discovered or previously known data quality problems.
Data quality tools generally fall into one of three categories:
1. Auditing
2. Cleansing
3. Migration
19
sources. Lexical analysis may be used to discover the business sense of words within the
data. The data that does not adhere to the business rules could then be modified as
necessary.
Data Analysis:
Activities that enclose the statistical evaluation, the logical study of data values and the
application of data mining algorithms in order to define data patterns and rules to ensure
that data does not violate the application domain constraints. The set of commercial
and research tools that provide data analysis techniques are as follows:
Commercial- dfPower
ETLQ
Migration
Architect
Trillium
WizWhy
Data Profiling:
Process of analyzing data sources with respect to the data quality domain, to identify and
prioritize data quality problems. Data profiling reports on the completeness of datasets
and data records organize data problems by importance; output the distribution of data
quality problems in a dataset, and lists missing values in existing records. The identifica-
tion of data quality problems before starting a data cleaning project is crucial to ensure
the delivery of accurate information. The following set of commercial and research tools
implement data profiling techniques:
20
Commercial- dfPower
ETLQ
Migration
Architect
Trillium
WizWhy
• Data parsing (elementizing) - breaks a record into atomic units that can be used
in subsequent steps. Parsing includes placing elements of a record into the correct
fields.
• Data standardization - converts the data elements to forms that are standard
throughout the data warehouse.
• Record matching- determines whether two records represent data on the same sub-
ject.
• Data transformation- ensures consistent mapping between source systems and data
warehouse.
21
• House-holding - combining individual records that have the same address.
• Documenting - documenting the results of the data cleansing steps in the meta
data.
Data cleaning:
The act of detecting, removing and/or correcting dirty data. Data cleaning aims not only
at cleaning up the data but also to bring consistency to different sets of data that have
been merged from separate databases. Sophisticated software applications are available
to clean data using specific functions, rules and look-up tables. In the past, this task was
done manually and therefore subject to human error. The following set of commercial
and research tools implement data cleaning techniques:
Commercial- DataBlade
dfPower
ETLQ
ETI*DataCleanser
Firstlogic
NaDIS
QuickAddress Batch
Sagent
Trillium
WizRule
Research- Ajax
Arktos
FraQL
22
Duplicate elimination:
The process that identifies duplicate records (referring to the same real entity) and merges
them into a single record. Duplicate elimination processes are costly and very time
consuming.
They usually require the following steps:
The set of commercial and research tools that provide duplicate elimination techniques
is presented below:
Research- Ajax
Flamingo Project
23
FraQL
The process of using additional information from internal or external data sources to im-
prove the quality of the input data that was incomplete, unspecific or outdated. Postal
address enrichment, geocoding and demographic data additions are typical data enrich-
ment procedures. The set of commercial and research data enrichment tools are listed
below:
Commercial- DataStage
dfPower
ETLQ
Firstlogic
NaDIS
QuickAddress Batch
Sagent
Trillium
Research- Ajax
24
Data transformation:
The set of operations (schema/data translation and integration, filtering and aggregation)
that source data must undergo to appropriately fit a target schema. Data transforma-
tions require metadata, such as data schemas, instance-level data characteristics, and
data mappings.
The set of commercial and research tools that can be classified as data transformation
tools is the following:
Research- Ajax
Arktos
Clio
FraQL
Potter’s Wheel
TranScm
25
Chapter 7
1. Fishbone Diagram
Fishbone diagrams are diagrams that show the causes of a certain event. Common
uses of the Fishbone diagram are product design and quality defect prevention, to
identify potential factors causing an overall effect. Each cause or reason for imper-
fection is a source of variation. Causes are usually grouped into major categories
to identify these sources of variation.
2. Flow Chart
A flowchart identifies the sequence of activities or the flow of materials and in-
formation in a process. There is no precise format, and the diagram can be drawn
simply with boxes, lines, and arrows. Flowcharts help the people involved in the
process understand it much better and more objectively by providing a picture of
the steps needed to accomplish a task.
26
Histograms provide clues about the characteristics of the parent population from
which a sample is taken. Patterns that would be difficult to see in an ordinary
table of numbers become apparent.
Bar Chart is series of bars representing the frequency, e.g. number of times yes/no.
• Displays large amounts of data that are difficult to interpret in tabular form.
4. Scatter diagram
• Supplies the data to confirm a hypothesis that two variables are related.
• Provides both a visual and statistical means to test the strength of a rela-
tionship.
5. Run Chart
Run charts show the performance and the variation of a process or some quality or
productivity indicator over time in a graphical fashion that is easy to understand
and interpret. They also identify process changes and trends over time and show
the effects of corrective actions.
27
• Monitors performance of one or more processes over time to detect trends,
shifts, or cycles.
6. Control Chart
7. Process chart
28
Chapter 8
Monitoring data quality is an important sub-aspect of the Data Quality Life Cycle.
It is based on the specified goals and rules and therefore on the current quality level
obtained after the initial analysis carried out on the basis of data profiling and the initial
cleansing of your data. Monitoring is not an end in itself but more or less serves as a sensor
for data quality weaknesses, before they make themselves felt in the destination system.
We can undertake the monitoring function by orienting itself towards the defined data
quality initiatives and general instructions as well any changes which may be required.
29
Figure 8.1: Monitor System
30
Conclusion
This paper on Data Quality Management System expresses the basic tasks of the man-
agement in the field of techniques, tools and improvement of data quality. Organizations
seeking relief from data problems often turn to technology for help. This is not the most
effective solution. Data quality is a behavioral problem, not a technology problem. To
solve the data quality problem organizations need to change user behavior. A comprehen-
sive program based on prevention, detection, correction and accountability is required.
Deploying a data quality management program is not an easy task, but the rewards are
enormous. Deploying a disciplined approach to managing data as an important asset will
better position an organization to improve the productivity of its information workers
and to better serve its customers.
Strong frameworks and process are imperative for controlling data quality and for manag-
ing data, the most important asset of an organization. Additional validation procedures
such as exception analysis and data reconciliation ensure high success rates in migration-
related initiatives. The challenges associated with data quality control initiatives can be
effectively handled by implementing the recommended framework and process to control
data quality. Maintaining data quality is no longer an option, particularly in today’s
competitive and regulatory climate. With this in place, the six-phase program can be
effectively pursued for the management of data quality.
31
Bibliography
[5] http://www.sap.com/management/data_quality_management/index.epx
[6] http://en.wikipedia.org/wiki/data_quality.html
[7] http://www.tricare.mil/imp/GuidelinesOnDataQualityManagement.html
32