IBM - Data Quality Achitecture Options and Approach
IBM - Data Quality Achitecture Options and Approach
December 3, 2013
0
Agenda
DQ Solution
DQ Architecture Options
Standardization process
Benefits
1
Why is DQ important ?
A Simple Pizza & Beer Order Receipt
2
Pain Areas of DQ Solution
Identification
of Business
Areas
Providing high
Managing
performance
Large Volumes
for ad-hoc
of Data
queries
Resolving data
quality issues Nos. and types
& survival of Technology
policy Involved
decisions
Handling complex
Data Availability
relationship in
& Cleansing
data
Large Nos. of
Unmanaged
Data sources
3
DQ Solution – DQ Management Approach
DMAIC is a 5 step iterative approach to data quality improvement. It comprises of
continuous analysis, observation, and improvement of underlying data leading to
overall improvement in the quality of information across the organization
4
DQ Solution – A typical DQ Lifecycle
The ultimate goal of DQ management should be to move from reactive mode of data
quality management to proactively control and manage the data quality so that the
data imperfections in the systems are limited.
5
DQ Solution – Factors Influencing DQ
All the below mentioned factors (LUCAS) need to be addressed in order to ensure
quality data available for analysis to the end users.
6
DQ Solution – DQ Reference Architecture
7
DQ Architecture Options
8
DQ Architecture Options – Pros & Cons
Data cleansing effort Ensures quality data is Cost and effort of DQ exercise More Expensive
and cost available at the place where it increases as we move away
is captured and hence minimal from the source system.
data quality impact on
downstream applications.
Data Load The data load may be delayed The more the DQ checks, the The data load is very
as the DQ checks needs to be more would be the impact on quick as the DQ checks
applied in the source system the data load. But if designed are applied after the
before the data is ready to be optimally, there might not be loads into DWH.
loaded into target layers. much impact .
Impact on Source DQ processes may become an Less impact to source Minimal impact to source
System overhead to the operational operational system compared operational system
system. to option 1 performance
Heterogeneous source An additional overhead of Less impact to source Minimal impact to source
systems implementing data quality operational system compared operational system
processes and procedures in to option 1. performance
multiple platforms
9
Near Real-time/Inline DQ management Solution
Consideration Points Option 1: Managing DQ solution Option 2: Managing Data Quality Solutions
using Big Data technology using Database Resources
Volume Best suited for very high volume data. It works well with low to medium data volumes.
Update Frequency Efficiently handles frequent changed Capable of handling frequently changed records.
records
Source data quality / Very efficiently handles multiple data Performance bottleneck possible with volume
No. of DQ Rules quality checks and controls during data growth and more number of DQ checks.
load process.
Data Load SLA Capable of loading high volume data in Capable of loading high volume data in stipulated
stipulated time. time but can crumble with data growth.
Expert availability As big data technologies are still emerging, ETL experts are easily available.
there could be difficulty in getting big data
skilled associate.
10
DQ Solution – Data Governance Council
11
Standardization process
Cleansing and standardization of data is achieved by set of transformations, where
organizations data process through each of these stages for better data cleansing and
standardization.
12
Standardization process – Technical Integration
13
Other Important DQ Processes – Error Handling
This is one of the ways of implementing Error handling solution in any ETL architecture
One common error table can be created to capture and store exceptions
while loading data in downstream systems.
When records are rejected due to data quality issues (validation errors)
they will be logged in the exception database.
In case, there is agreed default value provided by Business for the source
columns not holding valid data, the same will be loaded into the target
table.
14
Sample Error Handling Dashboard
15
Reconciliation
Data reconciliation is performed to verify the integrity of the data loaded into the
warehouse.
One of the major reasons of information loss is loading failures or errors during
loading. Such errors can occur due to several reasons:
Reconciliation process will only indicate whether or not the data is correct. It will not
indicate why the data is not correct. Reconciliation process answers “what” part of the
question, not “why” part of the question.
16
Types Of Reconciliation
Transactional Reconciliation
Matching the number of records in source and in target. If these counts are
equal, it can be safely assumed that records were not left out due to an error
during the ETL or simple load process. This can be further verified by the lack
of errors (not necessarily warnings) in the exception reporting by the ETL tool.
Financial Reconciliation
This checks on the data content in source and target. E.g. computing the sum
of Amount column in all records at source and target and matching the same.
Financial reconciliation will be performed before the data load starts into EDW.
17
Benefits
Central repository of enterprise data and single version of truth across the
enterprise providing Unified Information Delivery platform
Data is complete, accurate and consistent in the target system enabling
better confidence in decision making.
Provide business users & data stewards a clear picture of their data
quality, monitor, track and govern information over time.
18
THANK YOU
19