7 - ETL Concept Data Quality
7 - ETL Concept Data Quality
Business
Intelligence
Rokhmatul Insani
Outline
• ETL Concept
• Data Quality
• ETL Process
Overview ETL
• ETL merupakan bagian dari Data Acquisition dan Data Storage
• Merupakan proses backend yang memuat fungsi dan prosedur yang akan
mengubah source data menjadi format yang sesuai dengan penyimpanan
dalam DW
• Data yang akan diekstrak sangat tergantung pada requirement user
• Untuk mendapatkan strategic information, diperlukan rekonsiliasi data
yang terdiri dari 3 langkah: extract, tranform dan loading
• Tantangan proses ETL adalah kemungkinan beragamnya source data,
platform hardware dan OS yang digunakan, kualitas data dll
ETL Process
• The process at ETL has four major components:
• Extracting
Gathering raw data from the source systems and usually writing it to disk in the ETL environment
• Cleansing & Conforming
Sending source data through a series of processing steps in the ETL system to improve the quality of the
data received from the source, and merging data from two or more sources to create and enforce
conformed dimensions and conformed metrics
• Delivering
Physically structuring and loading the data into the presentation server's target dimensional models
• Managing
Managing the related systems and processes of the ETL environment in a coherent manner
Extracting Data
Subsystems that existed at this process
• Data Profiling
• Data profiling is the technical analysis of data to describe its content, consistency, and
structure.
• Change Data Capture
• Isolating the latest source data is called change data capture (CDC).
• The idea behind change data capture is simple enough: Just transfer the data that has been
changed since the last load.
• Extract System
• Extracting data from the source systems is a fundamental component of the ETL architecture.
ETL Process Based On SCD
CRC
CRC (Cyclic Redundancy Check) adalah algoritma yang digunakan untuk meningkatan performa ETL dalam
melakukan incremental update ke table Dimensi. Dimana data yag di insert/update ke table Dimensi hanya delta
selisih datanya saja antara data source dan target data dimensi.
Tahapannya meliputi :
• Concate seluruh variable pada data sumber.
• Gunakan fungsi hash8 pada data sumber yang telah diconcate tersebut sehingga data berubah menjadi
numerik.
• Concate seluruh variable pada data dimensi.
• Gunakan fungsi hash8 pada data dimensi yang telah diconcate tersebut sehingga data berubah menjadi
numerik.
• Terakhir, cocokan antara hasil hash8 data sumber dan data dimensi.
Cleaning & Conforming
• Cleaning and conforming data are critical ETL system tasks. These are the
steps where the ETL system adds value to the data.
• In this process, the system(s) improving your data quality culture and
processes.
• Subsystems that existed at this process:
A. Data Cleansing support data quality
B. Error Event Schema
C. Audit Dimension Assembler
D. Deduplication System
E. Conforming System
A. Data Cleansing
• One of our goals in describing the cleansing system is for cleansing data,
capturing data quality events, as well as measuring and ultimately
controlling data quality in the data warehouse.
• Data Quality in DW is not just the quality of individual data item, but the
quality of the full, integrated system as a whole
Data Quality Dimension
1. Accuracy Right Value
2. domain integrity Range of Allowable
3. Consistency form & content same in accross multiple src
4. Completeness no missing values
5. structural definiteness enforces Stardard
6. Clarity well understood
7. Timely
8. Usefulness satisfies requirement
Data Warehouse Challenges
Data Quality Problem & Challenge
• Sources of Data Pollution must be
Missing
Values checked
• Validation of Names and Address
inconsiste
• Cost of Poor Data Quality
Dummy cryptic
nt values
values values
etc
business
rule
violation
B. Error Event Schema
• The error event schema is a
centralized dimensional schema
whose purpose is to record every
error event thrown by a quality
screen anywhere in the ETL pipeline.
• This approach can be used in generic
data integration (DI) applications
where data is being transferred
between legacy applications
C. Audit Dimension Assembler
• The audit dimension is a special
dimension that is assembled in the
back room by the ETL system for
each fact table. The audit
dimension contains the metadata
context at the moment when a
specific fact table record is created
D. Deduplication System
• Common happen in an organization can have multiple source systems that create and manage master
table separately.
(Ex: Customer Table )
• Survivorship is the process of combining a set of matched records into a unified image that combines
the highest quality columns from the matched records into a conformed row. Survivorship involves
establishing clear business rules that define the priority sequence for column values from all possible
source systems to enable the creation of a single row with the best-survived attributes
E. Conforming System
• Conforming consists of all the
steps required to align the content
of some or all of the columns in a
dimension with columns in similar
or identical dimensions in other
parts of the data warehouse.
• The conforming process flow
combining the deduplicating and
survivorship processing.
Managing the ETL Environment
• The ETL management subsystems help achieve the goals of reliability,
availability, and manageability:
• Reliability. The ETL processes must run consistently to completion to provide data on a timely
basis that is trustworthy at any level of detail
• Availability. The data warehouse must meet its service level agreements. The warehouse
should be up and available as promised
• Manageability. A successful data warehouse is never done. It constantly grows and changes
along with the business. In order to do this, the ETL processes need to evolve gracefully as well
Managing the ETL Environment
Subsystems that existed at this process • Sorting System
• Job Scheduler Job Definition, Job Scheduling, • Lineage and Dependency Analyzer
Metadata Capture, Logging, Notification
• Problem Escalation System
• Backup System
• Parallelizing/Pipelining System
• Recovery and Restart System
• Security System
• Version Control System
• Compliance Manager
• Version Migration System
• Metadata Repository Manager
• Workflow Monitor
ETL Process
Dev. One-Time
• Step 1: Draw the High Level Plan Historic Load Process • Step 7: Dimension Table Incremental
• Step 2: Choose ETL Tools Processing
• Step 3: Develop Default Strategies • Step 8: Fact Table Incremental
• Step 5: Populate Dimension Table
• Step 4: Drill Down by Target Table Processing
with Historic Data • Step 9: Aggregate Table and OLAP
• Step 6: Perform the Fact Table
Loads
Historic Load • Step 10: ETL System Operation and
Automation
Historical
Normalized
Comprehensive
Timely
Quality Controlled