02 - ETL Design Strategy
02 - ETL Design Strategy
warehousing environment
• Goal oriented
– Not an isolated activity. Like data modeling, it is driven by goals external data Operational data
and purpose of the data warehouse
• Source Driven / Target Driven
– Source Driven – the necessary activities for getting data into the Get data from sources
warehousing environment from various sources Data Intake
Entities HR Organization
Customer
Business
Order Events
Product….. Receive order
History?
Ship order 2003
Cancel 2004
order…. 2005
Enterprise
Kinds Events
of Merger with..
data? Acquisition of..
Termination of….
Completeness Does the scope of data correspond to the scope of the data warehouse?
Is any data missing?
Granularity Is the source the lowest available grain ( most detailed level) for this data?
CLAIM-
CUSTOM
C U S TO M E R -N U M B E RPOLICY
Point of Origin?
C U S TO M E R -N A M E CUSTOM
CUSTOM
GENDER
DRIVER December 14, 2010
Evaluation of Sources – Origin of data
• Original Point of Entry – This practice has many benefits
– Data timeliness and accuracy are improved
– Simplifies the set of extracts from the source system
• Data Stewardship
– In organizations that have data stewardship program, involve the data stewards
atrix
M
t ore
S
D ata
e
urc
So
FIELD tM
a trix
men
Ele
Data
urce
So
CLAIM-NUMBER
December 14, 2010
Module 1 – Source Data Analysis & Modeling
•Warehousing subjects
•Business Questions
• Source composition Conceptual
•Facts and Qualifiers
• Source subjects Models
•Targets Configuration
exist
(analyze)
Which Source
top-down Subject
Modeling Existing data
Approach? model model
bottom-up
Source
Logical Logical
integrate
Model (ERM)
(design)
Structure
Structural Of data
(specify) Store (matrix)
Physical Existing
file desc
(optimize)
locate extract
Functional
(Implement) Existing data store
December 14, 2010
Module 1 – Source Data Analysis & Modeling
(scope)
What
Source data kinds of warehousing data
data stores To
target
modeling
Source composition Each source
model Does
no source model yes validate
(analyze
Concept
exist
Which Source
Modeling top-down Subject Existing data
ual
bottom-up
•Source composition model uses set notation to develop a subject area model
•Classifies each source by the business subjects that it supports
•Helps to understand
•which subjects have a robust set of sources
•which sources address a broad range of business subjects
•Helpful to plan, size, sequence and schedule development of the DW increments
CLAIM
MIS product table
EXPENSE
INCIDENT
CPS claim master CPS claim
action file
LIS claim file
CPS party file
CPS claim detail file
MIS auto
ORGANIZATION
Marketplace table
MIS residential
PARTY
Marketplace table
MARKETPLACE
December 14, 2010
Composition Subject Matrix Example
S
December 14, 2010
Module 1 – Source Data Analysis & Modeling
exist model
(analyze)
Which Source
top-down
Modeling Subject Existing data
Existing data
Approach? model model
model
bottom-up
Source
Logical Logical
integrate
(design) Model (ERM)
exist model
(analyze)
Which Source
top-down
Modeling Subject Existing data
Existing data
Approach? model model
model
bottom-up
Source
Logical Logical
integrate
(design) Model (ERM)
• The source data element matrix serves as the tool to perform source data modeling
• Source modeling and source assessment work well together and share the same set of
documentation techniques.
Model States
Normalize
Verify Model
discover patterns
Source/Target Mapping
Extract
Data capture Analysis – What to extract?
– Performed to understand requirements for data capture
• Which data entities and elements are required by target data stores? Transform
target
Data capture Design – When to extract? How to extract?
– Performed to understand and specify methods of data capture
• Timing of extracts from each data source
• Kinds of data to be extracted
• Occurrences of data (all or changes only) to be captured
• Change detection methods
• Extract technique (snapshot or audit trail)
• Data capture method (push from source or pull from warehouse
Source/Target Mapping
• Data Elements
Load
ETL Data
warehouse
customer
Source & target data models
product
service
Structural &
DATA STORE
MAPPING
DATA ELEMENT
MAPPING
MEMB
December 14, 2010
l
Source/target Mapping: Full set of Data
elements
l Elements added by business
ica l s
g e nes ns
Lo od si tio
m u
B es
qu
e nt
l
e lem ica l
ta x g e
a
D t ri Lo od
ma Source/target mapping m
e
bl ons l
a
/t ti ica
i
F s
le crip
hys gn
ap P esi
de et m
Elements added by
tar g d
ur ce/
So
transform logic
Elements
added by triage
triage
transform design
Source/Target Mapping
What is Triage?
• Source data structures are analyzed to determine the appropriate data elements for inclusion
Why Triage?
• Ensure that a complete set of attributes is captured in the warehousing environment.
• Rework is minimized
Source/Target Mapping
Custom
Member Nu
Reference Data
metadata Membership
Source system
Source system keys December 14, 2010
Data Capture Methods
ALL CHANGED
DATA DATA
Replicate Replicate
PUSH TO source Source changes
WAREHOUSE Files/tables Or transactions
Source/Target Mapping
OLTP Frequency of
Acquisition
Sources Data
Extraction
Data
Transformation
Work
Tables Warehouse
Loading
Latency of
Load
Intake
layer
Periodicity
Of
Data
Data Marts Mart
OLTP
Sources Data
Extraction
Work
When is the data ready in
Tables each source system ?
• AUDIT TRAIL
– Records details of each change to data of interest
– Details may include date and time of change, how the change was detected, reason for
change, before and after data values, etc.
– Acquisition techniques
• DBMS triggers
• DBMS replication
• Incremental selection
• Full file unload/copy
Transformation Analysis
Transformation Design
• Transformation Analysis
• Integrate disparate data
• Change granularity of data
• Assure data quality
• Transformation Design
• Specifies the processing needed to meet the requirements that are determined by
transformation analysis
• Determining kinds of transformations
– Selection
– Filtering
– Conversion
– Translation
– Derivation
– Summarization
– Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools
and technology
Transformation Analysis
Transformation Design
Transformation Analysis
Transformation Design
transformation requirements
Determine transformation
sequences
Specify transformation
process
transformation specifications
Transformation Analysis
Transformation Design
Select
Extracted
Source # 2
Transformed
Target data
sometimes from source 1
sometimes from source 2
Filter
Convert
Transformed
Target data
Value/format in is different than
value/format out
Translate
Transformed
Target data
both encoded and decoded value out
Derive
Transformed
Target data
new data values created…
More values out than in
Summarize
Transformed
Target data
Summary data out
‘for each store (for each product line (for each day (count the
number of transactions, accumulate the total dollar value of the
transactions))) ’
‘for each week (sum daily transaction count, sum daily dollar
total)
December 14, 2010
Identifying Transformation Rules
Transformation Analysis
Transformation Design
• Rule Dependency – when execution of a transformation rule is based upon the result of
another rule
• example: different translations occur depending on source chosen by a selection
rule
Specify selection
Specify filtering
Specify derivation
Specify summarization
1 3
1. Identify the transformation rules
2. Understand rule dependency – package as modules
3. Understand time dependency – package as processes
4. Validate and define the test plan
December 14, 2010
Modules and Programs
DTR027(Default Membership Type)
If membership-type is null or
DTR008(Derive Name) If membership-type is “family”
invalid assume “family” membership
separate name using comma
insert characters prior to comma in customer-last-name
insert characters after comma in customer-first-name
else move name to customer-biz-name
Transformation Rules
Dependencie
s among
rules
Structures of Modules,
Programs, Scripts, etc.
Extract &
Load
scheduling Dependencie
s
execution
ts
trae
r
Automated & Manual
verification
Procedures
communication
December 14, 2010
Module 4 –
Data Transportation & Loading
Design
Extract
Transform
databa
s e load
t r ops nart at ad
Load
r of t al p od er e h w
Target Data
a hc m
Load
Load
which DBMS?
Extract relational vs dimensional?
tables & indices?
load frequency?
load timing?
data volumes?
exception handling?
restart & recovery?
Transform load methods?
referential integrity?
databa
s e load
Load
Target Data
December 14, 2010
Populating Tables
• Drop and rebuild the tables
• Insert (only) rows into a table
• Delete old rows and insert changed rows
Load
Indices
Tables
update
at load?
dr o p &
index s rebuild?
egment
ation?
allow
Load
updating of
rows in
tables?
Indices
Tables
Transform
Load
Suspend
exceptions
ok
Reports
Target
data
Log
Discard
December 14, 2010
Integrating with ETL processes
scheduling restart/recover
dependencies y
scheduling dependencies
execution
ts
trae
r
T C ART XE verification
communicatio scheduling
n process
dependencies metadata
M
execution
ts
trae
r
verification
• Loading as a part of single transform
DA OL R OF S NART
ETL Summary
Data Mapping
Data Transformation
Data Conversion
Data Cleansing
ss ecc A at a D
s met s y S ecr uo S
mega na M es abat a D
Data Movement
Storage Management
Metadata Management
December 14, 2010
ETL - Critical Success Factors
Data Store Data
Roles Transformation
Roles
Integration
Intake
Granularity
Information
Delivery
Cleansing
Distribution
1. Design for the Future, Not for the Present v v v v v v
2. Capture and store only changed data v
3. Fully understand source systems and data v v v v v
4. Allow enough time to do the job right v v v v v v
5. Use the right sources, not the easy ones v v v
6. Pay attention to data quality v v v v v v
7. Capture comprehensive ETL metadata v v v v v v
8. Test thoroughly and according to a test plan v v v v v v
9. Distinguish between one-time and ongoing loads v v v
10. Use the right technology for the right reasons v v v v v v
11. Triage source attributes v v v
12. Capture atomic level detail v v
13. Strive for subject orientation and integration v v
14. Capture history of changes in audit trail form v
15. Modularize ETL processing v v v v v v
16. Ensure that business data is non-volatile v v v
17. Use bulk loads and/or insert-only processing v
18. Complete subject orientation and integration v v
19. Use the right data structures (relational vs. dimensional) v v v v v
20. Use shared transformation rules and logic v v v v v
21. Design for distribution first, then for access v
22. Fully understand each unique access need v v v
23. Use DBMS update capabilities v v v
24. Design for access before other purposes v v
25. Design for access tool capabilities v v
26. Capture quality metadata and report data quality v v v