04 - ETL Process
04 - ETL Process
Transformation Logic
Analytical Data Filter
Source System 1
Source System 1 Cubes
Extracts
Data marts
Extraction User View
Transformation
Load
METADATA
3
ETL Overview
• Our Cardinal Merch data warehouse gets information
from lots of sources.
• Not every source has the same frequency of update
• The data for a single dimension may not come from a
single source
Cardinal Merch Sources
• POS System
• Advertising Support System
• Purchasing System
• Product Design System
• Real Estate System
• HR System
• Accounting System
• KELP Extracts
• Spreadsheets
Cardinal Merch Sources
• POS System (Daily updates) • Some Store Related
• Transaction Level Data Information
• Date of Purchase • Store Number
• Time of Purchase • Number of Registers
• SKU
• Sale Price • Some Product Related
• Promotion ID Information
• Units Sold • SKU
• Demographic Data • Sale Price
Cardinal Merch Sources
• Purchasing System (Weekly updates)
• SKU
• List Price
• Currently Sold
• Size
• Manufacturer
Cardinal Merch Sources
• Product Design System (Monthly update)
• Color(s)
• Patterns
• Length
• Season
• Fashion Year
Cardinal Merch Sources
• Accounting System (Daily Update)
• Currency conversion Rates
• Spreadsheets
• Date information (Yearly Update)
• Time information (One time load)
Cardinal Merch Sources
• Real Estate System (Monthly updates)
• Store Address
• Store Size
• Date Opened / Closed
• Store Type
Cardinal Merch Sources
• HR System (Weekly updates)
• Store Manager Information
• Region Manager Information
• District Manager Information
• Number of Store employees
ETL Requirements
• Need handle different extract schedules
Transformation Logic
Analytical Data Filter
Source System 1
Source System 1 Cubes
Extracts
Data marts
Extraction User View
Transformation
Load
METADATA
14
Extraction
• Reading and understanding source data and copying
needed data into data staging area for further processing
• Implemented using gateways, standard interfaces, third-
party products or custom development.
Extraction
• Two logical extraction approaches:
• Full:
• Entire data source extracted – no need to track changes to data
• Incremental:
• Only data that has changed during specified time interval is extracted
• More efficient, but requires identifying changes to data in operational
source systems
• Determining the most appropriate technique to use
depends largely on the capabilities of the source systems
Extraction in Databases
• Oracle provides several techniques to track changes for
incremental extraction:
• Change Data Capture
• Captures changes made to tables either synchronously (using triggers) or
asynchronously (using redo log files)
• Timestamps
• Can be used if operational system has timestamps indicating date/time a row
was last modified
• Partitioning
• Can be used if source systems use range partitioning with a date key
• Triggers
• Typically used in conjunction with timestamp columns – trigger updates
timestamp column upon each change on the source table
Extraction in SQL Server
• Physical Extraction Methods
• Online
• Can be done via a reference to another database or schema.
• Offline
• Extraction process accesses data in structures outside source system
• Flat files (via SQL Plus, OCI or Pro C programs)
• SSIS Packages
• Export data utility (show example)
• Backup files (via SQL Server Backup Utility)
Extraction in Oracle
• Physical Extraction Methods
• Online
• Extraction process connects directly to source system to extract data
• Can be done via a database link or direct schema reference
• Offline
• Extraction process accesses data in structures outside source system
• Flat files (via SQL Plus, OCI or Pro C programs)
• Dump files (via Oracle’s Export Utility)
• Redo and archive logs
• Transportable Tablespaces
Extraction
• Third-Party Tools
• Applications such as Informatica and DataStage can connect
directly to Oracle using Native Oracle drivers.
• Queries are designed in a GUI interface, then run by a server
engine when submitted.
What is ETL?
Transformation Logic
Analytical Data Filter
Source System 1
Source System 1 Cubes
Extracts
Data marts
Extraction User View
Transformation
Load
METADATA
21
Transformation
• Business rules are applied to the extracted data
• Common types of business rules:
• Filtering – only keeping records with certain data values or ranges
• Summarizing – totaling records into a summary record
• Merging – combining two or more source system records into one output
or target record
• Transposing – converting codes to English description or data warehouse
standard descriptions
• Codifying - converting codes from different sources to data warehouse
standard values
• Derivations – applying mathematical formulas to produce data to be
stored in the data warehouse
• Cleansing – ensuring the quality of the data – that it is consistent, is of a
known, recognized value, and conforms to the metadata definition for it
Transformation
• Transformations can be just about any type of data
conversion.
• Example – Calculate of Gross Profit
• Gross Profit = Dollar Sales – Dollar Cost
• Can be done when fact data is inserted or as an update
after the fact record has been loaded.
• Transformations during load are most efficient
Transformation
• Don’t ignore or limit transformations because they are
difficult!
• The more difficult a transformation is, the more important
it is for the data warehouse designer to calculate it if
possible.
Transformation
• Transformations can be applied using either multistage or
pipelined approach:
• Multistage: each transformation implemented as a separate
SQL operation, creating a temporary staging table
• Pipelined: “transform-while-load” – transformations performed
during loading process, eliminating need to create temporary
staging tables
Transformation
• Oracle Transformation Mechanisms:
• SQL
• PL/SQL
• Table Functions
• Log Error functionality
SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE, DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE, DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
)
SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE
SRC.PROMO_COST,
'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
Transformation SQL – Merge
• SELECT possible new data to change (update or insert)
MERGE INTO DIM_PROMOTION DP USING (
SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE, DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE, DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
)
SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE
SRC.PROMO_COST,
'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
Transformation SQL – Merge
• Identify the filter / join criteria (candidate data to destination)
MERGE INTO DIM_PROMOTION DP USING (
SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE, DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE, DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
)
SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE
SRC.PROMO_COST,
'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
Transformation SQL – Merge
• Define the fields to update if the row already exists
MERGE INTO DIM_PROMOTION DP USING (
SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE, DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE, DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
)
SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE
SRC.PROMO_COST,
'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
Transformation SQL – Merge
• Define the fields to insert if the row does not exist
MERGE INTO DIM_PROMOTION DP USING (
SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE,
DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE,
DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
) SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
SRC.PROMO_COST,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE 'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
What is ETL?
Transformation Logic
Analytical Data Filter
Source System 1
Source System 1 Cubes
Extracts
Data marts
Extraction User View
Transformation
Load
METADATA
34
Load
• Process of loading transformed records into the warehouse
• Additional preprocessing may be done during load:
• Sort
• Aggregate
• Compute views
• Build indexes
• Partition
• Huge volumes of data to be loaded, yet small time window
(usually at night) when the warehouse can be taken off-line
• Typically handled by batch load utilities
Loading in Oracle
• Loading Mechanisms:
• SQL Loader
• External Tables
• OCI and Direct Path APIs
• Export/Import
• SQL Loader
• Loads data from flat files into Oracle
• Can perform basic data transformations
Loading in Oracle
• External Tables: CREATE TABLE EXT_STORE (
STORE_NAME VARCHAR2(100),
STORE_NUMBER VARCHAR2(100),
• You define a table in CITY VARCHAR2(100),
STATE VARCHAR2(100),
Oracle that is a mapping SALES_REGION VARCHAR2(100))
to a file ORGANIZATION EXTERNAL (
TYPE oracle_loader
• Once defined the file is DEFAULT DIRECTORY ext_table
ACCESS PARAMETERS (
accessible directly RECORDS DELIMITED BY NEWLINE
SKIP 1
through a query FIELDS TERMINATED BY ','
REJECT ROWS WITH ALL NULL FIELDS
)
LOCATION ('store.txt')
);
Loading in Oracle
• Third-Party Tools
• Native Oracle loading tools are fairly week – best option is
External Tables
• Other tools provide much more functionality
• Best – Informatica PowerCenter and IBM DataStage
• Bearable – Golden Import/Export
• Frustrating – SQL Loader…
Loading in SQL Server
• SQL Server Integration Services (SSIS)
• Import Data Wizard to generate SSIS package
Questions?