0% found this document useful (0 votes)
9 views40 pages

04 - ETL Process

Uploaded by

andreimcpe123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views40 pages

04 - ETL Process

Uploaded by

andreimcpe123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

CS131-8:

Data Warehousing and Data Mining


04 – ETL Process
ETL Overview
• Extract, Transformation, and Load (ETL) (or Data
Staging) takes the raw data from operational systems and
prepares it for the data warehouse
• We will look at each of the 3 phases generally and in the
specific context of our database
What is ETL?

Transformation Logic
Analytical Data Filter
Source System 1
Source System 1 Cubes
Extracts

Operational Data Data Warehouse


Store (ODS)
Source System 2
Source System 2
Extracts Standard Reports

Data marts
Extraction User View
Transformation
Load
METADATA
3
ETL Overview
• Our Cardinal Merch data warehouse gets information
from lots of sources.
• Not every source has the same frequency of update
• The data for a single dimension may not come from a
single source
Cardinal Merch Sources
• POS System
• Advertising Support System
• Purchasing System
• Product Design System
• Real Estate System
• HR System
• Accounting System
• KELP Extracts
• Spreadsheets
Cardinal Merch Sources
• POS System (Daily updates) • Some Store Related
• Transaction Level Data Information
• Date of Purchase • Store Number
• Time of Purchase • Number of Registers
• SKU
• Sale Price • Some Product Related
• Promotion ID Information
• Units Sold • SKU
• Demographic Data • Sale Price
Cardinal Merch Sources
• Purchasing System (Weekly updates)
• SKU
• List Price
• Currently Sold
• Size
• Manufacturer
Cardinal Merch Sources
• Product Design System (Monthly update)
• Color(s)
• Patterns
• Length
• Season
• Fashion Year
Cardinal Merch Sources
• Accounting System (Daily Update)
• Currency conversion Rates

• KELP Extracts (Daily Update)


• Product Level Ratings
Cardinal Merch Sources
• Advertising Support System (Monthly update)
• Promotion Information

• Spreadsheets
• Date information (Yearly Update)
• Time information (One time load)
Cardinal Merch Sources
• Real Estate System (Monthly updates)
• Store Address
• Store Size
• Date Opened / Closed
• Store Type
Cardinal Merch Sources
• HR System (Weekly updates)
• Store Manager Information
• Region Manager Information
• District Manager Information
• Number of Store employees
ETL Requirements
• Need handle different extract schedules

• Need to join data from multiple sources for a single


dimension

• Need to determine correct source to use when multiple


versions of a field are available from different sources
What is ETL?

Transformation Logic
Analytical Data Filter
Source System 1
Source System 1 Cubes
Extracts

Operational Data Data Warehouse


Store (ODS)
Source System 2
Source System 2
Extracts Standard Reports

Data marts
Extraction User View
Transformation
Load
METADATA
14
Extraction
• Reading and understanding source data and copying
needed data into data staging area for further processing
• Implemented using gateways, standard interfaces, third-
party products or custom development.
Extraction
• Two logical extraction approaches:
• Full:
• Entire data source extracted – no need to track changes to data
• Incremental:
• Only data that has changed during specified time interval is extracted
• More efficient, but requires identifying changes to data in operational
source systems
• Determining the most appropriate technique to use
depends largely on the capabilities of the source systems
Extraction in Databases
• Oracle provides several techniques to track changes for
incremental extraction:
• Change Data Capture
• Captures changes made to tables either synchronously (using triggers) or
asynchronously (using redo log files)
• Timestamps
• Can be used if operational system has timestamps indicating date/time a row
was last modified
• Partitioning
• Can be used if source systems use range partitioning with a date key
• Triggers
• Typically used in conjunction with timestamp columns – trigger updates
timestamp column upon each change on the source table
Extraction in SQL Server
• Physical Extraction Methods
• Online
• Can be done via a reference to another database or schema.
• Offline
• Extraction process accesses data in structures outside source system
• Flat files (via SQL Plus, OCI or Pro C programs)
• SSIS Packages
• Export data utility (show example)
• Backup files (via SQL Server Backup Utility)
Extraction in Oracle
• Physical Extraction Methods
• Online
• Extraction process connects directly to source system to extract data
• Can be done via a database link or direct schema reference
• Offline
• Extraction process accesses data in structures outside source system
• Flat files (via SQL Plus, OCI or Pro C programs)
• Dump files (via Oracle’s Export Utility)
• Redo and archive logs
• Transportable Tablespaces
Extraction
• Third-Party Tools
• Applications such as Informatica and DataStage can connect
directly to Oracle using Native Oracle drivers.
• Queries are designed in a GUI interface, then run by a server
engine when submitted.
What is ETL?

Transformation Logic
Analytical Data Filter
Source System 1
Source System 1 Cubes
Extracts

Operational Data Data Warehouse


Store (ODS)
Source System 2
Source System 2
Extracts Standard Reports

Data marts
Extraction User View
Transformation
Load
METADATA
21
Transformation
• Business rules are applied to the extracted data
• Common types of business rules:
• Filtering – only keeping records with certain data values or ranges
• Summarizing – totaling records into a summary record
• Merging – combining two or more source system records into one output
or target record
• Transposing – converting codes to English description or data warehouse
standard descriptions
• Codifying - converting codes from different sources to data warehouse
standard values
• Derivations – applying mathematical formulas to produce data to be
stored in the data warehouse
• Cleansing – ensuring the quality of the data – that it is consistent, is of a
known, recognized value, and conforms to the metadata definition for it
Transformation
• Transformations can be just about any type of data
conversion.
• Example – Calculate of Gross Profit
• Gross Profit = Dollar Sales – Dollar Cost
• Can be done when fact data is inserted or as an update
after the fact record has been loaded.
• Transformations during load are most efficient
Transformation
• Don’t ignore or limit transformations because they are
difficult!
• The more difficult a transformation is, the more important
it is for the data warehouse designer to calculate it if
possible.
Transformation
• Transformations can be applied using either multistage or
pipelined approach:
• Multistage: each transformation implemented as a separate
SQL operation, creating a temporary staging table
• Pipelined: “transform-while-load” – transformations performed
during loading process, eliminating need to create temporary
staging tables
Transformation
• Oracle Transformation Mechanisms:
• SQL
• PL/SQL
• Table Functions
• Log Error functionality

• SQL Server Transformation Mechanisms:


• SQL
• T-SQL
• Integration Services (SSIS)
Transformation SQL – Merge
• Update or insert a table row conditionally in a single SQL
statement

• This methodology also common in most ETL tools.

• Available in Oracle in version 9i and beyond

• Available in SQL Server 2008 and beyond


Transformation SQL – Merge
MERGE INTO DIM_PROMOTION DP USING (

-- SELECT CANDIDATE DATA --


SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE,
AD_TYPE,
DISPLAY_TYPE,
COUPON_TYPE,
AD_MEDIA_TYPE, --- INSERT CLAUSE --
PROMO_COST WHEN NOT MATCHED THEN INSERT (
FROM DIM_PROMOTION_UPDATE DP.PROMOTION_KEY,
) SRC DP.PROMOTION_NAME,
DP.PRICE_REDUCTION_TYPE,
-- FILTER CRITERIA -- DP.AD_TYPE,
ON ( DP.DISPLAY_TYPE,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME DP.COUPON_TYPE,
) DP.AD_MEDIA_TYPE,
DP.PROMO_COST,
DP.LAST_FLAG
-- UPDATE CLAUSE -- ) VALUES (
WHEN MATCHED THEN UPDATE SET PROMOTION_KEY.NEXTVAL,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE SRC.PROMOTION_NAME,
SRC.PRICE_REDUCTION_TYPE,
DP.AD_TYPE = SRC.AD_TYPE,
SRC.AD_TYPE,
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE, SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE, SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE, SRC.AD_MEDIA_TYPE,
SRC.PROMO_COST,
DP.PROMO_COST = SRC.PROMO_COST 'Y'
)
Transformation SQL – Merge
• Identify the table to update
MERGE INTO DIM_PROMOTION DP USING (

SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE, DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE, DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
)
SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE
SRC.PROMO_COST,
'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
Transformation SQL – Merge
• SELECT possible new data to change (update or insert)
MERGE INTO DIM_PROMOTION DP USING (

SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE, DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE, DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
)
SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE
SRC.PROMO_COST,
'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
Transformation SQL – Merge
• Identify the filter / join criteria (candidate data to destination)
MERGE INTO DIM_PROMOTION DP USING (

SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE, DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE, DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
)
SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE
SRC.PROMO_COST,
'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
Transformation SQL – Merge
• Define the fields to update if the row already exists
MERGE INTO DIM_PROMOTION DP USING (

SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE, DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE, DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
)
SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE
SRC.PROMO_COST,
'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
Transformation SQL – Merge
• Define the fields to insert if the row does not exist
MERGE INTO DIM_PROMOTION DP USING (

SELECT PROMOTION_NAME,
PRICE_REDUCTION_TYPE, WHEN NOT MATCHED THEN INSERT (
AD_TYPE,
DP.PROMOTION_KEY,
DP.PROMOTION_NAME,
DISPLAY_TYPE, DP.PRICE_REDUCTION_TYPE,
COUPON_TYPE, DP.AD_TYPE,
AD_MEDIA_TYPE,
DP.DISPLAY_TYPE,
DP.COUPON_TYPE,
PROMO_COST DP.AD_MEDIA_TYPE,
FROM DIM_PROMOTION_UPDATE DP.PROMO_COST,
) SRC DP.LAST_FLAG
) VALUES (
PROMOTION_KEY.NEXTVAL,
ON ( SRC.PROMOTION_NAME,
DP.PROMOTION_NAME = SRC.PROMOTION_NAME SRC.PRICE_REDUCTION_TYPE,
SRC.AD_TYPE,
) SRC.DISPLAY_TYPE,
SRC.COUPON_TYPE,
WHEN MATCHED THEN UPDATE SET SRC.AD_MEDIA_TYPE,
SRC.PROMO_COST,
DP.PRICE_REDUCTION_TYPE = SRC.PRICE_REDUCTION_TYPE 'Y'
DP.AD_TYPE = SRC.AD_TYPE, )
DP.DISPLAY_TYPE = SRC.DISPLAY_TYPE,
DP.COUPON_TYPE = SRC.COUPON_TYPE,
DP.AD_MEDIA_TYPE = SRC.AD_MEDIA_TYPE,
DP.PROMO_COST = SRC.PROMO_COST
What is ETL?

Transformation Logic
Analytical Data Filter
Source System 1
Source System 1 Cubes
Extracts

Operational Data Data Warehouse


Store (ODS)
Source System 2
Source System 2
Extracts Standard Reports

Data marts
Extraction User View
Transformation
Load
METADATA
34
Load
• Process of loading transformed records into the warehouse
• Additional preprocessing may be done during load:
• Sort
• Aggregate
• Compute views
• Build indexes
• Partition
• Huge volumes of data to be loaded, yet small time window
(usually at night) when the warehouse can be taken off-line
• Typically handled by batch load utilities
Loading in Oracle
• Loading Mechanisms:
• SQL Loader
• External Tables
• OCI and Direct Path APIs
• Export/Import

• SQL Loader
• Loads data from flat files into Oracle
• Can perform basic data transformations
Loading in Oracle
• External Tables: CREATE TABLE EXT_STORE (
STORE_NAME VARCHAR2(100),
STORE_NUMBER VARCHAR2(100),
• You define a table in CITY VARCHAR2(100),
STATE VARCHAR2(100),
Oracle that is a mapping SALES_REGION VARCHAR2(100))
to a file ORGANIZATION EXTERNAL (
TYPE oracle_loader
• Once defined the file is DEFAULT DIRECTORY ext_table
ACCESS PARAMETERS (
accessible directly RECORDS DELIMITED BY NEWLINE
SKIP 1
through a query FIELDS TERMINATED BY ','
REJECT ROWS WITH ALL NULL FIELDS
)
LOCATION ('store.txt')
);
Loading in Oracle
• Third-Party Tools
• Native Oracle loading tools are fairly week – best option is
External Tables
• Other tools provide much more functionality
• Best – Informatica PowerCenter and IBM DataStage
• Bearable – Golden Import/Export
• Frustrating – SQL Loader…
Loading in SQL Server
• SQL Server Integration Services (SSIS)
• Import Data Wizard to generate SSIS package
Questions?

You might also like