0% found this document useful (0 votes)
158 views40 pages

DW 2

The document discusses Extract, Transform and Load (ETL) processes. It defines ETL as extracting data from heterogeneous sources, transforming it to fit operational needs including cleaning and standardizing data, and loading it into a data warehouse. It provides examples of full versus incremental data extraction and using SQL commands, SQL*Loader, and PL/SQL for loading data. It also introduces the open source Kettle ETL tool for visually designing transformations with capabilities like filtering and aggregation.

Uploaded by

RV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views40 pages

DW 2

The document discusses Extract, Transform and Load (ETL) processes. It defines ETL as extracting data from heterogeneous sources, transforming it to fit operational needs including cleaning and standardizing data, and loading it into a data warehouse. It provides examples of full versus incremental data extraction and using SQL commands, SQL*Loader, and PL/SQL for loading data. It also introduces the open source Kettle ETL tool for visually designing transformations with capabilities like filtering and aggregation.

Uploaded by

RV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Extract, Transform and

Load (ETL)
Eduardo Almeida
Master Alma Universit de Nantes
{[email protected]}

Goal
To present the general concepts of the
Extract, Transform and Load (ETL) process

To present an open source ETL tool

To ETL

Bibliography
Berson, Alex e Smith, Stephen J
Data Warehousing, Data Mining & OLAP
Kimball, Ralph
The Data Warehouse Toolkit
Inmon, Willian H.
Building the Data Warehouse
Business Inteligence avec Oracle 10g
Claire Noirault
http://asktom.oracle.com
Donsez, Didier (prsentations)
Universit Joseph Fourier

DW Overall architecture

Extract, Transform
and Load (ETL)

DW Overall architecture
(staging area)

ETL

Extract

Extract

Production data
Heterogeneous data
sources
Heterogeneous
representations
Incremental x full loading

Extract

Extraction
Logical (Full, Incremental)
Physical

Full Extraction
Export from the source of one table or a
set of tables (ex., )
Extract using programs (ex., PL/SQL, Java,
etc)

Advantages
No trace of the changes
No additional information
on the source

Drawbacks
Large amount of data
Impact performance on data
sources and the ETL process

Incremental Extraction
Necessity of a mechanism to define
modified data
A DATE attribute
Triggers
Original / current value (ex., MINUS
operator)

Physical Extraction
Necessity of a mechanism to define
modified data
Log files
Dump files
Flat files
Partitioning (source tables are partitioned
along a date key)

Transform

Transform

Integration
Cleansing
Standardizing
Enrichment
Sort
Filter
...

Transform
Data integration

Transform
Data Cleansing
Data Cleansing is the act of detecting and
correcting (or removing) corrupt or inaccurate
records.
So Paulo
S. Paulo
SP

DW

Transform
Standardizing
Address
number, street, city, country, zip
street, number, neighborhood, city, country, zip
Phone
+33 (0) 2 40 55 66 77
330240556677
Name
Johnny Hallyday
Hallyday, Johnny
JOHNNY HALLYDAY

Load

Load
Large amount of data
Significant processing
loads
Low system use
Verify referential
integrity after the load
From fact table to
dimension

Command line tools

Extract
Oracle 'exp' command
exp scott/tiger file=emp.dmp log=emp.log
tables=emp rows=yes indexes=no
exp scott/tiger file=emp.dmp tables=(emp,dept)
exp scott/tiger tables=emp query="where
deptno=10"
exp scott/tiger file=abc.dmp tables=abc
query=\"where sex=\'f\'\" rows=yes

Extract
Extracting into Flat Files Using SQL*Plus
SET echo off
SET pagesize 0
SPOOL country_city.dat
SELECT distinct t1.country_name ||'|'|| t2.cust_city
FROM countries t1, customers t2
WHERE t1.country_id = t2.country_id
AND t1.country_name= 'United States of America';
SPOOL off

Load
Oracle 'imp' command
exp scott/tiger file=emp.dmp log=emp.log
tables=emp rows=yes indexes=no
exp scott/tiger file=emp.dmp tables=(emp,dept)
exp scott/tiger tables=emp query="where
deptno=10"
exp scott/tiger file=abc.dmp tables=abc
query=\"where sex=\'f\'\" rows=yes

Load
Scenario
My system has both clients and clients_dim tables
I want to load clients_dim table from an export of
clients

How to load using 'imp'?


rename clients to clients_temp;
rename clients_dim to clients;
imp alma1 fromuser=almax touser=alma1
tables=clients file=almax.clients.dmp
log=almax.clients.log IGNORE=Y
rename clients to clients_dim;
rename clients_temp to clients;

Load
Using SQL*Loader
sqlldr user control=control.ctl
The control.ctl file has the load information:
load data
infile 'country_city.dat'
into table country_city
fields terminated by "|" optionally enclosed by '"'
( country_name, cust_city )

Load
Using PL/SQL
DECLARE
nom_cat VARCHAR2(25);
descr VARCHAR2(100);
CURSOR cur IS
SELECT ref_produit, nom_produit
FROM produits;

Load
Using PL/SQL
BEGIN
FOR crec IN cur LOOP
select NOM_CATEGORIE,DESCRIPTION
into NOM_CAT,DESCR
from categories
where code_categorie=crec.CODE_CATEGORIE;

Load
Using PL/SQL
insert into products_dim (REF_PRODUIT,NOM_PRODUIT
NOM_CATEGORIE,DESCRIPTION)
values(
crec.REF_PRODUIT,crec.NOM_PRODUIT,
NOM_CAT,DESCR);
END LOOP;
COMMIT;
END;
/

Cursor
PL/SQL Variables

Kettle
Open source ETL tool
http://kettle.pentaho.org/

Kettle
Kettle is designed to help you with your ETTL
needs, which include the Extraction,
Transformation, Transportation and Loading of
data.

Runs with Java

Has a graphical user interface called Spoon

Kettle Tutorial
Open a terminal
$ spoon.sh

Transformation

Kettle Tutorial
1 - Explorateur

2 - Connections

4 Tester la
connection

3 Configuration

Kettle Tutorial
1 Desing (Palette de cration)

2 Glisser et dposer

Kettle Tutorial
1 Nom tape

2 Ecrire SQL

Kettle Tutorial

1 Insertion dans table


2 Lien

Kettle Tutorial

Kettle Tutorial
1 Excuter

2 Vrifier les resultats

Kettle Tutorial

1 Filtrer

Kettle Tutorial
1 Excuter

2 Vrifier les resultats

Kettle Tutorial

1 Agrgation

Kettle Tutorial
1 Nom tape

2 Champ group

3 Champ agrg

You might also like