SlideShare a Scribd company logo
ETL Process in Data
Warehouse
Data Acquisition
 It is the process of extracting the relevant
business info/- from the different source
systems transforming the data from one
format into an another format, integrating
the data in to homogeneous format and
loading the data in to a warehouse
database.
 Data Extraction (E)
 Data Transformation (T)
 Data Loading (L)
Sample ETL Process Flow
ETL Process
The ETL Process having the following basic steps
 Is mapping the data between source systems and
target database
 Is cleansing of source data in staging area
 Is transforming cleansed source data and then
loading into the target system
 Source System
A database, application, file, or other storage facility from
which the data in a data warehouse is derived.
 Mapping
The definition of the relationship and data flow between
source and target objects.
 Staging Area
A place where data is processed before entering the
warehouse.
 Cleansing
The process of resolving inconsistencies and fixing the
anomalies in source data, typically as part of the ETL
process.
 Transformation
The process of manipulating data. Any manipulation beyond
copying is a transformation. Examples include cleansing,
aggregating, and integrating data from multiple sources.
 Transportation
The process of moving copied or transformed data from a
source to a data warehouse.
 Target System
A database, application, file, or other storage facility to which
the "transformed source data" is loaded in a data warehouse.
ETL Overview
 Extraction Transformation Loading – ETL
 To get data out of the source and load it into the data
warehouse – simply a process of copying data from one
database to other
 Data is extracted from an OLTP database, transformed
to match the data warehouse schema and loaded into
the data warehouse database
 Many data warehouses also incorporate data from non-
OLTP systems such as text files, legacy systems, and
spreadsheets; such data also requires extraction,
transformation, and loading
 When defining ETL for a data warehouse, it is important
to think of ETL as a process, not a physical
implementation
ETL Overview
 ETL is often a complex combination of process and
technology that consumes a significant portion of the data
warehouse development efforts and requires the skills of
business analysts, database designers, and application
developers
 It is not a one time event as new data is added to the
Data Warehouse periodically – monthly, daily, hourly
 Because ETL is an integral, ongoing, and recurring part of
a data warehouse
 Automated
 Well documented
 Easily changeable
ETL Staging Database
 ETL operations should be performed on a
relational database server separate from the
source databases and the data warehouse
database
 Creates a logical and physical separation
between the source systems and the data
warehouse
 Minimizes the impact of the intense periodic ETL
activity on source and data warehouse databases
 ETL (Extract-Transform-Load)
ETL comes from Data Warehousing and
stands for Extract-Transform-Load. ETL
covers a process of how the data are loaded
from the source system to the data
warehouse. Currently, the ETL encompasses
a cleaning step as a separate step. The
sequence is then Extract-Clean-Transform-
Load. Let us briefly describe each step of the
ETL process.
Extraction
 Extract
The Extract step covers the data extraction
from the source system and makes it
accessible for further processing. The main
objective of the extract step is to retrieve all
the required data from the source system with
as little resources as possible. The extract
step should be designed in a way that it does
not negatively affect the source system in
terms or performance, response time or any
kind of locking
 There are several ways to perform the extract:
 Update notification - if the source system is able to provide a notification that
a record has been changed and describe the change, this is the easiest way
to get the data.
 Incremental extract - some systems may not be able to provide notification
that an update has occurred, but they are able to identify which records have
been modified and provide an extract of such records. During further ETL
steps, the system needs to identify changes and propagate it down. Note,
that by using daily extract, we may not be able to handle deleted records
properly.
 Full extract - some systems are not able to identify which data has been
changed at all, so a full extract is the only way one can get the data out of
the system. The full extract requires keeping a copy of the last extract in the
same format in order to be able to identify changes. Full extract handles
deletions as well.
 When using Incremental or Full extracts, the extract frequency is extremely
important. Particularly for full extracts; the data volumes can be in tens of
gigabytes.
Clean
The cleaning step is one of the most important as it
ensures the quality of the data in the data warehouse.
Cleaning should perform basic data unification rules, such
as:
 Making identifiers unique (sex categories Male/Female/Unknown,
M/F/null, Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
 Convert null values into standardized Not Available/Not Provided value
 Convert phone numbers, ZIP codes to a standardized form
 Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
 Validate address fields against each other (State/Country, City/State,
City/ZIP code, City/Street).
Extraction
 The integration of all of the disparate systems across the
enterprise is the real challenge to getting the data
warehouse to a state where it is usable
 Data is extracted from heterogeneous data sources
 Each data source has its distinct set of characteristics
that need to be managed and integrated into the ETL
system in order to effectively extract data.
 ETL process needs to effectively integrate systems that have
different:
 DBMS
 Operating Systems
 Hardware
 Communication protocols
 Need to have a logical data map before the physical data can
be transformed
 The logical data map describes the relationship between the
extreme starting points and the extreme ending points of your
ETL system usually presented in a table or spreadsheet
Extraction
 The content of the logical data mapping document has been proven to be the critical
element required to efficiently plan ETL processes
 The table type gives us our queue for the ordinal position of our data load processes
—first dimensions, then facts.
 The primary purpose of this document is to provide the ETL developer with a clear-
cut blueprint of exactly what is expected from the ETL process. This table must
depict, without question, the course of action involved in the transformation process
 The transformation can contain anything from the absolute solution to nothing at all.
Most often, the transformation can be expressed in SQL. The SQL may or may not be
the complete statement
Target Source Transformation
Table Name Column Name Data Type Table Name Column Name Data Type
 The analysis of the source system is
usually broken into two major phases:
The data discovery phase
The anomaly detection phase
Extraction - Data Discovery Phase
 Data Discovery Phase
key criterion for the success of the data
warehouse is the cleanliness and
cohesiveness of the data within it
 Once you understand what the target
needs to look like, you need to identify and
examine the data sources
Data Discovery Phase
 It is up to the ETL team to drill down further into the data
requirements to determine each and every source system, table,
and attribute required to load the data warehouse
 Collecting and Documenting Source Systems
 Keeping track of source systems
 Determining the System of Record - Point of originating of data
 Definition of the system-of-record is important because in most
enterprises data is stored redundantly across many different systems.
 Enterprises do this to make nonintegrated systems share data. It is very
common that the same piece of data is copied, moved, manipulated,
transformed, altered, cleansed, or made corrupt throughout the
enterprise, resulting in varying versions of the same data
Data Content Analysis - Extraction
 Understanding the content of the data is crucial for determining the best
approach for retrieval
- NULL values. An unhandled NULL value can destroy any ETL process.
NULL values pose the biggest risk when they are in foreign key columns.
Joining two or more tables based on a column that contains NULL values
will cause data loss! Remember, in a relational database NULL is not equal
to NULL. That is why those joins fail. Check for NULL values in every
foreign key in the source database. When NULL values are present, you
must outer join the tables
- Dates in nondate fields. Dates are very peculiar elements because they
are the only logical elements that can come in various formats, literally
containing different values and having the exact same meaning.
Fortunately, most database systems support most of the various formats for
display purposes but store them in a single standard format
 During the initial load, capturing changes to data content
in the source data is unimportant because you are most
likely extracting the entire data source or a potion of it
from a predetermined point in time.
 Later the ability to capture data changes in the source
system instantly becomes priority
 The ETL team is responsible for capturing data-content
changes during the incremental load.
Determining Changed Data
 Audit Columns - Used by DB and updated by triggers
 Audit columns are appended to the end of each table to
store the date and time a record was added or modified
 You must analyze and test each of the columns to
ensure that it is a reliable source to indicate changed
data. If you find any NULL values, you must to find an
alternative approach for detecting change – example
using outer joins
Process of Elimination
 Process of elimination preserves exactly one copy of
each previous extraction in the staging area for future
use.
 During the next run, the process takes the entire source
table(s) into the staging area and makes a comparison
against the retained data from the last process.
 Only differences (deltas) are sent to the data warehouse.
 Not the most efficient technique, but most reliable for
capturing changed data
Determining Changed Data
Initial and Incremental Loads
 Create two tables: previous load and current load.
 The initial process bulk loads into the current load table. Since
change detection is irrelevant during the initial load, the data
continues on to be transformed and loaded into the ultimate target
fact table.
 When the process is complete, it drops the previous load table,
renames the current load table to previous load, and creates an
empty current load table. Since none of these tasks involve
database logging, they are very fast!
 The next time the load process is run, the current load table is
populated.
 Select the current load table MINUS the previous load table.
Transform and load the result set into the data warehouse.
Determining Changed Data
Transformation
Transformation
 Main step where the ETL adds value
 Actually changes data and provides
guidance whether data can be used for its
intended purposes
 Performed in staging area
 Transform
The transform step applies a set of rules to
transform the data from the source to the
target. This includes converting any measured
data to the same dimension (i.e. conformed
dimension) using the same units so that they
can later be joined. The transformation step
also requires joining data from several
sources, generating aggregates, generating
surrogate keys, sorting, deriving new
calculated values, and applying advanced
validation rules.
Data Quality paradigm
 Correct
 Unambiguous
 Consistent
 Complete
 Data quality checks are run at 2 places - after
extraction and after cleaning and confirming
additional check are run at this point
Transformation
Transformation - Cleaning Data
 Anomaly Detection
 Data sampling – count(*) of the rows for a department
column
 Column Property Enforcement
 Null Values in reqd columns
 Numeric values that fall outside of expected high and
lows
 Cols whose lengths are exceptionally short/long
 Cols with certain values outside of discrete valid value
sets
 Adherence to a reqd pattern/ member of a set of
pattern
definign etl process extract transform load.ppt
Transformation - Confirming
 Structure Enforcement
Tables have proper primary and foreign keys
Obey referential integrity
 Data and Rule value enforcement
Simple business rules
Logical data checks
Staged Data
Cleaning
And
Confirming
Fatal Errors
Stop
Loading
Yes
No
Loading
Loading Dimensions
Loading Facts
 Load
During the load step, it is necessary to ensure
that the load is performed correctly and with
as little resources as possible. The target of
the Load process is often a database. In order
to make the load process efficient, it is helpful
to disable any constraints and indexes before
the load and enable them back only after the
load completes. The referential integrity
needs to be maintained by ETL tool to ensure
consistency.
Loading Dimensions
 Physically built to have the minimal sets of components
 The primary key is a single field containing meaningless
unique integer – Surrogate Keys
 The DW owns these keys and never allows any other
entity to assign them
 De-normalized flat tables – all attributes in a dimension
must take on a single value in the presence of a
dimension primary key.
 Should possess one or more other fields that compose
the natural key of the dimension
definign etl process extract transform load.ppt
 The data loading module consists of all the steps
required to administer slowly changing dimensions
(SCD) and write the dimension to disk as a physical
table in the proper dimensional format with correct
primary keys, correct natural keys, and final descriptive
attributes.
 Creating and assigning the surrogate keys occur in this
module.
 The table is definitely staged, since it is the object to be
loaded into the presentation system of the data
warehouse.
Loading dimensions
 When DW receives notification that an
existing row in dimension has changed it
gives out 3 types of responses
Type 1
Type 2
Type 3
Type 1 Dimension
Type 2 Dimension
Type 3 Dimensions
Loading facts
 Facts
Fact tables hold the measurements of an
enterprise. The relationship between fact
tables and measurements is extremely
simple. If a measurement exists, it can be
modeled as a fact table row. If a fact table
row exists, it is a measurement
Key Building Process - Facts
 When building a fact table, the final ETL step is
converting the natural keys in the new input records into
the correct, contemporary surrogate keys
 ETL maintains a special surrogate key lookup table for
each dimension. This table is updated whenever a new
dimension entity is created and whenever a Type 2
change occurs on an existing dimension entity
 All of the required lookup tables should be pinned in
memory so that they can be randomly accessed as each
incoming fact record presents its natural keys. This is
one of the reasons for making the lookup tables separate
from the original data warehouse dimension tables.
Key Building Process
definign etl process extract transform load.ppt
Loading Fact Tables
 Managing Indexes
Performance Killers at load time
Drop all indexes in pre-load time
Segregate Updates from inserts
Load updates
Rebuild indexes
 Managing Partitions
 Partitions allow a table (and its indexes) to be physically divided
into minitables for administrative purposes and to improve query
performance
 The most common partitioning strategy on fact tables is to
partition the table by the date key. Because the date dimension
is preloaded and static, you know exactly what the surrogate
keys are
 Need to partition the fact table on the key that joins to the date
dimension for the optimizer to recognize the constraint.
 The ETL team must be advised of any table partitions that need
to be maintained.
Outwitting the Rollback Log
 The rollback log, also known as the redo log, is
invaluable in transaction (OLTP) systems. But in a data
warehouse environment where all transactions are
managed by the ETL process, the rollback log is a
superfluous feature that must be dealt with to achieve
optimal load performance. Reasons why the data
warehouse does not need rollback logging are:
 All data is entered by a managed process—the ETL system.
 Data is loaded in bulk.
 Data can easily be reloaded if a load process fails.
 Each database management system has different logging
features and manages its rollback log differently
 Managing ETL Process
The ETL process seems quite straight
forward. As with every application, there is a
possibility that the ETL process fails. This can
be caused by missing extracts from one of the
systems, missing values in one of the
reference tables, or simply a connection or
power outage. Therefore, it is necessary to
design the ETL process keeping fail-recovery
in mind.
ETL Tool Implementation
 When you are about to use an ETL tool, there is a
fundamental decision to be made: will the company build
its own data transformation tool or will it use an existing
tool?
 Building your own data transformation tool (usually a set
of shell scripts) is the preferred approach for a small
number of data sources which reside in storage of the
same type. The reason for that is the effort to implement
the necessary transformation is little due to similar data
structure and common system architecture. Also, this
approach saves licensing cost and there is no need to
train the staff in a new tool.
 There are many ready-to-use ETL tools on the market. The main
benefit of using off-the-shelf ETL tools is the fact that they are
optimized for the ETL process by providing connectors to common
data sources like databases, flat files, mainframe systems, xml, etc.
They provide a means to implement data transformations easily and
consistently across various data sources. This includes filtering,
reformatting, sorting, joining, merging, aggregation and other
operations ready to use. The tools also support transformation
scheduling, version control, monitoring and unified metadata
management. Some of the ETL tools are even integrated with BI
tools.
 Some of the Well Known ETL Tools
 The most well known commercial tools are Ab Initio, IBM InfoSphere
DataStage, Informatica, Oracle Data Integrator and SAP Data Integrator.
 There are several open source ETL tools, among others Apatar,
CloverETL, Pentaho and Talend.
Ad

More Related Content

Similar to definign etl process extract transform load.ppt (20)

ETL Process
ETL ProcessETL Process
ETL Process
Rohin Rangnekar
 
ETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse FundamentalsETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse Fundamentals
SOMASUNDARAM T
 
Data warehousing
Data warehousingData warehousing
Data warehousing
factscomputersoftware
 
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
abhaybansal43
 
An Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETLAn Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETL
idescitation
 
Why shift from ETL to ELT?
Why shift from ETL to ELT?Why shift from ETL to ELT?
Why shift from ETL to ELT?
HEXANIKA
 
ETL Testing Training Presentation
ETL Testing Training PresentationETL Testing Training Presentation
ETL Testing Training Presentation
Apurba Biswas
 
ETL Technologies.pptx
ETL Technologies.pptxETL Technologies.pptx
ETL Technologies.pptx
Gaurav Bhatnagar
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
Er. Nawaraj Bhandari
 
ETL Tools Ankita Dubey
ETL Tools Ankita DubeyETL Tools Ankita Dubey
ETL Tools Ankita Dubey
Ankita Dubey
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
ganblues
 
“Extract, Load, Transform,” is another type of data integration process
“Extract, Load, Transform,” is another type of data integration process“Extract, Load, Transform,” is another type of data integration process
“Extract, Load, Transform,” is another type of data integration process
RashidRiaz18
 
Sqlserver interview questions
Sqlserver interview questionsSqlserver interview questions
Sqlserver interview questions
Taj Basha
 
Data Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrongData Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrong
Massimo Cenci
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
kzayra69
 
What is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data WharehouseWhat is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data Wharehouse
BugRaptors
 
Should ETL Become Obsolete
Should ETL Become ObsoleteShould ETL Become Obsolete
Should ETL Become Obsolete
Jerald Burget
 
ETL-Advance IA to improve your skills-pdf
ETL-Advance IA to improve your skills-pdfETL-Advance IA to improve your skills-pdf
ETL-Advance IA to improve your skills-pdf
oswahernan2203
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
Deepali Raut
 
What Is ETL | Process of ETL 2023 | GrapesTech Solutions
What Is ETL | Process of ETL 2023 | GrapesTech SolutionsWhat Is ETL | Process of ETL 2023 | GrapesTech Solutions
What Is ETL | Process of ETL 2023 | GrapesTech Solutions
GrapesTech Solutions
 
ETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse FundamentalsETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse Fundamentals
SOMASUNDARAM T
 
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
4_etl_testing_tutorial_till_chapter3-merged-compressed.pdf
abhaybansal43
 
An Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETLAn Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETL
idescitation
 
Why shift from ETL to ELT?
Why shift from ETL to ELT?Why shift from ETL to ELT?
Why shift from ETL to ELT?
HEXANIKA
 
ETL Testing Training Presentation
ETL Testing Training PresentationETL Testing Training Presentation
ETL Testing Training Presentation
Apurba Biswas
 
ETL Tools Ankita Dubey
ETL Tools Ankita DubeyETL Tools Ankita Dubey
ETL Tools Ankita Dubey
Ankita Dubey
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
ganblues
 
“Extract, Load, Transform,” is another type of data integration process
“Extract, Load, Transform,” is another type of data integration process“Extract, Load, Transform,” is another type of data integration process
“Extract, Load, Transform,” is another type of data integration process
RashidRiaz18
 
Sqlserver interview questions
Sqlserver interview questionsSqlserver interview questions
Sqlserver interview questions
Taj Basha
 
Data Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrongData Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrong
Massimo Cenci
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
kzayra69
 
What is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data WharehouseWhat is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data Wharehouse
BugRaptors
 
Should ETL Become Obsolete
Should ETL Become ObsoleteShould ETL Become Obsolete
Should ETL Become Obsolete
Jerald Burget
 
ETL-Advance IA to improve your skills-pdf
ETL-Advance IA to improve your skills-pdfETL-Advance IA to improve your skills-pdf
ETL-Advance IA to improve your skills-pdf
oswahernan2203
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
Deepali Raut
 
What Is ETL | Process of ETL 2023 | GrapesTech Solutions
What Is ETL | Process of ETL 2023 | GrapesTech SolutionsWhat Is ETL | Process of ETL 2023 | GrapesTech Solutions
What Is ETL | Process of ETL 2023 | GrapesTech Solutions
GrapesTech Solutions
 

Recently uploaded (20)

Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 4-30-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 4-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
To study Digestive system of insect.pptx
To study Digestive system of insect.pptxTo study Digestive system of insect.pptx
To study Digestive system of insect.pptx
Arshad Shaikh
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
One Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learningOne Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learning
momer9505
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
To study the nervous system of insect.pptx
To study the nervous system of insect.pptxTo study the nervous system of insect.pptx
To study the nervous system of insect.pptx
Arshad Shaikh
 
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdfExploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Sandeep Swamy
 
How to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 WebsiteHow to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 Website
Celine George
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam SuccessUltimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Mark Soia
 
Political History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptxPolitical History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptx
Arya Mahila P. G. College, Banaras Hindu University, Varanasi, India.
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
To study Digestive system of insect.pptx
To study Digestive system of insect.pptxTo study Digestive system of insect.pptx
To study Digestive system of insect.pptx
Arshad Shaikh
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
One Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learningOne Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learning
momer9505
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
To study the nervous system of insect.pptx
To study the nervous system of insect.pptxTo study the nervous system of insect.pptx
To study the nervous system of insect.pptx
Arshad Shaikh
 
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdfExploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Sandeep Swamy
 
How to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 WebsiteHow to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 Website
Celine George
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam SuccessUltimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Mark Soia
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
Ad

definign etl process extract transform load.ppt

  • 1. ETL Process in Data Warehouse
  • 2. Data Acquisition  It is the process of extracting the relevant business info/- from the different source systems transforming the data from one format into an another format, integrating the data in to homogeneous format and loading the data in to a warehouse database.  Data Extraction (E)  Data Transformation (T)  Data Loading (L)
  • 4. ETL Process The ETL Process having the following basic steps  Is mapping the data between source systems and target database  Is cleansing of source data in staging area  Is transforming cleansed source data and then loading into the target system
  • 5.  Source System A database, application, file, or other storage facility from which the data in a data warehouse is derived.  Mapping The definition of the relationship and data flow between source and target objects.  Staging Area A place where data is processed before entering the warehouse.  Cleansing The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of the ETL process.
  • 6.  Transformation The process of manipulating data. Any manipulation beyond copying is a transformation. Examples include cleansing, aggregating, and integrating data from multiple sources.  Transportation The process of moving copied or transformed data from a source to a data warehouse.  Target System A database, application, file, or other storage facility to which the "transformed source data" is loaded in a data warehouse.
  • 7. ETL Overview  Extraction Transformation Loading – ETL  To get data out of the source and load it into the data warehouse – simply a process of copying data from one database to other  Data is extracted from an OLTP database, transformed to match the data warehouse schema and loaded into the data warehouse database  Many data warehouses also incorporate data from non- OLTP systems such as text files, legacy systems, and spreadsheets; such data also requires extraction, transformation, and loading  When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation
  • 8. ETL Overview  ETL is often a complex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, database designers, and application developers  It is not a one time event as new data is added to the Data Warehouse periodically – monthly, daily, hourly  Because ETL is an integral, ongoing, and recurring part of a data warehouse  Automated  Well documented  Easily changeable
  • 9. ETL Staging Database  ETL operations should be performed on a relational database server separate from the source databases and the data warehouse database  Creates a logical and physical separation between the source systems and the data warehouse  Minimizes the impact of the intense periodic ETL activity on source and data warehouse databases
  • 10.  ETL (Extract-Transform-Load) ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process of how the data are loaded from the source system to the data warehouse. Currently, the ETL encompasses a cleaning step as a separate step. The sequence is then Extract-Clean-Transform- Load. Let us briefly describe each step of the ETL process.
  • 12.  Extract The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. The extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking
  • 13.  There are several ways to perform the extract:  Update notification - if the source system is able to provide a notification that a record has been changed and describe the change, this is the easiest way to get the data.  Incremental extract - some systems may not be able to provide notification that an update has occurred, but they are able to identify which records have been modified and provide an extract of such records. During further ETL steps, the system needs to identify changes and propagate it down. Note, that by using daily extract, we may not be able to handle deleted records properly.  Full extract - some systems are not able to identify which data has been changed at all, so a full extract is the only way one can get the data out of the system. The full extract requires keeping a copy of the last extract in the same format in order to be able to identify changes. Full extract handles deletions as well.  When using Incremental or Full extracts, the extract frequency is extremely important. Particularly for full extracts; the data volumes can be in tens of gigabytes.
  • 14. Clean The cleaning step is one of the most important as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as:  Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown)  Convert null values into standardized Not Available/Not Provided value  Convert phone numbers, ZIP codes to a standardized form  Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str  Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street).
  • 15. Extraction  The integration of all of the disparate systems across the enterprise is the real challenge to getting the data warehouse to a state where it is usable  Data is extracted from heterogeneous data sources  Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data.
  • 16.  ETL process needs to effectively integrate systems that have different:  DBMS  Operating Systems  Hardware  Communication protocols  Need to have a logical data map before the physical data can be transformed  The logical data map describes the relationship between the extreme starting points and the extreme ending points of your ETL system usually presented in a table or spreadsheet Extraction
  • 17.  The content of the logical data mapping document has been proven to be the critical element required to efficiently plan ETL processes  The table type gives us our queue for the ordinal position of our data load processes —first dimensions, then facts.  The primary purpose of this document is to provide the ETL developer with a clear- cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process  The transformation can contain anything from the absolute solution to nothing at all. Most often, the transformation can be expressed in SQL. The SQL may or may not be the complete statement Target Source Transformation Table Name Column Name Data Type Table Name Column Name Data Type
  • 18.  The analysis of the source system is usually broken into two major phases: The data discovery phase The anomaly detection phase
  • 19. Extraction - Data Discovery Phase  Data Discovery Phase key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it  Once you understand what the target needs to look like, you need to identify and examine the data sources
  • 20. Data Discovery Phase  It is up to the ETL team to drill down further into the data requirements to determine each and every source system, table, and attribute required to load the data warehouse  Collecting and Documenting Source Systems  Keeping track of source systems  Determining the System of Record - Point of originating of data  Definition of the system-of-record is important because in most enterprises data is stored redundantly across many different systems.  Enterprises do this to make nonintegrated systems share data. It is very common that the same piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise, resulting in varying versions of the same data
  • 21. Data Content Analysis - Extraction  Understanding the content of the data is crucial for determining the best approach for retrieval - NULL values. An unhandled NULL value can destroy any ETL process. NULL values pose the biggest risk when they are in foreign key columns. Joining two or more tables based on a column that contains NULL values will cause data loss! Remember, in a relational database NULL is not equal to NULL. That is why those joins fail. Check for NULL values in every foreign key in the source database. When NULL values are present, you must outer join the tables - Dates in nondate fields. Dates are very peculiar elements because they are the only logical elements that can come in various formats, literally containing different values and having the exact same meaning. Fortunately, most database systems support most of the various formats for display purposes but store them in a single standard format
  • 22.  During the initial load, capturing changes to data content in the source data is unimportant because you are most likely extracting the entire data source or a potion of it from a predetermined point in time.  Later the ability to capture data changes in the source system instantly becomes priority  The ETL team is responsible for capturing data-content changes during the incremental load.
  • 23. Determining Changed Data  Audit Columns - Used by DB and updated by triggers  Audit columns are appended to the end of each table to store the date and time a record was added or modified  You must analyze and test each of the columns to ensure that it is a reliable source to indicate changed data. If you find any NULL values, you must to find an alternative approach for detecting change – example using outer joins
  • 24. Process of Elimination  Process of elimination preserves exactly one copy of each previous extraction in the staging area for future use.  During the next run, the process takes the entire source table(s) into the staging area and makes a comparison against the retained data from the last process.  Only differences (deltas) are sent to the data warehouse.  Not the most efficient technique, but most reliable for capturing changed data Determining Changed Data
  • 25. Initial and Incremental Loads  Create two tables: previous load and current load.  The initial process bulk loads into the current load table. Since change detection is irrelevant during the initial load, the data continues on to be transformed and loaded into the ultimate target fact table.  When the process is complete, it drops the previous load table, renames the current load table to previous load, and creates an empty current load table. Since none of these tasks involve database logging, they are very fast!  The next time the load process is run, the current load table is populated.  Select the current load table MINUS the previous load table. Transform and load the result set into the data warehouse. Determining Changed Data
  • 27. Transformation  Main step where the ETL adds value  Actually changes data and provides guidance whether data can be used for its intended purposes  Performed in staging area
  • 28.  Transform The transform step applies a set of rules to transform the data from the source to the target. This includes converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined. The transformation step also requires joining data from several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated values, and applying advanced validation rules.
  • 29. Data Quality paradigm  Correct  Unambiguous  Consistent  Complete  Data quality checks are run at 2 places - after extraction and after cleaning and confirming additional check are run at this point Transformation
  • 30. Transformation - Cleaning Data  Anomaly Detection  Data sampling – count(*) of the rows for a department column  Column Property Enforcement  Null Values in reqd columns  Numeric values that fall outside of expected high and lows  Cols whose lengths are exceptionally short/long  Cols with certain values outside of discrete valid value sets  Adherence to a reqd pattern/ member of a set of pattern
  • 32. Transformation - Confirming  Structure Enforcement Tables have proper primary and foreign keys Obey referential integrity  Data and Rule value enforcement Simple business rules Logical data checks
  • 35.  Load During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.
  • 36. Loading Dimensions  Physically built to have the minimal sets of components  The primary key is a single field containing meaningless unique integer – Surrogate Keys  The DW owns these keys and never allows any other entity to assign them  De-normalized flat tables – all attributes in a dimension must take on a single value in the presence of a dimension primary key.  Should possess one or more other fields that compose the natural key of the dimension
  • 38.  The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes.  Creating and assigning the surrogate keys occur in this module.  The table is definitely staged, since it is the object to be loaded into the presentation system of the data warehouse.
  • 39. Loading dimensions  When DW receives notification that an existing row in dimension has changed it gives out 3 types of responses Type 1 Type 2 Type 3
  • 43. Loading facts  Facts Fact tables hold the measurements of an enterprise. The relationship between fact tables and measurements is extremely simple. If a measurement exists, it can be modeled as a fact table row. If a fact table row exists, it is a measurement
  • 44. Key Building Process - Facts  When building a fact table, the final ETL step is converting the natural keys in the new input records into the correct, contemporary surrogate keys  ETL maintains a special surrogate key lookup table for each dimension. This table is updated whenever a new dimension entity is created and whenever a Type 2 change occurs on an existing dimension entity  All of the required lookup tables should be pinned in memory so that they can be randomly accessed as each incoming fact record presents its natural keys. This is one of the reasons for making the lookup tables separate from the original data warehouse dimension tables.
  • 47. Loading Fact Tables  Managing Indexes Performance Killers at load time Drop all indexes in pre-load time Segregate Updates from inserts Load updates Rebuild indexes
  • 48.  Managing Partitions  Partitions allow a table (and its indexes) to be physically divided into minitables for administrative purposes and to improve query performance  The most common partitioning strategy on fact tables is to partition the table by the date key. Because the date dimension is preloaded and static, you know exactly what the surrogate keys are  Need to partition the fact table on the key that joins to the date dimension for the optimizer to recognize the constraint.  The ETL team must be advised of any table partitions that need to be maintained.
  • 49. Outwitting the Rollback Log  The rollback log, also known as the redo log, is invaluable in transaction (OLTP) systems. But in a data warehouse environment where all transactions are managed by the ETL process, the rollback log is a superfluous feature that must be dealt with to achieve optimal load performance. Reasons why the data warehouse does not need rollback logging are:  All data is entered by a managed process—the ETL system.  Data is loaded in bulk.  Data can easily be reloaded if a load process fails.  Each database management system has different logging features and manages its rollback log differently
  • 50.  Managing ETL Process The ETL process seems quite straight forward. As with every application, there is a possibility that the ETL process fails. This can be caused by missing extracts from one of the systems, missing values in one of the reference tables, or simply a connection or power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery in mind.
  • 51. ETL Tool Implementation  When you are about to use an ETL tool, there is a fundamental decision to be made: will the company build its own data transformation tool or will it use an existing tool?  Building your own data transformation tool (usually a set of shell scripts) is the preferred approach for a small number of data sources which reside in storage of the same type. The reason for that is the effort to implement the necessary transformation is little due to similar data structure and common system architecture. Also, this approach saves licensing cost and there is no need to train the staff in a new tool.
  • 52.  There are many ready-to-use ETL tools on the market. The main benefit of using off-the-shelf ETL tools is the fact that they are optimized for the ETL process by providing connectors to common data sources like databases, flat files, mainframe systems, xml, etc. They provide a means to implement data transformations easily and consistently across various data sources. This includes filtering, reformatting, sorting, joining, merging, aggregation and other operations ready to use. The tools also support transformation scheduling, version control, monitoring and unified metadata management. Some of the ETL tools are even integrated with BI tools.  Some of the Well Known ETL Tools  The most well known commercial tools are Ab Initio, IBM InfoSphere DataStage, Informatica, Oracle Data Integrator and SAP Data Integrator.  There are several open source ETL tools, among others Apatar, CloverETL, Pentaho and Talend.