SlideShare a Scribd company logo
UNIT 2
Mr.T.Somasundaram
Assistant Professor
Department Of Management
Kristu Jayanti College (Autonomous), Bengaluru
UNIT 2
ETL Process and Maintenance of Data
Warehouse
Data Extraction, Data
Transformation, Data loading, Data
Quality, Data Warehouse design
reviews, Testing and Monitoring
the data warehouse.
• ETL is a process in Data Warehousing and it stands for Extract,
Transform and Load.
• It is a process in which an ETL tool extracts the data from various
data source systems, transforms it in the staging area, finally loads
it into the Data warehouse system.
• It is data integration process that combines data from multiple data
sources into a single, consistent data store into a data warehouse.
• This may contains customize the tool to suit the need of the
enterprises.
(E.g.) ETL tool sets for long-term analysis & usage of data in
banking, insurance claims, retail sales history, etc.
ETL Process
ETL Process
ETL Process & Data Warehouse Fundamentals
• Google BigQuery.
• Amazon Redshift.
• Informatica – PowerCenter.
• IBM – Infosphere Information Server.
• Oracle Data Integrator.
• SQL Server Integration Services.
ETL Tools
1) Scalability – unlimited scalability are available at the
click of a button. (i.e.) capacity to be changed in size.
2) Simplicity – it saves time, resources and avoids lot of
complexity.
3) Out of the box – open source ETL require
customization and cloud-based ETL requires integration.
4) Compliance – it finds easy way to skip complicated and
risky compliance setups.
5) Long-term costs – it is cheaper with open source ETL
tools but will cost make it for long run.
Benefits of ETL Tools
Extraction (E):
The first step is extraction of data, source system’s data
is accessed first and prepared further for processing and
extracting required values.
It is extracted in various formats like relational
databases, No SQL, XML and flat files, etc.
It is important to store extract data in staging area, not
directly into data warehouse as it may cause damage and
rollback will be much difficult.
Phase (Steps) of ETL Process
Extraction has three approach -
a) Update Notification – the data is changed or altered in the
source system, it will notify users about the change.
b) Incremental Extract – many systems are incapable of
providing notification but are efficient enough to track down the
changes that made to source data.
c) Full extract – the whole data is extracted, when system is
neither able to notify nor able to track down the changes. Old
copy of data is maintained to identify the change.
(E.g.) phone numbers, Email conversion to standard form,
validation of address fields, etc.
Phase (Steps) of ETL Process
The data extraction issues are -
Source Identification – identify source applications and source structures.
Method of extraction – for each data source, define whether the extraction
process is manual or tool-based.
Extraction frequency – for each data source, establish how frequently the
data extraction must be done (daily, weekly, quartely, etc.)
Time Window – for each data source, denote the time window for the
extraction process.
Job sequencing – determine whether the beginning of one job in an
extraction job stream has to wait until the previous job has finished
successfully.
Exception handling – determine how to handle input records that can’t be
extracted.
Phase (Steps) of ETL Process
The following are the guidelines adopted for extracts as best practices -
The extract processing should identify changes since the last extract.
Interface record types should be defined for the extraction based on
entities in data warehouse model.
(E.g.) Client information extracted from source may be categorized into
person attributes, contact point information, etc.
When changes sent to data warehouse, all current attributes for changed
entity should be also sent.
Any interface record should be categorized as –
Records which have been added to operational database since the last
extract.
Records which have been deleted from operational database since the last
extract.
Phase (Steps) of ETL Process
Transformation (T):
The second step of ETL process is transformation.
A set of rules or functions are applied on extracted
data to convert it into a single standard format.
It includes dimension conversion, aggregation,
joining, derivation and calculations of new values.
Phase (Steps) of ETL Process
Transformation involve the following processes or tasks –
a) Filtering - Filtering – loading only certain attributes into the
data warehouse.
b) Cleaning – filling up the NULL values with some default
values, mapping U.S.A, United States, and America into USA,
etc.
c) Joining – joining multiple attributes into one.
d) Splitting – splitting a single attribute into multiple attributes.
e) Sorting – sorting tuples on the basis of some attribute
(generally key-attribute).
Phase (Steps) of ETL Process
Major Data Transformation Types:
a) Format Revisions – it include changes to the data types and length of
individual fields.
(E.g.) Product package type indicated by codes and names, in which fields
are numeric and text data types.
b) Decoding of fields – this is common when dealing with multiple source
systems and same data items are described by field values.
(E.g.) Coding for Male and Female may be 1 and 2 in one source system or
M and F in another source system.
c) Splitting of Single fields – storing of name, address, city, state data
together in a single field in earlier systems, but individual components need
to store in separate fields in data warehouse.
Phase (Steps) of ETL Process
d) Merging of information – this is not mean the merging of several fields
to create a single field of data.
(E.g.) information about product may come from different data sources,
product code & package type from another data source. Merging of
information denoted combination of product code, description, package
types in single entity.
e) Character set conversion – this is related to conversion of character sets
to an agreed standard character set for text data in data warehouse.
f) Conversion of units of measurements – it is required to convert the
metrics based on overseas operations to set numbers are all in one standard
unit of measurement.
g) Date / Time Conversion - this is representation of data and time in
standard formats.
Phase (Steps) of ETL Process
h) Summarization – this type of transformation is creating of summaries to
be loaded in data warehouse instead of loading most granular level of data.
(E.g.) Credit card transaction not necessary to store in data warehouse for
each single transaction, instead summarize the daily transactions for each
credit card and store the summary data.
i) Key Restructuring – while extracting data from input sources, look at
primary keys of extracted records and come with keys for fact and
dimension table based on keys in extracted records.
j) Deduplication – customer files have several records for same customer in
most of the companies, which leads to creation of additional records by
mistake. It is to maintain one record of customer and link all duplicates in
source systems to single record.
Phase (Steps) of ETL Process
Loading (L):
 The third & final step of ETL process is loading.
 Transformed data is finally loaded into data warehouse.
 Data is updated by loading into data warehouse frequently, but regular intervals.
 Indexes and constraints previously applied to data needs to be diabled before loading
commences.
 The rate and period of loading is depends on requirements and varies from system to
system.
 During the loads, the data warehouse has to be offline.
 Time period should be identified when loads may be scheduled without affecting data
warehouse users.
 It should be consider to divide the whole load process into smaller chunks and
populating a few files at a time.
Phase (Steps) of ETL Process
Mode of Loading (L):
a) Load – If targeted table to be loaded already exists and data exists in
table, load process wipes out the existing data and applies data from
incoming file. If table is empty before loading, the load process applies the
data from incoming file.
b) Append – It is an extension of the load. If data already exists in table,
append process unconditionally adds the incoming data, preserving the
existing data in target table. Incoming duplicate record may be rejected
during the append process.
c) Destructive Merge – in this step, apply the incoming data to the target
data. If incoming record matches with the key of an existing record, update
the matching target record, if not then add incoming record to the target
table.
Phase (Steps) of ETL Process
Mode of Loading (L):
d) Constructive Merge – this mode is opposite to the destructive merge. (i.e.) if
incoming record matches with the key of an existing record, leave the existing record,
add incoming record and mark the added record as superseding the old record.
e) Initial Load – to load the whole data warehouse in a single run. It is able to split the
load into separaet subloads and run as single loads. If you need more than one run to
create a single table, then it is scheduled to run in several days.
f) Incremental Loads – these are the applications of ongoing changes from source
system. It need a method to preserve the periodic nature of changes in data warehouse.
Constructive merge mode is an appropriate method for incremental tools.
g) Full Refresh – this application of data involves periodically rewriting the entire
data warehouse. Sometimes, it can partial refreshes to rewrite only specific tables, but
partial refreshes are rare, because dimentions table is intricately tied to the fact table.
Phase (Steps) of ETL Process
Data Quality (DQ) in Data Warehouse
What is Data Quality?
A IT professional, data accuracy is quite often and its accurancy associated
with a data element.
(E.g.) Consider Customer as Entity and it has attributes like -
,,,,,, etc.
This indicates data accuracy as its attributes of customer entity describe the
particular customer.
(i.e.) if data is fit for purpose for which it is intended, then such data has
quality.
Data Quality is related to the usage for the data item as defined by the users.
Data Quality in operational systems required database records conform to
field validation edits, which is data quality, but single field edits alone don’t
constitute data quality.
Customer
Name
Customer
Address
Customer
State
Customer
Mobile No
Definition:
Data quality refers to the overall utility of a dataset(s) as a function of
its ability to be easily processed and analyzed for other uses, usually by a
database, data warehouse, or data analytics system.
 Data quality in a data warehouse is not just the quality of individual data
items but the quality of the full, integrated system as a whole.
(E.g.) In online ordering system, while entering the data about customers in
order entry application, we may collect demographics of each customer.
 Sometimes, this demographic factors may not be needed or not much
attention.
 When those data’s are try to access which is integrated whole lacks data
quality.
(E.g.) Few customer’s information may be important or may not be
importance, when we filling or writing any application form. (especially
Banking process)
Data Accuracy Vs Data Quality
Difference between Data Accuracy and Data Quality:
Data Accuracy Data Quality
 Specific instance of an entity accurately
represents that occurrence of the entity
 The data item is exactly fit for the
purpose for which the business users
have defined it.
 Data element defined in terms of
database technology
 Wider concept grounded in the specific
business of the company
 Data element conforms to validation
constraints
 Relates not just to single data elements
but to the system as a whole
 Individual data items have the correct
data types
 Forum and content of data elements
consistent across the whole system
 Traditionally relates to operational
systems
 Essentially needed in a corporate-wide
data warehouse for business users
Characteristics (Dimensions) of Data Quality
The data quality dimensions are -
1. Accuracy:
The value stored in the system for a data element is the right value for that
occurence of the data element. (E.g.) getting correct customer address
2. Domain Integrity:
The data value of an attribute falls in the range of allowable, defined values.
(E.g.) Male and Female for gender data element.
3. Data Type:
Value for a data attribute is actually stored as the data type defined for the
attribute. (E.g.) Name field is defined as ‘text’.
4. Consistency:
The form and content of a data field in the same across multiple source
systems. (E.g.) Product code for Product A is 1234.
5. Redundancy:
The same data must not be stored in more than one place in a system.
6. Completeness:
There are no missing values for a given attribute in the system.
7. Duplication:
Duplication of records in a system is completely resolved. (E.g.) duplicate
records are identified and created cross-reference.
8. Conformance to Business Rules:
The values of each data item adhere to prescribed business rules. (E.g.) in
auction system, sale price can’t be less than the reserve price.
9. Structural Definiteness:
Wherever data item can naturally be structured into individual components,
the item must contain this well-defined structure. (E.g.)Names are divided
into first name, middle name and last name, which reduces the missing
10. Data Anomaly:
 A field must be used only for the purpose for which it is defined. (E.g.) In third
line of address column for long address, it should be entered the third line of
address, not the phone numbers or fax number.
11. Clarity:
 A data element may possess all the other characteristics of quality data but if the
users do not understand its meaning clearly.
12. Timely:
 The users determine the timeliness of the data. (E.g.) updation of data in customer
database on daily basis.
13. Usefulness:
 Every data element in the data warehouse must satisfy some requirements of the
collection of users.
14. Adherence to Data Integrity Rules:
 The data stored in the relational databases of the source system must adhere to
entity integrity and referential integrity rules.
Data Quality Challenges (Problems) in DW
0% 10% 20% 30% 40% 50%
Database Performance
Management Expectations
Business Rules
Data Transformation
User Expectations
Data Modeling
Data Quality
PERCENTAGE
Data Warehouse Challenges
Data Quality Framework
Establish Data Quality
Steering Committee
Agree on a suitable data
quality framework
Identify the business
functions affected most by
bad data
Institute data quality policy
and standards
Select high impact data
elements and determine
priorities
Define quality measurement
parameters and benchmarks
Plan and execute data
cleansing for high impact
data elements
Plan and execute data
cleansing for other less
severe elements
Initial
Data
Cleansing
efforts
Ongoing
Data
Cleansing
efforts
IT Professionals
User Representatives
Data Quality – Participants and Roles
Data
Quality
Initiatives
Data Consumer
(User Dept.)
Data Producer
(User Dept.)
Data Expert
(User Dept.) Data Correction Authority
(IT Dept.)
Data Consistency
Expert (IT Dept.)
Data Policy
Administrator
(IT Dept.)
Data Integrity
Specialist (User
Dept.)
The responsibilities for the roles are -
1. Data Consumers:
Uses the data warehouse for queries, reports and analysis. Establishes the
acceptable levels of data quality.
2. Data Producer:
Responsible for the quality of data input into the source systems.
3. Data Expert:
Expert in the subject matter and the data itself of the source systems.
Responsible for identifying pollution in the source systems.
4. Data Policy Administrator:
Ultimately responsible for resolving data corruption as data is
transformed and moved into the data warehouse.
The responsibilities for the roles are -
5. Data Integrity Specialist:
Responsible for ensuring that the data in the source systems conforms to
the business rules.
6. Data Correction Authority:
Responsible for actually applying the data cleansing techniques through
the use of tools or in-house programs.
7. Data Consistency Expert:
Responsible for ensuring that all data within the data warehouse (various
data marts) are fully synchronized.
Data Quality Tools
The useful data quality tools are -
1. Categories of Data Cleansing tools:
It assist in two ways –
 Data error discovery tools work on the source data to identify inaccuracies and
inconsistencies.
 Data correction tools help fix the corrupt data, which use series of algorithms
to parse, transform, match, consolidate and correct the data.
2. Error Discovery features:
The following list of error discovery functions that data cleansing tools are
capable of performing –
 Quickly and easily identify duplicate records.
 Identify data items whose values are outside the range of legal domain values.
 Find inconsistent data.
 Check for range of allowable values.
 Detect inconsistencies among data items from different sources.
 Allow users to identify and quantify data quality problems.
 Monitor trends in data quality over time.
 Report to users on the quality of data used for analysis.
 Reconcile problems of RDBMS referential integrity.
3. Data Correction features:
The following list describes the typical error correction functions that data cleansing
tools are capable of performing –
 Normalize inconsistent data.
 Improve merging of data from dissimilar data sources.
 Group and relate customer records belonging to the same household.
 Provide measurements of data quality.
 Standardize data elements to common formats.
 Validate for allowable values.
4. DBMS for Quality Control:
The database management system used as a tool for data quality control in many
ways, especially RDBMS prevent several types of errors creeping into data
warehouse –
 Domain integrity – provide domain value edits. Prevent entry of data if entered
data value is outside the defined limits of value.
 Update security – prevent unauthorized updates to the databases, which stops
unauthorized users from updating data in an incorrect way.
 Entity Integrity Checking – ensure that duplicate records with same primary key
value are not entered.
 Minimize missing values – ensure that nulls are not allowed in mandatory fields.
 Referrential Integrity Checking – ensure that relationships based on foreign keys
are pre-served.
 Conformance to Business rules – use trigger programs and stored procedures to
enforce business rules.
Benefits of Data Quality
Some specific areas where data quality definite benefits -
 Analysis with timely informaiton.
 Better Customer Service.
 Newer opportunities.
 Reduced costs and Risks.
 Improved Productivity.
 Reliable Strategic Decision Making.
Data Warehouse Design Reviews
One of the most effective techniques for ensuring quality in the
operational environment is the design review.
Errors can be detected and resolved prior to coding thtough a design
review.
The cost benefot of identifying errors early in the development life cycle
is enormous.
Design review is usually done on completion of the physical design of an
application.
Some of the issues around operational design review are follows –
Transaction performance
System availability
Project readiness
 Batch window adequacy
 Capacity
 User requirements satisfaction
Views of Data Warehouse Design
The four views regarding a data warehouse design must be considered –
1. Top-Down View:
This allows the selection of relevant information necessary for data
warehouse.
This information matches current and future business needs.
2. Data Source View:
It exposes the information being captured, stored and managed by
operational systems.
This information may be documented at various levels of detail and
accuracy from individual data source tables to integrated data sources
table.
Data sources are often modeled by traditional data modeling techniques,
such as entity – relationship model.
Views of Data Warehouse Design
3. Data Warehouse View:
This views includes fact tables and dimension tables.
It represents the information that is stored inside the data
warehouse, including pre-calculated totals and counts.
Information regarding the source, date and time of origin, added
to provide historical context.
4. Business Query View:
This view is the data perspective in the data warehouse from the
end-user’s view point.
Data Warehouse Design Approaches
A data warehouse can be built using three approaches -
a) The top-down approach:
It starts with the overall design and planning.
It is useful in cases where the technology is mature and well
known, and where the business problems that must be solved
are clear and well understood.
The process begins with an ETL process working from external
data sources.
In the top-down model, integration between the data warehouse
and the data marts is automatic as long as the data marts as
subsets of data warehouse is maintained.
Data Warehouse Design Approaches
b) The bottom-up approach:
The bottom-up approach starts with experiments and prototypes.
This is useful in the early stage of business modeling and
technology development.
It allows an organization to move forward at considerably less
expense and to evaluate the benefits of the technology before
making significant commitments.
This approach is to construct the data warehouse incrementally
over time from independently developed data marts.
In this approach, data flows from sources into data marts, then into
the data warehouse.
Data Warehouse Design Approaches
c) The Combined approach:
In this approach, both the top-down approach and bottom-up
approaches are exploited.
In combined approach, an organization can exploit the planned
and strategic nature of top-down approach while retaining the
rapid implementation and opportunistic application of the
bottom-up approach.
Data Warehouse Design Process
The general data warehouse design process involves the following steps -
Step 1: Choosing the appropriate Business process:
 Based on need & requirements, there exist two types of models like data
warehouse model and data mart model.
 Data warehouse model is chosen if business process is organisational
and has many complex object collections.
 A data mart model is chosen if business process is departmental and
focus on analysis of particular process.
Step 2: Choosing the grain of the business process:
 Grain is defined as fundamental data which are represented in fact table
for chose business process.
(E.g.) individual snapshots, individual transactions, etc.
Data Warehouse Design Process
Step 3: Choosing the Dimensions:
It includes selecting various dimensions such as time, item,
status, etc., which need in be applied to each fact table record.
Step 4: Choosing the measures:
It includes selecting various dimensions such as items_sold,
euros_sold, etc., which helps in filling up each fact table
record.
Testing & Monitoring the Data Warehouse
Definition:
Data Warehouse testing is the process of building and
executing comprehensive test cases to ensure that data in a
warehouse has integrity and is reliable, accurate and consistent
with the organization’s data framework.
Testing is very important for data warehouse systems for data
validation and to make them work correctly and efficiently.
Data Warehouse Testing is a series of Verification and
Validation activities performed to check for the quality and
accuracy of the Data Warehouse and its contents.
There are five basic levels of testing performed on a data warehouse –
1. Unit Testing:
This type of testing is being performed at the developer’s end.
In unit testing, each unit / component of modules is separately tested.
Each modules of the whole data warehouse (i.e.) program, SQL
Script, procedure,, Unix shell is validated and tested.
2. Integration Testing:
In this type of testing the various individual units / modules of the
application are brought together or combined and then tested against
the number of inputs.
It is performed to detect the fault in integrated modules and to test
whether the various components are performing well after integration.
3. System Testing:
 System testing is the form of testing that validates and tests the whole data
warehouse application.
 This type of testing is being performed by technical testing team.
 This test is conducted after developer’s team performs unit testing and the
main purpose of this testing is to check whether the entire system is working
altogether or not.
4. Acceptance Testing:
 To verify that the entire solution meets the business requirements and
successfully supports the business processes from a user’s perspective.
5. System Assurance Testing:
 To ensure and verify the operational readiness of the system in a production
environment.
 This is also referred to as the warranty period coverage.
Challenges of data warehouse testing are -
 Data selection from multiple source and analysis that follows
pose great challenge.
 Volume and complexity of the data, certain testing strategies are
time consuming.
 ETL testing requires hive SQL skills, thus it pose challenges for
tester who have limited SQL skills.
 Redundant data in a data warehouse & Inconsistent and
inaccurate reports.
Data Warehouse Testing Process
Testing a data warehouse is a multi-step process that involves
activities like identifying business requirements, designing test
cases, setting up a test framework, executing the text case and
validating data.
The steps for testing process are –
Step 1: Identify the various entry points:
As loading data into a warehouse involves multiple stages, it’s
essential to find out the various entry points to test data at each of
those stages.
If testing is done only at the destination, it can be confusing when
errors are found as it becomes more difficult to determine the root
cause.
Step 2: Prepare the required collaterals:
Two fundamental collaterals required for the testing process are database
schema representation and a mapping document.
The mapping document is usually a spreadsheet which maps each
column in the source database to the destination database.
A data integration solution can help generate the mapping document,
which is then used as an input to design test cases.
Step 3: Design an elastic, automated and integrated testing framework:
ETL is not a one-time activity. While some data is loaded all at once and
some through batches, new updates may trickle in
through streaming queues.
A testing framework design has to be generic and architecturally flexible
to accommodate new and diverse data sources and types, more volumes,
and the ability to work seamlessly with cloud and on-premises
Integrating the test framework with an automated data solution
(that contains features as discussed in the previous section)
increases the efficiency of the testing process.
Step 4: Adopt a comprehensive testing approach:
The testing framework needs to aim for 100% coverage of the
data warehousing process.
it’s important to design multiple testing approaches such as unit,
integration, functional, and performance testing.
The data itself has to be scrutinized for many checks that
includes looking for duplicates, matching record counts,
completeness, accuracy, loss of data, and correctness of
transformation.
Testing Operational Environment
There are no. of aspects that need to be tested as below –
1. Security:
 A separate security document is required for security testing. This document
contains a list of disallowed operations and devising tests for each.
2. Scheduler:
 Scheduling software is required to control the daily operations of a data
warehouse. It needs to be tested during system testing. The scheduling
software requires an interface with the data warehouse, which will need the
scheduler to control overnight processing and the management of
aggregations.
3. Disk Configuration:
 Disk configuration also needs to be tested to identify I/O bottlenecks. The test
should be performed with multiple times with different settings.
4. Management Tools:
It is required to test all the management tools during system
testing. Here is the list of tools that need to be tested.
• Event manager
• System manager
• Database manager
• Configuration manager
• Backup recovery manager
Testing the Database
The database is tested in following three ways –
1. Testing the database manager and monitoring tools:
 To test the database manager and the monitoring tools, they should be used in
the creation, running, and management of test database.
2. Testing database features:
 Here is the list of features that we have to test −
– Querying in parallel
– Create index in parallel
– Data load in parallel
3. Testing database performance:
 Query execution plays a very important role in data warehouse performance
measures. There are sets of fixed queries that need to be run regularly and
they should be tested.
Data Warehouse Monitoring
Data warehouse monitoring helps to understand how
the data warehouse is performing.
Some of the several reasons for monitoring are –
It ensures top performance.
It ensures excellent usability.
It ensures the business can run efficiently.
It prevents security issues.
It ensures governance and compliance.
ETL Process & Data Warehouse Fundamentals
Ad

More Related Content

What's hot (20)

Data Marts.pptx
Data Marts.pptxData Marts.pptx
Data Marts.pptx
DimpyJindal4
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
jeshocarme
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
JesusaEspeleta
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
idnats
 
Warehousing dimension star-snowflake_schemas
Warehousing dimension star-snowflake_schemasWarehousing dimension star-snowflake_schemas
Warehousing dimension star-snowflake_schemas
Eric Matthews
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
Prithwis Mukerjee
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
Theju Paul
 
Etl techniques
Etl techniquesEtl techniques
Etl techniques
mahezabeenIlkal
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
Star schema
Star schemaStar schema
Star schema
Chandanapriya Sathavalli
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
Sunita Sahu
 
Data warehousing
Data warehousingData warehousing
Data warehousing
Juhi Mahajan
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
Yogendra Uikey
 
DATA WAREHOUSE -- ETL testing Plan
DATA WAREHOUSE -- ETL testing PlanDATA WAREHOUSE -- ETL testing Plan
DATA WAREHOUSE -- ETL testing Plan
Madhu Nepal
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
Dr-Dipali Meher
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse System
Kiran kumar
 
Data Warehouse 101
Data Warehouse 101Data Warehouse 101
Data Warehouse 101
PanaEk Warawit
 
Data warehouse
Data warehouseData warehouse
Data warehouse
Sonali Chawla
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
Girish Dhareshwar
 
Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse
Lesa Cote
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
jeshocarme
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
JesusaEspeleta
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
idnats
 
Warehousing dimension star-snowflake_schemas
Warehousing dimension star-snowflake_schemasWarehousing dimension star-snowflake_schemas
Warehousing dimension star-snowflake_schemas
Eric Matthews
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
Theju Paul
 
Date warehousing concepts
Date warehousing conceptsDate warehousing concepts
Date warehousing concepts
pcherukumalla
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
Sunita Sahu
 
DATA WAREHOUSE -- ETL testing Plan
DATA WAREHOUSE -- ETL testing PlanDATA WAREHOUSE -- ETL testing Plan
DATA WAREHOUSE -- ETL testing Plan
Madhu Nepal
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
Dr-Dipali Meher
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse System
Kiran kumar
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
Girish Dhareshwar
 
Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse Role of Database Management System in A Data Warehouse
Role of Database Management System in A Data Warehouse
Lesa Cote
 

Similar to ETL Process & Data Warehouse Fundamentals (20)

definign etl process extract transform load.ppt
definign etl process extract transform load.pptdefinign etl process extract transform load.ppt
definign etl process extract transform load.ppt
smritiibansal
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
ganblues
 
Lecture 5: Extraction Transformation and loading.pptx
Lecture 5: Extraction Transformation and loading.pptxLecture 5: Extraction Transformation and loading.pptx
Lecture 5: Extraction Transformation and loading.pptx
RehmahAtugonza
 
An Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETLAn Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETL
idescitation
 
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATANEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
csandit
 
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
cscpconf
 
Database migration
Database migrationDatabase migration
Database migration
Sankar Patnaik
 
GROPSIKS.pptx
GROPSIKS.pptxGROPSIKS.pptx
GROPSIKS.pptx
avanceregine312
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
Samir Sabry
 
Get started with data migration
Get started with data migrationGet started with data migration
Get started with data migration
Thinqloud
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODI
Nagendra K
 
ETL_Methodology.pptx
ETL_Methodology.pptxETL_Methodology.pptx
ETL_Methodology.pptx
yogeshsuryawanshi47
 
ETL-Advance IA to improve your skills-pdf
ETL-Advance IA to improve your skills-pdfETL-Advance IA to improve your skills-pdf
ETL-Advance IA to improve your skills-pdf
oswahernan2203
 
Data Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrongData Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrong
Massimo Cenci
 
Etl interview questions
Etl interview questionsEtl interview questions
Etl interview questions
ashokvirtual
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
ijcatr04081001
ijcatr04081001ijcatr04081001
ijcatr04081001
reagan muriithi
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
definign etl process extract transform load.ppt
definign etl process extract transform load.pptdefinign etl process extract transform load.ppt
definign etl process extract transform load.ppt
smritiibansal
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
ganblues
 
Lecture 5: Extraction Transformation and loading.pptx
Lecture 5: Extraction Transformation and loading.pptxLecture 5: Extraction Transformation and loading.pptx
Lecture 5: Extraction Transformation and loading.pptx
RehmahAtugonza
 
An Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETLAn Overview on Data Quality Issues at Data Staging ETL
An Overview on Data Quality Issues at Data Staging ETL
idescitation
 
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATANEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
csandit
 
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
cscpconf
 
Get started with data migration
Get started with data migrationGet started with data migration
Get started with data migration
Thinqloud
 
Datastage to ODI
Datastage to ODIDatastage to ODI
Datastage to ODI
Nagendra K
 
ETL-Advance IA to improve your skills-pdf
ETL-Advance IA to improve your skills-pdfETL-Advance IA to improve your skills-pdf
ETL-Advance IA to improve your skills-pdf
oswahernan2203
 
Data Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrongData Warehouse - What you know about etl process is wrong
Data Warehouse - What you know about etl process is wrong
Massimo Cenci
 
Etl interview questions
Etl interview questionsEtl interview questions
Etl interview questions
ashokvirtual
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Ad

More from SOMASUNDARAM T (20)

MSM - UNIT 5.pdf
MSM - UNIT 5.pdfMSM - UNIT 5.pdf
MSM - UNIT 5.pdf
SOMASUNDARAM T
 
MSM - UNIT 4.pdf
MSM - UNIT 4.pdfMSM - UNIT 4.pdf
MSM - UNIT 4.pdf
SOMASUNDARAM T
 
MSM - UNIT 3.pdf
MSM - UNIT 3.pdfMSM - UNIT 3.pdf
MSM - UNIT 3.pdf
SOMASUNDARAM T
 
MSM - UNIT 2.pdf
MSM - UNIT 2.pdfMSM - UNIT 2.pdf
MSM - UNIT 2.pdf
SOMASUNDARAM T
 
MSM - UNIT 1.pdf
MSM - UNIT 1.pdfMSM - UNIT 1.pdf
MSM - UNIT 1.pdf
SOMASUNDARAM T
 
ITB - UNIT 5.pdf
ITB - UNIT 5.pdfITB - UNIT 5.pdf
ITB - UNIT 5.pdf
SOMASUNDARAM T
 
ITB - UNIT 4.pdf
ITB - UNIT 4.pdfITB - UNIT 4.pdf
ITB - UNIT 4.pdf
SOMASUNDARAM T
 
ITB - UNIT 3.pdf
ITB - UNIT 3.pdfITB - UNIT 3.pdf
ITB - UNIT 3.pdf
SOMASUNDARAM T
 
ITB - UNIT 2.pdf
ITB - UNIT 2.pdfITB - UNIT 2.pdf
ITB - UNIT 2.pdf
SOMASUNDARAM T
 
ITB - UNIT 1.pdf
ITB - UNIT 1.pdfITB - UNIT 1.pdf
ITB - UNIT 1.pdf
SOMASUNDARAM T
 
Data Mining
Data MiningData Mining
Data Mining
SOMASUNDARAM T
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
SOMASUNDARAM T
 
Organizing and Staffing
Organizing and StaffingOrganizing and Staffing
Organizing and Staffing
SOMASUNDARAM T
 
Directing and Controlling
Directing and ControllingDirecting and Controlling
Directing and Controlling
SOMASUNDARAM T
 
Data Analysis & Interpretation and Report Writing
Data Analysis & Interpretation and Report WritingData Analysis & Interpretation and Report Writing
Data Analysis & Interpretation and Report Writing
SOMASUNDARAM T
 
Computer Organization
Computer OrganizationComputer Organization
Computer Organization
SOMASUNDARAM T
 
Digital Fluency
Digital FluencyDigital Fluency
Digital Fluency
SOMASUNDARAM T
 
Planning and Objectives
Planning and ObjectivesPlanning and Objectives
Planning and Objectives
SOMASUNDARAM T
 
Management
ManagementManagement
Management
SOMASUNDARAM T
 
Business Management and Practices
Business Management and PracticesBusiness Management and Practices
Business Management and Practices
SOMASUNDARAM T
 
Ad

Recently uploaded (20)

apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
Political History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptxPolitical History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptx
Arya Mahila P. G. College, Banaras Hindu University, Varanasi, India.
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
Sinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_NameSinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_Name
keshanf79
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsepulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
sushreesangita003
 
How to Manage Opening & Closing Controls in Odoo 17 POS
How to Manage Opening & Closing Controls in Odoo 17 POSHow to Manage Opening & Closing Controls in Odoo 17 POS
How to Manage Opening & Closing Controls in Odoo 17 POS
Celine George
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Presentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem KayaPresentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
Unit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdfUnit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdf
KanchanPatil34
 
apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
How to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odooHow to Set warnings for invoicing specific customers in odoo
How to Set warnings for invoicing specific customers in odoo
Celine George
 
The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...The ever evoilving world of science /7th class science curiosity /samyans aca...
The ever evoilving world of science /7th class science curiosity /samyans aca...
Sandeep Swamy
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
Sinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_NameSinhala_Male_Names.pdf Sinhala_Male_Name
Sinhala_Male_Names.pdf Sinhala_Male_Name
keshanf79
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsepulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
sushreesangita003
 
How to Manage Opening & Closing Controls in Odoo 17 POS
How to Manage Opening & Closing Controls in Odoo 17 POSHow to Manage Opening & Closing Controls in Odoo 17 POS
How to Manage Opening & Closing Controls in Odoo 17 POS
Celine George
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Presentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem KayaPresentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
Unit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdfUnit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdf
KanchanPatil34
 

ETL Process & Data Warehouse Fundamentals

  • 1. UNIT 2 Mr.T.Somasundaram Assistant Professor Department Of Management Kristu Jayanti College (Autonomous), Bengaluru
  • 2. UNIT 2 ETL Process and Maintenance of Data Warehouse Data Extraction, Data Transformation, Data loading, Data Quality, Data Warehouse design reviews, Testing and Monitoring the data warehouse.
  • 3. • ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. • It is a process in which an ETL tool extracts the data from various data source systems, transforms it in the staging area, finally loads it into the Data warehouse system. • It is data integration process that combines data from multiple data sources into a single, consistent data store into a data warehouse. • This may contains customize the tool to suit the need of the enterprises. (E.g.) ETL tool sets for long-term analysis & usage of data in banking, insurance claims, retail sales history, etc. ETL Process
  • 6. • Google BigQuery. • Amazon Redshift. • Informatica – PowerCenter. • IBM – Infosphere Information Server. • Oracle Data Integrator. • SQL Server Integration Services. ETL Tools
  • 7. 1) Scalability – unlimited scalability are available at the click of a button. (i.e.) capacity to be changed in size. 2) Simplicity – it saves time, resources and avoids lot of complexity. 3) Out of the box – open source ETL require customization and cloud-based ETL requires integration. 4) Compliance – it finds easy way to skip complicated and risky compliance setups. 5) Long-term costs – it is cheaper with open source ETL tools but will cost make it for long run. Benefits of ETL Tools
  • 8. Extraction (E): The first step is extraction of data, source system’s data is accessed first and prepared further for processing and extracting required values. It is extracted in various formats like relational databases, No SQL, XML and flat files, etc. It is important to store extract data in staging area, not directly into data warehouse as it may cause damage and rollback will be much difficult. Phase (Steps) of ETL Process
  • 9. Extraction has three approach - a) Update Notification – the data is changed or altered in the source system, it will notify users about the change. b) Incremental Extract – many systems are incapable of providing notification but are efficient enough to track down the changes that made to source data. c) Full extract – the whole data is extracted, when system is neither able to notify nor able to track down the changes. Old copy of data is maintained to identify the change. (E.g.) phone numbers, Email conversion to standard form, validation of address fields, etc. Phase (Steps) of ETL Process
  • 10. The data extraction issues are - Source Identification – identify source applications and source structures. Method of extraction – for each data source, define whether the extraction process is manual or tool-based. Extraction frequency – for each data source, establish how frequently the data extraction must be done (daily, weekly, quartely, etc.) Time Window – for each data source, denote the time window for the extraction process. Job sequencing – determine whether the beginning of one job in an extraction job stream has to wait until the previous job has finished successfully. Exception handling – determine how to handle input records that can’t be extracted. Phase (Steps) of ETL Process
  • 11. The following are the guidelines adopted for extracts as best practices - The extract processing should identify changes since the last extract. Interface record types should be defined for the extraction based on entities in data warehouse model. (E.g.) Client information extracted from source may be categorized into person attributes, contact point information, etc. When changes sent to data warehouse, all current attributes for changed entity should be also sent. Any interface record should be categorized as – Records which have been added to operational database since the last extract. Records which have been deleted from operational database since the last extract. Phase (Steps) of ETL Process
  • 12. Transformation (T): The second step of ETL process is transformation. A set of rules or functions are applied on extracted data to convert it into a single standard format. It includes dimension conversion, aggregation, joining, derivation and calculations of new values. Phase (Steps) of ETL Process
  • 13. Transformation involve the following processes or tasks – a) Filtering - Filtering – loading only certain attributes into the data warehouse. b) Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States, and America into USA, etc. c) Joining – joining multiple attributes into one. d) Splitting – splitting a single attribute into multiple attributes. e) Sorting – sorting tuples on the basis of some attribute (generally key-attribute). Phase (Steps) of ETL Process
  • 14. Major Data Transformation Types: a) Format Revisions – it include changes to the data types and length of individual fields. (E.g.) Product package type indicated by codes and names, in which fields are numeric and text data types. b) Decoding of fields – this is common when dealing with multiple source systems and same data items are described by field values. (E.g.) Coding for Male and Female may be 1 and 2 in one source system or M and F in another source system. c) Splitting of Single fields – storing of name, address, city, state data together in a single field in earlier systems, but individual components need to store in separate fields in data warehouse. Phase (Steps) of ETL Process
  • 15. d) Merging of information – this is not mean the merging of several fields to create a single field of data. (E.g.) information about product may come from different data sources, product code & package type from another data source. Merging of information denoted combination of product code, description, package types in single entity. e) Character set conversion – this is related to conversion of character sets to an agreed standard character set for text data in data warehouse. f) Conversion of units of measurements – it is required to convert the metrics based on overseas operations to set numbers are all in one standard unit of measurement. g) Date / Time Conversion - this is representation of data and time in standard formats. Phase (Steps) of ETL Process
  • 16. h) Summarization – this type of transformation is creating of summaries to be loaded in data warehouse instead of loading most granular level of data. (E.g.) Credit card transaction not necessary to store in data warehouse for each single transaction, instead summarize the daily transactions for each credit card and store the summary data. i) Key Restructuring – while extracting data from input sources, look at primary keys of extracted records and come with keys for fact and dimension table based on keys in extracted records. j) Deduplication – customer files have several records for same customer in most of the companies, which leads to creation of additional records by mistake. It is to maintain one record of customer and link all duplicates in source systems to single record. Phase (Steps) of ETL Process
  • 17. Loading (L):  The third & final step of ETL process is loading.  Transformed data is finally loaded into data warehouse.  Data is updated by loading into data warehouse frequently, but regular intervals.  Indexes and constraints previously applied to data needs to be diabled before loading commences.  The rate and period of loading is depends on requirements and varies from system to system.  During the loads, the data warehouse has to be offline.  Time period should be identified when loads may be scheduled without affecting data warehouse users.  It should be consider to divide the whole load process into smaller chunks and populating a few files at a time. Phase (Steps) of ETL Process
  • 18. Mode of Loading (L): a) Load – If targeted table to be loaded already exists and data exists in table, load process wipes out the existing data and applies data from incoming file. If table is empty before loading, the load process applies the data from incoming file. b) Append – It is an extension of the load. If data already exists in table, append process unconditionally adds the incoming data, preserving the existing data in target table. Incoming duplicate record may be rejected during the append process. c) Destructive Merge – in this step, apply the incoming data to the target data. If incoming record matches with the key of an existing record, update the matching target record, if not then add incoming record to the target table. Phase (Steps) of ETL Process
  • 19. Mode of Loading (L): d) Constructive Merge – this mode is opposite to the destructive merge. (i.e.) if incoming record matches with the key of an existing record, leave the existing record, add incoming record and mark the added record as superseding the old record. e) Initial Load – to load the whole data warehouse in a single run. It is able to split the load into separaet subloads and run as single loads. If you need more than one run to create a single table, then it is scheduled to run in several days. f) Incremental Loads – these are the applications of ongoing changes from source system. It need a method to preserve the periodic nature of changes in data warehouse. Constructive merge mode is an appropriate method for incremental tools. g) Full Refresh – this application of data involves periodically rewriting the entire data warehouse. Sometimes, it can partial refreshes to rewrite only specific tables, but partial refreshes are rare, because dimentions table is intricately tied to the fact table. Phase (Steps) of ETL Process
  • 20. Data Quality (DQ) in Data Warehouse What is Data Quality? A IT professional, data accuracy is quite often and its accurancy associated with a data element. (E.g.) Consider Customer as Entity and it has attributes like - ,,,,,, etc. This indicates data accuracy as its attributes of customer entity describe the particular customer. (i.e.) if data is fit for purpose for which it is intended, then such data has quality. Data Quality is related to the usage for the data item as defined by the users. Data Quality in operational systems required database records conform to field validation edits, which is data quality, but single field edits alone don’t constitute data quality. Customer Name Customer Address Customer State Customer Mobile No
  • 21. Definition: Data quality refers to the overall utility of a dataset(s) as a function of its ability to be easily processed and analyzed for other uses, usually by a database, data warehouse, or data analytics system.  Data quality in a data warehouse is not just the quality of individual data items but the quality of the full, integrated system as a whole. (E.g.) In online ordering system, while entering the data about customers in order entry application, we may collect demographics of each customer.  Sometimes, this demographic factors may not be needed or not much attention.  When those data’s are try to access which is integrated whole lacks data quality. (E.g.) Few customer’s information may be important or may not be importance, when we filling or writing any application form. (especially Banking process)
  • 22. Data Accuracy Vs Data Quality Difference between Data Accuracy and Data Quality: Data Accuracy Data Quality  Specific instance of an entity accurately represents that occurrence of the entity  The data item is exactly fit for the purpose for which the business users have defined it.  Data element defined in terms of database technology  Wider concept grounded in the specific business of the company  Data element conforms to validation constraints  Relates not just to single data elements but to the system as a whole  Individual data items have the correct data types  Forum and content of data elements consistent across the whole system  Traditionally relates to operational systems  Essentially needed in a corporate-wide data warehouse for business users
  • 23. Characteristics (Dimensions) of Data Quality The data quality dimensions are - 1. Accuracy: The value stored in the system for a data element is the right value for that occurence of the data element. (E.g.) getting correct customer address 2. Domain Integrity: The data value of an attribute falls in the range of allowable, defined values. (E.g.) Male and Female for gender data element. 3. Data Type: Value for a data attribute is actually stored as the data type defined for the attribute. (E.g.) Name field is defined as ‘text’. 4. Consistency: The form and content of a data field in the same across multiple source systems. (E.g.) Product code for Product A is 1234.
  • 24. 5. Redundancy: The same data must not be stored in more than one place in a system. 6. Completeness: There are no missing values for a given attribute in the system. 7. Duplication: Duplication of records in a system is completely resolved. (E.g.) duplicate records are identified and created cross-reference. 8. Conformance to Business Rules: The values of each data item adhere to prescribed business rules. (E.g.) in auction system, sale price can’t be less than the reserve price. 9. Structural Definiteness: Wherever data item can naturally be structured into individual components, the item must contain this well-defined structure. (E.g.)Names are divided into first name, middle name and last name, which reduces the missing
  • 25. 10. Data Anomaly:  A field must be used only for the purpose for which it is defined. (E.g.) In third line of address column for long address, it should be entered the third line of address, not the phone numbers or fax number. 11. Clarity:  A data element may possess all the other characteristics of quality data but if the users do not understand its meaning clearly. 12. Timely:  The users determine the timeliness of the data. (E.g.) updation of data in customer database on daily basis. 13. Usefulness:  Every data element in the data warehouse must satisfy some requirements of the collection of users. 14. Adherence to Data Integrity Rules:  The data stored in the relational databases of the source system must adhere to entity integrity and referential integrity rules.
  • 26. Data Quality Challenges (Problems) in DW 0% 10% 20% 30% 40% 50% Database Performance Management Expectations Business Rules Data Transformation User Expectations Data Modeling Data Quality PERCENTAGE Data Warehouse Challenges
  • 27. Data Quality Framework Establish Data Quality Steering Committee Agree on a suitable data quality framework Identify the business functions affected most by bad data Institute data quality policy and standards Select high impact data elements and determine priorities Define quality measurement parameters and benchmarks Plan and execute data cleansing for high impact data elements Plan and execute data cleansing for other less severe elements Initial Data Cleansing efforts Ongoing Data Cleansing efforts IT Professionals User Representatives
  • 28. Data Quality – Participants and Roles Data Quality Initiatives Data Consumer (User Dept.) Data Producer (User Dept.) Data Expert (User Dept.) Data Correction Authority (IT Dept.) Data Consistency Expert (IT Dept.) Data Policy Administrator (IT Dept.) Data Integrity Specialist (User Dept.)
  • 29. The responsibilities for the roles are - 1. Data Consumers: Uses the data warehouse for queries, reports and analysis. Establishes the acceptable levels of data quality. 2. Data Producer: Responsible for the quality of data input into the source systems. 3. Data Expert: Expert in the subject matter and the data itself of the source systems. Responsible for identifying pollution in the source systems. 4. Data Policy Administrator: Ultimately responsible for resolving data corruption as data is transformed and moved into the data warehouse.
  • 30. The responsibilities for the roles are - 5. Data Integrity Specialist: Responsible for ensuring that the data in the source systems conforms to the business rules. 6. Data Correction Authority: Responsible for actually applying the data cleansing techniques through the use of tools or in-house programs. 7. Data Consistency Expert: Responsible for ensuring that all data within the data warehouse (various data marts) are fully synchronized.
  • 31. Data Quality Tools The useful data quality tools are - 1. Categories of Data Cleansing tools: It assist in two ways –  Data error discovery tools work on the source data to identify inaccuracies and inconsistencies.  Data correction tools help fix the corrupt data, which use series of algorithms to parse, transform, match, consolidate and correct the data. 2. Error Discovery features: The following list of error discovery functions that data cleansing tools are capable of performing –  Quickly and easily identify duplicate records.  Identify data items whose values are outside the range of legal domain values.  Find inconsistent data.
  • 32.  Check for range of allowable values.  Detect inconsistencies among data items from different sources.  Allow users to identify and quantify data quality problems.  Monitor trends in data quality over time.  Report to users on the quality of data used for analysis.  Reconcile problems of RDBMS referential integrity. 3. Data Correction features: The following list describes the typical error correction functions that data cleansing tools are capable of performing –  Normalize inconsistent data.  Improve merging of data from dissimilar data sources.  Group and relate customer records belonging to the same household.  Provide measurements of data quality.  Standardize data elements to common formats.  Validate for allowable values.
  • 33. 4. DBMS for Quality Control: The database management system used as a tool for data quality control in many ways, especially RDBMS prevent several types of errors creeping into data warehouse –  Domain integrity – provide domain value edits. Prevent entry of data if entered data value is outside the defined limits of value.  Update security – prevent unauthorized updates to the databases, which stops unauthorized users from updating data in an incorrect way.  Entity Integrity Checking – ensure that duplicate records with same primary key value are not entered.  Minimize missing values – ensure that nulls are not allowed in mandatory fields.  Referrential Integrity Checking – ensure that relationships based on foreign keys are pre-served.  Conformance to Business rules – use trigger programs and stored procedures to enforce business rules.
  • 34. Benefits of Data Quality Some specific areas where data quality definite benefits -  Analysis with timely informaiton.  Better Customer Service.  Newer opportunities.  Reduced costs and Risks.  Improved Productivity.  Reliable Strategic Decision Making.
  • 35. Data Warehouse Design Reviews One of the most effective techniques for ensuring quality in the operational environment is the design review. Errors can be detected and resolved prior to coding thtough a design review. The cost benefot of identifying errors early in the development life cycle is enormous. Design review is usually done on completion of the physical design of an application. Some of the issues around operational design review are follows – Transaction performance System availability Project readiness  Batch window adequacy  Capacity  User requirements satisfaction
  • 36. Views of Data Warehouse Design The four views regarding a data warehouse design must be considered – 1. Top-Down View: This allows the selection of relevant information necessary for data warehouse. This information matches current and future business needs. 2. Data Source View: It exposes the information being captured, stored and managed by operational systems. This information may be documented at various levels of detail and accuracy from individual data source tables to integrated data sources table. Data sources are often modeled by traditional data modeling techniques, such as entity – relationship model.
  • 37. Views of Data Warehouse Design 3. Data Warehouse View: This views includes fact tables and dimension tables. It represents the information that is stored inside the data warehouse, including pre-calculated totals and counts. Information regarding the source, date and time of origin, added to provide historical context. 4. Business Query View: This view is the data perspective in the data warehouse from the end-user’s view point.
  • 38. Data Warehouse Design Approaches A data warehouse can be built using three approaches - a) The top-down approach: It starts with the overall design and planning. It is useful in cases where the technology is mature and well known, and where the business problems that must be solved are clear and well understood. The process begins with an ETL process working from external data sources. In the top-down model, integration between the data warehouse and the data marts is automatic as long as the data marts as subsets of data warehouse is maintained.
  • 39. Data Warehouse Design Approaches b) The bottom-up approach: The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of business modeling and technology development. It allows an organization to move forward at considerably less expense and to evaluate the benefits of the technology before making significant commitments. This approach is to construct the data warehouse incrementally over time from independently developed data marts. In this approach, data flows from sources into data marts, then into the data warehouse.
  • 40. Data Warehouse Design Approaches c) The Combined approach: In this approach, both the top-down approach and bottom-up approaches are exploited. In combined approach, an organization can exploit the planned and strategic nature of top-down approach while retaining the rapid implementation and opportunistic application of the bottom-up approach.
  • 41. Data Warehouse Design Process The general data warehouse design process involves the following steps - Step 1: Choosing the appropriate Business process:  Based on need & requirements, there exist two types of models like data warehouse model and data mart model.  Data warehouse model is chosen if business process is organisational and has many complex object collections.  A data mart model is chosen if business process is departmental and focus on analysis of particular process. Step 2: Choosing the grain of the business process:  Grain is defined as fundamental data which are represented in fact table for chose business process. (E.g.) individual snapshots, individual transactions, etc.
  • 42. Data Warehouse Design Process Step 3: Choosing the Dimensions: It includes selecting various dimensions such as time, item, status, etc., which need in be applied to each fact table record. Step 4: Choosing the measures: It includes selecting various dimensions such as items_sold, euros_sold, etc., which helps in filling up each fact table record.
  • 43. Testing & Monitoring the Data Warehouse Definition: Data Warehouse testing is the process of building and executing comprehensive test cases to ensure that data in a warehouse has integrity and is reliable, accurate and consistent with the organization’s data framework. Testing is very important for data warehouse systems for data validation and to make them work correctly and efficiently. Data Warehouse Testing is a series of Verification and Validation activities performed to check for the quality and accuracy of the Data Warehouse and its contents.
  • 44. There are five basic levels of testing performed on a data warehouse – 1. Unit Testing: This type of testing is being performed at the developer’s end. In unit testing, each unit / component of modules is separately tested. Each modules of the whole data warehouse (i.e.) program, SQL Script, procedure,, Unix shell is validated and tested. 2. Integration Testing: In this type of testing the various individual units / modules of the application are brought together or combined and then tested against the number of inputs. It is performed to detect the fault in integrated modules and to test whether the various components are performing well after integration.
  • 45. 3. System Testing:  System testing is the form of testing that validates and tests the whole data warehouse application.  This type of testing is being performed by technical testing team.  This test is conducted after developer’s team performs unit testing and the main purpose of this testing is to check whether the entire system is working altogether or not. 4. Acceptance Testing:  To verify that the entire solution meets the business requirements and successfully supports the business processes from a user’s perspective. 5. System Assurance Testing:  To ensure and verify the operational readiness of the system in a production environment.  This is also referred to as the warranty period coverage.
  • 46. Challenges of data warehouse testing are -  Data selection from multiple source and analysis that follows pose great challenge.  Volume and complexity of the data, certain testing strategies are time consuming.  ETL testing requires hive SQL skills, thus it pose challenges for tester who have limited SQL skills.  Redundant data in a data warehouse & Inconsistent and inaccurate reports.
  • 47. Data Warehouse Testing Process Testing a data warehouse is a multi-step process that involves activities like identifying business requirements, designing test cases, setting up a test framework, executing the text case and validating data. The steps for testing process are – Step 1: Identify the various entry points: As loading data into a warehouse involves multiple stages, it’s essential to find out the various entry points to test data at each of those stages. If testing is done only at the destination, it can be confusing when errors are found as it becomes more difficult to determine the root cause.
  • 48. Step 2: Prepare the required collaterals: Two fundamental collaterals required for the testing process are database schema representation and a mapping document. The mapping document is usually a spreadsheet which maps each column in the source database to the destination database. A data integration solution can help generate the mapping document, which is then used as an input to design test cases. Step 3: Design an elastic, automated and integrated testing framework: ETL is not a one-time activity. While some data is loaded all at once and some through batches, new updates may trickle in through streaming queues. A testing framework design has to be generic and architecturally flexible to accommodate new and diverse data sources and types, more volumes, and the ability to work seamlessly with cloud and on-premises
  • 49. Integrating the test framework with an automated data solution (that contains features as discussed in the previous section) increases the efficiency of the testing process. Step 4: Adopt a comprehensive testing approach: The testing framework needs to aim for 100% coverage of the data warehousing process. it’s important to design multiple testing approaches such as unit, integration, functional, and performance testing. The data itself has to be scrutinized for many checks that includes looking for duplicates, matching record counts, completeness, accuracy, loss of data, and correctness of transformation.
  • 50. Testing Operational Environment There are no. of aspects that need to be tested as below – 1. Security:  A separate security document is required for security testing. This document contains a list of disallowed operations and devising tests for each. 2. Scheduler:  Scheduling software is required to control the daily operations of a data warehouse. It needs to be tested during system testing. The scheduling software requires an interface with the data warehouse, which will need the scheduler to control overnight processing and the management of aggregations. 3. Disk Configuration:  Disk configuration also needs to be tested to identify I/O bottlenecks. The test should be performed with multiple times with different settings.
  • 51. 4. Management Tools: It is required to test all the management tools during system testing. Here is the list of tools that need to be tested. • Event manager • System manager • Database manager • Configuration manager • Backup recovery manager
  • 52. Testing the Database The database is tested in following three ways – 1. Testing the database manager and monitoring tools:  To test the database manager and the monitoring tools, they should be used in the creation, running, and management of test database. 2. Testing database features:  Here is the list of features that we have to test − – Querying in parallel – Create index in parallel – Data load in parallel 3. Testing database performance:  Query execution plays a very important role in data warehouse performance measures. There are sets of fixed queries that need to be run regularly and they should be tested.
  • 53. Data Warehouse Monitoring Data warehouse monitoring helps to understand how the data warehouse is performing. Some of the several reasons for monitoring are – It ensures top performance. It ensures excellent usability. It ensures the business can run efficiently. It prevents security issues. It ensures governance and compliance.