ETL Process & Data Warehouse Fundamentals

UNIT 2
Mr.T.Somasundaram
Assistant Professor
Department Of Management
Kristu Jayanti College (Autonomous), Bengaluru

UNIT 2
ETL Process and Maintenance of Data
Warehouse
Data Extraction, Data
Transformation, Data loading, Data
Quality, Data Warehouse design
reviews, Testing and Monitoring
the data warehouse.

• ETL is a process in Data Warehousing and it stands for Extract,
Transform and Load.
• It is a process in which an ETL tool extracts the data from various
data source systems, transforms it in the staging area, finally loads
it into the Data warehouse system.
• It is data integration process that combines data from multiple data
sources into a single, consistent data store into a data warehouse.
• This may contains customize the tool to suit the need of the
enterprises.
(E.g.) ETL tool sets for long-term analysis & usage of data in
banking, insurance claims, retail sales history, etc.
ETL Process

• Google BigQuery.
• Amazon Redshift.
• Informatica – PowerCenter.
• IBM – Infosphere Information Server.
• Oracle Data Integrator.
• SQL Server Integration Services.
ETL Tools

1) Scalability – unlimited scalability are available at the
click of a button. (i.e.) capacity to be changed in size.
2) Simplicity – it saves time, resources and avoids lot of
complexity.
3) Out of the box – open source ETL require
customization and cloud-based ETL requires integration.
4) Compliance – it finds easy way to skip complicated and
risky compliance setups.
5) Long-term costs – it is cheaper with open source ETL
tools but will cost make it for long run.
Benefits of ETL Tools

Extraction (E):
The first step is extraction of data, source system’s data
is accessed first and prepared further for processing and
extracting required values.
It is extracted in various formats like relational
databases, No SQL, XML and flat files, etc.
It is important to store extract data in staging area, not
directly into data warehouse as it may cause damage and
rollback will be much difficult.
Phase (Steps) of ETL Process

Extraction has three approach -
a) Update Notification – the data is changed or altered in the
source system, it will notify users about the change.
b) Incremental Extract – many systems are incapable of
providing notification but are efficient enough to track down the
changes that made to source data.
c) Full extract – the whole data is extracted, when system is
neither able to notify nor able to track down the changes. Old
copy of data is maintained to identify the change.
(E.g.) phone numbers, Email conversion to standard form,
validation of address fields, etc.

The data extraction issues are -
Source Identification – identify source applications and source structures.
Method of extraction – for each data source, define whether the extraction
process is manual or tool-based.
Extraction frequency – for each data source, establish how frequently the
data extraction must be done (daily, weekly, quartely, etc.)
Time Window – for each data source, denote the time window for the
extraction process.
Job sequencing – determine whether the beginning of one job in an
extraction job stream has to wait until the previous job has finished
successfully.
Exception handling – determine how to handle input records that can’t be
extracted.

The following are the guidelines adopted for extracts as best practices -
The extract processing should identify changes since the last extract.
Interface record types should be defined for the extraction based on
entities in data warehouse model.
(E.g.) Client information extracted from source may be categorized into
person attributes, contact point information, etc.
When changes sent to data warehouse, all current attributes for changed
entity should be also sent.
Any interface record should be categorized as –
Records which have been added to operational database since the last
extract.
Records which have been deleted from operational database since the last
extract.

Transformation (T):
The second step of ETL process is transformation.
A set of rules or functions are applied on extracted
data to convert it into a single standard format.
It includes dimension conversion, aggregation,
joining, derivation and calculations of new values.

Transformation involve the following processes or tasks –
a) Filtering - Filtering – loading only certain attributes into the
data warehouse.
b) Cleaning – filling up the NULL values with some default
values, mapping U.S.A, United States, and America into USA,
etc.
c) Joining – joining multiple attributes into one.
d) Splitting – splitting a single attribute into multiple attributes.
e) Sorting – sorting tuples on the basis of some attribute
(generally key-attribute).

Major Data Transformation Types:
a) Format Revisions – it include changes to the data types and length of
individual fields.
(E.g.) Product package type indicated by codes and names, in which fields
are numeric and text data types.
b) Decoding of fields – this is common when dealing with multiple source
systems and same data items are described by field values.
(E.g.) Coding for Male and Female may be 1 and 2 in one source system or
M and F in another source system.
c) Splitting of Single fields – storing of name, address, city, state data
together in a single field in earlier systems, but individual components need
to store in separate fields in data warehouse.

d) Merging of information – this is not mean the merging of several fields
to create a single field of data.
(E.g.) information about product may come from different data sources,
product code & package type from another data source. Merging of
information denoted combination of product code, description, package
types in single entity.
e) Character set conversion – this is related to conversion of character sets
to an agreed standard character set for text data in data warehouse.
f) Conversion of units of measurements – it is required to convert the
metrics based on overseas operations to set numbers are all in one standard
unit of measurement.
g) Date / Time Conversion - this is representation of data and time in
standard formats.

h) Summarization – this type of transformation is creating of summaries to
be loaded in data warehouse instead of loading most granular level of data.
(E.g.) Credit card transaction not necessary to store in data warehouse for
each single transaction, instead summarize the daily transactions for each
credit card and store the summary data.
i) Key Restructuring – while extracting data from input sources, look at
primary keys of extracted records and come with keys for fact and
dimension table based on keys in extracted records.
j) Deduplication – customer files have several records for same customer in
most of the companies, which leads to creation of additional records by
mistake. It is to maintain one record of customer and link all duplicates in
source systems to single record.

Loading (L):
 The third & final step of ETL process is loading.
 Transformed data is finally loaded into data warehouse.
 Data is updated by loading into data warehouse frequently, but regular intervals.
 Indexes and constraints previously applied to data needs to be diabled before loading
commences.
 The rate and period of loading is depends on requirements and varies from system to
system.
 During the loads, the data warehouse has to be offline.
 Time period should be identified when loads may be scheduled without affecting data
warehouse users.
 It should be consider to divide the whole load process into smaller chunks and
populating a few files at a time.

Mode of Loading (L):
a) Load – If targeted table to be loaded already exists and data exists in
table, load process wipes out the existing data and applies data from
incoming file. If table is empty before loading, the load process applies the
data from incoming file.
b) Append – It is an extension of the load. If data already exists in table,
append process unconditionally adds the incoming data, preserving the
existing data in target table. Incoming duplicate record may be rejected
during the append process.
c) Destructive Merge – in this step, apply the incoming data to the target
data. If incoming record matches with the key of an existing record, update
the matching target record, if not then add incoming record to the target
table.

Mode of Loading (L):
d) Constructive Merge – this mode is opposite to the destructive merge. (i.e.) if
incoming record matches with the key of an existing record, leave the existing record,
add incoming record and mark the added record as superseding the old record.
e) Initial Load – to load the whole data warehouse in a single run. It is able to split the
load into separaet subloads and run as single loads. If you need more than one run to
create a single table, then it is scheduled to run in several days.
f) Incremental Loads – these are the applications of ongoing changes from source
system. It need a method to preserve the periodic nature of changes in data warehouse.
Constructive merge mode is an appropriate method for incremental tools.
g) Full Refresh – this application of data involves periodically rewriting the entire
data warehouse. Sometimes, it can partial refreshes to rewrite only specific tables, but
partial refreshes are rare, because dimentions table is intricately tied to the fact table.

Data Quality (DQ) in Data Warehouse
What is Data Quality?
A IT professional, data accuracy is quite often and its accurancy associated
with a data element.
(E.g.) Consider Customer as Entity and it has attributes like -
,,,,,, etc.
This indicates data accuracy as its attributes of customer entity describe the
particular customer.
(i.e.) if data is fit for purpose for which it is intended, then such data has
quality.
Data Quality is related to the usage for the data item as defined by the users.
Data Quality in operational systems required database records conform to
field validation edits, which is data quality, but single field edits alone don’t
constitute data quality.
Customer
Name
Customer
Address
Customer
State
Customer
Mobile No

Definition:
Data quality refers to the overall utility of a dataset(s) as a function of
its ability to be easily processed and analyzed for other uses, usually by a
database, data warehouse, or data analytics system.
 Data quality in a data warehouse is not just the quality of individual data
items but the quality of the full, integrated system as a whole.
(E.g.) In online ordering system, while entering the data about customers in
order entry application, we may collect demographics of each customer.
 Sometimes, this demographic factors may not be needed or not much
attention.
 When those data’s are try to access which is integrated whole lacks data
quality.
(E.g.) Few customer’s information may be important or may not be
importance, when we filling or writing any application form. (especially
Banking process)

Data Accuracy Vs Data Quality
Difference between Data Accuracy and Data Quality:
Data Accuracy Data Quality
 Specific instance of an entity accurately
represents that occurrence of the entity
 The data item is exactly fit for the
purpose for which the business users
have defined it.
 Data element defined in terms of
database technology
 Wider concept grounded in the specific
business of the company
 Data element conforms to validation
constraints
 Relates not just to single data elements
but to the system as a whole
 Individual data items have the correct
data types
 Forum and content of data elements
consistent across the whole system
 Traditionally relates to operational
systems
 Essentially needed in a corporate-wide
data warehouse for business users

Characteristics (Dimensions) of Data Quality
The data quality dimensions are -
1. Accuracy:
The value stored in the system for a data element is the right value for that
occurence of the data element. (E.g.) getting correct customer address
2. Domain Integrity:
The data value of an attribute falls in the range of allowable, defined values.
(E.g.) Male and Female for gender data element.
3. Data Type:
Value for a data attribute is actually stored as the data type defined for the
attribute. (E.g.) Name field is defined as ‘text’.
4. Consistency:
The form and content of a data field in the same across multiple source
systems. (E.g.) Product code for Product A is 1234.

5. Redundancy:
The same data must not be stored in more than one place in a system.
6. Completeness:
There are no missing values for a given attribute in the system.
7. Duplication:
Duplication of records in a system is completely resolved. (E.g.) duplicate
records are identified and created cross-reference.
8. Conformance to Business Rules:
The values of each data item adhere to prescribed business rules. (E.g.) in
auction system, sale price can’t be less than the reserve price.
9. Structural Definiteness:
Wherever data item can naturally be structured into individual components,
the item must contain this well-defined structure. (E.g.)Names are divided
into first name, middle name and last name, which reduces the missing

10. Data Anomaly:
 A field must be used only for the purpose for which it is defined. (E.g.) In third
line of address column for long address, it should be entered the third line of
address, not the phone numbers or fax number.
11. Clarity:
 A data element may possess all the other characteristics of quality data but if the
users do not understand its meaning clearly.
12. Timely:
 The users determine the timeliness of the data. (E.g.) updation of data in customer
database on daily basis.
13. Usefulness:
 Every data element in the data warehouse must satisfy some requirements of the
collection of users.
14. Adherence to Data Integrity Rules:
 The data stored in the relational databases of the source system must adhere to
entity integrity and referential integrity rules.

Data Quality Challenges (Problems) in DW
0% 10% 20% 30% 40% 50%
Database Performance
Management Expectations
Business Rules
Data Transformation
User Expectations
Data Modeling
Data Quality
PERCENTAGE
Data Warehouse Challenges

Data Quality Framework
Establish Data Quality
Steering Committee
Agree on a suitable data
quality framework
Identify the business
functions affected most by
bad data
Institute data quality policy
and standards
Select high impact data
elements and determine
priorities
Define quality measurement
parameters and benchmarks
Plan and execute data
cleansing for high impact
data elements
Plan and execute data
cleansing for other less
severe elements
Initial
Data
Cleansing
efforts
Ongoing
Data
Cleansing
efforts
IT Professionals
User Representatives

Data Quality – Participants and Roles
Data
Quality
Initiatives
Data Consumer
(User Dept.)
Data Producer
(User Dept.)
Data Expert
(User Dept.) Data Correction Authority
(IT Dept.)
Data Consistency
Expert (IT Dept.)
Data Policy
Administrator
(IT Dept.)
Data Integrity
Specialist (User
Dept.)

The responsibilities for the roles are -
1. Data Consumers:
Uses the data warehouse for queries, reports and analysis. Establishes the
acceptable levels of data quality.
2. Data Producer:
Responsible for the quality of data input into the source systems.
3. Data Expert:
Expert in the subject matter and the data itself of the source systems.
Responsible for identifying pollution in the source systems.
4. Data Policy Administrator:
Ultimately responsible for resolving data corruption as data is
transformed and moved into the data warehouse.

The responsibilities for the roles are -
5. Data Integrity Specialist:
Responsible for ensuring that the data in the source systems conforms to
the business rules.
6. Data Correction Authority:
Responsible for actually applying the data cleansing techniques through
the use of tools or in-house programs.
7. Data Consistency Expert:
Responsible for ensuring that all data within the data warehouse (various
data marts) are fully synchronized.

Data Quality Tools
The useful data quality tools are -
1. Categories of Data Cleansing tools:
It assist in two ways –
 Data error discovery tools work on the source data to identify inaccuracies and
inconsistencies.
 Data correction tools help fix the corrupt data, which use series of algorithms
to parse, transform, match, consolidate and correct the data.
2. Error Discovery features:
The following list of error discovery functions that data cleansing tools are
capable of performing –
 Quickly and easily identify duplicate records.
 Identify data items whose values are outside the range of legal domain values.
 Find inconsistent data.

 Check for range of allowable values.
 Detect inconsistencies among data items from different sources.
 Allow users to identify and quantify data quality problems.
 Monitor trends in data quality over time.
 Report to users on the quality of data used for analysis.
 Reconcile problems of RDBMS referential integrity.
3. Data Correction features:
The following list describes the typical error correction functions that data cleansing
tools are capable of performing –
 Normalize inconsistent data.
 Improve merging of data from dissimilar data sources.
 Group and relate customer records belonging to the same household.
 Provide measurements of data quality.
 Standardize data elements to common formats.
 Validate for allowable values.

4. DBMS for Quality Control:
The database management system used as a tool for data quality control in many
ways, especially RDBMS prevent several types of errors creeping into data
warehouse –
 Domain integrity – provide domain value edits. Prevent entry of data if entered
data value is outside the defined limits of value.
 Update security – prevent unauthorized updates to the databases, which stops
unauthorized users from updating data in an incorrect way.
 Entity Integrity Checking – ensure that duplicate records with same primary key
value are not entered.
 Minimize missing values – ensure that nulls are not allowed in mandatory fields.
 Referrential Integrity Checking – ensure that relationships based on foreign keys
are pre-served.
 Conformance to Business rules – use trigger programs and stored procedures to
enforce business rules.

Benefits of Data Quality
Some specific areas where data quality definite benefits -
 Analysis with timely informaiton.
 Better Customer Service.
 Newer opportunities.
 Reduced costs and Risks.
 Improved Productivity.
 Reliable Strategic Decision Making.

Data Warehouse Design Reviews
One of the most effective techniques for ensuring quality in the
operational environment is the design review.
Errors can be detected and resolved prior to coding thtough a design
review.
The cost benefot of identifying errors early in the development life cycle
is enormous.
Design review is usually done on completion of the physical design of an
application.
Some of the issues around operational design review are follows –
Transaction performance
System availability
Project readiness
 Batch window adequacy
 Capacity
 User requirements satisfaction

Views of Data Warehouse Design
The four views regarding a data warehouse design must be considered –
1. Top-Down View:
This allows the selection of relevant information necessary for data
warehouse.
This information matches current and future business needs.
2. Data Source View:
It exposes the information being captured, stored and managed by
operational systems.
This information may be documented at various levels of detail and
accuracy from individual data source tables to integrated data sources
table.
Data sources are often modeled by traditional data modeling techniques,
such as entity – relationship model.

Views of Data Warehouse Design
3. Data Warehouse View:
This views includes fact tables and dimension tables.
It represents the information that is stored inside the data
warehouse, including pre-calculated totals and counts.
Information regarding the source, date and time of origin, added
to provide historical context.
4. Business Query View:
This view is the data perspective in the data warehouse from the
end-user’s view point.

Data Warehouse Design Approaches
A data warehouse can be built using three approaches -
a) The top-down approach:
It starts with the overall design and planning.
It is useful in cases where the technology is mature and well
known, and where the business problems that must be solved
are clear and well understood.
The process begins with an ETL process working from external
data sources.
In the top-down model, integration between the data warehouse
and the data marts is automatic as long as the data marts as
subsets of data warehouse is maintained.

b) The bottom-up approach:
The bottom-up approach starts with experiments and prototypes.
This is useful in the early stage of business modeling and
technology development.
It allows an organization to move forward at considerably less
expense and to evaluate the benefits of the technology before
making significant commitments.
This approach is to construct the data warehouse incrementally
over time from independently developed data marts.
In this approach, data flows from sources into data marts, then into
the data warehouse.

c) The Combined approach:
In this approach, both the top-down approach and bottom-up
approaches are exploited.
In combined approach, an organization can exploit the planned
and strategic nature of top-down approach while retaining the
rapid implementation and opportunistic application of the
bottom-up approach.

Data Warehouse Design Process
The general data warehouse design process involves the following steps -
Step 1: Choosing the appropriate Business process:
 Based on need & requirements, there exist two types of models like data
warehouse model and data mart model.
 Data warehouse model is chosen if business process is organisational
and has many complex object collections.
 A data mart model is chosen if business process is departmental and
focus on analysis of particular process.
Step 2: Choosing the grain of the business process:
 Grain is defined as fundamental data which are represented in fact table
for chose business process.
(E.g.) individual snapshots, individual transactions, etc.

Data Warehouse Design Process
Step 3: Choosing the Dimensions:
It includes selecting various dimensions such as time, item,
status, etc., which need in be applied to each fact table record.
Step 4: Choosing the measures:
It includes selecting various dimensions such as items_sold,
euros_sold, etc., which helps in filling up each fact table
record.

Testing & Monitoring the Data Warehouse
Definition:
Data Warehouse testing is the process of building and
executing comprehensive test cases to ensure that data in a
warehouse has integrity and is reliable, accurate and consistent
with the organization’s data framework.
Testing is very important for data warehouse systems for data
validation and to make them work correctly and efficiently.
Data Warehouse Testing is a series of Verification and
Validation activities performed to check for the quality and
accuracy of the Data Warehouse and its contents.

There are five basic levels of testing performed on a data warehouse –
1. Unit Testing:
This type of testing is being performed at the developer’s end.
In unit testing, each unit / component of modules is separately tested.
Each modules of the whole data warehouse (i.e.) program, SQL
Script, procedure,, Unix shell is validated and tested.
2. Integration Testing:
In this type of testing the various individual units / modules of the
application are brought together or combined and then tested against
the number of inputs.
It is performed to detect the fault in integrated modules and to test
whether the various components are performing well after integration.

3. System Testing:
 System testing is the form of testing that validates and tests the whole data
warehouse application.
 This type of testing is being performed by technical testing team.
 This test is conducted after developer’s team performs unit testing and the
main purpose of this testing is to check whether the entire system is working
altogether or not.
4. Acceptance Testing:
 To verify that the entire solution meets the business requirements and
successfully supports the business processes from a user’s perspective.
5. System Assurance Testing:
 To ensure and verify the operational readiness of the system in a production
environment.
 This is also referred to as the warranty period coverage.

Challenges of data warehouse testing are -
 Data selection from multiple source and analysis that follows
pose great challenge.
 Volume and complexity of the data, certain testing strategies are
time consuming.
 ETL testing requires hive SQL skills, thus it pose challenges for
tester who have limited SQL skills.
 Redundant data in a data warehouse & Inconsistent and
inaccurate reports.

Data Warehouse Testing Process
Testing a data warehouse is a multi-step process that involves
activities like identifying business requirements, designing test
cases, setting up a test framework, executing the text case and
validating data.
The steps for testing process are –
Step 1: Identify the various entry points:
As loading data into a warehouse involves multiple stages, it’s
essential to find out the various entry points to test data at each of
those stages.
If testing is done only at the destination, it can be confusing when
errors are found as it becomes more difficult to determine the root
cause.

Step 2: Prepare the required collaterals:
Two fundamental collaterals required for the testing process are database
schema representation and a mapping document.
The mapping document is usually a spreadsheet which maps each
column in the source database to the destination database.
A data integration solution can help generate the mapping document,
which is then used as an input to design test cases.
Step 3: Design an elastic, automated and integrated testing framework:
ETL is not a one-time activity. While some data is loaded all at once and
some through batches, new updates may trickle in
through streaming queues.
A testing framework design has to be generic and architecturally flexible
to accommodate new and diverse data sources and types, more volumes,
and the ability to work seamlessly with cloud and on-premises

Integrating the test framework with an automated data solution
(that contains features as discussed in the previous section)
increases the efficiency of the testing process.
Step 4: Adopt a comprehensive testing approach:
The testing framework needs to aim for 100% coverage of the
data warehousing process.
it’s important to design multiple testing approaches such as unit,
integration, functional, and performance testing.
The data itself has to be scrutinized for many checks that
includes looking for duplicates, matching record counts,
completeness, accuracy, loss of data, and correctness of
transformation.

Testing Operational Environment
There are no. of aspects that need to be tested as below –
1. Security:
 A separate security document is required for security testing. This document
contains a list of disallowed operations and devising tests for each.
2. Scheduler:
 Scheduling software is required to control the daily operations of a data
warehouse. It needs to be tested during system testing. The scheduling
software requires an interface with the data warehouse, which will need the
scheduler to control overnight processing and the management of
aggregations.
3. Disk Configuration:
 Disk configuration also needs to be tested to identify I/O bottlenecks. The test
should be performed with multiple times with different settings.

4. Management Tools:
It is required to test all the management tools during system
testing. Here is the list of tools that need to be tested.
• Event manager
• System manager
• Database manager
• Configuration manager
• Backup recovery manager

Testing the Database
The database is tested in following three ways –
1. Testing the database manager and monitoring tools:
 To test the database manager and the monitoring tools, they should be used in
the creation, running, and management of test database.
2. Testing database features:
 Here is the list of features that we have to test −
– Querying in parallel
– Create index in parallel
– Data load in parallel
3. Testing database performance:
 Query execution plays a very important role in data warehouse performance
measures. There are sets of fixed queries that need to be run regularly and
they should be tested.

Data Warehouse Monitoring
Data warehouse monitoring helps to understand how
the data warehouse is performing.
Some of the several reasons for monitoring are –
It ensures top performance.
It ensures excellent usability.
It ensures the business can run efficiently.
It prevents security issues.
It ensures governance and compliance.

ETL Process & Data Warehouse Fundamentals

Recommended

More Related Content

What's hot (20)

Similar to ETL Process & Data Warehouse Fundamentals (20)

More from SOMASUNDARAM T (20)

Recently uploaded (20)

ETL Process & Data Warehouse Fundamentals