0% found this document useful (0 votes)
112 views

ETL Interview Questions

The document discusses Extract, Transform, Load (ETL) processes. It defines ETL as extracting data from source systems, transforming or modifying the data, and loading it into a target database. It describes different types of transformations and loads in ETL, including active vs passive transformations and full vs incremental loads. It also provides examples of ETL tools and components of the Informatica ETL tool.

Uploaded by

suman duggi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views

ETL Interview Questions

The document discusses Extract, Transform, Load (ETL) processes. It defines ETL as extracting data from source systems, transforming or modifying the data, and loading it into a target database. It describes different types of transformations and loads in ETL, including active vs passive transformations and full vs incremental loads. It also provides examples of ETL tools and components of the Informatica ETL tool.

Uploaded by

suman duggi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

What is ETL?

Extract – Extracting the data from source system


Transform – Transforming or modifying the data into format what business required
Load – Loading into target database

What are the transformation types?


Active Transformation
The output record count of the transformation may or may not equal to input record count.
For example, when we apply filter transformation for age column with the condition of age
between 25 and 40. In this case, the data will come out which satisfies this condition, hence the
outcome count cannot be predicted.
Passive Transformation
The output record count of the transformation is equal to input record count.
For example, when we apply expression transformation to concatenate first name and last name
columns, in this case, the data will come out even though the columns do not have values.
Connected Transformation
A transformation which is being linked with other transformation or target component is called
connected.
Unconnected Transformation
A transformation which is not being linked with any other transformation or target component is
called unconnected.
What are the types of load?
Full Load (Initial Load or Bulk Load or Fresh Load) –
The data loading process when we do it at very first time. It can be referred as Bulk load or Fresh
load.
The job extracts entire volume of data from source table or file and loading into truncated target
table after applying transformations logics
Incremental Load (Refresh Load or Daily Load or Change Data Capture) –
The modified data alone will be updated in target followed by full load. The changes will be
captured by comparing created or modified date against last run date of the job.
The modified data alone extracted from the source, the job will look for changes  in the source
table against job run table, if change exists then data will be extracted and that data alone will be
updated in the target without impacting the existing data.
Name some ETL tools
Informatica Power center
Talend Open Studio
IBM Datastage
SQL Server Integration Services (SSIS)
Ab-initio
Oracle Data Integrator
SAS – Data Integration Studio
SAP – Business Object Integrator
Clover ETL
Pentaho Data Integration

Explain the scenarios for testing source to a staging table.


– Verify the table structure of staging table (columns, data type, length, constraints, index)
-Verify the successful workflow (ETL job) run
-Verify the data count between source and staging table
-Verify the data comparison between source table and staging table
-Verify the duplicate data checking, duplicate data should not be loaded into staging table
-Verify the excess trailing space trimmed for all Varchar data type columns
-Verify the job consistency by performing subsequent run
-Verify the job failure runs behavior
-Verify the job re-run success scenario after failure correction
-Verify the job run with bad data (NULL values, exceeding precisions, lookup or reference
data not exists)
-Verify the job performance timing
How do you ensure that all source table data’s are loaded into target table?

-Using SET operator MINUS – if both source and target tables are in the same database server.

-Using macro – both source table and target table data will be copied into an excel and compared
with macro
-Using Automation tools – tool will fetch data and compares internally with own algorithm

-Using utility tools – develop an automation utility tool using Java or any scripting language
along with database drivers
Give an example for Low severity and High priority defect.
-There is a requirement where email notification needs to be triggered in case of job failure
-There is a deviation found during testing that the email notification has been received but the
number of records count in the content is not matching
–Low severity – since it does not affect any functionality
–High priority – since the wrong data count shows the wrong picture to the management team
What are the components of Informatica?
One of the major tool in worldwide. Majorly this tool is using for ETL, data masking, and data
quality.
It has four major components,
1. Repository manager – to add repository and managing folders
2. Designer – creating mappings
3. Workflow manager – creating workflow with task and mappings
4. Workflow monitor – workflow run status tracker

What are tasks available in Informatica?


The below are major tasks available in Informatica power center tool.
1.Session
2.Email
3.Command
4.Control
5.Decision
6.Timer
Database testing vs ETL testing
ETL Testing – Making sure that the data from source to target is being loaded properly or not
along with the business transformation rules.
Database Testing – Testing whether the data is being stored properly in the database when we do
some operations from the front end or back end along with testing of procedures, functions, and
triggers. Testing whether the data is being retrieved properly in UI.
What are the responsibilities of an ETL tester?
 Understanding Requirement
 Estimating
 Planning
 Test case preparation
 Test execution
 Giving Sign off

What does a mapping document contain?


A mapping document contains,
1. Columns mapping between source and target
2. Data type and length for all columns of source and target
3. Transformation logic for each column
4. ETL job or workflow information
5. Input parameter file information

What kind of defects can expect?


 Table structure issue
 Index unable to drop issue
 Index is not created after job run
 Data issue in source table
 Data count mismatch between source and target
 Data not matching between source and target
 Duplicate data loaded issue
 Trim and NULL issue
 Data precision issue
 Date format issue
 Business transformation rules issue
 Subsequent job run not working properly
 Running job with bad data does not kick off the bad data’s properly
 Rollback is not happening in case of job failure
 Performance issue
 Log file and content issue
 Mail notification and content issue

1000 records are in the source table, but only 900 records are loaded into the target table.
How do you find the missing 100 records?

-Using SET operator MINUS – if both source and target tables are in the same database server.

-Using Excel macro – both source table and target table data will be copied into an excel and
compared with macro
-Using Automation tool – tool will fetch data and compares internally with own algorithm

-Using Utility tool – develop an automation utility tool using Java or any scripting language
along with database drivers
Can you give few test cases to test the incremental load table?
–Insert few records and validate the data after job run
–Update non-primary column values and validate the data after job run
–Update primary column values and  validate the data after job run
–Delete few records and validate the data after job run
–Insert/update few records to create duplicate entries and validate the data after job run
–Update with bad data – NULL values, blank spaces, lookup data missing
How do you compare a flat file and database table?

-Manual Sampling method – manually compared in sampling basis

-Using Excel macro – flat file data and target table data will be copied into an excel and
compared with macro

-Using Automation tool – tool will fetch data and compares internally with own algorithm

-Using Utility tool – develop an automation utility tool using Java or any scripting language
along with database drivers

Slowly Changing Dimensions (SCD)


The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a
nutshell, this applies to cases where the attribute for a record varies over time. We give an example
below:

Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the
customer lookup table has the following record:

Customer Key Name State


1001 Christina Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now
modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.
There are in general three ways to solve this type of problem, and they are categorized as follows:

Slowly Changing Dimension Type 1: The new record replaces the original record. No trace
of the old record exists.

          In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key Name State


1001 Christina Illinois
After Christina moved from Illinois to California, the new information replaces the new record, and
we have the following table:

Customer Key Name State


1001 Christina California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to
keep track of the old information.

Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois before.

Usage:
About 50% of the time.

When to use Type 1:


Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to
keep track of historical changes.

Slowly Changing Dimension Type 2: A new record is added into the
customer dimension table. Therefore, the customer is treated essentially as two people.

In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new
information. Therefore, both the original and the new record will be present. The newe record gets its
own primary key.

In our example, recall we originally have the following table:

Customer Key Name State


1001 Christina Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the
table:

Customer Key Name State


1001 Christina Illinois
1005 Christina California
Advantages:
- This allows us to accurately keep all historical information.

Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is
very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.

Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to
track historical changes.

Slowly Changing Dimension Type 3: The original record is modified to reflect the change.

In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute
of interest, one indicating the original value, and one indicating the current value. There will also be
a column that indicates when the current value becomes active.

In our example, recall we originally have the following table:

Customer Key Name State


1001 Christina Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
o Customer Key
o Name
o Original State
o Current State
o Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have
the following table (assuming the effective date of change is January 15, 2003):

Customer Key Name Original State Current State Effective Date


1001 Christina Illinois California 15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.

Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information will be
lost.

Usage:
Type 3 is rarely used in actual practice.

When to use Type 3:


Type III slowly changing dimension should only be used when it is necessary for the data warehouse
to track historical changes, and when such changes will only occur for a finite number of time.

1.      What is Data warehouse


Ans: A Data warehouse is a subject oriented, integrated ,time variant, non volatile collection of data
in support of management's decision making process.
Subject oriented : means that the data addresses a specific subject such as sales, inventory etc.
Integrated : means that the data is obtained from a variety of sources.
Time variant : implies that the data is stored in such a way that when some data is changed.
Non volatile : implies that data is never removed. i.e., historical data is also kept.

2.      What is the difference between database and data warehouse


Ans: A database is a collection of related data. Where as Data Warehouse stores historical data, the
business users take their decisions based on historical data only.

3.      What is the difference between dimensional table and fact table


Ans: A dimension table consists of tuples of attributes of the dimension. A fact table can be thought
of as having tuples, one per a recorded fact. This fact contains some measured or observed variables
and identifies them with pointers to dimension tables.

4.      What is the difference between Data Mining and Data Warehousing

Ans: Data mining - analyzing data from different perspectives and concluding it into useful decision
making information. It can be used to increase revenue, cost cutting, increase productivity or improve
any business process. There are lot of tools available in market for various industries to do data
mining. Basically, it is all about finding correlations or patterns in large relational databases. 

Data warehousing comes before data mining. It is the process of compiling and organizing data into
one database from various source systems where as data mining is the process of extracting
meaningful data from that database (data warehouse). 
5.      What is Data Mart
Ans : A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), such as Sales, Finance, or Marketing. Data marts are often built and controlled by a
single department within an organization. Given their single-subject focus, data marts usually draw
data from only a few sources. The sources could be internal operational systems, a central data
warehouse, or external data.

6.      Difference between OLTP and OLAP


Ans: Online transactional processing (OLTP) is designed to efficiently process high volumes of
transactions, instantly recording business events (such as a sales invoice payment) and reflecting
changes as they occur.
Online analytical processing (OLAP) is designed for analysis and decision support, allowing
exploration of often hidden relationships in large amounts of data by providing unlimited views of
multiple relationships at any cross-section of defined business dimensions.

7.      What is ETL? 
Ans: ETL - extract, transform, and load.
Extracting data from outside source systems.
 Transforming raw data to make it fit for use by different departments.
 Loading transformed data into target systems like data mart or data warehouse. 

8.      Why ETL testing is required


Ans:  To verify the correctness of data transformation against the signed off business requirements
and rules. 

 To verify that expected data is loaded into data mart or data warehouse without loss of any data. 

To validate the accuracy of reconciliation reports (if any e.g. in case of comparison of report of
transactions made via bank ATM – ATM report vs. Bank Account Report). 

 To make sure complete process meet performance and scalability requirements 

Data security is also sometimes part of ETL testing 


To evaluate the reporting efficiency 
  

9.      What are ETL tester responsibilities


Ans :An ETL tester is responsible for writing SQL queries for various scenarios. They run a
number of tests including primary key, duplicate, default, and attribute tests of the process. In
addition, they are in charge of running record count checks as well as reconciling records with
source data. They also confirm the quality of the data and the loading process overall.
10.   What are the Key benefits of ETL Testing
Ans: Minimise the risk of Data loss
Data Security
Data Accuracy
Reporting effciency

11.   To get the list of tables and views in Database


Ans : SELECT * FROM information_schema.tables (will display both tables,views)
SELECT * FROM information_schema.views (will display on views)

12.   List the details about “SMITH”


Ans: Select * from employee where last_name=’SMITH’;

13.   List out the employees who are working in department 20


Ans: Select * from employee where department_id=20

14.   List out the employees who are earning salary between 3000 and 4500
Ans: Select * from employee where salary between 3000 and 4500

15.   List out the employees who are working in department 10 or 20


Ans: Select * from employee where department_id in (20,30)

16.   Find out the employees who are not working in department 10 or 30


Ans :Select last_name, salary, commission, department_id from employee where department_id not
in (10,30)

17.   List out the employees whose name starts with “S”


Ans:  Select * from employee where last_name like ‘S%’

18.   List out the employees whose name start with “S” and end with “H”
Ans:  Select * from employee where last_name like ‘S%H’

19.   List out the employees whose name length is 4 and start with “S”
Ans : Select * from employee where last_name like ‘S___’
20.   List out the employees who are working in department 10 and draw the salaries more than
3500
Ans:  Select * from employee where department_id=10 and salary>3500

21.   List out the employees who are not receiving commission.


Ans:  Select * from employee where commission is Null

22.   List out the employee id, last name in ascending order based on the employee id.
Ans:  Select employee_id, last_name from employee order by employee_id
23.   List out the employee id, name in descending order based on salary column
Ans : Select employee_id, last_name, salary from employee order by salary desc
24.   list out the employee details according to their last_name in ascending order and salaries in
descending order
Ans:  Select employee_id, last_name, salary from employee order by last_name, salary desc

25.   list out the employee details according to their last_name in ascending order and then on
department_id in descending order.
Ans:  Select employee_id, last_name, salary from employee order by last_name, department_id desc

26.   How many employees who are working in different departments wise in the organization
Ans : Select department_id, count(*), from employee group by department_id

27.   List out the department wise maximum salary, minimum salary, average salary of the
employees
Ans:  Select department_id, count(*), max(salary), min(salary), avg(salary) from employee group by
department_id

28.   List out the job wise maximum salary, minimum salary, average salaries of the employees.
Ans:  Select job_id, count(*), max(salary), min(salary), avg(salary) from employee group by
job_id

29.   List out the no.of employees joined in every month in ascending order.
Ans:  Select to_char(hire_date,’month’)month, count(*) from employee group by
to_char(hire_date,’month’) order by month

30.   List out the no.of employees for each month and year, in the ascending order based on the
year, month.
Ans:  Select to_char(hire_date,’yyyy’) Year, to_char(hire_date,’mon’) Month, count(*) “No. of
employees” from employee group by to_char(hire_date,’yyyy’), to_char(hire_date,’mon’).
What is a data warehouse?
A data warehouse is a database which,
1.Maintains history of data
2.Contains Integrated data (data from multiple business lines)
3.Contains Heterogeneous data (data from different source formats)
4.Contains Aggregated data
5.Allows only select to restrict data manipulation
6.Data will be stored in de-normalized format

Definition of a data warehouse:


1. Subject-oriented
2. Integrated
3. Non-volatile
4. Time-Variant
Main Usage of a data warehouse:
1. Data Analysis
2. Decision Makings
3. Planning or Forecasting
What is a dimension?
A Dimension table is a table where it contains only non-quantifying data and category of
information which are key for analysis. A dimension table contains primary key and non-
quantifying columns. If the primary key does not exist in source table then surrogate key would
exist.
What are the types of dimension?
Based on what type of data it stores there is two major types dimension table,
1.Confirmed dimension
2.Junk dimension
Based on where it’s being derived there is one dimension category,
3.Degenerated dimension
Based on how frequently the data in the dimension can be divided into 2 types,
4.Rapidly Changing Dimension (RCD)
5.Slowly Changing Dimension (SCD) 
What is a fact and what are the types of fact?
A fact is a column or attribute which can be quantifiable or measurable and will be used as key
analysis factor. We can call it as a measure.
Types of Fact:
1. Additive
2. Semi-additive
3. Non-additive
What does a fact table contain?
A table which contains facts is called fact table. Typically a fact table has facts and foreign
keys of dimension tables.
Fact table structure:
Foriegn_key1
Forign_keyN
Fact1
FactN
What are the types of a fact table?
Transactional
The fact table will contain data’s in very detail level without any rollup/aggregation the way how
transactional database stores.
Accumulating
Accumulating refers storing multiple entries for a single record to track the changes throughout
the workflow.
Periodic snapshot
The data will be extracted and loaded for a particular period of a time. It describes what would
be the state of the record in that specific period.
Factless fact table
When a fact table does not have any fact is called Factless fact table. It has only foreign keys of
dimension tables.
Why staging table is required?
1. To reduce the complexity of Job (It will be more complex when we move directly from Source
to Target)
2. To avoid the source database update.
3. To perform any calculations.
4. To perform data cleansing process as per business need.
5. When the data has been corrupted in Target after the load, we can delete the corrupted data in
Target database after that we can just load the unloaded/deleted data alone into Target from
staging database.
What is a surrogate key?
In most of the table, the  primary key will be loaded from source schema, but some source table
might not have a primary key in such has by using sequence generator the primary key will be
created, such keys are called Surrogate key.
In terms of usage, there is no difference between these two types of keys. Both differ in the way
of loading primary key loaded from the source table, whereas surrogate key loaded by the
sequence generator.
OLTP vs DW database

OLTP DW

Dedicated database
available for specific Integrated from
subject area or business different business
application applications

It keeps history of data


for analyzing past
It does not keep  history performance

It allows user to perform


the below DML
operations (Select, Insert, It allows only Select
Update,Delete) for end users

The main purpose is for


using day to day Purpose is for analysis
transactions and reporting

Data volume will be less Data volume is huge

Data stored in normalized Data stored in de-


format normalized format

Explain about star schemaOperational Data Store (ODS) vs Staging database

ODS Staging

Based on type of load it


It will have limited period stores incremental data or
of data (30 to 90 days) full volume of data

Temporary data storage


and for doing data
cleansing and other
Operational processing calculations
Based on business need,
normally the each
Integrated from different business line would have
business lines dedicated staging
This type schema contains the fact table in center position. As we know that fact table contains
a reference to dimension tables. Then the fact table will be surrounded by dimension tables with
foreign key reference. The dimension table will not have a reference with any other dimension.
Explain about snowflake schema
This type also contains a fact table in center position. The fact table has a reference to dimension
tables. The dimension table will have a reference to another dimension. The data will be stored
in the more normalized form.
What is the difference between star and snowflake?

Star Snowflake

As there is no relationship between


dimensions to other dimensions the Due to multiple links between dimensions
performance will be high. the performance will be low.

The number of joins will be less which The number of joins will be more which
makes query complexity low makes query complexity high

Consider the Project dimension mentioned in


above example it has Role column where the
Role name value will be stored against for The role information is separately stored in a
each project in case of start schema, the size table and the reference will be linked in
of the table will be high Project dimension, it reduces the table size

Data will be stored in de-normalized format Data will be stored in more normalized
in dimension table format in dimension tables
What is data cleansing?
Data cleansing is a process of removing irrelevant and redundant data, and correcting the
incorrect and incomplete data. It is also called as data cleaning or data scrubbing. All
organizations are growing drastically with huge competitions, they take business decisions based
on their past performance data and future projection
What is data masking?
What does data masking mean? Organizations never want to disclose highly confidential
information into all users. All sensitive data will be restricted to access in all environments other
than production. The process of masking/hiding/encrypting sensitive data is called data
masking.
Why Data mart?
1. The data warehouse database contains integrated data for all business lines, for example, a
banking data warehouse contains data for all saving, credit and loan accounts databases.
2. The reporting access level will be given to a person who has authority or needs to see the
comparison of data for all three types of accounts.
3. Meanwhile, a loan account branch manager does not require to see the saving and credit card
details, he wants to see only the past performance of loan account alone.
4. In that case for his analysis, we need to apply data level security to protect saving and credit
information’s data warehouse.
5. At the same time, the number of end users across three accounts will access the same data
warehouse, it will end up in poor performance.
6. To avoid these issues, the separate database will be built on top of data warehouse, named as the
data mart. The access will be given for respective business line resources not for everyone.
What is data purging and archiving?
Data purging means deleting data from a database which crosses the defined retention time.
Archiving means moving the data which crosses the defined retention time to another database
(archival database).
What are the types of SCD?
SCD Type 1
-Modifications will be done on the same record
-Here no history of changes will be maintained
SCD Type 2
-An existing record will be marked as expired with is_active flag or Expired_date column
-This type allows tracking the history of changes
SCD Type 3
-A new value will be tracked as a column
-Here history of changes will be maintained
What type of schema and SCD type used in your project?
In my current project, we are using type2 to keep the history of changes.
There are two major types of data load available based on the load process.

1.Full Load (Bulk Load)


The data loading process when we do it at very first time. It can be referred as Bulk load or Fresh
load.

The job extracts entire volume of data from a source table or file and loading into truncated target
table after applying transformations logics.
In most of the case, it could be a one time job run after then changes alone will be captured as part of
an incremental load. But again based on business need, it will be scheduled to run.

 2. Incremental load (Refresh load)


The modified data alone will be updated in target followed by full load. The changes will be captured
by comparing created or modified date against last run date of the job.

The modified data alone extracted from the source, the job will look for changes  in the source table
against job run table, if change exists then data will be extracted and that data alone will be updated
in the target without impacting the existing data.

If no changes are available then the ETL job will send a notification with no change available
between source and stage/target message.

There are multiple tools available in the market for ETL process. Tools are developed with different
technologies and offering more features for a smooth end to end data integration. Here are few ETL
tools,

ETL tools in data warehouse

1. Informatica Power center


One of the major tool in worldwide. Majorly this tool is using for ETL, data masking, and data
quality.It has four major components,

1. Repository manager – to add repository and managing folders


2. Designer – creating mappings
3. Workflow manager – creating workflow with task and mappings
4. Workflow monitor – workflow run status tracker

2. Talend Open Studio


One of the open source tool for data integration (ETL) which has been developed in JAVA.
Widely this tool is used for ETL, Data migration, and big data.

3. IBM Datastage
It has four major components,

 Manager – to manage the repository


 Designer – developing job
 Director – job scheduling, job running, and monitoring
 Administrator – creating users and managing projects/folders

4. SQL Server Integration Services (SSIS)


SQL server offers this tool for data integration with wide features of extract, transform and
load.

5. Ab-initio
6. Oracle Data Integrator
7. SAS – Data Integration Studio
8. Business Object Data Integrator
9. Clover ETL
10. Pentaho Data Integration

You might also like