ETL Interview Questions
ETL Interview Questions
-Using SET operator MINUS – if both source and target tables are in the same database server.
-Using macro – both source table and target table data will be copied into an excel and compared
with macro
-Using Automation tools – tool will fetch data and compares internally with own algorithm
-Using utility tools – develop an automation utility tool using Java or any scripting language
along with database drivers
Give an example for Low severity and High priority defect.
-There is a requirement where email notification needs to be triggered in case of job failure
-There is a deviation found during testing that the email notification has been received but the
number of records count in the content is not matching
–Low severity – since it does not affect any functionality
–High priority – since the wrong data count shows the wrong picture to the management team
What are the components of Informatica?
One of the major tool in worldwide. Majorly this tool is using for ETL, data masking, and data
quality.
It has four major components,
1. Repository manager – to add repository and managing folders
2. Designer – creating mappings
3. Workflow manager – creating workflow with task and mappings
4. Workflow monitor – workflow run status tracker
-Using SET operator MINUS – if both source and target tables are in the same database server.
-Using Excel macro – both source table and target table data will be copied into an excel and
compared with macro
-Using Automation tool – tool will fetch data and compares internally with own algorithm
-Using Utility tool – develop an automation utility tool using Java or any scripting language
along with database drivers
Can you give few test cases to test the incremental load table?
–Insert few records and validate the data after job run
–Update non-primary column values and validate the data after job run
–Update primary column values and validate the data after job run
–Delete few records and validate the data after job run
–Insert/update few records to create duplicate entries and validate the data after job run
–Update with bad data – NULL values, blank spaces, lookup data missing
How do you compare a flat file and database table?
-Using Excel macro – flat file data and target table data will be copied into an excel and
compared with macro
-Using Automation tool – tool will fetch data and compares internally with own algorithm
-Using Utility tool – develop an automation utility tool using Java or any scripting language
along with database drivers
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the
customer lookup table has the following record:
Slowly Changing Dimension Type 1: The new record replaces the original record. No trace
of the old record exists.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived in Illinois before.
Usage:
About 50% of the time.
Slowly Changing Dimension Type 2: A new record is added into the
customer dimension table. Therefore, the customer is treated essentially as two people.
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new
information. Therefore, both the original and the new record will be present. The newe record gets its
own primary key.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is
very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to
track historical changes.
Slowly Changing Dimension Type 3: The original record is modified to reflect the change.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute
of interest, one indicating the original value, and one indicating the current value. There will also be
a column that indicates when the current value becomes active.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information will be
lost.
Usage:
Type 3 is rarely used in actual practice.
Ans: Data mining - analyzing data from different perspectives and concluding it into useful decision
making information. It can be used to increase revenue, cost cutting, increase productivity or improve
any business process. There are lot of tools available in market for various industries to do data
mining. Basically, it is all about finding correlations or patterns in large relational databases.
Data warehousing comes before data mining. It is the process of compiling and organizing data into
one database from various source systems where as data mining is the process of extracting
meaningful data from that database (data warehouse).
5. What is Data Mart
Ans : A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), such as Sales, Finance, or Marketing. Data marts are often built and controlled by a
single department within an organization. Given their single-subject focus, data marts usually draw
data from only a few sources. The sources could be internal operational systems, a central data
warehouse, or external data.
7. What is ETL?
Ans: ETL - extract, transform, and load.
Extracting data from outside source systems.
Transforming raw data to make it fit for use by different departments.
Loading transformed data into target systems like data mart or data warehouse.
To verify that expected data is loaded into data mart or data warehouse without loss of any data.
To validate the accuracy of reconciliation reports (if any e.g. in case of comparison of report of
transactions made via bank ATM – ATM report vs. Bank Account Report).
To make sure complete process meet performance and scalability requirements
14. List out the employees who are earning salary between 3000 and 4500
Ans: Select * from employee where salary between 3000 and 4500
18. List out the employees whose name start with “S” and end with “H”
Ans: Select * from employee where last_name like ‘S%H’
19. List out the employees whose name length is 4 and start with “S”
Ans : Select * from employee where last_name like ‘S___’
20. List out the employees who are working in department 10 and draw the salaries more than
3500
Ans: Select * from employee where department_id=10 and salary>3500
22. List out the employee id, last name in ascending order based on the employee id.
Ans: Select employee_id, last_name from employee order by employee_id
23. List out the employee id, name in descending order based on salary column
Ans : Select employee_id, last_name, salary from employee order by salary desc
24. list out the employee details according to their last_name in ascending order and salaries in
descending order
Ans: Select employee_id, last_name, salary from employee order by last_name, salary desc
25. list out the employee details according to their last_name in ascending order and then on
department_id in descending order.
Ans: Select employee_id, last_name, salary from employee order by last_name, department_id desc
26. How many employees who are working in different departments wise in the organization
Ans : Select department_id, count(*), from employee group by department_id
27. List out the department wise maximum salary, minimum salary, average salary of the
employees
Ans: Select department_id, count(*), max(salary), min(salary), avg(salary) from employee group by
department_id
28. List out the job wise maximum salary, minimum salary, average salaries of the employees.
Ans: Select job_id, count(*), max(salary), min(salary), avg(salary) from employee group by
job_id
29. List out the no.of employees joined in every month in ascending order.
Ans: Select to_char(hire_date,’month’)month, count(*) from employee group by
to_char(hire_date,’month’) order by month
30. List out the no.of employees for each month and year, in the ascending order based on the
year, month.
Ans: Select to_char(hire_date,’yyyy’) Year, to_char(hire_date,’mon’) Month, count(*) “No. of
employees” from employee group by to_char(hire_date,’yyyy’), to_char(hire_date,’mon’).
What is a data warehouse?
A data warehouse is a database which,
1.Maintains history of data
2.Contains Integrated data (data from multiple business lines)
3.Contains Heterogeneous data (data from different source formats)
4.Contains Aggregated data
5.Allows only select to restrict data manipulation
6.Data will be stored in de-normalized format
OLTP DW
Dedicated database
available for specific Integrated from
subject area or business different business
application applications
ODS Staging
Star Snowflake
The number of joins will be less which The number of joins will be more which
makes query complexity low makes query complexity high
Data will be stored in de-normalized format Data will be stored in more normalized
in dimension table format in dimension tables
What is data cleansing?
Data cleansing is a process of removing irrelevant and redundant data, and correcting the
incorrect and incomplete data. It is also called as data cleaning or data scrubbing. All
organizations are growing drastically with huge competitions, they take business decisions based
on their past performance data and future projection
What is data masking?
What does data masking mean? Organizations never want to disclose highly confidential
information into all users. All sensitive data will be restricted to access in all environments other
than production. The process of masking/hiding/encrypting sensitive data is called data
masking.
Why Data mart?
1. The data warehouse database contains integrated data for all business lines, for example, a
banking data warehouse contains data for all saving, credit and loan accounts databases.
2. The reporting access level will be given to a person who has authority or needs to see the
comparison of data for all three types of accounts.
3. Meanwhile, a loan account branch manager does not require to see the saving and credit card
details, he wants to see only the past performance of loan account alone.
4. In that case for his analysis, we need to apply data level security to protect saving and credit
information’s data warehouse.
5. At the same time, the number of end users across three accounts will access the same data
warehouse, it will end up in poor performance.
6. To avoid these issues, the separate database will be built on top of data warehouse, named as the
data mart. The access will be given for respective business line resources not for everyone.
What is data purging and archiving?
Data purging means deleting data from a database which crosses the defined retention time.
Archiving means moving the data which crosses the defined retention time to another database
(archival database).
What are the types of SCD?
SCD Type 1
-Modifications will be done on the same record
-Here no history of changes will be maintained
SCD Type 2
-An existing record will be marked as expired with is_active flag or Expired_date column
-This type allows tracking the history of changes
SCD Type 3
-A new value will be tracked as a column
-Here history of changes will be maintained
What type of schema and SCD type used in your project?
In my current project, we are using type2 to keep the history of changes.
There are two major types of data load available based on the load process.
The job extracts entire volume of data from a source table or file and loading into truncated target
table after applying transformations logics.
In most of the case, it could be a one time job run after then changes alone will be captured as part of
an incremental load. But again based on business need, it will be scheduled to run.
The modified data alone extracted from the source, the job will look for changes in the source table
against job run table, if change exists then data will be extracted and that data alone will be updated
in the target without impacting the existing data.
If no changes are available then the ETL job will send a notification with no change available
between source and stage/target message.
There are multiple tools available in the market for ETL process. Tools are developed with different
technologies and offering more features for a smooth end to end data integration. Here are few ETL
tools,
3. IBM Datastage
It has four major components,
5. Ab-initio
6. Oracle Data Integrator
7. SAS – Data Integration Studio
8. Business Object Data Integrator
9. Clover ETL
10. Pentaho Data Integration