0% found this document useful (0 votes)
218 views

Data Stage Interview Questions

DataStage is an ETL tool used to integrate data from multiple sources and process large volumes of data. It provides a graphical user interface to design jobs that extract, transform, and load data. Common interview questions about DataStage include explaining what it is, how sources are populated, commands to import and export jobs, differences between versions, and explaining stages like Merge, Join, and Lookup.

Uploaded by

Manoj Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
218 views

Data Stage Interview Questions

DataStage is an ETL tool used to integrate data from multiple sources and process large volumes of data. It provides a graphical user interface to design jobs that extract, transform, and load data. Common interview questions about DataStage include explaining what it is, how sources are populated, commands to import and export jobs, differences between versions, and explaining stages like Merge, Join, and Lookup.

Uploaded by

Manoj Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

DATA STAGE INTERVIEW QUESTIONS & ANSWERS

DataStage is an ETL tool. It is used in graphical notation to build/construct solutions to


data integration. It is available in market in various versions like SE, EE and MVS edition.
One can check the availability of the job across cities including Bangalore, Pune,
Chennai and Hyderabad. DataStage role includes having knowledge on Data warehouse,
ETL, DataStage configuration, Design, Various stages, modules in datastages. It is used
to integrate multiple systems and processes high volumes of data. DataStage has user
friendly graphical frontend to design jobs.  DataStage interview questions and
answers are useful to attend job interviews and get shortlisted for job position.
1. Question 1. Explain Data Stage?
Answer :
A data stage is simply a tool which is used to design, develop and execute
many applications to fill various tables in data warehouse or data marts.Learn
more about DataStage in this insightful blog post now.
2. Question 2. Tell How A Source File Is Populated?
Answer :
We can generate a source file in various ways such as by making a SQL query
in Oracle, or  by using row generator extract tool etc.
3. Question 3. Write The Command Line Functions To Import And Export
The Ds Jobs?
Answer :
To signify the DS jobs, dsimport.exe is used and to export the DS jobs,
dsexport.exe is used.
4. Question 4. Differentiate Between Datastage 7.5 And 7.0?
Answer :
In Datastage 7.5 various new stages are added for more sturdiness and
smooth performance, such as Procedure Stage, Command Stage,etc.
5. Question 5. Explain Merge?
Answer :
Merge means to merge two or more tables. The two tables are merged on the
origin of Primary key columns in both the tables.Interested in learning
DataStage? Well, we have the in-depth DataStage Courses to give you a head
start in your career.
6. Question 6. Differentiate Between Data File And Descriptor File?
Answer :
As the name says, data files contains the data and the descriptor file contains
the information about the data in the data files.
7. Question 7. Differentiate Between Data Stage And Informatica?
Answer :
In datastage, there is a perception of separation, parallelism for node
configuration. While, there is no perception of separation and parallelism in
informatica for node configuration. Also, Informatica is more scalable than
Datastage. Datastage is more easy to use as compared to Informatica.

8. Question 8. Explain Routines And Their Types?


Answer :
Routines are basically group of functions that is described by DS manager. It
can be called through transformer stage. Routines are of three types such as,
parallel routines, server routines and main frame routines.
9. Question 9. How Can We Write Parallel Routines In Data Stage Px?
Answer :
We can mention parallel routines in C or C++ compiler. Such routines are also
developed in DS manager and can be called from transformer stage.
10. Question 10. What Is The Procedure Of Removing Duplicates, Without
The Remove Duplicate Stage?
Answer :
Duplicates can be detached by using Sort stage. We can use the opportunity,
as allow duplicate = false.
11. Question 11. What Steps Should Be Taken To Recover Datastage Jobs?
Answer :
In order to recover presentation of Datastage jobs, we have to first create the
baselines. Secondly, we should not use only one flow for presentation testing.
Thirdly, we should work in growth. Then, we should appraise data skews. Then
we should separate and solve the problems, one by one. After that, we should
allocate the file systems to take away bottlenecks, if any. Also, we should not
embrace RDBMS in start of testing phase. Last but not the least, we should
understand and evaluate the available tuning knobs.
12. Question 12. Compare And Contrast Between Join, Merge And Lookup
Stage?
Answer :
All the three are dissimilar from each other in the way they use the memory
storage, compare input necessities and how they treat various data . Join and
Merge needs minimum memory as compared to the Lookup stage.
13. Question 13. Describe Quality Stage?
Answer :
Quality stage is also called as Integrity stage. It assists in integrating various
types of data from different sources.
14. Question 14. Describe Job Control?
Answer :
Job control can be best performed by using Job Control Language (JCL). This
tool is used to execute various jobs concurrently, without using any kind of
loop.
15. Question 15. Contrast Between Symmetric Multiprocessing And
Massive Parallel Processing?
Answer :
In Symmetric Multiprocessing, the hardware resources are communal by
processor. The processor has one operating system and it communicates
through shared memory. While in Massive Parallel processing, the CPU
contact the hardware resources completely. This type of processing is also
called as Shared Nothing, as nothing is common in this. It is quicker than the
Symmetric Multiprocessing.
16. Question 16. Write The Steps Required To Kill The Job In Datastage?
Answer :
To destroy the job in Datasatge, we have to kill the individual processing ID.
17. Question 17. Contrast Between Validated And Compiled In The
Datastage?
Answer :
In Datastage, validating a job means, executing a job. While validating, the
Datastage engine checks whether all the necessary properties are given or
not. In other case, while compiling a job, the Datastage engine checks that
whether all the given property are suitable or not.
18. Question 18. How We Can Run Date Conversion In Datastage?
Answer :
We can use date conversion function for this reason i.e. Oconv
(Iconv(Filedname,”Existing Date Format”),”Another Date Format”).
19. Question 19. What Is The Need Of Exception Activity In Datastage?
Answer :
All the stages after the exception activity in Datastage are run in case of any
unfamiliar error occurs while executing the job sequencer.Learn how the
DataStage Training Videos  can take your career to the next level!
20. Question 20. Explain Apt_config In Datastage?
Answer :
It is the environment variable which is used to recognize the *.apt file in
Datastage. It is also used to keep the node information, scratch information
and disk storage information.
21. Question 21. Write The Different Types Of Lookups In Datastage?
Answer :
There are two types of Lookups in Datastage i.e. Normal lookup and Sparse
lookup.
22. Question 22. How We Can Covert Server Job To A Parallel Job?
Answer :
We can convert a server job in to a parallel job by using Link Collector and IPC
Collector.
23. Question 23. Explain Repository Tables In Datastage?
Answer :
In Datastage, the Repository is second name for a data warehouse. It can be
federalized as well as circulated.
24. Question 24. Describe Oconv () And Iconv () Functions In Datastage?
Answer :
In Datastage, OConv () and IConv() functions are used to convert formats from
one format to another i.e. conversions of time, roman numbers, radix, date,
numeral ASCII etc. IConv () is mostly used to change formats for system to
understand. While, OConv () is used to change formats for users to
understand.
25. Question 25. Define Usage Analysis In Datastage?
Answer :
In Datastage, Usage Analysis is done within few clicks. Launch Datastage
Manager and right click on job. Then, select Usage Analysis.
26. Question 26. How We Can Find The Number Of Rows In A Sequential
File?
Answer :
To find rows in chronical file, we can use the System variable @INROWNUM.
27. Question 27. Contrast Between Hash File And Sequential File?
Answer :
The only dissimilarity between the Hash file and Sequential file is that the
Hash file stores data on hash algorithm and on a hash key value, while
sequential file doesn’t have any key value to save the data. Hence we can say
that hash key feature, searching in Hash file is faster than in sequential file.
28. Question 28. How We Can Clean The Datastage Repository?
Answer :
We can clean the Datastage repository via the Clean Up Resources
functionality in the Datastage Manager.
29. Question 29. How We Can Called Routine In Datastage Job?
Answer :
We can call a routine from the transformer stage in Datastage job.
30. Question 30. Differentiate Between Operational Datastage (ods) And
Data Warehouse?
Answer :
We can say, ODS is a small data warehouse. An ODS doesn’t have information
for more than 1 year while a data warehouse have detailed information about
the entire business.
31. Question 31. For What Nls Stand For In Datastage?
Answer :
NLS stand for National Language Support. It can be used to integrate various
languages such as French, German, and Spanish etc. in the data, requisite for
processing by data warehouse.
32. Question 32. Can You Explain How Could Anyone Crash The Index
Before Loading The Data In Target In Datastage?
Answer :
In Datastage, we can crash the index before loading the data in target by using
the Direct Load functionality of SQL Loaded Utility.
33. Question 33. Does Datastage Support Gradually Changing Dimensions ?
Answer :
Yes,Version 8.5 + supports this feature in datastage.
34. Question 34. How Complicated Jobs Are Implemented In Datstage To
Recover Performance?
Answer :
In order to recover performance in Datastage, it is suggested, not to use more
than 20 stages in every job. If you need to use more than 20 stages then it is
advisable to use next job for those stages.
35. Question 35. Name The Third Party Tools That Can Be Used In
Datastage?
Answer :
The third party tools that can be used in Datastage, are Autosys, TNG and
Event Co-ordinator.
36. Question 36. Describe Project In Datastage?
Answer :
When ever we begin the Datastage client, we are asked to join to a Datastage
project. A Datastage project have Datastage jobs, built-in apparatus and
Datastage Designer or User-Defined components.
37. Question 37. What Types Of Hash Files Are There?
Answer :
There are two types of hash files in which are Static Hash File and Dynamic
Hash File.
38. Question 38. Describe Meta Stage?
Answer :
In Datastage, MetaStage is used to store metadata that is beneficial for data
lineage and data analysis.
39. Question 39. Why Unix Environment Is Useful In Datastage?
Answer :
It is useful in Datastage because sometimes one has to write UNIX programs
such as batch programs to raise batch processing etc.
40. Question 40. Contrast Between Datastage And Datastage Tx?
Answer :
Datastage is a tool from ETL i.e. Extract, Transform and Load and Datastage
TX is a tool from EAI i.e. Enterprise Application Integration.Learn more about
the ETL process in this insightful blog now.
41. Question 41. What Is Size Of A Transaction And An Array Means In A
Datastage?
Answer :
Transaction size means the number of row written before committing the
account in a table. An array size means the number of rows written/read to or
from the table respectively.
42. Question 42. Name The Various Types Views In A Datastage Director?
Answer :
There are three types of views in a Datastage Director i.e. Log View, Job View
and Status View.
43. Question 43. What Is The Use Of Surrogate Key?
Answer :
Surrogate key is mostly used for getting data faster. It uses catalog to perform
the retrieval operation.
44. Question 44. How Discarded Rows Are Processed In Datastage?
Answer :
In the Datastage, the discarded rows are managed by constraints in
transformer. We can either place the discarded rows in the properties of a
transformer or we can create a brief storage for discarded rows with the help
of REJECTED command.
45. Question 45. Contrast Between Odbc And Drs Stage?
Answer :
DRS stage is faster than the ODBC stage because it uses local databases for
connectivity.
46. Question 46. Describe Orabulk And Bcp Stages?
Answer :
Orabulk stage is used to store big amount of data in one target table of Oracle
database. The BCP stage is used to store big amount of data in one target
table of Microsoft SQL Server.
47. Question 47. Describe Ds Designer?
Answer :
The DS Designer is used to make work area and add many links to it.
48. Question 48. What Is The Need Of Link Partitioner And Link Collector In
Datastage?
Answer :
In Datastage, Link Partitioner is used to split data into various parts by certain
partitioning methods. Link Collector is used to collect data from many
partitions to a single data and save it in the target table.
DATASTAGE:-

1) Define Data Stage?

A data stage is basically a tool that is used to design, develop and execute
various applications to fill multiple tables in data warehouse or data marts. It is a
program for Windows servers that extracts data from databases and change
them into data warehouses. It has become an essential part of IBM WebSphere
Data Integration suite.

2) Explain how a source file is populated?

We can populate a source file in many ways such as by creating a SQL query
in Oracle, or  by using row generator extract tool etc.

3) Name the command line functions to import and export the DS jobs?

To import the DS jobs, dsimport.exe is used and to export the DS jobs,


dsexport.exe is used.

4) What is the difference between Datastage 7.5 and 7.0?

In Datastage 7.5 many new stages are added for more robustness and smooth
performance, such as Procedure Stage, Command Stage, Generate Report etc.

5) In Datastage, how you can fix the truncated data error?

The truncated data error can be fixed by using ENVIRONMENT VARIABLE ‘


IMPORT_REJECT_STRING_FIELD_OVERRUN’.

6) Define Merge?

Merge means to join two or more tables. The two tables are joined on the basis
of Primary key columns in both the tables.

7) Differentiate between data file and descriptor file?


As the name implies, data files contains the data and the descriptor file contains
the description/information about the data in the data files.

8) Differentiate between datastage and informatica?

Datastage

In datastage, there is a concept of partition, parallelism for node configuration.


While, there is no concept of partition and parallelism in informatica for node
configuration. Also, Informatica is more scalable than Datastage. Datastage is
more user-friendly as compared to Informatica.

9) Define Routines and their types?

Routines are basically collection of functions that is defined by DS manager. It


can be called via transformer stage. There are three types of routines such as,
parallel routines, main frame routines and server routines.

10) How can you write parallel routines in datastage PX?

We can write parallel routines in C or C++ compiler. Such routines are also
created in DS manager and can be called from transformer stage.

11) What is the method of removing duplicates, without the remove


duplicate stage?

Duplicates can be removed by using Sort stage. We can use the option, as allow
duplicate = false.

12) What steps should be taken to improve Datastage jobs?

In order to improve performance of Datastage jobs, we have to first establish the


baselines. Secondly, we should not use only one flow for performance testing.
Thirdly, we should work in increment. Then, we should evaluate data skews. Then
we should isolate and solve the problems, one by one. After that, we should
distribute the file systems to remove bottlenecks, if any. Also, we should not
include RDBMS in start of testing phase. Last but not the least, we should
understand and assess the available tuning knobs.

13) Differentiate between Join, Merge and Lookup stage?

All the three concepts are different from each other in the way they use the
memory storage, compare input requirements and how they treat various
records. Join and Merge needs less memory as compared to the Lookup stage.

14) Explain Quality stage?

Quality stage is also known as Integrity stage. It assists in integrating different


types of data from various sources.

15) Define Job control?

Job control can be best performed by using Job Control Language (JCL). This tool
is used to execute multiple jobs simultaneously, without using any kind of loop.

16) Differentiate between Symmetric Multiprocessing and Massive Parallel


Processing?

In Symmetric Multiprocessing, the hardware resources are shared by processor.


The processor has one operating system and it communicates through shared
memory. While in Massive Parallel processing, the processor access the hardware
resources exclusively. This type of processing is also known as Shared Nothing,
since nothing is shared in this. It is faster than the Symmetric Multiprocessing.

17) What are the steps required to kill the job in Datastage?

To kill the job in Datasatge, we have to kill the respective processing ID.

18) Differentiate between validated and Compiled in the Datastage?

In Datastage, validating a job means, executing a job. While validating, the


Datastage engine verifies whether all the required properties are provided or not.
In other case, while compiling a job, the Datastage engine verifies that whether
all the given properties are valid or not.
19) How to manage date conversion in Datastage?

We can use date conversion function for this purpose i.e.


Oconv(Iconv(Filedname,”Existing Date Format”),”Another Date Format”).

20) Why do we use exception activity in Datastage?

All the stages after the exception activity in Datastage are executed in case of any
unknown error occurs while executing the job sequencer.

21) Define APT_CONFIG in Datastage?

It is the environment variable that is used to identify the *.apt file in Datastage. It
is also used to store the node information, disk storage information and scratch
information.

22) Name the different types of Lookups in Datastage?

There are two types of Lookups in Datastage i.e. Normal lkp and Sparse lkp. In
Normal lkp, the data is saved in the memory first and then the lookup is
performed. In Sparse lkp, the data is directly saved in the database. Therefore, the
Sparse lkp is faster than the Normal lkp.

23) How a server job can be converted to a parallel job?

We can convert a server job in to a parallel job by using IPC stage and Link
Collector.

24) Define Repository tables in Datastage?

In Datastage, the Repository is another name for a data warehouse. It can be


centralized as well as distributed.

25) Define OConv () and IConv () functions in Datastage?

In Datastage, OConv () and IConv() functions are used to convert formats from
one format to another i.e. conversions of roman numbers, time, date, radix,
numeral ASCII etc. IConv () is basically used to convert formats for system to
understand. While, OConv () is used to convert formats for users to understand.
26) Explain Usage Analysis in Datastage?

In Datastage, Usage Analysis is performed within few clicks. Launch Datastage


Manager and right click the job. Then, select Usage Analysis and that’s it.

27) How do you find the number of rows in a sequential file?

To find rows in sequential file, we can use the System variable @INROWNUM.

28) Differentiate between Hash file and Sequential file?

The only difference between the Hash file and Sequential file is that the Hash file
saves data on hash algorithm and on a hash key value, while sequential file
doesn’t have any key value to save the data. Basis on this hash key feature,
searching in Hash file is faster than in sequential file.

29) How to clean the Datastage repository?

We can clean the Datastage repository by using the Clean Up Resources


functionality in the Datastage Manager.

30) How a routine is called in Datastage job?

In Datastage, routines are of two types i.e. Before Sub Routines and After Sub
Routines. We can call a routine from the transformer stage in Datastage.

31) Differentiate between Operational Datastage (ODS) and Data


warehouse?

We can say, ODS is a mini data warehouse. An ODS doesn’t contain information
for more than 1 year while a data warehouse contains detailed information
regarding the entire business.

32) NLS stands for what in Datastage?

NLS means National Language Support. It can be used to incorporate other


languages such as French, German, and Spanish etc. in the data, required for
processing by data warehouse. These languages have same scripts as English
language.
33) Can you explain how could anyone drop the index before loading the
data in target in Datastage?

In Datastage, we can drop the index before loading the data in target by using
the Direct Load functionality of SQL Loaded Utility.

34) Does Datastage support  slowly changing dimensions ?

Yes. Version 8.5 + supports this feature

35) How can one find bugs in job sequence?

We can find bugs in job sequence by using DataStage Director.

36) How complex jobs are implemented in Datstage to improve


performance?

In order to improve performance in Datastage, it is recommended, not to use


more than 20 stages in every job. If you need to use more than 20 stages then it
is better to use another job for those stages.

37) Name the third party tools that can be used in Datastage?

The third party tools that can be used in Datastage, are Autosys, TNG and Event
Co-ordinator. I have worked with these tools and possess hands on experience of
working with these third party tools.

38) Define Project in Datastage?

Whenever we launch the Datastage client, we are asked to connect to a


Datastage project. A Datastage project contains Datastage jobs, built-in
components and Datastage Designer or User-Defined components.

39) How many types of hash files are there?

There are two types of hash files in DataStage i.e. Static Hash File and Dynamic
Hash File. The static hash file is used when limited amount of data is to be loaded
in the target database. The dynamic hash file is used when we don’t know the
amount of data from the source file.
40) Define Meta Stage?

In Datastage, MetaStage is used to save metadata that is helpful for data lineage
and data analysis.

41) Have you have ever worked in UNIX environment and why it is useful in
Datastage?

Yes, I have worked in UNIX environment. This knowledge is useful in Datastage


because sometimes one has to write UNIX programs such as batch programs to
invoke batch processing etc.

42) Differentiate between Datastage and Datastage TX?

Datastage is a tool from ETL (Extract, Transform and Load) and Datastage TX is a
tool from EAI (Enterprise Application Integration).

43) What is size of a transaction and an array means in a Datastage?

Transaction size means the number of row written before committing the records
in a table. An array size means the number of rows written/read to or from the
table respectively.

44) How many types of views are there in a Datastage Director?

There are three types of views in a Datastage Director i.e. Job View, Log View and
Status View.

45) Why we use surrogate key?

In Datastage, we use Surrogate Key instead of unique key. Surrogate key is


mostly used for retrieving data faster. It uses Index to perform the retrieval
operation.

46) How rejected rows are managed in Datastage?

In the Datastage, the rejected rows are managed through constraints in


transformer. We can either place the rejected rows in the properties of a
transformer or we can create a temporary storage for rejected rows with the help
of REJECTED command.

47) Differentiate between ODBC and DRS stage?

DRS stage is faster than the ODBC stage because it uses native databases for
connectivity.

48) Define Orabulk and BCP stages?

Orabulk stage is used to load large amount of data in one target table of Oracle
database. The BCP stage is used to load large amount of data in one target table
of Microsoft SQL Server.

49) Define DS Designer?

The DS Designer is used to design work area and add various links to it.

50) Why do we use Link Partitioner and Link Collector in Datastage?

In Datastage, Link Partitioner is used to divide data into different parts through
certain partitioning methods. Link Collector is used to gather data from various
partitions/segments to a single data and save it in the target table.

You might also like