Ab Initio Interview Questions - 1
Ab Initio Interview Questions - 1
2. What is Maxcore?
A. Maxcore is Maximum amount of memory used by the component at the run time.
Default values Sort-10MB and Join - 64MB
2A Explain what max-core's are used for and what you need to look out for, when
setting them?
A maximum amount of memory in bytes a component will use,
Too Low: will swap to disk, slow down application,
Too high: consume too many resources, slow down.
13. Which one do you use first? Either a Sort or Partition component?
A. Sort is used after partition is done
15. Explain the concept of phases in Ab Initio and what they can be used for?
A First phase must complete before second runs; can save concurrent resource
usage; saves status after each phase if checkpoint.
20. What is the difference between Partition by Round-robin and Partition with
Load Balance?
A. Partition by Round-robin: Distributes data records evenly to each output flow in
round-robin fashion.
Partition with Load Balance: Distributes data records to output flow partitions,
Writing more records to the flow partitions that consume records faster.
22 If you want to do a join in parallel, which partition component would you use?
A partition by key
24 If you had an ad-hoc multifile of 100 files, and you wanted to run only 4 ways
parallel, what would you do?
A concatenate or custom component.
25 With in a graph, how would you take a 4 way parallel stream to 8 ways, what
component would you use?
A Repartition using partition ->> gather, or partition -> fan-in component.
25A. If we are creating multifile, what files are created?
A. 1. Control files.
2. Data Files (Serial files) (Partition files).
31 If a job fails, how do you rollback to the last successful checkpoint manually?
A m_rollback
48. What are the problems did u got while creating graphs?
A.
51. Suppose if u have 50,000 records in I/P table and if u wants to test 10 records
how can u test it?
A. Using Filter By Expression. Next_in_sequence () < 10
55. You handed a process written in Ab Initio, users are complaining that it runs
Slowly. Out line strategy for improving the performance?
A. 1. Parallelism
2. Sort in memory.
3. Spilling to disk.
4. Carrying around unnecessary columns.
56B. What is the difference between a sandbox and the graph parameters?
A. Sand box parameters are global and can be accessed into any graph for particular
user.
Graph parameters are local to the graph and cannot be accessed into other graphs.
58. How can a rollup replace a sort and dedup, when can it does so?
A. Rollup implicitly does a unique sort. If you care which of the duplicates are kept,
You probably cannot use a rollup to replace a sort and dedup.
59. When u runs an Ab Initio graph when does the .rec file get deleted?
A. It is deleted after the graph runs but before the end script runs.
60. Does a join of two sorted data streams preserve their respective sort order?
A. Some times.
If the flows are already sorted and are sorted on the same key the join retains the
sort automatically.
If the flows are not pre-sorted you have the choice to maintain the sort order or
not.
61. What are the difference between Sort & Sort With in Groups components, is
the output the same? Is the performance the same?
A. Sort: It simply sort (By default ascending).
Sort With in groups: It will sort with in the groups using the minor key.
But output of sort & sort with in groups is not same.
The result is the same but the performance of the later is quicker because you are
sub sorting the already sorted data.
62. When deciding upon a partitioning key what reflects a wise choice?
A. Even or nearly even data distribution among partitions denotes a good partitioning
key.
67. We can run the graphs from GDE? How can I do it with out GDE?
A. By deploying the graph as Korn shell script.
68. Can you execute the graph more than once at the same time? How?
A. Yes. By setting .rec file.
72. Did you ever use multiway processing? Adv of Parellel mfs over Serial?
A. Parallel Processing. The data is divided into patitions.
74. Co-Op is installed on two servers A and B, graph is running on A, How can I
rename the graph on server B?
A. By using Run program. Issue mv command.
77. After running the graph in GDE, what file is created in sand box?
A. .rec file.
78. I n a Reformat component how do you set the parameters if I have 1 input and
2 output files?
A.
79. What are the databases you used most of the time?
A. Oracle, Teradata.
80. What component do you use to load data into Oracle database?
A. Output table component.
84. Which one do you prefer among Join and Lookup file if I have two inputs, 1
with 100000 and 2 with 5000 records?
A. If it less records go for Lookup file.
89. Do you have understanding of multifile, can you have them in windows env?
A.
90. There are two datasets A with 100 million records and B with 50,000 records data
is not sorted .you have to join them what component would you use? How can
you modify?
A.
91. In a data stream there is a field from 1-9 what component will you use?
A.
97. What is the difference between hash Partitioning and Time series partitioning?
A.
101. What are the type of data files we have loaded and how we loaded it?
A.
103. How to schedule the job load and how much time it will take to load a 10gb
data file and what type of parallelism?
A.
107. You can join two tables using Join key word in SQL?
A.
127. What parameter specifies the memory size for the sort component?
A. Maxcore.
129. What is the method to create user-defined functions that the validate
component can use to verify data?
A. is_valid function prefix syntax.
130. With in an include statement, what does the ~ {tide} character do? Does it
indicate that the given include file: is relative to the local sandbox xfr directory
A. Yes.
131. In a component MPC file, what does the image line specify?
a) The location of the script or program to execute.
b) The label of the component when displayed in a GDE graph.
c) The icon used when displayed in the GDE component library.
d) The argument list passed to the unitool launcher.
e) None of the above.
A.
132. How do you describe the characteristic of the driving input for the join
component?
A. The records are stored in memory prior to executing the join transform.
133. Which action will cause the current partitioning keys to become invalid?
a) Multiplying partition keys by a constant
b) Joining using non grouped input with fewer keys than Partition by.
c) Using rollup with grouped input with fewer keys than Partition by.
d) Gathering(2) 8way multi-files into a single 8way multi-file.
e) All the above.
A.
134. What environment variable can be modified to alter the format of monitoring
reports?
a) XX_REPORT
b) XX_DEBUG
c) AB_CONFIG
d) XX_MONITOR
e) None of tht above.
A. None of the above
135. What is the component that does not force a phase break?
A. Intermediate file.
2. How can you bring the job running in back ground to fore ground?
A. By typing “fg”.
4.
gather – gather collects records from many sources, reads data from flow partition.
It reduces data parallelism, reduces pipeline parallelism
And it doesn’t support default record assignment
LocalMerge – reads data from many sorted sources and maintains the sort order
concat
it takes multiple streams of data and append then one after another, it maintains the
order…..
Interleave
It collects the records from many sources in round robin fashion. It reads block size of
records from first partition
Partition by key and sort……all records with same key are in same partition………Local
Lookup…….
Partition by round robin ……………. Distributed data evenly across the out
partitions..reads as chunks
Transformer
Merge
to co-relate the data – from different sources……….reads records from multiple input
ports………..operates on the records……….with merging keys…..
Merge-Join – to perform inner, outer and semi join in the form of relational database.
M_ls m_cp or m_dump or m_rollback, are co>op system shell-level utilities. (For
managing Parallel files, managing metadata, recovering a check pointed process)
Skew – monitoring……at the user requests….the co-op system monitor ab-initio jobs and
issue periodic reports…….monitor is control either two ways……….
Shell --- set the confiigration variable xx_report….before running the job….
With in the script supply arguments to the report option the mp run command……the
two interfaces accepts the same set of key word arguments…if both interfaces are
use……the effect addidive….in summary the key words…are verbose error, expanded
graph…….flows……..times………skew, skew = n, scroll = mode….file = filename
Interval = n….table flows………
create a multi file system a place where parallel files are stored…
Recovery
Log files……………..start/end………..hostsystem…variable/ab-initio/host/unique-id…
sequence character.
Investigating…………recapsulation……….
M_rollback(-d,-I, -h)…………manually
Xx_nice..xx_timeout……..xx_interval..ab_connection/_script
Ab_home.ab_password..ab_nodes
When running applications, please note the environment variables are passed downward
only.
/usr/local/abinitio
export AB_HOME=/USR/LOCAL/ABINITIO
export PATH = $AB_HOME/BIN:$PATH
The above settings enable your shell to locate the installed Ab Initio Software.
m_ls –1 mfs
The concept of skew refers to an unbalanced load among the partitions of a multifle or
among the partitions of a dataflow.
Eg: For a particular flow or file, they are k partitions.
There are total bytes for all partitions.
Then average = total/k.
Average = 1000/20 = 50.
Then, the skew for a partition with n bytes is (n-average)/max ( -100 - 0 - 100%)
By the way the sum of all the skews is 0%.
m_dump produces a human-readable report that shows how input data is interpreted by
Ab Initio metadata.
Like m_dump foo.dml foo.dat (prints data in foo.dat as interpreted by the metadata in
foo.dml)
In event of failure the application can restart from the most recent checkpoint instead of
from the beginning
Incase of software error or user control-c command…the co>op takes care of automatic
rollback, thus restoring all files, flows, and processes to their initial state or to their state
at the most recent checkpoint.
When a job does not complete normally, it leaves a file in the working directory on the
host system with the name jobname.rec……located at /var/abinitio/vnode/unique-id
Once a prototype configuration file is created for the database, each table must be
analyzed with db_config (analyzes a table to determine ..column names and types, the
applicable nodes, and the best scheme for loading or unloading it)
Database Components
DB Unload
DB Load
DB Truncate
SQLrun – run miscellaneous SQL against the database
Unload.dml
Record
String(“,”) name;
Decimal(“,”) age;
End;
Reformat.dml
Record
String (10) name=””;
Decimal(3) age = “”;
End;