SSIS Interview Questions and Answers For Experienced and Freshers
SSIS Interview Questions and Answers For Experienced and Freshers
and Freshers
Here we are publishing series of posts on SSIS Interview questions and answers Part 4
for experienced and freshers. Below is the series 4.
Even though both are the ETL tools, we can differentiate if you are asked
observations.
Data Transformation
1 Services Sql Server Integration Services
a) Control Flow
b) Data Flow
d) Package Explorer.
A for each loop will execute once for each item in the collection of items that it is
looking at. A good example would be if users are putting an Excel file into a directory
for import into the DB. You cannot tell ahead of time how many will be in the
directory, because a user might be late, or there might be more than one file from a
given user. When you define the ForEach container, you would tell it to execute for
each *.xls in the directory and it will then loop through, importing each one
individually, regardless of how many files are actually there.
OLEDB destination can use Fast Load options hence perform bulk uploads.
Copy column can add new columns only through existing columns but coming to
Derived column it can add new columns without any help from existing columns.
Derived Column can assign different transformations and data types to the new
columns whereas Copy Column cannot.
The Merge transformation is similar to the Union All transformations. Use the Union
All transformation instead of the Merge transformation in the following situations:
Q. What is the difference between for loop and for each loop container?
Ans:
The For Loop Container executes specified number of times like 10 times, 20 times
until the specified condition is met.
The Foreach Loop Container runs over an iterator. This iterator can be files from a
folder, records from ADO, data from a variable etc.
Q. How to pass property value at Run time? How do you implement Package
Configuration?
Ans:
A property value like connection string for a Connection Manager can be passed to the
pkg using package configurations. Package Configuration provides different options
like XML File, Environment Variables, SQL Server Table, Registry Value or Parent
package variable.
Asynchronous Transformations:
The output buffer or output rows are not in sync with the input buffer; output rows
use a new buffer. In these situations its not possible to reuse the input buffer
because an asynchronous component can have more, the same or less output records
than input records.
The component has to acquire multiple buffers of data before it can perform its
processing. An example is the Sort transformation, where the component has to
process the complete set of rows in a single operation.
The component has to combine rows from multiple inputs. An example is the
Merge transformation, where the component has to examine multiple rows from
each input and then merge them in sorted order.
There is no one-to-one correspondence between input rows and output rows. An
example is the Aggregate transformation, where the component has to add a row
to the output to hold the computed aggregate values.
Asynchronous components can further be divided into the two types described below:
Partially Blocking Transformation the output set may differ in terms of quantity
from the input set. Thus new buffers need to be created to accommodate the
newly created set.
Blocking Transformation a transformation that must hold one or more buffers
while it waits on one or more buffers, before it can pass that buffer down the
pipeline. All input records must read and processed before creating any output
records. For example, a sort transformation must see all rows before sorting and
block any data buffers from being passed down the pipeline until the output is
generated.
Note:
Synchronous components reuse buffers and therefore are generally faster than
asynchronous components
Execution trees are enormously valuable in understanding buffer usage. They can be
displayed for packages by turning on package logging for the Data Flow task
Stores the actual content and the following tables do the supporting roles.
Sysdtscategories
sysdtslog90
sysdtspackagefolders90
sysdtspackagelog
sysdtssteplog
sysdtstasklog
2008:
sysssispackagefolders
sysssislog
Q. How to achieve parallelism in SSIS?
Ans:
Parallelism is achieved using MaxConcurrentExecutable property of the package. Its
default is -1 and is calculated as number of processors + 2.
Q. Differences between dtexec.exe and dtexecui.exe
Ans:
Both dtexec.exe and dtexecui.exe execute SSIS packages in the same manner. The
difference is that dtexecui provided a graphical user interface to construct the
command line arguments for dtexec. The command string that is generated with
dtexecui can be used as command line arguments to dtexec.
If you have a single config file that stores all your connection managers then all your
packages must have contain the connection managers that are stored in that config
file. This means you may have to put connection managers in your package that you
dont even need.
There are also third party tools that can accomplish this for you (Pragmatic Works BI
xPress).
Other typical quality issues are nulls (missing values), outliers (dates like 2999 or
types like 50000 instead of 5000 especially important if someone is adjusting the
value to get bigger bonus), incorrect addresses and these are either corrected during
ETL, ignored, re-directed for further manual updates or it fails the packages which for
big processes is usually not practiced.
Lets start with when you typically use SPs. This is for preparing tables (truncate),
audit tasks (usually part of SSIS framework), getting configuration values for loops and
a few other general tasks.
During ETL extract you usually type simple SQL because it comes from other sources
and usually over complication is not a good choice (make it dynamic) because any
changes usually affect the package which has to be updated as well.
During Transformation phase (business rules, cleaning, core work) you should use
Transformation tasks not Stored procedures! There are loads of tasks that make the
package much easier to develop but also a very important reason is readability which
is very important for other people who need to change the package and obviously it
reduces risks of making errors. Performance is usually very good with SSIS as it is
memory/flow based approach. So when to use Stored Procedures for transformations?
If you dont have strong SSIS developers or you have performance reasons to do it. In
some cases SPs can be much faster (usually it only applies to very very large
datasets). Most important is have reasons which approach is better for the situation.
Q. What is your approach for ETL with data warehouses (how many packages you
developer during typical load etc)?
Ans:
This is rather generic question. A typical approach (for me) when building ETL is to.
Have a package to extract data per source with extract specific transformations
(lookups, business rules, cleaning) and loads data into staging table. Then a package
do a simple merge from staging to data warehouse (Stored Procedure) or a package
that takes data from staging and performs extra work before loading to data
warehouse. I prefer the first one and due to this approach I occasionally consider
having extract stage (as well as stage phase) which gives me more flexibility with
transformation (per source) and makes it simpler to follow (not everything in one go).
So to summarize you usually have package per source and one package per data
warehouse table destination. There are might be other approach valid as well so ask
for reasons.
These individual words are called Tokens, and all tokens in a index are divided using
some special character and search with the reference table.
Fuzzy Lookup transformation creates temporary objects, such as tables and indexes in
the SQL Server TempDB. So, make sure that the SSIS user account has sufficient
access to the database engine to create and maintain this temporary table. Fuzzy
lookup transformation has 3 features.
Q. How to accomplish incremental loads? (Load the destination table with new
records and update the existing records from source (if any updated records are
available)
Ans:
There are few methods available:
You can use Lookup Transformation where you compare source and destination
data based on some id/code and get the new and updated records, and then
useConditoional Split to select the new and updated rows before loading the
table. However, I dont recommend this approach, especially when destination
table is very huge and volume of delta is very high.
Use Execute SQL Task and with Staging table
Find the Maximum ID & Last ModifiedDate from destination and store in package
variables. (Control Flow)
Pull the new and updated records from source and load to a staging table (A
dataload table created in destination database) using above variables.(Data Flow)
Insert and Update the records using Execute SQL Task (Control Flow)
Use the feature CDC (Change Data Capture) from SQL Server 2008
Use Conditional split to split data for Inserts. Updates and Deletes
For inserts redirect to a OLEDB Destination
For Updates and Deletes redirect using a OLEDB Command transformation
Q. How can you enable the CDC for a table?
Ans:
To enable CDC to a table first the feature should be enabled to the corresponding
database. Both can be done using the below procs.
exec sys.sp_cdc_enable_db_change_data_capture
sys.sp_cdc_enable_table_change_data_capture