ADE Azure Data Engineer Interview
ADE Azure Data Engineer Interview
INTERVIEW QUESTIONS
PART-1
DEVIKRISHNA R
LinkedIn: @Devikrishna R Email: [email protected]
1. Reading and Processing 10 Million CSV Files in ADLS Gen2 using
Azure Data Factory
To handle a large number of CSV files in ADLS Gen2 using Azure Data
Factory (ADF):
• Step 1: Use Wildcard File Path
Use ADF's Copy Data or Data Flow activity to define a wildcard
pattern to match the files in the ADLS Gen2 container. This
avoids the need to list each file explicitly.
• Step 2: Batch Processing
Enable File Path Partitioning to process files in parallel. Use the
maxConcurrency setting in activities like Copy Data to increase
throughput.
• Step 3: Optimize Data Flow
Use Mapping Data Flow to transform and aggregate data as
needed. Leverage partitioning options in Data Flow for parallel
processing.
• Step 4: Iterative Processing with ForEach Activity
If each file requires specific processing, use the Get Metadata
activity to retrieve file names, and iterate over them with the
ForEach activity.
• Step 5: Performance and Cost Optimization
o Enable compression on CSVs to reduce size.
o Use polybase or external tables if the destination is Azure
Synapse Analytics.
o Use Integration Runtime (IR) to scale the compute
environment.
2. What is Integration Runtime?
Integration Runtime (IR) is the compute infrastructure used by Azure
Data Factory to provide the data integration capabilities. Types of IR:
• Azure IR: For cloud-based activities like copying data between
cloud sources.
• Self-Hosted IR: For accessing on-premises or private network
data securely.
• Azure-SSIS IR: For executing SQL Server Integration Services
(SSIS) packages.