0% found this document useful (0 votes)
33 views17 pages

@Q_B@Snowflake & AWS

The document outlines a comprehensive set of interview questions for a Data Engineer role, categorized into general, work experience, technical, system design, and HR rounds. It includes questions on ETL processes, Snowflake and AWS Glue usage, data modeling, and behavioral fit, along with practical coding tasks. Additionally, it provides tips for effective responses, such as using the STAR method and preparing for hands-on coding exercises.

Uploaded by

shubham khot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views17 pages

@Q_B@Snowflake & AWS

The document outlines a comprehensive set of interview questions for a Data Engineer role, categorized into general, work experience, technical, system design, and HR rounds. It includes questions on ETL processes, Snowflake and AWS Glue usage, data modeling, and behavioral fit, along with practical coding tasks. Additionally, it provides tips for effective responses, such as using the STAR method and preparing for hands-on coding exercises.

Uploaded by

shubham khot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

“Set 1 - General Questions 1️⃣Tell me about yourself.

2️⃣Why are
you interested in this role? 3️⃣What do you know about our
company? 4️⃣What are your strengths and weaknesses? 5️⃣Where
do you see yourself in the next 3-5 years? Set 2 - Work Experience
& Role-Specific Questions 6️⃣Can you walk me through your
experience with Snowflake & AWS Glue? 7️⃣What’s the most
challenging ETL pipeline migration you’ve worked on? 8️⃣How do
you handle failures in AWS Glue & Snowflake pipelines? 9️⃣Can you
explain a situation where you improved pipeline performance? 🔟
How do you ensure data accuracy and consistency during
migration? 💡 Pro Tip: Use the STAR Method (Situation, Task, Action,
Result) to answer behavioral questions effectively. 🟠 Technical
Round 2 – Data Engineering Concepts (ETL, Snowflake, SQL,
PySpark, AWS Glue) Set 1 - ETL & Pipeline Migration Questions 1️⃣
How do you approach migrating ETL pipelines from Oracle/MsSQL
to Snowflake? 2️⃣Explain the best practices for designing ETL
workflows in AWS Glue. 3️⃣What are the differences between
Snowflake vs Redshift vs BigQuery for warehousing? 4️⃣How do
you handle incremental vs full data loads in Snowflake? 5️⃣Explain
how you optimize ETL jobs in AWS Glue & PySpark. Set 2 -
Snowflake Performance & Optimization 6️⃣How do you optimize
Snowflake queries for faster performance? 7️⃣What’s the role of
clustering & partitioning in Snowflake? 8️⃣How do you handle
large-scale data ingestion into Snowflake? 9️⃣What are Transient
vs Permanent Tables in Snowflake, and when to use them? 🔟
Explain Time Travel and Zero-Copy Cloning in Snowflake. 💡 Pro Tip:
Be ready to write SQL queries & PySpark transformations on a
whiteboard or shared screen. 🔵 Technical Round 3 – System Design
& Cloud Infrastructure (AWS, APIs, Data Modeling) Set 1 - Data
Modeling & Warehouse Design 1️⃣How do you design a Snowflake
schema for an analytics use case? 2️⃣What’s the difference
between Star Schema & Snowflake Schema? 3️⃣How do you model
Salesforce & NetSuite data for analytics? 4️⃣Explain fact vs
dimension tables in a Snowflake data warehouse. 5️⃣How do you
handle slowly changing dimensions (SCD) in Snowflake? Set 2 -
Cloud Infrastructure & API Integration 6️⃣How does AWS Glue
integrate with Snowflake for ETL? 7️⃣How do you replicate
Salesforce/Workday data into Snowflake? 8️⃣Explain AWS Lambda
vs AWS Glue vs Airflow for data orchestration. 9️⃣How do you
handle API rate limits & failures in data ingestion? 🔟 What security
best practices do you follow for AWS & Snowflake? 💡 Pro Tip:
Expect system design whiteboarding & architecture discussions. 🟣
HR Round 4 – Final Discussion & Salary Negotiation Set 1 -
Company & Culture Fit Questions 1️⃣What motivates you as a Data
Engineer? 2️⃣How do you handle tight deadlines & production
failures? 3️⃣Have you worked in cross-functional teams? 4️⃣How do
you keep up with new data engineering trends? 5️⃣Why should we
hire you? Set 2 - Salary & Expectations 6️⃣What are your salary
expectations? 7️⃣Are you open to contract vs full-time roles? 8️⃣Are
you comfortable working in different time zones? 9️⃣What’s your
preferred work model (Remote/Hybrid/Onsite)? 🔟 Do you have any
questions for us? 📌 Bonus: Hands-On Coding & Whiteboarding
Practice ✅ SQL Questions 🔹 Write a query to find the top 3 highest-
selling products per month. 🔹 How do you implement window
functions in Snowflake? 🔹 Write a query to merge new records
into an existing Snowflake table. ✅ PySpark Questions 🔹 Convert a
JSON file to Parquet using PySpark. 🔹 Write a PySpark script to
remove duplicate records from a DataFrame. 🔹 Explain how
broadcast joins improve performance in PySpark. ✅ AWS Glue
Questions 🔹 How do you create a Glue job for processing S3
data? 🔹 How do you handle schema evolution in AWS Glue? 🔹
What’s the difference between Glue DynamicFrame & DataFrame?
"

”🟢 HR Round 1 – Behavioral & Fitment Questions

Set 1 - General Questions

1️⃣ Tell me about yourself.


"I'm a Data Engineer with over two years of experience designing and optimizing data
pipelines. I specialize in ETL development, big data processing with PySpark, and cloud-
based solutions using AWS and Snowflake. Currently, I work at a news agency, where I build
scalable data pipelines for analytics and reporting. I enjoy working on performance
optimization, automating workflows, and ensuring data integrity. Beyond work, I stay
updated with the latest data engineering trends and enjoy contributing to cross-functional
projects."

2️⃣ Why are you interested in this role?


"I'm excited about this role because it aligns with my experience in Snowflake, AWS Glue,
and ETL workflows. I see it as an opportunity to work on large-scale data challenges,
optimize complex pipelines, and collaborate with teams that prioritize innovation in data
engineering."

3️⃣ What do you know about our company?


"I've researched your company and found that you focus on [mention specific domain, e.g.,
e-commerce, finance, healthcare] and handle large volumes of data. Your emphasis on
cloud-based data solutions and analytics-driven decision-making aligns with my expertise,
and I believe I can contribute effectively."

4️⃣ What are your strengths and weaknesses?

 Strengths:

o Proficiency in building scalable ETL pipelines

o Strong SQL and PySpark skills for data transformation & optimization

o Experience in cloud platforms like AWS & Snowflake

 Weakness:

o "I tend to focus too much on details when optimizing queries. However, I’m
learning to balance performance with project timelines by prioritizing
optimizations that have the most impact."

5️⃣ Where do you see yourself in the next 3-5 years?


"I see myself evolving into a Senior Data Engineer or a Data Architect, leading the design of
efficient data systems, mentoring junior engineers, and working on cutting-edge
technologies in big data and AI-driven analytics."

Set 2 - Work Experience & Role-Specific Questions

6️⃣ Can you walk me through your experience with Snowflake & AWS Glue?
"I've designed and optimized ETL pipelines using AWS Glue for transforming and loading
data into Snowflake. I use PySpark within Glue for data transformations, schema evolution,
and incremental loading. In Snowflake, I optimize queries using clustering, partitioning, and
caching techniques."

7️⃣ What’s the most challenging ETL pipeline migration you’ve worked on?
(STAR Method Example)

 Situation: Migrating an on-premise SQL Server ETL pipeline to Snowflake.

 Task: Improve performance and reduce maintenance overhead.

 Action: Used AWS Glue for ETL processing, optimized data partitioning, and
implemented incremental loading using Snowflake Streams & Tasks.

 Result: Reduced processing time by 60% and improved query performance for
analytics.

8️⃣ How do you handle failures in AWS Glue & Snowflake pipelines?

 AWS Glue: Implement checkpointing & retry logic in PySpark jobs.


 Snowflake: Use error-handling SQL scripts, monitoring with Snowflake Query
History, and retry failed tasks using Streams & Tasks.

9️⃣ Can you explain a situation where you improved pipeline performance?
"I optimized an ETL pipeline by using PySpark broadcast joins to speed up small-to-large
table joins, reducing runtime from 3 hours to 45 minutes."

🔟 How do you ensure data accuracy and consistency during migration?

 Schema validation before migration

 Row count & checksum validation

 Using Snowflake Streams & Tasks for CDC (Change Data Capture)

🟠 Technical Round 2 – Data Engineering Concepts

Set 1 - ETL & Pipeline Migration

1️⃣ How do you approach migrating ETL pipelines from Oracle/MsSQL to Snowflake?

 Assess source schema & ETL logic

 Extract data using AWS DMS

 Use AWS Glue/Snowflake Staging Tables for transformations

 Optimize Snowflake warehouse sizing & indexing

2️⃣ Best practices for designing ETL workflows in AWS Glue?

 Use DynamicFrames for schema flexibility

 Optimize memory with Spark partitions

 Use S3 as an intermediate storage layer

3️⃣ Snowflake vs Redshift vs BigQuery?

 Snowflake: Best for on-demand compute scaling & semi-structured data.

 Redshift: Good for batch processing, but less flexible.

 BigQuery: Serverless with automatic scaling, ideal for Google Cloud users.

4️⃣ Handling incremental vs full data loads in Snowflake?

 Full Load: Truncate and reload entire data.

 Incremental Load: Use Streams & Tasks to track changes.

5️⃣ Optimizing ETL jobs in AWS Glue & PySpark?


 Use Glue Job Bookmarks for incremental loads

 Optimize partitions & avoid shuffling in PySpark

Set 2 - Snowflake Performance & Optimization

6️⃣ How do you optimize Snowflake queries?

 Use clustering keys, result caching, and materialized views

 Minimize **SELECT *** queries and optimize joins

7️⃣ Role of clustering & partitioning in Snowflake?

 Clustering improves query pruning

 Partitioning (via file structure) reduces unnecessary scans

8️⃣ Handling large-scale data ingestion into Snowflake?

 Parallel COPY commands from S3

 Auto-ingest using Snowpipe

9️⃣ Transient vs Permanent Tables in Snowflake?

 Transient: No Fail-safe, used for staging.

 Permanent: Retains history for compliance.

🔟 Time Travel & Zero-Copy Cloning?

 Time Travel: Restore data from past states.

 Zero-Copy Cloning: Clone tables instantly without duplication.

@@@@

🔵 Technical Round 3 – System Design & Cloud Infrastructure

Set 1 - Data Modeling & Warehouse Design

1️⃣ How do you design a Snowflake schema for an analytics use case?

 Understand business requirements (KPIs, dimensions, fact tables).

 Choose schema type (Star Schema or Snowflake Schema).

 Optimize data storage with clustering and partitioning.

 Use materialized views for frequently used aggregations.

 Leverage Snowflake features like micro-partitioning and result caching.


2️⃣ What’s the difference between Star Schema & Snowflake Schema?

Feature Star Schema Snowflake Schema

Structure Denormalized Normalized

Performance Faster queries Slower joins

Storage More redundant data Less redundancy

Joins Fewer joins needed Multiple joins required

Use case Fast query performance Optimized storage

3️⃣ How do you model Salesforce & NetSuite data for analytics?

 Extract data using AWS DMS, Fivetran, or Stitch.

 Stage raw data in Snowflake using schema similar to Salesforce/NetSuite.

 Transform data to match analytical needs (flatten JSON structures, join related
tables).

 Create fact and dimension tables (e.g., Sales as fact, Customers as dimension).

 Optimize with clustering on frequently queried fields (e.g., Date, Customer ID).

4️⃣ Explain fact vs dimension tables in a Snowflake data warehouse.

 Fact tables store transactional data (e.g., Sales, Orders).

 Dimension tables provide context (e.g., Customers, Products).

 Fact tables have high cardinality and numeric values.

 Dimension tables contain descriptive attributes for slicing and dicing data.

Example:

 Fact Table: sales (sale_id, customer_id, product_id, amount, date_id)

 Dimension Table: customers (customer_id, name, region, created_at)

5️⃣ How do you handle slowly changing dimensions (SCD) in Snowflake?

 SCD Type 1 (Overwrite): Update records directly.


 SCD Type 2 (Versioned History): Maintain historical records with valid_from and
valid_to timestamps.

 SCD Type 3 (Limited History): Store only the previous value in a separate column.

 Use Streams & Tasks to track changes efficiently.

Example SQL for SCD Type 2:

INSERT INTO customers_scd2 (customer_id, name, region, valid_from, valid_to)

SELECT customer_id, name, region, CURRENT_TIMESTAMP, NULL

FROM staging_customers

WHERE NOT EXISTS (

SELECT 1 FROM customers_scd2 WHERE customers_scd2.customer_id =


staging_customers.customer_id);

Set 2 - Cloud Infrastructure & API Integration

6️⃣ How does AWS Glue integrate with Snowflake for ETL?

 AWS Glue extracts raw data from sources (S3, RDS).

 Processes data using PySpark or Glue DynamicFrames.

 Writes transformed data to Snowflake using JDBC connection.

 AWS Glue Catalog can be used for metadata management.

7️⃣ How do you replicate Salesforce/Workday data into Snowflake?

 Use Fivetran/Stitch/AWS DMS for real-time or batch replication.

 Store data in a Snowflake staging area before transformations.

 Use Streams & Tasks to track changes and implement incremental loads.

 Partition & cluster data for optimal query performance.

8️⃣ Explain AWS Lambda vs AWS Glue vs Airflow for data orchestration.

Service Use Case Pros Cons

Event-driven ETL (small Serverless, cost- Limited memory & execution


AWS Lambda
data) effective time

AWS Glue Serverless ETL for big Scalable, supports Higher cost for frequent jobs
Service Use Case Pros Cons

data PySpark

Apache Flexible, DAG-based Requires infrastructure


Workflow orchestration
Airflow scheduling management

9️⃣ How do you handle API rate limits & failures in data ingestion?

 Implement exponential backoff for retries.

 Use caching mechanisms for frequently requested data.

 Batch API requests instead of making multiple small ones.

 Use AWS Step Functions for handling failures in a workflow.

Example:

import time

import requests

def call_api_with_retry(url, max_retries=5):

retries = 0

while retries < max_retries:

response = requests.get(url)

if response.status_code == 200:

return response.json()

elif response.status_code == 429: # Too many requests

time.sleep(2 ** retries) # Exponential backoff

retries += 1

return None

🔟 What security best practices do you follow for AWS & Snowflake?

 AWS:

o Use IAM roles & least privilege principle.

o Enable VPC, private endpoints & encryption (KMS, SSE-S3).


o Monitor access with CloudTrail & GuardDuty.

 Snowflake:

o Enable role-based access control (RBAC).

o Use network policies to restrict access.

o Implement column-level security & masking for PII data.

🟣 HR Round 4 – Final Discussion & Salary Negotiation

Set 1 - Company & Culture Fit

1️⃣ What motivates you as a Data Engineer?


"I enjoy working with large-scale data and optimizing high-performance pipelines. Building
scalable and efficient data solutions excites me."

2️⃣ How do you handle tight deadlines & production failures?

 Prioritize critical fixes while keeping stakeholders informed.

 Use monitoring & alerting to detect issues early.

 Follow post-mortem analysis to prevent future failures.

3️⃣ Have you worked in cross-functional teams?


"Yes, I collaborate with data analysts, scientists, and product managers to ensure data
solutions meet business needs."

4️⃣ How do you keep up with new data engineering trends?

 Follow AWS & Snowflake blogs.

 Participate in LinkedIn & Medium communities.

 Take online courses and attend tech meetups.

5️⃣ Why should we hire you?


"I bring strong ETL, cloud, and big data expertise and can optimize large-scale data
pipelines while ensuring data quality & reliability."
Set 2 - Salary & Expectations

6️⃣ What are your salary expectations?


"I’m looking for a competitive offer based on my experience, but I’m open to discussion."

7️⃣ Are you open to contract vs full-time roles?


"I prefer a full-time role but open to contract if it aligns with my career goals."

8️⃣ Are you comfortable working in different time zones?


"Yes, as long as there's some flexibility in overlapping working hours."

9️⃣ What’s your preferred work model (Remote/Hybrid/Onsite)?


"I prefer hybrid but open to onsite depending on the role."

🔟 Do you have any questions for us?

 "Can you tell me about the team structure?"

 "What are the key challenges you’re currently facing?"

📌 Bonus: Hands-On Coding & Whiteboarding Practice

✅ SQL - Top 3 highest-selling products per month

SELECT month, product_id, total_sales

FROM (

SELECT month, product_id, SUM(sales) AS total_sales,

RANK() OVER (PARTITION BY month ORDER BY SUM(sales) DESC) as rnk

FROM sales_data

GROUP BY month, product_id

) WHERE rnk <= 3;

✅ PySpark - Convert JSON to Parquet

df = spark.read.json("s3://input-bucket/data.json")

df.write.parquet("s3://output-bucket/data.parquet")
✅ AWS Glue - Schema Evolution

 Enable schema detection in AWS Glue DynamicFrames.

 Use AWS Lake Formation for schema validation.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Here’s the full set of SQL, PySpark, and AWS Glue codes for your questions:

✅ SQL Questions

1️⃣ Query to Find the Top 3 Highest-Selling Products Per Month

sql

CopyEdit

SELECT month, product_id, total_sales

FROM (

SELECT

DATE_TRUNC('month', order_date) AS month,

product_id,

SUM(sales) AS total_sales,

RANK() OVER (PARTITION BY DATE_TRUNC('month', order_date) ORDER BY SUM(sales)


DESC) AS rnk

FROM sales_data

GROUP BY month, product_id

) WHERE rnk <= 3;

2️⃣ Implementing Window Functions in Snowflake

sql

CopyEdit

SELECT

customer_id,

order_date,
total_amount,

SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS


running_total,

LAG(total_amount, 1, 0) OVER (PARTITION BY customer_id ORDER BY order_date) AS


previous_order_amount,

ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS order_rank

FROM orders;

 SUM() → Calculates running totals.

 LAG() → Fetches the previous order amount.

 ROW_NUMBER() → Assigns a sequential rank per customer.

3️⃣ Merge New Records into an Existing Snowflake Table

sql

CopyEdit

MERGE INTO customers target

USING new_customers source

ON target.customer_id = source.customer_id

WHEN MATCHED THEN

UPDATE SET target.name = source.name, target.city = source.city

WHEN NOT MATCHED THEN

INSERT (customer_id, name, city) VALUES (source.customer_id, source.name, source.city);

 Updates existing records.

 Inserts new records if no match is found.

✅ PySpark Questions

4️⃣ Convert a JSON File to Parquet using PySpark

python

CopyEdit

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("JSONtoParquet").getOrCreate()

df = spark.read.json("s3://input-bucket/data.json") # Load JSON

df.write.parquet("s3://output-bucket/data.parquet") # Save as Parquet

spark.stop()

5️⃣ Remove Duplicate Records from a PySpark DataFrame

python

CopyEdit

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate()

df = spark.read.csv("s3://input-bucket/data.csv", header=True)

df_no_duplicates = df.dropDuplicates() # Removes all duplicate rows

df_no_duplicates.show()

spark.stop()

6️⃣ How Broadcast Joins Improve Performance in PySpark

python

CopyEdit

from pyspark.sql import SparkSession

from pyspark.sql.functions import broadcast


spark = SparkSession.builder.appName("BroadcastJoinExample").getOrCreate()

large_df = spark.read.parquet("s3://large-dataset.parquet")

small_df = spark.read.parquet("s3://small-dataset.parquet")

optimized_df = large_df.join(broadcast(small_df), "common_key")

optimized_df.show()

spark.stop()

 Broadcasting smaller tables avoids costly shuffles, improving performance


significantly.

✅ AWS Glue Questions

7️⃣ Create an AWS Glue Job for Processing S3 Data

python

CopyEdit

import sys

from awsglue.transforms import *

from awsglue.utils import getResolvedOptions

from pyspark.context import SparkContext

from awsglue.context import GlueContext

from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()

glueContext = GlueContext(sc)

spark = glueContext.spark_session
job = Job(glueContext)

job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_options(

connection_type="s3",

connection_options={"paths": ["s3://input-bucket/data/"]},

format="json"

transformed_data = datasource0.toDF().select("id", "name", "age") # Transformation

glueContext.write_dynamic_frame.from_options(

frame=transformed_data,

connection_type="s3",

connection_options={"path": "s3://output-bucket/processed-data/"},

format="parquet"

job.commit()

 Reads JSON files from S3.

 Transforms data.

 Writes output as Parquet to another S3 bucket.

8️⃣ Handle Schema Evolution in AWS Glue

python

CopyEdit

from awsglue.context import GlueContext


glueContext = GlueContext(SparkContext.getOrCreate())

dynamic_frame = glueContext.create_dynamic_frame.from_options(

connection_type="s3",

connection_options={"paths": ["s3://data-bucket/"]},

format="parquet"

dynamic_frame = dynamic_frame.resolveChoice(

specs=[("new_column", "cast:int")]

glueContext.write_dynamic_frame.from_options(

frame=dynamic_frame,

connection_type="s3",

connection_options={"path": "s3://output-bucket/processed-data/"},

format="parquet"

 Uses resolveChoice() to handle schema evolution.

 Casts columns to appropriate types dynamically.

9️⃣ Difference Between Glue DynamicFrame & DataFrame

Feature DynamicFrame DataFrame

Schema is flexible (supports evolving


Schema Handling Fixed schema (Spark-based)
schemas)

Transformations Glue-specific transformations available Standard Spark transformations

Slightly slower due to metadata


Performance Faster for large-scale operations
handling
Feature DynamicFrame DataFrame

Preferred for optimized Spark


Usage Preferred for AWS Glue ETL jobs
workloads

Example Converting DynamicFrame → DataFrame

python

CopyEdit

from awsglue.context import GlueContext

glueContext = GlueContext(SparkContext.getOrCreate())

dynamic_frame = glueContext.create_dynamic_frame.from_options(

connection_type="s3",

connection_options={"paths": ["s3://data-bucket/"]},

format="json"

# Convert to DataFrame

dataframe = dynamic_frame.toDF()

# Convert back to DynamicFrame

dynamic_frame_new = DynamicFrame.fromDF(dataframe, glueContext)

You might also like