@Q_B@Snowflake & AWS
@Q_B@Snowflake & AWS
2️⃣Why are
you interested in this role? 3️⃣What do you know about our
company? 4️⃣What are your strengths and weaknesses? 5️⃣Where
do you see yourself in the next 3-5 years? Set 2 - Work Experience
& Role-Specific Questions 6️⃣Can you walk me through your
experience with Snowflake & AWS Glue? 7️⃣What’s the most
challenging ETL pipeline migration you’ve worked on? 8️⃣How do
you handle failures in AWS Glue & Snowflake pipelines? 9️⃣Can you
explain a situation where you improved pipeline performance? 🔟
How do you ensure data accuracy and consistency during
migration? 💡 Pro Tip: Use the STAR Method (Situation, Task, Action,
Result) to answer behavioral questions effectively. 🟠 Technical
Round 2 – Data Engineering Concepts (ETL, Snowflake, SQL,
PySpark, AWS Glue) Set 1 - ETL & Pipeline Migration Questions 1️⃣
How do you approach migrating ETL pipelines from Oracle/MsSQL
to Snowflake? 2️⃣Explain the best practices for designing ETL
workflows in AWS Glue. 3️⃣What are the differences between
Snowflake vs Redshift vs BigQuery for warehousing? 4️⃣How do
you handle incremental vs full data loads in Snowflake? 5️⃣Explain
how you optimize ETL jobs in AWS Glue & PySpark. Set 2 -
Snowflake Performance & Optimization 6️⃣How do you optimize
Snowflake queries for faster performance? 7️⃣What’s the role of
clustering & partitioning in Snowflake? 8️⃣How do you handle
large-scale data ingestion into Snowflake? 9️⃣What are Transient
vs Permanent Tables in Snowflake, and when to use them? 🔟
Explain Time Travel and Zero-Copy Cloning in Snowflake. 💡 Pro Tip:
Be ready to write SQL queries & PySpark transformations on a
whiteboard or shared screen. 🔵 Technical Round 3 – System Design
& Cloud Infrastructure (AWS, APIs, Data Modeling) Set 1 - Data
Modeling & Warehouse Design 1️⃣How do you design a Snowflake
schema for an analytics use case? 2️⃣What’s the difference
between Star Schema & Snowflake Schema? 3️⃣How do you model
Salesforce & NetSuite data for analytics? 4️⃣Explain fact vs
dimension tables in a Snowflake data warehouse. 5️⃣How do you
handle slowly changing dimensions (SCD) in Snowflake? Set 2 -
Cloud Infrastructure & API Integration 6️⃣How does AWS Glue
integrate with Snowflake for ETL? 7️⃣How do you replicate
Salesforce/Workday data into Snowflake? 8️⃣Explain AWS Lambda
vs AWS Glue vs Airflow for data orchestration. 9️⃣How do you
handle API rate limits & failures in data ingestion? 🔟 What security
best practices do you follow for AWS & Snowflake? 💡 Pro Tip:
Expect system design whiteboarding & architecture discussions. 🟣
HR Round 4 – Final Discussion & Salary Negotiation Set 1 -
Company & Culture Fit Questions 1️⃣What motivates you as a Data
Engineer? 2️⃣How do you handle tight deadlines & production
failures? 3️⃣Have you worked in cross-functional teams? 4️⃣How do
you keep up with new data engineering trends? 5️⃣Why should we
hire you? Set 2 - Salary & Expectations 6️⃣What are your salary
expectations? 7️⃣Are you open to contract vs full-time roles? 8️⃣Are
you comfortable working in different time zones? 9️⃣What’s your
preferred work model (Remote/Hybrid/Onsite)? 🔟 Do you have any
questions for us? 📌 Bonus: Hands-On Coding & Whiteboarding
Practice ✅ SQL Questions 🔹 Write a query to find the top 3 highest-
selling products per month. 🔹 How do you implement window
functions in Snowflake? 🔹 Write a query to merge new records
into an existing Snowflake table. ✅ PySpark Questions 🔹 Convert a
JSON file to Parquet using PySpark. 🔹 Write a PySpark script to
remove duplicate records from a DataFrame. 🔹 Explain how
broadcast joins improve performance in PySpark. ✅ AWS Glue
Questions 🔹 How do you create a Glue job for processing S3
data? 🔹 How do you handle schema evolution in AWS Glue? 🔹
What’s the difference between Glue DynamicFrame & DataFrame?
"
Strengths:
o Strong SQL and PySpark skills for data transformation & optimization
Weakness:
o "I tend to focus too much on details when optimizing queries. However, I’m
learning to balance performance with project timelines by prioritizing
optimizations that have the most impact."
6️⃣ Can you walk me through your experience with Snowflake & AWS Glue?
"I've designed and optimized ETL pipelines using AWS Glue for transforming and loading
data into Snowflake. I use PySpark within Glue for data transformations, schema evolution,
and incremental loading. In Snowflake, I optimize queries using clustering, partitioning, and
caching techniques."
7️⃣ What’s the most challenging ETL pipeline migration you’ve worked on?
(STAR Method Example)
Action: Used AWS Glue for ETL processing, optimized data partitioning, and
implemented incremental loading using Snowflake Streams & Tasks.
Result: Reduced processing time by 60% and improved query performance for
analytics.
8️⃣ How do you handle failures in AWS Glue & Snowflake pipelines?
9️⃣ Can you explain a situation where you improved pipeline performance?
"I optimized an ETL pipeline by using PySpark broadcast joins to speed up small-to-large
table joins, reducing runtime from 3 hours to 45 minutes."
Using Snowflake Streams & Tasks for CDC (Change Data Capture)
1️⃣ How do you approach migrating ETL pipelines from Oracle/MsSQL to Snowflake?
BigQuery: Serverless with automatic scaling, ideal for Google Cloud users.
@@@@
1️⃣ How do you design a Snowflake schema for an analytics use case?
3️⃣ How do you model Salesforce & NetSuite data for analytics?
Transform data to match analytical needs (flatten JSON structures, join related
tables).
Create fact and dimension tables (e.g., Sales as fact, Customers as dimension).
Optimize with clustering on frequently queried fields (e.g., Date, Customer ID).
Dimension tables contain descriptive attributes for slicing and dicing data.
Example:
SCD Type 3 (Limited History): Store only the previous value in a separate column.
FROM staging_customers
6️⃣ How does AWS Glue integrate with Snowflake for ETL?
Use Streams & Tasks to track changes and implement incremental loads.
8️⃣ Explain AWS Lambda vs AWS Glue vs Airflow for data orchestration.
AWS Glue Serverless ETL for big Scalable, supports Higher cost for frequent jobs
Service Use Case Pros Cons
data PySpark
9️⃣ How do you handle API rate limits & failures in data ingestion?
Example:
import time
import requests
retries = 0
response = requests.get(url)
if response.status_code == 200:
return response.json()
retries += 1
return None
🔟 What security best practices do you follow for AWS & Snowflake?
AWS:
Snowflake:
FROM (
FROM sales_data
df = spark.read.json("s3://input-bucket/data.json")
df.write.parquet("s3://output-bucket/data.parquet")
✅ AWS Glue - Schema Evolution
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Here’s the full set of SQL, PySpark, and AWS Glue codes for your questions:
✅ SQL Questions
sql
CopyEdit
FROM (
SELECT
product_id,
SUM(sales) AS total_sales,
FROM sales_data
sql
CopyEdit
SELECT
customer_id,
order_date,
total_amount,
FROM orders;
sql
CopyEdit
ON target.customer_id = source.customer_id
✅ PySpark Questions
python
CopyEdit
spark.stop()
python
CopyEdit
spark = SparkSession.builder.appName("RemoveDuplicates").getOrCreate()
df = spark.read.csv("s3://input-bucket/data.csv", header=True)
df_no_duplicates.show()
spark.stop()
python
CopyEdit
large_df = spark.read.parquet("s3://large-dataset.parquet")
small_df = spark.read.parquet("s3://small-dataset.parquet")
optimized_df.show()
spark.stop()
python
CopyEdit
import sys
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://input-bucket/data/"]},
format="json"
glueContext.write_dynamic_frame.from_options(
frame=transformed_data,
connection_type="s3",
connection_options={"path": "s3://output-bucket/processed-data/"},
format="parquet"
job.commit()
Transforms data.
python
CopyEdit
dynamic_frame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://data-bucket/"]},
format="parquet"
dynamic_frame = dynamic_frame.resolveChoice(
specs=[("new_column", "cast:int")]
glueContext.write_dynamic_frame.from_options(
frame=dynamic_frame,
connection_type="s3",
connection_options={"path": "s3://output-bucket/processed-data/"},
format="parquet"
python
CopyEdit
glueContext = GlueContext(SparkContext.getOrCreate())
dynamic_frame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://data-bucket/"]},
format="json"
# Convert to DataFrame
dataframe = dynamic_frame.toDF()