0% found this document useful (0 votes)
8 views

4

Uploaded by

Jagadeesh Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

4

Uploaded by

Jagadeesh Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

1.

Introduction
2. Product table-
1.It has columns Product_id, Department, Date and Sales_Amount
Find the yesterday’s sales + todays sales for each product for each department
Department wise sales amount (Todays sales + Yesterdays sales) for each product
2.We need to rank with respect to Department order by Sales_Amount and write same query in
PySpark.
3. Lake house architecture
4. Unity catalog (Released 5 months back)
5. Consider u have 1 csv file at ADLS location you are reading that csv file but there are some of
the bad records then how you handle bad records?
6. You are reading file add inferSchema then filtering there, group by there finally using the collect, can
you please tell me how many jobs has been created and stages has been created
Spark.read.csv(“Path of file”, inferSchema=True,).filter(some
condition).groupBy(“column”).collect()
7. Wide transformation and narrow transformation
8. Optimization technique to reduce data shuffling in single dataframe
9. Difference between cache and persist
10. Storage types of persist
11. Difference between memory serialization and memory unserialization
12. What is memory serialization in persist command?
13. Why coalesce in more efficient than repartition?
14. Optimization techniques on Delta table?
15. Suppose u have table which updated many times in week then can I get Mondays copy of this data.
That is what was my data on Monday. Write a simple command for above scenario.
16. When we create delta table then 1 folder is created into ADLS which is delta log. Inside delta log
there are files. What are those files referring to.
17. When we create delta table then 1 folder is created into ADLS which is delta log. Inside delta log
there are files. There are 2 files one is crc and second is json what this file specifies. Which one is log
file and what is the crc file?
18. What is Medallion architecture?
19. Let’s assume SAP database which resides in your virtual machine. Now u want to connect to
ADLS which type of IR need?
20. Suppose u fetching data from SAP database and u are getting memory error and your pipeline
gets failed then how u can handled this error?
21. Can u give me brief introduction about yourself.
22. What are the challenges you faced in your project?
23. You faced metadata issue or not?
24. Consider company table having different columns like company and sales etc. Write a query to get
company wise total sales.
25. Above query in PySpark?
26. What is difference between truncate and delete and drop
27. How can you handle triggers in SQL?
28. Consider 10 tables of data then how can you handle those 10 tables of data from on-premise to
cloud?
29. Integration Runtime
30. For any success or failure of pipeline how can you handle the email notification?
31. Explain the flow to handle the email notification and which type of activity u used to handle email
notification?
32. How u can handle incremental load of data from source?
33. How u can create mount to ADLS?
34. Normal query vs Stored procedure
35. What type of distribution methods/Stored procedure you have used in Synapse?
36. What are the facts and dimensions?
37. You handle facts and dimensions tables?
38. Can we join 2 fact tables?
39. Star schema vs snow flake schema
40. How can you handle data validation techniques?
41. Which data validation techniques you used in your project?
42. Any question from your end?

Second round
1.We have 50 table in on-premise DB then I want to copy these into Azure Blob Storage. Then how many
pipelines we need for that scenario.
2.Suppose from 50 table, 10 tables copy data done, then how u can copy remaining table after completion of
10 tables.
3.Delta table vs parquet table similarities and dissimilarities
4.How to call child notebook into parent notebook and how u can pass parameters into child notebook
5.Optimization techniques u used in databricks
6.Driver node & executer node details. Purpose of both Node.
7.Is Worker node contains multiple executor nodes
8.Expalin SCD
9.How to handle Incremental data
10.Go into deep DataBricks

You might also like