0% found this document useful (0 votes)

32 views8 pages

ipl-project

This document outlines the development of an Azure-based data engineering pipeline for processing IPL-related data, utilizing services such as Azure Blob Storage, Databricks, and Power BI. The pipeline includes stages for data ingestion, transformation, storage, and visualization, with automation facilitated by Azure Data Factory. Key challenges included data quality issues and schema inconsistencies, while learnings focused on real-time data processing and the integration of various Azure services.

Uploaded by

satyam81006062001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views8 pages

ipl-project

Uploaded by

satyam81006062001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Project Documentation: Azure-Based Data Engineering Pipeline

1. Project Overview

Summary:
This project demonstrates an end-to-end data engineering pipeline using Azure services to
ingest, clean, process, store, and visualize IPL-related data from raw CSVs to insightful Power BI
dashboards.

Objective:
To build a scalable, automated, and efficient data pipeline that:

 Ingests raw CSV files into Azure Blob Storage.

 Transforms and cleans data using Azure Databricks (PySpark).

 Stores data at multiple stages (Bronze, Silver, Gold) in Azure Data Lake Storage Gen2.

 Loads data into Azure SQL Database for querying.

 Creates a final Power BI dashboard for analytics and KPIs.

Technologies Used:

 Azure Blob Storage

 Azure Data Lake Storage Gen2 (ADLS)

 Azure Databricks (PySpark)

 Azure SQL Database

 Azure Data Factory (ADF)

 Power BI

2. Architecture Diagram

Summary:
The pipeline consists of multiple stages connected via Azure services. Each stage performs
specific tasks, from raw ingestion to advanced analytics.
3. Data Ingestion & Storage

Summary:
Set up cloud infrastructure to store raw and processed data in an organized manner.

 Resource Group created in Azure.

 Blob Storage:

o Container: raw

o Stores original CSV files.

 Azure Data Lake Gen2 (ADLS):

o Containers: bronze, silver, gold

 CSV Files Ingested:

o player.csv

o match.csv

o stadium.csv

o player_match.csv

o team.csv

o player_team.csv

4. Data Processing with Databricks

Summary:
Used three Databricks notebooks to transform and process data through different layers
(Bronze, Silver, Gold).

Notebook 1: Raw to Bronze

 Mounted Blob storage to Databricks.

 Read all CSVs using Spark.

 Added audit columns: ingestion_time, source_file.

 Converted files to Parquet format.

 Wrote to bronze container.

Notebook 2: Bronze to Silver

 Mounted and read Parquet files from bronze.

 Data cleaning operations:

o Drop nulls.

o Rename columns.

 Performed joins to combine datasets into a unified master table.

 Wrote cleaned data to silver container.

Notebook 3: Silver to Gold

 Read cleaned data from silver.

 Created Temp Views in Spark.

 Performed SQL queries to generate insights:

o Total Wins

o Player Stats

o Venue Analysis

 Stored analytical tables in the gold container.

5. Automation with Azure Data Factory (ADF)

Summary:
Orchestrated the pipeline using ADF pipelines to trigger Databricks notebooks sequentially.

 Created one ADF pipeline with 3 notebook activities:

1. Raw → Bronze (Notebook 1)

2. Bronze → Silver (Notebook 2)

3. Silver → Gold (Notebook 3)

 Connected Databricks workspace and notebook activities.

 Achieved full end-to-end automation.

6. Azure SQL Database Integration

Summary:
Used JDBC connections to transfer data from Databricks into Azure SQL DB for centralized
storage and Power BI access.

 Created two schemas:

o silver_db – Stores cleaned tables

o gold_db – Stores analytical/aggregated tables

Total Tables:

Silver DB:

 player_cleaned

 match_cleaned

 player_match_cleaned

 team_cleaned

 stadium_cleaned

 player_team_cleaned

Gold DB (Analytical Tables):

 team_performance_metrics

 player_contribution

 venue_analysis

 player_efficiency_metrics

 match_summary_insights
7. Power BI Dashboard

Summary:
Connected Power BI to Azure SQL Database to visualize insights, performance, and key metrics
of the IPL dataset.

 Connection: Azure SQL (Gold DB tables)

 KPIs Created:

o Orange Cap (Most Runs)

o Purple Cap (Most Wickets)

 Reports & Visuals:

o Team-wise Performance Metrics

o Top Players by Runs & Wickets

o Home vs Away Analysis

o Average Strike Rate by Player

o Match Results Summary

8. Challenges & Learnings

Summary:
Real-world implementation involved handling multiple datasets, formats, and orchestrations.

Challenges Faced:

 Small Dataset
The IPL data volume was limited in size, which may not fully capture the complexities of
large-scale, real-world sports analytics projects.

 Local Environment Setup

Setting up Power BI and SQL Server locally required careful attention to compatibility,
especially with JDBC connections and port configurations.

 Data Quality Issues

The raw IPL files had missing or inconsistent entries, especially in player statistics like
runs and wickets, which needed thorough data cleansing to ensure reliable analysis.

 Inconsistent File Schemas

Different CSV files had varying schema definitions, which made it necessary to perform
schema alignment and column standardization during ingestion and transformation
stages.

Key Learnings:

 Real-time ingestion and transformation

 PySpark optimizations and SQL querying

 Use of layered storage for scalability

 Understood the process of establishing JDBC connections between Azure Databricks and
Azure SQL Database for reading and writing data.

 Learned how to automate multi-step ETL processes using ADF pipelines

 Power BI basics
9. Conclusion

Summary:
The project successfully showcases how cloud-native tools can be combined to create a
powerful, scalable, and automated data pipeline with meaningful analytics.

 All stages of data engineering lifecycle completed.

 Automation achieved using ADF + Databricks.

ADE_Project_amit
No ratings yet
ADE_Project_amit
17 pages
Vidhi Data Engineer
No ratings yet
Vidhi Data Engineer
4 pages
Azure Data Engr-(Sample Resume)
No ratings yet
Azure Data Engr-(Sample Resume)
6 pages
Narsimlu - Azure Data Engineer - Resume .Pf-1
50% (2)
Narsimlu - Azure Data Engineer - Resume .Pf-1
4 pages
SNRMGR Resume
No ratings yet
SNRMGR Resume
4 pages
Divya Namdev Resume
No ratings yet
Divya Namdev Resume
3 pages
Laxmancibi sivakumar databricks resume
No ratings yet
Laxmancibi sivakumar databricks resume
5 pages
Master of Computer Applications: Data and Analytics
No ratings yet
Master of Computer Applications: Data and Analytics
61 pages
ICT Final Report Final
100% (1)
ICT Final Report Final
48 pages
Srilakshmi ADE Resume
No ratings yet
Srilakshmi ADE Resume
4 pages
Azure Data Superstore Pipeline - End-to-End Data Engineering and Visualization Report
No ratings yet
Azure Data Superstore Pipeline - End-to-End Data Engineering and Visualization Report
23 pages
Naukri Vivekkumarsingh[7y 6m]
No ratings yet
Naukri Vivekkumarsingh[7y 6m]
4 pages
Hrishikesh Reddy (project)
No ratings yet
Hrishikesh Reddy (project)
14 pages
MonishKunar DataAnalyst Resume
No ratings yet
MonishKunar DataAnalyst Resume
3 pages
VDart Gulf_Raju Uppalapati_Data Engineer
No ratings yet
VDart Gulf_Raju Uppalapati_Data Engineer
5 pages
Data Analytics Projects To Enhance Your Resume
No ratings yet
Data Analytics Projects To Enhance Your Resume
11 pages
Jeeru_Veeresh_Resume (1)
No ratings yet
Jeeru_Veeresh_Resume (1)
3 pages
Data Engineer
No ratings yet
Data Engineer
6 pages
KISHORE KUMAR REDDY MADITHATI
No ratings yet
KISHORE KUMAR REDDY MADITHATI
6 pages
introduction and project (3)
No ratings yet
introduction and project (3)
3 pages
DE - Himanshu - CT
No ratings yet
DE - Himanshu - CT
3 pages
Naukri_SarangBirewar[5y_0m]-2
No ratings yet
Naukri_SarangBirewar[5y_0m]-2
4 pages
Venkadesaperumal Resume
No ratings yet
Venkadesaperumal Resume
2 pages
Azure_5years_Cv__Retail3
No ratings yet
Azure_5years_Cv__Retail3
4 pages
UserFile
No ratings yet
UserFile
5 pages
Naukri_Lingaswamy[5y_0m]
No ratings yet
Naukri_Lingaswamy[5y_0m]
1 page
vishalaz
No ratings yet
vishalaz
1 page
Naukri_MaheshReddy7y_0m
No ratings yet
Naukri_MaheshReddy7y_0m
6 pages
resu,mme
No ratings yet
resu,mme
3 pages
Nikhil Kyatham ATS.pdf
No ratings yet
Nikhil Kyatham ATS.pdf
3 pages
General Account
No ratings yet
General Account
4 pages
Solution Document
No ratings yet
Solution Document
4 pages
Shabukarisadiq Resume
No ratings yet
Shabukarisadiq Resume
7 pages
Chaitanya-DE-4yrs-6
No ratings yet
Chaitanya-DE-4yrs-6
4 pages
Naukri Jaipal (4y 6m)
No ratings yet
Naukri Jaipal (4y 6m)
3 pages
Nitesh_Azure_data_engineer_2years
No ratings yet
Nitesh_Azure_data_engineer_2years
2 pages
Hemanth Kumar Meda Resume 4yrs
No ratings yet
Hemanth Kumar Meda Resume 4yrs
4 pages
Vinod Kumarresume1111111
No ratings yet
Vinod Kumarresume1111111
4 pages
DhanushR_Resume
No ratings yet
DhanushR_Resume
1 page
MNARESH[6y_3m]_Fabric
No ratings yet
MNARESH[6y_3m]_Fabric
3 pages
Raghav Srinivasan Resume
No ratings yet
Raghav Srinivasan Resume
1 page
Hitesh Patil Resume
No ratings yet
Hitesh Patil Resume
2 pages
Arslan Ali-Data Engineer (Resume)
No ratings yet
Arslan Ali-Data Engineer (Resume)
3 pages
Mobile: +91 8121099515: Kalyan Yalla
No ratings yet
Mobile: +91 8121099515: Kalyan Yalla
4 pages
Prakash_Data_Engineer
No ratings yet
Prakash_Data_Engineer
1 page
Shivam-Gupta-Resume
No ratings yet
Shivam-Gupta-Resume
1 page
N Jaya Mani - Data Engineer
No ratings yet
N Jaya Mani - Data Engineer
8 pages
Valluri Sunitha: Professional Summary
No ratings yet
Valluri Sunitha: Professional Summary
4 pages
Sunny Kumar-Data Engineer
No ratings yet
Sunny Kumar-Data Engineer
3 pages
MNARESH
No ratings yet
MNARESH
2 pages
Kunal_patil_DA
No ratings yet
Kunal_patil_DA
1 page
Rajshekarreddy (6y - 1m) - Cloud Data Engineer
No ratings yet
Rajshekarreddy (6y - 1m) - Cloud Data Engineer
3 pages
Harinath Data Engineer
No ratings yet
Harinath Data Engineer
4 pages
Narsimlu - ADF.Resume
No ratings yet
Narsimlu - ADF.Resume
4 pages
Resumedata Engineer
No ratings yet
Resumedata Engineer
3 pages
NareshKumar_Azure_6_years
No ratings yet
NareshKumar_Azure_6_years
2 pages
Anoop_Azure_Senior Data Engineer (1)
No ratings yet
Anoop_Azure_Senior Data Engineer (1)
5 pages
Zclus - Harish - Data Engineer
No ratings yet
Zclus - Harish - Data Engineer
6 pages
2.7 Years AzureDataEngineer Prateek
No ratings yet
2.7 Years AzureDataEngineer Prateek
2 pages
1745516832930-Pandas-Handbook
No ratings yet
1745516832930-Pandas-Handbook
33 pages
Summary of Experience
No ratings yet
Summary of Experience
8 pages
2.MS SET B CS MS DEC 2024
No ratings yet
2.MS SET B CS MS DEC 2024
14 pages
PGDBDA-2024-2025
No ratings yet
PGDBDA-2024-2025
22 pages
DBMS QUESTION BANK
No ratings yet
DBMS QUESTION BANK
18 pages
Instant ebooks textbook Oracle Database Transactions and Locking Revealed Building High Performance Through Concurrency 2nd Edition Darl Kuhn Thomas Kyte download all chapters
100% (3)
Instant ebooks textbook Oracle Database Transactions and Locking Revealed Building High Performance Through Concurrency 2nd Edition Darl Kuhn Thomas Kyte download all chapters
62 pages
Database Management System Using Libreoffice Base: Ntroduction
No ratings yet
Database Management System Using Libreoffice Base: Ntroduction
93 pages
Cataloguing and Classification in Libraries
100% (1)
Cataloguing and Classification in Libraries
4 pages
CS Practical Assignment 17 - 22
No ratings yet
CS Practical Assignment 17 - 22
16 pages
Migration From Oracle To TD
No ratings yet
Migration From Oracle To TD
40 pages
Dubai Property - Copy
No ratings yet
Dubai Property - Copy
11 pages
SQL Ledger White Paper
No ratings yet
SQL Ledger White Paper
13 pages
End_to_End_AR_Implementation_Document
No ratings yet
End_to_End_AR_Implementation_Document
3 pages
INF3003W Cheat Sheet
No ratings yet
INF3003W Cheat Sheet
13 pages
Unit V-DBMS-NOTE
No ratings yet
Unit V-DBMS-NOTE
4 pages
Final Paper Pattern CS405 Answer File
No ratings yet
Final Paper Pattern CS405 Answer File
7 pages
Libreoffice 4.0 Base Handbook Course
No ratings yet
Libreoffice 4.0 Base Handbook Course
260 pages
MS OFFICE LAB QUESTIONS
No ratings yet
MS OFFICE LAB QUESTIONS
1 page
Practice Problems - Week 2 - Data Management II - Supply Chain Technology and Systems - Edx
No ratings yet
Practice Problems - Week 2 - Data Management II - Supply Chain Technology and Systems - Edx
6 pages
Course Work Final
No ratings yet
Course Work Final
19 pages
Database Architecture Final Final
No ratings yet
Database Architecture Final Final
23 pages
Active Record 6 Query Interface
No ratings yet
Active Record 6 Query Interface
34 pages
Resume Antonio Cantillo
No ratings yet
Resume Antonio Cantillo
6 pages
Itbp Csbp340 Fa22 Sy
No ratings yet
Itbp Csbp340 Fa22 Sy
4 pages
(Java Database Connectivity) : Why Use JDBC?
No ratings yet
(Java Database Connectivity) : Why Use JDBC?
24 pages
Automatic Memory Management (AMM) On 11g & 12c Document 443746.1
No ratings yet
Automatic Memory Management (AMM) On 11g & 12c Document 443746.1
9 pages
NOSQL
No ratings yet
NOSQL
6 pages
History: Query Language), Was Designed To Manipulate and Retrieve Data Stored in IBM's Original
No ratings yet
History: Query Language), Was Designed To Manipulate and Retrieve Data Stored in IBM's Original
8 pages
Data Warehouse Tutorial For Beginners: Learn in 7 Days: Course Syllabus
No ratings yet
Data Warehouse Tutorial For Beginners: Learn in 7 Days: Course Syllabus
4 pages
DM DW Assignment (17775) PDF
No ratings yet
DM DW Assignment (17775) PDF
3 pages
MC Microsoft Certified Azure Data Fundamentals Study Guide: Exam DP-900
From Everand
MC Microsoft Certified Azure Data Fundamentals Study Guide: Exam DP-900
Jake Switzer
No ratings yet

ipl-project

Uploaded by

ipl-project

Uploaded by

Project Documentation: Azure-Based Data Engineering Pipeline

 Ingests raw CSV files into Azure Blob Storage.

 Transforms and cleans data using Azure Databricks (PySpark).

 Loads data into Azure SQL Database for querying.

 Creates a final Power BI dashboard for analytics and KPIs.

 Azure Blob Storage

 Azure Data Lake Storage Gen2 (ADLS)

 Azure Databricks (PySpark)

 Azure SQL Database

 Azure Data Factory (ADF)

 Resource Group created in Azure.

o Stores original CSV files.

 Azure Data Lake Gen2 (ADLS):

o Containers: bronze, silver, gold

 CSV Files Ingested:

4. Data Processing with Databricks

Notebook 1: Raw to Bronze

 Mounted Blob storage to Databricks.

 Read all CSVs using Spark.

 Added audit columns: ingestion_time, source_file.

 Converted files to Parquet format.

 Wrote to bronze container.

 Mounted and read Parquet files from bronze.

 Data cleaning operations:

 Performed joins to combine datasets into a unified master table.

 Wrote cleaned data to silver container.

Notebook 3: Silver to Gold

 Read cleaned data from silver.

 Created Temp Views in Spark.

 Performed SQL queries to generate insights:

 Stored analytical tables in the gold container.

5. Automation with Azure Data Factory (ADF)

 Created one ADF pipeline with 3 notebook activities:

1. Raw → Bronze (Notebook 1)

2. Bronze → Silver (Notebook 2)

3. Silver → Gold (Notebook 3)

 Connected Databricks workspace and notebook activities.

 Achieved full end-to-end automation.

 Created two schemas:

o silver_db – Stores cleaned tables

o gold_db – Stores analytical/aggregated tables

Gold DB (Analytical Tables):

 Connection: Azure SQL (Gold DB tables)

o Orange Cap (Most Runs)

o Purple Cap (Most Wickets)

 Reports & Visuals:

o Team-wise Performance Metrics

o Top Players by Runs & Wickets

o Home vs Away Analysis

o Average Strike Rate by Player

o Match Results Summary

 Local Environment Setup

 Data Quality Issues

 Inconsistent File Schemas

 Real-time ingestion and transformation

 PySpark optimizations and SQL querying

 Use of layered storage for scalability

 Learned how to automate multi-step ETL processes using ADF pipelines

 All stages of data engineering lifecycle completed.

 Automation achieved using ADF + Databricks.

You might also like