0% found this document useful (0 votes)
32 views8 pages

ipl-project

This document outlines the development of an Azure-based data engineering pipeline for processing IPL-related data, utilizing services such as Azure Blob Storage, Databricks, and Power BI. The pipeline includes stages for data ingestion, transformation, storage, and visualization, with automation facilitated by Azure Data Factory. Key challenges included data quality issues and schema inconsistencies, while learnings focused on real-time data processing and the integration of various Azure services.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views8 pages

ipl-project

This document outlines the development of an Azure-based data engineering pipeline for processing IPL-related data, utilizing services such as Azure Blob Storage, Databricks, and Power BI. The pipeline includes stages for data ingestion, transformation, storage, and visualization, with automation facilitated by Azure Data Factory. Key challenges included data quality issues and schema inconsistencies, while learnings focused on real-time data processing and the integration of various Azure services.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Project Documentation: Azure-Based Data Engineering Pipeline

1. Project Overview

Summary:
This project demonstrates an end-to-end data engineering pipeline using Azure services to
ingest, clean, process, store, and visualize IPL-related data from raw CSVs to insightful Power BI
dashboards.

Objective:
To build a scalable, automated, and efficient data pipeline that:

 Ingests raw CSV files into Azure Blob Storage.

 Transforms and cleans data using Azure Databricks (PySpark).

 Stores data at multiple stages (Bronze, Silver, Gold) in Azure Data Lake Storage Gen2.

 Loads data into Azure SQL Database for querying.

 Creates a final Power BI dashboard for analytics and KPIs.

Technologies Used:

 Azure Blob Storage

 Azure Data Lake Storage Gen2 (ADLS)

 Azure Databricks (PySpark)

 Azure SQL Database

 Azure Data Factory (ADF)

 Power BI

2. Architecture Diagram

Summary:
The pipeline consists of multiple stages connected via Azure services. Each stage performs
specific tasks, from raw ingestion to advanced analytics.
3. Data Ingestion & Storage

Summary:
Set up cloud infrastructure to store raw and processed data in an organized manner.

 Resource Group created in Azure.

 Blob Storage:

o Container: raw

o Stores original CSV files.

 Azure Data Lake Gen2 (ADLS):

o Containers: bronze, silver, gold

 CSV Files Ingested:

o player.csv

o match.csv

o stadium.csv

o player_match.csv

o team.csv

o player_team.csv

4. Data Processing with Databricks

Summary:
Used three Databricks notebooks to transform and process data through different layers
(Bronze, Silver, Gold).

Notebook 1: Raw to Bronze

 Mounted Blob storage to Databricks.

 Read all CSVs using Spark.

 Added audit columns: ingestion_time, source_file.

 Converted files to Parquet format.

 Wrote to bronze container.


Notebook 2: Bronze to Silver

 Mounted and read Parquet files from bronze.

 Data cleaning operations:

o Drop nulls.

o Rename columns.

 Performed joins to combine datasets into a unified master table.

 Wrote cleaned data to silver container.

Notebook 3: Silver to Gold

 Read cleaned data from silver.

 Created Temp Views in Spark.

 Performed SQL queries to generate insights:

o Total Wins

o Player Stats

o Venue Analysis

 Stored analytical tables in the gold container.

5. Automation with Azure Data Factory (ADF)

Summary:
Orchestrated the pipeline using ADF pipelines to trigger Databricks notebooks sequentially.

 Created one ADF pipeline with 3 notebook activities:

1. Raw → Bronze (Notebook 1)

2. Bronze → Silver (Notebook 2)

3. Silver → Gold (Notebook 3)

 Connected Databricks workspace and notebook activities.

 Achieved full end-to-end automation.


6. Azure SQL Database Integration

Summary:
Used JDBC connections to transfer data from Databricks into Azure SQL DB for centralized
storage and Power BI access.

 Created two schemas:

o silver_db – Stores cleaned tables

o gold_db – Stores analytical/aggregated tables

Total Tables:

Silver DB:

 player_cleaned

 match_cleaned

 player_match_cleaned

 team_cleaned

 stadium_cleaned

 player_team_cleaned

Gold DB (Analytical Tables):

 team_performance_metrics

 player_contribution

 venue_analysis

 player_efficiency_metrics

 match_summary_insights
7. Power BI Dashboard

Summary:
Connected Power BI to Azure SQL Database to visualize insights, performance, and key metrics
of the IPL dataset.

 Connection: Azure SQL (Gold DB tables)

 KPIs Created:

o Orange Cap (Most Runs)

o Purple Cap (Most Wickets)

 Reports & Visuals:

o Team-wise Performance Metrics

o Top Players by Runs & Wickets

o Home vs Away Analysis

o Average Strike Rate by Player

o Match Results Summary


8. Challenges & Learnings

Summary:
Real-world implementation involved handling multiple datasets, formats, and orchestrations.

Challenges Faced:

 Small Dataset
The IPL data volume was limited in size, which may not fully capture the complexities of
large-scale, real-world sports analytics projects.

 Local Environment Setup


Setting up Power BI and SQL Server locally required careful attention to compatibility,
especially with JDBC connections and port configurations.

 Data Quality Issues


The raw IPL files had missing or inconsistent entries, especially in player statistics like
runs and wickets, which needed thorough data cleansing to ensure reliable analysis.

 Inconsistent File Schemas


Different CSV files had varying schema definitions, which made it necessary to perform
schema alignment and column standardization during ingestion and transformation
stages.

Key Learnings:

 Real-time ingestion and transformation

 PySpark optimizations and SQL querying

 Use of layered storage for scalability

 Understood the process of establishing JDBC connections between Azure Databricks and
Azure SQL Database for reading and writing data.

 Learned how to automate multi-step ETL processes using ADF pipelines

 Power BI basics
9. Conclusion

Summary:
The project successfully showcases how cloud-native tools can be combined to create a
powerful, scalable, and automated data pipeline with meaningful analytics.

 All stages of data engineering lifecycle completed.

 Automation achieved using ADF + Databricks.

You might also like