0% found this document useful (0 votes)
5 views

AC52010

Uploaded by

oukorichard01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

AC52010

Uploaded by

oukorichard01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

AC52010 / AC53016 - MSc Project

(SEM 2 23/24)

Real-Time Data Warehousing with


Apache Airflow for Weather Data

Project Proposal
Introduction

With the increasing technology, data processing has become an integral part of our

lives. Every aspect of our life is influenced in one way or another by a processed data,

whether it is a recommendation of one’s favourite TV show or next vacation destination.

Real-time data processing has especially become critical to organisations and governments,

playing roles in weather prediction, air traffic control, and health monitoring.

Given the importance of real-time data processing, we propose to develop a Real-

Time Data Warehousing pipeline using Apache Airflow for weather data visualisation. This

system aims to efficiently manage, process, and store real-time weather data, enabling timely

analysis and insights for various applications. By leveraging Apache Airflow's powerful

scheduling and workflow management capabilities, we aim to seamlessly integrate and

process large volumes of weather data from weather APP api for continuous analysis and

visualisation.

Features to be implemented
Data Ingestion:

API Integration: Connect to various weather data providers (OpenWeatherMap,


WeatherAPI) to fetch real-time weather data.
Automated Scheduling: Use Airflow to schedule data fetches tasks at regular
intervals.

Data Storage:

Raw Data Storage: Store raw weather data in a data lake (AWS S3).
Structured Database: Store cleaned and transformed data in a relational database
(MySQL).

Data Transformation:

ETL Process: Implement ETL (Extract, Transform, and Load) processes using
Airflow to clean and transform the raw data.
Data Quality Checks: Ensure data integrity and accuracy through validation and
quality checks.

Real-Time Processing:

Stream Processing: Integrate Apache Kafka for real-time data streaming and
processing.
Airflow Integration: Use Airflow's Kafka sensors to handle real-time data ingestion
and processing.

Monitoring and Alerts:

Pipeline Monitoring: Utilize Airflow's built-in monitoring tools to track pipeline


performance.
Alert System: Implement alert mechanisms (e.g., Slack, email notifications) for error
handling and issue resolution.

Data Visualization:

Power BI Integration: Connect the processed data stored in the database to Power BI
for analysis.
Interactive Dashboards: Develop interactive dashboards and visualizations in Power
BI to provide insights into weather data trends and patterns.

Infrastructure Automation:

AWS CloudFormation/Terraform: Use AWS CloudFormation or Terraform to


automate the deployment and management of AWS resources.

Project Timeline

We propose this project to take between 1 and 4 weeks

Week 1: Project Planning and Requirements Gathering

Define project scope, objectives, and requirements.

Identify data sources and integration points.

Design the architecture for data ingestion, storage, and processing.

Create a detailed design for the user interface and dashboard in Power BI.

Design the database schema for raw and transformed data.


Week 2-3: Development

Develop API integration and data ingestion pipelines using Airflow.

Implement the ETL processes and data transformation logic in Airflow.

Set up AWS infrastructure using CloudFormation/Terraform.

Integrate Apache Kafka for real-time processing and Airflow sensors.

Develop Power BI dashboards and visualizations for data analysis.

Testing and Quality Assurance

Conduct unit tests, integration tests, and user acceptance testing.

Identify and address any bugs or issues.

Perform load testing to ensure the system can handle large volumes of data.

Week 3-4: Deployment and Documentation

Deploy the system to AWS cloud infrastructure.

Provide comprehensive documentation for users and administrators.

Conduct training sessions for end-users and administrators.

Conclusion
The proposed Real-Time Data Warehousing pipeline with Apache Airflow for

weather data on AWS will significantly enhance our ability to manage and analyse weather

data in real-time. By integrating Power BI for data visualization, the system will provide

valuable insights and enable timely decision-making.

You might also like