0% found this document useful (0 votes)
4 views

jose_MINI2nd

Uploaded by

josemonjohn10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

jose_MINI2nd

Uploaded by

josemonjohn10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Rain Prediction using Random Forest

CHAPTER 1
INTRODUCTION
1.1 Overview

Accurate rainfall forecasts are important for sectors such as agriculture, disaster
management. Industry based on weather models, etc. Traditional weather forecasts
often face the problem of accurate short-term forecasts due to the complex atmospheric
system.
This project focuses on developing a rainfall forecasting system using a random forest
algorithm. With an accuracy of 85%, the model classifies rain events based on historical
weather data. This makes it extremely valuable for government agencies, farmers, and
city planners. and other stakeholders Accurate forecasts help make better decisions
about resource management. Preparedness for dealing with disasters and agricultural
practices

The dataset used for this project comes from Kaggle, containing 145,460 data points
and 23 features, including temperature, humidity, wind direction, and barometric
pressure, with RainTomorrow as the target variable. Data pre-processing Category
types and numbers that classify features Deal with missing values With the precision of
the median inter-quartile range Categorical variables are converted to numeric format
through label coding, which includes removing outliers using (IQR) methods.

Performance is evaluated using metrics such as precision, recall, and AUC used to train
the model. Creating multiple decision trees to increase accuracy and reduce overfitting
using Random Forest Algorithm. This project demonstrates the effectiveness of the
random forest algorithm in Rainfall forecast This provides practical benefits for various
sectors. who have to rely on accurate weather forecasts.

1.2 Motivation

Increased variability in weather patterns and unpredictable weather conditions increase the
need for accurate rainfall forecasts. Wrong predictions can have dire consequences.
especially in the agricultural sector Where poor forecasts can lead to crop failure or
inefficient water use.

Dept. of Computer Applications 1


Rain Prediction using Random Forest

The motivation behind this rainfall prediction system is to address these challenges by
creating a reliable tool that helps farmers optimize their irrigation strategies. It also helps
urban planners monitor water resources and helps governments prepare for possible floods
or droughts. By leveraging machine learning techniques, We can analyze historical weather
data efficiently. To increase the accuracy of rainfall forecasts This will ultimately improve
the overall performance of the forecasting system.

1.3 Objectives

• Rainfall forecasts play an important role in protecting communities and


ensuring effective resource management in different areas. This is because
unpredictable weather patterns are occurring more frequently due to climate
change. The ability to make more accurate rainfall predictions is more
necessary than ever. Traditional weather models often lack reliable short-term
forecasts. especially in complex weather conditions This project leverages
machine learning techniques. Specifically, random forest algorithms. To fill
that gap Providing a more accurate and data-driven approach to estimating
rainfall
• Build a forecast model: Develop a machine learning model using
historical weather data to accurately predict the probability of rain.
• Uses advanced algorithms: Uses random forest algorithm. It processes
data efficiently using cluster methods. This reduces the need for
extensive pre-processing.
• Improves decision making: Provides valuable insights for stakeholders
such as farmers, disaster management teams. and water resource
managers to help them make informed decisions based on rainfall
forecasts
• Support sustainable practices: Promote sustainable agricultural practices
by providing accurate forecasts to account for risks associated with
sudden climate change. In addition to these goals The Rainfall Forecast
System also aims to provide useful guidelines for real-world challenges
posed by unpredictable rainfall patterns. which in the end Contribute to
more sustainable and resilient practices.

Dept. of Computer Applications 2


Rain Prediction using Random Forest

1.3 Existing System

Currently, no widely accessible device can reliably predict the weather.


Traditional forecasts rely on statistical models and meteorological expertise. This
can lead to inaccuracies due to the complexity of climate models. Additionally,
existing models often fail to fully utilize historical climate data. As a result, the
forecast will be less accurate. especially at the local level This is because weather
patterns are becoming more unpredictable. There is therefore a clear opportunity
to improve forecasting techniques. By increasing the use of data and integrating
more complex technologies. We can improve the accuracy of weather forecasts.
and ultimately help communities better prepare for different weather conditions.

1.4 Proposed System

To solve the limitations of current weather forecast models. We have developed a


rainfall forecasting system that leverages machine learning techniques. Our system
uses historical weather data from Australia's various weather stations to predict the
next day's rainfall trends.

⚫ Machine Learning Approach: We use various machine learning algorithms.


Special emphasis is placed on random forest models. This group of methods is
known for its robustness and accuracy in handling complex datasets. This
makes it suitable for weather forecasting work.

⚫ Historical climate data: Our models are trained on comprehensive historical


climate data. It includes important parameters such as temperature, humidity,
wind speed, sky visibility, barometric pressure, etc. by analyzing these
variables. We can capture complex patterns that affect rainfall.

⚫ Target Variable: The main target variable in our system is "RainTomorrow,"


which determines the probability that it will rain the next day

Dept. of Computer Applications 3


Rain Prediction using Random Forest

1.6 Methodology

In this assignment, we adopted the Agile technique[3], mainly making use of the Scrum
framework to guide our development process. Agile is a flexible, iterative technique to
task control and software program development that encourages incremental
development through non-stop remarks and collaboration. It focuses on turning in
small, manageable portions of a project over short periods, allowing for adaptability to
change and improving overall project outcomes.
Within Agile, Scrum is a widely-used method where work is divided into cycles known
as sprints, each aimed toward delivering a potentially releasable product increment.
Scrum emphasizes teamwork, responsibility, and iterative progress towards a well-
defined intention.

For this project, we identified 9 product backlog items that mentioned the tasks
required to complete the rain prediction system. These responsibilities have been
broken down and organized into four sprints. Each sprint focused on the specific
aspects of the project, from data collection and preprocessing to model training,
testing, and evaluation. By following this iterative system, we ensured non-stop
development, delivering each phase of the project efficiently and on time

Dept. of Computer Applications 4


Rain Prediction using Random Forest

CHAPTER 2
LITERATURE REVIEWS

A literature review is an essential part of any studies because it offers a deeper


understanding of existing expertise and research applicable to the topic. By reviewing
beyond studies and methodologies, can pick out gaps, challenges, and improvements that
inform the development of our machine. In the context of our Rain Prediction System,
undertaking a literature assessment helped to recognize contemporary practices in Rain
forecasting and ML applications, guiding the choice of appropriate algorithms and
techniques. The insights gained from the literature review are outlined in the following
sections for reference.

2.1 Prediction Of Rainfall Using Machine Learning Techniques

Introduction

This paper is published in 2020 by Moulana Mohammed, Roshitha Kolapalli, Niharika Golla,
and Siva Sai Maturi, it can be deduced that proper and correct prediction of rainfall helps for
better agriculture planning as well as disaster management; because of its major importance for
Indian and other states with full dependency on rain-based seasons for agriculture purposes,
machine learning technique was brought to this traditional system of rain forecasting in search
of new predictions to prevent failure at high time. The authors further suggest these methods
as ways of developing better forecast models and can therefore assist farmers and other related
stakeholders with the best decisions in optimizing water resources.

Study Area and Data

The study uses the historical rainfall data from the period of 1901 to 2015 covering monthly,
seasonal (three consecutive months) and yearly rainfall data over several subdivisions of India.
To strengthen the performance of the models and concentrate on important features, PCA is
used for dimensionality reduction. It reduces data dimensionality but retains relevant features
to help improve model accuracy and efficiency in machine learning.

Dept. of Computer Applications 5


Rain Prediction using Random Forest

2.1.1 Machine Learning Models

Three machine learning models are used for the purpose of rainfall prediction. These are:

Multiple Linear Regression (MLR): The model analyses the relationship between one
dependent and multiple independent variables that capture linear correlations. It may not handle
the nonlinear relations that often exist in rainfall data.

Support Vector Regression (SVR): It is known for its robustness in handling nonlinear data.
SVR projects data into higher-dimensional spaces by using kernel functions to fit a hyper plane
that minimizes error within a defined margin. The epsilon and other hyper parameters are fine-
tuned to optimize the performance of SVR.

Random Forest Regression. It creates several decision trees in training, aggregating the
outcome to be even more precise and general rather than fit over the lines for each particular
decision tree. With these, random forest captures nonlinear, interaction among numerous
features of complex datasets very well.

2.1.2 Result

Models performances are computed on MAE and R-squared.

The Support Vector Regression had the best accuracy with an R² score of 0.9959 and low Mean
Absolute Error (MAE) of 4.35, which makes it the most accurate model to capture the nonlinear
rainfall patterns.

Random Forest Regression also performed very well with an R² score of 0.9952 and MAE of
5.02, which means it has the ability to deal with complex data interactions though a little less
accurately than the SVR.

MLR had lower accuracy compared to the other models because it failed to detect the non
linearity in the data: its R² score was 0.9958 and MAE was 10.95.

Dept. of Computer Applications 6


Rain Prediction using Random Forest

Conclusion

The study concludes that SVR is the best model for rainfall prediction because it can handle
nonlinear relationships and give high accuracy. The random forest regression also showed very
good performance, hence could be a good substitute in complex datasets. Therefore, SVR and
random forest are recommended for any high-precision meteorological prediction applications,
as these can give robust, accurate forecasts important for agricultural and disaster management
planning.

2.2 Rainfall Prediction Using Machine Learning

Introduction

The paper "Rainfall Prediction Using Machine Learning" published by Arnav Garg and
Himanshu Pandey in 2019 highlights the problem that regions with drastic climate change
conditions due to global warming face a higher demand for precise rainfall prediction.
Forecasting is crucial in rural areas and developing regions, where agriculture is highly
dependent on seasonally occurring rainfalls. In such areas, traditional predictive techniques are
usually less than satisfactory; hence the use of machine learning techniques that can more
frequently and reliably predict the patterns using rainfall data history. The aim of this study is
to develop an inexpensive accessible model for predicting rainfall applicable in communities
with limited technological infrastructure.

Study Area and Data

Monthly district-wise rainfall data for India covering from 1951 to 2015 are used in the study.
This data is sourced from the Ministry of Earth Science, India. It also falls under the National
Data Sharing and Accessibility Policy. The pre-processing of the data is done in Python through
Jupyter Notebook that cleaned the data as well as handled missing values. The data was divided
into a training set (1951-2014) and a testing set (2015) for model training and evaluation. This
structured approach allowed the authors to test the model's accuracy on recent data, simulating
real-world forecasting challenges.

Dept. of Computer Applications 7


Rain Prediction using Random Forest

2.2.1 Machine Learning Models

The study evaluates three machine learning models for rainfall prediction:

Support Vector Machine: SVM is a kind of supervised learning algorithm generally classified
under classification and regression. In this paper, SVM makes hyperplane so that the rainfall
pattern classes can be classified under it. Its kernel function helps in the transformation of data
into higher dimensional to find linear as well as non-linear relationship. So for complex datasets
management SVM prove to be one of the robust approaches.

Random Forest: This is an ensemble learning technique with the help of which multiple
decision trees are built at the time of training, and each tree makes a different prediction. The
model aggregates all those predictions through averaging if it's regression or voting for
classification, and it gets greater accuracy along with low risk of overfitting. Random Forest
simply captures the complex, nonlinear patterns of data and noisy data and is therefore
performing very well in time-series data with seasonal and spatial variations such as rainfall.

KNN-K-Nearest Neighbors This is a non-parametric, instance-based learning algorithm.


KNN relies upon a simple approach in which predictions are based on the closest data points
in the feature space where each neighboring data point contributes proportionally according to
distance. It is computationally simple but not very effective at handling complex relationships
so goes well with straightforward datasets.

Using the test data for the year 2015, performance assessment of models was conducted.
Random Forest gained maximum accuracy by capturing very nonlinear and complex
relationships of rainfall data through an ensemble approach. It well utilized the decision trees
for modeling rainfall variation with great precision; therefore, it would be preferred where both
accuracy and interpretability are desired. SVM was also very robust and resistant to slight
biases and inconsistencies and made accurate predictions even in the noise-limited scenarios.
KNN was less precise but was good enough for some of the simpler predictive tasks, although
it lacked the refinement that could be required with regards to the complex data such as rainfall.

Dept. of Computer Applications 8


Rain Prediction using Random Forest

2.2.2 Result

Support Vector Regression (SVR):


Accuracy: All the models tested best at this. around 90%

Applicability: Best suited for time series data, offering very high accuracy in predicting values
especially for continuous variables such as rainfall.

Random Forest
Accuracy: Similar to SVR, often scoring around 85%.

Applicability: Best suited for handling large datasets with numerous features. It performs very well in
classification tasks related to rainfall levels and can provide feature importance, which can help
understand what factors most affect the prediction of rainfall.

k-Nearest Neighbors (KNN):


Accuracy: Lower than both SVR and RF. approximately 75%

Readiness: Not very fit for time series data simply because it doesn't make good predictions regarding
the general pattern in weather patterns.

Conclusion

This concludes that Random Forest is the best model for rainfall prediction because it can handle
nonlinear data relationships and complexity, all while providing a very accurate model. The
ensemble nature allows generalization, making it excellent at identifying subtle patterns in
rainfall data. However, in the case of minor data inconsistencies, SVM stands out as a practical
alternative that is more robust and flexible. Such data variability could easily be found in
applications in real life, thereby making SVM and Random Forest a very reliable tool for
rainfall forecasting. In fact, such predictive modeling can offer a considerable difference in
planning and decision-making in rainfall-dependent regions to ultimately improve sustainability
and resilience in climate-sensitive areas.

Dept. of Computer Applications 9


Rain Prediction using Random Forest

CHAPTER 3
DATA COLLECTION
3.1 Data Source

The dataset used to create this precipitation forecast system was sourced from
Kaggle and is named "weatherAUS.csv". It contains about 10 years of daily weather
observations from Australian weather stations, providing rich material for forecasting
models The dataset contains 145,460 rows and 23 features, with the objective variable
"RainTomorrow" — a classification of whether or not it rained the next day. If the rain is
more than 1mm the column is marked "yes", otherwise "no".

3.2 Dataset Description

Attributes of the Dataset: The dataset includes a wide kind of meteorological capabilities
that describe climate conditions for each day:

• Date: The date of the statement.


• Location: The call of the weather station in which the remark became made.
• MinTemp: The minimal temperature recorded for the day in degrees Celsius.
• MaxTemp: The most temperature recorded for the day in stages Celsius.
• Rainfall: Quantity of rainfall recorded for the day(recorded in mm).
• Evaporation: The Class A pan evaporation (in mm) recorded within the 24 hours
to 9am.
• Sunshine: The variety of hours of shiny sunshine recorded at some point of the
day.
• WindGustDir: The direction of the most powerful wind gust recorded.
• WindGustSpeed: The speed of the strongest wind gust recorded (recorded in km/h)
• WindDir9am: The wind route recorded at 9am.
• WindDir3pm: The wind route recorded at 3pm.
• WindSpeed9am: The wind velocity (in km/h) averaged over 10 mins earlier than
9am.

Dept. of Computer Applications 10


Rain Prediction using Random Forest

• WindSpeed3pm: The wind velocity (in km/h) averaged over 10 minutes earlier
than 3pm.
• Humidity9am: The percent humidity recorded at 9am.
• Humidity3pm: The percent humidity recorded at 3pm.
• Pressure9am: The atmospheric stress (in hPa) reduced to mean sea degree,
recorded at 9am.
• Pressure3pm: The atmospheric pressure (in hPa) decreased to mean sea level,
recorded at 3pm.
• Cloud9am: The fraction of the sky obscured by way of cloud at 9am, measured in
oktas (eighths of the sky).
• Cloud3pm: The fraction of the sky obscured with the aid of cloud at 3pm,
measured in oktas.
• Temp9am: The temperature (in degrees Celsius) recorded at 9am.
• Temp3pm: The temperature (in degrees Celsius) recorded at 3pm.
• RainToday: A binary indicator (Yes/No) of whether rainfall for the day handed
1mm.
• RainTomorrow: The intention variable indicating whether or not rainfall
tomorrow passed 1mm (Yes/No).

During the records preprocessing degree, the dataset exhibited lacking values, outliers, and
other anomalies, which were dealt with suitable measures. For instance, missing values had
been treated the usage of median imputation, while outliers had been recognized and
removed the use of the IQR (Interquartile Range) method.

Data Types: The dataset includes attributes with distinct facts sorts: numerical, discrete,
non-stop, and specific. Below is the breakdown of the statistics kinds and the capabilities
similar to each type:

Numerical Features Count: 16


Discrete Features Count: 2 (Cloud9am, Cloud3pm)
Continuous Features Count: 14 (MinTemp, MaxTemp, Rainfall, Evaporation, Sunshine,
WindGustSpeed, WindSpeed9am, WindSpeed3pm, Humidity9am, Humidity3pm,
Pressure9am, Pressure3pm, Temp9am, Temp3pm)

Dept. of Computer Applications 11


Rain Prediction using Random Forest

Categorical Features Count: 7 (Date, Location, WindGustDir, WindDir9am, WindDir3pm,


RainToday, RainTomorrow)
This various set of attributes affords a comprehensive view of weather conditions, helping
the machine gaining knowledge of model to predict rainfall occurrences appropriately.

Dept. of Computer Applications 12


Rain Prediction using Random Forest

CHAPTER 4
MATERIALS AND METHODS

4.1 Agile Methodology


In this project, followed the Agile method to ensure a flexible, iterative, and
adaptive method to developing Rain Prediction System. Agile technique allowed us to
break down the mission into conceivable phases, called sprints, which helped streamline
the development process and continuously improve the system through regular comments
and collaboration.
Each sprint focused on a specific set of tasks, allowing us to monitor the development
effectively and make vital adjustments along the manner. The key advantages of Agile
consist of rapid new release, timely feedback, and the ability to conform to changing
requirements, all of which performed a crucial position inside the successful of completion
of our project.

4.2 Product backlog


Backlog served as the foundation for all development tasks. The product backlog is
a listing of functions, improvements, and tasks necessary to obtain the project’s goal. It is
an important a part of the Agile Scrum framework, ensuring that each required elements
are identified and systematically addressed. Each item in the backlog was carefully
evaluated and ranked based on its importance and relevance to the project.
Below is a detailed breakdown of the 9 product backlog items identified for this project.

Backlog USER STORIES TASKS


ID
101 As a data analyst, wants to 1. Literature review.
import the dataset and 2. Write code to import the dataset.
perform an initial view. 3. Display the Dataset.
102 As a data scientist, wants to 1. Review feature names and
understand the features and descriptions.
their data types. 2. Print each feature and their data
types.

Dept. of Computer Applications 13


Rain Prediction using Random Forest

103 As a data analyst wants to 1. Review the data to classify


classify features as numerical features as numerical or
or categorical. categorical.
2. Count and display the numerical
features.
3. Count and display the categorical
features.
104 As a data scientist, 1. Identify the missing value.
wants to identify and impute 2. Choose appropriate imputation
missing values in the dataset. methods
3. Apply the methods to categorical
& numerical values

105 As a data scientist, wants to 1. Use visualization libraries to


visualize features and detect create graphs.
outliers. 2. Create a box plot to find the
outliers
3. Apply outlier detection methods
and record findings.
106 As a data engineer, wants to 1. Select an outlier handling
handle detected outliers in the strategy
dataset. 2. Implement the chosen strategy.
3. Check if all the outlier removed.
107 As a data scientist, wants to 1. Label encoding: Convert each
convert categorical category to a unique integer.
values into numerical values 2. Drop irrelevant features.
and do feature selection.

108 As a data scientist, wants to 1. Define the split strategy


split the dataset. 2. Define the split ratio (80:20)
3. Perform the dataset split:

109 As a data scientist, wants to 1. Select and configure different


train and evaluate machine learning algorithms.
machine learning models. 2. Monitor performance metrics
during training.
3. Analyses the accuracy and steps
to improve the accuracy if needed.

Table 4.1 Product backlog

Dept. of Computer Applications 14


Rain Prediction using Random Forest

4.2.1 Sprint &Burndown chart


To efficiently manage the project timeline and workload, we divided the tasks from
the product backlog into four sprints, each specializing in a particular set of deliverables.
This iterative method allowed for continuous remarks, improvement, and adaptation
throughout the project. For each sprint, we tracked progress the usage of a burndown chart,
which visually represented the remaining work and helped to make sure the timely
completion of tasks. Below is a breakdown of the four sprints and their respective
burndown charts.

Sprint 1
In Sprint 1, completed two product backlog tasks, laying the foundation for the project. The
development made during this sprint is reflected inside the burndown chart below, which
indicates a steady reduction in the remaining workload because the duties have been finished
on schedule. This helped us maintain a clear trajectory for the subsequent sprints.

SPRINT BURN DOWN CHART


INITIAL Aug- Aug- Aug- Aug- Aug- Aug- Aug- Aug-
ESTIMATE 01 02 05 06 07 12 14 15
BACKLOG DAY- DAY- DAY- DAY- DAY- DAY- DAY- DAY
ID USER STORIES DAY-0 1 2 3 4 5 6 7 -8
LITERATURE
101.1 REVIEW 2 2
IMPORT
101.2 DATASET 1 1
101.3 INITIAL VIEW 1 1
REVIEW
102.1 FEATURES 2 1 1
LIST
102.2 DATATYPES 2 1 1
REMAINING EFFORT 8 6 5 5 4 4 2 1 0
IDEAL TREND 8 7 6 5 4 3 2 1 0

Dept. of Computer Applications 15


Rain Prediction using Random Forest

Table 4.2 Sprint 1 burn down chart

Fig 4.1 Graph of remaining effort and ideal trend wrt ssprint 1

Dept. of Computer Applications 16


Rain Prediction using Random Forest

Sprint 2
In Sprint 2, continued the momentum by completing two more backlog tasks. This sprint focused
on refining the core components and ensuring smooth integration of the features. The burndown
chart for Sprint 2 below illustrates the consistent progress made, with a gradual reduction in
pending tasks, keeping us aligned with the project goals.

SPRINT BURN DOWN CHART


INITIAL Aug- Aug- Aug- Aug- Aug-
Aug- Aug-
BACKLOG ESTIMATE 16 19 21 25 27
29 30
USER STORIES
ID DAY-
DAY-0 DAY-1 DAY-2 DAY-3 DAY-4 DAY-5 DAY-6 7
103.1 Classify Features 3 1 1 1
Rationale for
103.2 Classifications 2 1 1
104.1 Identify Missing Values 3 1 2
Choose Imputation
104.2 Methods 3 1 2
104.3 Apply Imputation 3 1 2
REMAINING EFFORT 14 13 12 10 8 5 2 0
IDEAL TREND 14 12 10 8 6 4 2 0
Table 4.3 Sprint 2 burn down chart

Fig 4.2 Graph of remaining effort and ideal trend wrt sprint 2

Dept. of Computer Applications 17


Rain Prediction using Random Forest

Sprint 3

In Sprint three, made vast progress by completing 3 backlog obligations. This dash
emphasised improving the model's performance and improving data processing techniques.
The corresponding burndown chart for Sprint 3 suggests a regular decline in final
responsibilities, reflecting the team's commitment to assembly the sprint objectives efficiently.

SPRINT BURN DOWN CHART


Sep-
INITIAL ESTIMATE Sep-02 Sep-04 Sep-06 Sep-09 Sep-13 16
BACKLOG ID USER STORIES
DAY-
DAY-0 DAY-1 DAY-2 DAY-3 DAY-4 DAY-5 6
105.1 DATA VISUALIZATION 2 2
105.2 OUTLIER DETECTION 2 1 1
105.3 RECORD FINDING 2 1 1
106.1 OUTLIER HANDLING 2 2
107.1 DATA ENCODING 2 1 1
107.2 DOCUMENT RESULT 2 1 1
REMAINING EFFORT 12 9 7 4 3 1 0
IDEAL TREND 12 10 8 6 4 2 0
Table 4.4 Sprint 3 burn down chart

Fig 4.3 Graph of remaining effort and ideal trend wrt sprint 3

Dept. of Computer Applications 18


Rain Prediction using Random Forest

Sprint 4
In Sprint 4, the very last sprint of our mission, we successfully finished the last two backlog
tasks. This sprint targeted on final testing and refining the model to make sure the most
fulfilling overall performance. The burndown chart for Sprint 4 reflects the completion of all
remaining tasks, marking the successful closure of the task.

SPRINT BURN DOWN CHART


INITIAL ESTIMATE Sep-17 Sep-19 Sep-20 Sep-23 Sep-30
BACKLOG ID USER STORIES
DAY-0 DAY-1 DAY-2 DAY-3 DAY-4 DAY-5
108.1 DEFINE SPLIT 1 1
108.2 PERFORM SPLIT 2 2
108.3 SELECT ALGORITHM 3 1 2
109.1 MONITOR PERFORMANCE 3 1 2
109.2 ANALYZE ACCURACY 1 1
REMAINING EFFORT 10 9 6 4 3 0
IDEAL TREND 10 8 6 4 2 0
Table 4.5 Sprint 4 burn down chart

Fig 4.4 Graph of remaining effort and ideal trend wrt sprint 4

Dept. of Computer Applications 19


Rain Prediction using Random Forest

4.3 Data Preprocessing

Data preprocessing is an essential step in any machine learning problem because


actual datasets are frequently messy and comprise anomalies which can negatively impact
the model's overall performance. These anomalies might include missing values, outliers,
inconsistencies, and inappropriate records. Without addressing these troubles, device
studying algorithms may additionally struggle to discover meaningful patterns, main to
decrease accuracy and unreliable predictions. In our dataset, we applied numerous
preprocessing steps to make certain the facts were easy and appropriate for model
education. These preprocessing strategies helped us beautify the general first-class of the
dataset, in the long run enhancing the accuracy of our version. We will talk each of these
steps in element in the following sections.

4.2 Data Exploration

The first step in our Data preprocessing procedure turned into data exploration,
which allowed us to build a thorough information of the dataset and its structure. This
section is vital in any machine learning project because it presents insights into the
attributes, the rows, and the Data types, which helps in determining how to handle each
feature during the subsequent stages of preprocessing.

We started by means of examining the fundamental shape of the dataset. Using the
df.Head() function, we displayed the first few rows to take a look at the content of every
attribute. This gave us a preliminary impact of ways the records changed into organized
and the kind of values present in each column. By visually examining the first few rows,
we may want to identify key capabilities such as temperature, wind speed, and humidity,
together with our goal variable, RainTomorrow, which we aimed to predict.

Next, we used the df.Data() feature to achieve more distinct facts about the dataset, which
includes the overall wide variety of rows and attributes, the statistics kinds of each
characteristic. This step become crucial as it helped us become aware of missing Data
points, inconsistencies, and data quality issues. Knowing whether an attribute turned into

Dept. of Computer Applications 20


Rain Prediction using Random Forest

numerical or categorical will help the preprocess the data later. We observed that the dataset
contained 145,460 rows and 23 attributes, with a mix of numerical, categorical, and
continuous features.
To enhance our understanding of the data, we constructed various graphs and
visualizations, which helped us identify patterns, trends, and anomalies. Visual exploration
through histograms, box plots, and bar charts allowed us to observe the distribution of
numerical features and detect potential outliers or skewed data. For example, using box
plots helped us detect unusual values in features like Rainfall and WindGustSpeed. We also
employed heatmaps to examine correlations between different numerical attributes, which
provided insights into relationships that could influence the performance of our predictive
model.
By conducting this comprehensive data exploration, we laid a strong foundation for the
next stages of the preprocessing process. It allowed us to understand the dataset's overall
structure and highlighted the areas that needed further cleaning and refinement, ensuring
that we had a solid grasp of the data before moving on to more technical transformations.

4.3 Feature Classification

Feature classification is an essential a part of the statistics preprocessing procedure,


because it allows in organizing and expertise the capabilities of the dataset. By classifying
functions into classes consisting of numerical, discrete, continuous, and specific, we will
practice appropriate preprocessing techniques that improve the overall performance and
interpretability of the version
In our implementation, the Data set contained 23 features that had to be classified based
totally on their data type and features. Understanding this type is features because
extraordinary device getting to know algorithms take care of exclusive features in one-of-
a-kind ways. For example, numerical capabilities may be used directly in maximum
algorithms, while categorical features commonly have to be distinctive earlier than being
included into the project
We divided the features into the following categories.

• Numerical items: Numerical objects represent objects which have measurable


quantities and can tackle a couple of values. These capabilities are frequently used

Dept. of Computer Applications 21


Rain Prediction using Random Forest

at once as in maximum device gaining knowledge of models. Our data set


incorporates 16 numerical variables and consists of each continuous and discrete
variable. These features contain key information about the climate and are a key
issue in predicting the target variable RainTomorrow.

• Discrete Features: Discrete features are numerical attributes that take on a restrained
range of distinct values. In our dataset, we recognized 2 discrete capabilities:
o Cloud9am
o Cloud3pm
These attributes constitute the fraction of the sky obscured by using clouds at 9 AM
and 3 PM, respectively. They are measured in "oktas" (eighths) and might only
tackle values from 0 (absolutely clear sky) to 8 (completely overcast). Since those
values are countable and not non-stop, they fall into the discrete category.

• Continuous Features: Continuous features are numerical variables which can take
on any price inside a range and are regularly measured records. Our dataset
contained 14 continuous functions, which might be critical for understanding the
variation in weather situations. These features encompass:
o MinTemp: Minimum temperature in degrees Celsius.
o MaxTemp: Maximum temperature in degrees Celsius.
o Rainfall: The amount of rainfall in millimeters.
o Evaporation: Class A pan evaporation in millimeters.
o Sunshine: Hours of shiny sunshine at some point of the day.
o WindGustSpeed: Speed of the strongest wind gust inside the 24
hours before middle of the night.
o WindSpeed9am and WindSpeed3pm: Wind speeds averaged over
10 mins before nine AM and three PM, respectively.
o Humidity9am and Humidity3pm: Humidity chances at nine AM
and three PM.
o Pressure9am and Pressure3pm: Atmospheric pressure in hPa at 9
AM and 3 PM.
o Temp9am and Temp3pm: Temperature at nine AM and 3 PM.

Dept. of Computer Applications 22


Rain Prediction using Random Forest

These continuous features give us a comprehensive picture of the daily weather


conditions across different time points and play a crucial role in predicting rainfall
events.

• Categorical Features: Categorical features are attributes that represent discrete


categories or labels. These values cannot be used directly in machine learning
algorithms that require numerical input, so they often need to be transformed using
techniques such as label encoding. In our dataset, we identified 7 categorical
features, which include:
o Date: The date of observation.
o Location: The location of the weather station.
o WindGustDir: The direction of the strongest wind gust.
o WindDir9am and WindDir3pm: The wind direction at 9 AM and 3
PM, respectively.
o RainToday: A boolean feature indicating whether it rained today.
o RainTomorrow: The target variable indicating whether it will rain
the next day.
By classifying the features in this way, we were able to better understand how to
process each type of attribute. For instance, numerical features were treated for
outliers and missing values, while categorical features were transformed into
numerical representations using encoding techniques. The distinction between
continuous and discrete features also guided us in choosing appropriate
transformations and visualizations during the exploratory data analysis phase.

In summary, feature classification was essential to ensure that the attributes were treated
correctly in the preprocessing steps, thus improving the overall quality of the dataset and
enabling us to train a more accurate machine learning model.

4.4 Handling missing values

Handling missing values is a critical thing of data preprocessing, as it ensures the


integrity and completeness of the dataset. In real-world data, missing values are not unusual
and can rise up due to diverse reasons, together with errors in information series

Dept. of Computer Applications 23


Rain Prediction using Random Forest

or unrecorded observations. If not addressed nicely, lacking values can significantly affect
the performance of model getting to know fashions, main to inaccurate predictions, biases,
or maybe model failure.
In our dataset, we encountered missing values in several attributes. To address this
difficulty, we began via figuring out the wide variety of missing values for each
characteristic the use of the following function:
• df.Isnull().Sum()
Once the missing values were identified, we applied appropriate imputation techniques to
handle them effectively:
• Numerical features, such as temperature, rainfall, wind speed, and pressure, are
continuous variables, and missing values in these features can distort the overall
statistical properties of the dataset. To handle this, we imputed the missing values
with the mean of the respective feature. Imputing the mean ensures that the central
tendency of the data remains unchanged, allowing the model to perform accurately
without any major shifts in the feature distributions.
• Categorical features, such as the direction of the wind or location, cannot be filled
using statistical measures like the mean. For these features, we opted to impute
random values based on the most frequent or plausible category within the dataset.
This method helps retain the diversity and variability in the categorical data, without
introducing bias or favoring a particular category excessively.

Handling missing values is vital for several reasons:


• Model Performance: Missing values can cause incorrect computations in the
course of model training.ML algorithms normally assume a entire dataset,
and missing facts can motive these algorithms to fail or produce biased
results.

• Data Integrity: Missing values distort the dataset's basic shape, decreasing
its reliability. Without addressing them, the dataset may no longer
appropriately represent the underlying phenomena being studied, leading to
wrong insights.

Dept. of Computer Applications 24


Rain Prediction using Random Forest

• Statistical Bias: Ignoring lacking values or leaving them untreated can


introduce bias into the version. For example, attributes with a big proportion
of lacking values would possibly inadvertently impact the algorithm,
leading to skewed predictions.

Failure to address missing values can cause numerous troubles, including:


• Decreased Accuracy: If lacking values are not treated, the version can be not able
to seize the actual relationships between features, leading to decrease prediction
accuracy.
• Model Failure: Many system learning algorithms do not receive missing values and
could improve mistakes throughout training. This should save you the version from
functioning altogether.
• Bias in Predictions: If missing values are neglected, the model may additionally
emerge as counting on incomplete or inaccurate records, skewing its predictions
and making it less dependable in real world.
By nicely addressing missing values through imputation, we ensured that our dataset
remained entire and steady, permitting our machine getting to know fashions to analyze
successfully from the statistics without any interruptions caused by anomalies.

4.5 Outliers Removal

Outliers can negatively affect the performance of the CatBoost algorithm as it relies
on accurate data distributions to make predictions. Outliers skew these distributions, leading
to potential misclassification or errors in prediction accuracy. Removing or adjusting
outliers ensures a more robust and reliable model by maintaining the integrity of the data.
In this step, we plotted a box plot to visualize the distribution of data and detect the presence
of outliers. Outliers are values that significantly differ from the rest of the dataset, often
lying beyond the interquartile range (IQR).
To remove the outliers, we applied the Interquartile Range (IQR) method. This method
identifies outliers as values lying below the lower bridge or above the upper bridge of the
data.
The mathematical process for outlier detection and removal follows these steps:

Dept. of Computer Applications 25


Rain Prediction using Random Forest

• IQR = Q3−Q1
• Lower Bridge = Q1−1.5×IQR
• Upper Bridge = Q3+1.5×IQR
Any data points fall below the lower bridge or above the upper bridge are considered
outliers. These values are removed for the better performance of the model.

Fig 4.1 (Attribute MinTemp with outliers)

Fig 4.2 (Attribute MinTemp after removing outliers)

Dept. of Computer Applications 26


Rain Prediction using Random Forest

4.6 Converting Categorical variables to Numerical

In the preprocessing phase of our rain prediction project, we converted categorical


values into numerical values using the LabelEncoder().fit_transform() function. While
CatBoost can inherently handle categorical data on its own, we decided to manually convert
the categorical values to numerical as part of our comprehensive preprocessing strategy.
This approach allows for more control and consistency across different machine learning
models, as not all algorithms can natively process categorical data. It also helps in
understand how the model treats categorical variables through transforming them into a
numerical form.
The LabelEncoder from the sklearn.Preprocessing module is used to convert categorical
information into numerical labels, wherein each specific category is assigned a wonderful
integer. We used the fit_transform() function, which combines two processes:

• fit(): When fit() is applied, the encoder learns the unique categories present in the
column and assigns each category a numerical value. For example, if the categorical
values are ['Location', 'WindGustDir'], fit() identifies these distinct labels and maps
them to integers like:

'Location' → 2
'WindGustDir' → 13
This step builds the internal mapping from the unique categories to numerical
values.

• transform(): After the encoder has learned the mapping during fit(), transform()
applies this mapping to the dataset, converting the categorical values into their
corresponding numerical labels. With the above example, if we apply transform()
to the attribute, it will replace each 'Location' and 'WindGustDir' with 2 and 13
respectively.

• fit_transform(): The fit_transform() method combines both steps. It first fits the
encoder by identifying the unique categories and then transforms the data in a single
step. This saves time by applying both processes simultaneously.

Dept. of Computer Applications 27


Rain Prediction using Random Forest

Fig 4.3 (Before converting to Numerical)

Fig 4.4 (After converting to Numerical)

By converting categorical values manually with LabelEncoder, we maintain consistency


across various models and allow for faster, more efficient data processing, while still taking
advantage of the strengths of the CatBoost algorithm.

4.7 Feature Selection

In the Feature Selection step of our rain prediction project, we targeted on deciding
on taking the most relevant functions for training the model. This procedure facilitates to
improving the version's overall performance and reducing computational complexity.

Dept. of Computer Applications 28


Rain Prediction using Random Forest

To identify the relationships among the different attributes in our dataset, we built a
correlation heatmap. The heatmap visually represented how strongly each function turned
into correlated with the target variable in addition to with each other.

In this analysis:
• Positive correlation means when the one attribute increases, the other also tends to
increase.
• Negative correlation means when the one attribute increases, the other tends to
decrease.
We analysed both positive and negative correlations to determine which features had a
meaningful impact on predicting rainfall. Through this analysis, we found that the attribute
'Date' was the least correlated with the target variable. Since 'Date' did not provide any
significant contribution to the model's predictive power, we decided to drop this column
from the dataset.

All other attributes, which showed stronger correlations, were retained and used for training
the model. This feature selection process helped streamline our data, ensuring that only the
most relevant information was passed into the model, enhancing its accuracy and
efficiency.

4.8 Splitting the Dataset


In the ultimate step of our data preprocessing, we divided the dataset into two
sections: one for training the model and some other for testing out it. We observed an 80:20
ratio, wherein 80% of the facts was committed to training the version, whilst the other 20%
was reserved for testing how well the model works.
The training records helps the model research patterns within the dataset, while the testing
data is used to evaluate its accuracy on new, unseen records. This guarantees the version
could make accurate predictions on actual-real world statistics.

We used a way that effectively splits the dataset, making sure that this division is done
continuously each time by means of setting a fixed random seed. This consistency ensures
that the results are dependable and reproducible in the course of assessment.

Dept. of Computer Applications 29


Rain Prediction using Random Forest

CHAPTER 5
MODEL SELECTION

Random Forest is one of the powerful ensemble learning algorithms known for its wide
usage in classification and regression problems. It is ideal in complicated datasets such as
in weather forecasting because it develops a multitude of decision trees that improve the
precision in the prediction of rain and other weather patterns. The following are some
advantages in using Random Forest for the prediction of weather patterns like rain:

Handling Complex and Nonlinear Data Relationships: Random Forest is a good


algorithm that captures intricate patterns and interactions among the features. It fits quite
well with the weather data where variables such as temperature, humidity, and wind speed
have very intricate interdependencies that would affect predicting the rain.

Robustness against Overfitting and Irrelevant Features: Because it is structured as an


ensemble, Random Forest has less tendency to overfit than a single decision tree and
performs better even if a few features are irrelevant. This comes very handy with weather
datasets where not all variables have the same effect on predicting rain.

High accuracy out-of-the-box and relatively less hyperparameter tuning compared with
many complex models; thus, Random Forest is beneficial for weather forecasting because
datasets could be really large and diverse.

Feature Importance and Interpretability: Random Forest natively provides feature


importance scores, which allow to understand which features most affect the likelihood of
rain. These features could be pressure or wind speed, for example. This feature
interpretability is very valuable for meteorologists who are trying to understand and explain
the drivers behind the predictions.

Efficiency and Scalability: Random Forests are computationally intensive but very
scalable and can be parallelized. Therefore, they can easily process large weather datasets
even in real-time applications.

Dept. of Computer Applications 30


Rain Prediction using Random Forest

The basic idea of this algorithm, in the context of rainfall prediction, is to generate a forest
of decision trees. Each decision tree was trained on a random subset of the features (e.g.,
temperature, humidity, and pressure) as well as the target variable (rain or no rain). At time
of prediction, each of the decision trees independently voted on how likely it believed it
was that it was going to rain, and the decision was made as a majority vote across all trees.

5.1 Algorithm

Steps of the Random Forest Algorithm

Step 1: Data Preparation

Collect historical weather data with relevant features and a label indicating whether it
rained or not. Make sure to preprocess for missing values and features that are irrelevant to
the study.

Step 2: Training Phase

Train multiple decision trees on some randomly selected subset of data. Feature bagging
takes the form of random selection of features per tree; data bagging is presented as
bootstrapping the samples for each tree.

Step 3: Predictive Phase

For unseen data, each decision tree predicts (rain or not). The Random Forest calculates
the final prediction by tallying up the predictions for each tree through majority vote.

Step 4: Model Evaluation

The performance of the model is evaluated using test data on criteria such as accuracy,
precision, recall, and a confusion matrix on determining its discrimination between the
presence of rainfall and no rainfall.

Dept. of Computer Applications 31


Rain Prediction using Random Forest

In summary, Random Forest is a powerful predictor of rain that can model nonlinear
relationships in weather data while providing insight into feature importance. Its ensemble
approach reduces overfitting and makes use of many predictors to increase stability,
making it a good candidate for meteorological forecasting applications.

Dept. of Computer Applications 32


Rain Prediction using Random Forest

CHAPTER 6
MODEL EVALUATION

6.1 Model Prediction

A trained model is applied on new data to estimate results based on learned relationships
between variables gathered through historical data. In this rain fall prediction, we used the
Random Forest model with a specific train on data including 23 attributes and 145,460
rows from historical rainfall.

Before doing the model train and the evaluation, the dataset split into two data sets through
the 80:20 approach. This resulted in 116,368 rows being applied in training and 29,092
rows for testing. We applied the major portion of the data to make sure that our model
understood the underlying pattern in the data. To implement this, we used the testing dataset
by including 20% of the total dataset, which was 29,092 rows that were not used while
training.

During the prediction stage, the trained Random Forest model applies all of those
relationships to the testing data set and generates the output, for which a prediction
regarding the likelihoods of occurring rainfall can be taken about that particular time frame.
It takes features like temperature, humidity, and atmospheric pressure, in a given input for
which it also makes the prediction for that day to rain or not.

The predictions can now be compared with actual observations of rainfall to determine their
effectiveness, which in return will evaluate the accuracy and reliability of the model.
Metrics such as accuracy, precision, recall, and AUC can be applied so that performance
can be gauged and a complete perspective of how well the model predicts rainfall can be
produced.

6.2 Model Evaluation and Performance


Model evaluation will help us know how good our ML model is in performing the task on
unseen data. In this example, we evaluated the Random Forest model by using three

Dept. of Computer Applications 33


Rain Prediction using Random Forest

techniques: Confusion Matrix, Classification Report, and AUC (Area Under the Curve).
Here are the details of each technique used in our Random Forest model below:

6.3.1 Confusion Matrix

The confusion matrix is a performance measure of the classification problem, primarily the
binary classification. The matrix shows actual target values compared to predicted ones and
can show any of the four possible outcomes that follow:

True Positives (TP): True predictions of positive cases.


True Negatives (TN): True predictions of negative cases.
False Positives (FP): Type 1 errors, when the model has incorrectly predicted a positive
case.
False Negatives (FN): Type 2 errors, when the model has incorrectly predicted a negative
case.
For our Random Forest model, the confusion matrix is:

[[20685 1947]
[ 2455 4005]]
True Positives (TP): 4005 — Predicted correctly with rain
True Negatives (TN): 20685 — Predicted correctly without rain
False Positives (FP): 1947 — Predicted with rain incorrectly
False Negatives (FN): 2455 — Missed predicting rain

6.3.2 Classification Report

The classification report provides significant metrics that can be used in order to evaluate
the quality of the model's prediction, especially for each class. Some of the most critical
metrics included in the report are as follows:
• Precision
Precision measures the accuracy of the positive predictions. It indicates how many
of the predicted positive cases were actually positive.
Precision = TP / (TP + FP)

Dept. of Computer Applications 34


Rain Prediction using Random Forest

• Recall
Recall (or Sensitivity) measures the ability of the model to identify all relevant
instances. It indicates how many of the actual positive cases were predicted
correctly.
Recall (Sensitivity) = TP / (TP + FN)

• F1-Score
The F1-score is the harmonic mean of precision and recall. It provides a single score
that balances both precision and recall, especially useful when dealing with
imbalanced datasets.
F1 Score = 2((Precision × Recall) / (Precision + Recall))
• Accuracy
Accuracy measures the overall correctness of the model, indicating the proportion
of total correct predictions (both positive and negative) out of all predictions made.
Accuracy = TP+TN / (TP+TN+FP+FN)

The classification report provides critical information regarding the performance of the
model as it is measured with various metrics, including precision, recall, F1-score, and
accuracy. All of these metrics are very crucial for proper assessment and then selecting the
model, mainly because of the application for the task of rain prediction. For this task, there
are two classifications of relevance: positive (rain) and negative (no-rain).

Dept. of Computer Applications 35


Rain Prediction using Random Forest

CHAPTER 7
CONCLUSION

This project has been successfully developed based on an ML-based Rain Prediction
System using the Random Forest algorithm with an accuracy level of 85%. The model can
accurately predict occurrences of rainfall based on diverse meteorological features from
analysis of historical weather data. This system may benefit key stakeholders in this
respect, such as governments, farmers, and urban planners, by improving agricultural
planning and management of resources.

The set of data preprocessing steps have been utilized to ensure a good quality and accuracy
model. Missing values were addressed through mean imputation to numerical features and
random imputation for categorical variables while the outliers were managed with the help
of an IQR method. The process of Label Encoding helped converting the categorical data
into numbers and thereby ensuring that a model can support diverse kinds of data.

Critical Libraries included in the project are :

⚫ Pandas for data manipulation and analysis,


⚫ NumPy for numerical operations,
⚫ Scikit-learn for model evaluation metrics such as accuracy score and classification
report,
⚫ Imbalanced-learn for managing class imbalance with techniques like SMOTE.
85% accuracy of the model shows that it can work as well as traditional methods for
forecasting, but further optimization is possible. For instance, hyperparameter tuning along
with advanced feature engineering would be a good idea for further improvement.

Dept. of Computer Applications 36


Rain Prediction using Random Forest

7.1 Future steps and Improving performance

Improving accuracy, because accurate forecasts can critically affect agricultural planning
and resource management of water, helping people make better decisions. A higher
accuracy builds trust to be used more widely with applications in real-world projects,
thereby reducing economic damages from unexpected weather conditions, so that the
community is ultimately better prepared and resilient to such conditions.

7.1.1 Hyperparameter Optimization:

Instead of hyperparameter tuning via GridSearchCV or RandomizedSearchCV, some


hyperparameters should be identified and optimized.

Trees: More trees may result in better models, but this will cost more in computation.
Max Depth: Max depth is used to adjust the model's complexity and interpretability.
Min Samples Split and Min Samples Leaf: Min samples for split and leaf nodes can be
used to prevent overfitting.
Max Features: Different numbers of features that could be used in each split might
improve accuracy and prevent overfitting.
7.1.2 Cross-Validation
Using cross-validation is used to evaluate how well the model performs so that the best
parameters would be chosen such that consistent results will be obtained irrespective of
which subset of the data is used.

7.1.3 Feature Importance Analysis


Feature importance analysis after having trained can be used to evaluate the most important
features guiding the steps of future feature engineering.

7.1.4. Increased Tree Depth or Number of Estimators:


Overfitting monitoring can sometimes require an increase in the number of estimators or
adjusting the depth of the trees in an attempt to make it more sensitive to complicated
relationships in the data.

Dept. of Computer Applications 37


Rain Prediction using Random Forest

7.1.5. Early Stopping and Pruning:


Pruning or even custom stopping criteria might be beneficial in avoiding overfitting, though
not inherent to Random Forest.

In conclusion, Random Forest is a powerful and interpretable option for rainfall prediction,
and with a focus on hyperparameter tuning and further feature refinement, there is potential
to further improve accuracy and model performance.

Dept. of Computer Applications 38


Rain Prediction using Random Forest

CHAPTER 8
REFERENCES

[1] Documentation of Random Forest from https://randomforest.ai/docs/


[2] Kaggle. "Weather Dataset (weatherAUS)." from Kaggle, https://www.kaggle.com.
[3] AgileAlliance."What is Agile?" from Agile Alliance,
https://www.agilealliance.org/agile101.
[4] Google Scholar: Utilized to find various literature reviews, supporting an extensive
foundation for study. Available at https://scholar.google.com.
[5] Mohammed, M., Kolapalli, R., Golla, N., & Maturi, S. S. (2020). Prediction of
rainfall using machine learning techniques. International Journal of Scientific and
Technology Research, 9(1), 3236-3240.
[6] Garg, A., & Pandey, H. (2019). Rainfall prediction using machine learning.
International Journal of Innovative Science and Research Technology, 4(5), 56-
58.
[7] Data set: The observations were gathered from a multitude of weather stations.
http://www.bom.gov.au/climate/data.
[8] confusion matrix: Performance measurement tool for machine learning
classification, referred to scikit-learn’s documentation:
https://scikitlearn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.
html.

Dept. of Computer Applications 39

You might also like