Rainfall Prediction - Report
Rainfall Prediction - Report
amount of daily rainfall using those factors. Experts, for example, confirmed A classifier's accuracy and performance are determined not just by the
that a sequential retrospective reading algorithm better predicts rainfall model we apply, but also by how we preprocess data and the type of data
using climate change based on temperature, humidity, humidity, wind we give it to learn. Because many machine learning methods, such as linear
speed, and finally the study showed the performance of rain prediction regression, logistic regression, and k-nearest neighbors, can only handle
improved using in-depth learning models as a function. of the future. numerical data, categorical data must be converted to numeric. Check the
According to Sarker a comparison of performance between in-depth cardinality of each category feature before you start encoding.
learning and other machine learning algorithms is shown in Fig. 1 below,
Cardinality
where the performance of the in-depth learning model increases as the size
of the data is increased. Due to the large amount of data used in this study, The number of unique values in each categorical feature is known as
machine learning methods are appropriate. Scientists studied an in-depth cardinality. A feature with a high number of distinct/ unique values is a high
reading algorithm for predicting rainfall using different climate-dependent cardinality feature. A categorical feature with hundreds of zip codes is the
variables. To provide an accurate forecast of rainfall, speculation models are best example of a high cardinality feature. When this high cardinality
developed and tested using machine learning techniques. Therefore, many characteristic is encoded, it will increase the number of dimensions of data,
researchers did not show daily rainfall predictions but did research on which is a severe concern. This is not conducive to the model's success.
natural data to predict whether or not to predict rainfall and predict the There are a couple of approaches to deal with large cardinality: one is to use
annual rainfall average of daily rainfall is a challenging task. All of the engineering, and the other is to just remove the feature if it adds no value to
important environmental factors that are important in rainfall forecasting are the model.
not used. This paper examined machine learning algorithms using data Handling Missing Values
collected from a single-size meteorology station in comparison and selected
Missing values are a challenge for machine learning algorithms since they
appropriate natural characteristics associated with precision or negative
can't manage them. As a result, they must first be addressed. Missing values
rainfall to assess the daily rainfall performance algorithms for machine
can be identified and imputed using a variety of ways. Missing values are
learning using MAE and RMSE.
replaced with Nan(Not a Number) values when a dataset with missing
values is loaded using pandas. These NaN values may be recognized using
isna () or isnull () methods, and they can be imputed with fillna (). Missing correlation means that an increase in one feature's value reduces the value
Data Imputation is the term for this technique. of the target variable. We used the seaborn library to create a heatmap of
associated characteristics, which makes it easier to see which attributes are
Feature Hashing
most connected to the target variable.
Another effective feature engineering approach for dealing with large-scale
category features is feature hashing. A hash function is commonly employed • Taking Care of Class Inequality
in this technique, with the number of encoded features pre-set (as a vector Our data collection is significantly uneven, as we discovered during the
of pre-defined length), so that the hashed values of the features are used as EDA process. Because our algorithm doesn't learn much about the minority
indices in this pre-defined vector, and values are updated as needed. class, unbalanced data leads to skewed findings. We ran two tests, one with
Feature Scaling oversampled data and the other with data that were under-sampled.
The features in our data collection have a wide variety of magnitudes and Under-sampling: To exclude instances of the majority class, we utilized
ranges. However, because most machine learning algorithms compute the Imblearn's random under sampler library. This elimination is based on the
Euclidean distance between two data points, this is a concern. The distance to ensure that the least amount of data is lost.
magnitudes of features with high magnitudes will be weighed heavily in Oversampling: To produce synthetic instances for the minority class, we
distance computations compared to those with low magnitudes. To employed Imblearn's SMOTE approach. As an example, a subset of data
counteract this impact, we must equalize the magnitudes of all from the minority class is collected, and new synthetic cases are constructed.
characteristics. This may be accomplished by scaling.
Feature Importance
Choosing Features
The features used to train a machine learning model affect its performance.
The process of selecting the characteristics that contribute the most to our The importance of a characteristic to the creation of a model is described by
forecast variable or output, either automatically or manually, is known as its relevance. Feature importance is the practice of assigning a score to
feature selection. Irrelevant characteristics in data can lower model accuracy input/label attributes based on how useful they are at predicting a target
and cause the model to train based on them. Choosing features helps to variable. The usefulness of a feature might help you decide which ones to
prevent overfitting, enhance accuracy, and shorten training time. This task utilize. The ExtraTreesRegressor class will be used to determine the
was completed using two methods, both of which yielded similar results. relevance of features. This class implements a Meta estimator that uses
Statistical tests can be used to pick the attributes that have the strongest link averaging to boost projected accuracy and control over-fitting by fitting
with the output variable. The SelectKBest class in the sci-kit learn package many randomized decision trees on distinct samples of the dataset.
may be used with a variety of statistical tests to choose a certain number of
Feature Scaling
features. We chose 5 of the top features from our data set using the chi-
An approach for scaling, normalizing, and standardizing data in a range is
squared statistical test for non-negative characteristics.
feature scaling (0,1). When each column in a dataset contains unique values,
Outliers’ detection and treatment
scaling the data in each column to a common level becomes easier. A class
An outlier is a value that deviates abnormally from the rest of the data in a that implements feature scaling is Standard Scaler.
sample. Visualization (such as boxplots and scatter plots), Z-score,
Model Building
statistical and probabilistic algorithms, and other methods can be used to
In this project, I'll use a Logistic Regression method to create a prediction
find them.
model to forecast if it will rain in Australia tomorrow.
Encoding of Categorical Features
Logistic Regression
Most Machine Learning Algorithms like Logistic Regression, Support
This approach is based on statistics and is used to solve classification
Vector Machines, K Nearest
difficulties. • Its core is the logit or sigmoid function, which allows us to
Neighbors, for example, are unable to deal with categorical data. As a result,
forecast the likelihood that an input corresponds to a specific category.
these categorical data must be transformed into numerical data in order to
Logistic regression, according to the data science community, can address
be used in modeling, a process known as feature encoding.
60% of present classification difficulties.
One code encoding and label encoding are two examples of feature
encoding approaches. However, in this blog, I'll use the replace () function
to convert categorical data to numerical data.
Correlation
Correlation is a statistic that may be used to determine the strength of a link The logistic Regression model accuracy score is 0.84. The model does a
between two characteristics. In bivariate analysis, it's employed. The very good job of predicting.
method corr () in pandas may be used to calculate the correlation. The model shows no sign of Underfitting or Overfitting. This means the
• Heatmap and Correlation Matrix model generalizes well for unseen data.
Correlation describes the relationship between the characteristics and the The mean accuracy score of cross-validation is almost the same as the
goal variable. A positive correlation means that an increase in one feature's original model accuracy score. So, the accuracy of the model may not be
value improves the value of the target variable, whereas a negative improved using Cross-validation.
Experimental Section
1. Importing Libraries.
import NumPy as np
import pandas as pd
import tensorflow as tf
import warnings
WindGustDir 6.515815
WindDir9am 6.982612
WindDir3pm 2.632874
RainToday 0.988976
dtype: float64
MinTemp 0.443061
MaxTemp 0.215377
Rainfall 0.988976
Evaporation 42.788825
Sunshine 47.626457
WindGustSpeed 6.474498
3. The datatype of Date is an object so I will change it to date time for WindSpeed9am 0.941505
Humidity9am 1.266769
df['Date'] = pd.to_datetime(df['Date'])
Humidity3pm 2.530900
4. Checking for the missing values in the target variable Pressure9am 9.853719
Pressure3pm 9.830863
missing value= 3267
Cloud9am 37.734058
5. Dropping the missing values.
Cloud3pm 40.174411
Temp9am 0.638219
Temp3pm 1.910262
month 0.000000
month_cos 0.000000
6. Checking Target Variable.
day 0.000000
day_sin 0.000000
day_cos 0.000000
dtype: float64
7. Transformed features.
14. Training and validation accuracy plotting.
Result Analysis
The accuracy of the Regression model is 0.84. The model does an excellent
job of predicting. The model does not show Underfitting or Overfitting. This
means that the model performs well on invisible data. The accuracy rating
of the verification score is almost the same as the accuracy of the actual
model. Therefore, model accuracy may not be improved using Cross-
validation. we investigate the combined effect of applying several
meteorological factors in conjunction with PW V in rainfall prediction.
Instead of utilizing distinct seasonal models, we merged the seasonal and
diurnal elements into a single model. The difference is used to determine
whether or not our testing dataset is reliable. The outcome will be negative
if the testing RMSE value is bigger than the training RMSE value. This
demonstrates that the training dataset is better trained than the testing
dataset. The R2-score number tells us how accurate each model is. The most
accurate method is the random Forest Regression algorithm. The difference
between errored and correct numbers is similarly very negligible. When
compared to other algorithms, the Random Forest Algorithm is very
effectively learned. This rainfall forecast model is built using the approaches
Many numeric features have data points beyond IQR. I am considering
mentioned in Section III. The assessment metrics are presented after the
a threshold of 5 percentile, for outlier removal, i.e any point beyound
model has been trained and evaluated using data from the NTUS station
95 percentile and below 5 percentile is considerd as outlier and will be
from 2012 to 2015 and the SNUS station from 2016. After training, SVM
removed. The threshold of 5 percentile is choosen at random, we can
has an accuracy of 80%. Because of the categorical values that are included
consider other values for the threshold also.
in the dataset, the accuracy is good but not as good as other techniques.
13. Converting 'Yes' and 'No' to '1' and '0' respectively Because classification algorithms are best suited for numerical data, this has
Yes = 1 resulted in a modest reduction in SVM accuracy.
No = 0
Discussion of findings utilized for everyday rainfall, large-scale data analysis can be used to predict
rainfall in the future.
The environmental factors used in this study, which were collected by
measuring machines from a weather station, were analyzed for their References
correlation with rainfall effect, and appropriate features were selected based
1. World Health Organization: Climate Change and Human Health: Risks
on Pearton-related quantitative test results for daily rainfall predictions, as
and Responses. World Health Organization, January 2003
shown in Table 2. This study looked at rainfall forecasts using natural
2. Alcntara-Ayala, I.: Geomorphology, natural hazards, vulnerability and
characteristics with a correlation coefficient larger than 0.2. Similarly,
prevention of natural disasters in developing countries.
Manandhar et al. use the degree of interaction between each element to
Geomorphology 47(24), 107124 (2002)
identify five significant environmental parameters such as temperature,
3. Nicholls, N.: Atmospheric and climatic hazards: Improved monitoring
relative humidity, dew point, solar radiation, and increasing water vapor.
and prediction for disaster mitigation. Natural Hazards 23(23), 137155
Temperature and Related Humidity have a high relative coefficient of
(2001)
rotation - 0.9, according to the study's experimental data. Gnanasankaran
4. [Online] InDataLabs, Exploratory Data Analysis: the Best way to Start
and Ramaraj did not illustrate the influence of environmental elements on
a Data Science Project. Available: https://medium.com/@InDataLabs/
rainfall instead utilizing monthly and yearly rainfall, while Prabakaran et al.
why-start-a-data-science-project-with-exploratory-data-analysis-
utilized the year, temperature, cloud cover, and annual attribute to test
f90c0efcbe49
without examining the interaction between natural characteristics. Data on
5. [Online] Pandas Documentation. Available: https://pandas.pydata.org/
yearly rainfall predictions. This study used an appropriate environmental
pandas-docs/stable/reference/api/pandas.get_dummies.html
feature to train and test three machine learning models such as RF, MLR,
6. [Online] Sckit-Learn Documentation Available: https://scikit-
and XGBoost to get an estimate of daily rainfall. The performance of these
learn.org/
machine learning models is measured using MAE and RMSE. RAM for RF,
stable/modules/generated/sklearn.feature_extraction.FeatureHasher.h
MLR, XGBoost is 4.49, 4.97, and 3.58, and RMSE is 8.82, 8.61, and 7.85
tml
respectively. Similarly, researcher Manandhar et al.utilized machine-
7. [Online] Sckit-Learn Documentation Available: https://scikit-
readable algorithms to predict annual rainfall using appropriate natural
learn.org/
features and recorded a total accuracy of 79.6%. The researcher considered
stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
the attributes to predict the annual rainfall value by taking into account the
8. [Online] Sckit Learn Documentation Available: https://scikit-learn.org/
average temperature, cloud cover, and annual rainfall as inclusions.
stable/modules/generated/sklearn.feature_selection.SelectKBest.html
Analysis of correlations between adjectives was not examined. The average
9. [Online] Raheel Shaikh, Feature Selection Techniques in Machine
percentage of error forecast of annual rainfall using a fixed-line deficit was
Learning with Python Available: https://towardsdatascience.com/
7%. Researchers Gnanasankaran and Ramaraj [14], did not show any effect
feature-selection-techniques-in-machine-learning-with-python-
of environmental factors on rainfall. The study took monthly and annual
f24e7da3f36e Predicting Rainfall using Machine Learning Techniques
rainfall to predict rainfall and measured performance using RMSE of 0.1069
10. [Online] Imbalanced-learn Documentation Available:
and MAE of 0.0833 using multiple line regression. Therefore, this study
https://imbalanced-learn. readthedocs.io/en/stable/introduction.html
examined the impact of environmental factors on daily rainfall intensity
11. V. Veeralakshmi and D. Ramyachitra, Ripple Down Rule learner
using Pearson's correlation and selected appropriate environmental
(RIDOR) Classifier for IRIS Dataset. Issues, vol 1, p. 79-85. 12.
variables. Proper features used for input of daily rainfall predictor models
[Online] Aditya Mishra, Metrics to Evaluate your Machine Learning
and performance models are measured using MAE and RMSE.
Algorithm Available: https://towardsdatascience.com/ metrics-to-
Conclusion evaluate-your-machine-learning-algorithm-f10ba6e38234
12. Nikhil Sethi, Dr. Kanwal Garg, “Exploring Data Mining Technique for
Rainfall Prediction is a data science and machine learning subject that use
Rainfall Prediction”, Vol. 5(3), 2014, ISSN: 0975-9646.
algorithms to forecast the weather. Predicting rainfall is essential for making
13. Bushra Praveen, Swapan Talukdar, Shahfahad, Susanta Mahato,
better use of water resources and increasing crop yields, as well as lowering
Jayanta Mondal, Pritee Sharma, Abu Reza Md.Towfiqul Islam, Atiqur
mortality from flooding and rain-related illnesses. This study examines a
Rahman, “Analyzing Trend and Forecasting of rainfall changes in
number of machine learning algorithms for forecasting rain. Three machine
India using nonparametrical and machine learning approaches”,
learning algorithms, MLR, FR, and XGBoost, were presented and evaluated
Scientific Report, 2020.
using data from an Australian weather station. The Pearson coefficient of
14. Aakash Parmar, Kinjal Mistree, Mithila Sompura, “Machine Learning
integration was used to find appropriate natural rainfall features. A version
Techniques for Rainfall Prediction: A Review”, International
of the input machine learning model utilized in this work was selected
Conference on Innovations in Information Embedded and
features. The results of a comparison of the three algorithms (MLR, RF, and
Communication Systems (ICIIECS), March 2017.
XGBoost) revealed that XGBoost was the superior machine learning system
15. Shreekanth Parashar, Tanveer Hurra, “A Study of Rainfall Using
for forecasting daily rainfall using chosen natural features. If sensor data is
different Data Mining Techniques”, Research Gate, Article-May 2020.
incorporated in the study, the accuracy of rainfall estimations may improve.
16. Deepak Ranjan Nayak, Amitav Mahapatra, Pranati Mishra, “A Survey
However, sensory data were not taken into account in this investigation.
on Rainfall Prediction using Artificial Neural Networks”, International
Sensory and weather databases with extra unique environmental factors can
Journal of Computer Applications, volume 72- No.16, June 2013.
help enhance rainfall accuracy. If the sensor and meteorological data are