Rainfall Mini Project Report
Rainfall Mini Project Report
BACHELOR OF TECHNOLOGY
In
CIVIL ENGINEERING
By
ENGINEERING
SURATHKAL, MANGALORE-575025
NOVEMBER 2022
DECLARATION
We declare that the Report of the Mini project-I entitled “RAINFALL PREDICTION USING
MACHINE LEARNING”, which is being submitted to National Institute of Technology
Karnataka, Surathkal in partial fulfilment of requirements of the Degree of Bachelor of
Technology in Civil Engineering is a bonafide report of the project worked carried out by us.
The material contained in this report has not been submitted to any university or Institution for
the award of any degree.
UJWAL B (201CV258)
This is to certify that this report entitled RAINFALL PREDICTION USING MACHINE
LEARNING being submitted by ALLEN VARGHESE PAUL (201CV103), RAKSHITH
SAJJAN (201CV142), UJWAL B (201CV258) and SHANNON BRITNEY CARLO
(201CV249) is accepted as the record of work carried out by them as the part of a Mini project-
I in partial fulfilment of the requirements for the award of the degree of Bachelor of Technology
in Civil Engineering of the Department of Civil Engineering, National Institute of Technology
Karnataka, Surathkal, Mangaluru.
The study of precipitation and rainfall trends is critically important for a country like India
whose food security and economy are dependent on the timely availability of water. In this
work, trends of rainfall have been studied using daily data series using different types of
machine learning models. Different machine learning models have different accuracies,
ranging from 50% to 85%.
Previously, predicting rainfalls was particularly difficult, having to use historical data and
empirical formulas. However, with the advent of machine learning, rainfall can be predicted
much more accurately. Factors like humidity, wind speed, sunshine hours, wind direction,
minimum and maximum temperatures during the day are considered while predicting the
rainfall of a certain area. Models like Linear Regression, KNN, SVM, etc., have been used to
predict the output with the given data. Artificial Intelligence tools can replace the simulation
models by using input and output data sets without considering some of the complex relations
of the system to be modeled. The aim of the study is to predict the rainfall in Surathkal,
Mangalore, India.
The data used are from the Department of Water Resources and Ocean Engineering.
Keywords: Rainfall, Parameters, Artificial Intelligence, Machine Learning
TABLE OF CONTENTS
Abstract i
Content ii
List of figures iv
List of tables v
List of Abbreviations vi
1. Introduction
1.1. General 1
2. Literature Review
2.1. General 8
4. Study Area 13
5. Methodology 15
6. Result 25
7. Observation 26
8. Conclusion 27
9. References 28
LIST OF TABLES
1.2 Hydrometer 2
5.2 Heatmap 16
6.1 Trial 1 25
6.2 Trial 2 26
6.3 Trial 3 26
ABBREVIATIONS
AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
RE Relative Error
1.1. General
Rainfall is the major product of the condensation of atmospheric water vapor that fall under
gravitational pull from clouds. It occurs when a portion of the atmosphere becomes saturated
with water vapor (reaching so that the water condenses and "precipitates" or falls. Rainfall
(including drizzle and rain) is usually measured using a rain gauge and expressed
in units of millimeters (mm) of height or depth Rainfall is the predominant form of
precipitation causing streamflow, especially the flood flow in a majority of rivers in India.
The magnitude of rainfall varies with time and space. Differences in the magnitude of rainfall
in various parts of the country at a given time and variation of rainfall at a place in the
various seasons of the year are obvious and need no elaboration. Rainfall can be classified
based on the rate of precipitation as follows: –
Based on the seasons the different amounts of rainfall are given below: -
1. South West monsoon (June - September) – The south west monsoon is the principal
rainy season of India (75% annual rainfall). Precipitation about 100-200 mm per day.
2. Transition -1 post monsoon (October - November) – This air mass strikes the east
coast of the southern and causes rainfall. The cyclone forms in Bay of Bengal are
about twice as many as in the Arabian sea.
4. Transition -2 post summer (March - may) – there is very little rainfall in India in this
season.
India has 4 % of the world’s freshwater which must cater to 17% of the world’s population.
As per NITI Aayog report released in June 2019, India is facing the worst-ever water crisis in
history. Approximately 600 million people or roughly around 45% of the population in India
is facing high to severe water stress. As per the report, 21 Indian cities will run out of their
main source of water i.e., groundwater by 2020. The report goes on to say that nearly 40% of
the population will have absolutely no access to drinking water by 2030 and 6% of India’s
GDP will be lost by 2050 due to the water crisis. The water footprint network has developed
an interactive tool to calculate and map the water footprint by different users, assess its
sustainability, and identify strategic interventions for improving water use. Hence, to develop
these efficient systems, we can use Artificial Intelligence and Machine Learning.
The standard way of measuring rainfall or snowfall is the standard rain gauge, which can be
found in 100 mm plastic and 200 mm metal varieties. The inner cylinder is filled by 25 mm
of rain, with overflow flowing into the outer cylinder. Other types of gauges include the
popular wedge gauge, the tipping bucket rain gauge, and the weighing rain gauge.
1. Temperature
Temperature affects how much water evaporates off the surface of the ground. If the
temperature is high, then less moisture will be lost. However, if the temperature is low, then
more moisture will be lost.
2. Humidity
When humidity is high, the air becomes saturated and does not hold any additional water
molecules. In contrast, when humidity is low, the air holds more water molecules than what is
already present.
3. Wind
Wind speed and direction affect the movement of air across the earth’s surface. Strong winds
cause the air to move rapidly over the land, causing evaporation to occur at a faster rate. As a
result, the air becomes drier. Conversely, calm wind conditions allow the air to remain still,
which causes the air to become wetter.
4. Evaporation
Evaporation is the transition of the liquid particles into the gaseous phase. Rainfall is affected
by the rate of evaporation as it is the amount of water entering the atmosphere from the
surface of the Earth.
5. Cloud Cover
Cloud cover is the percentage of sky covered by clouds. Clouds reflect sunlight back into
space, thereby cooling the planet. On average, cloud cover increases precipitation.
AI refers to the development of computer systems able to perform tasks that normally require
human intelligence, such as visual perception, speech recognition, decision-making, and
translation between languages. Artificial intelligence was founded as an academic discipline
in 1956, and in the years since has experienced several waves of optimism. Some of the AI
applications include advanced web search engines recommendation systems (used
by YouTube and Amazon ), understanding human speech (such as Siri and Alexa), self-
driving cars (e.g., Tesla) AI researchers have adapted and integrated a wide range of
problem-solving techniques – including search and mathematical optimization, formal
logic, artificial neural networks, and methods based on statistics, probability and economics.
1. Supervised Learning
Supervised learning algorithms build a mathematical model of a set of data that contains both
the inputs and the desired outputs. The data is known as training data, and consists of a set of
training examples. Each training example has one or more inputs and the desired output, also
known as a supervisory signal. In the mathematical model, each training example is
represented by an array or vector, sometimes called a feature vector, and the training data is
represented by a matrix. Types of supervised-learning algorithms include some major
algorithms like active learning, classification and regression.
2. Unsupervised learning
Unsupervised learning algorithms take a set of data that contains only inputs, and find
structure in the data, like grouping or clustering of data points. The algorithms, therefore,
learn from test data that has not been labelled, classified, or categorized. Instead of
responding to feedback, unsupervised learning algorithms identify commonalities in the data
and react based on the presence or absence of such commonalities in each new piece of data.
3. Reinforcement learning
Machine learning refers to a subset of artificial intelligence that allows machines to learn and
improve automatically based on past data without the need for explicit programming.
Artificial intelligence aims at producing smart computer systems that can solve complex
human problems faster than humans can do. In the case of ML, we basically teach different
machines involving data to come up with accurate results by performing a task on its own
whereas for AI we try to develop a system which can perform the task like a human being
would.
Artificial intelligence is preferred over numerical modelling for many problems due to errors in
the simulation process. The complexity of simulations is much higher than that of artificial
intelligence. Artificial Intelligence tools can replace simulation models and decrease
computational efforts by using input and output data sets without considering complex
relations of the system to be modelled.
The four basic types of rainfall models that have been the focus of most of the recent research
on rainfall modelling:
3. Nonparametric models
4. “Mechanistic” models.
No matter what type of model is fit, a common goal is to simulate rainfall from the fitted
model. There are two sources of variation which are variation built into the model, and
variation associated with the uncertainty with which the parameters of the model are
estimated during the training phase of the data analysis. This second source of variation is
often overlooked.
In the case of rainfall models, there are variations in parameters as wind speed as these
cannot be exactly simulated with a high accuracy. Hence, numerical modelling is not suitable
for solving the rainfall prediction problem. Instead, it is recommended to feed the data
collected through experimentation to a model which can learn and predict the data accurately.
2. LITERATURE REVIEW
2.1 General
The review of literature consists of various sections (a) Rainfall studies and models, (b)
Models used for rainfall prediction, (c) Concepts of Machine Learning,
People are working on to detect the patterns in climate change as it affects the economy in
production to infrastructure. So as in rainfall also making prediction of rainfall is a challenging
task with a good accuracy rate. Making prediction on rainfall cannot be done by the traditional
way, so scientist is using machine learning and deep learning to find out the pattern for rainfall
prediction. Here are different techniques used for the prediction of rainfall such as Regression
analysis, clustering, and Artificial Neural Networks (ANN). Fundamentally, two approaches
are used for predicting rainfall. One is the
Empirical approach and the other is Dynamical approach. The empirical approach is based on
an analysis of historical data of the rainfall and its relationship to a variety of atmospheric and
oceanic variables over different parts of the world. The most widely used empirical approaches,
which are used for climate prediction, are regression, artificial neural network, fuzzy logic, and
group method of data handling. On the other hand, in a dynamical approach, predictions are
generated by physical models based on systems of equations that predict the evolution of the
global climate system in response to initial atmospheric conditions.
The different rainfall estimation models were developed by Ozlem Terzi by using the monthly
rainfall data of Isparta, Senirkent, Uluborlu, Egirdir, and Yalvac stations of Turki. Rainfall
estimation models were built using Decision Table, KNN, Multilinear Regression, M5Rules,
Multilayer Perceptron, RBF Network, Random Subspace, and Simple Linear Regression
algorithms and quality of these models were tested using the chosen coefficient of
determination (R2) and root-mean-squared error (RMSE) which are the most well-known and
the commonly used performance criteria. Using different combinations of Input given to the
above-developed Models, he has generated the MLR model that gives the best results to
estimate rainfall over the Isparta region. J.M. Spate et al has prepared a model to measure
streamflow from the measured and estimated/interpolated rainfall. K-medoid algorithm on
clustering has been discussed to clustering shapes/peaks. The paper has discussed the various
classification and association rule extraction methods. Instead, they have selected all those
catchments in their region of interest where high-intensity rainfall data does exist for at least
some temporal interval. Then they applied some simple criteria to the high-intensity data; for
example, so much rain must fall in such a small- time interval on a given day for that fall to be
flagged as an intense event. Having generated a Boolean series with 1s on every day with an
intense event and 0s elsewhere, they use data mining to automatically extract those
combinations of daily data characteristics that tend to occur on a day with 1 in the Boolean
series.
Pratap Singh Solanki et al reviewed the studies related to the use of data mining techniques in
the field of water resource sector for Water Management. Presently, Water Resource
Management has become the most challenging, interesting, and fascinating domain around the
world since last many years. Scientists tried to predict the Rainfall, Flood Warning, Water
Inflow, Water Availability and Requirements, etc. based on huge available metadata using
various methods. In this article, they tried to search the use of data mining techniques for
predicting the inflow, drought possibility, weather report, rainfall, evaporation, temperature,
wind speed, etc. This paper provides a survey of some literature and work done by the
researchers using various algorithms and modelling method viz. Associations rules,
Classification, Clustering, Decision Tree, and Artificial Neural Network, etc.
Pinky Saikia Dutta in her Project, Rainfall prediction is implemented with the use of the
empirical statistical technique. She used 6 years (2007-2012) datasets such as minimum
temperature, maximum temperature, pressure, wind direction, relative humidity, etc., and
performed prediction of Rainfall using Multiple Linear Regression (MLR). This model
forecasts the monthly rainfall amount in the summer monsoon season (in mm). Regression is
a statistical empirical technique that utilizes the relation between two or more quantitative
variables on an observational database so that the outcome variable can be predicted from the
others. One of the purposes of a regression model is to find out to what extent the outcome
(dependent variable) can be predicted by the independent variables. Predictors selected for the
model are minimum temperature, maximum temperature, mean sea level pressure, wind speed,
and rainfall.
Jyothis Joseph described the empirical method technique belonging to the clustering and
classification approach. ANNs are used to implement these techniques. He used Relative
Humidity, Pressure, Temperature, Precipitable Water, Wind Speed. In this paper subtractive
clustering is used. Subtractive clustering is a fast, one-pass algorithm for estimating the number
of clusters and the cluster centres in a set of data. Applying subtractive clustering, the optimum
numbers of clusters are obtained. The rainfall values are categorized as low, medium & heavy.
The classifier model has been evaluated against a confusion matrix and the results have been
obtained. This paper applies a neural network for rainfall prediction. In this paper, two methods
such as classification and clustering are implemented. The neural network Bayesian
regularization has been applied in the implementation.
K. Poorani, K Brindha in has used Principal Component Analysis method for forecasting of
rainfall. The proposed PCA method is used when there is a vital inter- correlation between the
predictors. The PCA model avoids the inter-correlation and support to reduce the degrees of
liberty by controlling the number of predictors. Their experiment studies, therefore, suggest
that PCA has some more benefits over ANN in analysing climatic time series such as rainfall,
particularly with regards to the interpretability of the extracted signals.
1. Linear regression
Linear regression is the simplest machine learning model in which we try to predict one output
variable using one or more input variables. The representation of linear regression is a linear
equation, which combines a set of input values (x) and predicted output(y) for the set of those
input values. It is represented in the form of a line.
2. Decision Tree:
Decision trees are the popular machine learning models that can be used for both regression
and classification problems.
A decision tree uses a tree-like structure of decisions along with their possible consequences
and outcomes. In this, each internal node is used to represent a test on an attribute; each branch
is used to represent the outcome of the test. The more nodes a decision tree has, the more
accurate the result will be. The advantage of decision trees is that they are intuitive and easy to
implement, but they lack accuracy.
3. Random Forest:
Random Forest is the ensemble learning method, which consists of many decision trees. Each
decision tree in a random forest predicts an outcome, and the prediction with most votes is
considered as the outcome.
A random forest model can be used for both regression and classification problems.
4. SVM:
Support Vector Machine (SVM) is a relatively simple Supervised Machine Learning Algorithm
used for classification and/or regression. It is more preferred for classification but is sometimes
very useful for regression as well. An SVM outputs a map of the sorted data with the margins
between the two as far apart as possible. SVMs are used in text categorization, image
classification, handwriting recognition and in the sciences.
5. KNN:
6. Gradient Boosting:
Gradient boosting is a technique used in creating models for prediction. The technique is mostly
used in regression. Gradient boosting presents model building in stages, just like other boosting
methods, while allowing the generalization and optimization of differentiable loss functions.
7. ADA Boosting:
Artificial Intelligence:
Several definitions of artificial intelligence (AI) have surfaced over the last few decades. John
McCarthy offers the following definition in the 2004 paper resides "It is the science and
engineering of making intelligent machines, especially intelligent computer programs. It is
related to the similar task of using computers to understand human intelligence, but AI does
not have to confine itself to methods that are biologically observable."
However, decades before this definition, the artificial intelligence conversation began with
Alan Turing's 1950 work "Computing Machinery and Intelligence" .In this paper, Turing,
often referred to as the "father of computer science", asks the following question: "Can
machines think?" From there, he offers a test, now famously known as the "Turing Test",
where a human interrogator would try to distinguish between a computer and human text
response. While this test has undergone much scrutiny since its publication, it remains an
important part of the history of AI.
Human approach:
Ideal approach:
Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy. Machine learning is an important component of the growing field of
data science.
Using statistical methods, algorithms are trained to make classifications or predictions, and to
uncover key insights in data mining projects. These insights subsequently drive decision
making within applications and businesses, ideally impacting key growth metrics. As big data
continues to expand and grow, the market demand for data scientists will increase. They will
be required to help identify the most relevant business questions and the data to answer them.
Machine learning algorithms are typically created using frameworks that accelerate solution
development, such as TensorFlow and PyTorch.
the learning system of a machine learning algorithm into three main parts.
2. An Error Function: An error function evaluates the prediction of the model. If there
are known examples, an error function can make a comparison to assess the
accuracy of the model.
3. A Model Optimization Process: If the model can fit better to the data points in the
training set, then weights are adjusted to reduce the discrepancy between the known
example and the model estimate. The algorithm will repeat this “evaluate and
optimize” process, updating weights autonomously until a threshold of accuracy has
been met.
While a lot of public perception of artificial intelligence centres around job losses, this
concern should probably be reframed. With every disruptive, new technology, we see that the
market demand for specific job roles shifts. For example, when we look at the automotive
industry, many manufacturers, like GM, are shifting to focus on electric vehicle production to
align with green initiatives. The energy industry isn’t going away, but the source of energy is
shifting from a fuel economy to an electric one.
In a similar way, artificial intelligence will shift the demand for jobs to other areas. There will
need to be individuals to help manage AI systems. The biggest challenge with artificial
intelligence and its effect on the job market will be helping people to transition to new roles
that are in demand.
3. DATA
The data used consists of the various parameters which affect rainfall. Humidity affects
rainfall because the air becomes saturated and cannot hold any additional water molecules.
Wind direction and speed changes with time and drastically affects the pattern of rainfall.
High wind speeds cause the air to move rapidly which results in the evaporation to occur at a
faster rate. Rate of evaporation is the amount of water being converted to vapour which will
later condense to water at high altitudes. The amount of sunshine is another factor to
consider. The data is taken from the years 2001 – 2008 on a daily basis. The various climate
changes in Surathkal are noted and taken into consideration while recording the data. The
table below (Table 3.1) shows the data as well as the source of the data collected.
Data Source
Rainfall parameters NITK Surathkal , Water
Resources and Ocean
Engineering Dept. data reports
4. STUDY AREA
Surathkal is one of the major localities in the northern part of Mangalore city located on NH-
66 in the Dakshina Kannada district, Karnataka. Surathkal is located at 12°58'60 N 74° 46'
60E. The maximum and minimum temperature in a year varies between 37 °C and 25 °C. But
ambient temperature occasionally touches 40 °C during summer season (usually March,
April, May) recorded in 21st century.
Mangalore is located on the western coast of India at 12.87°N 74.88°E in Dakshina Kannada
district, Karnataka state. It has an average elevation of 22 m (72 ft) above mean sea level.
Mangalore has a tropical monsoon climate and is under the direct influence of the Arabian
Sea branch of the southwest monsoon. It receives about 95 percent of its total annual rainfall
between May to September but remains extremely dry from December to March. Humidity is
approximately 75 percent on average and peaks during June, July and August. During this
time of year temperatures during the day stay below 34 °C (93 °F) and drop to about 19 °C
(66 °F) at night.
Fig 3.1. Location map of study area
5. METHODOLOGY
The methodology for developing the model is the classical approach and consists of data
cleaning, building the model and testing it. This is to increase the efficiency of the process
and reduce the time consumption as well as other resources spent.
The overview of the proposed model is shown in the figure below based on Machine
Learning for predicting rainfall using the rainfall parameters gathered in the dataset.
Data pre-processing is an integral step in Machine Learning as the quality of data and the
useful information that can be derived from it directly affects the ability of our model to learn;
therefore, it is extremely important that we pre-process our data before feeding it into our
model. EDA is basically an approach to analyse datasets to summarize their main
characteristics Some of the various EDA techniques are multivariate analysis, outlier
detection, and feature scaling. The steps in data pre-processing are:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Feature scaling
Below are some of the diagrams taken from the code showing the multivariate analysis as well
as scatterplots and outlier detection.
Model training is at the heart of the data science development lifecycle where the data
science team works to fit the best weights and biases to an algorithm to minimize the
loss function over prediction range. When a supervised learning technique is used,
model training creates a mathematical representation of the relationship between the
data features and a target label. In unsupervised learning, it creates a mathematical
representation among the data features themselves. Model training is the primary step
in machine learning, resulting in a working model that can then be validated, tested,
and deployed. The steps in model training are:
The training model is used to run the input data through the algorithm to correlate the
processed output against the sample output. Later model validation is carried out.
In Machine Learning, models are only as useful as their quality of predictions; hence,
fundamentally our goal is not to create models but to create high-quality models with
promising predictive power. The performance is measured by Accuracy, Root Mean Squared
Error, the Relative Error and the Coefficient of Correlation.
1. Accuracy
Accuracy is, simply put, the total proportion of observations that have been correctly
predicted.
• TP represents the number of True Positives. This refers to the total number of
observations that belong to the positive class and have been predicted correctly.
• TN represents the number of True Negatives. This is the total number of observations
that belong to the negative class and have been predicted correctly.
• FP is the number of False Positives. It is also known as a Type 1 Error. This is the
total number of observations that have been predicted to belong to the positive class,
but instead belong to the negative class.
• FN is the number of False Negatives. It may be
referred to as a Type 2 Error. This is the total number of observations that
have been predicted to be a part of the negative class but instead belong to
the positive class.
Relative Error can be defined as the average value of the relative differences between the
observed and predicted values of concentration with respect to observed concentrations. Using
this method, we can determine the magnitude of the absolute error in terms of the actual size of
the measurement. If the true measurement of the object is not known, then the relative error can
be found using the measured value. The relative error gives an indication of how good
measurement is relative to the size of the object being measured.
4. Coefficient of correlation
Model selection is the process of choosing one among many candidate models for a
predictive modelling problem. There may be many competing concerns when performing
model selection beyond model performance, such as complexity, maintainability, and
available resources. Model selection is a process that can be applied both across different
types of models (e.g., logistic regression, SVM, KNN, etc.) and across models of the same
type configured with different model hyperparameters.
All models have some predictive error, given the statistical noise in the data, the
incompleteness of the data sample, and the limitations of each different model type. Using the
best-trained model with the selected hyperparameters and important variables, we can predict
the rainfall with the given parameters in the dataset.
6. Results
After testing various models such as KNN, Linear Regression, SVM, etc., and their
respective K fold cross validations, the results have been calculated on the accuracy of each
of the models as well as other errors. Three trials have been carried out and the differences in
the accuracies of each of the models have been noted down. Different K-values have been
used as well. The contribution of the K-value has been taken into consideration as well.
Below are the tables after running the code successfully.
Fig.6.1 Trial 1
Fig.6.2 Trial 2
Fig.6.3 Trial 3
7. Observation
It is observed that the KNN K-fold as well as the AdaBoost Regressor K-fold have the
highest accuracies. The KNN K-fold has an accuracy of around 84.9% every trial. The
AdaBoost regressor has an accuracy of around 74%. These can be improved by
hyperparameter tuning. The other models have low accuracy due to the problem of overfitting
which resulted in few models having higher accuracies. If hyperparameter tuning is done,
then the accuracies of all models can be brought above 80%.
8. Conclusion
The KNN K-fold model has the highest accuracy and can be adopted for rainfall prediction
with some hyperparameter tuning. The parameters for measuring the rainfall are correct and
need to be accounted for overfitting.
9. References