Real Time Monitoring of Water Quality For Rural Areas A Machine Learning and Internet of Things Approach
Real Time Monitoring of Water Quality For Rural Areas A Machine Learning and Internet of Things Approach
Department of Computer Science and Engineering, Department of Electronics and Communication Engineering
Jaypee University of Information Technology, Jaypee University of Information Technology, Solan,
Solan, Himachal Pradesh, India Himachal Pradesh, India
[email protected] [email protected]
Abstract: A water quality monitoring system can aid in management, industries, and waste disposal all contribute to
preserving the environment, ensuring the security of rural villagers that have limited access for clean drinking
nearby water sources, and fostering economic growth in water. In rural areas, the water quality of wells and ponds is
rural areas. This results in the development of a system, assessed using two crucial parameters namely pH and
employing the Internet of Things (IoT) and Machine turbidity. The effectiveness of water treatment procedures, the
Learning to monitor the quality of water. This paper taste & odour of drinking water, the corrosion of
discusses the characteristics of water that explains infrastructure, and the health and survival of aquatic species
whether it is fit for human consumption or not. The pH can all be impacted by the pH of water. Turbidity is a unit used
and Turbidity sensors are dipped in water samples to describe how cloudy or hazy a liquid is which results in
acquired from wells, lakes, rivers, ponds, and other places suspended particles. Turbidity is a metric that tells us about
are used to inform the development of an effective the transparency of the water. Turbidity can change the
model. The data will be delivered from the sensors to the chemical, physical & biological properties of water. The
IDE, where it will then be sent to the cloud server. The presence of suspended particles like clay, silt and organic
model effectively accounts for test tables, where 1 indicates matter can cause a variety of issues, including high levels of
the water is fit for drinking and 0 indicates the water is turbidity in water [2]. A solution's acidity or alkalinity can be
unfit. The model is validated by employing Support Vector determined by its pH value, which ranges from 0 to 14, pH
Machine, Random Forest and XG Boost methods. The readings below 7 signifies acidity while above this value
maximum accuracy of 95.12% was observed using signifies alkalinity whereas pH equal to 7 is considered as
XGBoost. We have successfully implemented the real-time neutral. To guarantee that the water is safe and fit for use in
data for the water quality monitoring system for rural rural regions where wells and ponds are frequently used as
areas using Machine Learning and IoT which lets us know sources of drinking water, it is crucial to routinely test the pH
whether the water is fit for drinking or not. and turbidity of the water. Indicators of the presence of
pollutants or other contaminants in the water, which may have
Keywords: Turbidity, pH, Machine Learning Models, detrimental effects on both human health and the ecosystem,
Internet of Things, Microcontroller. include high levels of turbidity or low or high pH. Frequent
monitoring of these characteristics can aid in spotting possible
I. INTRODUCTION issues before they become serious and enable the proper
measures to be done to safeguard the water supplies in rural
According to the World Health Organization (WHO), regions [3].
unprotected wells & springs are used by 368 million people,
and untreated surface water from ponds, lakes, streams, and The algorithm proposed by Yingyi et. al.[2] employed
rivers are gathered by 122 million [1]. In 2020, approximately feedforward, recurrent, and hybrid architectures in Artificial
2 billion people were without access to safe water. Cholera, Neural Network (ANN) for the prediction of water quality.
dysentery, diarrhoea, typhoid, Hepatitis & polio are just a few Five output strategies they summarised & consequently
of the diseases that can spread as a result of inadequate discovered that the ANN models could handle various
sanitation & contaminated water. People are exposed to modelling issues in rivers, lakes, reservoirs, and Waste Water
avoidable health risks when water and sanitation Treatment Plants (WWTPs). Amir et. al.[3] used AI
infrastructure is inadequate, poorly maintained, or managed techniques ,i.e., ANN, Group Method of Data
improperly. Water contamination from agricultural chemicals, Handling(GMDH) and Support Vector Machines(SVM) for
technological advancements, internal community predicting the water quality of a river in Iran. They found that
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 05,2023 at 07:34:11 UTC from IEEE Xplore. Restrictions apply.
the lowest DDR index value is obtained using SVM model.
Ahmed et. al.[4] proposed a supervised machine learning
model that calculates water quality using the Water Quality
Index (WQI) and Water Quality Class (WQC). To forecast the
WQI and WQC, they used polynomial regression, gradient
boosting, and a variety of parameters. Najah et. al.[5]
proposed Linear Regression Models (LRM), Multilayer
Perceptron Neural Networks (MLP) & Radial Basis Function
Neural Networks (RBF-NN) for comparison of water quality.
RBF-NN models are more accurate than LRM and MLP,
according to the obtained results. Lu & Ma [6] used two
unique hybrid decision-tree-based machine learning models
that made advantage of the data denoising technique, were
included Complete Ensemble Empirical Mode
Decomposition with Adaptive Noise (CEEMDAN).
According to the outcome, CEEMDAN-RF provides the best
predictions for temperature, dissolved oxygen, and specific
conductance, whereas CEEMDAN-XG Boost provides the
best predictions for pH, turbidity, and fluorescent dissolved
organic matter. Chen et al.[7] employed 10 learning models
for huge data from the Chinese river (7 classical and 3
ensemble models). The outcome shows that learning models
might perform better in the prediction of water quality with
larger data sets. Furthermore, Decision Tree, Random Forest,
and Deep Cascade Forest identified & verified two important Fig .1: Proposed Methodology for Water Quality Monitoring
water parameter sets as having high specificity for the System
prediction of water quality.
For real time implementation, various samples from the wells
This study explores the idea that not all of the water in wells, and ponds of Richhana village, Solan was collected. With the
ponds, and lakes at rural places is suitable for drinking. As a help of microcontroller (ESP 32) & sensors, pH and turbidity
result, authors have proposed a model using different machine values were evaluated, which were validated using a model
learning models that check the pH and turbidity values. designed by Kaggle dataset [8]. ML models are used for the
Extreme Gradient Boosting, a component of the Ensemble design of the proposed models. The classification is done
methodology, is used to ensemble the two techniques. Authors whether the water is fit for human consumption based on pH
also have done real time implementation by using Internet of and turbidity parameters as attributes.
Things (IoT). They checked the pH and turbidity values
experimentally by sensors which were validated using the The most embedded systems are built using
proposed model implemented by Machine Learning (ML). Microcontrollers. Currently ARM Microcontrollers are
getting used in most of the Embedded Applications. ARM
This paper comprises of 2 sections, explains the proposed Cortex M0, M3, M4 are popular. Currently Cortex M7 is
methodology for the water quality monitoring system using getting popular. Spending Time to understand ARM Cortex
IoT and ML, section 3 explains the results employing IoT and M series Processors/Microcontrollers will be really useful.
ML which is followed by concluding remarks. ESP32 is a system-on-chip (SoC), that combines Wi-Fi and
Bluetooth capabilities with a dual-core processor, memory,
II. METHODOLOGY and peripherals. It is suitable for IoT and other embedded
applications due to its low power consumption, small size, and
rich feature set.
In this paper, the complete system of Water Quality
Monitoring System for rural areas using ML and IoT has been A dataset with pH and turbidity water quality metrics, along
illustrated. The proposed methodology is a combination of with their accompanying water quality labels, can be used to
train using Extreme Gradient Boost model, consisting of SVM
hardware using IoT and software using ML. The flowchart of
and Random Forest (RF). Since they can manage non-linear
the entire proposed model is illustrated in Fig.1.
correlations between the input characteristics & the output
labels that are less prone to overfitting than other tree-based
models. A comparable dataset can also be used to train an
SVM model, which outputs water quality labels and uses pH
and turbidity as input features. SVMs are especially helpful
when there is a distinct boundary dividing different water
quality classes since they seek out the best separation
hyperplane in a high-dimensional space between two classes
[9, 10]. SVMs are typically more resistant to data noise than
some other algorithms. To increase overall accuracy,
2
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 05,2023 at 07:34:11 UTC from IEEE Xplore. Restrictions apply.
ensemble combines the overall method loss of various In Eq. (5), x is the independent variable. 𝜋(𝑥) is the
machine learning models. probability of recurrence and 𝛽 is the binary outcome of an
attribute x. In the Decision Tree, Gini index is determined.
The SVM maps the data in a high-dimensional space where After the evaluation of Gini index's value, decision trees use
the model draws a straight line, called a hyperplane, to divide it to choose the root node, as the entire tree is built. The
the data into several classes, resulting in support vectors that created tree provides us with the projected label values based
aid in the prediction of the target labels [11]. The SVM on the output [15]. Decision Tree is expressed by the Eq. (6).
classifier is expressed by Eq. (1) and SVM classification for
dual formation is expressed by Eq. (2). 𝐺𝑖𝑛𝑖 (𝐷) = 1 − ∑𝑛𝑗=1 𝑝𝑗2
(6)
𝑚𝑖𝑛𝑓𝜉𝑖 ‖𝑓‖2𝑘 + 𝐶 ∑𝑙𝑖 𝜉𝑖 𝑦𝑖 𝑓(𝑥𝑖 ) ≥ 1 − 𝜉𝑖 , 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖 𝜉𝑖 ≥
In Eq(6), 𝑝𝑗 is relative frequency of class j in D. The
0 ) (1)
XGBoost ensemble model combines the loss functions of one
1 1 1 or more models and then utilises the pooled training data to
1 forecast the values of the target labels. This technique, known
𝑚𝑖𝑛𝛼 ∑ 𝛼𝑖 − ∑ ∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾(𝑥𝑖 , 𝑥𝑗 ) 0 ≤ 𝛼𝑖
2 as gradient boosting [16]. XGBoost classifier is given by the
Eq. (7).
≤ 𝐶 , 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖;
(𝑡−1)
𝑙 𝐿𝑡 = ∑𝑛𝑖=1 𝑙( 𝑦𝑖 , 𝑦̂𝑖 + 𝑓𝑡 (𝑥𝑖 )) + 𝛺(𝑓𝑡 )
∑ 𝛼𝑖 𝑦𝑖 = 0 (7)
(𝑡−1)
𝑖=1 In Eq.(7), 𝑦̂𝑖 be the prediction of the 𝑖 𝑡ℎ instance at the 𝑡 𝑡ℎ
iteration, we will need to add 𝑓𝑡 to minimize the objective.
(2)
According to the graphs and evaluation tables of our study
In Eq. (1) & Eq. (2), 𝜉𝑖 are slack variables and they measure
given ahead, it can be observed that there are several
the error produced at point (𝑥𝑖 , 𝑦𝑖 ), 𝛼𝑖 is the Langlier’s
multiplier. In order to forecast the target labels, RF builds a commonly used evaluation methods for potability check.
decision tree for each piece of training data, averages those
decision trees, and then allows users to vote on their favourite All the models are evaluated using different performance
prediction outcome [12]. RF classifier is expressed by Eq. (3) parameters. The type of data and the goals of the model all
influence the assessment technique selection. Accuracy,
∑𝑛𝑖=1 𝑌𝑖 1[𝑋𝑖∈𝐴𝑛 (𝑋,𝜃)] Precision, Recall, F1 Score is evaluated in this paper.
𝑟(𝑋) = 𝐸𝜃 [𝑟𝑛 (𝑋, 𝜃)] = 𝐸𝜃 [ 1 ]
∑𝑛𝑖=1 1 ∗ 1[𝑋𝑖∈𝐴𝑛 (𝑋,𝜃)] 𝐸𝑛(𝑋,𝜃) Accuracy calculated by given Eq. (8)
(3) 𝑇𝑝 +𝑇𝑁
In Eq. (3), 𝑟𝑛 (𝑋, 𝜃) is the randomized tree of rectangular cell 𝛼=𝑇 (8)
𝑃 +𝑇𝑁 +𝐹𝑃 +𝐹𝑁
of the random partition containing 𝐸𝑛 (𝑋, 𝜃) trees. For
validation, kNN, and logistic regression models are employed.
where 𝑇𝑃 and 𝑇𝑁 represents the number of instances that
In kNN algorithm, k represents the number of clusters that
builds the model. The Elbow approach is used to estimate k's correctly predict a label to be positive or negative respectively,
value. For example, if k is set to 7, the model creates 7 clusters while 𝐹𝑃 and 𝐹𝑁 represents the number of instances with
and, using those clusters, predicts the value of the target labels incorrectly predicted labels. Accuracy measures the
[13]. kNN is expressed by Eq. (4) proportion of instances that were accurately predicted.
However it is not the best option for data sets with imbalances
2
𝑚𝑖𝑛𝑤 ‖𝑋 𝑇 𝑊 − 𝑌‖ + 𝜌1 𝑅1 (𝑊) + 𝜌2 𝑅2 (𝑊) + 𝜌3 𝑅3 (𝑊) because, typically researchers are more interested in the
𝐹 minority class than the majority class.
(4)
In Eq. (4), W is the reconstruction weights, 𝜌 is the The percentage of labels that the model accurately predicts is
normalization vector, X & Y represents the set of training and measured by precision(𝛽). The recall is the percentage of all
testing data points respectively and R represents each attribute relevant labels that the model successfully identified. 𝛽 and 𝛾
of dimensionality d. The logistic regression model takes data can be calculated by Eq.(9) and Eq.(10) respectively
factors as its foundation, it uses one attribute to determine its
dependence on another attribute and then computes their
𝑇𝑝
values in a finite number of iterations to forecast the values of 𝛽= (9)
𝑇𝑃 +𝐹𝑃
the target labels [14]. Logistic Regression is expressed by the 𝑇𝑝
Eq. (5). 𝛾=𝑇 (10)
𝑃 +𝐹𝑁
ⅇ 𝛽0+𝛽1 𝑥
𝜋(𝑥) = 1+ⅇ 𝛽0 +𝛽1 𝑥
(5) F1 is evaluated using 𝛽 𝑎𝑛𝑑 𝛾 and is expressed by Eq. (11)
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 05,2023 at 07:34:11 UTC from IEEE Xplore. Restrictions apply.
2×𝛽×𝛾
𝐹1 = (11)
𝛽+𝛾
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 05,2023 at 07:34:11 UTC from IEEE Xplore. Restrictions apply.
For the validation of proposed model, authors have used
logistic regression, kNN and decision tree. The evaluation
parameters are tabulated in Table 2.
V. REFERENCES
[1] https://www.who.int/news-room/fact-sheets/detail/drinking-
water#:~:text=Contaminated%20water%20and%20poor%20s
anitation,individuals%20to%20preventable%20health%20risk
s
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 05,2023 at 07:34:11 UTC from IEEE Xplore. Restrictions apply.
[2] Y. Chen, L. Song, L. Yeqi, L. Yang, and D. Li, , A review of Receptor Proteins, International Journal on Emerging
the artificial neural network models for water quality Technologies 10(2): 23-28(2019).
prediction, Applied Sciences, in MDPI ,vol. 10, pp. 57-76, [11] V. Akkula, Tutorial on support vector machine (SVM),School
2020. of EECS, Washington State University, vol. 37, pp-3, 2006
[3] H.Amir, H.Nasrolahi and A.Parsaie , ,Water quality prediction [12] G. Biau, Analysis of a random forests model, The Journal of
using machine learning methods, Water Quality Research Machine Learning Research, in JMLR. org, vol.13, pp 1063-
Journal, in WA Publishing, vol.53, pp 3-13, 2018. 1095, 2012
[4] U.Ahmed, R.Mumtaz, H.Anwar, A.Shah, R.Irfan and Garc, , [13] S. Zhang, X. Li,M. Zong, X. Zhu and D. Cheng, Learning k for
Efficient water quality prediction using supervised machine KNN classification, ACM Transactions on Intelligent Systems
learning, Water, in MDPI,vol. 11, pp 10-22, 2019. and Technology (TIST),ACM New York, NY, USA, vol. 8, pp-
[5] A.Najah, A.El-Shafie, O.Karim, H.Amr and E.El-Shafie, 1-19, 2017
,Application of artificial neural networks for water quality [14] S. Lemeshow and D. Hosmer, Logistic regression analysis:
prediction, Neural Computing and Applications, in Springer, applications to ophthalmic research, American journal of
vol. 22, pp 187-201, 2013. ophthalmology, in Elsevier, vol. 147, pp. 766-767, 2009.
[6] H.Lu and X.Ma, ,Hybrid decision tree-based machine learning [15] S. Kotsiantis, Decision trees: a recent overview Artificial
models for short-term water quality prediction, Chemosphere, Intelligence Review, in Springer, vol. 39, pp. 261-283, 2013.
in Elsevier, vol. 171, pp 126-169, 2020. [16] T. Chen, T. He , M. Benesty and V. Khotilovich, Y. Tang, H.
[7] K.Chen, H.Chen, C.Zhou, Y.Huang, X.Qi, R.Shen, F.Liu, Cho, K. Chen, R. Mitchell, I. Cano, T. Zhou and others,
M.Zuo, X.Zou and J.Wing, , Comparative analysis of surface Xgboost: extreme gradient boosting, in Microsoft, vol. 1, pp.
water quality prediction performance and identification of key 1-4, 2015.
water parameters using different machine learning models [17] T. Aldhyani, M. Al-Yaari, H. Alkahtani, M. Mashael and
based on big data, Water research, in Elsevier, vol. 171, pp others, Water quality prediction using artificial intelligence
114-454, 2020. algorithms, in Hindawi, vol. 2020, 2020.
[8] https://www.kaggle.com/datasets/adityakadiwal/water- [18] M. Azrour, J. Mabrouki, G. Fattah, and A. Guezzaz and F.
potability. Aziz, Machine learning algorithms for efficient water quality
[9] S Jain, Computer Aided Detection system for the Classification prediction, in Springer, vol. 8, pp. 2793-2801, 2022
of Non Small Cell Lung Lesions using SVM, Current [19] Sharma, V., & Manocha, T. (2023). Comparative Analysis of
Computer-Aided Drug Design, 16(6), 2021 , pp 833-840. Online Fashion Retailers Using Customer Sentiment Analysis
[10] S Jain, M. Sood , SVM Classification of Cell Survival/ on Twitter. Available at SSRN 4361107
Apoptotic Death for Color Texture Images of Survival
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 05,2023 at 07:34:11 UTC from IEEE Xplore. Restrictions apply.