0% found this document useful (0 votes)
51 views26 pages

Research Article: Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis

Uploaded by

Magno Morelli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views26 pages

Research Article: Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis

Uploaded by

Magno Morelli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Hindawi

Journal of Environmental and Public Health


Volume 2023, Article ID 4916267, 26 pages
https://doi.org/10.1155/2023/4916267

Research Article
Prediction of Air Quality Index Using Machine Learning
Techniques: A Comparative Analysis

N. Srinivasa Gupta ,1 Yashvi Mohta,2 Khyati Heda,2 Raahil Armaan,2 B. Valarmathi,2


and G. Arulkumaran 3
1
School of Mechanical Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India
2
School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India
3
Department of Electrical and Computer Engineering, Bule Hora University, Bule Hora, Ethiopia

Correspondence should be addressed to G. Arulkumaran; [email protected]

Received 7 July 2022; Revised 22 September 2022; Accepted 18 October 2022; Published 30 January 2023

Academic Editor: Rahil Changotra

Copyright © 2023 N. Srinivasa Gupta et al. Tis is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

An index for reporting air quality is called the air quality index (AQI). It measures the impact of air pollution on a person’s health over a
short period of time. Te purpose of the AQI is to educate the public on the negative health efects of local air pollution. Te amount of
air pollution in Indian cities has signifcantly increased. Tere are several ways to create a mathematical formula to determine the air
quality index. Numerous studies have found a link between air pollution exposure and adverse health impacts in the population. Data
mining techniques are one of the most interesting approaches to forecast AQI and analyze it. Te aim of this paper is to fnd the most
efective way for AQI prediction to assist in climate control. Te most efective method can be improved upon to fnd the most optimal
solution. Hence, the work in this paper involves intensive research and the addition of novel techniques such as SMOTE to make sure
that the best possible solution to the air quality problem is obtained. Another important goal is to demonstrate and display the exact
metrics involved in our work in such a way that it is educational and insightful and hence provides proper comparisons and assists future
researchers. In the proposed work, three distinct methods—support vector regression (SVR), random forest regression (RFR), and
CatBoost regression (CR)—have been utilized to determine the AQI of New Delhi, Bangalore, Kolkata, and Hyderabad. After
comparing the results of imbalanced datasets, it was found that random forest regression provides the lowest root mean square error
(RMSE) values in Bangalore (0.5674), Kolkata (0.1403), and Hyderabad (0.3826), as well as higher accuracy compared to SVR and
CatBoost regression for Kolkata (90.9700%) and Hyderabad (78.3672%), while CatBoost regression provides the lowest RMSE value in
New Delhi (0.2792) and the highest accuracy is obtained for New Delhi (79.8622%) and Bangalore (68.6860%). Regarding the dataset
that was subjected to the synthetic minority oversampling technique (SMOTE) algorithm, it is noted that random forest regression
provides the lowest RMSE values in Kolkata (0.0988) and Hyderabad (0.0628) and higher accuracies are obtained for Kolkata
(93.7438%) and Hyderabad (97.6080%) in comparison to SVR and CatBoost regression, whereas CatBoost regression provides the
highest accuracies for New Delhi (85.0847%) and Bangalore (90.3071%). Tis demonstrated defnitely that datasets that had the SMOTE
algorithm applied to them produced a higher accuracy. Te novelty of this paper lies in the fact that the best regression models have been
picked through thorough research by analyzing their accuracies. Moreover, unlike most related papers, dataset balancing is carried out
through SMOTE. Moreover, all of the implementations have been documented via graphs and metrics, which clearly show the contrast
in results and help show what actually caused the improvement in accuracy.

1. Introduction physiological disorders and respiratory death. According to


scientifc evidence, air pollution poses the single greatest
Humans can only survive because of air. Its quality must be environmental risk. Due to the toxic gas emissions caused by
monitored and understood for our wellbeing. Due to air rapid industrialization, population levels have dramatically
pollution, millions of people around the world sufer from increased. Our health is sufering greatly as a result of the air
2 Journal of Environmental and Public Health

being contaminated by hazardous substances. Due to this supervised machine learning technique were used. Various
unchecked pollution, air quality has signifcantly declined. quantitative indices were used to assess the performance.
AQI is a numerical index used to measure and convey air Second, to forecast the AQI in the future, the ARIMA time
pollution levels. Te 12 parameters (air pollutants) used to series model was used. Both models were found to be highly
calculate the AQI are NO2 (nitrogen dioxide), SO2 (sulfur accurate and efcient in forecasting the AQI [3]. An inte-
dioxide), CO (carbon monoxide), O3 (ozone), PM10 (par- grated model used artifcial neural networks and the Kriging
ticulate matter having a diameter of 10 microns or less), method to estimate the quantity of air pollutants at several
PM2.5 (particulate matter having a diameter of 2.5 microns places in Mumbai and Navi Mumbai. Te high R values
or less), NH3 (ammonia), and benzene. In other applica- meant that the necessary level of ft between anticipated and
tions, the six pollutants PM10, PM2.5, SO2, NO2, CO, and O3 observed values had been achieved. In terms of R value and
are used to calculate the air quality index (AQI). However, forecast, ANN outperformed simple regression models [4].
the precise selection of contaminants relies on the particular To predict AQI author concentration based on parameter
aim and numerous variables, including data accessibility, like PM2.5, PM10, SO2 and NO2. In conclusion, of the al-
measurement techniques, and monitoring frequency. A high gorithms linear regression, decision tree regression, SVR,
AQI number indicates severely contaminated air, which can and RFR, the random forest regression algorithm yielded the
have a serious negative impact on health. Real-time air best accuracy of 0.99985 on the test data with the least mean
quality can be monitored using the AQI. Numerous weather square error of 0.00013 and the mean absolute error of
stations have also captured daily and hourly AQI data in our 0.00373 [5].
own backyard. Tese data will be mined and harvested with To forecast the AQI using the previous year’s data and
the intention of using them in the suggested work. projecting over a specifed future year as a gradient descent
As a result, the dataset used contains records of the AQIs boosted the multivariable regression issue. Tey out-
in various Indian cities. Te three distinct regression analysis performed ordinary regression models by improving the
techniques will be put into practice, and the best accuracy model’s efciency by employing cost estimates for the
will be determined through comparison. forecasting problem. Tey also utilized the AHP MCDM
Te proposed work compares a dataset’s efectiveness technique to assess the order of preference based on how
before and after using the SMOTE algorithm. Te major closely the alternatives resembled the ideal solution [6].
novelty is the usage of SMOTE. Unlike other papers, the Logistic regression [7] was used to determine if the pre-
impact of an imbalanced dataset has been studied, and sented data sample of daily weather/environmental condi-
hence, SMOTE has been applied to balance it. Furthermore, tions in a specifc city was polluted or not. Based on previous
the whole process has been documented with graphs and PM2.5 readings, this system attempted to predict PM2.5 levels
metrics which showcase each algorithm, every performance and detect air quality. Te results demonstrated that logistic
metric, under every dataset—in both its balanced and im- regression and autoregression could be used efectively to
balanced form. Te efectiveness of the suggested methods detect air quality and predict PM2.5 levels in the future.
will aid in predicting future AQI levels, which can serve as a Using 6 years of meteorological and pollutant data, this
warning and emphasize the need of reducing air pollution research [8] ofered an ML approach for predicting PM2.5
levels. concentrations from wind (speed and direction) and pre-
cipitation levels. Te fndings of the classifcation model
2. Literature Survey showed good reliability in classifying low (10 g/m3) against
high (>25 g/m3) PM2.5 concentrations, as well as low (10 g/
Tey initially looked at the relationship between several air m3) versus moderate (10–25 g/m3) PM2.5 concentrations. An
indicators, such as the AQI, PM2.5 concentrations, total NOx integrated model used the ANN and the Kriging method to
(nitrogen oxides) concentrations, and so on, in this study [1]. predict the level of air pollutants in Mumbai and Navi
Second, they built prediction models using random forest Mumbai based on historical data from the meteorological
regression (RFR) and support vector regression (SVR), and department and the Pollution Control Board [9]. Te
fnally, they assessed the regression models’ performance proposed model was then implemented and tested using the
using RMSE, coefcient of determination (R-SQUARE), and MATLAB application for ANN and the R application for the
correlation coefcient r. A widely used machine learning Kriging method. Te system helped with analyzing theex-
method (SVR) is used to quantify pollutant and particle tensive pollution data and projecting future pollution. Te
levels and predict the air quality index [2]. According to the identifcation of future data points to forecast air pollution
fndings, hourly concentrations of pollutants such as carbon was also done using time series analysis. An efective strategy
monoxide, sulfur dioxide, nitrogen dioxide, ground-level to predict Delhi’s AQI using a deep RNN based on LSTM to
ozone, and particulate matter 2.5, as well as the hourly AQI predict hourly pollutant concentrations was explored. Even
for the state of California, may be consistently predicted in hourly predictions, results were accurate. According to
using SVR with the RBF kernel. Te classifcation of unseen the fndings [10], deep learning-based strategies performed
validation data into six AQI categories provided by the better than traditional statistical methods [11].
United States Environmental Protection Agency (dataset) To predict daily AQI, prediction models included those
was completed with 94.1 percent accuracy. that used ARIMA as a time series model, PCR as a hybrid
Te prediction of the AQI using ML techniques such as regression model, ARIMA and PCR as the frst ensemble
time series analysis and LR. To predict the AQI, MLR and model, and ARIMA and gene expression programming
Journal of Environmental and Public Health 3

(GEP) as the second ensemble model. By utilizing the accuracy of 86.663 percent, the decision tree algorithm
correlation between urban nature (such as street greenness achieved 91.9978 percent [19].
and street building), urban trafc (such as vehicle volume), A nonmonitoring region’s AQI was anticipated. With
and air pollution, a set of periodic-frequent patterns and a results that were 92 percent acceptable for one-hour pre-
PM2.5 estimating model were created (e.g., PM2.5). Tey diction, the temporal dimension model was initially pre-
established a link between urban nature, traveling auto- sented based on the improved KNN algorithm to forecast
mobiles, and the quality of air pollution. Using this infor- AQI values across monitoring stations. Te algorithm was
mation, people can work toward developing an outstanding utilized in conjunction with a backpropagation neural
strategy to address all of them [12]. Linear regression was network (BPN), where it additionally considered geo-
used as a machine learning algorithm to predict air quality graphic distance, to predict the outcome of air quality in the
for the next day using sensor data from three specifc lo- spatial dimension [20]. Tey used ML models to forecast
cations in the Capital City of India-Delhi and the National Dhaka’s air quality levels that include deep learning
Capital Region (NCR). Te model’s performance was methods, such as LSTM, and various other techniques. Te
assessed using four performance measures: MAE, MSE, novel aspect of this approach was that they used a unique
RMSE, and MAPE. Tis paper looked at AQI prediction parameter(i.e., daily temperature) for predicting air pol-
using data generated by IoT arrangements [13]. Te ANN lution [21]. An ML-based technique was used for correctly
algorithm predicted hourly criteria pollutants concentration predicting the AQI based on data acquired from weather
levels and, AQI, AQHI, for Ahvaz, Iran, over a span of 12 stations and environment monitoring. Te prediction
months (Aug 2009–Aug 2010). Tis study demonstrated that method uses a neural network system improved using a
the ANN can be used to forecast air quality in cities such as new nonlinear autoregressive neural network (ARNN)
Ahvaz in order to prevent health efects. Tey came to the having an exogenic input model, which is specifcally
conclusion that urban air quality authorities might evaluate created for time-series prediction. Te framework was used
the spatial-temporal profle of pollutants and air quality in a scholarship involving various weather monitoring sites
metrics using an artifcial neural network [14]. in the London area [22].
Using air quality and meteorological records, tree-based To predict the air quality index of signifcant pollutants
ensemble learning models were developed to study the such as PM2.5, PM10, CO, NO2, SO2, and O3, they employed
urban air quality of the city of Lucknow in India over a fve- a variety of classifcation and regression approaches, in-
year period. PCA was used to identify the sources of air cluding linear regression, SDG regression, and random
pollution. Due to the incorporation of boosting and bagging forest regression. Evaluations were carried out using MSE,
techniques, the DTF and DTB models performed better in MAE, and R-SQUARE, which showed that ANN and SVM
classifcation and regression than the SVM. Te suggested worked best for AQI prediction in New Delhi [23]. Tey read
ensemble models for managing urban ambient air quality [24] several papers and gained an understanding of how the
were successful in predicting it [15]. Tey focused on air ANN could be used to predict the AQI. Tey used Jaccard
quality index measures and predictions based on past data similarity and deep learning methods in their proposal. Te
for the Central Jakarta area. PM2.5, one of the most often datasets were collected from UC Irvine. Tey came to the
utilized components in AQI assessment, was used in this conclusion that deep learning approaches improve predic-
investigation. Based on testing data, Brown’s weighted ex- tion accuracy.
ponential moving average accurately predicted future To predict the AQI data on smart cities the following
Central Jakarta AQI levels. In terms of precision, it out- algorighms like supervised learning, SVM and neural
performed the WMA, EMA, and BDES approaches [16]. networks were utilized in this paper. Databases were
Te dataset was collected to predict the AQI [17] in procured from the CPCB of the Ministry of Environ-
Chennai, Tamil Nadu. After that, it underwent pre- ment, Forests, and Climate Change of the GoI. Te model
processing to eliminate redundant data and replace missing performed well in terms of predicting the air quality of
values. A deep learning model based on SVR and LSTM was Delhi [25]. Te K-Means method [26] was proposed to
used to classify the AQI values. Tis proposed deep learning analyze air pollution. Using real-time records for pol-
method improved prediction accuracy, which would warn lutants, the correlation coefcient was calculated. Te
the public to reduce air pollution to a justifable level. Tey possibilistic fuzzy c-means (PFCM) algorithm was
used fve regression models for AQI prediction [18]: prin- contrasted with the K-Means algorithm. Te fndings
cipal component, partial least square, and principal com- demonstrated that the enhanced k-means clustering
ponent with one out, CV, and multiple regression AQI data technique delivered AQI values with higher accuracy and
from numerous Indian cities. Tey created three classif- lower execution time. We use supervised learning to
cation models to predict the AQI bucket: multinomial lo- create prediction models. Experiments have shown that
gistic regression, KNN, and KNN model with repeat CV decision trees (classifcation), SVR, and stacking en-
classifcation. In terms of accuracy and AUC, the KNN sembles work much better than the other methods in
Model with repeated CV and tune length 10 performed the their category. Mathematical models, learning, and re-
best. Health problems are predicted by the decision tree and gression techniques were recommended for developed
Naive Bayes algorithms. Good, moderate, unhealthy (un- areas and cities [27].
healthy for sensitive groups), and very unhealthy were the Te advancement of models for anticipating normal air
AQI categories. Compared to the Naive Bayes method’s quality levels utilizing computational insight techniques
4 Journal of Environmental and Public Health

enables. Te models were developed using data from the In this study [34], unique methods for pore pressure
three checking stations in the Czech Republic, Dukla, Rosice, prediction were created based on the most signifcant col-
and Brnenska, in order to predict the normal air quality fle lection of input features. Accuracy, R-SQUARE, and RMSE
and forecast air quality records for each air pollution sep- were utilized as performance metrics in this work.
arately. For examination, they utilized RMSE [28]. For AQI For pore pressure (PP) prediction utilizing good log data,
expectations, they used [29] IoT-based gadget information. this paper [35] combined the empirical equations with
Tey performed contamination expectations involving four machine learning methods such as the random forest al-
high-level relapse methods in this paper and introduced a gorithm, support vector regression algorithm, artifcial
similar report to decide the best model for precisely an- neural network algorithm, and decision tree algorithm. For
ticipating air quality as to information size and handling this, 2827 data records from three oil feld wells (Wells A, B,
time. For the correlation of these relapse models, the mean and C) in the Middle East were employed. Te results
MAE and RMSE were used as gauging measures. High- showed that the DT method outperformed the other three
recurrence detail successions WD(D) and low-recurrence predictive models in terms of performance prediction
surmised groupings WD(A) are produced using wavelet accuracy.
deterioration, as are long transient memory brain organi- In this work [36], predicting dispersed fracture densities
zation and autoregressive moving normal model for WD(D) in reservoir rocks may be possible using hybrid machine-
and WD(A) arrangements for the forecast. As for execution learning-optimizer models applied to a collection of pet-
measurements, they utilized RMSE, MAE, and R-SQUARE. rophysical logs confrmed using image log data. Te diverse
In this study [30], a unique machine-learning technique characteristics of fractures were addressed by various well
was developed to predict the condensate viscosity in the areas logs in various and sophisticated ways.
near the wellbore using 5 input variables: pressure, temper- Tree Marun oil feld wells (MN#163, MN#225, and
ature, initial gas to condensate ratio, gas-specifc gravity, and MN#179) provided access to the Asmari reservoir section on
condensate gravity. Te novel multiple extreme learning Iranian soil, and well-log data records were collected for
machine (MELM), least squares support vector machine these wells in order to anticipate shear wave velocity (VS)
(LSSVM), and multilayer perceptron, each of which has been [37]. Two hybrid machine learning prediction models
hybridized with a particle swarm optimizer (PSO) and genetic (MELM-PSO and MELM-GA), one deep learning model
algorithm, were among the nine machine learning and hybrid (CNN), and regularly used empirical methodologies to
machine learning algorithms that were evaluated (GA). In this anticipate VS were evaluated using the compiled dataset.
study [31], a unique machine-learning technique was created Deep learning successfully predicts VS for the supervised
based on feature selection to anticipate FVDC from a 12-input validation subset.
variable well-log. Te fracture density was previously pre- During the overbalance drilling technique, the safe mud
dicted using a hybrid method that incorporates two networks weight window (SMWW) was determined in this paper [38]
of multiple extreme learning machines (MELMs), multilayer by projecting the permitted upper and lower limits of the
perceptrons (MLPs), genetic algorithms (GAs), and particle bottom hole pressure window. Te novel machine learning
swarm optimizers (PSOs). Tey used an innovative MELM- approach MELM-PSO was developed to anticipate SMWW
PSO/GA mixture that has never been used before. Te models using ten well-log input variables and feature selection.
were MLP-PSO predictions, which the performance accuracy RMSE, R-SQUARE, and other performance indicators were
investigation found. applied in this study.
In this work [32], they created a novel deep machine In this study [39], a trustworthy machine-learning
learning model called convolutional neural network (CNN) forecasting model was used to predict the permeability (K)
to predict oil fow rate through an orifce plate using seven for heterogeneous carbonate gas condensate reservoirs. Tey
input variables, including fuid temperature, upstream used four machine learning models to predict permeability:
pressure, root diferential pressure, the ratio of base sedi- decision trees (DTs), support vector machines (SVMs), and
ment to water, oil specifc gravity, kinematic viscosity, and group way of data management (GMDH). In addition, the
beta ratio (Qo). Because there were no consistent and ac- GMDH model outperformed the other models.
curate methods to determine Qo, deep learning may be a In this study [40], the rheological performance of three
useful replacement for traditional machine learning tech- low-solid drilling fuids (based on bentonite, natural poly-
niques. Te study’s fndings demonstrated that the CNN mers, and nanoclay) was developed using the hybrid
model had the highest Qo prediction accuracy of any of the nanocomposite as an addition. As the polymer/nanoclay-
four developed models when used in the dataset of 3303 data hybrid-nanoparticle concentration increases, the fuids’ fl-
records collected from oil felds throughout Iran. tration abilities get better. Te rheological behavior of low-
565 data points from diferent parts of the world were solids polymer-based drilling fuid was most positively
used in this investigation. In this study [33], the multilayer impacted by the addition made of the clay-based nano-
perceptron method (MLP), an artifcial intelligence network, composite. Te ideal nanoclay content in the hybrid-poly-
and the innovative combination approaches for oil forma- mer nanocomposite was thought to be around 5 wt%,
tion volume factor (OFVF)—artifcial bee colony (ABC) and according to the analysis of the rheological characteristics
frefy (FF) optimization methods—had been used. In terms and fltration loss of the drilling fuids.
of RMSE and R-SQUARE, the MLP-ABC models of pre- In this article [41], the efectiveness of each drilling fuid
diction accuracy were evaluated for this test dataset. type was evaluated in terms of its ability to reduce fuid loss
Journal of Environmental and Public Health 5

and mud cake thickness, hence avoiding diferential pipe support vector machine (SVM), artifcial neural network
sticking. In that instance, drilling fuid fltering qualities (ANN), gradient boost (GB), decision tree (DT), and en-
were evaluated as a potential predictor of well diameter hanced k-means.
reduction caused by mud cake, close to permeable forma-
tions, and mud cake thickness was modifed. Te novel 3. Dataset Description and Sample Data
results showed that the rheological and fltration properties
of drilling fuids were signifcantly enhanced by Te link to the dataset used for this work is given below.
nanoparticles. https://www.kaggle.com/rohanrao/air-quality-data-in-
In this study [42], they created reliable models to forecast india.
the liquid critical-fow rates for operating oil wells. Per- Te dataset includes hourly and daily air quality and AQI
formance metrics were applied, such as coefcient of de- (air quality index) data from numerous stations in several
termination, root mean square error (RMSE), average Indian cities. Te data are for the years 2015 through 2020.
relative error (ARE), and average absolute relative error Te original dataset included 29532 rows and 16 columns,
(AARE). which included all of the cities listed below. Te cities are
In this work [43], they improved the forecast of the gas given below:
fow rate through wellhead chokes for a gas-condensate feld Ahmedabad, Aizawl, Amaravati, Amritsar, Bangalore,
by using the Firefy algorithm. Bhopal, Brajrajnagar, Chandigarh, Chennai, Coimbatore,
In this study [44], they developed a cutting-edge hybrid Delhi, Ernakulam, Gurugram, Guwahati, Hyderabad, Jaipur,
machine learning method that successfully predicts the gas Jorapokhar, Kochi, Kolkata, Lucknow, Mumbai, Patna,
fow rate through wellhead choke in gas condensate Shillong, Talcher, Tiruvananthapuram, and
reservoirs. Visakhapatnam.
Tis work is innovative since it thoroughly analyzed Te attribute information is given below.
other papers that have made the same attempt. By balancing
and imbalancing the dataset, the regression models that had
the highest accuracy were selected and subsequently used. 3.1. Date YYYY-MM-DD, City, PM2.5, PM10, NO, NO2, NOx,
Te use of SMOTE is another noteworthy innovation. NH3, CO, SO2, O3, Benzene, Toluene, AQI, and AQI_Bucket.
Unlike other articles, this one has explored the efects of an AQI_Bucket has six values such as good, satisfactory,
imbalanced dataset and used SMOTE to balance it. Addi- moderate, poor, very poor, and severe. Te dataset is cleaned
tionally, graphs and metrics that demonstrate each method, and selected from the 4 cities datasets such as New Delhi,
each performance parameter, and each dataset in both Bangalore, Kolkata, and Hyderabad from the original
balanced and imbalanced forms have been used to document dataset. Te attribute xylene was removed from the dataset
the entire process. due to the fact that the column values were empty for all 4
It will be possible to anticipate future AQI levels with the cities chosen by using Microsoft Excel software. Te dataset
help of the ofered techniques, which can serve as a warning includes hourly and daily air quality and AQI (air quality
and highlight the necessity of lowering air pollution. index) data from numerous stations in 26 Indian cities.
Te gaps identifed from the literature survey are given From the original dataset, the data of four cities such as New
below. Delhi, Bangalore, Kolkata, and Hyderabad were extracted.
Because these are major cities of India, it is important to
(i) In India, AQI measurement stations were set up in analyze the pollution levels in diferent urban cities of India
2014. Te National Air Monitoring Program has as they are the major contributors to the pollution. Tese
been used to measure AQI data in 240 cities across particular cities have a higher population density and give a
India. No proper system is in place which regularly good estimate of the pollution.
provides predicted data for the future. After cleaning the dataset and dividing it into 4 for each
(ii) All the papers usually focus on one city or area, city, the New Delhi dataset had 176 rows and 15 columns,
giving a biased outlook. the Bangalore dataset had 1362 rows and 15 columns, the
(iii) Te performance of the existing system should be Kolkata dataset had 747 rows and 15 columns, and the
increased. Hyderabad dataset had 1615 rows and 15 columns, re-
spectively. Te sample dataset for New Delhi, Bangalore,
Tese gaps are incorporated into the proposed method. Kolkata, and Hyderabad is shown in Tables 2–5, respectively.
Te proposed method uses diferent regression models along Te initial dataset has an imbalanced composition. Using
with the SMOTE algorithm for multiple cities in order to the synthetic minority oversampling technique (SMOTE)
increase the accuracy of the various models. Moreover, in algorithm, the imbalanced dataset is transformed into a
the papers studied, the following outcomes were found (i.e., balanced dataset. Oversampling is employed in this algo-
accuracy) for the existing algorithms such as Naive Bayes, rithm. Any classes with inadequate rows are supplemented
support vector machine, artifcial neural network, gradient with additional rows to ensure that each class label has an
boost, decision tree, and k-nearest neighbor. equal number of rows, or more or fewer rows, in the dataset.
Table 1 shows the various ML techniques/algorithms Asymmetry exists in an imbalanced dataset. An imbalanced
used in the existing systems and also states the accuracy dataset produces a skewed class distribution, which afects
achieved by each ML technique such as Naı̈ve Bayes (NB), the model’s accuracy in several ways.
6 Journal of Environmental and Public Health

Table 1: Some of the existing algorithm accuracy in percentage from the literature survey.
Accuracy in
Name of the algorithm Comments
percentage (%)
Naı̈ve Bayes (NB) 86.663 —
Support vector machine
92.40 —
(SVM)
Artifcial neural network
84–93 After simulating a lot of models, ANN gives within the range.
(ANN)
Gradient boost (GB) 96 —
Decision tree (DT) 91.9978 Predicting the PM2.5 with a near 89% accuracy rate.
Te k-means clustering method is 40% more efcient than the PFCM algorithm based
Enhanced k-means 71.28
on the speed of execution and accuracy.
Support vector
99.4 —
regression (SVR)
Random forest
99.985 Least MSE of 0.00013 and MAE of 0.00373.
regression (RFR)
CatBoost regression
99.88 Predicting PM2.5 readings with an inaccuracy of just 0.0006 and a 99.88% accuracy.
(CR)

Table 2: Sample dataset for New Delhi city.


City Date PM2.5 PM10 NO NO2 NOx NH3 CO SO2 O3 Benzene Toluene AQI AQI_bucket
Delhi 02/01/2015 186.18 269.55 62.09 32.87 88.14 31.83 9.54 6.65 29.97 10.55 20.09 454 Severe
Delhi 03/01/2015 87.18 131.9 25.73 30.31 47.95 69.55 10.61 2.65 19.71 3.91 10.23 143 Moderate
Delhi 04/01/2015 151.84 241.84 25.01 36.91 48.62 130.36 11.54 4.63 25.36 4.26 9.71 319 Very poor
Delhi 05/01/2015 146.6 219.13 14.01 34.92 38.25 122.88 9.2 3.33 23.2 2.8 6.21 325 Very poor
Delhi 06/01/2015 149.58 252.1 17.21 37.84 42.46 134.97 9.44 3.66 26.83 3.63 7.35 318 Very poor
Delhi 07/01/2015 217.87 376.51 26.99 40.15 52.41 134.82 9.78 5.82 28.96 4.93 9.42 353 Very poor
Delhi 08/01/2015 229.9 360.95 23.34 43.16 51.21 138.13 11.01 3.31 30.51 5.8 11.4 383 Very poor
Delhi 09/01/2015 201.66 397.43 19.18 38.56 45.6 140.6 11.09 3.48 32.94 5.25 11.12 375 Very poor
Delhi 10/01/2015 221.02 361.74 24.79 46.39 55.19 134.06 9.7 5.91 34.12 4.87 9.44 376 Very poor
Delhi 11/01/2015 205.41 393.2 28.46 47.29 57.88 131.1 10.98 5.54 50.37 5.93 10.59 379 Very poor

As a result, it is necessary to balance the data. It is multiple iterations used in the SMOTE algorithm, the values
possible to improve the accuracy of the results by over- are much closer to each other. Delhi city did not have any
sampling the positive class label. SMOTE is used in this “good” label values in the AQI_BUCKETcolumn in the
paper to conduct oversampling. Te SMOTE technique, dataset, and hence, it is marked as 0. Similarly, in Bangalore,
which builds its model on nearest neighbors, increases the there are no “severe” label values in the AQI_BUCKET
frequency of the minority class or minority class group in the column and it is marked as 0. Te SMOTE algorithm is being
given dataset. Te given dataset has 6 positive classes and 12 utilized in this paper to improve the accuracy of each model
negative classes, and they are shown in Figure 1. Tis dataset being run on the dataset, by balancing the datasets. An
is given as the input of the SMOTE algorithm. After that, it imbalanced dataset leads to a skewed class distribution that
increases the number of occurrences of the minority class causes discrepancies inaccuracies of models. Higher accurate
(positive) from six to twelve. It aids in dataset balancing, models, higher balanced accuracy, and higher balanced
which improves algorithm performance and prevents detection rate are produced by balanced datasets. Terefore,
overftting problems. SMOTE typically involves fnding a SMOTE is employed to accomplish this purpose and im-
feature vector and its closest neighbor, taking the diference prove accuracy.
between the two, multiplying it by a random number be- SMOTE has the beneft of not producing duplicate
tween 0 and 1, fnding a new point on the line segment by data points but rather artifcial data points that are
adding the random number to the feature vector, and re- marginally diferent from the actual data points. By
peating the process for all located feature vectors. SMOTE producing examples that are similar to the minority
has the advantage of producing synthetic data points as points already in existence, this algorithm aids in over-
opposed to copies that difer slightly from the original data coming the overftting issue caused by random over-
points. sampling. SMOTE also creates larger and less specifc
Table 6 logs the count of the attribute (AQI_Bucket) decision boundaries that increase the generalization ca-
labels with 6 distinct types of values; they are moderate, pabilities of classifers, thereby improving their
satisfactory, good, poor, very poor, and severe. After performance.
Journal of Environmental and Public Health 7

Table 3: Sample dataset for Bangalore city.


City Date PM2.5 PM10 NO NO2 NOx NH3 CO SO2 O3 Benzene Toluene AQI AQI_bucket
Bangalore 14/11/2015 42.42 156.84 7.25 29.94 31.78 21.94 1.56 2.23 31.35 1.82 4.65 130 Moderate
Bangalore 19/11/2015 21.99 39.86 7.08 16.44 19.51 41.96 1.73 2.95 9.98 1.52 2.38 103 Moderate
Bangalore 20/11/2015 13.89 31.44 6.84 12.14 15.35 23.93 1.72 2.5 4.56 0.74 1.48 74 Satisfactory
Bangalore 23/11/2015 19.66 36.84 6.47 16.37 20.87 24.04 1.35 2.83 4.09 1.18 2.17 75 Satisfactory
Bangalore 24/11/2015 20.35 33.97 7.76 20.64 24.75 26.98 1.36 2.59 7.77 1.02 1.9 85 Satisfactory
Bangalore 25/11/2015 34.39 36.29 8.38 28.8 32.28 32.75 2.48 3.76 14.63 1.32 3.17 141 Moderate
Bangalore 26/11/2015 43.91 43.65 11.74 29.33 32.78 55.4 1.52 3.44 14.8 1.53 3.59 90 Satisfactory
Bangalore 27/11/2015 44.14 112.78 7.05 26.64 27.06 32.33 2.18 4.3 25.57 1.69 3.36 126 Moderate
Bangalore 28/11/2015 44.94 114.34 8.47 28.1 29.37 32.75 2.3 4.7 29.1 1.56 2.38 147 Moderate
Bangalore 29/11/2015 29.35 75.79 5.72 21.21 21.4 19.08 1.55 4.55 29.03 1.01 1.15 87 Satisfactory

Table 4: Sample dataset for Kolkata city.


City Date PM2.5 PM10 NO NO2 NOx NH3 CO SO2 O3 Benzene Toluene AQI AQI_bucket
Kolkata 16/06/2018 47.55 128.66 6.01 24.89 24.51 7.4 0.72 7.3 27.24 2.14 0.81 119 Moderate
Kolkata 18/06/2018 50.1 105.68 3.23 33.28 36.5 8.55 1.47 3.02 72.28 1.97 2.62 107 Moderate
Kolkata 19/06/2018 39.25 87.24 2.6 30.86 33.45 12.06 1.35 1.93 81.12 1.59 2.47 148 Moderate
Kolkata 20/06/2018 24.44 53.19 5.77 38.03 43.79 9.14 1.7 6.88 49.58 2.02 3.13 94 Satisfactory
Kolkata 21/06/2018 31.68 60.16 4.46 38.39 43.04 6.52 1.42 1.31 13.47 3.76 5.52 100 Satisfactory
Kolkata 22/06/2018 25.22 48.96 0.99 28.1 29.07 6.53 0.39 2.31 30.32 1.62 2.65 60 Satisfactory
Kolkata 23/06/2018 22.95 44.58 1.14 25.76 26.85 5.38 0.38 1.06 22.84 1.67 2.63 47 Good
Kolkata 24/06/2018 24.61 46.54 0.86 25.49 26.32 3.96 0.4 1.1 23.13 1.51 2.28 48 Good
Kolkata 25/06/2018 28.6 45.36 1.95 43.45 45.37 3.62 0.41 1.11 13.56 2.58 4.17 50 Good
Kolkata 26/06/2018 30.5 46.08 1.27 37.12 38.33 3.19 0.38 2.29 34.84 2.05 4.41 61 Satisfactory

Table 5: Sample dataset for Hyderabad city.


City Date PM2.5 PM10 NO NO2 NOx NH3 CO SO2 O3 Benzene Toluene AQI AQI_bucket
Hyderabad 08/09/2015 91.82 32.94 5.41 28.93 23.37 24.94 0.48 7.98 27.04 1.01 5.74 179 Moderate
Hyderabad 09/09/2015 35.56 40.81 4.02 31.15 24.31 24.81 0.57 4.93 22.48 1.41 7.61 162 Moderate
Hyderabad 10/09/2015 45.64 44.89 7.06 28.96 25.58 24.8 0.73 5.29 24.69 1.25 7.84 76 Satisfactory
Hyderabad 11/09/2015 60.88 51.27 5.15 30.64 24.22 25.86 0.53 5.16 24.11 1.09 5.42 140 Moderate
Hyderabad 12/09/2015 65.61 41.31 3.4 26.03 20.37 24.78 0.57 5.44 25.47 0.83 4.39 128 Moderate
Hyderabad 13/09/2015 60.02 36.67 2.35 19.82 14.51 21.68 0.49 4.02 37.7 0.79 4.07 164 Moderate
Hyderabad 14/09/2015 73.21 35.28 2.82 19.94 15.4 21.4 0.57 5.96 34.11 0.52 2.44 169 Moderate
Hyderabad 01/10/2015 120.75 92.29 1.92 21.65 15.87 27.65 0.64 2.67 15.85 1.21 5.95 340 Very poor
Hyderabad 02/10/2015 29.66 76 2 25.94 16.02 20.45 0.6 3.81 17.4 1.2 5.62 125 Moderate
Hyderabad 03/10/2015 36.56 63.06 3.06 20.11 15.07 18.05 0.64 7.58 19.16 1.2 6.4 75 Satisfactory

Te comparison of balanced and imbalanced datasets for


the New Delhi, Bangalore, Kolkata, and Hyderabad cities is Afer SMOTE
shown in Figures 2–5, respectively.

4. Methodology
In this paper, the proposed methods use three diferent Figure 1: New minority class instances added.
algorithms to draw a comparative analysis of the AQI values
of New Delhi, Bangalore, Kolkata, and Hyderabad by using
parameters such as PM2.5, PM10, NO, NO2, NOx, NH3, CO, due to the fact that it makes the research paper way too
SO2, O3, Benzene, and toluene levels, which will then lengthy. Hence, the major cities of India have been chosen to
compare the three algorithms and fnd the most accurate and analyze the pollution levels in diferent urban cities of India
efcient algorithm. Te aim is to analyze and present it in an as they are the major contributors to pollution.
efcient way. It would help us discover interesting and Some of the existing algorithms used are Naive Bayes-a
insightful information. Tese particular cities have a higher Bayes theorem-based classifer, support vector machine-a su-
population density and give a good estimate of the pollution pervised learning model for classifcation and regression, ar-
in a major South Asian city. More cities have not been added tifcial neural network-learning methodology inspired by actual
8 Journal of Environmental and Public Health

IMBALANCED DATASET
NEW DELHI: BALANCED DATASET
Selected attribute Selected attribute
Name: AQI_Bucket Type: Nominal Name: AQI_Bucket Type: Nominal
Missing: 0 (0%) Distinct: 5 Unique: 0 (0%) Missing: 0 (0%) Distinct: 5 Unique: 0 (0%)
No. Label Count Weight No. Label Count Weight
1 Severe 239 239 1 Severe 478 478
2 Moderate 485 485 2 Moderate 485 485
3 Very Poor 514 514 3 Very Poor 514 514
4 Poor 534 534 4 Poor 534 534
5 Satisfactory 108 108 5 Satisfactory 432 432

514 534 534


485 514
478 485
432

239

108

Figure 2: Balanced and imbalanced data values for New Delhi city.

IMBALANCED DATASET
BANGALORE: BALANCED DATASET
Selected attribute Selected attribute
Name: AQI_Bucket Type: Nominal Name: AQI_Bucket Type: Nominal
Missing: 0 (0%) Distinct: 5 Unique: 1 (0%) Missing: 0 (0%) Distinct: 5 Unique: 1 (0%)
No. Label Count Weight No. Label Count Weight
1 Moderate 479 479 1 Moderate 958 958
2 Satisfactory 810 810 2 Satisfactory 810 810
3 Poor 12 12 3 Poor 768 768
4 Good 59 59 4 Good 944 944
5 Very Poor 1 1 5 Very Poor 1 1

810 958 944

810
768

479

59
12 1 1

Figure 3: Balanced and imbalanced data values for Bangalore city.

neurons of the brain, gradient boost-techniques utilizing an imbalanced dataset balanced. Tis approach helps with
ensemble of weak prediction models, decision tree-which beating the issue of overftting brought about by arbitrary
works by making predictive models using data, and k-nearest oversampling.
neighbor-a lazy learning nonparametric supervised method.
Te proposed algorithms used and compared are given
below. 4.2. Support Vector Regression. It is a discrete value pre-
diction technique that uses supervised learning. For com-
parable purposes, SVMs and support vector regression are
4.1. Synthetic Minority Oversampling Technique (SMOTE) likewise used. Finding the most appropriate line is the main
Algorithm. Synthetic samples are created for the minority tenet of SVR. In SVR, the hyperplane with the most points is
class using this oversampling technique. It aids in making an the line that fts the data the best.
Journal of Environmental and Public Health 9

IMBALANCED DATASET
KOLKATA: BALANCED DATASET
Selected attribute Selected attribute
Name: AQI_Bucket Type: Nominal Name: AQI_Bucket Type: Nominal
Missing: 0 (0%) Distinct: 6 Unique: 1 (0%) Missing: 0 (0%) Distinct: 6 Unique: 0 (0%)
No. Label Count Weight No. Label Count Weight
1 Moderate 151 151 1 Moderate 302 302
2 Satisfactory 278 278 2 Satisfactory 278 278
3 Good 119 119 3 Good 238 238
4 Poor 119 119 4 Poor 238 238
5 Very Poor 66 66 5 Very Poor 264 264
6 Severe 13 13 6 Severe 208 208

278 302
278
264
238 238
208

151
119 119

66

13

Figure 4: Balanced and imbalanced data values for Kolkata city.

IMBALANCED DATASET
HYDERABAD: BALANCED DATASET
Selected attribute Selected attribute
Name: AQI_Bucket Type: Nominal Name: AQI_Bucket Type: Nominal
Missing: 0 (0%) Distinct: 6 Unique: 0 (0%) Missing: 0 (0%) Distinct: 6 Unique: 0 (0%)
No. Label Count Weight No. Label Count Weight
1 Moderate 806 806 1 Moderate 806 806
2 Satisfactory 645 645 2 Satisfactory 645 645
3 Very Poor 3 3 3 Very Poor 768 768
4 Poor 30 30 4 Poor 960 960
5 Severe 4 4 5 Severe 1024 1024
6 Good 126 126 6 Good 1008 1008

806 1024 1008


960

645 806 768


645

126

3 30 4

Figure 5: Balanced and imbalanced data values for Hyderabad city.

4.3. Random Forest Regression (RFR) Algorithm. It is a fre- technique, aims at resolving categorical features using an
quently used supervised machine-learning technique for alternative based on permutation.
classifcation and regression problems. It creates decision All the three algorithms showed promising results in
trees based on a variety of samples, utilizing the average for other works which had been studied through the literature
regression and the classifcation vote. survey. Tese three algorithms were chosen due to their high
accuracy in previous diferent works (Table 1), and with the
4.4. CatBoost Regression (CR) Algorithm. Yandex has de- proposed work, the aim is to draw a comparative analysis
veloped a library of open-source software. It ofers a and fnd the one with the best accuracy with balanced and
framework for gradient boosting which, unlike the standard imbalanced datasets. Te aim is to use them and apply them
10 Journal of Environmental and Public Health

Start

Choosing Dataset

Data Preprocessing

with SMOTE without SMOTE

Balanced Dataset Imbalanced Dataset

Split into Train and


Test

Train Test

Feature Scaling

ML ALGORITHMS

ML Technique ML Technique ML Technique


Random Forest Support Vector CatBoost
Regression Regression Regression

AQI Prediction

Calculation
Evaluation Metric for
each ML Technique

Tabulation and
Comparison

Declare the ML
Technique with most
accuracy

End

Figure 6: Flowchart for the proposed methodology.

to the Bangalore, Kolkata, Hyderabad, and New Delhi forest regression (RFR), and CatBoost regression (CR).
datasets and compare their accuracies to fgure out what best Tese algorithms will be provided with a suitably large
fts our use case. dataset of cities, such as New Delhi, Bangalore, Kolkata, and
Te picked algorithms have the highest accuracy based Hyderabad, and will provide a practical environment.
on our extensive literature survey as logged in Table 1, used Te dataset used will be cleaned, reduced, and prepared
for the AQI prediction. Te algorithms being used for according to our requirements and the data will be split into
prediction are support vector regression (SVR), random training and testing data. Te plan is to use the simplest, most
Journal of Environmental and Public Health 11

Accuracy of different methods


100

90
Accuracy

80

70

60
SVR RFR CR
methods
New Delhi Kolkata
Bangalore Hyderabad
Figure 7: Accuracy comparison of algorithms for four cities.

Comparison of metric values of New Delhi


1

0.8

0.6
Metric Values

0.4

0.2

0
SVR RFR CR
Methods
R-SQUARE RMSE
MSE MAE
Figure 8: Te comparison between R-square, MSE, RMSE, and MAE of support vector regression, random forest regression, and CatBoost
regression of the New Delhi city imbalanced dataset.

straightforward implementation in order for the algorithms to use of SMOTE in order to balance the dataset. Both balanced
be applied easily in a real-life use case. Ten, diferent pa- and imbalanced datasets will be preserved and used in order
rameters will be taken to fnalize and draw up a comparison to bring to light any diferences in performance that may
between these 3 algorithms and then come to the conclusion to arise due to balancing. Following this, in a standard machine
show which is the most accurate. Te comparison can bring out learning procedure, the dataset is split into train and test to
important information about AQI prediction methods and even train the models and test their accuracies against real data.
help us choose the most suitable one. A comparison of the Feature scaling and normalization are carried out.
accuracy levels obtained with an imbalanced dataset and a Now, each regression model which has been picked,
balanced dataset with the help of the SMOTE algorithm will also namely, random forest, support vector regression, and
be done. CatBoost, are used for prediction and its accuracy is gauged,
Hence, the methodology is a step-by-step process in for each balanced and imbalanced dataset as mentioned
which the frst step is to fnd a suitable dataset and clean it. previously. Tey are compared using metrics such as RMSE
After this, further data preprocessing is applied which makes and R-SQUARE. Finally, all the data and results have been
12 Journal of Environmental and Public Health

Comparison of metric values of Bangalore


0.8

0.6
Metric Values

0.4

0.2

0
SVR RFR CR
Methods
R-SQUARE RMSE
MSE MAE
Figure 9: Te comparison between R-square, MSE, RMSE, and MAE of support vector regression, random forest regression, and CatBoost
regression for the Bangalore city imbalanced dataset.

displayed using clear fgures, graphs, and charts which easily manual iterations to achieve a suitable level of balance. Tis
make one understand what exactly has led to the increase in is carried out to create a balanced version of the dataset.
accuracy and hence help future research.
Figure 6 shows the various steps which will be performed Step 4. Not applying the SMOTE algorithm
during the implementation of this work to achieve the Here, the synthetic minority oversampling technique
determined result. Te fowchart is a process-based fow- (SMOTE) is not applied to the dataset it is being used di-
chart that shows the steps of the process in a detailed rectly just after removing unnecessary, irrelevant, and er-
manner. It has been derived from the actual working out into roneous data in it and hence is in its imbalanced form.
running these models and extracting results. Te process
fowchart is drawn in Western ANSI standards. Step 5. Splitting of the dataset
Te datasets are split into training and test data at an 80 :
Step 1. Choosing a dataset 20 ratio. Tese are used to train the model and then test it
Choosing an extensive dataset from Kaggle according to against the original data. Te values predicted by the ma-
our requirements and downloaded its CSV fle. chine learning algorithms are corroborated with the original
data to predict accuracy.
Step 2. Data preprocessing
In data preprocessing, they cleaned the original dataset Step 6. Training the dataset
and extracted the New Delhi, Bangalore, Kolkata, and Empirical studies show that the best results are obtained
Hyderabad city data. Because these are major cities in India, if 80% of the data is used for training. Random sampling is
it is important to analyze the pollution levels in diferent used as a way to divide the data into train and test sections. It
urban cities in India as they are the major contributors to the is widely accepted and is very popular.
pollution. Tese particular cities have a higher population
density and give a good estimate of the pollution. Each of Step 7. Testing the dataset
these datasets was cleaned by removing all null value rows, Empirical studies show that the best results are obtained
and the attribute xylene was removed from the dataset due to if the remaining 20% of the data is used for testing. Random
the fact that the column values were empty for all 4 cities sampling is used as a way to divide the data into train and
chosen, hence making it a redundant attribute. Microsoft test sections. It is widely accepted and is very popular.
Excel software is used to remove unnecessary, irrelevant, and
erroneous data. Step 8. Feature scaling
Te data have been normalized in order to make the
Step 3. Applying the SMOTE algorithm dataset fexible and consistent. StandardScaler from Scikit-
After the cleaning of the dataset, the synthetic minority Learn Library has been used to do so. It normalizes the
oversampling technique (SMOTE) is used to correct the features by deleting the mean and scaling the unit variance.
class imbalances in the AQI_Bucket values. Delhi, Banga-
lore, Kolkata, and Hyderabad required 3, 11, 9, and 24 Step 9. Applying machine learning (ML) techniques
Journal of Environmental and Public Health 13

Comparison of metric values of Kolkata


1

0.8

Metric Values
0.6

0.4

0.2

0
SVR RFR CR
Methods
R-SQUARE RMSE
MSE MAE
Figure 10: Te comparison between R-square, MSE, RMSE, and MAE of support vector regression, random forest regression, and CatBoost
regression for the Kolkata city imbalanced dataset.

After normalizing the range of features in the datasets, Step 13. AQI prediction
various algorithms, namely, CatBoost regression, random Machine learning techniques are used to aid in this
forest regression, and support vector regression are used to process, and the accuracy level of AQI for each city is es-
forecast air quality index, and then, they are compared to timated. Te values are tabulated and graphs depicting the
show which algorithm gives the best accuracy level for each accuracy levels of all 4 cities are plotted.
city, respectively.
Step 14. Calculation of evaluation metric for each ML
Step 10. Applying ML technique-random forest regression technique
Random forest is a supervised machine learning algo- Te metrics used for the proposed work are R-SQUARE,
rithm that is used for classifcation and regression problems. MSE, RMSE, MAE, and the accuracy (1-MAE) of CatBoost
It creates decision trees from several samples, using the regression, random forest regression, and support vector
majority vote for classifcation and the average in the case of regression.
regression. A random forest produces precise predictions
that are easy to understand. Efective handling of large Step 15. Tabulation and comparison
datasets is possible. Taking all the metric values obtained after running the
machine learning techniques (i.e.,) R-SQUARE, MSE,
Step 11. Applying ML technique-support vector regression RMSE, MAE, and the accuracy of the algorithms. For
Support vector regression is a supervised machine comparison tabulating, the predicted values and actual
learning algorithm that is used for regression problems. values for each city and model and plot multiple graphs
Discrete values can be predicted using it. Te core idea of such as line graphs, density plots, and scatter plots are
SVR is locating the best ft line. Te SVR best-ftting line is analyzed. All metric values and accuracy values of each
the hyperplane with the most points. Te fexibility of SVR city and model are further tabulated, plotting bar graphs
allows us to decide how much error in our model is to compare the accuracy of each model city-wise and also
acceptable. plot bar graphs to compare R-SQUARE, MSE, RMSE, and
MAE values of each model city-wise. Here, the accuracy is
Step 12. Applying ML technique-CatBoost regression calculated using various cities datasets with SMOTE
A supervised machine learning approach called Cat- applied to them, repeating the same steps from Step 10 to
Boost regression is based on gradient-boosted decision trees. Step 15 after using the dataset with the SMOTE algorithm
During training, a number of decision trees are constructed applied.
progressively. To generate a powerful, competitive predictive
model through greedy search, the main objective of boosting Step 16. Final comparative results (declare the ML tech-
is to successively integrate a large number of weak models or nique with the highest accuracy)
models that only marginally outperform chance. It has a Once tabulated all the values, the next step is to compare
quick inference process since it uses symmetric trees and its the metric values of all the used algorithms and see what best
boosting techniques aid in lowering overftting and en- fts the scenario. In the proposed work, random forest and
hancing model quality. CatBoost regression are the best performances overall. RFR
14 Journal of Environmental and Public Health

Comparison of metric values of Hyderabad


1

0.8

0.6
Metric Values

0.4

0.2

0
SVR RFR CR
Methods
R-SQUARE RMSE
MSE MAE
Figure 11: Te comparison between R-square, MSE, RMSE, and MAE of support vector regression, random forest regression, and CatBoost
regression for the Hyderabad imbalanced dataset.

Comparison between ACCURACIES of SVR on dataset with


and without SMOTE algorithm applied
100

90
ACCURACY

80

70

60
Delhi Bangalore Kolkata Hyderabad
Cities

Dataset without SMOTE


Dataset with SMOTE
Figure 12: Comparison between the accuracy of SVR on the balanced and imbalanced dataset (with and without using the SMOTE
algorithm).

got the best RMSE values in Bangalore, Kolkata, and 5. Discussion on Metrics Used
Hyderabad, whereas CatBoost regression performed best in
Delhi. Te highest accuracy was obtained by random forest Te metrics used in the proposed work are R-SQUARE,
regression for the cities of Kolkata and Hyderabad and New mean squared error (MSE), root mean squared error
Delhi and Bangalore. CatBoost regression gave the highest (RMSE), mean absolute error (MAE), and accuracy.
accuracy. Te tabulated values are compared with metric (i) R-SQUARE indicates to what extent the regression
values before and after applying SMOTE on the dataset to model is in line with the observed data. A higher R
fnd what gives better accuracy. In the proposed work, square value denotes a better model ft, the R Square
random forest and CatBoost were the best performances equation is shown by equation
overall.
Journal of Environmental and Public Health 15

Comparison between ACCURACIES of RFR on dataset with


and without SMOTE algorithm applied
100

ACCURACY 90

80

70

60
Delhi Bangalore Kolkata Hyderabad
Cities

Dataset without SMOTE


Dataset with SMOTE
Figure 13: Comparison between the accuracy of RFR on a dataset with and without the SMOTE algorithm.

Comparison between ACCURACIES of CatBoost on dataset


with and without SMOTE algorithm applied
100

90
ACCURACY

80

70

60
Delhi Bangalore Kolkata Hyderabad
Cities

Dataset without SMOTE


Dataset with SMOTE
Figure 14: Comparison between the accuracy of CR on a dataset with and without the SMOTE algorithm.

SSregr the value, the closer it is to the line, and hence the
R − SQUARE � . (1) better. If the MSE value � 0, the model is perfect. It is
SStt
shown in equation
n 2
Te sum of squares due to regression is denoted by Xi –X∧i 􏼁
SSregr (explained sum of squares), while the sum of MSE � 􏽘 , (2)
i−1 n
squares overall is denoted by SStt. Te degree to
which the regression model fts the data well is
where A � πr2 ,
shown by the sum of squares due to regression. Te
total sum of squares is used to determine how much (a) xi � Te ith observed value
the observed data has changed (data utilized in (b) x∧i � Te corresponding predicted value
regression modeling). (c) n � Te number of observations
(ii) MSE is a parameter that measures how closely a (iii) RMSE indicates how densely the data are distrib-
ftted line resembles a set of data points. Te lower uted along the line of best ft. RMSE values in the
16 Journal of Environmental and Public Health

Actual and predicted values Actual and predicted values


Actual 3 Actual
3 Predicted Predicted
2
2

Targets
1
Targets

0 0

-1 -1

-2
0 100 200 300 400 500 0 100 200 300 400 500 600 700
Observations Observations

Figure 15: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for New Delhi-SVR.

Actual and predicted values Actual and predicted values


3
6
2
4
Targets
Targets

1
2
0
0
-1
-2
0 50 100 150 200 250 300 350 400 0 200 400 600 800 1000
Observations Observations
Actual Actual
Predicted Predicted
Figure 16: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for Bangalore-SVR.

Actual and predicted values Actual and predicted values


2.0 Actual
3 Actual
Predicted Predicted
1.5

2 1.0
Targets

0.5
Targets

1
0.0

0 -0.5

-1.0
-1

0 50 100 150 200 0 100 200 300 400


Observations Observations
Figure 17: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for Kolkata-SVR.
Journal of Environmental and Public Health 17

Actual and predicted values Actual and predicted values


Actual
8 Predicted 1.5

6 1.0

0.5

Targets
Targets

4 Actual
Predicted
0.0
2
-0.5
0
-1.0
-2
0 100 200 300 400 500 0 200 400 600 800 1000 1200 1400 1600
Observations Observations

Figure 18: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for Hyderabad-SVR.

Actual and predicted values Actual and predicted values

Actual 3 Actual
3 Predicted Predicted
2
2

1
Targets
Targets

0 0

-1 -1

-2
0 100 200 300 400 500 0 100 200 300 400 500 600 700
Observations Observations

Figure 19: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for New Delhi-RFR.

Actual and predicted values Actual and predicted values

Actual 3 Actual
6 Predicted Predicted

2
4
Targets

Targets

1
2

0
0

-1
-2
0 50 100 150 200 250 300 350 400 0 200 400 600 800 1000
Observations Observations

Figure 20: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for Bangalore-RFR.
18 Journal of Environmental and Public Health

Actual and predicted values Actual and predicted values


Actual 2.0
3 Actual
Predicted Predicted
1.5

2
1.0
Targets

0.5

Targets
1
0.0

0 -0.5

-1.0
-1

0 50 100 150 200 0 100 200 300 400


Observations Observations

Figure 21: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for Kolkata–RFR.

Actual and predicted values Actual and predicted values

Actual
8 Predicted 1.5

6 1.0

0.5
Targets
Targets

4 Actual
Predicted
0.0
2
-0.5
0
-1.0
-2
0 100 200 300 400 500 0 200 400 600 800 1000 1200 1400 1600
Observations Observations

Figure 22: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for Hyderabad–RFR.

range of 0.2–0.5 demonstrate that the model can (a) n is the number of errors
reasonably predict the data. It is shown in the (b) Σ is the summation symbol (which means “add
equation them all up”)
􏽶������������ (c) |xi − x| is the absolute errors
􏽴
2
n
X i – Xi ∧ 􏼁 (v) Accuracy is used as a measurement to calculate how
RMSE � 􏽘 , (3)
m well a model is fnding patterns and identifying re-
i�1
lations in the dataset and it is shown in the equation
Accuracy � (1 − MAE) ∗ 100. (5)
where
(a) xi � Te ith observed value Tis gives the accuracy in percentage.
(b) x∧i � Te corresponding predicted value
(c) n � Te number of observations 6. Results and Discussion
(iv) MAE evaluates the absolute distance of the obser- In the proposed work, the dataset mentioned above has been
vations to the predictions on the regression line. It is cleaned such that it only has the values for the cities of New
shown in the equation Delhi, Bangalore, Kolkata, and Hyderabad. Te dataset was
used in two ways, once in an imbalanced version and then in
1 n 􏼌􏼌􏼌 􏼌
MAE � 􏽘 􏼌X − X􏼌􏼌􏼌, (4) a balanced version using SMOTE. Graphs were plotted and
m i�1 i it was seen that there was an increase in the accuracies of the
models which had the balanced dataset. For prediction
where purposes, three algorithms were run on it, namely, support
Journal of Environmental and Public Health 19

Actual and predicted values Actual and predicted values


Actual 3 Actual
3 Predicted Predicted
2
2
1

Targets
Targets

0 0

-1 -1

-2
0 100 200 300 400 500 0 100 200 300 400 500 600 700
Observations Observations

Figure 23: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for New Delhi–CR.

Actual and predicted values Actual and predicted values


Actual Actual
3
6 Predicted Predicted

2
4
Targets

Targets

1
2

0
0

-1
-2
0 50 100 150 200 250 300 350 400 0 200 400 600 800 1000
Observations Observations

Figure 24: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for Bangalore–CR.

Actual and predicted values Actual and predicted values


Actual 2.0
3 Actual
Predicted 1.5 Predicted

2 1.0
Targets

Targets

0.5
1
0.0

0 -0.5

-1.0
-1

0 50 100 150 200 0 100 200 300 400


Observations Observations

Figure 25: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for Kolkata-CR.

vector regression, random forest regression, and CatBoost According to the research in this paper, the choice to use
regression. Plotted graphs between the test data and the statistical metrics, such as RMSE, R-SQUARE and so on, has
predicted data were shown as well. Te metrics calculated in been understood and referred to in papers [30–33], as well as
each algorithm are R-SQUARE, mean squared error (MSE), how to efectively implement them. Metrics are used to track
root mean squared error (RMSE), and mean absolute error and gauge a model’s performance (during training and
(MAE). Comparative tables, graphs and scatter plots were testing). Tese metrics provide information on the precision
drawn for balanced and imbalanced dataset results to show of the forecasts, and the amount of departure from the actual
how using a balanced dataset when used provides higher values since all of the algorithms utilized are based on re-
accuracies in each algorithm. gression models.
20 Journal of Environmental and Public Health

Actual and predicted values Actual and predicted values

Actual
8 Predicted 1.5

6 1.0

0.5
Targets

Targets
4 Actual
Predicted
0.0
2
-0.5
0
-1.0
-2
0 100 200 300 400 500 0 200 400 600 800 1000 1200 1400 1600
Observations Observations

Figure 26: Scatter plots showing actual and predicted values for the imbalanced dataset (without using SMOTE) and balanced dataset (with
using SMOTE) for Hyderabad–CR.

Accuracy results’ comparison of the imbalanced depicts that random forest regression has the highest
dataset without using SMOTE algorithm for all the 4 R-SQUARE, and the lowest RMSE, MSE value and CatBoost
cities such as Delhi, Bangalore, Kolkata, and Hyderabad regression has the lowest MAE value.
obtained by the machine learning techniques such as Table 10 logs the result of performance metrics used for
support vector regression, random forest regression, and the Kolkata city imbalanced dataset (i.e.,) without using
CatBoost regression is shown in Table 7. Among the four SMOTE algorithm are R-SQUARE, MSE, RMSE, and MAE
cities, the Kolkata city dataset gives the maximum ac- values for all 3 algorithms such as support vector regression,
curacy for these three algorithms, whereas the Bangalore random forest regression, and CatBoost regression. Te
city dataset gives the minimum accuracy. Te dataset random forest regression gives the best result in comparison
used was imbalanced. to the support vector regression and CatBoost regression
Figure 7 depicts the accuracy achieved by various ML algorithm.
techniques such as SVR, RFR, and CR to estimate AQI in In Figure 10, the comparison between R-SQUARE, MSE,
four diferent cities using a bar graph. RMSE, and MAE of support vector regression, random
Table 8 logs the result of performance metrics used for forest regression and CatBoost regression for Kolkata city
the New Delhi city imbalanced dataset (i.e.,) without using imbalanced dataset (i.e.,) without using SMOTE algorithm is
SMOTE algorithm are R-SQUARE, MSE, RMSE, and MAE shown. It depicts that CATBOOST has the highest
values for all 3 algorithms such as support vector regression, R-SQUARE, and the lowest RMSE, MSE, and MAE values.
random forest regression, and CatBoost regression. Te Table 11 logs the result of performance metrics used for
CatBoost regression algorithm gives the best result in the Hyderabad city imbalanced dataset (i.e.,) without using
comparison to support vector regression and random forest the SMOTE algorithm are R-SQUARE, MSE, RMSE, and
regression. MAE values for all 3 algorithms such as support vector
In Figure 8, the comparison between R-SQUARE, MSE, regression, random forest regression, and CatBoost re-
RMSE, and MAE of support vector regression, random gression. Te random forest regression gives the best result
forest regression and CatBoost regression of New Delhi city in comparison to the support vector regression and CatBoost
imbalanced dataset (i.e.,) without using SMOTE algorithm regression.
through graphical representation is shown. It depicts that In Figure 11, the comparison between R-SQUARE,
CatBoost regression has the highest R-SQUARE, and the MSE, RMSE, and MAE of support vector regression,
lowest RMSE, MSE, and MAE values. random forest regression, and CatBoost regression im-
Table 9 logs the result of performance metrics used for balanced dataset (i.e.,) without using the SMOTE algorithm
the Bangalore imbalanced dataset (i.e.,) without using the is shown. It depicts that random forest regression has the
SMOTE algorithm are R-SQUARE, MSE, RMSE, and highest R-SQUARE, and the lowest RMSE, MSE, and MAE
MAE values for all 3 algorithms such as support vector values.
regression, random forest regression, and CatBoost re- Accuracy results comparison of the balanced dataset
gression. Te random forest regression gives the best using SMOTE algorithm for all the 4 cities such as Delhi,
result when compared to support vector regression and Bangalore, Kolkata, and Hyderabad obtained by the machine
CatBoost regression except for the fact that CatBoost learning techniques such as support vector regression,
regression gives a lesser MAE than random forest random forest regression, and CatBoost regression are
regression. shown in Table 12. Among the four cities, the Hyderabad
In Figure 9, the comparison between R-SQUARE, MSE, city dataset gives the maximum accuracy for these three
RMSE, and MAE of support vector regression, random algorithms, whereas the New Delhi city dataset gives the
forest regression, and CatBoost regression is shown. It minimum accuracy. In the proposed work, the original
Journal of Environmental and Public Health 21

Table 6: Comparison of dataset size with and without the SMOTE algorithm.
Imbalanced dataset size (not using the SMOTE
Balanced dataset size (using the SMOTE algorithm)
algorithm)
Cities
AQI_bucket values Delhi Bangalore Kolkata Hyderabad Delhi Bangalore Kolkata Hyderabad
Size
Moderate 485 479 151 806 485 958 302 806
Satisfactory 108 810 278 645 432 810 278 645
Good 0 59 119 126 0 944 238 1008
Poor 534 12 119 30 534 768 238 960
Very poor 514 1 66 3 514 1 264 768
Severe 239 0 13 4 478 0 208 1024

Table 7: Accuracy results comparison of the imbalanced dataset for four cities and methods used.
Cities
Method New Delhi (%) Bangalore (%) Kolkata (%) Hyderabad (%)
Accuracy (%)
Support vector regression 78.4867 66.4564 89.1656 76.6786
Random forest regression 79.4764 67.7038 90.9700 78.3672
CatBoost regression 79.8622 68.6860 89.9766 77.8991

Table 8: Te result of performance metrics used for New Delhi city Table 9: Te result of performance metrics used for Bangalore city
imbalanced dataset, without using the SMOTE algorithm. imbalanced dataset, without using the SMOTE algorithm.
Algorithm name R-square MSE RMSE MAE Algorithm name R-square MSE RMSE MAE
Support vector regression 0.9177 0.0908 0.3013 0.2151 Support vector regression 0.6525 0.3772 0.6142 0.3354
Random forest regression 0.9265 0.0810 0.2846 0.2052 Random forest regression 0.7035 0.3219 0.5674 0.3229
CatBoost regression 0.9293 0.0779 0.2792 0.2013 CatBoost regression 0.6877 0.3391 0.5823 0.3131

dataset is used and SMOTE is applied to it as mentioned Table 10: Te result of performance metrics used for Kolkata city
above and cleaned it to only have the values for cities New imbalanced dataset,vwithout using the SMOTE algorithm.
Delhi, Bangalore, Kolkata, and Hyderabad. 3 algorithms
Algorithm name R-square MSE RMSE MAE
have been implemented on it such as support vector re-
gression, random forest regression, and CatBoost regression Support vector regression 0.9714 0.2942 0.1715 0.1083
Random forest regression 0.9808 0.0197 0.1403 0.0902
for prediction purposes, and plotted graphs between the test
CatBoost regression 0.9752 0.0255 0.1597 0.1002
data and the predicted data as well.
Table 13 shows a comparison of SVR accuracy with
and without SMOTE algorithm of four cities. Bangalore Table 11: Te result of performance metrics used for the
city has the lowest accuracy of 66.46% and Kolkata city has Hyderabad city imbalanced dataset, without using the SMOTE
the highest accuracy of 89.17% from the dataset without algorithm.
SMOTE algorithm. Hyderabad city has the highest ac- Algorithm name R-square MSE RMSE MAE
curacy of 93.57% and New Delhi city has the lowest ac- Support vector regression 0.7599 0.2512 0.5012 0.2332
curacy of 84.83% from the dataset with SMOTE algorithm. Random forest regression 0.8600 0.1464 0.3826 0.2163
It is clearly observed that the dataset with SMOTE al- CatBoost regression 0.8474 0.1596 0.3995 0.2210
gorithm applied has higher accuracies. It is shown in
Figure 12.
Te accuracy comparison of SVR, RFR, and CR on Table 14 shows a comparison of RFR accuracy with and
balanced and imbalanced datasets (i.e., with and without without SMOTE algorithm of four cities. Bangalore city has
using SMOTE algorithm) is shown in Figures 12–14. Te the lowest accuracy of 67.70% and Kolkata city has the
accuracy for the balanced datasets for the four cities are highest accuracy of 90.97% from the dataset without SMOTE
increased when compared to the accuracy for the imbal- algorithm. Hyderabad city has the highest accuracy of
anced datasets. Te scatter plots show the actual and pre- 97.61% and New Delhi city has the lowest accuracy of 84.73%
dicted values for an imbalanced dataset (without using from the dataset with SMOTE algorithm. It is clearly ob-
SMOTE) and a balanced dataset (with using SMOTE) using served that the dataset with SMOTE algorithm applied has
SVR. Te scatter plots for the four cities such as New Delhi, higher accuracies. It is shown in Figure 13.
Bangalore, Kolkata, and Hyderabad for SVR are shown in Te accuracy comparison of RFR on balanced and
Figures 15–18. imbalanced datasets (i.e.,) with and without using SMOTE
22 Journal of Environmental and Public Health

Table 12: Accuracy results comparison of the balanced dataset using SMOTE algorithm for four cities and methods used.
Cities
Method New Delhi Bangalore Kolkata Hyderabad
Accuracy (%)
Support vector regression (SVR) 84.8332 87.1756 91.5624 93.5658
Random forest regression (RFR) 84.7284 90.3071 93.7438 97.6080
CatBoost regression (CR) 85.0847 90.3343 93.1656 96.7529

Table 13: Comparison of SVR accuracy with and without SMOTE algorithm of four cities.
SVR accuracy (not using SMOTE algorithm-imbalanced dataset) SVR accuracy (using SMOTE algorithm-balanced dataset)
Cities
(%) (%)
New Delhi 78.4867 84.8332
Bangalore 66.4564 87.1756
Kolkata 89.1656 91.5624
Hyderabad 76.6786 93.5658

Table 14: Comparison of RFR accuracy with and without the SMOTE algorithm of four cities.
RFR accuracy (not using SMOTE algorithm, imbalanced dataset) RFR accuracy (using SMOTE algorithm, balanced dataset)
Cities
(%) (%)
New Delhi 79.4764 84.7284
Bangalore 67.7038 90.3071
Kolkata 90.9700 93.7438
Hyderabad 78.3672 97.6080

Table 15: Comparison of CR accuracy with and without the SMOTE algorithm of four cities.
CR accuracy (not using SMOTE algorithm, imbalanced dataset) CR accuracy (using SMOTE algorithm, balanced dataset)
Cities
(%) (%)
New Delhi 79.8622 85.0847
Bangalore 68.6860 90.3343
Kolkata 89.9766 93.1656
Hyderabad 77.8991 96.7529

Table 16: Overall comparison between accuracy values of the dataset with and without SMOTE algorithm of four cities.
Cities
Delhi Bangalore Kolkata Hyderabad (%) Delhi Bangalore Kolkata Hyderabad
Method Accuracy of the imbalanced dataset (without SMOTE Accuracy of the balanced dataset (with SMOTE
algorithm) (%) algorithm) (%)
SVR 78.4867 66.4564 89.1656 76.6786 84.8332 87.1756 91.5624 93.5658
RFR 79.4764 67.7038 90.9700 78.3672 84.7284 90.3071 93.7438 97.6080
CatBoost 79.8622 68.6860 89.9766 77.8991 85.0847 90.3343 93.1656 96.7529

algorithm is shown in Figure 13. Te accuracy for the SMOTE algorithm. Hyderabad city has the highest ac-
balanced datasets for the four cities are increased when curacy of 96.75% and New Delhi city has the lowest
compared to the accuracy for the imbalanced datasets. Te accuracy of 85.08% from the dataset with the SMOTE
scatter plots show the actual and predicted values for an algorithm. It is clearly observed that the dataset with
imbalanced dataset (without using SMOTE) and a balanced SMOTE algorithm applied has higher accuracies. It is
dataset (with using SMOTE) using RFR. Te scatter plots for shown in Figure 14.
the four cities such as New Delhi, Bangalore, Kolkata, and Te accuracy comparison of CR on balanced and
Hyderabad are shown in Figures 19–22. imbalanced datasets (i.e.,) with and without using SMOTE
Table 15 shows a comparison of CR accuracy with and algorithm is shown in Figure 14. Te accuracy for the
without SMOTE algorithm of four cities. Bangalore city balanced datasets for the four cities is increased when
has the lowest accuracy of 68.69% and Kolkata city has the compared to the accuracy for the imbalanced datasets. Te
highest accuracy of 89.98% from the dataset without the scatter plots show the actual and predicted values for an
Journal of Environmental and Public Health 23

imbalanced dataset (without using SMOTE) and a bal- RFR, 93.5658% to 97.6080%, and with CatBoost Regression,
anced dataset (with using SMOTE) using CR. Te scatter 77.8991% to 96.7529%.
plots for the four cities such as New Delhi, Bangalore, So, this gives quite a clear picture of the importance of
Kolkata, and Hyderabad are shown in Figures 23–26, balanced datasets. Having a dataset properly balanced can
respectively. give more equal importance to each class. If there is too
Table 16 shows the overall comparison between the much of a gap between the number of values present for
accuracy values of the dataset with and without the SMOTE each class, it does not give an accurate portrayal of the
algorithm of four cities. It can be seen that in the dataset actual scenario, and hence, the model fails. SMOTE creates
without SMOTE algorithm the Kolkata city dataset gives the multiple synthetic examples for the minority class and
maximum accuracy for these three algorithms, whereas the brings about a balance to the dataset. Tis makes the
Bangalore city dataset gives the minimum accuracy. Te models work to the best of their ability, hence bringing
dataset with SMOTE algorithm is that the Hyderabad city better accuracy. Tis paper, hence makes clear about the
dataset gives the maximum accuracy for these three algo- importance of using SMOTE-applied datasets. Further-
rithms, whereas the New Delhi city dataset gives the min- more, these metrics also help show the best regression
imum accuracy. Te dataset with the SMOTE algorithm models for the particular use case and help in further
clearly shows an increase in accuracy levels. It can also be research.
seen clearly, how each city’s accuracy has changed
drastically. 7. Conclusion and Future Work
Te results from the imbalanced dataset show that
random forest regression produces the lowest RMSE values Air pollution is a global problem; researchers from all
in Bangalore (0.5674), Kolkata (0.1403), and Hyderabad around the world are working to discover a solution. To
(0.3826), as well as higher accuracy, compared to SVR and accurately forecast the AQI, machine learning techniques
CatBoost regression for Kolkata (90.9700%) and Hyder- were investigated. Te present study assessed the per-
abad (78.3672%), whereas CatBoost regression produces formance of the three best data mining models (SVR,
the lowest RMSE value in New Delhi (0.2792) and the RFR, and CR) for predicting the accurate AQI data in
highest accuracy for New Delhi (79.8622%) and Bangalore some of India’s most populous and polluted cities. Te
(68.6860%). In contrast to SVR and CatBoost regression, synthetic minority oversampling technique (SMOTE)
random forest regression yields the least RMSE values in was used to equalize the class data to get better and
Kolkata (0.0988) and Hyderabad (0.0628) and higher ac- consistent results. Tis unique approach of balancing the
curacies for Kolkata (93.7438%) and Hyderabad datasets, then using them, and then carefully comparing
(97.6080%) for the balanced dataset, which is the dataset the results of both imbalanced and balanced ones for
with the synthetic minority oversampling technique being highly accurate and then using statistical methods
(SMOTE) algorithm applied to it. CatBoost regression such as RMSE, MAE, MSE, and R-SQUARE to confrm
yields higher accuracies for New Delhi (85.0847%) and the better results were very clearly successful in getting
Bangalore (90.3071%) and the least accurate results for higher accuracy. Te fresh research on balanced versus
Kolkata (0.0988%) and Hyderabad (0.0628%). RMSE values imbalance datasets used in such an application is well-
for Bangalore and New Delhi are 0.2148 and 0.1895. tabulated and can be used as a reference for further
Terefore, it was evident from this that datasets that had the research.
SMOTE algorithm applied to them produced higher Te algorithms were run using both datasets (with and
accuracy. without the SMOTE algorithm), and an increase of 6 to 24%
It is observed that when SMOTE is applied, the accuracy was found. Our maximum accuracy in any city also went
for New Delhi with SVR goes from 78.4867% to 84.8332%, from 90.97% for Kolkata using RFR to 97.6% in the same city
with RFR it goes from 79.4764% to 84.7284%, and with and algorithm. Our lowest accuracy went from 66.45% in
CatBoost regression, it goes from 79.8622% to 85.0847%. In Bangalore using SVR to 84.7% in Delhi for RFR. Overall,
the Bangalore dataset again, it is noticed that once the there was a major increase in accuracy. In the proposed
SMOTE algorithm is applied to the dataset, those datasets work, using extensive testing of all three algorithms in New
help achieve that accuracies are considerably higher when Delhi, Bangalore, Kolkata, and Hyderabad, it came to our
models are applied to them than those with imbalanced notice that consistently, random forest regression and
datasets (without SMOTE). When SMOTE is applied, the CatBoost regression provided promising results. In both
accuracy for Bangalore with SVR goes from 66.4564% to cases, before using the SMOTE algorithm and after applying
87.1756%, with RFR goes from 67.7038% to 90.3071%, and SMOTE, they outperformed SVR. Te other metric com-
with CatBoost regression goes from 68.6860% to 90.3343%. parison with and without the SMOTE algorithm is given
It is noticed that when SMOTE is applied, accuracy for below.
Kolkata with SVR jumps from 89.1656% to 91.5624%, with
(i) Regarding R-SQUARE for unbalanced data,
RFR from 90.9700% to 93.7438%, and with CatBoost Re-
gression from 89.9766% to 93.1656%. To establish the trend (a) In New Delhi—CatBoost gave the highest
more, even Hyderabad shows increased accuracies from R-SQUARE.
models when SMOTE is applied, like when it is used with (b) In Bangalore, Kolkata, and Hyderabad: ran-
SVR, the accuracy goes from 76.6786% to 93.5658%, with dom forest got the highest R-SQUARE.
24 Journal of Environmental and Public Health

(ii) Regarding MSE for unbalanced data, For future work, there are plans to use satellite imagery
and more extensive data to provide estimations for indi-
(a) In New Delhi: CatBoost gave the lowest MSE
vidual areas of a city as well. Another avenue to explore
value.
would be artifcial intelligence (AI) to make the models
(b) In Bangalore, Kolkata, and Hyderabad: ran-
more efective and innovative. Tis would help in fguring
dom forest got the lowest MSE value.
out which industrial areas contribute the most to pollution.
(iii) Regarding MAE for unbalanced data: in terms of Extending the study and trying new algorithms would also
accuracy which was calculated using MAE, it make our work more detailed. Te aim is to fnd patterns
concluded as follows: and provide solutions on how to improve the air quality
index of a city. Te factors that contribute the most and
(a) In New Delhi and Bangalore: CatBoost gave
ways to minimize them in an efcient way are an area worth
the highest accuracy.
exploring. In addition, further analyzing our dataset more
(b) In Kolkata and Hyderabad: random forest gave
to see if there are any intriguing patterns, such as the AQI’s
the highest accuracy.
increase or reduction level during the holidays, or par-
(iv) Regarding RMSE for unbalanced data, ticular months and seasons, will be fruitful for our cause.
(a) In New Delhi: CatBoost got the least RMSE [45].
value albeit by a slight margin.
(b) In Bangalore, Kolkata, and Hyderabad: ran- Data Availability
dom forest got the least RMSE value.
Te data used to support the fndings of this study are
(v) Regarding R-SQUARE for balanced data, available on request to the corresponding author.
(a) In New Delhi and Bangalore, CatBoost gave
the highest R-SQUARE. Additional Points
(b) in Kolkata and Hyderabad, random forest gave
the highest R-SQUARE. (i) Tree regression algorithms were used for predicting the
air quality index (AQI), (ii) dataset balancing was carried out
(vi) Regarding MSE for balanced data, through the synthetic minority oversampling technique
(a) In New Delhi and Bangalore: CatBoost got the (SMOTE) algorithm, (iii) the air quality index was predicted
lowest RMSE value. using 15 attributes, and (iv) the dataset included AQI data
(b) in Kolkata and Hyderabad: random forest got from four signifcant Indian cities.
the least RMSE value.
(vii) Regarding MAE for balanced data: in terms of
Conflicts of Interest
accuracy which was calculated using MAE, it Te authors declare that they have no conficts of interest.
concluded as follows:
(a) In New Delhi and Bangalore: CatBoost gave References
the highest accuracy.
(b) In Kolkata and Hyderabad: random forest [1] H. Liu, Q. Li, D. Yu, and Y. Gu, “Air quality index and air
pollutant concentration prediction based on machine learning
regression gave the highest accuracy. algorithms,” Applied Sciences, vol. 9, p. 4069, 2019.
(viii) Regarding RMSE for balanced data, [2] M. Castelli, F. M. Clemente, A. Popovic, S. Silva, and
L. Vanneschi, “A machine learning approach to predict air
(a) In New Delhi and Bangalore: CatBoost got the quality in California,” Complexity, vol. 2020, Article ID
lowest RMSE value. 8049504, 23 pages, 2020.
(b) in Kolkata and Hyderabad: random forest got [3] G. Mani, J. K. Viswanadhapalli, and A. A. Stonie, “Prediction
the least RMSE value. and forecasting of air quality index in Chennai using re-
gression and ARIMA time series models,” Journal of Engi-
So, it seems that in the use case of AQI in India, the neering Research, vol. 9, 2021.
CatBoost and random forest algorithms, coupled with [4] S. V. Kottur and S. S. Mantha, “An integrated model using
SMOTE applied datasets, can provide great results to esti- Artifcial Neural Network (ANN) and Kriging for forecasting
mate air quality, which can prompt local and national air pollutants using meteorological data,” Int. J. Adv. Res.
governments, as well as other civic bodies to act and regulate Comput. Commun. Eng, vol. 4, pp. 146–152, 2015.
the air quality. As very evident from the abovementioned [5] S. Halsana, “Air quality prediction model using supervised
metrics, the application of these regression models on the machine learning algorithms,” International Journal of Sci-
2015 to 2020 AQI data has been successful in demonstrating entifc Research in Computer Science, Engineering and Infor-
mation Technology, vol. 8, pp. 190–201, 2020.
that our innovation of using the SMOTE algorithm has paid
[6] A. G. Soundari, J. Gnana, and A. C. Akshaya, “Indian air
of well and increased the accuracy values of these regression quality prediction and analysis using machine learning,”
models. Tis innovative approach can be applied to future International Journal of Applied Engineering Research, vol. 14,
research and its benefts reaped. p. 11, 2019.
Journal of Environmental and Public Health 25

[7] C. R. Aditya, C. R. Deshmukh, N. D K, P. Gandhi, and V. astu, 10th International Conference on Intelligent Systems (IS),
“Detection and prediction of air pollution using machine pp. 562–567, Varna, Bulgaria, August 2020.
learning models,” International Journal of Engineering Trends [22] Y. Zhou, S. De, G. Ewa, C. Perera, and K. Moessner, “Data-
and Technology, vol. 59, no. 4, pp. 204–207, 2018. Driven air quality characterization for urban environments: a
[8] J. Kleine Deters, R. Zalakeviciute, M. Gonzalez, and case study,” IEEE Access, vol. 6, Article ID 77996, 2018.
Y. Rybarczyk, “Modeling PM2. 5 urban pollution using [23] C. Srivastava, S. Singh, and A. P. Singh, “Estimation of air
machine learning and selected meteorological parameters,” pollution in Delhi using machine learning techniques,” in
Journal of Electrical and Computer Engineering, vol. 2017, Proceedings of the 2018 International Conference on Com-
Article ID 5106045, 14 pages, 2017. puting, Power and Communication Technologies (GUCON),
[9] P. Bhalgat, S. Pitale, and S. Bhoite, “Air quality prediction pp. 304–309, Greater Noida, India, September 2018.
using machine learning algorithms,” International Journal of [24] R. Raturi and J. R. Prasad, “Recognition of future air quality
Computer Applications Technology and Research, vol. 8, index using artifcial neural network,” International Research
pp. 367–370, 2019. Journal of Engineering and Technology (IRJET), vol. 5,
[10] M. Bansal, “Air quality index prediction of Delhi using pp. 2395–0056, 2018.
LSTM,” Int. J. Emerg. Trends Technol. Comput. Sci, vol. 8, [25] U. Mahalingam, K. Elangovan, H. Dobhal, C. Valliappa,
pp. 59–68, 2019. S. Shrestha, and G. Kedam, “A machine learning model for air
[11] A. Shishegaran, M. Saeedi, A. Kumar, and H. Ghiasinejad, quality prediction for smart cities,” in Proceedings of the 2019
“Prediction of air quality in Tehran by developing the non- International Conference on Wireless Communications Signal
linear ensemble model,” Journal of Cleaner Production, Processing and Networking (WiSPNET), pp. 452–457, Chen-
vol. 259, Article ID 120825, 2020. nai, India, March 2019.
[12] L. Tuan-Vinh, “Improving the awareness of sustainable smart [26] V. Sivakumar, G. R. Kanagachidambaresan, V. Dhilip kumar,
cities by analyzing lifelog images and IoT air pollution data,” M. Arif, C. Jackson, and G. Arulkumaran, “Energy-efcient
in Proceedings of the 2021 IEEE International Conference on markov-based lifetime enhancement approach for under-
Big Data (Big Data), IEEE, Orlando, FL, USA, September water acoustic sensor network,” Journal of Sensors,
2021. vol. 202210 pages, Article ID 3578002, 2022.
[13] R. Kumar, P. Kumar, and Y. Kumar, “Time series data [27] J. Sethi and M. Mittal, “Ambient air quality estimation
prediction using IoT and machine learning technique,” using supervised learning techniques,” ICST Transactions
Procedia Computer Science, vol. 167, no. 2020, pp. 373–381, on Scalable Information Systems, vol. 6, Article ID 159628,
2020. 2019.
[14] H. Maleki, A. Sorooshian, G. Goudarzi, Z. Baboli, [28] P. Hajek and V. Olej, “Predicting common air quality index -
Y. Tahmasebi Birgani, and M. Rahmati, “Air pollution pre- the case of Czech microregions,” Aerosol and Air Quality
diction by using an artifcial neural network model,” Clean Research, vol. 15, no. 2, pp. 544–555, 2015.
Technologies and Environmental Policy, vol. 21, no. 6, [29] S. Ameer, M. A. Shah, A. Khan et al., “Comparative
pp. 1341–1352, 2019. analysis of machine learning techniques for predicting air
[15] K. P. Singh, S. Gupta, and P. Rai, “Identifying pollution quality in smart cities,” IEEE Access, vol. 7, Article ID
sources and predicting urban air quality using ensemble 128325, 2019.
learning methods,” Atmospheric Environment, vol. 80, [30] A. R. Behesht Abad, S. Mousavi, N. Mohamadian et al.,
pp. 426–437, 2013. “Hybrid machine learning algorithms to predict condensate
[16] S. Hansun and M. Bonar Kristanda, “AQI measurement and viscosity in the near wellbore regions of gas condensate
prediction using B-wema method,” International Journal of reservoirs,” Journal of Natural Gas Science and Engineering,
Engineering Research and Technology, vol. 12, pp. 1621–1625, vol. 95, Article ID 104210, 2021.
2019. [31] M. Rajabi, S. Beheshtian, S. Davoodi et al., “Novel hybrid
[17] R. Janarthanan, P. Partheeban, K. Somasundaram, and machine learning optimizer algorithms to prediction of
P Navin Elamparithi, “A deep learning approach for pre- fracture density by petrophysical data,” Journal of Petroleum
diction of air quality index in a metropolitan city,” Sus- Exploration and Production Technology, vol. 11, no. 12,
tainable Cities and Society, vol. 67, no. 2021, Article ID pp. 4375–4397, 2021.
102720, 2021. [32] A. R. Behesht Abad, P. S. Tehrani, M. Naveshki et al.,
[18] M. Londhe, “Data mining and machine learning approach for “Predicting oil fow rate through orifce plate with robust
air quality index prediction,” International Journal of Engi- machine learning algorithms,” Flow Measurement and In-
neering and Applied Physics, vol. 1, no. 2, pp. 136–153, May strumentation, vol. 81, Article ID 102047, 2021.
2021. [33] O. Hasbeh, M. Ahmadi Alvar, K. Y. Aghdam, H. Ghorbani,
[19] R. W. Gore and D. S. Deshpande, “An approach for classi- N. Mohamadian, and J. Moghadasi, “Hybrid computing
fcation of health risks based on air quality levels,” in Pro- models to predict oil formation volume factor using multi-
ceedings of the 2017 1st International Conference on Intelligent layer perceptron algorithm,” Journal of Petroleum and Mining
Systems and Information Management (ICISIM), pp. 58–61, Engineering, vol. 23, no. 1, pp. 17–30, 2021.
Aurangabad, India, October 2017. [34] F. Jafarizadeh, M. Rajabi, S. Tabasi et al., “Data driven models
[20] X. Zhao, M. Song, A. Liu, Y. Wang, T. Wang, and J Cao, to predict pore pressure using drilling and petrophysical
“Data-Driven temporal-spatial model for the prediction of data,” Energy Reports, vol. 8, pp. 6551–6562, 2022.
AQI in nanjing,” Journal of Artifcial Intelligence and Soft [35] G. Zhang, S. Davoodi, S. S. Band, H. Ghorbani, A. Mosavi, and
Computing Research, vol. 10, no. 4, pp. 255–270, 2020. M Moslehpour, “A robust approach to pore pressure pre-
[21] A.-S. Chowdhury, M. S. Uddin, M. R. Tanjim, F. Noor, and diction applying petrophysical log data aided by machine
R. M. Rahman, “Application of Data Mining Techniques on learning techniques,” Energy Reports, vol. 8, pp. 2233–2247,
Air Pollution of Dhaka City,” in Proceedings of the 2020 IEEE 2022.
26 Journal of Environmental and Public Health

[36] S. Tabasi, P. Soltani Tehrani, M. Rajabi et al., “Optimized


machine learning models for natural fractures prediction
using conventional well logs,” Fuel, vol. 326, Article ID
124952, 2022.
[37] M. Rajabi, O. Hazbeh, S. Davoodi et al., “Predicting shear
wave velocity from conventional well logs with deep and
hybrid machine learning algorithms,” Journal of Petroleum
Exploration and Production Technology, 2022.
[38] S. Beheshtian, M. Rajabi, S. Davoodi et al., “Robust com-
putational approach to determine the safe mud weight
window using well-log data from a large gas reservoir,”
Marine and Petroleum Geology, vol. 142, Article ID 105772,
2022.
[39] Z. K. Masoud, S. Davoodi, H. Ghorbani et al., “Band, Per-
meability prediction of heterogeneous carbonate gas con-
densate reservoirs applying group method of data handling,”
Marine and Petroleum Geology, vol. 139, 2022.
[40] N. Mohamadian, H. Ghorbani, D. A. Wood, and
M. A. Khoshmardan, “A hybrid nanocomposite of poly(-
styrene-methyl methacrylate- acrylic acid)/clay as a novel
rheology-improvement additive for drilling fuids,” Journal of
Polymer Research, vol. 26, no. 2, p. 33, 2019.
[41] N. Mohamadian, H. Ghorbani, D. A. Wood, and
H. K. Hormozi, “Rheological and fltration characteristics of
drilling fuids enhanced by nanoparticles with selected ad-
ditives: an experimental study,” Advances in Geo-Energy
Research, vol. 2, no. 3, pp. 228–236, 2018.
[42] A. Choubineh, H. Ghorbani, D. A. Wood, S. Robab Moosavi,
E. Khalaf, and E. Sadatshojaei, “Improved predictions of
wellhead choke liquid critical-fow rates: m,” Fuel, vol. 207,
pp. 547–560, 2017.
[43] H. Ghorbani, J. Moghadasi, and D. A. Wood, “Prediction of
gas fow rates from gas condensate reservoirs through well-
head chokes using a frefy optimization algorithm,” Journal of
Natural Gas Science and Engineering, vol. 45, pp. 256–271,
2017.
[44] A. R. B. Abad, H. Ghorbani, N. Mohamadian et al., “Robust
hybrid machine learning algorithms for gas fow rates pre-
diction through wellhead chokes in gas condensate felds,”
Fuel, vol. 308, Article ID 121872, 2022.
[45] S. Fan, D. Hao, Y. Feng, K. Xia, and W. A. Yang, “A hybrid
model for air quality prediction based on data decomposi-
tion,” Information, vol. 12, no. 5, p. 210, 2021.

You might also like