0% found this document useful (0 votes)
75 views10 pages

Comparative Analysis Study For Air Quality Prediction in Smart Cities Using Regression Techniques

This document summarizes a research study that compared three regression techniques (random forest, linear, and decision tree regression) for predicting air quality using an air quality index. The study aimed to determine the most effective model based on evaluation metrics like mean absolute error and R2 score. It found that decision tree regression had the best performance with high R2 scores and minimal error rates. The study also showed that integrating cloud computing improved the execution time of the models, making real-time air quality forecasting feasible.

Uploaded by

Fresy Nugroho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views10 pages

Comparative Analysis Study For Air Quality Prediction in Smart Cities Using Regression Techniques

This document summarizes a research study that compared three regression techniques (random forest, linear, and decision tree regression) for predicting air quality using an air quality index. The study aimed to determine the most effective model based on evaluation metrics like mean absolute error and R2 score. It found that decision tree regression had the best performance with high R2 scores and minimal error rates. The study also showed that integrating cloud computing improved the execution time of the models, making real-time air quality forecasting feasible.

Uploaded by

Fresy Nugroho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received 14 September 2023, accepted 29 September 2023, date of publication 10 October 2023,

date of current version 23 October 2023.


Digital Object Identifier 10.1109/ACCESS.2023.3323447

Comparative Analysis Study for Air Quality


Prediction in Smart Cities Using
Regression Techniques
SHOROUQ AL-EIDI 1 , FATHI AMSAAD 2 , OMAR DARWISH 3 , (Senior Member, IEEE),
YAHYA TASHTOUSH 4 , ALI ALQAHTANI 5 ,
AND NIVESHITHA NIVESHITHA2 , (Graduate Student Member, IEEE)
1 Computer Science Department, Tafila Technical University, Tafila 66110, Jordan
2 Computer Science and Engineering Department, Wright State University, Colonel, OH 45435, USA
3 Information Security and Applied Computing Department, Eastern Michigan University, Ypsilanti, MI 48197, USA
4 Department of Computer Science, Jordan University of Science and Technology, Irbid 22110, Jordan
5 Department of Networks and Communication Engineering, Najran University, Najran 61441, Saudi Arabia

Corresponding author: Shorouq Al-Eidi ([email protected])


This work was supported by the Deanship of Scientific Research, Najran University, under the Research Groups Funding Program, under
Grant NU/RG/SERC/12/9.

ABSTRACT In smart cities, air pollution has detrimental impacts on human physical health and the quality
of living environment. Therefore, correctly predicting air quality plays an important effective action plan to
mitigate air pollution and create healthier and more sustainable environments. Monitoring and predicting
air pollution is crucial to empower individuals to make informed decisions that protect their health. This
research presents a comprehensive comparative analysis focused on air quality prediction using three distinct
regression techniques- Random Forest regression, Linear regression, and Decision Tree regression. The
main goal of this study is to discern the most effective model by considering a range of evaluation criteria,
including Mean Absolute Error and R2 measures. Moreover, it considers the crucial aspects of minimizing
prediction errors and enhancing computational efficiency by evaluating the regression models within two
frameworks. The findings of this study underscore the superiority of the Decision Tree regression approach
over the other models, demonstrating its exceptional accuracy with a high R2 score and a minimal error rate.
Moreover, integrating cloud computing technology has resulted in substantial improvements in the execution
time of these approaches. This technology enhancement significantly affects the overall efficiency of the air
quality prediction process. By leveraging distributed computing resources, real-time air quality forecasting
becomes feasible, enabling timely decision-making and proactive measures to address air pollution episodes
effectively.

INDEX TERMS Air pollution, machine learning, IoT, smart city, air quality index.

I. INTRODUCTION was identified as a primary cause of various allergies,


Recently, the detrimental of air pollution have garnered illnesses, and premature death, accounting for a staggering
significant global attention, as the World Health Organization 12% of global deaths in 2019 [6]. Moreover, air pollution
studies underscored, which illuminate the impact on human introduces dangerous substances into the atmosphere, includ-
health and the environment. It is alarming that air pollution ing greenhouse gases and biological compounds [9], further
exacerbating human-environmental challenges.
The associate editor coordinating the review of this manuscript and Specifically, the issue of air pollution in smart cities
approving it for publication was Sajid Ali . has gained significant attention in urban sustainability and

2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
115140 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
S. Al-Eidi et al.: Comparative Analysis Study for Air Quality Prediction in Smart Cities

enhanced quality of life. While smart technologies have the study can serve as invaluable inputs for decision-making
heralded remarkable efficiency and convenience, they have processes. They can potentially guide the development of
inadvertently become a source of air pollution. That is proactive measures that effectively address challenges posed
attributable to the concentration of industries and trans- by air pollution.
portation networks within smart cities, which escalates air This paper is organized as follows. Section II provides a
pollutants and harmful gases released into the atmosphere. review of the air quality and pollution prediction literature.
Consequently, the urban planners within these smart cities Section III details of air quality predication approaches,
recognize the need for innovative solutions to address this which illustrate the experimental setup pre-processing tech-
escalating problem. They leverage real-time monitoring, niques and utilize regression techniques to predict air
data analytics, and advanced approaches to accurately pollution levels. Section IV presents the experiment results.
predict and proactively mitigate pollution levels, thereby Section V offers a conclusion and potential future work.
safeguarding the well of their residents and ensuring the
future sustainability of urban communities. II. LITERATURE REVIEWS
The Air Quality Index (AQI) emergency has recently The field of air pollution prediction has experienced a
assumed a vital role in predicting air quality. AQI clearly notable rise in machine learning techniques to address the
indicates poorer air quality and harmful gases based on challenges associated with forecasting air quality levels.
predefined ranges of air pollutant concentrations [2]. Early These techniques have demonstrated their effectiveness in
prediction of AQI levels is instrumental in effective environ- predicting air pollution, thus contributing significantly to
mental management and preventing potential dangers of air developing air quality management strategies. This section
pollution. comprehensively explores the most notable models utilized
Given the situation of urgency, adopting sustainable for calculating and predicting the Air Quality Index (AQI)
solutions that effectively mitigate air pollution has become and the concentration levels of various air pollutants through
imperative, particularly when considering the well-being of different machine learning algorithms, such as regression
future generations. Over recent years, various forecasting techniques. These models hold considerable relevance and
models have been proposed to predict pollution levels, find practical utility in other application domains such as
with machine learning emerging as a noteworthy approach cloud computing.
due to its ability to handle the intricate interplay of Patil et al. [18] extensively reviewed different method-
air quality parameters. Machine learning-based prediction ologies and techniques to analyze the concentration level
systems are increasingly attractive for their precision in air of air pollution and the prediction of AQI. This study
quality management [8], [14], offering promising avenues for highlighted the performance of these analytical methods
designing cleaner and healthier smart cities. and presented the importance of calculating AQI as a
The primary objective of this study is to address the significant measure for assessing pollution levels and how it
challenges of time and cost constraints in air quality dramatically influences human health and the environment.
prediction. It does so by leveraging the efficiency of machine Similarly, Oliveri et al. [15] reviewed air quality models
learning techniques in conjunction with the AQI. To achieve while discussing the effect of air pollution concentration on
this, the study compares three distinct regression approaches human health.
to provide the most accurate air quality prediction. To assess A noteworthy study by Ameer et al. [1] scrutinized the
their effectiveness, well-established evaluation measures efficiency of four regression methods, namely Decision
such as q Root Mean Square Error (RMSE), R2 score, Tree, Gradient Boosting, Multilayer Perceptron, and Arti-
and Mean Absolute Error (MAE) are employed. The ficial Neural Network (ANN), in predicting air quality
ultimate goal is identifying the most efficient and suitable levels. These methods were evaluated based on tracking
regression model for predicting air quality. Beyond the PM2.5 levels in the air and calculating the AQI. The findings
accuracy, this study recognizes the real-time processing of this study concluded the Random Forest regression
capabilities in smart cities, such as valuing the processing method outperformed the others, achieving an adjusted MAE
time associated with each regression technique. To reduce the of 16% for Beijing City. This method also reduces the
execution time without compromising prediction accuracy, running time compared to Gradient Boosting and Multilayer
this work incorporates distributed computing techniques into Perceptron. Similarly, Maleki et al. [12] utilized the ANN
its methodology. This means that optimization considera- approach to predict the concentration levels different air
tions encompass factors such as data size and processing pollutants such as NO2 and SO2. This study applied in
time. several monitoring areas including Naderi, Havashenasi,
The implication of the study’s finding holds significant Behdasht, MohiteZist,and Iran. In this study the authors
practical relevance for formulating effective air pollution con- considered the effect of set parameters such as time, date,
trol strategies and contributes to advancements in air quality and meteorological data to offer a robust air quality predictive
prediction methodologies. Particularly in urban environments model.
where the monitoring of the AQI is crucial for public health Moreover, Zhang et al. [22] utilized the long short-term
and environmental management, these insights gleaned from memory (LSTM) to proposed a deep learning approch for

VOLUME 11, 2023 115141


S. Al-Eidi et al.: Comparative Analysis Study for Air Quality Prediction in Smart Cities

air pollution detection. This study conducted a series of study comparing Random Forest, Decision Tree, and Linear
experiments using Detrended Cross-Correlation Analysis regression models for predicting air pollutants and meteoro-
(DCCA) to explore the relationship between predicting levels logical conditions in the Arduino platform. The study found
of several air pollutants and meteorological data such as that the Random Forest model provided better performance
temperature and humidity, The results of this study was by reducing errors caused by overfitting. However, it was
observed there were a negative correlation between AQI noted that the Random Forest model required more memory
and meteorological data (temperature, humidity, and wind and incurred higher costs.
speed), while a strong positive correlation between pressure For using the clustering approach, Kingsy et al. [7]
and AQI. Furthermore, Bougoudis [3] developed a hybrid enhanced the K-Means algorithm to analyze and identify the
computational method to identify the correlation between air air pollution level. Their method calculates the correlation
pollutants and weather conditions to determine the actual coefficient between pollutant data to determine the AQI
cause of pollution. The study employed ANN and Random value and find the air pollution level in a specific location.
Forest as ensemble learning methods, claiming increased To validate their findings and evaluate the effectiveness
accuracy. However, the feedforward neural network faced of their approach, the authors compared their proposed
challenges predicting continuous values due to insufficient algorithm with the Fuzzy C-Means algorithm. Their
data. results demonstrated that the proposed K-Means clustering
For using classification machine learning algorithms, algorithm achieved higher accuracy and less execution time
Gore et al. [5] proposed a classification approach to study than the Fuzzy C-Means algorithm. Ganeshkumar et al. [4]
how air pollutant levels affect the health of humans. presented an efficient and cost-effective classification model
In their process, they employed Naive Bayes and Decision for environmental monitoring and air pollution prediction.
Tree algorithms and achieved a high accuracy using the Their study the authors used several artificial methods with
Decision Tree model. Moreover, Simu et al. [21] presented a cloud platform for data processing, leading to significant
a comparative study to compare the performance of several time savings, reduced labor efforts, and producing high-
machine learning algorithms, such as Random Forest and quality outcomes. This research highlights the importance of
Multi-linear Regression, in analyzing air pollutants and integrating cloud platform solutions to enhance the efficiency
predicting air pollution levels. The study results concluded and accuracy of monitoring and air quality prediction
that the Multilayer Perceptron algorithm outperformed the models, which is beneficial for addressing environment
other. mentoring challenges. Similarly, Park et al. [16] used their
Moreover, In [19], Peng et al. utilized Multilayer Per- own cloud computing technique to reduce the processing
ceptron to enhance the air quality prediction accuracy. time of processing and visualization of urban air pollution
However, they noted limitations in data extension and the data.
high computational cost because of the seasonal update of The literature review underscores the widespread predic-
the model. Mahalingam et al. [10] proposed using ANN tion of air quality and air pollution utilizing machine learning
and SVM algorithms to predict the AQI in the smart city algorithms, highlighting their potential to achieve accurate
of Deldi with impressive accuracies, mainly the Medium results, efficient computation, and effective prediction of
Gaussian SVM function. To predict the AQI and air pol- air quality levels. However, certain limitations need to be
lution levels, Sharma et al. [20] implied various algorithms, addressed. These include the necessity for more extensive
including Linear regression, ANNs, Lasso regression, and and more comprehensive datasets, challenges in accurately
XGBoost regression. The study focused on tracking the predicting continuous values, and the high computational
values of several pollutants, including NO2, SO2, PM2.5, cost associated with model updates. Additionally, the review
PM10, CO, and O3. The research findings indicated that the identifies a research gap in the focus on predicting the
Random Forest algorithm outperformed the other algorithms, AQI based solely on PM2.5 measurements, neglecting the
demonstrating its high performance in predicting the AQI and inclusion of other important air pollutants. Incorporating
air pollution levels. data on multiple pollutants such as O3, NO2, SO2, and
Nandini et al. [13] used Decision Trees and Multinomial PM2.5 can significantly enhance the accuracy of air pollution
Logistic Regression to forecast and analyze air quality prediction models. These insights provide valuable guidance
pollutant levels, achieving better accuracy with Multinomial for future research endeavors and for developing effective
Logistic Regression compared to Decision Tree. Similarly, air quality management strategies, particularly in smart
in a study by Mahanta et al., [11], a comprehensive com- cities.
parison of several algorithms, including Linear regression,
Decision Forest, XGBoost, ElasticNet, Boosted Decision III. AIR QUALITY PREDICTION APPROACH
Tree, KNN, Lasso regression, and Ridge regression to predict This section presents our air quality prediction approach and
air pollutant levels. Among these algorithms, Extra Trees the stages of how to predict air pollution using regression
exhibited superior performance due to its technique of techniques.
ranking the essential features to improve the accuracy of Our approach contains six main components: Dataset
the predictions. Moreover, Pasupuleti et al. [17] conducted a preprocessing, AQI calculation, Feature selection from

115142 VOLUME 11, 2023


S. Al-Eidi et al.: Comparative Analysis Study for Air Quality Prediction in Smart Cities

FIGURE 1. Air quality prediction model.

data, Splitting and Balancing data, and Regression model SO2 (Sulphur dioxide), CO (Carbon monoxide), and AQI.
construction for air quality prediction, as shown in Figure 1. This study considers the pollutant concentration values as
Air quality datasets were collected and loaded in the first crucial features of the dataset, enabling a comprehensive
stage for analysis. Next, preprocessing steps were applied understanding of pollution patterns in Pune’s smart city.
to ensure data quality, including handling missing values
and reducing outliers. Then, calculate the Air Quality Index
(AQI) for air pollutants in the dataset. After processing data, B. DATA PRE-PROCESSING
our feature extraction module extracts the most relevant and Data pre-processing is an important step in data analysis to
essential features. This step helps reduce the dimensionality improve the quality and reliability of the dataset by reducing
of the air dataset and focuses only on the significant variables. noise and inconsistencies.
The dataset was then balanced to ensure equal representation The first stage of data pre-possessing is handling missing
of different classes, followed by splitting it into training values in the raw data. The dataset used in this study
and testing sets. Finally, our regression module takes the comprised 103,205 entries containing several data types, such
sets of essential features as input and constructs regression as objects, integers, and floats. Some of these entries had null
classifiers to predict air quality. Performance metrics were or missing values, which must be addressed. To handle this
computed to identify a suitable and efficient model for issue, missing values were replaced with the mean values
predicting air quality. We describe the details of each module for pollutant parameters. This approach helped maintain the
next. dataset by ensuring no crucial information was lost due
to missing values. Moreover, the interquartile range (IQR)
A. DATASET DESCRIPTION method addresses duplicate observations and outliers. The
The dataset used in this study encompasses a comprehensive Interquartile range method utilizes three percentiles: quartiles
collection of 103,205 records, featuring data from monitoring Q1 (25th), Q2 (50th), and Q3 (75th), and considers the outlier
stations situated across ten diverse locations within Pune as any values not in the range between (Q1 − 1.5 ∗ IQR)
City.1 These areas include Bopadi Square 65, Karve Statue and (Q3 + 1.5 ∗ IQR). Instead of removing the outlier
Square 5, Lullanagar Square 14, Hadapsar Gadital 01, values, we used the lower and upper boundary values to
PMPML Bus Depot Deccan 15, Goodluck Square Cafe 23, replace them and retain important information while reducing
Chitale Bandhu Corner 41, Pune Railway Station 28, Rajashri the impact of data outliers on the data analysis process.
Shahu Bus Stand 19, and Dr. Baba Saheb Ambedkar Sethu Exploratory Data Analysis (EDA) has been used to gain
Junction 60. The dataset, compiled in 2019, resulted from insights into the dataset and understand its characteristics for
a collaborative effort between the Pune smart city and the cleaning and preparing the raw data for training purposes.
Indian Institute of Science, Bangalore. EDA process conducted descriptive statistics of the dataset
Within the dataset, we focus on 28 distinct features related based on analyzing various statistical measures such as
to air pollution, including NO2 (Nitrogen dioxide), O3 standard deviation, mean, minimum, and maximum values
(Ozone), PM10 (Particulates) with a diameter of less than for each air pollutant. By calculating these statistics values,
10 microns, PM2.5 with a diameter of less than 2.5 microns, we obtained a comprehensive dataset overview, enabling
us to identify potential anomalies that could affect the
1 https://www.kaggle.com/datasets/akshman/pune-smartcity-test-dataset analysis.

VOLUME 11, 2023 115143


S. Al-Eidi et al.: Comparative Analysis Study for Air Quality Prediction in Smart Cities

TABLE 1. Basic characteristic of dataset.

C. AQI CALCULATION
As mentioned before AQI is one of the most crucial parameter
have been used for monitoring the air quality in particular
cities. It provides a standard measure that quantifies air
pollution and helps understand its effects on human health
and environment. AQI is a numerical value within a defined FIGURE 2. Correlation of AQI air pollutants.
range, typically from 0 to 500. A higher value of AQI
indicates poorer air quality and the existence of harmful
air pollutants. Each pollutant has specific constraints and
specific averaging periods to ensure accurate assessment such matrix, which provides a view of the relationships between all
as the period is 8-hour maximum for Q3 and 24-hour average dataset variables and identifies features with strong positive
concentrations for SO2, PM10, CO, NO2, PM2.5. or negative correlations with the AQI, as shown in Figure 2.
To calculate the AQI, the concentrations of these air The results of this study highlighted that most values of
pollutants are categorized into sub-indices. These sub-indices air pollutants demonstrated a positive correlation with the
were defined based on predefined ranges that help to give AQI values, which indicates that higher concentrations of air
the level of air quality, ranging from ‘‘good’’ to ‘‘hazardous.’’ pollutants are associated with higher AQI values, reflecting
Where the highest value of sub-index among the air pollutants poorer air quality. This highlights how selecting important air
represents the overall air quality index for a certain location. pollutants as features that represent significant correlations
The computation of the AQI is based on Equation 1 combines with the AQI is essential in analyzing and predicting air
the sub-indices of each pollutant [11], which considers the quality variations in the study area.
weightage assigned to each pollutant based on its potential
health impacts, by incorporating multiple pollutants and E. SPLITTING DATA
their respective sub-indices, the AQI helps to assess the air
In this stage, the train-test split() method was utilized to split
quality [1].
the data into two parts with a ratio of 70:30 for training and
Ihigh − Ilow testing sets. This means 70% of the total dataset was chosen
I= (C − Ilow ) + Ilow (1) for training, while the remaining 30% of data was assigned
Chigh − Clow
for testing data. With this splitting ratio, the model is trained
where, I is Air Quality Index, C is Pollutant concentration. on a large sufficient portion of the data and evaluated on test
data to assess its performance.
D. FEATURE SELECTION
Feature selection becomes crucial in our research following F. BALANCING DATA
the data preprocessing and exploratory data analysis step. In machine learning tasks, addressing the issue of imbalanced
This process involves identifying and selecting the most rel- data is a crucial process to ensure reliable and accurate
evant features related to the AQI, representing the overall air prediction results. In this study, the distribution of AQI
quality. The features in this study based on the preprocessed values exhibits an imbalance in the given dataset, where
dataset contain several pollutant information such as CO, certain values occur more frequently than others. This can
SO2, O3, OZONE, NO2, PM10, and PM2.5, along with their be observed by categorizing the AQI values into predefined
corresponding AQI values. ranges, as shown in Figure 3.
We used the correlation analysis to determine the rela- Using imbalanced data can significantly effect on the
tionship between the features and AQI. Correlation analysis regression approaches. Biases can occur as approaches
can be used to find the linear relationship between two favor the majority class and overlook minority classes
variables. By calculating the correlation coefficients between with fewer instance data. To get over this issue, SMOTER
each feature and the AQI, we can assess their predictive value (Synthetic Minority Over-sampling technique for Regression
in understanding and predicting variations in air pollutant with Gaussian Noise) is one of the most common techniques
levels. The correlation values are compiled into a correlation has been used to improve the model’s performance.

115144 VOLUME 11, 2023


S. Al-Eidi et al.: Comparative Analysis Study for Air Quality Prediction in Smart Cities

relationships between the input factors and AQI to


accurately predict upcoming periods.
2) Linear regression: is a commonly employed statistical
method in several approaches for prediction and
forecasting air pollution [20]. It is used for examining
the relations between pollutant concentrations and the
AQI. Linear regression can make reliable predictions
about future air pollution levels by analyzing historical
data and discerning trends and patterns. Furthermore,
Linear regression aids in identifying the primary
factors contributing to air pollution. By assessing
the regression coefficients, it becomes possible to
determine how much the variable influences the AQI.
This information can be crucial in formulating effective
control measures to mitigate pollution and enhance air
quality.
3) Random Forest regression: is a supervised learning
technique that combines multiple Decision Trees and
can be used for regression problems. The input data
goes through multiple Decision Trees, and the average
FIGURE 3. AQI classified categories.
of each tree is used as the model’s output in the training
process [1].

H. EVALUATION MEASURES
The SMOTER technique is used to generate a synthetic In our evaluation stage, we aim to offer a comprehensive
minority and under-sampling the majority class, which helps analysis of different regression approaches and performance
to get a balanced dataset and ensures a more equitable metrics to provide the flexibility to select the classifier whose
representation of different AQI values. By generating more accuracy specifications are most relevant to users. Therefore,
synthetic samples, the minority class can create a more this part implies the most popular error rate metrics used in
balanced distribution of data points. Gaussian noise was also machine learning and information retrieval domains. We list
added into these synthetic samples to introduce variations and these measures and explain each one next.
prevent overfitting. By utilizing balancing the dataset and the • Mean Absolute Error (MAE): is a metric used to
SMOTER technique, the regression models are trained using calculate the mean value for the differences between the
more representative and several sets of data points. In this actual and predicted values observed from the model.
stage, the model’s ability to capture patterns and relationships It indicates the average of the model errors, as shown
across different AQI values of air pollutants will be enhanced, in the equation below:
leading to improved model performance and more accurate n
1X
predictions. MAE = (yj − y′j ) (2)
n
j=1

G. REGRESSION MODELS CONSTRUCTION • Root Mean Square Error (RMSE): is a widely used for
evaluating regression models. It is used to calculate the
The final step in the air quality prediction approach is
average deviation between predicting and actual model
constructing a regression model to predict air quality. For
values. A lower RMSE value highlighted that the model
this task, we train models using the following regression
achieved better performance. It can be calculated using
techniques.
the following formula:
1) Decision Tree regression: is a supervised machine v
u1 n
u X
learning algorithm commonly used to model non-linear
relationships between output variables and input fea- RMSE = t ( (yj − y′j )2 ) (3)
n
j=1
tures. The algorithm partitions the data into subsets
based on specific rules or criteria in this regression • R2 Score: is used to find the variance of target variables
approach. These rules are selected to minimize the in the model. It ranges from 0 to 1, with a higher value
difference in space between the predicted and the representing that the proposed model fits the dataset in a
actual values. By considering several input factors and good way. It is calculated using the following formula:
training the model using historical air pollution and
(yi − y′i )2
Pn
AQI data, Decision Tree regression can be applied R = 1 − Pni=1
2
(4)
2
to predict the air quality. The model analyzes the i=1 (yi − ȳi )

VOLUME 11, 2023 115145


S. Al-Eidi et al.: Comparative Analysis Study for Air Quality Prediction in Smart Cities

FIGURE 5. Actual vs predicted for decision tree regression.


FIGURE 4. Actual vs predicted for linear regression.

IV. EXPERIMENTAL RESULTS the Decision Tree regression model. Analyzing the results,
To validate reliability and effectiveness of air quality pre- we observe that the data points are more evenly distributed
diction methodologies, we present all experimental findings throughout the graph and closer to the regression line than
and compare them from different perspectives. Our initial the Linear regression case. This indicates that our study’s
evaluation focuses on comparing the actual and predicted Decision Tree regression model performs better in predicting
values of each approach to provide a reliable indicator of air quality.
approaches accuracy. The improved distribution and proximity of data points to
In addition, we compare the effectiveness of regression the regression line in the case of Decision Tree regression
approaches in predicting air quality across two execution signify a higher level of accuracy and reliability in predicting
configurations including (a personal laptop and a cloud air quality compared to linear regression. This suggests that
based platforms). Moreover, we emphasized measuring the the Decision Tree regression model may provide more precise
execution times for each regression technique on both predictions based on the dataset.
selected platforms. This analysis provides valuable insights Concluding the evaluation of regression models, Figure 6
into the computational efficiency and speed of the models. illustrates the comparison values of the Random Forest
In the following sections, we discuss the detailed results, regression model. Upon analysis, we observe that the data
providing a comprehensive understanding of the performance points are distributed and closer to the regression line, and
of the regression techniques. this graph looks similar to the Decision Tree model graph.
While the Random Forest model may offer advantages in
A. COMPARISON OF ACTUAL AND PREDICTED DATA handling complex relationships and reducing overfitting,
Our initial set of evaluation findings showcases the per- the Decision Tree model’s simplicity and interpretability
formance of our approaches in predicting air quality by make it a compelling option for understanding the factors
comparing the actual values with the prediction values influencing air quality. The Decision Tree model can provide
generated by models. By visually comparing these two sets valuable insight into the variables representing the most
of values, we can quickly assess the degree of proximity significant impact on air quality, aiding decision-making
between them, offering valuable insights into the accuracy processes.
of each model. Figure 4 presents the actual values plotted
versus the predicted values, focusing specifically on lin- B. PERFORMANCE EVALUATION USING DIFFERENT
ear regression results. The blue line represents the ideal CONFIGURATIONS
regression line, and the model’s accuracy depends on the This section represents the second set of evaluation
degree of alignment between the data points and this line. results showing our approach’s performance by applying
Upon examining the linear regression results, it becomes regression models in two configurations: personal laptop
evident that the data points are clustered at the bottom of and cloud platforms. Assessing the models’ performance
the graph and are not closely aligned with the regression in different platforms is crucial to ensure the reliabil-
line. This observation suggests that linear regression may not ity and suitability of models for real-world applications.
be the most suitable model for air quality prediction in this Additionally, it helps assess the effectiveness of com-
study. putational resources on the model’s performance. The
Continuing with the evaluation of regression models, following sub-sections provide the evaluation results for each
Figure 5 compares the actual and the prediction values of configuration.

115146 VOLUME 11, 2023


S. Al-Eidi et al.: Comparative Analysis Study for Air Quality Prediction in Smart Cities

TABLE 4. Evaluation results of training dataset using cloud configuration.

TABLE 5. Evaluation results of testing dataset using cloud configuration.

stable compared to the first configuration. The MAE and


FIGURE 6. Actual vs predicted for random forest regression.
RMSE values are exhibit minimal variation, implying that
the change in configuration does not significantly affect
the model’s performance. On the other hand, there is a
TABLE 2. Evaluation results of training dataset using laptop
configuration.
slight improvement in the performance of the Decision Tree
regression model when running on the cloud platform. The
MAE and RMSE values for the training dataset show a
marginal decrease, with an MAE of 1.97% and a RMSE of
9.94%. This improvement shows a slightly improved ability
to predict air quality compared to the first configuration.
Table 5 shows that the Random Forest performance is
comparable to the Decision Tree model. Both models
TABLE 3. Evaluation results of testing dataset using laptop configuration. represent similar MAE and RMSE values with similar
predictive capabilities. However, it is worth noting that the
Random Forest model tends to have a longer execution time,
which may limit its suitability and efficiency for certain
real-world applications where time is crucial.

C. EXECUTION TIME COMPARISON


This study compared the execution time for three regression
1) PERFORMANCE EVALUATION IN FIRST CONFIGURATION models with the SMOTER technique on two different
In the first configuration, Tables 2 and 3 present the results platforms: a personal laptop and a cloud. The goal was
of error evaluation metrics, specifically RMSE and MAE, to evaluate the impact of cloud computing technology on
for the regression models when executed on a personal the efficiency and speed-up of these models. The results
laptop platform. The findings show that the Decision Tree presented in Table 6 demonstrate a significant reduction in
model outperforms other models. It achieved 2.02% of MAE execution time when the models run using cloud platform
and 10.14% of an RMSE, indicating its ability to make compared to the personal laptop. The reduction execution
predictions with minimal average error and variability. On the time of regression models highlights the advantages of
other hand, Linear regression achieved low performance, with utilizing cloud computing technology for machine learning
a relatively high value for MAE of 32.19% and RMSE of tasks. For example, the execution time for SMOTER was
42.70%, concluding that Linear regression may not be the reduced from 1292.89 seconds on the personal laptop to
suitable model for accurately predicting air quality. 464.22 seconds on the cloud, resulting in a reduction of
approximately 64%. Similarly, the execution time of Decision
2) PERFORMANCE EVALUATION IN SECOND Tree decreased from 0.46 seconds on the personal laptop
CONFIGURATION to 0.28 seconds on the cloud, representing a significant
In the second configuration, the performance of the regres- reduction.
sion models was evaluated using a cloud platform. The Moreover, for the Random Forest model the execution time
evaluation outcomes for both the training and testing datasets was reduced from 39.40 seconds using the personal laptop to
are detailed in Tables 4 and 5, respectively. 17.27 seconds using cloud platform, indicating a reduction
Upon analyzing the evaluation metrics, it observes that the of approximately 56%. On the other hand, the execution time
performance of the linear regression model remains relatively for Linear Regression model was already relatively low on

VOLUME 11, 2023 115147


S. Al-Eidi et al.: Comparative Analysis Study for Air Quality Prediction in Smart Cities

TABLE 6. Model execution times in seconds. [5] R. W. Gore and D. S. Deshpande, ‘‘An approach for classification of health
risks based on air quality levels,’’ in Proc. 1st Int. Conf. Intell. Syst. Inf.
Manage. (ICISIM), Oct. 2017, pp. 58–61.
[6] B.-J. He, L. Ding, and D. Prasad, ‘‘Enhancing urban ventilation
performance through the development of precinct ventilation zones: A case
study based on the greater sydney, Australia,’’ Sustain. Cities Soc., vol. 47,
May 2019, Art. no. 101472.
[7] G. R. Kingsy, R. Manimegalai, D. M. S. Geetha, S. Rajathi, K. Usha,
and B. N. Raabiathul, ‘‘Air pollution analysis using enhanced K-means
clustering algorithm for real time sensor data,’’ in Proc. IEEE Region 10
Conf. (TENCON), Nov. 2016, pp. 1945–1949.
the personal laptop, with only 0.07 seconds, and it further [8] C. G. Kirwan and F. Zhiyong, Smart Cities and Artificial Intelligence:
Convergent Systems for Planning, Design, and Operations. Amsterdam,
decreased to 0.02 seconds on the cloud. The Netherlands: Elsevier, 2020.
These finding results demonstrated the benefits of utilizing [9] Z. Lv, D. Chen, R. Lou, and Q. Wang, ‘‘Intelligent edge computing based
cloud computing technology in reducing the execution time on machine learning for smart city,’’ Future Gener. Comput. Syst., vol. 115,
pp. 90–99, Feb. 2021.
of regression models. Reducing the execution time of models [10] U. Mahalingam, K. Elangovan, H. Dobhal, C. Valliappa, S. Shrestha,
help to achieve more efficient machine learning models. and G. Kedam, ‘‘A machine learning model for air quality prediction for
Particularly for larger and more complex datasets, cloud smart cities,’’ in Proc. Int. Conf. Wireless Commun. Signal Process. Netw.
(WiSPNET), Mar. 2019, pp. 452–457.
computing frameworks enable of distributing processing [11] S. Mahanta, T. Ramakrishnudu, R. R. Jha, and N. Tailor, ‘‘Urban air
data and model training, providing a solution to avoid quality prediction using regression analysis,’’ in Proc. IEEE Region Conf.
computational challenges and expedite the machine learning (TENCON), Oct. 2019, pp. 1118–1123.
[12] H. Maleki, A. Sorooshian, G. Goudarzi, Z. Baboli, Y. Tahmasebi
workflow. Birgani, and M. Rahmati, ‘‘Air pollution prediction by using an artificial
neural network model,’’ Clean Technol. Environ. Policy, vol. 21, no. 6,
pp. 1341–1352, Aug. 2019.
V. CONCLUSION
[13] K. Nandini and G. Fathima, ‘‘Urban air quality analysis and prediction
This study provides a comprehensive comparative analysis of using machine learning,’’ in Proc. 1st Int. Conf. Adv. Technol. Intell.
different regression models for predicting air quality in smart Control, Environ., Comput. Commun. Eng. (ICATIECE), Mar. 2019,
cities. Notably, the Decision Tree regression model demon- pp. 98–102.
[14] P. J. Navarathna and V. P. Malagi, ‘‘Artificial intelligence in smart city
strated a high performance compared to other regression analysis,’’ in Proc. Int. Conf. Smart Syst. Inventive Technol. (ICSSIT),
models. Incorporating Exploratory Data Analysis and the Dec. 2018, pp. 44–47.
SMOTER technique played a pivotal role in enhancing model [15] G. Oliveri Conti, B. Heibati, I. Kloog, M. Fiore, and M. Ferrante, ‘‘A review
of AirQ models and their applications for forecasting the air pollution
accuracy by addressing data imbalances and optimizing health outcomes,’’ Environ. Sci. Pollut. Res., vol. 24, no. 7, pp. 6426–6445,
feature selection. Moreover, the study emphasized the advan- Mar. 2017.
tages of utilizing cloud computing in regression modeling. [16] J. W. Park, C. H. Yun, H. S. Jung, and Y. W. Lee, ‘‘Visualization of urban
air pollution with cloud computing,’’ in Proc. IEEE World Congr. Services,
Utilizing cloud resources led to reduced model execution Jul. 2011, pp. 578–583.
time, resulting in enhanced efficiency and scalability. This [17] V. R. Pasupuleti, Uhasri, P. Kalyan, Srikanth, and H. K. Reddy, ‘‘Air quality
accelerated experimentation, training, and deployment of the prediction of data log by machine learning,’’ in Proc. 6th Int. Conf. Adv.
Comput. Commun. Syst. (ICACCS), Mar. 2020, pp. 1395–1399.
models, enhancing their practical applicability in real-world [18] R. M. Patil, D. H. T. Dinde, and S. K. Powar, ‘‘A literature review on
applications. prediction of air quality index and forecasting ambient air pollutants using
For future work recommendations, we explore diverse machine learning algorithms,’’ Int. J. Innov. Sci. Res. Technol., vol. 5, no. 8,
pp. 1148–1152, Sep. 2020.
machine-learning approaches for predicting air quality and [19] H. Peng, A. R. Lima, A. Teakles, J. Jin, A. J. Cannon, and W. W.
air pollution in smart cities. Additionally, investigating Hsieh, ‘‘Evaluating hourly air quality forecasting in Canada with nonlinear
the effect of meteorological data, including temperature, updatable machine learning methods,’’ Air Qual., Atmos. Health, vol. 10,
no. 2, pp. 195–211, Mar. 2017.
pressure, humidity, and wind speed, further enhances AQI [20] R. Sharma, G. Shilimkar, and S. Pisal, ‘‘Air quality prediction by machine
and air pollution prediction accuracy. This endeavor provides learning,’’ Int. J. Sci. Res. Sci. Technol., vol. 8, pp. 486–492, 2021.
valuable insight into identifying air quality levels and con- [21] S. Simu, V. Turkar, R. Martires, V. Asolkar, S. Monteiro, V. Fernandes,
and V. Salgaoncary, ‘‘Air pollution prediction using machine learning,’’
tributes to more effective air quality management approaches. in Proc. IEEE Bombay Sect. Signature Conf. (IBSSC), Dec. 2020,
pp. 231–236.
REFERENCES [22] Z. Zhang, H. Chen, and X. Huang, ‘‘Prediction of air quality combining
wavelet transform, DCCA correlation analysis and LSTM model,’’ Appl.
[1] S. Ameer, M. A. Shah, A. Khan, H. Song, C. Maple, S. U. Islam, and M. N. Sci., vol. 13, no. 5, p. 2796, Feb. 2023.
Asghar, ‘‘Comparative analysis of machine learning techniques for predict-
ing air quality in smart cities,’’ IEEE Access, vol. 7, pp. 128325–128338,
2019. SHOROUQ AL-EIDI received the M.S. degree in
[2] M. Batty, K. W. Axhausen, F. Giannotti, A. Pozdnoukhov, A. Bazzani, computer science from the Jordan University of
M. Wachowicz, G. Ouzounis, and Y. Portugali, ‘‘Smart cities of the future,’’ Science and Technology, Jordan, and the Ph.D.
Eur. Phys. J. Special Topics, vol. 214, no. 1, pp. 481–518, Nov. 2012. degree in computer science from the Memorial
[3] I. Bougoudis, K. Demertzis, and L. Iliadis, ‘‘HISYCOL a hybrid University of Newfoundland, Canada. She is an
computational intelligence system for combined machine learning: The Assistant Professor with the Computer Science
case of air pollution modeling in Athens,’’ Neural Comput. Appl., vol. 27, Department, Tafila Technical University. Her
no. 5, pp. 1191–1206, Jul. 2016. research interests include cyber security, machine
[4] D. Ganeshkumar, ‘‘Air and sound pollution monitoring system using cloud learning, networks, and big data analysis.
computing,’’ Int. J. Eng. Res., vol. V9, no. 6, Jun. 2020.

115148 VOLUME 11, 2023


S. Al-Eidi et al.: Comparative Analysis Study for Air Quality Prediction in Smart Cities

FATHI AMSAAD received the bachelor’s degree YAHYA TASHTOUSH received the B.Sc. and
in computer science from the University of Beng- M.Sc. degrees in electrical engineering from the
hazi, Libya, in 2002, and the dual master’s degrees Jordan University of Science and Technology
in computer science and computer engineering (JUST), Irbid, Jordan, in 1995 and 1999, respec-
from the University of Bridgeport, CT, USA, in tively, and the Ph.D. degree (joint degree) in
2011 and 2012, respectively, and the Ph.D. degree computer engineering from The University of
in engineering with an emphasis on computer Alabama in Huntsville, AL, USA, and the Univer-
science and engineering from the University of sity of Alabama at Birmingham, AL, in 2006. He is
Toledo, OH, USA, in 2017. He is an Assistant a Full Professor with the College of Computer
Professor of Computer Science and Engineering and Information Technology, JUST. His current
at Wright State University, Dayton, OH, USA. He has supervised over ten research interests are the IoT, deep/machine learning, wireless networks,
graduate students, including Niveshitha Niveshitha. He has established the robotics, and fuzzy systems.
Semiconductor Microelectronics Assurance, Resilience, and Trust (SMART)
Cybersecurity Research Lab at 490 Joshi Research Center, Computer Science
and Engineering Department, Wright State University. At the SMART
Cybersecurity Research Laboratory, He leads a research team comprising
several graduate students (master’s and Ph.D.) a Post-Doctoral Researcher,
and a Research Assistant Professor. His research interests include Assured
and Trusted Digital Microelectronics, Secure Heterogeneous Integration and
Advanced Packaging, Blockchain-enabled Federated Learning, IoT Hard-
ware Security, Machine/Deep Learning for Cybersecurity, AI Distributed
Cloud Computing, Secure AI Hardware Accelerators, and Resilient Circuit
Design (Memory/Microprocessor/ASICs/FPGAs). Both government and
ALI ALQAHTANI received the Ph.D. degree in
industry fund his research including AFRL, AFOSR, Intel, NSA, and the
computer engineering from Oakland University,
Ohio Department of Education. He has participated in several collaborative
Rochester Hills, MI, USA, in 2020. He is currently
research proposals that have led to a cumulative sum of about $33 Million,
an Assistant Professor with Najran University
including all partners along with Wright State University. He has served as
(NU). His research interests include machine
an Organizer, Program Chair, Technical Program Committee member, Guest
learning in general and deep learning in image
Editor, and on the Reviewer Board for several international conferences and
and signal processing, wireless vehicular networks
journals. In addition to his research activities, he has established teaching
(VANETs), wireless sensor networks, and cyber-
experience in hardware security, IoT and embedded systems security,
physical systems.
distributed computing, digital systems, and network administration and
security curriculum.

OMAR DARWISH (Senior Member, IEEE)


received the M.S. degree from the Jordan
University of Science and Technology, Jordan,
and the Ph.D. degree in computer science from
Western Michigan University, USA. He is an
Assistant Professor with the Information Security
and Applied Computing Department, Game
Above College of Engineering and Technology, NIVESHITHA NIVESHITHA (Graduate Student
Eastern Michigan University. He was an Assistant Member, IEEE) is a Graduate Student with Wright
Professor, a Program Coordinator of computer State University. His research interests include
information systems, and the Director of the IoT and Cybersecurity artificial intelligence, machine learning, and cloud
Laboratory, Ferrum College; a Visiting Assistant Professor with the computing.
Institute of Technology, West Virginia University; a Software Engineer
with MathWorks; and a Programmer with Nuqul Group. His research
interests include cyber security, the IoT, machine learning, networks, big
data analysis, cloud computing, artificial intelligence, data mining, and
information retrieval.

VOLUME 11, 2023 115149

You might also like