Atmosphere 15 01163
Atmosphere 15 01163
Article
Development of Machine Learning and Deep Learning
Prediction Models for PM2.5 in Ho Chi Minh City, Vietnam
Phuc Hieu Nguyen 1,2, *, Nguyen Khoi Dao 1,2 and Ly Sy Phu Nguyen 1,2
Abstract: The application of machine learning and deep learning in air pollution management is
becoming increasingly crucial, as these technologies enhance the accuracy of pollution prediction
models, facilitating timely interventions and policy adjustments. They also facilitate the analysis
of large datasets to identify pollution sources and trends, ultimately contributing to more effective
and targeted environmental protection strategies. Ho Chi Minh City (HCMC), a major metropolitan
area in southern Vietnam, has experienced a significant rise in air pollution levels, particularly PM2.5 ,
in recent years, creating substantial risks to both public health and the environment. Given the
challenges posed by air quality issues, it is essential to develop robust methodologies for predicting
PM2.5 concentrations in HCMC. This study seeks to develop and evaluate multiple machine learning
and deep learning models for predicting PM2.5 concentrations in HCMC, Vietnam, utilizing PM2.5 and
meteorological data over 911 days, from 1 January 2021 to 30 June 2023. Six algorithms were applied:
random forest (RF), extreme gradient boosting (XGB), support vector regression (SVR), artificial
neural network (ANN), generalized regression neural network (GRNN), and convolutional neural
network (CNN). The results indicated that the ANN is the most effective algorithm for predicting
PM2.5 concentrations, with an index of agreement (IOA) value of 0.736 and the lowest prediction
errors during the testing phase. These findings imply that the ANN algorithm could serve as an
effective tool for predicting PM2.5 concentrations in urban environments, particularly in HCMC.
Citation: Nguyen, P.H.; Dao, N.K.; This study provides valuable insights into the factors that affect PM2.5 concentrations in HCMC and
Nguyen, L.S.P. Development of
emphasizes the capacity of AI methodologies in reducing atmospheric pollution. Additionally, it
Machine Learning and Deep Learning
offers valuable insights for policymakers and health officials to implement targeted interventions
Prediction Models for PM2.5 in Ho Chi
aimed at reducing air pollution and improving public health.
Minh City, Vietnam. Atmosphere 2024,
15, 1163. https://doi.org/10.3390/
Keywords: PM2.5 ; prediction; machine learning; deep learning; Ho Chi Minh City
atmos15101163
primarily driven by metropolitan expansion, industrial growth, and a surge in road traf-
fic [4]. As per a report from GreenID, in the three years 2016, 2017, and 2018, the air quality
index (AQI) results reveal a troubling trend, as the percentage of AQI readings categorized
as unhealthy escalated from 31.1% in 2016 to 41.9% in 2017, and subsequently to 44.2% in
2018 [5]. The declining air quality in HCMC has emerged as a critical issue for politicians,
health officials, and the general populace. This issue has led to increased healthcare costs
and economic burdens, necessitating urgent actions such as stricter emissions regulations,
public health campaigns, and community engagement to mitigate the adverse effects and
improve overall air quality. Therefore, developing effective strategies to predict PM2.5 in
HCMC is crucial.
Air pollutant concentration prediction models play a vital role in both assessing
and managing air quality, offering critical insights for policymakers and environmental
managers. These models are indispensable in optimizing air quality monitoring systems
by providing detailed information on pollution levels, sources of pollutants, and the
overall status of air quality in different regions [6]. By predicting future pollution levels,
decisionmakers can take proactive measures such as issuing early warnings, implementing
public health campaigns, or enforcing stricter emission regulations to mitigate adverse
environmental and health impacts. The ability to forecast air quality at specific future points
allows for better preparedness and more informed decision making regarding pollution
control strategies [7].
There are two primary types of PM2.5 concentration prediction models: knowledge-
driven models and data-driven models. Knowledge-driven models, such as chemical
transport models, are based on atmospheric science and require a thorough understanding
of pollutant emission sources, transport mechanisms, and chemical transformations in
the atmosphere. These models simulate the diffusion, transmission, and cross-regional
transport of pollutants, making them highly valuable for detailed atmospheric analysis.
However, they are often computationally intensive, requiring precise input data and exten-
sive computational resources. Additionally, the complexity of atmospheric interactions and
uncertainties in emission inventories can limit the accuracy of these models, particularly
when dealing with real-world data that may not be as well structured as experimental con-
ditions. In contrast, data-driven models are more practical and have become increasingly
popular due to their ability to handle large datasets and generate predictions based on
data characteristics. These models, which rely on statistical methods or machine learning
techniques, are often more flexible and less dependent on detailed atmospheric knowledge.
Data-driven approaches can be classified into three groups: statistical models, artificial
intelligence (AI) models, and hybrid models that combine elements of both. AI models,
in particular, have gained significant attention in recent years, driven by advancements in
computational power and the Fourth Industrial Revolution. These models are known for
their high accuracy and reliability in air quality prediction, as they excel at modeling com-
plex, nonlinear phenomena and relationships between numerous exogenous variables [8].
One of the key strengths of AI-based models is their ability to process vast amounts of data
rapidly, offering significant advantages in terms of processing speed, scalability, and cost-
effectiveness. AI models, such as machine learning and deep learning techniques, have the
capability to impute missing data, identify hidden patterns in large datasets, and generate
accurate predictions of pollution levels at specific future points. Moreover, AI models can
provide spatial and temporal predictions, allowing researchers to estimate pollution levels
for particular areas and times with a high degree of precision [9]. In addition to their ability
to simulate air quality, AI techniques also allow for real-time monitoring and adaptive
learning, further enhancing their effectiveness in dynamic environments.
Numerous studies have proven the effectiveness of AI models in predicting air quality,
specifically PM2.5 levels. For example, Bingyue Pan (2018) employed the XGBoost algorithm
to predict PM2.5 levels in Tianjin City [10], whereas Zamani et al. (2019) applied random
forest, XGBoost, and deep learning methodologies utilizing multiplatform remote sensing
information to forecast PM2.5 levels in Tehran [11]. Goulier et al. (2020) provided an hourly
Atmosphere 2024, 15, 1163 3 of 19
forecast of ten atmospheric pollutant levels in Münster utilizing an artificial neural network
(ANN) methodology [12], whereas Castelli et al. (2020) employed support vector regression
(SVR) to anticipate pollutant and particle levels in California [13]. Likewise, Gou et al.
(2020) utilized statistical correlation evaluation and ANNs to discern relationships among
the air pollution index and weather variables in Xi’an and Lanzhou [14]. Doreswamy
et al. (2020) created machine learning models to forecast PM levels of the atmospheric
conditions of Taiwan [15]. In the U.S., Zhou et al. (2020) examined several machine
learning methodologies employed for air pollution prediction, with applications that span
multiple regions, including high-pollution urban areas [16]. Similarly, Chen et al. (2020)
investigated how climate change influences PM2.5 levels, using multimodel projections
to assess the effectiveness of predictive models in the U.S [17]. In Europe, Ordóñez et al.
(2020) utilized multimodel simulations combined with machine learning techniques to
improve air quality predictions, particularly for PM2.5 levels, while Petetin et al. (2020)
developed high-resolution forecasting models to enhance predictive accuracy across the
continent [18,19]. In China, Zheng et al. (2021) employed deep learning models to enhance
the accuracy of PM2.5 concentration predictions, showcasing the latest advancements in
air quality modeling [20]. These recent studies underscore the global applicability of
machine learning in tackling air pollution and provide a strong foundation for the methods
employed in this paper for HCMC.
In HCMC, PM2.5 data are not very available, with limited research focusing on predict-
ing PM2.5 levels. Few studies are currently available that predict these concentrations. Vo
et al. (2021) applied WRF model to predict PM2.5 level in HCMC [21]. Their study aimed
to evaluate the prediction of PM2.5 concentration by predicting meteorological variables
using the WRF model. In addition to utilizing a limited number of meteorological factors
(four variables), their study did not thoroughly address the optimization of input scenarios.
Another study from Rajnish et al. (2023) built a multivariate model for predicting air quality,
taking into account diverse factors like meteorological circumstances, air quality metrics,
and urban spatial data, and time factors to forecast NO2 , SO2 , O3 , and CO hourly concen-
trations [22]. This research attained a significant achievement in forecasting using spatially
scattered data; however, the duration of data collection was relatively short, spanning only
from February to December 2021. Additionally, data utilized in this investigation were
gathered from monitoring stations associated with a specific research project, rather than
from official, reliable, and publicly accessible government sources.
The primary objective of this research is to develop and evaluate various machine
learning and deep learning algorithms for predicting PM2.5 concentrations in HCMC, using
meteorological and PM2.5 data. The results from this study are expected to enhance the
comprehension of the determinants affecting PM2.5 levels in this metropolis and underscore
the potential of AI methodologies in alleviating air pollution and promoting public health.
2. Methodology
This section delineates the methods utilized for prediction of PM2.5 levels in HCMC
using various machine learning and deep learning algorithms. The establishment of a
PM2.5 prediction model has five key steps (Figure 1): (1) Data processing, (2) analyzing
the impact of parameters on PM2.5 , (3) designing scenarios of input datasets, (4) modeling
machine learning and deep learning algorithms to predict PM2.5 , and (5) selecting the best
prediction model for PM2.5 among the developed machine learning and deep learning
prediction models (Figure 1).
Firstly, daily data over 911 days, from 1 January 2021 to 30 June 2023, including
meteorological and PM2.5 parameters in HCMC, were collected for the development of
a prediction model. The meteorological data included ambient temperature, relative
humidity, wind speed, rainfall, sunshine hours, and evaporation that were collected from
Tan Son Hoa weather station (10.79723◦ N, 106.6667◦ E) at 236B Le Van Sy, Tan Binh District,
while the PM2.5 data were obtained from the monitoring station at the U.S Consulate
(10.7831◦ N, 106.7001◦ E) at 4 Le Duan street in District 1 in HCMC. These two stations are
Atmosphere 2024, 15, 1163 4 of 19
about 4 km apart as the crow flies and about 5 km apart by road. The collected data were
then processed by removing unavailable data points and outliers.
Secondly, the processed data were analyzed to determine feature importance and
identify the impact of the examined parameters on the objective function.
Thirdly, different sets of input data were generated to develop machine learning-based
prediction models.
Fourthly, various machine learning and deep learning algorithms, including RF, XGB,
SVR, ANN, GRNN, and CNN, were employed to formulate predictive models for PM2.5
in HCMC. Each algorithm employed in this work presents a distinct methodology for
addressing the intricacies of PM2.5 prediction, providing varied advantages in feature
selection, model training, and predictive accuracy. To evaluate the performance of these
models, the dataset was split into training and testing sets. Specifically, 80% of the data
were allocated for training, while the remaining 20% were designated for testing to evaluate
model performance. For deep learning algorithms such as ANN and CNN, which require a
validation set to monitor model training and prevent overfitting, the training data were
further divided. In this case, 80% of the training data were used as sub-training data,
and 20% of the training data were used for validation. This approach ensured that model
training could be halted when validation performance began to decline, reducing the risk
Atmosphere 2024, 15, x FOR PEER REVIEW
of overfitting. The remaining 20% of the dataset was consistently used as the testing 4 ofset
23
across all models to provide a final evaluation of prediction accuracy.
Figure 1.
1. Workflow
Workflow for
for developing
developing aa PM
PM2.5 prediction model.
Figure 2.5 prediction model.
Firstly, daily
Random data
forest, overknown
often 911 days,as from
RF, is1aJanuary
type of2021 to 30 June
ensemble 2023,technique
learning including that
me-
teorological
works and PM
by creating 2.5 parameters
a vast ensemble in of HCMC,
decisionwere
treescollected for the
through the development
training process andof athen
pre-
diction model.
displaying the The meteorological
average forecast of data
eachincluded ambient
individually temperature,
constructed relative
tree [23]. Its humidity,
ensemble
wind speed,
structure rainfall,
renders sunshine
it highly hours,
resilient to and evaporation
overfitting. The that were initiates
program collectedthefrom Tan Son
process by
generating
Hoa weather a randomized
station (10.79723°datasetN,derived
106.6667°from the236B
E) at primary
Le Vandata.
Sy, For
Tanevery bootstrapped
Binh District, while
sample,
the PM2.5a decision
data were tree is built from
obtained by selecting the best split
the monitoring from
station at athe
randomly chosen (10.7831°
U.S Consulate subset of
features at every
N, 106.7001° E) atnode.
4 Le The
Duan bootstrap
street in aggregation
District 1 in technique
HCMC. These generates multipleare
two stations bootstrap
about 4
samples
km apartthrough sampling
as the crow and about
flies and replacement, fromby
5 km apart which
road.decision trees are
The collected dataconstructed.
were then
The ultimate
processed byprediction
removingisunavailable
the mean ofdatathe forecasts
points andfrom all individual trees [24]. This study
outliers.
selects RF to predict PM
Secondly, the processed 2.5 levels due to its superior performance
data were analyzed to determine feature across several domains,
importance and
resilience
identify theto impact
overfitting,
of theand efficacyparameters
examined in situations oncharacterized
the objective by highly nonlinear and
function.
complex relationships
Thirdly, different between features
sets of input dataand target
were variables.
generated The performance
to develop machine of the RF
learning-
based prediction models.
Fourthly, various machine learning and deep learning algorithms, including RF,
XGB, SVR, ANN, GRNN, and CNN, were employed to formulate predictive models for
PM2.5 in HCMC. Each algorithm employed in this work presents a distinct methodology
for addressing the intricacies of PM2.5 prediction, providing varied advantages in feature
Atmosphere 2024, 15, 1163 5 of 19
model can be regulated by tuning hyperparameters including the quantity of trees in the
forest, the depth of the trees, the minimum samples required for a split, the minimum
samples required for a leaf node, and the maximum possible leaf nodes [25].
XGBoost is a powerful and scalable ensemble learning method widely used for regres-
sion and classification problems. It improves on traditional gradient boosting by optimizing
the handling of regularization and model optimization [26]. This work employs XGB to
forecast PM2.5 values by using its capacity to simulate intricate linkages and interactions in
the dataset. The algorithm processes historical air quality and meteorological data, which
enables it to discern patterns that affect PM2.5 concentrations. XGB possesses numerous
hyperparameters that could be adjusted to enhance performance, such as the learning rate,
the ensemble size, the tree depth, the sample size for each tree, and the feature count for
each tree [27].
Support vector regression (SVR) is a type of machine learning algorithm used for
regression tasks, which is derived from the support vector machine (SVM) framework [28].
Support vector regression (SVR) is recognized for its capacity to manage high-dimensional
data and to represent nonlinear relationships via kernel functions [29]. It develops a model
by transforming input data into a higher-dimensional space to facilitate linear regression
analysis. SVR seeks to identify a function that diverges from the actual observed objectives
by no more than a defined margin ϵ, while simultaneously maintaining maximal flatness.
This study utilizes SVR to forecast PM2.5 concentrations in HCMC by training the model
using atmospheric quality and meteorological data. The hyperparameters in SVR comprise
the regularization parameter C, the epsilon ϵ that delineates the margin of tolerance, and
the parameters linked to the selected kernel function, such as the kernel coefficient γ for
the radial basis function kernel.
Artificial neural networks (ANNs) are a category of machine learning techniques
designed to emulate the architecture and functionality of the human brain [23,30]. An
artificial neural network consists of several interconnected processing nodes, or neurons,
that collaboratively execute intricate computations. The method processes a collection
of input data via multiple hidden layers to obtain an output. Each neuron within the
network takes input from neurons in the preceding layer and use an activation function
to generate an output [31]. The output from each neuron is subsequently transmitted to
the neurons in the subsequent layer, and this process is reiterated until the output layer
is reached. The design of the network, comprising the quantity of layers and the number
of neurons per layer, can be tailored to improve efficiency for a certain task. Training
ANNs entails modifying weights and biases of neurons to reduce a loss function, which
quantifies the disparity between the expected output and observed output [30]. This
process is generally executed by backpropagation that entails calculating the gradient of
the loss function concerning the weights and biases, subsequently employing it to adjust
the network’s parameters. Utilizing the adaptability and efficacy of learning intricate and
nonlinear relationships among variables, ANNs are employed to predict PM2.5 levels by
training the network on atmospheric conditions data, atmospheric condition datasets, and
other pertinent aspects. The model acquires the ability to discern intricate patterns and
relationships in the data that affect PM2.5 values. An optimum artificial neural network
design consists of a configuration of hyperparameters, such as the quantity of hidden
layers, the quantity of neurons in each hidden layer, activation function, learning rate,
weight constraints, and dropout rate, which produce the most accurate predictions on the
validation data.
Generalized regression neural networks (GRNNs) [32] are a category of artificial neural
networks grounded in nonparametric predictive modeling. GRNNs are recognized for
their rapid training capabilities and proficiency in modeling intricate correlations between
input and target variables [33]. GRNNs comprise four layers: the input layer, pattern layer,
summation layer, and output layer. Each neuron in the pattern layer denotes a training
example and computes a distance metric to the input. These distances are consolidated
by the summation layer, which produces weighted outputs. The output layer delivers
Atmosphere 2024, 15, 1163 6 of 19
a predicted value derived from these consolidated data. The fundamental principle of
GRNNs is the application of kernel regression to approximate the conditional expectation
of the output variable based on the input parameter. It utilizes a radial basis function
to evaluate the probability density of data points and generates predictions based on
the weighted aggregation of these functions. GRNNs are adept at managing noisy and
intricate datasets, making them well suited for predicting air quality, particularly PM2.5
concentrations, which are influenced by numerous factors. The smoothing parameter (σ)
is a hyperparameter in GRNNs. The performance of the model is highly sensitive to the
value of σ, with smaller values potentially resulting in overfitting, while larger values may
lead to underfitting.
A convolutional neural network (CNN) is a form of deep learning model that integrates
several layers including convolutional, pooling, and fully connected layers. Convolutional
layers utilize filters on input data to identify characteristics, such as edges or textures, via
convolution processes. The dimensionality of data is reduced by pooling layers, which
utilize maximum pooling to preserve significant features while reducing computational
demands. The fully linked layers at the network’s conclusion integrate these features to
generate final predictions. CNNs excel in managing large-scale and high-dimensional
data, making them suitable for predicting PM2.5 levels. The convolutional layers assist in
recognizing critical characteristics and trends influencing PM2.5 values. To improve the
efficacy of the CNN model, essential hyperparameters like the quantity and dimensions of
convolutional filters, the depth of layers, the learning rate, and the batch size will be system-
atically maximized. Each best-performing model is tuned to its optimal hyperparameters
using a grid search method. The range for each hyperparameter is detailed in the results
section for each machine learning algorithm. Optimal hyperparameters are determined
using the validation dataset. This optimization process aims to find the hyperparameters
that yield the minimal root mean square error (RMSE) on the validation set.
Finally, after evaluating the predictive outcomes of the constructed models, the best-
performing model is selected for PM2.5 prediction in HCMC. In this study, the prediction
models are evaluated using various metrics, including root mean square error (RMSE),
mean absolute percentage error (MAPE), index of agreement (IOA), and normalized mean
bias (NMB). These evaluation metrics are expressed as follows:
s
1 n
n i∑
RMSE = (yi − ŷi )2 (1)
=1
100 n yi − ŷi
n i∑
MAPE = (2)
=1
yi
2
∑in=1 (yi − ŷi )
IOA = 1 − 2 (3)
∑in=1 ŷi − y + |yi − y|
∑in=1 ŷi − yi
N MB = (4)
∑in=1 yi
where n is the total number of data points; y is the mean of the actual observed values; yi
and ŷi are the actual observed value and predicted value for the ith data point, respectively;
RMSE measures the square root of the average squared variances between predicted and
observed values; and MAPE evaluates the mean absolute percentage error between fore-
casted and actual values. Furthermore, the IOA quantifies the extent of model prediction
inaccuracy on a scale from 0 to 1, with 1 signifying perfect concordance and 0 denoting
complete discordance. IOA is formulated to address certain constraints of the coefficient
of determination by offering a normalized metric of model prediction error. It considers
the disparities between the anticipated and observed values, providing a more balanced
measure of model performance, especially in cases with nonlinear relationships or when
dealing with outliers. On the other hand, NMB measures the average discrepancy between
Atmosphere 2024, 15, 1163 7 of 19
the anticipated and observed values, normalized by the means of the observed values. It in-
dicates the bias of the model’s predictions. Values approaching 0 signify little bias, whereas
positive values denote overestimation and negative values signify underestimating.
Meteorologicaland
Figure2.2.Meteorological
Figure and PM
PM data
2.5 2.5 data in HCMC
in HCMC from
from 1 January
1 January 20212021
to 30to 30 June
June 2023.2023.
Table 1. Summary of meteorological and PM2.5 data in HCMC from 1 January 2021 to 30 June 2023.
90 100.0% 90 100.0%
Frequency Frequency
80 90.0% 80 90.0%
Cumulative % Cumulative %
70 80.0% 70 80.0%
70.0% 60 70.0%
60
Cumulative %
Cumulative %
Frequency
Frequency
60.0% 60.0%
50 50
50.0% 50.0%
40 40
40.0% 40.0%
30 30 30.0%
30.0%
20 20.0% 20 20.0%
10 10.0% 10 10.0%
0 0.0% 0 0.0%
50
53
56
59
62
65
68
71
74
77
80
83
86
89
92
95
98
More
20.9
21.8
22.7
23.6
24.5
25.4
26.3
27.2
28.1
29.9
30.8
31.7
32.6
33.5
34.4
20
29
More
Temperature, oC Humidity, %
(a) (b)
80 100.0% 500 100.0%
Frequency Frequency
70 90.0% 450 90.0%
Cumulative %
Cumulative % 80.0% 400 80.0%
60
70.0% 350 70.0%
Cumulative %
Cumulative %
50
Frequency
Frequency
2.52
2.94
3.36
3.78
4.62
5.04
5.46
5.88
6.72
2.1
4.2
6.3
More
(c) (d)
60 100.0% 600 100.0%
Frequency
90.0% 90.0%
50 Cumulative % 500 Frequency
80.0% 80.0%
70.0% Cumulative %
40 70.0%
400
Cumulative %
Cumulative %
Frequency
Frequency
60.0% 60.0%
30 50.0% 300 50.0%
40.0% 40.0%
20 30.0% 200
30.0%
20.0% 20.0%
10 100
10.0% 10.0%
0 0.0% 0 0.0%
0
9
0.6
1.2
1.8
2.4
3.6
4.2
4.8
5.4
6.6
7.2
7.8
8.4
9.6
More
(e) (f)
100 100.0%
90 90.0%
Frequency
80 80.0%
70 Cumulative % 70.0%
Cumulative %
Frequency
60 60.0%
50 50.0%
40 40.0%
30 30.0%
20 20.0%
10 10.0%
0 0.0%
12.6
16.8
25.2
29.4
33.6
37.8
46.2
50.4
54.6
58.8
67.2
0
4.2
8.4
21
42
63
More
PM2.5, µg/m3
(g)
Figure 3. Distribution of meteorological and PM2.5 data in HCMC: (a) temperature, (b) humidity, (c)
Figure 3. Distribution of meteorological and PM2.5 data in HCMC: (a) temperature, (b) humidity,
evaporation, (d) wind speed, (e) sunshine hours, (f) rainfall, and (g) PM2.5 concentration.
(c) evaporation, (d) wind speed, (e) sunshine hours, (f) rainfall, and (g) PM2.5 concentration.
Table 1. Summary of meteorological and PM2.5 data in HCMC from 1 January 2021 to 30 June 2023.
The results showed that humidity, temperature, and wind speed have a strong cor-
relation with PM2.5 , while rainfall, evaporation, and sunshine hours have a moderate
correlation with PM2.5 .
Different scenarios of input parameters were generated to develop prediction models.
This allows us to evaluate the prediction performance under various sets of input parame-
ters. These scenarios were designed based on the Pearson correlation coefficient obtained
from the previous step. In this approach, scenarios were constructed by prioritizing the
Atmosphere 2024, 15, 1163 10 of 19
most correlated parameters down to the less correlated ones [36,37]. Starting with the high-
est correlation coefficient, each scenario incrementally incorporated additional parameters
in descending order of their correlation. This stepwise approach allowed for an exploration
of how the predictive power of the models evolved as features of varying correlation were
sequentially integrated into the input data, providing insights into the cumulative effect of
features on the model’s predicting performance, aiding in the effective optimization and
selection of input parameters. The input feature scenarios are detailed in Table 3.
Hyperparameter Value
n_estimators 1000, 1500, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10,000
max_depth 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, 20, 25, 30, 40, 50
min_samples_split 2, 4, 6, 8, 10, 20, 30, 40, 50, 70, 100
min_samples_leaf 1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 50
max_leaf_nodes 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
The random forest model demonstrated strong predictive capabilities, achieving IOA
values between 0.361 and 0.789 on the training set and 0.396 to 0.670 on the testing set
under all scenarios (Table 5). Generally, the inclusion of more input features improved the
accuracy in predicting PM2.5 concentration. In addition, the model achieved relatively low
RMSE, MAPE, and NMB values, indicating its effectiveness in accurately predicting PM2.5
concentrations under each scenario.
Table 5. Training and testing results of random forest predictive models for PM2.5 .
The highest prediction accuracy was achieved with input scenario 6, which resulted
in high IOA values and relatively low RMSE, MAPE, and NMB values for training and
testing evaluation. This indicates that the model is well fitted and capable of reliably
5 7.307 28.282 0.709 0.000 8.631 37.148 0.654 0.076
6 6.464 24.577 0.789 0.001 8.510 36.721 0.670 0.079
The highest prediction accuracy was achieved with input scenario 6, which resulted
Atmosphere 2024, 15, 1163
in high IOA values and relatively low RMSE, MAPE, and NMB values for training and11 of 19
testing evaluation. This indicates that the model is well fitted and capable of reliably pre-
dicting PM2.5 concentrations based on the given input variables. The optimal model was
predicting PM2.5 concentrations based on the given input variables. The optimal model
configured with an n_estimators of 1500, a max_depth of 12, a min_samples_split of 4, a
was configured with an n_estimators of 1500, a max_depth of 12, a min_samples_split
min_samples_leaf of 5, and max_leaf_nodes set to 90. Error! Reference source not found.
of 4, a min_samples_leaf of 5, and max_leaf_nodes set to 90. Figure 4 illustrates the
illustrates training
the training
and and testing
testing results
results of this
of this optimized
optimized random
random forest
forest model,
model, showing
showing moderate
moderate agreement
agreementbetween
betweenpredicted
predicted and actual PM 2.5 values, which suggests a reason-
and actual PM2.5 values, which suggests a reasonably strong
ably strong predictive
predictive performance.
performance.
(a) (b)
Figure 4. Training
Figure 4.and testingand
Training results from
testing the optimal
results from therandom forest model:
optimal random forest (a) training
model: result and
(a) training result and
(b) testing result.
(b) testing result.
Hyperparameter Value
n_estimators 1000, 1500, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10,000
max_depth 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50
0.0005, 0.0007, 0.0009, 0.001, 0.0011, 0.0013, 0.0015, 0.003, 0.005, 0.01,
learning rate
0.1, 0.2, 0.3
subsample 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
colsample_bytree 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
min_child_weight 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
The training and testing results across different scenarios are summarized in Table 7,
demonstrating moderate model performance. IOA values ranged from 0.302 to 0.740 during
training and from 0.320 to 0.687 during testing. The model achieved the best predictive
accuracy using input scenario 5, which included variables including humidity, temperature,
wind speed, rainfall, and evaporation. However, adding more input features did not
significantly enhance prediction accuracy.
The optimal XGB model was obtained with hyperparameters including an n_estimators
of 10,000, a max_depth of 3, a learning rate of 0.0005, a subsample of 0.5, a colsample_bytree
of 0.8, and a min_child_weight of 3. The training and testing results of this optimal model
are depicted in Figure 5, which illustrates moderate alignment between predicted and
observed PM2.5 values, confirming the model’s satisfactory predictive ability.
Atmosphere 2024, 15, 1163 12 of 19
Table 7. Training and testing results of the XGB predictive models for PM2.5 .
(a) (b)
Figure 5. Training
Figure 5.and testing results
Training from the
and testing optimal
results fromXGB
the model:
optimal(a) training
XGB result
model: (a) and (b) testing
training result and
result.
(b) testing result.
3.3.3. Support
3.3.3. Vector
SupportRegression Model Model
Vector Regression
Thisdetails
This section sectionthe
details the development
development and evaluation
and evaluation of anof an SVR
SVR algorithm
algorithm for predicting
for predict-
ing PM2.5 concentrations, using the input scenarios provided in Error! Referencemodels
PM 2.5 concentrations, using the input scenarios provided in Table 3. The source were
fine-tuned by adjusting the hyperparameters shown in Table 8.
not found.. The models were fine-tuned by adjusting the hyperparameters shown in Er-
ror! Reference source not found..
Table 8. Range of hyperparameters for training SVR predictive models.
Table 8. Range of hyperparameters for training SVR predictive models.
Hyperparameter Value
Hyperparameter
Kernel Value
linear, poly, rbf, sigmoid
gamma scale, auto
Kernel linear, poly, rbf, sigmoid
epsilon 0.01, 0.1, 0.2, 0.5, 1.0
gamma degree scale, auto 2, 3, 4, 5
epsilon C 0.01, 0.1, 0.2, 0.5, 1.0 0.1, 1, 10, 100, 1000
degree 2, 3, 4, 5
C 0.1,results
The performance 1, 10, 100, 1000
across various scenarios are summarized in Table 9, where
the model showed moderate effectiveness. IOA values ranged from 0.322 to 0.720 during
The training and from
performance 0.361across
results to 0.709 duringscenarios
various testing. The
arehighest accuracy
summarized in was obtained
Error! Refer-using
input scenario 5, which included factors like humidity, temperature,
ence source not found., where the model showed moderate effectiveness. IOA values wind speed, rainfall,
and evaporation.
ranged from 0.322 to 0.720 during training and from 0.361 to 0.709 during testing. The
Thewas
highest accuracy optimal SVR model
obtained using was
inputconfigured
scenario with the following
5, which includedhyperparameters:
factors like humid-a radial
basis function (rbf) kernel, C set to 1.0,
ity, temperature, wind speed, rainfall, and evaporation. epsilon at 0.1, and gamma set to scale. Figure 6
presents training and testing results of this optimal model, which showed strong agreement
between
Table 9. Training predicted
and and actual
testing results of SVRPM 2.5 values,
predictive indicating
models for PMa solid
2.5.
predictive capability.
Table 9. Training and testing results of SVR predictive models for PM2.5 .
(a) (b)
Figure 6. Figure
Training and testing results from the optimal SVR model: (a) training result and (b) testing
6. Training and testing results from the optimal SVR model: (a) training result and
result.
(b) testing result.
Table 11. Training and testing results of ANN predictive models for PM2.5.
Table 11. Training and testing results of ANN predictive models for PM2.5 .
Scenario 6 exhibited the best performance, with the highest IOA and the lowest
RMSE,
model developed MAPE,
using and
inputNMB values
scenario 6,for
andboth training and testing
hyperparameters datasets.
specified Consequently,
in Error! Refer- the
model developed using input scenario 6, and hyperparameters
ence source not found., was identified as the optimal ANN model. Error! Referencespecified in Table 12, was
identified
source not found. as the optimal
illustrates the ANN model.
training and Figure
testing7outcomes
illustrates of
thethis
training andshowing
model, testing outcomes
a
strong alignment between predicted and measured PM2.5 values, confirming its reliable values,
of this model, showing a strong alignment between predicted and measured PM 2.5
confirming its reliable predictive capability.
predictive capability.
Table 12. Hyperparameters of the optimal ANN predictive model.
Table 12. Hyperparameters of the optimal ANN predictive model.
Hyperparameter
Hyperparameter ValueValue
Number ofNumber
hiddenoflayers
hidden layers 4 4
Number of hidden neurons 60, 20, 30, 20
Number of hidden neurons 60, 20, 30, 20
Activation function relu, tanh, relu, relu
ActivationLearning
functionrate relu,0.0015
tanh, relu, relu
Learning rate
Dropout rate 0.0015
0.4
Weight constraint
Dropout rate 0.4 3
Weight constraint 3
(a) (b)
Figure 7. Training
Figure 7.and testing results
Training from results
and testing the optimal
fromANN model: (a)
the optimal ANNtraining result
model: (a)and (b) test-
training result and
ing result.
(b) testing result.
The optimal
TheGRNN
optimalmodel
GRNN was model
configured
was with rbf kernel
configured andrbf
with a sigma
kernelofand
0.111.
a Error!
sigma of 0.111.
ReferenceFigure
source8 not found. the
illustrates illustrates theand
training training and
testing testingofresults
results of this optimized
this optimized model, showing
model, showing moderate
moderate agreement
agreement between between predicted
predicted and PM
and actual actual PM2.5 values, confirm-
2.5 values, confirming its moderate
ing its moderate predictive capability.
predictive capability.
(a) (b)
Figure 8. Training and testing results from the optimal GRNN model: (a) training result and (b)
Figure 8. Training and testing results from the optimal GRNN model: (a) training result and
testing result.
(b) testing result.
The training and testing results across different scenarios, summarized in Error! Ref-
erence source not found., showed that the model’s performance ranges from moderate to
high. IOA values were between 0.396 and 0.581 for training, and between 0.437 and 0.607
for testing. The model achieved the highest accuracy using scenario 6, which incorporated
Atmosphere 2024, 15, 1163 16 of 19
(a) (b)
Figure 9. Training
Figure 9.andTraining
testing results from the
and testing optimal
results CNN
from the model:
optimal(a)CNN
training result(a)
model: and (b) test-result and
training
ing result.
(b) testing result.
metrics for testing sets suggest that ANN outperforms the other assessed models. Con-
sequently, the trained ANN model was selected for predicting PM2.5 concentrations in
HCMC, Vietnam.
4. Discussion
This study provides a comprehensive comparison of the performance of six different
machine learning and deep learning algorithms, random forest, XGB, SVR, ANN, GRNN,
and CNN, in predicting PM2.5 concentrations. Additionally, meteorological variables
including temperature, humidity, wind speed, sunshine hours, rainfall, and evaporation
were included to enhance the prediction accuracy. Among the models, the ANN model
outperformed the others, achieving an IOA of 0.736, an RMSE of 7.978, and an NMB
of 0.032 during the testing phase. These findings highlight the effectiveness of machine
learning techniques in air quality prediction and highlight the importance of selecting an
appropriate algorithm for predicting air pollution. This study provides valuable insights
for health officials and policymakers by demonstrating that machine learning models,
especially the ANN model, can accurately predict PM2.5 concentrations. This insight is
valuable for policymakers, as it can inform the implementation of effective strategies to
mitigate health risks associated with PM2.5 exposure. For instance, our model could enable
authorities to issue air quality alerts when PM2.5 levels are expected to rise above safe
thresholds. This allows citizens to take precautionary measures, such as staying indoors or
using masks on high-risk days. In addition, public health campaigns can be timed based
on pollution predictions, informing residents of exposure risks and protective actions like
wearing air filters or limiting outdoor activities.
Despite the promising results, this study has several limitations that should be ad-
dressed in future research. First, this study concentrates exclusively on PM2.5 levels in
HCMC. A more comprehensive comprehension of air quality throughout the nation would
be achieved by broadening the scope to include additional communities in Vietnam. Addi-
tionally, while machine learning and deep learning methods were applied to simulate and
predict PM2.5 concentrations, the study was limited by the availability of data from a single
automatic monitoring station—the U.S. Consulate station in HCMC. Consequently, the
results primarily reflect PM2.5 concentration levels within the vicinity of the consulate. A
larger number of standard automatic monitoring stations would enable a more generalized
and representative analysis of the entire study area.
Furthermore, this study focused on predicting PM2.5 concentrations based on meteoro-
logical factors, but PM2.5 concentrations are also influenced by various other factors, such
as emission sources and the presence of other air pollutants. Emission sources, including
industrial zones, construction sites, and high-traffic areas, are closely related to PM2.5
concentrations. Factors such as the relative location and proximity of these sources to
monitoring stations significantly impact dust concentrations. Additionally, the concentra-
tions of other air pollutants, such as NOx, SOx, CO2 , and H2 S, may interact with PM2.5
concentrations. Due to data limitations, these parameters were not included in this study.
Future research should find the effect of these pollutants on PM2.5 concentrations and
consider integrating them into prediction models.
Atmosphere 2024, 15, 1163 18 of 19
This study establishes a robust basis for subsequent research on PM2.5 predictions for
HCMC, and its findings can contribute to the development of effective air pollution control
and management strategies.
5. Conclusions
This study investigated the prediction of PM2.5 concentrations in HCMC utilizing
six distinct machine learning and deep learning algorithms. The models were trained
and validated on a dataset including temperature, humidity, wind speed, sunshine hours,
rainfall, and evaporation. Among the algorithms assessed, the ANN showed superior
performance in predicting PM2.5 levels, achieving an IOA of 0.736 and the lowest RMSE,
MAPE, and NMB values during testing. These results highlight the potential of machine
learning algorithms, particularly ANNs, in accurately predicting PM2.5 concentrations
based on meteorological data. The implications of this research are significant for HCMC,
where air pollution poses a critical public health concern. By utilizing these predictive
models, policymakers and health officials can implement more targeted and effective inter-
ventions to mitigate air pollution, ultimately improving public health outcomes. This study
advocates for the integration of advanced machine learning techniques into environmental
monitoring systems, offering a framework for proactive urban air quality management.
References
1. Usmani, R.S.A.; Saeed, A.; Abdullahi, A.M.; Pillai, T.R.; Jhanjhi, N.Z.; Hashem, I.A.T. Air Pollution and Its Health Impacts in
Malaysia: A Review. Air Qual. Atmos. Health 2020, 13, 1093–1118. [CrossRef]
2. Health and Environmental Effects of Particulate Matter (PM). Available online: https://www.epa.gov/pm-pollution/health-and-
environmental-effects-particulate-matter-pm (accessed on 1 May 2024).
3. WHO. Air Pollution in Viet Nam. Available online: https://www.who.int/vietnam/health-topics/air-pollution#:~:text=New%
20estimates%20in%202018%20reveal,million%20people%20die%20each%20year (accessed on 1 May 2024).
4. Bang, H.Q.; Khue, V.H.N. Air Emission Inventory. In Air Pollution—Monitoring, Quantification and Removal of Gases and Particles;
IntechOpen: London, UK, 2019; pp. 1–18. [CrossRef]
5. Green Innovation and Development Center. Air Quality Report 2018 in Vietnam; Green Innovation and Development Center:
Hanoi, Vietnam, 2019.
6. Singh, D.; Dahiya, M.; Kumar, R.; Nanda, C. Sensors and Systems for Air Quality Assessment Monitoring and Management: A
Review. J. Environ. Manag. 2021, 289, 112510. [CrossRef] [PubMed]
7. Hung, M.D. Application of Machine Learning to Fill in the Missing Monitoring Data of Air Quality. Vietnam J. Sci. Technol. 2018,
56, 104–110. [CrossRef]
8. López, M. Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer: Cham, Switzerland, 2022.
9. Oyebode, O.; Stretch, D. Neural Network Modeling of Hydrological Systems: A Review of Implementation Techniques. Nat.
Resour. Model. 2019, 32, e12189. [CrossRef]
Atmosphere 2024, 15, 1163 19 of 19
10. Pan, B. Application of XGBoost Algorithm in Hourly PM2.5 Concentration Prediction. IOP Conf. Ser. Earth Environ. Sci. 2018,
113, 012127. [CrossRef]
11. Joharestani, M.Z.; Cao, C.; Ni, X.; Bashir, B.; Talebiesfandarani, S. PM2.5 Prediction Based on Random Forest, XGBoost, and Deep
Learning Using Multisource Remote Sensing Data. Atmosphere 2019, 10, 373. [CrossRef]
12. Goulier, L.; Paas, B.; Ehrnsperger, L.; Klemm, O. Modelling of Urban Air Pollutant Concentrations with Artificial Neural Networks
Using Novel Input Variables. Int. J. Environ. Res. Public Health 2020, 17, 2025. [CrossRef]
13. Castelli, M.; Clemente, F.M.; Popovič, A.; Silva, S.; Vanneschi, L. A Machine Learning Approach to Predict Air Quality in
California. Complexity 2020, 2020, 049504. [CrossRef]
14. Guo, Q.; He, Z.; Li, S.; Li, X.; Meng, J.; Hou, Z.; Liu, J.; Chen, Y. Air Pollution Forecasting Using Artificial and Wavelet Neural
Networks with Meteorological Conditions. Aerosol Air Qual. Res. 2020, 20, 1429–1439. [CrossRef]
15. Doreswamy; Harishkumar, K.S.; Km, Y.; Gad, I. Forecasting Air Pollution Particulate Matter (PM2.5) Using Machine Learning
Regression Models. In Procedia Computer Science; Elsevier: Amsterdam, The Netherlands, 2020; Volume 171, pp. 2057–2066.
16. Zhou, X.; Liu, J.; Zhang, X. Air Pollution Prediction Using Machine Learning Approaches: A Review. J. Clean. Prod. 2020.
17. Chen, K.; Fiore, A.; Westervelt, D.M. The Influence of Climate Change on PM2.5 and Ozone in the United States: A Review of
Multi-Model Projections. J. Air Waste Manag. Assoc. 2020, 70, 583.
18. Ordóñez, C.; Mathis, H.; Friese, E.; Mues, A. Multi-Model Simulations and Machine Learning Techniques for Improving Air
Quality Predictions. Atmospheric Chemistry and Physics. Atmos. Chem. Phys. 2020, 20, 84.
19. Petetin, H.; Bowdalo, D.; Granell, C. Machine Learning Model for High Resolution PM2.5 Forecasting in Europe. Environ. Pollut.
2020, 266, 11518.
20. Zheng, Y.; Wang, J.; Zhang, J. Deep Learning Models for Air Pollution Prediction and PM2.5 Analysis in China. Environ. Sci.
Technol. 2021, 55, 422.
21. Vo, T.T.M.; Tran, T.T.; To, T.H. PM2.5 Forecast System by Using Machine Learning and WRF Model, A Case Study: Ho Chi Minh
City, Vietnam. Aerosol Air Qual. Res. 2021, 21, 210108. [CrossRef]
22. Rakholia, R.; Le, Q.; Quoc Ho, B.; Vu, K.; Simon Carbajo, R. Multi-Output Machine Learning Model for Regional Air Pollution
Forecasting in Ho Chi Minh City, Vietnam. Environ. Int. 2023, 173, 107848. [CrossRef]
23. Müller, A.; Guido, S. Introduction to Machine Learning with Python: A Guide for Data Scientists, 1st ed.; O’Reilly Media: Sebastopol,
CA, USA, 2016; ISBN 978-1449369415.
24. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
25. Scikit-Learn Random Forest Regressor. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
RandomForestRegressor.html (accessed on 1 April 2024).
26. Chen, T.; Guestrin, C. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794.
27. XGBoost XGBoost Parameters. Available online: https://xgboost.readthedocs.io/en/stable/parameter.html (accessed on
1 May 2024).
28. Platt, J. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. 1999. Available
online: https://home.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf (accessed on 1 May 2024).
29. Piri, J.; Abdolahipour, M.; Keshtegar, B. Advanced Machine Learning Model for Prediction of Drought Indices Using Hybrid
SVR-RSM. Water Resour Manag. 2023, 37, 683–712. [CrossRef]
30. Géron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor Flow: Concepts, Tools, and Techniques to Build Intelligent
Systems, 2nd ed.; O’Reilly Media: Sebastopol, CA, USA, 2019; ISBN 978-1492032649.
31. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [CrossRef]
32. Specht, D.F. A General Regression Neural Network. IEEE Trans. Neural Netw. 1991, 2, 568–576. [CrossRef]
33. Liu, K.; Lin, T.; Zhong, T.; Ge, X.; Jiang, F.; Zhang, X. New Methods Based on a Genetic Algorithm Back Propagation (GABP)
Neural Network and General Regression Neural Network (GRNN) for Predicting the Occurrence of Trihalomethanes in Tap
Water. Sci. Total Environ. 2023, 870, 161976. [CrossRef] [PubMed]
34. Nguyen, T.N.T.; Du, N.X.; Hoa, N.T. Emission Source Areas of Fine Particulate Matter (PM2.5 ) in Ho Chi Minh City, Vietnam.
Atmosphere 2023, 14, 579. [CrossRef]
35. Hien, T.T.; Nguyen, L.S.P.; Truong, M.T.; Pham, T.D.H.; Ngan, T.A.; Minh, T.H.; Hau, L.Q.; Trung, H.T.; Nhon, N.T.T.; Nguyen, N.T.
Spatiotemporal Variations of Atmospheric Mercury at Urban and Suburban Areas in Southern Vietnam Megacity: A Preliminary
Year-Round Measurement Study. Atmos. Environ. 2024, 333, 120664. [CrossRef]
36. Zhang, C.; Luo, Z.; Rezgui, Y.; Zhao, T. Enhancing Multi-Scenario Data-Driven Energy Consumption Prediction in Campus
Buildings by Selecting Appropriate Inputs and Improving Algorithms with Attention Mechanisms. Energy Build. 2024, 311, 114133.
[CrossRef]
37. Nguyen-Le, V.; Shin, H.; Chen, Z. Deep Neural Network Model for Estimating Montney Shale Gas Production Using Reservoir,
Geomechanics, and Hydraulic Fracture Treatment Parameters. Gas Sci. Eng. 2023, 120, 205161. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.