A Hybrid Model for the Prediction of Air Pollutants Concentration, Based on Statistical and Machine Learning Techniques
A Hybrid Model for the Prediction of Air Pollutants Concentration, Based on Statistical and Machine Learning Techniques
Check for
updates
Abstract. In large cities, the health of the inhabitants and the concen-
trations of particles smaller than 10 and 2.5 um (PMio, PM2.5) as well
as ozone (O3) are related, making their prediction useful for the govern-
ment and citizens. Mexico City has an air quality forecast system, which
presents a forecast by pollutant at hourly and geographic zone level, but
is only valid for the next 24h.
To generate predictions for a longer time period, sophisticated meth-
ods need to be used, but highly automated techniques, such as deep
learning, require a large amount of data, which are not available for this
problem. Therefore, a set of predictor variables is created to feed and
test different Machine Learning (ML) methods, and determine which
features of these methods are essential for the prediction of different pol-
lutant concentrations, to develop a hybrid ad-hoc model that includes
ML features, but allowing a level of explainability, unlike what would
occur with methods such as neural networks.
In this work we present a hybrid prediction model using different sta-
tistical methods and ML techniques, which allow estimating the concen-
tration of the three main pollutants in the air of Mexico City two weeks
ahead. The results of the different models are presented and compared,
with the hybrid model being the one that best predicts the extreme cases.
matter -
Urban ambient air pollution
1 Introduction
In large cities, particles smaller than 10 and (PMip, PM2.5) as well as
ozone (Ox), correspond to the most dangerous air pollutants. Particularly in
© Springer Nature Switzerland AG 2021
I. Batyrshin et al. (Eds.): MICAI 2021, LNAI 13068, pp. 252-264, 2021.
https://doi.org/10.1007/978-3-030-89820-5_21
A Hybrid Model for the Prediction of Air Pollutants Concentration 253
Mexico City (CDMX) these are also the pollutants that have most commonly
given rise to the activation of environmental contingencies.
Coarse particles (PM) can penetrate into the deepest part of the lungs,
such as the bronchioles or alveoli, and ultrafine particles (PMo.5) tend to pen-
etrate the gas exchange regions of the lung, and ultrafine particles can pass
through the lungs to affect other organs [10]. Ozone is formed mainly from pho-
tochemical reactions between organic compounds and nitrogen oxides. Ozone has
been shown to affect the respiratory, cardiovascular and central nervous systems.
A link has also been found between premature death and reproductive health
problems associated with ozone exposure [5].
According to the current version of the air quality index, concentrations
greater than 96 (parts per billion) ppb of O3, 76g/cubic meter (ug/m?*) of
PMyp or 45.1 wg/m? of fall into air quality that is considered poor.
In recent years, Artificial Intelligence (AI) methods have been used to address
environmental problems, including air quality prediction. Carbajal et al. [2] use
data from Mexico City and fuzzy logic to classify the concentrations of different
pollutants into an air quality category and an autoregressive model to predict
and classify the next day's air quality. Zhao and Hasan [19] use tree-based classi-
fication algorithms to predict whether a critical PM, .5 value will be exceeded on
the next day in Hong Kong, using historical pollutant data. Di et al. [8] combine
estimates from neural network, random forest, and gradient boosting models to
predict daily PM2.5 values in the United States using different data sources,
including meteorological data. Jihoon et al. [18] use Random Forest (RF), Gra-
dient boosting (GBM), Regression as well as meteorological and pollutant data
to predict next-day PM25 and PMio values in Seoul, with GBM being the best
model, followed by RF. Ditsuhi e¢ al. [11] review multiple papers about air pollu-
tants forecasting concluding that the most predicted pollutant is PM2.5, usually
for next-day predictions, with neural networks being the most used method. Ali
Shah et al. [17] use Empirical Mode Decomposition (EMD) to decompose time
series of PM. 5 and and test different ML algorithms, with data from
different locations, with different algorithms generating better results at each
location, but the best being those whose input data were pre-processed with
EMD.
In these studies a benefit of using ML methods as well as meteorological
data sources can be observed, but generally these models are not generated for
predictions longer than one day, nor to be more sensitive to detect extreme cases,
which are the main interest of making these predictions.
Mexico City is located within the Valley of Mexico basin, a flat area sur-
rounded by mountains with an average height between 600 and 800m above the
valley floor. The mountains known as the Sierra de Guadalupe, surround the
city to the north, the Sierra de las Cruces to the west, the Sierra del Ajusco to
the south and the Sierra Nevada to the east, the latter including the volcanoes
Iztlacihuatl (5200 m above sea level) and Popocatepetl (5400 m above sea level).
The number and distribution of mountains make Mexico City and its
metropolitan area a highly complex terrain, influencing the meteorology and
254 C. Minutti-Martinez et al.
how pollutants behave in the atmosphere. Local winds influence air quality and
the distribution of pollutants.
Studies have been conducted on local winds in Mexico City, in which the
presence of three winds throughout the day has been identified (Fig. 1), which
are affected when there are weather fronts in the Pacific Ocean and the Gulf of
Mexico [6,7,9, 12], other authors have identified up to 9 wind patterns throughout
the year [3].
Gradient wind
(westerlies, Nov-Apr)
Upper limit
of inversion
OOM
15 KM
Fig. 1. Schematic illustration of drainage airflow over the CDMX basin [12].
2 Methodology
The process by which the different ML models were arrived at, as well as the deci-
sions taken for their development, are presented below. The coding and analysis
were performed using the R language/environment [14].
2.1 Data
For the pollutants to be studied (O3, PMio and PM2.5) we resorted to the
Mexico City's Automatic Air Quality Monitoring Network (RAMA[15]), which
A Hybrid Model for the Prediction of Air Pollutants Concentration 255
has data since 1995 for O3, 2011 for PMio and 2003 for PM2.5, whose data can
be obtained at hourly level, for different monitoring stations.
The meteorological data come from stations located in the Mexico City area
of the Automatic Weather Stations operated by the National Water Commission
(CONAGUA [4]). These stations are shown in Fig. 2, and include the following
variables: wind speed and direction, air temperature, atmospheric pressure, rel-
ative humidity, barometric pressure, precipitation and solar radiation. The data
presents measurements every 10min and data are available for the period from
January 1, 2010 to March 31, 2019.
Although there are several monitoring stations, some of them disappeared
over time and new ones appeared, and also many have missing data, so in order
to characterize the levels of pollutants in the area of the CDMX, the 5 stations
with the highest density of data were determined and the maximum value (as
well as the mean, median and minimum) observed for each day.
us oA
Naucalp: Azcapotzalco 4%, Madero?
de Juare
Ven tiano
nza
cluD gaya
MYguel DE ME
Chimathuac
Hidal Nezahualcoyo
ac
TEZONTLE
Alvaro
enitoJu ez FleSs
Obregon of
Ejess up
oacan La
Nahy
Ka Magdalen
Contreras ainan
Valle de
wv is
é Vaabindlan Tlahuac -
Chalco
Fig. 2. Geographic location of the meteorological stations used for the predictor
variables.
As for the meteorological data, the daily maximum, minimum, mean and
median values for the different observed variables are used to generate the pre-
dictor variables, for each of the 5 stations in Fig. 2. However, it was observed
that using information from 2 or more stations did not increase the accuracy of
the methods. The station with the highest density of data and that produced the
highest accuracy is the Tezontle station (Tezon), which is located in the central
part of the CDMX, therefore, all the reported results are using the Tezon station
data.
When looking for other predictor variables, it was observed that pollutant
concentrations have a well-defined trends over different time scales. When the
data was analyzed for different years, at the level of the day of the year and
256 C. Minutti-Martinez et al.
the week, it was also observed that for all pollutants there is a trend defined by
the time of the year. Thus, the following variables are added to the predictor
variables: the year, week, month, four-month period, day of the year and day of
the week to be predicted.
Taking into consideration the temporal relationship between the years, days
of the week and the trend within each year, the variables corresponding to the
minimum, maximum, average and daily median of each pollutant and meteoro-
logical variable for the previous 52 weeks are also added as predictors, always
using the same day of the week as the one to be estimated.
This process results in a database with 2,979 records and 2,563 variables, of
which 2,550 are predictor variables. If there are missing data, the median (for
numerical variables) and mode (for categorical variables) are used as imputation
methods. This database is available in [13].
The relationships of the different variables are intended to be studied with
statistical methods to determine the best way to incorporate the meteorological
and temporal variables into the ML models. This is because there is no enough
data to let the method itself to determine in a completely automated way all this
information (new non-linear variables, interactions, etc.), so a careful statistical
analysis is also required.
After testing the models, it was observed that it is possible to predict up to
2 weeks into the future, without excessive loss in accuracy, so the results reported
are for this time range.
The use of a regression tree model was also tested, which can give hints of
variables of importance, but with reduced predictive power, so a Random forest
model was also tested, using 5,000 training models.
After analyzing the results of the different models, it was observed that many
of the variables have valuable information for the prediction of pollutants, but
their use is limited due to the number of records, so an hybrid ad-hoc model was
developed for the problem. This model combines the characteristics of Random
forest that fits multiple models and those of a Neural Network that can include
new non-linear variables and their interactions.
The new model is named CM-MLPred and consists of the generation of new
nonlinear variables (polynomials of degree 2 and 3 of the original variables) and
interaction of multiple variables, in this case, interaction of 2-6 variables. Sub-
sequently, multiple regression models are fitted by means of a random selection
of the base variables and the new variables, this selection of variables can follow
a probability distribution of how many variables to use (the highest number of
variables is prioritized). Subsequently, a random selection of records to predict
is also performed, this in order to (1) not over-fit the final model, (2) give dif-
ferent selection probability to each record to simulate adjustment weights, if it
is desired to prioritize the error reduction for a certain type of value and (3) to
be able to use the data not used during the training/adjustment of the model,
to obtain an estimate of the prediction error (cross-validation).
Another additional feature of the model is that since there are multiple mod-
els with different variables and number of variables, it is not necessary to impute
missing data, since only those models for which the value of the variables is
available can be used for the final estimate. In addition, having a multiplicity
of models prevents an outlier or erroneous value of a predictor variable from
considerably affecting the estimation.
Finally, the model can give a prediction interval, which can use the different
estimates of each sub-model to obtain a probability distribution of the estimate.
For the calculation of the probability distribution of the estimate, the weight
of each sub-model is used, which is given by a score that is calculated by the
product of the coefficient of determination of each model (adjusted R?), the AIC*
(Akaike information criterion) and a coefficient of determination for prediction.
Thus, a sub-model with a high score generally implies an adequate performance
in the 3 different measures. The final estimate is given by the estimate value
that maximizes the likelihood of the probability distribution of the estimate.
3 Results
The results for the 3 different pollutants are presented below, starting with
PM}, for which a greater level of detail of the process and results obtained are
given, and for PM2.5 and O3 the results obtained by performing the same pro-
cess described in PMio are presented. When observing the distribution of the
1
Since models with lower AIC are better, 1-(AIC-min(AIC))/(max(AIC)-min(AIC))
is used so that higher values correspond to better models.
258 C. Minutti-Martinez et al.
different contaminants along the data set, it was observed (Fig. 3) that the dis-
tribution is very asymmetric, with very large values being infrequent. However,
the very high values are of special interest for prediction, that is why the first
decision taken was to predict the logarithm of the pollutant concentration, which
has a more symmetric distribution (and more similar to a Gaussian), which will
make all methods more efficient in the prediction of high concentrations.
oltutant Pollutant
03
p10
PM25
00 20 000 3 4 5
pgim43 [PM10, PM2.5], ppb [03] LOG(ugims3) [PM10, PM2.5], LOG(ppb) [03]
Fig. 3. Distribution of daily maximum values, observed for each pollutant (left) and
for its logarithm (right).
To calculate the prediction error for each of the models, a random set of
data was selected for cross-validation, which corresponds to 13% of the total
records (384 records). This validation data set was selected by means of a random
sample where the probability of selection of each record is inversely proportional
to its frequency, in order to determine how good each model is at predicting
extreme cases, which are the ones of main interest. This validation set was never
used in the training of any of the models, and its error is measured by the
formula err = where C§obs ig the i-th observation of the pollutant
(
C = PMio, PMo .5, Os and Cé** is the model prediction corresponding to the
i-th observation. That is, it is a quadratic error relative to the measured value.
err results into a vector for which its different quartiles and mean are estimated.
All models (Regression [REG], Regularized regression [RIDGE], Neural Net-
works [NN], Random Forest [RF], Regression Trees [TREE]) were calibrated to
maximize their prediction power avoiding overfitting. For regression we used
stepwise, for regularized regression, ridge regression (LASSO and Elastic Net
were tested, and an optimal value of the regularization parameter was used,
being the Ridge regression always the one with better results). In the case of
Neural Networks, the number of hidden layers and the threshold parameter were
optimized, and different initial values for the weights were tested. This process
was done for each pollutant, having the following results.
A Hybrid Model for the Prediction of Air Pollutants Concentration 259
3.1 PM10
The hybrid model (CM-MLPred) aims to combine the advantages of new nonlin-
ear variables and interactions that a Neural Network provides, as well as the
multiplicity of predictors that a model such as Random Forest has. For the
training of this model, 5,000 sub-models were used, and they add 900 new vari-
ables, consisting of 300 non-linear transformations (polynomials of degree 2-3
and logarithms) as well as 600 variable interactions ranging from 2-6 variables,
where each sub-model contains 2-350 variables. A sample of the base variables
can be seen in Fig. 4, where the importance of the variables for the RF model is
presented.
RadSolarmediantezonweek33
RadSolarmediantezonweek34
RadSolarmediantezonweek6
RadSolarmediantezonweek8
RadSolarmediantezonweek35
RadSolarmediantezonweek32
RadSolarmediantezonweek7
daySunday 6
meanvalPM10week50
RadSolarmediantezonweek9
year
maxvalPM10week50
dayn
medianvalPM10_week50
maxvalPM10
RadSolar max tezon week 33 C}
meanvalPM10week52
medianvalPM10week52
RadSolarmax_tezonweek3
medianvalPM10week1
maxvalPM10week52
quarterQ3
RadSolarmaxtezonweek1
meanvalPM10week1
meanvalPM10 {>
medianvalO3week15
RadSolarmaxtezonweek28
medianvalO3week13
maxvalPM10week5
RadSolarmediantezonweek10
0 5 10 15 20
IncNodePurity
250
TREE
RIDGE, 200
Model Model
-
CMMLPred
NN
Model
REG
- RF
Tree
3, -
-
RIDGE
TREE
CMMLPred
Fig. 5. Comparison of the different models for the prediction error of PMio, in the
validation set
Figure 5 shows the distribution of the prediction squared error as well as the
box plot. From both graphs it can be seen that the hybrid method, CM-MLPred
presents a higher number of cases with low errors, in relation to the other models,
and has a lower variability in its errors.
Table 1 shows the different quartiles for the error obtained with each model,
as well as their mean.
Table 1. Errors for the validation data set using the different models tested, for PM1o
Taking these results into account, the best performing model was CM-MLPred
with the difference in the third quartile being more relevant, indicating a lower
probability of making large errors. Neural Networks (NN) is the one that could
be considered the second best model.
Figure 5 shows the observed vs estimated values of for each model,
so the model that is closer to a straight line with slope 1 (dotted line) is the
best one. In this graph it can be seen that CM-MLPred is the closest to that
line and that the other models tend to have problems when predicting extreme
values, especially for very high PMio values, which are precisely the values we
are interested in predicting.
A Hybrid Model for the Prediction of Air Pollutants Concentration 261
3.2 PM2.5
TREE
RIDGE
Model Model
curred - CMMLPred
Predicted
- NN
BONN
Model
REG
2
fia
a
ORF
- RIDGE
FS TREE -
TREE
NN
CMMLPred
0.60 80 120
Error Observed value
Fig. 6. Comparison of the different models for the prediction error of PM2.5, in the
validation set
Figure 6 shows the distribution of the prediction errors in a box plot, it can be
seen that the hybrid model, CM-MLPred, has a higher number of cases with low
errors, in relation to the other models, and has a lower variability in its errors.
Figure 6 also shows the observed vs predicted values of PM2.5 for each model.
Table 2 shows the different quartiles for the error obtained with each model,
as well as their mean.
Table 2. Errors for the validation data set using the different models tested for PM2.5
The best performing model was CM-MLPred with the difference in the third
quartile being more relevant, indicating a lower probability of making large
errors. Neural Networks (NN) is the one that could be considered the second
best model.
Although CM-MLPred performed better than the other models, it is also
observed that for PM2.5 it is more difficult to make adequate predictions, and
the difference between the methods is smaller (compared to which may
be expected given the low correlations of the predictor variables.
262 C. Minutti-Martinez et al.
3.3
+60
TREE
Model Model
cuMbPred - CMMLPred
Predicted
- NN
BONN
Model
Merce REG
me ORF
- RIDGE
-
TREE
NN
40
Fig. 7. Comparison of the different models for the prediction error of O3, in the vali-
dation data set
Table 3 shows the different quartiles for the error obtained with each model, as
well as their mean.
Table 3. Errors for the validation dataset using the different models tested for Oz
@ PMio is the pollutant that was more correlated to the meteorological variables
and the one in which better predictions are achieved.
e PM, 5 is the pollutant that presented the greatest prediction problems.
e Os is mainly dominated by solar radiation and the time of the year.
e For all pollutants, the hybrid model (CM-MLPred) was the one that obtained
the best predictions and the one that performed better in predicting extreme
values.
e The developed model can be adapted to other pollutants and cities.
e Because these pollutants show a pattern related to the day of the week, the
effect of human activity is evident, therefore the inclusion of related variables,
such as human mobility, could significantly improve the results.
References
1. Arellano-Vdzquez, M., Minutti-Martinez, C., Zamora-Machado, M.: Automated
characterization and prediction of wind conditions using gaussian mixtures. In:
Martinez-Villasenor, L., Herrera-AlcAntara, O., Ponce, H., Castro-Espinoza, F.A.
(eds.) Advances in Soft Computing, vol. 12468, pp. 158-168. Springer, Cham
(2020). https://doi.org/10.1007/978-3-030-60884-212
2. Carbajal-Hernandez, J.J., Sdanchez-Ferndndez, L.P., Carrasco-Ochoa, J.A.,
Martinez-Trinidad, J.F.: Assessment and prediction of air quality using fuzzy logic
and autoregressive models. Atmos. Environ. 60, 37-50 (2012). https: //doi-org/10.
1016/j.atmosenv.2012.06.004
3. Carreén-Sierra, S., Salcido, A., Castro, T., Celada-Murillo, A.T.: Cluster analysis
of the wind events and seasonal wind circulation patterns in the Mexico city region.
Atmosphere 6(8), 1006-1031 (2015)
4. CONAGUA: Automated weather stations, August 2021. https://www.conagua.gob.
mx/tools/GUI/EMAS.php
264 C. Minutti-Martinez et al.
5. Council, N.R.: Estimating Mortality Risk Reduction and Economic Benefits from
Controlling Ozone Air Pollution. The National Academies Press, Washington, DC
(2008). https://doi.org/10.17226/12198
6. De Foy, B., et al.: Mexico city basin wind circulation during the MCMA-2003
field campaign. Atmos. Chem. Phys. Discuss. 5(3), 2503-2558 (2005). https: //hal.
archives-ouvertes.fr/hal-00303903
7. De Foy, B., Clappier, A., Molina, L.T., Molina, M.J.: Distinct wind convergence
patterns in the Mexico city basin due to the interaction of the gap winds with
the synoptic flow. Atmos. Chem. Phys. 6(5), 1249-1265 (2006). https://www.org/
10.5194/acp-6-1249-2006
8. Di, Q., et al.: An ensemble-based model of PM2.5 concentration across the contigu-
ous united states with high spatiotemporal resolution. Environ. Int. 130, 104909
(2019). https://doi.org/10.1016/j.envint.2019.104909
9. de Foy, B., et al.: Basin-scale wind transport during the MILAGRO field campaign
and comparison to climatology using cluster analysis. Atmos. Chem. Phys. 8(5),
1209-1224 (2008). https://www.org/10.5194/acp-8-1209-2008
10. Heinzerling, A., Hsu, J., Yip, F.: Respiratory health effects of ultrafine particles
in children: a literature review. Water Air Soil Pollut. 227(1), 32 (2015). https: //
doi.org/10.1007/s11270-015-2726-6
11. Iskandaryan, D., Ramos, F., Trilles, S.: Air quality prediction in smart cities using
machine learning technologies based on sensor data: a review. Appl. Sci. 10(7)
(2020). https: //doi-org/10.3390/app10072401
12. Jauregui, E.: Local wind and air pollution interaction in the Mexico basin.
Atmésfera 1(3) (2011). https://www.revistascca.unam.mx/atm/index.php/atm/
article/view/25944
13. Minutti, C.: Pollutant and meteorological data for the prediction of air pollutants
in Mexico city, September 2021. https: //doi-org/10.6084/m9.figshare.16589822.v1
14. R Core Team: R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria (2021). https: //www.R-project.
org/
15. RAMA: Automatic air quality monitoring network (2021). http: //www.aire.cdmx.
gob.mx/default.php?ope %27aKBh%27
16. Sanchez-Pérez, P.A., Robles, M., Jaramillo, O.A.: Real time Markov chains: wind
states in anemometric data. J. Renew. Sustain. Energy 8(2), 023304 (2016).
https://doi.org/10.1063/1.4943120
17. Shah, 8.A.A., Almaraashi, W.A.M., Nadeem, M.S.A., Habib, N., Shim, S.0.: A
hybrid model for forecasting of particulate matter concentrations based on multi-
scale characterization and machine learning techniques. Math. Biosci. Eng. 18(3),
1992 (2021). https://doi.org/10.3934/mbe.2021104
18. Yoo, J., Shin, D., Shin, D.: Prediction system for fine particulate matter concentra-
tion index by meteorological and air pollution material factors based on machine
learning. In: Proceedings of the Tenth International Symposium on Information
and Communication Technology, SolCT 2019, pp. 479-485. Association for Com-
puting Machinery, New York (2019). https://www.org/10.1145/3368926.3369684
19. Zhao, Y., Hasan, Y.A.: Fine particulate matter concentration level prediction by
using tree-based ensemble classification algorithms. Int. J Adv. Comput. Sci. Appl.
4(5) (2013). https://www.org/10.14569/IJACSA.2013.040503