0% found this document useful (0 votes)

17 views

A Hybrid Model for the Prediction of Air Pollutants Concentration, Based on Statistical and Machine Learning Techniques

Uploaded by

laxovil111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

A Hybrid Model for the Prediction of Air Pollutants Concentration, Based on Statistical and Machine Learning Techniques

Uploaded by

laxovil111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

®

Check for
updates

A Hybrid Model for the Prediction of Air

Pollutants Concentration, Based on
Statistical and Machine Learning
Techniques

Carlos Minutti-Martinez!'@)©, Magali Arellano-Vazquez'®,

and Marlene Zamora-Machado*®
1
Artificial Intelligence Consortium, CONACyT-CIMAT, Guanajuato, Mexico
[email protected]
?
INFOTEC Center for Research and Innovation in Information and Communication
Technologies, Aguascalientes, Mexico
[email protected]
3
Autonomous University of Baja California, Mexicali, Mexico
[email protected]

Abstract. In large cities, the health of the inhabitants and the concen-
trations of particles smaller than 10 and 2.5 um (PMio, PM2.5) as well
as ozone (O3) are related, making their prediction useful for the govern-
ment and citizens. Mexico City has an air quality forecast system, which
presents a forecast by pollutant at hourly and geographic zone level, but
is only valid for the next 24h.
To generate predictions for a longer time period, sophisticated meth-
ods need to be used, but highly automated techniques, such as deep
learning, require a large amount of data, which are not available for this
problem. Therefore, a set of predictor variables is created to feed and
test different Machine Learning (ML) methods, and determine which
features of these methods are essential for the prediction of different pol-
lutant concentrations, to develop a hybrid ad-hoc model that includes
ML features, but allowing a level of explainability, unlike what would
occur with methods such as neural networks.
In this work we present a hybrid prediction model using different sta-
tistical methods and ML techniques, which allow estimating the concen-
tration of the three main pollutants in the air of Mexico City two weeks
ahead. The results of the different models are presented and compared,
with the hybrid model being the one that best predicts the extreme cases.

Keywords: Pollutant forecasting Machine learning Particulate

- -

matter -
Urban ambient air pollution

1 Introduction
In large cities, particles smaller than 10 and (PMip, PM2.5) as well as
ozone (Ox), correspond to the most dangerous air pollutants. Particularly in
© Springer Nature Switzerland AG 2021
I. Batyrshin et al. (Eds.): MICAI 2021, LNAI 13068, pp. 252-264, 2021.
https://doi.org/10.1007/978-3-030-89820-5_21
A Hybrid Model for the Prediction of Air Pollutants Concentration 253

Mexico City (CDMX) these are also the pollutants that have most commonly
given rise to the activation of environmental contingencies.
Coarse particles (PM) can penetrate into the deepest part of the lungs,
such as the bronchioles or alveoli, and ultrafine particles (PMo.5) tend to pen-
etrate the gas exchange regions of the lung, and ultrafine particles can pass
through the lungs to affect other organs [10]. Ozone is formed mainly from pho-
tochemical reactions between organic compounds and nitrogen oxides. Ozone has
been shown to affect the respiratory, cardiovascular and central nervous systems.
A link has also been found between premature death and reproductive health
problems associated with ozone exposure [5].
According to the current version of the air quality index, concentrations
greater than 96 (parts per billion) ppb of O3, 76g/cubic meter (ug/m?*) of
PMyp or 45.1 wg/m? of fall into air quality that is considered poor.
In recent years, Artificial Intelligence (AI) methods have been used to address
environmental problems, including air quality prediction. Carbajal et al. [2] use
data from Mexico City and fuzzy logic to classify the concentrations of different
pollutants into an air quality category and an autoregressive model to predict
and classify the next day's air quality. Zhao and Hasan [19] use tree-based classi-
fication algorithms to predict whether a critical PM, .5 value will be exceeded on
the next day in Hong Kong, using historical pollutant data. Di et al. [8] combine
estimates from neural network, random forest, and gradient boosting models to
predict daily PM2.5 values in the United States using different data sources,
including meteorological data. Jihoon et al. [18] use Random Forest (RF), Gra-
dient boosting (GBM), Regression as well as meteorological and pollutant data
to predict next-day PM25 and PMio values in Seoul, with GBM being the best
model, followed by RF. Ditsuhi e¢ al. [11] review multiple papers about air pollu-
tants forecasting concluding that the most predicted pollutant is PM2.5, usually
for next-day predictions, with neural networks being the most used method. Ali
Shah et al. [17] use Empirical Mode Decomposition (EMD) to decompose time
series of PM. 5 and and test different ML algorithms, with data from
different locations, with different algorithms generating better results at each
location, but the best being those whose input data were pre-processed with
EMD.
In these studies a benefit of using ML methods as well as meteorological
data sources can be observed, but generally these models are not generated for
predictions longer than one day, nor to be more sensitive to detect extreme cases,
which are the main interest of making these predictions.
Mexico City is located within the Valley of Mexico basin, a flat area sur-
rounded by mountains with an average height between 600 and 800m above the
valley floor. The mountains known as the Sierra de Guadalupe, surround the
city to the north, the Sierra de las Cruces to the west, the Sierra del Ajusco to
the south and the Sierra Nevada to the east, the latter including the volcanoes
Iztlacihuatl (5200 m above sea level) and Popocatepetl (5400 m above sea level).
The number and distribution of mountains make Mexico City and its
metropolitan area a highly complex terrain, influencing the meteorology and
254 C. Minutti-Martinez et al.

how pollutants behave in the atmosphere. Local winds influence air quality and
the distribution of pollutants.
Studies have been conducted on local winds in Mexico City, in which the
presence of three winds throughout the day has been identified (Fig. 1), which
are affected when there are weather fronts in the Pacific Ocean and the Gulf of
Mexico [6,7,9, 12], other authors have identified up to 9 wind patterns throughout
the year [3].

Gradient wind
(westerlies, Nov-Apr)

Upper limit
of inversion
OOM

15 KM

Fig. 1. Schematic illustration of drainage airflow over the CDMX basin [12].

Prediction of the concentrations of different air pollutants in the air, is very

useful for citizens and governments to take actions that allow them to reduce
the impact that these can reach and if possible, reduce the phenomenon itself in
terms of magnitude and occurrence. A wind state is defined as a region in velocity
phase space that contains available wind speeds that have a standard probability
distribution function that characterizes them as a group [16]. In previous research
[1] has been possible to characterize in an automated way the different wind
states, replicating the classifications made by experts without having trained the
classification method to reproduce that classification, resulting in a validation
of both classifications.
Historical analysis of these data also followed that there is a correlation
between the absence of some of these wind states and the subsequent occurrence
of pollutants in sufficient concentration to result in a contingency. It is the use of
this information and its variables that we intend to analyze in order to develop
AI/ML methods for predicting the different states of poor air quality within
Mexico City.

2 Methodology
The process by which the different ML models were arrived at, as well as the deci-
sions taken for their development, are presented below. The coding and analysis
were performed using the R language/environment [14].

2.1 Data
For the pollutants to be studied (O3, PMio and PM2.5) we resorted to the
Mexico City's Automatic Air Quality Monitoring Network (RAMA[15]), which
A Hybrid Model for the Prediction of Air Pollutants Concentration 255

has data since 1995 for O3, 2011 for PMio and 2003 for PM2.5, whose data can
be obtained at hourly level, for different monitoring stations.
The meteorological data come from stations located in the Mexico City area
of the Automatic Weather Stations operated by the National Water Commission
(CONAGUA [4]). These stations are shown in Fig. 2, and include the following
variables: wind speed and direction, air temperature, atmospheric pressure, rel-
ative humidity, barometric pressure, precipitation and solar radiation. The data
presents measurements every 10min and data are available for the period from
January 1, 2010 to March 31, 2019.
Although there are several monitoring stations, some of them disappeared
over time and new ones appeared, and also many have missing data, so in order
to characterize the levels of pollutants in the area of the CDMX, the 5 stations
with the highest density of data were determined and the maximum value (as
well as the mean, median and minimum) observed for each day.

us oA
Naucalp: Azcapotzalco 4%, Madero?
de Juare

Ven tiano
nza
cluD gaya
MYguel DE ME
Chimathuac
Hidal Nezahualcoyo
ac
TEZONTLE
Alvaro
enitoJu ez FleSs
Obregon of
Ejess up
oacan La

Nahy

Ka Magdalen
Contreras ainan
Valle de
wv is
é Vaabindlan Tlahuac -
Chalco

Fig. 2. Geographic location of the meteorological stations used for the predictor
variables.

As for the meteorological data, the daily maximum, minimum, mean and
median values for the different observed variables are used to generate the pre-
dictor variables, for each of the 5 stations in Fig. 2. However, it was observed
that using information from 2 or more stations did not increase the accuracy of
the methods. The station with the highest density of data and that produced the
highest accuracy is the Tezontle station (Tezon), which is located in the central
part of the CDMX, therefore, all the reported results are using the Tezon station
data.
When looking for other predictor variables, it was observed that pollutant
concentrations have a well-defined trends over different time scales. When the
data was analyzed for different years, at the level of the day of the year and
256 C. Minutti-Martinez et al.

the week, it was also observed that for all pollutants there is a trend defined by
the time of the year. Thus, the following variables are added to the predictor
variables: the year, week, month, four-month period, day of the year and day of
the week to be predicted.
Taking into consideration the temporal relationship between the years, days
of the week and the trend within each year, the variables corresponding to the
minimum, maximum, average and daily median of each pollutant and meteoro-
logical variable for the previous 52 weeks are also added as predictors, always
using the same day of the week as the one to be estimated.
This process results in a database with 2,979 records and 2,563 variables, of
which 2,550 are predictor variables. If there are missing data, the median (for
numerical variables) and mode (for categorical variables) are used as imputation
methods. This database is available in [13].
The relationships of the different variables are intended to be studied with
statistical methods to determine the best way to incorporate the meteorological
and temporal variables into the ML models. This is because there is no enough
data to let the method itself to determine in a completely automated way all this
information (new non-linear variables, interactions, etc.), so a careful statistical
analysis is also required.
After testing the models, it was observed that it is possible to predict up to
2 weeks into the future, without excessive loss in accuracy, so the results reported
are for this time range.

2.2 AI/ML Methods

As an exploratory method of the variables of importance and the base prediction
capacity, the Spearman correlation with the pollutant to be predicted is mea-
sured, the 1,000 most correlated variables were chosen to form the basis of each of
the models, these variables being chosen independently for each pollutant. Even
though there are several methods for variable selection, Spearman was used to
account for a nonlinear relationship and, due to the large number of variables
and redundancy of information (collinearity), there was no appreciable difference
using other variable selection methods. Although some ML methods can work
with a high number of variables without overfitting the model, poorly related
variables only introduce noise to the method, reduce robustness and complicate
the training process.
As an initial model, a regression model analysis is performed. Despite the
initial selection of variables, cross-validation suggests that the regression model
is over-fitted, so the stepwise method is used. Although the regression model
allows us to determine the most important variables, due to the over-fitting of
the model by using all the variables, it is not possible to determine the base
prediction level, so regularized regression (ridge regression, LASSO and Elastic
Net) is used to explore this base prediction capability.
Subsequently, a Neural Network was tested, however, due to the scarcity of
records, the depth of the network could only go up to 3 layers, even though the
parameter threshold was controlled to avoid over-fitting.
A Hybrid Model for the Prediction of Air Pollutants Concentration 257

The use of a regression tree model was also tested, which can give hints of
variables of importance, but with reduced predictive power, so a Random forest
model was also tested, using 5,000 training models.
After analyzing the results of the different models, it was observed that many
of the variables have valuable information for the prediction of pollutants, but
their use is limited due to the number of records, so an hybrid ad-hoc model was
developed for the problem. This model combines the characteristics of Random
forest that fits multiple models and those of a Neural Network that can include
new non-linear variables and their interactions.
The new model is named CM-MLPred and consists of the generation of new
nonlinear variables (polynomials of degree 2 and 3 of the original variables) and
interaction of multiple variables, in this case, interaction of 2-6 variables. Sub-
sequently, multiple regression models are fitted by means of a random selection
of the base variables and the new variables, this selection of variables can follow
a probability distribution of how many variables to use (the highest number of
variables is prioritized). Subsequently, a random selection of records to predict
is also performed, this in order to (1) not over-fit the final model, (2) give dif-
ferent selection probability to each record to simulate adjustment weights, if it
is desired to prioritize the error reduction for a certain type of value and (3) to
be able to use the data not used during the training/adjustment of the model,
to obtain an estimate of the prediction error (cross-validation).
Another additional feature of the model is that since there are multiple mod-
els with different variables and number of variables, it is not necessary to impute
missing data, since only those models for which the value of the variables is
available can be used for the final estimate. In addition, having a multiplicity
of models prevents an outlier or erroneous value of a predictor variable from
considerably affecting the estimation.
Finally, the model can give a prediction interval, which can use the different
estimates of each sub-model to obtain a probability distribution of the estimate.
For the calculation of the probability distribution of the estimate, the weight
of each sub-model is used, which is given by a score that is calculated by the
product of the coefficient of determination of each model (adjusted R?), the AIC*
(Akaike information criterion) and a coefficient of determination for prediction.
Thus, a sub-model with a high score generally implies an adequate performance
in the 3 different measures. The final estimate is given by the estimate value
that maximizes the likelihood of the probability distribution of the estimate.

3 Results
The results for the 3 different pollutants are presented below, starting with
PM}, for which a greater level of detail of the process and results obtained are
given, and for PM2.5 and O3 the results obtained by performing the same pro-
cess described in PMio are presented. When observing the distribution of the
1
Since models with lower AIC are better, 1-(AIC-min(AIC))/(max(AIC)-min(AIC))
is used so that higher values correspond to better models.
258 C. Minutti-Martinez et al.

different contaminants along the data set, it was observed (Fig. 3) that the dis-
tribution is very asymmetric, with very large values being infrequent. However,
the very high values are of special interest for prediction, that is why the first
decision taken was to predict the logarithm of the pollutant concentration, which
has a more symmetric distribution (and more similar to a Gaussian), which will
make all methods more efficient in the prediction of high concentrations.

Density plot Density plot

Distribution of the daily maximum pollutant value Distribution of the daily LOG-maximum pollutant value

oltutant Pollutant
03
p10
PM25

00 20 000 3 4 5
pgim43 [PM10, PM2.5], ppb [03] LOG(ugims3) [PM10, PM2.5], LOG(ppb) [03]

Fig. 3. Distribution of daily maximum values, observed for each pollutant (left) and
for its logarithm (right).

To calculate the prediction error for each of the models, a random set of
data was selected for cross-validation, which corresponds to 13% of the total
records (384 records). This validation data set was selected by means of a random
sample where the probability of selection of each record is inversely proportional
to its frequency, in order to determine how good each model is at predicting
extreme cases, which are the ones of main interest. This validation set was never
used in the training of any of the models, and its error is measured by the
formula err = where C§obs ig the i-th observation of the pollutant
(
C = PMio, PMo .5, Os and Cé** is the model prediction corresponding to the
i-th observation. That is, it is a quadratic error relative to the measured value.
err results into a vector for which its different quartiles and mean are estimated.
All models (Regression [REG], Regularized regression [RIDGE], Neural Net-
works [NN], Random Forest [RF], Regression Trees [TREE]) were calibrated to
maximize their prediction power avoiding overfitting. For regression we used
stepwise, for regularized regression, ridge regression (LASSO and Elastic Net
were tested, and an optimal value of the regularization parameter was used,
being the Ridge regression always the one with better results). In the case of
Neural Networks, the number of hidden layers and the threshold parameter were
optimized, and different initial values for the weights were tested. This process
was done for each pollutant, having the following results.
A Hybrid Model for the Prediction of Air Pollutants Concentration 259

3.1 PM10
The hybrid model (CM-MLPred) aims to combine the advantages of new nonlin-
ear variables and interactions that a Neural Network provides, as well as the
multiplicity of predictors that a model such as Random Forest has. For the
training of this model, 5,000 sub-models were used, and they add 900 new vari-
ables, consisting of 300 non-linear transformations (polynomials of degree 2-3
and logarithms) as well as 600 variable interactions ranging from 2-6 variables,
where each sub-model contains 2-350 variables. A sample of the base variables
can be seen in Fig. 4, where the importance of the variables for the RF model is
presented.

Variable importance (RF)

RadSolarmediantezonweek33
RadSolarmediantezonweek34
RadSolarmediantezonweek6
RadSolarmediantezonweek8
RadSolarmediantezonweek35
RadSolarmediantezonweek32
RadSolarmediantezonweek7
daySunday 6
meanvalPM10week50
RadSolarmediantezonweek9
year
maxvalPM10week50
dayn
medianvalPM10_week50
maxvalPM10
RadSolar max tezon week 33 C}
meanvalPM10week52
medianvalPM10week52
RadSolarmax_tezonweek3
medianvalPM10week1
maxvalPM10week52
quarterQ3
RadSolarmaxtezonweek1
meanvalPM10week1
meanvalPM10 {>
medianvalO3week15
RadSolarmaxtezonweek28
medianvalO3week13
maxvalPM10week5
RadSolarmediantezonweek10

0 5 10 15 20

IncNodePurity

Fig. 4. Variable importance for PMjo in the RF model

Here we show different comparisons of the different models with respect to

the validation data set, which was never used to train the models or to calibrate
their parameters.
260 C. Minutti-Martinez et al.

Box plot Scatterplot

Distribution of the prediction error for PM10 Observed vs Predicted for PM10

250
TREE

RIDGE, 200

Model Model
-
CMMLPred
NN
Model

REG
- RF

Tree
3, -

-
RIDGE
TREE

CMMLPred

0.60 60 100 150

Error Observed value

Fig. 5. Comparison of the different models for the prediction error of PMio, in the
validation set

Figure 5 shows the distribution of the prediction squared error as well as the
box plot. From both graphs it can be seen that the hybrid method, CM-MLPred
presents a higher number of cases with low errors, in relation to the other models,
and has a lower variability in its errors.
Table 1 shows the different quartiles for the error obtained with each model,
as well as their mean.

Table 1. Errors for the validation data set using the different models tested, for PM1o

Model Min. Ist Qu Median Mean 3rd Qu. Max.

CM-MLPred 0.0000 0.0138 0.0545 0.2349 0.1419 8.4287
REG 0.0000 0.0244 0.0890 0.2969 0.2176 10.5835
RIDGE 0.0000 0.0242 0.0897 0.2879 0.2242 7.5782
NN 0.0000 0.0230 0.0828 0.3087 0.2110 9.8603
TREE 0.0000 0.0387 0.1036 0.3495 0.2795 15.3808
RF 0.0000 0.0273 0.0948 0.2821 0.2291 7.4604

Taking these results into account, the best performing model was CM-MLPred
with the difference in the third quartile being more relevant, indicating a lower
probability of making large errors. Neural Networks (NN) is the one that could
be considered the second best model.
Figure 5 shows the observed vs estimated values of for each model,
so the model that is closer to a straight line with slope 1 (dotted line) is the
best one. In this graph it can be seen that CM-MLPred is the closest to that
line and that the other models tend to have problems when predicting extreme
values, especially for very high PMio values, which are precisely the values we
are interested in predicting.
A Hybrid Model for the Prediction of Air Pollutants Concentration 261

3.2 PM2.5

Box plot Scatterplot

Distribution of the prediction error for PM25 Observed vs Predicted for PM25

TREE

RIDGE
Model Model
curred - CMMLPred

Predicted
- NN
BONN
Model

REG

2
fia
a
ORF
- RIDGE

FS TREE -
TREE
NN

CMMLPred

0.60 80 120
Error Observed value

Fig. 6. Comparison of the different models for the prediction error of PM2.5, in the
validation set

Figure 6 shows the distribution of the prediction errors in a box plot, it can be
seen that the hybrid model, CM-MLPred, has a higher number of cases with low
errors, in relation to the other models, and has a lower variability in its errors.
Figure 6 also shows the observed vs predicted values of PM2.5 for each model.
Table 2 shows the different quartiles for the error obtained with each model,
as well as their mean.

Table 2. Errors for the validation data set using the different models tested for PM2.5

Model Min. 1st Qu Median Mean 3rd Qu. Max.

CM-MLPred 0.0000 0.0181 0.0761 0.2527 0.2250 6.8016
REG 0.0000 0.0328 0.1003 0.3392 0.2799 10.4971
RIDGE 0.0000 0.0345 0.1054 0.3334 0.2823 8.2566
NN 0.0000 0.0355 0.1205 0.2152 0.2478 6.7171
TREE 0.0000 0.0453 0.1325 0.3819 0.3086 8.4904
RF 0.0000 0.0364 0.0989 0.3088 0.2712 5.9467

The best performing model was CM-MLPred with the difference in the third
quartile being more relevant, indicating a lower probability of making large
errors. Neural Networks (NN) is the one that could be considered the second
best model.
Although CM-MLPred performed better than the other models, it is also
observed that for PM2.5 it is more difficult to make adequate predictions, and
the difference between the methods is smaller (compared to which may
be expected given the low correlations of the predictor variables.
262 C. Minutti-Martinez et al.

3.3

Box plot Scatterplot

Distribution of the prediction error for 03 Observed vs Predicted for 03

+60
TREE

Model Model
cuMbPred - CMMLPred

Predicted
- NN
BONN
Model

Merce REG

me ORF
- RIDGE
-
TREE
NN
40

0.60 a0 4120 160

Error Observed value

Fig. 7. Comparison of the different models for the prediction error of O3, in the vali-
dation data set

Table 3 shows the different quartiles for the error obtained with each model, as
well as their mean.

Table 3. Errors for the validation dataset using the different models tested for Oz

Model Min. Ist Qu Median Mean 3rd Qu. Max.

CM-MLPred 0.0000 0.0185 0.0740 0.2897 0.1865 13.3532
REG 0.0000 0.0230 0.0873 0.4950 0.3019 14.2492
RIDGE 0.0000 0.0281 0.0882 0.4896 0.2684 19.4817
NN 0.0000 0.0341 0.1160 0.6401 0.4048 17.9026
TREE 0.0000 0.0332 0.1031 0.5523 0.2925 27.3900
RF 0.0000 0.0278 0.0924 0.4630 0.2312 22.3349

Figure 7 shows the distribution of the prediction errors for O3 as well as a

plot of observed vs predicted values for each model.
The best performing model was CM-MLPred with the difference in the third
quartile being very relevant, indicating a lower probability of making large errors.
Unlike and 5, Neural Networks (NN) performed worse in this case,
with the second best model being Random Forest.
Although CM-MLPred performed better than the other models, it is also
observed that for Os it is more difficult to make adequate predictions, and the
differences between the other methods are smaller (compared to PMj9), which
may be expected given the low correlations of the predictor variables.
A Hybrid Model for the Prediction of Air Pollutants Concentration 263

4 Summary and Conclusions

A data set of meteorological variables and previous air pollutant concentrations
was constructed in order to study different AI/ML models to predict future
air pollutant concentrations. By analyzing the results and performance of these
models in a cross-validation data set, an hybrid ad-hoc model was developed
including the most relevant features of the studied models. The following con-
clusions are drawn from the results obtained:

@ PMio is the pollutant that was more correlated to the meteorological variables
and the one in which better predictions are achieved.
e PM, 5 is the pollutant that presented the greatest prediction problems.
e Os is mainly dominated by solar radiation and the time of the year.
e For all pollutants, the hybrid model (CM-MLPred) was the one that obtained
the best predictions and the one that performed better in predicting extreme
values.
e The developed model can be adapted to other pollutants and cities.
e Because these pollutants show a pattern related to the day of the week, the
effect of human activity is evident, therefore the inclusion of related variables,
such as human mobility, could significantly improve the results.

In future work we will include human mobility data as well as orography

data, to improve the results.

Data Availability Statement

The data that support the findings of this study are openly available in figshare at
https: //dx.doi.org/10.6084/9.figshare.16589822, under the Creative Commons
Attribution CC BY.

References
1. Arellano-Vdzquez, M., Minutti-Martinez, C., Zamora-Machado, M.: Automated
characterization and prediction of wind conditions using gaussian mixtures. In:
Martinez-Villasenor, L., Herrera-AlcAntara, O., Ponce, H., Castro-Espinoza, F.A.
(eds.) Advances in Soft Computing, vol. 12468, pp. 158-168. Springer, Cham
(2020). https://doi.org/10.1007/978-3-030-60884-212
2. Carbajal-Hernandez, J.J., Sdanchez-Ferndndez, L.P., Carrasco-Ochoa, J.A.,
Martinez-Trinidad, J.F.: Assessment and prediction of air quality using fuzzy logic
and autoregressive models. Atmos. Environ. 60, 37-50 (2012). https: //doi-org/10.
1016/j.atmosenv.2012.06.004
3. Carreén-Sierra, S., Salcido, A., Castro, T., Celada-Murillo, A.T.: Cluster analysis
of the wind events and seasonal wind circulation patterns in the Mexico city region.
Atmosphere 6(8), 1006-1031 (2015)
4. CONAGUA: Automated weather stations, August 2021. https://www.conagua.gob.
mx/tools/GUI/EMAS.php
264 C. Minutti-Martinez et al.

5. Council, N.R.: Estimating Mortality Risk Reduction and Economic Benefits from
Controlling Ozone Air Pollution. The National Academies Press, Washington, DC
(2008). https://doi.org/10.17226/12198
6. De Foy, B., et al.: Mexico city basin wind circulation during the MCMA-2003
field campaign. Atmos. Chem. Phys. Discuss. 5(3), 2503-2558 (2005). https: //hal.
archives-ouvertes.fr/hal-00303903
7. De Foy, B., Clappier, A., Molina, L.T., Molina, M.J.: Distinct wind convergence
patterns in the Mexico city basin due to the interaction of the gap winds with
the synoptic flow. Atmos. Chem. Phys. 6(5), 1249-1265 (2006). https://www.org/
10.5194/acp-6-1249-2006
8. Di, Q., et al.: An ensemble-based model of PM2.5 concentration across the contigu-
ous united states with high spatiotemporal resolution. Environ. Int. 130, 104909
(2019). https://doi.org/10.1016/j.envint.2019.104909
9. de Foy, B., et al.: Basin-scale wind transport during the MILAGRO field campaign
and comparison to climatology using cluster analysis. Atmos. Chem. Phys. 8(5),
1209-1224 (2008). https://www.org/10.5194/acp-8-1209-2008
10. Heinzerling, A., Hsu, J., Yip, F.: Respiratory health effects of ultrafine particles
in children: a literature review. Water Air Soil Pollut. 227(1), 32 (2015). https: //
doi.org/10.1007/s11270-015-2726-6
11. Iskandaryan, D., Ramos, F., Trilles, S.: Air quality prediction in smart cities using
machine learning technologies based on sensor data: a review. Appl. Sci. 10(7)
(2020). https: //doi-org/10.3390/app10072401
12. Jauregui, E.: Local wind and air pollution interaction in the Mexico basin.
Atmésfera 1(3) (2011). https://www.revistascca.unam.mx/atm/index.php/atm/
article/view/25944
13. Minutti, C.: Pollutant and meteorological data for the prediction of air pollutants
in Mexico city, September 2021. https: //doi-org/10.6084/m9.figshare.16589822.v1
14. R Core Team: R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria (2021). https: //www.R-project.
org/
15. RAMA: Automatic air quality monitoring network (2021). http: //www.aire.cdmx.
gob.mx/default.php?ope %27aKBh%27
16. Sanchez-Pérez, P.A., Robles, M., Jaramillo, O.A.: Real time Markov chains: wind
states in anemometric data. J. Renew. Sustain. Energy 8(2), 023304 (2016).
https://doi.org/10.1063/1.4943120
17. Shah, 8.A.A., Almaraashi, W.A.M., Nadeem, M.S.A., Habib, N., Shim, S.0.: A
hybrid model for forecasting of particulate matter concentrations based on multi-
scale characterization and machine learning techniques. Math. Biosci. Eng. 18(3),
1992 (2021). https://doi.org/10.3934/mbe.2021104
18. Yoo, J., Shin, D., Shin, D.: Prediction system for fine particulate matter concentra-
tion index by meteorological and air pollution material factors based on machine
learning. In: Proceedings of the Tenth International Symposium on Information
and Communication Technology, SolCT 2019, pp. 479-485. Association for Com-
puting Machinery, New York (2019). https://www.org/10.1145/3368926.3369684
19. Zhao, Y., Hasan, Y.A.: Fine particulate matter concentration level prediction by
using tree-based ensemble classification algorithms. Int. J Adv. Comput. Sci. Appl.
4(5) (2013). https://www.org/10.14569/IJACSA.2013.040503

What Is Time Series Analysis
No ratings yet
What Is Time Series Analysis
28 pages
Chapter 8 ARIMA Models: 8.1 Stationarity and Differencing
100% (1)
Chapter 8 ARIMA Models: 8.1 Stationarity and Differencing
46 pages
Ijerph 16 03505 v2
No ratings yet
Ijerph 16 03505 v2
25 pages
2797 8011 1 PB
No ratings yet
2797 8011 1 PB
3 pages
RP5
No ratings yet
RP5
9 pages
Final Year Publishing Paper Air Quality Index Prediction-39120034
No ratings yet
Final Year Publishing Paper Air Quality Index Prediction-39120034
8 pages
Air Pollution Prediction With Machine Learning: A Case Study of Indian Cities
No ratings yet
Air Pollution Prediction With Machine Learning: A Case Study of Indian Cities
16 pages
Air Quality Prediction of Data Log by Machine Learning
No ratings yet
Air Quality Prediction of Data Log by Machine Learning
5 pages
Air Quality Prediction
No ratings yet
Air Quality Prediction
2 pages
1 s2.0 S1877050918308263 Main
No ratings yet
1 s2.0 S1877050918308263 Main
10 pages
Predicting Air Pollution
No ratings yet
Predicting Air Pollution
4 pages
2021 - Statistical Approaches For Forecasting Primary Air Pollutants A Review
No ratings yet
2021 - Statistical Approaches For Forecasting Primary Air Pollutants A Review
19 pages
Research Paper Model
No ratings yet
Research Paper Model
4 pages
ARTIFICIAL INTELLIGENCE MODEL FOR AIR QUALITY PREDICTION AND ANALYSIS FROM IOT SENSOR DATA
No ratings yet
ARTIFICIAL INTELLIGENCE MODEL FOR AIR QUALITY PREDICTION AND ANALYSIS FROM IOT SENSOR DATA
11 pages
AiCareBreath_IoT-Enabled_Location-Invariant_Novel_Unified_Model_for_Predicting_Air_Pollutants_to_Avoid_Related_Respiratory_Disease
No ratings yet
AiCareBreath_IoT-Enabled_Location-Invariant_Novel_Unified_Model_for_Predicting_Air_Pollutants_to_Avoid_Related_Respiratory_Disease
9 pages
Air Quality Prediction Using Machine Learning Algorithms
100% (1)
Air Quality Prediction Using Machine Learning Algorithms
4 pages
PM2.5 Estimation Using Supervised Learning Models
No ratings yet
PM2.5 Estimation Using Supervised Learning Models
8 pages
Air Population Components Estimation in Silk Board Bangalore, India
No ratings yet
Air Population Components Estimation in Silk Board Bangalore, India
7 pages
Intelligent Forecasting of Air Quality and Pollution Prediction Using Machine Learning
No ratings yet
Intelligent Forecasting of Air Quality and Pollution Prediction Using Machine Learning
15 pages
Research Paper Model
No ratings yet
Research Paper Model
4 pages
JEI-24-079
No ratings yet
JEI-24-079
6 pages
Hable Khandekar2017
No ratings yet
Hable Khandekar2017
6 pages
Airqualitypridiction
No ratings yet
Airqualitypridiction
7 pages
20402
No ratings yet
20402
7 pages
Capstone Air Pollution Review 2 PT
No ratings yet
Capstone Air Pollution Review 2 PT
10 pages
A comprehensive evaluation of air pollution prediction improvement by a machine learning method
No ratings yet
A comprehensive evaluation of air pollution prediction improvement by a machine learning method
6 pages
[S1 IJEECS 2024 Aziz Jihadian Barid] Optimization ENSEMBLE Air Quality
No ratings yet
[S1 IJEECS 2024 Aziz Jihadian Barid] Optimization ENSEMBLE Air Quality
9 pages
Prediction of Outdoor PM2.5 Concentrations Based On
No ratings yet
Prediction of Outdoor PM2.5 Concentrations Based On
34 pages
Research Paper
No ratings yet
Research Paper
21 pages
Air quality assessment and pollution forecasting
No ratings yet
Air quality assessment and pollution forecasting
19 pages
Final CNN
No ratings yet
Final CNN
19 pages
An Efficient Implementation of ARIMA Technique for Air Quality Prediction
No ratings yet
An Efficient Implementation of ARIMA Technique for Air Quality Prediction
7 pages
A novel seasonal index–based machine learning approach for air pollution forecasting
No ratings yet
A novel seasonal index–based machine learning approach for air pollution forecasting
18 pages
Bayesian Network Reasoning and Machine Learning With Multiple Data Features
No ratings yet
Bayesian Network Reasoning and Machine Learning With Multiple Data Features
18 pages
Forecasting of The Daily Meteorological Pollution Using Wavelets and Support Vector Machine
No ratings yet
Forecasting of The Daily Meteorological Pollution Using Wavelets and Support Vector Machine
11 pages
RESEARCH - On Aqi
No ratings yet
RESEARCH - On Aqi
8 pages
BreatheAI
No ratings yet
BreatheAI
13 pages
Journal of Environmental and Public Health - 2023 - Gupta - Prediction of Air Quality Index Using Machine Learning
No ratings yet
Journal of Environmental and Public Health - 2023 - Gupta - Prediction of Air Quality Index Using Machine Learning
26 pages
Air Quality Assessment and Pollution Forecasting U
No ratings yet
Air Quality Assessment and Pollution Forecasting U
18 pages
TIJER2306218
No ratings yet
TIJER2306218
5 pages
A Survey On Air Quality Prediction Using Machine Learning
No ratings yet
A Survey On Air Quality Prediction Using Machine Learning
4 pages
1 s2.0 S1309104224002046 Main
No ratings yet
1 s2.0 S1309104224002046 Main
12 pages
Machine Learning Algorithms To Forecast Air Quality: A Survey
No ratings yet
Machine Learning Algorithms To Forecast Air Quality: A Survey
36 pages
3-Day-Ahead Forecasting of Regional Pollution Index For The Pollutants NO2, CO, SO2, and O3 Using Artificial Neural Networks in Athens, Greece
No ratings yet
3-Day-Ahead Forecasting of Regional Pollution Index For The Pollutants NO2, CO, SO2, and O3 Using Artificial Neural Networks in Athens, Greece
15 pages
air quality index analysis
No ratings yet
air quality index analysis
5 pages
Research Article: Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis
No ratings yet
Research Article: Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis
26 pages
Prediction of SO2 and PM Concentrations InTurkey Using An Artificial Neural Network
No ratings yet
Prediction of SO2 and PM Concentrations InTurkey Using An Artificial Neural Network
6 pages
Finalllllllllllll Report
No ratings yet
Finalllllllllllll Report
38 pages
Monitoring_and_Prediction_of_Air
No ratings yet
Monitoring_and_Prediction_of_Air
4 pages
A new model of air quality prediction using lightweight machine learning
No ratings yet
A new model of air quality prediction using lightweight machine learning
13 pages
Urban Ozone Concentration Forecasting With Artificial Neural Network in Corsica
No ratings yet
Urban Ozone Concentration Forecasting With Artificial Neural Network in Corsica
8 pages
Air Quality Prediction Through Regression Model
No ratings yet
Air Quality Prediction Through Regression Model
6 pages
Air Quality Prediction by Machine Learning Models-A Predictive Study On The Indian Coastal City of Visakhapatnam
No ratings yet
Air Quality Prediction by Machine Learning Models-A Predictive Study On The Indian Coastal City of Visakhapatnam
10 pages
1 s2.0 S1352231023004132 Main
No ratings yet
1 s2.0 S1352231023004132 Main
18 pages
Probabilistic Forecasting For Extreme NO2 Pollution Episodes
No ratings yet
Probabilistic Forecasting For Extreme NO2 Pollution Episodes
8 pages
Air Quality Monitoring Using Statistical Learning Models for Sustainable
No ratings yet
Air Quality Monitoring Using Statistical Learning Models for Sustainable
15 pages
Air Quality Index Prediction via Multi‑Task Machine Learning
No ratings yet
Air Quality Index Prediction via Multi‑Task Machine Learning
13 pages
Wu 2018
No ratings yet
Wu 2018
6 pages
AI-based air quality PM2.5 forecasting models for developing countries
No ratings yet
AI-based air quality PM2.5 forecasting models for developing countries
13 pages
Supercomputing for a Changing Climate: Modeling and Predicting Environmental Futures: O7.0 TRANSFORM INFORMATION TECHNOLOGY
From Everand
Supercomputing for a Changing Climate: Modeling and Predicting Environmental Futures: O7.0 TRANSFORM INFORMATION TECHNOLOGY
Elizabeth Mogopodi
No ratings yet
The Secret Language of Nature
From Everand
The Secret Language of Nature
Roberto Miguel Rodriguez
No ratings yet
Smart Dust Uses
From Everand
Smart Dust Uses
Yves Earhart
No ratings yet
Kyplot Research PDF
No ratings yet
Kyplot Research PDF
10 pages
Ngailo Edward
No ratings yet
Ngailo Edward
83 pages
Arizaga Et Al. 2020. Yearly Variation in The Structure and Diversity
No ratings yet
Arizaga Et Al. 2020. Yearly Variation in The Structure and Diversity
8 pages
Cfa Level 2 2023 Summary
No ratings yet
Cfa Level 2 2023 Summary
100 pages
Motivational Profiles of Learners of Multiple Foreign Languages A
No ratings yet
Motivational Profiles of Learners of Multiple Foreign Languages A
17 pages
Modelling Expenditure in Tourism Using The Log Skew Normal Distribution
No ratings yet
Modelling Expenditure in Tourism Using The Log Skew Normal Distribution
21 pages
Aquaculture Reports: Dibo Liu, Sascha Behrens, Lars-Flemming Pedersen, David L. Straus, Thomas Meinelt
No ratings yet
Aquaculture Reports: Dibo Liu, Sascha Behrens, Lars-Flemming Pedersen, David L. Straus, Thomas Meinelt
7 pages
Belote Et Al. 2009
No ratings yet
Belote Et Al. 2009
8 pages
1 s2.0 S0047235223000211 Main
No ratings yet
1 s2.0 S0047235223000211 Main
9 pages
Download full Statistical Foundations, Reasoning and Inference: For Science and Data Science (Springer Series in Statistics) Göran Kauermann ebook all chapters
100% (2)
Download full Statistical Foundations, Reasoning and Inference: For Science and Data Science (Springer Series in Statistics) Göran Kauermann ebook all chapters
40 pages
Forecasting The Price of Rice
No ratings yet
Forecasting The Price of Rice
20 pages
HW 9 Update
No ratings yet
HW 9 Update
3 pages
Sample Chapter - Stata Book
No ratings yet
Sample Chapter - Stata Book
47 pages
Material 3 - System Identification Tutorial
No ratings yet
Material 3 - System Identification Tutorial
41 pages
male
No ratings yet
male
4 pages
TSA Assignment
No ratings yet
TSA Assignment
8 pages
Hasil Eviews
No ratings yet
Hasil Eviews
8 pages
Quantitative Macroeconomics Ardl Model: 100403596@alumnos - Uc3m.es
No ratings yet
Quantitative Macroeconomics Ardl Model: 100403596@alumnos - Uc3m.es
10 pages
TP DE LABO SUR EVIEWS Rferyh
No ratings yet
TP DE LABO SUR EVIEWS Rferyh
5 pages
Development and Preliminary Testing of A Self-Rating Instrument To Measure Self-Directed Learning Ability of Nursing Students
No ratings yet
Development and Preliminary Testing of A Self-Rating Instrument To Measure Self-Directed Learning Ability of Nursing Students
8 pages
Akinyemi 2023 - Air Cargo Demand in Africa
No ratings yet
Akinyemi 2023 - Air Cargo Demand in Africa
12 pages
A Practical Guide To Data Analysis Using R An Examplebased Approach John H Maindonald instant download
100% (1)
A Practical Guide To Data Analysis Using R An Examplebased Approach John H Maindonald instant download
82 pages
ACF and PACF Plots
No ratings yet
ACF and PACF Plots
3 pages
Geographically Weighted Regression: The Analysis of Spatially Varying Relationships
No ratings yet
Geographically Weighted Regression: The Analysis of Spatially Varying Relationships
25 pages
Stock Price Correlation Cofficient Prediction With ARIMA-LSTM Hybrid Model
No ratings yet
Stock Price Correlation Cofficient Prediction With ARIMA-LSTM Hybrid Model
28 pages
Solutions of Wooldridge Lab
No ratings yet
Solutions of Wooldridge Lab
19 pages
Risk Assessment Model For Railway Passengers On A Crowded Platform
No ratings yet
Risk Assessment Model For Railway Passengers On A Crowded Platform
8 pages
Forecasting The National Passing Rate of The Certified Public Accountant Licensure Examination CPALE Ijariie22111
No ratings yet
Forecasting The National Passing Rate of The Certified Public Accountant Licensure Examination CPALE Ijariie22111
15 pages

A Hybrid Model for the Prediction of Air Pollutants Concentration, Based on Statistical and Machine Learning Techniques

Uploaded by

A Hybrid Model for the Prediction of Air Pollutants Concentration, Based on Statistical and Machine Learning Techniques

Uploaded by

®

A Hybrid Model for the Prediction of Air

Carlos Minutti-Martinez!'@)©, Magali Arellano-Vazquez'®,

Keywords: Pollutant forecasting Machine learning Particulate

Prediction of the concentrations of different air pollutants in the air, is very

2.2 AI/ML Methods

Density plot Density plot

Variable importance (RF)

Fig. 4. Variable importance for PMjo in the RF model

Here we show different comparisons of the different models with respect to

Box plot Scatterplot

0.60 60 100 150

Model Min. Ist Qu Median Mean 3rd Qu. Max.

Box plot Scatterplot

Model Min. 1st Qu Median Mean 3rd Qu. Max.

Box plot Scatterplot

0.60 a0 4120 160

Model Min. Ist Qu Median Mean 3rd Qu. Max.

Figure 7 shows the distribution of the prediction errors for O3 as well as a

4 Summary and Conclusions

In future work we will include human mobility data as well as orography

Data Availability Statement

You might also like