A Data-driven Approach to Forecasting Ground-level Ozone Concentration
A Data-driven Approach to Forecasting Ground-level Ozone Concentration
article info a b s t r a c t
Keywords: The ability to forecast the concentration of air pollutants in an urban region is crucial
Shapley values for decision-makers wishing to reduce the impact of pollution on public health through
Genetic algorithms active measures (e.g. temporary traffic closures). In this study, we present a machine
Environmental forecasting
learning approach applied to forecasts of the day-ahead maximum value of ozone
Evaluating forecasts
concentration for several geographical locations in southern Switzerland. Due to the
Multivariate time series
low density of measurement stations and to the complex orography of the use-case
terrain, we adopted feature selection methods instead of explicitly restricting relevant
features to a neighborhood of the prediction sites, as common in spatio-temporal
forecasting methods. We then used Shapley values to assess the explainability of the
learned models in terms of feature importance and feature interactions in relation to
ozone predictions. Our analysis suggests that the trained models effectively learned
explanatory cross-dependencies among atmospheric variables. Finally, we show how
weighting observations helps to increase the accuracy of the forecasts for specific ranges
of ozone’s daily peak values.
© 2021 The Author(s). Published by Elsevier B.V. on behalf of International Institute of
Forecasters. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
https://doi.org/10.1016/j.ijforecast.2021.07.008
0169-2070/© 2021 The Author(s). Published by Elsevier B.V. on behalf of International Institute of Forecasters. This is an open access article under
the CC BY license (http://creativecommons.org/licenses/by/4.0/).
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
bring to this work. Section 2 introduces the dataset and were obtained by fitting knowledge-based empirical for-
the nomenclature we use in the paper to refer to the mulae using historical data. No systematic investigation
different features, and Section 3 presents the forecast- of the interaction of variables was carried out. The au-
ing problem peculiarities and the problem formulation. thors showed how the maximum daily temperature is
Section 4 describes the two feature selection methods that the single most relevant variable in predicting (and fore-
were tested, namely a custom genetic algorithm and a fea- casting) ozone concentration. They consistently found a
ture selection method based on Shapley values. Section 5 strong correlation between the numerical weather predic-
outlines the regression algorithms used to perform the tion forecast error of this value and the error for ozone
analysis. Section 6 introduces the deterministic and prob- prediction. In Eslami, Choi, Lops, and Sayeed (2019) the
abilistic key performance indicators (KPIs) that we used to authors proposed a deep convolutional neural network
evaluate the different forecasters. Section 7.1 presents the (CNN) to forecast the hourly ozone concentrations for the
results of the two tested feature selection algorithms. In day ahead, over 25 monitoring sites. Despite the ability
Section 7.2 we study how different features and feature of the CNN to predict daily ozone trends correctly, the
interactions affect the final predictions of the forecast, authors found that it under-predicted high ozone peaks
using Shapley values. In Section 7.3, we show the numer- during the summer. In Gong and Ordieres-Meré (2016),
ical results for the tested forecasting algorithms, while in the authors focused on forecasts of extreme ozone con-
Section 7.4 we focus on predictions of extreme events. centrations, which are also the most useful to predict.
Finally, Section 8 concludes the paper with a summary of Forecasting extreme events is, in fact, more complicated
our main findings. than predicting them, as demonstrated in Mohan and
Saranya (2019). When one is mostly interested in predict-
1.1. Related works ing these tail events, sampling techniques can be applied
in order to mitigate the class imbalance problem (rare
Tropospheric ozone concentration has been the subject events are under-represented in the training data). The
of several studies, both for prediction (the task of finding authors of this study applied different sampling methods
a map from a set of covariates to a target) and for fore- to increase the classification accuracy of ozone concen-
tration, considering three different classes. They found
casting (predicting the values of the target in advance, in
that under-sampling can indeed increase the classification
future time steps). In Al Abri, Edirisinghe, Nawadha, and
performance. Unfortunately, a drawback of this technique
Kingdom (2015) different non-parametric models from
is that several data of the most represented class are dis-
the WEKA toolkit were tested to derive the ozone con-
carded, which could lead to a lack of generalization of the
centration from a set of eight different gaseous chemicals
model, due to over-fitting or a reduction of cross-learning
and atmospheric conditions measured at a single location.
(learning patterns from data in a given class, which are
Similarly, WEKA was used in Mohan and Saranya (2019)
also present in a second unobserved class, which could
to adapt models representing atmospheric conditions to
increase the prediction accuracy).
the ozone concentration at ground level, which showed
that even summer ozone peaks can be accurately pre-
1.2. Contributions
dicted if the atmospheric conditions are known. In Feng,
Zhang, Sun, and Zhang (2011), meteorological data from In the presence of a high number of relevant features,
a site near Beijing were used to predict the hourly ozone the task of forecasting the next-day peak in the concentra-
concentration at that point by using a neural network tion of ozone becomes highly challenging, due to the low
whose weights were trained using a genetic algorithm. number of observations on which a forecasting algorithm
In addition, different models were adapted for different can be trained. In fact, having a dataset consisting of a
times of the day. In Sheta, Faris, Rodan, Kovač-Andrić, few years of observations could result in having a number
and Al-Zoubi (2018), a nonlinear state-space model using of features higher than the number of observations, as
PM10 , temperature, wind speed, and relative humidity in the presented case. On the other hand, observations
as inputs was identified by using a neural network to further back in time may not be representative of the
predict ozone concentration. The model was then com- current situation, as the mixture of nitrogen oxides in the
pared with linear models and a multilayer perceptron. air has changed over time following vehicle fleet renewal.
In Siwek and Osowski (2016) the authors used a dataset As a consequence, due to the scarce number of instances,
of 55 characteristics (meteorological conditions and their we could not apply under-sampling techniques, as done
statistical transformations) collected in Warsaw to predict in Mohan and Saranya (2019). The only effective way
various air pollutants. They showed that by reducing the to train a model is by applying dimensionality reduction
number of features with a pre-selection step, the final techniques. Our first contribution consists of evaluating
accuracy of the prediction could be increased. Two pre- two different methods to perform feature selection. First,
selection methods were compared: a genetic algorithm we tested a genetic algorithm, as was done in Siwek
and a stepwise greedy strategy for linear models. and Osowski (2016) for the pollutant prediction task.
The task of forecasting PM2.5 and ozone concentra- In this case, we crafted custom mutation and crossover
tions for three large Chinese cities was considered in Lv, functions tailored to the forecasting task. The second ap-
Cobourn, and Bai (2016). Like in our study, the authors proach we tested is based on Shapley values (Lundberg
considered multiple monitoring stations, but the final val- & Lee, 2017). We then evaluated and compared the two
ues of the relevant atmospheric variables were weighted feature selection methods. To show that the feature se-
averages of neighbors of the target cities. The forecasts lection step is beneficial in increasing the accuracy of
971
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
Table 1 Table 2
Geographic context of air quality and weather monitoring stations. Geographic context of weather forecasting locations.
Station Code Altitude Context Main O3 Location Code Altitude [m a.s.l.] Context
[m a.s.l.] source Comprovasco p1 575 Rural
Locarno l1 200 Urban Industry Matro p2 2171 Mountain
Brione l2 486 Suburb Valley floor Bioggio p3 518 Suburb
Bioggio l3 314 Suburb Industry Tesserete p4 626 Rural
Tesserete l4 518 Rural Valley floor Chiasso p5 240 Urban
Chiasso l5 230 Urban Industry Sagno p6 704 Suburb
Mendrisio l6 354 Suburb Industry
Sagno l7 704 Rural Valley floor
Table 3
Dataset description. The symbol † denotes a variable that is both
measured and forecasted by the NWP service, while the symbol ‡
indicates a signal that is only forecasted.
the predictive algorithm, we compared our models with
Signal Symbol Unit
two control cases: one in which the model uses all the
[µg/m3 ]
[ 3
]
available features and one in which we pick the features Nitrogen oxide NO
entirely at random. Our second contribution is to compare Nitrogen dioxide NO2 µg/m
Generic nitrogen oxides NOx [ppb]
the performance of different popular learning algorithms
[µg/m2 ]
[ 3
]
Ozone O3
trained on the selected features. Third, we investigate † Global irradiance G W/m
the effect of imposing weights on the observations with † Atmospheric pressure P [hPa]
the highest daily ozone concentration on the algorithms’ † Precipitation Prec [mm]
† Relative humidity RH [%]
forecasting quality of extreme values. Our final contribu-
† Temperature T [◦C]
tion is an a posteriori explanation of feature importance. ‡ Dew point TD [◦C]
We investigate the more relevant feature interactions in † Wind direction, vectorial average Wd [◦ ]
predicting the ozone peak and explain our findings in † Wind speed, vectorial average Ws [m/s]
terms of atmospheric physics. † Cloud cover CN [–]
2. Dataset
algorithms correctly, while simultaneously avoiding the
2.1. Geographical context and data acquisition use of previous years, when the emissions of precursors
NO, NO2 , and NOx in Switzerland were more intense than
In this study, we focused on the Canton of Ticino, the today. We used data from the first four years to train the
southernmost canton of Switzerland. In this region, the forecasting algorithms and data from 2019 to test them.
concentration of air pollutants is generally higher than in A variety of signals covering weather and air quality
the rest of the country and is influenced by both the orog- with hourly resolution were considered for the model, as
raphy and the level of urbanization and industrialization. shown in Table 3. Most monitoring stations record the full
The natural shield provided by the Alps makes Ticino the set of signals specified in Table 3 on site. The few excep-
region with the highest solar radiation rate in Switzer- tions, where some signals were not collected locally, were
land. Ticino is characterized by a densely populated and managed with data from the nearest available stations.
When training forecasting algorithms, we considered data
heavily trafficked southern region, and by a sparsely pop-
measured up to 72 h in the past (except for the ARIMAX
ulated and more mountainous northern region. It borders
model) and weather forecasts up to 33 h in the future. The
Lombardy to the south, the most industrialized region in
value of 72 h was chosen based on preliminary results, in
Italy.
which we considered a history length up to seven days,
In this study, we used data acquired from several air and systematically shortened it. As we did not experience
quality and weather stations distributed in the region. significant improvements in accuracy from using a history
In addition, the Swiss Federal Office of Meteorology and length higher than 72 h, we fixed this value for all the
Climatology MeteoSwiss1 numerical weather prediction experiments. Table 3 shows the available signals.
(NWP) service provided weather forecasts for some of For various reasons, such as station maintenance, data
these locations. Fig. 1 shows the position of the moni- transmission failure, or power outages, some of the data
toring stations and the locations for which the weather in the time series of the years from 2015 to 2019 are
forecasts are available. Tables 1 and 2 describe the ge- missing. During the training period, data completeness is
ographical context for the monitoring stations and the above 99%, but two stations have substantial data holes in
weather forecasting locations, respectively. 2019: in Tesserete and Sagno, respectively, 50 and 15 full
Due to its photochemical origin, O3 shows a strong days of measurements are missing during the test period.
seasonal pattern, with higher concentrations in summer. All the missing data in the training set were filled using a
For this reason we focused our analysis only on the period random forest with surrogate splits, trained to predict the
from May to September in the years between 2015 and missing data using the station itself and its neighbors.
2019. This period was chosen to take into account a sig-
nificant number of measurements, i.e. enough to train the 2.2. Feature engineering
Fig. 1. Map of Ticino with the locations of stations used in the study. Red shows the stations where air quality and weather measurements are
collected, and blue shows the locations for which weather forecasts are available. The maps were originally downloaded from d-maps-1 (2020a)
and d-maps-2 (2020b). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
the overall number of features and minimize the compu- 2.3. Nomenclature
tational effort, we partly replaced the hourly values of the
measured and forecasted signals with basic statistical ag- The ozone forecast at any station for any given day D is
gregations, i.e. minimum, maximum, and average values computed twice: first at 18:30 (16:30 UTC) on the previ-
over a longer time period, as illustrated in Table 4. ous day D − 1, which we call the EVE forecast, and second
Based on suggestions from experts in the field of atmo- at 06:30 (04:30 UTC) on the same day D, here called the
spheric physics,2 we further manipulated some of the sig- MOR forecast. This is because the weather forecasts issued
nals available in the dataset to create additional features. by the NWP services are published twice a day, at 05:00
The engineered features are listed in Table 4. In addition, and at 14:00 local time. Fig. 2 illustrates the time window
we included a categorical feature, called RHW , which de- for a generic day. For each station, we tested eight differ-
scribes the general situation of the weather in Switzerland ent prediction methods at both prediction times, EVE and
for the prediction day using 12 weather types. MOR, for a total of 16 models per station.
For each location, separate forecasting models are When labeling the aggregated data in Table 4, we
trained using a subset of the matrix of all features. This use the following conventions. Measured quantities are
subset contains data specific to the location and infor- denoted by the letter m and weather forecasts provided
mation from the neighboring stations. For NWP, hourly by NWP services are denoted by the letter s. The index
values are used for the specific location, while bins are is the difference in hours between the last available data
used for the data of the neighboring stations, as sum- point and the acquisition time. Following this convention,
marized in Table 4. For example, the dataset of Chiasso m0 refers to the last measured data point available, i.e. the
contains the hourly NWP from Chiasso itself and bins value measured at 06:00 for the MOR forecast and at
aggregations from Sagno. 18:00 for EVE forecast. Likewise, m1 refers to the value
Given the different number of stations involved each measured at 05:00 and 17:00, and so on up to m23 .
time, the number of features for each model is variable The same temporal indexing applies to values provided
and comprise between 1700 and 2100. by NWP services. For MOR we call s0 the forecasted value
at 06:00, s1 the value for 07:00, and so on. In the EVE case
2 Environment Observatory of Southern Switzerland (OASI). we call s0 the predicted value at 18:00, s1 the predicted
973
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
Table 4
Summary of all the features used in this study. More information about the MOR and EVE cases is given in Section 2.3.
Signal kind Time interval Code Aggregation
Past 24 h (m0 , . . . , m23 ) mi Hourly values
All measured data From 0 to 24 h before 24h
From 0 to 48 h before 48h Mean
From 0 to 72 h before 72h
All forecasts (same station) MOR: from s0 to s32 si Hourly values
EVE: from s0 , to s29
All forecasts (neighboring station) MOR: from s0 to s7 b1 Minimum, maximum and
EVE: from s0 , to s6 average of every bin bi
MOR: from s8 to s16 b2
EVE: from s7 to s13
MOR: from s17 to s24 b3
EVE: from s14 to s19
MOR: from s25 to s32 b4
EVE: from s20 to s29
Measured NOx MOR: previous afternoon (m6 to m18 ) NOx12h Mean
EVE: previous morning (m6 to m18 )
Forecasted T MOR: upcoming afternoon (s6 to s18 ) T̂PM , T̂PM ,squared Mean and squared mean
EVE: upcoming afternoon (s18 to s29 )
Forecasted T MOR: all hourly values, from s0 to s32 T̂max Maximum
EVE: all hourly values, from s0 to s29
transf
Forecasted TD MOR: all hourly values, from s0 to s32 TD
ˆ max , TD
ˆ max Maximum, (Maximum + 20)3
EVE: all hourly values, from s0 to s29
Forecasted G MOR: upcoming morning (s0 to s6 ) ĜAM , ĜPM Mean
MOR: upcoming afternoon (s6 to s18 )
EVE: upcoming morning (s6 to s18 )
EVE: upcoming afternoon (s18 to s29 )
Forecasted CN MOR: upcoming morning (s0 to s6 ) CN
ˆ AM , CN
ˆ PM Mean
MOR: upcoming afternoon (s6 to s18 )
EVE: upcoming morning (s6 to s18 )
EVE: upcoming afternoon (s18 to s29 )
Forecasted Prec Upcoming 24 h (s0 to s23 ) Prec
ˆ 24h,sum Sum
Measured YO3 O3 measurements of the previous day YO3 Maximum
Forecasted RHW One categorical value for the RHW –
prediction day
Fig. 2. Time window of input and output data. Times are given in local time (CEST).
value at 19:00, and so on. The structure of the aggregation the measured NO concentrations up to 72 h before the
bins is shown in Table 4. prediction, in Brione.
To better refer to each specific component of the mod-
els, we denote the features based on the location of the 3. Problem formulation
measurement and time to which it is referred, combining
the codes of Tables 3, 4, 1, and 2. For example, Gm1 1
l The problem of forecasting the daily maximum ozone
designates the global irradiance measured in Locarno at signal presents the following characteristics:
05:00 in the MOR model and at 17:00 in the EVE model. 1. The signal is strongly seasonal, due to the pres-
p
Similarly, T̂s103 is the forecasted temperature in Bioggio at ence of annual patterns in both anthropogenic and
16:00 in the MOR model, and at 04:00 on the following non-anthropogenic processes governing ozone gen-
l2
day in the EVE model. NO72h is the mean value of all eration.
974
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
2. The signal is non-stationary, since its variance is case the model can be described as f (xtr , ytr , Θ ), where
subject to inter- and intra-annual fluctuations. the endogenous input signal ytr is then opportunely
3. The forecasts’ dependence on the features is non- shifted with the use of the backshift operator, as further
linear, as described in the literature and as further explained in Section 5.
detailed in Section 7.2.
4. The forecasted values are physically bounded by 4. Feature selection methods
the photo-chemistry and advective phenomena
regulating the formation and transport of ozone in Given the large number of features in each model,
the troposphere and atmosphere. if we were to train the prediction algorithms using all
5. In our use case, monitoring stations providing mea- the variables, whose number largely exceeds the num-
surements of relevant features for ozone forecast- ber of available observations, we could potentially incur
ing, such as temperature, past ozone and NOx numerical problems of solution non-uniqueness and mul-
values, are not sufficiently dense (nor at similar dis- ticollinearity that would corrupt the prediction process.
tances from prediction points) to provide a regular Moreover, even if the dataset contained a proportional
mesh, as can be seen in Fig. 1. In this case, the number of observations, an excessive number of features
use of spatio-temporal Gaussian processes (Kupilik would still result in a long computational time, which is
& Witmer, 2018), Gaussian Markov random fields justified only if the forecasting performance is better than
(Cameletti, Lindgren, Simpson, & Rue, 2013), or that of an algorithm trained on a subset of the features.
other graph-based spatio-temporal techniques We decided to perform feature selection using a cus-
(Carrillo et al., 2020) can lead to poor results. tom implementation of a genetic algorithm (GA), as well
Given the above considerations, we chose to model daily as using a proceeding issued from game theory, exploiting
ozone maxima using separate predictors for each location, Shapley values. The effectiveness of these two approaches
while still taking into account relevant features from is compared in Section 7.1 against a model composed of
nearby locations. The neighboring stations for each pre- features picked at random and a model composed of all
diction point are illustrated in Fig. 1, where the whole the available features.
region is divided into three macro-zones. In any case, we
let the feature selection processes described in 4.1. Feature selection using a genetic algorithm
Section 4 discriminate whether a given measurement
station is relevant. Calling n the number of observations In our implementation of the GA, an individual A is
and k the number of features, we define a training and defined as a subset of the entire feature set F with car-
a test dataset, Dtr = {xtr , ytr } and Dte = {xte , yte }, dinality k:
respectively, where xtr ∈ Rntr ×k and xte ∈ Rnte ×k are A ⊂ F, |A| = k (3)
matrices of features, and ytr ∈ Rntr and yte ∈ Rnte are
the target vectors, containing the maximum hourly ozone where k is the number of retained features. As anticipated
concentration of the same days. In this paper, ntr and nte in Section 3, in this study we set k = 30.
are equal to 587 and 151, respectively, that is, the number We defined a crossover function that ensures that the
of available days between May and September for the offsprings that emerge from it still retain k features from
2015–2018 period and for 2019. On the other hand, the their parents, with no repetitions. Formally, the offspring
number of features, k, is fixed at 30 for all the numerical C is a subset of the union of the sets of its parents A and
experiments and raised to 100 for the study of predicting B, with cardinality k:
high peaks, described in Section 7.4. These numbers were
chosen experimentally by systematically increasing them C ⊂ (A ∪ B) , |C | = k (4)
and choosing the k value beyond which the predictors’ We defined a custom mutation function so that each
performance no longer increased significantly. feature of the offspring C is either the original feature of
We train a model f (xtr , Θ ), where Θ is a set of the its parent A with probability 95%, or a new feature from
model’s parameters, in order to produce the forecasts for B ⊂ (F \ A) with probability 5%, and such that C has only
unseen data xte ∈ Rnte ×k : unique features. In practice, we generate two sequences
ŷte = f (xte , Θ ). (1) a and b from the sets A and B, by randomly fixing their
order, and iterate on them to generate the new set C ,
In order to compare the results across the different ap- which is composed of the elements of the sequence c,
proaches, we used regression-specific key performance where the ith element of c is defined as:
indicators (KPIs), as classification scores can only be com-
pared while using the same bins for the choice of the ci = ai (u ≥ 0.05) + bi (u < 0.05) u ∼ U [0, 1] (5)
classes. Different values for the bins’ edges are used in
The fitness function is defined as the out-of-bag mean
the ozone prediction literature, since those are typically
squared error (MSE) of a random forest composed of 30
chosen based on the local legislation. As such, we trained
bootstrap-aggregated (bagged) decision trees, trained on
the model f (xtr , Θ ) minimizing L2 loss:
the k active features of the individuals. We selected a
Θ ∗ = argmin ∥ytr − f (xtr , Θ )∥22 . (2) population size of 10k individuals, a crossover fraction of
Θ 80%, and an elite count of 5% of the population size. The
We highlight how this notation must be slightly adapted GA stops after 100 stall generations and the feature set of
for the ARIMAX model introduced in Section 5; in this the best individual is selected.
975
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
4.2. Feature selection using Shapley values shrinkage and selection operator (LASSO) regression (Tib-
shirani, 1996) penalizes β using the L1 norm, such that
Another method to assess the importance of each fea- some of the elements of β̂ can be set to zero. Unlike
ture is presented in Lundberg and Lee (2017). This method ridge regression, LASSO does not have a closed-form solu-
assigns feature importance scores using Shapley values, tion, but rather must be approximated through numerical
which originated in the field of game theory where they methods.
are used to estimate the contribution of various agents
in increasing the welfare of a community. These are ex- ARIMAX. The well-known autoregressive integrated
pressed as: moving average with explanatory variable (ARIMAX) is
defined as:
∑ |z |! (k − |z | − 1)!
φi (f , xtr ) = [f (z , Θ ) − f (z \i, Θ )]
p
∑ q
∑
z ⊆x
k! ŷt = β xtr + φi Bi (y′t ) − θi ϵt −1 + ϵt . (8)
i=1 i=1
(6)
where B is the backshift operator, i.e. Bn (zt ) = zt −n ; ϵt
where f (xtr , Θ ) is a regression model, in our specific case is additive white Gaussian noise; and the time series y′
the NGBoost algorithm, which is introduced in Section 5; is the result of differencing ytr d times. The matrix xtr
xtr is the feature set on which the model has been trained; contains the daily values of the selected features up to the
Θ is a model-specific set of parameters; k is the number of day before the prediction, and β denotes the regression
variables in the training set xtr ; and z \i denotes the minus coefficients as usual.
set operation, that is, the subtraction of the ith feature
The models are created and calculated with
from the reduced dataset z. The authors in Lundberg
statsmodels’s SARIMAX function, and the parameters
and Lee (2017) showed that such coefficients have highly
p, q, and d are chosen via grid search and fitted using
desirable properties that favorably affect their ability in
maximum likelihood estimation; we considered only d =
the (local) explanation of the models, and have been
0, since we do not have important trends, while we set
shown to be consistent and more robust with respect
7 and 3 as maximum values for p and q, respectively. We
to other more widespread methods for the evaluation of
stress that, even if the features contained in xtr refer to the
feature importance. Furthermore, the authors in Lundberg
et al. (2020) recently proposed a computationally efficient last 72 h, the ARIMAX model has been left free to extend
algorithm specifically tailored for tree-based models, and the endogenous signal’s influence on the forecast up to
made it available through the shap Python package. The seven previous days, that is p = 7. However, for all the
shap package provides an exact computation of Shapley considered locations, the grid search returned p <= 3.
value explanations for tree-based models. This provides Random forests and quantile random forests. The ran-
local explanations with theoretical guarantees of local ac- dom forest (RF) algorithm independently fits several deci-
curacy and consistency, which increase the robustness of sion trees, each trained on different datasets, created from
the method, since it does not rely on random samplings, the original one through random re-sampling of the ob-
which would be required to find the Shapley values using
servations, and keeping only a fraction of the overall fea-
approximate algorithms.
tures, chosen at random (Hastie, Tibshirani, & Friedman,
2009). The final prediction of the RF is then a (possibly
5. Regression models
weighted) average of the trees’ responses. One important
variant of RF algorithms is quantile regression forests
After the k features that best explain the data are
selected, we use them to create the regression matrix (QRF). The main difference from RF is that QRF keeps the
and produce the test forecasts. For this work, we studied value of all the observations in the fitted trees’ nodes, not
the output of several parametric models, such as linear just their mean, and assesses the conditional distribution
regression, ridge regression, LASSO, and ARIMAX, as well based on this information. Here, we used the Matlab
as more complex non-parametric tree-based algorithms, TreeBagger class, which implements the QRF algorithm
such as random forests, XGBoost, NGBoost, and LSBoost, described in Meinshausen (2014).
as described below. Tree-based boosting algorithms. Boosting algorithms
Penalized linear regression algorithms. Ridge regression employ additive training: starting from a constant model,
is a method designed to avoid collinearity issues and at each iteration, a new tree or any other so-called ‘‘weak
avoid near-singular matrix inversions when solving linear learner’’ hk (x) is added to the overall model Fk (x), so that
regression problems, especially in the case in which the Fk+1 (x) = Fk (x) + ηhk (x), where η ≤ 1 is a hyper-
number of features is large compared to the number of parameter denoting the learning rate, which helps reduce
observations. In this case, the regression coefficients β ∈ over-fitting. The least-squares gradient boosting (LSBoost)
Rk are quadratically penalized with parameter λ, such algorithm applies boosting in functional space: each weak
that the closed-form solution becomes: learner h tries to learn the gradient (with respect to the
)−1 T previous model Fk (x)) of the least-squares loss function.
β̂R = xTtr xtr + λIk xtr ytr ,
(
(7)
In other words, hk is fitted on the overall prediction error
where Ik is an identity matrix of size k. In this work, at iteration k − 1.
λ is tuned in cross-validation on the training set. In- A different approach is used by the XGBoost algo-
stead of punishing β using the L2 norm, least absolute rithm (Chen & Guestrin, 2016), which fits the additive
976
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
model Fk (x) in parameter space, that is, using a second- percentage error, S denotes forecast skill, and A denotes
order approximation of the loss, as a function of the accuracy. RMSEpers is the RMSE of the persistence model,
parameters of the weak learners (decision trees). This ap- i.e. the model where the prediction at day D + 1 is equal
proximation and other techniques used by XGBoost (like to the measured value at day D. C (yi ) is the function
an approximate histogram search for selecting splitting that associates every measured or forecasted value to the
points in the trees) result in a speedup of the training respective class, explicitly given by
process, with respect to LSBoost or RF algorithms, without
1 if 0 < yi ≤ 60,
⎧
sacrificing accuracy. At the same time, the algorithm in- ⎪
60 < yi ≤ 120,
⎪
troduces quadratic penalization on the parameter’s value ⎪
⎪
⎪ 2 if
and on the overall complexity of the trees, with which
⎪
120 < yi ≤ 135,
⎪
⎨3 if
parameters can be tuned to further mitigate over-fitting. C (yi ) = (14)
In addition to the QRF algorithm, we used a second ⎪
⎪ 4 if 135 < yi ≤ 180,
180 < yi ≤ 240,
⎪
algorithm that is able to assess the conditional probability 5 if
⎪
⎪
⎪
⎪
distribution of the predictions: natural gradient boosting ⎩
6 if 240 < yi .
(NGBoost) (Duan et al., 2019). While none of the previ-
ous algorithms introduced assumptions on the probability These values are the thresholds of classes of increas-
distribution of the observations, NGBoost explicitly fits ing severity of air pollution, as indicated by the Swiss
the parameters of a parametric probability distribution Society of Air Protection Officers (Cercl’Air) (Swiss Soci-
on each observation. This is made possible by exploiting ety of Air Protection Officers, 2019). Class 3 is especially
the tree structure of the underlying weak learner, since narrow compared to the other classes. As a result it will
observations in the same leaves share the same prob- be harder for the regression forecasting algorithms to
ability distribution parameters. The algorithm is fitted correctly predict this class.
in functional space, but instead of directly learning the Finally, we evaluated those algorithms which also re-
maximum likelihood gradient, the authors propose to cor- turned conditional distributions—that is, QRF and
rect it with the Fisher information. This results in fitting
NGBoost—using two additional KPIs. The first KPI is the
the so-called natural gradient, which makes the learning
reliability (Pinson, McSharry, & Madsen, 2010), defined as
process invariant to reparametrization of the underlying
probability distribution. n
1∑
We fitted LSBoost models using Matlab’s fitrensem- R(τ ) = 1{yi < ŷi,τ }, (15)
n
ble function, tuning its hyper-parameters via Bayesian i=1
optimization and using five-fold cross-validation. The XG-
where ŷi,τ is the quantile predicted by the algorithm at
Boost and NGBoost algorithms were fitted using their
the level τ ∈ [0, 1]. This KPI calculates how many of the
official xgboost and ngboost Python packages, respec-
tively, and hyper-parameters were selected using a grid total number of measured values are indeed lower than
search, always using a five-fold cross-validation strat- the quantile predicted on the same observations. If the
egy. We highlight that tuning the hyper-parameters in forecasting algorithm were perfect, the R(τ ) curve would
cross-validation mitigates over-fitting issues with the re- lie on the bisector of the first quadrant.
gression algorithms. The second probabilistic KPI is the average quantile
loss function, also known as pinball loss (Bentzien &
6. Key performance indicators Friederichs, 2014):
n
The performance of the forecasting algorithms intro- 1∑
ρ̄ (τ ) = ρτ (yi − ŷi,τ ), (16)
duced in Section 5 was evaluated using to the following n
i=1
standard performance indicators:
n where the function ρτ (x) is defined as
1 ∑
τ |x| x ≥ 0,
{
RMSE = √ (ŷi − yi )2 (9) if
n ρτ (x) = (17)
i=1 (1 − τ ) |x| if x < 0,
n
1 ∑⏐ ⏐ This KPI measures how narrow the predicted probability
MAE = ⏐ŷi − yi ⏐ (10)
n density function is around the observations. It can be
i=1
shown that this loss is minimized, independently of the
n ⏐ ⏐
100% ∑ ⏐ yi − ŷi ⏐ underlying distribution which generated the data, when
MAPE = ⏐ ⏐ (11)
n ⏐ yi ⏐ the predicted quantiles are the true ones. It should be
i=1
noted that for τ = 0.5, the corresponding value ρ̄ (τ ) is
RMSE
S=1− (12) half the value of the MAE statistic. To further evaluate the
RMSEpers performance of the quantiles as a single score, we inte-
n
100% ∑ { } grate ρ̄ (τ ) over the [0, 1] interval, as outlined in Gneiting
A= 1 C (ŷi ) = C (yi ) (13) and Raftery (2007). Thus we define
n
i=1 ∫ 1
where RMSE is the root mean squared error, MAE is Q-score = ρ̄ (τ ) dτ . (18)
the mean absolute error, MAPE is the mean absolute 0
977
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
7. Results
Fig. 5. Nemenyi test for comparing the RMSE performance of the four feature selection methods across all locations.
Table 5
Comparison between different feature selections for the best, average, and worst stations.
BIO MOR CHI MOR MEN EVE
SHAP GA All Rand SHAP GA All Rand SHAP GA All Rand
MAE 13.46 14.17 14.41 15.58 14.50 15.18 15.27 19.02 16.69 17.24 16.90 19.84
RMSE 17.68 18.38 18.98 20.35 19.36 21.25 20.03 24.21 21.59 22.61 21.70 24.92
LSBoost MAPE 10.34 11.06 11.29 12.32 10.65 11.05 11.33 14.42 12.75 13.57 13.52 15.59
S 0.430 0.408 0.388 0.344 0.419 0.362 0.399 0.273 0.333 0.302 0.330 0.231
Accuracy 71.43 68.71 71.43 69.39 71.43 71.43 70.75 60.54 65.07 60.96 59.59 59.59
MAE 14.05 14.32 14.63 15.55 15.84 15.66 17.34 21.13 16.78 17.69 17.54 21.43
RMSE 18.58 19.01 19.44 21.09 20.87 21.26 21.58 26.27 22.23 22.80 22.27 26.94
XGBoost MAPE 10.91 11.17 11.67 12.24 11.77 11.54 12.95 16.56 13.05 13.66 14.01 16.88
S 0.401 0.387 0.373 0.320 0.374 0.362 0.352 0.212 0.314 0.296 0.313 0.169
Accuracy 68.71 72.79 65.99 70.07 70.07 70.07 69.39 57.14 65.75 62.33 60.27 56.85
MAE 13.85 14.38 15.02 15.64 15.04 15.88 15.85 18.72 16.12 17.05 17.54 20.14
RMSE 18.50 18.70 20.04 20.60 19.90 21.64 21.11 23.58 21.32 22.79 22.76 25.46
NGBoost MAPE 10.73 11.13 11.96 12.36 11.06 11.85 11.98 14.26 12.54 13.29 14.12 16.04
S 0.404 0.397 0.354 0.336 0.403 0.351 0.367 0.292 0.342 0.297 0.298 0.214
Accuracy 68.71 68.71 68.71 68.03 69.39 70.07 67.35 59.86 66.44 60.27 58.90 57.53
979
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
Table 7
Main results of the study with Shapley values feature selection. Values in boldface indicate the lowest RMSE for the corresponding location.
BIO CHI MEN LOC BRI SAG TES
MOR EVE MOR EVE MOR EVE MOR EVE MOR EVE MOR EVE MOR EVE
MAE 13.81 15.28 15.20 15.82 15.70 16.10 14.08 14.43 14.52 14.71 14.88 15.37 13.64 15.03
RMSE 18.07 19.82 20.21 21.66 20.54 21.30 18.33 19.01 19.03 20.20 19.48 20.62 19.41 21.63
MAPE 10.74 11.83 11.13 11.69 12.35 12.65 11.78 12.36 12.49 12.80 11.93 12.25 12.08 13.82
RF
S 0.418 0.364 0.394 0.352 0.364 0.342 0.394 0.374 0.404 0.370 0.295 0.256 0.201 0.114
Accuracy 69.39 67.81 66.67 65.75 65.99 62.33 68.03 62.33 66.67 67.12 61.76 62.22 66.33 65.98
Q-score 4.987 5.396 5.452 5.765 5.590 5.749 5.133 5.239 5.208 5.364 5.259 5.662 4.494 4.961
MAE 13.46 13.96 14.50 15.56 14.66 16.69 14.20 14.30 14.25 14.50 15.08 14.59 13.37 14.86
RMSE 17.68 18.18 19.36 21.03 19.50 21.59 18.58 19.03 18.78 20.53 19.39 19.73 18.89 20.73
LSBoost MAPE 10.34 10.78 10.65 11.47 11.54 12.75 11.71 11.98 12.03 12.53 12.09 11.53 11.71 13.44
S 0.430 0.416 0.419 0.371 0.396 0.333 0.386 0.373 0.412 0.360 0.298 0.288 0.222 0.151
Accuracy 71.43 67.81 71.43 67.12 66.67 65.07 67.35 69.18 68.03 69.18 61.76 61.48 70.41 64.95
MAE 14.05 15.10 15.84 16.80 16.01 16.78 14.81 15.84 15.57 15.18 15.92 16.96 13.73 15.31
RMSE 18.58 19.58 20.87 22.04 21.45 22.23 18.85 22.21 19.58 21.31 20.25 22.44 19.33 21.27
XGBoost MAPE 10.91 11.73 11.77 12.30 12.35 13.05 12.17 13.83 13.17 13.19 12.66 13.73 11.92 13.77
S 0.401 0.371 0.374 0.341 0.336 0.314 0.377 0.268 0.387 0.335 0.266 0.190 0.204 0.129
Accuracy 68.71 73.29 70.07 67.12 65.31 65.75 66.67 70.55 68.03 65.07 61.76 56.30 65.31 64.95
MAE 13.85 14.83 15.04 16.23 14.84 16.12 14.96 14.02 15.18 15.09 19.57 19.77 19.34 21.76
RMSE 18.50 19.56 19.90 21.27 19.62 21.32 18.92 18.84 19.78 20.37 25.36 26.44 26.12 28.92
MAPE 10.73 11.32 11.06 11.94 11.68 12.54 12.42 12.06 12.83 13.08 16.93 16.86 18.37 20.90
NGBoost
S 0.404 0.372 0.403 0.364 0.393 0.342 0.375 0.379 0.381 0.365 0.082 0.046 −0.07 −0.18
Accuracy 68.71 69.18 69.39 65.75 65.99 66.44 61.90 67.12 65.31 69.18 47.79 50.37 59.18 52.58
Q-score 5.947 6.298 6.298 7.485 6.429 7.671 6.453 6.333 6.241 5.979 12.14 12.50 25.94 27.36
MAE 15.27 15.64 17.46 17.49 16.55 26.42 16.12 27.26 15.32 73.43 15.14 14.67 13.01 171.7
RMSE 21.67 20.32 23.03 23.37 21.74 109.9 20.84 144.2 20.41 693.6 19.78 19.99 19.03 945.3
LM MAPE 12.01 12.32 13.38 13.62 12.92 21.12 13.67 22.76 12.95 62.56 12.22 11.60 11.25 154.0
S 0.301 0.347 0.309 0.301 0.327 −2.39 0.311 −3.75 0.361 −20.6 0.284 0.279 0.217 −37.7
Accuracy 68.71 65.75 66.67 61.64 65.31 61.64 65.31 65.07 68.71 66.44 63.24 64.44 67.35 17.53
MAE 15.76 15.91 17.54 17.68 17.94 17.56 15.33 16.26 15.38 18.11 15.59 14.87 12.95 14.18
RMSE 21.13 20.43 22.99 23.58 23.22 22.89 20.18 21.07 20.73 23.82 20.11 20.32 18.83 20.64
LASSO MAPE 12.30 12.40 13.31 13.45 14.12 14.00 12.99 13.87 13.27 15.79 12.52 11.67 11.35 12.92
S 0.319 0.344 0.310 0.295 0.281 0.293 0.333 0.306 0.351 0.257 0.272 0.267 0.225 0.155
Accuracy 65.99 64.38 63.27 61.64 58.50 59.59 65.31 63.70 68.03 62.33 58.82 63.70 65.31 64.95
MAE 15.23 15.63 17.46 17.47 16.51 26.42 16.12 27.26 15.16 71.56 15.13 14.67 12.97 168.5
RMSE 21.61 20.32 23.03 23.32 21.70 109.9 20.84 144.2 20.23 670.5 19.77 19.99 19.00 914.6
Ridge MAPE 11.97 12.31 13.38 13.59 12.88 21.12 13.67 22.76 12.83 60.96 12.22 11.61 11.21 151.1
S 0.304 0.347 0.309 0.302 0.328 −2.39 0.311 −3.75 0.367 −19.9 0.284 0.279 0.218 −36.4
Accuracy 68.71 66.44 66.67 61.64 65.31 61.64 65.31 65.07 69.39 65.75 64.71 64.44 67.35 16.49
MAE 17.57 16.87 17.12 17.72 16.57 18.05 14.8 14.74 16.3 14.94 16.25 14.38 12.89 13.32
RMSE 23.34 22.28 23.21 22.77 21.95 23.42 20.39 19.79 22.76 19.73 21.35 19.07 19.17 20.88
ARIMAX MAPE 13.42 13.05 12.92 13.26 12.44 13.94 12.33 12.43 13.41 12.81 13.09 11.44 10.89 11.6
S 0.25 0.28 0.3 0.32 0.32 0.28 0.32 0.35 0.29 0.38 0.21 0.29 0.71 0.69
Accuracy 60.54 63.7 63.95 61.38 63.95 57.24 63.95 68.49 65.31 69.18 56.46 63.7 53.06 58.9
Figs. 11, 12, and 13 graphically illustrate the results For this reason, we decided to prompt the predictor to
for the best (Bioggio MOR), average (Chiasso MOR), and focus more on high concentrations, in our case classes 4,
worst (Mendrisio EVE) cases. Each figure is composed of 5, and 6 as defined in Eq. (14), by introducing weighted
four plots. The top one shows a comparison between the training. We assigned different weights to the observa-
main forecasting algorithms and the measured values. The
tions depending on the severity class they were in. We
second plot shows the prediction intervals at levels 20%,
40%, 60%, and 80% issued by the RF algorithm as well found that it was beneficial to assign a weight w1 ∈
as the RF prediction. The third and fourth plots further [20, 200] to observations in class 6, w2 ∈ [20, 200] to
investigate the goodness of fit of the quantiles of the RF observations in class 5, w3 ∈ [10, 20] to observations in
and NGBoost algorithms. class 4, and a fixed weight of 1 to all the other observa-
tions. The key idea is to find the optimal set of weights
7.4. High peaks prediction w1 , w2 , w3 for each location and case with improved fore-
casting quality at high concentrations. The same set of
The analysis presented so far focused on the dataset
weights is used during feature selection with SHAP and
in its entirety, aiming to provide the best KPIs over all
the data, independently from the air pollution severity NGBoost and to train the prediction algorithm with the
class. However, when predicting ozone concentrations, it selected features. Applying weights during feature selec-
is generally more important to be able to correctly predict tion should help select the most important features to
high concentrations, as they can pose a health risk. recognize the highest ozone concentrations.
982
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
Fig. 11. Result plots of the station yielding the best results, Bioggio MOR.
Fig. 12. Result plots of a station yielding average results, Chiasso MOR.
983
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
Fig. 13. Result plots of the station yielding the worst results, Mendrisio EVE.
We sought to increase the classification accuracy of iso-lines are obtained with cubic interpolation. It is diffi-
three particular subsets of our data using weighted train- cult to infer which weights give the best results, especially
ing, by evaluating the best combination of weights to when trying to maximize the accuracy of observations
optimize prediction accuracy for the following: above 135 µg/m3 , but for the other two cases, a high w2
a and variable w1 give the best weights for observations
• Values in classes 4, 5, and 6 (O3 > 135 µg/m3 ) above 180 µg/m3 , whereas a high w1 and a low w2
• Values in classes 5 and 6 (O3 > 180 µg/m3 ) appear to be the best combination of weights for correctly
• Values in class 6 (O3 > 240 µg/m3 ) predicting observations above 240 µg/m3 .
We analyzed the four stations with the highest number Table 9 shows the results of weighted training for the
of extreme measurements in 2019: Bioggio, Mendrisio, considered stations. We report the KPIs and the three sets
of weights that gave the best accuracy for each fraction
Locarno, and Chiasso. All these stations registered at least
of the dataset, compared to the results obtained in the
one event of class 6 and many events of class 5, as shown
case where no weights are applied. We can see that the
in Table 8. In particular, Chiasso registered 27 measure-
introduction of weights does not unduly affect the KPIs,
ments above 180 µg/m3 , of which four were above 240
and in fact improve them in some cases. For Chiasso
µg/m3 . In contrast to what we observed with unweighted
MOR, we also show in Fig. 15 the complete confusion
training, we noticed that when using weighted train-
matrices obtained. It can be seen that when actively trying
ing, increasing the number of selected features above 30
to enhance the recognition of observations above 135
improved the prediction accuracy at high ozone concen- µg/m3 , the correct classification of these values increased
trations. Therefore, we decided to increase to 100 the from about 50%–75% to about 80%–85%. Similarly, for
number of features that the algorithm can use to perform observations above 180 µg/m3 the correct recognition
its prediction. We used SHAP as the feature selection rate increases from 50%–70% to 80%–90% in all stations
method and NGBoost as the regressor. We calculated the but Locarno, where it stops at 70%. Finally, in Mendrisio
KPIs of the models for each combination of the weights and Chiasso, we could correctly predict all values above
w with w1 , w2 ∈ {20, 40, 60, . . . , 200}, w3 ∈ {10, 20}. 240 µg/m3 , which was not achieved in the unweighted
Fig. 14 shows the aggregated accuracy of the prediction analysis. This is not the case for Bioggio and Locarno,
for the three different classes of interest. Figs. 14(a) and where the only class 6 value is never recognized.
14(b) illustrate the distribution of the accuracy when
considering only the ozone measured values above 135 8. Conclusions
µg/m3 . Similarly, Figs. 14(c), 14(d), and 14(e), 14(f) show
the accuracy when restricting ourselves only to values In this study, we forecasted the day-ahead maximum
above 180 and 240 µg/m3 , respectively. The continuous ground-level ozone concentration during the summer of
984
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
Table 8
Number of particular events registered in the analyzed stations in 2019.
Station Values above Values above Values above
135 µg/m3 cl. [4, 5, 6] 180 µg/m3 cl. [5, 6] 240 µg/m3 cl. 6
Chiasso 46 23 4
Bioggio 47 17 1
Mendrisio 70 22 2
Locarno 41 9 1
Fig. 14. Plot of the results of the weights grid search with 100 selected features, aggregated across the stations of Chiasso, Bioggio, Mendrisio and
Locarno.
2019 in seven localities in southern Switzerland, using a Our analysis showed that gradient boosting
physics-agnostic, data-driven approach. Due to the high algorithms, and in particular least-squares boosting and
number of signals potentially affecting the predictions, natural gradient boosting, consistently outperformed the
we performed preliminary feature selection using two other tested forecasting methods. Where possible, we
methods, which we compared. The selected features were
further compared our results with those of other papers,
then used to train different state-of-the-art forecasting
and we were able to conclude that our results are similar
algorithms. Analyzing feature importance interactions us-
ing Shapley values suggested that the models trained to previous analysis and, in some cases, even better.
through our learning pipeline effectively learned explana- We then evaluated the effect of weighted training
tory cross-dependencies among atmospheric variables de- to increase the accuracy of predictions for high ozone
scribed in the ozone photochemistry literature. concentrations. Our analysis showed that this method is
985
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
Table 9
Results of the weighted analysis.
Weights Target class accuracy KPIs
w1 w2 w3 cl. [4, 5, 6] cl. [5, 6] cl. 6 MAE RMSE Tot. Acc.
No weights – – – 72.72 61.11 0.00 13.28 17.66 70.95
Max. accuracy classes [4, 5, 6] 140 80 20 85.07 77.78 0.00 13.03 17.52 73.68
BIO MOR
Max. accuracy classes [5, 6] 120 200 10 79.10 88.89 0.00 13.61 18.25 71.05
Max. accuracy class 6 180 100 10 77.61 72.22 0.00 12.62 16.18 70.39
No weights – – – 69.69 61.11 0.00 14.22 18.50 68.71
Max. accuracy classes [4, 5, 6] 40 100 20 84.84 77.78 0.00 14.59 18.47 73.51
BIO EVE
Max. accuracy classes [5, 6] 60 180 10 83.33 83.33 0.00 15.08 19.27 69.53
Max. accuracy class 6 80 40 10 83.33 72.22 0.00 14.24 18.31 73.50
No weights – – – 72.60 55.56 0.00 14.61 18.79 71.52
Max. accuracy classes [4, 5, 6] 180 20 10 80.82 74.07 75.00 13.75 19.54 75.50
CHI MOR
Max. accuracy classes [5, 6] 140 200 20 79.45 81.48 50.00 15.51 21.33 70.86
Max. accuracy class 6 200 40 10 75.34 70.37 100.00 13.53 19.14 71.52
No weights – – – 65.75 48.15 0.00 15.28 20.43 67.35
Max. accuracy classes [4, 5, 6] 80 20 20 82.19 74.07 75.00 14.90 19.96 71.52
CHI EVE
Max. accuracy classes [5, 6] 200 200 10 75.34 85.19 75.00 15.26 20.13 73.51
Max. accuracy class 6 180 20 20 75.34 70.37 75.00 15.26 20.09 67.55
No weights – – – 77.14 68.18 50.00 14.69 19.49 71.62
Max. accuracy classes [4, 5, 6] 120 40 20 84.29 81.82 100.00 14.17 19.02 71.43
MEN MOR
Max. accuracy classes [5, 6] 180 60 10 80.00 86.36 100.00 14.56 19.40 67.35
Max. accuracy class 6 200 40 10 77.14 81.82 100.00 14.34 19.23 69.39
No weights – – – 74.28 72.72 50.00 15.29 19.38 68.02
Max. accuracy classes [4, 5, 6] 120 120 20 82.86 81.82 50.00 16.13 21.13 67.81
MEN EVE
Max. accuracy classes [5, 6] 200 60 10 74.29 90.91 100.00 16.53 21.06 64.38
Max. accuracy class 6 180 40 10 81.43 81.82 100.00 15.54 19.75 67.81
No weights – – – 54.90 40.00 0.00 14.12 18.45 64.86
Max. accuracy classes [4, 5, 6] 180 100 20 80.39 60.00 0.00 14.43 18.52 72.11
LOC MOR
Max. accuracy classes [5, 6] 40 140 10 72.55 70.00 0.00 15.06 19.65 68.03
Max. accuracy class 6 60 180 10 70.59 70.00 0.00 13.92 17.97 70.07
No weights – – – 52.94 40.00 0.00 14.09 18.84 65.99
Max. accuracy classes [4, 5, 6] 40 80 20 80.39 50.00 0.00 14.37 18.61 69.18
LOC EVE
Max. accuracy classes [5, 6] 60 200 10 76.47 70.00 0.00 14.71 19.83 67.12
Max. accuracy class 6 100 40 10 66.67 50.00 0.00 13.93 17.75 67.81
feasible, as it increases forecast accuracy without com- the steering committee of the world congress in computer science,
promising overall forecast quality. Future directions for computer engineering and applied computing (No. x) (pp. 148–154).
this work include the formulation of probabilistic tech- Bentzien, S., & Friederichs, P. (2014). Decomposition and graphical
portrayal of the quantile score. Quarterly Journal of the Royal
niques for robust estimations of annual ozone concentra- Meteorological Society, 140(683), 1924–1934. http://dx.doi.org/10.
tion peaks, which are the most difficult events to predict, 1002/qj.2284.
due to their scarcity in the training set. In this view, train- Calvert, J. G., Orlando, J. J., Stockwell, W. R., & Wallington, T. J. (2015).
ing forecasters with ad-hoc-generated adversarial exam- The mechanisms of reactions influencing atmospheric ozone. Oxford
ples could result in a better forecast of the conditional University Press.
Cameletti, M., Lindgren, F., Simpson, D., & Rue, H. (2013). Spatio-
probability distributions.
temporal modeling of particulate matter concentration through
the SPDE approach. AStA. Advances in Statistical Analysis, 97(2),
Declaration of competing interest 109–131. http://dx.doi.org/10.1007/s10182-012-0196-3, URL: http:
//link.springer.com/10.1007/s10182-012-0196-3.
The authors declare that they have no known com- Carrillo, R. E., Leblanc, M., Schubnel, B., Langou, R., Topfel, C., & Alet, P.-
peting financial interests or personal relationships that J. (2020). High-resolution PV forecasting from imperfect data: A
graph-based solution. Energies, 13(21), 5763. http://dx.doi.org/10.
could have appeared to influence the work reported in
3390/en13215763.
this paper. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting
system. In Proceedings of the 22nd ACM SIGKDD international con-
Acknowledgment ference on knowledge discovery and data mining (pp. 785–794).
http://dx.doi.org/10.1145/2939672.2939785.
This work was funded by the Environment Observatory Crutzen, P. J., Lawrence, M. G., & Pöschl, U. (1998). On the background
photochemistry of tropospheric ozone. Tellus, Series B (Chemical and
of Southern Switzerland (OASI, www.ti.ch/oasi) of the
Physical Meteorology), 0889, http://dx.doi.org/10.3402/tellusb.v51i1.
Department of Territory of Canton Ticino (DT). 16264.
d-maps-1 (2020a). Map of Europe. d-maps.com, https://d-maps.com/
References carte.php?num_car=2232&lang=en. (Accessed 24 April 2020).
d-maps-2 (2020b). Map of Canton Ticino. d-maps.com, https://d-maps.
Al Abri, E. S., Edirisinghe, E. A., Nawadha, A., & Kingdom, U. (2015). com/carte.php?num_car=10350&lang=en. (Accessed 24 April 2020).
Modelling ground-level ozone concentration using ensemble learn- Dale, D. J. (2006). Statistical comparisons of classifiers over multiple
ing algorithms. In International conference on data mining (DMIN). data sets. Journal of Machine Learning Research, 7(1), 1–30.
986
D. Marvin, L. Nespoli, D. Strepparava et al. International Journal of Forecasting 38 (2022) 970–987
Duan, T., Avati, A., Ding, D. Y., Thai, K. K., Basu, S., Ng, A. Y., et al. (2019). Monks, P. S., Archibald, A. T., Colette, A., Cooper, O., Coyle, M., Der-
NGBoost: Natural Gradient Boosting for Probabilistic Prediction. went, R., et al. (2015). Tropospheric ozone and its precursors from
http://arxiv.org/abs/1910.03225. the urban to the global scale from air quality to short-lived climate
Eslami, E., Choi, Y., Lops, Y., & Sayeed, A. (2019). A real-time hourly forcer. Atmospheric Chemistry and Physics, 15(15), 8889–8973. http:
ozone prediction system using deep convolutional neural network. //dx.doi.org/10.5194/acp-15-8889-2015.
Neural Computing and Applications, 0123456789, 8–11. http://dx.doi. Pinson, P., McSharry, P., & Madsen, H. (2010). Reliability diagrams
org/10.1007/s00521-019-04282-x, https://doi.org/10.1007/s00521- for non-parametric density forecasts of continuous variables: Ac-
019-04282-x. counting for serial correlation. Quarterly Journal of the Royal
Feng, Y., Zhang, W., Sun, D., & Zhang, L. (2011). Ozone concen-
Meteorological Society, 136(646), 77–90. http://dx.doi.org/10.1002/
tration forecast method based on genetic algorithm optimized
qj.559.
back propagation neural networks and support vector machine
Pusede, S. E., Gentner, D. R., Wooldridge, P. J., Browne, E. C., Rollins, A.
data classification. Atmospheric Enviroment, 45(11), 1979–1985.
W., Min, K. E., et al. (2014). On the temperature dependence of or-
http://dx.doi.org/10.1016/j.atmosenv.2011.01.022, http://dx.doi.org/
10.1016/j.atmosenv.2011.01.022. ganic reactivity, nitrogen oxides, ozone production, and the impact
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring of emission controls in San Joaquin Valley, California. Atmospheric
rules, prediction, and estimation. Journal of the American Sta- Chemistry and Physics, 14(7), 3373–3395. http://dx.doi.org/10.5194/
tistical Association, 102(477), 359–378. http://dx.doi.org/10.1198/ acp-14-3373-2014.
016214506000001437. Pusede, S. E., Steiner, A. L., & Cohen, R. C. (2015). Temperature and re-
Gong, B., & Ordieres-Meré, J. (2016). Prediction of daily maximum cent trends in the chemistry of continental surface ozone. Chemical
ozone threshold exceedances by preprocessing and ensemble Reviews, 115(10), 3898–3918. http://dx.doi.org/10.1021/cr5006815.
artificial intelligence techniques. Environmental Modelling & Soft- Sheta, A., Faris, H., Rodan, A., Kovač-Andrić, E., & Al-Zoubi, A. M. (2018).
ware, 84(C), 290–303. http://dx.doi.org/10.1016/j.envsoft.2016.06. Cycle reservoir with regular jumps for forecasting ozone concentra-
020, https://doi.org/10.1016/j.envsoft.2016.06.020. tions: Two real cases from the east of Croatia: CRJ for forecasting
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of ozone concentrations. Air Quality, Atmosphere and Health, 11(5),
statistical learning. Elements, 1, 337–387. http://dx.doi.org/10.1007/ 559–569. http://dx.doi.org/10.1007/s11869-018-0561-9.
b94608, URL: http://www.springerlink.com/index/10.1007/b94608. Siwek, K., & Osowski, S. (2016). Data mining methods for prediction
Hollander, M., & Wolfe, D. (1999). A volume in the wiley series in prob- of air pollution. International Journal of Applied Mathematics and
ability and mathematical statistics, Nonparametric statistical methods Computer Science, 26(2), 467–478. http://dx.doi.org/10.1515/amcs-
(2nd ed.).
2016-0033.
Kourentzes, N., Svetunkov, I., & Schaer, O. (2020). https://github.com/
Stewart, D. R., Saunders, E., Perea, R. A., Fitzgerald, R., Campbell, D.
trnnick/tsutils/.
E., & Stockwell, W. R. (2017). Linking air quality and human
Kupilik, M., & Witmer, F. (2018). Spatio-temporal violent event pre-
health effects models: An application to the Los Angeles air
diction using Gaussian process regression. Journal of Computational
Social Science, 1(2), 437–451. http://dx.doi.org/10.1007/s42001-018- basin. Environmental Health Insights, 11, http://dx.doi.org/10.1177/
0024-y, https://doi.org/10.1007/s42001-018-0024-y. 1178630217737551.
Lu, X., Zhang, L., & Shen, L. (2019). Meteorology and climate influences Swiss Society of Air Protection Officers (2019). Indice de
on tropospheric ozone: a review of natural sources, chemistry, pollution de l’air à court terme IPC. (in French). URL:
and transport patterns. Current Pollution Reports, 5(4), 238–260. https://cerclair.ch/assets/pdf/27a_2019_08_28_F_Indice_de_
http://dx.doi.org/10.1007/s40726-019-00118-3. pollution_de_lair_court_terme.pdf.
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., The Swiss Federal Council (1985). Ordinance on Air Pollution Control
et al. (2020). From local explanations to global understanding with (OAPC). URL: https://www.admin.ch/opc/en/classified-compilation/
explainable ai for trees. Nature Machine Intelligence, http://dx.doi. 19850321/index.html.
org/10.1038/s42256-019-0138-9. Tibshirani, R. (1996). Regression shrinkage and selection via the
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpret- Lasso. Journal of the Royal Statistical Society. Series B. Statistical
ing model predictions. Advances in Neural Information Processing Methodology, 58(1), 267–288.
Systems. Walcek, C. J., & Yuan, H. H. (1995). Calculated influence of temperature-
Lv, B., Cobourn, W. G., & Bai, Y. (2016). Development of nonlinear related factors on ozone formation rates in the lower tropo-
empirical models to forecast daily PM2.5 and ozone levels in sphere. Journal of Applied Meteorology, http://dx.doi.org/10.1175/
three large Chinese cities. Atmospheric Enviroment, 147, 209–223.
1520-0450(1995)034<1056:CIOTRF>2.0.CO;2.
http://dx.doi.org/10.1016/j.atmosenv.2016.10.003, http://dx.doi.org/
World Health Organization (2003). Health aspects of air pollu-
10.1016/j.atmosenv.2016.10.003.
tion with particulate matter, ozone and nitrogen dioxide. Report
Meinshausen, N. (2014). Quantile regression forests. Journal of Ma-
on a WHO working group (p. 95). Bonn, Germany: Regional
chine Learning Research, 131, 65–78. http://dx.doi.org/10.1016/j.
jmva.2014.06.005. Office for Europe, http://www.euro.who.int/_data/assets/pdf_file/
Mohan, S., & Saranya, P. (2019). A novel bagging ensemble approach 0005/112199/E79097.pdf.
for predicting summertime ground-level ozone concentration.
Journal of the Air and Waste Management Association, 69(2), 220–
233. http://dx.doi.org/10.1080/10962247.2018.1534701, https://doi.
org/10.1080/10962247.2018.1534701.
987