0% found this document useful (0 votes)
12 views12 pages

A Data Driven Interpretable Ensemble Framework Based On Tree Models For Forecasting The Occurrence of COVID 19 in The USA

This research article presents a data-driven ensemble framework utilizing tree-based machine learning models to forecast daily new COVID-19 cases in the USA and identify key influencing factors. The study developed three algorithms—Random Forest, XGBoost, and LightGBM—integrated through linear ensemble methods, achieving the highest prediction accuracy with the LightGBM model. Key findings indicate that vaccination, mask-wearing, reduced mobility, and government interventions positively impact COVID-19 control and prevention efforts.

Uploaded by

researchdeb2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views12 pages

A Data Driven Interpretable Ensemble Framework Based On Tree Models For Forecasting The Occurrence of COVID 19 in The USA

This research article presents a data-driven ensemble framework utilizing tree-based machine learning models to forecast daily new COVID-19 cases in the USA and identify key influencing factors. The study developed three algorithms—Random Forest, XGBoost, and LightGBM—integrated through linear ensemble methods, achieving the highest prediction accuracy with the LightGBM model. Key findings indicate that vaccination, mask-wearing, reduced mobility, and government interventions positively impact COVID-19 control and prevention efforts.

Uploaded by

researchdeb2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Environmental Science and Pollution Research (2023) 30:13648–13659

https://doi.org/10.1007/s11356-022-23132-3

RESEARCH ARTICLE

A data‑driven interpretable ensemble framework based on tree


models for forecasting the occurrence of COVID‑19 in the USA
Hu‑Li Zheng1 · Shu‑Yi An2 · Bao‑Jun Qiao2 · Peng Guan1 · De‑Sheng Huang3 · Wei Wu1

Received: 12 May 2022 / Accepted: 16 September 2022 / Published online: 22 September 2022
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2022

Abstract
This prevalence of coronavirus disease 2019 (COVID-19) has become one of the most serious public health crises. Tree-
based machine learning methods, with the advantages of high efficiency, and strong interpretability, have been widely used
in predicting diseases. A data-driven interpretable ensemble framework based on tree models was designed to forecast daily
new cases of COVID-19 in the USA and to determine the important factors related to COVID-19. Based on a hyperpara-
metric optimization technique, we developed three machine learning algorithms based on decision trees, including random
forest (RF), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), and three linear
ensemble models were used to integrate these outcomes for better prediction accuracy. Finally, the SHapley Additive expla-
nation (SHAP) value was used to obtain the feature importance ranking. Our outcomes demonstrated that, among the three
basic machine learners, the prediction accuracy was the following in descending order: LightGBM, XGBoost, and RF. The
optimized LAD ensemble was the most precise prediction model that reduced the prediction error of the best base learner
(LightGBM) by approximately 3.111%, while vaccination, wearing masks, less mobility, and government interventions had
positive effects on the control and prevention of COVID-19.

Keywords COVID-19 · Ensemble · Hyperopt · XGBoost · LightGBM · RF

Introduction
Responsible Editor: Marcus Schulz The novel virus that can cause severe acute respiratory dis-
ease (COVID-19) has become one of the most serious public
* Wei Wu
[email protected] health crises. Furthermore, the current COVID-19 outbreak
remains a global pandemic. Globally, as of 5:40 p.m. Central
Hu‑Li Zheng
[email protected] European time, November 2, 2021, there have been 240 mil-
lion cumulative confirmed cases of COVID-19, including 5
Shu‑Yi An
[email protected] million deaths, according to the World Health Organization
(https://​covid​19.​who.​int/). This new epidemic has attracted
Bao‑Jun Qiao
[email protected] global attention and has become one of the most serious
public health crises. The common symptoms of COVID-19
Peng Guan
[email protected] are coughing, fever, fatigue, anorexia, headache, rhinorrhea,
and myalgia, and SARS-CoV-2 infection is believed to be
De‑Sheng Huang
[email protected] transmitted through aerosols or droplets (Guan et al. 2020;
Mao et al. 2020). Since the epidemic was proclaimed a pan-
1
Department of Epidemiology, School of Public Health, demic, types of precautions have been taken to control the
China Medical University, No. 77 Puhe Road, Shenyang, epidemic’s spread, covering the prevention and control of
Liaoning Province, China
the disease during public transportation, distancing policies,
2
Liaoning Provincial Center for Disease Control population-wide movement control, wearing of masks, and
and Prevention, Shenyang, Liaoning, China
vaccinations (Ng et al. 2020; Shen et al. 2020). Due to the
3
Department of Mathematics, School of Intelligent Medicine, measures undertaken, the daily confirmed cases in China
China Medical University, Shenyang, Liaoning, China

13
Environmental Science and Pollution Research (2023) 30:13648–13659 13649

13
13650 Environmental Science and Pollution Research (2023) 30:13648–13659

◂Fig. 1  Framework of ensemble methods for forecasting COVID-19 the correlations between the influencing factors and the
occurrence. RF, random forest; XGBoost, eXtreme Gradient Boost- diseases.
ing; LightGBM, Light Gradient Boosting Machine; SA, simple aver-
aging; OLS, ordinary least square; LAD, least absolute deviation;
Previous studies that only applied traditional epidemic
SHAP, SHapley Additive explanation models were subject to underfitting or overfitting problems
and had poor generalization ability. Compared with other
models, machine learning models have the advantages of
have decreased drastically, but globally, the virus has not yet approximation excellent universality, superior nonlinear
been completely stopped owing to its high infectious power approximation, interpretability, and not easy overfitting in
and strong pathogenicity. Since the outbreak, the USA has time series analysis, and can analyze a large number of fea-
always been one of areas most affected by the epidemic. tures simultaneously. Therefore, we believe that the machine
Therefore, we decided to predict the trend of the epidemic in learning model is most suitable for predicting COVID-19
the USA and offered relevant prevention suggestions. trends and analyzing influencing factors at the same time.
Various models have been used to predict COVID- Before us, RF was used in research on COVID-19 and other
19 transmission. Based on the traditional suscepti- various diseases (Sarica et al. 2017; Wu et al. 2021; Yang
ble–infected–recovered compartment model, many novel et al. 2020). The eXtreme Gradient Boosting (XGBoost)
dynamical models (Abbasi et al. 2020; Campillo-Funollet method was used to predict the mortality of COVID-19 in
et al. 2021; Sun et al. 2020) have been proposed that consider Wuhan, China (K. Wang et al. 2020). XGBoost has also been
many factors, such as deaths, daily admissions, discharges, used in other diseases for disease prediction and risk factor
and quarantine. There have also been time series models that analysis, such as smoking-induced noncommunicable dis-
have been used to predict COVID-19, such as the autoregres- ease (Davagdorj et al. 2020) and kidney disease (Chen et al.
sive integrated moving average method (Ceylan 2020) and 2019). The Light Gradient Boosting Machine (LightGBM)
its variants (Ahmar and Del Val 2020; ArunKumar et al. model showed better discrimination ability than the tradi-
2021). However, the above methods cannot consider a broad tional model in predicting the all-cause mortality of patients
range of factors affecting the development of the epidemic (Zheng et al. 2021a, b), but it has not been used in COVID-
and cannot address the changing environment. 19 prediction. To date, none of these three methods has been
More importantly, with the continuous upgrading of com- used to forecast daily cases of COVID-19 in the USA, which
puter information and software technology, artificial intel- is one of the innovations in our research.
ligence is progressively becoming widely used in medical Furthermore, considering one situation in which a single
systems to detect diseases and make clinical diagnoses. machine learner is inferior to an ensemble model that can
Machine learning, including deep learning, is supposed reduce deviation and improve robustness (L. Wang et al.
to be an indispensable part of artificial intelligence (Yu 2021; Ye et al. 2021), in this study, we used three linear
et al. 2021), which has also been widely used in predict- ensemble methods — simple averaging (SA), ordinary least
ing COVID-19. A spatial–temporal analysis framework was square (OLS), and least absolute deviation (LAD) ensem-
developed by combining random forest (RF) regression and bles — to integrate three tree-based models, including RF,
a multiobjective optimization algorithm to predict the daily XGBoost, and LightGBM, to predict the prevalence of
cases and death rate in Asia (Pan et al. 2021). The prediction COVID-19 in the USA and analyze its influencing factors
effects of support vector regression and stacking-ensemble for COVID-19 prevention and control.
learning were better than those of comparison models in
Brazil (Ribeiro et al. 2020). The Prophet algorithm was also
considered to have reliable prediction ability in South Korea Methods
(Asfahan et al. 2020). Recently, special attention has been
paid to deep learning methods, because of their excellent Framework of ensemble methods
universality and superior nonlinear approximation in time
series analysis. Abdelkader et al. confirmed that hybrid To achieve the two objectives of prediction and prevention,
convolutional neural network–long short-term memory and three main steps were implemented, including data prepa-
hybrid gated recurrent unit–convolutional neural networks ration, prediction of single machine learning models, and
could efficiently predict COVID-19 cases (Dairi et al. 2021). ensemble methods as shown in Fig. 1. First, all of our data
Another study also showed that the bidirectional long short- come from public datasets. We have tried our best to gather
term memory method could be used for pandemic prediction more comprehensive data for ensemble machine learning
and better planning and management (Shahid et al. 2020). models. In addition to the number of daily new cases, other
However, it is worth noting that the interpretability of the data were divided into four categories, namely, personal
deep learning model has not been strong due to its black box protection, social policy indicators, community mobility
problem, and these related studies could not well analyze and time indices. Second, based on the hyperparametric

13
Environmental Science and Pollution Research (2023) 30:13648–13659 13651

optimization technique (Hyperopt) to tune the parameters Data analysis of three machine learning models, includ-
automatically, we developed three machine learning algo- ing RF, XGBoost, and LightGBM, was conducted using
rithms based on decision trees, including RF, XGBoost, and Python software, version 3.8.8. We adopted sklearn.metric,
LightGBM, to forecast daily new cases. Finally, three linear sklearn.model_selection, and matplotlib.pyplot modules in
ensemble methods — SA, OLS, and LAD ensembles — Python and some main Python packages, including shap,
were adopted to repredict the daily new cases by combining hyperopt, RandomForestRegressor, xgboost, and lightgbm.
the results of the three basic models for better prediction Other data analysis of the three linear ensemble methods was
accuracy and robustness. At the same time, for the sake of conducted by using R software, version 4.0.5. The Forecast-
interpretability, we determined the impact of the included Comb, forecast, ggplot2, graphics, and tseries packages were
variables on the outcomes using SHapley Additive expla- used. Methods were performed in accordance with relevant
nation (SHAP) values, to identify the important factors for guidelines and regulations.
COVID-19 transmission.
Random forest (RF)
Data collection and preprocessing Random forest (RF) based on bagging integration is one of
the most common and powerful supervised learning algo-
The data were collected from the following four public data rithms that can solve regression and classification problems
sources: (1) cases of COVID-19 and the number of vaccina- (Breiman 2001). Its technique is to create multiple samples
tions in the USA were obtained from the official website of from the same set of data, readjust them through bootstrap
the Centers for Disease Control and Prevention of the USA technology, and randomly select predictors to form each
(https://​covid.​cdc.​gov); (2) the usage rate of masks was col- node of the decision tree. The randomness of time series
lected from the Institute for Health Metrics and Evaluation can also be well handled by RF (Casiraghi et al. 2020). The
(https://​covid​19.​healt​hdata.​org/​united-​states-​of-​ameri​ca?​ random forest model can be described as Eq. (1), where P
view=​mask-​use&​tab=​trend), which could reflect people’s represents the number of decision trees.
awareness of self-protection; (3) social policy indicators were
obtained via the Oxford Covid Government Response Tracker 1∑
P

(Hale et al. 2021), which could quantify the extent of govern- Y= F (x) (1)
P i=1 i
ment responses; (4) the travel popularity data were obtained
through Google Community Mobility Reports (https://​www.​
google.​com/​covid​19/​mobil​ity/), reflecting the movement LightGBM and XGBoost models
trend of citizens over time through geographical location. In
addition to the above information sources, we added week Gradient boosting is a tree-based machine learning ensemble
and festival information, time trend items and lags as input method (Kim et al. 2021) that can improve the accuracy and
variables. There were no missing data in this study. robustness of overall training and prediction by integrating
In the early stage of the epidemic, the number of cases was multiple weak learners. In this study, we utilized two relatively
small and unstable, so our research dates were from 1 April advanced and quick gradient-lifting algorithms: XGBoost and
2020 to 31 August 2021. We deleted some variables that did not LightGBM. The most important feature of XGBoost is that it
change significantly during the study period and some repeated can automatically use the multithreading of the CPU in paral-
variables. Considering the seasonality, autocorrelation and par- lel and improve the algorithm to improve the accuracy. The
tial autocorrelation, we included the trend item and 1, 2, and 7 XGBoost algorithm can be summarized by Eq. (2), where l
time-lagged variables as the input features. Finally, we took the denotes the loss function, fj denotes a weak learner, and Ω
daily new cases as the outcome variable and included a total denotes the regularization term (Nishio et al. 2018).
of 35 input features, as shown in Table 1, and preliminarily
explored the correlation between the target and input features n {(
∑ j+1 ( ))} ( )
through Spearman’s correlation, as shown in Fig. 2. Although Lj = l yi , yi + fj xi + Ω fj (2)
i=1
some features had a statistically weak association with daily
cases, they were still included to ensure the integrity of the LightGBM is a decision tree algorithm based on histogram
features and avoid the meaningful features in reality from being that has two novel techniques to improve performance and
ignored. More importantly, that weakly correlated features reduce computing time: Gradient-based One-Side Sampling
were retained hardly affected the prediction results of the three and Exclusive Feature Bundling (Ke et al. 2017). The first tech-
machine learners. These models based on decision trees in our nique retains instances with large gradients, and only randomly
research could assign appropriate weights to each feature by omits instances with small gradients to retain the accuracy of
self-learning using the training set data. the information gain estimation. The second makes it possible

13
13652 Environmental Science and Pollution Research (2023) 30:13648–13659

Table 1  All input features Category Feature code Feature

Self-protection ×1 Mask (%)


×2 People receiving 1 or more doses cumulative
×3 People fully vaccinated cumulative
Social policy indicators ×4 School closing
×5 Workplace closing
×6 Cancel public events
×7 Restrictions on gatherings
×8 Close public transport
×9 Requirements to stay at home
× 10 Restrictions on internal movement
× 11 International travel controls
× 12 Income support
× 13 Debt/contract relief
× 14 Public information campaigns
× 15 Testing policy
× 16 Contact tracing
× 17 Protection of elderly people
× 18 Government response index
× 19 Containment health index
× 20 Economic support index
Community mobility × 21 Retail and recreation change from baseline (%)
× 22 Grocery and pharmacy percent change from baseline (%)
× 23 Parks percent change from baseline (%)
× 24 Transit stations percent change from baseline (%)
× 25 Workplaces percent change from baseline (%)
× 26 Residential percent change from baseline (%)
Time index × 27 Holiday
× 28 Previous a day is holiday
× 29 Previous 2 days is holiday
× 30 Previous 3 days is holiday
× 31 Day of week
× 32 Trend
× 33 Lag1
× 34 Lag2
× 35 Lag7

to design a nearly lossless method to reduce the number of searches are not feasible due to the large amount of calcula-
features in sparse high-dimensional data (Yu 2019). tion, making it less efficient than Hyperopt.
The Tree of Parzen Estimators (TPE) was selected as the search
Hyperopt (a hyperparametric optimization algorithm. Therefore, we sought the best parameter combinations
technique) of three machine learners by this optimization method, which was
an innovation of our research. Combined with tenfold cross valida-
Hyperopt is a distributed asynchronous hyperparameter opti- tion, we took the mean absolute percentage error (MAPE) as the
mization method (Bergstra et al. 2013) based on Bayesian objective function, which was the goal that we wanted to optimize
optimization, which has been used in the parameter opti- in the parameter space that we defined. The following parameter
mization of end products (Shahriari et al. 2015), such as spaces were used for parameter optimization.
recommendation systems, medical analysis tools, and speech
recognizers. Grid searching is often used for this purpose. i. For random forest, the parameters and their ranges were
However, when the number of parameters increases, grid as follows: n_estimators, 160–190; and max_depth,

13
Environmental Science and Pollution Research (2023) 30:13648–13659 13653

Fig. 2  Spearman correlation


between daily new cases and
input features

8–15. More parameter explanations and introductions and the actual value is the smallest. The weight of the
of random forest parameters are available on the follow- OLS ensemble is generally learned from the training data,
ing website: https://​scikit-​learn.​org/​stable/​modul​es/​gener​ but the weighted average method is not necessarily bet-
ated/​sklea​rn.​ensem​ble.​Rando​mFore​stReg​ressor.​html. ter than the simple average method (Shahhosseini et al.
ii. For XGBoost, the parameters and their ranges were as 2020), especially for large-scale integration and situations
follows: n_estimators, 130–190; learning_rate, 0.070– in which the performance of individual learners is simi-
0.095; max_depth, 5–9; min_child_weight, 8–16; lar. The LAD ensemble computes forecast combination
subsample, 0.85–0.95; colsample_bytree, 0.8–0.9; weights using the principle of minimum absolute devia-
reg_lambda, 1–8; and alpha, 10–20. More parameter tion. One characteristic of LAD is that it does not mini-
explanations and introductions of XGBoost parameters mize the squared error loss as OLS and constrained least
are available on the following website: https://​xgboo​st.​ai. squares, but the absolute values of the errors. We hoped to
iii. For LightGBM, the parameters and their ranges were improve the accuracy of prediction by considering the out-
as follows: n_estimators, 110–140; learning_rate, puts of the three machine learning methods as the inputs
0.09–0.1; max_bin, 130–150; max_depth, 4–6; num_ of the three ensembles.
leaves, 8–12; bagging_freq, 30–40; bagging_fraction,
0.85–0.95; feature_fraction, 0.85–0.95; lambda_l1, Model evaluation
170–200; and lambda_l2, 0.0000035–0.0000045.
More parameter explanations and introductions of In our study, three accuracy metrics were applied to evaluate
LightGBM parameters can be available on the follow- the performance of the models: mean absolute error (MAE),
ing website: https://l​ ightg​ bm.r​ eadth​ edocs.i​ o/e​ n/l​ atest/. root mean square error (RMSE), and mean absolute percent-
age error (MAPE) as follows:
Ensemble methods n
1 ∑|
MAE = y − yi ||
̂ (3)
In this study, the three ensemble models used the outputs n i=1 | i
of the three basic models as input variables for secondary
prediction, which could increase the forecast accuracy and √
√ n
√1 ∑ (
robustness. Obviously, the SA method gives each instance RMSE = √ ̂
yi − yi
)2
(4)
the same weight. The principle of OLS is that the sum n i=1
of the squares of the errors between the estimated value

13
13654 Environmental Science and Pollution Research (2023) 30:13648–13659

n
1 ∑ || ̂ yi − yi || and accurately, and a series of tools could be developed to
MAPE = | | × 100% (5) explain the global behavior of models and to directly cap-
n i=1 || yi ||
ture feature interactions(Lundberg et al. 2020). We calcu-
where yi denotes the observed values, ŷi is the prediction, lated SHAP values for all variables for three basic machine
and n denotes the number of data points. learners and then combined these importance values. More
MAE is the average of absolute prediction error and represents principles and calculation details can be seen in this study
the arithmetic mean of the absolute error between the predicted (Mangalathu et al. 2020).
value and the actual value. RMSE is the square root of the aver-
age squared deviation of predictions from real values. MAPE
quantifies the accuracy as a percentage, which can be calculated Results and discussions
as the cumulative absolute percentage error of each time frame.
Characteristics of cases of COVID‑19 in the USA
Feature importance measurement
This study focused on the number of daily new cases in
Understanding data from machine learning models was the USA from April 1, 2020, to August 31, 2021. First, we
also one of our research goals. Our chosen models — RF, decomposed the data. The data on daily cases, seasonality,
XGBoost, and LightGBM — all have natural methods to trends and remainders are displayed from top to bottom in
quantify the importance of input features. However, for Fig. 3. In the figure, the development of daily new cases in
the interpretability of ensembles and the consistency of all the USA is fully shown. There is also a seasonal pattern and
models, Tree Explainer (Lundberg et al. 2020), an expla- a trend in our data. Second, to understand the seasonality
nation method for trees, was adopted to measure and rank of the data more clearly, we drew a seasonal subseries plot
the feature importance. This method could easily calculate (Fig. 4). The horizontal lines indicate the means for one day
the optimal local interpretation according to the expected of all weeks. This figure clearly depicts the underlying perio-
properties in game theory. Using the SHapley Additive dicity and shows the regular pattern in a cycle. In a week,
explanation (SHAP) values of the whole dataset, the abil- the numbers of cases on Monday, Tuesday and Wednesday
ity of local interpretation could be calculated efficiently are higher than those on other days.

Fig. 3  Decomposition of the daily COVID-19 cases in the USA

13
Environmental Science and Pollution Research (2023) 30:13648–13659 13655

Prediction effects of all models could more accurately forecast the COVID-19 trend in the
USA. The SA ensemble could not greatly improve the pre-
Our entire time series data include the daily new cases as diction accuracy. Compared with the base learners, the
the outcome variable and 35 input variables. The daily data remaining two ensembles provided better accuracy. From
in the USA from 1 April 2020 to 31 August 2021 were spilt Table 2, the optimized LAD ensemble is the most precise
into two parts: a training set (from 1 April 2020 to 31 July prediction model, with an MAE of 8540.411, reducing the
2021) to construct three basic models: RF, XGBoost, and prediction error of the best base learner (LightGBM) by
LightGBM; and a test set (from August 1 to 31, 2021) to val- approximately 3.111%. Moreover, other metrics are lower
idate the predictive performance of each model. Then, three than those of basic learners. This outcome might occur
ensemble methods were used to integrate the results of the because there are many categorical variables in our data and
three basic models. Figures 5 and 6 illustrate the relationship LightGBM can offer good accuracy with integer-encoded
between real COVID-19 cases and predicted values achieved categorical features. Moreover, its leafwise algorithms tend
by three basic learners and three ensembles, respectively. It to achieve smaller losses than level-wise algorithms, such as
can be seen from the figures that the prediction effect of the XGBoost. Interestingly, the optimized LAD ensemble could
ensemble methods is better than those of the single methods. further improve the prediction accuracy for the COVID out-
In addition, we have found that the LightGBM and ensemble break in the USA. This improvement could be explained by
models perform better at the data inflection points in Figs. 5 the LAD method being able resist outliers in the data, while
and 6, showing that these models could have good predic- the OLS method gives more weight to outliers.
tion ability for complex and changeable data and situations.
Feature importance
Performance measures for all models
Based on the SHAP-based method, we obtained the feature
We set two identical parameters for the three basic machine SHAP values of the three basic machine learning methods
learning models to render them comparable: tenfold cross and then determined the feature importance according to
validation and seed number 2021. The details of the model the weight achieved by the optimal ensemble method, i.e.,
evaluation criteria are shown in Table 2. Obviously, among LAD ensemble. These artificially controllable external fac-
the three basic machine learning methods, considering all tors are the focus of our attention, as shown in Fig. 6. The
of the criteria, the accuracy of prediction is as follows in mean SHAP absolute value indicates the average impact
descending order: LightGBM, XGBoost, and RF. LightGBM on the model output magnitude. Figure 6 presents the top

Fig. 4  Seasonal subseries plot of weekly COVID-19 cases in the USA

13
13656 Environmental Science and Pollution Research (2023) 30:13648–13659

Fig. 5  Predicted versus


observed COVID-19 cases for
RF, XGBoost, and LightGBM.
RF, random forest; XGBoost,
eXtreme Gradient Boosting;
LightGBM, Light Gradient
Boosting Machine

Fig. 6  Predicted versus


observed COVID-19 cases for
SA, OLS, and LAD ensembles.
SA, simple averaging; OLS,
ordinary least square; LAD,
least absolute deviation; SHAP,
SHapley Additive explanation

10 most important features for the outcomes of the RF, The great innovation of our study was that a large
XGBoost, LightGBM, and LAD ensemble models. Com- number of input variables, such as self-protection, social
munity mobility features account for a large proportion policy, community mobility, and time index, were entered
of the top 10 features. The four subgraphs show that the into the models. The study was the first to use LightGBM
workplace and residential percent changes from baseline to predict the epidemic situation in the USA and to use a
have been ranked first and second. There are five com- hyperparametric optimization technique (Hyperopt) to turn
munity mobility indicators in Fig. 7d. Social policy indi- parameters. The use of ensemble methods is another of our
cators, such as restrictions on gatherings, government strengths, and ensemble methods could improve prediction
response index, and requirements to stay at home, also accuracy. The SHapley Additive explanation (SHAP) value
appear frequently in Fig. 7. The feature, i.e., the cumu- was used to improve the interpretability of the model.
lative number of people receiving 1 dose or more, also However, in our study, there were still many limitations that
ranks high. Wearing a mask is also an important feature. must to be improved upon by future studies. This research was
The currently developed vaccine has remained effective for based on the USA, a whole country with a very large geograph-
moderate and severe COVID-19, even in the face of virus ical area and various weather environments and landforms.
variants (Thiruvengadam et al. 2021). The effectiveness of Thus, it is difficult for us to find a suitable meteorological index
social isolation and face covering in epidemic control has or air pollution index to exactly describe the characteristics of
also been confirmed in other countries (Trauer et al. 2021). this geographical environment, although meteorological factors

13
Environmental Science and Pollution Research (2023) 30:13648–13659 13657

Table 2  Performance of all models study. Future research could conduct make prediction analysis
Model Dataset Evaluation metrics
over a larger range and include more environmental and spa-
tiotemporal variables. The number of cases worldwide is still
MAE RMSE MAPE (%)
increasing rapidly. New variants of coronavirus have been found
RF Training set 2582.654 4080.452 4.176 in many countries. In the USA, the delta variant has increased
Test set 5140.966 17,719.748 9.522 the risk of hospitalization and death (Bast et al. 2021). To resist
XGBoost Training set 2194.963 3881.644 3.824 COVID-19, it is crucial to formulate a clear reporting policy
Test set 9804.773 13,320.913 7.172 for potential global health emergencies (Chams et al. 2020),
LightGBM Training set 1956.127 2562.883 4.161 also making it possible for the world to jointly build a global
Test set 8814.679 12,522.886 6.267 COVID-19 database, including virus variants, government poli-
SA ensemble Training set 2059.457 3054.155 3.708 cies, population mobility and other relevant data. Then, a larger
Test set 9691.268 13,989.89 7.014 and more comprehensive dataset can be created to better serve
OLS ensemble Training set 1932.322 2542.013 4.016 the predictive model and government decision-making, aim-
Test set 8760.691 12,475.390 6.239 ing to penetrate COVID-19 evolution in more countries. On
LAD ensemble Training set 1923.844 2592.111 3.887 the foundation of big data, future research could build a more
Test set 8540.411 12,303.870 6.088 comprehensive and practical prediction model from the aspects
of space–time geography, medical resources, economic support
RF, random forest; XGBoost, eXtreme Gradient Boosting; LightGBM,
and feature interaction.
Light Gradient Boosting Machine; SA, simple averaging; OLS, ordi-
nary least square; LAD, least absolute deviation; MAE, mean absolute
error; RMSE, root mean square error; MAPE, mean absolute percent-
age error Conclusions

and air quality can affect the spread of COVID-19 (Copat et al. In this study, data related to the COVID-19 epidemic were
2020; Zheng et al. 2021a, b). In fact, the occurrence of COVID- collected as much as possible, and a total of 35 variables
19 should be impacted by the spatiotemporal variations espe- were entered into the machine learning models. Based on the
cially in large areas. Spatial factors were not considered in this authenticity and validity of our data source, we confirmed

Fig. 7  Feature importance analysis by SHAP values. RF, random forest; XGBoost, eXtreme Gradient Boosting; LightGBM, Light Gradient
Boosting Machine; LAD, least absolute deviation; SHAP, SHapley Additive explanation

13
13658 Environmental Science and Pollution Research (2023) 30:13648–13659

that, among the three basic machine learning models, Light- Bast E, Tang F, Dahn J, Palacio A (2021) Increased risk of hospitali-
GBM had the best prediction performance. Moreover, the sation and death with the delta variant in the USA. Lancet Infect
Dis 21(12):1629–1630. https://​doi.​org/​10.​1016/​S1473-​3099(21)​
ensemble models, especially the LAD ensemble, could fur- 00685-X
ther improve the prediction accuracy. At the same time, the Bergstra J, Yamins D, Cox DD (2013) Making a science of model
results of importance ranking illustrated that vaccination, search: hyperparameter optimization in hundreds of dimensions
wearing a mask, less mobility, and appropriate government for vision architectures. Paper presented at the Proceedings of
the 30th International Conference on International Conference
intervention measures could effectively slow the incidence on Machine Learning
rate, providing a professional basis for the government to Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://​
formulate relevant policies on the prevention of and response doi.​org/​10.​1023/A:​10109​33404​324
to COVID-19. Our models can be applied to many other Campillo-Funollet E, Van Yperen J, Allman P, Bell M, Beresford W,
Clay J et al (2021) Predicting and forecasting the impact of local
countries in which similar data are available. outbreaks of COVID-19: use of SEIR-D quantitative epidemio-
logical modelling for healthcare demand and capacity. Int J Epide-
miol 50(4):1103–1113. https://​doi.​org/​10.​1093/​ije/​dyab1​06
Author contribution The conceptualization was performed from Casiraghi E, Malchiodi D, Trucco G, Frasca M, Cappelletti L, Fontana
by ZHL and WW. ZHL designed and drafted the manuscript. ASY, T et al (2020) Explainable machine learning for early assessment
QBJ, and GP participated in the data collection. ZHL, HDS, and GP of COVID-19 risk prediction in emergency departments. IEEE
participated in the data analysis. ASY and WW critically revised the Access 8:196299–196325. https://​doi.​org/​10.​1109/​access.​2020.​
manuscript. 30340​32
Ceylan Z (2020) Estimation of COVID-19 prevalence in Italy, Spain,
Funding This study was supported by the National Natural Science and France. Sci Total Environ 729:138817. https://​doi.​org/​10.​
Foundation of China (grant numbers: 81202254 and 71974199) and the 1016/j.​scito​tenv.​2020.​138817
Science Foundation of Liaoning Provincial Department of Education Chams N, Chams S, Badran R, Shams A, Araji A, Raad M et al (2020)
(LJKQZ2021027). COVID-19: a multidisciplinary review. Front Public Health 8:383.
https://​doi.​org/​10.​3389/​fpubh.​2020.​00383
Chen T, Li X, Li Y, Xia E, Qin Y, Liang S et al (2019) Prediction and risk
Data availability All data came from public databases, and the sources stratification of kidney outcomes in IgA nephropathy. Am J Kidney
have been presented in the manuscript. Dis 74(3):300–309. https://​doi.​org/​10.​1053/j.​ajkd.​2019.​02.​016
Copat C, Cristaldi A, Fiore M, Grasso A, Zuccarello P, Signorelli
Declarations SS et al (2020) The role of air pollution (PM and NO(2)) in
COVID-19 spread and lethality: a systematic review. Environ
Ethics approval and consent to participate Not applicable. Res 191:110129. https://​doi.​org/​10.​1016/j.​envres.​2020.​110129
Dairi A, Harrou F, Zeroual A, Hittawe MM, Sun Y (2021) Comparative
Consent for publication All of the authors agreed to publish the manu- study of machine learning methods for COVID-19 transmission
script. forecasting. J Biomed Inform 118:103791–103791. https://​doi.​
org/​10.​1016/j.​jbi.​2021.​103791
Competing interests The authors declare no competing interests. Davagdorj K, Pham VH, Theera-Umpon N, Ryu KH (2020) XGBoost-
based framework for smoking-induced noncommunicable disease
prediction. International journal of environmental research and
public health, 17(18). https://​doi.​org/​10.​3390/​ijerp​h1718​6513.
Guan WJ, Ni ZY, Hu Y, Liang WH, Ou CQ, He JX et al (2020) Clinical
characteristics of coronavirus disease 2019 in China. N Engl J Med
References 382(18):1708–1720. https://​doi.​org/​10.​1056/​NEJMo​a2002​032
Hale T, Angrist N, Goldszmidt R, Kira B, Petherick A, Phillips T,
Abbasi Z, Zamani I, Mehra AHA, Shafieirad M, Ibeas A (2020) Opti- Webster S, et al (2021) A global panel database of pandemic
mal control design of impulsive SQEIAR epidemic models with policies (Oxford COVID-19 Government Response Tracker).
application to COVID-19. Chaos, Solitons Fractals 139:110054– Nat Human Behav. Retrieved from. https://​doi.​org/​10.​1038/​
110054. https://​doi.​org/​10.​1016/j.​chaos.​2020.​110054 s41562-​021-​01079-8
Ahmar AS, Del Val EB (2020) SutteARIMA: Short-term forecasting Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W et al (2017) Light-
method, a case: Covid-19 and stock market in Spain. Sci Total GBM: a highly efficient gradient boosting decision tree. Paper
Environ 729:138883. https://​doi.​org/​10.​1016/j.​scito​tenv.​2020.​ presented at the Proceedings of the 31st International Conference
138883 on Neural Information Processing Systems
ArunKumar KE, Kalaga DV, Sai Kumar CM, Chilkoor G, Kawaji M, Kim BW, Choi MC, Kim MK, Lee J-W, Kim MT, Noh JJ, et al (2021)
Brenza TM (2021) Forecasting the dynamics of cumulative COVID- Machine learning for recurrence prediction of gynecologic can-
19 cases (confirmed, recovered and deaths) for top-16 countries cers using Lynch syndrome-related screening markers. Cancers
using statistical machine learning models: auto-regressive integrated 13(22). https://​doi.​org/​10.​3390/​cance​rs132​25670.
moving average (ARIMA) and seasonal auto-regressive integrated Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B et al
moving average (SARIMA). Appl Soft Comput 103:107161– (2020) From local explanations to global understanding with
107161. https://​doi.​org/​10.​1016/j.​asoc.​2021.​107161 explainable AI for trees. Nat Mach Intelli 2(1):56–67. https://​doi.​
Asfahan S, Gopalakrishnan M, Dutt N, Niwas R, Chawla G, Agarwal org/​10.​1038/​s42256-​019-​0138-9
M et al (2020) Using a simple open-source automated machine Mangalathu S, Hwang S-H, Jeon J-S (2020) Failure mode and effects
learning algorithm to forecast COVID-19 spread: a modelling analysis of RC members based on machine-learning-based
study. Adv Respir Med 88(5):400–405. https://​doi.​org/​10.​5603/​ SHapley Additive exPlanations (SHAP) approach. Eng Struct
ARM.​a2020.​0156 219:110927. https://​doi.​org/​10.​1016/j.​engst​ruct.​2020.​110927

13
Environmental Science and Pollution Research (2023) 30:13648–13659 13659

Mao L, Jin H, Wang M, Hu Y, Chen S, He Q et al (2020) Neurologic gained control of its second COVID-19 wave. Nat Commun
manifestations of hospitalized patients with coronavirus disease 12(1):6266. https://​doi.​org/​10.​1038/​s41467-​021-​26558-4
2019 in Wuhan China. JAMA Neurol 77(6):683–690. https://​doi.​ Wang K, Zuo P, Liu Y, Zhang M, Zhao X, Xie S et al (2020) Clinical
org/​10.​1001/​jaman​eurol.​2020.​1127 and laboratory predictors of in-hospital mortality in patients with
Ng CFS, Seposo XT, Moi ML, Tajudin MABA, Madaniyazi L, Sahani coronavirus disease-2019: a cohort study in Wuhan. China Clin
M (2020) Characteristics of COVID-19 epidemic and control Infect Dis 71(16):2079–2088. https://d​ oi.o​ rg/1​ 0.1​ 093/c​ id/c​ iaa53​ 8
measures to curb transmission in Malaysia. Int J Infect Dis: IJID : Wang L, Zhu Z, Sassoubre L, Yu G, Liao C, Hu Q et al (2021)
Official Publication Int Soc Infect Dis 101:409–411. https://​doi.​ Improving the robustness of beach water quality modeling using
org/​10.​1016/j.​ijid.​2020.​10.​027 an ensemble machine learning approach. Sci Total Environ
Nishio M, Nishizawa M, Sugiyama O, Kojima R, Yakami M, Kuroda 765:142760. https://​doi.​org/​10.​1016/j.​scito​tenv.​2020.​142760
T et al (2018) Computer-aided diagnosis of lung nodule using Wu QW, Xia JF, Ni JC, Zheng CH (2021) GAERF: predicting lncRNA-
gradient tree boosting and Bayesian optimization. PLoS ONE disease associations by graph auto-encoder and random forest.
13(4):e0195875. https://​doi.​org/​10.​1371/​journ​al.​pone.​01958​75 Brief Bioinform 22(5). https://​doi.​org/​10.​1093/​bib/​bbaa3​91
Pan Y, Zhang L, Yan Z, Lwin MO, Skibniewski MJ (2021) Discovering Yang L, Wu H, Jin X, Zheng P, Hu S, Xu X et al (2020) Study of
optimal strategies for mitigating COVID-19 spread using machine cardiovascular disease prediction model based on random forest
learning: experience from Asia. Sustain Cities Soc 75:103254– in eastern China. Sci Rep 10(1):5245. https://​doi.​org/​10.​1038/​
103254. https://​doi.​org/​10.​1016/j.​scs.​2021.​103254 s41598-​020-​62133-5
Ribeiro M, da Silva RG, Mariani VC, Coelho LDS (2020) Short-term Ye GH, Alim M, Guan P, Huang DS, Zhou BS, Wu W (2021) Improv-
forecasting COVID-19 cumulative confirmed cases: Perspectives ing the precision of modeling the incidence of hemorrhagic
for Brazil. Chaos Solitons Fractals 135:109853. https://​doi.​org/​ fever with renal syndrome in mainland China with an ensemble
10.​1016/j.​chaos.​2020.​109853 machine learning approach. PLoS ONE 16(3):e0248597. https://​
Sarica A, Cerasa A, Quattrone A (2017) Random forest algorithm for doi.​org/​10.​1371/​journ​al.​pone.​02485​97
the classification of neuroimaging data in Alzheimer’s disease: a Yu CS, Chang SS, Chang TH, Wu JL, Lin YJ, Chien HF et al (2021)
systematic review. Front Aging Neurosci 9:329. https://​doi.​org/​ A COVID-19 pandemic artificial intelligence-based system with
10.​3389/​fnagi.​2017.​00329 deep learning forecasting and automatic statistical data acquisi-
Shahhosseini M, Hu G, Archontoulis SV (2020) Forecasting corn yield tion: development and implementation study. J Med Internet Res
with machine learning ensembles. Front Plant Sci 11:1120–1120. 23(5):e27806. https://​doi.​org/​10.​2196/​27806
https://​doi.​org/​10.​3389/​fpls.​2020.​01120 Yu X (2019) Light Gradient Boosting Machine: an efficient soft com-
Shahid F, Zameer A, Muneeb M (2020) Predictions for COVID-19 puting model for estimating daily reference evapotranspiration
with deep learning models of LSTM. GRU and Bi-LSTM Chaos with local and external meteorological data. Agric Water Manag
Solitons Fractals 140:110212. https://​doi.​org/​10.​1016/j.​chaos.​ 225:105758
2020.​110212 Zheng C, Tian J, Wang K, Han L, Yang H, Ren J et al (2021a) Time-
Shahriari B, Swersky K, Wang Z, Adams RP, Freitas ND (2015) Tak- to-event prediction analysis of patients with chronic heart
ing the human out of the loop: a review of Bayesian optimization. failure comorbid with atrial fibrillation: a LightGBM model.
Proc IEEE 104(1):148–175 BMC Cardiovasc Disord 21(1):379. https://​doi.​org/​10.​1186/​
Shen J, Duan H, Zhang B, Wang J, Ji JS, Wang J, et al (2020) Preven- s12872-​021-​02188-y
tion and control of COVID-19 in public transportation: experi- Zheng HL, Guo ZL, Wang ML, Yang C, An SY, Wu W (2021b)
ence from China. Environ Pollut (Barking, Essex 1987) 266(Pt Effects of climate variables on the transmission of COVID-
2):115291–115291. https://d​ oi.o​ rg/1​ 0.1​ 016/j.e​ nvpol.2​ 020.1​ 15291 19: a systematic review of 62 ecological studies. Environ Sci
Sun J, Chen X, Zhang Z, Lai S, Zhao B, Liu H et al (2020) Forecast- Pollut Res Int 28(39):54299–54316. https://​doi.​org/​10.​1007/​
ing the long-term trend of COVID-19 epidemic using a dynamic s11356-​021-​15929-5
model. Sci Rep 10(1):21122–21122. https://​doi.​org/​10.​1038/​
s41598-​020-​78084-w Publisher's note Springer Nature remains neutral with regard to
Thiruvengadam R, Awasthi A, Medigeshi G, Bhattacharya S, Mani jurisdictional claims in published maps and institutional affiliations.
S, Sivasubbu S, et al (2021) Effectiveness of ChAdOx1 nCoV-
19 vaccine against SARS-CoV-2 infection during the delta Springer Nature or its licensor holds exclusive rights to this article under
(B.1.617.2) variant surge in India: a test-negative, case-control a publishing agreement with the author(s) or other rightsholder(s);
study and a mechanistic study of post-vaccination immune author self-archiving of the accepted manuscript version of this article
responses. Lancet Infect Dis. https://​doi.​org/​10.​1016/​s1473-​ is solely governed by the terms of such publishing agreement and
3099(21)​00680-0 applicable law.
Trauer JM, Lydeamore MJ, Dalton GW, Pilcher D, Meehan MT,
McBryde ES et al (2021) Understanding how Victoria, Australia

13

You might also like