0% found this document useful (0 votes)
1 views

Qiu - Machine Learning Approaches to predict Peak Demand Days - 2020

This study investigates the use of machine learning models to predict peak demand days for cardiovascular disease (CVD) admissions based on environmental factors in Chengdu, China, from 2015 to 2017. Six algorithms were compared, with the LightGBM model achieving the highest predictive performance, indicating its potential as a decision-making tool for healthcare resource management. The findings highlight the significant contributions of meteorological conditions and air pollutants to CVD admissions, emphasizing the importance of accurate demand forecasting in healthcare settings.

Uploaded by

a.henriquez.saa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Qiu - Machine Learning Approaches to predict Peak Demand Days - 2020

This study investigates the use of machine learning models to predict peak demand days for cardiovascular disease (CVD) admissions based on environmental factors in Chengdu, China, from 2015 to 2017. Six algorithms were compared, with the LightGBM model achieving the highest predictive performance, indicating its potential as a decision-making tool for healthcare resource management. The findings highlight the significant contributions of meteorological conditions and air pollutants to CVD admissions, emphasizing the importance of accurate demand forecasting in healthcare settings.

Uploaded by

a.henriquez.saa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Qiu et al.

BMC Medical Informatics and Decision Making (2020) 20:83


https://doi.org/10.1186/s12911-020-1101-8

RESEARCH ARTICLE Open Access

Machine learning approaches to predict


peak demand days of cardiovascular
admissions considering environmental
exposure
Hang Qiu1,2* , Lin Luo2, Ziqi Su3, Li Zhou4, Liya Wang2 and Yucheng Chen5,6

Abstract
Background: Accumulating evidence has linked environmental exposure, such as ambient air pollution and
meteorological factors, to the development and severity of cardiovascular diseases (CVDs), resulting in increased
healthcare demand. Effective prediction of demand for healthcare services, particularly those associated with peak
events of CVDs, can be useful in optimizing the allocation of medical resources. However, few studies have attempted
to adopt machine learning approaches with excellent predictive abilities to forecast the healthcare demand for CVDs.
This study aims to develop and compare several machine learning models in predicting the peak demand days of
CVDs admissions using the hospital admissions data, air quality data and meteorological data in Chengdu, China from
2015 to 2017.
Methods: Six machine learning algorithms, including logistic regression (LR), support vector machine (SVM), artificial
neural network (ANN), random forest (RF), extreme gradient boosting (XGBoost), and light gradient boosting machine
(LightGBM) were applied to build the predictive models with a unique feature set. The area under a receiver operating
characteristic curve (AUC), logarithmic loss function, accuracy, sensitivity, specificity, precision, and F1 score were used
to evaluate the predictive performances of the six models.
Results: The LightGBM model exhibited the highest AUC (0.940, 95% CI: 0.900–0.980), which was significantly higher
than that of LR (0.842, 95% CI: 0.783–0.901), SVM (0.834, 95% CI: 0.774–0.894) and ANN (0.890, 95% CI: 0.836–0.944), but
did not differ significantly from that of RF (0.926, 95% CI: 0.879–0.974) and XGBoost (0.930, 95% CI: 0.878–0.982). In
addition, the LightGBM has the optimal logarithmic loss function (0.218), accuracy (91.3%), specificity (94.1%), precision
(0.695), and F1 score (0.725). Feature importance identification indicated that the contribution rate of meteorological
conditions and air pollutants for the prediction was 32 and 43%, respectively.
(Continued on next page)

* Correspondence: [email protected]
1
School of Computer Science and Engineering, University of Electronic
Science and Technology of China, No.2006, Xiyuan Ave, West Hi-Tech Zone,
611731 Chengdu, Sichuan, P.R. China
2
Big Data Research Center, University of Electronic Science and Technology
of China, Chengdu, China
Full list of author information is available at the end of the article

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if
changes were made. The images or other third party material in this article are included in the article's Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons
licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the
data made available in this article, unless otherwise stated in a credit line to the data.
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 2 of 11

(Continued from previous page)


Conclusion: This study suggests that ensemble learning models, especially the LightGBM model, can be used to
effectively predict the peak events of CVDs admissions, and therefore could be a very useful decision-making tool for
medical resource management.
Keywords: Machine learning, Cardiovascular disease, Hospital admission, Prediction, Environmental exposure

Background number of studies have attempted to adopt machine-


Cardiovascular diseases (CVDs) are the leading cause of learning based data-driven approaches to forecast the de-
death worldwide; about 17.9 million deaths were attrib- mand for healthcare services associated with environmental
utable to CVDs in 2016, representing approximately 31% exposure, and these few studies predominately focused on
of all global deaths in that year [1]. Even though behav- the application of artificial neural network (ANN) [26–29].
ioral factors, including physical inactivity, smoking, un- For instance, Kassomenos et al. [30] applied ANN and step-
healthy diets and obesity, are well-known risk factors for wise regression models to predict the daily number of hos-
CVDs, a large body of studies have indicated that envir- pital admissions for CVDs and respiratory diseases
onmental exposure [2–4], such as ambient air pollution considering air pollution and meteorological conditions,
[5–9] and temperature variability [10–12], also makes a and ANN performed better than the regression model.
significant contribution to CVDs, resulting in increased Moreover, there were relatively fewer machine-learning
risk of morbidity. For example, using conditional logistic based studies on predicting peak event of healthcare de-
regression models, Liu et al. [13] conducted a multi-city mand associated with environmental exposure [31]. To the
study in 26 Chinese cities, and the results showed that best of our knowledge, only one study has used ANN to
elevated concentrations of sulfur dioxide (SO2), nitrogen forecast peak demand days of emergency department visits
dioxide (NO2), carbon monoxide (CO), and ozone (O3) for chronic respiratory diseases based on weather and en-
were associated with increased risk of hospitalization for vironmental pollution. Although part of other machine
heart failure. Another national time-series study con- learning algorithms performed better than ANN in other
ducted in 184 Chinese cities linked temperature variabil- fields [32], it is unclear how effective the other machine
ity to the increase of hospital admissions for CVDs and learning approaches are in predicting the healthcare ser-
its subtypes using over-dispersed Poisson regression vices demand associated with environmental exposure,
models [14]. Although these statistical regression models which leaves open the potential for the development of
can assess the associations of environmental exposure more accurate predictive models using other algorithms.
with CVDs morbidity [15–17], they are often incapable In this study, we contribute to the existing body of
of providing sufficiently accurate morbidity prediction knowledge by developing and comparing various ma-
for healthcare management. Moreover, we lack informa- chine learning models in predicting the peak demand
tion on the effect of a complex mixture of environmental days of CVDs admissions based on hospital admissions
exposure on CVDs morbidity. data, air quality data and meteorological data in
With an increasing number of CVDs patients putting Chengdu, China from 2015 to 2017. Six types of ma-
pressure on the limited medical resources, the prediction chine learning models, including logistic regression (LR),
of healthcare demands, particularly those associated with support vector machine (SVM), ANN, random forest
peak events, has gained greater attention. Time series (RF), extreme gradient boosting (XGBoost), and light
forecasting approaches, such as the autoregressive inte- gradient boosting machine (LightGBM), were con-
grated moving average (ARIMA) model and the seasonal structed, and their predictive performances were also
ARIMA model, are widely applied in predicting prob- compared. The study shows the potential of machine
lems regarding emergency department visits [18, 19], learning approaches for predicting peak events of CVDs
new admission inpatients [20] and inpatients discharge admissions, and identifies the most sui model for deci-
[21]. However, these models have difficulties solving the sion making.
complex nonlinear relationship among multi-factors,
and their forecasting abilities to extrapolate are limited.
Recently, machine learning algorithms, which can solve Methods
the nonlinear relationship among multi-dimensional vari- Overview of the research framework
ables, have been shown to be effective in prediction, and This study attempted to predict the peak demand days
are being used successfully in various healthcare applica- of CVDs admissions using machine learning techniques.
tions, such as medical diagnosis [22, 23] and disease risk The block diagram of the classified prediction process is
prediction [24, 25]. Nevertheless, only a very limited shown in Fig. 1. In brief, the time series dataset, which
was comprised of CVDs admissions, meteorological data
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 3 of 11

Fig. 1 Block Diagram of Classified Prediction Process

and air quality data, was pre-processed. Second, the gen- which the daily number of CVDs admissions were equal
eralized additive model (GAM) was built to choose the to or above the 85th percentile threshold were defined
lag day of meteorological conditions and air pollutants as peak demand days. Thus, the binary variable of CVDs
for CVDs admission. Then, six machine learning algo- admissions is highly imbalanced, with 931 samples of
rithms, including LR, SVM, ANN, RF, XGBoost and non-peak demand and 165 samples of peak demand.
LightGBM, were applied to construct the predictive This binary variable of CVDs admissions was used as the
models, and the models’ parameters were optimized with primary dependent variable in the analysis.
10-fold cross validation. After that, the predictive models
were validated, then the performances of these models Meteorological data and air quality data
were compared. Finally, we predicted the peak demand Meteorological data, including temperature, relative hu-
days of CVDs admissions based on the optimal machine midity and rainfall, were derived from the Chengdu Me-
learning model. teorological Monitoring Database (http://data.cma.cn/).
The details are discussed in the following sub-sections. Hourly data of air pollutants, including PM2.5 (particu-
late matter with aerodynamic diameter ≤ 2.5 μm), PM10
Data collection and preprocessing (particulate matter with aerodynamic diameter ≤ 10 μm),
Hospital admissions data SO2, NO2, CO and O3, were obtained from the China
Data for the daily number of hospital admissions for pa- National Environmental Monitoring Center (http://www.
tients with CVDs who lived in urban areas of Chengdu cnemc.cn/), which provides real-time monitoring of
was obtained from the Health Information Center of Si- hourly concentrations of air pollutants to the general
chuan Province, China. This data contains aggregate public. We averaged the 24-h mean concentrations for
numbers of CVDs admissions in all the tertiary and sec- PM2.5, PM10, SO2, NO2 and CO, and calculated max-
ondary hospitals of Chengdu each day with primary imum 8-h moving average concentrations for O3 from
diagnosis of CVDs (International Classification of Dis- the air quality monitoring stations interspersed among
eases, 10th Revision codes: I00-I99) from 1 January 2015 the urban areas of Chengdu. Concentrations of particu-
to 31 December 2017, which is 1096 days of continuous late matter with an aerodynamic diameter between 2.5
data. and 10 μm (PMC) were calculated by subtracting daily
Additionally, we focused on the peak demand of CVDs average concentrations of PM2.5 from PM10 [9, 34].
admissions, and the binary variable was generated from
the daily number of CVDs admissions. In the absence of Data preprocessing
a known threshold for daily CVDs admissions, the peak Data for the daily number of hospital admissions for
demand was defined on the basis of an 85th percentile CVDs, meteorological data and air quality data were col-
threshold (304 hospital admissions per day) by reference lected from different data sources. We merged these
to the previous studies [31, 33]. Specifically, the days on three datasets to form a time series dataset by date (i.e. 1
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 4 of 11

January 2015 to 31 December 2017). The time series fea- The results demonstrated that temperature, relative
tures were extracted from date, including year, month humidity, rainfall, PM2.5, PM10, PMC, SO2, NO2, CO and
(month of year), day (day of month), holiday (public hol- O3 were associated with CVDs admissions, with the
idays) and DOW (day of week). minimum GCV values at lag04, lag06, lag06, lag3, lag3,
During the study period, the percentages of missing lag3, lag0, lag0, lag0 and lag6, respectively.
values from the monitoring stations were 1.28% (14/ Finally, the independent variables for forecasting the
1096) for meteorological conditions, and 3.19% (35/ peak demand days of CVDs admissions included fifteen
1096) for air pollutants. The linear interpolation which features, which are shown in Table 1.
has acceptable performance and reliability was used to
fill in the missing values of meteorological conditions Machine learning methods
and air pollutants [35, 36]. In this study, six well-accepted machine learning algo-
rithms, including LR, SVM, ANN, RF, XGBoost and
Feature extraction LightGBM, were applied to develop predictive models
As illustrated in the above section, the features for pre- with the unique feature set. These machine learning
dicting the peak demand days of CVDs admissions in- methods were considered according to their following
cluded time series features, meteorological condition characteristics.
features and air pollutant features. Accumulating epi- LR is a common and basic algorithm, which is widely
demiological studies have suggested that the effect of used in disease risk prediction and epidemiology [39].
meteorological conditions and air pollutants on CVDs SVM is a discriminative classification technique, which
admissions is delayed, and the lag effect is related to the has been widely applied in medical diagnostics and other
regional environment [8, 12, 37]. Hence, we employed fields, especially with small sample sets [40]. ANN, in-
an over-dispersed GAM, which allowed the quasi- spired by biological neural networks, has a remarkable
Poisson distribution to analyze the lag effects of daily ability to determine the meaning and rules of compli-
meteorological conditions and air pollutants on CVDs cated data [41, 42]. RF, an ensemble algorithm, applies a
admissions, and chose the lag day based on the mini- bootstrap algorithm to extract multiple samples from
mum Generalized Cross-Validation (GCV) values which the training set randomly, and trains the samples with
measure models fit [5, 34]. The lag effects of single day the weak classifier (i.e. decision tree) [43]. RF’s final re-
lags (from lag0 to lag6) and cumulative day lags (from sult is determined by the majority of votes over all deci-
lag01 to lag06) were taken into consideration. The pe- sion trees, thereby improving its predictive accuracy and
nalized spline approaches were applied to control for po- preventing the model from over-fitting. XGBoost is a
tential confounding of long-term trends, seasonality and distributed gradient boosting algorithm and has gained
meteorological effects [38]. Moreover, dummy variables wide popularity and attention in machine learning com-
of holiday and DOW were controlled. petitions [44, 45]. XGBoost chooses a weak classifier to

Table 1 The features for prediction


Feature category Features Description
Time series features year year of the date of hospital admission
month month of year
day day of month
holiday public holidays
DOW day of week
Meteorological condition features Tem_lag04 mean temperature for the moving average of current day and previous four days (lag04)
RH_lag06 relative humidity for the moving average of current day and previous six days (lag06)
Rain_lag06 rainfall for the moving average of current day and previous six days (lag06)
Air pollutants features PM2.5_lag3 PM2.5 at the previous three days (lag3)
PM10_lag3 PM10 at the previous three days (lag3)
PMC_lag3 PMC at the previous three days (lag3)
SO2_lag0 SO2 at the current day (lag0)
NO2_lag0 NO2 at the current day (lag0)
CO_lag0 CO at the current day (lag0)
O3_lag6 O3 at the previous six days (lag6)
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 5 of 11

facilitate efficient optimization algorithms, adds an L2 TP


Sensitivity ¼ ð2Þ
regularization term of leaf weights to achieve lower vari- TP þ FN
ance, and uses the second-order Taylor series as the cost
TN
function to retain more information about the target Specificity ¼ ð3Þ
function, thereby improving its predictive accuracy. TN ¼ FP
LightGBM is a distributed and high-performance gradi- TP
Precision ¼ ð4Þ
ent lifting framework based on a decision tree algorithm TP þ FP
designed for fast computational time, especially with
2 Precision Recall
very large data sets [46]. It utilizes two novel techniques: F1 score ¼ ð5Þ
Precision þ Recall
gradient-based one-side sampling and exclusive feature
bundling, which respectively are used to deal with the where, TP = True Positive, FP = False Positive, TN =
TP
huge number of data samples and massive amount of True Negative, FN = False Negative; Recall ¼ TPþFN
features [47].
All above-mentioned models were trained and tested on Results
a partitioned 80/20 percentage split of the dataset by Descriptive statistics
stratified random sampling. Simultaneously, in situations The statistical information of daily CVDs hospital admis-
where there was imbalanced class data combined with un- sions, meteorological conditions and air pollutants con-
equal error costs, these models’ performance metrics were centrations is summarized in Table 3. During the study
not representative of reasonable performances. Therefore, period, the average of daily hospital admissions for
it was necessary to balance the dataset to get true per- CVDs was 208 inpatients, the minimum value was 33,
formance values for the classifier; hence, we adjusted and the maximum value was 476. The daily average
weights inversely proportional to class frequencies in the levels of temperature, relative humidity and rainfall were
input data when training the machine learning models. 17.0 °C, 80.4% and 2.6 mm, respectively. The daily aver-
The parameters of these six predictive models were age concentrations were 60.3 μg/m3 for PM2.5, 99.3 μg/
determined by grid search and 10-fold cross-validation m3 for PM10, 39.0 μg/m3 for PMC, 13.9 μg/m3 for SO2,
in training the dataset. To be specific, we partitioned the 55.0 μg/m3 for NO2, 96.0 μg/m3 for O3 and 1.1 mg/m3
training dataset into ten equally sized pieces, and we uti- for CO.
lized the grid search with nine pieces to tune the param-
eters, while the remaining piece was used as the Evaluation and comparison of the predictive models
validation set. We repeated this process ten times. The Based on the above-mentioned features in Table 1, we
best parameters for predictive models were obtained constructed six machine learning models to predict the
with the best score, which itself was obtained by aver- peak demand days for CVDs admissions. Using the opti-
aging the process of repetition mentioned in the previ- mal parameters for each model, the predictive models
ous sentence. Table 2 shows the values of the were corroborated via a validation set which was derived
parameters for each model. from the training dataset by 10-fold cross-validation.
The box plot of AUC for each model with 10-fold cross-
Model assessment validation in training dataset is shown in Fig. 2. The
We calculated the AUC from receiver operating charac- AUC for LR, SVM, ANN, RF, XGBoost and LightGBM
teristic (ROC) analysis to evaluate the predictive utilities was 0.817 (95% confidence interval (CI): 0.795–0.839),
of the models, and the AUC of the six machine learning 0.814 (95% CI: 0.792–0.836), 0.844 (95% CI: 0.814–
models was compared based on the DeLong method (p- 0.875), 0.929 (95% CI: 0.906–0.951), 0.945 (95% CI:
value < 0.05 was deemed to indicate statistical signifi- 0.922–0.967) and 0.9454 (95% CI: 0.921–0.967), respect-
cance) [48]. Meanwhile, logarithmic loss function (log- ively. The XGBoost model achieved the best AUC, and
loss) was applied to quantify the accuracy of the classi- its performance was significantly better than LR (p-value
fier by punishing the wrong classification. Furthermore, < 0.001), SVM (p-value < 0.001) and ANN (p-value <
the evaluation indicators of the confusion matrix, includ- 0.001), but did not differ significantly from RF (p-value =
ing accuracy, sensitivity, specificity, precision, and F1 0.264) and LightGBM (p-value = 0.933).
score, were used to analyze the relationship between the Based on the validation result for the training dataset,
actual values and the predicted values for the peak de- we predicted the peak demand days for CVDs admis-
mand of CVDs admissions. sions in an independent testing dataset. The ROC curve
for the predictive models in that testing dataset is shown
TP þ TN in Fig. 3. The AUC of LR, SVM, ANN, RF, XGBoost and
Accuracy ¼ ð1Þ LightGBM was 0.842 (95% CI: 0.783–0.901), 0.834 (95%
TP þ TN þ FP þ FN
CI: 0.774–0.894), 0.890 (95% CI: 0.836–0.944), 0.926
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 6 of 11

Table 2 Summary of parameter values in each model


Models Parameters Values Parameters Mean
LR penalty L1 penalty function
SVM kernel linear kernel function
C 5 penalty parameter of the error term
ANN kernel initializer uniform kernel initializer function
activation1 relu activation of hidden layer
activation2 sigmoid activation of output layer
optimizer Adam training optimization algorithm
epochs 300 number of times shown to the network
batch size 20 batch size
dropout 0.0 dropout rate
RF n estimators 695 number of iterations
max depth 4 maximum depth of variable interactions
max features 7 number of features for the best split
XGBoost learning rate 0.1 learning rate
n estimators 100 number of iterations
eta 0.01 control of learning rate
max depth 3 maximum depth of variable interactions
gamma 0.6 minimum loss reduction required to make a further partition on the tree’ leaf node
subsample 0.7 subsample ratio
co-sample by tree 0.6 subsample ratio of columns when constructing each tree
min child weight 2 sum of the minimum weights that leaf nodes need to observe
LightGBM learning rate 0.1 learning rate
n estimators 100 number of iterations
max depth 8 maximum depth of variable interactions
num leaves 10 number of leaves in each tree
bagging fraction 0.7 percentage of sampling used in each iteration
feature fraction 0.9 ratio of features to build the tree in each iteration
min data in leaf 5 minimum number of records in a leaf
min split gain 0.0 smallest gain of the split

(95% CI: 0.879–0.974), 0.930 (95% CI: 0.878–0.982) and The identification of feature importance
0.940 (95% CI: 0.900–0.980), respectively. The LightGBM As illustrated in the above section, the LightGBM model
model had the highest AUC value among all these predict- achieved the best performance; it offers the most powerful
ive models, and the performance was significantly better predictors for predicting the peak demand days of CVDs
than LR (p-value < 0.001), SVM (p-value < 0.001), ANN admissions. The identification of feature importance based
(p-value = 0.03), but did not differ significantly from RF on LightGBM is shown in Fig. 4. The contribution rate of
(p-value = 0.222) and XGBoost (p-value = 0.489). time series features, meteorological conditions and air pol-
Furthermore, we used log-loss, accuracy, sensitivity, lutants for predicting the peak demand days of CVDs ad-
specificity, precision, and F1 score to compare the per- missions was 25, 32 and 43%, respectively. Among the
formances of these six machine learning models in the meteorological condition features, the top-ranked features
independent testing dataset (Table 4). The LightGBM were Tem_lag04 and RH_lag06, respectively. Similarly, the
model exhibited the best AUC (0.940), log-loss (0.218), top-ranked features among the air pollutants were NO2_
accuracy (0.913), specificity (0.941), precision (0.695), lag0 and SO2_lag0, respectively.
and F1 score (0.725) in this testing dataset, and the RF
model had the best sensitivity (0.909). Thus, the Discussion
LightGBM model achieved the best performance among The six machine learning models were developed to pre-
the six machine learning models. dict the peak demand days for CVDs admissions, and as
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 7 of 11

Table 3 Summary statistics of daily CVDs admissions, meteorological conditions and air pollutants concentrations in Chengdu,
2015–2017
Mean Standard Deviation Minimum Median Maximum
CVDs hospital admissions (n) 208 90 33 206 476
Meteorological Conditions
Temperature (°C) 17.0 7.2 −1.1 17.8 30.2
Relative Humidity (%) 80.4 8.8 43.0 80.8 98.3
Rainfall (mm) 2.6 8.7 0.0 0.0 122.0
Air Pollutants Concentrations
PM2.5 (μg/m3) 60.3 42.4 6.1 48.4 324.5
PM10 (μg/m3) 99.3 64.7 14.3 79.8 492.5
3
PMC (μg/m ) 39.0 25.8 4.8 31.6 238.2
SO2 (μg/m3) 13.9 5.8 3.9 12.7 37.9
3
NO2 (μg/m ) 55.0 17.3 15.7 53.0 130.4
O3 (μg/m3) 96.0 54.6 5.6 85.3 290.4
CO (mg/m3) 1.1 0.4 0.4 1.0 2.8
CVDs Cardiovascular diseases

a result of our study, the optimal model has been identi- models have better generalization capabilities compared
fied. To the best of our knowledge, no studies have ap- to other models for predicting the peak demand days of
plied machine learning models other than ANN in the CVDs admissions. The LightGBM exhibited the best per-
prediction of peak event of healthcare demand. This is formance among the ensemble learning models. Com-
the first study to construct and compare various ma- pared with ANN, SVM and LR, the AUC of LightGBM
chine learning models in terms of predicting the peak significantly improved by 5.65, 12.66 and 11.61%, respect-
events of CVDs admissions using meteorological data, ively. Even though most predictive models have higher re-
air quality data and hospital admissions data. call and lower precision, this could be acceptable as
Our study found that the ensemble learning models, in- insufficient allocation of medical resources in peak days
cluding LightGBM, RF and XGBoost, outperformed ANN, can lead to costly outcomes. The results of our study indi-
SVM and LR, achieved overall accuracies of > 0.86 and cate that ensemble learning models are well suited for the
AUCs of > 0.92. This suggests that the ensemble learning prediction of peak demand for healthcare services.

Fig. 2 Box plot of AUC for machine learning models with 10-fold cross-validation in training dataset. °: the outliers of box plot; *: the model is
significantly different from the XGBoost model. LR: logistic regression; SVM: support vector machine; ANN: artificial neural network; RF: random forest;
XGBoost: extreme gradient boosting; LightGBM: light gradient boosting machine
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 8 of 11

Fig. 3 ROC curve of machine learning models in testing dataset. LR: logistic regression; SVM: support vector machine; ANN: artificial neural network;
RF: random forest; XGBoost: extreme gradient boosting; LightGBM: light gradient boosting machine

The lag patterns of meteorological conditions and air exposure to increased risk of CVDs morbidity [5–12].
pollutants have been well-documented in epidemio- However, evidence of the effect of a complex mixture of
logical studies [8, 12, 16], and suggest that the lag effects environmental exposure on CVDs morbidity is still lim-
of environmental exposure have regional differences. ited. Machine learning techniques provide an opportun-
However, to date, very few machine-learning based stud- ity for developing algorithms that classify individuals
ies have analyzed the lag effect of environmental expos- with complex interaction factors. In our study, the con-
ure when predicting the peak demand for healthcare tribution of the special ambient air pollutants and cli-
services. Krishan et al. [31] applied representative lags to matic characteristics of the area to the peak demand
predictors based on the results from other studies to days of CVDs admissions was successfully modeled. The
forecast the peak demand days of emergency department identification of feature importance based on the opti-
visits, but did not incorporate the actual situation of the mal model showed that among the environmental ex-
study area. In our study, we utilized GAM to analyze the posure features, the 4 top-ranked features were Tem_
lag effect of meteorological conditions and air pollutants lag04, RH_lag06, NO2_lag0 and SO2_lag0, respectively,
on CVDs admissions in our study areas. GAM is useful and the contribution rate of meteorological conditions
in the detection of early warning signals for future peak and air pollutants to the prediction was 32 and 43%, re-
demand. spectively. These results suggest that environmental ex-
Environmental exposure, such as ambient air pollution posure is an important predictor.
and extreme temperatures, is an important but underap- Our study has several strengths. First, considering the
preciated risk factor contributing to the development lag effects of the complex mixture of environmental ex-
and severity of CVDs [4]. Accumulating evidence from posure and their regional differences, we utilized an
epidemiological studies has linked environmental over-dispersed GAM to analyze the lag effects of

Table 4 The evaluation indicators of machine learning models in testing dataset


Models AUC log-loss Accuracy Sensitivity Specificity Precision F1 score
LR 0.842 (95% CI: 0.783–0.901) 0.513 0.766 0.848 0.751 0.378 0.523
SVM 0.834 (95% CI: 0.774–0.894) 0.344 0.748 0.879 0.724 0.362 0.513
ANN 0.890 (95% CI: 0.836–0.944) 0.296 0.858 0.333 0.951 0.551 0.415
RF 0.926 (95% CI: 0.879–0.974) 0.358 0.862 0.909 0.854 0.527 0.667
XGBoost 0.930 (95% CI: 0.878–0.982) 0.277 0.876 0.818 0.886 0.563 0.667
a
LightGBM 0.940 (95% CI: 0.900–0.980) 0.218 0.913 0.758 0.941 0.695 0.725
font bold: the optimal values; athe optimal model. LR logistic regression, SVM support vector machine, ANN artificial neural network, RF random forest, XGBoost
extreme gradient boosting, LightGBM light gradient boosting machine
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 9 of 11

Fig. 4 Features importance ranking based on LightGBM model

meteorological conditions and air pollutants on CVDs days of CVDs admissions. Further study is required to
admissions, and chose the lag day with the minimum forecast the number of admissions for CVDs accurately
GCV value as the optimal predictor, rather than using based on regression models. Third, the current model is
the current day or relying on previous research, which designed for non-communicable diseases, such as CVDs,
makes our predictive models more practical. In addition, which are associated with environmental exposure, and
we applied six well-accepted machine learning algo- the model might not be suitable for forecasting the peak
rithms to construct predictive models, which indicate events of infectious diseases.
our commitment to present a wide variety of ap-
proaches. Specially, LR represents the basic machine Conclusions
learning model, SVM and ANN are widely used in pre- This study used machine learning approaches to forecast
diction, and RF, XGBoost and LightGBM are ensemble the peak demand days for CVDs admissions based on
learning models. As discussed earlier, we found that en- hospital admissions data, air quality data and meteoro-
semble learning models, especially the LightGBM model, logical data. The results revealed that ensemble learning
have higher prediction capabilities than LR or ANN, models, especially the LightGBM model, can accurately
which can benefit decision makers in finding more suit- predict the peak events of CVDs admissions. Meanwhile,
able models for the prediction of healthcare demand, es- the identification of feature importance based on
pecially during peak events. To the best of our LightGBM indicated that meteorological conditions and
knowledge, this study is the first to develop and compare air pollutants made significant contributions to the ac-
various well-accepted machine learning models to pre- curacy of prediction. These findings show that machine
dict the peak events of CVDs admissions that consider learning approaches have potential in the prediction of
environmental exposure. Our results contribute to the the peak events of CVDs, and the predictive capacity of
limited research in this filed, as they provide useful and ensemble learning models makes them valid tools sup-
comprehensive information to those who seek to identify porting decisions regarding medical resource
the most suitable model for decision making. management.
Our study also has some limitations that need to be
addressed. First, we considered only two well-studied en- Abbreviations
vironmental exposures: meteorological conditions and ANN: Artificial neural network; ARIMA: Autoregressive integrated moving
average; AUC: Area under a receiver operating characteristic curve;
ambient air pollutants, but some other environmental CO: Carbon monoxide; CVDs: Cardiovascular diseases; DOW: Day of week;
factors, such as exposure to the metals arsenic, cadmium GAM: Generalized additive model; GCV: Generalized Cross-Validation;
and lead, also play important roles in the development LightGBM: Light gradient boosting machine; LR: Logistic regression;
NO2: Nitrogen dioxide; O3: Ozone; PM2.5: Particulate matter with aerodynamic
and severity of CVDs [4]. Second, we just constructed diameter ≤ 2.5 μm; PM10: Particulate matter with aerodynamic diameter ≤
the classification models to predict the peak demand 10 μm; PMC: Particulate matter with an aerodynamic diameter between 2.5
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 10 of 11

and 10 μm; RF: Random forest; ROC: Receiver operating characteristic; 4. Cosselman KE, Navas-Acien A, Kaufman JD. Environmental factors in
SO2: Sulfur dioxide; SVM: Support vector machine; XGBoost: Extreme gradient cardiovascular disease. Nat Rev Cardiol. 2015;12(11):627–42.
boosting 5. Zhu X, Qiu H, Wang L, Duan Z, Yu H, Deng R, Zhang Y, Zhou L. Risks of
hospital admissions from a spectrum of causes associated with particulate
Acknowledgements matter pollution. Sci Total Environ. 2019;656:90–100.
We thank the Health Information Center of Sichuan Province for its 6. Hui L, Yaohua T, Xiao X, Juan J, Jing S, Yaying C, Chao H, Man L, Yonghua
permission to use the data. H. Ambient particulate matter concentrations and hospital admissions in 26
of China’s largest cities: a case-crossover study. Epidemiology. 2018;29(5):
Authors’ contributions 649–57.
HQ proposed and designed the study. HQ, LL and ZQS performed the 7. Tatiane F, Maria F, Clarice dF, Felipe N, Washington J, Nelson G. Effects of
experiments and analyzed the data. LYW and LZ collected the data and particulate matter and its chemical constituents on elderly hospital
performed the statistical analyses. HQ and LL wrote the manuscript. ZQS and admissions due to circulatory and respiratory diseases. Int J Environ Res
YCC revised the manuscript. All authors have read and approved the final Public Health. 2016;13(10):947–57.
manuscript. 8. Soleimani Z, Darvishi Boloorani A, Khalifeh R, Griffin DW, Mesdaghinia A.
Short-term effects of ambient air pollution and cardiovascular events in
Funding shiraz, Iran, 2009 to 2015. Environ Sci Pollut Res Int. 2019;26(7):6359–67.
This research was supported by the National Natural Science Foundation of 9. Chen M, Qiu H, Wang L, Zhou L, Zhao F. Attributable risk of cardiovascular
China (No. 71661167005) and the Key Research and Development Program hospital admissions due to coarse particulate pollution: a multi-city time-
of Sichuan Province (No. 2018SZ0114, No. 2019YFS0271), which provide series analysis in southwestern China. Atmos Environ. 2019;218:117014.
financial support in the design of study and analysis of data, and the 1·3·5 10. Zhao Q, Zhao Y, Li S. Impact of ambient temperature on clinical visits for
Project for Disciplines of Excellence–Clinical Research Incubation Project, cardio-respiratory diseases in rural villages in Northwest China. Sci Total
West China Hospital, Sichuan University (Grant No. 2018HXFH023, Environ. 2018;612:379–85.
ZYJC18013), which provide financial support in interpretation of data and 11. Ha S, Nguyen K, Liu D, Mannisto T, Nobles C, Sherman S, Mendola P.
writing the manuscript. Ambient temperature and risk of cardiovascular events at labor and
delivery: a case-crossover study. Environ Res. 2017;159:622–8.
Availability of data and materials 12. Phung D, Thai PK, Guo Y, Morawska L, Rutherford S, Chu C. Ambient
The meteorological and air quality datasets analyzed during the current temperature and risk of cardiovascular hospitalization: an updated
study are available at http://data.cma.cn/ and http://www.cnemc.cn/. Daily systematic review and meta-analysis. Sci Total Environ. 2016;550:1084–102.
data of hospital admissions for CVDs are available from the Health 13. Liu H, Tian Y, Song J, Cao Y, Hu Y. Effect of ambient air pollution on
Information Center of Sichuan Province, but restrictions are applied to these hospitalization for heart failure in 26 of China's largest cities. Am J Cardiol.
data, which were used under license for the current study, and so are not 2017;121(5):628–33.
publicly available. The daily number of hospital admissions for patients with 14. Tian Y, Liu H, Si Y, Cao Y, Song J, Li M, Wu Y, Wang X, Xiang X, Juan J.
CVDs are however available from authors upon reasonable requests, and Association between temperature variability and daily hospital admissions
with permission of the Health Information Center of Sichuan Province, China. for cause-specific cardiovascular disease in urban China: a national time-
series study. PLoS Med. 2019;16(1):e1002738.
Ethics approval and consent to participate 15. Hsu WH, Hwang S-A, Kinney PL, Lin S. Seasonal and temperature
This study was approved by the Health Information Center of Sichuan modifications of the association between fine particulate air pollution and
Province. Informed consent was waived because this research did not cardiovascular hospitalization in New York state. Sci Total Environ. 2017;578:
involve individual data. 626–32.
16. Ma Y, Zhao Y, Yang S, Zhou J, Yang D. Short-term effects of ambient air
Consent for publication pollution on emergency room admissions due to cardiovascular causes in
Not applicable. The study does not include details relating to an individual Beijing, China. Environ Pollut. 2017;230:974–80.
person. 17. Vahedian M, Khanjani N, Mirzaee M, Koolivand A. Ambient air pollution and
daily hospital admissions for cardiovascular diseases in Arak, Iran. Arya
Competing interests Atherosclerosis. 2017;13(3):117–34.
The authors declare that they have no competing interests. 18. Juang WC, Huang S-J, Huang F-D, Cheng P-W, Wann S-R. Application of
time series analysis in modelling and forecasting emergency department
Author details visits in a medical Centre in southern Taiwan. BMJ Open. 2017;7(11):
1 e018628.
School of Computer Science and Engineering, University of Electronic
Science and Technology of China, No.2006, Xiyuan Ave, West Hi-Tech Zone, 19. Jilani T, Housley G, Figueredo G, Tang PS, Hatton J, Shaw D. Short and Long
611731 Chengdu, Sichuan, P.R. China. 2Big Data Research Center, University term predictions of hospital emergency department attendances. Int J Med
of Electronic Science and Technology of China, Chengdu, China. Inform. 2019;129:167–74.
3
Department of Statistics, Faculty of Science, University of British Columbia, 20. Zhou L, Ping Z, Dongdong W, Cheng C, Hao H. Time series model for
Vancouver, Canada. 4Health Information Center of Sichuan Province, forecasting the number of new admission inpatients. Bmc Med Inform
Chengdu, China. 5Cardiology Division, West China Hospital, Sichuan Decis Mak. 2018;18(1):39–49.
University, Chengdu, China. 6West China Biomedical Big Data Center, West 21. Zhu T, Luo L, Zhang X, Shi Y, Shen W. Time series approaches for
China Hospital, Sichuan University, Chengdu, China. forecasting the number of hospital daily discharged inpatients. IEEE J
Biomed Health Inform. 2017;21:515–26.
Received: 17 December 2019 Accepted: 23 April 2020 22. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S.
Dermatologist-level classification of skin cancer with deep neural networks.
Nature. 2017;542(7639):115–8.
References 23. Gunčar G, Kukar M, Notar M, Brvar M, Černelč P, Notar M, Notar M. An
1. WHO: https://www.who.int/news-room/fact-sheets/detail/cardiovascular- application of machine learning to haematological diagnosis. Sci Rep. 2018;
diseases-(cvds) (accessed on 1 September 2019). 8(1):411.
2. Dominici F, Peng RD, Bell ML, Pham L, McDermott A, Zeger SL, Samet JM. 24. Qiu H, Yu HY, Wang LY, Yao Q, Wu SN, Yin C, Fu B, Zhu XJ, Zhang YL, Xing
Fine particulate air pollution and hospital admission for cardiovascular and Y, et al. Electronic health record driven prediction for gestational diabetes
respiratory diseases. JAMA. 2006;295(10):1127–34. mellitus in early pregnancy. Sci Rep. 2017;7(1):16417.
3. Peng RD, Chang HH, Bell ML, McDermott A, Zeger SL, Samet JM, Dominici 25. Lim J, Kim J, Cheon S. A deep neural network-based method for early
F. Coarse particulate matter air pollution and hospital admissions for detection of osteoarthritis using statistical data. Int J Environ Res Public
cardiovascular and respiratory diseases among Medicare patients. JAMA. Health. 2019;16(7):1281.
2008;299(18):2172–9.
Qiu et al. BMC Medical Informatics and Decision Making (2020) 20:83 Page 11 of 11

26. Kassomenos P, Petrakis M, Sarigiannis D, Gotti A, Karakitsios S. Identifying Publisher’s Note


the contribution of physical and chemical stressors to the daily number of Springer Nature remains neutral with regard to jurisdictional claims in
hospital admissions implementing an artificial neural network model. Air published maps and institutional affiliations.
Quality Atmosphere Health. 2011;4(3–4):263–72.
27. Shakerkhatibi M, Dianat I, Jafarabadi MA, Azak R, Kousha A. Air pollution and
hospital admissions for cardiorespiratory diseases in Iran: artificial neural
network versus conditional logistic regression. Int J Environ Sci Technol.
2015;12(11):3433–42.
28. Moustris KP, Larissi IK, Nastos PT, Paliatsos AG. Seven-days-ahead forecasting
of childhood asthma admissions using artificial neural networks in Athens,
Greece. Int J Environ Health Res. 2012;22(2):93–104.
29. Polezer G, Tadano YS, Siqueira HV, Godoi AFL, Yamamoto CI, de André PA,
Pauliquevis T, MdF A, Oliveira A, PHN S. Assessing the impact of PM 2.5 on
respiratory disease using artificial neural networks. Environ Pollut. 2018;235:
394–403.
30. Kassomenos P, Papaloukas C, Petrakis M, Karakitsios S. Assessment and
prediction of short term hospital admissions: the case of Athens, Greece.
Atmospheric Environ. 2008;42(30):7078–86.
31. Khatri KL, Tamil LS. Early detection of peak demand days of chronic
respiratory diseases emergency department visits using artificial neural
networks. IEEE J Biomed Health Inform. 2017;99:285–90.
32. Wu C-C, Yeh W-C, Hsu W-D, Islam MM, Nguyen PA, Poly TN, Wang Y-C,
Yang H-C, Li Y-C. Prediction of fatty liver disease using machine learning
algorithms. Comput Methods Prog Biomed. 2019;170:23–9.
33. Soyiri IN, Reidpath DD, Sarran C. Forecasting peak asthma admissions in
London: an application of quantile regression models. Int J Biometeorol.
2013;57(4):569–78.
34. Qiu H, Zhu X, Wang L, Pan J, Pu X, Zeng X, Zhang L, Peng Z, Zhou L.
Attributable risk of hospital admissions for overall and specific mental
disorders due to particulate matter pollution: a time-series study in
Chengdu, China. Environ Res. 2019;170:230–7.
35. Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M. Methods
for imputation of missing values in air quality data sets. Atmos Environ.
2004;38(18):2895–907.
36. Qiu H, Tan K, Long F, Wang L, Yu H, Deng R, Long H, Zhang Y, Pan J. The
Burden of COPD Morbidity Attributable to the Interaction between Ambient
Air Pollution and Temperature in Chengdu, China. Int J Environ Res Public
Health. 15(3):492.
37. Ma Y, Zhang H, Zhao Y, Zhou J, Yang S, Zheng X, Wang S. Short-term
effects of air pollution on daily hospital admissions for cardiovascular
diseases in western China. Environ Sci Pollut Res. 2017;24(16):14071–9.
38. Chen G, Zhang Y, Zhang W, Li S, Guo Y. Attributable risks of emergency
hospital visits due to air pollutants in China: a multi-city study. Environ
Pollut. 2017;228:43–9.
39. Dreiseitla S, Ohno-Machadob L. Logistic regression and artificial neural
network classification models: a methodology review. J Biomed Inform.
2002;35(5–6):352–9.
40. Cortes C, Vapnik VN. Support vector networks. Mach Learn. 1995;20(3):273–
97.
41. Marcel VG, Sander B. Editorial: Artificial Neural Networks as Models of Neural
Information Processing. Front Computational Neurosci. 2017;11:114.
42. White H. Learning in artificial neural networks: a statistical perspective.
Neural Comput. 2014;1(4):425–64.
43. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
44. Chen T, Guestrin C: XGBoost: A Scalable Tree Boosting System. In:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining: 2016; 2016.
45. Friedman JH. Greedy function approximation: a gradient boosting machine.
Ann Stat. 2001;29(5):1189–232.
46. Ke GL, Meng Q, Finley T, Wang TF, Chen W, Ma WD, Ye QW, Liu TY.
LightGBM: a highly efficient gradient boosting decision tree. Adv Neur In.
2017;30:46–54.
47. Deng L, Pan J, Xu X, Yang W, Liu C, Liu H. PDRLGB: precise DNA-binding
residue prediction using a light gradient boosting machine. BMC
Bioinformatics. 2018;19:136–45.
48. Delong ER, Delong DM, Clarke-Pearson DL. Comparing the areas under two
or more correlated receiver operating characteristic curves: a nonparametric
approach. Biometrics. 1988;44(3):837–45.

You might also like