0% found this document useful (0 votes)
9 views18 pages

Bayesian Network Reasoning and Machine Learning With Multiple Data Features

This paper presents a Bayesian network model utilizing multi-featured data from 31 provinces in China to predict air quality and monitor air pollution risks. The model achieves a 90% accuracy rate in early warning and diagnosis of air pollution causes, outperforming other machine learning methods. It emphasizes the importance of effective forecasting and early warning systems to address the severe air pollution issues affecting public health and economic activities in China.

Uploaded by

bhanumathiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

Bayesian Network Reasoning and Machine Learning With Multiple Data Features

This paper presents a Bayesian network model utilizing multi-featured data from 31 provinces in China to predict air quality and monitor air pollution risks. The model achieves a 90% accuracy rate in early warning and diagnosis of air pollution causes, outperforming other machine learning methods. It emphasizes the importance of effective forecasting and early warning systems to address the severe air pollution issues affecting public health and economic activities in China.

Uploaded by

bhanumathiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Natural Hazards (2021) 107:2555–2572

https://doi.org/10.1007/s11069-021-04504-3 (0123456789().,-volV)(0123456789().,-volV)

ORIGINAL PAPER

Bayesian network reasoning and machine learning


with multiple data features: air pollution risk monitoring
and early warning

Xiaoliang Xie1 • Jinxia Zuo2,3 • Bingqi Xie2,3 • Thomas A. Dooling4 •

Selvarajah Mohanarajah5

Received: 23 August 2020 / Accepted: 4 January 2021 / Published online: 18 January 2021
Ó The Author(s), under exclusive licence to Springer Nature B.V. part of Springer Nature 2021

Abstract
From a macro-perspective, based on machine learning and data-driven approach, this paper
utilizes multi-featured data from 31 provinces and regions in China to build a Bayesian
network (BN) analysis model for predicting air quality index and warning the air pollution
risk at the city level. Further, a two-layer BN for analyzing influencing factors of various
air pollutants is developed. Subsequently, the model is applied to forecast the trends of
temporal and spatial changes in the form of probabilistic inference and to investigate the
degree of impact incurred from individual influencing factors. From the comparisons with
the results obtained from other machine learning approaches and algorithms such as neural
networks, it is concluded that by comprehensively using the established BN, one can not
only reach a monitoring and early warning accuracy rate of 90% but also scrutinize and
diagnose the main cause of air pollution risk changes from the perspective of probability.

Keywords AQI prediction  Air pollution risk  Bayesian network  Machine learning 
Statistical analysis

& Jinxia Zuo


[email protected]
1
College of Mathematics and Statistics, Hunan University of Technology and Business,
Changsha 410205, China
2
Institute of Big Data and Internet Innovation, Hunan University of Technology and Business,
Changsha 410205, China
3
Key Laboratory of Hunan Province for Statistical Learning and Intelligent Computation, Hunan
University of Technology and Business, Changsha 410205, China
4
Department of Chemistry and Physics, University of North Carolina At Pembroke, Pembroke,
NC 28372, USA
5
Department of Mathematics and Computer Science, University of North Carolina At Pembroke,
Pembroke, NC 28372, USA

123
2556 Natural Hazards (2021) 107:2555–2572

1 Introduction

With the recent rapid development in China, accompanied is the severe air pollution
problem, and the government and the public have gradually realized that air quality are
seriously threatening the human health and economic activities. Global attention to air
pollution continues to intensify, and effective forecasting and early warning models
become vitally necessary.
To a large extent, air pollution creates the biggest threat to people’s health (Fan et al.
2019). From 2013 to 2016, China’s average PM2.5 concentration (57.75 lg/m3) was five
times higher than the WHO’s standard (10 lg/m3) and six times higher than the USA
(8.47 lg/m3) (Zhou et al. 2019). Air pollution causes more than 3 million deaths world-
wide each year (Lelieveld et al. 2015), and China alone accounts for 41.2% of these deaths.
From 1990 to 2010, the death rate (14.9%) caused by air pollution is nearly an order of
magnitude higher than road traffic (0.97%) and AIDS (1.92%) (Yang et al. 2013).
In addition, air pollution also brings huge economic losses. The loss caused by par-
ticulate matter pollution in Shanghai in 2001 was approximately 625.4 million US dollars,
accounting for 1.03% of the city’s GDP (Chen et al. 2019). In 2007, the total economic loss
caused by shortened working hours in 30 provinces in China was 48.55 billion US dollars
(approximately 1.1% of GDP). It is roughly equivalent to Vietnam’s annual GDP in 2010.
Wu et al. (2020) estimated that the direct economic loss and the indirect medical cost for
haze disaster were, respectively, 3.16 billion US dollars and 1.18 billion US dollars for
Jiangsu Province in 2013.
Air pollution is caused by many factors, including atmospheric conditions, geographical
location, traffic conditions, and human activities. Therefore, air pollution is considered by
scholars such as Ghaemi (2018), to be a complex and nonlinear problem. The investigation
on the complexity of air quality issues has led to a variety of predictive models and
analytical methods.
In order to reduce the economic losses and health hazards caused by air pollutants,
many scholars have employed a variety of air quality prediction and early warning methods
in their research. Main prediction methods can be divided into two categories, methods
based on physical diffusion simulation and methods based on machine learning. The
former methods mainly predict air quality by simulating the diffusion law of pollutants. For
example, Rakowska et al. (2014) established a prediction model by simulating the trans-
mission of air pollutants in the street, but this method has certain parameter empirical
assumption limitations (Egan et al. 2014; Wang et al. 2010).
The latter methods are based on machine learning, statistics learning, and big data-
driven methods. In recent years, many scholars have utilized big data, data mining, and
machine learning to build AQI prediction and air pollution risk analysis models so as to
achieve the functions of prediction and risk prevention and control (Zhu et al. 2017; Sun
et al. 2017; Xu et al. 2017).
Hosseini, Safari, Mazinani (2017) proposed a novel type 2 fuzzy inference system to
solve the uncertainty and imprecision in AQI. The prediction accuracy of this model can
reach 94%. Kang et al. (2017) used the BP neural network optimized by genetic simulated
annealing algorithm to predict the AQI to solve the problem that the neural network easily
falls into a local optimum. Fang et al. (2020) proposed an AQI prediction BP neural
network based on wavelet transform input.
Ghaemi et al. (2018) designed a spatiotemporal system using an online algorithm based
on support vector machine (SVM), which takes into account the dynamics of air quality

123
Natural Hazards (2021) 107:2555–2572 2557

and high spatiotemporal variability to address the shortcomings of traditional support


vector machines and artificial neural networks. In addition, many scholars have considered
the time series and space into the model to perform more spatiotemporal analysis (Li et al.
2017).
In China, artificial neural networks (ANN), fuzzy theoretical models, linear system
analysis, environmental quality measurement models, and statistical theory methods are
employed to predict air quality. Wang et al. (2019) used an isolation forest algorithm to
analyze outliers in the air and established an air quality early warning system. Zhao et al.
(2019) used the historical air quality and meteorological detection data of Wuhan to apply
the firework algorithm to the established model to optimize the model from different angles
to obtain a good performance AQI prediction model.
In addition to monitoring air pollution, in recent years, scholars have conducted studies
on factors affecting air quality. Zhou et al. (2019) used data envelopment analysis (DEA)
to analyze the factors that contribute to PM2.5 and determined the impact of meteoro-
logical factors and human activities on pollutants. Yang et al. (2015) used the North China
Plain as an example to study the characteristics and formation mechanisms of continuous
haze in China and to study the influence of meteorological factors on various pollutants.
Chen et al. (2019) analyzed the relationship between PM2.5 in Beijing and the three
industries, showing that the tertiary industry had a suppressive effect on PM2.5 emissions.
However, the BN can be achieved in realizing AQI prediction and risk monitoring.
Machine learning based on BN is competitive when compared with other machine learning
approaches and algorithms as it can graphically represent complex uncertain systems in a
simple and understandable way to realize the integration of data-driven and expert expe-
rience fusion structure learning methods and use parameter learning results for risk
warning and decision-making assistance (Uusitalo 2007). Moreover, in such networks, it
allows data to be combined with domain knowledge, making it easy to incorporate
knowledge of different accuracy and different sources in a mathematically consistent
manner. Finally, data may not be collected in a timely manner due to the delay in detection
system or due to the cause of failure in the system, and the adoption of BN provides a
natural way (e.g., EM algorithm) to handle delayed data and missing data.
Bayesian research includes two directions, Bayes theorem and BN. BN is developed on
the basis of Bayes’ theorem and has obvious advantages in displaying models and sim-
plifying computational complexity. Some scholars have used Bayesian uncertainty and
they allow incompleteness of data to conduct related research. Liu et al. (2008) utilized
Bayes theorem to introduce uncertainty and parameter estimation in the modeling process
to establish a Bayes’ theorem model for air quality prediction in Xiamen. Mcmillan et al.
(2005) applied the knowledge of Bayesian statistics to establish a hierarchical Bayesian
model for ozone spatial–temporal prediction and determine the probability distribution of
ozone spatial–temporal change.
Zhu et al. (2016) proposed a Gaussian BN based on an urban big data to establish the
spatiotemporal causality of air pollutants and the parameter learning of causality. This
model can be used to determine the causal relationship of spatiotemporal pollutants, find
the source of pollution, and understand the relationship between various source
interactions.
In view of the small area of existing research on air quality, BN is used to analyze the
impact of air quality in provinces and municipalities on the basis of insufficient details of
influencing factors and/or lack of space spillover effects.
In this paper, the data of 31 provinces and municipalities in China and the measurement
results of pollution incurred from other provinces (a total of 22 kinds of data) are used at a

123
2558 Natural Hazards (2021) 107:2555–2572

macro-level to implement the Bayesian network’s monitoring, uncertainty reasoning


diagnosis, and early warning functions, for comprehensively analyzing the influencing
factors of air pollution risk and determining the causes, sources, and changes in air quality
through early warning.
The structure of the rest of the paper is as follows. Section 2 contains the model
development and hypothesis setting, theory of BN, selection of indicators, and methods for
measuring the degree of pollution incurred from other provinces. In Sect. 3, a two-layer
analysis model of air quality forecast and air pollution risk early warning system based on a
simplified network is developed. In Sect. 4, the Netica software is used for parameter
learning of the constructed model, and the learning results are applied to make relevant
inferences and diagnoses. In Sect. 5, the test data set is used to verify the effectiveness of
the model, the evaluation and comparison of model prediction effects are provided, the
analysis results are summarized, policy opinions are proposed, and shortcomings of the
model are discussed.

2 Models and assumptions

2.1 Bayesian network theory

Bayesian network is a combination of probability and graph theory. It also known as belief
networks and is a graphical model that describes a dependency relationship between
variables and is suitable for expressing and analyzing uncertain and probabilistic events. Its
advantages include the ability to make use of the structure of the problem in accordance
with the principles of probability theory, reducing the computational difficulty of
reasoning.
In general, a directed acyclic graph (DAG) reflects a set of conditionally independent
relationships between a set of variables (nodes), such as observable variables, hidden
variables, and unknown parameters, while the arcs between the nodes represent proba-
bilistic dependencies among the corresponding random variables.
If there is no arrow between the variables in the nodes, the random variables are said to
be conditionally independent of each other. Two nodes may be connected by a single
arrow; the node that points to the other node is a ‘‘parent’’ node of this other node, and this
other node is a ‘‘descendant or child’’ of the node. Such a pair of nodes will generate a
conditional probability value.
According to Friedman et al. (1997), a network is defined by a pair bðG; HÞ, where
G ¼ ðX; EÞ is a DAG, with d discrete random variables denoted by X ¼ fX1 ; X2 ; :::; Xd g.
The arcs represent the direct dependencies between these variables.
Each node Xi takes ri possible values encoded as xi1 ; xi2 ; :::; xiri . Associated with each
node Xi is a conditional probability distribution, collectively represented by
H ¼ ðhi Þ1  i  d , which quantifies how much a node depends on its parents. The set of all
parent nodes of a node Xi in G is denoted by PaðXi Þ, where the Q parent nodes describe the
cause and the child node shows the effect. PaðXi Þ has qi ¼ m2Xi rm possible configura-
tions, where Xi ¼ fm; Xm 2 PaðXi Þg is the set of all subscripts m such that Xm is an
immediate parent of Xi .
According to Khakzad (2019), the conditional probabilities H are known as the network
parameters which can either be elicited from subject matter experts or be learned from
data. The conditional distribution of Xi jPaðXi Þ defined by the matrix of probabilities hi ¼
ðhijk Þ 1  j  qi represents the probability of the k-th ðk ¼ 1; :::; ri Þ of the Xi given the j-th
1  k  ri

123
Natural Hazards (2021) 107:2555–2572 2559

parent configuration PaðXi Þ ¼ j, that is, hijk ¼ PðXi ¼ kjPaðXi Þ ¼ jÞ. The joint probability
distribution of ðX1 ; X2 ; :::; Xd Þ is as follows:
Y
d Y
d
PðX1 X2 ; :::; Xd Þ ¼ pðXi jPaðXi ÞÞ ¼ hðXi jPaðXi ÞÞ ð1Þ
i¼1 i¼1

The Bayesian inference was based on Bayes’ theorem as follows:


PðPaðXi ÞÞPðXi jPaðXi ÞÞ
PðPaðXi ÞjXi Þ ¼ ð2Þ
PðXi Þ

Applied in this study, each influencing factor of air pollution risk is regarded as a parent
node or a cause, and the pollution risk is the child node as the result. Formula (2) can be
used to infer the probability of various states of the air pollution risk as the child node
under the condition of determining the status of the parent node. According to the obtained
probabilities of each state, the risk level is the state corresponding to the maximum
probability.

2.2 Index selection and processing

Among the many factors affecting air quality are socioeconomic activities by human,
energy consumption, traffic emission, and environmental awareness. At the same time, the
atmospheric conditions in different regions will cause the formation of pollutants. Based on
the previous research, we found that air quality has a spatial correlation. In order to
consider the spatial correlation of the AQI between regions, we also introduce a decision
variable, the degree of pollution incurred from other provinces.
Moreover, according to the China’s national regulations, the AQI is divided into six
levels. (0–50 stands for grade 1 or excellent air quality, 51–100 for grade 2 or good air
quality, 101–150 for grade 3 or air quality lightly polluted, 151–250 for grade 4 or
moderately polluted, 201–300 for grade 5 or severely polluted, and above 300 for grade 6
or severely polluted.) To better clarify the risk level of air pollution, this article divides the
air pollution risk into three grades. Low risk if the level is 1 or 2, medium air pollution risk
if the level is 3 or 4, and high risk if the level is 5 or 6.
In addition, the location of each province is classified according to the traditional six
regional classification methods of the country, namely North China, Northeast China, East
China, Central South China, Southwest China, and Northwest China. At the same time,
time is also quantified, and the year is divided into four quarters. Therefore, the con-
struction of the index system in this article involves the cross-integration of multiple data.

2.3 Measurement and indicators of pollution incurred from other provinces

Studies show that air pollutants have a spatial spillover effect, meaning that air quality
pollution in a province may be affected by pollutants in other provinces. Therefore, in this
study, we consider the interaction of air conditions in other provinces as one of the
influencing factors. However, according to the existing research, the intensity of the pol-
lution has not been utilized to distinguish the level of impact. Therefore, we will use the
spatial correlation matrix and scatter plot to measure the impact on each province by the air
quality of other provinces.

123
2560 Natural Hazards (2021) 107:2555–2572

According to the research ideas of spatial econometrics, it is necessary to test the spatial
correlation of the data before selecting a specific research method to determine whether the
selected dependent variable has spatial correlation and measure whether the air quality of
each province has a spatial spillover effect. Moran index (Moran0 s I) is generally used to
test the index statistics of spatial correlation of data. Moran0 s I is divided into global
Moran0 s I and local Moran0 s I, the calculation formulas are as follows (Wu et al. 2019a, b):
Pn Pn  
0 n i¼1 j¼1 wij ðxi  xÞ xj  x
Moran s I ¼  Pn 2
ð3Þ
S0 i¼1 ðxi  xÞ

ðxi  xÞ Xn
Moran0 s I ¼ w ðx  xÞ
j¼1 ij i
ð4Þ
S2
P P P P
where S0 ¼ ni¼1 nj¼1 Wij ; S2 ¼ 1n ni¼1 ðxi  xÞ2 ; x ¼ 1n ni¼1 xi , xi denotes the AQI
value for i-th province, n is the number of provinces, and wij is the spatial weight between
province i and province j. The setting principle of element wij in W is:

0; dij [ D
Wij ¼ ð5Þ
1; dij  D

where D is the value calculated by the software GeoDa, i and j represent two regions or
provinces.
When the global Moran0 s I in Eq. (3) is used, the result reflects the correlation value
between all provinces at the same time. In contrast, in the calculation of local Moran0 s I, a
numerical result will be given for each row. This value indicates the degree of association
between a province and its neighboring provinces. In this paper, in the process of mea-
suring the degree of pollution incurred from other provinces, each row needs to be given a
measurement result of being affected by its neighboring provinces, thus local Moran0 s I is
appropriate.
The main steps of the measurement: The first step is to calculate the GeoDa spatial
correlation matrix W. The second step is to use W to calculate the local Moran0 s I and draw
the Moran scatter plot. The third step is to use Moran scatter plot to measure the degree of
influence by other provinces.
According to the calculated Moran0 s I, when Moran0 s I (resp. I\0), it indicates that the
observations between regions have a positive (resp. negative) spatial correlation; when
Moran0 s I ¼ 0, it indicates that the observations between regions are independent of each
other (Wu et al. 2019a, b). The new measurement concept proposed in this paper using
Moran0 s I and scatter plots is shown in the following table.

Table 1 Index measurement results


Mean New measurement concept (degree of influence by other provinces)

First quadrant HH(high–high) 4


Second quadrant LH(low–high) 3
Third quadrant LL(low–low) 1
Fourth quadrant HL(high–low) 2

123
Natural Hazards (2021) 107:2555–2572 2561

If province i appears in the first quadrant of the scatter chart, it is a high–high area,
indicating that the air pollution in area i is affected by other areas at level 4 (4 representing
the highest degree of influence, and 1 representing the lowest degree of influence)
(Table 1). According to this principle, we can complete the quantification of the degree of
pollution incurred from other provinces and use specific numerical values to measure the
degree of air pollution in province i affected by other provinces.

3 Model construction

3.1 Data sources

In this study, various provinces and municipalities in mainland China are taken as the
research objects. The data on the concentration of various pollutants came from the
monthly data provided by the website of China Air Quality Online Monitoring and
Analysis Platform, released by the national environmental protection department, includ-
ing the measured monthly PM2.5, PM10, SO2, CO, NO2, and O3. Meteorological data use
monthly data provided by the China Statistical Yearbook, including four meteorological
factors, monthly average temperature, monthly precipitation, monthly average relative
humidity, and monthly average sunshine hours in each province from December 2013 to
December 2018.
Socioeconomic data were collected from the statistical yearbooks of each province from
2014 to 2019 and from the China Statistical Yearbook that contains data for the 31
provincial and autonomous regions. It contains eight categories of data to quantify human
activities, including urbanization level, population density, urban area, number of civilian
cars, public transportation, total coal consumption, urban green area, and proportion of
tertiary industry. The data are summarized in Table 2.
Bayesian network is divided into structure learning and parameter learning, in which
structure learning clarifies the relationship between various indicators, and parameter
learning learns the probability distribution of states in the network based on sample data. In
order to build an effective air pollution monitoring model, this study uses continuous data
to carry out the structure learning of BN.
In this study, the BN is built with two layers to realize the monitoring and early warning
simultaneously. The first layer determines the BN between six pollutants (Sect. 3.2),
meteorological factors, spatiotemporal factors, and the measurement results of pollution
incurred from other provinces. The second layer of the network establishes a BN for the
selected six pollutants related to human behavior, and through the sensitivity analysis of
the network, it can provide early warning of human behaviors that contribute to the
changes of various pollutants.
On the other hand, in the parameter learning process of the BN, in order to realize the
diagnostic reasoning function and graphing of BN, it is needed to use discrete data to learn
the parameters of the BN, so that the result of the parameter learning is a finite state matrix.
This paper used R language to discretize part of the data, and the rest was discretized
according to existing regulations. The data after discretization and grouping correspond to
the risk status of each indicator; the larger the value, the higher the risk probability.
Therefore, the purpose of the model proposed in this paper is to realize the prediction
function of air quality when using continuous data corresponding to each index and use
probability to determine the air pollution risk level after discrete quantification of the
index. The results of the discrete quantization of the data are shown in Table 2.

123
2562 Natural Hazards (2021) 107:2555–2572

Table 2 Introduction of indicators and situation after dispersion


Type Details Description Number of
discrete
levels

Parent Pollutants PM2.5 Particles less than 2.5 microns in 6


nodes diameter can enter the alveoli
(cause) directly (ug/m3)
PM10 Diameter less than 10 microns, long 5
duration (ug/m3)
SO2 The most common sulfur oxides are 4
toxic
CO Mainly from coal combustion (ug/ 4
m3)
NO2 Mainly from motor vehicle exhaust 5
O3 Mainly from the petrochemical 5
industry
Meteorological Average temperature (A) The arithmetic mean of the 3
environment temperature value of each
observation
Precipitation (B) Depth of precipitation in horizontal 3
plane
Relative humidity (C) Moisture content in the air 3
Sunshine hours (D) Sun exposure hours (h) 4
Regions (E) North, northeast, east, Six administrative regions in china 6
central south,
southwest, northwest in
China
Quarters (F) 1,2,3,4 quarter Divide a year into four quarters 4
Space spillover The degree of pollution The influence of the spatial position 4
effect (G) incurred from other relationship between the regions
provinces on their interconnection
Human Urbanization level (H) Ratio of total urban population to 4
activities total regional population
Population density (I) Number of population per unit of 5
land area (person/square
kilometer)
Urban area (J) Form a certain population and 4
building area (hectare)
Number of civilian cars Number of civilian cars (vehicles) 4
(K)
Number of public Number of public transportation 5
transportation (L) (vehicles)
Total coal consumption Total coal energy consumption 5
(M) (10,000 tons of standard coal)
Urban green area (N) Urban green space and garden area 4
Tertiary industry as a Percentage of GDP in the tertiary 6
percentage of GDP (O) industry
Child node (effect) AQI Air pollution risk ‘‘low,’’
‘‘mid,’’
‘‘high’’
Air quality index Continuous

123
Natural Hazards (2021) 107:2555–2572 2563

Table 3 Correlation coefficients of pollutant nodes


PM2.5 PM10 SO2 CO NO2 O3 AQI

PM2.5 1.00 0.89 0.58 0.73 0.75 - 0.45 0.93


PM10 0.89 1.00 0.61 0.73 0.73 - 0.34 0.89
SO2 0.58 0.61 1.00 0.60 0.45 - 0.38 0.55
CO 0.73 0.73 0.60 1.00 0.66 - 0.50 0.69
NO2 0.75 0.73 0.45 0.66 1.00 - 0.37 0.73
O3 - 0.45 - 0.34 - 0.38 - 0.50 - 0.37 1.00 - 0.20
AQI 0.93 0.89 0.55 0.69 0.73 - 0.20 1.00

3.2 Bayesian network

Structural learning of BN is mainly divided into two categories, learning based on con-
straint algorithms and/or based on scoring functions. Our approach will combine both of
them for structure learning. The first step will use the relationship between nodes to create
preliminary relationship constraints and determine the preliminary structure, so as to
simplify the learning results. The second step is to apply the hill-climbing (hc) algorithm
for further learning.

3.2.1 Preliminary structure learning

Correlation analysis is performed for each parent node and its child node (AQI) in this
paper. The results of the correlation analysis section are shown in the following table.
Table 3 shows the correlation coefficients between some variables. The absolute value
greater than 0.4 was selected as the condition for the initial establishment of the BN.
Choosing 0.4 as the threshold in this paper is the result of Scutari’s study (2017) and

Fig. 1 The first layer of Bayesian network

123
2564 Natural Hazards (2021) 107:2555–2572

Fig. 2 The integrated two-layer Bayesian network

multiple experiments. Choosing 0.4 can achieve the effect of a certain sparse BN. For
example, the correlation coefficient between PM2.5 and PM10 is 0.89, which proves that
there is an obvious dependency between these two indicators. An edge is added between
the two nodes. Based on this determination method, this paper determined the preliminary
undirected BN structure diagram.

3.2.2 Structure learning based on hill-climbing algorithm

The second step is to use the hc to learn the training data set. In order to reduce the
uncertainty of the model, the data were divided into six equal parts. BN structure training
was performed according to the divided data. Six different BN graphs were obtained by
using the Bnlearn package in R for related learning. In the final BN graph, the dependency
relationship between nodes was determined by the edges with a probability of greater than
50% in six different BN graphs, where 50% is set based on degree of expected

Fig. 3 Trend of probability value above low risk in each region

123
Natural Hazards (2021) 107:2555–2572 2565

simplification for the average BN model. According to Scutari’s documents (2013), the BN
obtained by the above data processing and structure sparse method is called the average
BN model and has a certain recognition function for noisy data.
The application of this method can be used to simplify the dependency relationship
between various indicators, yet still retain the purpose of using the BN and ensure that it
can play a certain analytical role. The structure of the first layer of BN obtained by using R
is shown in Fig. 1. The integrated two-layer BN is shown in Fig. 2, where the relationship
between the nodes of the first layer is shown in detail in Fig. 1, and the nodes of the second
layer are indicators related to human activities. The specific meaning of each symbol is
described in Table 2 (e.g., H stands for urbanization level).

4 Results and analysis

This section conducts the parameter learning using the built BN structure. The purpose of
BN parameter learning is mainly to learn the probability of each node of the constructed
BN structure graph, that is, to update the original prior distribution of the variables when
given a specific state of a node. Abundant information and results can be obtained from the
species using the learning results of the BN structures.
In order to quantitatively study the relationship between air pollution risk and multiple
influencing factors, we further clarify the cause of pollution and facilitate effective control
of the source of pollution. We will conduct four experiments to verify with probability
from BN: (1) The temporal and spatial changes of air pollution risk; (2) The relationship
between air pollution risk and spatial spillover effects; (3) The key pollution sources for air
pollution risk early warning; and (4) The sensitivity analysis between six types pollutants
and human activities.
The four experiments mainly focus on the analysis of air pollution risk and its influ-
encing factors. The first three experiments select some representative and controversial
factors in air pollution risk to analyze and summarize the results of BN parameter learning.
We analyzed the degree of influence of different influencing factors on the change of air
pollution risk and further investigated the reasons for the change of risk.

Fig. 4 Trend of probability value above low risk in each quarter

123
2566 Natural Hazards (2021) 107:2555–2572

Experiment 4 is based on the analysis of the first three experiments, using a two-layer
BN for further risk sources. The sensitivity analysis is conducted to determine the likeli-
hood of related human activities that cause the changes in pollutant concentration, and the
analysis results can help relevant departments to implement more stringent air pollution
risk management and control from the level of human activities.

4.1 Trend analysis of spatiotemporal changes

First, we use the first layer of the BN structure graph obtained in Sect. 3.2.2 as the
structural basis for Netica to draw the network reasoning diagram. After discretizing all the
data using R, it is imported into the software for parameter learning.
The result of parameter learning in this section is obtained by calculating the conditional
probability table of low pollution risk under the conditions of a given spatial or temporal
node, which is used to quantify the strength of the dependence relationship between
spatiotemporal dimensions and pollution risk, and the air quality status of the provinces
across the country was analyzed from a macro-perspective.
Using software Netica to learn the posterior probability: when the state of the spatial
node is given, the BN parameter learning formula PðAir pollution risk ¼00 low00 jloc ¼ 1Þ
can be used to obtain the probability value corresponding to area 1 (Fig. 3). Similarly, the
respective probability values for loc ¼ 2; 3; 4; 5; 6 can be obtained.
It can be seen from Fig. 3 that the air pollution conditions in North China, Northeast
China, and Northwest China are worse than the air pollution conditions in the rest of the
country each year. From 2014 to 2018, the probability of low risk in each region was on the
rise as a whole. This means that in the past few years, the air quality of the country has
been improved and controlled. Among them, the probability of low risk in North China
increased by 9.6% from 2014 to 2018.
Sheng et al. (2019) pointed out, from the perspective of energy conservation and
emission reduction and environmental governance efficiency, that the efficiency of envi-
ronmental governance gradually decreased from the southeast to the north to the southwest
in China. The air quality in the northern region is lower than the national average and is
related to the governance efficiency in the northern region.
When a given temporal node is under certain conditions, Netica is used to learn the
posterior probability distribution of low risk as shown in Fig. 4. It illustrates the trend of
the probability value above the low-risk condition in each quarter. On the whole, pollution
risk was lower in the second and third quarters than in the first and fourth quarters. Because
the lower temperature seasons generally occur in January and December, which is the
winter time in China. The low temperature and harsh meteorological environment prevent
the diffusion and dilution of pollutants. In winter, due to the large temperature difference

Table 4 Results of diagnostic


Risk level Low risk Middle risk High risk
reasoning
G 1 2 3 4 5 6

1 0.711 0.279 0.164 0.054 0.076 0.142


2 0.0248 0.0593 0.408 0.760 0.128 0.297
3 0.13 0.366 0.131 0.06 0.188 0.239
4 0.134 0.296 0.297 0.126 0.608 0.322

123
Natural Hazards (2021) 107:2555–2572 2567

Table 5 Probability change of each pollutant under extreme conditions


Probability value of the best case of each Probability value of the worst case of each
pollutant (%) pollutant (%)

Low risk High risk Changes Low risk High risk Changes

PM2.5 75.3 14.6 ;60.7 3.15 37.2 :34.05


PM10 91.7 19 ;72.7 1.73 24.1 :22.37
CO 44.7 12.3 ;32.4 1.85 25.3 :23.45
SO2 34 5.11 ;28.89 4.72 47.5 :42.78
NO2 70.6 12.8 ;57.8 0.85 13.7 :12.85
O3 2.9 27.8 :24.9 3.14 29.2 :26.06

between day and night and higher relative humidity, pollutants in the air are prone to
oversaturation, resulting in excessive concentrations of pollutants.

4.2 Impact analysis of space spillover effect

In the parameter learning process, this paper uses the influence of other provinces to
diagnose and reason about the impact of air pollution risk. The impact of other provinces
and air pollution risk was selected as an example of diagnostic reasoning. According to the
index measurement results of pollution incurred from other provinces (Sect. 2.3), the result
of the degree of influence of province i at each time point is quantified as 1, 2, 3, 4. The
parameter learning results obtained are shown in the following table by using the following
formulas:
 
P G ¼ 1=::=4jrisk level ¼00 low00 =00 middle00 =00 high00 ð6Þ

where the given condition G is the degree of pollution incurred from other provinces, and
there are 1–4 states, respectively.
Table 4 shows the diagnostic function of the BN: when the state of the target node (air
pollution risk) is known, the probability that each province is affected by the other pro-
vinces at different levels can be obtained. Table 4 shows that when the air pollution risk is
low, the impact of other provinces is relatively small; but when the air pollution risk is
high, i.e., when the AQI level is 6 or higher, the impact of other provinces is relatively
moderate, indicating space spillover effects can have some impact on air quality but they
are not the root cause.
According to the results of the measurement, provinces such as Guizhou and Guangxi
have always appeared in the low-risk level. They are less affected by other provinces.
According to the research results, comparable results coincide with the measurement ones:
These provinces are generally less affected by other provinces and often have good air
quality. Provinces generally belong to slightly backward urban agglomerations. Poor
economic conditions and low levels of urbanization have not caused much damage to air
quality.

123
2568 Natural Hazards (2021) 107:2555–2572

4.3 Diagnostic reasoning of air pollution influencing factors

This section analyzes the causality between the six pollutants and air pollution risk in detail
and uses the BN to analyze the main influencing factors that cause risk changes.
According to the results obtained from BN parameter learning, from 2014 to 2018, there
is a 72.1% probability that the risk is at a low level, and the remaining levels account for
27.9%. In general, the risk of air pollution is still under control and the air quality was
extremely poor in a small part of the time.
This experiment aims to analyze the changes in probability for the six pollutants to be in
the best or worst state when the air pollution is at higher or lower risk. When the air
pollution risk lies at different risk levels, we wish to investigate the change in probability
for the six pollutants to be in their best/worst states. The greater the change in probability
value, the more likely it is to become the main source of pollution risk. The probability
changes of extreme states of various pollutants under low risk and high risk are shown in
Table 5 as follows.
In the extreme cases of various pollutants, SO2 and PM2.5 increase more in the worst
case, increased by 42.78% and 34.05%, respectively. The change of these two factors will
have a major impact on air pollution risk. When O3 is in either of the two extreme cases,
the probability values of both extreme cases increase, indicating that the relationship
between the changes in O3 and changes in air pollution is weaker.

4.4 Analysis of influencing factors of pollutants

When the pollutants affecting the air quality are identified, the second-layer BN is used to
analyze the source of the pollution. Particularly, by analyzing the impact of human
activities on changes in pollutant concentrations, the study can provide early warning of
related human activities.
According to the second layer of the established BN, the sensitivity analysis of six
pollutants was carried out, and the influence intensity of eight human activities on the
change of pollutant concentration was analyzed. The sensitivity analysis process is mainly
to determine the six pollutants as the target objects, and when the pollution level of each

Fig. 5 Comparison of predicted values and actual values

123
Natural Hazards (2021) 107:2555–2572 2569

Table 6 Performance comparison of multiple prediction methods


Methods RMSE MAPE Accuracy Is it possible to illustrate

hc 30.2 0.2827 66.5% Yes


Improved hc 8.1749 0.0759 90% Yes
LASVM 6.54 71% No
BP 42.3130 63.34% No

target object changes, the change in the probability of the corresponding influencing factors
is clarified. The greater the possibility of change, the higher the sensitivity of the node, the
greater the influence of the factor index on the target object, and the more serious the
consequences.
From the sensitivity analysis results, the first of these human activities that impacts the
pollutant concentration the most is the proportion of tertiary industry to GDP, accounting
for 20.9%. For the other five pollutants other than SO2, the tertiary industry’s share of GDP
is the most sensitive. Therefore, it is shown that the proportion of tertiary industry to GDP
is a key factor in the change of pollutant concentration.
The research results show that the urbanization level and population density have the
greatest impact on NO2, the urban area has the largest impact on O3, the number of civilian
cars and public transportation has the greatest impact on SO2, NO2 nitrogen oxides, and
total coal consumption is the most important factor in increasing PM2.5.

5 Verification and conclusion

5.1 Verification and comparison

The air quality monitoring model based on the BN method constructed in this paper is used
to detect the remaining undiscrete data (test data set) to test the predictive ability of the
model. The discrete data are used to detect the risk early warning capabilities of the model.
The comparison of predicted values and actual values is shown in Fig. 5. The ‘‘difference’’
line is the result of subtracting the predicted value from the actual value.
Of the test data set, we import the data except AQI in each record into the model, and
then we predict the risk level result. Among the 388 pieces of data, 350 pieces are
consistent with the actual classification from the measurement model established in
Sect. 3.2, implying that this model achieves 90% accuracy.
In order to demonstrate the validation of the established BN in air quality measurement
and early warning, it is now compared with the results by other authors during the past
5 years. For convenience, the algorithm’s basis for the model constructed in this paper will
be called the improved hc.
In Table 6, the part using LASVM refers to the experimental result of Ghaemi et al.
(2018), and the part of BP algorithm refers to the experimental result of Fang et al. (2020).
From the comparison of multiple prediction methods shown in Table 6, the model
proposed in this paper shows a better prediction capability and illustration ability. Com-
pared with the improved hc, the accuracy of the hc without simplification is worse, which
demonstrates that the adoption of the averaged BN model has a certain effect in simpli-
fying the BN.

123
2570 Natural Hazards (2021) 107:2555–2572

For the air quality prediction and risk early warning model developed in this study, the
BN has a probability and graphical function that other neural networks do not possess. It
simplifies the complexity of air quality prediction and risk early warning model and
expresses the dependency relationship between various input variables in the form of
probability, yet it also holds a relatively high predictive ability.
On the other hand, from the perspective of prediction, the genetic simulated annealing
algorithm can achieve 98.92% accuracy. The BP neural network model based on wavelet
changes can achieve 95.6% prediction accuracy. Wu et al. (2019a, b) considered the
dynamic characteristics of the AQI sequence and proposed the CEEMD-Elman model
based on complementary ensemble empirical mode decomposition (CEEMD) to make
predictions, with an accuracy rate of 94.12%.

5.2 Conclusion and discussion

In this study, a monitoring model for air quality and risk warning system has been
developed and validated. By analyzing the BN structure chart established in this paper,
using the unique inference and diagnosis advantages of BN, the following conclusions can
be drawn to contribute to the air pollution risk monitoring and early warning of various
cities.
(1) PM2.5 and SO2 are among the most critical factors for the deterioration of air
pollution risk. The government should implement corresponding measures to timely
monitor the abnormal discharge of these two pollutants and clarify whether or not
the human activities clearly increase.
(2) The spatial spillover effect exists in various air pollutants, and data also show that
there is a significant spatial correlation in the air quality index. The provincial and
municipal associations should cooperate with other provinces to jointly monitor and
control air quality. The formation of urban cooperation groups that resist air quality
deterioration will play a certain role in the effective control of air quality.
(3) High air pollution risk is more likely to occur in winter when the temperature is
relatively low. In view of the low temperature in winter, it should be more strictly
controlled to reduce emissions from pollution sources. In the comparison of regional
air quality, the pollution status of North China and Northeast China is the bottleneck
area of national air quality improvement.
By employing the BN, this study aims to predict the air quality and elaborate risk
warning system. The establishment of this model can not only guarantee a certain pre-
diction accuracy but also utilize the BN parameter learning process to make inferences and
diagnoses. Findings are of great significance in helping the government or related
departments to identify the source of the problem and implement corresponding effective
interventions to control the air quality and reduce the impact of bad air quality on human
health.
However, this article mainly builds the model from monthly data. It can only measure
and warn of air quality from a macro-perspective. The accuracy that cannot be updated
every hour dynamically is a disadvantage of this model.
Further research directions: (1) In this study, the probabilistic inference is used for
reasoning and measuring uncertainty. This in some cases is not suitable since probabilities
are additive. In many applications, information is usually not additive. In such cases,
information reasoning can be performed by adopting Dempster–Shafer’s evidence theory

123
Natural Hazards (2021) 107:2555–2572 2571

and Choquet capacity for uncertainty reasoning. (2) The factors considered in this paper
and the model established are fixed, but the air quality changes with time and space, and
each moment and each state is dynamic. Hence, it is possible to implement an adaptive
dynamic change model.

Acknowledgements This work was supported by Projects of the National Social Science Foundation of
China (No: 19BTJ011).

Compliance with ethical standards

Conflict of interest The authors declare that there are no conflicts of interest for the publication of this study.

References
Chen JB, Chen KY et al (2019) PM2.5 pollution and inhibitory effects on industry development: a bidi-
rectional correlation effect mechanism. Int J Environ Res Public Health 16(7):1159
Chen JB, Chen KY, Wang G et al (2019) Indirect economic impact incurred by haze pollution: an
econometric and input-output joint model. Int J Environ Res Public Health 16(13):2328
Egan SD, Stuefer M, Webley P et al (2014) WRF-Chem modeling of sulfur dioxide emissions from the 2008
Kasatochi Volcano. Ann Geophys. https://doi.org/10.4401/ag-6626
Fan FY, Lei YL, Li L (2019) Health damage assessment of particulate matter pollution in Jing-Jin-Ji region
of China. Environ Sci Pollut Res 26:7883–7895
Fang Z, Zhang L, Huang Y (2020) A novel BP neural network with wavelet transform inputs for air quality
index prediction. IOP Conf Ser Mater Sci Eng 735:012059
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2–3):131–163
Ghaemi Z, Alimohammadi A et al (2018) LaSVM-based big data learning system for dynamic prediction of
air pollution in Tehran. Environ Monit Assess 190:300
Kang Z, Qu ZY (2017) Application of BP neural network optimized by genetic simulated annealing
algorithm to prediction of air quality index in Lanzhou. In: 2017 2nd IEEE International Conference on
Computational Intelligence and Applications, pp 155–160. https://doi.org/10.1109/CIAPP.2017.
8167199
Khakzad N (2019) System safety assessment under epistemic uncertainty: using imprecise probabilities in
Bayesian network. Saf Sci 116:149–160
Lelieveld J, Evans JS, Fnais M, Giannadaki D, Pozzer A (2015) The contribution of outdoor air pollution
sources to premature mortality on a global scale. Nature 525:367–371
Li X, Peng L, Yao X et al (2017) Long short-term memory neural network for air pollutant concentration
predictions: method development and evaluation. Environ Pollut 231:997–1004
Liu Y, Guo H, Mao G et al (2008) A Bayesian hierarchical model for urban air quality prediction under
uncertainty. Atmos Environ 42(36):8464–8469
Mcmillan N, Bortnick SM, Irwin ME et al (2005) A hierarchical Bayesian model to estimate and forecast
ozone through space and time. Atmos Environ 39(8):1373–1382
Rakowska A, Wong KC, Townsend T (2014) Impact of traffic volume and composition on the air quality
and pedestrian exposure in urban street canyon. Atmos Environ 98:260–270
Safari A, Hosseini R, Mazinani M (2017) A novel type-2 adaptive neuro fuzzy inference system classifier
for modelling uncertainty in prediction of air pollution disaster. Int J Eng Trans B Appl
30(11):1746–1751
Scutari M, Nagarajan R (2013) Identifying significant edges in graphical models of molecular networks.
Artif Intell Med 57(3):207–217
Scutari M, Auconi P, Caldarelli G et al (2017) Bayesian networks analysis of malocclusion data. Sci Rep
7(1):1–11
Sheng X, Peng BH, Elahi E, Wei G (2019) Regional convergence of energy-environmental efficiency: from
the perspective of environmental constraints. Environ Sci Pollut Res 26(25):25467–25475
Sun W, Sun JY (2017) Daily PM2.5 concentration prediction based on principal component analysis and
LSSVM optimized by cuckoo search algorithm. J Environ Manage 188:144–152

123
2572 Natural Hazards (2021) 107:2555–2572

Uusitalo L (2007) Advantages and challenges of Bayesian networks in environmental modelling. Ecol
Model 203(3–4):312–318
Wang JZ, Yang WD (2019) Air quality early warning system based on nonlinear correction strategy. Syst
Eng Theory Pract 39(8):2138–2151 ((in Chinese))
Wang L, Jang C, Zhang Y et al (2010) Assessment of air quality benefits from national air pollution control
policies in China. Part II: evaluation of air quality predictions and air quality benefits assessment.
Atmos Environ 44(28):3449–3457
Wu MM, Xu JX, Wang Q (2019a) AQI prediction of CEEMD-Elman neural network based on data
decomposition. China Environ Sci 39(11):4580–4588 ((in Chinese))
Wu XH, Chen Y et al (2019b) Study of haze emission efficiency based on new co-opetition data envel-
opment analysis. Expert Syst. https://doi.org/10.1111/exsy.12466
Wu XH, Guo J, Wei G et al (2020) Economic losses and willingness to pay for haze: the data analysis based
on 1123 residential families in Jiangsu province, China. Environ Sci Pollut Res 27:17864–17877
Xu YZ, Yang WD, Wang JZ (2017) Air quality early-warning system for cities in China. Atmos Environ
148:239–257
Yang GH, Yu W et al (2013) Rapid health transition in China, 1990–2010: findings from the global burden
of disease study 2010. Lancet 381:1987–2015
Yang YR, Liu XG, Qu Y et al (2015) Characteristics and formation mechanism of continuous hazes in
China: a case study during the autumn of 2014 in the North China Plain. Atmos Chem Phys
15(14):8165–8178
Zhao JH, Dong T, Cai B (2019) AQI prediction based on long short-term memory model with spatial-
temporal optimizations and fireworks algorithm. J Wuhan Univ (Nat Sci Ed) 65(3):250–262 ((in
Chinese))
Zhou Y, Li L, Sun RL et al (2019) Haze influencing factors: a data envelopment analysis approach. Int J
Environ Res Public Health 16(6):914
Zhu JY, Zheng Y, Yi XW et al (2016) A Gaussian Bayesian model to identify spatio-temporal causalities for
air pollution based on urban big data. In: Computer Communications Workshops (INFOCOM
WKSHPS) IEEE Conf. (San Francisco, CA, USA: IEEE), pp 3–8. https://doi.org/10.1109/INFCOMW.
2016.7562036
Zhu S, Lian X, Liu H et al (2017) Daily air quality index forecasting with hybrid models: a case in China.
Environ Pollut 231:1232–1244

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

123

You might also like