0% found this document useful (0 votes)

8 views

Optimal policy for structure maintenance: A deep reinforcement learning framework

This paper presents a deep reinforcement learning (DRL) framework for optimizing structural maintenance policies, particularly for aging infrastructure like bridges. The framework employs a decision-making AI agent that learns maintenance actions based on historical data and simulations, utilizing a deep neural network to approximate state-action values. Case studies demonstrate the effectiveness of the DRL approach in finding optimal maintenance strategies for both simple and complex bridge structures.

Uploaded by

gustavo proaño

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Optimal policy for structure maintenance: A deep reinforcement learning framework

Uploaded by

gustavo proaño

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Structural Safety 83 (2020) 101906

Contents lists available at ScienceDirect

Structural Safety
journal homepage: www.elsevier.com/locate/strusafe

Optimal policy for structure maintenance: A deep reinforcement learning T

framework
Shiyin Weia,b,c, Yuequan Baoa,b,c, Hui Lia,b,c,
⁎

a
Key Lab of Smart Prevention and Mitigation of Civil Engineering Disasters of the Ministry of Industry and Information Technology, Harbin Institute of Technology, Harbin
150090, China
b
Key Lab of Structures Dynamic Behavior and Control of the Ministry of Education, Harbin Institute of Technology, Harbin 150090, China
c
School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China

ARTICLE INFO ABSTRACT

Keywords: The cost-effective management of aged infrastructure is an issue of worldwide concern. Markov decision process
Bridge maintenance policy (MDP) models have been used in developing structural maintenance policies. Recent advances in the artificial
Deep reinforcement learning (DRL) intelligence (AI) community have shown that deep reinforcement learning (DRL) has the potential to solve large
Markov decision process (MDP) MDP optimization tasks. This paper proposes a novel automated DRL framework to obtain an optimized
Deep Q-network (DQN)
structural maintenance policy. The DRL framework contains a decision maker (AI agent) and the structure that
Convolutional neural network (CNN)
needs to be maintained (AI task environment). The agent outputs maintenance policies and chooses maintenance
actions, and the task environment determines the state transition of the structure and returns rewards to the
agent under given maintenance actions. The advantages of the DRL framework include: (1) a deep neural net-
work (DNN) is employed to learn the state-action Q value (defined as the predicted discounted expectation of the
return for consequences under a given state-action pair), either based on simulations or historical data, and the
policy is then obtained from the Q value; (2) optimization of the learning process is sample-based so that it can
learn directly from real historical data collected from multiple bridges (i.e., big data from a large number of
bridges); and (3) a general framework is used for different structure maintenance tasks with minimal changes to
the neural network architecture. Case studies for a simple bridge deck with seven components and a long-span
cable-stayed bridge with 263 components are performed to demonstrate the proposed procedure. The results
show that the DRL is efficient at finding the optimal policy for maintenance tasks for both simple and complex
structures.

1. Introduction on a deterioration model, a reward model, and a core policy-decision

model [6]. The deterioration model [5,7] which considers the dete-
Civil infrastructure experiences performance degradation due to rioration of the infrastructure under different maintenance policies,
aging, environmental erosion, natural or man-made hazards, etc. The assesses the effectiveness of a maintenance action (including no repair)
structural condition assessment and maintenance processes have been by predicting the associated consequences. The reward model, which is
the main concern throughout the world [1,6,14]. Life-cycle manage- related to system performance and objectives, assesses the costs of
ment is widely used to maximize the cost-effectiveness of maintenance different maintenance actions. The core policy-decision model aims to
actions, and decisions for maintenance actions are based on the main- make rational and cost-effective maintenance decisions based on the
tenance policy. reward, the structural performance, and the constraints.
Several types of policy have been developed over the last five dec- Of all the decision models, Markov decision process models (MDPs)
ades. Preventive maintenance is the most important, and includes time- are the most frequently employed by managers in practice. For ex-
based maintenance (TBM) and condition-based maintenance (CBM). ample, the core methodology behind the bridge management system
TBM is conducted at regular intervals, whereas CBM is performed ac- (BMS) software PONTIS and BRIGIT are both based on an MDP
cording to the inspected or predicted structural conditions [15,30,31]. [9,18,29]. An MDP [2,10,26,28] uses a finite state space to represent
In general, CBM is usually better than TBM because of the additional the structural conditions and the Markov property holds in the state
structural information provided by an inspection. CBM is usually based space [16], i.e., future states are independent of the historic states given

⁎
Corresponding author at: School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China.
E-mail address: [email protected] (H. Li).

https://doi.org/10.1016/j.strusafe.2019.101906
Received 10 March 2018; Received in revised form 17 October 2019; Accepted 3 November 2019
Available online 19 November 2019
0167-4730/ © 2019 Elsevier Ltd. All rights reserved.
S. Wei, et al. Structural Safety 83 (2020) 101906

the current state. The decision maker chooses a plan from a finite set of recognized as an important component for constructing general AI
maintenance actions based on actual observations from an inspection of systems and has been applied to various engineering tasks for decision-
the structural condition, then it performs the maintenance action on the making and control, e.g., video games [19], robots [13], question-an-
structure and receives a reward as a consequence of the action. The swering systems [4] and self-driving vehicles [23].
effects of natural deterioration, hazards, and maintenance actions are This study proposes a DRL framework for the automated policy
all depicted as transition matrices among possible states, which are making of bridge maintenance actions. Section 2 establishes the DRL
based on the physical model, expert experience, or statistics of the framework for maintenance and introduces the optimization method
bridge management history. Generally, worse conditions cause a higher for learning the optimal maintenance policy. From the perspective of
risk of structural failure and financial losses. A maintenance action can DRL, as shown in Fig. 1, the task environment comprises the physical
improve the structural condition and reduce these risks, and has a properties of the bridge structure, the agent represents the BMS, an
certain cost associated with it. Therefore, this process is a natural op- action is one of the possible maintenance actions, and the reward cor-
timization problem. The Bellman equation provides a mathematical responds to the financial cost of the maintenance action and the pos-
framework for MDPs. Dynamic programming (DP) and linear pro- sible associated risks under certain conditions. Section 3 illustrates the
gramming (LP) algorithms are frequently employed to obtain the op- application of the DRL framework to a general bridge maintenance task
timum maintenance policies [11,17,22]. using both a simple bridge deck structure and a complex cable-stayed
DP and LP algorithms require the expectation of a return reward, bridge. The performance of the proposed DRL method is compared with
which is calculated using the Bellman equation. However, the calcu- other hand-crafted CBM and TBM policies. The conclusions and dis-
lation is expensive and inefficient for problems with large state or ac- cussions are summarized in Section 4.
tion spaces [26]. From the perspective of artificial intelligence (AI), the
maintenance policy-making problem can be treated as a special case of 2. DRL framework for structural maintenance policies
reinforcement learning (RL), and be solved efficiently using a family of
efficient sampling algorithms, such as the bootstrapping TD (temporal DRL is employed to obtain the optimal maintenance policy for a
difference) method, the Monte-Carlo tree search (MCTs) method, and bridge to keep it in acceptable condition by minimizing the costs of
others. Hence, a general deep reinforcement learning (DRL) framework maintenance. Deep learning (DL) approximates the value function with
is proposed for structural maintenance policy decisions. The optimiza- a DNN and the RL provides the improvement mechanism.
tion is sample-based so that it can directly learn from simulation or real
historic data collected from multiple bridges. A deep neural network 2.1. Q-learning and Deep Q-Network (DQN)
(DNN) structured agent enables the framework to be used in various
bridge cases with little change required to the network architecture DRL-based maintenance policy making is based on MDP models,
(depth and layer size). which are frequently employed to describe the processes of structural
The RL is inspired by the trial-and-error process of human learning maintenance. An MDP is a tuple S , A , P , R , , where S is the
in psychology. The agent (the decision maker in the maintenance tasks) structural state space (i.e., the discrete structural rate set according to
learns to perform optimal actions by interacting with the task en- the inspection manual, such as very good, good, fair, poor, urgent, or
vironment (the bridge in the maintenance tasks). These interactions critical). A = {a1, a2, , am} is the possible maintenance action space
mean that the agent performs maintenance actions in the task en- (i.e., the predefined maintenance actions with m types of action levels,
vironment based on the current state, and the environment responds to such as no maintenance, minor maintenance, major maintenance, and
the agent by returning the state transition and reward (Fig. 1). The replace). P = P (St + 1 St , At ) is the state transition probability matrix,
reward is an indicator of the goodness of the action performed under i.e., the structural state transition probability from St at year t to St + 1 at
the given states and is usually in the form of financial costs or benefits. year t + 1 when performing the maintenance action At at year t. When
From such interactions, the agent acquires knowledge about the task the action At is ’no repair’, the structural state will be degraded by the
environment and assigns credits to maintain a value function to obtain probability due to natural erosion or damage. When the maintenance
an optimized policy. DRL approximates the value function using DNNs action At is of different levels (minor maintenance, major maintenance,
[19,25] for problems with large state or action spaces. DRL has been or replace), the structural state will be enhanced to different levels by
the probability due to the maintenance. Given the state of the bridge, a
decision regarding maintenance is made based on the maintenance
policy. This policy is defined as the conditional probability of an action
under the given state, (At St ) = P (At St ) , where a good policy should
consider the consequences of a given maintenance action. Here, the
reward function Rt = R (St , At ) serves as an indicator of the con-
sequence, R = E [Rt St ] and [0, 1] is a discount factor considering
the long-term rewards. The reward in this study is defined as the ne-
gative of the maintenance costs plus the risk to the structure. Effective
maintenance may increase the maintenance costs while decreasing the
structural risks. The goal of DRL-based maintenance policy making is to
learn the maintenance policy that maximizes the total reward during
the entire lifespan of the structure. All the states, maintenance actions,
and reward function sets are defined a priori based on the inspection
and maintenance manual and the cost criterion, while the state tran-
sition probability matrix is obtained from practical experience or a
numerical model of the structure.
A sequence of bridge states, maintenance actions, and rewards that
depict the entire history of the RL task following a certain policy is
denoted as an episode: S1, A1 , R1, S2, , RT 1, ST , AT , RT ~ .A probabil-
istic version and a sample version of a trajectory following a policy are
Fig. 1. Schematic of reinforcement learning [26] and mapping relation with shown in Fig. 2(a) and (b) respectively.
structural maintenance task. The optimal policy should consider all the sequences during the

2
S. Wei, et al. Structural Safety 83 (2020) 101906

The input of the neural network is the state s = St and the output is the
corresponding state-action value Q (St , a, w), where the weights w are
updated by minimizing the mean-squared error (MSE):
1
J (w) = E (Y Q w (S t , a)) 2
2 (3)
via stochastic gradient descent (SGD) [12]:

w = E [(Y Q w (St , a))] w Qw (St , a) (4)

However, the Q value network may not give the exact value because
it is trained based on information from the MC target Y, which depends
on the explored episodes. During training, it is better to explore than to
exploit [26]. To explore more possible policies that have not been en-
countered in the experience, the -greedy search method is employed
during training (explore new possibilities with the probability of while
exploiting the known with a probability of 1 , where decays with
the learning process):

1 + / m if At = argmax Q (St , a)
Fig. 2. Trajectory following the policy [26]. The hollow node denotes the (At St ) = a

state, the solid node denotes the action, and the rectangular hollow node de-
/m otherwise (5)
notes the terminal state. where m is the number of legal actions in each step. In addition, con-
straints can be imposed on the policy during training to implement real
entire lifespan of the bridge. Therefore, the return Gt which balances the constraints in real life, for example, if the maximum maintenance costs
short- and long-term rewards (where T is the lifespan of the bridge) and is limited to be Cmax , then the -greedy policy may be changed to the
the state-action value function Q (St , At ) which is the expectation, are following:
introduced as:
if Ct Cmax:

{
T t 1 1 At = 0
Gt = Rt + 1 + Rt + 2 + = kR (At St ) =
t+k+1 0 otherwise
k=0
else (Ct < Cmax ):
Q (St , At ) = E [Gt St , At ] (1)
1 + / m ifAt = argmax aQ (St , a)
(At St ) =
Given the value of Q (·) , one can find the optimal maintenance /m otherwise
action that maximizes the expectation of the return. When the state and
action spaces are relatively large, calculating Q (·) is very expensive. A All the examples in Section 3 are unconstrained, however, con-
deep Q network (DQN) provides an efficient way to approximate this straints can be easily considered in this way. The pseudo-code of Q-
value. The optimization method used is the training method based on learning is listed as follows:
the sample version shown in Fig. 2(b), where the data are used to train
Q-learning
the DQN. In practical applications, data can be sourced from bridges
and components in similar natural environments (a big data source) or
from simulations between the BMS agent and the bridge (a simulation
data source). Field and experimental results can be embedded in the
physical properties of the task environment by specifying the state
transition matrices and the reward functions. In this way, the agent can
consider engineering concerns (such as rare hazard events) by learning
to recognize the task environment through the interactions.
The DQN is given by Q w (s, a) Q (s , a) as parameterized by w. A
DNN -structured Q network is employed in this study due to the pow-
erful nonlinear representation and mapping capacity [19,24]. The in-
cremental form of the iteration in the training is:

Q (St , At ) Q (St , At ) +
= Q (St , At ) + [Rt + max Q (St + 1, a) Q (St , At )]
a (2)

where is the user-defined learning rate, = Y Q (St , At ) is the MC

error,and Y = Gt = k = 0 Rt + k + 1 is the MC target. The MC target Y is
T t 1 k

generated from the interactions between the BMS agent and the bridge 2.2. CNN
structure in the MC sampling style under the given policy. Given the
structural state St of year t, the maintenance action is sampled by the DNNs can have several parameters with deep layers. A convolu-
policy At ~ (a St ) , the reward is calculated be Rt = R (St , At ), and the tional neural network (CNN) is a kind of DNN with an architecture that
consequent state is then obtained St + 1 ~P (s St , At ) . The simulation con- can learn spatio-temporal features. It has fewer but more efficient
tinues over the episode until t = T . These kinds of interactions generate shared parameters, which are in convolutional form.
significant content for the dataset of the transition tuple (St , At , Rt , St + 1) Besides the deep architecture, three properties contribute to the
in the episode. Y is calculated thereafter. In this way, Eq. (2) imple- efficiency of CNNs: local connections, shared weights, and pooling
ments the expectation of Eq. (1) using MC simulations. The general [12,27,32]. Unlike a fully connected network in which the output nodes
transition data (St , At , Rt , St + 1, Y ) are employed as the training dataset. of each layer are connected to all the input nodes in the next layer, the

3
S. Wei, et al. Structural Safety 83 (2020) 101906

Fig. 3. DRL architecture that contains the four stages: (1) state encoding, (2) feature learning by CNN, (3) Q learning from a fully connected network, and (4) policy
making using the -greedy method based on the output of the Q network. The input is the stack of the one-hot encoding for the structural conditions and the binary
coding of the relevant year. The input state is best reshaped to approximately a square to fit the 2-D CNN network.

Fig. 4. Representative Transverse and longitudinal sections [20].

4
S. Wei, et al. Structural Safety 83 (2020) 101906

3.1.1. Problem formulation

An optimal maintenance policy usually depends on how a bridge
changes under different maintenance levels, the objectives of the
maintenance (i.e., bridge performance objectives and maintenance cost
objectives), and constraints. In the DRL framework, both maintenance
level and bridge condition are discretized. Maintenance actions are
categorized from 0 to 3 and condition is from 0 to 5. Deterioration is
modeled by the state transition matrix under various maintenance
types. The objectives are considered in the reward model. They are
related to the financial costs and structural performance. Constraints
can be imposed in the greedy policy, as illustrated in Eq. 5.
States
In the MTQ inspection report, the structural conditions for all the
deck system components are rated from 0 to 5 according to the severity
of the material defects and the percentage loss of the components’ cross-
section and surface area along the length: 0 is ‘very good’ and 5 is
‘critical’. In addition, the age of the bridge is also important for the
Fig. 5. Sample illustrations of state and action coding:
St = [0, 4, 4, 3, 1, 2, 3, 49] and action coding: At = [3, 0, 2, 3, 3, 1, 1]. maintenance policy. Maintenance decisions may be different for the
same structural conditions in a different life-cycle year. For example,
for a bridge with a 100-year life expectancy in a not bad condition, the
output nodes of a CNN are connected to a limited number of input optimal policy may be a minor repair in year 30 but no repair in year
nodes in a local patch. Connections are represented by a set of weights, 99. Therefore, the state S in the DRL framework stacks the compo-
which is called the convolutional kernel. Kernels are designed to learn nents’ structural condition and age to form a vector of length 8. A trick
the local correlated features of the input and are shared by different here is to encode the structural condition of each component to the one-
patches within the same layer. Therefore, the shared parameters in the hot encoding vector of length 6, so the encoding of all the components
same layer detect the same pattern in different parts of the input. The forms a 2-D array of size 7 × 6 . The age (with the scale of 100) is en-
CNN used in this paper contains two stages in each convolutional layer, coded to the binary form as a vector of 7, since 27 = 128. For example,
a convolutional stage and a nonlinear activation stage. The convolu- the binary encoding of 49 is 0110001. Thus, the states are normalized.
tional operation and nonlinear activation function in each convolu- Therefore, the stacked state is a 2-D array of sized 7 × 7 . Thus, the size
tional layer are of the state space is |S | = 100 × 67 2.8 × 107 . An example of the en-
coded feature of a random state St = [0, 4, 4, 3, 1, 2, 3, 49] is shown in
y= (k *x + b) (6)
Fig. 5(a).
Maintenance actions
where x is the input of the layer (the encoding of the state); * is the
Actions are generally divided into four discrete levels according to
convolutional operator, k is the trainable convolutional kernel with its
the maintenance effect and costs [21]:no repair (0), minor repair (1),
size as the hyperparameters, b is the addictive bias, and (·) is the
major repair (2), and replacement (3). Each component is assigned one
nonlinear activation function. The ReLU function [8] is used in the
of these maintenance levels. The maintenance action can be re-
paper: (x ) = max (x , 0) .
presented by a vector of length 7 and the corresponding one-hot en-
Fig. 3 shows the DRL architecture used in cases 1 and 2 in Section 3.
coding is a 2-D array of size 7 × 4 as shown in Fig. 5 (b). he size of the
It has four stages: (1) one-hot encoding, which encodes the state and
action space is |A | = 47 1.6 × 10 4 .
time into the normalized form inputs; (2) feature learning, which is
State transition matrices
implemented by the hidden convolutional layers; (3) Q-learning, which
State transition matrices fully depict the dynamic properties of the
is implemented with three fully connected layers, where the last layer is
bridge and can be obtained from data using Bayes’ rule or from the
reshaped to be the output Q value of the given state; and (4) policy
deterioration model developed from field and laboratory experiments.
making, which implements the -greedy policy in Eq. (5).
In this application, the state transition matrices are obtained based on
the statistical results from [20]. The natural environment is considered
to be a moderate environment and the inspection interval is adjusted to
3. Case studies
1 year in this study. The transition matrices are different under the
various action levels. The state transition matrices for no maintenance
3.1. Case I: a simple bridge deck system
actions are listed in Fig. 6 while Fig. 7 lists the matrices for the main-
tenance actions of different levels. A basic rational assumption is that
A bridge deck system (Fig. 4), described in an inspection report
without any maintenance actions the structure will deteriorate by at
published by the Ministry of Transportation in Quebec (MTQ)[20], is
most one level in a year. The element pij in each matrix represents the
the first example used to illustrate the DRL framework for maintenance
probability that the component transforms from state i to state j under
policy decisions. The database is part of a comprehensive system to
different maintenance action levels. Note that although the correlation
manage highway structures in Quebec. Only the environments and the
of the state transition between different components is neglected, the
structural materials are considered in the statistics. The dataset contains
CNN is powerful at representing correlations.
information up to year 2000 on 9,678 province-owned bridges. The
The effect of maintenance actions is assumed to be dependent on the
transition probability matrices P of the deck system components under
structural conditions and the maintenance actions, and is determined
a moderate environment were calculated with Bayes’ rule. In addition,
by the maintenance level via the state transition matrices. Therefore, if
the maintenance tasks are treated at the component level, such that the
the condition is worse, the maintenance effect is lower for the same
state and actions are all considered at the component level.
maintenance action. As the level of the maintenance action increases,
Seven components are considered in the MTQ deck system: the
the maintenance effect improves for the same structural condition.
wearing surface, the drainage system, exterior face 1, exterior face 2,
Replacement transforms a component from any condition to a condition
end portion 1, end portion 2, and the middle portion, as shown in Fig. 4.
of 0. To facilitate the problem formulation, all the deck system com-
These components are numbered numerically in this order.
ponents share the same state transition probability matrices, as shown

5
S. Wei, et al. Structural Safety 83 (2020) 101906

Fig. 6. State transition probability matrices under moderate environment and no maintenance.

in Fig. 7. sv08data.dot.ca.gov/contractcost/, or specified by the manager. How-

Rewards ever, the exact value is not important in this study, so the costs are
Rewards are defined as a combination of the financial maintenance specified by the authors in this study. The units for the financial costs
costs and structural risks in Eq. 7. The negative value is taken to be are set to 1 since the relative values are more important than the ab-
consistent with the literal meaning of ’reward’. Rewards are the user- solute ones. The maintenance cost of a deck system component c is
specified goals for maintaining a bridge, and the optimal policy leads to assumed to be dependent on the state of the deck system component s
maintenance actions that maximize the total rewards. Reference values and the maintenance actions a as follows:
can be based on an elaborate metric for the costs listed inhttp://

6
S. Wei, et al. Structural Safety 83 (2020) 101906

Fig. 7. State transition probability matrices for all deck system components.

Table 1
Reward rate depended on condition rate and maintenance level.
Ratecondition

Rateaction 0.80 for condition = 0 0.85 for condition = 1 0.90 for condition = 2 0.95 for condition = 3 1.0 for condition = 4 1.0 for condition = 5

0.0 for action = 0 0.0 × 0.80 0.0 × 0.85 0.0 ×0.90 0.0 × 0.95 0.0 × 1.0 0.0 × 1.0
0.1 for action = 1 0.1 × 0.80 0.1 × 0.85 0.1 ×0.90 0.1 × 0.95 0.1 × 1.0 0.1 × 1.0
0.3 for action = 2 0.3 × 0.80 0.3 × 0.85 0.3 × 0.90 0.3 × 0.95 0.3 × 1.0 0.3 × 1.0
0.4 for action = 3 1.0 × 0.80 1.0 × 0.85 1.0 ×0.90 1.0 × 0.95 1.0 × 1.0 1.0 × 1.0

Note: total costs of deck system components are 80, 60, 80, 60, 100, 120, 100, respectively; the risk rate of all components for the six conditions are 0.01,0.01, 0.02,
0.03, 0.1, 0.3 times of the total costs, respectively. In the next section, rewards (costs) of each year are normalized based on the sum of the total costs of all deck
components (i.e., 600).

R (c , s , a ) layer are 2 × 2, 2 × 2, 2 × 2 , and 1 × 1, respectively. Fig. 8illustrates a

sample state, the CNN-approximated Q value, and the -greedy actions
= costtotal (c ) × ratecondition (s ) × rateaction (a) + costtotal (c ) × raterisk (s )
under the given state.
(7) The training procedure for the DRL network is listed in the pseu-
where costtotal (c ), ratecondition (c ), rateaction (a) are the cost rates that de- docode in Section 2.1. There are five steps: (1) Initialization: The
pend on the deck system component c, the state of deck system com- training starts from arbitrarily initialized parameters w} and policy
ponent s, and the maintenance action a, respectively; raterisk (s ) is the w (·) . (2) Simulation: Initialize St , and run MC simulations from t = 1

structural risk rate related to bridge states, and costtotal (c ) × raterisk (s ) to t = 100 under the given the policy w (·) and transition matrices P to
measures the risk by probabilistic financial costs, as listed in Table 1. collect the maintenance transition data St , At , Rt , St + 1 . (3) Storing:
Calculate the MC target Y = Gt = k = 0 Rt + k + 1 and save
T t 1 k
DRL paradigm
The size of the Q value table for the deck system maintenance task is St , At , Rt , St + 1, Y to memory buffer M . (4) Updating: Update the
|S ||A | = 100 × 67 × 47 , which is too large to be solved using DP network parameters w and then the policy w using the batch data
method. In the DRL framework, the state St is treated as the input and sampled from M . (5) Iteration: Repeat steps (2)-(4) until convergence.
the state-action value Q (Sa, a) is approximated by the outputs of the In step (2), the network determines the DRL policy w (·) , which inputs
network. The -greedy policy in Eq. 5 is employed to sample the state St and outputs Q w (St , a). The maintenance action is sampled from
maintenance action in the training step. The convolutional DQN ar- the output policy At ~ w (a St ) . In step (4), the network is updated using
chitecture shown in Fig. 3(a) is employed to obtain the optimal main- the sampled batch dataset from the memory buffer M . The network
tenance actions for the encoded input state. The input size is 7 × 7 × 1, inputs the batch of states {St } and outputs the corresponding {Q w (St , a} .
where the third dimension is extended to make it easier to connect to {At , Y } are used to update the parameters using Eq. 4. The capacity of
the neural network. Next, there are four convolutional layers, with the the memory buffer M is 10 4 and the training batch size is 103 . Once the
sizes of each layer being 7 × 7 × 4, 4 × 4 × 16, 2 × 2 × 32 and memory buffer is full, the newly generated simulation data
1 × 1 × 64 respectively. The sizes of the kernel for each layer are St , At , Rt , St + 1, Y overwrites the memory buffer so that the network
3 × 3, 3 × 3, 3 × 3, and 2 × 2 , respectively, and the stride sizes of each can always be trained on the newly updated dataset.

7
S. Wei, et al. Structural Safety 83 (2020) 101906

costtotal (c ) × ratecondition (s ) × rateaction (a) and the DQN loss is the mean
square error J (w) . The DRL policy rapidly converges from a randomly
initialized policy to the policy with the lowest cost after 10,000 training
steps in approximately 25 min. The model still explores new state-ac-
tion pairs to find a better policy (represented as spikes on the cost and
DQN loss curves and denoted by the rectangle) due to the -greedy
mechanism. In each new exploration, the costs increase corresponding
to increases in the DQN loss. However, because the optimal policy has
already been found and will not change in the specified task environ-
ment, after the exploration, the DRL policy will quickly revert to the
optimal solution. However, the exploration continues during training.
Table 2 compares the normalized costs for different maintenance
policies (DRL, time-based, and condition-based maintenance policies)
over 1,000 MC simulations of bridge maintenance from t = 1 to 100. In
each simulation, maintenance action At is sampled from the given
policy under the specific state St and the consequent state St + 1 is sam-
pled from the state transition matrices P (· St , At ). ’Time-X’ denotes the
time-based policy of making minor repairs on all deck components
every X years (X = 5, 10, 15, 20). Condition-based policy ‘Condition-X’
denotes the condition-based policy for making minor repairs on the
deck system components whose conditions are worse than condition X
(X = 1, 2, 3). The comparison shows that the DRL policy is optimal
among these policies. It has the lowest average normalized life-cycle
cost of −1.3885. The Condition-1 policy, which has a similar normal-
ized average life-cycle cost of −1.3938, is ranked second, while the
other policies cost significantly more.
The life-cycle condition distribution in Fig. 10(a) shows that the
DRL and Condition-1 policies have similar life-cycle condition dis-
tributions. That is, the deck system components mainly remain in either
condition 1 or 0. This may be due to the formulation of the reward. The
yearly structural risk rate related to the structural condition, raterisk (s ) is
set as 0.01, 0.01, 0.02, 0.03 and 1 for conditions 0 to 4, which assumes
Fig. 8. Sample: state is s = [0, 4, 4, 3, 1, 2, 3, 49] and greedy action is that conditions 0 and 1 are the same from a structural risk and struc-
a = [0, 1, 2, 1, 0, 0, 1]. tural performance perspective, while the other conditions cost more.
Therefore, the top two policies opt for maintenance when the condition
The CNN-structured DQN is built using Tensorflow, and the task is worse than 1, and they always keep the structural condition between
environment is established according to the standard of the OpenAI 0 and 1.
Gym environment. The learning rate is set as = 0.001, the discount Fig. 10(b) compares the number of maintenance actions and action
factor is set as = 0.95, and the parameter decays with continued distributions in every 5-year period. The results show that the time-
iterations. All the results are obtained on a desktop PC operating 64-bit based policies perform maintenance actions uniformly in the specified
Ubuntu with an i7-4770 processor at 3.4 GHz and 8G RAM. The de- interval. The condition-based policies do not opt for much maintenance
tailed source code is available athttps://github.com/HIT-SMC/Bridge- in the early years because of the initial good condition of the bridge.
maintenance. The DRL policy opts for fewer actions during both the early and later
years. This suggests that in the last few years of the life cycle, the DRL
policy tends to require less frequent maintenance because the risk ex-
3.1.2. Results
pectation reduces and the benefit of maintenance and the risk sequence
The performance of the DRL during training for the simulation steps
of no repair may reach a balance under the terminal constraint. How-
are shown in Fig. 9. The total cost is the sum of the maintenance costs
ever, the Condition-1 policy depends only on the condition and is
over the simulation (the first expression in (the former part of Eq. 7

Fig. 9. Performance of DRL policy during training. The rectangle indicates the exploring spike.).

8
S. Wei, et al. Structural Safety 83 (2020) 101906

Table 2
Case I: Total costs comparison (1000 simulations).
Policies DRL Condition-1 Condition-2 Condition-3 Time-5 Time-10 Time-15 Time-20

Total cost µ −1.3885 −1.3938 −1.6427 −2.0669 −2.6008 −1.9287 −1.9133 −2.1070
Total cost 0.0685 0.0663 0.0994 0.1649 0.0199 0.0988 0.2371 0.4057

independent of the time (terminal constraint). employed (Fig. 3(b)).

The transition details (St , At , Rt ) of the MC simulations for the
bridge maintenance under the DRL policy, including the Condition-1
and the Time-15 policies, are shown in Fig. S1, S2, and S3 in the 3.2.1. Problem formulation
Supplementary. The simulation cases show that the optimal DRL policy For simplicity, the problem formulation for the cable-stayed bridge
performs like Condition-1, i.e., it chooses minor repairs for components example is like that for the deck system described in Section 3.1.1. The
worse than condition 1 but opts for no repair when the time is close to condition of each component is rated from 0 to 5, from very good to
terminal. Major repair and replacement are not selected during the critical. The optional maintenance actions are numbered from 0 to 3
entire maintenance trajectory, which may be due to the lack of con- corresponding to actions from no repair to replace. The yearly state
straints on maintenance actions (e.g., total maintenance costs) that the transition matrices of different components due to deterioration and
DRL policy chooses the frequent minor repair to always keep the bridge various maintenance action levels are shown in Figs. 14 and 15 in ap-
in a good condition. pendix. The reward function, which considers the financial main-
tenance costs and structural performance measured by probabilistic risk
costs, has the same format as Eq. (7) and Table 1. total _cost (c ) are 3, 2,
3.2. Case II: a long-span cable-stayed bridge 10, and 4 for the longest stay-cable, the longest girder section, the
tower, and the pier, respectively. The rewards are then normalized
A long-span cable-stayed bridge with a main span of 648 m in based on the sum of the total costs of all components.
mainland China [3,15] is used to investigate the performance of the The DRL network architecture is similar to the network used in
proposed approach for complex structures. The bridge is a two-tower Section 3.1, but has deeper layers and more layer units. The input stacks
cable-stayed bridge with 168 cables, 89 box-girder sections, two bridge the one-hot encoded structural condition with the length of 1578
towers, and four bridge piers, as shown in Fig. 11, giving a total of 263 (263 × 6 ) and the binary-encoded age with the length of 7. The input is
components. A DRL network with deeper layers and more units is then expanded to a vector of length 1600 by zero padding. Finally, it is

Fig. 10. Comparison of different maintenance policies for 1000 simulations of the simple deck system.

9
S. Wei, et al. Structural Safety 83 (2020) 101906

Fig. 11. The investigated the long-span cable-stayed bridge[15].

Fig. 12. Performance of DRL policy during training. The rectangle indicates the exploring spike.

Fig. 13. Comparison of different maintenance policies for 1000 simulations of the long-span cable-stayed bridge system.

10
S. Wei, et al. Structural Safety 83 (2020) 101906

Table 3
Case II: Total costs comparison (1000 simulations).
Policies DRL Condition-1 Condition-2 Condition-3 Time-5 Time-10 Time-15 Time-20

Total cost µ −1.267 −1.279 −1.567 −1.953 −2.055 −2.274 −3.216 −4.157
Total cost 0.014 0.013 0.019 0.033 0.013 0.077 0.119 0.141

Fig. 14. State transition matrices under moderate environment and no maintenance for the long-span cable-stayed bridge.

Fig. 15. State transition matrices under different level of maintenance for the long-span cable-stayed bridge.

11
S. Wei, et al. Structural Safety 83 (2020) 101906

condition X. The DRL policy is the optimum of all the policies. It has the
lowest average normalized life-cycle cost of −1.267. Condition-1 is
second with an average normalized life-cycle cost of −1.279.
Moreover, as in Fig. 10(a), both the DRL and Condition-1 policies
lead to similar life-cycle condition distributions. The components
mainly remain in either condition 1 or 0, as shown in Fig. 13(a). Once
again, the reason is that the reward function assumes that conditions 0
and 1 are the same from the structural risk and structural performance
perspective. Therefore, the DRL policy always chooses to keep the
structural condition between 0 and 1.
The distributions of maintenance actions for different policies are
shown in Fig. 13(b). The DRL policy opts for fewer actions during the
final few years compared to Condition-1, which implies that the DRL
network learns to take the age of the bridge into consideration when
making policy decisions. The time-based policies have uniformly dis-
tributed maintenance actions in the specified interval and the condi-
tion-based policies do not opt for much maintenance in the early years
because of the initial good condition of the bridge. These results are like
those described in Section 3.1.2. This implies that DRL is effective in
finding the optimal policy for different maintenance tasks.

4. Conclusion and discussions

This paper proposes a DRL framework for automatic bridge main-

Fig. 16. State encoding for the cable-stayed bridge. tenance policy decision-making. Examples for a highway bridge deck
system with seven elements and a long-span cable-stayed bridge are
reshaped to a 2-D array of size 40 × 40 × 1 (as shown in Fig. 16 in given, and the following conclusions are made:
appendix). Six convolutional layers of size 40 × 40 × 8, 20 × 20 ×
32, 10 × 10 × 128, 5 × 5 × 512, 3 × 3 × 1024 and 1 × 1 × 4096 are (1) This paper proposes a DRL framework as a general solution to
adopted in the feature learning stage. The sizes of the kernel are all component-level maintenance policy decision making. The struc-
3 × 3, and the stride sizes are 1 × 1, 2 × 2, 2 × 2, 2 × 2, 2 × 2 , and 1 × 1 tures correspond to the environmental task in AI field, and policy
respectively. In the Q-learning stage, output of the CNN is used the making corresponds to the agent. The optimal policy is learned
input layer of the fully connected layer. The hidden layer size of the through iterations of the policy evaluation and policy improvement
fully connected layer is set as 8,000. The model then outputs the Q- stages. The DL implements the policy evaluation by training the
value of size 263 × 4 , and maintenance action are obtained in the DQN. The input dataset can be sourced either from historical data
-greedy manner. Therefore, the size of the state space is for bridges in a similar natural environment or from maintenance
|S | = 100 × 6263 4.5 × 10 206 , and the action space is |A | = 4263 4.5 × 10158 . simulations using a MC method.
The hyperparameters are set as = 0.95, = 0.0001, the capacity of (2) The DRL framework provides a general method for structures
the experience buffer is 105 , and the training batch size is 103 . The DRL with different complexities (i.e., different numbers of components)
network is built on TensorFlow, and is trained on a 64-bit Ubuntu with little change in the network architecture. The maintenance of a
desktop enabled with a GTX-1080Ti GPU device. Same as in Section complex structure requires a deeper and wider network than a
3.1, the training procedure includes five steps according to the pseudo simpler structure. To build an effective DRL neural network, it is
code, i.e., initialization, simulation, storing, updating, and itera- recommended that the size of the fully connected hidden layer is
tion until convergence. The detailed source code is available athttps:// four times that of the output Q layer to avoid underfitting. The
github.com/HIT-SMC/Bridge-maintenance. number of CNN layers is determined according to the sizes of the
input, kernel, and stride. A smaller update step is recommended for
deeper neural networks.
3.2.2. Results
(3) DRL opts for less maintenance in the last few years of the life
Fig. 12 illustrates the performance of the DRL policy during training
cycle, which implies that the DRL agent takes the age and special
for the simulation steps. The simulation is for the given state transition
engineering concerns into account in the task environment.
matrices and the current step policy, also denoted as step 2 of the five
(4) The training of the proposed DRL framework is sample-based,
steps according to the pseudocode. The total cost is the sum of the
which means it can learn directly from real historical data and can
maintenance costs over the simulation and the DQN loss is the mean
incorporate the real history and physical model in the same fra-
square error given by Eq. (3). The DRL policy converges from a ran-
mework. When real data are not available, an environment model is
domly initialized policy to the policy with the lowest cost after 100,000
required to respond to the DRL agent’s maintenance policy.
training steps in approximately 3 days. The spikes imply that the DRL
Therefore, to obtain a rational maintenance policy through the
agent is exploring due to the -greedy mechanism.
learning process, it is important to have reasonable models for de-
Table 3 compares the normalized costs for different maintenance
terioration and cost criteria. Physical and structural concerns (like
policies (the learned DRL, time-based, and condition-based main-
performance under earthquakes) should be incorporated into the
tenance policies) over 1000 simulations of bridge maintenance from
task environment.
t = 1 to 100. In each simulation, the maintenance action is sampled
from the given policy and the consequent state is sampled from the state
transition matrices given in Fig. 14 and Fig. 15. ’Time-X’ represents Acknowledgements
time-based policy of making minor repairs on all components every X
years. ‘Condition-X’ represents the condition-based policy of making This study was financially supported by the NSFC (Grant Nos.
minor repairs on the components whose conditions are worse than U1711265, 51638007, 51678203, 51478149 and 51678204).

12
S. Wei, et al. Structural Safety 83 (2020) 101906

Appendix A. Supplementary data 2018;155:1–15.

[16] Madanat Samer, Ibrahim Wan Hashim Wan. Poisson regression models of infra-
structure transition probabilities. J Transp Eng 1995;121(3):267–72.
Supplementary data associated with this article can be found, in the [17] Medury Aditya, Madanat Samer. Simultaneous network optimization approach for
online version, athttps://doi.org/10.1016/j.strusafe.2019.101906. pavement management systems. J Infrastruct Syst 2013;20(3):04014010.
[18] Mirzaei Zanyar, Adey Bryan T, Klatter L, Thompson PD. The IABMAS bridge
management committee overview of existing bridge. Manage Syst 2014.
References [19] Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Rusu Andrei A, Veness Joel,
Bellemare Marc G, Graves Alex, Riedmiller Martin, Fidjeland Andreas K, Ostrovski
[1] Bao Yuequan, Chen Zhicheng, Wei Shiyin, Xu Yang, Tang Zhiyi, Li Hui. The state of Georg, et al. Human-level control through deep reinforcement learning. Nature
the art of data science and engineering in structural health monitoring. Engineering 2015;518(7540):529.
2019. [20] Morcous G. Performance prediction of bridge deck systems using markov chains. J
[2] Cesare Mark A, Santamarina Carlos, Turkstra Carl, Vanmarcke Erik H. Modeling Perform Constr Facil 2006;20(2):146–55.
bridge deterioration with markov chains. J Transp Eng 1992;118(6):820–33. [21] Papakonstantinou KG, Shinozuka M. Optimum inspection and maintenance policies
[3] Chen Zhicheng, Bao Yuequan, Li Hui. The random field model of the spatial dis- for corroded structures using partially observable markov decision processes and
tribution of heavy vehicle loads on long-span bridges. Health Monitoring of stochastic, physically based models. Probab Eng Mech 2014;37:93–108.
Structural and Biological Systems 2016, vol. 9805. International Society for Optics [22] Robelin Charles-Antoine, Madanat Samer M. Dynamic programming based main-
and Photonics; 2016. p. 98052Q. tenance and replacement optimization for bridge decks using history-dependent
[4] Ferrucci David, Brown Eric, Chu-Carroll Jennifer, Fan James, Gondek David, deterioration models. Applications of Advanced Technology in Transportation.
Kalyanpur Aditya A, Adam Lally J, Murdock William, Nyberg Eric, Prager John, 2006. p. 13–8.
et al. Building Watson: an overview of the DeepQA project. AI Magazine [23] Sallab Ahmad EL, Abdou Mohammed, Perot Etienne, Yogamani Senthil. Deep re-
2010;31(3):59–79. inforcement learning framework for autonomous driving. Electron Imaging
[5] Frangopol Dan M. Life-cycle performance, management, and optimisation of 2017;2017(19):70–6.
structural systems under uncertainty: accomplishments and challenges 1. Struct [24] Schmidhuber Jürgen. Deep learning in neural networks: an overview. Neural
Infrastruct Eng 2011;7(6):389–413. Networks 2015;61:85–117.
[6] Frangopol Dan M, Dong You, Sabatino Samantha. Bridge life-cycle performance and [25] Silver David, Huang Aja, Maddison Chris J, Guez Arthur, Sifre Laurent, Van Den
cost: analysis, prediction, optimisation and decision-making. Struct Infrastruct Eng Driessche George, Schrittwieser Julian, Antonoglou Ioannis, Panneershelvam Veda,
2017;13(10):1239–57. Lanctot Marc, et al. Mastering the game of go with deep neural networks and tree
[7] Frangopol Dan M, Kallen Maarten-Jan, van Jan Noortwijk M. Probabilistic models search. Nature 2016;529(7587):484.
for life-cycle performance of deteriorating structures: review and future directions. [26] Sutton Richard S, Barto Andrew G. Reinforcement learning: an introduction. MIT
Progr Struct Eng Mater 2004;6(4):197–212. press; 2018.
[8] Glorot Xavier, Bordes Antoine, Bengio Yoshua. Deep sparse rectifier neural net- [27] Tang Zhiyi, Chen Zhicheng, Bao Yuequan, Li Hui. Convolutional neural network-
works. Proceedings of the fourteenth international conference on artificial in- based data anomaly detection method using multiple information for structural
telligence and statistics. 2011. p. 315–23. health monitoring. Struct Control Health Monitor 2019;26(1):e2296.
[9] Hawk Hugh. Bridgit deterioration models. Transp Res Rec 1995;1490:19–22. [28] Tao Zongwei, Corotis Ross B, Hugh Ellis J. Reliability-based bridge design and life
[10] Hu Qiying, Yue Wuyi. Markov decision processes with their applications. Springer cycle management with markov decision processes. Struct Saf
Science & Business Media; 2007. volume 14. 1994;16(1–2):111–32.
[11] Kuhn Kenneth D. Network-level infrastructure management using approximate [29] Thompson Paul D, Small Edgar P, Johnson Michael, Marshall Allen R. The pontis
dynamic programming. J Infrastruct Syst 2009;16(2):103–11. bridge management system. Struct Eng Int 1998;8(4):303–8.
[12] LeCun Yann, Bengio Yoshua, Hinton Geoffrey. Deep learning. Nature [30] Van Noortwijk JM. A survey of the application of gamma processes in maintenance.
2015;521(7553):436. Reliab Eng Syst Saf 2009;94(1):2–21.
[13] Levine Sergey, Pastor Peter, Krizhevsky Alex, Ibarz Julian, Quillen Deirdre. [31] Wei Shiyin, Zhang Zhaohui, Li Shunlong, Li Hui. Strain features and condition as-
Learning hand-eye coordination for robotic grasping with deep learning and large- sessment of orthotropic steel deck cable-supported bridges subjected to vehicle
scale data collection. Int J Rob Res 2018;37(4–5):421–36. loads by using dense FBG strain sensors. Smart Mater Struct 2017;26(10):104007.
[14] Li Hui, Jinping Ou. The state of the art in structural health monitoring of cable- [32] Xu Yang, Wei Shiyin, Bao Yuequan, Li Hui. Automatic seismic damage identifica-
stayed bridges. J Civil Struct Health Monitor 2016;6(1):43–67. tion of reinforced concrete columns from images by a region-based deep convolu-
[15] Li Shunlong, Wei Shiyin, Bao Yuequan, Li Hui. Condition assessment of cables by tional neural network. Struct Control Health Monitor 2019:e2313.
pattern recognition of vehicle-induced cable tension ratio. Eng Struct