Optimal policy for structure maintenance: A deep reinforcement learning framework
Optimal policy for structure maintenance: A deep reinforcement learning framework
Structural Safety
journal homepage: www.elsevier.com/locate/strusafe
a
Key Lab of Smart Prevention and Mitigation of Civil Engineering Disasters of the Ministry of Industry and Information Technology, Harbin Institute of Technology, Harbin
150090, China
b
Key Lab of Structures Dynamic Behavior and Control of the Ministry of Education, Harbin Institute of Technology, Harbin 150090, China
c
School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China
Keywords: The cost-effective management of aged infrastructure is an issue of worldwide concern. Markov decision process
Bridge maintenance policy (MDP) models have been used in developing structural maintenance policies. Recent advances in the artificial
Deep reinforcement learning (DRL) intelligence (AI) community have shown that deep reinforcement learning (DRL) has the potential to solve large
Markov decision process (MDP) MDP optimization tasks. This paper proposes a novel automated DRL framework to obtain an optimized
Deep Q-network (DQN)
structural maintenance policy. The DRL framework contains a decision maker (AI agent) and the structure that
Convolutional neural network (CNN)
needs to be maintained (AI task environment). The agent outputs maintenance policies and chooses maintenance
actions, and the task environment determines the state transition of the structure and returns rewards to the
agent under given maintenance actions. The advantages of the DRL framework include: (1) a deep neural net-
work (DNN) is employed to learn the state-action Q value (defined as the predicted discounted expectation of the
return for consequences under a given state-action pair), either based on simulations or historical data, and the
policy is then obtained from the Q value; (2) optimization of the learning process is sample-based so that it can
learn directly from real historical data collected from multiple bridges (i.e., big data from a large number of
bridges); and (3) a general framework is used for different structure maintenance tasks with minimal changes to
the neural network architecture. Case studies for a simple bridge deck with seven components and a long-span
cable-stayed bridge with 263 components are performed to demonstrate the proposed procedure. The results
show that the DRL is efficient at finding the optimal policy for maintenance tasks for both simple and complex
structures.
⁎
Corresponding author at: School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China.
E-mail address: [email protected] (H. Li).
https://doi.org/10.1016/j.strusafe.2019.101906
Received 10 March 2018; Received in revised form 17 October 2019; Accepted 3 November 2019
Available online 19 November 2019
0167-4730/ © 2019 Elsevier Ltd. All rights reserved.
S. Wei, et al. Structural Safety 83 (2020) 101906
the current state. The decision maker chooses a plan from a finite set of recognized as an important component for constructing general AI
maintenance actions based on actual observations from an inspection of systems and has been applied to various engineering tasks for decision-
the structural condition, then it performs the maintenance action on the making and control, e.g., video games [19], robots [13], question-an-
structure and receives a reward as a consequence of the action. The swering systems [4] and self-driving vehicles [23].
effects of natural deterioration, hazards, and maintenance actions are This study proposes a DRL framework for the automated policy
all depicted as transition matrices among possible states, which are making of bridge maintenance actions. Section 2 establishes the DRL
based on the physical model, expert experience, or statistics of the framework for maintenance and introduces the optimization method
bridge management history. Generally, worse conditions cause a higher for learning the optimal maintenance policy. From the perspective of
risk of structural failure and financial losses. A maintenance action can DRL, as shown in Fig. 1, the task environment comprises the physical
improve the structural condition and reduce these risks, and has a properties of the bridge structure, the agent represents the BMS, an
certain cost associated with it. Therefore, this process is a natural op- action is one of the possible maintenance actions, and the reward cor-
timization problem. The Bellman equation provides a mathematical responds to the financial cost of the maintenance action and the pos-
framework for MDPs. Dynamic programming (DP) and linear pro- sible associated risks under certain conditions. Section 3 illustrates the
gramming (LP) algorithms are frequently employed to obtain the op- application of the DRL framework to a general bridge maintenance task
timum maintenance policies [11,17,22]. using both a simple bridge deck structure and a complex cable-stayed
DP and LP algorithms require the expectation of a return reward, bridge. The performance of the proposed DRL method is compared with
which is calculated using the Bellman equation. However, the calcu- other hand-crafted CBM and TBM policies. The conclusions and dis-
lation is expensive and inefficient for problems with large state or ac- cussions are summarized in Section 4.
tion spaces [26]. From the perspective of artificial intelligence (AI), the
maintenance policy-making problem can be treated as a special case of 2. DRL framework for structural maintenance policies
reinforcement learning (RL), and be solved efficiently using a family of
efficient sampling algorithms, such as the bootstrapping TD (temporal DRL is employed to obtain the optimal maintenance policy for a
difference) method, the Monte-Carlo tree search (MCTs) method, and bridge to keep it in acceptable condition by minimizing the costs of
others. Hence, a general deep reinforcement learning (DRL) framework maintenance. Deep learning (DL) approximates the value function with
is proposed for structural maintenance policy decisions. The optimiza- a DNN and the RL provides the improvement mechanism.
tion is sample-based so that it can directly learn from simulation or real
historic data collected from multiple bridges. A deep neural network 2.1. Q-learning and Deep Q-Network (DQN)
(DNN) structured agent enables the framework to be used in various
bridge cases with little change required to the network architecture DRL-based maintenance policy making is based on MDP models,
(depth and layer size). which are frequently employed to describe the processes of structural
The RL is inspired by the trial-and-error process of human learning maintenance. An MDP is a tuple S , A , P , R , , where S is the
in psychology. The agent (the decision maker in the maintenance tasks) structural state space (i.e., the discrete structural rate set according to
learns to perform optimal actions by interacting with the task en- the inspection manual, such as very good, good, fair, poor, urgent, or
vironment (the bridge in the maintenance tasks). These interactions critical). A = {a1, a2, , am} is the possible maintenance action space
mean that the agent performs maintenance actions in the task en- (i.e., the predefined maintenance actions with m types of action levels,
vironment based on the current state, and the environment responds to such as no maintenance, minor maintenance, major maintenance, and
the agent by returning the state transition and reward (Fig. 1). The replace). P = P (St + 1 St , At ) is the state transition probability matrix,
reward is an indicator of the goodness of the action performed under i.e., the structural state transition probability from St at year t to St + 1 at
the given states and is usually in the form of financial costs or benefits. year t + 1 when performing the maintenance action At at year t. When
From such interactions, the agent acquires knowledge about the task the action At is ’no repair’, the structural state will be degraded by the
environment and assigns credits to maintain a value function to obtain probability due to natural erosion or damage. When the maintenance
an optimized policy. DRL approximates the value function using DNNs action At is of different levels (minor maintenance, major maintenance,
[19,25] for problems with large state or action spaces. DRL has been or replace), the structural state will be enhanced to different levels by
the probability due to the maintenance. Given the state of the bridge, a
decision regarding maintenance is made based on the maintenance
policy. This policy is defined as the conditional probability of an action
under the given state, (At St ) = P (At St ) , where a good policy should
consider the consequences of a given maintenance action. Here, the
reward function Rt = R (St , At ) serves as an indicator of the con-
sequence, R = E [Rt St ] and [0, 1] is a discount factor considering
the long-term rewards. The reward in this study is defined as the ne-
gative of the maintenance costs plus the risk to the structure. Effective
maintenance may increase the maintenance costs while decreasing the
structural risks. The goal of DRL-based maintenance policy making is to
learn the maintenance policy that maximizes the total reward during
the entire lifespan of the structure. All the states, maintenance actions,
and reward function sets are defined a priori based on the inspection
and maintenance manual and the cost criterion, while the state tran-
sition probability matrix is obtained from practical experience or a
numerical model of the structure.
A sequence of bridge states, maintenance actions, and rewards that
depict the entire history of the RL task following a certain policy is
denoted as an episode: S1, A1 , R1, S2, , RT 1, ST , AT , RT ~ .A probabil-
istic version and a sample version of a trajectory following a policy are
Fig. 1. Schematic of reinforcement learning [26] and mapping relation with shown in Fig. 2(a) and (b) respectively.
structural maintenance task. The optimal policy should consider all the sequences during the
2
S. Wei, et al. Structural Safety 83 (2020) 101906
The input of the neural network is the state s = St and the output is the
corresponding state-action value Q (St , a, w), where the weights w are
updated by minimizing the mean-squared error (MSE):
1
J (w) = E (Y Q w (S t , a)) 2
2 (3)
via stochastic gradient descent (SGD) [12]:
1 + / m if At = argmax Q (St , a)
Fig. 2. Trajectory following the policy [26]. The hollow node denotes the (At St ) = a
state, the solid node denotes the action, and the rectangular hollow node de-
/m otherwise (5)
notes the terminal state. where m is the number of legal actions in each step. In addition, con-
straints can be imposed on the policy during training to implement real
entire lifespan of the bridge. Therefore, the return Gt which balances the constraints in real life, for example, if the maximum maintenance costs
short- and long-term rewards (where T is the lifespan of the bridge) and is limited to be Cmax , then the -greedy policy may be changed to the
the state-action value function Q (St , At ) which is the expectation, are following:
introduced as:
if Ct Cmax:
{
T t 1 1 At = 0
Gt = Rt + 1 + Rt + 2 + = kR (At St ) =
t+k+1 0 otherwise
k=0
else (Ct < Cmax ):
Q (St , At ) = E [Gt St , At ] (1)
1 + / m ifAt = argmax aQ (St , a)
(At St ) =
Given the value of Q (·) , one can find the optimal maintenance /m otherwise
action that maximizes the expectation of the return. When the state and
action spaces are relatively large, calculating Q (·) is very expensive. A All the examples in Section 3 are unconstrained, however, con-
deep Q network (DQN) provides an efficient way to approximate this straints can be easily considered in this way. The pseudo-code of Q-
value. The optimization method used is the training method based on learning is listed as follows:
the sample version shown in Fig. 2(b), where the data are used to train
Q-learning
the DQN. In practical applications, data can be sourced from bridges
and components in similar natural environments (a big data source) or
from simulations between the BMS agent and the bridge (a simulation
data source). Field and experimental results can be embedded in the
physical properties of the task environment by specifying the state
transition matrices and the reward functions. In this way, the agent can
consider engineering concerns (such as rare hazard events) by learning
to recognize the task environment through the interactions.
The DQN is given by Q w (s, a) Q (s , a) as parameterized by w. A
DNN -structured Q network is employed in this study due to the pow-
erful nonlinear representation and mapping capacity [19,24]. The in-
cremental form of the iteration in the training is:
Q (St , At ) Q (St , At ) +
= Q (St , At ) + [Rt + max Q (St + 1, a) Q (St , At )]
a (2)
generated from the interactions between the BMS agent and the bridge 2.2. CNN
structure in the MC sampling style under the given policy. Given the
structural state St of year t, the maintenance action is sampled by the DNNs can have several parameters with deep layers. A convolu-
policy At ~ (a St ) , the reward is calculated be Rt = R (St , At ), and the tional neural network (CNN) is a kind of DNN with an architecture that
consequent state is then obtained St + 1 ~P (s St , At ) . The simulation con- can learn spatio-temporal features. It has fewer but more efficient
tinues over the episode until t = T . These kinds of interactions generate shared parameters, which are in convolutional form.
significant content for the dataset of the transition tuple (St , At , Rt , St + 1) Besides the deep architecture, three properties contribute to the
in the episode. Y is calculated thereafter. In this way, Eq. (2) imple- efficiency of CNNs: local connections, shared weights, and pooling
ments the expectation of Eq. (1) using MC simulations. The general [12,27,32]. Unlike a fully connected network in which the output nodes
transition data (St , At , Rt , St + 1, Y ) are employed as the training dataset. of each layer are connected to all the input nodes in the next layer, the
3
S. Wei, et al. Structural Safety 83 (2020) 101906
Fig. 3. DRL architecture that contains the four stages: (1) state encoding, (2) feature learning by CNN, (3) Q learning from a fully connected network, and (4) policy
making using the -greedy method based on the output of the Q network. The input is the stack of the one-hot encoding for the structural conditions and the binary
coding of the relevant year. The input state is best reshaped to approximately a square to fit the 2-D CNN network.
4
S. Wei, et al. Structural Safety 83 (2020) 101906
5
S. Wei, et al. Structural Safety 83 (2020) 101906
Fig. 6. State transition probability matrices under moderate environment and no maintenance.
6
S. Wei, et al. Structural Safety 83 (2020) 101906
Fig. 7. State transition probability matrices for all deck system components.
Table 1
Reward rate depended on condition rate and maintenance level.
Ratecondition
Rateaction 0.80 for condition = 0 0.85 for condition = 1 0.90 for condition = 2 0.95 for condition = 3 1.0 for condition = 4 1.0 for condition = 5
0.0 for action = 0 0.0 × 0.80 0.0 × 0.85 0.0 ×0.90 0.0 × 0.95 0.0 × 1.0 0.0 × 1.0
0.1 for action = 1 0.1 × 0.80 0.1 × 0.85 0.1 ×0.90 0.1 × 0.95 0.1 × 1.0 0.1 × 1.0
0.3 for action = 2 0.3 × 0.80 0.3 × 0.85 0.3 × 0.90 0.3 × 0.95 0.3 × 1.0 0.3 × 1.0
0.4 for action = 3 1.0 × 0.80 1.0 × 0.85 1.0 ×0.90 1.0 × 0.95 1.0 × 1.0 1.0 × 1.0
Note: total costs of deck system components are 80, 60, 80, 60, 100, 120, 100, respectively; the risk rate of all components for the six conditions are 0.01,0.01, 0.02,
0.03, 0.1, 0.3 times of the total costs, respectively. In the next section, rewards (costs) of each year are normalized based on the sum of the total costs of all deck
components (i.e., 600).
structural risk rate related to bridge states, and costtotal (c ) × raterisk (s ) to t = 100 under the given the policy w (·) and transition matrices P to
measures the risk by probabilistic financial costs, as listed in Table 1. collect the maintenance transition data St , At , Rt , St + 1 . (3) Storing:
Calculate the MC target Y = Gt = k = 0 Rt + k + 1 and save
T t 1 k
DRL paradigm
The size of the Q value table for the deck system maintenance task is St , At , Rt , St + 1, Y to memory buffer M . (4) Updating: Update the
|S ||A | = 100 × 67 × 47 , which is too large to be solved using DP network parameters w and then the policy w using the batch data
method. In the DRL framework, the state St is treated as the input and sampled from M . (5) Iteration: Repeat steps (2)-(4) until convergence.
the state-action value Q (Sa, a) is approximated by the outputs of the In step (2), the network determines the DRL policy w (·) , which inputs
network. The -greedy policy in Eq. 5 is employed to sample the state St and outputs Q w (St , a). The maintenance action is sampled from
maintenance action in the training step. The convolutional DQN ar- the output policy At ~ w (a St ) . In step (4), the network is updated using
chitecture shown in Fig. 3(a) is employed to obtain the optimal main- the sampled batch dataset from the memory buffer M . The network
tenance actions for the encoded input state. The input size is 7 × 7 × 1, inputs the batch of states {St } and outputs the corresponding {Q w (St , a} .
where the third dimension is extended to make it easier to connect to {At , Y } are used to update the parameters using Eq. 4. The capacity of
the neural network. Next, there are four convolutional layers, with the the memory buffer M is 10 4 and the training batch size is 103 . Once the
sizes of each layer being 7 × 7 × 4, 4 × 4 × 16, 2 × 2 × 32 and memory buffer is full, the newly generated simulation data
1 × 1 × 64 respectively. The sizes of the kernel for each layer are St , At , Rt , St + 1, Y overwrites the memory buffer so that the network
3 × 3, 3 × 3, 3 × 3, and 2 × 2 , respectively, and the stride sizes of each can always be trained on the newly updated dataset.
7
S. Wei, et al. Structural Safety 83 (2020) 101906
costtotal (c ) × ratecondition (s ) × rateaction (a) and the DQN loss is the mean
square error J (w) . The DRL policy rapidly converges from a randomly
initialized policy to the policy with the lowest cost after 10,000 training
steps in approximately 25 min. The model still explores new state-ac-
tion pairs to find a better policy (represented as spikes on the cost and
DQN loss curves and denoted by the rectangle) due to the -greedy
mechanism. In each new exploration, the costs increase corresponding
to increases in the DQN loss. However, because the optimal policy has
already been found and will not change in the specified task environ-
ment, after the exploration, the DRL policy will quickly revert to the
optimal solution. However, the exploration continues during training.
Table 2 compares the normalized costs for different maintenance
policies (DRL, time-based, and condition-based maintenance policies)
over 1,000 MC simulations of bridge maintenance from t = 1 to 100. In
each simulation, maintenance action At is sampled from the given
policy under the specific state St and the consequent state St + 1 is sam-
pled from the state transition matrices P (· St , At ). ’Time-X’ denotes the
time-based policy of making minor repairs on all deck components
every X years (X = 5, 10, 15, 20). Condition-based policy ‘Condition-X’
denotes the condition-based policy for making minor repairs on the
deck system components whose conditions are worse than condition X
(X = 1, 2, 3). The comparison shows that the DRL policy is optimal
among these policies. It has the lowest average normalized life-cycle
cost of −1.3885. The Condition-1 policy, which has a similar normal-
ized average life-cycle cost of −1.3938, is ranked second, while the
other policies cost significantly more.
The life-cycle condition distribution in Fig. 10(a) shows that the
DRL and Condition-1 policies have similar life-cycle condition dis-
tributions. That is, the deck system components mainly remain in either
condition 1 or 0. This may be due to the formulation of the reward. The
yearly structural risk rate related to the structural condition, raterisk (s ) is
set as 0.01, 0.01, 0.02, 0.03 and 1 for conditions 0 to 4, which assumes
Fig. 8. Sample: state is s = [0, 4, 4, 3, 1, 2, 3, 49] and greedy action is that conditions 0 and 1 are the same from a structural risk and struc-
a = [0, 1, 2, 1, 0, 0, 1]. tural performance perspective, while the other conditions cost more.
Therefore, the top two policies opt for maintenance when the condition
The CNN-structured DQN is built using Tensorflow, and the task is worse than 1, and they always keep the structural condition between
environment is established according to the standard of the OpenAI 0 and 1.
Gym environment. The learning rate is set as = 0.001, the discount Fig. 10(b) compares the number of maintenance actions and action
factor is set as = 0.95, and the parameter decays with continued distributions in every 5-year period. The results show that the time-
iterations. All the results are obtained on a desktop PC operating 64-bit based policies perform maintenance actions uniformly in the specified
Ubuntu with an i7-4770 processor at 3.4 GHz and 8G RAM. The de- interval. The condition-based policies do not opt for much maintenance
tailed source code is available athttps://github.com/HIT-SMC/Bridge- in the early years because of the initial good condition of the bridge.
maintenance. The DRL policy opts for fewer actions during both the early and later
years. This suggests that in the last few years of the life cycle, the DRL
policy tends to require less frequent maintenance because the risk ex-
3.1.2. Results
pectation reduces and the benefit of maintenance and the risk sequence
The performance of the DRL during training for the simulation steps
of no repair may reach a balance under the terminal constraint. How-
are shown in Fig. 9. The total cost is the sum of the maintenance costs
ever, the Condition-1 policy depends only on the condition and is
over the simulation (the first expression in (the former part of Eq. 7
Fig. 9. Performance of DRL policy during training. The rectangle indicates the exploring spike.).
8
S. Wei, et al. Structural Safety 83 (2020) 101906
Table 2
Case I: Total costs comparison (1000 simulations).
Policies DRL Condition-1 Condition-2 Condition-3 Time-5 Time-10 Time-15 Time-20
Total cost µ −1.3885 −1.3938 −1.6427 −2.0669 −2.6008 −1.9287 −1.9133 −2.1070
Total cost 0.0685 0.0663 0.0994 0.1649 0.0199 0.0988 0.2371 0.4057
Fig. 10. Comparison of different maintenance policies for 1000 simulations of the simple deck system.
9
S. Wei, et al. Structural Safety 83 (2020) 101906
Fig. 12. Performance of DRL policy during training. The rectangle indicates the exploring spike.
Fig. 13. Comparison of different maintenance policies for 1000 simulations of the long-span cable-stayed bridge system.
10
S. Wei, et al. Structural Safety 83 (2020) 101906
Table 3
Case II: Total costs comparison (1000 simulations).
Policies DRL Condition-1 Condition-2 Condition-3 Time-5 Time-10 Time-15 Time-20
Total cost µ −1.267 −1.279 −1.567 −1.953 −2.055 −2.274 −3.216 −4.157
Total cost 0.014 0.013 0.019 0.033 0.013 0.077 0.119 0.141
Fig. 14. State transition matrices under moderate environment and no maintenance for the long-span cable-stayed bridge.
Fig. 15. State transition matrices under different level of maintenance for the long-span cable-stayed bridge.
11
S. Wei, et al. Structural Safety 83 (2020) 101906
condition X. The DRL policy is the optimum of all the policies. It has the
lowest average normalized life-cycle cost of −1.267. Condition-1 is
second with an average normalized life-cycle cost of −1.279.
Moreover, as in Fig. 10(a), both the DRL and Condition-1 policies
lead to similar life-cycle condition distributions. The components
mainly remain in either condition 1 or 0, as shown in Fig. 13(a). Once
again, the reason is that the reward function assumes that conditions 0
and 1 are the same from the structural risk and structural performance
perspective. Therefore, the DRL policy always chooses to keep the
structural condition between 0 and 1.
The distributions of maintenance actions for different policies are
shown in Fig. 13(b). The DRL policy opts for fewer actions during the
final few years compared to Condition-1, which implies that the DRL
network learns to take the age of the bridge into consideration when
making policy decisions. The time-based policies have uniformly dis-
tributed maintenance actions in the specified interval and the condi-
tion-based policies do not opt for much maintenance in the early years
because of the initial good condition of the bridge. These results are like
those described in Section 3.1.2. This implies that DRL is effective in
finding the optimal policy for different maintenance tasks.
12
S. Wei, et al. Structural Safety 83 (2020) 101906
13