0% found this document useful (0 votes)
12 views

Wei 2019

This article proposes a deep reinforcement learning framework to determine the optimal time to reclose transmission lines that were tripped during a cyber attack on the power grid. The framework establishes an environment to simulate power system dynamics during attack recovery. Using training data from this environment, the deep RL strategy can learn to choose the reclosing time that minimizes cyber attack impacts under different scenarios in real-time. Numerical results show the proposed strategy is effective at reducing impacts compared to conventional recovery approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Wei 2019

This article proposes a deep reinforcement learning framework to determine the optimal time to reclose transmission lines that were tripped during a cyber attack on the power grid. The framework establishes an environment to simulate power system dynamics during attack recovery. Using training data from this environment, the deep RL strategy can learn to choose the reclosing time that minimizes cyber attack impacts under different scenarios in real-time. Numerical results show the proposed strategy is effective at reducing impacts compared to conventional recovery approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 1

Cyber-Attack Recovery Strategy for Smart Grid


Based on Deep Reinforcement Learning
Fanrong Wei, Zhiqiang Wan, Haibo He, Fellow, IEEE

Abstract—The integration of cyber-physical system increases However, the attackers are getting more concealed and
the vulnerabilities of critical power infrastructures. Once the more difficult to be identified before launching an attack [10].
malicious attackers take the substation control authorities, they Meanwhile, the attackers will hide the attack consequences,
can trip all the transmission lines to block the power transfer. As
a consequence, asynchrony will emerge between the separated delaying the identification time by the defense system [11].
regions which had been interconnected by these transmission Therefore, the defense systems are hard to immune against all
lines. In order to recover from the attack, a straightforward way the potential attacks at any time and anywhere. Especially,
is to reclose these transmission lines once we detect the attack. the hijack of the Supervisory Control and Data Acquisi-
However, this may cause severe impacts on the power system, tion (SCADA) network has been reported for several times
such as current inrush and power swing. Therefore, it is critical
to properly choose the reclosing time to mitigate these impacts. In [12]. Once hacking into the SCADA system, there are two
this paper, we propose a recovery strategy to reclose the tripped alternative ways to sabotage the power system. One is to
transmission lines at the optimal reclosing time. In particular, keep deteriorating the power infrastructure for quite a long
a deep reinforcement learning (RL) framework is adopted to period of time, thus to cause permanent damage to the power
endow the strategy with the adaptability of uncertain cyber- infrastructure, just like the well-known cyber-attacks in Iran
attack scenarios and the ability of real-time decision-making.
In this framework, an environment is established to simulate the [13]. For this kind of cyber-attack, fast and reliable intrusion
power system dynamics during the attack-recovery process and detection is important. The other is to trip all the transmission
generate the training data. With these data, the deep RL based lines connected to the substation [14], resulting in severe
strategy can be trained to determine the optimal reclosing time. N − k (e.g., N − 3 or N − 4) contingencies. For this kind of
Numerical results show that the proposed strategy can minimize cyber-attack, the recovery strategy proposed in this paper is as
the cyber-attack impacts under different scenarios.
important as the intrusion detection strategy.
Index Terms—Cyber-attack, recovery strategy, deep reinforce- Usually, only the N − 1/N − 2 contingencies have been
ment learning (RL), optimal reclosing time. pre-arranged in the emergency plan of power system [15]. In
addition, unlike the sequential contingencies resulted by the
I. I NTRODUCTION extreme weather [16, 17], the N − k contingencies resulted
DEVELOPING trend of traditional power systems is by the cyber-attack will be conducted simultaneously within
A the integration of the cyber and physical domain to
improve control flexibility and efficiency [1]. However, the
several seconds. Thus the unexpected N −k contingencies will
have significant impacts on power system operation, leading
deep integration of cyber-physical systems increases the risks to severe system stability problems, even cascading outages.
of cyber-attack on critical power infrastructures. Indeed, a Note that there is no permanent fault in the tripped transmis-
report [2] testifies that the number of security-related incidents sion lines. Therefore, the power system will be fully recovered
in power systems is rising year by year. if we can reclose all the tripped transmission lines before
To ensure power safety, researchers have conducted many the cascading failures. However, as the power transfer is
excellent works. From the perspective of attackers, Ref. [3–5] temporarily blocked because of the cyber-attack, asynchrony
demonstrate the recent trends in cyber-attacks such as multi- will gradually emerge between the separated regions of the
switch attacks and false data injection (FDI) attacks. As to power system which were once interconnected by the tripped
the defenders, signature-based and anomaly-detection based lines. Thus, severe impacts (e.g., current inrush and power
methods are proposed to protect the cyber-physical systems swing) are likely to be insulted when we try to reunite the
from the potential attackers [6, 7]. Ref. [8] proposes a tri- asynchronous power system with reclosing operations. Under
level optimization model to formulate the coordinated attack this circumstance, the widely adopted principles the faster of
scenario and identify the optimal defending strategy. Ref. [9] recovery, the better [18] or recovery after a constant time-delay
suggests a modified semi-Markov process model to describe [19] will bring risks to the stability of power system.
the attack procedures against the intrusion tolerant system. Some efforts have been devoted to alleviating the impacts
of reclosing transmission lines. Ref. [20] finds out that the
F. Wei, Z. Wan and H. He are with the Department of Electrical, Computer reclosing time has a prominent influence on the degree of
and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881
USA (e-mail: [email protected]; [email protected]; [email protected]). reclosing impacts. Ref. [21] shows that a faster reclosing is not
(Corresponding author: Haibo He) necessarily better, and there exists an optimal reclosing time
This work was supported by the Office of Naval Research under grant which could minimize the impacts of reclosing operations.
N00014-18-1-2396. Any opinions, findings, and conclusions or recommenda-
tions expressed in this material are those of the authors and do not necessarily On this basis, Ref. [22] proposes an offline optimal reclosing
reflect the views of the Office of Naval Research. strategy by enumerating all the possible reclosing times in

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 2

a typical operation scenario. The numerical results indicate immediately or after a fixed time-delay), the proposed one
that the optimal reclosing strategy can significantly alleviate significantly improves the recovery performance.
the reclosing impacts. Therefore, we can expect a definite 2) An environment is established to simulate the power
improvement when adopting the concept of optimal reclosing system dynamics under cyber-attack. Specifically, the states
time in the cyber-attack recovery. of power system are obtained by the numerical integration
The offline strategy above is based on the assumption that method, while the recovery performance is evaluated by the
the power system should be deterministic. However, due to transient energy function.
the uncertainties of the power system such as the unpredictable 3) The proposed strategy can generate the optimal recovery
activities of attackers and the uncertainty in the system param- actions rightly after the identification of cyber-attack, thus to
eters [3], this strategy is not suitable for the real-time cyber- alleviate the potential risks of cascading outages.
attack recovery process. One may argue that we can enumerate 4) The proposed strategy is endowed with good adaptability
all the possible reclosing times in real-time to achieve better that it will continuously yield optimal or near-optimal actions
adaptability. However, it is extremely time-consuming due under different cyber-attack scenarios.
to the massive numerical simulations. Given the extreme The rest of this paper is organized as follows. We introduce
danger post-attack situations, prolonged waiting for recovery the cyber-attack recovery problem in Section II and propose a
decisions will substantially increase the risks of cascading numerical method to simulate the attack-recovery in Section
failures [23]. Therefore, it is of great significance to put III. In Section IV, a deep RL based recovery strategy is used to
forward a recovery strategy characterized by the environmental solve the problem. Simulation results are provided in Section
adaptability and real-time decision-making ability. V and the conclusion is given in Section VI, respectively.
In recent years, deep RL has achieved promising results in
many complex real-time decision-making applications under II. P ROBLEM F ORMULATION
uncertain environments [24–29]. More specifically, a deep
Q-network has achieved a level comparable to that of a The power substation is a crucial battlefield of cyber secu-
professional human in the Atari 2600 [24]. As to the power rity. Especially, a typical attack strategy in Ref. [6] is adopted
grid, Ref. [25, 26] present deep RL methods to determine to demonstrate the potential attack-defense process, the details
the optimal strategy for real-time EV charging scheduling. of which are shown in Fig. 1.
Ref. [27] proposes a novel autonomous control framework
Grid Mind for secure operation of power grids based on deep Intrusion Attempts Attack & Defense Recovery
RL technologies. Ref. [28] suggests a data-driven, model-free Attack to trip breakers
Fast evaluation
V.S.
method for load frequency control (LFC) against renewable Modify GOOSE
Defense system

energy uncertainties based on deep RL in continuous action Normal


Opearation
Modify SMV
Competition
Mitigation actions

Malicious
domain. Ref. [29] validates that the RL methods can be connections
Modify MMS
Identified
before attack
Identified
after attack
Recover all
breakers

more efficient when making real-time decisions in uncertain Has attack


succeeded?
Focus of this
Block intruder s Block intruder s
environments. Ref. [30] presents a deep RL based method to connection connection paper

solve the optimal active power dispatch in a near-optimal way.


Despite the training process is relatively time-consuming, the Fig. 1. A potential competition between attackers and defenders.
RL based method could adapt to the uncertain environment
without any retraining process. It can be seen from Fig. 1 that, after obtaining the authority
In this paper, we develop a recovery strategy to optimally of the substation, the attackers can trip the circuit breakers in
reclose the transmission lines based on a popular deep RL three ways: the first is to trip directly by modifying the Generic
algorithm, Deep Deterministic Policy Gradient (DDPG) [31]. Object Oriented Substation Event (GOOSE); the second is to
Since the action space of our problem is continuous, we misguide local relays by injecting false Sampled Measured
choose DDPG instead of the double deep Q learning algorithm Value (SMV); the third is to sabotage the relay setting val-
adopted in [32], which is used for discrete action space. ues by fabricating false Manufacturing Message Specifica-
The DDPG contains a critic network and an actor-network. tion (MMS) messages. Meanwhile, the defense system will
The critic network is used to approximate the value function crosscheck the operation logs and heterogeneous sampling
while the actor-network is used to generate the action. An data periodically to identify the malicious activities. If the
environment is established to simulate the state transition of defense system can identify the attack before the action of the
power system with potential recovery action. Meanwhile, a attackers, it will block the attacker’s connection and return to
reward based on transient energy function is calculated to the normal operation. However, if the attack is successfully
evaluate the performance of recovery action. With the data launched, the attackers can trip all the transmission lines
(state, action, and reward), the actor and critic networks of connected to the substation. Thus, the integrity of the power
DDPG will be trained offline. After the training process, the system will be severely damaged.
proposed approach can be deployed for the online cyber-attack Fig. 2 shows a cyber-attack happened in a modified 9-
recovery. The contributions of this paper are as follows. bus system. The attacker invades the substation corresponding
1) A novel cyber-attack recovery strategy based on deep to BUS10 and tripes all the transmission lines connected to
RL is proposed to determine the optimal reclosing time. this substation. After the attack, multiple power transmission
Compared with the existing recovery strategies (e.g., reclosing channels (L1, L2 and L3) will be lost if malicious attackers

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 3

invade the substation corresponding to BUS10. As a con- A. Dynamic Equations of Power System
sequence, the blocked power will accelerate/decelerate the The dynamic equations describe the rotational dynamics of
generator rotors of G1/G2/G3, and result in the asynchrony synchronous generators in the power system. During normal
in the power system. When we try to reclose the transmission operation, the relative position of the rotor angle is fixed and
lines corresponding to the asynchronous generators, the power the relative rotor speed is zero. After the cyber-attack, the rotor
system will encounter severe impacts (e.g., current inrush and either accelerates or decelerates with respect to the difference
power swing) if the reclosing time is not appropriately selected between the mechanical power and electromagnetic power.
[21]. Therefore, the problem lies in how to recover the tripped For the ith generator, rotator dynamics equation can be
lines at an appropriate time. presented as [33]:

BUS 10 dδi
BUS 2 BUS 7
LD3
BUS 8 BUS 9 BUS 3 = ωi
G2 G3 dt (1)
dωi 1
= (Pmi − Pei − Di ωi )
dt Mi
BUS 5 BUS 6
where δi is the rotor angle, ωi is the rotor speed, Mi is the
Power channel lost: generators moment of inertia. Di is the damping constant.
LD1 LD2 1) L1: Bus 6 --- Bus 10
BUS 4 2) L2: Bus 8 --- Bus 10
Pmi is the mechanic power. Pei is the electromagnetic power
3) L3: Bus 9 --- Bus 10 determined by
BUS 1 n
X
G1 Pei = Ei2 Gii + [Cij sin (δi − δj ) + Dij cos (δi − δj )] (2)
j=1,j6=i

Fig. 2. A modified 9-bus system under cyber-attack. where Ei is constant voltage behind transient reactance. n is
the number of generators. The coefficients Cij = Ei Ej Bij ,
Dij = Ei Ej Gij , Yii = Gii + jBii , Yij = Gij + jBij .
III. S IMULATION OF ATTACK -R ECOVERY P ROCESS Y is the reduced admittance matrix of the system, Y =
−1
In this section, we propose a numerical method to simulate Yss −Yse Yee Yes . Where Yee denotes the mutual admittance
the entire attack-recovery process. As depicted in Fig. 3, before of the passive nodes only. Yss denotes the mutual admittance
the attack, the system is operating in a normal state. Then, the of the generation nodes only. Yes and Yse denote the mutual
malicious attackers launch cyber-attacks on the power system. admittance between the passive nodes and the generation
When the cyber-attack is identified by the defenders, recovery nodes. The detailed deduction of Y can refer to [34].
action will be generated to recovery the power system. After We select rotor angle δ and rotor speed ω as the state of the
executing the recovery action, the power system will be fully multi-generator power system. More specifically, rotor speed
restored and enter the post-action state. is related to the kinetic energy of the power system, while the
rotor angle of the generators is related to the potential energy
of the power system. For the modified 9-bus system in Fig. 2,
Steady state Reward T
Power system s = [δ1 , δ2 , δ3 , ω1 , ω2 , ω3 ] .
s0 parameters Transient Energy V
Rotor angle  0 For notation brevity, equations (1)-(2) can be simplified as:
Power flow data
Potential Kinetic
Rotor speed 0 Energy Energy ds
= f (s, t) (3)
dt
Stealthy
To obtain the state transitions of the dynamic equation
Pre-action State Recovery Action Post-action State
Attack
sI a sN (3), an implicit trapezoidal numerical integration method is
Time adopted from [35]. The method requires the solution of
delay t I Rotor angle  I Rotor angle  N
Reclose delay
Identification nonlinear algebraic equations at each iteration.
Rotor speed I t L1 t L 2 t L 3 Rotor speed  N
∆t
sn+1 = sn + [f (sn , tn ) + f (sn+1 , tn+1 )] (4)
t L 3 2
t L 2 where ∆t is the time step of the numerical integration.
t L1
t I We can define the following residual vector for the trape-
t0 ta tI t L1 tL 2 tL3 t N zoidal method.
Timeline of Attack-Recovery Process ∆t
R (w) = w − sn − [f (sn , tn ) + f (w, tn+1 )] (5)
Fig. 3. The attack-recovery process integrated in the numerical simulation.
2
Thus, sn+1 in equation (4) will be given by the solution of
To simulate the process described above, firstly, we present R (sn+1 ) = 0.
dynamic equations of the power system under cyber-attack. One of the standard methods for solving nonlinear algebraic
Then, we propose a module to generate the potential recovery equations is the Newton-Raphson method. It begins with an
action. At last, the transient energy function is formulated to initial guess for sn+1 and solves a linearized version of R = 0
evaluate the performance of the recovery action. to find a correction to the initial guess for sn+1 .

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 4

B. Recovery Action the recovery action. Noticing that any of L1, L2 and L3 is
1) The definition: To restore the integrity of power sys- possible to be served as the first reclosing line, the action a
tem, we should reclose the tripped lines at a proper time can be simplified as:
after the identification of cyber-attack. Therefore, for the [∆tL1 , ∆tL2 , ∆tL3 ] → [Lk, ∆tLα , ∆tLβ ] (8)
power system in Fig. 2, the action can be denoted as a =
[∆tL1 , ∆tL2 , ∆tL3 ], where ∆tLi , i ∈ {1, 2, 3} is the reclos- where Lk denotes the index of the transmission line to be
ing time delay from the identification time tI . The detailed reclosed immediately after cyber-attack identification, Lk ∈
timeline of attack-recovery is shown in Fig. 3 {L1, L2, L3}. Index α, β ∈ {1, 2, 3} and α, β 6= k, α < β.
In order to avoid the potential system collapse caused by 3) The selection of Lk based on prior knowledge: With
power flow transferring, the tripped lines need to be reclosed the prior knowledge of power flow, it is possible to directly
within a specified time boundary. Usually, the time boundary is determine Lk to further reduce the solution space of a. For
related to the setting time of over-current relay which may be instance, if there were two power send-in channels but only
triggered by the power flow transfer. Therefore, the reclosing one send-out channel, it will be better to immediately reclose
time delay ∆tLi should satisfy the send-out channel rather than the two send-in channels.
For the 9-bus system in Fig. 2 with the parameters listed in
0 ≤ ∆tLi ≤ Tu i ∈ {1, 2, 3} (6)
the appendix, the transmission line L1 should be served as the
where Tu is the upper boundary of recovery time delay. only send-out channel in all possibilities. Under this circum-
In the timeline of numerical simulation, the reclosing time stance, L1 will be reclosed immediately after the identification
tLi of transmission line Li will be: of cyber-attack. Without loss of generality, in this paper, the
action will be simplified as a = [∆tL2 , ∆tL3 ], and ∆tL1 will
tLi = tI + ∆tLi + εop (7) be set as 0.
where εop denotes the operation delay of circuit breaker after 4) The recovery effects: During the process of attack
a reclosing command is issued. and recovery, the admittance matrix Y will be changed.
2) The simplification tricks: In practice, there are some Specifically, attacks can be regarded as removing the existing
simplification tricks which can reduce the solution space of the branches from the admittance matrix, while recovery actions
action, and thus to relieve the possible computational burden. will restore the removed branches. Therefore, the attack and
The details are presented below: recovery process can be characterized by the modification of
By replacing the bus with a node, the region to be restored the admittance matrix.
in Fig. 2 can be abstracted as the topology shown in Fig. 4. According to equation (2), when the parameters in admit-
If we define the node connected to generators/loads as active tance matrix Y changes, the electromagnetic power Pem will
node, while the others as passive node. Thus, N6, N8 and N9 follow the change. By this means, the effect of a is reflected
are the active nodes, while N10 is a passive node. Lines L1, in the dynamic equation.
L2, L3 are the transmission lines to be reclosed. 5) Special Situations: In practical situations, it is possible
Assume that the lines are reclosed in the order of L1 → that BUS10 could be split into two nodes, i.e., BUS10-1 and
L2 → L3, and the topology changes are shown as Fig. 4 BUS10-2. Under this circumstance, there are three potential
(a)∼(c). In Fig. 4 (a), when we reclose L1, there is no power post-attack situations: a) All the transmission lines connected
flow between N6 and N10 because N10 is a passive node. to BUS10-1 and BUS10-2 have been tripped; b) Only the
There is no power flow until the active nodes N8 and N9 are transmission lines connected to BUS10-1 have been tripped;
connected to N10 by reclosing L2 and L3. Similarly, if we c) Only the transmission lines connected to BUS10-2 have
reclose L2 first, there is no power flow until we reclose L1 been tripped. Consequently, all the three situations mentioned
or L3. If we reclose L3 first, there is no power flow until we above should be generated and trained offline to make sure
reclose L1 or L2. the recovery strategy could be adaptive to all these situations.

N10 N10 N10


N8
L2 L3
N9 N8
L2 L3
N9 N8
L2 L3
N9 C. Reward to Evaluate the Recovery Performance
L1 L1 L1
N6 N6 N6
The topology of the power system will be fully restored
(a) First reclosing (b) Second reclosing (c) Third reclosing after we reclose all the tripped transmission lines L1, L2, and
Active node Tripped line Reclosed line w/o power flow L3. We define the post-action time tN as
Passive node Reclosed line w/ power flow
tN = max {tL2 , tL3 } + ∆t (9)
Fig. 4. The changes of topology with the reclosing action.
To evaluate the performance of recovery, we will observe
From the analysis above, we can further conclude that the the post-action state sN at tN , and then calculate the transient
time of the first reclosing has no impact on the power flow energy function proposed in Ref. [36]. The transient energy is
because there is no power flow after the first reclosing. There- used as the reward r to evaluate the recovery performance.
fore, we can reclose one transmission line immediately after In order to calculate the transient energy, firstly, the rotor
the identification of cyber-attack. By this means, the solution angle δ and rotor speed ω obtained in the numerical simulation
space will be reduced without sacrificing the performance of will be transformed into the center of inertia (COI) framework,

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 5

which can offer a good physical insight into the behavior of executing this action, a reward is calculated according to
synchronous generators. equation (15) and passed back to the deep RL algorithm.
The motion center of the generators in the COI reference This reward assesses the quality of taking the action under
frame can be represented by the current state. During the training process, the deep RL
n algorithm will adjust the parameters of the neural networks
1 X
δCOI = Mi δi (10) such that the reward can be maximized.
MΣ i=1 There are various deep RL algorithms, such as DDPG and
n DQN. Specifically, the DDPG has a continuous action space,
1 X while the DQN has a descrete one. Because the action space
ωCOI = Mi ωi (11)
MΣ i=1 of a is continuous, the DDPG is adopted in this problem.
n
X
MΣ = Mi (12) B. Generation of Training Data
i=1 The generation of training data is shown in Algorithm 1.
where n is the total number of generators in the power system. Firstly, we generate the potential load combinations with the
δCOI and ωCOI are the reference rotor angle and reference rotor load parameters in the appendix. Then we calculate its optimal
speed in the COI framework. In this framework, the rotor angle power flow and obtain the initial state s0 . Note that, if the gen-
θi and rotor speed ω̂i can be expressed as eration combination is ready as well, the power flow instead of
optimal power flow should be adopted. Subsequently, for each
θi = δi − δCOI (13) load combination, we randomly generate the identification
ω̂i = ωi − ωCOI (14) delay within the upper boundary as ∆tI ≤ TI and obtain
the identification time tI . Note that, TI should be determined
Then, the transient energy can be calculated by equation by the frequency of the cross-check activity conducted by the
(15). n this equation, the first term is the kinetic energy and defensive system. By this means, the identification cases are
depends on rotor speed only, while the last three terms together
form the potential energy of the system which depends only formulated.
on rotor angle. For each identification case, we randomly generate the
n n potential action a with the restriction of equation (6). Then
1X X
V = Mi ω̂i2 − Pi (θi − θis ) we calculate tL1 , tL2 , tL3 , and tN with equations (7)∼(9).
2 j=1 i=1 Then, we begin the numerical simulation with the time
n−1
X n
X points ta , tI , tL1 , tL2 , tL3 , and tN . When t = ta , we launch
− [Cij (cos θij − cos θijs )] (15) the attack. When t = tI , the attack is identified, and we
i=1 j=i+1
observe the system state sI . When t = tL1 , tL2 , tL3 , we
n−1 n
θi − θis + θj − θjs
 
+
X X
Dij s 
sin θij − sin θij
reclose the corresponding transmission line. When t = tN , we
s
i=1 j=i+1
θij − θij observe the system state sN and then terminate the numerical
simulation. Then we calculate the reward r with state s0 and
where the steady state θis is equals to the initial state θ0 . sN by equation (15). Finally, a transition of (sI , a, r) is stored
in the replay memory D. Repeat the above procedures until
IV. R ECOVERY S TRATEGY BASED ON D EEP RL enough transitions have been obtained.
In this section, we adopt a deep RL algorithm, Deep
Deterministic Policy Gradient (DDPG) [31], to determine the C. Training of DDPG
optimal reclosing time. The proposed approach is shown in The transitions in the replay memory are used to train the
Fig. 5. We firstly generate the training data with numerical DDPG algorithm such that it can learn the optimal reclosing
simulation and store them in an experience replay memory. strategy. As shown in Fig. 5, the DDPG algorithm consists of
Then, a DDPG algorithm is proposed to learn the optimal two major networks: an actor network and a critic network.
recovery policy from these data. The generation of the training The actor network is used to generate the continuous action a
data and the training of DDPG are illustrated in Algorithms 1 with the input of state s , and the critic network will output
and 2, respectively. the evaluation of the action with the input of state s and action
a. For the details of DDPG, we refer the readers to [31].
A. Preliminary of Reinforcement Learning During the training process,
 we first randomly initialize the
The aim of RL is to learn a mapping from the input state to critic network Q s, a θQ and the actor network µ (s |θµ )
Q
the action so as to maximize/minimize a numerical reward  i i θi  and
with parameters θµ . Then, a minibatch of transitions
#F
signal [37]. Unlike supervised learning where groundtruth F = sI , a , r i=1 is randomly sampled from replay
examples are provided by a knowledgeable external supervisor, memory D. With these transitions, the parameters of the critic
RL should discover which action yields the most reward by network θQ are updated by minimizing the loss
trial-and-error search. X#F  2
L= ri − Q siI , ai θQ , (16)
Fig. 6 shows the schematic diagram of deep RL. We first i=1
observe the state s from the environment. Then, the deep which represents the mean squared error between the actual
RL algorithm generates the action for the environment. After reward and the value approximated by the critic network.

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 6

Scenario generation
Load Delay of Recovery Loss function
Policy gradient J
 
2
combinations identification action L   i  ri  Q sIi , a i  Q 
 
s0 t I a
Update Update

Q  sI , a  Q  sI , a 

Actor network Critic network


a
Environment
DDPG Algorithm

Store  sI , a, r  sI sI , a

Experience r
Sampling
algorithm
replay N *  sI , a , r 
memory
Sample mini-batch

Fig. 5. Optimal cyber-attack recovery based on DDPG algorithm.

Algorithm 2 Training of DDPG


1: Initialization 
Action 2: Randomly initialize critic network Q s, a θQ and actor
network µ (s |θµ ) with parameters θQ and θµ .
3: for episode = 1, M do  i i i  #F
4: Sample a random minibatch F = sI , a , r i=1 from
Reward replay memory D.
Deep RL algorithm
5: Update the critic network
 i by minimizing the loss:
2
L = #F
P i

i Q
Environment 6: i=1 r − Q s I , a θ
7: Update the actor network with the sampled policy gradient:
State Q 
∇θµ J = #F s=si ,a=µ(si ) ∇θµ µ (s |θµ ) |si
P
8: i=1 ∇a Q s, a θ

9: end for

Fig. 6. The schematic diagram of deep RL.


Then, the parameters of the actor network are updated by
Algorithm 1 Generation of Training Data θµ ← θµ + α∇θµ J, (18)
1: Initialization
2: Initialize replay memory D. where α is the learning rate.
3: Generate potential load combinations in the power system. After the training process, the actor network will be de-
4: for Load Combination = 1, H do ployed for online cyber-attack recovery.
5: Select a load combination, calculate its optimal power flow,
obtain the initial state s0 .
6: Randomly generate the potential time delay of identification V. N UMERICAL S TUDY
process ∆tI to formulate the identification cases.
7: for Identification Case = 1, G do A. Simulation Setup
8: Randomly generate potential action options for each case. 1) The environment: We adopt the modified 9-bus system
9: for Action Option = 1, K do
10: Run the numerical simulation. illustrated in Fig. 2 to verify the performance of the recovery
11: Launch the cyber-attack when t = ta . strategy. The parameters of power system are shown in the
12: Observe identification state sI when t = tI . appendix. The sampling interval of the load parameters is 0.1
13: Execute action option a = [tL1 , tL2 , tL3 ]. p.u.. Assume that the power outputs of generators follow the
14: Observe the post-action state sN when t = tN . schedule of optimal power flow. The upper boundaries of the
15: Terminate the numerical simulation and calculate r.
16: Store transition (sI , a, r) in replay memory D. identification delay TI and action delay Tu are 1.5s and 1.0s,
17: end for respectively, with the sampling interval of 0.05s. The operation
18: end for delay of circuit breakers εop = 0.02s. The time step of the
19: end for numerical simulation ∆t = 0.01s.
2) Training parameters: The parameters of the scenario
generation process are: H = 27, G = 15, K = 400, ta = 0.1s.
The parameters of the actor network are updated with the The critic network is a 4-layer network and the number of
sampled policy gradient the units in its input layer, hidden layer #1 and #2 and
its output layer are 6, 64, 64 and 2, respectively. The batch
X#F size of the sampled transitions F for training is 32. The
∇a Q s, a θQ s=si ,a=µ(si ) ∇θµ µ (s |θµ ) |si

∇θµ J = number of training epochs, M , equals to 400,000. The training
i=1
(17) process takes about 15 minutes on the computer with one

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 7

NVIDIA TITAN Xp GPU and one i7- 6800K CPU. After the energy will begin to change with a certain regularity, and each
training process, the proposed approach can be deployed for reclosing operation will modify the regularity to some extent.
the cyber-attack recovery. It takes less than 0.1ms to generate Finally, with the accomplishment of the reclosing operations,
one schedule. The code is written in Python with TensorFlow. the transient energy will no longer change drastically but decay
slowly with the damping of the power system.
B. Validation of Recovery Strategy In order to maintain the stability of power system, we
Several numerical simulations are conducted to illustrate need to ensure that the post-action transient energy is as low
the advantages of the proposed scheme. Three schemes are as possible. As shown in Fig. 7, the post-action transient
designed as follows. energy of the three schemes are 0.82 p.u., 0.31 p.u., and 0.06
Scheme 1 is an immediate reclosing strategy following p.u., respectively. Among all these schemes, scheme 1 with
the widely adopted principle the sooner, the better [18]. an immediate reclosing is much worse than the other two
Thus, the tripped lines will be reclosed immediately after the schemes, while the transient energy of scheme 3 is the lowest.
identification of the cyber-attack. Subsequently, the generator current of these three schemes
Scheme 2 is a fixed time strategy based on the reclosing after the recovery actions are shown in Fig. 8. Note that
module in the local protective relaying system [19]. The an excessive current may cause the operation of generator
tripped lines will be reclosed after a fixed reclosing delay 0.8s protection or even severe damage to the generator. As shown
(40-cycle reclosing with the frequency of 50Hz). in Fig. 8, the peak values of the generator current of the three
Scheme 3 is the strategy proposed in this paper. schemes are 2.21 p.u., 1.74 p.u., and 1.64 p.u., respectively.
These three strategies will be evaluated on two scenarios, We can see that the peak current of scheme 3 is the lowest of
i.e., Scenario 1 with the stable first swing, and Scenario 2 with all these schemes.
the unstable first swing. In particular, the load parameters of
Scenario 1 are LD1 = 1.20 p.u., LD2 = 1.10 p.u., LD3 = 2.5
Scheme 1
0.65 p.u., and the identification delay is ∆tI = 0.60s; the load Current of Generator (p.u.)
2.0 Scheme 2
parameters of Scenario 2 are LD1 = 1.35 p.u., LD2 = 1.20 Scheme 3
p.u., LD3 = 0.70 p.u., and the identification delay is ∆tI = 1.5
0.40s. Note that the data generated in Scenario 1 and Scenario
2 are not included in the training data so that we can evaluate 1.0
the adaptability of DDPG algorithm.
0.5
1) Scenario 1 with stable first swing: The process from 0 0.5 1.0 1.5 2.0 2.5
normal operation, to cyber-attack and then to recovery are nu- Time (s)
merically simulated. The transient energy curves are recorded Fig. 8. The generator currents of three schemes in Scenario 1.
in Fig. 7, where ta = 0.10s is the time when the cyber-attack
happens. Then after an identification delay 0.50s, the cyber- The degree of power swing, which is another important
attack will be identified by the defensive system at t = 0.60s. index to evaluate the recovery performance, can be measured
At this time point, schemes 1∼3 will be implemented. As by the peak-valley range of the rotor angle. The results of
an immediate reclosing strategy, Scheme 1 will reclose the the three schemes are shown in Fig. 9 (a)-(c). The peak-valley
tripped lines immediately after the identification of the cyber- range of G1 after recovery are 0.94 rad, 0.49 rad, and 0.31 rad,
attack. Considering that the reclosing delay is 0.02s. Therefore, respectively; those of G2 are 2.04 rad, 1.21 rad, and 0.51 rad,
the reclosing time for Scheme 1 is ts1 = 0.62s. Scheme respectively; while those of G3 are 0.21 rad, 0.15 rad, and 0.09
2 will reclose the tripped lines after a fixed reclosing delay rad, respectively. We can observe that scheme 3 outperforms
0.80s after the tripping activities at t = 0.10s. Therefore, the the rest two from the perspective of power swing.
reclosing signal will be issued at 0.90s, and the reclosing time In summary, the numerical results in this scenario indicate
for Scheme 2 is ts2 = 0.92s. For scheme 3, the reclosing time that the proposed strategy can greatly improve the system
of L2 and L3 are ts32 = 0.83s and ts33 = 1.10s, respectively. stability and reduce the reclosing impacts compared to the
two baseline schemes.
ta t s1 t s 31 t s 2 t s 32 2) Scenario 2 with unstable first swing: In this scenario, we
1.0
are intended to verify if an immediate reclosing is essential
Transient energy (p.u.)

Scheme 1
0.8 in the perilous situation that the system is about to lose its
0.6 stability. The transient energy and rotor angle of the three
0.4 Scheme 2
schemes are illustrated in Fig. 10 and Fig. 11, respectively.
0.2
As shown in Fig. 10, the transient energy of scheme 2 keeps
Scheme 3 monotonously increasing after the recovery actions in ts2 =
0 0.5 1.0 1.5 2.0 2.5 0.92s, indicating the power system has lost its stability. This
Time (s)
observation is revalidated by the rotor angle in Fig. 11 (b).
Fig. 7. The transient energy of three schemes in Scenario 1. For scheme 1 and scheme 3 which have regained stability
after the recovery, the post-recovery transient energy are 0.64
It can be observed that the transient energy stays zero p.u. and 0.07 p.u., respectively. Meanwhile, the ranges of
before the cyber-attack. Then after the attack, the transient power swing of G1 in Fig. 11 (a) and Fig. 11 (c) are 0.84

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 8

1.6 1.5
G1 G2 G3 G1 G2 G3

Rotor angle (rad)


1.2
Rotor angle (rad)
1.0
0.8 0.5
0.4
0
0
-0.5
-0.4
-1.0
-0.8 0 0.5 1.0 1.5 2.0 2.5
0 0.5 1.0 1.5 2.0 2.5 Time (s)
Time (s) (a) The rotor angle of Scheme 1
(a) The rotor angle of Scheme 1
1.6 80
G1 G2 G3
Rotor angle (rad)

Rotor angle (rad)


G1 G2 G3 60
1.2
0.8 40
0.4 20
0 0
-0.4
-20
-0.8 0 0.5 1.0 1.5 2.0 2.5
0 0.5 1.0 1.5 2.0 2.5 Time (s)
Time (s) (b) The rotor angle of Scheme 2
(b) The rotor angle of Scheme 2
1.6 1.5
G1 G2 G3
Rotor angle (rad)

G1

Rotor angle (rad)


G2 G3 1.0
1.2
0.8 0.5
0.4 0
0 -0.5
-0.4
-1.0
-0.8 0 0.5 1.0 1.5 2.0 2.5
0 0.5 1.0 1.5 2.0 2.5 Time (s)
Time (s) (c) The rotor angle of Scheme 3
(c) The rotor angle of Scheme 3
Fig. 11. The power swing of three schemes in Scenario 2.
Fig. 9. The power swing of three schemes in Scenario 1.

ta ts1  ts31  ts 2 t s 32 DQN, we first discretized the action space with a step size
1.6 of 0.05 p.u.. The comparison results are shown in Fig. 12.
Transient energy (p.u.)

1.2
It can be seen that the DDPG method converges faster than
the DQN method. In addition, the training loss of the DDPG
0.8 Scheme 1 keeps smaller than the one of DQN after 400,000 epochs of
training.
0.4
Scheme 3
5.0
0 0.5 1.0 1.5 2.0 2.5 DDPG
Training loss (p.u.)

Time (s) 4.0


DQN
3.0
Fig. 10. The transient energy of three schemes in Scenario 2.
2.0

1.0
rad. and 0.36 rad, respectively; those of G2 are 1.76 rad and
0
0.48 rad, respectively; while those of G3 are 0.32 rad and 0.11 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
×10^5
Epoch (p.u.)
rad, respectively. Thus the performance of scheme 3 is better
than that of scheme 1 in this scenario. Fig. 12. The convergence rate of DDPG and DQN.
We can conclude from the above results that, the reclosing
action should be implemented in a very short time, or the Then, we generate all the possible actions in Scenario 1
system is likely to lose stability just like the case of scheme with brute-force and obtain the corresponding transient energy.
2. Meanwhile, the sooner, the better is not the optimal principle Note that, the resolution for the brute-force search is 0.02s.
for cyber-attack recovery even in a perilous situation. These As shown in Fig. 13, the lowest transient energy among
results verify that the proposed strategy can keep the power them is 0.05 p.u., while the ones of the DDPG and DQN
system stable, as well as reduce the impacts of cyber-attack. are 0.06 p.u. and 0.11 p.u., respectively. It can be seen that
the DDPG outperforms the DQN method in terms of the
output optimality. In addition, the action generated by DDPG
C. Comparison between DDPG and DQN is proved to be optimal, or strictly speaking, near-optimal.
In this subsection, the DDPG method adopted in this paper Finally, the DDPG and the DQN methods are tested with
is compared with the DQN method to further validate the more scenarios to verify the adaptability. The test scenarios
merits of the proposed approach. are generated by changing the parameter of Load LD3 on
Firstly, we record the convergence rate of the DDPG and the the basis of Scenario 1. Then, we adopt a brute-force based
DQN during the training process. In order to use the double method to find the top 5% and 10% performances correspond-

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 9

starts at t = 1.00s, and be identified at t = 1.50s. The three


Action output schemes to be tested in this scenario are the same as the ones
Action output
Transient energy (p.u.)

1 of DQN adopted in Section V-B. As to the recovery time, Scheme


of DDPG
0.8
0.6 1 is [1.52,1.52,1.52,1.52]; Scheme 2 is [1.82,1.82,1.82,1.82];
0.4 Scheme 3 is [2.28,1.52,1.77,1.52].
0.2
0 2.0
1 0.8
0.6 0.8 1
0.6

Transient energy (p.u.)


0.4 0.4 Scheme 2
0.2 0.2 1.5
0 t L 3
tL
2
1.0 Scheme 1
Fig. 13. The relationship between reclosing action and transient energy.
0.5
Scheme 3
0.40
0
0.35 Top 5% region 0 1.0 2.0 3.0 4.0 5.0
5%~10% region Time (s)
Transient energy (p.u.)

0.30 Performance of DDPG


0.25 Performance of DQN Fig. 15. The transient energy of three schemes in Scenario 3.
0.20
0.15 0.4
G5600 G6500 G8500

Rotor angle (rad)


0.10 0.2
0.05 0
0 -0.2
0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80
Active Power of Load LD3 (p.u.) -0.4
-0.6
Fig. 14. The performance of DDPG and DQN with more scenarios. 0 2.0 4.0 6.0 8.0 10.0
Time (s)
(a) The rotor angle of Scheme 1
0.4
ing to these scenarios and record the results in Fig. 14. Note G5600 G6500 G8500
Rotor angle (rad)

that, among all these scenarios shown in Fig. 14, only the 0.2

scenarios marked with red are used to generate the training 0


data. It can be seen that, the DDPG method always keeps the -0.2
top 10% of all the potential actions, performing better than -0.4
the DQN method. -0.6
0 2.0 4.0 6.0 8.0 10.0
Time (s)
(b) The rotor angle of Scheme 2
D. Application in Large Realistic System
0.4
In this subsection, a realistic benchmark, the Nordic system G5600 G6500 G8500
Rotor angle (rad)

0.2
model [38], has been adopted to test the proposed approach.
0
The Nordic system model has been matched to 8760-hour
historical data from the Nordpool market in 2015. We assume -0.2
that the BUS 5101 is under cyber-attack that all the connected -0.4
420kV lines are tripped by the attacker. The transmission lines -0.6
0 2.0 4.0 6.0 8.0 10.0
to be recovered include: L1 from BUS 5501, L2 from BUS Time (s)
5102, L3 from BUS 5103, and L4 from BUS 3359. After the (c) The rotor angle of Scheme 3

power flow calculation, L4 turns out to be the only send-out Fig. 16. The power swing of three schemes in Scenario 3.
channel all the time, thus it should be reclosed immediately
after the identification of cyber-attack. The action variables The performance of the three schemes are shown in Figs.
to be decided by DDPG are tL1 , tL2 , tL3 . During the training 15∼17. As shown in Fig. 15, the post-recovery transient
process, we select 4 typical days per month and 6 typical energy of the three schemes (recorded at t = 2.28s) are
hours per day from the historical data in 2015 to initialize the 0.62 p.u., 1.84 p.u., and 0.49 p.u., respectively. As to the
power system. The simulation is conducted with PSS/E. The power swing of these three schemes shown in Fig. 16, the
parameters of the scenario generation process are: H = 288, peak-valley range of G5600 during the period of 2s∼10s
G = 10, K = 125, Ta = 1.0s. The rotor angle and rotor speed are 0.48 rad, 0.81 rad, and 0.39 rad, respectively; those of
recorded in the transitions are sampled from the generators in G6500 are 0.18 rad, 0.35 rad, and 0.14 rad, respectively;
BUS 5600, BUS 6500, and BUS 8500. while those of G8500 are 0.19 rad, 0.37 rad, and 0.15 rad,
In the testing process, a new test scenario, Scenario 3, respectively. As to the degree of frequency fluctuation shown
is adopted to validate the effectiveness of the proposed ap- in Fig. 17, the frequency fluctuation of G5600 after recovery
proach. This scenario is based on the power flow at 6 pm are 49.79∼50.23 Hz, 49.65∼50.38 Hz, and 49.84∼50.17 Hz,
on 6/25/2015. In the numerical simulation, the cyber-attack respectively; those of G6500 are 49.92 50.09 Hz, 49.90 50.14

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 10

50.4 TABLE A.I


G6500 G5600
PARAMETERS OF T RANSMISSION L INES
Frequency (Hz)

50.2 From To Res Xin Cap


(Bus) (Bus) (p.u.) (p.u.) (p.u.)
50.0
4 5 0.064 0.364 0.250
49.8 4 6 0.064 0.364 0.200
G8500
5 7 0.108 0.628 0.600
49.6 6 10 0.064 0.364 0.160
0 2.0 4.0 6.0 8.0 10.0 6 9 0.108 0.628 0.300
Time (s)
(a) The Frequency of Scheme 1 9 10 0.064 0.364 0.100
7 8 0.064 0.364 0.240
50.4 8 10 0.064 0.364 0.060
G6500 G5600
Frequency (Hz)

50.2

50.0 TABLE A.II


PARAMETERS OF L OADS
49.8 Location Active power Reactive power
Parameter
G8500 (Bus) (p.u.) (p.u.)
49.6 LD1 5 1.10∼1.30 0.45
0 2.0 4.0 6.0 8.0 10.0
Time (s) LD2 6 1.00∼1.20 0.45
(b) The Frequency of Scheme 2 LD3 8 0.50∼0.70 0.25
50.4
G6500 G5600
Frequency (Hz)

50.2
[4] C. Liu, M. Zhou, J. Wu, C. Long, and D. Kundur, “Financially motivated
50.0 FDI on SCED in real-time electricity markets: Attacks and mitigation,”
IEEE Trans. Smart Grid, vol. 10, no. 2, pp. 1949–1959, March 2019.
49.8 G8500
[5] H. Zhang, Y. Qi, J. Wu, L. Fu, and L. He, “DoS attack energy
management against remote state estimation,” IEEE Trans. Control of
49.6 Network Syst., vol. 5, no. 1, pp. 383–394, March 2018.
0 2.0 4.0 6.0 8.0 10.0
Time (s) [6] J. Hong, C. Liu, and M. Govindarasu, “Integrated anomaly detection
(c) The Frequency of Scheme 3 for cyber security of the substations,” IEEE Trans. Smart Grid, vol. 5,
no. 4, pp. 1643–1653, July 2014.
Fig. 17. The frequency fluctuation of three schemes in Scenario 3. [7] E. Mousavinejad, F. Yang, Q. Han, and L. Vlacic, “A novel cyber attack
detection method in networked control systems,” IEEE Trans. Cybern.,
vol. 48, no. 11, pp. 3254–3264, Nov 2018.
[8] K. Lai, M. Illindala, and K. Subramaniam, “A tri-level optimization
Hz, and 49.94 50.07 Hz, respectively; while those of G8500 model to mitigate coordinated attacks on electric power systems in a
are 49.93∼50.12 Hz, 49.88∼50.21 Hz, and 49.94∼50.11 Hz, cyber-physical environment,” Applied Energy, vol. 235, pp. 204 – 218,
respectively. We can observe that scheme 3 keeps performing February 2019.
[9] Y. Zhang, L. Wang, and Y. Xiang, “Power system reliability analysis
better than the rest two from all the perspectives, i.e., transient with intrusion tolerance in scada systems,” IEEE Trans. Smart Grid,
energy, power swing, and frequency fluctuation. vol. 7, no. 2, pp. 669–683, March 2016.
To sum up, the proposed approach could still be effective [10] C. Zhao, J. He, P. Cheng, and J. Chen, “Analysis of consensus-based
distributed economic dispatch under stealthy attacks,” IEEE Trans. Ind.
when applied in a large and realistic power system. Electron., vol. 64, no. 6, pp. 5107–5117, June 2017.
[11] C. Wang, C. Ten, Y. Hou, and A. Ginter, “Cyber inference system for
substation anomalies against alter-and-hide attacks,” IEEE Trans. Power
VI. C ONCLUSION Syst., vol. 32, no. 2, pp. 896–909, March 2017.
In this paper, a deep RL based recovery strategy is proposed [12] G. Liang, S. R. Weller, J. Zhao, F. Luo, and Z. Y. Dong, “The 2015
ukraine blackout: Implications for false data injection attacks,” IEEE
to optimally reclose the transmission lines lost in the cyber- Trans. Power Syst., vol. 32, no. 4, pp. 3317–3318, July 2017.
attack. By optimizing the reclosing time, the proposed strategy [13] J. P. Farwell and R. Rohozinski, “Stuxnet and the future of cyber war,”
enables a better recovery performance than that of other re- Survival, vol. 53, no. 1, pp. 23–40, 2011.
[14] C. Ten, K. Yamashita, Z. Yang, A. V. Vasilakos, and A. Ginter, “Impact
covery strategies. Meanwhile, the deep RL framework endows assessment of hypothesized cyberattacks on interconnected bulk power
the recovery strategy with environmental adaptability and real- systems,” IEEE Trans. Smart Grid, vol. 9, no. 5, pp. 4405–4425, Sep.
time decision-making ability. Numerical results indicate that 2018.
[15] A. A. Babalola, R. Belkacemi, and S. Zarrabian, “Real-time cascading
the proposed strategy can minimize the cyber-attack impacts failures prevention for multiple contingencies in smart grids through a
under different cyber-attack scenarios. multi-agent system,” IEEE Trans. Smart Grid, vol. 9, no. 1, pp. 373–385,
Jan 2018.
[16] L. Chen, H. Zhang, Q. Wu, and V. Terzija, “A numerical approach for
A PPENDIX hybrid simulation of power system dynamics considering extreme icing
Compared with the standard WSCC 9-bus system, the events,” IEEE Trans. Smart Grid, vol. 9, no. 5, pp. 5038–5046, Sep.
2018.
modified one has the difference in parameters of transmission [17] C. Wang, Y. Hou, F. Qiu, S. Lei, and K. Liu, “Resilience enhancement
lines and loads, as shown in Tables A.I and A.II, respectively. with sequentially proactive operation strategies,” IEEE Trans. Power
Syst., vol. 32, no. 4, pp. 2847–2857, July 2017.
R EFERENCES [18] S. Soltan and G. Zussman, “EXPOSE the line failures following a cyber-
[1] R. Liu, C. Vellaithurai, S. S. Biswas, T. T. Gamage, and A. K. Srivastava, physical attack on the power grid,” IEEE Trans. Control of Network Syst.,
“Analyzing the cyber-physical impact of cyber events on the power grid,” vol. 6, no. 1, pp. 451–461, March 2019.
IEEE Trans. Smart Grid, vol. 6, no. 5, pp. 2444–2453, Sep. 2015. [19] “IEEE guide for automatic reclosing of line circuit breakers for AC
[2] U.S. Dept. Homeland Security, “ICS-CERT year in review 2016,” 2016. distribution and transmission lines,” IEEE Std C37.104-2002, pp. 1–62,
[3] S. Liu, B. Chen, T. Zourntos, D. Kundur, and K. Butler-Purry, “A April 2003.
coordinated multi-switch attack for cascading failures in smart grid,” [20] Y. Mansour, E. Vaahedi, A. Y. Chang, B. R. Corns, and et.al, “BC
IEEE Trans. Smart Grid, vol. 5, no. 3, pp. 1183–1195, May 2014. Hydro’s on-line transient stability assessment (TSA) model development,

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSG.2019.2956161, IEEE
Transactions on Smart Grid
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2018 11

analysis and post-processing,” IEEE Trans. Power Systems, vol. 10, Zhiqiang Wan (S’16) received his B.S. degree from
no. 1, pp. 241–253, Feb 1995. the Harbin Institute of Technology, Harbin, China,
[21] M. B. Djuric and V. V. Terzija, “A new approach to the arcing faults in 2012. He received his M.S. degree in the School
detection for fast autoreclosure in transmission systems,” IEEE Trans. of Electrical and Electronics Engineering, Huazhong
Power Delivery, vol. 10, no. 4, pp. 1793–1798, Oct 1995. University of Science and Technology (HUST) in
[22] B. H. Zhang, Y. C. Yuan, Z. Chen, and Z. Q. Bo, “Computation of 2015. He is presently working towards his Ph.D.
optimal reclosure time for transmission lines,” IEEE Trans. Power Syst., degree in the School of Electrical, Computer and
vol. 17, no. 3, pp. 670–675, Aug 2002. Biomedical Engineering, University of Rhode Island
[23] J. Guo, F. Liu, J. Wang, J. Lin, and S. Mei, “Toward efficient cascading (URI), Kingston, Rhode Island, USA. His current
outage simulation and probability analysis in power systems,” IEEE research interests include deep learning, deep rein-
Trans. Power Syst., vol. 33, no. 3, pp. 2370–2382, May 2018. forcement learning, and cyber-physical system, with
[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. a particular interest in smart grid applications. He was a recipient of the
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, URI Graduate Student Research & Scholarship Excellence Award in the Life
“Human-level control through deep reinforcement learning,” Nature, vol. Sciences, Physical Sciences, and Engineering in 2019, the Best Paper Award
518, no. 7540, pp. 529–533, March 2015. in the IEEE Power & Energy Society General Meeting (PESGM) in 2018, and
[25] Z. Wan, H. Li, H. He, and D. Prokhorov, “Model-free real-time ev the Best Paper Award in the IEEE 11th International Conference on Power
charging scheduling based on deep reinforcement learning,” IEEE Trans. Electronics and Drive Systems (PEDS) in 2015.
Smart Grid, pp. 1–1, 2019.
[26] Z. Wan, C. Jiang, M. Fahad, Z. Ni, Y. Guo, and H. He, “Robot-
assisted pedestrian regulation based on deep reinforcement learning,”
IEEE Trans. Cybernetics, pp. 1–14, 2018.
[27] J. Duan, D. Shi, R. Diao, H. Li, Z. Wang, B. Zhang, D. Bian, and
Z. Yi, “Deep-reinforcement-learning-based autonomous voltage control
for power grid operations,” IEEE Trans. Power Syst., pp. 1–1, 2019.
[28] Z. Yan and Y. Xu, “Data-driven load frequency control for stochastic
power systems: A deep reinforcement learning method with continuous
action search,” IEEE Trans. Power Syst., vol. 34, no. 2, pp. 1653–1656,
March 2019.
[29] J. Duan, Z. Yi, D. Shi, C. Lin, X. Lu, and Z. Wang, “Reinforcement-
learning-based optimal control of hybrid energy storage systems in
hybrid acdc microgrids,” IEEE Trans. Ind. Informat., vol. 15, no. 9,
pp. 5355–5364, Sep. 2019.
[30] J. Duan, H. Li, X. Zhang, R. Diao, B. Zhang, D. Shi, X. Lu, Z. Wang,
and S. Wang, “A deep reinforcement learning based approach for optimal
active power dispatch,” arXiv preprint arXiv:1908.11543, 2019.
[31] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” ICLR, 2016.
[32] V. Bui, A. Hussain, and H. Kim, “Double deep q-learning-based dis-
tributed operation of battery energy storage system considering uncer- Haibo He (SM’11-F’18) received the B.S. and M.S.
tainties,” IEEE Trans. Smart Grid, pp. 1–1, 2019. degrees in electrical engineering from the Huazhong
[33] H. D. Chiang, F. F. Wu, and P. P. Varaiya, “A BCU method for direct University of Science and Technology, China, in
analysis of power system transient stability,” IEEE Trans. Power Syst., 1999 and 2002, respectively, and the Ph.D. degree in
vol. 9, no. 3, pp. 1194–1208, Aug 1994. electrical engineering from Ohio University in 2006.
[34] F. Dorfler and F. Bullo, “Synchronization of power networks: Network He is currently the Robert Haas Endowed Chair
reduction and effective resistance,” IFAC Proceedings Volumes, vol. 43, Professor at the Department of Electrical, Com-
no. 19, pp. 197 – 202, 2010. puter, and Biomedical Engineering, University of
[35] F. L. Alvarado, R. H. Lasseter, and J. J. Sanchez, “Testing of trapezoidal Rhode Island. His current research interests include
integration with damping for the solution of power transient problems,” computational intelligence, machine learning, data
IEEE Trans. Power Apparat. Syst., vol. PAS-102, no. 12, pp. 3783–3790, mining, and various applications. He has published
Dec 1983. one sole-author research book (Wiley), edited one book (Wiley-IEEE) and
[36] H. D. Chiang and C. C. Chu, “Theoretical foundation of the BCU six conference proceedings (Springer), and authored and co-authored more
method for direct stability analysis of network-reduction power system. than 300 peer-reviewed journal and conference papers. He was the general
models with small transfer conductances,” IEEE Trans. Circuit. Syst. I: chair of the IEEE Symposium Series on Computational Intelligence (SSCI
Fund. Theory and Appl., vol. 42, no. 5, pp. 252–265, May 1995. 2014). He received the IEEE International Conference on Communications
[37] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Best Paper Award (2014), IEEE CIS Outstanding Early Career Award (2014),
MIT press Cambridge, 1998, vol. 1, no. 1. and National Science Foundation CAREER Award (2011). He is currently the
[38] L. Vanfretti and S. H. Olsen, “An open data repository and a data Editor-in-Chief of the IEEE Transactions on Neural Networks and Learning
processing software toolset of an equivalent nordic grid model matched Systems.
to historical electricity market data,” Data in Brief, vol. 11, pp. 349 –
357, 2017.

Fanrong Wei received the B.S., and Ph. D degrees


in electrical engineering from Huazhong University
of Science and Technology (HUST) in 2013 and
2018. He is currently a research scholar at Univer-
sity of Rhode Island (URI). His researches mainly
~---
focus on cyber-physic security, optimal power sys-
tem/microgrid scheduling and protective relay.

1949-3053 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like