0% found this document useful (0 votes)
19 views

Applications of Deep Reinforcement Learning in Nuclear Energy:a Review

This review paper discusses the applications of deep reinforcement learning (DRL) in the nuclear energy sector, highlighting its advantages in solving complex decision-making problems. It summarizes the principles of DRL, its current research status in other energy fields, and outlines potential future directions for its application in nuclear energy. The paper aims to provide insights into the capabilities of DRL and its relevance for addressing challenges in the nuclear industry.

Uploaded by

302602682
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Applications of Deep Reinforcement Learning in Nuclear Energy:a Review

This review paper discusses the applications of deep reinforcement learning (DRL) in the nuclear energy sector, highlighting its advantages in solving complex decision-making problems. It summarizes the principles of DRL, its current research status in other energy fields, and outlines potential future directions for its application in nuclear energy. The paper aims to provide insights into the capabilities of DRL and its relevance for addressing challenges in the nuclear industry.

Uploaded by

302602682
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Nuclear Engineering and Design 429 (2024) 113655

Contents lists available at ScienceDirect

Nuclear Engineering and Design


journal homepage: www.elsevier.com/locate/nucengdes

Applications of deep reinforcement learning in nuclear energy: A review


Yongchao Liu a,b, Bo Wang a,b,* , Sichao Tan a,b,*, Tong Li a,b, Wei Lv a,b ,
Zhenfeng Niu a,b, Jiangkuan Li a,b , Puzhen Gao a,b, Ruifeng Tian a,b
a
College of Nuclear Science and Technology, Harbin Engineering University, Harbin 150001, PR China
b
Heilongjiang Provincial Key Laboratory of Nuclear Power System & Equipment, Harbin Engineering University, Harbin 150001, PR China

A R T I C L E I N F O A B S T R A C T

Keywords: In recent years, Deep reinforcement learning (DRL), as an important branch of artificial intelligence (AI), has
Nuclear energy been widely used in physics and engineering domains. It combines the perceptual advantages of deep learning
Artificial intelligence (DL) and the decision-making advantages of reinforcement learning (RL), and is very suitable for solving the
Deep reinforcement learning
“perception-decision” problem with high-dimensional and nonlinear characteristics. In this paper, firstly, the
Review
algorithm principle, mainstream framework, characteristics and advantages of DRL are summarized. Secondly,
Prospects
the application research status of DRL in other energy fields is reviewed, which provides reference for the
possible impact and future research direction in the field of nuclear energy. Thirdly, the main research directions
of DRL in the field of nuclear energy are summarized and commented, and the application architecture and
advantages of DRL are illustrated through specific application cases. Finally, the advantages, limitations and
future development direction of DRL in the field of nuclear energy are discussed. The goal of this review is to
provide an understanding of DRL capabilities along with state-of-the-art applications in nuclear energy to re­
searchers wishing to address new problems with these methods.

which is huge and equipped with a large number of sensors. Massive


high-dimensional data are generated in the process of operation,
1. Introduction
which contains a lot of decision-making value information to be
mined.
As a safe, clean, stable and reliable energy supply mode, nuclear
• Complex operating characteristics. The operation of nuclear energy
energy is helpful to alleviate global climate change and environmental
system is the coupling embodiment of nuclear physics, flow, heat
pollution, reduce global dependence on fossil energy, ensure energy
transfer, chemistry and many other processes, which has the char­
security and realize sustainable development. In recent years, the deep
acteristics of nonlinearity, strong coupling, multivariable and time-
integration of information technology and industry has become a trend,
varying. It is difficult to make effective decision analysis for dy­
and various digital technologies represented by artificial intelligence
namic operation process.
(AI) have been rapidly applied in many high-tech fields and achieved
• Multi-objective operation requirements. The operation of nuclear
excellent application results. The nuclear energy industry has accumu­
energy system has the requirements of safety, economy and
lated a large number of underutilized operation data for decades, and AI
maneuverability. It is necessary to dynamically adjust the optimal
technology based on data-driven has great application potential in the
allocation of resources according to different objectives, and at the
field of nuclear energy. AI technology can reduce the operating cost of
same time, weigh multiple objectives to realize the decision of
the nuclear industry chain at all stages, greatly reduce the operational
optimal operation strategy.
risk of human beings, and effectively improve the economy and safety of
• Autonomous control target. The function of AI is reflected in assist­
the nuclear industry. In order to realize the deep integration of AI and
ing operators in decision-making, reducing the operation burden and
nuclear energy industry, we must first make clear the characteristics and
supporting the realization of the future goal of completely autono­
requirements of nuclear energy industry system.
mous control operation of nuclear energy engineering system.
• The system structure is complex and the data dimension is high. The
In recent years, DL as an important research hotspot in the field of AI,
nuclear energy system is composed of many subsystems and devices,

* Corresponding authors.
E-mail addresses: [email protected] (B. Wang), [email protected] (S. Tan).

https://doi.org/10.1016/j.nucengdes.2024.113655
Received 24 May 2024; Received in revised form 11 October 2024; Accepted 17 October 2024
0029-5493/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

• Achieve multi-objective common optimization. DRL can achieve the


Nomenclature optimization of specified object and integrate the requirements of
optimization objects into reward design. By adjusting the energy
Acronyms distribution strategy, the balance between supply and demand and
AC Actor-Critic the maximization of energy utilization efficiency can be achieved.
A2C Advantage Actor-Critic • Good robustness and adaptability. DRL is data-driven, so it can
A3C Asynchronous Advantage Actor-Critic adaptively adjust and optimize strategies according to new data and
ACKTR Actor-Critic using Kronecker-Factored Trust Region environmental changes. This makes it have great advantages in
AI Artificial Intelligence dealing with the dynamic and uncertain problems in the energy
DDPG Deep Deterministic Policy Gradient industry.
DL Deep Learning
DNN Deep Neural Networks Based on the above analysis, it can be seen that DRL is very suitable
DPG Deterministic Policy Gradient for solving the difficult problems of “perception” and “decision-making”
DQN Deep Q-network with high dimensional and nonlinear characteristics in the nuclear en­
DRL Deep Reinforcement Learning ergy field. Although the application potential is huge, the literature on
KFAC Kronecker-Factored Approximate Curvature the application of DRL technology in the field of nuclear energy is still
MDP Markov Decision Process limited, and there is no corresponding systematic review to summarize
ML Machine Learning the research progress, existing challenges and future development
PPO Proximal policy optimization prospects of DRL in the field of nuclear energy.
RL Reinforcement Learning This review investigates and sorts out the theoretical basis of DRL,
SAC Soft Actor-Critic the application research work of DRL in other non-nuclear energy fields,
TD Timing Difference and summarizes and analyzes the application of DRL in the nuclear
TD3 Twin Delayed Deep Deterministic Policy Gradient energy field. Combining the above research, the research progress,
TRPO Trust Region Policy Optimization application limitations and future development trends of DRL technol­
ogy in the field of nuclear energy are given. It is hoped that this study
can provide relevant reference for researchers to conduct DRL research
in the field of nuclear energy.
has made great breakthroughs in image recognition (Tian, 2020; Liu
et al., 2020; Ai et al., 2021), speech recognition (Al Smadi et al., 2015; 2. Foundations of DRL
Corea and Corea, 2019; Hema and Marquez, 2023), natural language
processing (Lalwani et al., 2018; Zaib et al., 2020; Juhn and Liu, 2020) In this section, the basic concepts of DRL are introduced. It first
and other fields. A large number of scholars in the nuclear field have also
clarifies the boundaries and positioning of RL in the field of machine
carried out a lot of applied research on DL to solve key issues such as learning (ML), then introduces the basic concepts and main methods of
digital modeling of nuclear reactor systems (Oliveira and Almeida,
RL, and finally focuses on the combination of DL and RL, and outlines
2013; Li et al., 2016; Sobes et al., 2021), fault diagnosis (Qi et al., 2023; the principles, application characteristics, and its development of the
Liu et al., 2019; Jiang et al., 2020), time series prediction (El-Sefy et al.,
current mainstream DRL.
2021; Kim et al., 2021; Lei et al., 2022), design and operation optimi­
zation (Li et al., 2022; Hiscox et al., 2020; Mc Leod and Rivera, 2023),
and control (Zhou and Tan, 2023; Li et al., 2016). Different from DL; 2.1. ML Branches
which focuses on the perception and expression of things, reinforcement
learning is essentially more inclined to learn the optimal strategies for ML consists of four learning styles: supervised learning, unsupervised
solving problems, especially dynamic problems. AlphaGo, as the world’s
first AI agent to defeat human professional Go players and world
champions, its main working principle is RL. RL is widely used in robot
control (Perrusquía et al., 2021); (Salvato et al., 2021), energy system
(Jasmin et al., 2011); (Yang et al., 2021), games (Jaderberg et al., 2019);
(Khabbaz et al., 2017), and other fields. With the increasing complexity
of tasks, RL increasingly needs to rely on DL to learn the abstract rep­
resentation of input data, thus forming a frontier research hotspot in the
field of AI, namely DRL. DRL has excellent application effects in in­
dustrial fields such as autonomous driving, robot control, energy in­
dustry and power grid system. DRL has the following application
advantages in solving energy industry problems.

• Strong perception and decision-making ability. DRL combines the


perceptual ability of DL and the decision-making ability of RL. It can
learn and extract useful features directly from a large number of
energy industry data, and make optimal decisions for complex
operation scenarios.
• Dealing with complexity and high-dimensional problems. The en­
ergy industry system has complex system structure and dynamic
operation characteristics. At the same time, the input and output of
energy industry problems usually have high-dimensional character­
istics. DRL can deal with complex, continuous and high-dimensional
state space and action space by using the powerful nonlinear fitting
advantage of deep neural network. Fig.1. Branches and applications of ML algorithms.

2
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

learning, semi-supervised learning and RL. Fig. 1 illustrates the bound­ of RL algorithms.
aries and applications between the four learning styles. Supervised
learning enables prediction and classification in unlabeled data by 2.2.2. Value-based methods
learning from already labeled data. Unsupervised learning is used to In a value-based approach, agent aims to learn an optimal value
process unlabeled data and automatically learns to discover patterns and function, which in turn determines its policy by selecting the action with
structures in the data in order to classify and make predictions on un­ the highest estimated value. Typically, we define the state value func­
known data. Semi-supervised learning is a type of learning between tion and the state-action value function.
supervised and unsupervised learning. It mainly uses a small portion of The state-value function is defined as:
labeled data to train the model and a large amount of unlabeled data to
V π (s) = Eτ∼π [R(τ) |s ] (1)
refine the model in order to improve its predictive ability. RL learns the
optimal decision-making policy through the interaction of the agent where V π (s) represents the expected discounted cumulative reward
with the environment, with the goal of maximizing the cumulative following the trajectory τ according to the policy π from the state st
reward as the learning objective. RL has good applications in robot onwards. Here, E denotes the expectation and R(τ) represents the reward
navigation, autonomous driving, industrial control and other fields. value of trajectory τ.Meanwhile, the trajectory τ refers to the sequence of
states, actions and rewards that an agent encounters in the environment
2.2. RL as it follows its policy π.
The action-value function is defined as:
RL is usually categorized into two types, model-free algorithms and
Qπ (s, a) = E [R(τ)|s, a] (2)
model-based algorithms (Sutton and Barto, 2018). Model-based ap­ τ∼π
proaches incorporate a model of the environment in which they interact.
which denotes expected discounted cumulative reward for starting
Whereas, this paper focuses on model-free RL algorithms, which have
from state st and taking action at and then following trajectory τ ac­
been widely studied for their ease of use and implement ability in
cording to policy π.
various domains. Model-free approaches are further distinguished into
The following relationship exists between the state value function
value-based and policy-based approaches (Sutton and Barto, 1999;
and the state-action value function:
Thrun and Littman, 2000).
RL includes several main components, including agent, environment, V π (s) = E [Qπ (s, a)] (3)
a∼π
state, and action, as shown in Fig. 2. Among them, agent is the entity
that performs actions and learns from the environment, and updates the That is, V π (s) is the weighted average of the probabilities of Qπ (s, a)
policy function and value function through data samples; environment is for all possible actions on a per-action basis.
the world with which the agent interacts; state is the specific situation of One of the value-based approaches is Q-learning, which finds the
the environment at a certain moment, which is usually used as the basis optimal policy by learning the Q-function. In traditional Q-learning, the
for decision-making; action is the operation that agent can perform to Q-function is represented by a Q-table, which provides an estimate
influence the environment; reward is the direct feedback for performing Q* (s, a) of the optimal Q-value for each of the state-action pairs (s, a) in
a specific action, which is the basis for RL. the state space S and the action space A. The Q-table starts with a
random initialization, and its values are gradually updated according to
2.2.1. Markov process Bellman’s optimality conditions as the agent continues to explore the
Markov process is a mathematical model used in RL to describe the environment (Bellman, 1966):
environment, and its core property is that the determination of the next
Q* (s, a) = R(s, a) + γmaxaʹ Q* (sʹ, aʹ) (4)
state of the system depends only on the current state and is independent
of past states or actions, i.e., it is Markovian. A Markov process consists where R(s, a) is the reward received for taking action α in state s , and γ is
of a set of states and their transfer probabilities. The introduction of the discount factor for future rewards.
actions and rewards yields a Markov decision process (Bellman, 1957;
Howard, 1960) (MDP). The MDP consists of an environment state (S), an 2.2.3. Policy-based methods
action (A), a reward (R), a state transfer probability (T) and a discount The policy-based approach optimizes the parameterized policy
factor (γ). The policy π is a mapping from state space to action space. πθ (a|s ) directly instead of determining the optimal policy through value
Agent takes action at ∈ A in state st ∈ S according to the policy π, moves function estimation (Thrun and Littman, 2000). Compared to the value-
to the next state according to the state transfer probability T, and re­ based approach, the policy-based approach naturally handles high-
ceives reward rt ∈ R. The goal of RL is to optimize the policy to maximize dimensional action spaces and is capable of learning stochastic pol­
the reward. The MDP provides a clear framework for dealing with this icies. Although they may fall into local minima, policy-based methods
type of decision-making problem and lays the foundation for the design have more stable convergence.
In order to evaluate the merits of a policy, it is necessary to define an
objective function that is based on the expected cumulative reward:
J(θ) = E [R(τ)] (5)
τ∼π θ

The optimization goal is to find the optimal parameter θ* to maxi­


mize J(θ), that is,

θ* = argmaxθ E[R(τ)] (6)


To this end, it is necessary to calculate the gradient of the cost
function about θ. This task is initially not simple because we need to find
the gradient of the policy parameter θ, and the effect of policy changes
on the state distribution is unknown.
In fact, a change in policy alters the set of accessed states, which may
affect performance in unpredictable ways. Using the chain-law trick of
Fig.2. RL Framework.

3
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

log-probability, we can represent ∇θ J(θ) as a computable expectation: Three main classes of algorithms and their variants in policy gradient
[ ] methods include deep deterministic policy gradient (DDPG) (Lillicrap
∑T
∇θ J(θ) = E ∇θ log(πθ (at |st ) )R(τ) (7) et al., 2015), statistical methods for natural gradients, and the Actor-
τ∼π θ
t=0 Critic (AC) (Sutton and Barto, 2018) framework. DDPG is the most
common one, which is a combination of deterministic policy gradient
where log denotes the logarithm function, capturing the log-proba­ (DPG) (Silver et al., 2014) and AC. Its variants mainly include Distrib­
bility’s contribution, and ∇θ signifies the gradient with respect to the uted DDPG (Barth-Maron et al., 2018) and twin delayed deep deter­
policy parameters θ . ministic policy gradient (TD3) (Fujimoto et al., 2018), among others.
This gradient is then utilized to update the policy parameters: Statistical methods for natural gradient mainly include trust region
policy optimization (TRPO) (Schulman et al., 2015) and variants Actor-
θ←θ + λ∇θ J(θ) (8)
Critic using kronecker-factored trust region (ACKTR) (Wu et al., 2017)
where λ specifies the learning rate, which is the step size for the updates and proximal policy optimization (PPO) (Schulman et al., 2017), etc. AC
to the parameter θ . frameworks mainly include Advantage Actor-Critic (A2C) (Babaeizadeh
et al., 2016), Asynchronous Advantage Actor-Critic (A3C) (Mnih et al.,
2016) and soft Actor-Critic (SAC) (Haarnoja et al., 2018), etc.
2.3. DRL
2.3.1. DL
The rise and development of DRL in the field of information tech­ DL is a technique that mimics the human brain’s ability to process
nology is closely related to the breakthrough of RL and the improvement information and is based on neural networks. These networks transmit
of computational performance (LeCun et al., 2015). Among them, deep signals through weighted connections and use activation functions to
neural networks (Schmidhuber, 2015) (DNN) contain more hidden model nonlinear processing. Neural networks learn and optimize per­
layers between the input and output layers, which makes it possible to formance by calculating the gradient of the loss function and updating
extract high-level features in complex and high-dimensional data spaces the parameters using back-propagation. More knowledge on DL can be
with superior prediction effects. DRL synthesizes the decision-making found in the literature (LeCun et al., 2015; Schmidhuber, 2015).
advantages of RL and the perceptual advantages of DNNs, bridges the
gap between high-dimensional perceptual inputs and actions, and en­ 2.3.2. DRL algorithm based on value function
ables the extension of RL to the high-dimensional state and action spaces DRL algorithms based on value functions mainly approximate the
with decision-making problems. Nowadays, DRL has been applied in value function or the action value function by means of DNNs. These
several research areas, and since 2010, the number of publications algorithms usually use temporal differential learning (Sutton, 1988) or
mentioning “deep reinforcement learning” has been increasing Q-learning methods (Watkins and Dayan, 1992) to update the functions.
geometrically every year, as shown in Fig. 3. Currently, the main value function-based DRL algorithms are shown in
Model-free DRL-based algorithms can be categorized into two clas­ Fig. 4, and the principal features of DQN and its variant algorithms are
ses, value function-based and policy gradient-based, at the theoretical described below.
level, as shown in Fig. 4. Among them, deep Q-network (DQN) (Mnih
et al., 2015) is the originator of the DRL algorithm that approximates the (1) DQN
action-valued function by DNNs. In order to enhance the speed and
stability of DQN algorithms, many scalability methods and improvement DQN is a breakthrough DRL algorithm, proposed by DeepMind team
ideas have been continuously proposed, including but not limited to in 2015 (Mnih et al., 2015), and has achieved to outperform human
Distributional DQN (Nair et al., 2015), Double DQN (Van Hasselt et al., players on Atari 2600 game. The specific algorithm update process is
2016), Prioritized Experience Replay DQN (Schaul et al., 2015), Dueling shown in Fig. 5.
DQN (Wang et al., 2016), Noisy DQN (Fortunato et al., 2017), Rainbow Assuming that the neural network model is f, its parameter is θ, and
DQN (Hessel et al., 2018), etc. the true action-value function is Q(s, a), the DQN learning algorithm
On the other hand, DL can also be used to approximate policies. wants to solve the optimal parameter so that the output of the neural
network is as close as possible to the value of the true action-value
function Q, i.e., f(s, a|θ) ≈ Q(s, a).
In DQN algorithm also need to estimate the value of the action like Q-
learning algorithm, in order to make the neural network approximation
of the DQN algorithm of the value of the action and the real value of the
action with less error, the defined objective function is shown in Eq. (9).

J(θ) = E[(R(s, a) + γmaxQ(sʹ, aʹ) − f(s, a|θ))2 ] (9)


a∈A

where sʹ denotes the next state, aʹ denotes the action selected at the next
state, θ denotes the parameters of the neural network, f(s, a|θ) is the
value estimated by the neural network, and R(s, a) +γmaxa∈A Q(sʹ, aʹ)
denotes the timing difference (TD) error objective. To optimize the pa­
rameters of the neural network, the gradient descent method is used to
minimize the objective function, and the update formula is shown in Eq.
(10).
∂J(θ)
θ = θ− α (10)
∂θ

where α specifies the learning rate, which is the step size for the updates
to the parameter θ .
Fig.3. Number of publications mentioning “Deep Reinforcement Learning” The DQN algorithm introduces an empirical replay method and a
(Garnier et al., 2021).

4
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Fig.4. Classification of DRL algorithms.

Fig.5. DQN Principal Framework.

dual network method to improve itself. The empirical replay method can same environment and acquiring varied empirical data.
effectively reduce the correlation of each consecutive sample of the Experience replay mechanism: the experience of all actors interact­
training, which facilitates the DL perception, and the training network ing with the environment is collected and stored in a shared experience
f(s, a|θ) in the dual network structure is used to evaluate the value of the pool.
current action in the output, and the target network f(sʹ, a|θ* ) is used to Parallel learners: multiple learners compute the gradient of the loss
evaluate the value of the action in the temporal difference error target. function in parallel, utilizing data from the experience pool and
uploading the gradient to the parameter server for updating the pa­
(2) Distributional DQN. rameters of the Q-network.
Parameter server: accepts the gradient information from the learner
Distributional DQN (Nair et al., 2015) proposes a distributed archi­ and updates the parameters of the Q-network by the gradient descent
tecture to solve the problem of DQN relying on a single machine during method.
training, which results in long training cycles. This architecture is
divided into four core parts: (3) Double DQN.
Parallel actors: the architecture deploys multiple actors (N), each
with its own copy of the Q-network, executing different policies in the Double DQN (Van Hasselt et al., 2016) addresses the problem of Q-
learning overestimation bias through decoupling selection and

5
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

bootstrap action evaluation. The maxa∈A Q(sʹ, aʹ) in the temporal differ­ specific applications. DeepMind’s research demonstrates the successful
ence error objective can be transformed as shown in the following integration of these techniques and points to the synergistic effects they
equation. can have when brought together. The ablation study further elucidates
the specific contribution of each component to the overall performance.
maxa∈A Q(sʹ, aʹ) = Qw− (sʹ, argmaxaʹ Qw− (sʹ, aʹ) ) (11)
The experimental validation results are shown in Fig. 6, where the ex­
The computation of Q-values in the above equation all use the target periments show that the Rainbow DQN performs well in benchmarks on
network. Double DQN for external Q-value computation uses the target the Atari 2600, with significant improvements in both learning effi­
network and internal Q-value computation uses the training network. ciency and final performance. The contribution of each DQN variant to
The improved TD error target is shown in the following equation. the overall performance is also shown based on the detailed ablation
experiment results.
r + γQw− (sʹ, argmaxaʹ Qw (sʹ, aʹ) ) (12)
In the above formula, r represents the reward at the current timestep. 2.3.3. DRL algorithm based on policy gradient
The remaining part signifies the discounted reward estimated by the Q Policy gradient-based DRL algorithms use policy gradient methods to
networks, where Qw− denotes the Q-value from the target network, and directly parameterize a policy and learn the policy parameters by opti­
Qw denotes the Q-value from the training network. mizing an objective function with expected returns. The advantage of
The Dueling DQN (Schaul et al., 2015) algorithm proposes a new such methods is that they are naturally applicable to high-dimensional
Dueling network architecture that can generalize various actions by action spaces and continuous action spaces, and are capable of
representing state values and action rewards separately. The Dueling learning stochastic policies. Currently, the main DRL algorithms based
DQN algorithm pays more attention to the differences in the dominance on policy gradient are shown in Fig. 4, and the principal features of the
values of different actions with the Q-value function as shown in the main algorithms are described in detail below.
following equation. The policy network in the policy gradient algorithm is denoted by πθ ,
where θ is the weight parameter of the policy. The optimization objec­
Qη,α,β (s, a) = Vη,α (s) + Aη,β (s, a) (13) ∑
tive of the algorithm is to maximize maxθ E[R|πθ ], where R = Tt=0 rt
Vη,α (s) denotes the state value function, Vη,α (s) denotes the action represents the total reward value obtained by the intelligent in a series of
dominance function, η denotes the network parameters shared by both, actions. The updating process of the policy gradient algorithm is
and α and β denote the network parameters of the state value function detailed as follows: assuming that a complete set of states, actions, and
and action dominance function, respectively. rewards constitutes a trajectory τ = (s0 , a0 , r1 , ⋯, sT− 1 , aT− 1 , rT− 1 , sT ),
then the policy gradient can be expressed as:
(4) Prioritized Experience Replay DQN. T− 1

g = R∇θ lnπ (at |st ; θ) (14)
Prioritized Experience Replay DQN (Wang et al., 2016) proposes a t=0

prioritized experience replay method to improve the speed of policy


In the above equation, R guides the direction of the trajectory up­
updates. The method prioritizes the experience based on the magnitude
date. When R is positive and large, the probability of occurrence of the
of the TD error; the larger the absolute value of the TD error, the higher
trajectory of the corresponding action sequence increases; vice versa, it
the probability that the corresponding experience will be sampled. Pri­ ∑ 1
decreases. ∇θ T− τ=0 lnπ (at |st ; θ) pushes the policy to update in the direc­
ority experience playback uses two techniques, stochastic prioritization
tion of the fastest gradient change. The rules for updating the policy
and importance-sampling weight, to improve the sampling process.
parameters are as follows:
Stochastic prioritization samples by TD error probability, increasing
sample diversity and ensuring that all samples have a chance of being θ←θ + αg (15)
selected. Importance-sampling weights, on the other hand, regulate the
parameter update rate and maintain learning stability. where α represents the learning rate. After many iterations, the policy
network will increase the probability of trajectories with higher cumu­
(5) Noisy DQN. lative reward R to appear τ while decreasing the probability of trajec­
tories with lower cumulative reward R to appear.
Noisy DQN (Fortunato et al., 2017) enhances the exploration
mechanism of DRL algorithms on the architecture of DQN. It introduces (1) AC
noise into the neural network weights via NoisyNet, and unlike tradi­
tional DQN that rely on fixed parameters such as epsilon to perform The AC method (Sutton and Barto, 2018) effectively solves the
exploration, Noisy DQN is able to adaptively adjust the exploration in­ problem that inefficient actions in a series of actions in traditional policy
tensity. This mechanism avoids the problem that traditional DQN may gradient algorithms may be overshadowed by other, better actions and
ignore differences in the value of actions when selecting actions, and thus their impact cannot be accurately assessed by combining the
ensures that the exploration process is no longer limited to random se­ evaluation of the Q-function with policy optimization. The AC archi­
lection, so that all non-optimal actions are no longer explored equally. tecture consists of two parts, the “actuator” and the “evaluator”, as
NoisyNet’s exploration mechanism is able to automatically learn when shown in Fig. 7. The actuator updates the behavior based on the policy
and how much noise to add to maximize the long term gains during the gradient algorithm, while the evaluator applies the value function
training process, so that the exploration mechanism dynamically adjusts method to assess the behavior. The advantage of the AC architecture is
according to the environmental feedback. This adaptive exploration that it simplifies the sequential update of the policy gradient into a single
mechanism improves efficiency and maintains learning performance. step, avoiding waiting for the entire sequence to end before evaluating
and updating the policy, which reduces the difficulty of data collection
(6) Rainbow DQN. and reduces the variance of the policy gradient.

Rainbow DQN (Hessel et al., 2018) is an integrated DQN variant that (2) DDPG
combines six Distributional DQN, Double DQN, Prioritized Experience
Replay DQN, Dueling DQN, Noisy DQN, and Multi-step learning Inno­ The DPG algorithm (Silver et al., 2014), which performs action se­
vative techniques that overcome the limitations of standard DQN in lection in a deterministic manner, selects a fixed action in a given state.

6
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Fig.6. Performance of DQN algorithm on 57 Atari games (Hessel et al., 2018).

For the value function network:


( ) 1 ∑[ ( ʹ) ( ) ]2
L θQ = rt + γQʹ st+1 , μʹ(st+1 |θμ )|θQ − Q st , at |θQ (18)
ʹ
N t

( )
θQt+1 = θQt + αQ ∇θQ L θQ (19)

Here, θQ denotes the parameter vector of the value function network,


( )
L θQ is an assessment of the capability of the value function network,
employing its gradients with respect to the network parameters for
updates, αQ represent the learning rates for the value function network.
In typical implementations of the DDPG algorithm, the parameters
are updated using a soft update strategy to enhance stability:
Fig.7. AC algorithm architecture.
θQt+1 ←τθQt + (1 − τ)θQt+1
ʹ
(20)

The deterministic policy provides an efficient treatment of the contin­ ʹ


θμt+1 ←τθμt + (1 − τ)θμt+1 (21)
uous action space and solves the original probabilistic policy action
sampling, which results in a significant increase in computation in large where τ denotes the soft update rate and its value is much less than one,
action spaces. DDPG (Barth-Maron et al., 2018) combines the DQN and
θQt and θμt are the parameters of the critic and actor networks at time step
DPG algorithms to enhance the applicability of the algorithm. In DDPG,
t ,respectively, while θQt+1 and θμt+1 are the updated parameters of the
ʹ ʹ
( )
deterministic policy μ(s|θμ ) and value function Q s, a|θQ are repre­
target networks following the soft update. The DDPG algorithm also
sented by neural networks with parameters θμ and θQ , respectively. The
increases the exportability by adding noise to further improve the per­
policy network is responsible for updating the policy, which is equiva­
formance. In the task of continuous action space, DDPG shows stable
lent to the actuator in the AC structure; the value function network
performance.
evaluates the actions and provides the gradient information, acting as an
evaluator. The updating process of the policy network and the value
(3) TD3
function network are represented as follows, respectively:
For the policy network:
The TD3 algorithm (Fujimoto et al., 2018) is proposed on the basis of
[ ⃒ ]
( )⃒ ⃒ DDPG, aiming to further improve the training stability and performance
∇θμ J ≈ Eμʹ ∇a Q s, a|θQ ⃒s=st ,a=μ(st ) ∇θμ μ(s|θμ ) ⃒ (16)
s=st of the algorithm for continuous action space problems. The TD3 algo­
rithm mainly solves the problems in DDPG through the following three
θμt+1 = θμt + αμ ∇θμ J (17) key technical improvements:

Here, ∇θμ J represent the gradient of the policy’s performance J with


a. Dual Q-learning mechanism: TD3 introduces two independent value
respect to the policy parameters θμ ,and αμ represent the learning rates
function evaluators, Q1 and Q2, instead of the single evaluator in
for the policy network. DDPG. When the policy is updated, TD3 takes the smaller of these

7
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

two value functions for gradient descent, which effectively reduces stability and sample efficiency problems in RL by introducing the Trust
the phenomenon of policy over-estimation. region. The TRPO algorithm endeavors to ensure the stability of the
b. Delayed Update Mechanism: TD3 implements a delay mechanism in learning process by ensuring that the new policy does not deviate
the update frequency of the policy network, i.e., the value function is excessively from the old policy when updating the policy. Specifically,
updated several times before the policy network is updated once. TRPO aims to maximize the objective function while constraining the
This mechanism ensures the stability of the value estimation and policy update step size to avoid generating disruptive and drastic up­
provides a more reliable learning target for the policies. dates. The core of this approach is to solve the following problem:
c. Target policy smoothing regularization: TD3 adds noise to the goal [ ]
πθ (a|s) πθ
policy, which enhances the exploration ability of the algorithm maxθ Es,a∼πθold A old (s, a) (24)
πθold (a|s)
through this smoothing process and reduces the interference of
environmental noise in the learning process. while being subject to certain KL dispersion constraints:
[ ]
Table 1 demonstrates the maximum average returns of the policy Es∼πθold DKL (π θold (⋅|s)‖πθ (⋅|s)) ≤ δ (25)
gradient-based DRL algorithms in different OpenAI gym environments.
The TD3 algorithm exhibits higher sample efficiency and better per­ where, DKL is the Kullback-Leibler divergence between the old and new
formance compared to other algorithms in selected test environments. policies, which measures the difference in the distribution of the policies
The successful practice of TD3 not only provides a new perspective for for a given state, and δ is a predefined small value to limit the difference
solving continuous action space problems, but also contributes an between the old and new policies. In this way, TRPO ensures that the
important theoretical foundation for the development of DRL algorithms iterative updating of policies is both large enough that effective new
and empirical experience. policies can be learned, and small enough to avoid excessive deviations
that lead to performance degradation. Compared to traditional policy
(4) A3C gradient methods, TRPO demonstrates higher sample efficiency and
better performance in continuous control tasks and high-dimensional
The A3C algorithm (Mnih et al., 2016) is based on the idea of action space problems. In addition, the TRPO algorithm is relatively
Asynchronous RL, which utilizes multiple threads to execute agent ac­ less sensitive to the choice of hyperparameters, which makes the algo­
tions in asynchronous mode. Each actuator may be in a different state rithm more practical.
and take different actions at any given moment. This asynchronous
mechanism reduces the correlation between samples and can be an (6) ACKTR
effective alternative to experience replay, allowing the parameters to be
updated using the On-policy method. The A3C algorithm significantly The ACKTR algorithm (Wu et al., 2017) further improves the policy
reduces the hardware requirements compared to traditional policy gradient approach based on TRPO. The algorithm aims to optimize the
gradient algorithms, which require high-performance GPU processors, policy update process through efficient natural gradient descent. The
whereas the A3C only uses multi-core CPUs. natural gradient descent takes into account the geometric properties of
Also, the dominance function can be used as an alternative in the the parameter space to make the update closer to the shape of the policy
evaluation of the value function, which is defined as follows: probability distribution, thus improving the learning efficiency.
The key innovation of ACKTR is the application of the Kronecker-
At = Q(st , at ) − V(st ) (22) Factored Approximate Curvature (KFAC) approximation to the Fisher
Or: information matrix in natural gradient descent, which significantly re­
duces computational complexity and maintains the accuracy of the
At = rt + γV(st+1 ) − V(st ) (23) approximation. This is particularly critical for large-scale problems
where a complete computation of the Fisher information matrix is
where, the term Q(st , at ) denotes the action-value function at state st impractical. ACKTR optimizes the policy network of Actor and enhances
when action at is taken, and the term V(st ) represents the value function the value network update of Critic through KFAC. In several experi­
of the current state st . ments, ACKTR demonstrates superior performance relative to A3C and
By using the dominance function instead of the Q-value function, the TRPO, especially in complex environments, showing lower variance and
probability of occurrence of quality behaviors can be enhanced more higher data efficiency. This makes ACKTR particularly suitable for sit­
effectively. In addition, the use of the dominance function can further uations where sample efficiency is critical.
reduce the variance of the algorithm. In addition, A3C has a wide range
of applications and is suitable for a variety of 2D and 3D environments, (7) PPO
as well as discrete and continuous action spaces, showing great
versatility. The PPO algorithm (Schulman et al., 2017) further improves the
utility and flexibility of the algorithm with a simplified objective func­
(5) TRPO tion based on TRPO. It focuses on solving the problems that DQN cannot
handle continuous actions, and the A3C algorithm needs to adjust
The TRPO algorithm has made significant progress in solving the numerous hyper-parameters and faces low sampling efficiency. A

Table 1
Performance of policy-based DRL algorithms in a gym environment (Fujimoto et al., 2018).
Environment DDPG TD3 TRPO ACKTR PPO SAC

HalfCheetah 3305.6 9636.95 − 15.57 1450.46 1795.43 2347.19


Hopper 2020.46 3564.07 2471.3 2428.39 2164.7 2996.66
Walker2d 1843.85 4682.82 2321.47 1216.7 3317.69 1283.67
Ant 1005.3 4372.44 − 75.85 1821.94 1083.2 655.35
Reacher − 6.51 ¡3.60 − 111.43 − 4.26 − 6.18 − 4.44
InvPendulum 1000 1000.00 985.4 1000 1000 1000
InvDoublePendulum 9355.52 9337.47 205.85 9081.92 8977.94 8487.15

8
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

substitute target is used in the PPO: limited to keep the training stable. The PPO accomplishes this with the
following improved objective function:
L(θ) = ̂
E[rt (θ) A
̂ t] (26)
E[min((rt (θ), clip(rt (θ), 1 − ε, 1 + ε) ) A
L(θ) = ̂ ̂ )] (27)
Here, rt (θ) = πθ (at |st )/π θold (at |st ) represents the ratio between the old
and new strategies, while A ̂ t represents the estimate of the dominance where, clip(rt (θ), 1 − ε, 1 + ε) is used to place a limit on the policy ratio
function. PPO aims to optimize the policy so that the new policy pro­ rt (θ) to remain within [1 − ε,1 + ε]. This restriction ensures that the new
duces better actions relative to the old one, but the improvement is policy does not deviate too far and ensures the stability of the algorithm.

Table 2
Summary of DRL application in non-nuclear field.
Field Application Direction DRL Objective Algorithm Year Reference

Automatic Driving Path tracking, Smoothness and accuracy of planned path tracking PPO 2020 (Shan et al., 2020)
path planning PPO 2019 (Chen et al., 2019)
DDPG 2021 (Tian et al., 2021)
Driving decision Safe and accurate vehicle interaction DDQN, TD3, 2019 (Chen et al., 2019)
SAC 2018 (Jaritz et al., 2018)
A3C
Energy saving, Efficient automobile power efficiency DQN, DDQN 2019 (Qi et al., 2019)
environmental protection
Robot Task decision, control Precise robot control TD3 2019 (Johannink et al., 2019)
Path planning (single, group) Flexible and efficient robot motion A3C 2020 (Liu et al., 2020)
DQN 2019 (Bae et al., 2019)
PPO 2018 (Long et al., 2018)
Failure-tolerant control Robust strategy of robot fault TD3 2022 (Yan et al., 2022)
PPO 2020 (Okamoto and
Kawamoto, 2020)
Unknown exploration Reduce repeated exploration and improve exploration DQN 2016 (Tai and Liu, 2016)
efficiency DDPG 2020 (Hu et al., 2020)
Socializing, human behavior Enhance the ability of robots to socialize with people DQN 2016 (Qureshi et al., 2016)
Learning SARSA, Q- 2021 (Akalin and Loutfi,
learning 2021)
Energy Wind Wind energy scheduling Less fluctuation caused by uncertainty of wind power DQN 2021 (Meng et al., 2021)
Industry Energy management output, improve energy utilization efficiency DDPG 2022 (Zhu et al., 2022)
Unit control Optimize power output AC 2024 (Mazare, 2024)
DNN, MBRL 2023 (Xie et al., 2023)
Q-learning 2016 (Wei et al., 2016)
Q-learning 2015 (Wei et al., 2015)
Q-learning 2019 (Saenz-Aguirre et al.,
DDPG 2021 2019)
(Zhang et al., 2021)
Energy storage prediction Effectively manage the uncertainty of wind power DQN 2020 (Oh and Wang, 2020)
forecasting
Market management and Guide market behavior A3C 2022 (Sanayha and Vateekul,
tendering 2022)
Solar Solar energy scheduling Improve operational performance and reduce costs Q-learning 2020 (Correa-Jullian et al.,
Energy management Q-learning 2014 2020)
(Leo et al., 2014)
Efficiency optimization Increase the total energy collected by solar panels SARSA 2017 (Abel et al., 2017)
Q-learning
Prediction Accurate solar irradiance prediction DQN 2021 (Jalali et al., 2021)
Economic evaluation Solar microgrid effectiveness and flexibility RL 2023 (Yuan and Xie, 2023)
Sensor node power Improve battery performance and reduce excessive power SARSA 2017 (Shresthamali et al.,
management consumption 2017)
Energy of Unit control optimization Optimize the system output power or actively dissipate the Q-learning 2020 (Bruzzone et al., 2020)
Waves wave energy. SARSA, Q- 2021 (Pierart et al., 2021)
learning 2022 (Zou et al., 2022)
DQN 2022 (Sarkar et al., 2022)
PPO 2022 (Xie et al., 2022)
PPO
Desalination of sea water Reduce the uncertainty and fluctuation of fresh water DDPG 2022 (Yin and Lei, 2022)
supply
Power System System control Improve the control efficiency of power and voltage of SIRL 2024 (Song et al., 2024)
power grid system. MADDPG 2023 (Li et al., 2023)
MADRL 2024 (Zhang et al., 2024)
DQN 2022 (Xu et al., 2022)
Q-learning 2023 (Zhang et al., 2023)
Q-learning 2021 (Liu et al., 2021)
Power grid dispatching Reduce operating costs and optimize voltage distribution GRL 2023 (Chen et al., 2023)
DDPG 2022 (Chen et al., 2022)
Defense and protection Improve the ability of attack detection and defense and DQN 2023 (Deng et al., 2023)
fault protection 2023 (Yang et al)
2023 (Lin et al., 2023)
Market Guide market behavior DQN 2023 (Li et al., 2023)
AC 2023 (Zhang et al., 2023)
Carbon emission Improve the economic indicators of carbon emissions PPO 2022 (Xu et al., 2022)

9
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

In tests of MuJoCo, Atari 2600 games, and 3D environments, the PPO and hills and promoted them in unknown road environments and tra­
algorithm shows superior performance compared to the A2C and A3C jectories. In addition, RL also plays an important role in energy saving
algorithms. and environmental protection during driving. The vehicle energy man­
agement system designed by Qi et al. (Qi et al., 2019) independently
(8) SAC learns the best fuel / power distribution from the interaction between
the vehicle and the traffic environment, and improves the fuel economy
The SAC algorithm (Haarnoja et al., 2018) focuses on solving the of the vehicle.
convergence problem faced by policy gradient algorithms. SAC in­
troduces the concept of entropy into the objective function, aiming to (2) Robot.
maximize the cumulative reward as well as the entropy, encouraging the
agent to take more stochastic actions in completing the task. The In recent years, DRL algorithm has significantly improved the
objective function of SAC is as follows: adaptability of mobile robots in unstructured environments and the
T
ability of trajectory planning in complex tasks. Therefore, the develop­

J= E(st ,at )∼ρπ [r(st , at ) + αH(π(⋅|st ) ) ] (28) ment and continuous improvement of DRL algorithm for robot control
t=0 has become a hot research topic at home and abroad.
First of all, RL can realize the task control of robot object grasping,
where, H(π(⋅|st ) ) represents the entropy of the policy given the current assembly and so on. Johannink et al. (Johannink et al., 2019) achieved a
state, ρπ represents the distribution of state-action pairs under policy π , good training robot to complete the above control tasks. Secondly, RL
and α represents the importance of the entropy relative to the reward. can control the trajectory of robots in complex environments, so that
The SAC algorithm encourages exploration by maximizing entropy, they can better cope with dynamic obstacles and uncertain environ­
which not only prevents the agent from prematurely converging to a ments. Liu et al. (Liu et al., 2020) realized motion planning in crowded
suboptimal policy, but also improves the robustness of the algorithm. and chaotic scenes of public environment. In addition, group robot path
SAC exhibits superior performance over DDPG and PPO in several planning is also one of the research topics of RL methods. Bae et al. (Bae
continuous control tasks. et al., 2019) realized multi-robot path planning by analyzing the image
information of the surrounding environment. Long et al. (Long et al.,
3. DRL in nuclear energy 2018) trained a large number of robots in a rich and complex environ­
ment, and verified the sensor-level collision avoidance strategy. RL can
The previous section 2 introduces the DRL algorithm in detail, which help robots make adaptive decisions in the face of emergencies or system
ultimately needs to be implemented in practical applications in the field failures. Yan et al. (Yan et al., 2022) carried out active fault-tolerant
of nuclear energy. In this section, we first summarize and analyze the control for robots with manipulator failures to ensure system security.
application research experience of DRL method in the industrial field. Okamoto et al. (Okamoto and Kawamoto, 2020) used RL to obtain
Then, the research trends and research status of DRL in the field of robust strategies for robot faults. RL can also provide solutions for un­
nuclear energy are introduced. Finally, the key research directions and known exploration of robots. Tai et al. (Tai and Liu, 2016) designed a
application cases of DRL in the field of nuclear energy are introduced in mobile robot that uses only the depth information of RGB-D sensors to
detail. explore in an unknown indoor environment. Hu et al. (Hu et al., 2020)
designed a multi-mobile robot collaborative exploration strategy that
Research experience in DRL in the industrial field saves time and energy costs. In addition, RL methods are also used for
robots to learn social rules, human behavior patterns, and human so­
As an advanced and powerful learning framework, the advancement cialization. (Qureshi et al., 2016; Akalin and Loutfi, 2021).
of DRL method stems from the ability of RL method to deal with complex
and dynamic environments and the ability to make adaptive decisions in (3) Energy industry.
the face of unknown scenarios. In recent years, it has become a high-
profile cutting-edge technology in the fields of autonomous driving, DRL technology has rich research and application in energy demand
robot complex control, energy and power, and has shown its excellent forecasting, energy production optimization, energy efficiency
application potential. This section summarizes the application cases of improvement, energy system control and so on.
DRL in autonomous driving, robot control, energy and power and other In the field of wind energy, firstly, RL plays an important role in the
industries, as shown in Table 2. Provide a reference basis for future scheduling management of wind energy. The advanced real-time
research and applications of DRL in the field of nuclear energy. scheduling strategy of distributed energy system proposed by Meng
et al. (Meng et al., 2021) explores the optimal real-time scheduling for
(1) Automatic driving. 24-hour wind demand disturbance. The intelligent scheduling strategy
of distribution network proposed by Zhu et al. (Zhu et al., 2022) reduces
The automatic driving decision-making system based on DRL algo­ power fluctuation. Secondly, RL can effectively deal with the dynamic
rithm gives the vehicle the ability to learn independently and operate and nonlinear characteristics of wind energy system. The wind power
driving, which provides technical support for the real realization of control system can learn and optimize the parameters of the blade angle
intelligent vehicle driving. and speed of the wind turbine in real time, and adapt to the changing
Firstly, RL achieves intelligent and safe navigation in path planning. wind speed and meteorological conditions. Mahmood et al. (Mazare,
Shan et al. (Shan et al., 2020) and Chen et al. (Chen et al., 2019) used RL 2024) realized the adaptive optimal safe wind power generation control
to adjust the predefined path deviation to adapt to path tracking in of variable speed wind turbine system. Xie et al. (Xie et al., 2023) used
medium and high speed driving scenarios. Tian et al. (Tian et al., 2021) RL to control the torque and pitch of wind turbines, showing excellent
achieved accurate path tracking through behavioral cloning of human robustness and control performance in the case of uncertainty and un­
driving habits. Secondly, the automatic driving system needs to make expected actuator failures. Wei et al. (Wei et al., 2016; Wei et al., 2015)
real-time decisions in a complex and changeable traffic environment, solved the control problem of variable speed wind energy conversion
such as overtaking, lane change, etc. Chen et al. (Chen et al., 2019) used system of permanent magnet synchronous generator. The data-driven
the RL method to realize the automatic driving task in the urban dense yaw proposed by Saenz-Aguirre et al. (Saenz-Aguirre et al., 2019) re­
vehicle scene. Jaritz et al. (Jaritz et al., 2018) achieved faster conver­ alizes the control of wind direction under different conditions. Z hang
gence and more robust driving on complex road structures such as turns et al. (Zhang et al., 2021) proposed a sparse coordinated control strategy

10
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

for ordinary power system stabilizers. RL also plays an important role in industrial fields, it is concluded that the application research of DRL
the prediction and management of energy storage systems (Oh and mainly focuses on the following five points.
Wang, 2020), market management and bidding (Sanayha and Vateekul,
2022), etc. • Task decision planning. DRL has the ability to interact with the
In the field of solar energy, firstly, RL intelligently adjusts the storage environment to learn the best strategy, and can make accurate and
and distribution of energy through real-time data, weather forecast and safe decision-making plans. For example, vehicles avoid obstacles
power demand of solar power generation systems. Correa-Jullian et al. and change lanes in automatic driving. It is suitable for solving the
(Correa-Jullian et al., 2020) and Leo et al. (Leo et al., 2014) determine problems such as the formulation of operating rules for complex
the optimal operation plan and optimal scheduling scheme of solar tasks in nuclear power plants and the inspection of robots in nuclear
water heating system or solar microgrid system. Secondly, RL plays a power plants.
key role in the efficiency optimization of solar power generation sys­ • Resource management scheduling. DRL has the ability of accurate
tems. Abel et al. (Abel et al., 2017) used RL to increase the total energy prediction based on data in complex and changeable environment,
harvested by solar panels and optimized the power generation efficiency which can timely predict demand and reasonably allocate and
of the power generation system. At the same time, RL has excellent schedule resources. For example, power dispatching in power grid
application effects in solar radiation prediction (Jalali et al., 2021), solar system. It is suitable for solving the problems of power generation
project economic evaluation (Yuan and Xie, 2023), adaptive solar power planning, transmission line selection and load distribution of nuclear
management (Shresthamali et al., 2017) and other aspects. power plants.
In the field of wave energy, RL applications mainly focus on unit • Operation design optimization. DRL has the ability to approach the
control and optimization and seawater desalination. Bruzzone et al. optimal solution through trial and error and learning in high-
(Bruzzone et al., 2020), Pierart et al. (Pierart et al., 2021), Zou et al. (Zou dimensional space, and can give the optimal solution to achieve
et al., 2022), Sarkar et al. (Sarkar et al., 2022) discussed the application the goal under specific constraints. For example, the power genera­
of RL in the control of wave energy converter, and dynamically adjusted tion efficiency of solar panels is optimized. It is suitable for solving
the speed ratio of the generator according to the sea conditions to the problems of nuclear fuel assembly configuration and core
maximize the output power of the system. Xie et al. (Xie et al., 2022) structure design.
used DRL to find the optimal wave dissipation strategy in the control of • System autonomous control. DRL has the ability of good adaptability
wave energy plate breakwaters. Yin et al. (Yin and Lei, 2022) used DRL and robust control in complex and uncertain environment, and can
to control seawater desalination systems to smooth freshwater develop adaptive emergency control scheme and realize precise
production. control. For example, fault-tolerant control under robot failure. It is
suitable for solving the problems of reactor startup and shutdown,
(4) Power system. safety control under reactor accidents, etc.
• Unknown exploratory learning. DRL has the ability to gradually find
DRL technology has made important contributions to improving the strategies to adapt to the unknown environment through continuous
operation efficiency and stability of the power grid and promoting the trial and learning, and can give operational strategies to adapt to the
optimal allocation and efficient utilization of energy. complex unknown environment. For example, robots explore un­
In the field of power grid system, RL is widely used in system control, known environments to avoid obstacles. It is suitable for solving the
power grid dispatching, defense and protection, market and so on. In problems of autonomous and safe operation of nuclear system in
terms of system control, optimize the frequency and voltage control of deep space environment.
the power grid system. Song et al. (Song et al., 2024) combined V2G
control with power plant frequency control to complete the flame 3.2. Application trends of DRL in nuclear energy field
retardant task. Li et al. (Li et al., 2023) reduced power loss and opti­
mized voltage fluctuation through real-time voltage control. Zhang et al. Due to the particularity of safety in the nuclear energy field, the
(Zhang et al., 2024) proposed a distributed voltage control for active research in the nuclear energy field is relatively conservative. Cutting-
distribution networks. Xu et al. (Xu et al., 2022) embedded prior edge technologies such as DRL, which has emerged in recent years,
knowledge into RL to realize real-time power correction control of have not yet been studied on a large scale in nuclear power. The relevant
complex power systems. Zhang et al. (Zhang et al., 2023) and Liu et al.
(Liu et al., 2021) achieve optimal multi-area coordination while
improving the control performance of frequency deviation caused by
strong random interference. Uncertainty in dynamic economic dispatch
is a key issue in the safe and economic operation of power systems. The
power system diagram RL method designed by Chen et al. (Chen et al.,
2023) can achieve higher quality solutions in online operations. Chen
et al. (Chen et al., 2022)’s collaborative scheduling framework reduces
operating costs and optimizes voltage distribution. Secondly, RL
methods can be used for fault detection and processing of power grid
systems, as well as defense against various potential threats and attacks.
Deng et al. (Deng et al., 2023) and Yang et al. (Yang et al) used DRL to
perform fault detection and cascade protection of power grid systems,
respectively. Lin et al. (Lin et al., 2023) used DRL to design a false data
injection attack model and an adversarial detection method. In addition,
RL also has certain research in energy market forecasting and electricity
price adjustment (Li et al., 2023; Zhang et al., 2023), carbon emission
flow optimization (Qin et al., 2022), etc.

(5) Application experience.


Fig.8. The change trend of DRL research literature in the field of nu­
Through the above analysis of application cases in different clear energy.

11
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

research mainly appeared after 2020, as shown in Fig. 8, and showed a fusion tokamak device, as shown in Fig. 11, and achieved a good control
blowout trend in 2022 and 2023.This section mainly summarizes and effect. Relevant research results were published in the journal Nature.
analyzes the research of DRL in the field of nuclear energy, and provides This is also a breakthrough in the application of DRL technology in the
reference for the subsequent research. field of nuclear energy.
The research of DRL technology in the field of nuclear energy mainly
focuses on autonomous control, operation and maintenance strategy 3.4. Operation and maintenance strategy optimization
optimization, design optimization, fault diagnosis, nuclear technology
application, etc., as summarized in Table 3. The specific research di­ DRL technology is also applied to the optimization of operation and
rection, results and cases will be introduced in detail below. maintenance strategy of nuclear energy system, which effectively re­
duces the cost investment and shutdown risk during the operation and
3.3. Autonomous control maintenance of nuclear energy system equipment, and improves the
economy and reliability of nuclear energy application. Zhao et al. (Zhao
In recent years, research on DRL technology in the field of nuclear and Smidts, 2022) used DRL technology to optimize the maintenance
energy has mainly focused on the autonomous control of nuclear energy strategy of the pump system of nuclear power plant. Hao et al. (Hao
systems. The application of DRL improves the automation and intelli­ et al., 2023) proposed a DRL operation and maintenance strategy to
gence level of nuclear energy systems, further ensures the safety of address the imperfect knowledge of nuclear power system degradation
system operation, optimizes control performance, and reduces control model and the challenge of partial observability of system degradation
and operation costs. Lee et al. used DRL technology to independently state. Pylorof et al. (Pylorof and Garcia, 2022) used RL to solve the
control the reactor power rise process (Lee et al., 2020) and the cold problem of supervision and control in advanced energy systems. Yi et al.
shutdown process (Lee et al., 2022) of nuclear power plants. Experi­ (Yi et al., 2022) used DRL to maximize the benefits of hydrogen pro­
ments prove the feasibility and potential of DRL technology for nuclear duction in a nuclear renewable integrated energy system according to
power plant reactor start-stop autonomous control. The reactor auto­ the flexibility requirements of the grid.
matic power control architecture proposed by them is shown in Fig. 9. Hao et al. (Hao et al., 2023a,b) uses DRL method to search for the
The research architecture of autonomous control of nuclear power best operation and maintenance strategy of information physical energy
plants using DRL is similar to this. First, a nuclear power plant simulator system. According to available information (such as production plan,
is used as the environment for DRL applications. Then, the training and component remaining useful life, component status, remaining main­
optimization of DRL agents are realized through interaction. Finally, the tenance time and system status), select the best operation and mainte­
application of DRL controllers is realized. At the same time, they nance operation to be performed. It adapts to the production fluctuation
compared the DRL control effect with the traditional PID control effect and the uncertainty of power demand, so as to ensure the more reliable
during the cold shutdown process, as shown in Fig. 10 and Table 3. From and safe production and supply of power system. At the same time, the
the comparison results of the figure and the table, it can be seen that the author applied it to the Advanced Lead-cooled Fast Reactor European
DRL control method has a better control effect than the traditional PID Demonstrator (ALFRED). By simulating the operation and maintenance
control, and it also proves the potential of DRL in the autonomous decision sequence, the comparison results of different maintenance
control of nuclear power plants. strategies are shown in Tables 4 and 5. As can be seen from the table, the
Kim et al. (Kim et al., 2023), Bae et al. (Bae et al., 2023) and Park method based on DRL has a higher average profit. 95 % Conditional
et al. (Park et al., 2022) also conducted relevant studies in the start-stop Value at Risk (CVaR) is the lowest, the loss is the lowest, and the number
control of nuclear power plant reactors and proved that DRL has the of safe/serious shutdowns is the least. During the five-year mission, the
potential to realize automatic control operation. common strategies of Corrective Maintenance (CM) and Preventive
In the nuclear power critical system equipment control. Li et al. (Li Maintenance (PM) have caused a lot of component failures. DRL oper­
et al., 2022; Li et al., 2023) studied the control of steam pressure at the ation and maintenance strategy makes use of the information of
outlet of a direct-flow steam generator. Zhang et al. (Zhang et al., 2024) component health status, which greatly reduces the number of system
used DRL to achieve safe and efficient operation of nuclear steam supply dysfunction (safe shutdown and serious shutdown) in nuclear power
system. Dong et al. (Dong et al., 2020) carried out system control opti­ plants. The results show that the optimal solution found by DRL is better
mization on the basis of energy system dynamics. Li et al. (Li et al., 2021) than that provided by the most advanced operation and maintenance
realized the coordinated control of nuclear reactor power control and strategy.
steam generator level control.
In the case of nuclear power system transmission accident control 3.5. Design optimization
disposal. Li et al. (Li et al., 2022) designed an active fault-tolerant
control system for the fault of the direct-flow steam generator sensor, In the field of nuclear energy design optimization, DRL technology
which ensured the normal and stable operation of the direct-flow steam has also contributed excellent application results. It effectively reduces
generator. Saeed et al. (Saeed et al., 2023) conducted emergency the calculation cost of complex design, search and other issues in the
treatment for the feedwater loss accident of the small modular reactor nuclear field, and improves the security of system equipment in the
IP-200, and Lee et al. (Lee et al., 2021) realized emergency treatment design stage. In physical computational optimization, Gu et al. (Gu et al.,
after the coolant loss accident of the nuclear power plant, proving that 2023) used DRL to solve the large-scale optimization problem in the
DRL can deal with emergencies on the premise of ensuring the safety of expensive particle transport simulation, and verified the critical search
the power plant. and nuclear component geometry optimization of small modular
The control of different pile types is also studied. Chen et al. (Chen reactors.
and Ray, 2022) designed a nonlinear control system for boiling water Dang et al. (Dang and Ishii, 2022) realized the prediction of the
reactor, and experiments proved the advantages of DRL control system interface area of turbulent two-phase flow using DRL, and verified the
in anti-interference, perturbation stability and setpoint tracking. Zhong performance improvement brought by the new method through the
et al. (Zhong et al., 2022) designed a set of infinite horizon optimal experimental database. An interface area prediction method based on
tracking control method for continuous time nonlinear systems, and physical information RL is proposed. This method aims to capture the
achieved simulation verification in 2500 MW PWR nuclear power plant. complexity of two-phase flow by using the interface area transport
Mattioni et al. (Mattioni et al., 2023) and Degrave et al. (Degrave et al., equation and the advantage of RL. A simple MDP model is used to
2022) applied the DRL method to the actual control of the high- simulate the change of bubble interface state driven by turbulent
temperature plasma device of the magnetic confinement nuclear interaction in the flow field, and the interface structure of transient

12
Table 3
Summary of DRL research in nuclear energy field.

Y. Liu et al.
Application direction Research object DRL objective Algorithm Year Reference

Control Reactor control Reactor rising power control 1. Optimize control strategy A3C 2020 (Lee et al., 2020)
Cold shutdown control 2. Realize autonomous control of startup and shutdown SAC 2022 (Lee et al., 2022)
Autonomous start operation coordination control processes SAC 2023 (Kim et al., 2023)
Multi-objective control during nuclear power plant shutdown 3. Ensure control security SAC 2023 (Bae et al., 2023)
Thermal start-up automatic control A3C 2022 (Park et al., 2022)
Key system equipment OTSG outlet steam pressure control 1. Good tracking ability and anti-interference ability PPO 2022 (Li et al., 2022)
control OTSG Outlet pressure control 2. Small overshoot control effect, fast stabilization speed, strong PPO 2023 (Li et al., 2023)
Multi-objective control of key nuclear steam supply system adaptive ability SAC 2023 (Zhang et al., 2024)
Optimal control of thermoelectric response of nuclear steam supply MLP-based RL 2020 (Dong et al., 2020)
system
Coordinated control of nuclear reactor power and steam generator level DDPG 2021 (Li et al., 2021)
Accident control OTSG sensor fault control 1. Accurate and rapid control in case of accidents to ensure the DQN 2022 (Li et al., 2022)
Autonomous control of small modular reactor under LOFW accident normal and stable operation of the system AC 2023 (Saeed et al., 2023)
Autonomous control under LOCA accident in nuclear power plant 2. Handle emergencies under the premise of ensuring the safety SAC 2021 (Lee et al., 2021)
of nuclear power plants
Control of different Nonlinear control system of boiling water reactor 1 Improve the control performance of pressurized water reactor, DDPG 2022 (Chen and Ray,
reactor types boiling water reactor and nuclear fusion device 2022)
Nonlinear tracking control for 2500 MW PWR nuclear power plant 2. Improve security IRL 2022 (Zhong et al., 2022)
Plasma control 3. Reduce the design workload and control cost Dynamic 2023 (Mattioni et al.,
DNN-DRL 2023)
Tokamak high temperature plasma control MPO 2022 (Degrave et al.,
2022)
Optimize O&M policy Optimization of pump system maintenance strategy for nuclear power 1. Reduce maintenance cost and improve economy MBRL 2022 (Zhao and Smidts,
plant 2. Track changes in demand and improve energy efficiency 2022)
Find the most profitable nuclear power plant operation and DRL 2023 (Hao et al., 2023)
maintenance strategy
Nuclear power plant system operation supervision and optimization SAC 2022 (Pylorof and Garcia,
2022)
13

To provide optimized operation and maintenance strategy for European PPO, IL 2023 (Hao et al., 2023)
advanced lead-cooled fast reactor demonstration plant
Information physical energy system operation and maintenance strategy PPO, IL 2023 (Hao et al., 2023)
optimization
Renewable integrated energy system to maximize the efficiency of TD3 2022 (Yi et al., 2022)
hydrogen production SAC
PPO
Design Physical computation Critical search and nuclear component geometry optimization for small 1. Solve large-scale optimization problems in particle transport NEORL 2023 (Gu et al., 2023)
optimization optimization modular reactors simulation
Prediction of interface area of turbulent two-phase flow 2. Improve computing performance and reduce optimization IAT-PhyRL 2022 (Dang and Ishii,
cost 2022)
Nuclear fuel Learn the experience of nuclear fuel engineer and choose the solution 1. Use expert knowledge effectively PPO 2021 (Radaideh and
optimization 2. Reduce optimization costs Shirvan, 2021)
Nuclear fuel optimization 3. Improve fuel efficiency PESA 2022 (Radaideh and
Shirvan, 2022)

Nuclear Engineering and Design 429 (2024) 113655


Nuclear reactor control, nuclear fuel optimization application NEORL 2023 (Radaideh et al.,
2023)
Fuel loading mode optimization PPO 2023 (Seurin and
Shirvan, 2023)
Boiling water reactor fuel assembly optimization DQN 2021 (Radaideh et al.,
PPO 2021)
Safety diagnosis Nuclear facility safety function status diagnosis Improve the diagnostic effect and ensure the safe operation of DQN 2020 (Park et al., 2020)
Nuclear power plant equipment fault diagnosis nuclear power plant DQN 2023 (Zhong et al., 2023)
Nuclear power plant rotating machinery fault diagnosis CNN-RL, 2022 (Qian and Liu,
GRU-RL 2022)
Nuclear technology application Robot path planning in nuclear radiation environment Improve the safety and reliability of robots in nuclear DQN 2023 (Wu et al., 2023)
environments
Nuclear medicine Examination risk management architecture 1. Support the prevention of risks and injuries during medical SARSA 2023 (Shah et al., 2023)
examination
2. Reduce operating costs
Autonomous radiation source detection for radiation emergencies Improved detection performance PPO 2023 (Hu et al., 2023)
Proton linac beam orbit correction Improved calibration performance DDPG 2022 (Yang et al., 2022)
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Fig.9. DRL-based automatic power rise control architecture for nuclear power plant reactors (Lee et al., 2020).

Fig.10. Comparison of pressure between controller and regulator (Lee et al., 2022).

bubble two-phase flow is described, as shown in Fig. 12. The environ­ optimization of nuclear fuel assembly combination with high dimen­
mental state set of MDP design includes hydrodynamic and two-phase sional and computationally heavy physical problems and reduced the
flow parameters related to interfacial area transport of bubbles. The optimization cost. At the same time, Radaideh et al. (Radaideh et al.,
average bubble diameter is changed by selecting the multiplication ratio 2021) provides a decision support tool for the optimization of boiling
within a given range. Based on the experimental database of vertical water reactor components. Seurin et al. (Seurin and Shirvan, 2023)
upward bubble air–water flow, the performance of the new method is studied the optimization of nuclear fuel loading mode. The author es­
verified, as shown in Fig. 13. The results show that, compared with IATE tablishes the connection between RL and the fuel designer’s strategy of
model, IAT-PhyRL model has better prediction effect among different meeting specific constraints and goals by moving fuel rods in practice.
experimental data sources. It is proved that RL model can find the The application verification is carried out on two low-dimensional (~2
optimal strategy to capture various types of interface transport phe­ × 106 combination) and high-dimensional (~1031 combination) boiling
nomena from the designed MDP environment. water reactor components. RL algorithm is more effective than SO al­
Other design optimization research mainly focuses on nuclear fuel gorithm in solving high-dimensional problems (10 × 10 assembly) by
optimization, and DRL method can effectively improve fuel efficiency, embedding expert knowledge in the form of game rules and effectively
reduce cost and ensure safety. Radaideh et al. (Radaideh and Shirvan, exploring the search space, as shown in Fig. 14. For a given computing
2021; Radaideh and Shirvan, 2022; Radaideh et al., 2023) solved the resource and time frame related to fuel design, RL algorithm is superior

14
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Fig.11. Control architecture of DRL on magnetically confined fusion Tokamak device (Degrave et al., 2022).

of DRL fault diagnosis model proposed by the author is shown in Fig. 15.
Table 4
The input of the model is the sensor data feature, and the output is the
Comparison results of operating performance (Lee et al., 2022).
fault category. The model is rewarded by classifying right and wrong, so
Performance PID-Based DRL-Based as to learn more essential features. Through two experimental fault cases
Controller Controller
of bearings and gears, the results are shown in Fig. 16. The results show
Pressurizer Average deviation ±0.3248 kg/cm2 (ZN) ±0.2816 kg/ that DRL model has slower convergence speed than DL model, but it has
Pressure error from 27 kg/cm2 ± 0.1805 kg/cm2 cm2
better stability after reaching the convergence state. Moreover, the
(DRL)
Reaching time to 27 32 min (ZN) 10 min diagnosis effect of DRL model is better than that of traditional DL model
kg/cm2 10 min (DRL) and classic ML model in both initial sample size and small sample
Pressurizer Average deviation ±9.56 % (ZN) ±8.79 % scenarios.
Level error from 50 % ±6.55 % (DRL)
Reaching time to 50 +144 min (ZN) +93 min
% +38 min (DRL)
3.7. Nuclear technology application

In addition to the nuclear power mentioned above, DRL has also


Table 5 achieved good applications in nuclear technologies such as radiation
Performance of operation and maintenance strategy in ALFRED (Hao et al., detection, nuclear medicine and nuclear robotics. Wu et al. (Wu et al.,
2023). 2023) used DRL to realize the path planning of robots in nuclear envi­
Maintenance Average 95 % CVaR Average Average ronment under the constraints of path length and cumulative radiation
strategy Profit [109 euro] number of CM number of PM dose. Even in a complex environment, the barrier-free optimal path with
[109euro] (Ranking) (Ranking) (Ranking)
low radiation dose can be planned. Shah et al. (Shah et al., 2023) applies
(Ranking)
DRL technology to physical examination of nuclear medicine depart­
Corrective 0.09 ± 1.41 ± 38.12 ± 5.64 −
ment, and the proposed risk management system can support the pre­
0.13 (5) 0.88 (5) (5)
Scheduled 0.53 ± 0.93 ± 24.32 ± 1.98 60.47 ± 6.55 vention of risks and injuries during medical examination and reduce
0.12 (4) 0.53 (4) (4) (4) operating costs. Hu et al. (Hu et al., 2023) studied the autonomous ra­
Predictive 1.18 ± 0.30 ± 0.03 ± 0.01 (1) 44.13 ± 5.86 diation source detection of radiation emergencies, and the strategy
0.07 (3) 0.17 (3) (3) provided by DRL realized good exploration and location. Yang et al.
PPO 1.39 ± 0.04 ± 0.05 ± 0.03 (3) 42.03 ± 3.98
(Yang et al., 2022) corrected the beam trajectory of proton linear
0.02 (2) 0.03 (2) (2)
PPO(GTST- 1.44 ± 0.01 ± 0.04 ± 0.02 (2) 41.97 ± 4.06 accelerator and applied it to radioactive waste treatment and cancer
MLD) 0.02 (1) 0.01 (1) (1) treatment. In the extremely complex accelerator system, it is very
difficult to accurately correct the beam offset. The author puts forward a
new method of beam trajectory correction based on multi-agent RL, as
to SO by finding more feasible modes (4–5 times more than SO) and
shown in Fig. 17. Where (a) is the physical structure diagram of MEBT
improving the search speed, which shows that RL has excellent
segment in proton linear accelerator. (b) the process of representing RL.
computing efficiency. The results of this work clearly prove the effec­
The red box in (a) corresponds to the RL environment. (c) is a sectional
tiveness of RL as another decision support tool for nuclear fuel assembly
view of a quadrupole, in which the dipole coil is a calibration coil, the
optimization.
red part is a vertical control current, and the green part is a horizontal
current. Beam matching is achieved by calibrating the beam offset in the
3.6. Safety diagnosis horizontal and vertical phase space. The experimental results show that
the average beam offset calibrated by this method is about 0.39 mm,
DRL has been studied in fault diagnosis of nuclear power system. which achieves the expected experimental goal. It can control the beam
Park et al. (Park et al., 2020) used DRL to continuously monitor and trajectory of MEBT, and has application value and competitiveness in
diagnose the safe operation of nuclear facilities, and the results proved beam trajectory correction of linear accelerator.
the feasibility of DRL in the safety function diagnosis of nuclear facil­
ities. Zhong et al. (Zhong et al., 2023) used DRL to solve the problem of 4. Discussion
incorrect classification caused by data imbalance in the fault diagnosis
of nuclear power plant equipment. Qian et al. (Qian and Liu, 2022) used According to the summary and analysis of the previous two chapters,
DRL to solve the problem of fault diagnosis of rotating machinery in it can be seen that the current theoretical research and practical appli­
nuclear power plants in the case of small sample learning. The structure cation research on DRL have made great progress. Especially in the field

15
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Fig.12. MDP environment describing the change of interface area of vertical air–water two-phase flow (Dang and Ishii, 2022).

of non-nuclear energy, research results have been fruitful, and research is needed, no online solution is required, and the design structure is
in the field of nuclear energy has also shown a positive and upward greatly simplified.
trend. This section will discuss the advantages, limitations and prospects
of future research directions of the DRL algorithm in the field of nuclear 4.2. Restriction
energy, and provide guidance for researchers or engineers on the
application of DRL in the field of nuclear energy. The excellent performance of DRL-based methods leads us to believe
that DRL technology will occupy an important place in future applied
4.1. Advantage research on AI in nuclear energy. However, the design, invocation, and
deployment of DRL algorithms are still challenging due to the excessive
Chapter 3 summarizes and reflects the application effects of DRL. The complexity of nuclear energy systems, safety constraints, and practical
advantages of DRL will not be described in detail here, but they mainly system issues such as high data acquisition costs. Therefore, the authors
include the following points. summarize the possible limitations and difficulties when DRL is inten­
sively studied in the nuclear energy field.
a. Strong adaptability: DRL can handle complex, non-linear environ­
ments and has strong adaptability. In particular, it has huge advan­ (1) Generalization
tages in solving problems such as high-dimensional space and non-
exhaustive problems. From the research on the application of DRL in nuclear energy, it is
b. Learning ability: DRL has a powerful learning ability. The researcher evident that there are difficulties in algorithm integration and software
only needs to specify the learning goal, and DRL can automatically environment compatibility and generalization. There is a lack of a
learn and optimize the policy by continuously interacting with the common research interface for integrating DRL into nuclear energy
environment. systems. The software environments for building nuclear energy system
c. Generalization ability: DRL has good generalization ability. Knowl­ simulators are not uniform and compatible with existing mature algo­
edge and experience can be shared between multiple tasks to rithm libraries. This leads to significant difficulties and additional
improve learning efficiency. workload in deploying off-the-shelf algorithms. This in turn leads to
d. Easy to apply: DRL does not need to establish a mathematical model excessive focus on algorithm implementation rather than adaptation to
of the system during the application process, less a priori knowledge solve the problem when faced with a newly acquired nuclear energy
system problem. A unified deployment interface and mature algorithm

16
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Fig.13. Prediction results of interfacial area transport under different flow conditions (Dang and Ishii, 2022).

Fig.14. In the case of 10 × 10 optimization, the total number of feasible patterns found by different methods (Seurin and Shirvan, 2023).

implementations can significantly improve the efficiency of DRL requires a large amount of time to collect enough data samples to ensure
research in nuclear energy. that the model achieves the expected performance.The dimension of the
state-action space, the complexity of the control problem, the computing
(2) Sampling of data. power of the computer, and the trial-and-error nature of RL all
contribute to the inefficiency of the adoption, which in turn increases
The most controversial topic of DRL is its poor learning efficiency.RL the training cost. In the future, how to improve the sampling efficiency

17
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Fig.15. Fault diagnosis model based on DRL (Qian and Liu, 2022).

Fig.16. Effect comparison of different fault diagnosis models (Qian and Liu, 2022).

of the algorithm and the efficiency of data usage will be a non-negligible The difficulty of setting up DRL application training mainly focuses
issue. on the algorithm selection and the setting of reward function, which
determines the final application effect of DRL. Although different DRL
(3) Training setup. algorithms have corresponding application characteristics, it is still
difficult to choose the appropriate DRL algorithm for practical

18
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Fig.17. Beam trajectory correction framework based on multi-agent RL (Yang et al., 2022).

application, which needs to be considered in combination with various The ultimate goal of any RL algorithm is real-world deployment.
factors such as system characteristics, control problems, algorithmic Currently, one of the main difficulties in deploying DRL in nuclear en­
advantages, etc. There is no fixed standard for the setting of DRL re­ ergy systems lies in transferring simulation results to real-world sce­
wards, which is more dependent on experience. It is a great challenge to narios. Due to the specificity of nuclear energy systems, there is a lack of
set the reward function efficiently, fit the optimization objective, and actual data samples and they are not suitable for training interactions in
avoid reward sparsity. Summarizing the experience of efficient DRL real systems. Considering how to improve the accuracy of modeling,
training settings for problems in nuclear energy field will be a long reduce the gap between the simulation training environment and the
exploration process. real physical environment, and realize the effective migration of
knowledge learned in the simulator for deployment to the real world are
(4) Security and Robustness. all issues that need to be focused on at present.

The safety and robustness of DRL modeling applications are critical


4.3. Future research prospect
for complex, high safety demanding systems such as nuclear energy
systems. Processing parameter uncertainties, data errors, and mis­
According to the introduction above, we have defined the operating
matches between the simulator and the real system can lead to safety
characteristics and requirements of the nuclear industry system, the
problems when the model is applied. Adversarial RL and Robust RL are
technical advantages of DRL, the future development restrictions of
both better explorations to address the above issues. Still, further efforts
DRL, the application experience in other industrial fields and the
are needed to develop safer and more robust RL algorithms to ensure
research status in the field of nuclear energy. From the above five per­
zero constraint violations during testing or even training.
spectives, it is considered that the application of DRL in the field of
nuclear energy is one of the inevitable trends in the future. The suc­
(5) Adaptation and generalization
cessful and mature application experience of DRL in other industrial
fields summarized in part (5) of Section 3.1 can further guide future
When nuclear energy systems face more diverse tasks. Existing DRL
research on DRL in the fields of nuclear energy, such as task decision
research on how to quickly adapt to the needs of various applications,
planning, resource management scheduling, operation design optimi­
utilize good a priori knowledge, share good training experience,
zation, system autonomous control and unknown exploratory learning.
enhance adaptability, improve generalization, reduce time cost, and
From the research point of view in the field of nuclear energy, the
save computational resources will be promising research directions.
realization of autonomous control of nuclear energy system is still one of
the main research directions in the future, and autonomous control will
(6) From Simulation to Reality.
be the ultimate goal of instrumentation and control technology of the
next generation of nuclear energy system. Although the main research

19
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

results focus on the autonomous control direction of nuclear energy Declaration of competing interest
system at present, there are still too few research results. Especially the
future research directions of system-level coordinated control, fault- The authors declare that they have no known competing financial
tolerant control and optimization of existing control strategies. interests or personal relationships that could have appeared to influence
Relying on DRL’s powerful search and optimization ability, the design the work reported in this paper.
and operation optimization of nuclear energy system equipment in the
future is still the key research direction, but at present it is mainly Acknowledgements
limited to nuclear fuel research. From the perspective of DRL applica­
tion, it will be the main research direction to solve the leap from This work is supported by the National Natural Science Foundation
simulation to reality in the future and realize the application, verifica­ of China (No.12405196) and the Nuclear + X Joint Innovation Fund for
tion and optimization in the actual system. Solving the problems of al­ Ph.D. Students in Nuclear Science of Harbin Engineering University
gorithm integration and compatibility and universality of software (GXB202306).
environment will greatly promote the research efficiency of DRL in the
field of nuclear energy, which is a difficult problem to be overcome in Data availability
the future.
Data will be made available on request.
5. Conclusion
References
As a key and excellent frontier technology, DRL has great potential in
solving key problems in the field of nuclear energy. However, the Abel, D., Reif, E., Littman, M.L., 2017. Improving solar panel efficiency using
reinforcement learning. RLDM 2017.
literature about DRL in the field of nuclear energy is still very shallow. In Ai, S., Koe, A.S.V., Huang, T., 2021. Adversarial perturbation in remote sensing image
the present article, we reviewed the available contributions on the topic recognition. Appl. Soft Comput. 105, 107252.
to provide the reader with a comprehensive state of play of the possi­ Akalin, N., Loutfi, A., 2021. Reinforcement learning approaches in social robotics.
Sensors 21 (4), 1292.
bilities of DRL in nuclear energy. Al Smadi, T., Al Issa, H.A., Trad, E., Al Smadi, K.A., 2015. Artificial intelligence for
The application scenarios in the field of nuclear engineering are speech recognition based on neural networks. Journal of Signal and Information
complex and changeable, and the DRL algorithm system is complex. Processing 6 (02), 66.
Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., Kautz, J., 2016. Reinforcement
According to the principle and characteristics of the algorithm, the learning through asynchronous advantage actor-critic on a gpu. arXiv preprint arXiv:
optimal algorithm adaptation for application scenarios should be 1611.06256.
selected from the perspectives of continuity and dimension of input and Bae, H., Kim, G., Kim, J., Qian, D., Lee, S., 2019. Multi-robot path planning method using
reinforcement learning. Appl. Sci. 9 (15), 3057.
output, optimization of target requirements, limitation of computing
Bae, J., Kim, J.M., Lee, S.J., 2023. Deep reinforcement learning for a multi-objective
resources, computational stability, and balance between time and ac­ operation in a nuclear power plant. Nucl. Eng. Technol.
curacy. At present, DRL research in the field of nuclear energy mainly Barth-Maron, G., Hoffman, M.W., Budden, D., Dabney, W., Horgan, D., Tb, D., Lillicrap,
focuses on autonomous control, operation and maintenance strategy T., 2018. Distributed distributional deterministic policy gradients. arXiv preprint
arXiv:1804.08617.
optimization, design optimization, fault diagnosis, nuclear technology Bellman, R., 1957. A Markovian decision process. Journal of Mathematics and Mechanics
application and so on. Especially, there are a lot of engineering research 679–684.
requirements for autonomous intelligent control of complex nuclear Bellman, R., 1966. Dynamic programming. Science 153 (3731), 34–37.
Bruzzone, L., Fanghella, P., Berselli, G., 2020. Reinforcement learning control of an
energy systems with different types and working conditions. From the onshore oscillating arm wave energy converter. Ocean Eng. 206, 107346.
research of other energy fields, it can be concluded that DRL is mainly Chen, L., Chen, Y., Yao, X., Shan, Y., Chen, L., 2019. In: An Adaptive Path Tracking
good at solving the optimization and control problems of energy system, Controller Based on Reinforcement Learning with Urban Driving Application. IEEE,
pp. 2411–2416.
and has accumulated a lot of research experience. From the perspective Chen, X., Ray, A., 2022. Deep reinforcement learning control of a boiling water reactor.
of technical maturity and nuclear energy engineering requirements, the IEEE Trans. Nucl. Sci. 69 (8), 1820–1832.
research direction of DRL in the future nuclear energy field mainly fo­ Chen, Z., Wang, R., Sun, K., Zhang, T., Du, P., Zhao, Q., 2022. A modified long short-term
memory-deep deterministic policy gradient-based scheduling method for active
cuses on two directions: autonomous control and efficiency optimiza­ distribution networks. Front. Energy Res. 10, 913130.
tion. The resistance to promote the application of DRL in nuclear energy Chen, J., Yuan, B., Tomizuka, M., 2019. In: October). Model-Free Deep Reinforcement
field is mainly reflected in the compatibility and universality of algo­ Learning for Urban Autonomous Driving. IEEE, pp. 2765–2771.
Chen, J., Yu, T., Pan, Z., Zhang, M., Deng, B., 2023. A scalable graph reinforcement
rithm integration and software environment, the robustness and appli­
learning algorithm based stochastic dynamic dispatch of power system under high
cation fault tolerance of the model, and the migration and deployment of penetration of renewable energy. Int. J. Electr. Power Energy Syst. 152, 109212.
model knowledge from simulation to reality. Corea, F., Corea, F., 2019. AI and Speech Recognition. An Introduction to Data:
As of now, the application ability and robustness of DRL algorithm in Everything You Need to Know About AI. Big Data and Data Science, pp. 53–56.
Correa-Jullian, C., Droguett, E.L., Cardemil, J.M., 2020. Operation scheduling in a solar
nuclear energy system still need to be explored. With the support of the thermal system: a reinforcement learning-based framework. Appl. Energy 268,
continuous progress of DRL algorithm and the challenge of nuclear en­ 114943.
ergy industry, DRL will become a research hotspot of AI in the field of Dang, Z., Ishii, M., 2022. Towards stochastic modeling for two-phase flow interfacial area
predictions: a physics-informed reinforcement learning approach. Int. J. Heat Mass
nuclear energy in the next few years and bring considerable benefits. Transf. 192, 122919.
Degrave, J., Felici, F., Buchli, J., Neunert, M., Tracey, B., Carpanese, F., Riedmiller, M.,
CRediT authorship contribution statement 2022. Magnetic control of tokamak plasmas through deep reinforcement learning.
Nature 602 (7897), 414–419.
Deng, X., Wang, S., Wang, W., Yu, P., Xiong, X., 2023. Optimal defense strategy for AC/
Yongchao Liu: Writing – original draft, Resources, Investigation, DC hybrid power grid cascading failures based on game theory and deep
Conceptualization. Bo Wang: Writing – review & editing, Supervision. reinforcement learning. Front. Energy Res. 11, 1167316.
Dong, Z., Huang, X., Dong, Y., Zhang, Z., 2020. Multilayer perception based
Sichao Tan: Writing – review & editing, Supervision. Tong Li: Valida­ reinforcement learning supervisory control of energy systems with application to a
tion, Methodology, Investigation. Wei Lv: Validation, Investigation. nuclear steam supply system. Appl. Energy 259, 114193.
Zhenfeng Niu: Methodology, Investigation. Jiangkuan Li: Supervision, El-Sefy, M., Yosri, A., El-Dakhakhni, W., Nagasaki, S., Wiebe, L., 2021. Artificial neural
network for predicting nuclear power plant dynamic behaviors. Nucl. Eng. Technol.
Investigation. Puzhen Gao: Writing – review & editing, Supervision.
53 (10), 3275–3285.
Ruifeng Tian: Supervision, Conceptualization. Fortunato, M., Azar, M.G., Piot, B., Menick, J., Osband, I., Graves, A., Legg, S., 2017.
Noisy networks for exploration. arXiv preprint arXiv:1706.10295.

20
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Fujimoto, S., Hoof, H., & Meger, D. (2018, July). Addressing function approximation Li, P., Shen, J., Wu, Z., Yin, M., Dong, Y., Han, J., 2023. Optimal real-time Voltage/Var
error in actor-critic methods. In International conference on machine learning (pp. control for distribution network: Droop-control based multi-agent deep
1587-1596). PMLR. reinforcement learning. Int. J. Electr. Power Energy Syst. 153, 109370.
Garnier, P., Viquerat, J., Rabault, J., Larcher, A., Kuhnle, A., Hachem, E., 2021. A review Li, G., Wang, X., Liang, B., Li, X., Zhang, B., Zou, Y., 2016. Modeling and control of
on deep reinforcement learning for fluid mechanics. Comput. Fluids 225, 104973. nuclear reactor cores for electricity generation: A review of advanced technologies.
Gu, X., Radaideh, M.I., Liang, J., 2023. OpenNeoMC: a framework for design Renew. Sustain. Energy Rev. 60, 116–128.
optimization in particle transport simulations based on OpenMC and NEORL. Ann. Li, Z., Wang, J., Ding, M., 2022. A review on optimization methods for nuclear reactor
Nucl. Energy 180, 109450. fuel reloading analysis. Nucl. Eng. Des. 397, 111950.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018, July). Soft actor-critic: Off-policy Li, D., Yang, Q., Ma, L., Peng, Z., Liao, X., 2023. Offense and defence against adversarial
maximum entropy deep reinforcement learning with a stochastic actor. In sample: a reinforcement learning method in energy trading market. Front. Energy
International conference on machine learning (pp. 1861-1870). PMLR. Res. 10, 1071973.
Hao, Z., Di Maio, F., Zio, E., 2023. Monte Carlo tree search-based deep reinforcement Li, C., Yu, R., Yu, W., Wang, T., 2022. Pressure control of Once-through steam generator
learning for flexible operation & maintenance optimization of a nuclear power plant. using Proximal policy optimization algorithm. Ann. Nucl. Energy 175, 109232.
Journal of Safety and Sustainability. Li, C., Yu, R., Yu, W., Wang, T., 2022. Fault-tolerant control system for once-through
Hao, Z., Di Maio, F., Zio, E., 2023a. A sequential decision problem formulation and deep steam generator based on reinforcement learning algorithm. Nucl. Eng. Technol. 54
reinforcement learning solution of the optimization of O&M of cyber-physical energy (9), 3283–3292.
systems (CPESs) for reliable and safe power production and supply. Reliab. Eng. Syst. Li, C., Yu, R., Yu, W., Wang, T., 2023. Reinforcement learning-based control with
Saf. 235, 109231. application to the once-through steam generator system. Nucl. Eng. Technol.
Hao, Z., Di Maio, F., Zio, E., 2023b. Flexible operation and maintenance optimization of Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Wierstra, D., 2015.
aging cyber-physical energy systems by deep reinforcement learning. Nucl. Eng. Continuous control with deep reinforcement learning. arXiv preprint arXiv:
Technol. 1509.02971.
Hema, C., Marquez, F.P.G., 2023. Emotional speech recognition using cnn and deep Lin, X., An, D., Cui, F., Zhang, F., 2023. False data injection attack in smart grid: attack
learning techniques. Appl. Acoust. 211, 109492. model and reinforcement learning-based detection method. Front. Energy Res. 10,
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Silver, D., 1104989.
2018, April. Rainbow: Combining improvements in deep reinforcement learning. In Liu, Y.K., Ai, X., Ayodeji, A., Wu, M.P., Peng, M.J., Xia, H., Yu, W.F., 2019. Enhanced
Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1). graph-based fault diagnostic system for nuclear power plants. Nucl. Sci. Tech. 30,
Hiscox, B., Sobes, V., Popov, E., Archibald, R., Betzler, B., Terrani, K., 2020. AI 1–14.
Optimization of the Reactor Unit Cell to Support TCR Optimization. Oak Ridge Liu, L., Dugas, D., Cesari, G., Siegwart, R., Dubé, R., 2020. In: Robot Navigation in
National Lab.(ORNL), Oak Ridge, TN (United States). Crowded Environments Using Deep Reinforcement Learning. IEEE, pp. 5671–5677.
Howard, R.A., 1960. Dynamic programming and markov processes. Liu, L., Wang, Y., Chi, W., 2020. Image recognition technology based on machine
Hu, J., Niu, H., Carrasco, J., Lennox, B., Arvin, F., 2020. Voronoi-based multi-robot learning. IEEE Access.
autonomous exploration in unknown environments via deep reinforcement learning. Liu, Y., Zhang, L., Xi, L., Sun, Q., Zhu, J., 2021. Automatic generation control for
IEEE Trans. Veh. Technol. 69 (12), 14413–14423. distributed multi-region interconnected power system with function approximation.
Hu, H., Wang, J., Chen, A., Liu, Y., 2023. An autonomous radiation source detection Front. Energy Res. 9, 700069.
policy based on deep reinforcement learning with generalized ability in unknown Long, P., Fan, T., Liao, X., Liu, W., Zhang, H., Pan, J., 2018. In: Towards Optimally
environments. Nucl. Eng. Technol. 55 (1), 285–294. Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning.
Jaderberg, M., Czarnecki, W.M., Dunning, I., Marris, L., Lever, G., Castaneda, A.G., IEEE, pp. 6252–6259.
Graepel, T., 2019. Human-level performance in 3D multiplayer games with Mattioni, A., Zoboli, S., Mavkov, B., Astolfi, D., Andrieu, V., Witrant, E., Prieur, C., 2023.
population-based reinforcement learning. Science 364 (6443), 859–865. Enhancing deep reinforcement learning with integral action to control tokamak
Jalali, S.M.J., Khodayar, M., Ahmadian, S., Shafie-Khah, M., Khosravi, A., Islam, S.M.S., safety factor. Fusion Eng. Des. 196, 114008.
Catalão, J.P., 2021. In: A New Ensemble Reinforcement Learning Strategy for Solar Mazare, M., 2024. Adaptive optimal secure wind power generation control for variable
Irradiance Forecasting Using Deep Optimized Convolutional Neural Network speed wind turbine systems via reinforcement learning. Appl. Energy 353, 122034.
Models. IEEE, pp. 1–6. Mc Leod, J.E.N., Rivera, S.S., 2023. Reliability Optimization of New Generation Nuclear
Jaritz, M., De Charette, R., Toromanoff, M., Perot, E., Nashashibi, F., 2018. In: End-to- Power Plants Using Artificial Intelligence. In: Reliability Engineering and
End Race Driving with Deep Reinforcement Learning. IEEE, pp. 2070–2075. Computational Intelligence for Complex Systems: Design, Analysis and Evaluation.
Jasmin, E.A., Ahamed, T.I., Raj, V.J., 2011. Reinforcement learning approaches to Cham, Springer Nature Switzerland, pp. 159–173.
economic dispatch problem. Int. J. Electr. Power Energy Syst. 33 (4), 836–845. Meng, F., Bai, Y., Jin, J., 2021. An advanced real-time dispatching strategy for a
Jiang, B. T., Zhou, J., & Huang, X. B. (2020, August). Artificial Neural Networks in distributed energy system based on the reinforcement learning algorithm. Renew.
Condition Monitoring and Fault Diagnosis of Nuclear Power Plants: A Concise Energy 178, 13–24.
Review. In International Conference on Nuclear Engineering (Vol. 83778, p. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu,
V002T08A032). American Society of Mechanical Engineers. K. (2016, June). Asynchronous methods for deep reinforcement learning. In
Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Levine, S., 2019. In: International conference on machine learning (pp. 1928-1937). PMLR.
Residual Reinforcement Learning for Robot Control. IEEE, pp. 6023–6029. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G.,
Juhn, Y., Liu, H., 2020. Artificial intelligence approaches using natural language Hassabis, D., 2015. Human-level control through deep reinforcement learning.
processing to advance EHR-based clinical research. J. Allergy Clin. Immunol. 145 Nature 518 (7540), 529–533.
(2), 463–469. Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Silver, D.,
Khabbaz, A.H., Pouyan, A.A., Fateh, M., Abolghasemi, V., 2017. In: An Adaptive RL 2015. Massively parallel methods for deep reinforcement learning. arXiv preprint
Based Fuzzy Game for Autistic Children. IEEE, pp. 47–52. arXiv:1507.04296.
Kim, J.M., Bae, J., Lee, S.J., 2023. Strategy to coordinate actions through a plant Oh, E., Wang, H., 2020. Reinforcement-learning-based energy storage system operation
parameter prediction model during startup operation of a nuclear power plant. Nucl. strategies to manage wind power forecast uncertainty. IEEE Access 8, 20965–20976.
Eng. Technol. 55 (3), 839–849. Okamoto, W., Kawamoto, K., 2020. In: Reinforcement Learning with Randomized
Kim, H., Jo, Y., Lee, D., 2021. Feasibility study on AI-based prediction for CRUD induced Physical Parameters for Fault-Tolerant Robots. IEEE, pp. 1–4.
power shift in PWRs. Proceedings of the Transactions of the Korean Nuclear Society Oliveira, M.V.D., Almeida, J.C.S.D., 2013. Modeling and control of a nuclear power plant
Virtual Autumn Meeting, Korea (online). using AI techniques.
Lalwani, T., Bhalotia, S., Pal, A., Rathod, V., Bisen, S., 2018. Implementation of a chatbot Park, J., Kim, T., Seong, S., 2020. Providing support to operators for monitoring safety
system using AI and NLP. International Journal of Innovative Research in Computer functions using reinforcement learning. Prog. Nucl. Energy 118, 103123.
Science & Technology (IJIRCST) 6 (3). Park, J., Kim, T., Seong, S., Koo, S., 2022. Control automation in the heat-up mode of a
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436–444. nuclear power plant using reinforcement learning. Prog. Nucl. Energy 145, 104107.
Lee, D., Arigi, A.M., Kim, J., 2020. Algorithm for autonomous power-increase operation Perrusquía, A., Yu, W., Li, X., 2021. Multi-agent reinforcement learning for redundant
using deep reinforcement learning and a rule-based system. IEEE Access 8, robot control in task-space. Int. J. Mach. Learn. Cybern. 12, 231–241.
196727–196746. Pierart, F., Manríquez, C., Campos, P., 2021. In: Reinforcement Learning Algorithms
Lee, D., Kim, H., Choi, Y., Kim, J., 2021. In: Development of Autonomous Operation Applied to Reactive and Resistive Control of a Wave Energy Converter. IEEE,
Agent for Normal and Emergency Situations in Nuclear Power Plants. IEEE, pp. 1–6.
pp. 240–247. Pylorof, D., Garcia, H.E., 2022. A reinforcement learning approach to long-horizon
Lee, D., Koo, S., Jang, I., Kim, J., 2022. Comparison of deep reinforcement learning and operations, health, and maintenance supervisory control of advanced energy
PID controllers for automatic cold shutdown operation. Energies 15 (8), 2834. systems. Eng. Appl. Artif. Intel. 116, 105454.
Lei, J., Ren, C., Li, W., Fu, L., Li, Z., Ni, Z., Yu, T., 2022. Prediction of crucial nuclear Qi, B., Liang, J., Tong, J., 2023. Fault diagnosis techniques for nuclear power plants: a
power plant parameters using long short-term memory neural networks. Int. J. review from the artificial intelligence perspective. Energies 16 (4), 1850.
Energy Res. 46 (15), 21467–21479. Qi, X., Luo, Y., Wu, G., Boriboonsomsin, K., Barth, M., 2019. Deep reinforcement learning
Leo, R., Milton, R.S., Sibi, S., 2014. In: Reinforcement Learning for Optimal Energy enabled self-learning control for energy efficient driving. Transportation Research
Management of a Solar Microgrid. IEEE, pp. 183–188. Part C: Emerging Technologies 99, 67–81.
Li, J., Liu, Y., Qing, X., Xiao, K., Zhang, Y., Yang, P., Yang, Y.M. 2021, November. The Qian, G., Liu, J., 2022. Development of deep reinforcement learning-based fault
application of Deep Reinforcement Learning in Coordinated Control of Nuclear diagnosis method for rotating machinery in nuclear power plants. Prog. Nucl. Energy
Reactors. In Journal of Physics: Conference Series (Vol. 2113, No. 1, p. 012030). IOP 152, 104401.
Publishing. Qin, P., Ye, J., Hu, Q., Song, P., Kang, P., 2022. Deep reinforcement learning based power
system optimal carbon emission flow. Front. Energy Res. 10, 1017128.

21
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655

Qureshi, A.H., Nakamura, Y., Yoshikawa, Y., Ishiguro, H., 2016. In: Robot Gains Social Wei, C., Zhang, Z., Qiao, W., Qu, L., 2015. Reinforcement-learning-based intelligent
Intelligence through Multimodal Deep Reinforcement Learning. IEEE, pp. 745–751. maximum power point tracking control for wind energy conversion systems. IEEE
Radaideh, M.I., Shirvan, K., 2021. Rule-based reinforcement learning methodology to Trans. Ind. Electron. 62 (10), 6360–6370.
inform evolutionary algorithms for constrained optimization of engineering Wei, C., Zhang, Z., Qiao, W., Qu, L., 2016. An adaptive network-based reinforcement
applications. Knowl.-Based Syst. 217, 106836. learning method for MPPT control of PMSG wind energy conversion systems. IEEE
Radaideh, M.I., Du, K., Seurin, P., Seyler, D., Gu, X., Wang, H., Shirvan, K., 2023. NEORL: Trans. Power Electron. 31 (11), 7837–7848.
Neuroevolution optimization with reinforcement learning—applications to carbon- Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., Ba, J., 2017. Scalable trust-region method
free energy systems. Nucl. Eng. Des. 112423. for deep reinforcement learning using kronecker-factored approximation. Advances
Radaideh, M.I., Shirvan, K., 2022. PESA: prioritized experience replay for parallel hybrid in neural information processing systems, 30.
evolutionary and swarm algorithms-application to nuclear fuel. Nucl. Eng. Technol. Wu, Z., Yin, Y., Liu, J., Zhang, D., Chen, J., Jiang, W., 2023. A novel path planning
54 (10), 3864–3877. approach for mobile robot in radioactive environment based on improved deep Q
Radaideh, M.I., Wolverton, I., Joseph, J., Tusar, J.J., Otgonbaatar, U., Roy, N., network algorithm. Symmetry 15 (11), 2048.
Shirvan, K., 2021. Physics-informed reinforcement learning optimization of nuclear Xie, J., Dong, H., Zhao, X., 2023. Data-driven torque and pitch control of wind turbines
assembly design. Nucl. Eng. Des. 372, 110966. via reinforcement learning. Renew. Energy.
Saeed, H.A., Peng, M., Wang, H., Rasool, A., 2023. Autonomous control model for Xie, Y., Zhao, X., Luo, M., 2022. An active-controlled heaving plate breakwater trained
emergency operation of small modular reactor. Ann. Nucl. Energy 190, 109874. by an intelligent framework based on deep reinforcement learning. Ocean Eng. 244,
Saenz-Aguirre, A., Zulueta, E., Fernandez-Gamiz, U., Lozano, J., Lopez-Guede, J.M., 110357.
2019. Artificial neural network based reinforcement learning for wind turbine yaw Xu, P., Zhang, J., Lu, J., Zhang, H., Gao, T., Chen, S., 2022. A prior knowledge-embedded
control. Energies 12 (3), 436. reinforcement learning method for real-time active power corrective control in
Salvato, E., Fenu, G., Medvet, E., Pellegrino, F.A., 2021. Crossing the reality gap: a survey complex power systems. Front. Energy Res. 10, 1009545.
on sim-to-real transferability of robot controllers in reinforcement learning. IEEE Yan, Z., Tan, J., Liang, B., Liu, H., Yang, J., 2022. In: Active Fault-Tolerant Control
Access 9, 153171–153187. Integrated with Reinforcement Learning Application to Robotic Manipulator. IEEE,
Sanayha, M., Vateekul, P., 2022. Model-based deep reinforcement learning for wind pp. 2656–2662.
energy bidding. Int. J. Electr. Power Energy Syst. 136, 107625. Yang, X., Chen, Y., Wang, J., Zheng, H., Liu, H., Zhou, D., Zhou, Q., 2022. Online beam
Sarkar, S., Gundecha, V., Shmakov, A., Ghorbanpour, S., Babu, A.R., Faraboschi, P., orbit correction of MEBT in CiADS based on multi-agent reinforcement learning
Fievez, J., 2022, June. Multi-agent reinforcement learning controller to maximize algorithm. Ann. Nucl. Energy 179, 109346.
energy efficiency for multi-generator industrial wave energy converter. In Yang, Y., Tu, F., Huang, S., Tu, Y., Liu, T. Research on CNN-LSTM DC power system fault
Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 11, pp. diagnosis and differential protection strategy based on reinforcement learning.
12135-12144). frontiers in energy research, 11, 1258549.
Schaul, T., Quan, J., Antonoglou, I., Silver, D., 2015. Prioritized experience replay. arXiv Yang, T., Zhao, L., Li, W., Zomaya, A.Y., 2021. Dynamic energy dispatch strategy for
preprint arXiv:1511.05952. integrated energy system based on improved deep reinforcement learning. Energy
Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw. 61, 235, 121377.
85–117. Yi, Z., Luo, Y., Westover, T., Katikaneni, S., Ponkiya, B., Sah, S., Khanna, R., 2022. Deep
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region reinforcement learning based optimization for a tightly coupled nuclear renewable
policy optimization. In International conference on machine learning (pp. 1889- integrated energy system. Appl. Energy 328, 120113.
1897). PMLR. Yin, X., Lei, M., 2022. Deep reinforcement learning based coastal seawater desalination
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy via a pitching paddle wave energy converter. Desalination 543, 115986.
optimization algorithms. arXiv preprint arXiv:1707.06347. Yuan, G., Xie, F., 2023. Digital Twin-Based economic assessment of solar energy in smart
Seurin, P., Shirvan, K., 2023. Assessment of Reinforcement Learning Algorithms for microgrids using reinforcement learning technique. Sol. Energy 250, 398–408.
Nuclear Power Plant Fuel Optimization. arXiv preprint arXiv:2305.05812. Zaib, M., Sheng, Q.Z., Emma Zhang, W., 2020. February). A short survey of pre-trained
Shah, S.I.H., Naeem, M., Paragliola, G., Coronato, A., Pechenizkiy, M., 2023. An AI- language models for conversational ai-a new age in nlp. In: Proceedings of the
empowered infrastructure for risk prevention during medical examination. Expert Australasian Computer Science Week Multiconference, pp. 1–4.
Syst. Appl. 225, 120048. Zhang, B., Cao, D., Hu, W., Ghias, A.M., Chen, Z., 2024. Physics-Informed Multi-Agent
Shan, Y., Zheng, B., Chen, L., Chen, L., Chen, D., 2020. A reinforcement learning-based deep reinforcement learning enabled distributed voltage control for active
adaptive path tracking approach for autonomous driving. IEEE Trans. Veh. Technol. distribution network using PV inverters. Int. J. Electr. Power Energy Syst. 155,
69 (10), 10581–10595. 109641.
Shresthamali, S., Kondo, M., Nakamura, H., 2017. Adaptive power management in solar Zhang, T., Dong, Z., Huang, X., 2024. Multi-objective optimization of thermal power and
energy harvesting sensor node using reinforcement learning. ACM Transactions on outlet steam temperature for a nuclear steam supply system with deep reinforcement
Embedded Computing Systems (TECS) 16 (5s), 1–21. learning. Energy 286, 129526.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M., 2014. In: Zhang, G., Hu, W., Cao, D., Huang, Q., Chen, Z., Blaabjerg, F., 2021. A novel deep
Deterministic Policy Gradient Algorithms. PMLR, pp. 387–395. reinforcement learning enabled sparsity promoting adaptive control method to
Sobes, V., Hiscox, B., Popov, E., Archibald, R., Hauck, C., Betzler, B., Terrani, K., 2021. improve the stability of power systems with wind energy penetration. Renew.
AI-based design of a nuclear reactor core. Sci. Rep. 11 (1), 19646. Energy 178, 363–376.
Song, X., Sun, J., Tan, S., Ling, R., Chai, Y., Guerrero, J.M., 2024. Cooperative grid Zhang, S., Lan, F., Xue, B., Chen, Q., Qiu, X., 2023. A novel automatic generation control
frequency control under asymmetric V2G capacity via switched integral method with hybrid sampling for the multi-area interconnected grid. Front. Energy
reinforcement learning. Int. J. Electr. Power Energy Syst. 155, 109679. Res. 11, 1280724.
Sutton, R.S., 1988. Learning to predict by the methods of temporal differences. Mach. Zhang, F., Yang, Q., Li, D., 2023. A deep reinforcement learning-based bidding strategy
Learn. 3, 9–44. for participants in a peer-to-peer energy trading scenario. Front. Energy Res. 10,
Sutton, R.S., Barto, A.G., 1999. Reinforcement learning: an introduction. Robotica 17 (2), 1017438.
229–235. Zhao, Y., Smidts, C., 2022. Reinforcement learning for adaptive maintenance policy
Sutton, R.S., Barto, A.G., 2018. Reinforcement learning: An introduction [M]. MIT press. optimization under imperfect knowledge of the system degradation model and
Tai, L., Liu, M., 2016. Towards cognitive exploration through deep reinforcement partial observability of system states. Reliab. Eng. Syst. Saf. 224, 108541.
learning for mobile robots. arXiv preprint arXiv:1610.01733. Zhong, W., Wang, M., Wei, Q., Lu, J., 2022. A new neuro-optimal nonlinear tracking
Thrun, S., Littman, M.L., 2000. Reinforcement learning: an introduction. AI Mag. 21 (1), control method via integral reinforcement learning with applications to nuclear
103. systems. Neurocomputing 483, 361–369.
Tian, Y., 2020. Artificial intelligence image recognition method based on convolutional Zhong, X., Zhang, L., Ban, H., 2023. Deep reinforcement learning for class imbalance
neural network algorithm. IEEE Access 8, 125731–125744. fault diagnosis of equipment in nuclear power plants. Ann. Nucl. Energy 184,
Tian, Y., Cao, X., Huang, K., et al., 2021. Learning to drive like human beings: a method 109685.
based on deep reinforcement learning[J]. IEEE Trans. Intell. Transp. Syst. 23 (7), Zhou, G., Tan, D., 2023. Review of nuclear power plant control research: neural network-
6357–6367. based methods. Ann. Nucl. Energy 181, 109513.
Van Hasselt, H., Guez, A., Silver, D., 2016, March. Deep reinforcement learning with Zhu, J., Hu, W., Xu, X., Liu, H., Pan, L., Fan, H., Chen, Z., 2022. Optimal scheduling of a
double q-learning. In Proceedings of the AAAI conference on artificial intelligence wind energy dominated distribution network via a deep reinforcement learning
(Vol. 30, No. 1). approach. Renew. Energy 201, 792–801.
Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., Freitas, N., 2016. In: Dueling Zou, S., Zhou, X., Khan, I., Weaver, W.W., Rahman, S., 2022. Optimization of the
Network Architectures for Deep Reinforcement Learning. PMLR, pp. 1995–2003. electricity generation of a wave energy converter using deep reinforcement learning.
Watkins, C.J., Dayan, P., 1992. Q-Learning. Machine Learning 8, 279–292. Ocean Eng. 244, 110363.

22

You might also like