Applications of Deep Reinforcement Learning in Nuclear Energy:a Review
Applications of Deep Reinforcement Learning in Nuclear Energy:a Review
A R T I C L E I N F O A B S T R A C T
Keywords: In recent years, Deep reinforcement learning (DRL), as an important branch of artificial intelligence (AI), has
Nuclear energy been widely used in physics and engineering domains. It combines the perceptual advantages of deep learning
Artificial intelligence (DL) and the decision-making advantages of reinforcement learning (RL), and is very suitable for solving the
Deep reinforcement learning
“perception-decision” problem with high-dimensional and nonlinear characteristics. In this paper, firstly, the
Review
algorithm principle, mainstream framework, characteristics and advantages of DRL are summarized. Secondly,
Prospects
the application research status of DRL in other energy fields is reviewed, which provides reference for the
possible impact and future research direction in the field of nuclear energy. Thirdly, the main research directions
of DRL in the field of nuclear energy are summarized and commented, and the application architecture and
advantages of DRL are illustrated through specific application cases. Finally, the advantages, limitations and
future development direction of DRL in the field of nuclear energy are discussed. The goal of this review is to
provide an understanding of DRL capabilities along with state-of-the-art applications in nuclear energy to re
searchers wishing to address new problems with these methods.
* Corresponding authors.
E-mail addresses: [email protected] (B. Wang), [email protected] (S. Tan).
https://doi.org/10.1016/j.nucengdes.2024.113655
Received 24 May 2024; Received in revised form 11 October 2024; Accepted 17 October 2024
0029-5493/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
2
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
learning, semi-supervised learning and RL. Fig. 1 illustrates the bound of RL algorithms.
aries and applications between the four learning styles. Supervised
learning enables prediction and classification in unlabeled data by 2.2.2. Value-based methods
learning from already labeled data. Unsupervised learning is used to In a value-based approach, agent aims to learn an optimal value
process unlabeled data and automatically learns to discover patterns and function, which in turn determines its policy by selecting the action with
structures in the data in order to classify and make predictions on un the highest estimated value. Typically, we define the state value func
known data. Semi-supervised learning is a type of learning between tion and the state-action value function.
supervised and unsupervised learning. It mainly uses a small portion of The state-value function is defined as:
labeled data to train the model and a large amount of unlabeled data to
V π (s) = Eτ∼π [R(τ) |s ] (1)
refine the model in order to improve its predictive ability. RL learns the
optimal decision-making policy through the interaction of the agent where V π (s) represents the expected discounted cumulative reward
with the environment, with the goal of maximizing the cumulative following the trajectory τ according to the policy π from the state st
reward as the learning objective. RL has good applications in robot onwards. Here, E denotes the expectation and R(τ) represents the reward
navigation, autonomous driving, industrial control and other fields. value of trajectory τ.Meanwhile, the trajectory τ refers to the sequence of
states, actions and rewards that an agent encounters in the environment
2.2. RL as it follows its policy π.
The action-value function is defined as:
RL is usually categorized into two types, model-free algorithms and
Qπ (s, a) = E [R(τ)|s, a] (2)
model-based algorithms (Sutton and Barto, 2018). Model-based ap τ∼π
proaches incorporate a model of the environment in which they interact.
which denotes expected discounted cumulative reward for starting
Whereas, this paper focuses on model-free RL algorithms, which have
from state st and taking action at and then following trajectory τ ac
been widely studied for their ease of use and implement ability in
cording to policy π.
various domains. Model-free approaches are further distinguished into
The following relationship exists between the state value function
value-based and policy-based approaches (Sutton and Barto, 1999;
and the state-action value function:
Thrun and Littman, 2000).
RL includes several main components, including agent, environment, V π (s) = E [Qπ (s, a)] (3)
a∼π
state, and action, as shown in Fig. 2. Among them, agent is the entity
that performs actions and learns from the environment, and updates the That is, V π (s) is the weighted average of the probabilities of Qπ (s, a)
policy function and value function through data samples; environment is for all possible actions on a per-action basis.
the world with which the agent interacts; state is the specific situation of One of the value-based approaches is Q-learning, which finds the
the environment at a certain moment, which is usually used as the basis optimal policy by learning the Q-function. In traditional Q-learning, the
for decision-making; action is the operation that agent can perform to Q-function is represented by a Q-table, which provides an estimate
influence the environment; reward is the direct feedback for performing Q* (s, a) of the optimal Q-value for each of the state-action pairs (s, a) in
a specific action, which is the basis for RL. the state space S and the action space A. The Q-table starts with a
random initialization, and its values are gradually updated according to
2.2.1. Markov process Bellman’s optimality conditions as the agent continues to explore the
Markov process is a mathematical model used in RL to describe the environment (Bellman, 1966):
environment, and its core property is that the determination of the next
Q* (s, a) = R(s, a) + γmaxaʹ Q* (sʹ, aʹ) (4)
state of the system depends only on the current state and is independent
of past states or actions, i.e., it is Markovian. A Markov process consists where R(s, a) is the reward received for taking action α in state s , and γ is
of a set of states and their transfer probabilities. The introduction of the discount factor for future rewards.
actions and rewards yields a Markov decision process (Bellman, 1957;
Howard, 1960) (MDP). The MDP consists of an environment state (S), an 2.2.3. Policy-based methods
action (A), a reward (R), a state transfer probability (T) and a discount The policy-based approach optimizes the parameterized policy
factor (γ). The policy π is a mapping from state space to action space. πθ (a|s ) directly instead of determining the optimal policy through value
Agent takes action at ∈ A in state st ∈ S according to the policy π, moves function estimation (Thrun and Littman, 2000). Compared to the value-
to the next state according to the state transfer probability T, and re based approach, the policy-based approach naturally handles high-
ceives reward rt ∈ R. The goal of RL is to optimize the policy to maximize dimensional action spaces and is capable of learning stochastic pol
the reward. The MDP provides a clear framework for dealing with this icies. Although they may fall into local minima, policy-based methods
type of decision-making problem and lays the foundation for the design have more stable convergence.
In order to evaluate the merits of a policy, it is necessary to define an
objective function that is based on the expected cumulative reward:
J(θ) = E [R(τ)] (5)
τ∼π θ
3
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
log-probability, we can represent ∇θ J(θ) as a computable expectation: Three main classes of algorithms and their variants in policy gradient
[ ] methods include deep deterministic policy gradient (DDPG) (Lillicrap
∑T
∇θ J(θ) = E ∇θ log(πθ (at |st ) )R(τ) (7) et al., 2015), statistical methods for natural gradients, and the Actor-
τ∼π θ
t=0 Critic (AC) (Sutton and Barto, 2018) framework. DDPG is the most
common one, which is a combination of deterministic policy gradient
where log denotes the logarithm function, capturing the log-proba (DPG) (Silver et al., 2014) and AC. Its variants mainly include Distrib
bility’s contribution, and ∇θ signifies the gradient with respect to the uted DDPG (Barth-Maron et al., 2018) and twin delayed deep deter
policy parameters θ . ministic policy gradient (TD3) (Fujimoto et al., 2018), among others.
This gradient is then utilized to update the policy parameters: Statistical methods for natural gradient mainly include trust region
policy optimization (TRPO) (Schulman et al., 2015) and variants Actor-
θ←θ + λ∇θ J(θ) (8)
Critic using kronecker-factored trust region (ACKTR) (Wu et al., 2017)
where λ specifies the learning rate, which is the step size for the updates and proximal policy optimization (PPO) (Schulman et al., 2017), etc. AC
to the parameter θ . frameworks mainly include Advantage Actor-Critic (A2C) (Babaeizadeh
et al., 2016), Asynchronous Advantage Actor-Critic (A3C) (Mnih et al.,
2016) and soft Actor-Critic (SAC) (Haarnoja et al., 2018), etc.
2.3. DRL
2.3.1. DL
The rise and development of DRL in the field of information tech DL is a technique that mimics the human brain’s ability to process
nology is closely related to the breakthrough of RL and the improvement information and is based on neural networks. These networks transmit
of computational performance (LeCun et al., 2015). Among them, deep signals through weighted connections and use activation functions to
neural networks (Schmidhuber, 2015) (DNN) contain more hidden model nonlinear processing. Neural networks learn and optimize per
layers between the input and output layers, which makes it possible to formance by calculating the gradient of the loss function and updating
extract high-level features in complex and high-dimensional data spaces the parameters using back-propagation. More knowledge on DL can be
with superior prediction effects. DRL synthesizes the decision-making found in the literature (LeCun et al., 2015; Schmidhuber, 2015).
advantages of RL and the perceptual advantages of DNNs, bridges the
gap between high-dimensional perceptual inputs and actions, and en 2.3.2. DRL algorithm based on value function
ables the extension of RL to the high-dimensional state and action spaces DRL algorithms based on value functions mainly approximate the
with decision-making problems. Nowadays, DRL has been applied in value function or the action value function by means of DNNs. These
several research areas, and since 2010, the number of publications algorithms usually use temporal differential learning (Sutton, 1988) or
mentioning “deep reinforcement learning” has been increasing Q-learning methods (Watkins and Dayan, 1992) to update the functions.
geometrically every year, as shown in Fig. 3. Currently, the main value function-based DRL algorithms are shown in
Model-free DRL-based algorithms can be categorized into two clas Fig. 4, and the principal features of DQN and its variant algorithms are
ses, value function-based and policy gradient-based, at the theoretical described below.
level, as shown in Fig. 4. Among them, deep Q-network (DQN) (Mnih
et al., 2015) is the originator of the DRL algorithm that approximates the (1) DQN
action-valued function by DNNs. In order to enhance the speed and
stability of DQN algorithms, many scalability methods and improvement DQN is a breakthrough DRL algorithm, proposed by DeepMind team
ideas have been continuously proposed, including but not limited to in 2015 (Mnih et al., 2015), and has achieved to outperform human
Distributional DQN (Nair et al., 2015), Double DQN (Van Hasselt et al., players on Atari 2600 game. The specific algorithm update process is
2016), Prioritized Experience Replay DQN (Schaul et al., 2015), Dueling shown in Fig. 5.
DQN (Wang et al., 2016), Noisy DQN (Fortunato et al., 2017), Rainbow Assuming that the neural network model is f, its parameter is θ, and
DQN (Hessel et al., 2018), etc. the true action-value function is Q(s, a), the DQN learning algorithm
On the other hand, DL can also be used to approximate policies. wants to solve the optimal parameter so that the output of the neural
network is as close as possible to the value of the true action-value
function Q, i.e., f(s, a|θ) ≈ Q(s, a).
In DQN algorithm also need to estimate the value of the action like Q-
learning algorithm, in order to make the neural network approximation
of the DQN algorithm of the value of the action and the real value of the
action with less error, the defined objective function is shown in Eq. (9).
where sʹ denotes the next state, aʹ denotes the action selected at the next
state, θ denotes the parameters of the neural network, f(s, a|θ) is the
value estimated by the neural network, and R(s, a) +γmaxa∈A Q(sʹ, aʹ)
denotes the timing difference (TD) error objective. To optimize the pa
rameters of the neural network, the gradient descent method is used to
minimize the objective function, and the update formula is shown in Eq.
(10).
∂J(θ)
θ = θ− α (10)
∂θ
where α specifies the learning rate, which is the step size for the updates
to the parameter θ .
Fig.3. Number of publications mentioning “Deep Reinforcement Learning” The DQN algorithm introduces an empirical replay method and a
(Garnier et al., 2021).
4
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
dual network method to improve itself. The empirical replay method can same environment and acquiring varied empirical data.
effectively reduce the correlation of each consecutive sample of the Experience replay mechanism: the experience of all actors interact
training, which facilitates the DL perception, and the training network ing with the environment is collected and stored in a shared experience
f(s, a|θ) in the dual network structure is used to evaluate the value of the pool.
current action in the output, and the target network f(sʹ, a|θ* ) is used to Parallel learners: multiple learners compute the gradient of the loss
evaluate the value of the action in the temporal difference error target. function in parallel, utilizing data from the experience pool and
uploading the gradient to the parameter server for updating the pa
(2) Distributional DQN. rameters of the Q-network.
Parameter server: accepts the gradient information from the learner
Distributional DQN (Nair et al., 2015) proposes a distributed archi and updates the parameters of the Q-network by the gradient descent
tecture to solve the problem of DQN relying on a single machine during method.
training, which results in long training cycles. This architecture is
divided into four core parts: (3) Double DQN.
Parallel actors: the architecture deploys multiple actors (N), each
with its own copy of the Q-network, executing different policies in the Double DQN (Van Hasselt et al., 2016) addresses the problem of Q-
learning overestimation bias through decoupling selection and
5
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
bootstrap action evaluation. The maxa∈A Q(sʹ, aʹ) in the temporal differ specific applications. DeepMind’s research demonstrates the successful
ence error objective can be transformed as shown in the following integration of these techniques and points to the synergistic effects they
equation. can have when brought together. The ablation study further elucidates
the specific contribution of each component to the overall performance.
maxa∈A Q(sʹ, aʹ) = Qw− (sʹ, argmaxaʹ Qw− (sʹ, aʹ) ) (11)
The experimental validation results are shown in Fig. 6, where the ex
The computation of Q-values in the above equation all use the target periments show that the Rainbow DQN performs well in benchmarks on
network. Double DQN for external Q-value computation uses the target the Atari 2600, with significant improvements in both learning effi
network and internal Q-value computation uses the training network. ciency and final performance. The contribution of each DQN variant to
The improved TD error target is shown in the following equation. the overall performance is also shown based on the detailed ablation
experiment results.
r + γQw− (sʹ, argmaxaʹ Qw (sʹ, aʹ) ) (12)
In the above formula, r represents the reward at the current timestep. 2.3.3. DRL algorithm based on policy gradient
The remaining part signifies the discounted reward estimated by the Q Policy gradient-based DRL algorithms use policy gradient methods to
networks, where Qw− denotes the Q-value from the target network, and directly parameterize a policy and learn the policy parameters by opti
Qw denotes the Q-value from the training network. mizing an objective function with expected returns. The advantage of
The Dueling DQN (Schaul et al., 2015) algorithm proposes a new such methods is that they are naturally applicable to high-dimensional
Dueling network architecture that can generalize various actions by action spaces and continuous action spaces, and are capable of
representing state values and action rewards separately. The Dueling learning stochastic policies. Currently, the main DRL algorithms based
DQN algorithm pays more attention to the differences in the dominance on policy gradient are shown in Fig. 4, and the principal features of the
values of different actions with the Q-value function as shown in the main algorithms are described in detail below.
following equation. The policy network in the policy gradient algorithm is denoted by πθ ,
where θ is the weight parameter of the policy. The optimization objec
Qη,α,β (s, a) = Vη,α (s) + Aη,β (s, a) (13) ∑
tive of the algorithm is to maximize maxθ E[R|πθ ], where R = Tt=0 rt
Vη,α (s) denotes the state value function, Vη,α (s) denotes the action represents the total reward value obtained by the intelligent in a series of
dominance function, η denotes the network parameters shared by both, actions. The updating process of the policy gradient algorithm is
and α and β denote the network parameters of the state value function detailed as follows: assuming that a complete set of states, actions, and
and action dominance function, respectively. rewards constitutes a trajectory τ = (s0 , a0 , r1 , ⋯, sT− 1 , aT− 1 , rT− 1 , sT ),
then the policy gradient can be expressed as:
(4) Prioritized Experience Replay DQN. T− 1
∑
g = R∇θ lnπ (at |st ; θ) (14)
Prioritized Experience Replay DQN (Wang et al., 2016) proposes a t=0
Rainbow DQN (Hessel et al., 2018) is an integrated DQN variant that (2) DDPG
combines six Distributional DQN, Double DQN, Prioritized Experience
Replay DQN, Dueling DQN, Noisy DQN, and Multi-step learning Inno The DPG algorithm (Silver et al., 2014), which performs action se
vative techniques that overcome the limitations of standard DQN in lection in a deterministic manner, selects a fixed action in a given state.
6
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
( )
θQt+1 = θQt + αQ ∇θQ L θQ (19)
7
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
two value functions for gradient descent, which effectively reduces stability and sample efficiency problems in RL by introducing the Trust
the phenomenon of policy over-estimation. region. The TRPO algorithm endeavors to ensure the stability of the
b. Delayed Update Mechanism: TD3 implements a delay mechanism in learning process by ensuring that the new policy does not deviate
the update frequency of the policy network, i.e., the value function is excessively from the old policy when updating the policy. Specifically,
updated several times before the policy network is updated once. TRPO aims to maximize the objective function while constraining the
This mechanism ensures the stability of the value estimation and policy update step size to avoid generating disruptive and drastic up
provides a more reliable learning target for the policies. dates. The core of this approach is to solve the following problem:
c. Target policy smoothing regularization: TD3 adds noise to the goal [ ]
πθ (a|s) πθ
policy, which enhances the exploration ability of the algorithm maxθ Es,a∼πθold A old (s, a) (24)
πθold (a|s)
through this smoothing process and reduces the interference of
environmental noise in the learning process. while being subject to certain KL dispersion constraints:
[ ]
Table 1 demonstrates the maximum average returns of the policy Es∼πθold DKL (π θold (⋅|s)‖πθ (⋅|s)) ≤ δ (25)
gradient-based DRL algorithms in different OpenAI gym environments.
The TD3 algorithm exhibits higher sample efficiency and better per where, DKL is the Kullback-Leibler divergence between the old and new
formance compared to other algorithms in selected test environments. policies, which measures the difference in the distribution of the policies
The successful practice of TD3 not only provides a new perspective for for a given state, and δ is a predefined small value to limit the difference
solving continuous action space problems, but also contributes an between the old and new policies. In this way, TRPO ensures that the
important theoretical foundation for the development of DRL algorithms iterative updating of policies is both large enough that effective new
and empirical experience. policies can be learned, and small enough to avoid excessive deviations
that lead to performance degradation. Compared to traditional policy
(4) A3C gradient methods, TRPO demonstrates higher sample efficiency and
better performance in continuous control tasks and high-dimensional
The A3C algorithm (Mnih et al., 2016) is based on the idea of action space problems. In addition, the TRPO algorithm is relatively
Asynchronous RL, which utilizes multiple threads to execute agent ac less sensitive to the choice of hyperparameters, which makes the algo
tions in asynchronous mode. Each actuator may be in a different state rithm more practical.
and take different actions at any given moment. This asynchronous
mechanism reduces the correlation between samples and can be an (6) ACKTR
effective alternative to experience replay, allowing the parameters to be
updated using the On-policy method. The A3C algorithm significantly The ACKTR algorithm (Wu et al., 2017) further improves the policy
reduces the hardware requirements compared to traditional policy gradient approach based on TRPO. The algorithm aims to optimize the
gradient algorithms, which require high-performance GPU processors, policy update process through efficient natural gradient descent. The
whereas the A3C only uses multi-core CPUs. natural gradient descent takes into account the geometric properties of
Also, the dominance function can be used as an alternative in the the parameter space to make the update closer to the shape of the policy
evaluation of the value function, which is defined as follows: probability distribution, thus improving the learning efficiency.
The key innovation of ACKTR is the application of the Kronecker-
At = Q(st , at ) − V(st ) (22) Factored Approximate Curvature (KFAC) approximation to the Fisher
Or: information matrix in natural gradient descent, which significantly re
duces computational complexity and maintains the accuracy of the
At = rt + γV(st+1 ) − V(st ) (23) approximation. This is particularly critical for large-scale problems
where a complete computation of the Fisher information matrix is
where, the term Q(st , at ) denotes the action-value function at state st impractical. ACKTR optimizes the policy network of Actor and enhances
when action at is taken, and the term V(st ) represents the value function the value network update of Critic through KFAC. In several experi
of the current state st . ments, ACKTR demonstrates superior performance relative to A3C and
By using the dominance function instead of the Q-value function, the TRPO, especially in complex environments, showing lower variance and
probability of occurrence of quality behaviors can be enhanced more higher data efficiency. This makes ACKTR particularly suitable for sit
effectively. In addition, the use of the dominance function can further uations where sample efficiency is critical.
reduce the variance of the algorithm. In addition, A3C has a wide range
of applications and is suitable for a variety of 2D and 3D environments, (7) PPO
as well as discrete and continuous action spaces, showing great
versatility. The PPO algorithm (Schulman et al., 2017) further improves the
utility and flexibility of the algorithm with a simplified objective func
(5) TRPO tion based on TRPO. It focuses on solving the problems that DQN cannot
handle continuous actions, and the A3C algorithm needs to adjust
The TRPO algorithm has made significant progress in solving the numerous hyper-parameters and faces low sampling efficiency. A
Table 1
Performance of policy-based DRL algorithms in a gym environment (Fujimoto et al., 2018).
Environment DDPG TD3 TRPO ACKTR PPO SAC
8
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
substitute target is used in the PPO: limited to keep the training stable. The PPO accomplishes this with the
following improved objective function:
L(θ) = ̂
E[rt (θ) A
̂ t] (26)
E[min((rt (θ), clip(rt (θ), 1 − ε, 1 + ε) ) A
L(θ) = ̂ ̂ )] (27)
Here, rt (θ) = πθ (at |st )/π θold (at |st ) represents the ratio between the old
and new strategies, while A ̂ t represents the estimate of the dominance where, clip(rt (θ), 1 − ε, 1 + ε) is used to place a limit on the policy ratio
function. PPO aims to optimize the policy so that the new policy pro rt (θ) to remain within [1 − ε,1 + ε]. This restriction ensures that the new
duces better actions relative to the old one, but the improvement is policy does not deviate too far and ensures the stability of the algorithm.
Table 2
Summary of DRL application in non-nuclear field.
Field Application Direction DRL Objective Algorithm Year Reference
Automatic Driving Path tracking, Smoothness and accuracy of planned path tracking PPO 2020 (Shan et al., 2020)
path planning PPO 2019 (Chen et al., 2019)
DDPG 2021 (Tian et al., 2021)
Driving decision Safe and accurate vehicle interaction DDQN, TD3, 2019 (Chen et al., 2019)
SAC 2018 (Jaritz et al., 2018)
A3C
Energy saving, Efficient automobile power efficiency DQN, DDQN 2019 (Qi et al., 2019)
environmental protection
Robot Task decision, control Precise robot control TD3 2019 (Johannink et al., 2019)
Path planning (single, group) Flexible and efficient robot motion A3C 2020 (Liu et al., 2020)
DQN 2019 (Bae et al., 2019)
PPO 2018 (Long et al., 2018)
Failure-tolerant control Robust strategy of robot fault TD3 2022 (Yan et al., 2022)
PPO 2020 (Okamoto and
Kawamoto, 2020)
Unknown exploration Reduce repeated exploration and improve exploration DQN 2016 (Tai and Liu, 2016)
efficiency DDPG 2020 (Hu et al., 2020)
Socializing, human behavior Enhance the ability of robots to socialize with people DQN 2016 (Qureshi et al., 2016)
Learning SARSA, Q- 2021 (Akalin and Loutfi,
learning 2021)
Energy Wind Wind energy scheduling Less fluctuation caused by uncertainty of wind power DQN 2021 (Meng et al., 2021)
Industry Energy management output, improve energy utilization efficiency DDPG 2022 (Zhu et al., 2022)
Unit control Optimize power output AC 2024 (Mazare, 2024)
DNN, MBRL 2023 (Xie et al., 2023)
Q-learning 2016 (Wei et al., 2016)
Q-learning 2015 (Wei et al., 2015)
Q-learning 2019 (Saenz-Aguirre et al.,
DDPG 2021 2019)
(Zhang et al., 2021)
Energy storage prediction Effectively manage the uncertainty of wind power DQN 2020 (Oh and Wang, 2020)
forecasting
Market management and Guide market behavior A3C 2022 (Sanayha and Vateekul,
tendering 2022)
Solar Solar energy scheduling Improve operational performance and reduce costs Q-learning 2020 (Correa-Jullian et al.,
Energy management Q-learning 2014 2020)
(Leo et al., 2014)
Efficiency optimization Increase the total energy collected by solar panels SARSA 2017 (Abel et al., 2017)
Q-learning
Prediction Accurate solar irradiance prediction DQN 2021 (Jalali et al., 2021)
Economic evaluation Solar microgrid effectiveness and flexibility RL 2023 (Yuan and Xie, 2023)
Sensor node power Improve battery performance and reduce excessive power SARSA 2017 (Shresthamali et al.,
management consumption 2017)
Energy of Unit control optimization Optimize the system output power or actively dissipate the Q-learning 2020 (Bruzzone et al., 2020)
Waves wave energy. SARSA, Q- 2021 (Pierart et al., 2021)
learning 2022 (Zou et al., 2022)
DQN 2022 (Sarkar et al., 2022)
PPO 2022 (Xie et al., 2022)
PPO
Desalination of sea water Reduce the uncertainty and fluctuation of fresh water DDPG 2022 (Yin and Lei, 2022)
supply
Power System System control Improve the control efficiency of power and voltage of SIRL 2024 (Song et al., 2024)
power grid system. MADDPG 2023 (Li et al., 2023)
MADRL 2024 (Zhang et al., 2024)
DQN 2022 (Xu et al., 2022)
Q-learning 2023 (Zhang et al., 2023)
Q-learning 2021 (Liu et al., 2021)
Power grid dispatching Reduce operating costs and optimize voltage distribution GRL 2023 (Chen et al., 2023)
DDPG 2022 (Chen et al., 2022)
Defense and protection Improve the ability of attack detection and defense and DQN 2023 (Deng et al., 2023)
fault protection 2023 (Yang et al)
2023 (Lin et al., 2023)
Market Guide market behavior DQN 2023 (Li et al., 2023)
AC 2023 (Zhang et al., 2023)
Carbon emission Improve the economic indicators of carbon emissions PPO 2022 (Xu et al., 2022)
9
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
In tests of MuJoCo, Atari 2600 games, and 3D environments, the PPO and hills and promoted them in unknown road environments and tra
algorithm shows superior performance compared to the A2C and A3C jectories. In addition, RL also plays an important role in energy saving
algorithms. and environmental protection during driving. The vehicle energy man
agement system designed by Qi et al. (Qi et al., 2019) independently
(8) SAC learns the best fuel / power distribution from the interaction between
the vehicle and the traffic environment, and improves the fuel economy
The SAC algorithm (Haarnoja et al., 2018) focuses on solving the of the vehicle.
convergence problem faced by policy gradient algorithms. SAC in
troduces the concept of entropy into the objective function, aiming to (2) Robot.
maximize the cumulative reward as well as the entropy, encouraging the
agent to take more stochastic actions in completing the task. The In recent years, DRL algorithm has significantly improved the
objective function of SAC is as follows: adaptability of mobile robots in unstructured environments and the
T
ability of trajectory planning in complex tasks. Therefore, the develop
∑
J= E(st ,at )∼ρπ [r(st , at ) + αH(π(⋅|st ) ) ] (28) ment and continuous improvement of DRL algorithm for robot control
t=0 has become a hot research topic at home and abroad.
First of all, RL can realize the task control of robot object grasping,
where, H(π(⋅|st ) ) represents the entropy of the policy given the current assembly and so on. Johannink et al. (Johannink et al., 2019) achieved a
state, ρπ represents the distribution of state-action pairs under policy π , good training robot to complete the above control tasks. Secondly, RL
and α represents the importance of the entropy relative to the reward. can control the trajectory of robots in complex environments, so that
The SAC algorithm encourages exploration by maximizing entropy, they can better cope with dynamic obstacles and uncertain environ
which not only prevents the agent from prematurely converging to a ments. Liu et al. (Liu et al., 2020) realized motion planning in crowded
suboptimal policy, but also improves the robustness of the algorithm. and chaotic scenes of public environment. In addition, group robot path
SAC exhibits superior performance over DDPG and PPO in several planning is also one of the research topics of RL methods. Bae et al. (Bae
continuous control tasks. et al., 2019) realized multi-robot path planning by analyzing the image
information of the surrounding environment. Long et al. (Long et al.,
3. DRL in nuclear energy 2018) trained a large number of robots in a rich and complex environ
ment, and verified the sensor-level collision avoidance strategy. RL can
The previous section 2 introduces the DRL algorithm in detail, which help robots make adaptive decisions in the face of emergencies or system
ultimately needs to be implemented in practical applications in the field failures. Yan et al. (Yan et al., 2022) carried out active fault-tolerant
of nuclear energy. In this section, we first summarize and analyze the control for robots with manipulator failures to ensure system security.
application research experience of DRL method in the industrial field. Okamoto et al. (Okamoto and Kawamoto, 2020) used RL to obtain
Then, the research trends and research status of DRL in the field of robust strategies for robot faults. RL can also provide solutions for un
nuclear energy are introduced. Finally, the key research directions and known exploration of robots. Tai et al. (Tai and Liu, 2016) designed a
application cases of DRL in the field of nuclear energy are introduced in mobile robot that uses only the depth information of RGB-D sensors to
detail. explore in an unknown indoor environment. Hu et al. (Hu et al., 2020)
designed a multi-mobile robot collaborative exploration strategy that
Research experience in DRL in the industrial field saves time and energy costs. In addition, RL methods are also used for
robots to learn social rules, human behavior patterns, and human so
As an advanced and powerful learning framework, the advancement cialization. (Qureshi et al., 2016; Akalin and Loutfi, 2021).
of DRL method stems from the ability of RL method to deal with complex
and dynamic environments and the ability to make adaptive decisions in (3) Energy industry.
the face of unknown scenarios. In recent years, it has become a high-
profile cutting-edge technology in the fields of autonomous driving, DRL technology has rich research and application in energy demand
robot complex control, energy and power, and has shown its excellent forecasting, energy production optimization, energy efficiency
application potential. This section summarizes the application cases of improvement, energy system control and so on.
DRL in autonomous driving, robot control, energy and power and other In the field of wind energy, firstly, RL plays an important role in the
industries, as shown in Table 2. Provide a reference basis for future scheduling management of wind energy. The advanced real-time
research and applications of DRL in the field of nuclear energy. scheduling strategy of distributed energy system proposed by Meng
et al. (Meng et al., 2021) explores the optimal real-time scheduling for
(1) Automatic driving. 24-hour wind demand disturbance. The intelligent scheduling strategy
of distribution network proposed by Zhu et al. (Zhu et al., 2022) reduces
The automatic driving decision-making system based on DRL algo power fluctuation. Secondly, RL can effectively deal with the dynamic
rithm gives the vehicle the ability to learn independently and operate and nonlinear characteristics of wind energy system. The wind power
driving, which provides technical support for the real realization of control system can learn and optimize the parameters of the blade angle
intelligent vehicle driving. and speed of the wind turbine in real time, and adapt to the changing
Firstly, RL achieves intelligent and safe navigation in path planning. wind speed and meteorological conditions. Mahmood et al. (Mazare,
Shan et al. (Shan et al., 2020) and Chen et al. (Chen et al., 2019) used RL 2024) realized the adaptive optimal safe wind power generation control
to adjust the predefined path deviation to adapt to path tracking in of variable speed wind turbine system. Xie et al. (Xie et al., 2023) used
medium and high speed driving scenarios. Tian et al. (Tian et al., 2021) RL to control the torque and pitch of wind turbines, showing excellent
achieved accurate path tracking through behavioral cloning of human robustness and control performance in the case of uncertainty and un
driving habits. Secondly, the automatic driving system needs to make expected actuator failures. Wei et al. (Wei et al., 2016; Wei et al., 2015)
real-time decisions in a complex and changeable traffic environment, solved the control problem of variable speed wind energy conversion
such as overtaking, lane change, etc. Chen et al. (Chen et al., 2019) used system of permanent magnet synchronous generator. The data-driven
the RL method to realize the automatic driving task in the urban dense yaw proposed by Saenz-Aguirre et al. (Saenz-Aguirre et al., 2019) re
vehicle scene. Jaritz et al. (Jaritz et al., 2018) achieved faster conver alizes the control of wind direction under different conditions. Z hang
gence and more robust driving on complex road structures such as turns et al. (Zhang et al., 2021) proposed a sparse coordinated control strategy
10
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
for ordinary power system stabilizers. RL also plays an important role in industrial fields, it is concluded that the application research of DRL
the prediction and management of energy storage systems (Oh and mainly focuses on the following five points.
Wang, 2020), market management and bidding (Sanayha and Vateekul,
2022), etc. • Task decision planning. DRL has the ability to interact with the
In the field of solar energy, firstly, RL intelligently adjusts the storage environment to learn the best strategy, and can make accurate and
and distribution of energy through real-time data, weather forecast and safe decision-making plans. For example, vehicles avoid obstacles
power demand of solar power generation systems. Correa-Jullian et al. and change lanes in automatic driving. It is suitable for solving the
(Correa-Jullian et al., 2020) and Leo et al. (Leo et al., 2014) determine problems such as the formulation of operating rules for complex
the optimal operation plan and optimal scheduling scheme of solar tasks in nuclear power plants and the inspection of robots in nuclear
water heating system or solar microgrid system. Secondly, RL plays a power plants.
key role in the efficiency optimization of solar power generation sys • Resource management scheduling. DRL has the ability of accurate
tems. Abel et al. (Abel et al., 2017) used RL to increase the total energy prediction based on data in complex and changeable environment,
harvested by solar panels and optimized the power generation efficiency which can timely predict demand and reasonably allocate and
of the power generation system. At the same time, RL has excellent schedule resources. For example, power dispatching in power grid
application effects in solar radiation prediction (Jalali et al., 2021), solar system. It is suitable for solving the problems of power generation
project economic evaluation (Yuan and Xie, 2023), adaptive solar power planning, transmission line selection and load distribution of nuclear
management (Shresthamali et al., 2017) and other aspects. power plants.
In the field of wave energy, RL applications mainly focus on unit • Operation design optimization. DRL has the ability to approach the
control and optimization and seawater desalination. Bruzzone et al. optimal solution through trial and error and learning in high-
(Bruzzone et al., 2020), Pierart et al. (Pierart et al., 2021), Zou et al. (Zou dimensional space, and can give the optimal solution to achieve
et al., 2022), Sarkar et al. (Sarkar et al., 2022) discussed the application the goal under specific constraints. For example, the power genera
of RL in the control of wave energy converter, and dynamically adjusted tion efficiency of solar panels is optimized. It is suitable for solving
the speed ratio of the generator according to the sea conditions to the problems of nuclear fuel assembly configuration and core
maximize the output power of the system. Xie et al. (Xie et al., 2022) structure design.
used DRL to find the optimal wave dissipation strategy in the control of • System autonomous control. DRL has the ability of good adaptability
wave energy plate breakwaters. Yin et al. (Yin and Lei, 2022) used DRL and robust control in complex and uncertain environment, and can
to control seawater desalination systems to smooth freshwater develop adaptive emergency control scheme and realize precise
production. control. For example, fault-tolerant control under robot failure. It is
suitable for solving the problems of reactor startup and shutdown,
(4) Power system. safety control under reactor accidents, etc.
• Unknown exploratory learning. DRL has the ability to gradually find
DRL technology has made important contributions to improving the strategies to adapt to the unknown environment through continuous
operation efficiency and stability of the power grid and promoting the trial and learning, and can give operational strategies to adapt to the
optimal allocation and efficient utilization of energy. complex unknown environment. For example, robots explore un
In the field of power grid system, RL is widely used in system control, known environments to avoid obstacles. It is suitable for solving the
power grid dispatching, defense and protection, market and so on. In problems of autonomous and safe operation of nuclear system in
terms of system control, optimize the frequency and voltage control of deep space environment.
the power grid system. Song et al. (Song et al., 2024) combined V2G
control with power plant frequency control to complete the flame 3.2. Application trends of DRL in nuclear energy field
retardant task. Li et al. (Li et al., 2023) reduced power loss and opti
mized voltage fluctuation through real-time voltage control. Zhang et al. Due to the particularity of safety in the nuclear energy field, the
(Zhang et al., 2024) proposed a distributed voltage control for active research in the nuclear energy field is relatively conservative. Cutting-
distribution networks. Xu et al. (Xu et al., 2022) embedded prior edge technologies such as DRL, which has emerged in recent years,
knowledge into RL to realize real-time power correction control of have not yet been studied on a large scale in nuclear power. The relevant
complex power systems. Zhang et al. (Zhang et al., 2023) and Liu et al.
(Liu et al., 2021) achieve optimal multi-area coordination while
improving the control performance of frequency deviation caused by
strong random interference. Uncertainty in dynamic economic dispatch
is a key issue in the safe and economic operation of power systems. The
power system diagram RL method designed by Chen et al. (Chen et al.,
2023) can achieve higher quality solutions in online operations. Chen
et al. (Chen et al., 2022)’s collaborative scheduling framework reduces
operating costs and optimizes voltage distribution. Secondly, RL
methods can be used for fault detection and processing of power grid
systems, as well as defense against various potential threats and attacks.
Deng et al. (Deng et al., 2023) and Yang et al. (Yang et al) used DRL to
perform fault detection and cascade protection of power grid systems,
respectively. Lin et al. (Lin et al., 2023) used DRL to design a false data
injection attack model and an adversarial detection method. In addition,
RL also has certain research in energy market forecasting and electricity
price adjustment (Li et al., 2023; Zhang et al., 2023), carbon emission
flow optimization (Qin et al., 2022), etc.
11
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
research mainly appeared after 2020, as shown in Fig. 8, and showed a fusion tokamak device, as shown in Fig. 11, and achieved a good control
blowout trend in 2022 and 2023.This section mainly summarizes and effect. Relevant research results were published in the journal Nature.
analyzes the research of DRL in the field of nuclear energy, and provides This is also a breakthrough in the application of DRL technology in the
reference for the subsequent research. field of nuclear energy.
The research of DRL technology in the field of nuclear energy mainly
focuses on autonomous control, operation and maintenance strategy 3.4. Operation and maintenance strategy optimization
optimization, design optimization, fault diagnosis, nuclear technology
application, etc., as summarized in Table 3. The specific research di DRL technology is also applied to the optimization of operation and
rection, results and cases will be introduced in detail below. maintenance strategy of nuclear energy system, which effectively re
duces the cost investment and shutdown risk during the operation and
3.3. Autonomous control maintenance of nuclear energy system equipment, and improves the
economy and reliability of nuclear energy application. Zhao et al. (Zhao
In recent years, research on DRL technology in the field of nuclear and Smidts, 2022) used DRL technology to optimize the maintenance
energy has mainly focused on the autonomous control of nuclear energy strategy of the pump system of nuclear power plant. Hao et al. (Hao
systems. The application of DRL improves the automation and intelli et al., 2023) proposed a DRL operation and maintenance strategy to
gence level of nuclear energy systems, further ensures the safety of address the imperfect knowledge of nuclear power system degradation
system operation, optimizes control performance, and reduces control model and the challenge of partial observability of system degradation
and operation costs. Lee et al. used DRL technology to independently state. Pylorof et al. (Pylorof and Garcia, 2022) used RL to solve the
control the reactor power rise process (Lee et al., 2020) and the cold problem of supervision and control in advanced energy systems. Yi et al.
shutdown process (Lee et al., 2022) of nuclear power plants. Experi (Yi et al., 2022) used DRL to maximize the benefits of hydrogen pro
ments prove the feasibility and potential of DRL technology for nuclear duction in a nuclear renewable integrated energy system according to
power plant reactor start-stop autonomous control. The reactor auto the flexibility requirements of the grid.
matic power control architecture proposed by them is shown in Fig. 9. Hao et al. (Hao et al., 2023a,b) uses DRL method to search for the
The research architecture of autonomous control of nuclear power best operation and maintenance strategy of information physical energy
plants using DRL is similar to this. First, a nuclear power plant simulator system. According to available information (such as production plan,
is used as the environment for DRL applications. Then, the training and component remaining useful life, component status, remaining main
optimization of DRL agents are realized through interaction. Finally, the tenance time and system status), select the best operation and mainte
application of DRL controllers is realized. At the same time, they nance operation to be performed. It adapts to the production fluctuation
compared the DRL control effect with the traditional PID control effect and the uncertainty of power demand, so as to ensure the more reliable
during the cold shutdown process, as shown in Fig. 10 and Table 3. From and safe production and supply of power system. At the same time, the
the comparison results of the figure and the table, it can be seen that the author applied it to the Advanced Lead-cooled Fast Reactor European
DRL control method has a better control effect than the traditional PID Demonstrator (ALFRED). By simulating the operation and maintenance
control, and it also proves the potential of DRL in the autonomous decision sequence, the comparison results of different maintenance
control of nuclear power plants. strategies are shown in Tables 4 and 5. As can be seen from the table, the
Kim et al. (Kim et al., 2023), Bae et al. (Bae et al., 2023) and Park method based on DRL has a higher average profit. 95 % Conditional
et al. (Park et al., 2022) also conducted relevant studies in the start-stop Value at Risk (CVaR) is the lowest, the loss is the lowest, and the number
control of nuclear power plant reactors and proved that DRL has the of safe/serious shutdowns is the least. During the five-year mission, the
potential to realize automatic control operation. common strategies of Corrective Maintenance (CM) and Preventive
In the nuclear power critical system equipment control. Li et al. (Li Maintenance (PM) have caused a lot of component failures. DRL oper
et al., 2022; Li et al., 2023) studied the control of steam pressure at the ation and maintenance strategy makes use of the information of
outlet of a direct-flow steam generator. Zhang et al. (Zhang et al., 2024) component health status, which greatly reduces the number of system
used DRL to achieve safe and efficient operation of nuclear steam supply dysfunction (safe shutdown and serious shutdown) in nuclear power
system. Dong et al. (Dong et al., 2020) carried out system control opti plants. The results show that the optimal solution found by DRL is better
mization on the basis of energy system dynamics. Li et al. (Li et al., 2021) than that provided by the most advanced operation and maintenance
realized the coordinated control of nuclear reactor power control and strategy.
steam generator level control.
In the case of nuclear power system transmission accident control 3.5. Design optimization
disposal. Li et al. (Li et al., 2022) designed an active fault-tolerant
control system for the fault of the direct-flow steam generator sensor, In the field of nuclear energy design optimization, DRL technology
which ensured the normal and stable operation of the direct-flow steam has also contributed excellent application results. It effectively reduces
generator. Saeed et al. (Saeed et al., 2023) conducted emergency the calculation cost of complex design, search and other issues in the
treatment for the feedwater loss accident of the small modular reactor nuclear field, and improves the security of system equipment in the
IP-200, and Lee et al. (Lee et al., 2021) realized emergency treatment design stage. In physical computational optimization, Gu et al. (Gu et al.,
after the coolant loss accident of the nuclear power plant, proving that 2023) used DRL to solve the large-scale optimization problem in the
DRL can deal with emergencies on the premise of ensuring the safety of expensive particle transport simulation, and verified the critical search
the power plant. and nuclear component geometry optimization of small modular
The control of different pile types is also studied. Chen et al. (Chen reactors.
and Ray, 2022) designed a nonlinear control system for boiling water Dang et al. (Dang and Ishii, 2022) realized the prediction of the
reactor, and experiments proved the advantages of DRL control system interface area of turbulent two-phase flow using DRL, and verified the
in anti-interference, perturbation stability and setpoint tracking. Zhong performance improvement brought by the new method through the
et al. (Zhong et al., 2022) designed a set of infinite horizon optimal experimental database. An interface area prediction method based on
tracking control method for continuous time nonlinear systems, and physical information RL is proposed. This method aims to capture the
achieved simulation verification in 2500 MW PWR nuclear power plant. complexity of two-phase flow by using the interface area transport
Mattioni et al. (Mattioni et al., 2023) and Degrave et al. (Degrave et al., equation and the advantage of RL. A simple MDP model is used to
2022) applied the DRL method to the actual control of the high- simulate the change of bubble interface state driven by turbulent
temperature plasma device of the magnetic confinement nuclear interaction in the flow field, and the interface structure of transient
12
Table 3
Summary of DRL research in nuclear energy field.
Y. Liu et al.
Application direction Research object DRL objective Algorithm Year Reference
Control Reactor control Reactor rising power control 1. Optimize control strategy A3C 2020 (Lee et al., 2020)
Cold shutdown control 2. Realize autonomous control of startup and shutdown SAC 2022 (Lee et al., 2022)
Autonomous start operation coordination control processes SAC 2023 (Kim et al., 2023)
Multi-objective control during nuclear power plant shutdown 3. Ensure control security SAC 2023 (Bae et al., 2023)
Thermal start-up automatic control A3C 2022 (Park et al., 2022)
Key system equipment OTSG outlet steam pressure control 1. Good tracking ability and anti-interference ability PPO 2022 (Li et al., 2022)
control OTSG Outlet pressure control 2. Small overshoot control effect, fast stabilization speed, strong PPO 2023 (Li et al., 2023)
Multi-objective control of key nuclear steam supply system adaptive ability SAC 2023 (Zhang et al., 2024)
Optimal control of thermoelectric response of nuclear steam supply MLP-based RL 2020 (Dong et al., 2020)
system
Coordinated control of nuclear reactor power and steam generator level DDPG 2021 (Li et al., 2021)
Accident control OTSG sensor fault control 1. Accurate and rapid control in case of accidents to ensure the DQN 2022 (Li et al., 2022)
Autonomous control of small modular reactor under LOFW accident normal and stable operation of the system AC 2023 (Saeed et al., 2023)
Autonomous control under LOCA accident in nuclear power plant 2. Handle emergencies under the premise of ensuring the safety SAC 2021 (Lee et al., 2021)
of nuclear power plants
Control of different Nonlinear control system of boiling water reactor 1 Improve the control performance of pressurized water reactor, DDPG 2022 (Chen and Ray,
reactor types boiling water reactor and nuclear fusion device 2022)
Nonlinear tracking control for 2500 MW PWR nuclear power plant 2. Improve security IRL 2022 (Zhong et al., 2022)
Plasma control 3. Reduce the design workload and control cost Dynamic 2023 (Mattioni et al.,
DNN-DRL 2023)
Tokamak high temperature plasma control MPO 2022 (Degrave et al.,
2022)
Optimize O&M policy Optimization of pump system maintenance strategy for nuclear power 1. Reduce maintenance cost and improve economy MBRL 2022 (Zhao and Smidts,
plant 2. Track changes in demand and improve energy efficiency 2022)
Find the most profitable nuclear power plant operation and DRL 2023 (Hao et al., 2023)
maintenance strategy
Nuclear power plant system operation supervision and optimization SAC 2022 (Pylorof and Garcia,
2022)
13
To provide optimized operation and maintenance strategy for European PPO, IL 2023 (Hao et al., 2023)
advanced lead-cooled fast reactor demonstration plant
Information physical energy system operation and maintenance strategy PPO, IL 2023 (Hao et al., 2023)
optimization
Renewable integrated energy system to maximize the efficiency of TD3 2022 (Yi et al., 2022)
hydrogen production SAC
PPO
Design Physical computation Critical search and nuclear component geometry optimization for small 1. Solve large-scale optimization problems in particle transport NEORL 2023 (Gu et al., 2023)
optimization optimization modular reactors simulation
Prediction of interface area of turbulent two-phase flow 2. Improve computing performance and reduce optimization IAT-PhyRL 2022 (Dang and Ishii,
cost 2022)
Nuclear fuel Learn the experience of nuclear fuel engineer and choose the solution 1. Use expert knowledge effectively PPO 2021 (Radaideh and
optimization 2. Reduce optimization costs Shirvan, 2021)
Nuclear fuel optimization 3. Improve fuel efficiency PESA 2022 (Radaideh and
Shirvan, 2022)
Fig.9. DRL-based automatic power rise control architecture for nuclear power plant reactors (Lee et al., 2020).
Fig.10. Comparison of pressure between controller and regulator (Lee et al., 2022).
bubble two-phase flow is described, as shown in Fig. 12. The environ optimization of nuclear fuel assembly combination with high dimen
mental state set of MDP design includes hydrodynamic and two-phase sional and computationally heavy physical problems and reduced the
flow parameters related to interfacial area transport of bubbles. The optimization cost. At the same time, Radaideh et al. (Radaideh et al.,
average bubble diameter is changed by selecting the multiplication ratio 2021) provides a decision support tool for the optimization of boiling
within a given range. Based on the experimental database of vertical water reactor components. Seurin et al. (Seurin and Shirvan, 2023)
upward bubble air–water flow, the performance of the new method is studied the optimization of nuclear fuel loading mode. The author es
verified, as shown in Fig. 13. The results show that, compared with IATE tablishes the connection between RL and the fuel designer’s strategy of
model, IAT-PhyRL model has better prediction effect among different meeting specific constraints and goals by moving fuel rods in practice.
experimental data sources. It is proved that RL model can find the The application verification is carried out on two low-dimensional (~2
optimal strategy to capture various types of interface transport phe × 106 combination) and high-dimensional (~1031 combination) boiling
nomena from the designed MDP environment. water reactor components. RL algorithm is more effective than SO al
Other design optimization research mainly focuses on nuclear fuel gorithm in solving high-dimensional problems (10 × 10 assembly) by
optimization, and DRL method can effectively improve fuel efficiency, embedding expert knowledge in the form of game rules and effectively
reduce cost and ensure safety. Radaideh et al. (Radaideh and Shirvan, exploring the search space, as shown in Fig. 14. For a given computing
2021; Radaideh and Shirvan, 2022; Radaideh et al., 2023) solved the resource and time frame related to fuel design, RL algorithm is superior
14
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
Fig.11. Control architecture of DRL on magnetically confined fusion Tokamak device (Degrave et al., 2022).
of DRL fault diagnosis model proposed by the author is shown in Fig. 15.
Table 4
The input of the model is the sensor data feature, and the output is the
Comparison results of operating performance (Lee et al., 2022).
fault category. The model is rewarded by classifying right and wrong, so
Performance PID-Based DRL-Based as to learn more essential features. Through two experimental fault cases
Controller Controller
of bearings and gears, the results are shown in Fig. 16. The results show
Pressurizer Average deviation ±0.3248 kg/cm2 (ZN) ±0.2816 kg/ that DRL model has slower convergence speed than DL model, but it has
Pressure error from 27 kg/cm2 ± 0.1805 kg/cm2 cm2
better stability after reaching the convergence state. Moreover, the
(DRL)
Reaching time to 27 32 min (ZN) 10 min diagnosis effect of DRL model is better than that of traditional DL model
kg/cm2 10 min (DRL) and classic ML model in both initial sample size and small sample
Pressurizer Average deviation ±9.56 % (ZN) ±8.79 % scenarios.
Level error from 50 % ±6.55 % (DRL)
Reaching time to 50 +144 min (ZN) +93 min
% +38 min (DRL)
3.7. Nuclear technology application
15
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
Fig.12. MDP environment describing the change of interface area of vertical air–water two-phase flow (Dang and Ishii, 2022).
of non-nuclear energy, research results have been fruitful, and research is needed, no online solution is required, and the design structure is
in the field of nuclear energy has also shown a positive and upward greatly simplified.
trend. This section will discuss the advantages, limitations and prospects
of future research directions of the DRL algorithm in the field of nuclear 4.2. Restriction
energy, and provide guidance for researchers or engineers on the
application of DRL in the field of nuclear energy. The excellent performance of DRL-based methods leads us to believe
that DRL technology will occupy an important place in future applied
4.1. Advantage research on AI in nuclear energy. However, the design, invocation, and
deployment of DRL algorithms are still challenging due to the excessive
Chapter 3 summarizes and reflects the application effects of DRL. The complexity of nuclear energy systems, safety constraints, and practical
advantages of DRL will not be described in detail here, but they mainly system issues such as high data acquisition costs. Therefore, the authors
include the following points. summarize the possible limitations and difficulties when DRL is inten
sively studied in the nuclear energy field.
a. Strong adaptability: DRL can handle complex, non-linear environ
ments and has strong adaptability. In particular, it has huge advan (1) Generalization
tages in solving problems such as high-dimensional space and non-
exhaustive problems. From the research on the application of DRL in nuclear energy, it is
b. Learning ability: DRL has a powerful learning ability. The researcher evident that there are difficulties in algorithm integration and software
only needs to specify the learning goal, and DRL can automatically environment compatibility and generalization. There is a lack of a
learn and optimize the policy by continuously interacting with the common research interface for integrating DRL into nuclear energy
environment. systems. The software environments for building nuclear energy system
c. Generalization ability: DRL has good generalization ability. Knowl simulators are not uniform and compatible with existing mature algo
edge and experience can be shared between multiple tasks to rithm libraries. This leads to significant difficulties and additional
improve learning efficiency. workload in deploying off-the-shelf algorithms. This in turn leads to
d. Easy to apply: DRL does not need to establish a mathematical model excessive focus on algorithm implementation rather than adaptation to
of the system during the application process, less a priori knowledge solve the problem when faced with a newly acquired nuclear energy
system problem. A unified deployment interface and mature algorithm
16
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
Fig.13. Prediction results of interfacial area transport under different flow conditions (Dang and Ishii, 2022).
Fig.14. In the case of 10 × 10 optimization, the total number of feasible patterns found by different methods (Seurin and Shirvan, 2023).
implementations can significantly improve the efficiency of DRL requires a large amount of time to collect enough data samples to ensure
research in nuclear energy. that the model achieves the expected performance.The dimension of the
state-action space, the complexity of the control problem, the computing
(2) Sampling of data. power of the computer, and the trial-and-error nature of RL all
contribute to the inefficiency of the adoption, which in turn increases
The most controversial topic of DRL is its poor learning efficiency.RL the training cost. In the future, how to improve the sampling efficiency
17
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
Fig.15. Fault diagnosis model based on DRL (Qian and Liu, 2022).
Fig.16. Effect comparison of different fault diagnosis models (Qian and Liu, 2022).
of the algorithm and the efficiency of data usage will be a non-negligible The difficulty of setting up DRL application training mainly focuses
issue. on the algorithm selection and the setting of reward function, which
determines the final application effect of DRL. Although different DRL
(3) Training setup. algorithms have corresponding application characteristics, it is still
difficult to choose the appropriate DRL algorithm for practical
18
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
Fig.17. Beam trajectory correction framework based on multi-agent RL (Yang et al., 2022).
application, which needs to be considered in combination with various The ultimate goal of any RL algorithm is real-world deployment.
factors such as system characteristics, control problems, algorithmic Currently, one of the main difficulties in deploying DRL in nuclear en
advantages, etc. There is no fixed standard for the setting of DRL re ergy systems lies in transferring simulation results to real-world sce
wards, which is more dependent on experience. It is a great challenge to narios. Due to the specificity of nuclear energy systems, there is a lack of
set the reward function efficiently, fit the optimization objective, and actual data samples and they are not suitable for training interactions in
avoid reward sparsity. Summarizing the experience of efficient DRL real systems. Considering how to improve the accuracy of modeling,
training settings for problems in nuclear energy field will be a long reduce the gap between the simulation training environment and the
exploration process. real physical environment, and realize the effective migration of
knowledge learned in the simulator for deployment to the real world are
(4) Security and Robustness. all issues that need to be focused on at present.
19
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
results focus on the autonomous control direction of nuclear energy Declaration of competing interest
system at present, there are still too few research results. Especially the
future research directions of system-level coordinated control, fault- The authors declare that they have no known competing financial
tolerant control and optimization of existing control strategies. interests or personal relationships that could have appeared to influence
Relying on DRL’s powerful search and optimization ability, the design the work reported in this paper.
and operation optimization of nuclear energy system equipment in the
future is still the key research direction, but at present it is mainly Acknowledgements
limited to nuclear fuel research. From the perspective of DRL applica
tion, it will be the main research direction to solve the leap from This work is supported by the National Natural Science Foundation
simulation to reality in the future and realize the application, verifica of China (No.12405196) and the Nuclear + X Joint Innovation Fund for
tion and optimization in the actual system. Solving the problems of al Ph.D. Students in Nuclear Science of Harbin Engineering University
gorithm integration and compatibility and universality of software (GXB202306).
environment will greatly promote the research efficiency of DRL in the
field of nuclear energy, which is a difficult problem to be overcome in Data availability
the future.
Data will be made available on request.
5. Conclusion
References
As a key and excellent frontier technology, DRL has great potential in
solving key problems in the field of nuclear energy. However, the Abel, D., Reif, E., Littman, M.L., 2017. Improving solar panel efficiency using
reinforcement learning. RLDM 2017.
literature about DRL in the field of nuclear energy is still very shallow. In Ai, S., Koe, A.S.V., Huang, T., 2021. Adversarial perturbation in remote sensing image
the present article, we reviewed the available contributions on the topic recognition. Appl. Soft Comput. 105, 107252.
to provide the reader with a comprehensive state of play of the possi Akalin, N., Loutfi, A., 2021. Reinforcement learning approaches in social robotics.
Sensors 21 (4), 1292.
bilities of DRL in nuclear energy. Al Smadi, T., Al Issa, H.A., Trad, E., Al Smadi, K.A., 2015. Artificial intelligence for
The application scenarios in the field of nuclear engineering are speech recognition based on neural networks. Journal of Signal and Information
complex and changeable, and the DRL algorithm system is complex. Processing 6 (02), 66.
Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., Kautz, J., 2016. Reinforcement
According to the principle and characteristics of the algorithm, the learning through asynchronous advantage actor-critic on a gpu. arXiv preprint arXiv:
optimal algorithm adaptation for application scenarios should be 1611.06256.
selected from the perspectives of continuity and dimension of input and Bae, H., Kim, G., Kim, J., Qian, D., Lee, S., 2019. Multi-robot path planning method using
reinforcement learning. Appl. Sci. 9 (15), 3057.
output, optimization of target requirements, limitation of computing
Bae, J., Kim, J.M., Lee, S.J., 2023. Deep reinforcement learning for a multi-objective
resources, computational stability, and balance between time and ac operation in a nuclear power plant. Nucl. Eng. Technol.
curacy. At present, DRL research in the field of nuclear energy mainly Barth-Maron, G., Hoffman, M.W., Budden, D., Dabney, W., Horgan, D., Tb, D., Lillicrap,
focuses on autonomous control, operation and maintenance strategy T., 2018. Distributed distributional deterministic policy gradients. arXiv preprint
arXiv:1804.08617.
optimization, design optimization, fault diagnosis, nuclear technology Bellman, R., 1957. A Markovian decision process. Journal of Mathematics and Mechanics
application and so on. Especially, there are a lot of engineering research 679–684.
requirements for autonomous intelligent control of complex nuclear Bellman, R., 1966. Dynamic programming. Science 153 (3731), 34–37.
Bruzzone, L., Fanghella, P., Berselli, G., 2020. Reinforcement learning control of an
energy systems with different types and working conditions. From the onshore oscillating arm wave energy converter. Ocean Eng. 206, 107346.
research of other energy fields, it can be concluded that DRL is mainly Chen, L., Chen, Y., Yao, X., Shan, Y., Chen, L., 2019. In: An Adaptive Path Tracking
good at solving the optimization and control problems of energy system, Controller Based on Reinforcement Learning with Urban Driving Application. IEEE,
pp. 2411–2416.
and has accumulated a lot of research experience. From the perspective Chen, X., Ray, A., 2022. Deep reinforcement learning control of a boiling water reactor.
of technical maturity and nuclear energy engineering requirements, the IEEE Trans. Nucl. Sci. 69 (8), 1820–1832.
research direction of DRL in the future nuclear energy field mainly fo Chen, Z., Wang, R., Sun, K., Zhang, T., Du, P., Zhao, Q., 2022. A modified long short-term
memory-deep deterministic policy gradient-based scheduling method for active
cuses on two directions: autonomous control and efficiency optimiza distribution networks. Front. Energy Res. 10, 913130.
tion. The resistance to promote the application of DRL in nuclear energy Chen, J., Yuan, B., Tomizuka, M., 2019. In: October). Model-Free Deep Reinforcement
field is mainly reflected in the compatibility and universality of algo Learning for Urban Autonomous Driving. IEEE, pp. 2765–2771.
Chen, J., Yu, T., Pan, Z., Zhang, M., Deng, B., 2023. A scalable graph reinforcement
rithm integration and software environment, the robustness and appli
learning algorithm based stochastic dynamic dispatch of power system under high
cation fault tolerance of the model, and the migration and deployment of penetration of renewable energy. Int. J. Electr. Power Energy Syst. 152, 109212.
model knowledge from simulation to reality. Corea, F., Corea, F., 2019. AI and Speech Recognition. An Introduction to Data:
As of now, the application ability and robustness of DRL algorithm in Everything You Need to Know About AI. Big Data and Data Science, pp. 53–56.
Correa-Jullian, C., Droguett, E.L., Cardemil, J.M., 2020. Operation scheduling in a solar
nuclear energy system still need to be explored. With the support of the thermal system: a reinforcement learning-based framework. Appl. Energy 268,
continuous progress of DRL algorithm and the challenge of nuclear en 114943.
ergy industry, DRL will become a research hotspot of AI in the field of Dang, Z., Ishii, M., 2022. Towards stochastic modeling for two-phase flow interfacial area
predictions: a physics-informed reinforcement learning approach. Int. J. Heat Mass
nuclear energy in the next few years and bring considerable benefits. Transf. 192, 122919.
Degrave, J., Felici, F., Buchli, J., Neunert, M., Tracey, B., Carpanese, F., Riedmiller, M.,
CRediT authorship contribution statement 2022. Magnetic control of tokamak plasmas through deep reinforcement learning.
Nature 602 (7897), 414–419.
Deng, X., Wang, S., Wang, W., Yu, P., Xiong, X., 2023. Optimal defense strategy for AC/
Yongchao Liu: Writing – original draft, Resources, Investigation, DC hybrid power grid cascading failures based on game theory and deep
Conceptualization. Bo Wang: Writing – review & editing, Supervision. reinforcement learning. Front. Energy Res. 11, 1167316.
Dong, Z., Huang, X., Dong, Y., Zhang, Z., 2020. Multilayer perception based
Sichao Tan: Writing – review & editing, Supervision. Tong Li: Valida reinforcement learning supervisory control of energy systems with application to a
tion, Methodology, Investigation. Wei Lv: Validation, Investigation. nuclear steam supply system. Appl. Energy 259, 114193.
Zhenfeng Niu: Methodology, Investigation. Jiangkuan Li: Supervision, El-Sefy, M., Yosri, A., El-Dakhakhni, W., Nagasaki, S., Wiebe, L., 2021. Artificial neural
network for predicting nuclear power plant dynamic behaviors. Nucl. Eng. Technol.
Investigation. Puzhen Gao: Writing – review & editing, Supervision.
53 (10), 3275–3285.
Ruifeng Tian: Supervision, Conceptualization. Fortunato, M., Azar, M.G., Piot, B., Menick, J., Osband, I., Graves, A., Legg, S., 2017.
Noisy networks for exploration. arXiv preprint arXiv:1706.10295.
20
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
Fujimoto, S., Hoof, H., & Meger, D. (2018, July). Addressing function approximation Li, P., Shen, J., Wu, Z., Yin, M., Dong, Y., Han, J., 2023. Optimal real-time Voltage/Var
error in actor-critic methods. In International conference on machine learning (pp. control for distribution network: Droop-control based multi-agent deep
1587-1596). PMLR. reinforcement learning. Int. J. Electr. Power Energy Syst. 153, 109370.
Garnier, P., Viquerat, J., Rabault, J., Larcher, A., Kuhnle, A., Hachem, E., 2021. A review Li, G., Wang, X., Liang, B., Li, X., Zhang, B., Zou, Y., 2016. Modeling and control of
on deep reinforcement learning for fluid mechanics. Comput. Fluids 225, 104973. nuclear reactor cores for electricity generation: A review of advanced technologies.
Gu, X., Radaideh, M.I., Liang, J., 2023. OpenNeoMC: a framework for design Renew. Sustain. Energy Rev. 60, 116–128.
optimization in particle transport simulations based on OpenMC and NEORL. Ann. Li, Z., Wang, J., Ding, M., 2022. A review on optimization methods for nuclear reactor
Nucl. Energy 180, 109450. fuel reloading analysis. Nucl. Eng. Des. 397, 111950.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018, July). Soft actor-critic: Off-policy Li, D., Yang, Q., Ma, L., Peng, Z., Liao, X., 2023. Offense and defence against adversarial
maximum entropy deep reinforcement learning with a stochastic actor. In sample: a reinforcement learning method in energy trading market. Front. Energy
International conference on machine learning (pp. 1861-1870). PMLR. Res. 10, 1071973.
Hao, Z., Di Maio, F., Zio, E., 2023. Monte Carlo tree search-based deep reinforcement Li, C., Yu, R., Yu, W., Wang, T., 2022. Pressure control of Once-through steam generator
learning for flexible operation & maintenance optimization of a nuclear power plant. using Proximal policy optimization algorithm. Ann. Nucl. Energy 175, 109232.
Journal of Safety and Sustainability. Li, C., Yu, R., Yu, W., Wang, T., 2022. Fault-tolerant control system for once-through
Hao, Z., Di Maio, F., Zio, E., 2023a. A sequential decision problem formulation and deep steam generator based on reinforcement learning algorithm. Nucl. Eng. Technol. 54
reinforcement learning solution of the optimization of O&M of cyber-physical energy (9), 3283–3292.
systems (CPESs) for reliable and safe power production and supply. Reliab. Eng. Syst. Li, C., Yu, R., Yu, W., Wang, T., 2023. Reinforcement learning-based control with
Saf. 235, 109231. application to the once-through steam generator system. Nucl. Eng. Technol.
Hao, Z., Di Maio, F., Zio, E., 2023b. Flexible operation and maintenance optimization of Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Wierstra, D., 2015.
aging cyber-physical energy systems by deep reinforcement learning. Nucl. Eng. Continuous control with deep reinforcement learning. arXiv preprint arXiv:
Technol. 1509.02971.
Hema, C., Marquez, F.P.G., 2023. Emotional speech recognition using cnn and deep Lin, X., An, D., Cui, F., Zhang, F., 2023. False data injection attack in smart grid: attack
learning techniques. Appl. Acoust. 211, 109492. model and reinforcement learning-based detection method. Front. Energy Res. 10,
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Silver, D., 1104989.
2018, April. Rainbow: Combining improvements in deep reinforcement learning. In Liu, Y.K., Ai, X., Ayodeji, A., Wu, M.P., Peng, M.J., Xia, H., Yu, W.F., 2019. Enhanced
Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1). graph-based fault diagnostic system for nuclear power plants. Nucl. Sci. Tech. 30,
Hiscox, B., Sobes, V., Popov, E., Archibald, R., Betzler, B., Terrani, K., 2020. AI 1–14.
Optimization of the Reactor Unit Cell to Support TCR Optimization. Oak Ridge Liu, L., Dugas, D., Cesari, G., Siegwart, R., Dubé, R., 2020. In: Robot Navigation in
National Lab.(ORNL), Oak Ridge, TN (United States). Crowded Environments Using Deep Reinforcement Learning. IEEE, pp. 5671–5677.
Howard, R.A., 1960. Dynamic programming and markov processes. Liu, L., Wang, Y., Chi, W., 2020. Image recognition technology based on machine
Hu, J., Niu, H., Carrasco, J., Lennox, B., Arvin, F., 2020. Voronoi-based multi-robot learning. IEEE Access.
autonomous exploration in unknown environments via deep reinforcement learning. Liu, Y., Zhang, L., Xi, L., Sun, Q., Zhu, J., 2021. Automatic generation control for
IEEE Trans. Veh. Technol. 69 (12), 14413–14423. distributed multi-region interconnected power system with function approximation.
Hu, H., Wang, J., Chen, A., Liu, Y., 2023. An autonomous radiation source detection Front. Energy Res. 9, 700069.
policy based on deep reinforcement learning with generalized ability in unknown Long, P., Fan, T., Liao, X., Liu, W., Zhang, H., Pan, J., 2018. In: Towards Optimally
environments. Nucl. Eng. Technol. 55 (1), 285–294. Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning.
Jaderberg, M., Czarnecki, W.M., Dunning, I., Marris, L., Lever, G., Castaneda, A.G., IEEE, pp. 6252–6259.
Graepel, T., 2019. Human-level performance in 3D multiplayer games with Mattioni, A., Zoboli, S., Mavkov, B., Astolfi, D., Andrieu, V., Witrant, E., Prieur, C., 2023.
population-based reinforcement learning. Science 364 (6443), 859–865. Enhancing deep reinforcement learning with integral action to control tokamak
Jalali, S.M.J., Khodayar, M., Ahmadian, S., Shafie-Khah, M., Khosravi, A., Islam, S.M.S., safety factor. Fusion Eng. Des. 196, 114008.
Catalão, J.P., 2021. In: A New Ensemble Reinforcement Learning Strategy for Solar Mazare, M., 2024. Adaptive optimal secure wind power generation control for variable
Irradiance Forecasting Using Deep Optimized Convolutional Neural Network speed wind turbine systems via reinforcement learning. Appl. Energy 353, 122034.
Models. IEEE, pp. 1–6. Mc Leod, J.E.N., Rivera, S.S., 2023. Reliability Optimization of New Generation Nuclear
Jaritz, M., De Charette, R., Toromanoff, M., Perot, E., Nashashibi, F., 2018. In: End-to- Power Plants Using Artificial Intelligence. In: Reliability Engineering and
End Race Driving with Deep Reinforcement Learning. IEEE, pp. 2070–2075. Computational Intelligence for Complex Systems: Design, Analysis and Evaluation.
Jasmin, E.A., Ahamed, T.I., Raj, V.J., 2011. Reinforcement learning approaches to Cham, Springer Nature Switzerland, pp. 159–173.
economic dispatch problem. Int. J. Electr. Power Energy Syst. 33 (4), 836–845. Meng, F., Bai, Y., Jin, J., 2021. An advanced real-time dispatching strategy for a
Jiang, B. T., Zhou, J., & Huang, X. B. (2020, August). Artificial Neural Networks in distributed energy system based on the reinforcement learning algorithm. Renew.
Condition Monitoring and Fault Diagnosis of Nuclear Power Plants: A Concise Energy 178, 13–24.
Review. In International Conference on Nuclear Engineering (Vol. 83778, p. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... & Kavukcuoglu,
V002T08A032). American Society of Mechanical Engineers. K. (2016, June). Asynchronous methods for deep reinforcement learning. In
Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Levine, S., 2019. In: International conference on machine learning (pp. 1928-1937). PMLR.
Residual Reinforcement Learning for Robot Control. IEEE, pp. 6023–6029. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G.,
Juhn, Y., Liu, H., 2020. Artificial intelligence approaches using natural language Hassabis, D., 2015. Human-level control through deep reinforcement learning.
processing to advance EHR-based clinical research. J. Allergy Clin. Immunol. 145 Nature 518 (7540), 529–533.
(2), 463–469. Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Silver, D.,
Khabbaz, A.H., Pouyan, A.A., Fateh, M., Abolghasemi, V., 2017. In: An Adaptive RL 2015. Massively parallel methods for deep reinforcement learning. arXiv preprint
Based Fuzzy Game for Autistic Children. IEEE, pp. 47–52. arXiv:1507.04296.
Kim, J.M., Bae, J., Lee, S.J., 2023. Strategy to coordinate actions through a plant Oh, E., Wang, H., 2020. Reinforcement-learning-based energy storage system operation
parameter prediction model during startup operation of a nuclear power plant. Nucl. strategies to manage wind power forecast uncertainty. IEEE Access 8, 20965–20976.
Eng. Technol. 55 (3), 839–849. Okamoto, W., Kawamoto, K., 2020. In: Reinforcement Learning with Randomized
Kim, H., Jo, Y., Lee, D., 2021. Feasibility study on AI-based prediction for CRUD induced Physical Parameters for Fault-Tolerant Robots. IEEE, pp. 1–4.
power shift in PWRs. Proceedings of the Transactions of the Korean Nuclear Society Oliveira, M.V.D., Almeida, J.C.S.D., 2013. Modeling and control of a nuclear power plant
Virtual Autumn Meeting, Korea (online). using AI techniques.
Lalwani, T., Bhalotia, S., Pal, A., Rathod, V., Bisen, S., 2018. Implementation of a chatbot Park, J., Kim, T., Seong, S., 2020. Providing support to operators for monitoring safety
system using AI and NLP. International Journal of Innovative Research in Computer functions using reinforcement learning. Prog. Nucl. Energy 118, 103123.
Science & Technology (IJIRCST) 6 (3). Park, J., Kim, T., Seong, S., Koo, S., 2022. Control automation in the heat-up mode of a
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436–444. nuclear power plant using reinforcement learning. Prog. Nucl. Energy 145, 104107.
Lee, D., Arigi, A.M., Kim, J., 2020. Algorithm for autonomous power-increase operation Perrusquía, A., Yu, W., Li, X., 2021. Multi-agent reinforcement learning for redundant
using deep reinforcement learning and a rule-based system. IEEE Access 8, robot control in task-space. Int. J. Mach. Learn. Cybern. 12, 231–241.
196727–196746. Pierart, F., Manríquez, C., Campos, P., 2021. In: Reinforcement Learning Algorithms
Lee, D., Kim, H., Choi, Y., Kim, J., 2021. In: Development of Autonomous Operation Applied to Reactive and Resistive Control of a Wave Energy Converter. IEEE,
Agent for Normal and Emergency Situations in Nuclear Power Plants. IEEE, pp. 1–6.
pp. 240–247. Pylorof, D., Garcia, H.E., 2022. A reinforcement learning approach to long-horizon
Lee, D., Koo, S., Jang, I., Kim, J., 2022. Comparison of deep reinforcement learning and operations, health, and maintenance supervisory control of advanced energy
PID controllers for automatic cold shutdown operation. Energies 15 (8), 2834. systems. Eng. Appl. Artif. Intel. 116, 105454.
Lei, J., Ren, C., Li, W., Fu, L., Li, Z., Ni, Z., Yu, T., 2022. Prediction of crucial nuclear Qi, B., Liang, J., Tong, J., 2023. Fault diagnosis techniques for nuclear power plants: a
power plant parameters using long short-term memory neural networks. Int. J. review from the artificial intelligence perspective. Energies 16 (4), 1850.
Energy Res. 46 (15), 21467–21479. Qi, X., Luo, Y., Wu, G., Boriboonsomsin, K., Barth, M., 2019. Deep reinforcement learning
Leo, R., Milton, R.S., Sibi, S., 2014. In: Reinforcement Learning for Optimal Energy enabled self-learning control for energy efficient driving. Transportation Research
Management of a Solar Microgrid. IEEE, pp. 183–188. Part C: Emerging Technologies 99, 67–81.
Li, J., Liu, Y., Qing, X., Xiao, K., Zhang, Y., Yang, P., Yang, Y.M. 2021, November. The Qian, G., Liu, J., 2022. Development of deep reinforcement learning-based fault
application of Deep Reinforcement Learning in Coordinated Control of Nuclear diagnosis method for rotating machinery in nuclear power plants. Prog. Nucl. Energy
Reactors. In Journal of Physics: Conference Series (Vol. 2113, No. 1, p. 012030). IOP 152, 104401.
Publishing. Qin, P., Ye, J., Hu, Q., Song, P., Kang, P., 2022. Deep reinforcement learning based power
system optimal carbon emission flow. Front. Energy Res. 10, 1017128.
21
Y. Liu et al. Nuclear Engineering and Design 429 (2024) 113655
Qureshi, A.H., Nakamura, Y., Yoshikawa, Y., Ishiguro, H., 2016. In: Robot Gains Social Wei, C., Zhang, Z., Qiao, W., Qu, L., 2015. Reinforcement-learning-based intelligent
Intelligence through Multimodal Deep Reinforcement Learning. IEEE, pp. 745–751. maximum power point tracking control for wind energy conversion systems. IEEE
Radaideh, M.I., Shirvan, K., 2021. Rule-based reinforcement learning methodology to Trans. Ind. Electron. 62 (10), 6360–6370.
inform evolutionary algorithms for constrained optimization of engineering Wei, C., Zhang, Z., Qiao, W., Qu, L., 2016. An adaptive network-based reinforcement
applications. Knowl.-Based Syst. 217, 106836. learning method for MPPT control of PMSG wind energy conversion systems. IEEE
Radaideh, M.I., Du, K., Seurin, P., Seyler, D., Gu, X., Wang, H., Shirvan, K., 2023. NEORL: Trans. Power Electron. 31 (11), 7837–7848.
Neuroevolution optimization with reinforcement learning—applications to carbon- Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., Ba, J., 2017. Scalable trust-region method
free energy systems. Nucl. Eng. Des. 112423. for deep reinforcement learning using kronecker-factored approximation. Advances
Radaideh, M.I., Shirvan, K., 2022. PESA: prioritized experience replay for parallel hybrid in neural information processing systems, 30.
evolutionary and swarm algorithms-application to nuclear fuel. Nucl. Eng. Technol. Wu, Z., Yin, Y., Liu, J., Zhang, D., Chen, J., Jiang, W., 2023. A novel path planning
54 (10), 3864–3877. approach for mobile robot in radioactive environment based on improved deep Q
Radaideh, M.I., Wolverton, I., Joseph, J., Tusar, J.J., Otgonbaatar, U., Roy, N., network algorithm. Symmetry 15 (11), 2048.
Shirvan, K., 2021. Physics-informed reinforcement learning optimization of nuclear Xie, J., Dong, H., Zhao, X., 2023. Data-driven torque and pitch control of wind turbines
assembly design. Nucl. Eng. Des. 372, 110966. via reinforcement learning. Renew. Energy.
Saeed, H.A., Peng, M., Wang, H., Rasool, A., 2023. Autonomous control model for Xie, Y., Zhao, X., Luo, M., 2022. An active-controlled heaving plate breakwater trained
emergency operation of small modular reactor. Ann. Nucl. Energy 190, 109874. by an intelligent framework based on deep reinforcement learning. Ocean Eng. 244,
Saenz-Aguirre, A., Zulueta, E., Fernandez-Gamiz, U., Lozano, J., Lopez-Guede, J.M., 110357.
2019. Artificial neural network based reinforcement learning for wind turbine yaw Xu, P., Zhang, J., Lu, J., Zhang, H., Gao, T., Chen, S., 2022. A prior knowledge-embedded
control. Energies 12 (3), 436. reinforcement learning method for real-time active power corrective control in
Salvato, E., Fenu, G., Medvet, E., Pellegrino, F.A., 2021. Crossing the reality gap: a survey complex power systems. Front. Energy Res. 10, 1009545.
on sim-to-real transferability of robot controllers in reinforcement learning. IEEE Yan, Z., Tan, J., Liang, B., Liu, H., Yang, J., 2022. In: Active Fault-Tolerant Control
Access 9, 153171–153187. Integrated with Reinforcement Learning Application to Robotic Manipulator. IEEE,
Sanayha, M., Vateekul, P., 2022. Model-based deep reinforcement learning for wind pp. 2656–2662.
energy bidding. Int. J. Electr. Power Energy Syst. 136, 107625. Yang, X., Chen, Y., Wang, J., Zheng, H., Liu, H., Zhou, D., Zhou, Q., 2022. Online beam
Sarkar, S., Gundecha, V., Shmakov, A., Ghorbanpour, S., Babu, A.R., Faraboschi, P., orbit correction of MEBT in CiADS based on multi-agent reinforcement learning
Fievez, J., 2022, June. Multi-agent reinforcement learning controller to maximize algorithm. Ann. Nucl. Energy 179, 109346.
energy efficiency for multi-generator industrial wave energy converter. In Yang, Y., Tu, F., Huang, S., Tu, Y., Liu, T. Research on CNN-LSTM DC power system fault
Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 11, pp. diagnosis and differential protection strategy based on reinforcement learning.
12135-12144). frontiers in energy research, 11, 1258549.
Schaul, T., Quan, J., Antonoglou, I., Silver, D., 2015. Prioritized experience replay. arXiv Yang, T., Zhao, L., Li, W., Zomaya, A.Y., 2021. Dynamic energy dispatch strategy for
preprint arXiv:1511.05952. integrated energy system based on improved deep reinforcement learning. Energy
Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw. 61, 235, 121377.
85–117. Yi, Z., Luo, Y., Westover, T., Katikaneni, S., Ponkiya, B., Sah, S., Khanna, R., 2022. Deep
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region reinforcement learning based optimization for a tightly coupled nuclear renewable
policy optimization. In International conference on machine learning (pp. 1889- integrated energy system. Appl. Energy 328, 120113.
1897). PMLR. Yin, X., Lei, M., 2022. Deep reinforcement learning based coastal seawater desalination
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy via a pitching paddle wave energy converter. Desalination 543, 115986.
optimization algorithms. arXiv preprint arXiv:1707.06347. Yuan, G., Xie, F., 2023. Digital Twin-Based economic assessment of solar energy in smart
Seurin, P., Shirvan, K., 2023. Assessment of Reinforcement Learning Algorithms for microgrids using reinforcement learning technique. Sol. Energy 250, 398–408.
Nuclear Power Plant Fuel Optimization. arXiv preprint arXiv:2305.05812. Zaib, M., Sheng, Q.Z., Emma Zhang, W., 2020. February). A short survey of pre-trained
Shah, S.I.H., Naeem, M., Paragliola, G., Coronato, A., Pechenizkiy, M., 2023. An AI- language models for conversational ai-a new age in nlp. In: Proceedings of the
empowered infrastructure for risk prevention during medical examination. Expert Australasian Computer Science Week Multiconference, pp. 1–4.
Syst. Appl. 225, 120048. Zhang, B., Cao, D., Hu, W., Ghias, A.M., Chen, Z., 2024. Physics-Informed Multi-Agent
Shan, Y., Zheng, B., Chen, L., Chen, L., Chen, D., 2020. A reinforcement learning-based deep reinforcement learning enabled distributed voltage control for active
adaptive path tracking approach for autonomous driving. IEEE Trans. Veh. Technol. distribution network using PV inverters. Int. J. Electr. Power Energy Syst. 155,
69 (10), 10581–10595. 109641.
Shresthamali, S., Kondo, M., Nakamura, H., 2017. Adaptive power management in solar Zhang, T., Dong, Z., Huang, X., 2024. Multi-objective optimization of thermal power and
energy harvesting sensor node using reinforcement learning. ACM Transactions on outlet steam temperature for a nuclear steam supply system with deep reinforcement
Embedded Computing Systems (TECS) 16 (5s), 1–21. learning. Energy 286, 129526.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M., 2014. In: Zhang, G., Hu, W., Cao, D., Huang, Q., Chen, Z., Blaabjerg, F., 2021. A novel deep
Deterministic Policy Gradient Algorithms. PMLR, pp. 387–395. reinforcement learning enabled sparsity promoting adaptive control method to
Sobes, V., Hiscox, B., Popov, E., Archibald, R., Hauck, C., Betzler, B., Terrani, K., 2021. improve the stability of power systems with wind energy penetration. Renew.
AI-based design of a nuclear reactor core. Sci. Rep. 11 (1), 19646. Energy 178, 363–376.
Song, X., Sun, J., Tan, S., Ling, R., Chai, Y., Guerrero, J.M., 2024. Cooperative grid Zhang, S., Lan, F., Xue, B., Chen, Q., Qiu, X., 2023. A novel automatic generation control
frequency control under asymmetric V2G capacity via switched integral method with hybrid sampling for the multi-area interconnected grid. Front. Energy
reinforcement learning. Int. J. Electr. Power Energy Syst. 155, 109679. Res. 11, 1280724.
Sutton, R.S., 1988. Learning to predict by the methods of temporal differences. Mach. Zhang, F., Yang, Q., Li, D., 2023. A deep reinforcement learning-based bidding strategy
Learn. 3, 9–44. for participants in a peer-to-peer energy trading scenario. Front. Energy Res. 10,
Sutton, R.S., Barto, A.G., 1999. Reinforcement learning: an introduction. Robotica 17 (2), 1017438.
229–235. Zhao, Y., Smidts, C., 2022. Reinforcement learning for adaptive maintenance policy
Sutton, R.S., Barto, A.G., 2018. Reinforcement learning: An introduction [M]. MIT press. optimization under imperfect knowledge of the system degradation model and
Tai, L., Liu, M., 2016. Towards cognitive exploration through deep reinforcement partial observability of system states. Reliab. Eng. Syst. Saf. 224, 108541.
learning for mobile robots. arXiv preprint arXiv:1610.01733. Zhong, W., Wang, M., Wei, Q., Lu, J., 2022. A new neuro-optimal nonlinear tracking
Thrun, S., Littman, M.L., 2000. Reinforcement learning: an introduction. AI Mag. 21 (1), control method via integral reinforcement learning with applications to nuclear
103. systems. Neurocomputing 483, 361–369.
Tian, Y., 2020. Artificial intelligence image recognition method based on convolutional Zhong, X., Zhang, L., Ban, H., 2023. Deep reinforcement learning for class imbalance
neural network algorithm. IEEE Access 8, 125731–125744. fault diagnosis of equipment in nuclear power plants. Ann. Nucl. Energy 184,
Tian, Y., Cao, X., Huang, K., et al., 2021. Learning to drive like human beings: a method 109685.
based on deep reinforcement learning[J]. IEEE Trans. Intell. Transp. Syst. 23 (7), Zhou, G., Tan, D., 2023. Review of nuclear power plant control research: neural network-
6357–6367. based methods. Ann. Nucl. Energy 181, 109513.
Van Hasselt, H., Guez, A., Silver, D., 2016, March. Deep reinforcement learning with Zhu, J., Hu, W., Xu, X., Liu, H., Pan, L., Fan, H., Chen, Z., 2022. Optimal scheduling of a
double q-learning. In Proceedings of the AAAI conference on artificial intelligence wind energy dominated distribution network via a deep reinforcement learning
(Vol. 30, No. 1). approach. Renew. Energy 201, 792–801.
Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., Freitas, N., 2016. In: Dueling Zou, S., Zhou, X., Khan, I., Weaver, W.W., Rahman, S., 2022. Optimization of the
Network Architectures for Deep Reinforcement Learning. PMLR, pp. 1995–2003. electricity generation of a wave energy converter using deep reinforcement learning.
Watkins, C.J., Dayan, P., 1992. Q-Learning. Machine Learning 8, 279–292. Ocean Eng. 244, 110363.
22