0% found this document useful (0 votes)

34 views

Reinforcement Learning-Based Optimal Scheduling Model of Battery Energy Storage System at The Building Level

Uploaded by

nakranitirth7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Reinforcement Learning-Based Optimal Scheduling Model of Battery Energy Storage System at The Building Level

Uploaded by

nakranitirth7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Renewable and Sustainable Energy Reviews 190 (2024) 114054

Contents lists available at ScienceDirect

Renewable and Sustainable Energy Reviews

journal homepage: www.elsevier.com/locate/rser

Reinforcement learning-based optimal scheduling model of battery energy

storage system at the building level
Hyuna Kang, Seunghoon Jung, Hakpyeong Kim, Jaewon Jeoung, Taehoon Hong *
Department of Architecture and Architectural Engineering, Yonsei University, Seoul, 03722, Republic of Korea

A R T I C L E I N F O A B S T R A C T

Keywords: Installing the battery energy storage system (BESS) and optimizing its schedule to effectively address the
Battery energy storage system intermittency and volatility of photovoltaic (PV) systems has emerged as a critical research challenge. None
PV system theless, some existing studies still have limitations in terms of the efficiency of the BESS scheduling due to the
Reinforcement learning
lack of comprehensive consideration of diverse user objectives. As a response to this gap, this study aimed to
Optimal scheduling
develop a reinforcement learning (RL)-based optimal scheduling model to better reflect the continuous behaviors
in the complex real world. To this end, focused on residential buildings connected to the grid and equipped with
a BESS and PV system, its optimal scheduling models were developed using four algorithms from among the
various RL techniques according to training methods. The results of the case study showed that the developed RL-
based optimal scheduling model using Proximal Policy Optimization (PPO) can be applied to effectively operate
the BESS with a PV system, considering possible uncertainties in the real world. The case study demonstrated the
effectiveness and feasibility of the developed RL-based optimal scheduling model. Compared to other algorithms,
the PPO-based RL model has better decision-making for optimal BESS scheduling strategies to maximize their
self-sufficiency rate and economic profits by coping with changing variables in the real world. Therefore, the RL-
based BESS scheduling model will offer an optimal solution, specifically tailored for use within a virtual power
plant, where numerous buildings continuously share electricity.

intermittent properties of solar energy, which are relatively inflexible

1. Introduction [5,6]. The other is uncertainties of load patterns, load profiles, market
price, and so on [7]. Addressing these challenges is imperative for
Although many nations are seeking to increase their renewable en enhancing the economic viability and encouraging proactive BESS
ergy supplies so as to achieve carbon neutrality, the instability of adoption by energy prosumers. As a result, developing an intelligent
renewable energy supplies is becoming an issue due to the unprece operation and scheduling model that accounts for both intermittency
dented abnormal climate [1]. Moreover, as the energy consumption of and uncertainties within BESS with PV systems is a pressing challenge
residential buildings rises alongside increases in energy prices acceler for researchers.
ating, an increase in electricity rates has become inevitable [2]. There There are many existing studies regarding the optimal operating
fore, the importance of a battery energy storage system (BESS) is scheduling model of a BESS with a PV system for practical applications.
emerging as a complementary solution to address the volatility and First, the self-sufficiency rate has become one of the main objectives of
intermittency of renewable energy systems, efficiently storing surplus implementing net-zero energy building [8,9]. Matching battery capacity
electric power in a battery and discharging it when power is needed [3]. and building load profiles is significant for maximizing the
In particular, for a photovoltaic (PV) system, one in the spotlight as a self-consumption rate in buildings [10,11]. Second, existing models
distributed generation (DG), it is necessary to install the BESS together considering economic profit pursued maximizing the economic profits of
[4]. prosumers generated through energy trading [12–15]. By adhering to
Although various advantages of the BESS are known, BESS with a PV the optimal operating strategy, prosumers can exchange the surplus
system in buildings has two obstacles to obtaining its ultimate goals: power in their BESS for monetary rewards when the power generation is
decreasing peak load while increasing self-sufficiency. One is the higher than the load demand [16–18]. Third, several existing models

* Corresponding author.
E-mail addresses: [email protected] (H. Kang), [email protected] (S. Jung), [email protected] (H. Kim), [email protected] (J. Jeoung),
[email protected] (T. Hong).

https://doi.org/10.1016/j.rser.2023.114054
Received 17 October 2022; Received in revised form 23 October 2023; Accepted 1 November 2023
Available online 16 November 2023
1364-0321/© 2023 Elsevier Ltd. All rights reserved.
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

Abbreviation MDP Markov Decision Process

MILP Mixed-Integer Linear Programming
A2C Advantage Actor-Critic PPO Proximal Policy Optimization
BESS Battery Energy Storage System PSO Particle Swarm Optimization
DG Distributed Generation PV Photovoltaic
DP Dynamic Programming REC Renewable Energy Certificate
DQN Deep Q-Network RL Reinforcement Learning
EMS Energy Management System SAC Soft Actor-Critic
EV Electric Vehicles SMP System marginal Price
GA Genetic Algorithm SOC State of Charge
GAE Generalized Advantage Estimation TD3 Twin Delayed Deep Deterministic Policy Gradient
KEPCO Korea Electric Power Corporation VPP Virtual Power Plant
KPX Korea Power Exchange

provide BESS control strategies for peak shaving [19,20]. To simulate and surging peak demand, the developed RL-based optimal sched
the complicated components and constraints in the power grid system, uling model for BESS with PV systems emerges as a formidable
the existing model used a mathematical approach depending on various solution.
objectives [21–23]. When the target systems are simple, mathematical • The meticulously developed model, tailored to South Korean resi
optimization models can manage optimal BESS scheduling by maxi dential settings and aligned with energy policies and load patterns, is
mizing or minimizing the objective functions [24,25]. When all func primed for commercial success. It can be commercially available in
tions are linear or convex in mathematical models, deterministic the near future since it is trained and applied to residential buildings
optimization algorithms such as mixed-integer linear programming based on energy policies and load patterns in South Korea, under the
(MILP) can be used [12,26,27]. When there are nonlinear functions and supporting scheme for BESS users (e.g., Renewable Energy Certifi
uncertainties in mathematical models, metaheuristic algorithms such as cate (REC) weight). Ultimately, it can play a pivotal role in the smart
genetic algorithms (GA) are appropriate [28,29]. Moreover, dynamic grid balancing power supply and demand by proposing a profitable
programming (DP) coupled with the Markov decision process (MDP) and rational operation strategy for carbon neutrality by 2050 in
improved the model performance [30]. Recently, some existing studies South Korea [35–38]. The developed model has the potential to be a
developed a decision-making model using reinforcement learning (RL) pivotal player in achieving a harmonious power supply and demand
which can learn by itself in certain environments and quickly determine balance in the smart grid.
reliable results while reducing the complexity of models [31–34].
Based on the literature review, existing studies on the optimal The remaining structure of this paper is as follows: Chapter 2 de
scheduling of energy systems have the following limitations. First, most scribes the problem definition and modeling methodology of the BESS
scheduling models focused on one objective, and only a few studies with a PV system based on RL. Chapter 3 describes a case study and
addressed multi-objective problems. In the real world, when various model validation. In Chapter 4, the performance of developed models is
stakeholders want to install the BESS with a PV system in buildings, they evaluated by applying them to a target building and comparative anal
will evaluate the performance differently depending on their interests. ysis with the existing model as well as feasibility analysis of the optimal
Second, many existing scheduling models use rule-based, stochastic, and scheduling model are conducted. Finally, the conclusions provide a
deterministic algorithms, though these can reduce the optimization summary of the results and contributions in Chapter 5.
performance when uncertainties or complexity in the real world arise
and should be reflected. To provide an optimal solution even in a 2. Materials and methods
multidimensional energy community, intelligent models that can adapt
to the changing environment and learn independently must be applied 2.1. Problem definition
to the energy system.
Therefore, this study aimed to develop an RL-based optimal sched Although the purpose of potential users, various benefits, different
uling model to better reflect the continuous behaviors in the complex components and environmental conditions should all be defined for the
real world. To this end, a case study was performed using four distinct optimal scheduling of BESS with a PV system, it is difficult to clearly
algorithms from a spectrum of RL techniques. Based on the aforemen define them. If the BESS with PV system is used, economic profit can be
tioned research gaps, the major contributions of this study are as obtained mainly through arbitrage trading, and the self-sufficiency rate
follows: of residential buildings can be increased by self-using the power gen
eration. Furthermore, peak loads can be reduced by storing surplus
• To address trade-offs depending on various stakeholders and objec power in the battery and using it during times of high-power load in the
tives, this study conducted multi-objective optimization for the microgrid. This study proposed a framework for developing the optimal
scheduling of BESS with a PV system by maximizing the self- scheduling model using an RL algorithm to determine the more realistic
sufficiency rate with economic profit as well as reducing peak scheduling solution considering highly volatile variables such as the
load. It seeks to maximize the self-sufficiency rate while simulta electric load, power generation, or BESS charging state.
neously optimizing economic profitability and reducing peak load. Fig. 1 shows a solution framework for developing an RL-based
This multi-dimensional approach sets our research apart from the optimal scheduling model of BESS with a PV system. In the RL envi
conventional. ronment, the power balance of the system was configured in four com
• RL-based optimal scheduling model of BESS with a PV system can be ponents: (i) BESS, (ii) PV system, (iii) residential building, and (iv)
effective in a situation in which both the volatility of the power external grid. Before training in the RL algorithm, the power flow in the
supply and peak demand are rapidly increasing, it can be helpful to energy system is defined through the mathematical model. To balance
both BESS owners and operators. As we confront a rapidly evolving the power system, the sum of all power flow is defined as 0, and then a
landscape characterized by heightened volatility in power supply database such as PV generation, electricity load, electric price, and BESS

2
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

Fig. 1. Framework for RL-based optimal scheduling model of BESS with PV system.

specification was established to formulate the mathematical model. In the power flows and components were expressed as a mathematical
the RL agent algorithm, the state means the current situation of the model, and the system configuration and energy exchanges are as fol
power system (i.e., day and hour, electricity generation and load and lows (refer to Fig. 2).
energy state of BESS, and electricity sale price and purchase price ac Based on the defined power flows, the power balance of the electric
cording to timestep t). In the case of residential buildings in South Korea, system, which is the power flows between system components, can be
since the purchase price of electricity is higher than the sale price, defined as Eq. (1). The electricity transaction volume with the external
increasing the SSR is advantageous in terms of cost as less electricity is grid (PEG
t ) can be calculated in sales and purchases as in Eq. (2), power
purchased. Since the optimization results can be overestimated when flow should not occur simultaneously (refer to Eq. (3)). Moreover, the
the SSR and economic profit are separately considered, this study maximum value of PEG is determined according to the contract power
t
assumed that increasing the SSR also increases economic profit. As a (refer to Eq. (4)).
result, the reward is maximizing the SSR and minimizing peak load. RL
algorithm keeps training until the reward no longer increases. Then a PPV EG BESS
t + Pt + Pt − PLoad
t =0 (1)
trained RL-based optimal scheduling model was applied to a case study
in South Korea to verify the scheduling model. The performance of the PEG EG2MG
t = Pt − PMG2EG
t (2)
developed model was assessed by comparing other optimization
methods. Ultimately, by reflecting the uncertainty of future scenarios PEG2MG
t × PMG2EG
t =0 (3)
affecting the BESS scheduling in the real world, the developed RL-based
model is expected to adapt to various environments by interacting with PEG
t ≤ Pcontract (4)
the environment.
where PLoad
t is the electricity load of the end-use at timestep t, PPV
t is the
2.2. Modeling of BESS with PV system power generation from PV system at timestep t, PBESS t is the net power
exchange of the battery at timestep, PEG2MG
t and PMG2EG
t are the electric
2.2.1. System modeling power imported from external grid to microgrid and exported from
To determine the optimal scheduling of BESS with PV system, the microgrid to external grid at timestep, PEGt is the net power exchange of
power flows of the grid system should be defined. The system configu the external grid at timestep, and Pcontract is the contracted power at
ration depends on the components of grid and power flows. In this study,

3
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

Fig. 2. System configuration.

timestep (△t = 1, the number of timesteps T = 24). Ecapacity = (1− Aging (d)) • Ecapacity (12)
nominal

2.2.2. BESS modeling where, Pcapacity is the power capacity of BESS, Et is the energy state of
A mathematical model of BESS was constructed. The power flow is BESS at timestep t, ηsd is self-discharge rate, ηc and ηd are the charging-
expressed as Eq. (5) and charging and discharging at the same time discharging efficiency, Pc,t and Pd,t are power charged to and discharged
should be prevented (refer to Eq. (6)). The electric power charging and from BESS at timestep t, Ecapacity is the energy capacity of BESS, SOCmin
discharging of the BESS are limited by the power capacity of the BESS and SOCmax is the minimum and maximum SOC of BESS, Aging (d) is
(refer to Eq. (7)). The energy state of the battery at a given time step can cyclic aging of BESS in day d, Lifetime cycle is the lifetime cycle of BESS,
be expressed as Eq. (8). The electrical energy state of the BESS is limited and Ecapacity nominal is the nominal capacity of BESS in the initial state.
to a certain state of charge (SOC) to prevent excessive aging and power
Moreover, binary variables in this study can be used to convert
failure of the BESS (refer to Eqs. (9) and (10)). The energy throughput
nonlinear equations or inequalities into linear expressions. In particular,
model among the many different BESS cyclic aging models was used. In
since the multiplication of a binary variable and a non-negative
the energy throughput model, the capacity of the BESS is reduced by the
continuous variable can be represented as a linear expression, it is
ratio of the current amount of the BESS charge and discharge to the
possible to linearize constraints in the defined optimization problem,
maximum amount of the BESS charge and discharge during its lifetime,
including the multiplication of continuous variables. In Chapter 2.2.1,
which can be obtained by multiplying the nominal capacity of the BESS
Eq. (3) to prevent power transmission to and from the external grid from
by the maximum cycle of the BESS (refer to Eqs. (11) and (12)).
occurring at the same time, and Eq. (6) to avoid power charge and
PBESS = Pd,t − Pc,t (5) discharge of the battery from occurring at the same time are expressed as
t
multiplications of two power flows which are continuous variables.
Pd,t × Pc,t = 0 (6) Therefore, those equations were converted to implement linear expres
sions with binary variables.
PBESS
t ≤ Pcapacity (7) First, to prevent the simultaneous occurrence of power transmission
to and from the external grid, binary variables were applied to represent
{ }
Et+1 = (1 − ηsd )Et + (1 − ηc )Pc,t + (1 − ηd )Pd,t Δt (8) Eqs. (3) and (6) as linear expressions. By dividing power transmissions to
the grid and from the grid, Eq. (4) to limit the maximum value of power
Et ≥ Ecapacity • SOCmin (9) transmission can be converted to Eqs. (13) and (14) in which the value
obtained by multiplying the non-negative variable Pcontract and the bi
Et ≤ Ecapacity • SOCmax (10) nary variables that indicate the presence or absence of flow (i.e., bMG2EG
t
∑T ( ) and bEG2MG
t ) is limited to the maximum value of each flow. And Eq. (15)
Aging (d) = t=0 Pd,t + Pc,t •Δt
(11) where the sum of the two binary variables is set to 1 was added to
Ecapacity • Lifetime cycle•2 prevent the power flows from occurring all at once.

4
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

PMG2EG
t ≤ bMG2EG
t × Pcontract (13) 2.3. RL model structure

PEG2MG
t ≤ bEG2MG
t × Pcontract (14) 2.3.1. Markov Decision Process (MDP)
Existing mathematical models such as Mixed-Integer Linear Pro
bMG2EG
t + bEG2MG
t =1 (15) gramming (MILP) that require definitions for energy systems operate
based on historical data, so it is difficult to reflect the changing real
where PEG2MG
t and PMG2EG
t are the power flow imported from external world. On the other hand, RL-based algorithms can be used for real-time
grid to microgrid and exported from microgrid to external grid at energy management of BESS without defining the whole energy system.
timestep, bMG2EG and bEG2MG are the binary variable of the electric power As a form of machine learning, the RL is a method in which the agent
t t
purchased from the electricity market and sold in the electricity market defined in an environment recognizes the current state and selects the
at timestep t, and Pcontract is the contract power at timestep. best action to maximize the future reward while interacting with the
Next, to prevent the simultaneous occurrence of power charge and surrounding environment [31–34]. The RL, which models states and
discharge of the battery, binary variables were applied to represent Eqs. actions using deep neural networks, can efficiently handle
(6) and (7) as linear expressions. By dividing the power charge and high-dimensional state spaces and complex action spaces. Therefore, in
discharge, Eq. (7) to limit the maximum value of power flow of the order to reflect the uncertainty of future scenarios affecting the BESS
battery can be converted to Eqs. (16) and (17) in which the value ob scheduling in the real world, this study used the RL model, which is
tained by multiplying the non-negative variable Pcapacity and the binary expected to adapt to various environments by interacting with the
variables that indicate the presence or absence of flow (i.e., bc,t and bd,t ) environment.
is limited to the maximum value of each flow. In addition, Eq. (18) The RL model can be designed using the Markov Decision Process
where the sum of the two binary variables is set to 1 was added to (MDP), which consists of the state, action, transition probability, and
prevent the power charge and discharge of the battery from occurring all reward (S, A, P, R) [39–41]. Therefore, the MDP was defined to solve the
at once. optimization problem (i.e., real-time scheduling of the BESS using
multi-objective optimization).
Pc,t ≤ bc,t × Pcapacity (16) First, state (S) is defined as the current situation of an environment
observed by the agent, and in RL, the state (st ) is moved into the next
Pd,t ≤ bd,t × Pcapacity (17) state (st+1 ) according to the action of the agent. In this study, power
generation, electricity load, SOC of BESS, month, and hour were defined
bc,t + bd,t = 1 (18) as states to consider factors that have an essential influence on deter
mining optimal scheduling.
where Pc,t and Pd,t are the electric power charged to and discharged from
Second, action (A) in MDP refers to a possible action that an agent
BESS at timestep t, Pcap is the power capacity of BESS, and bc,t and bd,t are
can take in a specific state. In the optimal scheduling of BESS with a PV
the binary variable of the electric power charged to BESS and discharged
system, action is defined as determining power flows (i.e., PEG BESS
t , Pt ) as
from BESS at timestep t.
decision variables (i.e., continuous variables) of the objective function.
Depending on the current energy state of BESS, constraints are needed
2.2.3. Objective function
for action space. By referring to previous studies, the definition was
BESS are devices with various uses, including peak shaving, voltage
constructed as follows. When the action is in a charged state, it is defined
support, frequency regulation, mitigation of transmission congestion,
as being charged up to Ecapacity • SOCmax in case Et+1 is larger than
reduction of energy losses and arbitrage, which peak shaving, mitigation
Ecapacity • SOCmax . Meanwhile, when the action is in a discharged state, it
of transmission congestion, reduction of energy losses and arbitrage can
be controlled via power flow scheduling. Peak shaving can be achieved is defined as being discharged up to Ecapacity • SOCmin in case Et+1 is
by reducing the peak load of power imported from the external grid to smaller than Ecapacity • SOCmin [42–45].
the microgrid. Meanwhile, the mitigation of transmission congestion Third, the transition probability (P) refers to the probability that
and reduction of energy loss is made possible by increasing the self- something will happen to the next state when a specific action is taken in
sufficiency rate. Particularly in South Korea, arbitrage is impossible the state [46]. Even if a specific action is carried out in a specific situ
because electricity prices for household consumers are higher than retail ation due to various environmental factors, the result is expressed as a
prices for prosumers. Accordingly, in order to improve profitability, the probability rather than an equation. Power generation, electricity load,
amount of electricity purchased from the external grid should be mini month, and hour are provided from the state of the next step, and the
mized to increase the self-sufficiency rate. Therefore, in this study, energy state of BESS may vary depending on the action.
maximizing the self-sufficiency rate and minimizing the peak load of the Fourth, reward (R) is a weighted-sum multi-objective reward func
residential building were defined as the objective function of the opti tion, which means the expected value of the reward the agent receives
mization problem. The self-sufficiency rate is represented by Eq. (20), when an arbitrary action is taken in the current state. Like the objectives
while the peak load is represented by Eq. (21). Furthermore, to consider of the optimization problem defined in this study, the reward of the RL
the two objectives simultaneously, this study defined an objective model was defined as maximizing the self-sufficiency rate and mini
function based on the weighted sum method (refer to Eq. (22)). mizing the peak load of the residential building. The self-sufficiency rate
increases as the electric power transmitted from the external grid to the
T ( )
∑ PEG2MG microgrid are reduced. Therefore, at every step, a negative value of
self-sufficiency rate = 1− t
(20)
t=1 Pload
t PEG2MG
t is given as a reward (R1) for the first objective. For the peak load,
for each last step of every day, the negative value of that daily maximum
( )
Peak load = max PEG2MG
t (21) load is given as a reward (R2) for the second objective. Both rewards are
provided to the agent as the normalized average values.
t

self-sufficiency rate
Objective function= w•
self-sufficiency rateaverage
+ (1 − w) 2.3.2. RL model structure
As a kind of machine learning, the RL is a method in which the agent
(− Peak load)
• , (0 ≤ w ≤ 1) (22) defined in an environment recognizes the current state and selects the
Peak loadaverage
best action to maximize the future reward while interacting with the
surrounding environment. The agent receives a reward for every action

5
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

and learns by maximizing the cumulative reward through trial and error • Model-based and model-free: Depending on whether or not there is a
[47]. The RL can efficiently handle high-dimensional state spaces and model for the environment, RL can be categorized into model-based
complex behavior spaces through modeling for states and behaviors. and model-free algorithms [47]. The model-based algorithm can
predict a change in the environment according to the action in
• Policy: Policy refers to the behavior pattern of the agent in a certain advance and plan the optimal action in advance. However, it is
state and is a function that connects the state to the action. The policy difficult to make an accurate model for the BESS with PV system
can be divided into a deterministic policy that determines only one applied to various buildings in the city. Therefore, since the
action in a certain state and a stochastic policy that determines the model-free RL does not need to set up or learn a model for the
probability distribution of actions. environment in advance, this study used it because it can simply
• Value function: This is a predictive function of how much reward will determine an optimal behavior despite the agent’s changes in
be provided with state and action. This appears as a weighted sum of various external environments.
all rewards to be received later when the state and action are taken, • On-policy and off-policy: Among the model-free RL, the algorithms
and a discounting factor can be used to indicate a preference for the can be categorized depending on whether or not the algorithm up
reward. date the same policy which is used to select actions: (i) on-policy:
• Model: The model refers to the agent’s expectation of how the next learning algorithms that update the same policy which the agent is
state and next reward can be determined. It can be divided into either currently using for deciding actions (e.g., SARSA, REINFORCE,
the state model or reward model. Advantage Actor-Critic (A2C) and Proximal Policy Optimization
(PPO)); and (ii) off-policy: learning algorithms that update a policy
The RL model includes policy and value functions. At each step t of which is different from the one that is used for deciding actions (e.g.,
episode, the agent learns the reward (Rt ) according to each state (st ) of Deep Q-Network (DQN), Twin Delayed Deep Deterministic Policy
the environment and the action (at ) of the agent through the policy and Gradient (TD3), and Soft Actor-Critic (SAC)). For the on-policy
function, and then delivers it to the environment. Afterwards, the model, action and learning are performed with the same policy,
environment returns to the next observation (i.e. next state st+1 ) and and the learning is repeated while accumulating learning results in
reward (Rt+1 ), which are changed according to the agent’s action, to the one experience memory. In comparison, the off-policy model is a
agent. This process is repeated until the reward the agent receives from method of repeating learning by separating behavior and learning
the environment is maximized. In addition, an important issue in RL is experience and building them up separately.
the balance between exploration and exploitation. As the agent performs • Value function and policy: The on-policy and off-policy algorithms
trial and error learning (i.e., exploration), the agent learns and updates can also be categorized depending on whether or not the value
the policy function and value function to receive the best reward (i.e., function and policy are used for learning process as follows: (i) value-
exploitation). In addition, the policy and value functions obtain more based: a learning method mainly using value function (e.g., SARSA,
information about the environment through exploration and are upda DQN); (ii) policy-based: a learning method by directly evaluating
ted based on the known information to maximize reward through and improving the policy without a value function (e.g., REIN
exploitation [48]. As a result, referring to the design environment and FORCE) and (iii) actor-critic: a learning method using both value
set-up configuration of OpenAI Gym, this study designed the RL envi function and policy (e.g., A2C, PPO, TD3 and SAC) [50–53].
ronment [49].
This study aims to solve the problem of continuous space in the ac
2.4. Development of the RL model for optimal scheduling of BESS with PV tion and environment (i.e., hourly scheduling of energy flows) according
system to the time stamp. However, value-based algorithms cannot be used
when there are high-dimension or many types of actions. In addition,
The RL model has a different learning direction and performance policy-based algorithms decrease model performance due to the high
depending on its training characteristics. The RL model can be imple gradient variance problem. Therefore, in both on-policy and off-policy
mented in various forms depending on the purpose of use, and can be algorithms, four actor-critic algorithms (i.e., A2C, PPO, TD3, and SAC)
mainly classified, as follows (refer to Table 1). which are mainly used for continuous and complex environments are
adopted for the defined problem in this study.
In summary, which has the same target policy to update as the cur
rent behavior policy, searches the environment to update the current
behavior policy as it is. Therefore, since the state distribution becomes
Table 1 dependent on the current policy, it is data dependent, and there is a
Classification of RL algorithms. possibility of converging to the local optimal. However, it mainly de
Classification Policy based Value based Actor-Critic termines the optimal value by going around to avoid falling into the
On-policy • REINFORCE • Monte Carlo • Asynchronous local optimal, and it can be more accurate and convenient than the off
• REINFORCE Learning (MC) Advantage Actor- policy in the exploration strategy. Accordingly, A2C and PPO algo
with advantage • Temporal- critic (A3C) rithms, which have one behavior policy and target policy, determine the
Difference (TD) • Advantage Actor- optimal value by going around to avoid falling into the local optimal. If
• State Action Critic (A2C)
Reward State • Trust Region Policy
the learning is done enough, it can be a more robust model than off-
Action (SARSA) Optimization policy by determining the optimal behavior even in a complex envi
(TRPO) ronment that reflects the real world.
• Proximal Policy Meanwhile, off-policy has a different behavior policy and target
Optimization (PPO)
policy. Because of the learning using their other data, off-policy has a
Off-policy • Q-learning • Deep Deterministic large variance between these policies and has a slow convergence speed.
• Deep Q- Policy Gradient This can be solved by using Importance Sampling (i.e., Estimating a
Network (DQN) (DDPG)
• Double DQN • Twin Delayed Deep
value from the distribution of samples obtained by different policies) or
Deterministic Policy by deterministically transforming the action selection of the target
Gradient (TD3) policy (i.e., limitation of action selection). However, since the action of
• Soft Actor-Critic the target policy is limited, an exploration strategy is needed to avoid
(SAC)
falling into the local optimal. Off-policy converges quickly with high

6
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

learning efficiency, but biased results can be derived by accumulating where θ and φ are parameters of the actor network (i.e., policy function
even erroneous experiences continuously. Accordingly, TD3 and SAC π) and the critic network, respectively. Jπ (θ) and JQ (θ) are the objective
algorithms as an off-policy model have mainly high performance in functions for the actor network and critic network, respectively. γ is the
solving simple problems through RL. discount factor for future rewards. at and st are action and state at
timestep t, respectively. rt+1 and st+1 are reward and state at timestep t +
2.4.1. Advantage actor-critic (A2C) 1, respectively. V is the value function for state. A and Q are the
The advantage of A2C is that it can improve convergence perfor advantage function and Q-value function for action and state, respec
mance and stabilize learning as it uses the advantage function (i.e., A(st , tively.
at )) as a value function [53]. A2C is an on-policy algorithm that trains
the agent after collecting the transitions (i.e., < st , at , rt+1 , st+1 >) ob 2.4.2. Proximal policy optimization (PPO)
tained from the latest policy (πθ ) in the experience money (refer to PPO is one of the on-policy RLs, and it can learn stably due to its
Fig. 3). An A2C agent consists of two interdependent networks, actor and better sample complexity (the number of training samples that are
critic. needed for the algorithm) [50,54]. Since the on-policy algorithm uses
The actor updates the parameters of the network to maximize the only the current policy (πθ ) for exploration, it can fall into a local
expected reward that the agent can receive and determines the optimal maximum. To overcome this, PPO improved the model performance by
action. In A2C, the advantage function received from the critic is used adjusting the exploration (refer to Fig. 4).
when the actor is trained through the policy gradient (refer to Eq. (23)). The actor in the PPO model updates the policy network parameters
The advantage function indicates how much better it is to take a specific by using the surrogate function and clipping method as an objective
action (at ) in a given state (st ) than to take a general action (refer to Eq. function instead of the advantage function in A2C model (refer to Eq.
(24)). The use of the advantage function makes it possible to reduce (27)) [55]. As for the surrogate function, by using the ratio (rt (θ)) be
variance and improve convergence performance when the agent is tween the new parameter (θ) in the current step and the old parameter
learning. (θold ) in the previous step, the parameter is prevented from significantly
The critic updates the parameters of the network to minimize the TD updating at once, thus stable learning is performed (refer to Eq. (28)). As
error (i.e., JQ (θ)) and evaluates the potential reward depending on the for the clipping method, the local maximum by incorrect exploration
state (refer to Eq. (25)). Meanwhile, advantage is expressed in the were prevented by constraining it in such a way that the policy
relationship between V(st+1 ) and V(st ) rather than between Q(st , at ) and parameter is updated only within a certain range (i.e., 1 − ϵ,1 + ϵ) policy
V(st ) according to the bellman equation [47] (refer to Eq. (26)). parameter (refer to Eq. (29)).
Consequently, the critic network can be effectively built with only one V The critic updates the parameters of the network to minimize the TD
function, not two networks, Q and V functions. error (i.e., JQ (φ)) as in A2C. In other words, the mean squared error is
[∑ ]
calculated so that the V(st ) expected in the state is close to the sum of the
∇θ Jπ (θ)= E ∇θ log πθ (at |st )A(st , at ) (23)
target V(st+1 ) and the reward (refer to Eq. (30)). Meanwhile, when the
advantage function is calculated and transmitted from the critic to the
A(st , at )= Q(st , at )− V(st ) (24)
actor, Generalized Advantage Estimation (GAE) is applied [56] (refer to
Eq. (31)), which in turn adjusts bias-variance trade-offs and stabilizes
JQ (φ) = (rt+1 +γV(st+1 )− V(st ))2 (25)
learning.
A(st , at ) = rt+1 +γV(st+1 )− V(st ) (26)

Fig. 3. Architecture of the Advantage Actor-Critic (A2C) algorithm.

7
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

Fig. 4. Architecture of the Proximal Policy Optimization (PPO) algorithm.

[∑ ]
Jπ (θ)= E rt (θ)A(st , at ) (27) (
AGAE = (1 − λ) A(1) +λA(2) + λ2 A(3) +…
)
(31)
∑
rt (θ) =
πθ (at |st )
(28) An = δn γn− 1
(32)
πθold (at |st )
δ = rt +γV(st+1 )− V(st ) (33)
JCLIP
π (θ)= E[min(rt (θ)A,clip(rt (θ), 1 − ϵ, 1 + ϵ)A)] (29)
where θ and φ are parameters of the actor network (i.e., policy function
JQ (φ) = (rt+1 +γV(st+1 )− V(st ))2 (30) π) and the critic network, respectively. Jπ (θ), JCLIP (θ) and JQ (φ) are the
π
objective functions for actor network with and without clipping method

Fig. 5. Architecture of the Twin Delayed DDPG (TD3) algorithm.

8
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

and the objective functions for critic network. rt (θ) is the ratio between particular, SAC uses entropy regularization to adjust the trade-off be
old and new policy function. ϵ is clipping constraint. γ and λ are the tween exploration and exploitation (refer to Fig. 6).
discount factor and GAE parameter. at and st are action and state at The actor updates the policy parameters to maximize the Q-function
timestep t, respectively. rt+1 and st+1 are reward and state at timestep t + received from the critic as an off-policy algorithm. However, in SAC,
1, respectively. V is the value function for state. A and AGAE are the entropy regularized Q-function (entropy term added to the Q-function as
advantage functions applied with and without GAE, respectively. δ is TD a bonus) is used instead of the Q-function (refer to Eqs. (39) and (40)).
advantage function for action and state. Entropy (H) is a measure of policy randomness. It is set to 0 in a
deterministic policy, while it is set to a value between 0 and 1 in a
2.4.3. Twin delayed deep deterministic policy gradient (TD3) stochastic policy. The entropy regularization coefficient α is multiplied
TD3 performs stable learning by compensating for the limitation that by the entropy term to automatically adjust the trade-off between
has been excessively influenced by incorrect exploration while learning exploration and exploitation while being explicitly learned. Ultimately,
and directing the actor in the direction of maximizing the Q-function the entropy term naturally decreases as the iteration progresses. This
[52]. TD3 is an off-policy algorithm that collects transitions (i.e., < st , at , prevents the policy from converging to the local maximum in the initial
rt+1 , st+1 >) collected from the policy at all time points in the replay stage, thus allowing for the optimal policy while continuing exploration.
buffer and then trained the agent based on the transition history (refer to Meanwhile, in SAC, the action was selected stochastically rather than
Fig. 5). deterministically to provide a similar effect to the addition of explora
The actor updates the policy parameter to maximize the Q-function tion noise. As the optimal action, as well as the suboptimal action, was
received from the critic (refer to Eq. (34)). In TD3, a delayed update frequently selected, the agent was able to show robustness even in an
method (which updates policy parameters after multiple updates of unexpected state.
critic) is used to reduce the variance of the Q-function for stable As in TD3, the critic performs clipped double q-learning to update the
learning. Meanwhile, in the off-policy algorithm, the actor transmits the parameters of the two critic networks (refer to Eq. (41)). To perform
target action to the critic network to estimate the target Q-function (y). faster and more stable training, the critic selects the smaller Q value of
In addition, target policy smoothing for estimating the target Q-function the two value networks as the target Q-function (y) (refer to Eq. (42)). To
by adding noise (ϵ′) to the target action π(st+1 ) is used in TD3 (refer to Eq. reduce the overestimation bias, two target Q-functions are used even in
(35)). The target Q-function is sampled in the action range that includes the training process of the two critic networks. However, in SAC, Q-
an error term, not a fixed target Q-function to minimize instability function is replaced with a soft Q-function according to soft bellman
during which the target Q-function is either underestimated or over iteration (refer to Eq. (43)). Soft Q-function means changing the hard
estimated. As with the target action, the action added exploration noise max of the Bellman equation to softmax. The softmax is used to create
(ϵ) was selected (refer to Eq. (36)). efficient samples in continuous action space and perform critic conver
The critic updates parameters with the use of clipped double Q- gence and stable learning.
learning using two critic networks [52]. In clipping the double [∑ ]
Jπ (θ)= E Q(st , at )+αH(π(•|st )) (39)
q-learning, the TD error (i.e.,JQ (φ)) is calculated using the smaller of the
target Q-functions derived from the two critic networks (refer to Eqs. [∑ ]
(37) and (38)). The target Q-function contains an error because it is a Jπ (θ)= E Q(st , at )− αlog π(at |st ) (40)
value estimated by receiving the action π(st+1 ) from the target policy.
The policy selects a suboptimal action as the actor is learned based on [
JQ (φ)= E (y − Q(st , at ))2
]
(41)
the overestimated maximum value of the target Q-function with errors.
Therefore, the use of the minimum target Q-function derived from the ( )
two critic networks makes it possible to minimize the impact of accu y = rt+1 +γ minQ(st+1 , at+1 )− αlog(π(at+1 |st+1 )) (42)
1,2
mulated errors arising from iterations, and thus to solve overestimation
bias. Qsoft (st , at )= E[rt+1 +γsoftmaxQ(st+1 , at+1 )] (43)
[∑ ]
∇θ Jπ (θ)= E ∇θ log πθ (at |st )Q(st , at ) (34) where θ and φ are parameters of the actor network (i.e., policy function
π) and the critic network, respectively. Jπ (θ) and JQ (φ) are the objective
at+1 = π(st+1 ) + ϵ′ (35) functions for actor network and the critic network. H and α are the en
tropy and the entropy regularization coefficient, respectively. γ is the
at = π(st )+ϵ (36) discount factor. at and st are action and state at timestep t, respectively.
rt+1 and st+1 are reward and state at timestep t + 1, respectively. Qsoft and
JQ (φ) = (y − Q(st , at ))2 (37) y are the soft Q-function and the target Q-function, respectively.

y = rt+1 +γminQ(st+1 , at+1 ) (38) 3. Case study

1,2

where θ and φ are parameters of the actor network (i.e., policy function 3.1. Description of the case study
π) and the critic network, respectively. Jπ (θ) and JQ (φ) are the objective
functions for actor network and the critic network. ϵ and ϵ′ are the This study aimed to apply the optimal scheduling of ESS with PV to
exploration noises for the state and the next state. γ is the discount actual residential buildings and evaluate the performance of the devel
factor. at and st are action and state at timestep t, respectively. rt+1 and oped optimization model based on empirical data on electricity loads
st+1 are reward and state at timestep t + 1, respectively. Q and y are the and power generation. A case study was conducted to verify the devel
Q-value function and the target Q-function, respectively. oped scheduling model of ESS with PV and assess its effects. The target
area for the case study was Chungbuk, an area where the PV system
2.4.4. Soft actor-critic (SAC) distribution capacity per residential building is the highest in South
SAC is an RL that optimizes stochastic policy as one of the off-policy Korea [57,58]. Therefore, a residential building in Chungbuk was tar
algorithms. Like TD3, SAC uses clipped double Q-learning and performs geted with a large-capacity PV system (i.e., 3 kW, maximum capacity for
stable learning by compensating for the limitation of value-learning with residential buildings according to regulations for the electricity busi
the use of stochastic policy instead of target policy smoothing. In ness) and high energy use intensity (EUI) (i.e., 4.95 kWh/m2, higher

9
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

Fig. 6. Architecture of the Soft Actor-Critic (SAC) algorithm.

than the average EUI of the residential buildings in South Korea) [12]. In dollars using the average exchange rate (i.e., KRW1,101/USD) during
order to establish the training and test sets of the developed optimal the same period as in the case study.
scheduling model, this study collected the actual data on the hourly
electricity load and power generation from the target residential
building for two years (i.e., training set: from April 2018 to March 2019, 3.2. Model validation
test sets: from January to December 2020). The analysis period of this
study was carried out at hourly intervals to reflect the electricity sale 3.2.1. Training of the RL model
price that changes every hour. To this end, the hourly data on the In this study, in order to compare the performance of four scheduling
electric power system and electricity price were collected from resi models according to the different optimization algorithms, the devel
dential buildings in Chungbuk and assumptions were defined. Moreover, oped RL model was trained. The training was conducted for each model
the capacity and performance of the BESS were set as follows based on based on actual data for one year (i.e., from April 2018 to March 2019),
the existing ESS product and previous studies [59–62].: (i) energy ca and one episode consisted of a month, which was set to a total of 24 steps
pacity: 3.3 kWh; (ii) power capacity: 3 kW; (iii) maximum SOC: 90 %; per day (i.e., hourly unit). In addition, the trained RL models were tested
(iv) minimum SOC: 20 %; (v) self-discharge rate: 5 % per 30days; (vi) with actual datasets (i.e., from January to December 2020) to verify the
charge and discharge efficiency: 5 %, respectively; (vii) daily cycle limit: effectiveness of the developed model. To train the developed models
1 to 4 times; and (viii) the energy state of ESS of the first timestep of the optimally, the following hyperparameters were set through trial and
first day starts with the minimum SOC. error for each model based on existing studies (refer to Table 3) [50–53].
Meanwhile, a case study was conducted for residential buildings in As for the A2C and the PPO, the adaptive moment estimation
South Korea, so electricity business regulation and support schemes for (ADAM) optimizer was used for the parameter update in both models,
residential buildings in South Korea were applied during the analysis and the related hyperparameters were set as follows in common in both
period. Accordingly, the data on the electricity price variables for sale models. A warm-up cosine scheduler was used as the learning rate
and purchase (i.e., SMP, REC price, REC weight, price of demand charge
and price of energy charge) were collected from the Korea Power Ex Table 3
change (KPX) and Korea Electric Power Corporation (KEPCO) [63]. In A summary of hyperparameter values for the developed models.
South Korea, progressive tariffs are imposed on electricity used in resi Classification Policy optimization Q-learning
dential buildings. Because the electricity bill that residents have to pay A2C PPO TD3 SAC
can rapidly increase with increasing electricity consumption [64], the
Learning rate 0.001 0.001 0.001 0.001
progressive tariff in South Korea was considered for scheduling (refer to Total timesteps 6,000,000 6,000,000 6,000,000 6,000,000
Table 2). The collected electricity price data were converted to US Buffer size (Experience 576 768 17,520 17,520
memory)
Batch size – 96 64 64
value function coefficient 0.5 0.5 – –
Table 2 Gamma factor 1 1 1 1
Electricity progressive tariff in South Korea. The generalized advantage 0.9 0.9 – –
Progressive electricity purchase Demand charge ($) Energy charge ($/kWh) estimation (GAE) lambda
factor
1 - 200 kWh 0.83 0.08 Learning starts – – 744 744
201 - 400 kWh 1.45 0.17 Target update rate (TAU) – – 0.001 0.001
Above 400 kWh 6.63 0.26 Training frequency – – 744, 744,
’step’ ’step’
Note: The exchange rate (KRW/USD) is 1101 KRW to a USD.

10
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

scheduler, and the initial learning rate was set to 0.001. Learning was comparative analysis with the existing model using MLP was carried out
terminated when the timesteps reached 6,000,000. The value function to verify the validity of the optimal solution suggested by the RL-based
coefficient was set to 0.5. The Gamma factor was set to 1 because the model.
length of one episode is fixed (i.e., one month). The generalized
advantage estimation (GAE) lambda factor was set to 0.9. On the other 4. Results and discussion
hand, experience memory was set to 576 for A2C and 768 for PPO so
that parameters were updated at timesteps corresponding to 24 and 32 4.1. Performance evaluation of the developed models
days, respectively. The batch size was set to 96 in PPO, and A2C did not
use a minibatch. Fig. 7 shows the change of reward during the 6.0 × 106 training step
As for the TD3 and the SAC, an ADAM optimizer was used for the in four RL algorithms (i.e., A2C, PPO, TD3, and SAC) at weights (w) 0,
parameter update in both models. A warm-up cosine scheduler was used 0.5, and 1.0. At weight = 0, the objective of a scheduling model is to
as the learning rate scheduler, and the initial learning rate was set to minimize peak load, and at weight = 1, the objective is to maximize the
0.001. Learning starts when timesteps reaches 744 and ends when self-sufficiency rate. At weight (w) = 0.5, the two objectives are equally
timesteps reaches 6,000,000. The buffer size was set to 17,520 so that considered in the scheduling model. As the reward increases, the
transitions collected during timesteps corresponding to 2 years were optimal solution of the RL-based BESS scheduling model maximizes the
used for training. The batch size was set to 64. The gamma factor was set self-sufficiency rate of the residential building or minimizes the peak
to 1. The parameter of the target model was soft updated every 744 load to further improve the stability of the power supply in microgrids.
timesteps corresponding to 31 days with the target update rate (TAU) Overall, the reward of A2C and PPO (i.e., on-policy) increased as the
value of 0.001 [53]. training step was repeated through stable learning, while the reward of
TD3 and SAC (i.e., off-policy) did not converge and appeared erratic
3.2.2. Comparative analysis with the existing model regardless of the number of training steps. At w = 0, the final reward was
In order to validate the optimization results (i.e., hourly electricity 56.3, 60.2, 58.4, and 59.0, respectively, for A2C, PPO, TD3, and SAC. At
flows of the BESS with PV system) of the developed RL-based model, the w = 1, the final reward was 56.3, 61.4, 60.0, and 59.8, respectively, for
comparative analysis of the developed RL-model with existing optimi A2C, PPO, TD3, and SAC. At w = 0.5, the final reward was 56.2, 61.8,
zation was conducted. MILP is one of the mathematical optimization 57.9, and 53.4, respectively, for A2C, PPO, TD3, and SAC. As a result,
frameworks and deals with the optimization problem that maximizes or the lowest reward was obtained in the A2C by determining the erratic
minimizes the demand through the efficient allocation of limited re action. In contrast, the highest reward was obtained in the PPO.
sources. MILP has been frequently used in the optimization problem of Particularly at weight = 0.5, the reward of the PPO was 61.8, which was
energy systems with many variables, along with the time domain due to the highest among all scenarios. TD3 and SAC, which converge quickly,
its advantages that enable global optimization and help to solve large showed that it is possible to derive biased and erratic results while
problems. However, in MILP, all the constraints and objective functions repeating learning when implementing a complex environment such as
should be linear, and the computation time may increase depending on the BESS scheduling model. Moreover, sufficiently trained A2C and PPO
the level of detail. can determine optimal behavior even in a complex environment to
In this study, an optimal scheduling model of BESS with PV by using determined electricity flows in the BESS scheduling model. In particular,
MILP, which can find Pareto optimal solutions for maximizing both the the PPO with the highest reward is possibly the best algorithm for the
economic and environmental benefits. Moreover, the payment of the optimal scheduling model through stable learning proposed in this
electricity bill based on the progressive tariff in South Korea is on a study.
monthly basis, and as such, for more accurate optimization results, daily Fig. 8 shows the relationship between the self-sufficiency rate and
optimization should be performed monthly. Here, MILP was performed peak load according to the optimization weight of the BESS with a PV
multi-objective optimization at the time during which the electricity system applied with optimal scheduling and the four RL algorithms. All
purchase price increases due to the accurate reflection of the progressive four RL algorithms showed a tendency for self-sufficiency rate to in
tariff. Notably, in order to reflect the progressive tariff imposed on crease as the weight decreased (i.e., the goal of maximizing the self-
residential buildings in South Korea (i.e., base charge and power pur sufficiency rate), and a tendency for the maximum load to decrease as
chase price increase by degrees according to aggregated monthly power the weight increased (the goal to minimize the maximum load). This
consumption), the progressive tariff level at the time of increase in trend was most pronounced in PPOs with Pareto fronts. In the Pareto
power purchase price was confirmed, and the power purchase price was front, the PPO was clearly decided to increase the SSR and lower the
updated in the following order: (i) the progressive electricity purchase at peak load according to the weight. In particular, off-policy TD3 and SAC
a specific timestep is calculated to confirm the progressive tariff level; models showed biased results due to the optimization weight not making
(ii) it is repeated according to the timestep if the progressive tariff level an appropriate decision.
is not changed; and (iii) if the progressive tariff level is changed at the As shown in Table 4 and Fig. 9, the average self-sufficiency rate of
corresponding timestep, the electricity purchase prices are updated ac PPO was 2.22 %, 0.64 %, and 3.32 % higher than that of A2C, TD3, and
cording to the progressive tariff level in the next step. As a result, based SAC, respectively. Overall, when operated to maximize self-sufficiency
on the optimal charge and discharge scheduling of BESS with PV, ac rate (w = 0), the self-sufficiency rate of PPO was the highest among
quired through the solution via optimization, the monthly net revenue the four RL models. Notably, at weight = 0.6, the self-sufficiency rate
and electricity self-sufficiency rate that were the bases of the payment of showed the largest differences at 3.16 %, 1.71 %, and 6.18 % compared
the electricity bill were calculated. For detailed descriptions of the to A2C, TD3 and SAC. These results mean that optimal scheduling using
calculation process for the mathematical model of the electric system PPO can be advantageous when BESS is used for self-sufficiency rate in
and BESS, please refer to Jung et al. [12]. residential buildings.
Although the MLP considering progressive tariff, it uses only a linear On the other hand, PPO and TD3 made different decisions depending
mathematical model, has a disadvantage: the price according to the on the operational purpose, and in the case of A2C, the peak load could
progressive tariff can be applied only after the derivation of hourly not be lowered compared to the other models (refer to Figs. 3–9). When
scheduling of electric flows. On the other hand, the RL-based model is the average peak load of PPO was 0.039 kWh and 0.008 kWh lower than
capable of considering the price for the progressive tariff level from the that of A2C and TD3, respectively. The average peak load of PPO 0.012
hourly scheduling of electricity flows based on the cumulative rewards kWh higher than the SAC. Since the SAC determined the electricity flows
during the monthly period and is thus expected to better reflect the with lower self-sufficiency rate and peak load, regardless of the opti
actual electricity tariff system than the MILP. Therefore, in this study, a mization objectives, the average peak load of SAC was lower by 0.051 %,

11
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

Fig. 7. Training curves tracking the reward of agent according to optimization weight and RL algorithms.

0.012 %, and 0.027 %, respectively, compared to A2C, PPO, and TD3. As and 3.8 % higher than that of A2C, PPO, TD3 and SAC, respectively. The
a result, when operated to minimize peak load (w = 1), the peak load of peak load of the existing model was 1.301 kWh, which was 0.005 and
PPO also appeared to be the lowest. However, selecting an excellent RL 0.01 kWh higher than that of A2C and SAC, but lower than that of TD3
algorithm is not appropriate just because the reward (self-sufficiency by 0.011 kWh. In particular, the same result as the peak load of PPO was
rate and peak load) derived from the RL-based BESS scheduling model is drawn. The results indicate that the PPO with the smallest difference
high or low. In RL algorithms, the optimal solution is calculated through from the existing model in both self-sufficiency rate and peak load was
training according to the time step only with some historical data, and steady and well-trained.
thus the converged values may have unrealistic and biased results. To sum up, the difference between the existing model and the PPO
These results confirmed that the PPO with the highest learning was the smallest, which suggests that in addition to reflecting reality, the
performance selects appropriate decisions that maximize the SSR and PPO can be seen as the best RL algorithm in terms of SSR and peak load
minimize peak load considering optimization weights. In the TD3 and reduction. In addition, although the SAC is a better alternative to reduce
SAC, optimization weights for self-sufficiency rate and peak load were the peak load than the PPO, it may be unrealistic due to a large error
not considered with clustered results in the center (i.e., optimization from the existing model.
objectives were not adequately reflected). There were cases where SAC Therefore, The PPO algorithm showed the smallest difference from
showed the best performance in terms of peak load. the existing model in both SSR and peak load, indicating its ability to
provide results that closely align with reality. This suggests that PPO
offers a valid and realistic approach to BESS scheduling. The findings
4.2. Comparative analysis with the existing model
suggest that PPO outperformed the other RL algorithms in terms of both
SSR and peak load reduction. It offered the smallest variation from the
To ensure the validity of the optimal solution obtained via the so
existing model, indicating its potential as the best algorithm for practical
lution in Chapter 4.1, the results of case study were compared to the
implementation.
existing model using MILP. Due to the limitations of the existing model,
While SAC showed promise in peak load reduction, its significant
only the optimal solution at weight = 0 was present because the analysis
deviation from the existing model raises concerns about its realism and
was carried out with a focus on the increase of self-sufficiency rate.
practicality in real-world applications. In summary, this study implies
Fig. 10 shows the comparison of the self-sufficiency rate and peak load
that the PPO algorithm is a promising choice for BESS scheduling. It
of each model over the existing model using MILP. The self-sufficiency
successfully balances the objectives of maximizing self-sufficiency rate
rate of the existing model was 36.4 %, which was 3.1 %, 0.8 %, 2.5 %

12
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

Fig. 8. The comparison between four RL algorithms at weight from 0 to 1.

Table 4
Self-sufficiency rate and peak load according to the optimization weight.
Optimization weight Self sufficiency rate (SSR) (%) Peak load (kWh)

A2Ca PPOb TD3c SACd A2C PPO TD3 SAC

0 33.3 35.6 33.9 32.6 1.30 1.30 1.31 1.29

0.1 33.1 35.4 33.7 33.2 1.29 1.27 1.29 1.27
0.2 32.9 35.5 34.6 32.4 1.29 1.33 1.31 1.22
0.3 32.9 35.6 34.7 32.9 1.30 1.25 1.22 1.21
0.4 32.4 35.7 34.6 33.4 1.31 1.26 1.28 1.22
0.5 32.9 35.4 35 31.9 1.28 1.26 1.25 1.23
0.6 32.4 35.6 33.8 29.4 1.26 1.26 1.27 1.25
0.7 32.9 35.2 34.1 30.9 1.29 1.24 1.26 1.26
0.8 32.7 33.7 33.2 28.5 1.31 1.21 1.23 1.21
0.9 30.6 30.6 33.4 29.4 1.29 1.17 1.21 1.23
1 29.5 31.7 32 28.8 1.28 1.21 1.23 1.23
Average 32.33 34.55 33.91 31.22 1.29 1.25 1.26 1.24

Note: A2C.
a
is advantage actor-critic algorithm; PPO.
b
is proximal policy optimization; TD3.
c
is twin delayed deep deterministic policy gradient algorithm; SAC.
d
is soft actor-critic algorithm; and grey shading area is either the highest self sufficiency rate or the lowest peak load.

and minimizing peak load, offering results that closely match real-world their ability to balance the objectives of minimizing peak load and
scenarios. This indicates its potential for practical application in resi maximizing the self-sufficiency rate (SSR). First, the on-policy algo
dential buildings and microgrids. However, the study underlines the rithms, A2C and PPO, exhibited stable learning and increasing rewards
importance of considering specific optimization objectives and weighing as training steps were repeated. In contrast, the off-policy algorithms,
the trade-offs between these objectives when implementing BESS in TD3 and SAC, showed erratic and non-converging behavior. This sug
diverse real-world contexts. gests that on-policy algorithms are more suitable for stable learning in
complex environments like BESS scheduling. Second, optimal solution
4.3. Implications of this study algorithms via stable learning are A2C and PPO. When adequately
trained, they were able to determine optimal behavior in the BESS
The findings of this study hold significant implications for the use of scheduling model, maximizing SSR and minimizing peak load. Among
RL algorithms in optimizing BESS in conjunction with PV systems for these, PPO consistently achieved the highest rewards, making it a
energy management in residential buildings and microgrids. promising algorithm for practical applications. Third, the findings un
The study evaluated the performance of four RL algorithms (A2C, derscore the importance of selecting the right RL algorithm for BESS
PPO, TD3, and SAC) under varying optimization weights (w) to assess scheduling in real-world scenarios. While high rewards are desirable,

13
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

Fig. 9. The comparison between four RL algorithms in terms of SSR and peak load.

Fig. 10. Comparison of self-sufficiency rate and peak load of each model over the existing model.

14
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

the study cautions against relying solely on reward values, as converged Declaration of competing interest
values may yield unrealistic or biased results. Fourth, with the highest
reward, PPO demonstrated the ability to balance objectives effectively The authors declare that they have no known competing financial
by considering optimization weights. This is particularly crucial when interests or personal relationships that could have appeared to influence
users have varying preferences and objectives for BESS operation. PPO’s the work reported in this paper.
performance highlights the capacity to address the trade-off relation
ships between maximizing the self-sufficiency rate and minimizing peak Data availability
load.
Data will be made available on request.
5. Conclusion
Acknowledgements
Globally, the battery energy storage system (BESS) still lacks eco
nomic efficiency due to its high price. However, adopting it in cities is This work was supported by the National Research Foundation of
essential to complement the intermittency and volatility of Photovoltaic Korea (NRF) grant funded by the Korea government (MSIT; Ministry of
(PV) systems. To deal with the various uncertainty of microgrids in the Science and ICT) (NRF-2021R1A3B1076769).
real-world, this study developed an reinforcement learning (RL)-based
optimal scheduling model of BESS with PV system using four RL algo References
rithms. As a case study, a residential building in Chungbuk was targeted
and the developed RL-model has been applied accordingly. [1] Symeonidou MM, Zioga C, Papadopoulos AM. Life cycle cost optimization analysis
of battery storage system for residential photovoltaic panels. J Clean Prod 2021;
The case study results demonstrated the effectiveness and feasibility 309:127234. https://doi.org/10.1016/J.JCLEPRO.2021.127234.
of the developed RL-based optimal scheduling model. Compared to [2] Kang H, An J, Kim H, Ji C, Hong T, Lee S. Changes in energy consumption
other algorithms, the results showed that the proximal policy optimi according to building use type under COVID-19 pandemic in South Korea. Renew
Sustain Energy Rev 2021;148:111294. https://doi.org/10.1016/j.
zation (PPO)-based RL model has better decision-making in the case of
rser.2021.111294.
residential buildings. Moreover, a PPO-based RL model with simple [3] Vand B, Ruusu R, Hasan A, Manrique Delgado B. Optimal management of energy
modeling was determined as the optimal scheduling solution of BESS for sharing in a community of buildings using a model predictive control. Energy
Convers Manag 2021;239:114178. https://doi.org/10.1016/j.
maximizing the self-sufficiency rate of a residential building well,
enconman.2021.114178.
similar to the existing mixed integer linear programming-based model [4] KNREC. 2020 renewable energy white paper. South Korea. Korea Renewable
that defined all environments in detail. Energy Center (KNREC).; 2020.
The RL-based BESS scheduling model is poised to offer an optimal [5] Sun W, Zou Y, Zhang X, Guo N, Zhang B, Du G. High robustness energy
management strategy of hybrid electric vehicle based on improved soft actor-critic
solution and specialize in the application within a virtual power plant, deep reinforcement learning. Energy 2022;258:124806. https://doi.org/10.1016/j.
where numerous buildings continuously share electricity based on the energy.2022.124806.
BESS. However, due to limitations in data availability, this study does [6] Faraji J, Ketabi A, Hashemi-Dezaki H. Optimization of the scheduling and
operation of prosumers considering the loss of life costs of battery storage systems.
not conduct a scale-up case study to include various battery sizes, J Energy Storage 2020;31:101655. https://doi.org/10.1016/j.est.2020.101655.
building use types, or regions in the energy-sharing community. A [7] Bahramara S, Sheikhahmadi P, Golpîra H. Co-optimization of energy and reserve in
promising avenue for future research involves expanding the scope of standalone micro-grid considering uncertainties. Energy 2019;176:792–804.
https://doi.org/10.1016/j.energy.2019.04.057.
this study to encompass a broader and more diverse range of scenarios. If [8] Zhang Y, Ma T, Elia Campana P, Yamaguchi Y, Dai Y. A techno-economic sizing
there is an opportunity to conduct a scale-up case study by encom method for grid-connected household photovoltaic battery systems. Appl Energy
passing various battery sizes, building use types, and regions within the 2020;269:115106. https://doi.org/10.1016/j.apenergy.2020.115106.
[9] Lorenzi G, Silva CAS. Comparing demand response and battery storage to optimize
energy-sharing community, future research can offer a more compre
self-consumption in PV systems. Appl Energy 2016;180:524–35. https://doi.org/
hensive insight into the potential effectiveness of the RL-based BESS 10.1016/j.apenergy.2016.07.103.
scheduling model in a wider array of real-world settings. This expanded [10] Nyholm E, Goop J, Odenberger M, Johnsson F. Solar photovoltaic-battery systems
in Swedish households – self-consumption and self-sufficiency. Appl Energy 2016;
investigation will contribute valuable insights into the scalability and
183:148–59. https://doi.org/10.1016/j.apenergy.2016.08.172.
adaptability of the model, further enhancing its practical utility in the [11] Camilo FM, Castro R, Almeida ME, Pires VF. Economic assessment of residential PV
realm of energy management and distribution. It is worthy of further systems with self-consumption and storage in Portugal. Sol Energy 2017;150:
investigation to reflect the real world considering the uncertain vari 353–62. https://doi.org/10.1016/j.solener.2017.04.062.
[12] Jung S, Kang H, Lee M, Hong T. An optimal scheduling model of an energy storage
ables (e.g., volatility of market price, accelerated degradation of BESS) system with a photovoltaic system in residential buildings considering the
so as to more accurately determine the optimal battery capacity in the economic and environmental aspects. Energy Build 2020;209:109701. https://doi.
energy-sharing community. In the near future, with the improved org/10.1016/j.enbuild.2019.109701.
[13] Rallo H, Canals Casals L, De La Torre D, Reinhardt R, Marchante C, Amante B.
computational performance of the developed RL model, our focus also Lithium-ion battery 2nd life used as a stationary energy storage system: ageing and
extends to sensitivity analysis involving the various variables and long- economic analysis in two real cases. J Clean Prod 2020;272:122584. https://doi.
term economic considerations to enhance the quality of results. org/10.1016/J.JCLEPRO.2020.122584.
[14] Abdulla K, De Hoog J, Muenzel V, Suits F, Steer K, Wirth A, et al. Optimal operation
of energy storage systems considering forecasts and battery degradation. IEEE
Author contribution statement Trans Smart Grid 2018;9:2086–96. https://doi.org/10.1109/TSG.2016.2606490.
[15] van der Stelt S, AlSkaif T, van Sark W. Techno-economic analysis of household and
community energy storage for residential prosumers with smart appliances. Appl
Hyuna Kang: Writing - original draft, Investigation, Visualization,
Energy 2018;209:266–76. https://doi.org/10.1016/J.APENERGY.2017.10.096.
Conceptualization, Methodology, Visualization, Validation, Supervi [16] Choi S, Min S-W. Optimal scheduling and operation of the ESS for prosumer market
sion. Seunghoon Jung: Methodology, Visualization, Resources, Super environment in grid-connected industrial complex. IEEE Trans Ind Appl 2018;54:
1949. https://doi.org/10.1109/TIA.2018.2794330. –57.
vision, Project administration, Funding acquisition. Hakpyeong Kim:
[17] Luo Y, Itaya S, Nakamura S, Davis P. Autonomous cooperative energy trading
Investigation, Methodology, Writing - Review & Editing, Jongbaek An: between prosumers for microgrid systems. 39th annual IEEE conference on local
Conceptualization, Writing - Review & Editing. Juwon Hong: Investi computer networks workshops. 2014. p. 693–6. https://doi.org/10.1109/
gation, Resources, Writing - Review & Editing, Seungkeun Yeom: LCNW.2014.6927722.
[18] Wainstein ME, Dargaville R, Bumpus A. Social virtual energy networks: exploring
Conceptualization, Visualization. Taehoon Hong*: Methodology, Vali innovative business models of prosumer aggregation with virtual power plants.
dation, Writing - Review & Editing, Supervision, Project administration, 2017 IEEE power & energy society innovative smart grid technologies conference
Funding acquisition. (ISGT). 2017. p. 1–5. https://doi.org/10.1109/ISGT.2017.8086022.
[19] Agamah SU, Ekonomou L. Energy storage system scheduling for peak demand
reduction using evolutionary combinatorial optimisation. Sustain Energy Technol
Assessments 2017;23:73–82. https://doi.org/10.1016/j.seta.2017.08.003.

15
H. Kang et al. Renewable and Sustainable Energy Reviews 190 (2024) 114054

[20] Hong Z, Wei Z, Li J, Han X. A novel capacity demand analysis method of energy [40] Yao L, Dong Q, Jiang J, Ni F. Deep reinforcement learning for long-term pavement
storage system for peak shaving based on data-driven. J Energy Storage 2021;39: maintenance planning. Comput Aided Civ Infrastruct Eng 2020;35:1230–45.
102617. https://doi.org/10.1016/j.est.2021.102617. https://doi.org/10.1111/mice.12558.
[21] Ahsan SM, Khan HA, Hassan N, Arif SM, Lie T-T. Optimized power dispatch for [41] Li S, Snaiki R, Wu T. A knowledge-enhanced deep reinforcement learning-based
solar photovoltaic-storage system with multiple buildings in bilateral contracts. shape optimizer for aerodynamic mitigation of wind-sensitive structures. Comput
Appl Energy 2020;273:115253. https://doi.org/10.1016/j.apenergy.2020.115253. Aided Civ Infrastruct Eng 2021;36:733–46. https://doi.org/10.1111/mice.12655.
[22] Zhou Y, Cao S. Energy flexibility investigation of advanced grid-responsive energy [42] Quoilin S, Kavvadias K, Mercier A, Pappone I, Zucker A. Quantifying self-
control strategies with the static battery and electric vehicles: a case study of a consumption linked to solar home battery systems: statistical analysis and
high-rise office building in Hong Kong. Energy Convers Manag 2019;199:111888. economic assessment. Appl Energy 2016;182:58–67. https://doi.org/10.1016/j.
https://doi.org/10.1016/j.enconman.2019.111888. apenergy.2016.08.077.
[23] Rodrigues DL, Ye X, Xia X, Zhu B. Battery energy storage sizing optimisation for [43] Bortolini M, Gamberi M, Graziani A. Technical and economic design of
different ownership structures in a peer-to-peer energy sharing community. Appl photovoltaic and battery energy storage system. Energy Convers Manag 2014;86:
Energy 2020;262:114498. https://doi.org/10.1016/j.apenergy.2020.114498. 81–92. https://doi.org/10.1016/j.enconman.2014.04.089.
[24] Campana PE, Cioccolanti L, François B, Jurasz J, Zhang Y, Varini M, et al. Li-ion [44] Cerino Abdin G, Noussan M. Electricity storage compared to net metering in
batteries for peak shaving, price arbitrage, and photovoltaic self-consumption in residential PV applications. J Clean Prod 2018;176:175–86. https://doi.org/
commercial buildings: a Monte Carlo Analysis. Energy Convers Manag 2021;234: 10.1016/j.jclepro.2017.12.132.
113889. https://doi.org/10.1016/j.enconman.2021.113889. [45] Olaszi BD, Ladanyi J. Comparison of different discharge strategies of grid-
[25] Medved S, Domjan S, Arkar C. Contribution of energy storage to the transition from connected residential PV systems with energy storage in perspective of optimal
net zero to zero energy buildings. Energy Build 2021;236:110751. https://doi.org/ battery energy storage system sizing. Renew Sustain Energy Rev 2017;75:710–8.
10.1016/j.enbuild.2021.110751. https://doi.org/10.1016/j.rser.2016.11.046.
[26] Sæther G, Crespo del Granado P, Zaferanlouei S. Peer-to-peer electricity trading in [46] Jung S, Jeoung J, Kang H, Hong T. Optimal planning of a rooftop PV system using
an industrial site: value of buildings flexibility on peak load reduction. Energy GIS-based reinforcement learning. Appl Energy 2021;298:117239. https://doi.org/
Build 2021;236:110737. https://doi.org/10.1016/j.enbuild.2021.110737. 10.1016/J.APENERGY.2021.117239.
[27] Cremi MR, Pantaleo AM, van Dam KH, Shah N. Optimal design and operation of an [47] Sutton RS, Barto AG. Reinforcement learning: an introduction second edition.
urban energy system applied to the Fiera Del Levante exhibition centre. Appl Learning. https://doi.org/10.1109/MED.2013.6608833; 2012.
Energy 2020;275:115359. https://doi.org/10.1016/j.apenergy.2020.115359. [48] Ziebart BD, Maas A, Bagnell JA, Dey AK. Maximum entropy inverse reinforcement
[28] Toubeau J-F, Bottieau J, Grève Z De, Vallée F, Bruninx K. Data-Driven scheduling learning. Proceedings of the national conference on artificial intelligence. 2008.
of energy storage in day-ahead energy and reserve markets with probabilistic [49] Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, et al.
guarantees on real-time delivery. IEEE Trans Power Syst 2021;36:2815–28. OpenAI Gym 2016;1–4.
https://doi.org/10.1109/TPWRS.2020.3046710. [50] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal Policy
[29] Liu J, Cao S, Chen X, Yang H, Peng J. Energy planning of renewable applications in Optimization Algorithms 2017;1–12.
high-rise residential buildings integrating battery and hydrogen vehicle storage. [51] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: off-policy maximum
Appl Energy 2021;281:116038. https://doi.org/10.1016/j.apenergy.2020.116038. entropy deep reinforcement learning with a stochastic actor. 35th International
[30] Al-Kanj L, Nascimento J, Powell WB. Approximate dynamic programming for Conference on Machine Learning, ICML 2018 2018;5:2976–89.
planning a ride-hailing system using autonomous fleets of electric vehicles. Eur J [52] Fujimoto S, van Hoof H, Meger D. Addressing function approximation error in
Oper Res 2020;284:1088–106. https://doi.org/10.1016/j.ejor.2020.01.033. actor-critic methods. In: Dy J, Krause A, editors. Proceedings of the 35th
[31] Yang JJ, Yang M, Wang MX, Du PJ, Yu YX. A deep reinforcement learning method international conference on machine learning, vol. 80. PMLR; 2018. p. 1587–96.
for managing wind farm uncertainties through energy storage system control and [53] Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, et al. Asynchronous
external reserve purchasing. Int J Electr Power Energy Syst 2020;119:105928. methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ, editors.
https://doi.org/10.1016/J.IJEPES.2020.105928. Proceedings of the 33rd international conference on machine learning, vol. 48.
[32] Hussain A, Bui VH, Kim HM. Deep reinforcement learning-based operation of fast New York, New York, USA: PMLR; 2016. p. 1928–37.
charging stations coupled with energy storage system. Elec Power Syst Res 2022; [54] Schulman J, Levine S, Moritz P, Jordan M, Abbeel P. Trust region policy
210:108087. https://doi.org/10.1016/J.EPSR.2022.108087. optimization. 32nd international conference on machine learning. ICML 2015
[33] Soares A, Geysen D, Spiessens F, Ectors D, De Somer O, Vanthournout K. Using 2015;3:1889–97.
reinforcement learning for maximizing residential self-consumption – results from [55] Konda VR, Tsitsiklis JN. Actor-critic algorithms. Adv Neural Inf Process Syst; 2000.
a field test. Energy Build 2020. https://doi.org/10.1016/j.enbuild.2019.109608. [56] Schulman J, Moritz P, Levine S, Jordan MI, Abbeel P. High-dimensional continuous
[34] Wang Y, Qiu D, Strbac G. Multi-agent deep reinforcement learning for resilience- control using generalized advantage estimation. 4th International Conference on
driven routing and scheduling of mobile energy storage systems. Appl Energy Learning Representations. Iclr 2016 - conference track proceedings, vols. 1–14;
2022;310:118575. https://doi.org/10.1016/J.APENERGY.2022.118575. 2016.
[35] Parra D, Patel MK. Effect of tariffs on the performance and economic benefits of [57] Statistics Korea. 2017 population and housing census. 2018.
PV-coupled battery systems. Appl Energy 2016;164:175–87. https://doi.org/ [58] Korea New Renewable Energy Center. 2017 New & renewable energy statistics.
10.1016/J.APENERGY.2015.11.037. 2018.
[36] Hoppmann J, Volland J, Schmidt TS, Hoffmann VH. The economic viability of [59] Chem LG. RESU3.3 battery pack specification. 2017.
battery storage for residential solar photovoltaic systems – a review and a [60] Beaudin M, Zareipour H, Schellenberg A, Rosehart W. Energy storage for
simulation model. Renew Sustain Energy Rev 2014;39:1101–18. https://doi.org/ mitigating the variability of renewable electricity sources. Energy storage for smart
10.1016/J.RSER.2014.07.068. grids: planning and operation for renewable and variable energy resources (VERs),
[37] Li J. Optimal sizing of grid-connected photovoltaic battery systems for residential vol. 14; 2014. p. 1–33. https://doi.org/10.1016/B978-0-12-410491-4.00001-4.
houses in Australia. Renew Energy 2019;136:1245–54. https://doi.org/10.1016/J. [61] Aneke M, Wang M. Energy storage technologies and real life applications – a state
RENENE.2018.09.099. of the art review. Appl Energy 2016;179:350–77. https://doi.org/10.1016/j.
[38] The 2050 carbon neutrality and green growth commission Carbon Neutrality apenergy.2016.06.097.
Scenarios. Presidential Commission on Carbon Neutrality and Green Growth. 2022. [62] Chatzivasileiadi A, Ampatzi E, Knight I. Characteristics of electrical energy storage
[39] Bellman RE. A markovian decision process. Journal of Mathematics and Mechanics technologies and their applications in buildings. Renew Sustain Energy Rev 2013.
1957;6:679–84. https://doi.org/10.1016/j.rser.2013.05.023.
[63] Korea power exchange n.d..
[64] KEPCO. General terms and conditions. 2018.

Introduction to Time Series and Forecasting【solution manual 】
79% (24)
Introduction to Time Series and Forecasting【solution manual 】
46 pages
Time Series Forecasting Week 2 Quiz Part 1
75% (4)
Time Series Forecasting Week 2 Quiz Part 1
3 pages
Champo Carpets Problem Statement
0% (1)
Champo Carpets Problem Statement
2 pages
Principles of Energy Storage Systems
From Everand
Principles of Energy Storage Systems
Jayarama P. Reddy
No ratings yet
A Two-Stage Algorithm For Optimal Scheduling of Battery Energy Storage Systems For Peak-Shaving
No ratings yet
A Two-Stage Algorithm For Optimal Scheduling of Battery Energy Storage Systems For Peak-Shaving
6 pages
batteries-09-00219-v3
No ratings yet
batteries-09-00219-v3
16 pages
A Stochastic Method For Behind The Meter PV Battery Energy S - 2024 - Journal of
No ratings yet
A Stochastic Method For Behind The Meter PV Battery Energy S - 2024 - Journal of
10 pages
Lagrangian Relaxation Based Heuristics For A Chance-Constrained Optimization Model of A Hybrid Solar-Battery Storage System
No ratings yet
Lagrangian Relaxation Based Heuristics For A Chance-Constrained Optimization Model of A Hybrid Solar-Battery Storage System
25 pages
Optimal operation and stochastic scheduling of renewable energy of a microgrid with optimal sizing of battery energy storage considering cost reduction
No ratings yet
Optimal operation and stochastic scheduling of renewable energy of a microgrid with optimal sizing of battery energy storage considering cost reduction
13 pages
820 - Paper - PowerTech2019 Preprint
No ratings yet
820 - Paper - PowerTech2019 Preprint
6 pages
10 1016@j Procs 2020 03 263
No ratings yet
10 1016@j Procs 2020 03 263
13 pages
Amini(2022) (1)
No ratings yet
Amini(2022) (1)
20 pages
Smart Optimization in Battery Energy Storage Systems An Overview
100% (2)
Smart Optimization in Battery Energy Storage Systems An Overview
17 pages
Iterative Sizing Methodology for Photovoltaic Plants Coupled With Battery Energy Storage Systems to Ensure Smooth
No ratings yet
Iterative Sizing Methodology for Photovoltaic Plants Coupled With Battery Energy Storage Systems to Ensure Smooth
8 pages
Optimal sizing of behind the meter battery energy storage systems under optimal battery operation A case study in Ireland
No ratings yet
Optimal sizing of behind the meter battery energy storage systems under optimal battery operation A case study in Ireland
11 pages
An Intelligent Energy Management System For Optimum Design and Real-Time Operation
No ratings yet
An Intelligent Energy Management System For Optimum Design and Real-Time Operation
16 pages
fenrg-11-1100456
No ratings yet
fenrg-11-1100456
15 pages
Applsci 11 07851 v2
No ratings yet
Applsci 11 07851 v2
26 pages
Riset Penyimpanan Energi Dan Perencanaan Energi Terbarukan Proporsi Tinggi Mengingat Permintaan
No ratings yet
Riset Penyimpanan Energi Dan Perencanaan Energi Terbarukan Proporsi Tinggi Mengingat Permintaan
9 pages
Shin 2020
No ratings yet
Shin 2020
14 pages
An Optimal Control Approach for Enhancing Efficiency in Renewable Energy Communities
No ratings yet
An Optimal Control Approach for Enhancing Efficiency in Renewable Energy Communities
6 pages
energies-14-08365-v2
No ratings yet
energies-14-08365-v2
17 pages
A Study On Sizing of Substation For PV With Optimized Operation of BESS
No ratings yet
A Study On Sizing of Substation For PV With Optimized Operation of BESS
9 pages
1-s2.0-S2352467723001601-main
No ratings yet
1-s2.0-S2352467723001601-main
15 pages
Optimal Energy Management Via Day-Ahead Scheduling Considering
No ratings yet
Optimal Energy Management Via Day-Ahead Scheduling Considering
42 pages
sustainability-15-01004-v3
No ratings yet
sustainability-15-01004-v3
28 pages
Optimal Day-Ahead Scheduling of Microgrids With Battery Energy Storage System
No ratings yet
Optimal Day-Ahead Scheduling of Microgrids With Battery Energy Storage System
28 pages
Applied Sciences
No ratings yet
Applied Sciences
24 pages
Mathematics 10 03420 v2
No ratings yet
Mathematics 10 03420 v2
25 pages
A Smart Battery Management System For Photovoltaic
No ratings yet
A Smart Battery Management System For Photovoltaic
23 pages
Optimum Allocation of Battery Energy Storage Systems For Power Grid
No ratings yet
Optimum Allocation of Battery Energy Storage Systems For Power Grid
12 pages
Energy Storage and Management System Design Optimization for a Photovoltaic Integrated Low-Energy Building
No ratings yet
Energy Storage and Management System Design Optimization for a Photovoltaic Integrated Low-Energy Building
36 pages
Attentive_Convolutional_Deep_Reinforcement_Learning_for_Optimizing_Solar-Storage_Systems_in_Real-Time_Electricity_Markets
No ratings yet
Attentive_Convolutional_Deep_Reinforcement_Learning_for_Optimizing_Solar-Storage_Systems_in_Real-Time_Electricity_Markets
11 pages
energies-16-05480 (1)
No ratings yet
energies-16-05480 (1)
16 pages
Investigation On Sizing of Voltage Source For A Battery Energy Storage System in Microgrid With Renewable Energy Sources
No ratings yet
Investigation On Sizing of Voltage Source For A Battery Energy Storage System in Microgrid With Renewable Energy Sources
14 pages
Generator Scheduling
No ratings yet
Generator Scheduling
16 pages
Cost Optimization by Integrating PV Syst
No ratings yet
Cost Optimization by Integrating PV Syst
12 pages
Hoon 2020
No ratings yet
Hoon 2020
5 pages
MDPI_First_Page
No ratings yet
MDPI_First_Page
1 page
Optimize The Operating Range For Improving The Cycle Life of Battery Energy
No ratings yet
Optimize The Operating Range For Improving The Cycle Life of Battery Energy
11 pages
Optimized Energy Management System For Profit Maximization of A Photovoltaic Plant With Batteries A Case Study of Coremas Brazil
No ratings yet
Optimized Energy Management System For Profit Maximization of A Photovoltaic Plant With Batteries A Case Study of Coremas Brazil
9 pages
An Improved Rules Based Control of Battery Energy Storage
No ratings yet
An Improved Rules Based Control of Battery Energy Storage
10 pages
1 s2.0 S1364032122005639 Main
No ratings yet
1 s2.0 S1364032122005639 Main
27 pages
Energy Storage - 2025 - Saurabh - Techno Economic Analysis of Grid Connected Photovoltaic Systems With Battery Energy
No ratings yet
Energy Storage - 2025 - Saurabh - Techno Economic Analysis of Grid Connected Photovoltaic Systems With Battery Energy
22 pages
Optimal_Dispatch_for_Battery_Energy_Storage_Station_in_Distribution_Network_Considering_Voltage_Distribution_Improvement_and_Peak_Load_Shifting
No ratings yet
Optimal_Dispatch_for_Battery_Energy_Storage_Station_in_Distribution_Network_Considering_Voltage_Distribution_Improvement_and_Peak_Load_Shifting
9 pages
ISGT.2014.6816474
No ratings yet
ISGT.2014.6816474
5 pages
Optimizing Energy Management and Sizing of Photovoltaic
No ratings yet
Optimizing Energy Management and Sizing of Photovoltaic
22 pages
Optimal Operation Approach With Combined BESS Sizing and PV Generation in Microgrid
No ratings yet
Optimal Operation Approach With Combined BESS Sizing and PV Generation in Microgrid
14 pages
rios2018
No ratings yet
rios2018
6 pages
paper5
No ratings yet
paper5
16 pages
batteries-06-00056
No ratings yet
batteries-06-00056
16 pages
An Optimization-Based Approach For Assessing The Benefits of Residential Battery Storage in Conjunction With Solar PV
No ratings yet
An Optimization-Based Approach For Assessing The Benefits of Residential Battery Storage in Conjunction With Solar PV
8 pages
s11831-023-10046-7
No ratings yet
s11831-023-10046-7
32 pages
On-Line Building Energy Optimization Using Deep Reinforcement Learning
No ratings yet
On-Line Building Energy Optimization Using Deep Reinforcement Learning
11 pages
Behind The Meter Residential Photovoltaic Plus Battery Syst 2024 Renewable E
No ratings yet
Behind The Meter Residential Photovoltaic Plus Battery Syst 2024 Renewable E
12 pages
2020-Optimization of Residential Battery Energy Storage System Scheduling For Cost and Emissions Reductions
No ratings yet
2020-Optimization of Residential Battery Energy Storage System Scheduling For Cost and Emissions Reductions
13 pages
Survey of Strategies To Optimize Battery Operation To Minimize The Electricity Cost in A Microgrid With Renewable Energy Sources and Electric Vehicles
No ratings yet
Survey of Strategies To Optimize Battery Operation To Minimize The Electricity Cost in A Microgrid With Renewable Energy Sources and Electric Vehicles
16 pages
1-s2.0-S030626192401153X-main
No ratings yet
1-s2.0-S030626192401153X-main
15 pages
Ih94s 6833c
No ratings yet
Ih94s 6833c
23 pages
1 s2.0 S2352152X19316251 Main
No ratings yet
1 s2.0 S2352152X19316251 Main
8 pages
1 s2.0 S235248472301154X Main
No ratings yet
1 s2.0 S235248472301154X Main
18 pages
An_Optimal_Power_Usage_Scheduling_in_Smart_Grid_In
No ratings yet
An_Optimal_Power_Usage_Scheduling_in_Smart_Grid_In
20 pages
Hybrid Machine Learning-Based Estimation of Remaining Useful Life (RUL) and SOH of Lithium-Ion Batteries for EV Applications
From Everand
Hybrid Machine Learning-Based Estimation of Remaining Useful Life (RUL) and SOH of Lithium-Ion Batteries for EV Applications
Giritharan Mani
No ratings yet
PipelineRisk Model
No ratings yet
PipelineRisk Model
7 pages
Linear Regression Analysis
No ratings yet
Linear Regression Analysis
2 pages
Class 2 Classification of Control Systems With Exemples
No ratings yet
Class 2 Classification of Control Systems With Exemples
31 pages
Boiler Drum Processing Through LabVIEW
No ratings yet
Boiler Drum Processing Through LabVIEW
9 pages
First Order Differential Equation: Linear Equations TEST 1: 23/10 10-11.30 AM Presentation Chapter 1 - 4/10/2015
No ratings yet
First Order Differential Equation: Linear Equations TEST 1: 23/10 10-11.30 AM Presentation Chapter 1 - 4/10/2015
7 pages
Data Science For Managers and Business Leaders Curriculum
No ratings yet
Data Science For Managers and Business Leaders Curriculum
6 pages
1.1 What Is Linear Programming
No ratings yet
1.1 What Is Linear Programming
5 pages
JMP 8 Statistics and Graphics Guide 2nd Edition Sas Publishing download pdf
100% (4)
JMP 8 Statistics and Graphics Guide 2nd Edition Sas Publishing download pdf
51 pages
Sampling Theorem Verification: Experiment No. 6 - Open Ended Experiment 1
No ratings yet
Sampling Theorem Verification: Experiment No. 6 - Open Ended Experiment 1
7 pages
DISCRETE MATHEMATICS AND ITS APPLICATIONS Series Editor
No ratings yet
DISCRETE MATHEMATICS AND ITS APPLICATIONS Series Editor
602 pages
Example of LU Decomposition
No ratings yet
Example of LU Decomposition
4 pages
Super Membranes
No ratings yet
Super Membranes
77 pages
A Multilayer Feed-Forward Neural Network
No ratings yet
A Multilayer Feed-Forward Neural Network
9 pages
DESIGN AND ANALYSIS OF ALGORITHMS NOTES - Fred
No ratings yet
DESIGN AND ANALYSIS OF ALGORITHMS NOTES - Fred
38 pages
Become Data Analyst For Free
No ratings yet
Become Data Analyst For Free
6 pages
Final - CO2011 - en - 2020 - 201 - 281x - No Keys
No ratings yet
Final - CO2011 - en - 2020 - 201 - 281x - No Keys
5 pages
A Linear Programming Approach For The Three-Dimensional Bin-Packing Problem
No ratings yet
A Linear Programming Approach For The Three-Dimensional Bin-Packing Problem
8 pages
Decision Theory - Lecture Material
No ratings yet
Decision Theory - Lecture Material
7 pages
123
No ratings yet
123
3 pages
Design Digital Controller Using Root Locus (Autosaved)
No ratings yet
Design Digital Controller Using Root Locus (Autosaved)
16 pages
DEEP LEARNING IIT Kharagpur Assignment - 5 - 2024
No ratings yet
DEEP LEARNING IIT Kharagpur Assignment - 5 - 2024
9 pages
Simulation Simulation: Modeling and Performance Analysis With Discrete-Event Simulation G y
0% (1)
Simulation Simulation: Modeling and Performance Analysis With Discrete-Event Simulation G y
21 pages
Husek - 2008 - Systems, Structure and Control
No ratings yet
Husek - 2008 - Systems, Structure and Control
256 pages
Control Engineering - Introduction
No ratings yet
Control Engineering - Introduction
26 pages
COL106 Assignment3 PDF
No ratings yet
COL106 Assignment3 PDF
9 pages
Chapter 9 - Linear Equation in One Variable (Part - 3), Class 8, Maths RD Sharma
No ratings yet
Chapter 9 - Linear Equation in One Variable (Part - 3), Class 8, Maths RD Sharma
12 pages
MIT18 S096F13 Pset9 PDF
No ratings yet
MIT18 S096F13 Pset9 PDF
2 pages

Reinforcement Learning-Based Optimal Scheduling Model of Battery Energy Storage System at The Building Level

Uploaded by

Reinforcement Learning-Based Optimal Scheduling Model of Battery Energy Storage System at The Building Level

Uploaded by

Renewable and Sustainable Energy Reviews 190 (2024) 114054

Contents lists available at ScienceDirect

Renewable and Sustainable Energy Reviews

Reinforcement learning-based optimal scheduling model of battery energy

intermittent properties of solar energy, which are relatively inflexible

Abbreviation MDP Markov Decision Process

Fig. 2. System configuration.

Fig. 3. Architecture of the Advantage Actor-Critic (A2C) algorithm.

Fig. 4. Architecture of the Proximal Policy Optimization (PPO) algorithm.

Fig. 5. Architecture of the Twin Delayed DDPG (TD3) algorithm.

y = rt+1 +γminQ(st+1 , at+1 ) (38) 3. Case study

Fig. 6. Architecture of the Soft Actor-Critic (SAC) algorithm.

Fig. 8. The comparison between four RL algorithms at weight from 0 to 1.

A2Ca PPOb TD3c SACd A2C PPO TD3 SAC

0 33.3 35.6 33.9 32.6 1.30 1.30 1.31 1.29

You might also like