Energy Management For Microgrids Using A Hierarchical Game-Machine Learning Algorithm
Energy Management For Microgrids Using A Hierarchical Game-Machine Learning Algorithm
Abstract—This paper presents an energy management search and broadcast the energy management strategy. The
strategy for microgrids using a multiagent game-learning alternative way is to distribute the decision-making process to
algorithm. This microgrid is powered by photovoltaic (PV) the individual BS controllers. Such an approach formulated
systems equipped with batteries and is intended to be operating the original energy management problem as a multiagent
in islanded mode. The proposed energy management strategy is
applied to wireless communication networks by addressing the
optimization problem and aimed at solving for an equilibrium.
tradeoff between the communication signal’s quality of service In [9], energy management is modeled as a multi-player game.
(QoS) and energy availability. A two-layer algorithm combining However, the computation cost for a large scale system might
multiagent-game and reinforcement learning (RL) is designed be too costly and impractical [10]. A reinforcement machine
for base stations (BSs) in order to accomplish the goal learning algorithm was introduced in [11] so that the
mentioned above. The proposed method shows improvement in computation time is manageable. This reinforcement learning
the microgrid’s performance and has a higher converging speed (RL) algorithm enables the BS to search for an operating point
compared to a direct RL approach. The designed energy via trial and error but requires a few days of “training.”
management algorithm was tested in multiple case studies.
In this paper, we propose a hierarchical load response
Keywords—microgrid, reinforcement learning, game theory, algorithm combing a virtual two-player game and RL
distributed control, multi-agent system, optimization, process. Applying this algorithm, the controllers in a
communication system, load shaping microgrid solve the immediate load response as a two-player
I. INTRODUCTION game and adjust the user’s load model using an RL algorithm.
Because a two-player game could be solved in polynomial
A microgrid is a local independently controlled electric time, the controllers obtain a reasonable load plan fast in real-
system with distributed energy resources (DER), combined time and gain the capability of adapting to the unknown
with storage devices and flexible loads. Such a system can be environment with the aid of RL. Additionally, compared to a
operated in a non-autonomous way or an autonomous way, direct RL approach, the two-player game simplifies the action
depending on whether it is connected to the main grid [1]. searching space; thus, the converging speed is higher. As the
Thanks to its independent control and local DERs, a microgrid study results will show, this approach obtains an energy
can power its local area when the main grid is interrupted management strategy with performance compared to an
during a natural disaster [2]. This feature has drawn much exhaustive heuristic search and has a shorter training period
attention from designers seeking high system resiliency, such compared to direct RL algorithm.
as those in the communications industry. Considering the
crucial role played by communication networks, a few II. MICROGRID ENERGY MANAGEMENT
companies have already explored using renewable energy An example of the proposed microgrid for communication
sources and microgrids to power communication facilities networks is shown in Fig.1. The microgrid consists of a
consisting of a group of base stations (BSs). A practical photovoltaic (PV) power generator, three communication
question arises as of how to utilize the stored energy in such BSs, and battery units. The batteries are responsible for
microgrid considering renewable resources’ partial stochastic absorbing excess generated power or powering the load when
characteristics. Past research has been focusing on searching the generated power is insufficient. The BSs in this microgrid
for an energy management strategy that solves this problem. utilize communication traffic shaping (CTS) to adjust their
One of the approaches is by switching on/off BSs and energy consumption [12]. In a general form, the load at each
minimize the total load demand, as discussed in [3, 4] and [5]. BS could be expressed as
Other methodologies considering green energy availability
and delay performance include GALA [6], IDEA [7], and
TEA [8]. These approaches rely on a central controller to
This work is supported and funded by the Hillman Foundation.
546
Authorized licensed use limited to: INTERNATIONAL ISLAMIC UNIVERSITY. Downloaded on May 16,2022 at 14:12:40 UTC from IEEE Xplore. Restrictions apply.
PBSi rti ( Pb PcG (t )) rti PL
where rti is the ratio that the BS’s load takes in the total
microgrid load, Pb is the base BS power and Pc is the
controllable power as a linear function of the traffic shaping
factor (TSF) G , and PL is the microgrid total load demand.
Generally, the higher G is, the better the quality of the
communication is. A traffic shaper controls (“shapes”) the
actual throughput (equivalent to the total volume of traffic) at
Fig. 1. Communication microgrid scheme.
the output of a communication system. Since the action of
shaping traffic entails a reduction of bit rate, it will lead to an
increased delay for data traffic or higher compression ratio for PB GR PL (6)
interactive video or speech traffic. In an LTE base station, a
radio frame is divided into minimum units of transmit Hence the battery energy output is
resources called “resource blocks” (RB). Without traffic t
'Ebat (t ) ER EL ³ P ( x)dx (7)
shaping, all ongoing calls will require RBT resource blocks. t0
B
547
Authorized licensed use limited to: INTERNATIONAL ISLAMIC UNIVERSITY. Downloaded on May 16,2022 at 14:12:40 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Game-learning algorithm scheme.
cause the microgrid to shut down. Therefore, it is a good where G i indicates the virtual user's TSF action. Each user’s
practice to have the microgrid batteries’ SoC maintained at a objective is to maximize (13) considering the virtual player’s
relatively high level in island mode. In this project, the possible action.
performance of the microgrid is evaluated by an objective If a natural disaster hits the microgrid, the communications
function measuring the weighted sum of communication between BSs might be cut off, and the components could be
quality and battery SoC distribution: damaged. In this condition, the behavior modes of BSs could
obj (t ) wcom fcom ( PBS ,t ) wSoC P(SoC(td ) ! SoCgoal ) (12) be different. Some BSs may experience a surge in
communication load demand, while other BSs may reduce
where f com ( PBS , t ) calculates the average normalized peak their load consumption in order to conserve more stored
energy. Therefore, the BSs in this microgrid do not
signal-to-noise ratio (PSNR) of the communication network
necessarily share a common goal. Instead, it is safer for the
computed by (3). The BS’s goal is to search for a TSF strategy
BSs to assume the worst scenario, where its virtual player is
G (t ) that maximizes the objective function (12). Details on trying to minimize their objectives. With this assumption, the
how f com ( PBS , t ) is calculated could be found in [9]. objective of the BSs becomes
The proposed algorithm consists of two parts: the which makes it a zero-sum game. The solution of (14) is the
immediate TSF decision game and the BS load-ratio learning. same as the solution of a corresponding linear programming
The scheme of the algorithm is shown in Fig. 2. In this problem
section, the details of the two parts are explained.
Maximize z
A. Two-player TSF game
Subject to Acx t z e
Initially, the load demand (1) could be computed using the
load and weather forecast information and modeled as a (15)
ecx 1 0
virtual two-player game similar to [9]. For a user with a load
ratio rti , it assumes a virtual user takes all the rest of the load x t 0
(1 rti )( Pb PcG (t )) . Therefore, the objective function (12) where x is player I’s strategy vector indicating its probability
depends on the TSF choices of both the actual and virtual user playing TSF actions, e is a vector of ones with the same length
is of x, A is the payoff table computed by deducing players’
obj (t , D i (t ), D i (t )) wload f com ( PBS , t , G i (t ), G i (t )) choice of TSF as shown in Fig. 3, z is the BS’s expected
(13)
wSoC P( SoC (td ) ! SoCgoal G i (t ), G i (t )) payoff and is his probability of choosing the ith TSF. This
correspondence of zero-sum game and linear programming is
Player
P21 P22 based on the connection between the Minmax Theorem and
II
the Duality Theorem [16]. The linear programming problem
Player
I
G 21 G 22 could be solved by applying a simplex or interior-point
P11 G 11 obj (t , G 11, G 21) obj (t , G 11, G 22) algorithm [17].
P12 G 12 obj (t , G 12, G 21) obj (t , G 12, G 22) B. Load-ratio learning
During the learning process, the BSs search for their
Fig. 3. Example payoff table of a user.
optimal load-ratio policies through interactions with their
environment and adapt their decision-making process in a
548
Authorized licensed use limited to: INTERNATIONAL ISLAMIC UNIVERSITY. Downloaded on May 16,2022 at 14:12:40 UTC from IEEE Xplore. Restrictions apply.
trial-and-error manner. At first, each BS is given a load-ratio 0.8
list pi t pmin i (20)
M 1
Lr > p1 , p2 ,..., pM @ These limits are set so that the agents are not trapped in local
optimums as the environment changes.
where pi indicates the probability that the BS’s load takes C. Design of reward function
i of the microgrid’s total load. At time t, all the BSs The reward function has a critical influence on the BS's
M strategy obtained through RL. For example, if one action
randomly pick its load ratio according to their load-ratio lists: always leads to the highest payoff, the BS would develop a
strategy to play that action only. In this study, two sets of
i reward functions are given to the BSs depending on their
rti (t ) with probability pi
M available information.
1) Reward function with communication
After making the load-ratio choice, the BS conducts the two-
If the communication network in the microgrid is
player game solving and observe the resulting system status
functioning normal and no component is lost, the BSs could
change. Then, a reward is computed and given to each agent
exchange their choices of TSFs and battery SoC status with
to update its load-ratio policy, and the above process repeats.
each other. So the original objective function (12) is
This process is similar to that of a policy iteration, but the
commutable to all BSs; thus, (12) works as the reward
updating law and its converging objective are different due to
function:
the multi-agent arrangement [18]. The learning sequence of
the agent is shown below: ri wcom f com ( PBS , t , G1 (t ),..., G N (t )) wSoC p( SoC (td ) !
(21)
SoCgoal )
1. At time t, the BS chooses a load-ratio according to its
load-ratio policy Lri . Suppose the load ratio taken is A penalty is given to the BS when the probability of reaching
rti . the SoC goal is too small, as shown in (22). This setting
encourages the BS to pick a lower TSF so that more energy
2. After conducting the two-player game and solving for is saved when the energy situation is critical.
its solution, each BS applies the obtained TSF strategy ri 1 Gi (t ), P(SoC(tend ) t SoCgol ) 0.5 (22)
vector x.
2) Reward function with local information
3. At the next LCI decision time t 1 , the BSs collect the If the communication link in the microgrid is damaged or
system status (SoC, PSNR information) and compute malfunctioning, it is possible that some of the BSs are unable
their payoffs. Suppose the reward of player i is ri (t ) . to observe other BSs’ moves and power-load profiles. In this
situation, the disconnected BSs are limited to evaluate its
4. BS updates its load-ratio policy according to the rule performance with local information using the function shown
in (23)
Lri (t 1)
Lri (t ) E ri (t ) eai Lri (t ) (18)
1 1
ri wload wSoC (23)
where 0 E 1 is a learning rate parameter and is a unit 1 eGi (t ) SoC (t )
SoC ( t )
Gi (t )
1 e
vector with its rti th component unity. This algorithm is
which is an approximation of the original objective function,
known as Linear Reward-Inaction algorithm LR I . One where ai and SoC are the user’s local TSF and battery SoC
critical feature of this learning algorithm is that the 1
convergence to pure NE of agents’ strategies is guaranteed if level. The first term is a normalized
1 eDi SoC (t )
the learning rate is sufficiently small, regardless of the
communication quality index, which gives the agent a high
number of players. By which it means that all probability
reward when both the SoC and Gi (t ) are large. The second
vectors Lri converge to unity vectors. Such convergence is in
term 1 is an approximated energy availability index.
terms of probability, and the mixed strategies NEs are not
SoC ( t )
Gi (t )
proven to be stable. However, compared to other multi-agent 1 e
approaches such as mixed game and multi-agent Q-learning, Similar to the standard operation reward, a penalty is given
this algorithm is one of the few that guarantees a form of when the SoC level is critical
convergence [19]. A full description of the LR I algorithm
ri 1 Gi (t ), SoC(t ) d SoCmin (24)
could be found in [20]. A modification made to this algorithm
in this study is a set of limits to the load-ratio policy such that
pi d pmax 0.8i (19)
549
Authorized licensed use limited to: INTERNATIONAL ISLAMIC UNIVERSITY. Downloaded on May 16,2022 at 14:12:40 UTC from IEEE Xplore. Restrictions apply.
IV. ANALYSIS VERIFICATION TABLE I I. INITIAL LOAD-RATIO POLICY OF A USER
A case study applying this game-learning algorithm was
conducted in MATLAB. The scheme of the simulated system rti 0.2 0.4 0.6 0.8 1
is coincident with the one in Fig. 1, which consists of three pi 0.1 0.6 0.1 0.1 0.1
BSs. It is assumed the users share the same fundamental
parameters as listed in Table I. The users are asked to pick because such value reflects the system’s long-term behavior
TSFs and update their load-ratio policy every hour. The initial in term of the objective function. This value is called the
load-ratio policy of a BS is shown in Table II, and the system performance index in the following context.
available TSF of a BS is G (t ) >0.2,0.4,0.6,0.8,1.0@ . In the simulation, the system is operating normally where
TABLE I. EVALUATION PARAMETER VALUES BSs are free to communicate with each other. The system
PSNR and battery SoC during a two-day simulation is shown
Symbol PARAMETER Value in Fig. 4. As the result shows, the obtained system PSNR is
wcom Communication quality weight 0.5
highly related to the system SoC level, which is similar to the
one obtained by the heuristic algorithm (25) as shown in Fig.
wSoC Energy availability weight 0.5 5. Both algorithms ensure a system SoC over their desired
Ebat Battery fully charged energy 24 kWh goals (0.8), but the game-learning algorithm has a higher
minimal PSNR. According to [14], a moderately good target
Solar power expectation 1 kW for quality of video stream is 37 dB PSNR, whereas a 32 dB
BS base load expectation 0.2 kW
PSNR is considered as acceptable. Also, the overall system
performance computed by (26) of the game-learning
BS traffic depended load
expectation
0.8 kW algorithm is evagl 18.5058 compared to that of heuristic
one evaheu 18.9941 . Therefore, in the sense of overall
Solar power variance 4,000
objective function, the game-learning algorithm has a similar
BS traffic depended load performance compared to the exhaustive heuristic algorithm.
4,000
variance
Desired battery SoC level 0.8 A comparison between the game-learning and a direct RL
approach is also conducted under the normal condition. If the
Initial Battery SoC level 0.7
TSF strategy is found through a direct RL approach linking
BW BS total bandwidth 10MHz SoC states with the TSF actions, such as the one in [11], the
a PSNR-rate bit curve parameter 10.4
PSNR-rate bit curve parameter -23.8
system PSNR and SoC would have a relationship as shown
r Nominal transmit rate bit 2 Mbps in Fig. 6 with a system performance evarl 10.8693 , lower
The PSNR and battery SoC of the overall system are than that of the game-learning algorithm. This performance
demonstrated, and the system performance is compared to lost is mainly due to its exclusion of the time dimension in its
that of a direct RL algorithm and a heuristic algorithm with searching space. Additionally, the direct RL approach
complete information. The exhaustive heuristic algorithm demands a longer training period. With a learning rate of 0.1,
maximizes the objective function at each TSF decision time: the direct RL requires approximate 15 days until the system
performance reaches a stable level, which could be observed
Maximize obj (t , D (t ))t (25) from the training curve shown in Fig. 7. In comparison, when
the same system is operated applying the game-learning
The day-averaged cumulated objective function evaluates the
performance of a microgrid
end
¦ obj (t )
t 1 (26)
eva
day
550
Authorized licensed use limited to: INTERNATIONAL ISLAMIC UNIVERSITY. Downloaded on May 16,2022 at 14:12:40 UTC from IEEE Xplore. Restrictions apply.
Fig. 7. Learning curve of a micogrid applying direct RL algorithm.
Fig. 6. LSR and system SoC applying direct RL, normal condition.
Transactions on Wireless Communications, vol. 12, no. 5, pp. 2126-
2136, 2013.
[4] N. Yu, Y. Miao, L. Mu, H. Du, H. Huang, and X. Jia, "Minimizing
Energy Cost by Dynamic Switching ON/OFF Base Stations in Cellular
Networks," IEEE Transactions on Wireless Communications, vol. 15,
no. 11, pp. 7457-7469, 2016.
[5] A. Stavridis, S. Narayanan, M. D. Renzo, L. Alonso, H. Haas, and C.
Verikoukis, "A base station switching on-off algorithm using
traditional MIMO and spatial modulation," in 2013 IEEE 18th
International Workshop on Computer Aided Modeling and Design of
Communication Links and Networks (CAMAD), 2013, pp. 68-72.
[6] T. Han and N. Ansari, "Green-energy Aware and Latency Aware user
associations in heterogeneous cellular networks," pp. 4946-4951:
Fig. 8. Learning curve of a micogrid applying game-learning algorithm. IEEE.
[7] D. Liu, Y. Chen, K. K. Chai, and T. Zhang, "Distributed delay-energy
method, the obtained learning curve is shown in Fig. 8, which aware user association in 3-tier HetNets with hybrid energy sources,"
shows the system starts with a higher performance index as pp. 1109-1114: IEEE.
well as a faster-converging speed. The performance [8] V. Chamola, B. Krishnamachari, and B. Sikdar, "Green Energy and
Delay Aware Downlink Power Control and User Association for Off-
improvement is benefited from the game solving process, Grid Solar-Powered Base Stations," IEEE Systems Journal, vol. 12, no.
which takes consideration of power and load prediction. And 3, pp. 2622-2633, 2018.
the increased learning speed is likely caused by a smaller [9] R. Hu, A. Kwasinski, and A. Kwasinski, "Mixed strategy load
searching space: the agent applying direct RL has an SoC- management strategy for wireless communication network micro grid,"
pp. 1-8: IEEE.
TSF 2-dimension searching space while the agent in the
[10] C. D. a. C. H. Papadimitriou, Three-player games are hard. 2005.
game-learning algorithm needs only to explore the 1-
[11] R. Hu and A. Kwasinski, "Energy management for microgrids using a
dimension load-ratio policy. reinforcement learning algorithm " in 2018 IEEE Green Energy and
Smart Systems Conference (IGESSC), 2018.
V. CONCLUSION
[12] "Microgrids for disaster preparedness and recovery with electricity
This paper presented a two-layer game-machine learning continuity plans and systems: with electricity continuity plans and
energy management mechanism for microgrids to optimize its systems," in Premium Official News, Newspaper Article, ed: Plus
Media Solutions, 2015.
energy usage. In this particular case, the analysis focuses on a
[13] A. Kwasinski and A. Kwasinski, "Integrating cross-layer LTE
system applicable to wireless communication networks, but resources and energy management for increased powering of base
the same approach can be used in other applications with a stations from renewable energy," pp. 498-505: IFIP.
partially controllable load. The simulation results show that [14] A. Kwasinski and A. Kwasinski, "The role of multimedia source
the energy management strategy obtained by the game- codecs in green cellular networks," vol. 2016-, pp. 1-6: IEEE.
learning algorithm has a better performance than solely [15] S. Vandael, B. Claessens, D. Ernst, T. Holvoet, and G. Deconinck,
applying the reinforcement learning and is close with an "Reinforcement Learning of Heuristic EV Fleet Charging in a Day-
exhaustive heuristic search algorithm. Benefiting from the Ahead Electricity Market," IEEE Transactions on Smart Grid, vol. 6,
reduced searching space, converging speed of the proposed no. 4, pp. 1795-1805, 2015.
algorithm is higher than a direct reinforcement learning [16] S. Homer and A. L. Selman, Computability and complexity theory,
2nd;2; ed. (no. Book, Whole). London; New York;: Springer, 2011.
algorithm. Additionally, this algorithm showed strong
[17] H. Karloff, Linear programming (Progress in theoretical computer
resilience against system damage such as partial power loss science). Boston: Birkhäuser, 1991, pp. viii, 142 p.
and communication network failure. In the future, the
[18] R. S. Sutton, A. G. Barto, and I. netLibrary, Reinforcement learning:
performance of the adpating feature of the algorithm under an introduction (no. Book, Whole). Cambridge, Mass: MIT Press,
dynamic environment will be valuated. 1998.
[19] L. Buşoniu, R. Babuška, and B. De Schutter, "Multi-agent
REFERENCE Reinforcement Learning: An Overview," vol. 310Berlin, Heidelberg:
[1] N. Hatziargyriou, Microgrids : Architectures and Control. New York, Springer Berlin Heidelberg, 2010, pp. 183-221.
United Kindom: John Wiley & Sons, Incorporated, 2013. [20] P. S. Sastry, V. V. Phansalkar, and M. A. L. Thathachar, "Decentralized
[2] M. Hanna, "When disasters strike distributed systems," vol. 15, ed. learning of Nash equilibria in multi-person stochastic games with
Newton: King Content Company, 1995, p. 54. incomplete information," IEEE Transactions on Systems, Man, and
Cybernetics, vol. 24, no. 5, pp. 769-777, 1994.
[3] E. Oh, K. Son, and B. Krishnamachari, "Dynamic Base Station
Switching-On/Off Strategies for Green Cellular Networks," IEEE
551
Authorized licensed use limited to: INTERNATIONAL ISLAMIC UNIVERSITY. Downloaded on May 16,2022 at 14:12:40 UTC from IEEE Xplore. Restrictions apply.