0% found this document useful (0 votes)
8 views

A_Reinforcement_Learning_Approach_to_Optimize_Energy_Usage_in_RF-Charging_Sensor_Networks

This document presents a reinforcement learning approach to optimize energy usage in RF-charging sensor networks, specifically focusing on maximizing Energy Efficiency (EE) for a solar-powered Hybrid Access Point (HAP) and its connected devices. The authors model the power allocation problem as a Two-layer Markov Decision Process and propose a decentralized Q-Learning solution that significantly outperforms existing strategies in terms of EE. Simulation results demonstrate that the proposed methods achieve substantial improvements in EE while adapting to varying channel conditions and energy arrivals.

Uploaded by

Satti Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

A_Reinforcement_Learning_Approach_to_Optimize_Energy_Usage_in_RF-Charging_Sensor_Networks

This document presents a reinforcement learning approach to optimize energy usage in RF-charging sensor networks, specifically focusing on maximizing Energy Efficiency (EE) for a solar-powered Hybrid Access Point (HAP) and its connected devices. The authors model the power allocation problem as a Two-layer Markov Decision Process and propose a decentralized Q-Learning solution that significantly outperforms existing strategies in terms of EE. Simulation results demonstrate that the proposed methods achieve substantial improvements in EE while adapting to varying channel conditions and energy arrivals.

Uploaded by

Satti Babu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

526 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO.

1, MARCH 2021

A Reinforcement Learning Approach to Optimize


Energy Usage in RF-Charging Sensor Networks
Honglin Ren and Kwan-Wu Chin

Abstract—We consider a Radio Frequency (RF)-charging


network where sensor devices harvest energy from a solar-
powered Hybrid Access Point (HAP) and transmit their data
to the HAP. We aim to optimize the power allocation of both
the HAP and devices to maximize their Energy Efficiency (EE),
which is defined as the total received data (in bits) for each
Joule of consumed energy. Unlike prior works, we consider the
case where both the HAP and devices have causal knowledge
of channel state information and their energy arrival process.
We model the power allocation problem as a Two-layer Markov
Decision Process (TMDP), where the first layer corresponds to
the HAP and the second layer consists of devices. We then out-
line a novel, decentralized Q-Learning (QL) solution that employs
linear function approximation to represent the large state space.
The simulation results show that when the HAP and devices
employ our solution, their EE is orders of magnitude higher
than competing policies.
Index Terms—Wireless sensor networks, energy harvesting,
Q-learning, transmit power control. Fig. 1. An example RF-energy harvesting network. The gray frame represents
bad channel condition. The dotted arrows show an energy transmission. The
solid arrow indicates a data transmission.

I. I NTRODUCTION
ENSOR devices form the perception layer of upcoming
S Internet of Things (IoT) networks, where they are used to
monitor or effect an environment [1]. Specifically, they will
To this end, we consider a network comprising of a solar-
powered HAP that charges one or more sensing devices with
RF-energy harvesting capability; see Figure 1. The HAP is
be relied upon to realize smart cities and smart manufactur-
responsible for charging and collecting data from devices. To
ing [2]. A challenging issue face by these sensor devices is
achieve this, the HAP uses a time frame that has an energy
their limited energy, which affects their operation time [3]. In
and data transfer phase. After charging devices in the energy
this respect, a promising energy source is Radio Frequency
transfer phase, devices transmit their data to the HAP in a
(RF) energy transfer [4], [5], which brings about three advan-
Time Division Multiple Access (TDMA) manner. Our aim is
tages: 1) far-field charging; 2) data transmissions and charging
to optimize these two phases by maximizing the EE of the
operate over the same frequency; and 3) multiple devices are
HAP and devices, where EE is defined as the total received
able to harvest energy simultaneously from a single transmis-
data per Joule of consumed energy.
sion. For example, in [4], the authors demonstrated a prototype
To illustrate how EE can be maximized, consider Figure 1.
with a camera that is powered by access points in a WiFi
In each frame, assume the HAP receives 1 Joule of energy. For
network.
ease of exposition, assume there are only two channel states:
Apart from that, energy transmitters such as Power Beacons
bad or good; note, our problem considers non-negative real
(PBs) or Hybrid Access Points (HAPs) that are used to charge
valued channel gains. Charging/transmission is only success-
sensor devices may have energy constraint. This is because
ful when the channel is in the good state. Assume that when
operators are now aiming to reduce the carbon footprint or
the channel state is good, devices harvest 1 μJ of energy, and a
operating expenditure of their network [6] by deploying solar-
bit cost 1 μJ to transmit. Consider two strategies: 1) strategy-
powered PBs/HAPs [7].
1, where the HAP and devices transmit in both frame t and
Manuscript received June 24, 2020; revised September 29, 2020 and t + 1 and 2) strategy-2, where the HAP and devices only
November 16, 2020; accepted November 18, 2020. Date of publication transmit when the channel state is good; i.e., frame t + 1. For
November 23, 2020; date of current version March 18, 2021. The editor
coordinating the review of this article was J. Chen. (Corresponding author: strategy-1, over two frames, the HAP collects 1 bit using 2
Honglin Ren.) Joule of energy because the channel state in frame t is bad,
The authors are with the School of Electrical, Computer and meaning the energy used for charging is wasted. On the other
Telecommunications Engineering, University of Wollongong, Wollongong,
NSW 2522, Australia (e-mail: [email protected]). hand, if the HAP uses strategy-2, it collects 2 bits over two
Digital Object Identifier 10.1109/TGCN.2020.3039840 frames. This is because the HAP stores energy when the
2473-2400 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 527

channel state is bad, and charges devices and collects data II. R ELATED W ORKS
when the channel is good. There are many studies that aim to maximize EE in
The previous example shows that a HAP and devices Wireless Powered Communication Networks (WPCNs) [9].
should only transmit when the channel condition is favourable. To the best of our knowledge, there are no works that con-
However, in practice, the problem is challenging because sider a solar-powered HAP operating in a WPCN whereby
the HAP and devices have only causal knowledge of all nodes are able to learn the best transmission power to
energy arrivals and CSI; i.e., they only know current and maximize EE. Critically, we consider a distributed approach.
past information of the said quantities. Moreover, the HAP Prior works such as [10], [11], [12], and [13] use math-
may experience battery overflow if it does not transmit ematical optimization to optimize EE. In particular, many
frequently. To this end, this article makes the following works rely on the Dinkelbach method [14]. For example,
contributions: the authors of [10] optimize sub-carrier allocation to avoid
• We present a Two-layer Markov Decision Process interference and also to determine the transmit power of users.
(TMDP) followed by Two-layer Q-learning (TQL) Wu et al. [11] jointly determine the transmission time of users
approach; these approaches, referred to respectively as and the transmit power at the HAP and devices. Zewde and
TQL-1 and TQL-2, allow the HAP and devices to learn Gursoy [12] determine the time interval for energy transfer
the optimal transmit power policy in each time frame and data communication while avoiding packet loss due to
base on their current energy level. In TQL-2, their state data buffer overflow. In [13], the authors investigate a Time
space also incorporates the current channel gain. Another Division Mode Switching (TDMS) scheme where users simul-
innovation is the use of novel features, as part of the taneously transmit data to a HAP. The problem is to optimize
Linear Function Approximation (LFA) framework [8], to the charging interval and transmit power of users. In [15], the
evaluate each action or transmission power based on the authors consider a wireless powered cellular network, where
current state of the HAP or devices. We note that although a user and a device-to-device pair cooperatively transmit data
we consider a solar-powered HAP, our solution continues to the HAP. The problem is to determine the beamforming
to work because it does not assume any specific energy weights used by the HAP, transmit power and transmission
arrivals distribution; hence, it is also suitable for use if the interval of users. Song and Zheng [16] employ Particle Swarm
HAP is equipped with other types of renewable energy Optimization (PSO) to determine the charging interval and
sources. transmit power of users in a Non-Orthogonal Multiple Access
• Previous works, see Section II, assume causal knowl- (NOMA)-based WPCN. These works outline a centralized
edge, whereby nodes are aware of future channel gain solution and requires global information or perfect CSI.
information, which is impractical. Apart from that, stud- In summary, our work has the following differences to prior
ies that consider causal CSI only consider the worst case works on optimizing the EE of a WPCN: 1) it considers causal
or robust setting, or aim to maintain data/energy queue knowledge of channel gain and energy arrivals; 2) it uses
stability. In contrast, we consider a distributed learning a solar-powered HAP to charge devices; and 3) it proposes
approach that adapts to stochastic energy arrivals and and studies a distributed, learning solution. To date, there are
channel gains, and assume only causal knowledge. only a few works that have considered causal channel gain
• The simulation results show TQL-1 and TQL-2 achieve information when optimizing the EE of a system. For exam-
respectively more than 48 and 389 times the EE achieved ple, Hu and Yang [17] dynamically determine the transmission
by random and greedy strategies. Moreover, we found time and transmit power of each user while maintaining data
that decaying the learning rate as opposed to fixing queue stability. In [18], the authors consider deriving a sched-
it to a constant results in EE that is at least 12.97% ule for sensor nodes to avoid collision. Unlike [17] and [18],
higher. This accelerates the training process at the start we consider a learning solution that allows HAP and devices
of training, and as the learning rate decays, the agent to adapt their transmit power dynamically. We also note that
can explore more possible actions before converging to past works have not considered a solar power HAP. Moreover,
a solution. In addition, our approaches are capable of prior solutions are optimized for specific scenarios and have
maximizing EE while meeting the minimum throughput no learning capability. There are also works such as [19] that
requirement of devices. Interestingly, with larger variation consider estimating channel gains in order to improve trans-
in channel condition or increase in maximum transmit missions of data. However, they do not consider optimizing
power of the HAP, both our approaches obtained bet- energy efficiency in a multi-users system with both energy
ter EE due to higher sum rates. When we increase the harvesting devices and a HAP.
solar panel size at the HAP, meaning it harvests more
energy, the HAP uses a high transmission power to avoid
energy overflow, which results in a high data rate but III. S YSTEM M ODEL
lower EE. Table I summarizes our notations. We consider a RF-
Next, we first review prior works in Section II before charging network that is composed of a HAP and a set of
formalizing our system model in Section III. After that, wireless powered devices N = {1, . . . , |N |}, where |N | is the
Section IV presents our problem formally. Section V out- total number of devices and each device is indexed by i ∈ N .
lines TQL-1 and TQL-2. Section VI contains evaluation of Fig. 2 shows that the HAP harvests energy from solar and
our approaches. Section VII concludes this article. uses it for charging devices and receiving data from devices.
528 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021

TABLE I
C OMMON N OMENCLATURE

Fig. 3. The structure of a frame.

any energy outage. The HAP is only aware of its current and
past energy arrivals. Let ě0t denote the energy harvested by
the HAP in frame t. The HAP has a rechargeable battery with
capacity Ē0,max . Its current amount of energy in frame t is
denoted by e0t . Let pmax be the HAP’s maximum transmit
power, where its transmit power in time slot t is denoted as
p0t . We also consider circuit power consumption during trans-
missions. Let pc be the circuit power at the HAP. Define o0t
as the amount of energy wasted at the HAP due to overflow.
Formally, we have,
 
o0t = e0t + ě0t − p0t + pc · τ0 − Ē0,max . (1)
The uplink and downlink channel gain in frame t between a
t and ǧ t , respectively.
device i and the HAP is denoted as ĝi0 0i
We assume block fading where ĝi0 t and ǧ t are fixed in a
0i
given frame but vary across frames [20]. We consider the Log-
distance path loss model, defined by
 
d
PLd0 →di (dB ) = PL(d0 ) + 10nlog10 i + χ, (2)
d0
where PL(d0 ) is the path loss at a reference distance d0 , and
di is the distance from the HAP to device i. The path loss
exponent is denoted as n, and the term χ models small scale
fading that is modeled as a zero-mean Gaussian distributed
variable with a standard deviation σ. Note, channel estimation
is complementary to our work; any energy efficient methods
can be used to acquire the CSI to/from devices; e.g., [20].
In this article, we do not consider the energy cost of channel
Fig. 2. A RF-charging network.
estimation. This can be modeled by allocating some of the
energy harvested by the HAP or devices for channel estima-
tion at the start of each frame. Note, this has the effect of
Devices always have data to transmit and no Quality of Service scaling our results and it does not change our conclusion or
(QoS) requirement, where they are tolerant of delays resulting findings.
from deferred transmissions or energy outage. Denote the harvested energy at device i in frame t as ěit .
We assume Time Division Multiple Access (TDMA), where The value of ěit is determined by the transmission power of the
time is divided into T frames. Each frame with length T̄ is HAP p0t and the channel gain ǧ0i t . Let pˇt = p t · ǧ t be
i 0 0i
indexed by t, where t = 1, . . . , T . As shown in Fig. 3, each the incident power at the antenna of a device. The received
frame has a charging slot and |N | data slots. Each data slot power η(pˇit ) at device i is a non-linear function of its incident
has duration τ ; in practice, this duration is set to the coherence
power [22]. Formally, we have ěit = τ0 · η(pˇit ). Each device
time of the channel [20]. The charging slot has duration τ0 =
has a rechargeable battery with capacity Ēmax . The transmis-
(T̄ − |N | · τ ). The HAP broadcasts energy to devices in the
sion power pit of device i in frame t depends on its available
charging slot and each device transmits data to the HAP in its
energy eit , where initially we have ei0 = 0. Specifically, if
allocated slot. Note, in this work we consider the HAP and
pit = 0 then device i does not transmit in frame t. Otherwise,
each device have causal CSI knowledge.
it transmits at power pit , which is bounded by its available
The solar energy arrival process at the HAP is governed by
energy. Also, the harvested energy in frame t is available in
a Hidden Markov Model (HMM) with parameters determined
frame t + 1 [23]. Denote p̌c as the circuit power at devices.
by the solar irradiance data collected in the month of June from
The energy level eit+1 at the beginning of frame t + 1 is
2008 to 2010, at a solar site located in Elizabeth City State
expressed as
University [21]. We assume a non-zero energy arrival process,  
i.e., during day time, meaning the HAP does not experience eit + ěit − pit + p̌c · τ = eit+1 ≤ Ēmax . (3)
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 529

In frame t, device i transmits a total of dit amount of data, In words, the EE at device i is its sum rate dit divided by its
which is defined as consumed energy over T frames. The objective is to maximize
  ΦEEi . Formally, we have
p t ĝ t
dit = τ · B log2 1 + i i0 , (4) max E ΦEEi
n0
pit
 t 
where B is the channel bandwidth, and n0 is the noise power. s.t. pi + p̌c · τ ≤ eit , ∀t ∈ T
In the above expression, dit is only non-zero for a device i  
eit + ěit − pit + p̌c · τ ≤ Ēmax , ∀t ∈ T . (9)
when it decides to transmit; i.e., pit > 0 in frame t. Note that
Equ. (4) represents the asymptotic or upper bound capacity Similar to the HAP, the challenge at each device is that it does
of device i’s uplink. In practice, the HAP simply notes the not know the channel condition and its energy arrival process.
amount of bits received from each device at the end of each
frame. The total received data d̄ at the HAP is defined as V. T WO -L AYER Q-L EARNING (TQL)
|N | We apply Q-Learning [24] to solve our problem. In partic-

T 
d̄ = dit . (5) ular, an agent interacts with the environment to learn the best
t=1 i=1 action for a given state over time. Advantageously, it does
not require the probability distribution of states; i.e., it is a
The performance metric EE ΦEE is defined as the ratio so called model-free approach [8]. For a given state, an agent
between the sum rate and the total amount of energy consumed determines a policy or an action that results in it receiving a
at the HAP. Formally, we have, high expected value, which is calculated using the immediate
d̄ reward and the value of future states.
ΦEE = T   . (6) Fig. 4 provides an overview of our two-layer approach; the
t=1 p0 + pc · τ0
t
HAP is located at the upper layer, and devices form the lower
layer. At both layers, agents decide the transmission power of
IV. T HE P ROBLEM nodes. We run a Q-Learning (QL) agent at the HAP and at
We aim to jointly determine the transmit power of the HAP each device [8]. At the beginning of frame t, the HAP decides a
and devices. Ideally, the HAP aims to use the minimum trans- transmit power p0t to charge devices. Each device then selects
mission power to charge devices that yields the maximum a transmit power pit to transmit its data to the HAP. As it
amount of data from devices. We divide the problem into two will become clear later, the goal of both agents is to learn a
parts; namely, 1) the HAP determines its transmission power policy that allows them to select the best transmit power given
p0t in each frame, and 2) each device determines its transmis- their available energy. In particular, they will use the feedback
sion power pit in its assigned data slot. Next, we introduce the received after taking each action in each state to reinforce or
problem at the HAP and devices separately. discourage an action.
The objective of the HAP is to maximize EE over a given In the following sections, we introduce MDP and QL. After
T horizon. Formally, we have that, we model the power allocation problem as a two-layer
  MDP (TMDP) [25]. Then, in Section V-C, we explain LFA.
T |N | t We then outline TQL algorithms used to determine the optimal
d
max E T t=1  t i=1  i policy at each layer.
t=1 p0 + pc · τ0
p0t

s.t. p0t ≤ pmax , ∀t ∈ T A. Markov Decision Process (MDP) and Q-Learning


 t 
p0 + pc · τ0 ≤ e0t , ∀t ∈ T An MDP is defined as a tuple A, S, T , R, where S is a
 
e0t + ě0t − p0t + pc · τ0 − o0t ≤ Ē0,max , ∀t ∈ T . (7) set of states; A is a set of actions; R is a set of rewards; T
is the transition probability from one state to another. In each
The expectation is taken with respect to the random channel state s t ∈ S, an agent takes an action a t ∈ A and obtains a
gains between the HAP and each device, and the energy arrival reward r t ∈ R. Then the state s t transition to s t+1 as per the
rate at the HAP over T frames. transition model T . A policy π is a mapping from a state s t
Our problem is challenging because the HAP has causal to an action a t , i.e., a t = π(s t ) [8].
CSI and energy arrival information. If it transmits during bad To evaluate a policy π, we define a value function V π (s t ).
channel condition, it experiences a high energy loss, which In particular, V π (s t ) returns the expected reward when an
lowers EE. In this case, the HAP should defer its transmission agent follows policy π from state s t and in all subsequent
or avoid using a high transmission power. However, doing so states. Let π ∗ denote the optimal policy, meaning π ∗ yields

may lead to energy overflow. the maximum expected reward V π (s t ) in all states s t ∈ S.
Devices aim to use their RF energy efficiently to maximize ∗
As shown in [26], the policy π can be found via the Value
the amount of data transmitted to the HAP. For a device i, Iteration algorithm for a model based MDP.
define ΦEEi as For a model-free MDP, one can use Q-learning to attain the
T optimal policy π ∗ when the state transition model is unknown.
dt In the Q-learning framework [24], an agent interacts with the
ΦEEi = T  t=1 i  . (8)
t=1 pi + pˇc · τ
t environment when taking an action a t in state s t . It then
530 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021

• Action A0 : the HAP has a finite action set A0 = {a0t =


p0t | p0t ∈ {0, κ0 , 2κ0 , . . . , P̄0,max }}, where P̄0,max is
the maximum transmit power and κ0 = 0.02 · P̄0,max
is a step size used to discretize the transmission power.
If the battery level is lower than the minimum transmit
power κ0 Watts, the only action is not to transmit.
• State transition T0 : the transition function models the
probability of a state transition to s0t+1 when the arrived
energy at the HAP is ě0t and action a0t is taken in state
s0t . We assume a model-free setting, and hence, the HAP
is not aware of the probability distribution that governs
the transition between states.
• Reward R0 : We define a reward for each of the follow-
ing scenarios: 1) no channel information is available, and
2) the HAP has current channel condition ǧ0i t . The reward
t
r0 for scenario 1) is defined as,

⎨ ΦEE , p0t = 0
t
r0 = −0.00001, p0t = 0 ∧ o0t > 0 (12)

0, Otherwise,
Fig. 4. A two-layer MDP model. where ΦEE is the EE as defined as Equ. (6); The HAP
obtains a negative reward when it uses no energy in
a frame that causes battery overflow. For scenario 2),
obtains a reward r t and observes the state moves to s t+1 .
using current channel information, the HAP evaluates the
In this case, the agent maintains an action-value function
received energy at each device ěit . Define ζ as,
Q π (s t , a t ), where Q π (s t , a t ) returns the expected reward
for taking action a t from s t and follows policy π there- T|N | t

after. Let Q π (s t , a t ) be the optimal action-value function ζ = T  i=1 ěi
t=1
.
 (13)
t=1 p0 + p̌c · τ0
t
that corresponds to π ∗ . The aim for an agent is to approximate

the optimal Q π (s t , a t ) value from each Q π (s t , a t ) during In words, ζ is the charging efficiency of the HAP, which
interaction with the environment. Formally, an action-value is equal to the ratio between the received energy at all
function Q π (s t , a t ) is defined as, devices over the amount of energy consumed by the HAP.

 The reward r0t for scenario (ii) is,
 
Qπ st , at = E γ k −1 · rit+k |s t , a t , (10) ⎧
k =1 ⎨ ζ, p0t = 0
t
where γ ∈ [0, 1) is a discount factor. We are now ready to r0 = −0.00001, p0t = 0 ∧ o0t > 0 (14)

introduce the update rule for Q π (s t , a t ). In the Q-learning 0, Otherwise.
framework [8], an agent selects an action a t when it is in state At each device, the MDP is defined as follows:
s t , and obtains a reward r t . It then moves to state s t+1 and • State Si : In frame t, the state sit for device i is defined
estimates the next highest value Q π (s t+1 , a t+1 ) for available by its battery level eit and the channel condition ĝi0 t .
actions a t+1 ∈ A. In this case, the agent updates Q π (s t , a t ) Similarly, the battery level can take any value from a
base on its current reward r t and the next estimate value continuous range, which is eit ∈ [0, Emax ]. Therefore,
Q π (s t+1 , a t+1 ). Therefore, Q π (s t , a t ) is defined as, the set Si contains infinite number of states.
   
Q π s t , a t ←− Q π s t , a t • Action Ati : Each device i has the following actions:
   et et
Ati = {ait = pit | pit ∈ {0, κi , 2κi . . . τi }}, where τi
+ α · r t + γ · max Q π s t+1 , a t+1 represents the highest available transmission power, and
a t+1 ∈A
 et
κi = 0.02 · τi is a step size to discretize the transmission
π t t 
− Q s ,a (11) power. Note, if the battery level is lower than the mini-
mum transmission power; i.e., κi Watts, the only action
where α ∈ [0, 1] is a step size that controls the learning rate. is not to transmit.
t+1
• State transition Ti : The transition of state sit to si
B. A Two-Layer MDP Model t
occurs when device i takes an action ai and the harvested
At the upper layer or HAP, its MDP model is defined as, energy is ěit . We assume a model-free approach where
• State S0 : in frame t, the state s0t for the HAP is defined devices are unaware of the probability distribution that
by its battery level e0t and the channel condition ǧ0i t . determines the amount of arrived energy.
The battery level is assumed to be continuous; i.e., • Reward Ri : the reward for frame t is rit = ΦEEi , where
e0t ∈ [0, Ē0,max ]. ΦEEi is the EE of a device; see Equ. (8).
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 531

C. Q-Learning With LFA uses p0t . Observe that a smaller amount of consumed energy
Note that the HAP and each device has a continuous state leads to a higher feature value. This is reasonable as we want
space, which results in a large state-action space that is com- the HAP to use the minimum amount of energy in order to
putationally intractable. To this end, we use LFA and outline maximize EE. The feature function fˆ3 (s0t , a0t ) evaluates the
novel features to represent the state-action space. Specifically, charging efficiency when the HAP uses a transmission power
we employ a parametric function to approximate Q π (s t , a t ). p0t . Formally,
⎧  |N | t
Advantageously, LFA allows both HAP and devices to find the
 t t  ⎨ c2 × i=1 ěi
, a0t = p0t = 0
global solution [27]. Another reason for adopting LFA is that fˆ3 s0 , a0 = (p0t +pc )·τ0 (20)
is requires much less computational resources as compared to ⎩ 0, p t = 0,
0
using a neural network, and thus making it ideal for use on
an energy constrained device [27]. where c2 = 103 is a constant value to normalize fˆ3 (s0t , a0t ) ∈
LFA uses a linear combination of M feature functions (0, 1]. In words, this feature returns a non-zero value that is
fm (s t , a t ), m = 1, 2, . . . , M to map each action-value pair proportional to the amount of received energy at devices and
(s t , a t ) into a feature value. Let f be a M × 1 array of feature inversely proportional to the energy consumed by the HAP.
values. That is, this feature returns a value proportional to the resulting
Let θ ∈ RM ×1 be a column vector containing a set EE for a given action. Note, the HAP only uses fˆ3 (s0t , a0t )
when the current channel information ǧ0i t is available.
of weights that indicate the contribution of feature values.
Therefore, Q π (s t , a t ) is approximated as the weighted sum 2) A Device’s Feature Functions: Each device uses the
of feature values [8], i.e., function fˇ1 (sit , ait ) to evaluate its battery condition after tak-
    ing action ait and uses the function fˇ2 (sit , ait ) to evaluate the
Q π s t , a t ≈ Q̃ π s t , a t , θ = f T θ , (15) selected transmission power pit . Formally, we have,

where Q̃ π (s t , a t , θ ) is the approximated action-value func-  t t 1, rit+1 < Ēmax
ˇ
f1 si , ai = (21)
tion, and T denotes the transpose of a matrix. The vector 0, Otherwise,
θ controls the linear combination of feature functions to ⎧ 1
approximate Q π (s t , a t ). To minimize the difference between  t t  ⎨ c1 × (pit +p̌c )·τ , ait = pit = 0
ˇ
f2 si , ai = t ·100% (22)
Q̃ π (s t , a t , θ ) and Q π (s t , a t ), we use the Gradient Descent ⎩
e
i

algorithm to update θ . Specifically, when an agent takes an 0, pit = 0,


action a t+1 in state t + 1, and obtains a reward r t+1 , the Feature fˇ1 (sit , ait ) returns a higher value if there is no overflow,
update rule for θ is defined as follows, whilst fˇ2 (sit , ait ) returns a higher non-zero value if device i
  uses a low but non-zero transmit power.
Δ = r t+1 + γ · max Q̃ π s t+1 , a t+1 , θ
a t+1 ∈A
 
− Q̃ π s t , a t , θ , (16) D. TQL Algorithms
θ ← θ + α · Δ · f, (17) We now outline two-layer Q-learning algorithms that
employ −greedy policy to select actions [8]. The parame-
where Δ is the temporal difference error. ter ∈ (0,1) controls the amount of exploration, where an
We are now ready to present the feature functions for the agent selects an action randomly with probability , and with
HAP and devices. The features are designed based on the fol- probability 1− it selects the action a t that has the highest
lowing principle: for a given state, favorable action(s) should expected reward, i.e., Pr [a t = maxa t ∈A Q π (s t , a t )] = 1 − .
have a high value. We introduce two TQL algorithms; namely TQL-1 and TQL-2.
1) HAP’s Feature Functions: We have M = 3 feature func- TQL-1 assumes causal CSI, while TQL-2 has only the current
tions at the HAP; namely fˆ1 (s0t , a0t ), fˆ2 (s0t , a0t ) and fˆ3 (s0t , a0t ). CSI. Therefore, when the HAP uses TQL-1, we use Equ. (12)
The feature function fˆ1 (s0t , a0t ) indicates whether the action a0t as the reward function. In this case, the HAP has two feature
avoids battery overflow. Formally, we have, functions; i.e., Equ. (18)-(19). On the other hand, when the

 t t 1, e0t+1 < Ē0,max HAP uses TQL-2, its reward is as per Equ. (14), and it has
ˆ
f1 s0 , a0 = (18) three feature functions: Equ. (18)-(20). Define π̂ and πˇi to be
0, Otherwise,
a vector of power allocation policy at the HAP and at device
where e0t+1 is the next battery level defined in Equ. (3). The i. Let K be a set of episodes, and |K| denotes the total number
feature function fˆ2 (s0t , a0t ) evaluates the selected transmission of episodes. Each episode k ∈ K contains Γ frames, where Γ
power p0t at the HAP and it is defined as, is a constant value. We assume the solar energy arrival rate ě0t
⎧ 1
remains fixed in each episode k, i.e., ě0t = ě k . We now present
 t t  ⎨ c1 × (p0t +pc )·τ0 , a0t = p0t = 0 how the HAP and each device i learn to find the optimal power
fˆ2 s0 , a0 = t
e0
·100% (19) allocation policy.

0, p0t = 0, 1) Algorithm 1: The HAP sets the weight of each feature
function to one initially. At the beginning of each frame, the
where c1 = 0.02 is a constant to normalize fˆ2 (s0t , a0t ) ∈ (0, 1]; HAP selects an action a0t based on its current state s0t using the
(p t +p )·τ
pc is the circuit power at the HAP; 0 e tc 0 · 100% rep- −greedy policy; see Algorithm 1, line 5. The HAP then col-
0
resents the proportion of consumed energy when the HAP lects data from devices by calling the function CollectData(.)
532 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021

Algorithm 1: Pseudocode for TQL-1 Algorithm 2: Pseudocode for Device i


Input: γ, α, Input: γ, α,
Output: π̂ Output: πˇi
1 Initialize θ̂ to ones 1 Initialize each θˇi to ones
2 for each episode k ∈ K do 2 for each frame t in episode k do
3 Solar energy arrival rate ě0t = ě k 3 Select an action ait ∈ Ati using πˇi
4 for each frame t in episode k do 4 Calculate rit as per Equ. (8)
5 Select an action a0t ∈ A0 using π̂ 5 Observe the next state sit+1 , and next action set
6 Total received data d̄ ← CollectData(.) At+1
i
7 Calculate r0t as per Equ. (12) 6 Calculate maxa t+1 ∈At+1 Q̃ πˇi (sit+1 , ait+1 , θˇi )
8 Observe the next state s0t+1 7
i i
Obtain πˇi based on −greedy
9 Calculate maxa t+1 ∈A Q̃ π̂ (s0t+1 , a0t+1 , θ̂ ) 8 Update θˇi as per Equ. (15)- (17)
0 0
10 Obtain π̂ based on −greedy 9 end
11 Update θ̂ as per Equ. (15)-(17) 10 return πˇi
12 end
13 end
14 return π̂ Algorithm 3: Pseudocode for TQL-2
Input: γ, α,
Output: π̂
1 Initialize θ̂ to ones
see line 6. Based on the total received data d̄ and the selected
2 for each episode k ∈ K do
action a0t in state s0t , the HAP then obtains a reward r0t as
per Equ. (12), see line 7. After that, it observes the next state 3 Solar energy arrival rate ě0t = ě k
s0t+1 . Then the HAP uses Equ. (15) to estimate the value for 4 for each frame t in episode k do
Select an action a0t ∈ A0 using π̂
each action a0t+1 ∈ A0 in state s0t+1 , see line 9. After this,
5

in line 10 the HAP selects a power allocation policy π̂ using 6 Calculate r0t as per Equ. (14)
−greedy. Next, it updates θ̂ using the obtained reward r0t , 7 Observe the next state s0t+1
current value Q̃ π̂ (s0t , a0t , θ̂ ) and the estimated value for the 8 Calculate maxa t+1 ∈A Q̃ π̂ (s0t+1 , a0t+1 , θ̂ )
0 0
next state Q̃ π̂ (s0t+1 , a0t+1 , θ̂ ), see line 11. 9 Update θ̂ as per Equ. (15)-(17)
2) Algorithm 2: Initially, a device i sets the weight for each 10 Obtain π̂ based on −greedy
feature function to one. In each frame, it first uses πˇi to select 11 end
an action ait and obtains the corresponding reward rit , see 12 end
line 3-4. After that, the device observes the next state sit+1 , 13 return π̂
and the next action set At+1 i . Note, as defined in Section V-B,
its energy level eit+1 yields its corresponding action set At+1 i .
πˇ t+1 t+1 ˇ
Next, the device estimates the next value Q̃ (si , ai , θ i )
i
each device, it is also possible to employ works such as
by calculating the value for each action ait+1 ∈ At+1 i , see SensorGym [28] at the HAP, whereby devices upload their
line 6. In line 7, the device then sets its power allocation state to the HAP; i.e., training is carried out at the HAP. The
policy π̌ using −greedy. The last step is to update θˇi using HAP then uses the state information of devices to learn the
the current reward rit , current value Q̃ πˇi (sit , ait , θˇi ) and the optimal policy, which it then upload to devices. This means a
next value Q̃ πˇi (sit+1 , ait+1 , θˇi ), see line 8. device only needs to lookup the best action given its current
3) Algorithm 3: As shown in line 1, the HAP initializes observed state.
the weight for each feature function to one. In state s0t , the
HAP selects an action a0t using −greedy, see line 1. Then it
t to calculate the reward VI. E VALUATION
uses the current channel condition ǧ0i
r0 , see line 6. After that, it observes the next state s0t+1 , and
t
We run our experiments in MATLAB on an Intel I5-4690 @
calculates the next estimated value. Next, the HAP updates θ̂ 3.50 GHz computer with 8 GB of RAM. We consider a RF-
using Equ. (15)-(17). In line 10, the HAP uses −greedy to charging network with five devices randomly placed 2 to 6
obtain its power allocation policy π̂ . In the last step, devices meters away from the HAP. Each frame is composed of one
perform the learning approach as illustrated in Algorithm 2. charging slot with a length of τ0 = 0.5s and five data slots;
We note that the computational complexity in each frame each slot has length τ = 0.1s. We note our proposed methods
for both TQL-1 and TQL-2 is a function of the action space remain valid for other values of τ and τ0 because the problem,
A0 or Ati with size that corresponds to the step size κ0 statistics of channel gains and energy arrivals remain the same.
or κi . This is because the HAP or each device needs to The energy arrival rate of each episode at the HAP is mod-
search through their action space to identify the best action elled as a Markov chain as per [21]. Each episode contains
for a given state; see line-8 of Algorithm-3 for example. Γ = 5 frames. The highest transmission power for the HAP
Lastly, we note that instead of running Algorithm- 2 at is P̄0,max = 30 dBm [11]. The energy conversion efficiency
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 533

TABLE II
η ∈ [0, 0.624] is set according to the RF-energy harvester PARAMETERS U SED IN S IMULATIONS
P2110B [29]. We note that the RF energy receiver has an oper-
ating range of [−12, 10] dBm [29]. The battery capacity of the
HAP and devices is Ē0,max = 30 J and Ēmax = 1 J, respectively.
We assume that at the beginning of each simulation, devices
and the HAP have no residual energy. The channel from the
HAP to devices is modelled according to the Log-distance path
loss model Equ. (2), where the path loss exponent is set to 2.5
and the standard deviation is σ = 6 dB. Moreover, we set to
decay every five frames, where the is set to one initially. For
TQL-1, we set for the HAP and devices to 10/t and 1/t, where
the of devices has an initial value of 0.1. For TQL-2, we set
for the HAP and devices to 1/t. Note, for the HAP stops
decreasing when it reaches 0.02. To train the agent at the HAP where the learning rate for the HAP decays from 0.1 to 0.01.
and devices, for each episode, the weight of feature functions Simulation results are recorded every 1000 frames.
is updated based on the update rule defined by Equ. (16)- 1) Learning Rate for TQL-1: Referring to Fig. 5a, when
(17). We run the proposed TQL algorithms for 100000 frames α is fixed to a value lower than 0.1, TQL-1 has the low-
to guarantee convergence to the optimal solution; referring to est EE that is 55.24% lower than when it uses α = 0.1.
Fig. 5a, we see that the training process requires approximately However, if it has a decaying learning rate, its EE is the high-
20000 frames. Note, all the results are an average of 50 simu- est; i.e., 1.41 × 107 bits/Joule. This is because in the first
lations. Table II summarizes the parameter values used in our 20000 frames, TQL-1 learns faster than when it uses α = 0.1.
simulations. After 20000 frames, as its learning rate decays, TQL-1 uses
Note that as our problem is new, see Section II, there are a small step size to explore more possible actions before con-
no prior methods we can compare against fairly. In particu- verging to a solution. Thus, in the remaining experiments, we
lar, existing approaches require non-causal or future energy set the learning rate of the HAP and devices to 250/t and 5/t,
arrivals while our problem calls for a solution that uses only respectively.
causal information. To this end, we only compare against the 2) Learning Rate for TQL-2: Referring to Fig. 5b, when
following methods: TQL-2 uses α = 0.01, it converges to a solution in 20000
• Random. The HAP or a device randomly selects a frames. However, when it uses α = 0.001 and α = 0.0001, it
transmission power based on its available energy. does not converge in 100000 frames. Moreover, when TQL-2
• Greedy. The HAP or a device always use its highest pos- uses α = 0.1, it has an EE that gradually decreases after 20000
sible transmission power, with a maximum of 30 dBm, frames. This is because the step size is too large to obtain a
in each frame. solution. In contrast, a decaying learning rate allows TQL-2 to
• Low Energy Consumption (LEC). In each frame, the HAP learn fast in the first 25000 frames. When the decaying learn-
uses κ = δ · P̄0,max dBm to charge devices, and in turn, ing rate is lower than 0.01, it allows TQL-2 to use a small
a device uses 2% of its residual energy to transmit data step size to obtain a solution. As a result, when TQL-2 uses
back to the HAP. Note that in our TQL approaches, we a decaying learning rate, it has an EE that is 12.97% higher
et than when using α = 0.01. Therefore, in subsequent experi-
consider κ0 = 0.02 · P̄0,max and κi = 0.02 · τi as the
ments, we use a learning rate of 250/t for both the HAP and
step size to discretize the transmission power of the HAP
devices.
and devices, respectively.
3) Learning Rate Versus Convergence Time: We now study
We study five scenarios. The first scenario considers various
the relationship between channel conditions and learning rates.
learning rate α values. Second, we study how energy arrival
We increase σ from 6 dB to 8 dB. We calculate the Mean
impacts performance. The third considers the HAP’s maximum
Square Error (MSE) of the difference between the estimated
transmit power P̄0,max . Fourth, we investigate different SINR
and true value as per Equ. (16) for each frame. Our TQL
thresholds. Finally, we consider different channel conditions.
methods are deemed to have converged if their MSE is lower
We record four metrics: EE as per (6), sum rate as per (4),
than the average MSE.
percentage of wasted energy due to overflow at the HAP, and
From Fig. 5c, we see that as σ increases, TQL-1 converges
the HAP’s charging efficiency ζ; see Equ. (13).
slower when it uses a decaying learning rate. Specifically, it
converges in 13135 frames when σ = 6 dB, and converges in
A. Learning Rate 15054 frames as σ increases to 8 dB. Similarly, varying chan-
The learning rate α changes from 0.1 to 0.0001. We also nel condition has a minor impact on the convergence rate when
investigate decaying learning rates, which is initially set to TQL-1 uses a small learning rate; i.e., α = 0.01, α = 0.001
one and decays every five frames. For TQL-1, both the HAP and α = 0.0001. This is because a higher channel variation
and devices have a different decaying learning rate, where the does not affect the system as much when an agent uses a small
HAP’s learning rate is set to 250/t; devices have a learning learning rate. In contrast, when TQL-1 uses a large learning
rate of 5/t, which stops decreasing when its value is 0.001. As rate, say α = 0.1, its convergence rate fluctuates from 43263
for TQL-2, the HAP and devices have a learning rate of 250/t, to 29412 frames.
534 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021

Fig. 5. Learning rate VS EE and Convergence time.

Fig. 5d shows that TQL-2 converges slower when σ is good. According to Fig. 6(c) and Fig. 6(d), when the solar
increases from 6 to 8 dB. Interestingly, it has an insignifi- panel size is 500 cm2 , TQL-1 and TQL-2 maintain the highest
cant impact on the convergence rate when TQL-2 uses a large charging efficiency, i.e., 4.95 × 10−5 and 2.33 × 10−4 , with
learning rate; i.e., α = 0.01 and α = 0.1. This is because an energy wastage of 49.31% and 32.32%. Fig. 6(b) shows
when the HAP uses TQL-2 it is able to exploit the current that as the HAP has more energy, the sum rate improves for
channel condition. the tested methods. Moreover, when the solar panel increases
to 1500 cm2 , the sum rate of TQL-1 and TQL-2 increases
to 1.41 × 106 bits/s and 2.40 × 106 bits/s, respectively. On
B. Energy Arrival Rates the one hand, more available energy at the HAP allows more
To study the impact of energy arrivals, we increase the energy to be transferred to devices. On the other hand, TQL-
solar panel from 50 to 1500 cm2 . Fig. 6(d) shows that a 1 and TQL-2 train devices to remain idle until the channel
higher energy arrival rate leads to more energy wastage. As condition becomes favorable. Therefore, TQL-1 and TQL-2
the solar panel increases to 1500 cm2 , LEC wastes 97.61% improve their sum rate. Interestingly, when the solar panel
of the received energy because the HAP transmits at a fixed size is 50 cm2 , LEC has a sum rate that is 238.30% higher
transmission power of 13 dBm to charge devices. In contrast, than sum rate of Greedy. This is because LEC allows a device
Random and Greedy only waste 50.23% and 22.29% of the to use 2% of its residual energy to transmit data. However,
arrived energy when the solar panel size is 1500 cm2 . TQL-1 devices using Greedy frequently have insufficient energy to
and TQL-2 train the HAP to transmit when the channel state transmit. Consequently, as shown in Fig. 6(a), when the solar
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 535

Fig. 6. Impact of energy arrival rate at the HAP.

panel size is 50 cm2 , LEC has an EE that is 389.49% higher wastage. However, TQL-1 and TQL-2 have a charging effi-
than the EE of Greedy. ciency that is 70.97% and 705.45% higher than charging
From Fig. 6(a) we see that TQL-2 has higher EE than efficiency of Greedy. As a result, the EE of TQL-1 and
TQL-1. However, the difference in EE between TQL-2 and TQL-2 is 3118.22% and 7210.64% higher than the EE of
TQL-1 decreases from 482.17% to 92.37% with higher energy Greedy. According to Fig. 6(c), when the solar panel increases
arrivals. The reason is that TQL-2 has channel information, from 200 cm2 to 1500 cm2 , the difference in charging effi-
and it selects a transmission power that maximizes the charg- ciency between TQL-1 and TQL-2 reduces from 562.02%
ing efficiency whilst avoiding battery overflow. In contrast, to 231.59%. Fig. 6(d) shows that when solar panel size is
TQL-1 has no channel information, the HAP prefers a low 1500 cm2 , both of TQL-1 and TQL-2 waste more than 60%
transmission power that maximizes EE. Based on Fig. 6(c), of the arrived energy. In this case, we observe that with
when the solar panel size is 50 cm2 , TQL-2 has a charging increasing energy arrival rates, energy overflow at the HAP
efficiency that is 800.00% higher than the charging efficiency occurs frequently. Therefore, TQL-1 and TQL-2 use high
of TQL-1. Consequently, when there is no energy overflow transmission power to avoid overflow even when the channel
at the HAP, TQL-2 has a higher EE that is 482.17% higher is bad. Consequently, the EE gap between TQL-1 and TQL-2
than the EE of TQL-1. As the energy arrival rate increases, decreases with higher energy arrivals.
energy overflow starts to occur. Therefore, to maximize EE,
there is a trade-off between maintaining charging efficiency
and avoiding energy waste. For instance, when the solar C. Impact of Transmission Power
panel size is 500 cm2 , energy wasted by TQL-1 and TQL- In this experiment, we vary the HAP’s maximum trans-
2 is 49.31% and 32.32%, while Greedy has no energy mission power P̄0,max from 30 dBm to 40 dBm. The HAP
536 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021

Fig. 7. Impact of P̄0,max .

and Random method have no wasted energy. Hence, they are


omitted.We note that when P̄0,max increased from 30 dBm
to 40 dBm, the transmission power of LEC increases from
13 dBm to 23 dBm. Moreover, the average amount of energy
that arrives at the HAP is 3.75 × 103 Joule. That is, in each
frame, the average amount of energy arrives at the HAP is
0.0375 Joule. Therefore, when P̄0,max = 30 dBm, LEC can
only use 0.01 Joule in each frame, which is lower than its
harvested energy, i.e., 0.0375 Joule. In this case, 71.23% of
the arrived energy is wasted by the HAP. However, when
P̄0,max = 40 dBm, LEC can use 0.10 Joule when the HAP
has sufficient energy, and hence, no energy overflow at the
HAP. From Fig. 7(c), we see that with P̄0,max increases
from 30 dBm to 40 dBm, the charging efficiency of LEC
increases from 2.9 × 10−7 to 1.31 × 10−5 . This is because
Fig. 8. Impact of SINR Threshold.
in our RF energy receiver model, the energy conversion effi-
has a 200 cm2 solar panel. Fig. 7(d) shows that as P̄0,max ciency ζ increases from 0% to 62.4% when the received
increases from 30 dBm to 40 dBm, the amount of energy power increases from −12 dBm to 3 dBm. Therefore, the sum
wasted by LEC decreases from 71.23% to 0%. The Greedy rate of LEC increases from 1.52 × 103 to 2.50 × 105 when
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 537

Fig. 9. Impact of σ.

P̄0,max increases from 30 dBm to 40 dBm; see Fig. 7(b). and TQL-2 improves 21.08% and 4.77%, as P̄0,max increases
Consequently, the EE of LEC improves to 6.65 × 106 when from 30 dBm to 40 dBm.
P̄0,max = 40 dBm, which is 4496.53% higher than when we set
P̄0,max = 30 dBm.
Fig. 7(a) shows that as the HAP has higher transmit power D. Impact of SINR Thresholds
P̄0,max , EE improves for TQL-1 and TQL-2. From Fig. 7(c) We increase the SINR threshold from 0 dB to 80 dB; trans-
we see that as P̄0,max increases from 30 dBm to 40 dBm, the missions that do not satisfy the minimum SINR threshold are
charging efficiency of TQL-1 and TQL-2 have a improve of marked as fails. The HAP has a 200 cm2 solar panel and
19.65% and 39.14%, respectively. As a result, the sum rate a maximum transmit power of P̄0,max = 30 dBm. Based on
of TQL-1 and TQL-2 have a improve of 192.65% and 4.77%. Fig. 8, the SINR threshold has no impact on the EE of Greedy.
Based on Fig. 7(d), the energy wasted by TQL-1 reduces from This is because Greedy causes the HAP to deplete its received
45.12% to 0.79% when P̄0,max increases from 30 dBm to energy to charge devices, and in turn, devices use up their
40 dBm. Similarly, TQL-2 also reduces the wasted energy harvested energy to transmit their data. In this case, despite
from 0.87% to 0%. This is because TQL-1 and TQL-2 train the SINR threshold increasing from 0 dB to 80 dB, devices
the HAP to use a high transmission power when the channel using the Greedy policy are able to meet their SINR thresh-
is good. Therefore, as HAP has higher P̄0,max , TQL-1 and old. In contrast, the SINR threshold has a significant impact on
TQL-2 can use more energy to charge devices during good the EE attained by TQL-1, TQL-2 and LEC. When the SINR
channel condition, which improves the charging efficiency and threshold increases from 0 dB to 80 dB, the EE of TQL-1,
also avoids energy overflow. Consequently, the EE of TQL-1 TQL-2 and LEC reduces by 63.01%, 48.12% and 94.02%,
538 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021

respectively. Specifically, the EE of TQL-1 and TQL-2 starts [2] A. Zanella, N. Bui, A. Castellani, L. Vangelista, and M. Zorzi, “Internet
to reduce when the SINR threshold is 40 dB. However, the of Things for smart cities,” IEEE Internet Things J., vol. 1, no. 1,
pp. 22–32, Feb. 2014.
EE of LEC reduces by 1.65% when the SINR threshold is [3] P. Kamalinejad, C. Mahapatra, Z. Sheng, S. Mirabbasi, V. C. M. Leung,
20 dB. On the one hand, we see that from Fig. 6(c), when and Y. L. Guan, “Wireless energy harvesting for the Internet of Things,”
the solar panel size is 200 cm2 , the amount of energy that IEEE Commun. Mag., vol. 53, no. 6, pp. 102–108, Jun. 2015.
LEC provides to devices is 0.57% and 0.09% of the harvested [4] V. Talla, B. Kellogg, B. Ransford, N. Saman, S. Gollakota, and
J. R. Smith, “Powering the next billion devices with Wi-Fi,” in Proc.
energy as compared to the case when the HAP uses TQL-1 ACM CoNEXT, Heidelberg, Germany, Dec. 2015, pp. 1–13.
and TQL-2. On the other hand, LEC allows a device to use [5] X. Lu, P. Wang, D. Niyato, D.-I. Kim, and Z. Han, “Wireless networks
2% of its residual energy to transmit data, which causes some with RF energy harvesting: A contemporary survey,” IEEE Commun.
Surveys Tuts., vol. 17, no. 2, pp. 757–767, 2nd Quart., 2015.
transmissions to fail to meet their SINR threshold. In contrast, [6] X. Huang and N. Ansari, “Energy sharing within EH-enabled wire-
TQL-1 and TQL-2 train devices to remain idle until the chan- less communication networks,” IEEE Wireless Commun., vol. 22,
nel becomes good. In this case, devices that have sufficient no. 3, pp. 144–149, Jun. 2015.
energy use a high transmission power. Consequently, devices [7] T. Wang, C. Jiang, and Y. Ren, “Access points selection in super WiFi
network powered by solar energy harvesting,” in Proc. IEEE WCNC,
are able to meet their required SINR threshold, which ranges Doha, Qatar, Apr. 2016, pp. 1–5.
from 0 to 40 dB, when using TQL-1 and TQL-2. [8] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
Cambridge, MA, USA: MIT Press, 2018.
[9] S. Buzzi, I. Chih-Lin, T. E. Klein, H. V. Poor, C. Yang, and A. Zappone,
E. Channel Conditions “A survey of energy-efficient techniques for 5G networks and chal-
lenges ahead,” IEEE J. Sel. Areas Commun., vol. 34, no. 4, pp. 697–709,
To study the impact of channel conditions, we increase Apr. 2016.
σ from 6 dB to 8 dB. The HAP has a 200 cm2 solar [10] Q. Wu, W. Chen, M. Tao, J. Li, H. Tang, and J. Wu, ‘Resource allo-
panel and P̄0,max = 30 dBm. Fig. 9(c) shows that as the σ cation for joint transmitter and receiver energy efficiency maximization
in downlink OFDMA systems,” IEEE Trans. Commun., vol. 63, no. 2,
increases from 6 dB to 8 dB, the charging efficiency of Greedy, pp. 416–430, Feb. 2015.
Random and LEC increases to 6.12 × 10−5 , 7.27 × 10−5 and [11] Q. Wu, M. Tao, D. W. K. Ng, W. Chen, and R. Schober, “Energy-efficient
2.00 × 10−5 , respectively. From Fig. (9(a)), when σ increases resource allocation for wireless powered communication networks,”
IEEE Trans. Wireless Commun., vol. 15, no. 3, pp. 2312–2327,
from 6 dB to 8 dB, the EE of TQL-1 and TQL-2 increased Mar. 2016.
by 99.01% and 57.79%, respectively. Firstly, Fig. 9(c) shows [12] T. A. Zewde and M. C. Gursoy, “Energy-efficient time allocation for
that for σ ranging from 6 dB to 8 dB, the amount of harvested wireless energy harvesting communication networks,” in Proc. IEEE
energy at devices increases by 142.48% and 124.35%. In this Globecom Workshops, Washington, DC, USA, Dec. 2016, pp. 1–6.
[13] X. Lin, L. Huang, C. Guo, P. Zhang, M. Huang, and J. Zhang,
case, devices in TQL-1 and TQL-2 have more energy for trans- “Energy-efficient resource allocation in TDMS-based wireless pow-
missions. Secondly, TQL-1 and TQL-2 train devices to remain ered communication networks,” IEEE Commun. Lett., vol. 21, no. 4,
idle until the channel is favorable. This allows devices to pp. 861–864, Apr. 2017.
[14] W. Dinkelbach, “On nonlinear fractional programming,” Manag. Sci.,
reduce energy consumption when the channel is bad, and thus vol. 13, no. 7, pp. 492–498, Mar. 1967.
saving energy for use when the channel is good. Consequently, [15] S. Mao, S. Leng, J. Hu, and K. Yang, “Energy-efficient resource alloca-
as σ increases from 6 dB to 8 dB the sum rate of TQL-1 tion for cooperative wireless powered cellular networks,” in Proc. IEEE
and TQL-2 increase by 98.89% and 57.78%, respectively; see ICC, Kansas City, MO, USA, May 2018, pp. 1–6.
[16] M. Song and M. Zheng, “Energy efficiency optimization for wireless
Fig. 9(b). Therefore, the EE of TQL-1 and TQL-2 increased powered sensor networks with nonorthogonal multiple access,” IEEE
to 1.55 × 107 bits/Joule and 7.57 × 107 bits/Joule. Sens. Lett., vol. 2, no. 1, pp. 1–4, Mar. 2018.
[17] J. Hu and Q. Yang, “Dynamic energy-efficient resource allocation in
wireless powered communication network,” Wireless Netw., vol. 25,
VII. C ONCLUSION no. 6, pp. 3005–3018, 2019.
[18] T. Liu, X. Wang, and L. Zheng, “A cooperative SWIPT scheme for
In this article, we consider optimizing EE without chan- wirelessly powered sensor networks,” IEEE Trans. Commun., vol. 65,
nel state and energy arrival information. We have proposed no. 6, pp. 2740–2752, Jun. 2017.
a decentralized approach. That is, the HAP and each device [19] P. Mukherjee and S. De, “cDIP: Channel-aware dynamic window proto-
col for energy-efficient IoT communications,” IEEE Internet Things J.,
determines their transmission power with only causal knowl- vol. 5, no. 6, pp. 4474–4485, Dec. 2018.
edge. The simulation results show that it is not necessary [20] Y. Zeng and R. Zhang, “Optimized training design for wireless energy
for each device to transmit in its assigned data slot. A bet- transfer,” IEEE Trans. Commun., vol. 63, no. 2, pp. 536–546, Feb. 2015.
ter strategy is for the HAP and devices to remain idle until [21] M.-L. Ku, Y. Chen, and K. J. R. Liu, “Data-driven stochastic models
and policies for energy harvesting sensor communications,” IEEE J. Sel.
they deem the channel is good. Furthermore, our approach Areas Commun., vol. 33, no. 8, pp. 1505–1520, Aug. 2015.
is able to improve EE while ensuring SINR threshold that is [22] E. Boshkovska, D. W. K. Ng, N. Zlatanov, and R. Schober, “Practical
higher than 40 dB. Lastly, an interesting research direction non-linear energy harvesting model and resource allocation for SWIPT
systems,” IEEE Commun. Lett., vol. 19, no. 12, pp. 2082–2085,
is to extend works such as [30], whereby devices learn the Dec. 2015.
best modulation scheme or data rate for each system state that [23] S. Sudevalayam and P. Kulkarni, “Energy harvesting sensor nodes:
results in the highest expected sum-rate over time. Survey and implications,” IEEE Commun. Surveys Tuts., vol. 13,
no. 3, pp. 443–461, 3rd Quart., 2011.
[24] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8,
pp. 279–292, May 1992.
R EFERENCES
[25] M. L. Puterman, Markov Decision Processes.: Discrete Stochastic
[1] C. Yang, K.-W. Chin, T. He, and Y. Liu, “On sampling time Dynamic Programming. New York, NY, USA: Wiley, 2014.
maximization in wireless powered Internet of Things,” IEEE Trans. [26] R. Bellman, “Dynamic programming,” Science, vol. 153, pp. 34–37,
Green Commun. Netw., vol. 3, no. 3, pp. 641–650, Sep. 2019. Jul. 1966.
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 539

[27] A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, and Kwan-Wu Chin received the Bachelor of Science
J. P. How, “A tutorial on linear function approximators for dynamic (First Class Hons.) and Ph.D. (with commendation)
programming and reinforcement learning,” Found. Trends Mach. Learn., degrees from Curtin University, Australia, in 1997
vol. 6, no. 4, pp. 375–454, Dec. 2013. and 2000, respectively. He was a Senior Research
[28] A. Murad, K. Bach, F. A. Kraemer, and G. Taylor, “IoT sensor gym: Engineer with Motorola from 2000 to 2003. In 2004,
Training autonomous IoT devices with deep reinforcement learning,” in he joined the University of Wollongong as a Senior
Proc. 9th Int. Conf. Internet Things, Bilbao, Spain, Oct. 2019, pp. 1–4. Lecturer, where he was promoted to an Associate
[29] Data Sheet for P2110B Power Harvester. Accessed: 2016. [Online]. Professor in 2011. To date, he holds four U.S.
Available: https://www.powercastco.com/wp-content/uploads/2016/12/ patents, and has published more than 150 confer-
P2110B-Datasheet-Rev-3.pdf ence and journal articles. His current research areas
[30] P. Mukherjee and S. De, “Dynamic feedback-based adaptive modula- include medium access control protocols for wire-
tion for energy-efficient communication,” IEEE Commun. Lett., vol. 23, less networks, and resource allocation algorithms/policies for communications
no. 5, pp. 946–949, May 2019. networks.

Honglin Ren received the Bachelor of Engineering


degree (First Class Hons.) in telecommunication
engineering from the University of Wollongong,
Australia, and Zhengzhou University, China, in
2017. He is currently pursuing the Ph.D. degree
with the University of Wollongong. His current
research focuses on energy efficiency optimization
in wireless-powered networks.

You might also like