A_Reinforcement_Learning_Approach_to_Optimize_Energy_Usage_in_RF-Charging_Sensor_Networks
A_Reinforcement_Learning_Approach_to_Optimize_Energy_Usage_in_RF-Charging_Sensor_Networks
1, MARCH 2021
I. I NTRODUCTION
ENSOR devices form the perception layer of upcoming
S Internet of Things (IoT) networks, where they are used to
monitor or effect an environment [1]. Specifically, they will
To this end, we consider a network comprising of a solar-
powered HAP that charges one or more sensing devices with
RF-energy harvesting capability; see Figure 1. The HAP is
be relied upon to realize smart cities and smart manufactur-
responsible for charging and collecting data from devices. To
ing [2]. A challenging issue face by these sensor devices is
achieve this, the HAP uses a time frame that has an energy
their limited energy, which affects their operation time [3]. In
and data transfer phase. After charging devices in the energy
this respect, a promising energy source is Radio Frequency
transfer phase, devices transmit their data to the HAP in a
(RF) energy transfer [4], [5], which brings about three advan-
Time Division Multiple Access (TDMA) manner. Our aim is
tages: 1) far-field charging; 2) data transmissions and charging
to optimize these two phases by maximizing the EE of the
operate over the same frequency; and 3) multiple devices are
HAP and devices, where EE is defined as the total received
able to harvest energy simultaneously from a single transmis-
data per Joule of consumed energy.
sion. For example, in [4], the authors demonstrated a prototype
To illustrate how EE can be maximized, consider Figure 1.
with a camera that is powered by access points in a WiFi
In each frame, assume the HAP receives 1 Joule of energy. For
network.
ease of exposition, assume there are only two channel states:
Apart from that, energy transmitters such as Power Beacons
bad or good; note, our problem considers non-negative real
(PBs) or Hybrid Access Points (HAPs) that are used to charge
valued channel gains. Charging/transmission is only success-
sensor devices may have energy constraint. This is because
ful when the channel is in the good state. Assume that when
operators are now aiming to reduce the carbon footprint or
the channel state is good, devices harvest 1 μJ of energy, and a
operating expenditure of their network [6] by deploying solar-
bit cost 1 μJ to transmit. Consider two strategies: 1) strategy-
powered PBs/HAPs [7].
1, where the HAP and devices transmit in both frame t and
Manuscript received June 24, 2020; revised September 29, 2020 and t + 1 and 2) strategy-2, where the HAP and devices only
November 16, 2020; accepted November 18, 2020. Date of publication transmit when the channel state is good; i.e., frame t + 1. For
November 23, 2020; date of current version March 18, 2021. The editor
coordinating the review of this article was J. Chen. (Corresponding author: strategy-1, over two frames, the HAP collects 1 bit using 2
Honglin Ren.) Joule of energy because the channel state in frame t is bad,
The authors are with the School of Electrical, Computer and meaning the energy used for charging is wasted. On the other
Telecommunications Engineering, University of Wollongong, Wollongong,
NSW 2522, Australia (e-mail: [email protected]). hand, if the HAP uses strategy-2, it collects 2 bits over two
Digital Object Identifier 10.1109/TGCN.2020.3039840 frames. This is because the HAP stores energy when the
2473-2400
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 527
channel state is bad, and charges devices and collects data II. R ELATED W ORKS
when the channel is good. There are many studies that aim to maximize EE in
The previous example shows that a HAP and devices Wireless Powered Communication Networks (WPCNs) [9].
should only transmit when the channel condition is favourable. To the best of our knowledge, there are no works that con-
However, in practice, the problem is challenging because sider a solar-powered HAP operating in a WPCN whereby
the HAP and devices have only causal knowledge of all nodes are able to learn the best transmission power to
energy arrivals and CSI; i.e., they only know current and maximize EE. Critically, we consider a distributed approach.
past information of the said quantities. Moreover, the HAP Prior works such as [10], [11], [12], and [13] use math-
may experience battery overflow if it does not transmit ematical optimization to optimize EE. In particular, many
frequently. To this end, this article makes the following works rely on the Dinkelbach method [14]. For example,
contributions: the authors of [10] optimize sub-carrier allocation to avoid
• We present a Two-layer Markov Decision Process interference and also to determine the transmit power of users.
(TMDP) followed by Two-layer Q-learning (TQL) Wu et al. [11] jointly determine the transmission time of users
approach; these approaches, referred to respectively as and the transmit power at the HAP and devices. Zewde and
TQL-1 and TQL-2, allow the HAP and devices to learn Gursoy [12] determine the time interval for energy transfer
the optimal transmit power policy in each time frame and data communication while avoiding packet loss due to
base on their current energy level. In TQL-2, their state data buffer overflow. In [13], the authors investigate a Time
space also incorporates the current channel gain. Another Division Mode Switching (TDMS) scheme where users simul-
innovation is the use of novel features, as part of the taneously transmit data to a HAP. The problem is to optimize
Linear Function Approximation (LFA) framework [8], to the charging interval and transmit power of users. In [15], the
evaluate each action or transmission power based on the authors consider a wireless powered cellular network, where
current state of the HAP or devices. We note that although a user and a device-to-device pair cooperatively transmit data
we consider a solar-powered HAP, our solution continues to the HAP. The problem is to determine the beamforming
to work because it does not assume any specific energy weights used by the HAP, transmit power and transmission
arrivals distribution; hence, it is also suitable for use if the interval of users. Song and Zheng [16] employ Particle Swarm
HAP is equipped with other types of renewable energy Optimization (PSO) to determine the charging interval and
sources. transmit power of users in a Non-Orthogonal Multiple Access
• Previous works, see Section II, assume causal knowl- (NOMA)-based WPCN. These works outline a centralized
edge, whereby nodes are aware of future channel gain solution and requires global information or perfect CSI.
information, which is impractical. Apart from that, stud- In summary, our work has the following differences to prior
ies that consider causal CSI only consider the worst case works on optimizing the EE of a WPCN: 1) it considers causal
or robust setting, or aim to maintain data/energy queue knowledge of channel gain and energy arrivals; 2) it uses
stability. In contrast, we consider a distributed learning a solar-powered HAP to charge devices; and 3) it proposes
approach that adapts to stochastic energy arrivals and and studies a distributed, learning solution. To date, there are
channel gains, and assume only causal knowledge. only a few works that have considered causal channel gain
• The simulation results show TQL-1 and TQL-2 achieve information when optimizing the EE of a system. For exam-
respectively more than 48 and 389 times the EE achieved ple, Hu and Yang [17] dynamically determine the transmission
by random and greedy strategies. Moreover, we found time and transmit power of each user while maintaining data
that decaying the learning rate as opposed to fixing queue stability. In [18], the authors consider deriving a sched-
it to a constant results in EE that is at least 12.97% ule for sensor nodes to avoid collision. Unlike [17] and [18],
higher. This accelerates the training process at the start we consider a learning solution that allows HAP and devices
of training, and as the learning rate decays, the agent to adapt their transmit power dynamically. We also note that
can explore more possible actions before converging to past works have not considered a solar power HAP. Moreover,
a solution. In addition, our approaches are capable of prior solutions are optimized for specific scenarios and have
maximizing EE while meeting the minimum throughput no learning capability. There are also works such as [19] that
requirement of devices. Interestingly, with larger variation consider estimating channel gains in order to improve trans-
in channel condition or increase in maximum transmit missions of data. However, they do not consider optimizing
power of the HAP, both our approaches obtained bet- energy efficiency in a multi-users system with both energy
ter EE due to higher sum rates. When we increase the harvesting devices and a HAP.
solar panel size at the HAP, meaning it harvests more
energy, the HAP uses a high transmission power to avoid
energy overflow, which results in a high data rate but III. S YSTEM M ODEL
lower EE. Table I summarizes our notations. We consider a RF-
Next, we first review prior works in Section II before charging network that is composed of a HAP and a set of
formalizing our system model in Section III. After that, wireless powered devices N = {1, . . . , |N |}, where |N | is the
Section IV presents our problem formally. Section V out- total number of devices and each device is indexed by i ∈ N .
lines TQL-1 and TQL-2. Section VI contains evaluation of Fig. 2 shows that the HAP harvests energy from solar and
our approaches. Section VII concludes this article. uses it for charging devices and receiving data from devices.
528 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021
TABLE I
C OMMON N OMENCLATURE
any energy outage. The HAP is only aware of its current and
past energy arrivals. Let ě0t denote the energy harvested by
the HAP in frame t. The HAP has a rechargeable battery with
capacity Ē0,max . Its current amount of energy in frame t is
denoted by e0t . Let pmax be the HAP’s maximum transmit
power, where its transmit power in time slot t is denoted as
p0t . We also consider circuit power consumption during trans-
missions. Let pc be the circuit power at the HAP. Define o0t
as the amount of energy wasted at the HAP due to overflow.
Formally, we have,
o0t = e0t + ě0t − p0t + pc · τ0 − Ē0,max . (1)
The uplink and downlink channel gain in frame t between a
t and ǧ t , respectively.
device i and the HAP is denoted as ĝi0 0i
We assume block fading where ĝi0 t and ǧ t are fixed in a
0i
given frame but vary across frames [20]. We consider the Log-
distance path loss model, defined by
d
PLd0 →di (dB ) = PL(d0 ) + 10nlog10 i + χ, (2)
d0
where PL(d0 ) is the path loss at a reference distance d0 , and
di is the distance from the HAP to device i. The path loss
exponent is denoted as n, and the term χ models small scale
fading that is modeled as a zero-mean Gaussian distributed
variable with a standard deviation σ. Note, channel estimation
is complementary to our work; any energy efficient methods
can be used to acquire the CSI to/from devices; e.g., [20].
In this article, we do not consider the energy cost of channel
Fig. 2. A RF-charging network.
estimation. This can be modeled by allocating some of the
energy harvested by the HAP or devices for channel estima-
tion at the start of each frame. Note, this has the effect of
Devices always have data to transmit and no Quality of Service scaling our results and it does not change our conclusion or
(QoS) requirement, where they are tolerant of delays resulting findings.
from deferred transmissions or energy outage. Denote the harvested energy at device i in frame t as ěit .
We assume Time Division Multiple Access (TDMA), where The value of ěit is determined by the transmission power of the
time is divided into T frames. Each frame with length T̄ is HAP p0t and the channel gain ǧ0i t . Let pˇt = p t · ǧ t be
i 0 0i
indexed by t, where t = 1, . . . , T . As shown in Fig. 3, each the incident power at the antenna of a device. The received
frame has a charging slot and |N | data slots. Each data slot power η(pˇit ) at device i is a non-linear function of its incident
has duration τ ; in practice, this duration is set to the coherence
power [22]. Formally, we have ěit = τ0 · η(pˇit ). Each device
time of the channel [20]. The charging slot has duration τ0 =
has a rechargeable battery with capacity Ēmax . The transmis-
(T̄ − |N | · τ ). The HAP broadcasts energy to devices in the
sion power pit of device i in frame t depends on its available
charging slot and each device transmits data to the HAP in its
energy eit , where initially we have ei0 = 0. Specifically, if
allocated slot. Note, in this work we consider the HAP and
pit = 0 then device i does not transmit in frame t. Otherwise,
each device have causal CSI knowledge.
it transmits at power pit , which is bounded by its available
The solar energy arrival process at the HAP is governed by
energy. Also, the harvested energy in frame t is available in
a Hidden Markov Model (HMM) with parameters determined
frame t + 1 [23]. Denote p̌c as the circuit power at devices.
by the solar irradiance data collected in the month of June from
The energy level eit+1 at the beginning of frame t + 1 is
2008 to 2010, at a solar site located in Elizabeth City State
expressed as
University [21]. We assume a non-zero energy arrival process,
i.e., during day time, meaning the HAP does not experience eit + ěit − pit + p̌c · τ = eit+1 ≤ Ēmax . (3)
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 529
In frame t, device i transmits a total of dit amount of data, In words, the EE at device i is its sum rate dit divided by its
which is defined as consumed energy over T frames. The objective is to maximize
ΦEEi . Formally, we have
p t ĝ t
dit = τ · B log2 1 + i i0 , (4) max E ΦEEi
n0
pit
t
where B is the channel bandwidth, and n0 is the noise power. s.t. pi + p̌c · τ ≤ eit , ∀t ∈ T
In the above expression, dit is only non-zero for a device i
eit + ěit − pit + p̌c · τ ≤ Ēmax , ∀t ∈ T . (9)
when it decides to transmit; i.e., pit > 0 in frame t. Note that
Equ. (4) represents the asymptotic or upper bound capacity Similar to the HAP, the challenge at each device is that it does
of device i’s uplink. In practice, the HAP simply notes the not know the channel condition and its energy arrival process.
amount of bits received from each device at the end of each
frame. The total received data d̄ at the HAP is defined as V. T WO -L AYER Q-L EARNING (TQL)
|N | We apply Q-Learning [24] to solve our problem. In partic-
T
d̄ = dit . (5) ular, an agent interacts with the environment to learn the best
t=1 i=1 action for a given state over time. Advantageously, it does
not require the probability distribution of states; i.e., it is a
The performance metric EE ΦEE is defined as the ratio so called model-free approach [8]. For a given state, an agent
between the sum rate and the total amount of energy consumed determines a policy or an action that results in it receiving a
at the HAP. Formally, we have, high expected value, which is calculated using the immediate
d̄ reward and the value of future states.
ΦEE = T . (6) Fig. 4 provides an overview of our two-layer approach; the
t=1 p0 + pc · τ0
t
HAP is located at the upper layer, and devices form the lower
layer. At both layers, agents decide the transmission power of
IV. T HE P ROBLEM nodes. We run a Q-Learning (QL) agent at the HAP and at
We aim to jointly determine the transmit power of the HAP each device [8]. At the beginning of frame t, the HAP decides a
and devices. Ideally, the HAP aims to use the minimum trans- transmit power p0t to charge devices. Each device then selects
mission power to charge devices that yields the maximum a transmit power pit to transmit its data to the HAP. As it
amount of data from devices. We divide the problem into two will become clear later, the goal of both agents is to learn a
parts; namely, 1) the HAP determines its transmission power policy that allows them to select the best transmit power given
p0t in each frame, and 2) each device determines its transmis- their available energy. In particular, they will use the feedback
sion power pit in its assigned data slot. Next, we introduce the received after taking each action in each state to reinforce or
problem at the HAP and devices separately. discourage an action.
The objective of the HAP is to maximize EE over a given In the following sections, we introduce MDP and QL. After
T horizon. Formally, we have that, we model the power allocation problem as a two-layer
MDP (TMDP) [25]. Then, in Section V-C, we explain LFA.
T |N | t We then outline TQL algorithms used to determine the optimal
d
max E T t=1 t i=1 i policy at each layer.
t=1 p0 + pc · τ0
p0t
C. Q-Learning With LFA uses p0t . Observe that a smaller amount of consumed energy
Note that the HAP and each device has a continuous state leads to a higher feature value. This is reasonable as we want
space, which results in a large state-action space that is com- the HAP to use the minimum amount of energy in order to
putationally intractable. To this end, we use LFA and outline maximize EE. The feature function fˆ3 (s0t , a0t ) evaluates the
novel features to represent the state-action space. Specifically, charging efficiency when the HAP uses a transmission power
we employ a parametric function to approximate Q π (s t , a t ). p0t . Formally,
⎧ |N | t
Advantageously, LFA allows both HAP and devices to find the
t t ⎨ c2 × i=1 ěi
, a0t = p0t = 0
global solution [27]. Another reason for adopting LFA is that fˆ3 s0 , a0 = (p0t +pc )·τ0 (20)
is requires much less computational resources as compared to ⎩ 0, p t = 0,
0
using a neural network, and thus making it ideal for use on
an energy constrained device [27]. where c2 = 103 is a constant value to normalize fˆ3 (s0t , a0t ) ∈
LFA uses a linear combination of M feature functions (0, 1]. In words, this feature returns a non-zero value that is
fm (s t , a t ), m = 1, 2, . . . , M to map each action-value pair proportional to the amount of received energy at devices and
(s t , a t ) into a feature value. Let f be a M × 1 array of feature inversely proportional to the energy consumed by the HAP.
values. That is, this feature returns a value proportional to the resulting
Let θ ∈ RM ×1 be a column vector containing a set EE for a given action. Note, the HAP only uses fˆ3 (s0t , a0t )
when the current channel information ǧ0i t is available.
of weights that indicate the contribution of feature values.
Therefore, Q π (s t , a t ) is approximated as the weighted sum 2) A Device’s Feature Functions: Each device uses the
of feature values [8], i.e., function fˇ1 (sit , ait ) to evaluate its battery condition after tak-
ing action ait and uses the function fˇ2 (sit , ait ) to evaluate the
Q π s t , a t ≈ Q̃ π s t , a t , θ = f T θ , (15) selected transmission power pit . Formally, we have,
where Q̃ π (s t , a t , θ ) is the approximated action-value func- t t 1, rit+1 < Ēmax
ˇ
f1 si , ai = (21)
tion, and T denotes the transpose of a matrix. The vector 0, Otherwise,
θ controls the linear combination of feature functions to ⎧ 1
approximate Q π (s t , a t ). To minimize the difference between t t ⎨ c1 × (pit +p̌c )·τ , ait = pit = 0
ˇ
f2 si , ai = t ·100% (22)
Q̃ π (s t , a t , θ ) and Q π (s t , a t ), we use the Gradient Descent ⎩
e
i
in line 10 the HAP selects a power allocation policy π̂ using 6 Calculate r0t as per Equ. (14)
−greedy. Next, it updates θ̂ using the obtained reward r0t , 7 Observe the next state s0t+1
current value Q̃ π̂ (s0t , a0t , θ̂ ) and the estimated value for the 8 Calculate maxa t+1 ∈A Q̃ π̂ (s0t+1 , a0t+1 , θ̂ )
0 0
next state Q̃ π̂ (s0t+1 , a0t+1 , θ̂ ), see line 11. 9 Update θ̂ as per Equ. (15)-(17)
2) Algorithm 2: Initially, a device i sets the weight for each 10 Obtain π̂ based on −greedy
feature function to one. In each frame, it first uses πˇi to select 11 end
an action ait and obtains the corresponding reward rit , see 12 end
line 3-4. After that, the device observes the next state sit+1 , 13 return π̂
and the next action set At+1 i . Note, as defined in Section V-B,
its energy level eit+1 yields its corresponding action set At+1 i .
πˇ t+1 t+1 ˇ
Next, the device estimates the next value Q̃ (si , ai , θ i )
i
each device, it is also possible to employ works such as
by calculating the value for each action ait+1 ∈ At+1 i , see SensorGym [28] at the HAP, whereby devices upload their
line 6. In line 7, the device then sets its power allocation state to the HAP; i.e., training is carried out at the HAP. The
policy π̌ using −greedy. The last step is to update θˇi using HAP then uses the state information of devices to learn the
the current reward rit , current value Q̃ πˇi (sit , ait , θˇi ) and the optimal policy, which it then upload to devices. This means a
next value Q̃ πˇi (sit+1 , ait+1 , θˇi ), see line 8. device only needs to lookup the best action given its current
3) Algorithm 3: As shown in line 1, the HAP initializes observed state.
the weight for each feature function to one. In state s0t , the
HAP selects an action a0t using −greedy, see line 1. Then it
t to calculate the reward VI. E VALUATION
uses the current channel condition ǧ0i
r0 , see line 6. After that, it observes the next state s0t+1 , and
t
We run our experiments in MATLAB on an Intel I5-4690 @
calculates the next estimated value. Next, the HAP updates θ̂ 3.50 GHz computer with 8 GB of RAM. We consider a RF-
using Equ. (15)-(17). In line 10, the HAP uses −greedy to charging network with five devices randomly placed 2 to 6
obtain its power allocation policy π̂ . In the last step, devices meters away from the HAP. Each frame is composed of one
perform the learning approach as illustrated in Algorithm 2. charging slot with a length of τ0 = 0.5s and five data slots;
We note that the computational complexity in each frame each slot has length τ = 0.1s. We note our proposed methods
for both TQL-1 and TQL-2 is a function of the action space remain valid for other values of τ and τ0 because the problem,
A0 or Ati with size that corresponds to the step size κ0 statistics of channel gains and energy arrivals remain the same.
or κi . This is because the HAP or each device needs to The energy arrival rate of each episode at the HAP is mod-
search through their action space to identify the best action elled as a Markov chain as per [21]. Each episode contains
for a given state; see line-8 of Algorithm-3 for example. Γ = 5 frames. The highest transmission power for the HAP
Lastly, we note that instead of running Algorithm- 2 at is P̄0,max = 30 dBm [11]. The energy conversion efficiency
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 533
TABLE II
η ∈ [0, 0.624] is set according to the RF-energy harvester PARAMETERS U SED IN S IMULATIONS
P2110B [29]. We note that the RF energy receiver has an oper-
ating range of [−12, 10] dBm [29]. The battery capacity of the
HAP and devices is Ē0,max = 30 J and Ēmax = 1 J, respectively.
We assume that at the beginning of each simulation, devices
and the HAP have no residual energy. The channel from the
HAP to devices is modelled according to the Log-distance path
loss model Equ. (2), where the path loss exponent is set to 2.5
and the standard deviation is σ = 6 dB. Moreover, we set to
decay every five frames, where the is set to one initially. For
TQL-1, we set for the HAP and devices to 10/t and 1/t, where
the of devices has an initial value of 0.1. For TQL-2, we set
for the HAP and devices to 1/t. Note, for the HAP stops
decreasing when it reaches 0.02. To train the agent at the HAP where the learning rate for the HAP decays from 0.1 to 0.01.
and devices, for each episode, the weight of feature functions Simulation results are recorded every 1000 frames.
is updated based on the update rule defined by Equ. (16)- 1) Learning Rate for TQL-1: Referring to Fig. 5a, when
(17). We run the proposed TQL algorithms for 100000 frames α is fixed to a value lower than 0.1, TQL-1 has the low-
to guarantee convergence to the optimal solution; referring to est EE that is 55.24% lower than when it uses α = 0.1.
Fig. 5a, we see that the training process requires approximately However, if it has a decaying learning rate, its EE is the high-
20000 frames. Note, all the results are an average of 50 simu- est; i.e., 1.41 × 107 bits/Joule. This is because in the first
lations. Table II summarizes the parameter values used in our 20000 frames, TQL-1 learns faster than when it uses α = 0.1.
simulations. After 20000 frames, as its learning rate decays, TQL-1 uses
Note that as our problem is new, see Section II, there are a small step size to explore more possible actions before con-
no prior methods we can compare against fairly. In particu- verging to a solution. Thus, in the remaining experiments, we
lar, existing approaches require non-causal or future energy set the learning rate of the HAP and devices to 250/t and 5/t,
arrivals while our problem calls for a solution that uses only respectively.
causal information. To this end, we only compare against the 2) Learning Rate for TQL-2: Referring to Fig. 5b, when
following methods: TQL-2 uses α = 0.01, it converges to a solution in 20000
• Random. The HAP or a device randomly selects a frames. However, when it uses α = 0.001 and α = 0.0001, it
transmission power based on its available energy. does not converge in 100000 frames. Moreover, when TQL-2
• Greedy. The HAP or a device always use its highest pos- uses α = 0.1, it has an EE that gradually decreases after 20000
sible transmission power, with a maximum of 30 dBm, frames. This is because the step size is too large to obtain a
in each frame. solution. In contrast, a decaying learning rate allows TQL-2 to
• Low Energy Consumption (LEC). In each frame, the HAP learn fast in the first 25000 frames. When the decaying learn-
uses κ = δ · P̄0,max dBm to charge devices, and in turn, ing rate is lower than 0.01, it allows TQL-2 to use a small
a device uses 2% of its residual energy to transmit data step size to obtain a solution. As a result, when TQL-2 uses
back to the HAP. Note that in our TQL approaches, we a decaying learning rate, it has an EE that is 12.97% higher
et than when using α = 0.01. Therefore, in subsequent experi-
consider κ0 = 0.02 · P̄0,max and κi = 0.02 · τi as the
ments, we use a learning rate of 250/t for both the HAP and
step size to discretize the transmission power of the HAP
devices.
and devices, respectively.
3) Learning Rate Versus Convergence Time: We now study
We study five scenarios. The first scenario considers various
the relationship between channel conditions and learning rates.
learning rate α values. Second, we study how energy arrival
We increase σ from 6 dB to 8 dB. We calculate the Mean
impacts performance. The third considers the HAP’s maximum
Square Error (MSE) of the difference between the estimated
transmit power P̄0,max . Fourth, we investigate different SINR
and true value as per Equ. (16) for each frame. Our TQL
thresholds. Finally, we consider different channel conditions.
methods are deemed to have converged if their MSE is lower
We record four metrics: EE as per (6), sum rate as per (4),
than the average MSE.
percentage of wasted energy due to overflow at the HAP, and
From Fig. 5c, we see that as σ increases, TQL-1 converges
the HAP’s charging efficiency ζ; see Equ. (13).
slower when it uses a decaying learning rate. Specifically, it
converges in 13135 frames when σ = 6 dB, and converges in
A. Learning Rate 15054 frames as σ increases to 8 dB. Similarly, varying chan-
The learning rate α changes from 0.1 to 0.0001. We also nel condition has a minor impact on the convergence rate when
investigate decaying learning rates, which is initially set to TQL-1 uses a small learning rate; i.e., α = 0.01, α = 0.001
one and decays every five frames. For TQL-1, both the HAP and α = 0.0001. This is because a higher channel variation
and devices have a different decaying learning rate, where the does not affect the system as much when an agent uses a small
HAP’s learning rate is set to 250/t; devices have a learning learning rate. In contrast, when TQL-1 uses a large learning
rate of 5/t, which stops decreasing when its value is 0.001. As rate, say α = 0.1, its convergence rate fluctuates from 43263
for TQL-2, the HAP and devices have a learning rate of 250/t, to 29412 frames.
534 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021
Fig. 5d shows that TQL-2 converges slower when σ is good. According to Fig. 6(c) and Fig. 6(d), when the solar
increases from 6 to 8 dB. Interestingly, it has an insignifi- panel size is 500 cm2 , TQL-1 and TQL-2 maintain the highest
cant impact on the convergence rate when TQL-2 uses a large charging efficiency, i.e., 4.95 × 10−5 and 2.33 × 10−4 , with
learning rate; i.e., α = 0.01 and α = 0.1. This is because an energy wastage of 49.31% and 32.32%. Fig. 6(b) shows
when the HAP uses TQL-2 it is able to exploit the current that as the HAP has more energy, the sum rate improves for
channel condition. the tested methods. Moreover, when the solar panel increases
to 1500 cm2 , the sum rate of TQL-1 and TQL-2 increases
to 1.41 × 106 bits/s and 2.40 × 106 bits/s, respectively. On
B. Energy Arrival Rates the one hand, more available energy at the HAP allows more
To study the impact of energy arrivals, we increase the energy to be transferred to devices. On the other hand, TQL-
solar panel from 50 to 1500 cm2 . Fig. 6(d) shows that a 1 and TQL-2 train devices to remain idle until the channel
higher energy arrival rate leads to more energy wastage. As condition becomes favorable. Therefore, TQL-1 and TQL-2
the solar panel increases to 1500 cm2 , LEC wastes 97.61% improve their sum rate. Interestingly, when the solar panel
of the received energy because the HAP transmits at a fixed size is 50 cm2 , LEC has a sum rate that is 238.30% higher
transmission power of 13 dBm to charge devices. In contrast, than sum rate of Greedy. This is because LEC allows a device
Random and Greedy only waste 50.23% and 22.29% of the to use 2% of its residual energy to transmit data. However,
arrived energy when the solar panel size is 1500 cm2 . TQL-1 devices using Greedy frequently have insufficient energy to
and TQL-2 train the HAP to transmit when the channel state transmit. Consequently, as shown in Fig. 6(a), when the solar
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 535
panel size is 50 cm2 , LEC has an EE that is 389.49% higher wastage. However, TQL-1 and TQL-2 have a charging effi-
than the EE of Greedy. ciency that is 70.97% and 705.45% higher than charging
From Fig. 6(a) we see that TQL-2 has higher EE than efficiency of Greedy. As a result, the EE of TQL-1 and
TQL-1. However, the difference in EE between TQL-2 and TQL-2 is 3118.22% and 7210.64% higher than the EE of
TQL-1 decreases from 482.17% to 92.37% with higher energy Greedy. According to Fig. 6(c), when the solar panel increases
arrivals. The reason is that TQL-2 has channel information, from 200 cm2 to 1500 cm2 , the difference in charging effi-
and it selects a transmission power that maximizes the charg- ciency between TQL-1 and TQL-2 reduces from 562.02%
ing efficiency whilst avoiding battery overflow. In contrast, to 231.59%. Fig. 6(d) shows that when solar panel size is
TQL-1 has no channel information, the HAP prefers a low 1500 cm2 , both of TQL-1 and TQL-2 waste more than 60%
transmission power that maximizes EE. Based on Fig. 6(c), of the arrived energy. In this case, we observe that with
when the solar panel size is 50 cm2 , TQL-2 has a charging increasing energy arrival rates, energy overflow at the HAP
efficiency that is 800.00% higher than the charging efficiency occurs frequently. Therefore, TQL-1 and TQL-2 use high
of TQL-1. Consequently, when there is no energy overflow transmission power to avoid overflow even when the channel
at the HAP, TQL-2 has a higher EE that is 482.17% higher is bad. Consequently, the EE gap between TQL-1 and TQL-2
than the EE of TQL-1. As the energy arrival rate increases, decreases with higher energy arrivals.
energy overflow starts to occur. Therefore, to maximize EE,
there is a trade-off between maintaining charging efficiency
and avoiding energy waste. For instance, when the solar C. Impact of Transmission Power
panel size is 500 cm2 , energy wasted by TQL-1 and TQL- In this experiment, we vary the HAP’s maximum trans-
2 is 49.31% and 32.32%, while Greedy has no energy mission power P̄0,max from 30 dBm to 40 dBm. The HAP
536 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021
Fig. 9. Impact of σ.
P̄0,max increases from 30 dBm to 40 dBm; see Fig. 7(b). and TQL-2 improves 21.08% and 4.77%, as P̄0,max increases
Consequently, the EE of LEC improves to 6.65 × 106 when from 30 dBm to 40 dBm.
P̄0,max = 40 dBm, which is 4496.53% higher than when we set
P̄0,max = 30 dBm.
Fig. 7(a) shows that as the HAP has higher transmit power D. Impact of SINR Thresholds
P̄0,max , EE improves for TQL-1 and TQL-2. From Fig. 7(c) We increase the SINR threshold from 0 dB to 80 dB; trans-
we see that as P̄0,max increases from 30 dBm to 40 dBm, the missions that do not satisfy the minimum SINR threshold are
charging efficiency of TQL-1 and TQL-2 have a improve of marked as fails. The HAP has a 200 cm2 solar panel and
19.65% and 39.14%, respectively. As a result, the sum rate a maximum transmit power of P̄0,max = 30 dBm. Based on
of TQL-1 and TQL-2 have a improve of 192.65% and 4.77%. Fig. 8, the SINR threshold has no impact on the EE of Greedy.
Based on Fig. 7(d), the energy wasted by TQL-1 reduces from This is because Greedy causes the HAP to deplete its received
45.12% to 0.79% when P̄0,max increases from 30 dBm to energy to charge devices, and in turn, devices use up their
40 dBm. Similarly, TQL-2 also reduces the wasted energy harvested energy to transmit their data. In this case, despite
from 0.87% to 0%. This is because TQL-1 and TQL-2 train the SINR threshold increasing from 0 dB to 80 dB, devices
the HAP to use a high transmission power when the channel using the Greedy policy are able to meet their SINR thresh-
is good. Therefore, as HAP has higher P̄0,max , TQL-1 and old. In contrast, the SINR threshold has a significant impact on
TQL-2 can use more energy to charge devices during good the EE attained by TQL-1, TQL-2 and LEC. When the SINR
channel condition, which improves the charging efficiency and threshold increases from 0 dB to 80 dB, the EE of TQL-1,
also avoids energy overflow. Consequently, the EE of TQL-1 TQL-2 and LEC reduces by 63.01%, 48.12% and 94.02%,
538 IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, VOL. 5, NO. 1, MARCH 2021
respectively. Specifically, the EE of TQL-1 and TQL-2 starts [2] A. Zanella, N. Bui, A. Castellani, L. Vangelista, and M. Zorzi, “Internet
to reduce when the SINR threshold is 40 dB. However, the of Things for smart cities,” IEEE Internet Things J., vol. 1, no. 1,
pp. 22–32, Feb. 2014.
EE of LEC reduces by 1.65% when the SINR threshold is [3] P. Kamalinejad, C. Mahapatra, Z. Sheng, S. Mirabbasi, V. C. M. Leung,
20 dB. On the one hand, we see that from Fig. 6(c), when and Y. L. Guan, “Wireless energy harvesting for the Internet of Things,”
the solar panel size is 200 cm2 , the amount of energy that IEEE Commun. Mag., vol. 53, no. 6, pp. 102–108, Jun. 2015.
LEC provides to devices is 0.57% and 0.09% of the harvested [4] V. Talla, B. Kellogg, B. Ransford, N. Saman, S. Gollakota, and
J. R. Smith, “Powering the next billion devices with Wi-Fi,” in Proc.
energy as compared to the case when the HAP uses TQL-1 ACM CoNEXT, Heidelberg, Germany, Dec. 2015, pp. 1–13.
and TQL-2. On the other hand, LEC allows a device to use [5] X. Lu, P. Wang, D. Niyato, D.-I. Kim, and Z. Han, “Wireless networks
2% of its residual energy to transmit data, which causes some with RF energy harvesting: A contemporary survey,” IEEE Commun.
Surveys Tuts., vol. 17, no. 2, pp. 757–767, 2nd Quart., 2015.
transmissions to fail to meet their SINR threshold. In contrast, [6] X. Huang and N. Ansari, “Energy sharing within EH-enabled wire-
TQL-1 and TQL-2 train devices to remain idle until the chan- less communication networks,” IEEE Wireless Commun., vol. 22,
nel becomes good. In this case, devices that have sufficient no. 3, pp. 144–149, Jun. 2015.
energy use a high transmission power. Consequently, devices [7] T. Wang, C. Jiang, and Y. Ren, “Access points selection in super WiFi
network powered by solar energy harvesting,” in Proc. IEEE WCNC,
are able to meet their required SINR threshold, which ranges Doha, Qatar, Apr. 2016, pp. 1–5.
from 0 to 40 dB, when using TQL-1 and TQL-2. [8] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
Cambridge, MA, USA: MIT Press, 2018.
[9] S. Buzzi, I. Chih-Lin, T. E. Klein, H. V. Poor, C. Yang, and A. Zappone,
E. Channel Conditions “A survey of energy-efficient techniques for 5G networks and chal-
lenges ahead,” IEEE J. Sel. Areas Commun., vol. 34, no. 4, pp. 697–709,
To study the impact of channel conditions, we increase Apr. 2016.
σ from 6 dB to 8 dB. The HAP has a 200 cm2 solar [10] Q. Wu, W. Chen, M. Tao, J. Li, H. Tang, and J. Wu, ‘Resource allo-
panel and P̄0,max = 30 dBm. Fig. 9(c) shows that as the σ cation for joint transmitter and receiver energy efficiency maximization
in downlink OFDMA systems,” IEEE Trans. Commun., vol. 63, no. 2,
increases from 6 dB to 8 dB, the charging efficiency of Greedy, pp. 416–430, Feb. 2015.
Random and LEC increases to 6.12 × 10−5 , 7.27 × 10−5 and [11] Q. Wu, M. Tao, D. W. K. Ng, W. Chen, and R. Schober, “Energy-efficient
2.00 × 10−5 , respectively. From Fig. (9(a)), when σ increases resource allocation for wireless powered communication networks,”
IEEE Trans. Wireless Commun., vol. 15, no. 3, pp. 2312–2327,
from 6 dB to 8 dB, the EE of TQL-1 and TQL-2 increased Mar. 2016.
by 99.01% and 57.79%, respectively. Firstly, Fig. 9(c) shows [12] T. A. Zewde and M. C. Gursoy, “Energy-efficient time allocation for
that for σ ranging from 6 dB to 8 dB, the amount of harvested wireless energy harvesting communication networks,” in Proc. IEEE
energy at devices increases by 142.48% and 124.35%. In this Globecom Workshops, Washington, DC, USA, Dec. 2016, pp. 1–6.
[13] X. Lin, L. Huang, C. Guo, P. Zhang, M. Huang, and J. Zhang,
case, devices in TQL-1 and TQL-2 have more energy for trans- “Energy-efficient resource allocation in TDMS-based wireless pow-
missions. Secondly, TQL-1 and TQL-2 train devices to remain ered communication networks,” IEEE Commun. Lett., vol. 21, no. 4,
idle until the channel is favorable. This allows devices to pp. 861–864, Apr. 2017.
[14] W. Dinkelbach, “On nonlinear fractional programming,” Manag. Sci.,
reduce energy consumption when the channel is bad, and thus vol. 13, no. 7, pp. 492–498, Mar. 1967.
saving energy for use when the channel is good. Consequently, [15] S. Mao, S. Leng, J. Hu, and K. Yang, “Energy-efficient resource alloca-
as σ increases from 6 dB to 8 dB the sum rate of TQL-1 tion for cooperative wireless powered cellular networks,” in Proc. IEEE
and TQL-2 increase by 98.89% and 57.78%, respectively; see ICC, Kansas City, MO, USA, May 2018, pp. 1–6.
[16] M. Song and M. Zheng, “Energy efficiency optimization for wireless
Fig. 9(b). Therefore, the EE of TQL-1 and TQL-2 increased powered sensor networks with nonorthogonal multiple access,” IEEE
to 1.55 × 107 bits/Joule and 7.57 × 107 bits/Joule. Sens. Lett., vol. 2, no. 1, pp. 1–4, Mar. 2018.
[17] J. Hu and Q. Yang, “Dynamic energy-efficient resource allocation in
wireless powered communication network,” Wireless Netw., vol. 25,
VII. C ONCLUSION no. 6, pp. 3005–3018, 2019.
[18] T. Liu, X. Wang, and L. Zheng, “A cooperative SWIPT scheme for
In this article, we consider optimizing EE without chan- wirelessly powered sensor networks,” IEEE Trans. Commun., vol. 65,
nel state and energy arrival information. We have proposed no. 6, pp. 2740–2752, Jun. 2017.
a decentralized approach. That is, the HAP and each device [19] P. Mukherjee and S. De, “cDIP: Channel-aware dynamic window proto-
col for energy-efficient IoT communications,” IEEE Internet Things J.,
determines their transmission power with only causal knowl- vol. 5, no. 6, pp. 4474–4485, Dec. 2018.
edge. The simulation results show that it is not necessary [20] Y. Zeng and R. Zhang, “Optimized training design for wireless energy
for each device to transmit in its assigned data slot. A bet- transfer,” IEEE Trans. Commun., vol. 63, no. 2, pp. 536–546, Feb. 2015.
ter strategy is for the HAP and devices to remain idle until [21] M.-L. Ku, Y. Chen, and K. J. R. Liu, “Data-driven stochastic models
and policies for energy harvesting sensor communications,” IEEE J. Sel.
they deem the channel is good. Furthermore, our approach Areas Commun., vol. 33, no. 8, pp. 1505–1520, Aug. 2015.
is able to improve EE while ensuring SINR threshold that is [22] E. Boshkovska, D. W. K. Ng, N. Zlatanov, and R. Schober, “Practical
higher than 40 dB. Lastly, an interesting research direction non-linear energy harvesting model and resource allocation for SWIPT
systems,” IEEE Commun. Lett., vol. 19, no. 12, pp. 2082–2085,
is to extend works such as [30], whereby devices learn the Dec. 2015.
best modulation scheme or data rate for each system state that [23] S. Sudevalayam and P. Kulkarni, “Energy harvesting sensor nodes:
results in the highest expected sum-rate over time. Survey and implications,” IEEE Commun. Surveys Tuts., vol. 13,
no. 3, pp. 443–461, 3rd Quart., 2011.
[24] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8,
pp. 279–292, May 1992.
R EFERENCES
[25] M. L. Puterman, Markov Decision Processes.: Discrete Stochastic
[1] C. Yang, K.-W. Chin, T. He, and Y. Liu, “On sampling time Dynamic Programming. New York, NY, USA: Wiley, 2014.
maximization in wireless powered Internet of Things,” IEEE Trans. [26] R. Bellman, “Dynamic programming,” Science, vol. 153, pp. 34–37,
Green Commun. Netw., vol. 3, no. 3, pp. 641–650, Sep. 2019. Jul. 1966.
REN AND CHIN: REINFORCEMENT LEARNING APPROACH TO OPTIMIZE ENERGY USAGE IN RF-CHARGING SENSOR NETWORKS 539
[27] A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, and Kwan-Wu Chin received the Bachelor of Science
J. P. How, “A tutorial on linear function approximators for dynamic (First Class Hons.) and Ph.D. (with commendation)
programming and reinforcement learning,” Found. Trends Mach. Learn., degrees from Curtin University, Australia, in 1997
vol. 6, no. 4, pp. 375–454, Dec. 2013. and 2000, respectively. He was a Senior Research
[28] A. Murad, K. Bach, F. A. Kraemer, and G. Taylor, “IoT sensor gym: Engineer with Motorola from 2000 to 2003. In 2004,
Training autonomous IoT devices with deep reinforcement learning,” in he joined the University of Wollongong as a Senior
Proc. 9th Int. Conf. Internet Things, Bilbao, Spain, Oct. 2019, pp. 1–4. Lecturer, where he was promoted to an Associate
[29] Data Sheet for P2110B Power Harvester. Accessed: 2016. [Online]. Professor in 2011. To date, he holds four U.S.
Available: https://www.powercastco.com/wp-content/uploads/2016/12/ patents, and has published more than 150 confer-
P2110B-Datasheet-Rev-3.pdf ence and journal articles. His current research areas
[30] P. Mukherjee and S. De, “Dynamic feedback-based adaptive modula- include medium access control protocols for wire-
tion for energy-efficient communication,” IEEE Commun. Lett., vol. 23, less networks, and resource allocation algorithms/policies for communications
no. 5, pp. 946–949, May 2019. networks.