Cooperative Computation Offloading and Resource Allocation For Blockchain-Enabled Mobile Edge Computing: A Deep Reinforcement Learning Approach
Cooperative Computation Offloading and Resource Allocation For Blockchain-Enabled Mobile Edge Computing: A Deep Reinforcement Learning Approach
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
mobile devices will be susceptible to attacks. Therefore, the blockchain-enabled MEC systems is described. In Section
trust model needs to be considered on cooperative computation V, the joint problem of offloading decision and resource
offloading. allocation is formulated. We introduce the offloading decision
3) Dynamic Optimization: Moreover, most of the existing and resource allocation in A3C framework in Section VI.
works [6], [10], [11] in the computation offloading decision The performance of the proposed algorithm is evaluated and
and resource allocation strategies are optimized based on analyzed by simulations in Section VII. Finally, in Section
a one-time slot, and the long-term computation offloading VIII, we conclude this paper and look forward to future work.
performance cannot be characterized [12], [13]. The design
and optimization of blockchain-enabled MEC systems should II. R ELATED W ORK
account for the environmental dynamics, such as the time-vary The cooperative computation offloading problem has been
channel conditions and the task arrival. widely studied for MEC and cloud computing systems [18].
To deal with the first two challenges, in this paper, we Hong et al. [19] proposed a quality of service (QoS) co-
propose to maximize the weighted sum of the computation operative computation offloading problem for robots swarms
rate and the transaction throughput for blockchain-enabled in clouding systems aimed at minimizing latency. Cao et al.
MEC systems by jointly optimizing the cooperative computa- [20] studied a novel cooperative computation offloading based
tion offloading decision and resource allocation. Specifically, on both computation and communication of MEC system-
the computation tasks are offloaded from mobile devices to s to improve the energy efficiency for latency-constrained
MEC servers through cooperative communications, wherein computation. Guo et al. [21] presented an efficient dynamic
blockchain technology is applied to guarantee data security. offloading and resource scheduling strategy to decrease energy
For the dynamic optimization issue, we formulate the joint consumption and latency. However, these approaches do not
optimization as a Markov decision process (MDP) problem, take into account the security and privacy of data in coopera-
and develop an efficient deep reinforcement learning (DRL) tive computation offloading.
based offloading decision and resource allocation algorithm to The application of the blockchain in the MEC systems can
solve the problem. significantly improve the network security, data integrity and
The contributions of this paper are summarized as follows. computation validity of the system [22]. The computation
• In most existing works [10], [11], [14], [15], the design offloading problem has also been studied for the blockchain-
and optimization of blockchain and MEC are done sepa- enabled MEC system [11], [14]. Liu et al. [10] proposed a
rately, which will result in sub-optimal performance. We novel blockchain-based framework with an adaptive block size
propose a cooperative computation offloading framework in MEC systems, which considered two offloading models,
for blockchain-enabled MEC systems to enable the joint i.e., offloading to MEC servers or nearby device-to-device
analysis of the MEC computation rate and the blockchain users. Kang et al. [23] proposed a secure and distributed
transaction throughput while considering the trust model. vehicular blockchain for data management in vehicular edge
• The study jointly considers the offloading decision, power computing and networks. Based on the common decentral-
allocation, block size, and block interval to maximize ization characteristic of MEC and blockchain technology, Xu
the weighted sum of computation rate of MEC sys- et al. [24] proposed a trustless crowd-intelligence ecosystem
tems and transaction throughput of blockchain system- to improve network congestion. However, these works only
s. Considering the dynamic characteristics of wireless consider blockchain as an overlay system above the MEC
channels and the available resources, the optimization system, which will give rise to sub-optimal performance.
problem is formulated as an MDP. Since the action Furthermore, these approaches utilize static optimization tech-
space of the MDP problem has both continuous actions niques, which cannot characterize the long-term computation
and discrete actions, traditional learning algorithms, such offloading performance. Therefore, their methods cannot be
as Q-learning [16] and SARSA [17] and so on, are applied in practical dynamic systems.
powerless. An asynchronous advantage actor-critic (A3C) Deep reinforcement learning (DRL) is emerging as one of
reinforcement learning algorithm is introduced to solve the efficient methods to obtain the optimal decision-making
the MDP problem, in which deep neural networks are policy and maximize long-term rewards. Therefore, the use of
optimized by using asynchronous gradient descent and DRL to solve computation offloading problems for blockchain-
eliminating the correlation of data. enabled MEC systems has attracted considerable interest from
• The proposed algorithm and other baseline functions are academia. Qiu et al. [25] proposed a model-free DRL-based
implemented by using Tensorflow on a Python-based computation offloading scheme for blockchain-enabled MEC
simulator. Extensive simulation results show that the pro- systems while considering mining tasks and data process-
posed algorithm has good convergence, and has signifi- ing tasks. A computation offloading problem based on an
cant performance improvements over existing algorithms. advanced deep Q-learning network (DQN) was presented to
Furthermore, we observe that the proposed scheme can minimize energy consumption and delay [26]. However, all of
achieve the optimal trade-off between the performance of the above-proposed algorithms can only handle discrete action
the MEC system and the blockchain system. space and do not apply to continuous action cases. Therefore,
The rest of this paper is organized as follows. In Section we develop an A3C-based cooperative computation offloading
II, we discuss related research. We introduce the system and resource allocation algorithm to achieve the optimal trade-
model in Section III. In Section IV, the trust calculation in off between the performance of the MEC system and the
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
Tb (t) The block interval in slot t transmit rate which mobile device n offloads the computation
tasks through relay node r is given by
trust
BDn→r (t)
Rn,r (t) =
forward scheme. The offloaded date received in both two 2
2
phases is combined at the BS using maximal ratio combining gr,n (t)gn (t)Ptot,n (t)/(σr,n (t)σn2 (t))
log2 (1 + ),
r,n (t)/σr,n (t) − gn (t)/σn (t)
(MRC) [33]. The transmit rate of mobile device n when the 2 (t) + g
gn,r (t)/σn,r 2 2
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
constant. Then the time when the MEC server performs mobile pre-prepare prepare persist
device n is given by 0
Dn (t)Ln r′ (t)ψn Ln
τn (t) = = n . (13) 1
sn sn
Since the computation results are very small, we ignore 2
the return time of the computing results in this paper. The
computation rate of MEC server n is given by Lsnn . The time
that the computation tasks of mobile device n are successfully 3
executed is given by
Fig. 2: The process of the consensus algorithm.
t′n,of f (t) = ψn + τn (t). (14)
Accordingly, the total computation rate and the total time
After generating a block, the block needed to be verified.
of mobile device n are respectively given by
In the consensus process, we utilize the delegated Byzantine
sn
rn (t) = an (t)rloc (t) + (1 − an (t))(rn,of f (t) + ), (15) fault tolerance (dBFT) consensus mechanism [34]. When there
Ln are K consensus nodes in the consensus process, we assume
that K ≥ 3f + 1, where f is the maximum number of fault-
ttot,n (t) = an (t)tloc + (1 − an (t))t′n,of f (t), (16) tolerant nodes. In the consensus process, the leader of the node
where an (t) ∈ {0, 1} is the offloading decision of mobile is called the speaker, and the others are called members. The
device n. When an (t) = 1, the computation tasks of mobile speaker is responsible for broadcasting new block proposals
device n are executed locally. Otherwise, the computation to other nodes. The members are responsible for voting on the
tasks are offloaded to the MEC server. new block proposal. When the number of votes is not less than
K − f , the proposal is passed. The speaker p of the consensus
process is determined by
C. Blockchain System
p = (h − v) mod N, (19)
Any node in the blockchain can collect the transactions
from the MEC system. To improve the system performance, where h is the block height of the current consensus, and v is
some blockchain nodes with a high number of votes are the view number. Then, the process of the consensus algorithm
selected as consensus nodes to participate in generating blocks is shown in Fig. 2.
and verifying blocks. The number of votes for a consensus The algorithm can be divided into three phases: pre-prepare,
node candidate is determined by the number of stakes it prepare, and persist. During the pre-prepare phase, the speaker
holds, its available resources and its trust value. We as- for this round is responsible for broadcasting a message to
sume that the stake and available computing resource of other members. Meanwhile, the speaker launches a proposal.
blockchain node n in slot t are represented by Φn (t) and In the prepare phase, the members broadcast the message and
Tn (t), respectively. The available computing resource of the vote. When a consensus node receives no less than K − f
node is the remaining resource after processing the offloaded signatures of the block, it enters the third phase, and a block
tasks. Denote the sets of the stake and available computing is successfully generated in the phase. Meanwhile, the block
resource of nodes by Φs (t) = {Φ1 (t), Φ2 (t), ..., Φn (t)} and is broadcasted the whole blockchain system, and then enter
T (t) = {T1 (t), T2 (t), ..., Tn (t)}, respectively. We assume that the next round of the consensus process.
the MEC server has a first in first out (FIFO) data buffer to Let Tc (t) denote the time cost in the consensus process.
store the arrived but not yet executed offloaded tasks. Hence, For simplicity, the consensus process is divided into two parts,
the dynamics of the processing queue at the beginning of the i.e., message propagating and message verification, including
t + 1 time slot can be given by as follows. signatures verification, message authentication codes (MAC)
generation, and MAC verification [35]. Then, the latency of
Fn (t + 1) = max{Fn (t) − sn + ρn rn (t), 0}, (17) the consensus process in slot t is give by
where ρn is the processing density (in CPU cycles/bit). Then, Tc (t) = Tp (t) + Tv (t), (20)
the computing resource available to the blockchain system by
MEC server n in the slot t is given by where Tp (t) and Tv (t) are the time of message propagation
and the validation time in slot t, respectively.
Tn (t) = max{F ′ − Fn (t), Tmin }, (18) Similar to [15], we utilize latency time to finality (LTF) as
the latency of the blockchain system. The LTF is given by
where F ′ and Tmin are the total computing capacity of MEC
servers and the minimum computing capacity required by the Ttotal (t) = Tc (t) + Tb (t). (21)
blockchain system, respectively. Let Dntrust denote the trust Then, the transaction throughput [15] can be expressed as
value of blockchain nodes. In the paper, we assume that these
K block producers, in turn, generate blocks [15]. Let Sb (t) ⌊Sb (t)/χ⌋
Ψ(t) = , (22)
and Tb (t) denote the block size (in MB) and block interval (in Tb (t)
seconds) in slot t, respectively. where χ is the average size of transactions.
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
IV. T RUST C ALCULATION IN B LOCKCHAIN - ENABLED nodes [36]. Therefore, the value of αn→r and βn→r can be
MEC S YSTEMS respectively recast as
For secure communications, only relay nodes with high trust new
αn→r plr
= αn→r + Pn→r × (αn→r + βn→r ), (26)
values should be chosen to relay the offloaded data to the MEC
server. If computing tasks are relayed by a relay node with new
βn→r = βn→r − Pn→r
plr
× (αn→r + βn→r ), (27)
low trust value, the relay node may take malicious actions,
plr
e.g., dropping relaying data packets. Therefore, each mobile where Pn→r is the packet loss rate. Similar to [36], the packet
device should interact with relay nodes with high trust value loss rate is estimated by the following equation.
to avoid potential security threats. Similarly, malicious block ∑c
ω(b) × ω(b)
producers may generate a fake block. Therefore, the consensus Pn→r = 1 − b ∑c
plr
, (28)
b ω(b)
nodes selected should have a higher trust value. To compute
the trust values of nodes (relay nodes and consensus nodes), where ω(b) is the weight value of a historical link state and let
we jointly utilize two common ways to evaluate, i.e., direct link = (ω(1), ω(2), ..., ω(b)) be a historical link state record.
2b
trust and indirect trust (recommendation) [36]. Direct trust The wight value is given by ω(b) = c(c+1) , where b and c are
values of nodes are calculated based on subjective logic, while the serial number of ω(b) in link and the number of the state
indirect trust values are computed based on the third party’s record, respectively.
recommendations. In this work, we evaluate the trust value On the other hand, we assume that all relay nodes have the
of a node by a real number ranging from 0 to 1. Like most same initial energy consumption rate and energy level. When
literature, such as [14], [37], the trust threshold is set 0.5. In malicious nodes launch malicious attacks, they can always
other words, the node is trustworthy when its trust value is consume anomalous energy. Therefore, we utilize energy as
higher than 0.5; otherwise, it is not trustworthy. Next, we first a quality of service (QoS) trust metric to measure whether a
pen
calculate the trust value of relay nodes. relay node is malicious or not. Let Pn→r be the energy con-
sumption rate, which is achieved by using the Ray Projection
pen
method [41] (Pn→r ∈ [0, 1]). Then the node competence (NC)
A. Calculation of Direct Trust is given by
Similar to [36], we utilize node honesty and node capacity {
1 − Pn→r
pen res
, if En→r ≥ θ,
to calculate direct trust. Since mobile communication channels N Cn→r =
0, otherwise,
between mobile devices and relay nodes are unstable and
(29)
noisy, the communication behaviors in computation offloading
res
involves considerable uncertainty. We tackle the uncertain- where En→r and θ are the residual energy of one relay node
ty by using a Subjective Logic framework [38]. The trust and the energy threshold, respectively.
value of mobile device n to relay node r in the Subjective As mentioned above, the node trust relies on the node
Logic framework can be described as a triplet ωn→r = honesty and node competence. Then, the direct trust of a relay
{bn→r , dn→r , νn→r }, where bn→r , dn→r , and νn→r represent node is defined as
belief, disbelief and uncertainty, respectively. Peculiarly, the direct
Dn→r =
relationships among them are determined by {
0.5 + (N Hn→r − 0.5) × N Cn→m , if N Hn→r ≥ 0.5,
bn→r , dn→r , νn→r ∈ [0, 1], N Hn→r × N Cn→r , otherwise.
bn→r + dn→r + νn→r = 1. (23) (30)
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
trust evaluation. Therefore, we need to detect whether the and the trust value of relay nodes and blockchain nodes
recommendation is reliable before calculating the trust value. D trust (t) = {Dn→r
trust
(t), Dntrust (t)}, which is denoted as
For this purpose, we present a simple way to detect the trust { }
rel
value by defining the recommended reliability Rn→r . To begin S(t) , G(t), T (t), Φs (t), D trust (t) . (34)
with, we compute the average value of all updated recommen- Since the state space is continuous, the probability of being
dations for candidate r, denoted by Rrave . Then, we obtain the in a particular state is zero. The probability that the process
difference between the recommendation value and the average will leave the state s(t) to transition to the next state s(t + 1)
value. The greater the difference, the lower the reliability of after taking an action a(t) ∈ A can be expressed as
the recommendation. Therefore, the recommended reliability ∫
rel
Rn→r is given by P r(s(t + 1) | s(t), a(t)) = f (s(t), a(t), s′ )ds′ , (35)
S t+1
rel
Rn→r = 1− | Rn→r
rec,i
− Rrave |, (31)
where f is the state transition probability density function.
rec,i
where Rn→rrepresents the recommended value for the ith
update in the blockchain system. B. Action Space
If the recommended reliability of a recommender is less
The action space includes offloading decision a(t), power
than 0.5, even if it has a high recommended value, the recom-
allocation P (t), block size Sb (t), and block interval Tb (t). We
mended value cannot be used to compute the recommended
utilize A(t) = [a(t), P (t), aSk (t), aTk (t)] to define the action
trust. Therefore, we obtain the recommended trust based on
set.
the recommended reliability and the recommended value as
Offloading Decision: The offloading decision is denoted by
follows.
∑I a(t) , [a1 (t), a2 (t), ..., aN (t)]. (36)
Rrel × Rn→r
rec,i
recom
Rn→r = i=1 n→r , (32)
I Power Allocation Decision: The power allocation decision
where I is the number of updates. Therefore, the relay node will be obtained based on achieving a maximum reward. We
trust is given by denote the power allocation decision by P (t), as shown below.
trust
Dn→r = P (t) , [Ptotal,1 (t), Ptotal,2 (t), ..., Ptotal,N (t)] . (37)
{ direct
Dn→r , new
if αn→r ≥ T hnum , Block Size and Block Interval: The delegators are elect-
direct recom
ωdirect Dn→r + ωrecom Rn→r , otherwise, ed by voting based on the number of stakes held by the
(33) blockchain nodes, the trust value of blockchain nodes, and
available computing resource. After determining the delega-
where ωdirect , ωrecom , and T hnum are the weight values of
tors, they take turns to produce blocks. By using the limits
the direct value and the recommendation, and the number of
fractional method, the block size and block interval decisions
interaction between recommenders and the blockchain system,
are respectively given by
respectively. ωdirect ∈ [0, 1], ωrecom ∈ [0, 1], and ωdirect +
ωrecom = 1. Similarly, the trust value of the nodes in the aSk (t) ∈ [0.2, Ṡb ], (38)
blockchain system is evaluated using the same method.
aTk (t) ∈ [0.1, Ṫb ], (39)
V. P ROBLEM F ORMULATION
where Ṫb are the block size limit and the maximum block
It is well known that wireless channels have the Markovian interval, respectively.
property [42], [43]. Therefore, the blockchain-enabled MEC
system is formulated as a discrete MDP to maximize the sys-
tem reward. Since it is impossible to predict the state transition C. Reward Function
probability and reward in advanced in mobile environment, we In this paper, we formulate an optimization problem to
propose a model-free approach based on deep reinforcement maximize the weighted sum of the computation rate of the
learning to solve the above the MDP problem. The MDP is MEC system and the transaction throughput of the blockchain
defined by a tuple < S, A, P, r >, where S is the state set of system, which jointly optimizes offloading decision, power
the system, A is the action set of the system, P is the state allocation, block size, and block interval. Then, the joint
transition probabilities, and r is the reward function. optimization problem is formulated as
[T −1 ]
∑ ∑
N
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
where ω > 1, and ω1 (0 < ω1 < 1) is a weight factor to generate an estimate of the advantage. Then, the advantage
combine the objective function to a single one, and ω2 is a function is given by
mapping factor that ensures that the objective function is at
the same level. ε(ε ≤ ∆t) is the maximum tolerable average A(at , st ; θ, θv ) = Rt (θv ) − V (st ; θv )
delay in offloading tasks. PT is the sum of the power available ∑
k−1
for all mobile devices and relay nodes in the network. Then, = γ i rt+i + γ k V (st+k ; θv ) − V (st ; θv ). (44)
i=0
we define the reward function as
{
O(t), if C1 − C4 are satisfied, The policy π(at |st ; θ) and the value function V (st ; θv ) are
rt = (41) approximated by using a single convolutional neural network.
0, otherwise,
Especially, the policy function is output by a softmax layer,
∑
N
where O(t) = ω1 ω2 rn (t) + (1 − ω1 )Ψ(t). and the estimate of the value function is output by a linear
n=1 layer. In A3C, all network weights are stored in a central
VI. O FFLOADING D ECISION AND R ESOURCE A LLOCATION parameter server [45]. In the beginning, each actor-learner
IN THE A3C F RAMEWORK
sets its network parameters to those of the server. Then,
multiple actor-learners learn concurrently and optimize the
Compared with other DRL algorithms, such as actor-critic convolutional neural network through asynchronous gradient
learning (AC), advantage actor-critic learning (A2C), and descent. After computing the gradient, the actor-learners send
policy-based learning, the A3C is a faster, simpler, and more the updates to the server. Then, the server propagates new
robust parallel reinforcement learning algorithm proposed by weights to the actor-learners to ensure they share a common
Google DeepMind in 2016 [44]. It can reliably train deep policy.
neural network policies. Different from the underlying re- Two loss functions are associated with the two convolutional
inforcement learning algorithms, such as actor-critic that is neural network outputs. For policy loss function, we have
an on-policy search algorithm, and Q-learning that is an off-
policy value-based search algorithm, A3C combines the ben- fπ (θ) = log π(at | st ; θ)(Rt − V (st ; θv )) + βH(π(st ; θ)),
efits of the value-based method and the policy-based method (45)
[44]. More importantly, it could work in discrete as well as
continuous action spaces. A3C utilizes asynchronous actor- where H(π(st ; θ)) is the entropy. β is a hyperparameter that
learners, i.e., employing multiple CPU threads on a single controls the strength of the entropy regularization term.
machine, to learn more efficiently. Multiple actor-learners Differentiating the policy loss function in (45) with respect
running in parallel can interact with their environment and to the parameter θ, we have
obtain different exploration policies. Moreover, the exploration
policy of each actor-learner is independent of those of the ∇θ fπ (θ) = ∇θ log π(at | st ; θ)(Rt − V (st ; θv ))
others. Hence, the overall exploration policy available for + β∇θ H(π(st ; θ)). (46)
training becomes more diverse.
In an A3C algorithm, we need to maintain a policy The loss function for estimated value function is given by
π(at |st ; θ) (a set of action probability outputs) with the
fv (θv ) = (Rt − V (st ; θv ))2 . (47)
parameter θ and an estimate of the value function V (st ; θv )
(how good a certain state is to be) with the parameter θv . Similarly, differentiating the value loss function in (47) with
Compared with traditional policy gradient methods, A3C is respect to θv yields
more intelligent because the agent utilizes the estimated value
function (the critic) to update the policy (the actor). The policy ∇θv fv (θv ) = 2(Rt − V (st ; θv ))∇θv V (st ; θv ). (48)
and the value function are updated in the terminal state or after
maximum step tmax actions. In the policy-based methods, the The loss function can be minimized by adopting RMSProp
rule is updated by using the discounted returns, which is given algorithm that has been widely used in the deep learning
by algorithms. Then, the estimate of the gradient under RMSProp
is given by
∑
k−1
i k
Rt (θv ) = γ(r) = γ rt+i + γ V (st+k ; θv ), (42) g = αg + (1 − α)∆θ2 , (49)
i=0
where k is vary from state to state and is upper-bounded by where α is the momentum, and ∆θ is the accumulated
tmax , rt+i is immediate reward, and γ ∈ (0, 1] is the discount gradients of the loss function.
factor. Then, the RMSProp algorithm can be updated according to
However, the estimate can cause the variance. To reduce the the following estimated gradient downhill.
variance of the estimate, the advantage estimates is adopted, △θ
which is given by θ ← θ − η√ , (50)
g+ϵ
A(at , st ) = Q(at , st ) − V (st ). (43)
where η is the learning rate, and ϵ is small positive number.
Since the Q(at , st ) value cannot be determined in A3C, Based on (46) and (48), the detail of the A3C algorithm
the discounted returns is used as the estimate of Q(at , st ) to used in our proposed approach is shown in Algorithm 1.
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
Algorithm 1 A3C-Based Computation Offloading and Re- TABLE II: Summary of the Simulation Parameters
source Allocation Algorithm
Parameters Definition Values
Initialization: B Bandwidth 180 KHz [50]
• Assume that θ and θv are parameters the actor network PT Maximum power available 1 W [51]
and critic network in global network. φn Transmit time 0.4 s [47]
• Assume that θ′ and θv′ are parameters the actor network N0 Noise power density −174 dBm/Hz [3]
and critic network in local network. χ Average transaction size 200 KB
• Set global counter T = 0 and local step counter t = 1. Ṡb Block size 8 MB [15]
• Set Tmax , tg , γ, learning rate η, ϵ, and tmax . ε Tolerable maximum delay 1 s [5]
• Set the number of agents W . Ln Processing density 737.5 cycle/bit [5]
Iteration: F Computation capability 2.5 GHz [5]
1: while T < Tmax do ϖ2 ,ϖ1 The weighted values 0.2, 0.0001
2: for w = 1 to W do ηa ,ηc Learning rate 10−3 ,10−2
3: Reset global gradient dθ = 0 and dθv = 0. ξ Shadowing standard deviation 10 dB [52]
4: Synchronize local parameters θ′ = θ and θv′ = θv .
5: Set t0 = t and obtain system state S(t).
6: repeat
7: Obtain action A(t) according to policy A. Simulation Parameters
π(A(t)|S(t); θ′ ). We consider a network that consists of an MEC system
8: Execute action A(t), observe reward R(t), and and a blockchain system, which comprises 30 mobile devices
observe next state S(t + 1). scattering over a 2 × 2 km2 area [47]. The number of relay
9: t = t + 1. nodes within the coverage of each BS is 5. The CPU-cycle
10: until t − t0 == tmax frequency of mobile devices and MEC servers is 1 GHz [5]
11: if t%tg == 0 then and 2.4 GHz [48], respectively. Other simulation parameters
12: R = V (S(t); θv′ ). are summarized in Table II, where the path loss models and
13: end if the shadowing fading are standard settings provided by 3GPP
14: for i = t − 1 to t0 do [49]. In our simulations, we use a computer, which has 6
15: R = R(t) + γR. CPU cores. The CPU is Intel Core i5-8400 with 32G memory.
16: Compute policy gradient ∇θ′ fπ (θ′ ) according to The software environment we used is Tensorflow 1.10.0 with
(46). Python 3.6 on Ubuntu 18.04.2 LTS. For the blockchain sys-
17: Compute accumulate gradient dθ = dθ + tem, we used virtualization for distributed ledger technology
∇θ′ fπ (θ′ ). (vDLT) we developed, which is a service-oriented blockchain
18: Compute value gradient ∇θv′ fv (θv′ ) according to system with virtualization and decoupled management/control
(48). and execution. Different block sizes can be dynamically set in
19: Compute accumulate gradient dθv = dθv + vDLT by chaning the parameters in vDLT. For more details,
∇θv′ fπ (θv′ ). please go to http://vdlt.io/approach.html.
20: end for By using Tensorflow’s built-in module TensorBoard, we
21: Asynchronous update weight parameter θ and θv show the visualization of our A3C architecture, as shown in
according to (50). Fig. 3. In Fig. 3, the architecture of the proposed algorithm
22: end for consists of a global network and six worker agents. We can
23: end while observe that the proposed algorithm starts with constructing
the global network. Then, the parameters in the global network
are propagated synchronously to each worker agent. In Fig. 4,
we show the internal structure of one of the worker agents.
VII. S IMULATION R ESULTS AND A NALYSIS Every worker agent has its own network and environment. By
interacting with their own environment, worker agents update
the global network parameters.
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
10
Fig. 3: Visualization of the proposed deep reinforcement learning algorithm using TensorBoard.
W_0
a_loss
Locl_grad Normal
Choose_a c_loss A1 Local_grad
actor
sync
S
...4 more
Locl_grad
a_loss TD_errorLocl_grad TD_error
Normal
a_loss
Normal Normal_1 Normal_2 Normal_1
Normal_2 Locl_grad
a_loss a_loss a_loss actor
Vtarget ...1more
Wrap_a_out Locl_grad
A Locl_grad
critic int
Locl_grad
Locl_grad
actor a_loss
S
int Locl_grad
4 4
x 10 x 10
10 10
9.5
9.5
9
9
8.5
Total reward
Total reward
8 8.5
7.5
8
7
7.5
6.5 η =10−3 ηc=10−1
a
−4
ηa=10 η =10−2
6 7 c
−5
ηa=10 ηc=10−3
5.5
100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000
Episode Episode
Fig. 5: The total reward with different learning of the actor network.Fig. 6: The total reward with different learning of the critic network.
the figure, we can observe that when the critic’s learning rate The reason is that as the transmit power increases, although
ηc = 0.1, the proposed algorithm first converges, followed by the transmit rate increases, the overhead of communication
ηc = 0.01 and ηc = 0.001. increases, such as energy consumption, which can affect the
computation rate.
In Fig. 7, we illustrate how the sum of the power available
PT affects the average reward. We can observe that the average Fig. 8 and Fig. 9 show the impact of the CPU-cycle
reward increases when PT increases. However, the proposed frequency of the MEC servers sn on average computation rate
algorithm performs better than other schemes. From the figure, and average transaction throughput, respectively. From the Fig.
the average reward of all schemes grows slowly when the 8, we can observe that the average computation rate of all
value of the sum of power available, PT , is greater than 0.7. schemes increases slowly with the increase in the CPU-cycle
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
11
4 6
x 10 x 10
8 8.5
Proposed
Only−offloading
8 FBS
7.5
Average Reward
6.5
6
7
5.5
Proposed
FBS 5
Only−offloading
FBT
6.5 4.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1 1.5 2 2.5 3 3.5 4 4.5 5
The sum of the power available (W) The CPU cycle frequency snof MEC servers 9
x 10
Fig. 7: Average reward versus the total power available PT . Fig. 8: Average computation rate versus the CPU-cycle frequency sn .
4
x 10
92 6
Proposed
Only−offloading
Average Transaction Throughput (TPS)
90 FBS
5
FBT
88
4
Average Reward
86
84 3
82
2
80
Proposed
1
78 Only−offloading
FBS
FBT
76 0
1 1.5 2 2.5 3 3.5 4 4.5 5 N=8 N=14 N=20 N=26 N=32
The CPU cycle frequency snof MEC servers x 10
9 Number of mobile deveices
Fig. 9: Average transaction throughput versus the CPU- Fig. 10: Average reward versus number of mobile devices N
cycle frequency sn .
frequency sn . However, from Fig. 9, it is observed that the reward of all the schemes decreases with the increase in
average transaction throughput decreases with the increase in the maximum block interval. That is because the transaction
the CPU-cycle frequency sn for all schemes. That is because throughput decreases with the increase in maximum block
the computing resource of MEC servers is limited. When the interval when other parameters are unchanged. To verify the
MEC server consumes more computing resource to perform impact of the average transaction size χ on the average
the offloading tasks, the computing resource available to the reward, we evaluate the performance obtained by the proposed
blockchain system become less. scheme under different average transaction size, as shown in
Fig. 12. From the figure, we can observe that the average
Fig. 10 shows the comparison of the average reward versus
reward for all schemes decreases with the increasing average
the number of mobile devices. As can be seen from the figure,
transaction size. The reason is that one block can only contain
with the number of mobile devices increases, the average
a small number of transactions with large-size transactions.
reward keeps increasing. Due to the joint optimization of
Furthermore, we can also find that the proposed scheme can
offloading decision, the allocation of transmit power, block
obtain highest average reward with the variation of average
size, and block interval, the proposed algorithm can always
transaction size, and then follows the FBS, the Only-offloading
benefit compared with other algorithms that only optimize part
scheme, and the lowest scheme is the FBT. Similarly, we
of the optimization items.
also evaluate the impact of block size Ṡb on the average
In Fig. 11, we examine the average reward under different reward, as shown in Fig. 13. Observe that, the average reward
maximum block interval Ṫb . It can be observed that the average
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
12
4
x 10
11000 2.6
Proposed Proposed
10000 FBS 2.4 Only−offloading
Only−offloading FBS
9000 FBT 2.2 FBT
8000 2
Average Reward
Average Reward
7000 1.8
6000 1.6
5000 1.4
4000 1.2
3000 1
2000 0.8
1000 0.6
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 200 300 400 500 600
Maximum block interval (s) The average transaction size (KB)
Fig. 11: Average reward versus maximum block interval Ṫb . Fig. 12: Average reward versus transaction size χ.
9
x 10
14000 3
B=150KHz
B=300KHz
10000
2
Average Reward
8000
1.5
6000
1
4000
Proposed
0.5
2000 Only−offloading
FBS
FBT
0 0
1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 70 80 90
Block size limit (MB) Episode
Fig. 13: Average reward versus block size limit Ṡb . Fig. 14: The computing resource of the randomly chosen blockchain
nodes at the randomly selected 90 episode for the proposed algorithm.
slowly increase with the increase in block size except for given ϖ1 as CPU-cycle frequency sn increases. Besides, we
FBS. That is because the LTF constraint limits the maximum can observe that the average reward increase as ϖ1 increases.
number of transactions in a block. Another observation is That is because the performance of the MEC system is mainly
that the proposed scheme always performs the best, followed affected by changes in the CPU-cycle frequency, and the
by Only-offloading and FBT. Moreover, we randomly choose performance of the blockchain system is almost constant, as
a blockchain node in the blockchain system to display its shown in Fig. 16. Then we can achieve the tradeoff between
computing resource at randomly selected 90 episodes during the performance of the MEC system and the performance of
B = 150KHz, B = 300KHz, and B = 450KHz in Fig. the blockchain system based on Fig. 16.
14. From the figure, the queue length of the blockchain nodes
at different episodes is finite and because the transmit rate is VIII. C ONCLUSIONS AND F UTURE W ORK
different. Besides, we can observe that the computing resource In this paper, we studied a blockchain-enabled MEC system
available of blockchain node decreases with the increase in and, considering the trust value of nodes (i.e., relay nodes
bandwidth B. That is because the transmit rate increases with and consensus nodes), investigated the computation rate of
the increase in bandwidth. the MEC system and transaction throughput of the blockchain
In Figs. 15 and 16, we show the impact of the CPU- system maximization problem. To satisfy the performance
cycle frequency sn on the average reward, computation rate requirements of the system, we jointly optimized cooperative
of the MEC system, and throughput of the blockchain system, offloading decision, power allocation, block size, and block
respectively. From Fig. 15, it is in accordance with our interval. Due to the dynamic characteristics of the wireless
intuition that the performance of the reward improves for a channels and available resources, the formulated optimization
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
13
6
x 10
500 15
400
5
350
Average Reward
300 0
Sn=1GHz Sn=2GHz Sn=3GHz
250
200 150
ϖ1=0.3
Throughput (TPS)
150 ϖ1=0.7
100
100
50
50
0 0
Sn=1GHz Sn=2GHz Sn=3GHz Sn=1GHz Sn=2GHz Sn=3GHz
Fig. 15: The impact of different parameter settings on average reward. Fig. 16: The tradeoff under different parameter setting of ϖ1 and sn .
problem was modeled as an MDP. An A3C algorithm was [11] N. Zhao, H. Wu, and Y. Chen, “Coalition game-based computation
developed to cope with the MDP problem, which can stably resource allocation for wireless blockchain networks,” IEEE Internet
of Things J., pp. 1–1, 2019.
train neural networks. Simulation results have shown the [12] T. Chen and G. B. Giannakis, “Bandit convex optimization for scalable
efficiency of our proposed algorithm, which has fast con- and dynamic iot management,” IEEE Internet of Things Journal, vol. 6,
vergence and better performance than other algorithms under no. 1, pp. 1276–1286, 2018.
[13] M. H. Amini, H. Arasteh, and P. Siano, “Sustainable smart cities through
different parameter settings. Meanwhile, we can also observe the lens of complex interdependent infrastructures: Panorama and state-
that the algorithm can achieve the optimal trade-off between of-the-art,” in Sustainable Interdependent Networks II, 2019, pp. 45–68.
the computation rate of the MEC system and transaction [14] J. Kang, Z. Xiong, D. Niyato, D. Ye, D. I. Kim, and J. Zhao, “Toward
secure blockchain-enabled internet of vehicles: Optimizing consensus
throughput in the blockchain system. In future work, we will management using reputation and contract theory,” IEEE Trans. Veh.
study interference management in blockchain-enabled MEC Technol., vol. 68, no. 3, pp. 2906–2920, Mar. 2019.
systems. [15] M. Liu, F. Yu, Y. Teng, V. Leung, and M. Song, “Performance optimiza-
tion for blockchain-enabled industrial internet of things (IIoT) systems:
A deep reinforcement learning approach,” IEEE Trans. Ind. Info., pp.
R EFERENCES 1–1, 2019.
[16] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no.
[1] M. Patel, B. Naughton, C. Chan, N. Sprecher, S. Abeta, A. Neal et al., 3-4, pp. 279–292, 1992.
“Mobile-edge computing introductory technical white paper,” White [17] G. A. Rummery and M. Niranjan, On-line Q-learning using connec-
paper, mobile-edge computing (MEC) industry initiative, pp. 1089–7801, tionist systems. University of Cambridge, Department of Engineering
2014. Cambridge, England, 1994, vol. 37.
[2] J. Kwak, Y. Kim, J. Lee, and S. Chong, “Dream: Dynamic resource and [18] F. Guo, F. R. Yu, H. Zhang, H. Ji, M. Liu, and V. C. M. Leung, “Adaptive
task allocation for energy minimization in mobile cloud systems,” IEEE resource allocation in future wireless networks with blockchain and
J. Sel. Areas Commun., vol. 33, no. 12, pp. 2510–2523, Dec. 2015. mobile edge computing,” IEEE Trans. Wireless Commun., Accepted,
[3] J. Du, L. Zhao, X. Chu, F. R. Yu, J. Feng, and C. I, “Enabling low-latency 2019.
applications in LTE-A based mixed fog/cloud computing systems,” IEEE [19] Z. Hong, H. Huang, S. Guo, W. Chen, and Z. Zheng, “Qos-aware
Trans. Veh. Technol., vol. 68, no. 2, pp. 1757–1771, Feb. 2019. cooperative computation offloading for robot swarms in cloud robotics,”
[4] J. Feng, Q. Pei, F. R. Yu, X. Chu, and B. Shang, “Computation IEEE Trans. Veh. Technol., vol. 68, no. 4, pp. 4027–4041, Apr. 2019.
offloading and resource allocation for wireless powered mobile edge [20] X. Cao, F. Wang, J. Xu, R. Zhang, and S. Cui, “Joint computation and
computing with latency constraint,” IEEE Wireless Communications communication cooperation for energy-efficient mobile edge comput-
Letters, accepted, 2019. ing,” IEEE Internet of Things J., vol. 6, no. 3, pp. 4188–4200, Jun.
[5] Y. Mao, J. Zhang, S. H. Song, and K. B. Letaief, “Stochastic joint 2019.
radio and computational resource management for multi-user mobile- [21] S. Guo, J. Liu, Y. Yang, B. Xiao, and Z. Li, “Energy-efficient dynamic
edge computing systems,” IEEE Trans. Wireless Commun., vol. 16, no. 9, computation offloading and cooperative task scheduling in mobile cloud
pp. 5994–6009, Sept. 2017. computing,” IEEE Trans. Mobile Comput., vol. 18, no. 2, pp. 319–333,
[6] J. Zhao, Q. Li, Y. Gong, and K. Zhang, “Computation offloading Feb. 2019.
and resource allocation for cloud assisted mobile edge computing in [22] R. Yang, F. R. Yu, P. Si, Z. Yang, and Y. Zhang, “Integrated blockchain
vehicular networks,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7944– and edge computing systems: A survey, some research issues and
7956, Aug. 2019. challenges,” IEEE Communications Surveys & Tutorials, vol. 21, no. 2,
[7] F. Zhou, Y. Wu, R. Q. Hu, and Y. Qian, “Computation rate maximization pp. 1508–1532, 2019.
in UAV-enabled wireless-powered mobile-edge computing systems,” [23] J. Kang, R. Yu, X. Huang, M. Wu, S. Maharjan, S. Xie, and Y. Zhang,
IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 1927–1941, Sep. 2018. “Blockchain for secure and efficient data sharing in vehicular edge
[8] Z. Xiong, Y. Zhang, D. Niyato, P. Wang, and Z. Han, “When mobile computing and networks,” IEEE Internet of Things J., accepted, 2019.
blockchain meets edge computing,” IEEE Comm. Mag., vol. 56, no. 8, [24] J. Xu, S. Wang, B. Bhargava, and F. Yang, “A blockchain-enabled
pp. 33–39, Aug. 2018. trustless crowd-intelligence ecosystem on mobile edge computing,”
[9] D. Miller, “Blockchain and the Internet of things in the industrial sector,” IEEE Trans. Ind. Infor., pp. 1–1, accept, 2019.
IT Professional, vol. 20, no. 3, pp. 15–18, May 2018. [25] X. Qiu, L. Liu, W. Chen, Z. Hong, and Z. Zheng, “Online deep rein-
[10] M. Liu, F. R. Yu, Y. Teng, V. C. M. Leung, and M. Song, “Distributed forcement learning for computation offloading in blockchain-empowered
resource allocation in blockchain-based video streaming systems with mobile edge computing,” IEEE Trans. Veh. Technol., pp. 1–1, 2019.
mobile edge computing,” IEEE Trans. Wireless Commun., vol. 18, no. 1, [26] D. C. Nguyen, P. N. Pathirana, M. Ding, and A. Seneviratne, “Secure
pp. 695–708, Jan. 2019. computation offloading in blockchain based iot networks with deep
reinforcement learning,” arXiv preprint arXiv:1908.07466, 2019.
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
14
[27] M. S. Alam, J. W. Mark, and X. S. Shen, “Relay selection and resource [49] “3GPP TR 36.839 v11.0.0,” 3GPP, Tech. Rep., Sept. 2012.
allocation for multi-user cooperative ofdma networks,” IEEE Trans. [50] “Evolved universal terrestrial radio access (E-UTRA): Physical channels
Wireless Commun., vol. 12, no. 5, pp. 2193–2205, May 2013. and modulation,” 3GPP TS 36.211 V8.6.0 Std., Mar., 2009.
[28] Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobile-edge com- [51] P. Guo, W. Hou, L. Guo, W. Sun, C. Liu, H. Bao, L. H. K. Duong, and
puting: Partial computation offloading using dynamic voltage scaling,” W. Liu, “Fault-tolerant routing mechanism in 3d optical network-on-chip
IEEE Trans. Commun., vol. 64, no. 10, pp. 4268–4282, Oct. 2016. based on node reuse,” IEEE Transactions on Parallel and Distributed
[29] L. Liu, C. Chen, Q. Pei, S. Maharjan, and Y. Zhang, “Vehicular edge Systems, Accepted, 2019.
computing and networking: A survey,” arXiv preprint arXiv:1908.06849, [52] D. Zhai, R. Zhang, Y. Wang, H. Sun, L. Cai, and Z. Ding, “Joint
2019. user pairing, mode selection, and power control for d2d-capable cellular
[30] G. Han, J. Jiang, L. Shu, and M. Guizani, “An attack-resistant trust networks enhanced by nonorthogonal multiple access,” IEEE Internet of
model based on multidimensional trust metrics in underwater acoustic Things J., vol. 6, no. 5, pp. 8919–8932, Oct. 2019.
sensor network,” IEEE Trans. Mobile Comput., vol. 14, no. 12, pp. 2447–
2459, Dec. 2015.
[31] Z. Yao, D. Kim, and Y. Doh, “Plus: Parameterized and localized
trust management scheme for sensor networks security,” in 2006 IEEE
International Conference on Mobile Ad Hoc and Sensor Systems, 2006, Jie Feng is currently pursuing the Ph.D. degree in
pp. 437–446. Communication and Information System at Xidian
[32] A. P. Miettinen and J. K. Nurminen, “Energy efficiency of mobile clients University, Xian, China. She is also with Carleton
in cloud computing.” in Proc. USENIX Conf. Hot Topics Cloud Comput. University as Visiting Ph.D Student since January
(HotCloud). Boston, MA, USA, Jun. 2012, pp. 1–7. 2019. Her current research interests include mobile
[33] M. S. Alam, J. W. Mark, and X. S. Shen, “Relay selection and resource edge computing, Blockchain, deep reinforcement
allocation for multi-user cooperative ofdma networks,” IEEE Trans. learning, Device to Device communication, resource
Wireless Commun., vol. 12, no. 5, pp. 2193–2205, May 2013. allocation and convex optimization and stochastic
[34] “Antshares digital assets for everyone,” [Online]. Avail- network optimization.
able:https://www.antshares.org.
[35] A. Clement, E. L. Wong, L. Alvisi, M. Dahlin, and M. Marchetti,
“Making byzantine fault tolerant systems tolerate byzantine faults.” in
NSDI, vol. 9, 2009, pp. 153–168.
[36] G. Han, J. Jiang, L. Shu, and M. Guizani, “An attack-resistant trust
model based on multidimensional trust metrics in underwater acoustic F. Richard Yu (S00-M04-SM08-F18) received the
sensor network,” IEEE Trans. Mobile Comput., vol. 14, no. 12, pp. 2447– PhD degree in electrical engineering from the Uni-
2459, Dec. 2015. versity of British Columbia (UBC) in 2003. From
[37] R. A. Shaikh, H. Jameel, B. J. d’Auriol, H. Lee, S. Lee, and Y.-J. Song, 2002 to 2006, he was with Ericsson (in Lund, Swe-
“Group-based trust management scheme for clustered wireless sensor den) and a start-up in California, USA. He joined
networks,” IEEE trans. parallel distrib. syst., vol. 20, no. 11, pp. 1698– Carleton University in 2007, where he is currently a
1712, 2009. Professor. He received the IEEE Outstanding Service
[38] A. Josang, “An algebra for assessing trust in certification chains,” in Award in 2016, IEEE Outstanding Leadership Award
Proc. Netw. Distrib. Syst. Secur. Symp., 1999, pp. 1–10. in 2013, Carleton Research Achievement Award in
[39] Q. Liu, Y. Liao, B. Tang, and L. Yu, “A trust model based on subjective 2012, the Ontario Early Researcher Award (formerly
logic for multi-domains in grids,” in Proc. Pacific-Asia Workshop on Premiers Research Excellence Award) in 2011, the
Computational Intelligence and Industrial Application, 2008, pp. 882– Excellent Contribution Award at IEEE/IFIP TrustCom 2010, the Leadership
886. Opportunity Fund Award from Canada Foundation of Innovation in 2009
[40] N. Oren, T. J. Norman, and A. Preece, “Subjective logic and arguing and the Best Paper Awards at IEEE ICNC 2018, VTC 2017 Spring, ICC
with evidence,” Artificial Intelligence, vol. 171, no. 10-15, pp. 838–854, 2014, Globecom 2012, IEEE/IFIP TrustCom 2009 and Intl Conference on
2007. Networking 2005. His research interests include wireless cyber-physical sys-
[41] M. Chen, Y. Zhou, and L. Tang, “Ray projection method and its tems, connected/autonomous vehicles, security, distributed ledger technology,
applications based on grey prediction,” Chin. J. Stat. Decis, vol. 1, no. 1, and deep learning.
pp. 13–20, 2007. He serves on the editorial boards of several journals, including Co-Editor-
[42] M. Simsek, M. Bennis, and I. Güvenç, “Learning based frequency-and in-Chief for Ad Hoc & Sensor Wireless Networks, Lead Series Editor for
time-domain inter-cell interference coordination in hetnets,” IEEE Trans. IEEE Transactions on Vehicular Technology, IEEE Transactions on Green
Veh. Technol., vol. 64, no. 10, pp. 4589–4602, 2014. Communications and Networking, and IEEE Communications Surveys &
[43] Y. Wei, F. R. Yu, M. Song, and Z. Han, “User scheduling and resource Tutorials. He has served as the Technical Program Committee (TPC) Co-Chair
allocation in hetnets with hybrid energy supply: An actor-critic rein- of numerous conferences. Dr. Yu is a registered Professional Engineer in the
forcement learning approach,” IEEE Trans. Wireless Commun., vol. 17, province of Ontario, Canada, a Fellow of the Institution of Engineering and
no. 1, pp. 680–692, 2018. Technology (IET), and a Fellow of the IEEE. He is a Distinguished Lecturer,
[44] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, the Vice President (Membership), and an elected member of the Board of
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- Governors (BoG) of the IEEE Vehicular Technology Society.
forcement learning,” in International conference on machine learning,
2016, pp. 1928–1937.
[45] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz, “Re-
inforcement learning through asynchronous advantage actor-critic on a
gpu,” arXiv preprint arXiv:1611.06256, 2016. Qingqi Pei received his B.S., M.S. and Ph.D.
[46] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. degrees in Computer Science and Cryptography
Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale from Xidian University, in 1998, 2005 and 2008,
machine learning on heterogeneous distributed systems,” arXiv preprint respectively. He is now a Professor and member
arXiv:1603.04467, 2016. of the State Key Laboratory of Integrated Services
[47] C. You, K. Huang, H. Chae, and B. Kim, “Energy-efficient resource al- Networks, also a Professional Member of ACM
location for mobile-edge computation offloading,” IEEE Trans. Wireless and Senior Member of IEEE, Senior Member of
Commun., vol. 16, no. 3, pp. 1397–1411, March 2017. Chinese Institute of Electronics and China Computer
[48] Y. Kim, H. Lee, and S. Chong, “Mobile computation offloading for Federation. His research interests focus on privacy
application throughput fairness and energy efficiency,” IEEE Trans. preserving, blockchain and edge computing security.
Wireless Commun., vol. 18, no. 1, pp. 3–19, Jan 2019.
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2961707, IEEE Internet of
Things Journal
15
Xiaoli Chu (M06CSM15) received the B.Eng. de- Jianbo Du received the B.S. degree and M.S. degree
gree in electronic and information engineering from from Xi’an University of Posts and Telecommu-
Xian Jiao Tong University, Xian, China, in 2001, nications in 2007 and 2013, respectively, and the
and the Ph.D. degree in electrical and electronic Ph.D. in communication and information system-
engineering from the Hong Kong University of Sci- s at Xidian University, Xian, Shaanxi, China, in
ence and Technology, Hong Kong, in 2005. She is 2018. She is now a teacher with the department of
a Senior Lecturer with the Department of Electronic Communication and Information Engineering, Xian
and Electrical Engineering, University of Sheffield, University of Posts and Telecommunications. Her
Sheffield, U.K. From September 2005 to April 2012, research interests include mobile edge computing,
she was with the Centre for Telecommunications resource management, NOMA, deep reinforcement
Research, Kings College London. She has published learning, convex optimization, stochastic network
more than 100 peer-reviewed journal and conference papers. She is the optimization and heuristic algorithms and their applications in wireless
Lead Editor/author of the book Heterogeneous Cellular Networks: Theory, communications.
Simulation and Deployment (Cambridge University Press, 2013) and the book
4G Femtocells: Resource Allocation and Interference Management (Springer
2013).
2327-4662 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.