0% found this document useful (0 votes)
15 views

20

this is a paper on mobile crowd sensing which can be easily obtained on iee and can be used to study especially game theory

Uploaded by

Himanshi Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

20

this is a paper on mobile crowd sensing which can be easily obtained on iee and can be used to study especially game theory

Uploaded by

Himanshi Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

Multi-Agent Actor-Critic Network-Based Incentive


Mechanism for Mobile Crowdsensing in Industrial
Systems
Bo Gu, Member, IEEE, Xinxin Yang, Ziqi Lin, Weiwei Hu, Mamoun Alazab, Senior Member, IEEE, Rupak
Kharel, Senior Member, IEEE

Abstract—Mobile crowdsensing (MCS) is an appealing sensing their sensing capabilities and the inherent mobility of device
paradigm that leverages the sensing capabilities of smart devices owners, it is possible to accomplish fine-grained sensing tasks.
and the inherent mobility of device owners to accomplish sensing This sensing paradigm, called mobile crowdsensing (MCS), is
tasks with the aim of constructing powerful industrial systems.
Incentivizing mobile users (MUs) to participate in sensing activi- especially suitable for large-scale sensing tasks due to its cost
ties and contribute high-quality data is of paramount importance efficiency.
to the success of MCS services. In this paper, we formulate the Nevertheless, the successful operation of MCS relies on the
competitive interactions between a sensing platform (SP) and participation of MUs. Due to the increased battery consump-
MUs as a multi-stage Stackelberg game with the SP as the tion and the potential risk of data leakage, MUs are not willing
leader player and the MUs as the followers. Given the unit
prices announced by MUs, the SP calculates the quantity of to perform sensing tasks unless they are sufficiently com-
sensing time to purchase from each MU by solving a convex pensated. Therefore, incentive mechanisms are essential for
optimization problem. Then, each follower observes the trading device owners to share part of their time, efforts and resources
records and iteratively adjusts their pricing strategy in a trial- for sensing tasks. How to determine the appropriate reward
and-error manner based on a multi-agent deep reinforcement to recruit MUs for sensing tasks so as to improve the data
learning algorithm. Simulation results demonstrate the efficiency
of the proposed method. 1 quality while satisfying the monetary budget constraints of
task initiators (TIs) is a key problem in the MCS environment.
Index Terms—Cognitive sensor networks, mobile crowd sens- Previous work on task assignment and incentive design re-
ing (MCS), Stackelberg game, deep reinforcement learning
(DRL), incentive mechanism, MADDPG lies on the assumption that the sensing platform (SP) perfectly
knows the private information of the MUs (i.e., user-perceived
sensing cost and data quality). However, this assumption is
I. I NTRODUCTION quite strong, especially in a free market, where MUs compete
with each other for sensing opportunities and they believe that
I N current industrial spaces, data and information regarding
working conditions, equipment status, environmental con-
text, and so forth must be collected in a real-time manner at
concealing their private information is beneficial. Under such
an asymmetric information scenario, it is highly challenging
various locations. This information can be used for the moni- for the SP to provide an appropriate reward in keeping with
toring, prediction, modelling and control of industrial systems the quality of the data that the MU contributes. On one hand,
to provide advanced services [1]–[4]. Generally, conventional a selfish MU may pretend to have a higher data quality to
sensing architectures require deploying sensors at all points earn an extra reward. On the other hand, in the absence of
of interest (PoIs), which substantially increases both the op- MUs’ private information, the SP may provide a payment that
erational expenditure and capital expenditure. In recent years, is lower than the MUs deserve and, consequently, discourage
the proliferation of smart devices and advances in wireless the cooperation of MUs.
communications technology have changed the methods used to Game theory is regarded as a powerful tool for analysing the
collect data. These devices are usually equipped with various competitive interactions between two parties with conflicting
sensors (e.g., GPS, accelerometer and camera). By leveraging interests. Recently, many game theory-based incentive mecha-
nisms [5], [6] have been proposed to motivate truthful contri-
Manuscript received April 29, 2020; revised June 5, 2020; accepted Septem- bution in MCS systems. In [6], a Bayesian game is formulated
ber 5, 2020. Date of publication XXX XX, 2020; date of current version to analyse the interactions between SP and MUs, based on
September 11, 2020. This work was partially supported by the National Key which the authors proposed a distributed incentive mechanism
R&D Program of China (2019YFB1704702).
B. Gu, X. Yang, Z. Lin and W. Hu are with the School of Intelligent to obtain a Nash equilibrium solution while simultaneously
Systems Engineering, Sun Yat-sen University, Guangzhou 510275, China. (E- maximizing the SP’s utility. However, the above-mentioned
mail: [email protected]). algorithms require excessive information exchange (such as
M. Alazab is with the College of Engineering, IT and Environment, Charles
Darwin University, Australia. (E-mail: [email protected]). user-perceived sensing cost, data quality and utility functions),
M. Kharel is with theFaculty of Science and Engineering, Manch- which is infeasible in practice due to the unbearable signalling
ester Metropolitan University, Manchester, UK, M15 6BH (e-mails: overhead. Moreover, the previous works focus on the analysis
[email protected]).
1 Source code is available: https://github.com/glab2019/MADDPG-Based- of the interaction between the SP and candidate MUs, while
MCS the competition among MUs is ignored.

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

To address the above-mentioned problems and achieve an locally and then shared among MUs in a P2P mode. Then,
efficient and stable solution in the absence of complete infor- a pricing scheme is proposed to improve data quality while
mation, we first formulate the task assignment and incentive ensuring appropriate incentive for participants. In [15], Huang
design problem in the MCS environment as a multi-stage et al. proposed a data quality verification framework based on
Stackelberg game that includes a leader and multiple followers. the consensus mechanism of blockchain. In [16], Guo et al.
We consider the case where all the MUs offer their prices focused on a visual crowdsensing environment and proposed
to the SP simultaneously. In the first stage, given the prices a dynamic and utility-enhanced incentive mechanism to match
announced by the MUs, the SP determines the optimal task quality requirements to providers. In [17], Jiang et al. proposed
assignment by solving a utility maximization problem. In the a multi-layer MCS framework that allows different TIs to
second stage, based on the feedback from the environment publish their data requirements so that common data items
(i.e., how the tasks are assigned to MUs), we propose a multi- can be shared among TIs.
agent deep deterministic policy gradient (MADDPG)-based [7] Sensing coverage and latency: In [9], Zhang et al. pro-
framework to learn the optimal pricing policy of each MU posed a multi-task assignment mechanism to maximize the
without requiring prior knowledge of the pricing strategies utility of the SP while satisfying a stringent delay constraint.
implemented by other MUs. The main contributions of the The formulated problem is proved to be NP-hard, and a
proposed algorithm are five-fold. heuristic algorithm is presented to achieve a near-optimal task
• We fully consider the interactions between the SP and assignment. In [18], Cheung et al. focused on delay-sensitive
MUs, as well as the competition among MUs, by con- sensing tasks and proposed a dynamic programming-based
structing a Stackelberg leader-follower game. The Nash algorithm to maximize the user’s profit. In [8], Ko et al.
equilibrium, which gives plausible results for full compe- proposed a data-usage and mobility-aware task assignment
tition, is obtained as a solution for the Stackelberg game. mechanism for MCS by taking into account the sensing
• The proposed algorithm is scalable in the sense that it coverage and energy consumption of MUs.
requires neither prior knowledge of the user-perceived
sensing cost nor the actions simultaneously taken by other B. Reinforcement Learning-Based Methods
MUs. The signalling overhead can therefore be reduced
RL and deep RL (DRL) [19] are commonly used machine
substantially.
learning paradigms, in which agents learn how to map state
• The proposed algorithm is based on MADDPG, a central-
to action to maximize a long-term reward via continuous in-
ized training and distributive execution DRL technique
teraction with the environment. Specifically, RL uses a reward
that is adaptable to the environmental dynamics and
to guide agents to make better decisions. In recent years,
parameter uncertainty.
considerable research effort has been devoted to studying
• The proposed algorithm does not rely on existing training
task assignment and incentive design in MCS based on RL
data. Each MU acts as a learning agent that interacts with
techniques [20]–[22]. In [23], Zhou et al. used DRL to verify
its environment continuously to generate a sequence of
the original data set to improve the efficiency and robustness
experience for training purposes.
of the MCS system. In [24], Xiao et al. focused on a vehicular
• Simulation results confirm that the proposed algorithm
environment in which the interaction between the SP and
can learn a good policy that guides each MU to determine
vehicles is modelled as a congestion-sensing game. Then, they
the price of a unit amount of its sensing time.
proposed a Q-Learning-based [19] algorithm to guide the SP
and the vehicles to make optimal decisions. In [25], Chen et
II. R ELATED W ORK al. extended the study to a multi-agent scenario and proposed
The existing research on task assignment and incentive de- an RL-based algorithm to obtain the optimal sensing policy in
sign of MCS can be divided into two categories: optimization- real time to optimize the benefit of sensing participants.
based methods and reinforcement learning (RL)-based meth- In this paper, we assume a free market where MUs compete
ods. with each other for sensing opportunities. Through interactions
with the operating environment, MUs learn an optimal pricing
A. Optimization-Based Methods policy to maximize their payoff while satisfying their time
In recent years, several task assignment and incentive design budget constraints and the monetary budget constraint of the
algorithms for MCS have been proposed based on convex SP. We employ MADDPG to accelerate convergence and
optimization [8]–[10], game theory [11], [12], matching the- increase the robustness of the proposed algorithm.
ory [13], and Lyapunov optimization [14] with the aim of
optimizing data quality, sensing coverage or latency. III. P ROBLEM F ORMULATION
Data quality: In [11], Cao et al. designed a game theory- As shown in Fig. 1, an MCS system is composed of multiple
based incentive mechanism for MCS to encourage the appro- stakeholders: (i) TIs that wish to collect sensing data at PoIs;
priate user equipment nearby to share their resources for sens- (ii) MUs that share part of their time and resources to perform
ing tasks; then, they proposed a task-migration algorithm that sensing tasks; and (iii) an SP that acts as a third-party agency
enables resource sharing among users to enhance data quality. to recruit MUs on behalf of TIs. Without loss of generality, we
In [12], Jiang et al. proposed a group-based sensing framework consider a general “many-to-many” scenario where each task
where sensing data are stored and processed on user equipment can be allocated to multiple MUs and each MU is permitted to

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

followers maximize their utility based on the observation


of the strategy adopted by the leader. In this paper, we
Sensing task fully account for the competition among followers. The best
Incentive mechanism
Data Sensing platform
response of each follower is derived based on an MADDPG
Task assignment
algorithm, where each follower interacts with its operating
environment to learn a good policy to optimize its long-term
Data requester
Budget award without requiring any prior knowledge of the actions
Data Assignment taken by the others. Formally, the Stackelberg leader-follower
game is defined as follows:
• Player: The SP and MUs are the players of the game,
MU with the SP as the leader and candidate MUs as the
followers.
MU • Strategy: For each MU, the strategy is to determine
MU
the price of a unit quantity of its sensing time; for the
Target Area platform, the strategy is choose the quantity of sensing
time to purchase from each candidate MU.
• Payoff: The payoff functions for each MU i ∈ U and the

Fig. 1: Mobile crowdsensing architecture. SP are shown in Eqs.(1) and (4), respectively.
For each sensing task j ∈ V, the quantity of sensing time
to purchase from MU i is denoted by xij . Then, the payoff of
participate in multiple sensing tasks. Moreover, we assume that MU i is given by
MUs receive monetary payment for participating in sensing X
activities. If sufficiently compensated, MUs in the proximity of ψi (pi , xi ) = pi xij , ∀i ∈ U (1)
j∈V
a PoI may participate in sensing activities. Initially, TIs publish
their sensing tasks on the platform and the parameters used to where pi is the unit price announced by MU i and xi =
describe a task, including (i) the phenomena of interest, (ii) the {xij }j∈V .
target area, and (iii) the total budget for rewarding participants. The payoff of TI j is defined by
Then, the SP recruits MUs to perform sensing tasks to meet !
X X
the requirements of TIs. φj (xj , p) = πj ωij xij − pi xij , ∀j ∈ V (2)
Time is divided into time slots. In each slot, we assume that i∈U i∈U
the numbers of MUs and sensing tasks are denoted by U and
where xj = {xij }i∈U , p = {pi }i∈U , ωij represents the data
V , respectively. Formally, the set of MUs and sensing tasks
quality offered by MU i when performing sensing task j,
are represented by U = {1, 2, ..., U } and V = {1, 2, ..., V },
and πj (·) is the utility function that is an increasing concave
respectively. Each MU i, ∀i ∈ U has a limited available time
function of the quality-weighted sensing time.
ti to perform sensing tasks. The interactions between MUs
Then, the payoff of the SP, which is the sum payoff of all
and the platform are summarized as follows:
TIs and MUs (i.e., social welfare), is given by
1. As shown in Fig. 1, MU i, who is interested in sensing X X
tasks, first registers with the SP by indicating its available φ (x, p) = ψi (pi , xi ) + φj (xj , p)
sensing time ti an desired reward per unit of sensing time i∈U j∈V
pi . X X
! (3)
2. Given the price and time budget of each MU, the SP = πj ωij xij
determines the quantity of sensing time to purchase from j∈V i∈U
each MU. where x = {xij }i∈U ,j∈V .
3. MUs perform the assigned sensing tasks and upload the Supposing that πj (·) = log(·), ∀j ∈ V, we have
sensing data to the SP. !
4. The SP aggregates all sensing data and sends a report to X X
φ (x, p) = log ωij xij (4)
the TIs.
j∈V i∈U

IV. M ULTI -S TAGE S TACKELBERG G AME Definition 1: Letting x∗i = {x∗ij }j∈V , x∗ = {x∗ij }i∈U ,j∈V and
p∗ = {p∗i }i∈U , the point (x∗ , p∗ ) is a Nash equilibrium if it
A Stackelberg leader-follower game [26] is a strategic satisfies
game in which a leader commits to a strategy first and then
followers move accordingly. In general, all the players in the ψi (p∗i , x∗i ) ≥ ψi (pi , x∗i ) , ∀pi 6= p∗i , ∀i ∈ U (5)
game are assumed to be self-interested in the sense that they
and
adopt strategies that consider those of others with the aim
φ (x∗ , p∗ ) ≥ φ (x, p∗ ) , ∀x 6= x∗ (6)
of maximizing their own benefit. In particular, considering
the possible countermeasures of the followers, the leader first In the next two sections, we illustrate how to derive the
chooses a strategy that maximizes its own utility. Then, the Nash equilibrium.

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

V. P LATFORM S OCIAL W ELFARE O PTIMIZATION Q∗ (s, a) is continuously updated based on the sequence of
Given a unit price vector p, the platform aims to optimize experience samples from actual or simulated interactions with
its payoff. the environment.
Problem 1: When the state and action space is large, traditional RL
! (e.g., Q-learning) algorithms may fail to find a good policy
in a reasonable period of time. Deep Q-Network (DQN) [28],
X X
max log ωij xij
x which combines Q-learning with a deep neural network, has
j∈V i∈U
X the potential to overcome this shortcoming.
s.t. xij ≤ ti , ∀i ∈ U (7)
j∈V
B. Policy Gradient
X
pi xij ≤ bj , ∀j ∈ V
i∈U Q-learning and DQN learn the values of actions. The larger
where ti is the time budget of MU i to perform sensing tasks the value of an action is, the more likely the action is to be
and bj is the budget of TI j. selected (e.g., -greedy algorithm [27]). In contrast, policy gra-
dient algorithms learn a parameterized policy that maximizes
Theorem 1. The optimal solution to Problem 1 is given as an objective function, which enables agents to select actions
follows. without referring to value functions. The objective function is
bj wij defined as follows,
(
∗ pi i∈U wij ,
P if I ≥ 0
xij = ti w ij
∀i ∈ U, ∀j ∈ V (8)
P , otherwise J(θ) = Es∼pπ ,a∼πθ [R] (11)
j∈V wij

b w2
P 
Pj ij where θ is the policy’s parameter vector, pπ is the state
P
where I = j∈V log i∈U pi i∈U wij −
P P 2
ti wij
 distribution and πθ is the parameterized policy. According to
j∈V log wij .
P
i∈U j∈V the policy gradient theorem, the policy gradient of policy can
Proof. See Appendix A. be expressed as follows,

∇θ J(θ) = Es∼pπ ,a∼πθ [∇θ log πθ (a|s)Qπ (s, a)] (12)


VI. MADDPG-BASED U SER PAYOFF O PTIMIZATION
In this section, we present how each MU chooses its best Policy gradient algorithms aim to optimize J(θ), and the
response to the strategy taken by the SP. We first provide an policy parameters can be updated in the direction of ∇θ J(θ):
overview of the RL [27] techniques that form the basis of
our proposed algorithm. RL is a machine learning paradigm θt+1 = θt + α∇θ J(θ) (13)
in which agents learn how to map environmental states to
actions to optimize their long-term returns via trial-and-error where α is a non-negative step size for each adjustment. The
experience. In time step t, an agent selects an action at based larger the value of an action, the more likely the action is to
on the environment’s state st . Then, the agent receives a be selected.
numerical reward rt+1 , and the environment transitions to a The deterministic policy gradient (DPG) algorithm [29]
new state st+1 . The agent can finally learn a good policy in extends the policy gradient to deterministic policy µθ , and
such a trial-and-error manner. Generally, RL algorithms can be the gradient of objective function J(θ) = Es∼pµ [R(s, a)] in
classified into two categories: action-value methods and policy DPG can be expressed as:
gradient methods.
∇θ J(θ) = Es∼D [∇θ µθ (a|s)∇a Qµ (s, a)|a=µθ (s) ] (14)

A. Action-Value Methods where D is a randomly sampled mini-batch, which contains


Q-learning is a well-known RL algorithm, where an agent samples generated by the learning agent.
interacts with the operating environment continuously to up- Deep DPG (DDPG) [30] is a combination of DPG and
date the Q-value, i.e., the utility of taking action a in state s deep learning, where two deep neural networks are used as
given policy π. approximators for policy µ and critic Qµ . As shown in Fig.2,
the input of the actor network is state s, and the output is the
Qπ (s, a) = E [Rt |st = s, at = a] (9) deterministic action a. On the other hand, the inputs of the
critic network include the state s and action a generated by
We denote the optimal action-value function by Q∗ (s, a) =
the actor network, and the output is the estimated Q-value.
maxπ Qπ (s, a). According to the Bellman optimality equation,
In a multi-agent learning system, the state transition of the
Q∗ (s, a) can be expressed as
h i environment depends on the joint behaviour of the agents. In
Q∗ (s, a) = E rt+1 + µ max 0
Q ∗ 0 0
(s , a )|st = s, at = a other words, the actions taken by one agent directly affect
a the actions of other agents evolving in the same system,
(10)
which may lead to slow policy convergence. Thus, we adopt
where s0 is the new state after taking action a. The key idea an improved algorithm termed MADDPG [7] with two key
behind Q-learning is that the optimal action-value function techniques:

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

Algorithm 1 Training process of MADDPG


random process N 𝑡
𝐀𝐜𝐭𝐨𝐫 𝑖 𝐂𝐫𝐢𝐭𝐢𝐜 𝑖 Initialize:
OU noise
optimizer
policy
optimizer Randomly initialize critic network
loss
update 𝜃 𝜇𝑖
gradient
𝑎𝑖𝑡
update 𝜃 𝑄𝑖
function Qi (o, a|θ Qi ) and actor µi (o|θ µi ) with weights
𝜇𝑖 𝑜𝑖𝑡 |𝜃 𝜇𝑖

environment
online network
𝜃 𝜇𝑖
online network θ Qi and θ µi
𝒂𝑡 = (𝑎1𝑡 , … , 𝑎𝑁
𝑡 gradient 𝜃 𝑄𝑖
)
w.r.t 𝑎𝑖𝑡
soft update
Initialize target networks Q0i and µ0i with
soft update 𝑦𝑖𝑡 0 0

𝑎𝑖𝑡+1
weights θ Qi ← θQi , θ µi ← θ µi
𝑜𝑖𝑡 target network target network
𝜃𝜇′𝑖 ′
𝜃 𝑄𝑖
Initial replay buffer D
(𝒐𝑡 , 𝒂𝑡 , 𝒓𝑡 , 𝒐𝑡+1 ) Initial hyperparameters τ , γ
for episode = 1, 2, · · · , M do
𝑆 ∗ (𝒐𝑗 , 𝒂𝑗 , 𝒓𝑗 , 𝒐𝑗+1 )
replay buffer D Initialize a random process N for action exploration
Receive initial observation state o1
for t = 1, 2, · · · , T do
Fig. 2: MADDPG architecture. For each agent i, select action ati = µi (oti |θ µi ) + Nt
according to the current policy and exploration noise
1. Centralized training and distributed execution: agents Execute action at = {at1 , · · · , atN } and calculate
share their experience during training, but each agent uses reward r t = {r1t , · · · , rN
t
} according to Eq.(18), new state
t+1
only local observations at execution time; o according to Eq.(16)
2. Augmented critic: Critic is augmented by inferring the Store transition (ot , at , r t , ot+1 ) in D
policies of the other agents [7]. ot ← ot+1
for agent i = 1, · · · , N do
Sample a random minibatch of S transitions
C. Proposed Algorithm
(oj , aj , r j , oj+1 ) from D
As described in the previous section, given the feedback Update critic by minimizing the loss according
from the environment (i.e., how the tasks are assigned to to Eq.(19)
MUs), each MU i chooses a unit price with the aim of Update actor using the sampled policy gradient
maximizing its payoff. according to Eq.(21)
end for
max ψi (pi , xi ) Update the target networks:
pi
(15) 0 0
s.t. pi ≥ 0 θ Qi ← τ θ Qi + (1 − τ )θ Qi
0 0
Without the prior knowledge of the actions taken by other θ µi ← τ θ µi + (1 − τ )θ µi
agents, we derive the optimal action of each agent based on
the MADDPG algorithm. end for
State observation: The state is a vector of environment end for
features corresponding to our problem defined in objective
(15). We take into account the observations of the previous
when the payment received (i.e., pti j∈V t
P
L time slots so that agents have a better opportunity to track P xij ) is not enough
changes in the environment. In each time slot t, the observation to compensate the sensing cost (i.e., cti j∈V xtij ).
space of MU i is defined as, As shown in Fig.2, MADDPG consists of two networks,
namely, a critic and actor for each learning agent. In each time
oti = {pt−1
i , xt−1
i , · · · , pt−L
i , xt−L
i }, ∀i ∈ U (16)
slot t, the input of each actor network is the observation of
Action: In time slot t, each MU i observes how tasks are environmental features oti , and the output is the deterministic
assigned in the previous L slots and then determines the unit action. On the other hand, the inputs of the critic network
price in the current slot, i.e., pti , according to the outputs of include oti and ati , and the output is the estimated value of the
the actor network. current state.
Critic: The loss function of the critic network is calculated by
ati = µi oti |θ µi

(17)
T
Note that the action space here is continuous, which satisfies 1X t
Li = (y − Qi (ot , at |θ Qi ))2 (19)
the requirements of MADDPG. T t i
Reward Function: We design a reward function to optimize
the objective (15). The reward function of MU i is defined as where at = {at1 , · · · , atN }, ot = {ot1 , · · · , otN } and yit is the
  target action value defined by
X X
rit = log 1 + pti xtij − cti xtij  (18) 0

j∈V j∈V
yit = rit + γQ0i (ot+1 , a0(t+1) |θ Qi ) (20)

where cti is the unit sensing cost of MU i. The logarithm By conducting intensive experiments, we observe that the
function is adopted to ensure a negative reward (or penalty) proposed algorithm achieves the best performance when γ

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

TABLE I: Simulation parameters to the monetary budget constraints. In the most extreme
Parameters Values circumstances, a substantial penalty is imposed on the agent
Number of MUs (U) 8 if there is a mismatch between its unit price and sensing
Number of TIs (V) 2
Monetary budget of TI j (bj ) [20, 40]
quality, and the SP may not purchase any sensing time at
Time budget of MU i (ti ) [5, 25] all. This result is not surprising because the SP prefers MUs
Unit sensing cost of MU i (cti ) 0.05 that offer high sensing quality and low unit price. The penalty
Sensing qualities of MU i when performing 1+i/10 is exacerbated as the mismatch increases. As a consequence,
task 1 (wi1 )
Sensing qualities of MU i when performing 1.1+i/10 agents must compete with others and dynamically select their
task 2 (wi2 ) unit price to maintain a relatively high benefit, as shown in
Actor learning rate (αa ) 0.001 Figures 3(a) and 3(d).
Critic learning rate (αc ) 0.001
Soft Update (τ ) 0.01
Figure 3(e) shows that the SP regains a relatively high
Discount factor (γ) 0 payoff after a sharp initial decline because of the competition
among MUs, which also results in the fluctuation shown in
Figure 3(f). Despite the fluctuation in an individual MU’s
is equal to 0. Namely, the immediate reward appropriately payoff, the total payoff of MUs converges to the monetary
reflects the true state-action value. budget of the TIs, as expected.
Actor: The gradient of the expected reward for agent i with
deterministic policies µθi is given by C. Comparison with Existing Algorithms
T
1 X We also compare our algorithm with two existing algo-
∇θµi J ≈ ∇θµi ai ∇ai Qi (ot , ai , atN \i |θ Qi )|ai =µi (oj |θµi ) rithms.
T t
i

(21) • Quality Proportional Optimal Pricing (QPOP): We


assume that the SP has a priori knowledge of data quality
where atN \i
= {at1 , · · · , ati−1 , ati+1 , · · · , atN }. offered by each MU. Then, unit prices are set to be
Algorithm 1 summarizes the training process of MADDPG. proportional to the MUs’ data quality while consuming
the exact time and monetary budgets (i.e., I = 0), which
VII. N UMERICAL E VALUATION is impossible in practical systems but can serve as an
A. Simulation Setup optimal benchmark.
• Random: Unit prices are randomly generated in the range
We consider an MCS system that consists of multiple
of (0, 1).
MUs and multiple TIs. The payoff maximization problem
is constrained by time and monetary budget simultaneously. We compare these algorithms in two scenarios. In the first
The data qualities offered by the MUs are sorted in an scenario, the time budgets of the MUs (i.e., ti ) vary from 5
ascending order. For simplicity, we set wi,1 = 1 + i/10 and to 25, and the monetary budgets of the TIs are fixed to 20. In
wi,2 = 1.1 + i/10. The sensing qualities are fixed throughout the second scenario, the monetary budgets of TIs are randomly
the whole experiment. Because self-interested agents aim to selected from 20 to 40, and the time budgets of the MUs are
maximize their immediate benefits, the discount factor of long- fixed to 20. To illustrate the impact of budget constraints, we
term rewards is set to zero. To accelerate the training process, also fix the quality weights of the MUs during the experiments
we adopt a relatively small network for each agent. As shown and set U and V to 8 and 2, respectively.
in Fig 4, the actor network and critic network are constructed In the first scenario, as shown in Fig. 5(a), 5(b) and 5(c),
with one input layer, three hidden layers and one output the proposed algorithm achieves near-optimal performance in
layer. The three hidden layers have 128, 64 and 32 neurons, the absence of prior knowledge of data quality offered by the
respectively. Furthermore, both the actor and critic networks MUs. When the average time budget of the MUs is small,
take ReLU function as activation of all hidden layers, and the intensive competition among the TIs (i.e., the buyers) results in
actor network takes Tanh function as activation of output layer a seller’s market. As a consequence, the payoff of TIs achieved
to generate policies. The other simulation parameters of our in our proposed algorithm is even lower than that achieved
experiment are provided in Table I. in the random algorithm as shown in Fig. 5(b). Furthermore,
the proposed algorithm significantly outperforms the random
algorithm in terms of the social welfare, as shown in Fig. 5(c).
B. Behaviour of the Proposed Algorithm This is not surprising because random pricing cannot fully
We first investigate the convergence of the proposed algo- utilize the MUs’ sensing resource, whereas in our proposed
rithm. Figure 3(a) illustrates the unit prices of MUs versus algorithm, the MUs are encouraged to dynamically adjust their
the number of iterations. Figures 3(b) and 3(c) show the unit prices to achieve a resource-efficient MCS system.
corresponding amounts of sensing time for tasks 1 and 2, In the second scenario, the proposed algorithm outperforms
respectively. Notably, the number of iterations required for the random algorithm in terms of the payoff of TIs and social
convergence is acceptable (approximately 200 iterations). Ini- welfare in most cases as shown in Fig. 6(a), 6(b) and 6(c).
tially, agents start with a low price and increase the price With the increase of average monetary budgets of TIs, MUs
rapidly to obtain higher benefits. As a countermeasure, the tend to charge higher prices in response to the observation
SP decreases the quantity of sensing time purchased due of the task-assignment strategy adopted by the SP. Therefore,

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

08 08  08

$PRXQWRI6HQVLQJ7LPH 7DVN

$PRXQWRI6HQVLQJ7LPH 7DVN
 08  08  08
08 08 08
08  08  08
8QLW3ULFHVRI08V

 08 08 08


 
08 08 08
 08  08  08
08 08 08
 

 

  

 
                 
,WHUDWLRQ ,WHUDWLRQ ,WHUDWLRQ

(a) Unit Prices of MUs. (b) Amount of Sensing Time (Task1). (c) Amount of Sensing Time (Task2).

 

 

 


3D\RII

3D\RII

3D\RII
08 
 08  
08
 08  
08
08  7, 

08 7, 6XP3D\RIIRI08V
08  63 6XP%XGJHWRI63


                 
,WHUDWLRQ ,WHUDWLRQ ,WHUDWLRQ

(d) Payoff of MUs. (e) Payoff of TIs and SP. (f) Sum Payoff of MUs vs. Monetary Budget.

Fig. 3: Convergence of our proposed algorithm

posed algorithm is close to that of the QPOP benchmark.


Nevertheless, QPOP requires complete information about the
quality weights and unconditional cooperation of MUs. In
contrast, our algorithm requires only local observations from

Actor
interactions with the operating environment.
Input Output(Tanh)
ReLu(32)
ReLu(128)
ReLu(64) VIII. C ONCLUSION
This paper presented a distributed task assignment and
incentive mechanism based on a Stackelberg game and MAD-
DPG. Since the proposed algorithm requires neither prior

Critic knowledge of user-perceived sensing cost nor the actions


simultaneously taken by other MUs, the signalling overhead is
Input
ReLu(32) Output
ReLu(128)
ReLu(64) greatly reduced. Furthermore, a salient feature of the proposed
algorithm is that it can guide self-interested agents to achieve
an overall design objective from a sequence of state-action-
Fig. 4: Actor and critic networks.
reward observations. We compare the proposed algorithm with
two existing algorithms. Simulation results show that both the
SP and MUs can learn to achieve a near-optimal payoff while
the TIs have to either pay more in order to incentivize MUs satisfying the stringent monetary budget constraints of the TIs.
to participant in sensing activities or reduce the quantity of
sensing time to purchase, resulting in decreased payoff of TIs. A PPENDIX A
In other word, sensing effort becomes a seller’s market. As P ROOF OF T HEOREM 1
shown in Fig. 6(b) and 6(c), the payoff of TIs and the social The corresponding Lagrangian form of Problem 1 is
welfare achieved in our proposed algorithm are even lower !
X X
than those achieved in the random algorithm when the average L(x, λ, µ) = − log wij xij
monetary budget of TIs is extremely high (e.g., close to 40). j∈V i∈U
In summary, the proposed algorithm significantly outper-
  ! (22)
X X X X
forms the random algorithm in terms of the MUs’ payoff + λi  xij − ti  + µj pi xij − bj
and the social welfare, and the performance of the pro- i∈U j∈V j∈V i∈U

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

  




3D\RIIRI08V



3D\RIIRI7,V


:HOIDUH
 
 


 4323 4323  4323


3URSRVHG  3URSRVHG 3URSRVHG
5DQGRP 5DQGRP  5DQGRP

              
$YHUDJH7LPH%XGJHWRI08V $YHUDJH7LPH%XGJHWRI08V $YHUDJH7LPH%XGJHWRI08V
(a) Payoff of MUs. (b) Payoff of TIs. (c) Welfare.

Fig. 5: A comparison of our algorithm and other algorithms for different MU time budgets.

  

  


3D\RIIRI08V

3D\RIIRI7,V

:HOIDUH
 



 
4323 4323 4323
3URSRVHG 3URSRVHG  3URSRVHG
 5DQGRP  5DQGRP 5DQGRP
              
$YHUDJH0RQHWDU\%XGJHWRI7,V $YHUDJH0RQHWDU\%XGJHWRI7,V $YHUDJH0RQHWDU\%XGJHWRI7,V
(a) Payoff of MUs. (b) Payoff of TIs. (c) Welfare.

Fig. 6: A comparison of our algorithm and other algorithms for different monetary budgets of the SP.

where λ = {λi } and µ = {µj } are the Lagrangian multipliers. Since


The KKT conditions are given as follows. !
 ∂L(xij λi ,uj ) X
= 0, ∀i ∈ U, ∀j ∈ V µj pi xij − bj =0 (27)
∂x

Pij

 
 i∈U
 λi P j∈V xij − ti = 0, ∀i ∈ U




µj i∈U pi xij − bj = 0, ∀j ∈ V (23)
 x ≥ 0, ∀i ∈ U, ∀j ∈ V we have


 ij

 λi ≥ 0, ∀i ∈ U


 X
µj ≥ 0, ∀j ∈ V pi xij = bj (28)
i∈U
Eliminating λi , we have
P P
Namely, if j∈V xij − ti 6= 0 ∀j ∈ VP then i∈U pi xij −
   P 
P wij − µj pi x ij − t i = 0, ∀i ∈ U
j∈V
 w x
P ij ij bj =P 0. Similarly, we can prove that if j∈V xij − ti 6= 0

 i∈U
 
i∈U pi xij − bj = 0, ∀j ∈ V
 µj

then i∈U pi xij − bj = 0.
x ij ≥ 0, ∀i, j
 wij Therefore, Problem 1 can be decomposed into two sub-
 Pi∈U wij xij − µj pi ≥ 0, ∀i ∈ U


 problems as follows.
µj ≥ 0, ∀j ∈ V

Sub-problem 1:
(24)
!
When X X
wij max log wij xij
x
P − µj pi = 0 (25) j∈V i∈U
i∈U wij xij X
s.t. xij ≤ ti , ∀i ∈ U (29)
we have j∈V
wij
X
µj = >0 pi xij = bj , ∀j ∈ V
pi
P (26)
i∈U wij xij i∈U

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

The KKT conditions of the corresponding Lagrangian form Eliminating µj , we have


of sub-problem 1 are given as follows.
P wij λi
pi i∈U wij xij − pi ≥ 0, ∀j ∈ V

 ∂L(xij λi ,uj ) 

= 0, ∀i ∈ U, j ∈ V   
P wij
P 
pi xij − bj = 0, ∀j ∈ V

∂x

Pij i∈U
 pi i∈U wij xij

 
j∈V xij − ti = 0, ∀i ∈ U
 λi
  x ≥ 0, ∀i, j
 Pij

(30) 

 x
P ij ≥ 0, ∀i ∈ U, j ∈ V j∈V xij − ti = 0, ∀i ∈ U



 i∈U pi xij = bj , ∀j ∈ V (39)
λi ≥ 0, ∀i ∈ U
On one hand, the stationarity and complementary slackness
Eliminating λi , we have in Eq.(38) imply that

P wij − µj pi ≥ 0, ∀i ∈ U wij λi
 i∈U wij xij
P ≥ (40)

 
 P wij  P  pi i∈U wij x ij pi
− µj pi j∈V x ij − t i = 0, ∀i ∈ U
i∈U wij xij
 xij ≥ 0, and
 ∀i, j
Pwij

λ∗i > 0
 P 
i∈U pi xij − bj = 0, ∀j ∈ V
x∗ij = λ∗
i i∈U wij
(41)
(31) 0 othervise
On one hand, the stationarity and complementary slackness Equation (41) can be rewritten as follows.
in Eq.(30) imply that wij
wij x∗ij = ( ∗ P )+ (42)
≥ µj pi λi i∈U wij
P (32)
i∈U wij xij
On the other hand, the primal feasibility in Eq.(38) implies
and that
X
P wij xij = ti
(
∗ µ∗j > 0 (43)
x∗ij = pi i∈U wij µj (33) j∈V
0 othervise
Therefore,
For simplicity, Eq. (33) can be rewritten as follows
ti wij
x∗ij = P (44)
wij j∈V wij
x∗ij =( P )+ (34)
pi i∈U wij µ∗j
In summary, the optimal solution to Problem 1 is given as
with (·)+ = max(·, 0) follows.
On the other hand, the primal feasibility in Eq.(30) implies ( bj wij
if I ≥ 0
pi i∈U wij ,
P
that ∗
xij = ti w ij
∀i ∈ U, ∀j ∈ V (45)
wij wij , otherwise
X P
j∈V
pi · P = bj (35)
pi i∈U wij µ∗j P
b w2

i∈U Pj ij
P
where I = log −
Therefore, P 2 j∈V i∈U pi i∈U wij
P ti wij
j∈V log i∈U
P
wij .
j∈V
b w
x∗ij = Pj ij (36)
pi i∈U wij R EFERENCES
Sub-problem 2: [1] Z. Zhou, J. Feng, B. Gu, B. Ai, S. Mumtaz, J. Rodriguez, and
! M. Guizani, “When mobile crowd sensing meets uav: Energy-efficient
X X task assignment and route planning,” IEEE Transactions on Communi-
max log wij xij cations, vol. 66, no. 11, pp. 5526–5538, 2018.
x
j∈V i∈U [2] X. Kong, F. Xia, J. Li, M. Hou, M. Li, and Y. Xiang, “A Shared
X Bus Profiling Scheme for Smart Cities Based on Heterogeneous Mobile
s.t. xij = ti , ∀i ∈ U (37) Crowdsourced Data,” IEEE Transactions on Industrial Informatics,
j∈V vol. 16, no. 2, pp. 1436–1444, 2020.
X [3] T. Hussain, K. Muhammad, J. D. Ser, S. W. Baik, and V. H. C.
pi xij ≤ bj , ∀j ∈ V De Albuquerque, “Intelligent Embedded Vision for Summarization of
i∈U Multiview Videos in IIoT,” IEEE Transactions on Industrial Informatics,
vol. 16, no. 4, pp. 2592–2602, 2020.
The KKT conditions of the corresponding Lagrangian form [4] H. Liao, Z. Zhou, X. Zhao, L. Zhang, S. Mumtaz, A. Jolfaei, S. H.
of sub-problem 2 are given as follows. Ahmed, and A. K. Bashir, “Learning-based context-aware resource
allocation for edge computing-empowered industrial iot,” IEEE Internet
 ∂L(xij λi ,uj ) of Things Journal, pp. 1–1, 2019.
 ∂x = 0, ∀i ∈ U, j ∈ V [5] D. Yang, G. Xue, X. Fang, and J. Tang, “Incentive mechanisms for
 Pij 
− ∀j ∈ V

 µj
 p x b = 0, crowdsensing: Crowdsourcing with smartphones,” IEEE/ACM Transac-
i∈U i ij j
(38) tions on Networking, vol. 24, no. 3, pp. 1732–1744, June 2016.
x
Pij ≥ 0, ∀i ∈ U, j ∈ V [6] S. Ji and T. Chen, “Incentive mechanisms for discretized mobile crowd-
− ∀i ∈ U



 x
j∈V ij ti = 0, sensings,” IEEE Transactions on Wireless Communications, vol. 15,

µj ≥ 0, ∀j ∈ V no. 1, pp. 146–161, Jan 2016.

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

[7] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, [30] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
“Multi-agent actor-critic for mixed cooperative-competitive environ- D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
ments,” in Advances in neural information processing systems, 2017, learning,” Comput. Sci., vol. 8, no. 6, p. A187, 2015.
pp. 6379–6390.
[8] H. Ko, S. Pack, and V. C. Leung, “Coverage-guaranteed and energy-
efficient participant selection strategy in mobile crowdsensing,” IEEE
Internet of Things Journal, vol. 6, no. 2, pp. 3202–3211, 2018.
[9] X. Li and X. Zhang, “Multi-task allocation under time constraints in
mobile crowdsensing,” IEEE Transactions on Mobile Computing, 2019.
[10] Q. Pham, S. Mirjalili, N. Kumar, M. Alazab, and W. Hwang, “Whale op- Bo Gu received the Ph.D. degree from Waseda
timization algorithm with applications to resource allocation in wireless University, Japan, in 2013. He was a research engi-
networks,” IEEE Transactions on Vehicular Technology, vol. 69, no. 4, neer with Sony Digital Network Applications, Japan,
pp. 4285–4297, 2020. from 2007 to 2011, an assistant professor with
[11] B. Cao, S. Xia, J. Han, and Y. Li, “A distributed game methodology Waseda University, Japan, from 2011 to 2016, and an
for crowdsensing in uncertain wireless scenario,” IEEE Transactions on associate professor with Kogakuin University, Japan,
Mobile Computing, vol. 19, no. 1, pp. 15–28, 2019. from 2016 to 2018. He is currently an associate
[12] C. Jiang, L. Gao, L. Duan, and J. Huang, “Scalable mobile crowdsensing professor with the school of intelligent systems
via peer-to-peer data sharing,” IEEE Transactions on Mobile Computing, engineering, Sun Yat-sen University. His research
vol. 17, no. 4, pp. 898–912, 2017. interests include Internet of Things, edge computing,
[13] S. Bhattacharjee, N. Ghosh, V. K. Shah, and S. K. Das, “QnQ: Quality network economics, and machine learning. He was
and quantity based unified approach for secure and trustworthy mobile the recipient of the IEEE ComSoc Communications Systems Integration
crowdsensing,” IEEE Transactions on Mobile Computing, vol. 19, no. 1, & Modeling (CSIM) Technical Committee Best Journal Article Award in
pp. 200–216, 2018. 2019, the Asia-Pacific Network Operations and Management Symposium
[14] X. Wang, R. Jia, X. Tian, X. Gan, L. Fu, and X. Wang, “Location- (APNOMS) Best Paper Award in 2016, and the IEICE Young Researcher’s
aware crowdsensing: Dynamic task assignment and truth inference,” Award in 2011.
IEEE Transactions on Mobile Computing, 2018.
[15] J. Huang, L. Kong, H.-N. Dai, W. Ding, L. Cheng, G. Chen, X. Jin,
and P. Zeng, “Blockchain Based Mobile Crowd Sensing in Industrial
Systems,” IEEE Transactions on Industrial Informatics, vol. 3203, no. c,
pp. 1–1, 2020.
[16] B. Guo, H. Chen, Q. Han, Z. Yu, D. Zhang, and Y. Wang, “Worker-
contributed data utility measurement for visual crowdsensing systems,” Xinxin Yang received the B.Eng. degree from Hebei
IEEE Transactions on Mobile Computing, vol. 16, no. 8, pp. 2379–2391, University of Technology, China, in 2020. He is
2016. currently pursuing his master’s degree in pattern
[17] C. Jiang, L. Gao, L. Duan, and J. Huang, “Data-centric mobile crowd- recognition and intelligent systems at Sun Yat-sen
sensing,” IEEE Transactions on Mobile Computing, vol. 17, no. 6, pp. University. His research interests include the edge
1275–1288, 2017. computing, Internet of Things and machine learning.
[18] M. H. Cheung, F. Hou, and J. Huang, “Delay-sensitive mobile crowd-
sensing: Algorithm design and economics,” IEEE Transactions on
Mobile Computing, vol. 17, no. 12, pp. 2761–2774, 2018.
[19] R. S. Sutton and A. G. Barto, “Reinforcement learning: an introduction
cambridge,” MA: MIT Press.[Google Scholar], 1998.
[20] C. H. Liu, Z. Chen, and Y. Zhan, “Energy-efficient distributed mobile
crowd sensing: A deep learning approach,” IEEE Journal on Selected
Areas in Communications, vol. 37, no. 6, pp. 1262–1276, 2019.
[21] X. Chen, Z. Zhao, C. Wu, M. Bennis, H. Liu, Y. Ji, and H. Zhang,
“Multi-tenant cross-slice resource orchestration: A deep reinforcement
learning approach,” IEEE Journal on Selected Areas in Communications, Ziqi Lin received the B.Eng. degree from Dongguan
vol. 37, no. 10, pp. 2377–2392, 2019. University of Technology, China, in 2019. He is
[22] F. Farivar, M. S. Haghighi, A. Jolfaei, and M. Alazab, “Artificial currently pursuing his master’s degree in control
intelligence for detection, estimation, and compensation of malicious engineering at Sun Yat-sen University. His research
attacks in nonlinear cyber-physical systems and industrial iot,” IEEE interests include the Internet of Things, edge com-
Transactions on Industrial Informatics, vol. 16, no. 4, pp. 2716–2725, puting, and machine learning.
2020.
[23] Z. Zhou, H. Liao, B. Gu, K. M. S. Huq, S. Mumtaz, and J. Rodriguez,
“Robust mobile crowd sensing: When deep learning meets edge com-
puting,” IEEE Network, vol. 32, no. 4, pp. 54–60, 2018.
[24] L. Xiao, T. Chen, C. Xie, H. Dai, and H. V. Poor, “Mobile crowd-
sensing games in vehicular networks,” IEEE Transactions on Vehicular
Technology, vol. 67, no. 2, pp. 1535–1545, 2017.
[25] Y. Chen and H. Wang, “Intelligentcrowd: Mobile crowdsensing via
multi-agent reinforcement learning,” CoRR, vol. abs/1809.07830, 2018.
[Online]. Available: http://arxiv.org/abs/1809.07830
[26] D. Fudenberg and J. Tirole, Game Theory. Cambridge, MA, USA: Weiwei Hu received the bachelor’s degree in Marine
MIT Press, 1991. Engineering College from Dalian Maritime Univer-
[27] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. sity, in 2020, where he is currently pursuing the
Cambridge, MA, USA: MIT Press, 1998. master’s degree in Control Science and Engineering
[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. in Sun Yat-sen University. His research interests
Bellemare, A. Graves, M. A. Riedmiller, A. K. Fidjeland, G. Ostrovski, include the Blockchain, Internet of Things and edge
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, computing.
D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
[29] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in Proc. ICML, 2014, pp. 1–
9.

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics

Mamoun Alazab received the Ph.D. degree in


computer science from the School of Science, In-
formation Technology and Engineering, Federation
University of Australia. He is currently an Associate
Professor with the College of Engineering, IT and
Environment, Charles Darwin University, Australia.
He is also a Cyber Security Researcher and a Prac-
titioner with industry and academic experience. His
research is multidisciplinary that focuses on cyber
security and digital forensics of computer systems
with a focus on cybercrime detection and prevention,
including cyber terrorism and cyber warfare.

Rupak Kharel received the Ph.D. degree in secure


communication systems from Northumbria Univer-
sity, U.K., in 2011. He is currently a Reader (As-
sociate Professor) within the Department of Com-
puting and Mathematics, Manchester Metropolitan
University. His research interests include various
use cases and the challenges of the IoT and cy-
ber physical systems including Internet of Vehicles
(IoV), cyber security, physical layer security, 5G and
beyond systems. He is a Principal Investigator of
multiple government and industry funded research
projects. Rupak is a Senior member of the IEEE, member of the IET and a
Fellow of the Higher Education Academy (FHEA), U.K.

1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.

You might also like