20
20
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
Abstract—Mobile crowdsensing (MCS) is an appealing sensing their sensing capabilities and the inherent mobility of device
paradigm that leverages the sensing capabilities of smart devices owners, it is possible to accomplish fine-grained sensing tasks.
and the inherent mobility of device owners to accomplish sensing This sensing paradigm, called mobile crowdsensing (MCS), is
tasks with the aim of constructing powerful industrial systems.
Incentivizing mobile users (MUs) to participate in sensing activi- especially suitable for large-scale sensing tasks due to its cost
ties and contribute high-quality data is of paramount importance efficiency.
to the success of MCS services. In this paper, we formulate the Nevertheless, the successful operation of MCS relies on the
competitive interactions between a sensing platform (SP) and participation of MUs. Due to the increased battery consump-
MUs as a multi-stage Stackelberg game with the SP as the tion and the potential risk of data leakage, MUs are not willing
leader player and the MUs as the followers. Given the unit
prices announced by MUs, the SP calculates the quantity of to perform sensing tasks unless they are sufficiently com-
sensing time to purchase from each MU by solving a convex pensated. Therefore, incentive mechanisms are essential for
optimization problem. Then, each follower observes the trading device owners to share part of their time, efforts and resources
records and iteratively adjusts their pricing strategy in a trial- for sensing tasks. How to determine the appropriate reward
and-error manner based on a multi-agent deep reinforcement to recruit MUs for sensing tasks so as to improve the data
learning algorithm. Simulation results demonstrate the efficiency
of the proposed method. 1 quality while satisfying the monetary budget constraints of
task initiators (TIs) is a key problem in the MCS environment.
Index Terms—Cognitive sensor networks, mobile crowd sens- Previous work on task assignment and incentive design re-
ing (MCS), Stackelberg game, deep reinforcement learning
(DRL), incentive mechanism, MADDPG lies on the assumption that the sensing platform (SP) perfectly
knows the private information of the MUs (i.e., user-perceived
sensing cost and data quality). However, this assumption is
I. I NTRODUCTION quite strong, especially in a free market, where MUs compete
with each other for sensing opportunities and they believe that
I N current industrial spaces, data and information regarding
working conditions, equipment status, environmental con-
text, and so forth must be collected in a real-time manner at
concealing their private information is beneficial. Under such
an asymmetric information scenario, it is highly challenging
various locations. This information can be used for the moni- for the SP to provide an appropriate reward in keeping with
toring, prediction, modelling and control of industrial systems the quality of the data that the MU contributes. On one hand,
to provide advanced services [1]–[4]. Generally, conventional a selfish MU may pretend to have a higher data quality to
sensing architectures require deploying sensors at all points earn an extra reward. On the other hand, in the absence of
of interest (PoIs), which substantially increases both the op- MUs’ private information, the SP may provide a payment that
erational expenditure and capital expenditure. In recent years, is lower than the MUs deserve and, consequently, discourage
the proliferation of smart devices and advances in wireless the cooperation of MUs.
communications technology have changed the methods used to Game theory is regarded as a powerful tool for analysing the
collect data. These devices are usually equipped with various competitive interactions between two parties with conflicting
sensors (e.g., GPS, accelerometer and camera). By leveraging interests. Recently, many game theory-based incentive mecha-
nisms [5], [6] have been proposed to motivate truthful contri-
Manuscript received April 29, 2020; revised June 5, 2020; accepted Septem- bution in MCS systems. In [6], a Bayesian game is formulated
ber 5, 2020. Date of publication XXX XX, 2020; date of current version to analyse the interactions between SP and MUs, based on
September 11, 2020. This work was partially supported by the National Key which the authors proposed a distributed incentive mechanism
R&D Program of China (2019YFB1704702).
B. Gu, X. Yang, Z. Lin and W. Hu are with the School of Intelligent to obtain a Nash equilibrium solution while simultaneously
Systems Engineering, Sun Yat-sen University, Guangzhou 510275, China. (E- maximizing the SP’s utility. However, the above-mentioned
mail: [email protected]). algorithms require excessive information exchange (such as
M. Alazab is with the College of Engineering, IT and Environment, Charles
Darwin University, Australia. (E-mail: [email protected]). user-perceived sensing cost, data quality and utility functions),
M. Kharel is with theFaculty of Science and Engineering, Manch- which is infeasible in practice due to the unbearable signalling
ester Metropolitan University, Manchester, UK, M15 6BH (e-mails: overhead. Moreover, the previous works focus on the analysis
[email protected]).
1 Source code is available: https://github.com/glab2019/MADDPG-Based- of the interaction between the SP and candidate MUs, while
MCS the competition among MUs is ignored.
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
To address the above-mentioned problems and achieve an locally and then shared among MUs in a P2P mode. Then,
efficient and stable solution in the absence of complete infor- a pricing scheme is proposed to improve data quality while
mation, we first formulate the task assignment and incentive ensuring appropriate incentive for participants. In [15], Huang
design problem in the MCS environment as a multi-stage et al. proposed a data quality verification framework based on
Stackelberg game that includes a leader and multiple followers. the consensus mechanism of blockchain. In [16], Guo et al.
We consider the case where all the MUs offer their prices focused on a visual crowdsensing environment and proposed
to the SP simultaneously. In the first stage, given the prices a dynamic and utility-enhanced incentive mechanism to match
announced by the MUs, the SP determines the optimal task quality requirements to providers. In [17], Jiang et al. proposed
assignment by solving a utility maximization problem. In the a multi-layer MCS framework that allows different TIs to
second stage, based on the feedback from the environment publish their data requirements so that common data items
(i.e., how the tasks are assigned to MUs), we propose a multi- can be shared among TIs.
agent deep deterministic policy gradient (MADDPG)-based [7] Sensing coverage and latency: In [9], Zhang et al. pro-
framework to learn the optimal pricing policy of each MU posed a multi-task assignment mechanism to maximize the
without requiring prior knowledge of the pricing strategies utility of the SP while satisfying a stringent delay constraint.
implemented by other MUs. The main contributions of the The formulated problem is proved to be NP-hard, and a
proposed algorithm are five-fold. heuristic algorithm is presented to achieve a near-optimal task
• We fully consider the interactions between the SP and assignment. In [18], Cheung et al. focused on delay-sensitive
MUs, as well as the competition among MUs, by con- sensing tasks and proposed a dynamic programming-based
structing a Stackelberg leader-follower game. The Nash algorithm to maximize the user’s profit. In [8], Ko et al.
equilibrium, which gives plausible results for full compe- proposed a data-usage and mobility-aware task assignment
tition, is obtained as a solution for the Stackelberg game. mechanism for MCS by taking into account the sensing
• The proposed algorithm is scalable in the sense that it coverage and energy consumption of MUs.
requires neither prior knowledge of the user-perceived
sensing cost nor the actions simultaneously taken by other B. Reinforcement Learning-Based Methods
MUs. The signalling overhead can therefore be reduced
RL and deep RL (DRL) [19] are commonly used machine
substantially.
learning paradigms, in which agents learn how to map state
• The proposed algorithm is based on MADDPG, a central-
to action to maximize a long-term reward via continuous in-
ized training and distributive execution DRL technique
teraction with the environment. Specifically, RL uses a reward
that is adaptable to the environmental dynamics and
to guide agents to make better decisions. In recent years,
parameter uncertainty.
considerable research effort has been devoted to studying
• The proposed algorithm does not rely on existing training
task assignment and incentive design in MCS based on RL
data. Each MU acts as a learning agent that interacts with
techniques [20]–[22]. In [23], Zhou et al. used DRL to verify
its environment continuously to generate a sequence of
the original data set to improve the efficiency and robustness
experience for training purposes.
of the MCS system. In [24], Xiao et al. focused on a vehicular
• Simulation results confirm that the proposed algorithm
environment in which the interaction between the SP and
can learn a good policy that guides each MU to determine
vehicles is modelled as a congestion-sensing game. Then, they
the price of a unit amount of its sensing time.
proposed a Q-Learning-based [19] algorithm to guide the SP
and the vehicles to make optimal decisions. In [25], Chen et
II. R ELATED W ORK al. extended the study to a multi-agent scenario and proposed
The existing research on task assignment and incentive de- an RL-based algorithm to obtain the optimal sensing policy in
sign of MCS can be divided into two categories: optimization- real time to optimize the benefit of sensing participants.
based methods and reinforcement learning (RL)-based meth- In this paper, we assume a free market where MUs compete
ods. with each other for sensing opportunities. Through interactions
with the operating environment, MUs learn an optimal pricing
A. Optimization-Based Methods policy to maximize their payoff while satisfying their time
In recent years, several task assignment and incentive design budget constraints and the monetary budget constraint of the
algorithms for MCS have been proposed based on convex SP. We employ MADDPG to accelerate convergence and
optimization [8]–[10], game theory [11], [12], matching the- increase the robustness of the proposed algorithm.
ory [13], and Lyapunov optimization [14] with the aim of
optimizing data quality, sensing coverage or latency. III. P ROBLEM F ORMULATION
Data quality: In [11], Cao et al. designed a game theory- As shown in Fig. 1, an MCS system is composed of multiple
based incentive mechanism for MCS to encourage the appro- stakeholders: (i) TIs that wish to collect sensing data at PoIs;
priate user equipment nearby to share their resources for sens- (ii) MUs that share part of their time and resources to perform
ing tasks; then, they proposed a task-migration algorithm that sensing tasks; and (iii) an SP that acts as a third-party agency
enables resource sharing among users to enhance data quality. to recruit MUs on behalf of TIs. Without loss of generality, we
In [12], Jiang et al. proposed a group-based sensing framework consider a general “many-to-many” scenario where each task
where sensing data are stored and processed on user equipment can be allocated to multiple MUs and each MU is permitted to
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
Fig. 1: Mobile crowdsensing architecture. SP are shown in Eqs.(1) and (4), respectively.
For each sensing task j ∈ V, the quantity of sensing time
to purchase from MU i is denoted by xij . Then, the payoff of
participate in multiple sensing tasks. Moreover, we assume that MU i is given by
MUs receive monetary payment for participating in sensing X
activities. If sufficiently compensated, MUs in the proximity of ψi (pi , xi ) = pi xij , ∀i ∈ U (1)
j∈V
a PoI may participate in sensing activities. Initially, TIs publish
their sensing tasks on the platform and the parameters used to where pi is the unit price announced by MU i and xi =
describe a task, including (i) the phenomena of interest, (ii) the {xij }j∈V .
target area, and (iii) the total budget for rewarding participants. The payoff of TI j is defined by
Then, the SP recruits MUs to perform sensing tasks to meet !
X X
the requirements of TIs. φj (xj , p) = πj ωij xij − pi xij , ∀j ∈ V (2)
Time is divided into time slots. In each slot, we assume that i∈U i∈U
the numbers of MUs and sensing tasks are denoted by U and
where xj = {xij }i∈U , p = {pi }i∈U , ωij represents the data
V , respectively. Formally, the set of MUs and sensing tasks
quality offered by MU i when performing sensing task j,
are represented by U = {1, 2, ..., U } and V = {1, 2, ..., V },
and πj (·) is the utility function that is an increasing concave
respectively. Each MU i, ∀i ∈ U has a limited available time
function of the quality-weighted sensing time.
ti to perform sensing tasks. The interactions between MUs
Then, the payoff of the SP, which is the sum payoff of all
and the platform are summarized as follows:
TIs and MUs (i.e., social welfare), is given by
1. As shown in Fig. 1, MU i, who is interested in sensing X X
tasks, first registers with the SP by indicating its available φ (x, p) = ψi (pi , xi ) + φj (xj , p)
sensing time ti an desired reward per unit of sensing time i∈U j∈V
pi . X X
! (3)
2. Given the price and time budget of each MU, the SP = πj ωij xij
determines the quantity of sensing time to purchase from j∈V i∈U
each MU. where x = {xij }i∈U ,j∈V .
3. MUs perform the assigned sensing tasks and upload the Supposing that πj (·) = log(·), ∀j ∈ V, we have
sensing data to the SP. !
4. The SP aggregates all sensing data and sends a report to X X
φ (x, p) = log ωij xij (4)
the TIs.
j∈V i∈U
IV. M ULTI -S TAGE S TACKELBERG G AME Definition 1: Letting x∗i = {x∗ij }j∈V , x∗ = {x∗ij }i∈U ,j∈V and
p∗ = {p∗i }i∈U , the point (x∗ , p∗ ) is a Nash equilibrium if it
A Stackelberg leader-follower game [26] is a strategic satisfies
game in which a leader commits to a strategy first and then
followers move accordingly. In general, all the players in the ψi (p∗i , x∗i ) ≥ ψi (pi , x∗i ) , ∀pi 6= p∗i , ∀i ∈ U (5)
game are assumed to be self-interested in the sense that they
and
adopt strategies that consider those of others with the aim
φ (x∗ , p∗ ) ≥ φ (x, p∗ ) , ∀x 6= x∗ (6)
of maximizing their own benefit. In particular, considering
the possible countermeasures of the followers, the leader first In the next two sections, we illustrate how to derive the
chooses a strategy that maximizes its own utility. Then, the Nash equilibrium.
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
V. P LATFORM S OCIAL W ELFARE O PTIMIZATION Q∗ (s, a) is continuously updated based on the sequence of
Given a unit price vector p, the platform aims to optimize experience samples from actual or simulated interactions with
its payoff. the environment.
Problem 1: When the state and action space is large, traditional RL
! (e.g., Q-learning) algorithms may fail to find a good policy
in a reasonable period of time. Deep Q-Network (DQN) [28],
X X
max log ωij xij
x which combines Q-learning with a deep neural network, has
j∈V i∈U
X the potential to overcome this shortcoming.
s.t. xij ≤ ti , ∀i ∈ U (7)
j∈V
B. Policy Gradient
X
pi xij ≤ bj , ∀j ∈ V
i∈U Q-learning and DQN learn the values of actions. The larger
where ti is the time budget of MU i to perform sensing tasks the value of an action is, the more likely the action is to be
and bj is the budget of TI j. selected (e.g., -greedy algorithm [27]). In contrast, policy gra-
dient algorithms learn a parameterized policy that maximizes
Theorem 1. The optimal solution to Problem 1 is given as an objective function, which enables agents to select actions
follows. without referring to value functions. The objective function is
bj wij defined as follows,
(
∗ pi i∈U wij ,
P if I ≥ 0
xij = ti w ij
∀i ∈ U, ∀j ∈ V (8)
P , otherwise J(θ) = Es∼pπ ,a∼πθ [R] (11)
j∈V wij
b w2
P
Pj ij where θ is the policy’s parameter vector, pπ is the state
P
where I = j∈V log i∈U pi i∈U wij −
P P 2
ti wij
distribution and πθ is the parameterized policy. According to
j∈V log wij .
P
i∈U j∈V the policy gradient theorem, the policy gradient of policy can
Proof. See Appendix A. be expressed as follows,
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
environment
online network
𝜃 𝜇𝑖
online network θ Qi and θ µi
𝒂𝑡 = (𝑎1𝑡 , … , 𝑎𝑁
𝑡 gradient 𝜃 𝑄𝑖
)
w.r.t 𝑎𝑖𝑡
soft update
Initialize target networks Q0i and µ0i with
soft update 𝑦𝑖𝑡 0 0
𝑎𝑖𝑡+1
weights θ Qi ← θQi , θ µi ← θ µi
𝑜𝑖𝑡 target network target network
𝜃𝜇′𝑖 ′
𝜃 𝑄𝑖
Initial replay buffer D
(𝒐𝑡 , 𝒂𝑡 , 𝒓𝑡 , 𝒐𝑡+1 ) Initial hyperparameters τ , γ
for episode = 1, 2, · · · , M do
𝑆 ∗ (𝒐𝑗 , 𝒂𝑗 , 𝒓𝑗 , 𝒐𝑗+1 )
replay buffer D Initialize a random process N for action exploration
Receive initial observation state o1
for t = 1, 2, · · · , T do
Fig. 2: MADDPG architecture. For each agent i, select action ati = µi (oti |θ µi ) + Nt
according to the current policy and exploration noise
1. Centralized training and distributed execution: agents Execute action at = {at1 , · · · , atN } and calculate
share their experience during training, but each agent uses reward r t = {r1t , · · · , rN
t
} according to Eq.(18), new state
t+1
only local observations at execution time; o according to Eq.(16)
2. Augmented critic: Critic is augmented by inferring the Store transition (ot , at , r t , ot+1 ) in D
policies of the other agents [7]. ot ← ot+1
for agent i = 1, · · · , N do
Sample a random minibatch of S transitions
C. Proposed Algorithm
(oj , aj , r j , oj+1 ) from D
As described in the previous section, given the feedback Update critic by minimizing the loss according
from the environment (i.e., how the tasks are assigned to to Eq.(19)
MUs), each MU i chooses a unit price with the aim of Update actor using the sampled policy gradient
maximizing its payoff. according to Eq.(21)
end for
max ψi (pi , xi ) Update the target networks:
pi
(15) 0 0
s.t. pi ≥ 0 θ Qi ← τ θ Qi + (1 − τ )θ Qi
0 0
Without the prior knowledge of the actions taken by other θ µi ← τ θ µi + (1 − τ )θ µi
agents, we derive the optimal action of each agent based on
the MADDPG algorithm. end for
State observation: The state is a vector of environment end for
features corresponding to our problem defined in objective
(15). We take into account the observations of the previous
when the payment received (i.e., pti j∈V t
P
L time slots so that agents have a better opportunity to track P xij ) is not enough
changes in the environment. In each time slot t, the observation to compensate the sensing cost (i.e., cti j∈V xtij ).
space of MU i is defined as, As shown in Fig.2, MADDPG consists of two networks,
namely, a critic and actor for each learning agent. In each time
oti = {pt−1
i , xt−1
i , · · · , pt−L
i , xt−L
i }, ∀i ∈ U (16)
slot t, the input of each actor network is the observation of
Action: In time slot t, each MU i observes how tasks are environmental features oti , and the output is the deterministic
assigned in the previous L slots and then determines the unit action. On the other hand, the inputs of the critic network
price in the current slot, i.e., pti , according to the outputs of include oti and ati , and the output is the estimated value of the
the actor network. current state.
Critic: The loss function of the critic network is calculated by
ati = µi oti |θ µi
(17)
T
Note that the action space here is continuous, which satisfies 1X t
Li = (y − Qi (ot , at |θ Qi ))2 (19)
the requirements of MADDPG. T t i
Reward Function: We design a reward function to optimize
the objective (15). The reward function of MU i is defined as where at = {at1 , · · · , atN }, ot = {ot1 , · · · , otN } and yit is the
target action value defined by
X X
rit = log 1 + pti xtij − cti xtij (18) 0
j∈V j∈V
yit = rit + γQ0i (ot+1 , a0(t+1) |θ Qi ) (20)
where cti is the unit sensing cost of MU i. The logarithm By conducting intensive experiments, we observe that the
function is adopted to ensure a negative reward (or penalty) proposed algorithm achieves the best performance when γ
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
TABLE I: Simulation parameters to the monetary budget constraints. In the most extreme
Parameters Values circumstances, a substantial penalty is imposed on the agent
Number of MUs (U) 8 if there is a mismatch between its unit price and sensing
Number of TIs (V) 2
Monetary budget of TI j (bj ) [20, 40]
quality, and the SP may not purchase any sensing time at
Time budget of MU i (ti ) [5, 25] all. This result is not surprising because the SP prefers MUs
Unit sensing cost of MU i (cti ) 0.05 that offer high sensing quality and low unit price. The penalty
Sensing qualities of MU i when performing 1+i/10 is exacerbated as the mismatch increases. As a consequence,
task 1 (wi1 )
Sensing qualities of MU i when performing 1.1+i/10 agents must compete with others and dynamically select their
task 2 (wi2 ) unit price to maintain a relatively high benefit, as shown in
Actor learning rate (αa ) 0.001 Figures 3(a) and 3(d).
Critic learning rate (αc ) 0.001
Soft Update (τ ) 0.01
Figure 3(e) shows that the SP regains a relatively high
Discount factor (γ) 0 payoff after a sharp initial decline because of the competition
among MUs, which also results in the fluctuation shown in
Figure 3(f). Despite the fluctuation in an individual MU’s
is equal to 0. Namely, the immediate reward appropriately payoff, the total payoff of MUs converges to the monetary
reflects the true state-action value. budget of the TIs, as expected.
Actor: The gradient of the expected reward for agent i with
deterministic policies µθi is given by C. Comparison with Existing Algorithms
T
1 X We also compare our algorithm with two existing algo-
∇θµi J ≈ ∇θµi ai ∇ai Qi (ot , ai , atN \i |θ Qi )|ai =µi (oj |θµi ) rithms.
T t
i
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
$ P R X Q W R I 6 H Q V L Q J 7 L P H 7 D V N
$ P R X Q W R I 6 H Q V L Q J 7 L P H 7 D V N
0 8 0 8 0 8
0 8 0 8 0 8
0 8 0 8 0 8
8 Q L W 3 U L F H V R I 0 8 V
, W H U D W L R Q , W H U D W L R Q , W H U D W L R Q
(a) Unit Prices of MUs. (b) Amount of Sensing Time (Task1). (c) Amount of Sensing Time (Task2).
3 D \ R I I
3 D \ R I I
3 D \ R I I
0 8
0 8
0 8
0 8
0 8
0 8 7 ,
0 8 7 , 6 X P 3 D \ R I I R I 0 8 V
0 8 6 3 6 X P % X G J H W R I 6 3
, W H U D W L R Q , W H U D W L R Q , W H U D W L R Q
(d) Payoff of MUs. (e) Payoff of TIs and SP. (f) Sum Payoff of MUs vs. Monetary Budget.
Actor
interactions with the operating environment.
Input Output(Tanh)
ReLu(32)
ReLu(128)
ReLu(64) VIII. C ONCLUSION
This paper presented a distributed task assignment and
incentive mechanism based on a Stackelberg game and MAD-
DPG. Since the proposed algorithm requires neither prior
…
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
3 D \ R I I R I 0 8 V
3 D \ R I I R I 7 , V
: H O I D U H
Fig. 5: A comparison of our algorithm and other algorithms for different MU time budgets.
3 D \ R I I R I 7 , V
: H O I D U H
4 3 2 3 4 3 2 3 4 3 2 3
3 U R S R V H G 3 U R S R V H G 3 U R S R V H G
5 D Q G R P 5 D Q G R P 5 D Q G R P
$ Y H U D J H 0 R Q H W D U \ % X G J H W R I 7 , V $ Y H U D J H 0 R Q H W D U \ % X G J H W R I 7 , V $ Y H U D J H 0 R Q H W D U \ % X G J H W R I 7 , V
(a) Payoff of MUs. (b) Payoff of TIs. (c) Welfare.
Fig. 6: A comparison of our algorithm and other algorithms for different monetary budgets of the SP.
λi ≥ 0, ∀i ∈ U
X
µj ≥ 0, ∀j ∈ V pi xij = bj (28)
i∈U
Eliminating λi , we have
P P
Namely, if j∈V xij − ti 6= 0 ∀j ∈ VP then i∈U pi xij −
P
P wij − µj pi x ij − t i = 0, ∀i ∈ U
j∈V
w x
P ij ij bj =P 0. Similarly, we can prove that if j∈V xij − ti 6= 0
i∈U
i∈U pi xij − bj = 0, ∀j ∈ V
µj
then i∈U pi xij − bj = 0.
x ij ≥ 0, ∀i, j
wij Therefore, Problem 1 can be decomposed into two sub-
Pi∈U wij xij − µj pi ≥ 0, ∀i ∈ U
problems as follows.
µj ≥ 0, ∀j ∈ V
Sub-problem 1:
(24)
!
When X X
wij max log wij xij
x
P − µj pi = 0 (25) j∈V i∈U
i∈U wij xij X
s.t. xij ≤ ti , ∀i ∈ U (29)
we have j∈V
wij
X
µj = >0 pi xij = bj , ∀j ∈ V
pi
P (26)
i∈U wij xij i∈U
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
[7] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, [30] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
“Multi-agent actor-critic for mixed cooperative-competitive environ- D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
ments,” in Advances in neural information processing systems, 2017, learning,” Comput. Sci., vol. 8, no. 6, p. A187, 2015.
pp. 6379–6390.
[8] H. Ko, S. Pack, and V. C. Leung, “Coverage-guaranteed and energy-
efficient participant selection strategy in mobile crowdsensing,” IEEE
Internet of Things Journal, vol. 6, no. 2, pp. 3202–3211, 2018.
[9] X. Li and X. Zhang, “Multi-task allocation under time constraints in
mobile crowdsensing,” IEEE Transactions on Mobile Computing, 2019.
[10] Q. Pham, S. Mirjalili, N. Kumar, M. Alazab, and W. Hwang, “Whale op- Bo Gu received the Ph.D. degree from Waseda
timization algorithm with applications to resource allocation in wireless University, Japan, in 2013. He was a research engi-
networks,” IEEE Transactions on Vehicular Technology, vol. 69, no. 4, neer with Sony Digital Network Applications, Japan,
pp. 4285–4297, 2020. from 2007 to 2011, an assistant professor with
[11] B. Cao, S. Xia, J. Han, and Y. Li, “A distributed game methodology Waseda University, Japan, from 2011 to 2016, and an
for crowdsensing in uncertain wireless scenario,” IEEE Transactions on associate professor with Kogakuin University, Japan,
Mobile Computing, vol. 19, no. 1, pp. 15–28, 2019. from 2016 to 2018. He is currently an associate
[12] C. Jiang, L. Gao, L. Duan, and J. Huang, “Scalable mobile crowdsensing professor with the school of intelligent systems
via peer-to-peer data sharing,” IEEE Transactions on Mobile Computing, engineering, Sun Yat-sen University. His research
vol. 17, no. 4, pp. 898–912, 2017. interests include Internet of Things, edge computing,
[13] S. Bhattacharjee, N. Ghosh, V. K. Shah, and S. K. Das, “QnQ: Quality network economics, and machine learning. He was
and quantity based unified approach for secure and trustworthy mobile the recipient of the IEEE ComSoc Communications Systems Integration
crowdsensing,” IEEE Transactions on Mobile Computing, vol. 19, no. 1, & Modeling (CSIM) Technical Committee Best Journal Article Award in
pp. 200–216, 2018. 2019, the Asia-Pacific Network Operations and Management Symposium
[14] X. Wang, R. Jia, X. Tian, X. Gan, L. Fu, and X. Wang, “Location- (APNOMS) Best Paper Award in 2016, and the IEICE Young Researcher’s
aware crowdsensing: Dynamic task assignment and truth inference,” Award in 2011.
IEEE Transactions on Mobile Computing, 2018.
[15] J. Huang, L. Kong, H.-N. Dai, W. Ding, L. Cheng, G. Chen, X. Jin,
and P. Zeng, “Blockchain Based Mobile Crowd Sensing in Industrial
Systems,” IEEE Transactions on Industrial Informatics, vol. 3203, no. c,
pp. 1–1, 2020.
[16] B. Guo, H. Chen, Q. Han, Z. Yu, D. Zhang, and Y. Wang, “Worker-
contributed data utility measurement for visual crowdsensing systems,” Xinxin Yang received the B.Eng. degree from Hebei
IEEE Transactions on Mobile Computing, vol. 16, no. 8, pp. 2379–2391, University of Technology, China, in 2020. He is
2016. currently pursuing his master’s degree in pattern
[17] C. Jiang, L. Gao, L. Duan, and J. Huang, “Data-centric mobile crowd- recognition and intelligent systems at Sun Yat-sen
sensing,” IEEE Transactions on Mobile Computing, vol. 17, no. 6, pp. University. His research interests include the edge
1275–1288, 2017. computing, Internet of Things and machine learning.
[18] M. H. Cheung, F. Hou, and J. Huang, “Delay-sensitive mobile crowd-
sensing: Algorithm design and economics,” IEEE Transactions on
Mobile Computing, vol. 17, no. 12, pp. 2761–2774, 2018.
[19] R. S. Sutton and A. G. Barto, “Reinforcement learning: an introduction
cambridge,” MA: MIT Press.[Google Scholar], 1998.
[20] C. H. Liu, Z. Chen, and Y. Zhan, “Energy-efficient distributed mobile
crowd sensing: A deep learning approach,” IEEE Journal on Selected
Areas in Communications, vol. 37, no. 6, pp. 1262–1276, 2019.
[21] X. Chen, Z. Zhao, C. Wu, M. Bennis, H. Liu, Y. Ji, and H. Zhang,
“Multi-tenant cross-slice resource orchestration: A deep reinforcement
learning approach,” IEEE Journal on Selected Areas in Communications, Ziqi Lin received the B.Eng. degree from Dongguan
vol. 37, no. 10, pp. 2377–2392, 2019. University of Technology, China, in 2019. He is
[22] F. Farivar, M. S. Haghighi, A. Jolfaei, and M. Alazab, “Artificial currently pursuing his master’s degree in control
intelligence for detection, estimation, and compensation of malicious engineering at Sun Yat-sen University. His research
attacks in nonlinear cyber-physical systems and industrial iot,” IEEE interests include the Internet of Things, edge com-
Transactions on Industrial Informatics, vol. 16, no. 4, pp. 2716–2725, puting, and machine learning.
2020.
[23] Z. Zhou, H. Liao, B. Gu, K. M. S. Huq, S. Mumtaz, and J. Rodriguez,
“Robust mobile crowd sensing: When deep learning meets edge com-
puting,” IEEE Network, vol. 32, no. 4, pp. 54–60, 2018.
[24] L. Xiao, T. Chen, C. Xie, H. Dai, and H. V. Poor, “Mobile crowd-
sensing games in vehicular networks,” IEEE Transactions on Vehicular
Technology, vol. 67, no. 2, pp. 1535–1545, 2017.
[25] Y. Chen and H. Wang, “Intelligentcrowd: Mobile crowdsensing via
multi-agent reinforcement learning,” CoRR, vol. abs/1809.07830, 2018.
[Online]. Available: http://arxiv.org/abs/1809.07830
[26] D. Fudenberg and J. Tirole, Game Theory. Cambridge, MA, USA: Weiwei Hu received the bachelor’s degree in Marine
MIT Press, 1991. Engineering College from Dalian Maritime Univer-
[27] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. sity, in 2020, where he is currently pursuing the
Cambridge, MA, USA: MIT Press, 1998. master’s degree in Control Science and Engineering
[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. in Sun Yat-sen University. His research interests
Bellemare, A. Graves, M. A. Riedmiller, A. K. Fidjeland, G. Ostrovski, include the Blockchain, Internet of Things and edge
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, computing.
D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
[29] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in Proc. ICML, 2014, pp. 1–
9.
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3024611, IEEE
Transactions on Industrial Informatics
1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Fondren Library Rice University. Downloaded on October 25,2020 at 02:23:26 UTC from IEEE Xplore. Restrictions apply.