zhan2019
zhan2019
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 1
Abstract—The explosive increase of mobile devices with built-in sensors such as GPS, accelerometer, gyroscope and camera has
made the design of mobile crowdsensing (MCS) applications possible, which create a new interface between humans and their
surroundings. Until now, various MCS applications have been designed, where the task initiators (TIs) recruit mobile users (MUs) to
complete the required sensing tasks. In this paper, deep reinforcement learning (DRL) based techniques are investigated to address
the problem of assigning satisfactory but profitable amount of incentives to multiple TIs and MUs as a MCS game. Specifically, we first
formulate the problem as a multi-leader and multi-follower Stackelberg game, where TIs are the leaders and MUs are the followers.
Then, the existence of the Stackelberg Equilibrium (SE) is proved. Considering the challenge to compute the SE, a DRL based
Dynamic Incentive Mechanism (DDIM) is proposed. It enables the TIs to learn the optimal pricing strategies directly from game
experiences without knowing the private information of MUs. Finally, numerical experiments are provided to illustrate the effectiveness
of the proposed incentive mechanism compared with both state-of-the-art and baseline approaches.
Index Terms—Incentive mechanism, Multi-leader multi-follower mobile crowdsensing, Stackelberg Equilibrium, Deep reinforcement
learning
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 2
TABLE 1
08 7,
List of important notations used in this paper.
IROORZHU OHDGHU 6WDJH,
3D\PHQW
6WUDWHJ\ 0&66HUYHU Notation Explanation
6HQVLQJ n, N Index of a MU, number of MUs
5HSRUW
m, M Index of a TI, number of TIs
6HQVLQJGDWD
The sensing time MU n spends for TI m.
3D\PHQW
tn
m , tm , t
n Vector of sensing time MUs spend for TI m.
6WDJH,, Vector of sensing time MU n spends for all TIs
(tn ∗ ∗ n ∗ Optimum of tn n
m ) , tm , (t ) m , tm , t .
Price TI m determines for MU n.
Fig. 1. An example of a MCS system. pn
m , pm , p
n Vector of prices TI m determines for all MUs.
Vector of prices all TIs pay for MU n.
(pn ∗ ∗ n ∗ Optimum of pn n
m ) , pm , (p ) m , pm , p .
n n
Cm (tm ) Cost of MU n for implementing TI m’s task.
sensing slot. Maximum sensing time of MU n.
In this paper, as shown in Fig. 1, we design an incentive κn , δm
Budget of TI m.
n
mechanism for MCS based on a two-stage, non-cooperative ωm Sensing quality of MU n for TI m’s task.
game which is known as the Stackelberg game. Specif- em
n Mobility index of MU n from TI m’s PoI.
ϕm (tm ) Utility function of TI m.
ically, we consider multiple TIs and multiple MUs who ψn (tn , pn ) Payoff function of MU n
can participate in the MCS system simultaneously. Each φm (tm , pm ) Payoff function of TI m
MU can arbitrary divide its resource to serve different TIs.
We first study the optimal solution and Nash equilibrium
of the MCS game. Based on the insight provided by the tion. Section 5 gives the analysis of the Nash equilibrium
optimal solution and Nash equilibrium of the MCS game, and optimal solution. Section 6 provides the detailed design
we extend our work for dynamic incentive mechanism of DDIM approach. Finally, Section 7, numerical experi-
by formulating the MCS game as a multi-agent Markov ments are conducted to evaluate the system performance,
decision process (MDP). It makes the model more flexible, Section 8 discusses the paper and Section 9 concludes the
which is different from single-agent discrete models used by paper. Table 1 shows the important notations in this paper.
existing work [12].
To address the challenges of multiple TIs and continuous
decision space of MUs, we propose an approach based on 2 R ELATED W ORK
multi-agent DRL with policy gradient. Our approach can Currently, a large quantity of previous works have been
effectively learn the optimal pricing strategy directly from dedicated to designing incentive mechanisms for MCS. Auc-
MCS game history without any prior knowledge about tion is one of the widely used incentive mechanism design
participants’ utility functions. It has merits over model- framework for MCS. Yang et al. [9, 10] proposed the incen-
based MCS game strategies in that it is totally model-free tive mechanism for user-centric MCS using auction method.
and provides a general solution to MCS systems. Thus, it can Several papers [13–15] have taken into consideration that
be applied to complex and unpredictable solutions where it MUs may come into the MCS system in an on-line manner.
is difficult to obtain the precise system models. Recently, many works [16–18] considered the quality of the
Differing from the previous works, our contribution is sensing data. Jin et al. [19] proposed the incentive mechanis-
three-fold. m for privacy-aware data aggregation in MCS. Gan et al. [20]
proposed the game-based incentive mechanism for multi-
1) We formulate the MCS system as a multi-leader
resource sharing to maintain the social fairness-efficiency
multi-follower Stackelberg game, and the existence
tradeoff. However, in these works, the MUs as the seller bid
of SE is proved. To the best of our knowledge, this is
for the sensing tasks.
one of the first works which models the MCS system
Yang et al. [9, 10] modeled the platform-centric incentive
as a multi-leader multi-follower Stackelberg game.
mechanism for MCS using Stackelberg game approach. Du-
2) Since the SE cannot be solved directly, we transform
an et al. [21] used the Stackelberg game to design a threshold
the MCS game into a multi-agent MDP and design
revenue model for the MUs. Cheung et al. [22] designed
a DRL based incentive mechanism called “DDIM”,
the delay-sensitive mechanism MCS based on Stackelberg
which enables each TI to learn the optimal pricing
game. In [11], Maharjan et al. proposed the multimedia
strategy directly from game history without any
application of quality-aware MCS based on Stackelberg
prior knowledge about MUs’ private information.
game. Chen et al. [23] modeled the crowdsourcing systems
DDIM not only can learn the pricing strategy under
with two-stage non-cooperative game, and investigated the
the deterministic environment, but also the pricing
behaviors of MUs under the global network effects. Nie et
strategy under the stochastic environment when the
al. [24, 25] modeled the rewarding and participating for MCS
MUs enter or leave the TIs’ PoIs dynamically.
system as a two-stage single-leader-multiple-follower game,
3) Numerical results demonstrate the effectiveness of
at the same time, the reward was designed taking the under-
the proposed DDIM scheme when compared with
lying social network effects amid the mobile social network
both state-of-the-art and baseline approaches.
into account. However, these works only considered one
The remainder of this paper is organized as follows. sensing task in one sensing slot, meanwhile they did not
Section 3 presents the system model. In Section 2, we discuss take the privacy of MUs into consideration. Xiao et al. [12]
the related works. Section 4 describes the problem formula- proposed the secure MCS game with only one sensing task
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 3
in a sensing slot based on Stackelberg game, they took the function of tn m , for each m − n pair [9, 10, 31–34]. Piece-
privacy of MUs by designing the DQN approach. In [26], wise linear functions [10] and quadratic functions [32, 33]
the authors studied the MCS game in vehicular networks, are two examples which are widely used in previous works.
where a Q-Learning approach was applied to derive the In this paper, quadratic function is selected for each MU, i.e.,
equilibrium of the game. Peng el al. [27] and Chakeri et Cm (tm ) = anm tnm 2 + bnm tnm , on his effort level with anm > 0
n n
al. [28] proposed the incentive mechanisms based on two- and bn m > 0, which can be used to model the increasing
stage non-cooperative game for MCS with multiple crowd- marginal cost for every additional unit of effort exerted. And
sourcers, yet these works assumed that the MUs could anm and bnm differ from each other since different MUs have
participate in one sensing task in one sensing slot. Even different level of availability and different sensing tasks are
though there are some works such as [29] having studied the at different level of difficulty. This kind of cost function has
multi-leader multi-follower Stackelberg game, they can not been widely accepted to represent the cost of a MU, e.g.,
be directly applied to our work. In this work, the authors [32, 33]. For MU n, the payoff function can be formulated
designed the evolutionary algorithm to find the optimal as: ∑ ∑
strategies in the Stackelberg game, however, it assumes that ψn (tn , pn ) = pnm tnm − Cmn n
(tm ). (1)
m m
each player only has one variable and the objective of it is to
maximize the total payoffs of each player which is different That is, the benefits MU n can obtain by selling sensing
with our work. service to different TIs.
There are some works for multiple TIs and multiple MUs
MCS systems [30–32]. He et al. [30] designed the incentive 3.2 TI Modeling
mechanism for MCS systems based on Walrasian Equilib- Let pm = (pn m )∀n∈N be the vector of prices that TI m
rium. Duan et al. [32] devised the incentive mechanism determines for all the MUs, and tm = (tn m )∀n∈N be the
for the MCS system that would benefit all the TIs, MUs vector of sensing time that all MUs spend for TI m. Each TI
and MCS server, in a balanced manner. However, these m is associated with a utility function ϕm (·) to measure the
works need the central controller to control the market, sensing quality of all the MUs. Here, we assume that ϕm (·)
which is infeasible in free market scenarios. In this work, we is a monotonically increasing, differentiable, and strictly
formulate the multiple TIs and multiple MUs MCS system concave function of tm . Due to the heterogeneous charac-
as a Stackelberg game in the free market crowdsensing, the teristics of mobile devices, they may contribute differently
main challenges are how to prove the existence of the SE to the quality of sensing for a given amount of sensing
and how to compute it. time. The definition of quality of sensing varies for different
applications. For example, in MedWatch System, the quality
of sensing refers to the quality (e.g., resolution, contrast,
3 S YSTEM M ODEL sharpness) of uploaded photos. Photos with higher quality
We consider that there is a MCS system which consists of M will help the TI better identify visible problems with medical
TIs and N MUs. Let M , {m = 1, 2, · · · , M } be the set of devices. In air quality monitoring MCS systems, quality of
TIs, N , {n = 1, 2, · · · , N } be the set of MUs. Each TI m sensing refers to a MU’s estimation accuracy of air quality.
aims to recruit some of N MUs located near PoIs to gather Obviously, MUs with higher quality of sensing, the TIs will
sensing data and establish a MCS application. obtain higher utilities, e.g., the utility function of TI m ϕm (·)
is a monotonic in quality of sensing [17, 18]. To capture this,
n
the weight ωm is used to indicate the contribution of unit
3.1 MU Modeling
sensing time to the quality of sensing of TI m that MU n
MUs and TIs all aim to maximize their payoffs in the makes.
trading process, as the following two steps: (a) each TI m Then, the payoff of each TI m consists of two parts: (a)
determines the sensing price pn m when occupying a unit the utility gained from collecting MUs’ sensing data, and (b)
amount of each MU n’s sensing time, and (b) each MU n the incentives paid for the sensing service of MUs, as:
chooses tnm amount of sensing time to service TI m. Let ∑
tn = (tnm )∀m∈M be the vector of sensing time that MU n φm (tm , pm ) = ϕm (tm ) − pnm tnm . (2)
n
spends for all tasks, and pn = (pn m )∀m∈M be the vector In this paper, used utility function ϕm (tm ) =
of prices that all TIs pay for MU n. The mobility index of ∑ the widely n n
µm log(1+ n log(1 + ωm tm )) is adopted as in the previous
MU n from TI m’s PoI (which is denoted by em n ) is the works [9–11, 16], and µm is the weight for different TIs. The
probability that MU n leaves the PoI of TI m during the log(1+ωm n n
tm ) term reflects the TI m’s diminishing return on
sensing task, which depends on the movement speed of an the work of MU n, and the outer log term reflects the TI m’s
MU. Each participant (i.e., either a MU or TI) is associated diminishing return∑ on participating MUs. For convenience,
with a payoff function, which represents its benefit. MU we set σm = 1 + n log(1 + ωm n n
tm ), and thus gm (σm ) =
n obtains pnm tnm amount of benefits when it allocates tnm µm log(σm ).
amount of sensing time to TI m. At the same time, perform-
ing TI m’s sensing task for tn m units of sensing time will
incur a cost denoted as Cm n n
(tm ), which could incorporate 4 P ROBLEM F ORMULATION
several factors such as physical or mental tiredness of MUs, In a MCS system, TIs and MUs negotiate the pricing and
battery drainage and bandwidth occupation, etc. Without task allocation strategies, to maximize their own payoffs.
n n
loss of generality, a cost function Cm (tm ) is assumed to be a Specifically, the objectives for TI m and MU n can be
monotonically increasing, differentiable and strictly convex formulated as constrained optimization problems.
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 4
(tnm )∗ =
m m
where δm is the credit budget of TI m. m=1
(6)
For MU n,
F1 +F2 +F3
, elseif F1 + F2 + F3 > 0,
F4
0, otherwise,
OPMU : max ψn (tn , pn ) (4)
∑ where,
s.t. tnm ≤ κn , tnm > 0, ∑ ∏
m F1 = anm2 (pnm − pnm1 ),
m1 ̸=m m2 ̸=m1 ,m
where κn is the maximum sensing time for MU n. ∑ ∏
Since each participant has its own objective, it is nec- F2 = an (bn − bnm ),
m ̸=m m2 ̸=m1 ,m m2 m1
∏1 ∑ ∏
essary to find a pricing strategy pm for each TI m and F3 = 2 anm1 κn , F4 = 2 anm1 . (7)
a sensing time allocation strategy tn for each MU n such m1 ̸=m m m1 ̸=m
that all of them could receive benefits. We formulate the Proof. Assume that there are M TIs in the MCS system.
incentive mechanism as a Stackelberg game based on the OPMU becomes:
two-stage non-cooperative game theory. In a Stackelberg ∑
game, participants will be classified into two groups, leaders max pn tn − anm tnm 2 − bnm tnm (8)
m m m
∑
and followers, where leaders have the privilege of moving s.t. tnm ≤ κn , tnm > 0. (9)
first while followers will move according to the leaders’ m
actions. Specifically, our problem is modeled as a multi- Obviously, OPMU is a strictly convex optimization problem.
leader multi-follower Stackelberg game with two stages, Hence, there exists a unique solution. Set λn0 and λnm as La-
where the TIs act as the leaders and all the MUs act as the grangian multipliers, optimization problem (8)-(9) becomes:
followers. First, each TI m determines the pricing strategy ∑
pm . Then, each MU n acts as the game follower choosing its Ln = (pn tn − anm tnm 2 − bnm tnm ) (10)
m m m
∑ ∑
sensing time strategy tn to maximize its own payoff. −λn0 ( tnm − κn ) + λnm tnm .
m m
∗ n ∗
Definition 1. SE: The strategy set ((pn m ) , (tm ) ) for all n ∈ N KKT condition is
and m ∈ M constitutes a SE of the MCS game if the following
∂Ln
conditions are satisfied: = 0, ∀m ∈ M, (11)
(a): (pn ∗ ∂tnm
m ) is a Nash Equilibrium for TIs, i.e., ∑
( ) λn0 ( tn − κn ) = 0, λnm tnm = 0, (12)
m m
φm t∗m (p∗m , p∗−m ), p∗m , p∗−m ≥ ∑
( ) λn0 , λnm ≥ 0, tnm > 0, tnm ≤ κn . (13)
m
φm t∗m (pm , p∗−m ), pm , p∗−m , (5)
Eqn. (11) can be converted to:
where p∗−m
= (pnm1 )∗∀m1 ∈M\m,∀n∈N
is the pricing strategies of pnm − 2anm tnm − bnm − λn0 + λnm = 0, ∀m ∈ M. (14)
other TIs except m.
(b): For given pricing strategy (pn )∗ , the optimal response Since tn
m > 0, we can obtain that λnm = 0. Then, optimal
from n ∗ sensing time allocation can take one of the following cases:
( nMUn n∗ ) is (t ) , which is the unique maximizer for i) Case I: λn0 = 0, according to Eqn. (14), we have:
ψn t , (p ) .
pnm − bnm
tnm = . (15)
2anm
5 O PTIMAL S OLUTION AND N ASH E QUILIBRIUM
ii) Case II: λn0 > 0, according to Eqn. (14), we have
5.1 Optimal Solution for MUs pn −bn −λ
tnm = m 2amn n0 . Substituting tnm into Eqn. (12), we have
∑ ∏
m ∏ n
m1 (pm1 −bm1 )−2
an n n
As formulated in Section 4, for a given set of prices an- m am κn
m1 ̸=m m
λn0 = ∑ ∏ , and:
nounced by TIs, MU n calculates its optimal sensing time m
n
am1
m1̸=m
response by solving optimization problem OPMU in Eqn. (4).
Obviously, OPMU is a convex optimization problem. Hence, F1 + F2 + F3
tnm = , (16)
the stationary solution for each MU is unique and optimal. F4
Note that pn n
m must be greater than bm , otherwise MU n will where F1 , F2 , F3 and F4 satisfy Eqn. (7).
not participate in TI m’s sensing task. This is because that
Then, using Eqn. (15) and (16), the optimal sensing
if pn n
m < bm and MU n participates in TI m’s task, suppose allocation strategies for M TIs and N MUs which covers
MU n contributes any sensing time tc n ∈ R+ , its benefits
m
2 cases I-II for any given feasible pn satisfy Eqn. (6).
n c n c
for serving m is computed as (pm − bm )tn
n
m − am tm which
n
is less than 0. This implies that any TI m who wants to From Theorem 1, MUs can adjust their sensing time allo-
recruit MU n to participate in its own task must provide the cation strategies according to the TIs’ pricing strategies and
sensing price at least more than bn obtain the maximum payoffs. The detailed sensing service
m , otherwise MU n will
not contribute any sensing data for its task. computing process for MU n is described by Algorithm 1.
When a MU receives the TIs’ pricing strategies, it starts
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 5
Algorithm 1 The sensing service computing process for For convenience, in the following, ϕm will be utilized
MU n. to replace ϕm (tm ). The Hassian matrix of ϕm is defined as
Input: TIs pricing strategies pn for MU n; 2
H = ( ∂p∂n1ϕ∂p
m
n2 ) ∈ R
N ×N
.
Ouput: The sensing time allocation strategy tn for MU n; m m
′ ′′ ∂gm (σm )
Also, we use gm (σm ) and gm (σm ) to denote and
1: Receiving the pricing strategies pn from all the TIs; 2
∂σm
∂ gm (σm )
∑
M
m −bm
pn n
∂σm 2 , respectively. According to the definition of ϕm ,
2: if 2an ≤ κn then the second-order derivative of ϕm with respect to (w.r.t.) pn
m=1 m m
pn −bn is
3: tnm = m2an m ; 2
∂ ϕm ′′ ′ (
g (σm ) − gm (σm ) ∂tm n )2
m
4: else if F1 + F2 + F3 > 0 then n 2 = m . (17)
n F1 +F2 +F3 ∂pm (1 + tnm )2 ∂pnm
5: tm = F4 ;
6: else The second-order partial derivative of ϕm is
7: tnm = 0; ∂ 2 ϕm ′′ 1 ∂tnm1 ∂tnm2
8: end if = g (σ m ) . (18)
∂pnm1 ∂pnm2 m
(1 + tnm1 )(1 + tnm2 ) ∂pnm1 ∂pnm2
Set diagonal matrix H1 = diag[λ1 , λ2 , · · · , λN ], where λn =
′
gm (σm ) ∂tn ′
its sensing time allocation strategies (Line 1) computing − (1+t m 2
n )2 ( ∂pn ) . Obviously, we have gm (σm ) > 0. As a
m m
process. First, the MU checks whether its sensing time is result, it can be derived that λ ≤ 0.(n
)
′′
enough (Line 2). If so, it means that the TIs could get enough Furthermore, set H2 = gm (σm ) H2 (n1 , n2 ) ∈ RN ×N ,
sensing time from MUs, and do not need to compete with where:
others (Line 3). However, when the MU’s sensing time is 1 ∂tnm1 ∂tnm2
limited, there is competition among TIs. In this case, If TI m H2 (n1 , n2 ) = H2 (n2 , n1 ) = .
(1 + tm )(1 + tm ) ∂pnm1 ∂pnm2
n1 n2
sets a feasible price for MU n , then MU n will allocate the (19)
′′
amount of sensing time larger than 0 to it (Line 4), then MU Then, we can rewrite H2 as H2 = ngm (σm )qq T , where
q = (q n ) ∈ RN ×1 , and q n = 1+t
n will participate in its sensing task (Line 5). Otherwise, MU 1 ∂tm
n ∂pn . According to the
m m
n will not participate in its sensing task (Line 7). Note that definition of Hassian matrix of ϕm , we have H = H1 + H2 .
there is no game played among MUs. Each MU responds Randomly select a vector v ∈ RN ×1 , where the elements in
to the pricing strategies which announced by the TIs only v are not all 0. Then, we have v T Hv = v T H1 v + v T H2 v .
using its local information. The pricing strategies depend definition of H1 , it is easy to obtain v T H1 v =
on all the sensing time allocation strategies selected by MUs ∑ nonnthe
Based
n λ (v )2
≤ 0. And according to the definition of H2 , we
and hence MUs indirectly affect each others’ decisions. have,
(∑ v n ∂tnm )2
′′ ′′
5.2 Nash Equilibrium Analysis for TIs v T H2 v = g m (σm )v T qq T v = gm (σm ) .
n 1 + tn ∂pn
m m
Since each TI behaves in a selfish manner, and all of them are (20)
′′ ∂tn
rational, all the TIs aim at maximizing their own payoffs. In Since gm (σm ) = − (1+σ µm
m)
2 < 0 and ∂pn
m
≥ 0 , it is clear
m
this situation, the competition among TIs can be formulated that v T H2 v ≤ 0. Therefore, we have that vHv ≤ 0.
as a non-cooperative game, whose solution is well-known This indicates that ϕm is a concave function. Meanwhile,
Nash equilibrium. Let δm > 0 be the budget of TI m. Each ∂ 2 (−pn n
m tm )
we have ∂pn 2 ≤ 0. It means that −pnm tnm is also a
TI aims to recruit more sensing time with a lower price. m
concave function. Since a function is concave if it is the
If there is a single TI, it could set a very low price to sum of concave functions, the payoff functions of TIs are
maximize its payoff. Assume that δm is given for each TI all concave.
m. As mentioned in Section 4, the sensing service pricing In addition, it is clear that the strategy sets of TIs are
problem OPTI of TI m is shown in Eqn. (3). closed, bounded and convex. Based on Lemma 1, there
In the proposed MCS game, the non-cooperative game exists a Nash Equilibrium in the non-cooperative game
among the TIs can be described as follows: among TIs.
• Players: TI m ∈ M. Since there is a Nash Equilibrium among TIs, and OPMU
• Strategy: Pricing strategy pm of any TI m. has a unique maximum for a given pm . Therefore, the
• Utility: Payoff functions of TIs given in Eqn. (2). Stackelberg game possesses a SE.
Lemma 1. If the following conditions are satisfied, there exists a Theorem 3. The Stackelberg game formulated in the considered
Nash Equilibrium in the non-cooperative game [35]. multi-leader multi-follower MCS game possesses a SE.
• The player set is finite. According to Theorem 3, we know that each TI m could
• The strategy sets are closed, bound, and convex. determine its optimal pricing strategy, in which the TI m
• The utility functions are continuous and quasi-concave in could not unilaterally change its pricing strategy to receive
the strategy space. more payoff. Furthermore, MUs can determine their optimal
Theorem 2. There exists a Nash Equilibrium in the non- sensing time allocation strategies according to TIs’ pricing
cooperative game among the TIs. strategies to gain the maximum payoffs. In order to solve
optimally OPTI , we start by relaxing the positivity constraint
Proof. As analyzed in Section 5.1, when TIs announce their
of pn
m for convenience of analysis. Let us define Lm as:
pricing strategies, each MU n will give its sensing time ∑
allocation strategy tn
m for TI m. Lm = φm (tm , pm ) − λm ( pnm tnm − δm ). (21)
n
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 6
...
...
...
1 D
Action
... Calculate
1 Critic network
Reward loss
...
...
...
...
...
(Payoff)
State Transition: 1
Policy: Actor network
Calculate
e
Calculate payoff
Close up when buffer is full,
while open up when buffer is empty
TI
Fig. 2. (left part) MDP for a MCS game; (right part) proposed DRL based modeling of DDIM for each TI.
The first order optimality condition for the TIs leads to MDP. Then, we present the DRL approach for TIs to learn
∂pn = 0. Then ∂pn = 0, for all n ∈ N gives N equations.
∂Lm ∂Lm
the optimal pricing strategies.
m m
Furthermore, ∂L∂pn
m
= 0, for all m ∈ M gives M equations.
m
Then, a Nash Equilibrium exists for the TIs in the pricing 6.1 MDP for MCS Game
strategies selection game which can be obtained by solving As shown in the left part of Fig. 2, we formulate the MCS
following equations together: game as a multi-agent MDP for MCS (referred as MMDP). It
{ ∂L } is composed of state space Sm , {sm }, action space Am ,
m
n
= 0, ∀n, m . (22) {pm }, state transition probability Pm , {Pm }, and reward
∂pm
Rm , {rm }, which will be described in detail in the next
Solving these M × N equations, we can obtain p∗m . From a few sections. Then, MMDP=< Sm , Am , Pm , Rm > [36].
Theorem 2, we know there exists a Nash Equilibrium for Furthermore, each TI acts as an agent in MMDP and the
TIs. Therefore, we know that Eqn. (22) has a solution. Using environment consists of N MUs.
p∗m , we can compute t∗m . However, even though we know
there exists a SE, yet we still do not know how to solve 6.1.1 State Space
them because of the complexity of Eqn. (22). That is, it is Sm = {sm }. We denote the pricing strategy of TI m at k -th
impossible to get the analytical solution of Eqn. (22). It is game and the vector of sensing time that all MUs spend for
clear that the TIs must know the private information of TI m as pm (k) and tm (k), respectively. Then, the state of
MUs to solve the Eqn.(22). However, in practical scenarios, TI m in MMDP is defined as sm (k) = [pm (k − L), tm (k −
it is impossible for the TIs to know the private information L), · · · , pm (k − 1), tm (k − 1)]. That is, the state of TI m is
of the MUs as a priori. Therefore, it is necessary to design comprised of the past L game records between TI m and all
an efficient algorithm for the incentive mechanism design MUs. We call it “game history matrix”.
in this multi-leader multi-follower MCS system. Since there
are no existing works can be applied directly to solve our 6.1.2 Action Space
problem, we seek to design a DRL-based approach to obtain Am = {pm }. The action for TI m at k -th game is defined as
the SE, as shown in the next section. pm (k), which is the pricing strategy profile of TI m at k -th
game.
6 P ROPOSED DRL- BASED DYNAMIC I NCENTIVE 6.1.3 State Transition Probability Function
M ECHANISM (DDIM) P = {Pm }, where Pm : Sm × Am × Sm → [0, 1] represents
Since optimization problem OPTI in Eqn. (3) is non-linear the transition probability distribution of TI m’s state. We as-
with complicated form, and the OPTI among TIs are tightly sume that the current state and action for TI m is sm (k) = s
coupled, it is difficult to solve it explicitly. Furthermore, and pm (k) = p, respectively. Then, the probability of
′ ′
anm , bnm represents the private information of an m − n pair sm (k + 1) = s is P (s |s, p).
but it is necessary for solving OPTI in Eqn. (3). Having
them can be unrealistic in practice when MUs are unwilling 6.1.4 Reward Function
to expose their private information in certain situations. Rm : Sm × Am → R, where the reward function for TI m is
Therefore, a DRL approach is employed to enable TIs to defined as
( )
learn the optimal pricing strategies directly from negotiation rm (k) = ξφm tm (k), pm (k) . (23)
history without prior knowledge about any MU. In the
following, we first formulate the MCS game as a multi-agent where ξ is a scaling factor.
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 7
∇αm Jm =
[ ( )] Finally, the critic Vβm can be optimized by minimizing
=Esm ∼ρm (s),pm ∼παm ∇αm log παm (pm |sm )Qm sm , pm ) the following loss function:
[ ( ) ( )] [
=Esm ∼ρm (s),pm ∼πᾱm ∇αm log παm (·)fm sm , pm Am · Lm (βm ) = Esm ∼ρm (sm ) − Vβm (sm )
(25)
π m (pm |sm ) [ ′ ]] 2
where fm (sm , pm ) = πα
[∑
ᾱm (pm |sm ) . State action value + Es′ ∼Pm ,pm ∼πᾱ rm + Vβm (sm )
∞ m m
function Qm (sm , p]m ) = E l
l=1 γm rm (k +(l)|sm (k) ) = D [
∑ ( ) ∑D
( )]2
sm ,(pm (k) )= pm .( Advantage
) function A m s m , p m = ≈ − Vβm sm (k) + rm (l) + Vβm sm (D + 1) ,
Qm sm , pm − Vm sm , and ᾱm is the parameter ( ) of k=1 l=k
the policy used for sampling pm . Notably, Vm sm = (27)
Epm ∼παm Qm (sm , pm ). where D is the number of samples for training the critic.
Moreover, to reduce oscillation caused by gradient
method during training, we integrate the proximal policy 6.3 Proposed DRL based Modeling of DDIM for Each TI
optimization (PPO) method which was proposed in [39] to
As shown in the right part of Fig. 2, each TI is an in-
stabilize the training process, that clips the policy gradient
dependent agent with one actor network and one crit-
as :
ic network. At k -th game, TI m will update its state
∇αm Jm Clip sm (k) (i.e., game history matrix) as the input of actor
[ ( )] network. The actor network will then generate its action
= ∇αm Esm ∼ρm (s),pm ∼πᾱm min fm Am , η(fm )Am pm (k) (i.e., the pricing policy). After all TIs broadcast
∑D their pricing policies (p1 (k), p2 (k) · · · , pM (k)), MUs deter-
[ ]
≈ ∇αm log παm (k) min fm (k)Âm (k), η(fm (k))Âm (k) , mine their sensing time allocation strategies for all the TIs
k=1 (t1 (k), t2 (k) · · · , tM (k)). When receiving MUs’ strategies,
(26) TI m will calculate its payoff rm (k) and the newest game
where η(x) is the piecewise function with intervals [x < record [sm (k), pm (k), sm (k + 1), rm (k)] will be stored into
π m (pm (k)|sm (k))
1−ε, 1−ε ≤ x ≤ 1+ε, x > 1+ε], fm (k) = πα ᾱm (pm (k)|sm (k))
, the replay buffer D.
∑D
Âm (k) = l=k rm (l) + Vβm (sm (D + 1)) − Vβm (sm (k)), and
D is number of samples for estimating the policy gradient 6.3.1 Update Actor and Critic
at one training step. Through policy gradient method, the The actor and critic network will be updated when the
actor can be optimized. replay buffer is filled with D records. Specifically, TI m will
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 8
( )
first calculate all the state values Vβm sm (k) in D through 7.2 Compared State-of-the-Art and Baseline Approach-
∑D
critic network. Then, it calculates all l=k rm (l), fm (k), and
es
( ) ∑D
Âm (k) in D. After all Vβm sm (k) , l=k rm (l), fm (k), and We compare our work with two state-of-the-art approaches,
Clip
Âm (k) in D are calculated, the policy gradient ∇αm Jm is and two baseline approaches.
estimated and the parameter of the actor network will be
Platform-centric Model (PCM) [10]: this is a modified
•
updated by using gradient ascent method as:
version of classical incentive mechanism based on a
′ Stackelberg game, where originally the MCS server
αm ← αm + τ1 ∇αm Jm
Clip
(αm ), (28) can only issue one task in one sensing slot. We
extend this work to multiple sensing tasks where
where τ1 is the learning rate for training actor network and tasks appear in a random sequence.
′
αm represents the renewed parameter of actor network. • Platform-centric Model with data quality (PCMDQ)
Furthermore, the critic network can be updated by min- [11]: different from PCM, in this work, the authors
imizing loss function Lm (βm ) by gradient descent method: took the data quality into consideration.
′
• Greedy: it is a heuristic algorithm that greedily
βm ← βm − τ2 ∇βm Lm (βm ), (29) chooses a policy with maximum reward from the
replay buffer.
where τ2 is the learning rate for training critic network and • Random: TIs randomly select their prices, and MUs
′
βm represents the renewed parameter of critic network. randomly decide their sensing time.
∑ ∑
We use total payoff value ζ = m φm + n ψn for perfor-
6.3.2 Detailed Explanation of DDIM mance evaluation, where φm , ψn are payoffs for each TI and
MU as in Eqn. (1) and (2), respectively.
Algorithm 2 shows the pseudocode for DDIM. When a game
starts, each TI initializes its game history matrix (Line 1).
During each game period, it updates its game history matrix 7.3 Results
first (Line 7). Then, by taking its game history matrix as First, we show the convergence of our proposed DDIM
an input of its policy network, each TI derives its pricing approach. In this group of simulations, the maximum sens-
policy and sends it to MUs (Line 8-10). After receiving MUs’ ing time of each MU n is randomly generated from range
sensing time (Line 11), TIs calculate their payoffs and store (6, 10). We set the number of TIs as M = 2 and the number
their game information into replay buffer (Line 12-16). The of MUs as N = 8. Fig. 3(a) shows the pricing strategy of
predefined penalty ϑ will be given to pm (k) if it makes TI 1. We see that our proposed DDIM approach converges
TI m exceed its budget (Line 13-15). The actor network to the optimal prices quickly within less than 200 episodes.
and critic network are updated every D epochs using the Since TIs constantly adjust their pricing strategies, MUs also
past D training samples stored in replay buffer (Line 3 adjust their sensing time allocation strategies accordingly.
and Line 18-19). After updating these two neural networks As shown in Fig. 3(d), MUs’ sensing time allocation strate-
with gradient descent and gradient ascent respectively, the gies also converge to the stable states. Specifically, from
replay buffer will be cleared up (Line 20); and the game will Fig. 3(e) and Fig. 3(f), we see that the payoffs of TIs and
finish when reaching maximum training episodes (Line 2). MUs converge quickly, which means that DDIM approach
Furthermore, both the actor and critic networks are chosen allows participants to quickly obtain optimal payoffs.
as fully-connected neural networks with two hidden layers. Next, we show how each MU and TI behave in a MCS
Specifically, the actor network παm takes state sm (k) as its game under all five approaches, as shown in Table 2, in
input and outputs the mean and variance of action pm . terms of their payoffs. We set the number of MUs and
The critic Vβm takes state sm (k) as its input and outputs TIs both to N = M = 6. The MUs’ maximum sensing
estimated state value of sm (k). time κn randomly generated from (2, 4). We can see that
DDIM achieves the highest total payoff of all MUs and
TIs ζ = 375.3, compared to 203.6 and 214.9 for PCM
and PCMDQ, respectively. Greedy and Random approaches
7 P ERFORMANCE E VALUATION are slightly better but still our approach DDIM has an at
7.1 Simulation Settings least 8.6% improvement. For each MU and TI, we see that
several TIs achieve higher payoffs in PCM and PCMDQ at
Simulations are conducted to evaluate the performance of the beginning. This is because initially MUs are not aware
the proposed DDIM approach with M TIs and N MUs. of performing subsequent sensing tasks will bring higher
Parameter µm for TI m is generated uniformly at random benefits. Thus, MUs try to offer their sensing time as much
from ranges [30, 120]. Values of an n n
m , bm , ωm between MU as possible to serve TIs, and making TIs set a very low price
n and TI m are generated uniformly at random within initially to obtain higher payoffs. However, this strategy
(0, 1). Budget δm of each TI m is randomly generated within will result in MUs to quickly consume sensing time and
[80, 100]. The neural network parameters in our simulation cannot serve any further TIs. For example, in PCM and
are selected through fine-tuning. In this paper, the actor PCMDQ, TIs 4-6 are unable to recruit any MU. On the
network and critic network for each agent both have two contrary, in DDIM all TIs compete with each other so that
hidden fully-connected layers with 200 neurons and 50 MUs allocate their sensing time in a much better way. This
neurons respectively. We set D = 20, and L = 5 by default. competition has been formulated as a non-cooperative game
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 9
3 7 2.8
p11 x11 p12
Price Strategy of TI 2
p41 5 x41 p42
0.5 0 1.8
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Episode Episode
Episode
(a) Pricing strategy of TI 1. (b) MU sensing strategies to TI 1. (c) Pricing strategy of TI 2.
4.5 25
x12 160
Sensing Service Strategies
4 x22
x32
140
20
3.5
x42
Payoffs of MUs
120
Payoffs of TIs
3 x52
15
x62 100
2.5
x72
x82
80 10
2
SE
1.5 60
5
1 40
0.5 20 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Episode Episode Episode
(d) MU sensing strategies to TI 2. (e) Payoffs of TIs. (f) Payoffs of MUs.
TABLE 2 in the simulation of Fig. 5, we set the number of TIs with 20.
Payoff values for five approaches Fig. 4(a) shows the overall payoff value for all MUs and TIs,
where we see that DDIM achieves clearer gains when more
MU/TI id 1 2 3 4 5 6
TIs are assumed. For example, ζ = 1750 when M = 20 for
ψn 8.7 11.3 10.7 9.2 8.2 12.3
DDIM DDIM, but only 1250 for Greedy with the same number
φm 46.8 ∑58.3 46.9
∑ 39.4 53.5 70
ζ = n ψn + m φm = 375.3 of TIs. In order to understand the reasons behind it, we
ψn 8 10.8 9.6 8.4 7 11.2 examine the respective payoffs for TIs and MUs, as shown
Greedy
φm 32 ∑48 35.8
∑ 28.2 39.2 107.5
ζ = n ψn + m φm = 345.7
in Fig. 4(b) and Fig. 4(c). We observe from Fig. 4(b) that TI
ψn 3.1 5.4 4.9 5.9 2.6 3.0 payoffs decrease with the increase of number of TIs. That
PCM
φm 80.6 ∑89.1 9.0
∑ 0 0 0 is, although the sensing time supplies remain unchanged,
ζ = n ψn + m φm = 203.6 it leads to more competitions among TIs. Therefore, TIs
ψn 3.8 7.6 8.8 6.4 3.3 3.2
PCMDQ
φm 81.7 ∑90.2 9.9 0 0 0
need to increase prices to recruit sensing time from MUs,
∑
ζ = n ψn + m φm = 214.9 resulting in a decline of their payoffs, as shown in Fig. 4(b)
Random
ψn 8.6 11 9.9 8.6 7.8 11.8 and a rise in MUs’ payoffs in Fig. 4(c). We also see that
φm 20.8 ∑35.2 23
∑ 18 28 40 payoffs of TIs in PCM and PCMDQ drop more quickly than
ζ = n ψn + m φm = 222.7
DDIM, since the degree of competitions among TIs does
not change in PCM and PCMDQ approaches. In addition,
more TIs will not influence how MUs behave in PCM and
in this work, whose solution is Nash equilibrium, i.e., all the PCMDQ schemes. This because that in their approaches
sensing time have been optimally allocated, makes DDIM since MUs only serve the early coming TIs, when the
outperform PCM and PCMDQ. In Greedy, TIs can only MUs’ sensing time are exhausted, the following TIs will
see their instantaneous reward, but not long-term ones of not have any influence on them and the payoffs of MUs
the future; and in Random, TIs always choose the pricing will remain unchanged. For example, as shown in Fig. 4(c),
strategies randomly, which lead TIs to pick worse pricing when M = 5, the sensing time of all the MUs in PCM have
strategies and thus lower payoffs. been exhausted, therefore the following coming TIs cannot
Next, we show the overall performance comparison recruit any MU and the payoffs of the MUs will not change.
when varying the number of TIs and MUs, in Fig. 4 and
Fig. 5. In this set of simulations, we set MUs’ maximum Then, we show the trend of total payoffs for all MUs
sensing time randomly generated from range (2, 3). In the and TIs when varying the number of MUs, in Fig. 5(a).
simulation of Fig. 4, we set the number of MUs with 40. And We see that DDIM obviously achieves better than all other
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 10
2000 120 10
DDIM DDIM DDIM
Greedy Greedy Greedy
Total Payoffs of TIs and MUs PCM PCM PCM
100
8
2
20
500
0 0
5 10 15 20 25 5 10 15 20 25 5 10 15 20 25
Number of TIs Number of TIs Number of TIs
(a) Total payoffs of TIs and MUs when varying (b) Payoffs of TIs when varying no. of TIs. (c) Payoffs of MUs when varying no. of TIs.
no. of TIs.
2000 80 12
DDIM DDIM
Greedy Greedy
70
Total Payoffs of TIs and MUs
Greedy
1500 Random 60 Random PCM
8 PCMDQ
50 Random
1000 40 6
30
4
500 20
2
10
0 0 0
5 10 20 30 40 5 10 20 30 40 5 10 20 30 40
Number of MUs Number of MUs Number of MUs
(a) Total payoffs of TIs and MUs when varying (b) Payoffs of TIs when varying no. of MUs. (c) Payoffs of MUs when varying no. of MUs.
no. of MUs.
80
four approaches. For example, ζ = 1450 when N = 30 for DDIM
Greedy
DDIM, but only 495 for PCM and 1150 for Greedy with the 70 PCM
same number of MUs. In order to understand the reasons PCMDQ
Average Payoffs of TIs
60 Random
behind it, we examine the respective payoffs for TIs and
50
MUs, as shown in Fig. 5(b) and Fig. 5(c). From Fig. 5(b)
and Fig. 5(c), we can see that TI payoffs increase with more 40
MUs, while MU payoffs decrease. This is because that TIs
30
have more options to buy sensing service from more MUs.
Fig. 6 shows the overall performance comparison when 20
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 11
400 80 10
DDIM DDIM DDIM
Greedy Greedy Greedy
Total Payoffs of TIs and MUs 350 PCM 70 PCM PCM
PCMDQ PCMDQ 8 PCMDQ
250 50 6
200 40
4
150 30
2
100 20
50 10 0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
κmax
max max
(a) Total payoffs of TIs and MUs when varying (b) Payoffs of TIs when varying κmax . (c) Payoffs of MUs when varying κmax .
κmax .
120
N = 6, respectively. We simulate the sensing quality of MU TI 1
as randomly distributed in (ωmax −0.2, ωmax ),
n TI 2
n for TI m ωm 100 TI 3
where ωmax was varied from 0.2 to 1 with an increment
of 0.2. We see that with the increase of ωmax , all the 80
Payoffs of TIs
five approaches have the same trend. This is because that
n
larger ωm implies that MUs can provide sensing data with 60
higher quality, thus making the TIs obtain better sensing
performance. Meanwhile, DDIM consistently obtains the 40
best performance compared with all the baselines. Because
with the increasing of ωmax , although TIs in all the five 20
TIs will obtain more utilities given the same sensing time 40
of MUs. In Fig. 8(c), the average payoff of all the MUs have
an increase with larger value of µm . This is because that
30
with larger µm , TIs will obtain more utilities given the same
amount of sensing time of MUs, then the TIs will increase
20
their prices for the MUs to get more sensing time, thus the
MUs will obtain higher payoffs.
Fig. 9 shows the change of TIs’ payoffs when varying of 10
0.1 0.2 0.3 0.4 0.5
µ1 . In this simulation, we fix MUs’ maximum sensing time
κn as randomly generated from range (2, 3). We set M = 3, Fig. 10. Payoffs of TIs when varying em
n .
N = 6 and µ2 = µ3 = 50. We simulate that µ1 was varied
from 20 to 100 with an increment of 20. From Fig. 9, we can Fig. 10 shows the change of TIs’ average payoffs when
observe that the payoff of TI 1 increases with µ1 , while the varying emn . In this simulation, MU n leaves TI m’s PoI
payoffs of TI 2 and 3 have a slow decrease. This is because in each time slot with a probability emn . We simulate that
that with the increase of µ1 , TI 1 will obtain more utilities em
n was varied from 0.1 to 0.5 with an increment of 0.1.
given the same amount of sensing time of MUs, thus paying From Fig. 10, we can observe that the average payoffs of TIs
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 12
300 70 8
DDIM DDIM DDIM
Greedy Greedy Greedy
Total Payoffs of TIs and MUs PCIM PCM 7 PCM
60
250
40 4
150
3
30
2
100
20
1
50 10 0
30 40 50 60 70 30 40 50 60 70 30 40 50 60 70
m m m
(a) Total payoffs of TIs and MUs when varying (b) Payoffs of TIs when varying µm . (c) Payoffs of MUs when varying µm .
µm .
decrease with em
n . This is because that with the increase of 8.3 Time Complexity During Actual Execution
em
n , more MUs will leave the PoIs of the TIs, making the During the executing phase, given the state information as
sensing data worthless for the TIs. DDIM can learn MUs’ input, each TI m utilizes its own actor network παm to out-
mobility model, then can make the best decision for each put an action, thus the computational complexity is merely
TI, while other approaches can not. From Fig. 10, we can based on an actor network. According to [41], the time
see DDIM obtains the best performance. For example, the complexity for a fully-connected deep neural network is
average payoffs of TIs is 35 when emn = 0.5 for DDIM, but ∑F
only 11 for PCM and 30 for Greedy with the same em n.
computed as the number of multiplications: O( ϵf ·ϵf −1 ),
f =1
where ϵf is the number of neural units in fully-connected
8 D ISCUSSIONS layer f .
8.1 Extension to General Incentive Mechanism Design In this work, we design the actor network παm with
of Multi-Leader Multi-Follower MCS Case two fully-connected hidden layers. Therefore, executing the
In this paper, the incentive mechanism of multi-leader actor network will not impose a computational burden on
multi-follower MCS is modeled as a two-stage Stackelberg TIs.
game. In this game, the main challenge is how to achieve
the SE for each participant (TIs and MUs) under the unfa-
miliar environment. From Algorithm 1, we know that MUs
9 C ONCLUSION
could obtain the optimal sensing time allocation strategies In this paper, we designed the pricing and sensing time
directly from the prices published by the TIs. Therefore, the allocation strategies for MCS systems with multiple TIs
remaining challenge becomes how to achieve the optimal and multiple MUs to incentivize MUs for participation. We
pricing strategies for the TIs. Due to this challenge, we studied this problem from a free market perspective with
design the Algorithm 2. In Algorithm 2, we design the the goal of achieving a SE. We first formulated the incentive
DDIM based on DRL approach. With DDIM, each TI can mechanism as a Stackelberg game, which can reveal the
directly decide its pricing strategy from the previous game characteristics of the supply-demand pattern for MCS. We
records without knowing any prior information of the other then analyzed and proved that there exists a SE, however
TIs and MUs. Therefore, the incentive mechanism design by its closed-form expression cannot be obtained. Next, we
DRL approach in this paper can be generalized to any multi- proposed a DRL based solution called DDIM to solve it.
leader multi-follower MCS case which can be modeled as a TIs can learn the optimal pricing strategies directly from
two-stage Stackelberg game. game experience without knowing the private information
of MUs. Extensive simulation results showed that DDIM
8.2 Extension to Privacy-Preserving Model and Securi- outperforms the state-of-the-art and baseline approaches.
ty of MUs
In the incentive mechanism designed in this paper, when
the TIs decide their optimal pricing strategies (Eqn. (22)),
R EFERENCES
they need to know the exact utility functions of other [1] A. Draghici and M. V. Steen, “A survey of techniques
participants. Since the parameters an n
m , bm for each MU n for automatically sensing the behavior of a crowd,”
are privacy information [12, 32, 40], the TIs may not have ACM Comput. Surv., vol. 51, no. 1, p. 21, 2018.
sufficient information to solve Eqn. (22). In our future work, [2] B. Guo, Z. Wang, Z. Yu, and et al., “Mobile crowd
we are going to use the privacy-preserving model to protect sensing and computing: The review of an emerg-
the security of MUs, such like encrypting the location infor- ing human-powered sensing paradigm,” ACM Comput.
mation of MUs through privacy-preserving model to avoid Surv., vol. 48, no. 1, pp. 7:1–7:31, 2015.
privacy leakage. [3] “Patientslikeme,” https://www.patientslikeme.com/.
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 13
[4] P. Zhou, Y. Zheng, and M. Li, “How long to wait?: 795–808, 2017.
predicting bus arrival time with mobile phone based [21] L. Duan, T. Kubo, K. Sugiyama, J. Huang, T. Hasegawa,
participatory sensing,” in ACM MobiSys, 2012, pp. 379– and J. Walrand, “Incentive mechanisms for smartphone
392. collaboration in data acquisition and distributed com-
[5] Y. Cheng, X. Li, Z. Li, and et al., “Aircloud: a cloud- puting,” in IEEE INFOCOM, 2012, pp. 1701–1709.
based air-quality monitoring system for everyone,” in [22] M. H. Cheung, F. Hou, and J. Huang, “Delay-sensitive
ACM SenSys, 2014, pp. 251–265. mobile crowdsensing: Algorithm design and economic-
[6] F. Rebecchi, M. D. De Amorim, V. Conan, A. Passarella, s,” IEEE Trans. Mob. Comput., 2018.
R. Bruno, and M. Conti, “Data offloading techniques in [23] Y. Chen, B. Li, and Q. Zhang, “Incentivizing crowd-
cellular networks: A survey,” IEEE Commun. Surv. Tut., sourcing systems with network effects,” in IEEE INFO-
vol. 17, no. 2, pp. 580–603, 2015. COM, 2016, pp. 1–9.
[7] M. Xiao, J. Wu, L. Huang, and et al., “Online task as- [24] J. Nie, J. Luo, Z. Xiong, D. Niyato, and et al., “A stackel-
signment for crowdsensing in predictable mobile social berg game approach towards socially-aware incentive
networks,” IEEE Trans. Mob. Comput., vol. 16, no. 8, pp. mechanisms for mobile crowdsensing,” arXiv preprint
2306 – 2320, 2017. arXiv:1807.08412v1, 2018.
[8] E. Wang, Y. Yang, J. Wu, W. Liu, and X. Wang, “An [25] J. Nie, Z. Xiong, D. Niyato, and et al., “A socially-aware
efficient prediction-based user recruitment for mobile incentive mechanisms for mobile crowdsensing service
crowdsensing,” IEEE Trans. Mob. Comput., vol. 17, no. 1, market,” arXiv preprint arXiv:1807.08412v1, 2018.
pp. 16–28, 2018. [26] L. Xiao, T. Chen, C. Xie, H. Dai, and H. V. Poor,
[9] D. Yang, G. Xue, X. Fang, and et al., “Crowdsourcing to “Mobile crowdsensing games in vehicular networks,”
smartphones: incentive mechanism design for mobile IEEE Trans. Veh. Technol., vol. 67, no. 2, pp. 1535–1545,
phone sensing,” in ACM MobiCom, 2012, pp. 173–184. 2018.
[10] D. Yang, G. Xue, X. Fang, and J. Tang, “Incentive [27] J. Peng, Y. Zhu, W. Shu, and M. Wu, “When da-
mechanisms for crowdsensing: Crowdsourcing with ta contributors meet multiple crowdsourcers: Bilateral
smartphones,” IEEE/ACM Trans. Net., vol. 24, no. 3, pp. competition in mobile crowdsourcing,” Comput. Netw.,
1732–1744, 2016. vol. 95, pp. 1–14, 2016.
[11] S. Maharjan, Y. Zhang, and S. Gjessing, “Optimal in- [28] A. Chakeri and L. G. Jaimes, “An incentive mecha-
centive design for cloud-enabled multimedia crowd- nism for crowdsensing markets with multiple crowd-
sourcing,” IEEE Trans. Multimedia, vol. 18, no. 12, pp. souarcers,” IEEE Internet of Things J., vol. 5, no. 2, pp.
2470–2481, 2016. 708–715, 2018.
[12] L. Xiao, Y. Li, G. Han, and et al., “A secure mobile [29] A. Sinha, P. Malo, A. Frantsev, and K. Deb, “Finding op-
crowdsensing game with deep reinforcement learn- timal strategies in a multi-period multi-leader–follower
ing,” IEEE Trans. Inf. Foren. Sec., vol. 13, no. 1, pp. 35 stackelberg game using an evolutionary algorithm,”
– 47, 2018. Computers & Operations Research, vol. 41, pp. 374–385,
[13] K. Han, C. Zhang, J. Luo, and et al., “Truthful schedul- 2014.
ing mechanisms for powering mobile crowdsensing,” [30] S. He, D.-H. Shin, J. Zhang, J. Chen, and P. Lin, “An
IEEE Trans. Comput., vol. 65, no. 1, pp. 294–307, 2016. exchange market approach to mobile crowdsensing:
[14] D. Zhao, X.-Y. Li, and H. Ma, “How to crowdsource pricing, task allocation, and walrasian equilibrium,”
tasks truthfully without sacrificing utility: Online in- IEEE JSAC, vol. 35, no. 4, pp. 921–934, 2017.
centive mechanisms with budget constraint.” in IEEE [31] Y. Zhan, Y. Xia, and J. Zhang, “Quality-aware incentive
INFOCOM, vol. 14, 2014, pp. 1213–1221. mechanism based on payoff maximization for mobile
[15] L. Gao, F. Hou, and J. Huang, “Providing long-term crowdsensing,” Ad Hoc Netw., vol. 72, pp. 44–55, 2018.
participation incentive in participatory sensing,” in [32] X. Duan, C. Zhao, S. He, P. Cheng, and J. Zhang, “Dis-
IEEE INFOCOM, 2015, pp. 2803–2811. tributed algorithms to compute walrasian equilibrium
[16] J. Wang, J. Tang, D. Yang, and et al., “Quality-aware and in mobile crowdsensing,” IEEE Trans. Ind. Electron.,
fine-grained incentive mechanisms for mobile crowd- vol. 64, no. 5, pp. 4048–4057, 2017.
sensing,” in IEEE ICDCS, 2016, pp. 354–363. [33] M. H. Cheung, F. Hou, and J. Huang, “Make a differ-
[17] H. Jin, L. Su, D. Chen, and et al., “Quality of infor- ence: Diversity-driven social mobile crowdsensing,” in
mation aware incentive mechanisms for mobile crowd IEEE INFOCOM, 2017, pp. 1–9.
sensing systems,” in ACM MobiHoc, 2015, pp. 167–176. [34] T. Liu and Y. Zhu, “Social welfare maximization in par-
[18] H. Jin, L. Su, D. Chen, H. Guo, K. Nahrstedt, and J. Xu, ticipatory smartphone sensing,” Comput. Netw., vol. 73,
“Thanos: Incentive mechanism with quality awareness pp. 195–209, 2014.
for mobile crowd sensing,” IEEE Trans. Mob. Comput., [35] D. Fudenberg and J. Tirole, Game Theory. Mas-
2018. sachusetts: Cambridge Press, 1991.
[19] H. Jin, L. Su, H. Xiao, and K. Nahrstedt, “Incentive [36] R. S. Sutton, A. G. Barto et al., Reinforcement learning: An
mechanism for privacy-aware data aggregation in mo- introduction. MIT press, 1998.
bile crowd sensing systems,” IEEE/ACM Trans. Net., [37] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithm-
vol. 26, no. 5, pp. 2019–2032, 2018. s,” in NIPS, 2000, pp. 1008–1014.
[20] X. Gan, Y. Li, W. Wang, L. Fu, and X. Wang, “Social [38] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Man-
crowdsourcing to friends: An incentive mechanism for sour, “Policy gradient methods for reinforcement learn-
multi-resource sharing,” IEEE JSAC, vol. 35, no. 3, pp. ing with function approximation,” in NIPS, 2000, pp.
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2019.2927314, IEEE
Transactions on Mobile Computing
IEEE TRANS. ON MOBILE COMPUTING, VOL. XX, NO. XX, XX XX 14
Yufeng Zhan received his Ph.D. degree in Jiang Zhang is currently a final year undergrad-
School of Automation from Beijing Institute of uate student at the School of Computer Science
Technology, Beijing, China, in 2018. His re- and Technology at Beijing Institute of Technol-
search interests include mobile computing, ma- ogy, China. His research interests include ma-
chine learning and networked control systems. chine learning and control science and technol-
ogy.
1536-1233 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.