Discovery of Ranking Fraud For Mobile Apps
Discovery of Ranking Fraud For Mobile Apps
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 1
Abstract—Ranking fraud in the mobile App market refers to fraudulent or deceptive activities which have a purpose of bumping up
the Apps in the popularity list. Indeed, it becomes more and more frequent for App developers to use shady means, such as inflating
their Apps’ sales or posting phony App ratings, to commit ranking fraud. While the importance of preventing ranking fraud has been
widely recognized, there is limited understanding and research in this area. To this end, in this paper, we provide a holistic view of
ranking fraud and propose a ranking fraud detection system for mobile Apps. Specifically, we first propose to accurately locate the
ranking fraud by mining the active periods, namely leading sessions, of mobile Apps. Such leading sessions can be leveraged for
detecting the local anomaly instead of global anomaly of App rankings. Furthermore, we investigate three types of evidences, i.e.,
ranking based evidences, rating based evidences and review based evidences, by modeling Apps’ ranking, rating and review behaviors
through statistical hypotheses tests. In addition, we propose an optimization based aggregation method to integrate all the evidences
for fraud detection. Finally, we evaluate the proposed system with real-world App data collected from the iOS App Store for a long time
period. In the experiments, we validate the effectiveness of the proposed system, and show the scalability of the detection algorithm as
well as some regularity of ranking fraud activities.
Index Terms—Mobile Apps, Ranking Fraud Detection, Evidence Aggregation, Historical Ranking Records, Rating and Review.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 2
Work Flow
START
INPUT Data Flow
real-world data sets. Section 6 provides a brief review
Mobile Apps Historical Records
MINING LEADING of related works. Finally, in Section 7, we conclude the
SESSIONS
paper and propose some future research directions.
RANKING BASED
EVIDENCES
2 I DENTIFYING L EADING S ESSIONS FOR MO-
OUTPUT
RATING BASED
BILE A PPS
EVIDENCE
AGGREGATION EVIDENCES
In this section, we first introduce some preliminaries,
REVIEW BASED and then show how to mine leading sessions for mobile
END EVIDENCES
Apps from their historical ranking records.
Fig. 1. The framework of our ranking fraud detection
2.1 Preliminaries
system for mobile Apps.
The App leaderboard demonstrates top K popular Apps
and leading sessions in detail later. In other words, with respect to different categories, such as “Top Free
ranking fraud usually happens in these leading sessions. Apps” and “Top Paid Apps”. Moreover, the leaderboard
Therefore, detecting ranking fraud of mobile Apps is is usually updated periodically (e.g., daily). There-
actually to detect ranking fraud within leading sessions fore, each mobile App a has many historical ranking
of mobile Apps. Specifically, we first propose a simple records which can be denoted as a time series, Ra =
yet effective algorithm to identify the leading sessions of {r1a , · · · , ria , · · · , rna }, where ria ∈ {1, ..., K, +∞} is the
each App based on its historical ranking records. Then, ranking of a at time stamp ti ; +∞ means a is not ranked
with the analysis of Apps’ ranking behaviors, we find in the top K list; n denotes the number of all ranking
that the fraudulent Apps often have different ranking records. Note that, the smaller value ria has, the higher
patterns in each leading session compared with nor- ranking position the App obtains.
mal Apps. Thus, we characterize some fraud evidences By analyzing the historical ranking records of mobile
from Apps’ historical ranking records, and develop three Apps, we observe that Apps are not always ranked
functions to extract such ranking based fraud evidences. high in the leaderboard, but only in some leading events.
Nonetheless, the ranking based evidences can be affected For example, Figure 2 (a) shows an example of leading
by App developers’ reputation and some legitimate mar- events of a mobile App. Formally, we define a leading
keting campaigns, such as “limited-time discount”. As event as follows.
a result, it is not sufficient to only use ranking based Definition 1 (Leading Event): Given a ranking thresh-
evidences. Therefore, we further propose two types of old K ∗ ∈ [1, K], a leading event e of App a contains
fraud evidences based on Apps’ rating and review his- a time range Te = [testart , teend ] and corresponding rank-
tory, which reflect some anomaly patterns from Apps’ ings of a, which satisfies rstart a
≤ K ∗ < rstart−1
a
, and
historical rating and review records. In addition, we a ∗
rend ≤ K < rend+1 . Moreover, ∀tk ∈ (tstart , teend ), we
a e
develop an unsupervised evidence-aggregation method have rka ≤ K ∗ .
to integrate these three types of evidences for evaluating Note that we apply a ranking threshold K ∗ which is
the credibility of leading sessions from mobile Apps. usually smaller than K here because K may be very big
Figure 1 shows the framework of our ranking fraud (e.g., more than 1000), and the ranking records beyond
detection system for mobile Apps. K ∗ (e.g., 300) are not very useful for detecting the
It is worth noting that all the evidences are extracted ranking manipulations.
by modeling Apps’ ranking, rating and review behaviors Furthermore, we also find that some Apps have sev-
through statistical hypotheses tests. The proposed frame- eral adjacent leading events which are close to each
work is scalable and can be extended with other domain- other and form a leading session. For example, Figure 2(b)
generated evidences for ranking fraud detection. Finally, shows an example of adjacent leading events of a given
we evaluate the proposed system with real-world App mobile App, which form two leading sessions. Particu-
data collected from the Apple’s App store for a long larly, a leading event which does not have other nearby
time period, i.e., more than two years. Experimental neighbors can also be treated as a special leading session.
results show the effectiveness of the proposed system, The formal definition of leading session is as follows.
the scalability of the detection algorithm as well as some Definition 2 (Leading Session): A leading session s of
regularity of ranking fraud activities. App a contains a time range Ts = [tsstart , tsend ] and
Overview. The remainder of this paper is organized as n adjacent leading events {e1 , ..., en }, which satisfies
follows. In Section 2, we introduce some preliminaries tsstart = testart1
, tsend = teendn
and there is no other leading
∗
and how to mine leading sessions for mobile Apps. session s that makes Ts ⊆ Ts∗ . Meanwhile, ∀i ∈ [1, n),
ei+1
Section 3 presents how to extract ranking, rating and we have (tstart − teendi
) < ϕ, where ϕ is a predefined time
review based evidences and combine them for ranking threshold for merging leading events.
fraud detection. In Section 4 we make some further Intuitively, the leading sessions of a mobile App rep-
discussion about the proposed approach. In Section 5, resent its periods of popularity, so the ranking manip-
we report the experimental results on two long-term ulation will only take place in these leading sessions.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 3
Event 1
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 4
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 5
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 6
historical leading sessions. Then, we can compute the downloads, and thus propel the App’s ranking position
evidence by in the leaderboard. Although some previous works on
( ) review spam detection have been reported in recent
Ψ4 (s) = 1 − P N (µR , σR ) ≥ ∆Rs . (11)
years [14], [19], [21], the problem of detecting the local
EVIDENCE 5. In the App rating records, each rating anomaly of reviews in the leading sessions and capturing
can be categorized into |L| discrete rating levels, e.g., 1 them as evidences for ranking fraud detection are still
to 5, which represent the user preferences of an App. The under-explored. To this end, here we propose two fraud
rating distribution with respect to the rating level li in a evidences based on Apps’ review behaviors in leading
normal App a’s leading session s, p(li |Rs,a ), should be sessions for detecting ranking fraud.
consistent with the distribution in a’s historical rating EVIDENCE 6. Indeed, most of the the review manip-
records, p(li |Ra ), and vice versa. Specifically,
( N s ) we can ulations are implemented by bot farms due to the high
compute the distribution by p(li |Rs,a ) = N sli , where cost of human resource. Therefore, review spamers often
(.)
post multiple duplicate or near-duplicate reviews on the
Nlsi is the number of ratings in s and the rating is at level
s same App to inflate downloads [19], [21]. In contrast,
li , N(.) is the total number of ratings in s. Meanwhile,
the normal App always have diversified reviews since
we can compute p(li |Ra ) in a similar way. Then, we use
users have different personal perceptions and usage
the Cosine similarity between p(li |Rs,a ) and p(li |Ra ) to
experiences. Based on the above observations, here we
estimate the difference as follows.
∑|L| define a fraud signature Sim(s), which denotes the
p(li |Rs,a ) × p(li |Ra ) average mutual similarity between the reviews within
D(s) = √∑ i=1
√∑ . (12)
|L| |L| leading session s. Specifically, this fraud signature can
i=1 p(li |Rs,a ) × i=1 p(li |Ra )
2 2
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 7
EVIDENCE 7. From the real-world observations, we Here, we also use the Gaussian approximation to com-
find that each review c is always associated with a pute the p-value with the above hypotheses. Specifically,
specific latent topic z. For example, some reviews may we assume DKL (s||a) follows the Gaussian distribution,
be related to the latent topic “worth to play” while DKL (s||a) ∼ N (µDL , σDL ), where µDL and σDL can
some may be related to the latent topic “very boring”. be learnt by the MLE method from the observations of
Meanwhile, since different users have different personal DKL (s||a) in all Apps’ historical leading sessions. Then,
preferences of mobile Apps, each App a may have we can compute the evidence by
different topic distributions in their historical review ( )
records. Intuitively, the topic distribution of reviews in Ψ7 (s) = 1 − P N (µDL , σDL ) ≥ DKL (s||a) . (17)
a normal leading session s of App a, i.e., p(z|s), should The values of two evidences Ψ6 (s) and Ψ7 (s) are in
be consistent with the topic distribution in all historical the range of [0, 1]. Meanwhile, the higher evidence value
review records of a, i.e., p(z|a). It is because that the a leading session has, the more chance this session has
review topics are based on the users’ personal usage ranking fraud activities.
experiences but not the popularity of mobile Apps. In
contrast, if the reviews of s have been manipulated, the 3.4 Evidence Aggregation
two topic distributions will be markedly different. For After extracting three types of fraud evidences, the next
example, there may contain more positive topics, such as challenge is how to combine them for ranking fraud
“worth to play” and “popular”, in the leading session. detection. Indeed, there are many ranking and evidence
In this paper we propose to leverage topic modeling aggregation methods in the literature, such as permu-
to extract the latent topics of reviews. Specifically, here tation based models [17], [18], score based models [11],
we adopt the widely used Latent Dirichlet Allocation [26] and Dempster-Shafer rules [10], [23]. However, some
(LDA) model [9] for learning latent semantic topics. To of these methods focus on learning a global ranking for
be more specific, the historical reviews of a mobile App all candidates. This is not proper for detecting ranking
a, i.e., Ca , is assumed to be generated as follows. First, fraud for new Apps. Other methods are based on super-
before generating Ca , K prior conditional distributions vised learning techniques, which depend on the labeled
of words given latent topics {ϕz } are generated from a training data and are hard to be exploited. Instead,
prior Dirichlet distribution β. Second, a prior latent topic we propose an unsupervised approach based on fraud
distribution θa is generated from a prior Dirichlet distri- similarity to combine these evidences.
bution α for each mobile App a. Then, for generating Specifically, we define the final evidence score Ψ∗ (s)
the j-th word in Ca denoted as wa,j , the model firstly as a linear combination of all the existing evidences as
generates a latent topic z from θa and then generates Equation 18. Note that, here we propose to use the linear
wa,j from ϕz . The training process of LDA model is combination because it has been proven to be effective
to learn proper latent variables θ = {P (z|Ca )} and and is widely used in relevant domains, such as ranking
ϕ = {P (w|z)} for maximizing the posterior distribution aggregation [16], [20].
of review observations, i.e., P (Ca |α, β, θ, ϕ). In this paper,
∑
NΨ ∑
NΨ
we use a Markov chain Monte Carlo method named ∗
Ψ (s) = wi × Ψi (s), s.t. wi = 1, (18)
Gibbs sampling [12] for training LDA model. If we i=1 i=1
denote the reviews in leading session s of a as Cs,a , we
can use the KL-divergence to estimate the difference of where NΨ = 7 is the number of evidences, and weight
topic distributions between Ca and Cs,a . wi ∈ [0, 1] is the aggregation parameter of evidence
Ψi (s). Thus, the problem of evidence aggregation be-
∑ P (zk |Cs,a ) comes how to learn the proper parameters {wi } from
DKL (s||a) = P (zk |Cs,a )ln , (16)
P (zk |Ca ) the training leading sessions.
k
∏ We first propose an intuitive assumption as Principle
where P (zk |Ca ) and P (zk |Cs,a ) ∝ P (zk ) w∈Cs,a P (w|zk ) 1 for our evidence aggregation approach. Specifically,
can be obtained through the LDA training process. The we assume that effective evidences should have similar
higher value of DKL (s||a) indicates the higher difference evidence scores for each leading session, while poor evidences
of topic distributions between Ca and Cs,a . Therefore, will generate different scores from others. In other words,
if a leading session has significantly higher value of evidences that tend to be consistent with the plurality
DKL (s||a) compared with other leading sessions of Apps of evidences will be given higher weights and evidences
in the leaderboard, it has high probability of having which tend to disagree will be given smaller weights. To
ranking fraud. To capture this, we define statistical hy- this end, for each evidence score Ψi (s), we can measure
potheses to compute the significance of DKL (s||a) for its consistence using the variance-like measure
each leading session as follows. ( )2
σi (s) = Ψi (s) − Ψ(s) , (19)
◃ H YPOTHESIS 0: The signature DKL (s||a) of leading
session s is not useful for detecting ranking fraud. where Ψ(s) is the average evidence score of leading
◃ H YPOTHESIS 1: The signature DKL (s||a) of leading session s obtained from all NΨ evidences. If σi (s) is
session s is significantly higher than expectation. small, the corresponding Ψi (s) should be given a bigger
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 8
weight and vice versa. Therefore, given an App set by σi∗ (s) in Equation 20, and exploit similar gradient
A = {ai } with their leading sessions {sj }, we can define based approach that is introduced above for learning the
the evidence aggregation problem as an optimization weights of evidences.
problem that minimizes weighted variances of the ev-
idences over all leading sessions; that is 4 D ISCUSSION
Here, we provide some discussion about the proposed
∑ ∑∑
NΨ
ranking fraud detection system for mobile Apps.
arg min wi · σi (s), (20)
w First, the download information is an important sig-
a∈A s∈a i=1
nature for detecting ranking fraud, since ranking ma-
∑
NΨ
nipulation is to use so-called “bot farms” or “human
s.t. wi = 1; ∀wi ≥ 0. (21)
i=1
water armies” to inflate the App downloads and ratings
in a very short time. However, the instant download
In this paper, we exploit the gradient based approach
information of each mobile App is often not available for
with exponentiated updating [15], [16] to solve this
analysis. In fact, Apple and Google do not provide ac-
problem. To be specific, we first assign wi = N1Ψ as
curate download information on any App. Furthermore,
the initial value, then for each s, we can compute the
the App developers themselves are also reluctant to
gradient by,
release their download information for various reasons.
∂wi · σi (s) Therefore, in this paper, we mainly focus on extracting
∇i = = σi (s). (22)
∂wi evidences from Apps’ historical ranking, rating and re-
Thus, we can update the weight wi by view records for ranking fraud detection. However, our
approach is scalable for integrating other evidences if
w∗ × exp(−λ∇i ) available, such as the evidences based on the download
wi = ∑NΨi , (23)
∗
j=1 wj × exp(−λ∇j ) information and App developers’ reputation.
Second, the proposed approach can detect ranking
where wi∗ is the last updated weight value wi , and λ is
fraud happened in Apps’ historical leading sessions.
the learning rate, which is empirically set λ = 10−2 in
However, sometime, we need to detect such ranking
our experiments.
fraud from Apps’ current ranking observations. Actually,
Finally, we can exploit Equation (18) to estimate the a
given the current ranking rnow of an App a, we can
final evidence score of each leading session. Moreover,
detect ranking fraud for it in two different cases. First,
given a leading session s with a predefined threshold τ , a
if rnow > K ∗ , where K ∗ is the ranking threshold intro-
we can determine that s has ranking fraud if Ψ∗ (s) > τ .
duced in Definition 1, we believe a does not involve in
However, sometimes only using evidence scores for
ranking fraud, since it is not in a leading event. Second,
evidence aggregation is not appropriate. It is because a
if rnow < K ∗ , which means a is in a new leading event e,
that different evidences may have different score range
we treat this case as a special case that teend = tenow and
to evaluate leading sessions. For example, some evi-
θ2 = 0. Therefore, such real-time ranking frauds also can
dences may always generate higher scores for leading
be detected by the proposed approach.
sessions than the average evidence score, although they
Finally, after detecting ranking fraud for each leading
can detect fraudulent leading sessions and rank them in
session of a mobile App, the remainder problem is
accurate positions.
how to estimate the credibility of this App. Indeed, our
Therefore, here we propose another assumption as
approach can discover the local anomaly instead of the
Principle 2 for our evidence aggregation approach.
global anomaly of mobile Apps. Thus, we should take
Specifically, we assume that effective evidences should rank
consideration of such kind of local characteristics when
leading sessions from a similar conditional distribution, while
estimating the credibility of Apps. To be specific, we
poor evidences will lead to a more uniformly random ranking
define an App fraud score F(a) for each App a according
distribution [16]. To this end, given a set of leading
to how many leading sessions of a contain ranking fraud.
sessions, we first rank them by each evidence score and ∑
obtain NΨ ranked lists. Let us denote πi (s) as the ranking F(a) = [[Ψ∗ (s) > τ ]] × Ψ∗ (s) × ∆ts , (26)
of session s returned by Ψi (s), then we can calculate the s∈a
average ranking for leading session s by where s ∈ a denotes that s is a leading session of
App a, and Ψ∗ (s) is the final evidence score of leading
1 ∑
NΨ
session s that can be calculated by Equation 18. In
π(s) = πi (s). (24) particular, we define a signal function [[x]] (i.e., [[x]] = 1
NΨ i=1
if x = T rue, and 0 otherwise) and a fraud threshold τ to
Then, for each evidence score Ψi (s), we can measure its decide the top k fraudulent leading sessions. Moreover,
consistence using the variance-like measure, ∆ts = (tsend − tsstart + 1) is the time range of s, which
( )2 indicates the duration of ranking fraud. Intuitively, an
σi∗ (s) = πi (s) − π(s) . (25)
App contains more leading sessions, which have high
If σi∗ (s) is small, the corresponding Ψi (s) should be given fraud evidence scores and long time duration, will have
a bigger weight and vice versa. Then we can replace σi (s) higher App fraud scores.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 9
(a) Top Free 300 data set (b) Top Paid 300 data set (a) Top Free 300 data set (b) Top Paid 300 data set
Fig. 6. The distribution of the number of Apps w.r.t Fig. 8. The distribution of the number of Apps w.r.t
different rankings. different numbers of leading events.
(a) Top Free 300 data set (b) Top Paid 300 data set (a) Top Free 300 data set (b) Top Paid 300 data set
Fig. 7. The distribution of the number of Apps w.r.t Fig. 9. The distribution of the number of Apps w.r.t
different numbers of ratings. different number of leading sessions.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 10
(a) Top Free 300 data set (b) Top Paid 300 data set
Fig. 10. The distribution of the number of leading ses-
sions w.r.t different number of leading events.
5.3.1 Baselines Fig. 11. The screenshots of our fraud evaluation platform.
5.3.2 The Experimental Setup
The first baseline Ranking-RFD stands for Ranking ev-
idence based Ranking Fraud Detection, which estimates To study the performance of ranking fraud detection by
ranking fraud for each leading session by only using each approach, we set up the evaluation as follows.
ranking based evidences (i.e., Ψ1 to Ψ3 ). These three First, for each approach, we selected 50 top ranked
evidences are integrated by our aggregation approach. leading sessions (i.e., most suspicious sessions), 50 mid-
The second baseline Rating-RFD stands for Rating dle ranked leading sessions (i.e., most uncertain ses-
evidence based Ranking Fraud Detection, which esti- sions), and 50 bottom ranked leading sessions (i.e., most
mates the ranking fraud for each leading session by only normal sessions) from each data set. Then, we merged
using rating based evidences (i.e., Ψ4 and Ψ5 ). These two all the selected sessions into a pool which consists 587
evidences are integrated by our aggregation approach. unique sessions from 281 unique Apps in “Top Free 300”
data set, and 541 unique sessions from 213 unique Apps
The third baseline Review-RFD stands for Review ev-
in “Top Paid 300” data set.
idence based Ranking Fraud Detection, which estimates
the ranking fraud for each leading session by only using Second, we invited five human evaluators who are
review based evidences (i.e., Ψ6 and Ψ7 ). These two familiar with Apple’s App store and mobile Apps to
evidences are integrated by our aggregation approach. manually label the selected leading sessions with score
2 (i.e., Fraud), 1 (i.e., Not Sure) and 0 (i.e., Non-
Particularly, here we only use the rank based aggrega-
fraud). Specifically, for each selected leading session,
tion approach (i.e., Principle 2) for integrating evidences
each evaluator gave a proper score by comprehensively
in above baselines. It is because that these baselines are
considering the profile information of the App (e.g.,
mainly used for evaluating the effectiveness of different
descriptions, screenshots), the trend of rankings during
kinds of evidences, and our preliminary experiments val-
this session, the App leaderboard information during
idated that baselines with Principle 2 always outperform
this session, the trend of ratings during this session, and
baselines with Principle 1.
the reviews during this session. Moreover, they can also
The last baseline E-RFD stands for Evidence based download and try the corresponding Apps for obtaining
Ranking Fraud Detection, which estimates the ranking user experiences. Particularly, to facilitate their evalu-
fraud for each leading session by ranking, rating and ation, we develop a Ranking Fraud Evaluation Platform,
review based evidences without evidence aggregation. which ensures that the evaluators can easily browse
Specifically, it ranks leading sessions by Equation 18, all the information. Also, the platform demonstrates
where each wi is set to be 1/7 equally. This baseline leading sessions in random orders, which guarantees
is used for evaluating the effectiveness of our ranking there is no relationship between leading sessions’ order
aggregation method. and their fraud scores. Figure 11 shows the screenshot
Note that, according to Definition 3, we need to define of the platform. The left panel shows the main menu,
some ranking ranges before extracting ranking based the right upper panel shows the reviews for the given
evidences for EA-RFD-1, EA-RFD-2, Rank-RFD and E- session, and the right lower panel shows the ranking
RFD. In our experiments, we segment the rankings into related information for the given session. After human
5 different ranges, i.e., [1, 10], [11, 25], [26, 50], [51, 100], evaluation, each leading session s is assigned a fraud
[101, 300], which are commonly used in App leader- score f (s) ∈ [0, 10]. As a result, all the five evaluators
boards. Furthermore, we use the LDA model to extract agreed on 86 fraud sessions and 113 non-fraud sessions
review topics as introduced in Section 3.3. Particularly, Top Free 300 data set. Note that, 11 labeled fraud sessions
we first normalize each review by the Stop-Words Re- among them are from the external reported suspicious
mover [6] and the Porter Stemmer [7]. Then, the number Apps [4], [5], which validates the effectiveness of our hu-
of latent topic Kz is set to 20 according to the perplexity man judgement. Similarly, all the five evaluators agreed
based estimation approach [8], [31]. Two parameters α on 94 fraud sessions and 119 non-fraud sessions Top Free
and β for training LDA model are set to be 50/K and 300 data set. Moreover, we computed the Cohen’s kappa
0.1 according to [13]. coefficient [1] between each pair of evaluators to estimate
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 11
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 12
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 13
an unjustifiable favorable relevance or importance [30]. Moreover, we proposed an optimization based aggrega-
For example, Ntoulas et al. [22] have studied various tion method to integrate all the evidences for evaluating
aspects of content-based spam on the Web and presented the credibility of leading sessions from mobile Apps.
a number of heuristic methods for detecting content An unique perspective of this approach is that all the
based spam. Zhou et al [30] have studied the problem evidences can be modeled by statistical hypothesis tests,
of unsupervised Web ranking spam detection. Specifi- thus it is easy to be extended with other evidences
cally, they proposed an efficient online link spam and from domain knowledge to detect ranking fraud. Finally,
term spam detection methods using spamicity. Recently, we validate the proposed system with extensive experi-
Spirin et al. [25] have reported a survey on Web spam ments on real-world App data collected from the Apple’s
detection, which comprehensively introduces the princi- App store. Experimental results showed the effectiveness
ples and algorithms in the literature. Indeed, the work of the proposed approach.
of Web ranking spam detection is mainly based on the In the future, we plan to study more effective fraud
analysis of ranking principles of search engines, such as evidences and analyze the latent relationship among rat-
PageRank and query term frequency. This is different ing, review and rankings. Moreover, we will extend our
from ranking fraud detection for mobile Apps. ranking fraud detection approach with other mobile App
The second category is focused on detecting online related services, such as mobile Apps recommendation,
review spam. For example, Lim et al. [19] have identified for enhancing user experience.
several representative behaviors of review spammers Acknowledgement. This work was supported in part by grants
and model these behaviors to detect the spammers. Wu from National Science Foundation for Distinguished Young Scholars
et al. [27] have studied the problem of detecting hybrid of China (Grant No. 61325010), Natural Science Foundation of China
shilling attacks on rating data. The proposed approach (NSFC, Grant No.71329201), National High Technology Research and
is based on the semi-supervised learning and can be Development Program of China (Grant No. SS2014AA012303), Sci-
used for trustworthy product recommendation. Xie et ence and Technology Development of Anhui Province (Grants No.
al. [28] have studied the problem of singleton review 1301022064), the International Science and Technology Cooperation
spam detection. Specifically, they solved this problem Plan of Anhui Province (Grant No. 1303063008). This work was also
by detecting the co-anomaly patterns in multiple review partially supported by grants from National Science Foundation (NSF)
based time series. Although some of above approaches via grant numbers CCF-1018151 and IIS-1256016.
can be used for anomaly detection from historical rating
and review records, they are not able to extract fraud R EFERENCES
evidences for a given time period (i.e., leading session).
[1] http://en.wikipedia.org/wiki/cohen’s kappa.
Finally, the third category includes the studies on [2] http://en.wikipedia.org/wiki/information retrieval.
mobile App recommendation. For example, Yan et al. [29] [3] https://developer.apple.com/news/index.php?id=02062012a.
developed a mobile App recommender system, named [4] http://venturebeat.com/2012/07/03/apples-crackdown-on-app-
ranking-manipulation/.
Appjoy, which is based on user’s App usage records to [5] http://www.ibtimes.com/apple-threatens-crackdown-biggest-
build a preference matrix instead of using explicit user app-store-ranking-fraud-406764.
ratings. Also, to solve the sparsity problem of App usage [6] http://www.lextek.com/manuals/onix/index.html.
records, Shi et al. [24] studied several recommendation [7] http://www.ling.gu.se/l̃ager/mogul/porter-stemmer.
[8] L. Azzopardi, M. Girolami, and K. V. Risjbergen. Investigating the
models and proposed a content based collaborative fil- relationship between language model perplexity and ir precision-
tering model, named Eigenapp, for recommending Apps recall measures. In Proceedings of the 26th International Conference on
in their Web site Getjar. In addition, some researchers Research and Development in Information Retrieval (SIGIR’03), pages
369–370, 2003.
studied the problem of exploiting enriched contextual [9] D. M. Blei, A. Y. Ng, and M. I. Jordan. Lantent dirichlet allocation.
information for mobile App recommendation. For exam- Journal of Machine Learning Research, pages 993–1022, 2003.
ple, Zhu et al. [32] proposed a uniform framework for [10] Y. Ge, H. Xiong, C. Liu, and Z.-H. Zhou. A taxi driving fraud
detection system. In Proceedings of the 2011 IEEE 11th International
personalized context-aware recommendation, which can Conference on Data Mining, ICDM ’11, pages 181–190, 2011.
integrate both context independency and dependency [11] D. F. Gleich and L.-h. Lim. Rank aggregation via nuclear norm
assumptions. However, to the best of our knowledge, minimization. In Proceedings of the 17th ACM SIGKDD international
conference on Knowledge discovery and data mining, KDD ’11, pages
none of previous works has studied the problem of 60–68, 2011.
ranking fraud detection for mobile Apps. [12] T. L. Griffiths and M. Steyvers. Finding scientific topics. In Proc.
of National Academy of Science of the USA, pages 5228–5235, 2004.
[13] G. Heinrich. Paramter stimaion for text analysis. Technical report,
7 C ONCLUDING R EMARKS University of Lipzig, 2008.
[14] N. Jindal and B. Liu. Opinion spam and analysis. In Proceedings
In this paper, we developed a ranking fraud detection of the 2008 International Conference on Web Search and Data Mining,
system for mobile Apps. Specifically, we first showed WSDM ’08, pages 219–230, 2008.
[15] J. Kivinen and M. K. Warmuth. Additive versus exponentiated
that ranking fraud happened in leading sessions and gradient updates for linear prediction. In Proceedings of the twenty-
provided a method for mining leading sessions for each seventh annual ACM symposium on Theory of computing, STOC ’95,
App from its historical ranking records. Then, we iden- pages 209–218, 1995.
[16] A. Klementiev, D. Roth, and K. Small. An unsupervised learning
tified ranking based evidences, rating based evidences algorithm for rank aggregation. In Proceedings of the 18th European
and review based evidences for detecting ranking fraud. conference on Machine Learning, ECML ’07, pages 616–623, 2007.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TKDE.2014.2320733, IEEE Transactions on Knowledge and Data Engineering
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XX XXXX 14
[17] A. Klementiev, D. Roth, and K. Small. Unsupervised rank Hui Xiong is currently an Associate Professor
aggregation with distance-based models. In Proceedings of the 25th and Vice Chair of the Management Science
international conference on Machine learning, ICML ’08, pages 472– and Information Systems Department, and the
479, 2008. Director of Rutgers Center for Information As-
[18] A. Klementiev, D. Roth, K. Small, and I. Titov. Unsupervised rank surance at the Rutgers, the State University
aggregation with domain-specific expertise. In Proceedings of the of New Jersey, where he received a two-year
21st international jont conference on Artifical intelligence, IJCAI’09, early promotion/tenure (2009), the Rutgers Uni-
pages 1101–1106, 2009. versity Board of Trustees Research Fellowship
[19] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw. for Scholarly Excellence (2009), and the ICDM-
Detecting product review spammers using rating behaviors. In 2011 Best Research Paper Award (2011). He
Proceedings of the 19th ACM international conference on Information received the B.E. degree from the University of
and knowledge management, CIKM ’10, pages 939–948, 2010. Science and Technology of China (USTC), China, the M.S. degree from
[20] Y.-T. Liu, T.-Y. Liu, T. Qin, Z.-M. Ma, and H. Li. Supervised rank the National University of Singapore (NUS), Singapore, and the Ph.D.
aggregation. In Proceedings of the 16th international conference on degree from the University of Minnesota (UMN), USA.
World Wide Web, WWW ’07, pages 481–490, 2007. His general area of research is data and knowledge engineering, with
[21] A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu, M. Castellanos, a focus on developing effective and efficient data analysis techniques
and R. Ghosh. Spotting opinion spammers using behavioral for emerging data intensive applications. He has published prolifically
footprints. In Proceedings of the 19th ACM SIGKDD international in refereed journals and conference proceedings (3 books, 40+ journal
conference on Knowledge discovery and data mining, KDD ’13, 2013. papers, and 60+ conference papers). He is a co-Editor-in-Chief of Ency-
[22] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting clopedia of GIS, an Associate Editor of IEEE Transactions on Data and
spam web pages through content analysis. In Proceedings of the Knowledge Engineering (TKDE) and the Knowledge and Information
15th international conference on World Wide Web, WWW ’06, pages Systems (KAIS) journal. He has served regularly on the organization
83–92, 2006. and program committees of numerous conferences, including as a
[23] G. Shafer. A mathematical theory of evidence. 1976. Program Co-Chair of the Industrial and Government Track for the 18th
[24] K. Shi and K. Ali. Getjar mobile application recommendations ACM SIGKDD International Conference on Knowledge Discovery and
with very sparse datasets. In Proceedings of the 18th ACM SIGKDD Data Mining and a Program Co-Chair for the 2013 IEEE International
international conference on Knowledge discovery and data mining, Conference on Data Mining (ICDM-2013). He is a senior member of the
KDD ’12, pages 204–212, 2012. ACM and IEEE.
[25] N. Spirin and J. Han. Survey on web spam detection: principles
and algorithms. SIGKDD Explor. Newsl., 13(2):50–64, May 2012.
[26] M. N. Volkovs and R. S. Zemel. A flexible generative model
for preference aggregation. In Proceedings of the 21st international
conference on World Wide Web, WWW ’12, pages 479–488, 2012. Yong Ge received his Ph.D. in Information Tech-
[27] Z. Wu, J. Wu, J. Cao, and D. Tao. Hysad: a semi-supervised hybrid nology from Rutgers, The State University of
shilling attack detector for trustworthy product recommendation. New Jersey in 2013, the M.S. degree in Signal
In Proceedings of the 18th ACM SIGKDD international conference on and Information Processing from the University
Knowledge discovery and data mining, KDD ’12, pages 985–993, 2012. of Science and Technology of China (USTC)
[28] S. Xie, G. Wang, S. Lin, and P. S. Yu. Review spam detection in 2008, and the B.E. degree in Information
via temporal pattern discovery. In Proceedings of the 18th ACM Engineering from Xi’an Jiao Tong University in
SIGKDD international conference on Knowledge discovery and data 2005. He is currently an Assistant Professor at
mining, KDD ’12, pages 823–831, 2012. the University of North Carolina at Charlotte.
[29] B. Yan and G. Chen. Appjoy: personalized mobile application His research interests include data mining and
discovery. In Proceedings of the 9th international conference on Mobile business analytics. He received the ICDM-2011
systems, applications, and services, MobiSys ’11, pages 113–126, 2011. Best Research Paper Award, Excellence in Academic Research (one
[30] B. Zhou, J. Pei, and Z. Tang. A spamicity approach to web spam per school) at Rutgers Business School in 2013, and the Dissertation
detection. In Proceedings of the 2008 SIAM International Conference Fellowship at Rutgers University in 2012. He has published prolifically
on Data Mining, SDM’08, pages 277–288, 2008. in refereed journals and conference proceedings, such as IEEE TKDE,
[31] H. Zhu, H. Cao, E. Chen, H. Xiong, and J. Tian. Exploiting ACM TOIS, ACM TKDD, ACM TIST, ACM SIGKDD, SIAM SDM, IEEE
enriched contextual information for mobile app classification. In ICDM, and ACM RecSys. He has served as Program Committee mem-
Proceedings of the 21st ACM international conference on Information bers at the ACM SIGKDD 2013, the International Conference on Web-
and knowledge management, CIKM ’12, pages 1617–1621, 2012. Age Information Management 2013, and IEEE ICDM 2013. Also he has
[32] H. Zhu, E. Chen, K. Yu, H. Cao, H. Xiong, and J. Tian. Mining served as a reviewer for numerous journals, including IEEE TKDE, ACM
personal context-aware preferences for mobile users. In Data TIST, KAIS, Information Science, and TSMC-B.
Mining (ICDM), 2012 IEEE 12th International Conference on, pages
1212–1217, 2012.
[33] H. Zhu, H. Xiong, Y. Ge, and E. Chen. Ranking fraud detection
for mobile apps: A holistic view. In Proceedings of the 22nd ACM
international conference on Information and knowledge management,
CIKM ’13, 2013.
Enhong Chen is currently a Professor and Vice
Hengshu Zhu is currently a Ph.D. student in Dean of School of Computer Science, Vice Di-
the School of Computer Science and Technol- rector of the National Engineering Laboratory
ogy at University of Science and Technology of for Speech and Language Information Process-
China (USTC), China. He was supported by the ing of University of Science and Technology of
China Scholarship Council (CSC) as a visiting China (USTC), winner of the National Science
research student at Rutgers, the State University Fund for Distinguished Young Scholars of China.
of New Jersey, USA, for more than one year. He He received the B.S. degree form Anhui Uni-
received his B.E. degree in Computer Science versity, Master degree from Hefei University of
from USTC, China, in 2009. Technology and Ph.D degree in computer sci-
His main research interests include mobile ence from USTC.
data mining, recommender systems, and social His research interests include data mining and machine learning, so-
networks. During his Ph.D. study, he received the KSEM-2011 and cial network analysis and recommender systems. He has published lots
WAIM-2013 Best Student Paper Award. He has published a number of of papers on refereed journals and conferences, including TKDE, TMC,
papers in refereed journals and conference proceedings, such as IEEE KDD, ICDM, NIPS and CIKM. He has served on program committees
TMC, ACM TIST, WWW Journal, KAIS, ACM CIKM, and IEEE ICDM. of numerous conferences including KDD, ICDM, SDM. He received the
He also has served as a reviewer for numerous journals, such as IEEE Best Application Paper Award on KDD-2008 and Best Research Paper
TSMC-B, KAIS, and WWW Journal. Award on ICDM-2011. He is a senior member of the IEEE.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.