0% found this document useful (0 votes)
83 views15 pages

Privacy-Preserving Social Media Data Publishing For Personalized Ranking-Based Recommendation

This paper proposes a framework called PrivRank for privacy-preserving publishing of social media data to enable personalized recommendations while protecting private user data. PrivRank continuously obfuscates user public activity data on social media to minimize privacy leakage of specified private data, under a budget limiting distortion and preserving utility for recommendations. An evaluation shows PrivRank provides effective privacy protection while maintaining high utility for ranking-based recommendation compared to other approaches.

Uploaded by

Kiran Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views15 pages

Privacy-Preserving Social Media Data Publishing For Personalized Ranking-Based Recommendation

This paper proposes a framework called PrivRank for privacy-preserving publishing of social media data to enable personalized recommendations while protecting private user data. PrivRank continuously obfuscates user public activity data on social media to minimize privacy leakage of specified private data, under a budget limiting distortion and preserving utility for recommendations. An evaluation shows PrivRank provides effective privacy protection while maintaining high utility for ranking-based recommendation compared to other approaches.

Uploaded by

Kiran Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325420687

Privacy-Preserving Social Media Data Publishing for Personalized Ranking-


Based Recommendation

Article  in  IEEE Transactions on Knowledge and Data Engineering · May 2018


DOI: 10.1109/TKDE.2018.2840974

CITATIONS READS
48 3,031

3 authors, including:

Dingqi Yang Philippe Cudre-Mauroux


Université de Fribourg Université de Fribourg
66 PUBLICATIONS   2,519 CITATIONS    225 PUBLICATIONS   5,456 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cyber Security View project

All content following this page was uploaded by Philippe Cudre-Mauroux on 04 April 2019.

The user has requested enhancement of the downloaded file.


JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Privacy-Preserving Social Media Data


Publishing for Personalized Ranking-Based
Recommendation
Dingqi Yang, Bingqing Qu, and Philippe Cudré-Mauroux

Abstract—Personalized recommendation is crucial to help users find pertinent information. It often relies on a large collection of user
data, in particular users’ online activity (e.g., tagging/rating/checking-in) on social media, to mine user preference. However, releasing
such user activity data makes users vulnerable to inference attacks, as private data (e.g., gender) can often be inferred from the users’
activity data. In this paper, we proposed PrivRank, a customizable and continuous privacy-preserving social media data publishing
framework protecting users against inference attacks while enabling personalized ranking-based recommendations. Its key idea is to
continuously obfuscate user activity data such that the privacy leakage of user-specified private data is minimized under a given data
distortion budget, which bounds the ranking loss incurred from the data obfuscation process in order to preserve the utility of the data
for enabling recommendations. An empirical evaluation on both synthetic and real-world datasets shows that our framework can
efficiently provide effective and continuous protection of user-specified private data, while still preserving the utility of the obfuscated
data for personalized ranking-based recommendation. Compared to state-of-the-art approaches, PrivRank achieves both a better
privacy protection and a higher utility in all the ranking-based recommendation use cases we tested.

Index Terms—Privacy-preserving data publishing, Customized privacy protection, Personalization, Ranking-based recommendation,
Social media, Location based social networks

1 I NTRODUCTION

D EVELOPING effective recommendation engines is crit-


ical in the era of Big Data in order to provide perti-
nent information to the users. To deliver high-quality and
protection on the private data by distorting the public data
before its publication, at the expense of a loss of utility of
the public data in the latter processing stages. For the use
personalized recommendations, online services such as e- case of recommendation engines, utility refers to the person-
commerce applications typically rely on a large collection of alization performance based on the distorted public data,
user data, particularly user activity data on social media, i.e., whether the recommendation engines can accurately predict
such as tagging/rating records, comments, check-ins, or the individual’s preference based on the obfuscated data. There is
other types of user activity data. In practice, many users an intrinsic trade-off between privacy and personalization.
are willing to release the data (or data streams) about their On one hand, more distortion of public data leads to better
online activities on social media to a service provider in ex- privacy protection, as it makes it harder for adversaries
change for getting high-quality personalized recommenda- to infer private data. On the other hand, it also incurs a
tions. In this paper, we refer to such user activity data as pub- higher loss in utility, as highly distorted public data prevents
lic data. However, they often consider part of the data from recommendation engines from accurately predicting users’
their social media profile as private, such as gender, income real preferences.
level, political view, or social contacts. In the following, we To apply privacy-preserving data publishing techniques
refer to those data as private data. Although users may refuse in the case of social media based recommendation, one
to release private data, the inherent correlation between immediate strategy is to obfuscate user public data on the
public and private data often causes serious privacy leakage. user side before being sent to social media. However, such
For example, one’s political affiliation can be inferred from an approach is unrealistic as it hinders key benefits for users.
her rating of TV shows [1]; one’s gender can be inferred from In real-world use cases, social media provides users with a
her activities on location-based social networks [2]. These social sharing platform, where they can interact with their
studies show that private data often suffers from inference friends by intentionally sharing their comments/ratings on
attacks [3], where an adversary analyzes a user’s public data items, blogs, photos, videos, or even their real-time loca-
to illegitimately gain knowledge about her private data. It tions. For example, when a user watched a good movie and
is thus crucial to protect user private data when releasing wants to share her high rating on it with her friends, she
public data to recommendation engines. does not want the rating to be obfuscate in any sense.
To tackle this problem, privacy-preserving data publish- As it is inappropriate to obfuscate user public data before
ing has been widely studied [4]. Its basic idea is to provide being sent to social media, an alternative solution is to
protect user privacy when releasing their public data from
Authors are with the Department of Informatics at the University of social media to any other third-party services. Specifically, many
Fribourg, Switzerland, E-mail: {dingqi.yang, bingqing.qu, philippe.cudre-
mauroux}@unifr.ch third-party services for social media require access to user
Manuscript received xxx; revised xxx. activity data (or data streams) in order to provide them with
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

personalized recommendations. In addition to such public data obfuscation framework for user activity data on
data, these services may require optional access to users’ social media. The key idea is to measure the privacy
profiles. While some privacy-conscious users want to keep leakage of user-specified private data from public data
certain data from their profiles (e.g., gender) as private, based on mutual information, and then to obfuscate
other non privacy-conscious users may not care about the public data such that the privacy leakage is minimized
same type of private data and choose to release them. under a given data distortion budget, which can ensure
Subsequently, an adversary could illegitimately infer the the utility of the released data. To handle the real-world
private data of the privacy-conscious users, by learning the use case of third-party services built on top of social
correlation between the public and the private data from the media, our framework considers both historical and
non privacy-conscious users. Therefore, it is indispensable online user activity data:
to provide privacy protection when releasing user public – Historical data publishing: When a user subscribes to
data from social media. a third-party service for the first time, the service
In this paper, we study the problem of privacy- provider has access to the user’s entire historical
preserving publishing of user social media data by con- public data. To obfuscate the user’s historical data,
sidering both the specific requirements of user privacy on we minimize the privacy leakage from her historical
social media and the data utility for enabling high-quality data by obfuscating her data using data from another
personalized recommendation. Towards this goal, we face user whose historical data is similar but with less
the following three challenges. First, since users often have privacy leakage.
different privacy concerns [5], a specific type of data (e.g., – Online data publishing: After the user subscribed to
gender) may be considered as private by some users, while third-party services, the service provider also has
other users may prefer to consider it as public in order to get real-time access to her future public data stream. Due
better personalized services. Therefore, the first challenge to efficiency considerations, online data publishing
is to provide users with customizable privacy protection, i.e., should be performed based on incoming data in-
to protect user-specified private data only. Second, when stances only (e.g., a rating/tagging/checking-in ac-
subscribing to third-party services, users often allow the ser- tivity on an item), without accessing the user’s histor-
vice providers to access not only their historical public data, ical data. Therefore, we minimize the privacy leakage
but also their future public data as a data stream. Although from individual activity data instance by obfuscating
the obfuscated historical public data can efficiently reduce the data stream on-the-fly.
privacy leakage, the continuous release of user activity • Third, to guarantee the utility of the obfuscated data for
feed will incrementally increase such leakage (see Figure 7 enabling personalized ranking-based recommendation,
for details). Therefore, the second challenge is to provide we measure and bound the data distortion using a
continuous privacy protection over user activity data streams. pairwise ranking loss metric, i.e., the Kendall-τ rank
Third, we consider the case of ranking-based (or top-N) distance [8]. To efficiently incorporate such ranking
recommendation, which is more practical and has been loss, we propose a bootstrap sampling process to fast
widely adopted in practice by many e-commerce platforms approximate the Kendall-τ distance.
[6]. As ranking-based recommendation algorithms mainly • Finally, we conduct an extensive empirical evaluation of
leverage the ranking of items for preference prediction, they PrivRank. The results show that PrivRank can continu-
are sensitive to the ranking loss incurred from the data ously provide customized protection of user-specified
obfuscation process. However, the computation of ranking private data, while the obfuscated data can still be
losses often implies a high cost that is super-linear in the exploited to enable high-quality personalized ranking-
number of items used for recommendation [7]. Therefore, based recommendation.
the third challenge is to efficiently bound ranking loss in data
obfuscation. The rest of the paper is organized as follows. We present
Aiming at overcoming the above challenges, we pro- the related work in Section 2. The preliminaries of our work
pose PrivRank, a customizable and continuous privacy- are presented in Section 3. Afterward, we firstly define our
preserving data publishing framework protect users against threat model in Section 4, and present our historical and
inference attacks while enabling personalized ranking- online data publishing methods in Section 5 and 6, respec-
based recommendation. It provides continuous protection tively. The experimental evaluation is shown in Section 7.
of user-specified private data against inference attacks by We conclude our work in Section 8.
obfuscating both the historical and streaming user activity
data before releasing them, while still preserving the utility
of the published data for enabling personalized ranking-
2 R ELATED W ORK
based recommendation by efficiently limiting the pairwise To protect user privacy when publishing user data, the
ranking loss incurred from data obfuscation. Our main current practice mainly relies on policies or user agreements,
contributions are summarized as follows: e.g., on the use and storage of the published data [4].
• First, considering the use case of recommendation However, this approach cannot guarantee that the users’
based on social media data, we identify a privacy- sensitive information is actually protected from a malicious
preserving data publishing problem by analyzing the attacker. Therefore, to provide effective privacy protection
specific privacy requirements and users’ benefits of when releasing user data, privacy-preserving data publish-
social media. ing has been widely studied. Its key idea is to obfuscate
• Second, we propose a customizable and continuous user data such that published data remains useful for some
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

application scenarios while the individual’s privacy is pre- the predicted ranking list and the actual list from the users.
served. According to the attacks considered, existing work Therefore, different from existing methods that bound data
can be classified into two categories. distortion using non-ranking-based measures, our approach
The first category is based on heuristic techniques to considers bounding the ranking loss incurred from the data
protect ad-hoc defined user privacy [4]. Specific solutions obfuscation process using the Kendall-τ rank distance [8]
mainly tackle the privacy threat when attackers are able to to preserve the utility of the published data for person-
link the data owner’s identity to a record, or an attribute alized ranking-based recommendation. In addition, as the
in the published data. For example, to protect user privacy computation of ranking losses often implies a high cost that
from identity disclosure, K-anonymity [9] obfuscates the is super-linear in the number of items for recommendation
released data so that each record cannot be distinguished [7], we develop a bootstrap sampling process to fast approx-
from at least k-1 other records. However, since these tech- imate the Kendall-τ distance.
niques usually have ad-hoc privacy definitions, they have Compared to our previous work [2], this paper makes
been proven to be non-universal and can only be successful the following improvements: 1) we extend the scope of our
against limited adversaries [10]. privacy-preserving data publishing problem from location
The second category is theory-based and focuses on the based social networks to general social media; 2) we im-
uninformative principle [11], i.e., on the fact that the pub- prove the data utility guarantee by explicitly considering
lished data should provide attackers with as little additional the use case of personalized ranking-based recommenda-
information as possible beyond background knowledge. tion, and re-design the privacy-preserving data publish-
Differential privacy [12] is a well-known technique that is ing framework by bounding ranking loss; 3) we discuss
known to guarantee user privacy against attackers with and compare different types of ranking losses and select
arbitrary background knowledge. Information-theoretic pri- Kendall-τ distance, and propose a bootstrap-sampling pro-
vacy protection approaches have also been proposed in cess for its fast approximation; 4) we re-design and conduct
that context. They try to quantitatively measure privacy new experiments with two ranking-based recommendation
leakage based on various entropy-based metrics such as use cases to show the effectiveness of our framework and
conditional entropy [10] and mutual information [13], and its superiority over our previous work [2] for enabling
to design privacy-protection mechanisms based on those ranking-based recommendations; 5) we conduct a thorough
measures. Although the concept of differential privacy is scalability study with synthetic datasets, and show that
stricter (i.e., against attackers with arbitrary background PrivRank can scale up to large datasets.
knowledge) than that of information-theoretic approaches,
the latter is intuitively more accessible and fits the practical 3 P RELIMINARIES
requirements of many application domains [10]. In partic-
ular, information theory can provide intuitive guidelines 3.1 System Workflow
to quantitatively measure the amount of a user’s private Figure 1 illustrates the end-to-end workflow of our system.
information that an adversary can learn by observing and PrivRank is implemented as a supplementary module to
analyzing the user’s public data (i.e. the privacy leakage of existing social media platforms, in order to let user en-
private data from public data). joy high-quality personalized recommendations from third-
In this study, we advocate the information-theoretic party services under a customized privacy guarantee.
approach. Specifically, we measure the privacy leakage of 1) When users interact with each other via a social media
user private data from public data based on mutual in- service, they voluntarily share their activity data, partic-
formation, and then obfuscate public data such that the ularly the tagging/rating/checking-in activities which
privacy leakage is minimized under a given data distortion massively implies their preference.
budget. In the current literature, existing data obfuscation 2) When a user wants to subscribe to third-party services,
methods mainly ensure data utility by bounding the data she typically needs to give them access to such kind
distortion using metrics such as Euclidean distance [14], of activity data. Specifically, right after the user’s sub-
Squared L2 distance [15], Hamming distance [1] or Jensen- scription, third-party services can immediately access
Shannon distance [2]. They are analogous to limiting the the user’s historical activity data. Before releasing her
loss of predicting user ratings on items, where the goal is to such data and according to the user’s own criteria, the
minimize the overall difference (e.g., mean absolute error) historical data publishing module obfuscates her histori-
between the predicted ratings and the real ratings from the cal activity data to protect user-specified private data
users. Although minimizing such a rating prediction error is against inference attacks. Afterward, when the user
widely adopted by the research community, ranking-based continuously report her activity on the social media, the
(or top-N) recommendation is more practical and is actually online data publishing module obfuscates each activity
adopted by many e-commercial platforms [6]. Specifically, (e.g., adding a tag to a photo, rating a movie or checking
different from rating prediction that tries to infer how in at a POI) from her activity streams before sending to
users rate items, ranking-based recommendation tries to third-party services. All data obfuscation is performed
determine a ranked list of items for the user, where the with the utility guarantee for personalized ranking-
top items are most likely to be appealing to her. However, based recommendation by limiting the ranking loss
we argue that bounding data distortion using traditional incurred from data obfuscation.
metrics is not optimal for ranking-based recommendation, 3) Despite receiving obfuscated public data, the third-
whose goal is to minimize the ranking difference (e.g., party services can still provide high-quality personal-
pairwise ranking loss or mean average precision) between ized ranking-based recommendation to the users.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Fig. 2. A toy example of ranking loss. The real user rating a on three
items can be obfuscated to â1 or â2 . While the Euclidean distances for
both obfuscations are exactly the same, the ranking loss from the two
obfuscations are different. Compared to the ranking list i1 < i2 < i3 in
the original rating a, the obfuscated rating â1 does not incur any ranking
loss as we still observe i1 < i2 < i3 . However, the obfuscated rating â2
incurs a certain ranking loss as we find i1 < i3 < i2 there.

Fig. 1. System workflow for privacy-preserving publishing of social me-


dia data: 1) Users report their activity (i.e., public data) on social media on item i (i ∈ I ) will be used to update the corresponding
services; 2) PrivRank publishes the obfuscated public data to third-party
service providers; 3) The third-party service providers can still deliver element Viu of the public data vector.
high-quality personalized ranking-based recommendation to users.

3.3 Ranking-Based Recommendation


Our system workflow is beneficial to all the involved Based on the aforementioned public data vectors, ranking-
entities. First, a user still shares her actual activities with based recommendation outputs a ranked list of items for a
her friends on a social media platform, while now enjoying user, where the top items are most likely to be appealing
high-quality personalized recommendations from a third- to her. The related algorithms mainly leverage the existing
party service under a customized privacy guarantee, as only ranking of items in the learning process to predict the miss-
obfuscated user activity data are released from the social ing rank of the items for recommendation [16]. Therefore,
media platform to the third-party service. Second, the third- ranking-based recommendation algorithms are sensitive to
party service may attract more users (in particular privacy- the ranking loss incurred from the data obfuscation process,
conscious users), when providing high-quality recommen- rather than other types of loss measured by the Euclidean
dation services with privacy protection. Third, the social or Squared L2 distance, for example. Moreover, those tradi-
media platform may also boost its business by attracting tional data distortion measures are not analogous to ranking
more users and more third-party services by providing loss [16]. Figure 2 shows an example where the same data
privacy protection data publishing. In addition, another distortion budget measured by Euclidean distance may im-
advantage of our framework lies on its easy integration with ply different ranking losses. Therefore, considering ranking
the existing social media platform, where the latter does not loss incurred from data obfuscation is critical for ranking-
need to be changed significantly (as shown in Figure 1). based recommendation.
Therefore, these benefits could incentivize both the social
media platform and the third-party service to implement
PrivRank. 4 T HREAT M ODEL
In this study, we consider the inference attack [3] as the
3.2 User Preference Modeling from Social Media Data targeted threat model. As described above, we consider that
Users’ activities on social media massively implies their each user has two types of data: i) public data (e.g., her
preferences. Individual social media services often provide activity data) that she is willing to release for getting person-
users with a unique feature (or a certain type of items) for in- alized recommendations, and ii) private data (e.g., gender)
teraction, such as photos for Flickr, videos for YouTube, mu- that she wants to keep private. We denote public data as
sic for Last.fm, and POIs for Foursquare. By interacting with X ∈ X , and private data as Y ∈ Y , where X and Y are the
these items on social media (e.g., tagging a photo, rating a sets of values that X and Y can take, respectively. Since Y
video or checking-in at a POI), users explicitly or implicitly is often linked to X by their joint probability p(X, Y ), an
express their preferences on those items. In this work, we adversary who observes X is able to gain some knowledge
consider such user activity as public data. Formally, let U about Y . To reduce such privacy leakage, the basic idea is to
and I denote the sets of users and items, respectively. A release a distorted X̂ ∈ X̂ instead of X such that it is hard
typical representation of such public data from a user u to infer Y from X̂ .
(u ∈ U ) is a vector V u (of size |I|) that encodes any type
of user preference, such as the user’s ratings (on a 1-5 scale),
4.1 Inference Attack
her tags/thumbs-ups (in a binary format), or her cumulative
number of interactions (e.g., the number of check-ins on Inference attacks assume that an adversary has a method
POIs). When the user subscribes to a third-party service q to infer Y , where the adversary always tries to select q
for the first time, the service provider will have immediate such that the cost (e.g., inference error) of using q to infer Y
access to the user’s (historical) public data vector V u which is minimized [13]. Therefore, before observing X̂ , q can be
contains the user’s all historical activities. Afterward, the obtained by solving the following problem:
service provider can also observe the future user activity
feed, where each activity (e.g., rating/tagging/checking-in) c = min EY [C(Y, q)] (1)
q
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

where C(Y, q) is the expected cost function of inferring Y


using q . After observing the distorted public data X̂ , q can
be obtained by solving the following problem:
ĉ = min EY |X̂ [C(Y, q)|X̂] (2)
q

The adversary’s cost gain after observing X̂ is as follows:


∆C = c − ĉ, (3)
which measures how much knowledge the adversary gains
w.r.t. the inference of Y after observing X̂ . The idea of
privacy protection is to find X̂ such that the privacy leakage Fig. 3. Historical data publishing
∆C is reduced, while the obfuscated X̂ can still be used to
enable personalized recommendation.
budget which bounds the ranking loss. Finally, based on the
4.2 Basic Idea of Our Solution learned obfuscation function, we perform probabilistic data
obfuscation. Customized privacy protection is achieved in
In order to reduce the privacy leakage ∆C , we obfuscate the way that, for the specified private data (e.g., gender), a
X to obtain X̂ based on a probabilistic obfuscation function corresponding obfuscation function is generated.
pX̂|X , which encodes the conditional probability of releasing
X̂ when observing X . Intuitively, pX̂|X should be designed
such that any inference attack on Y should be rendered 5.1 User Clustering
weak. Meanwhile, it also keeps some utility of X̂ by limiting We try to obfuscate a user’s historical public data vec-
the distortion budget in the obfuscation process, which can tor to that of another user. Directly learning the optimal
be modeled by a constraint ∆X as follows; obfuscation function pX̂|X from individual user’s public
data incurs the complexity growing quadratically with the
EX̂,X (dist(X̂, X)) ≤ ∆X (4)
number of users |U|. To reduce the problem complexity, the
where dist(X̂, X) is a certain distance metric that measures user clustering phase clusters users into a limited and fixed
the difference between X̂ and X . ∆X limits the expected number of groups according to their public data vector.
distortion w.r.t. the probabilistic obfuscation function pX̂|X . Then, the complexity related to learning the optimal obfus-
The data distortion budget can ensure the utility of the cation function between user clusters rather than between
released data. Considering the data utility for enabling per- individual users is hence reduced and independent with
sonalized ranking-based recommendation, we measure and |U|. Specifically, we cluster the set of users U based on their
bound the data distortion budget using ranking distance. In historical public data vector V u . We adopt average-linkage
summary, the key idea of our solution is to learn pX̂|X that hierarchical clustering [17] using Euclidean distance for the
minimizes ∆C under a given distortion budget ∆X . sake of simplicity. Based on the clustering results, we obtain
the mapping from users U to clusters G , where each element
G (G ∈ G ) is the centroid of the corresponding cluster.
5 H ISTORICAL DATA P UBLISHING
To publish historical public data in a privacy-preserving
way, the key idea is to probabilistically obfuscate a user’s 5.2 Cluster-wise Obfuscation Function Learning
historical public data vector to that of another user, which The optimal obfuscation function is learned based on user
are similar but have less privacy leakage. In this context, clusters. Therefore, values for the public data X and for the
data obfuscation operates on one’s whole public data vector, released public data X̂ refer to user clusters G . Without loss of
rather than obfuscating her individual activity records one generality, in the following derivation, we keep using X and
by one (over the user’s activity stream). Compared to the Y for public and private data, respectively. In the following,
streaming scheme, we show that such a historical data we first formally present our privacy-utility tradeoff, and
obfuscation scheme can achieve the same level of privacy then the utility guarantee by bounding ranking loss, fol-
protection with a lower data distortion budget (see Section lowed by the obfuscation function learning algorithm.
7.3 for details).
Figure 3 gives an overview of the historical data pub-
5.2.1 Balancing Privacy and Utility
lishing process. First, aiming at reducing the problem com-
plexity stemming from learning the optimal obfuscation The privacy leakage in this paper is measured by ∆C ,
function, we incorporate a clustering step in our framework which represents the information gain of an adversary after
to cluster a large number of users into a limited number of observing the released public data X̂ . When using a log-loss
groups based on their public data, as similar user activities cost function, Calmon et al. [13] proved that ∆C becomes
often cause similar privacy leakage [2]. Second, based on the mutual information between the released public data X̂
the user clusters, we quantitatively measure the privacy and the specific private data Y :
leakage of user-specified private data (e.g. gender) from X p(x̂, y)
public data, and then learn the optimal obfuscation function ∆C = I(X̂, Y ) = p(x̂, y) log (5)
by minimizing the privacy leakage under a given distortion p(x̂)p(y)
x̂∈X̂,y∈Y
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

As noted above, we use the probabilistic obfuscation [19]. In other words, for ranking-based recommendation
function pX̂|X to generate the released public data X̂ . There- algorithms, minimizing pairwise/listwise loss is equivalent
fore, the joint probability of X̂ and Y can be computed as: to maximizing the predicted ranking quality measured by
X MAP or NDCG. Following this idea, we also want to bound
p(x̂, y) = pX̂|X (x̂|x)pX,Y (x, y) (6) the data distortion incurred from data obfuscation by limit-
x∈X ing the pairwise ranking loss when obfuscating X into X̂ .
The marginal probability pX̂ (x̂), pX (x) and pY (y) can be We choose to measure the pairwise ranking loss using a
calculated as follows: widely known metric, i.e., the Kendall-τ rank distance [8].
X It measures the number of pairwise disagreements between
pX̂ (x̂) = pX̂|X (x̂|x)pX,Y (x, y) (7) two ranking lists. For two users a and b, we denote their
x∈X,y∈Y public data vectors as V a and V b , respectively. The Kendall-
X X τ rank distance K(V a , V b ) is then computed as:
pX (x) = pX,Y (x, y), pY (y) = pX,Y (x, y) (8) X
y∈Y x∈X K(V a , V b ) = 1Vib <Vjb (12)
Via >Vja
Combined with the above Equations, the mutual informa-
tion between the release public data X̂ and the private data where Via is the ranking score of item i in list V a , and
Y can be derived as: so on. 1cond is an indicator function which is equal to 1
X p(x̂, y) X when cond is true and 0 otherwise. As Eq. 12 counts the
I(X̂, Y ) = p(x̂, y) log − p(y) log p(y) absolute number of pairwise disagreements, we normalize
p(x̂)
x̂∈X̂,y∈Y y∈Y it by dividing by |I|(|I| − 1)/2, so that the normalized
(9) Kendall-τ distance lies in the interval [0,1]:
where the second term is the entropy of Y , i.e.,
P 1 X
− y∈Y p(y) log p(y), which is a constant for the specified K(V a , V b ) = 1 b b (13)
private data (e.g., gender) in a given dataset. Hence, we |I|(|I| − 1)/2 a a Vi <Vj
Vi >Vj
ignore this term in the following derivations and obtain:
A value of 1 indicates maximum disagreement while 0
X p(x̂, y) indicates that the two lists express the same ranking. For
I(X̂, Y ) = p(x̂, y) log (10)
p(x̂) the sake of simplicity, all terms of the Kendall-τ distance
x̂∈X̂,y∈Y
refer to the normalized Kendall-τ distance (Eq. 13) in the
Combined with Equations 6 and 7, the mutual information following.
can then be derived as a function of only two factors, namely In practice, a large number of items yields a high
the joint probability pX,Y which can be empirically obtained cost when computing the Kendall-τ distance. Since the
from a given dataset, and the obfuscation function pX̂|X : computation of the Kendall-τ distance requires a total of
X |I|(|I| − 1)/2 pairwise comparisons, the resulting computa-
I(X̂, Y ) = pX̂|X (x̂|x)pX,Y (x, y) tion complexity is O(n2 ), where n is the number of items |I|.
x̂∈X̂ To efficiently compute the Kendall-τ distance for large item
x∈X
y∈Y sets, we propose to use a bootstrap sampling process [20]
P
pX̂|X (x̂|x0 )pX,Y (x0 , y) (11) to approximate the Kendall-τ distance. Specifically, instead
x0 ∈X of computing all |I|(|I| − 1)/2 comparisons, we randomly
· log P
pX̂|X (x̂|x00 )pX,Y (x00 , y 0 ) sample S pairs of items for comparison. After counting the
x00 ∈X absolute number of disagreements in S sampled pairs, we
0
y ∈Y
then normalize it by dividing by |S|:
The optimal obfuscation function pX̂|X is learned such that 1 X
I(X̂, Y ) is minimized under a given distortion budget ∆X . K(V a , V b ) ≈ 1Vib <Vjb (14)
|S| a a
Vi >Vj ,(i,j)∈S
5.2.2 Bounding Ranking Loss for Utility
5.2.3 Optimal Obfuscation Function Learning
To provide optimal utility guarantees for personalized
ranking-based recommendation, we consider bounding the Considering the above ranking loss as a constraint to ensure
data distortion dist(X̂, X) based on ranking loss. There are high data utility, we now present our algorithm that learns
typically three types of ranking loss functions [16], namely the optimal cluster-wise obfuscation function pĜ|G . For a
pointwise, pairwise, and listwise, which are defined on the given dataset, we can empirically determine pG,Y according
basis of single items, pairs of items, and all ranking items, to private data Y (e.g., gender). Thus, the obfuscation func-
respectively. As a pointwise loss function measures the loss tion pĜ|G can be learned by Algorithm 1, which contains a
of (ranking) score for individual items, it is analogous to convex optimization problem with three constraints (can be
non-ranking-based distance metrics. A theoretical study on solved by many solvers such as CVX [21]). The first con-
these three types of ranking loss functions [16] shows that straint is for the distortion budget that bounds the expected
pairwise and listwise losses are indeed upper bounds of two Kendall-τ distance w.r.t. the probabilistic obfuscation func-
quantities 1-MAP and 1-NDCG, respectively, where Mean tion pĜ|G . Note that it is easy to compute the Kendall-τ dis-
Average Precision (MAP) and Normalized Discounted Cu- tance for Ĝ and G. The last two constraints are probability
mulative Gain (NDCG) are two popular metrics for evalu- constraints of pG|G
ˆ . To stress the protected private data Y ,
ating ranking-based information retrieval algorithms [18], we denote the corresponding optimal obfuscation function
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Algorithm 1 Cluster-wise obfuscation function learning


Require: Joint probability pG,Y , and distortion budget ∆X
1: Solve the optimization problem for pĜ|G

min I(Ĝ, Y )
p ˆ
G|G

s.t., EĜ,G (K(Ĝ, G)) ≤ ∆X


pĜ|G (ĝ|g) ∈ [0, 1], ∀g, ĝ ∈ G
X
pĜ|G (ĝ|g) = 1, ∀g ∈ G Fig. 4. A toy example of online data obfuscation causing different ranking

losses. The public data vector encodes the count of check-ins at POIs
2: return pĜ|G,Y (i.e., the cumulative count of a user’s interactions with items), and
incrementally incorporates incoming check-ins from user activity feed.
Suppose the incoming activity is a check-in at POI i2 , and the obfus-
Algorithm 2 Probabilistic historical data obfuscation cation maps it to i3 . By adding this activity/obfuscated activity to two
different users’ public data, we observe different ranking losses. While
Require: Obfuscation functions pĜ|G for all possible Y this obfuscation causes no ranking loss for user a (i.e., we always find
1: for u ∈ U do i3 > i2 > i1 before and after obfuscation), it causes some ranking loss
2: Get u-specified private data Y for user b (i.e., the ranking of i2 > i3 no longer holds after obfuscation).
3: Get obfuscation function pĜ|G,Y for Y
4: Get u’s cluster G, where u ∈ G
5: Obfuscate the user’s cluster G to Ĝ based on pĜ|G,Y (Ĝ|G)
6: Randomly select a user û in cluster Ĝ
7: Obfuscate V û to V u
8: end for

as pĜ|G,Y . Note that we do not assume any inference attack


methods in our framework, and that any inference attacks
on Y should be rendered weak.
As we try to find the optimal obfuscation probability
between each pair of user clusters, the problem complexity Fig. 5. Online data publishing
of learning optimal pĜ|G in Algorithm 1 is O(n2 ), where n
is the number of user clusters |G| rather than the number of
rating/tagging/checking-in activity on an item), without
users |U|. The later evaluation shows that a small number
accessing the user’s historical data. In other words, we
of G can indeed provide an efficient privacy protection (see
want to obfuscate each activity to another with less privacy
Section 7.7 for details).
leakage. However, a rating/tagging/checking-in activity of
a user will probably lead to a certain modification of the
5.3 Probabilistic Historical Data Obfuscation user’s public data vector that encodes a certain type of user
Since the learned obfuscation function is based on user preference, such as the user’s ratings (on a 1-5 scale), her
clusters, we still need to bridge the gap between clusters tags/thumbs-ups (in a binary format), or her cumulative
and users to obfuscate individual public data vectors. Algo- number of interactions (e.g., the number of check-ins on
rithm 2 describes the probabilistic data obfuscation process. POIs). Therefore, obfuscating an activity to another often
Specifically, for a user u, we first obtain the corresponding causes different ranking loss for different public data vectors
obfuscation function pĜ|G,Y to protect her private data Y (i.e., different users). Figure 4 shows a toy example where
(Line 2-3). We then obfuscate her cluster G to another Ĝ the same obfuscation causes different ranking losses for two
based on the obfuscation function pĜ|G,Y (Ĝ|G) (Line 4-5). different users’ public data. Therefore, the probabilistic ob-
fuscation function here should be personalized, i.e., ranking
Finally, since all users in cluster Ĝ share the similar public loss measured based on the user’s own public data.
data vectors, we randomly select one user û in the cluster Figure 5 shows the online data publishing process. First,
Ĝ, and leverage her public data vector V û to obfuscate by measuring the privacy leakage of the user-specified
(replace) V u (Line 6-7). private data from each activity, we learn the personalized
optimal obfuscation function such that the privacy leakage
6 O NLINE DATA S TREAM P UBLISHING is minimized under a given distortion budget. Second, for
each incoming data instance of a user from the user activ-
After a user subscribed to third-party services, the service
ity data stream, we perform the probabilistic obfuscation
providers have access to the user’s future activity streams.
according to the learned obfuscation function of that user.
Therefore, we protect her private data by obfuscating her
activity stream on-the-fly. Different from historical data
publishing, the streaming nature of user activity imposes 6.1 Personalized Activity-wise Obfuscation Function
the following constraint on online data obfuscation: Due Learning
to time and space efficiency requirements of real-time data The activity-wise obfuscation function is learned based
publishing (i.e., single-pass processing with limited mem- on individual activities, where an activity refers to a
ory) [22], online data obfuscation can only be performed based rating/tagging/checking-in activity on an item. The idea
on the incoming activity data instance itself (e.g., a new here is to obfuscate an activity on one item using another
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

Algorithm 3 Personalized activity-wise obfuscation func- we first investigate the trade-off between privacy protection
tion learning and personalization performance for ranking-based recom-
Require: Joint probability pi,Y , distortion budget ∆X , user public data mendation. Second, we study the continuous privacy pro-
vector V u tection performance by evaluating the privacy leakage over
1: Solve the optimization problem for pî|i
time. Third, we evaluate the customization performance of
min I(î|i, Y ) privacy protection by comparing the privacy leakage of
pî|i
user-specified private data and that of other data. Fourth,
s.t., Eî,i (K(V u + i, V u + î)) ≤ ∆X we further explore the utility guarantee for ranking-based
pî|i (î|i) ∈ [0, 1], ∀i, î ∈ I recommendation under different loss metrics. Fifth, based
X on synthetic datasets, we study the impact of private data
pî|i (î|i) = 1, ∀i ∈ I

settings. Finally, we evaluate the runtime performance of
our framework. We start by introducing our experimental
2: return pu
î|i,Y setup below before reporting on the evaluation results.

Algorithm 4 Probabilistic online activity obfuscation


7.1 Experimental Setup
Require: Obfuscation functions pu for all possible Y , an incoming
î|i
activity of user u on item i 7.1.1 Dataset
1: Get u-specified private data Y
2: Get personalized obfuscation function pu for u and Y Although there are many public social media datasets avail-
î|i,Y
able for benchmarking recommendation systems, very few
3: Obfuscate i to î based on pî|i,Y (î|i)
of them provide the corresponding private data for privacy
4: return obfuscated activity of user u on item î
studies. Therefore, we collected our own dataset from a
location based social network Foursquare for evaluation.
Specifically, users can share their real-time presence on
item with less privacy leakage. Therefore, values for the public
Foursquare by checking-in at Point of Interests (POIs), e.g., a
data X and for the released public data X̂ refer to the item set bar or a supermarket. Such spatiotemporal user activity data
I . To minimize the privacy leakage of private data Y from are widely used for enabling various personalized recom-
public data i (i ∈ I ), we follow the same privacy-utility mendations [23], [24], [25], as check-in data can be regarded
trade-off framework for cluster-wise obfuscation function as “foot rating”, where a higher visiting frequency of a
learning (Section 5.2.1), and try to minimize I(î, Y ). POI implies a more positive preference. Using the method
The personalized obfuscation function is learned by
described in [26], [27], we crawled Foursquare check-in data
considering individual users’ public data in order to better
via Twitter Public Streams1 for about 18 months (from Apr.
bound the ranking loss. Let V u + i denote the public data
2012 to Sep. 2013) in two big cities (i.e., New York City and
resulting from adding an incoming activity i to the public
Tokyo), and consider them as public data. Table 1 shows the
data V u of user u. For an obfuscation from item i to item
statistics of the resulting datasets we collected.
î, we measure the ranking loss caused by this obfuscation
In addition, we also collected the corresponding user
as K(V u + i, V u + î). Note that we can easily fast compute
profile data as private data. Due to Foursquare’s privacy
K(V u +i, V u + î) instead of using bootstrap sampling based policy, only limited profile data (i.e., name and gender) is
approximation, as the potential pairwise ranking differences included in the check-ins. Fortunately, as the dataset is col-
can only be related to ranking pairs involving item i or item lected via Twitter, we also have access to the corresponding
î. Subsequently, the computation complexity becomes O(n), Twitter profiles, which typically include additional informa-
where n is the number of items |I|. tion such as number of followers and “followings”, etc. In
In summary, we first empirically calculate pi,Y , and learn this paper, due to the limited availability of user profile data
the optimal obfuscation function puî|i,Y for each user u using in the collected dataset, we define two attributes as private,
Algorithm 3 which is also a convex optimization problem. i.e, gender (male/female) and social status [28] (a yardstick
The complexity of solving the optimization problem in to measure the popularity of a user in social network). For a
Algorithm 3 depends only on |I|. user u, social status is computed as the ratio of the number
of u’s followers to the number of users u follows (i.e.,
#f ollowers(u)
6.2 Probabilistic Online Activity Obfuscation “followings”): social(u) = #f ollowings(u) . We also discretize
Based on the learned obfuscation functions, we obfuscate u’ social status as popular (social(u) > 1) and non-popular
each incoming activity from a user’s activity stream using (social(u) ≤ 1). We note that our framework is not limited
Algorithm 4. For each incoming activity on item i, we to these two types of private data, and it can incorporate
first obtain the corresponding obfuscation function puî|i,Y any categorical attributes as private data.
to protect u-specified private data Y (Line 1-2). We then
obfuscate the activity on item i based on puî|i,Y (î|i), and map 7.1.2 Evaluation Use Cases and Metrics
the activity onto item î as obfuscated data (Line 3-4). Privacy evaluation is traditionally based on simulations,
and tries to show that the defined privacy is satisfied with
a reasonable computation overhead [29]. In this paper, we
7 E XPERIMENTAL E VALUATION take a step forward to quantitatively evaluate both our
We empirically evaluate the effectiveness and efficiency of
our framework. Specifically, based on real-world datasets, 1. https://dev.twitter.com/streaming/public
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

TABLE 1 a test set Xtest (20%), and then use our framework to
Characteristics of the experimental dataset obfuscate Xtrain into X̂train . Subsequently, we apply the
recommendation algorithms on the obfuscated data X̂train ,
Dataset New York City Tokyo
User number 3,669 6,870
and then make predictions for the test dataset Xtest , which
POI number 1,861 2,811 represents the users’ true preference. Our goal is to verify
Check-in number 893,722 1,290,445 that the obfuscated data X̂train can still be used to accurately
predict the users’ true preference in Xtest . To evaluate the
quality of the resulting recommendations, we use Mean
privacy protection and data utility. Specifically, we imple- Average Precision (MAP) [18], which is a widely used metric
ment two inference attack methods to directly assess the in information retrieval to assess the quality of rankings.
performance of our privacy protection and use two real- Higher value of MAP implies better performance. Each reported
world ranking-based recommendation use cases to evaluate result is the mean value of ten repeated trials.
the resulting utility of the obfuscated data.
Privacy. Inference attacks [3] on private data try to
7.1.3 Baseline Approaches
infer a user’s private information Y (e.g., gender) from
her released public data X̂ , which can be regarded as a In order to demonstrate the effectiveness of our framework,
classification problem for discrete data. Therefore, we adopt we compare it with the following baselines:
here two common classification algorithms as inference • Random obfuscation (Rand). For historical data obfus-
attack methods, namely Support Vector Machine (SVM) and cation, it randomly obfuscates each user public data
0
Naive Bayes (NB). We assume that adversaries have trained vector V u to another V u with a given probability
their classifiers based on the original public data X and prand . For online activity obfuscation, it randomly ob-
private data Y from some non privacy-conscious users [13], fuscates each user activity on i to another item i0 with
who do not care about their privacy and publish all their probability prand . Here, prand controls the distortion
data. We randomly sample 50% of all users as such non budget in both cases.
privacy-conscious users for training the classifiers, and then • Frapp [33]. It is a generalized matrix-theoretic frame-
perform inference attacks on the private data Y of the rest work of data perturbation for privacy-preserving min-
of the users based on their obfuscated activity data X̂ . We ing. Its key idea is to obfuscate one’s activity data to
use the Area Under the Curve (AUC), which is a widely itself with higher probability than to others. For his-
used metric for classification problem [30], to evaluate the torical data obfuscation, it obfuscates a user u’s public
0
performance of the inference attacks. We report the value data vector V u to V u with probability pf rapp = γe
(1-AUC) as a privacy protection metric in the experiments. if u = u0 , otherwise pf rapp = e. Here e is used for
Higher value of (1-AUC) implies better privacy protection. The probability normalization, i.e., e = γ+|U1 |−1 . For online
ideal privacy protection is achieved when AU C = 0.5 (i.e., data obfuscation, it obfuscates each activity on item i to
1 − AU C = 0.5), which implies that any inference attack another item i0 with probability pf rapp = γe if i = i0 ,
method performs no better than a random guess. otherwise pf rapp = e (here e = γ+|I|−11
). The distortion
Utility. In this work, utility refers to the ranking-based budget is controlled by γ in both cases.
recommendation performance. We select two typical use cases, • Differential privacy (Diff ) [12] is a state-of-the-art
i.e., POI recommendation [23] and context-aware activity method to protect privacy regardless of the adver-
recommendation [24], as our target scenarios. sary’s prior knowledge. It can be implemented for
• POI Rec. POI recommendation [23] tries to recommend different types of data, such as numeric data [34],
to a user a list of POIs that she would be interested categorical data [35], set-valued data [36] or location
in. To implement this use case, we first consider the data [37]. Here we adopt exponential mechanism [35],
cumulative check-in number of a user on a POI as the [38], [39] in our experiments as it fits our use case
rating (i.e., the cumulative number of interactions as of categorical data. More importantly, it is straightfor-
preference score) to build a user-POI matrix, and then ward to be implemented for online data obfuscation,
leverage a Bayesian personalized ranking algorithm which we explicitly consider in PrivRank. We exclude
[31] to predict the ranked list. Note that the POI Rec other sophisticated differential privacy methods for
is a common use case for user-item recommendation. categorical data (such as [40]), as they do not handle
• Activity Rec. Context-aware activity recommendation online data obfuscation. We implement exponential
[24] tries to come up with a list of activities (represented mechanism as follows. For historical data obfuscation,
0
by POI categories, e.g., restaurant or bar) that a given it obfuscates V u to V u with a probability that de-
0
user may be interested in based on her current context creases exponentially with the distance d(V u , V u ), i.e.,
0 0
(i.e., location and time). We first discretize the context pdif f (V u |V u ) ∝ exp(−βd(V u , V u )), where β ≥ 0. β
(i.e., time slots and location grid cells) of check-in data actually controls the distortion budget. The exponential
to build a user-context-activity tensor using the 0/1- mechanism satisfies 2βdmax -differential privacy, where
0
based scheme (i.e., binary format of preference score), dmax = maxu,u0 ∈U d(V u , V u ). For online activity ob-
and then leverage a ranking tensor factorization algo- fuscation, this method obfuscates each activity of user
rithm [32] for ranking prediction. u on item i with pdif f (i0 |i) ∝ exp(−βd(V u +i0 , V u +i)).
For both use cases, we first randomly split the original Here we also use the Kendall-τ distance for d(·, ·). The
public data X into a training dataset Xtrain (80%) and distortion budget is controlled by β in both cases.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

(results for the Tokyo dataset are similar).

7.3 Privacy Protection over User Activity Streams


To study the privacy protection performance over time, we
first obfuscate the historical data using our method, and
then compare different online obfuscation methods for the
future activity streams. Specifically, we select the first 14
month data as historical data, and the last 4 month data
(a) POI Rec with SVM (b) POI Rec with NB
as activity streams. Such a setting corresponds to the case
that a user subscribes to a third-party service at the end of
the 14th month, where the service can access her activity
stream from that time. To obfuscate the historical data, we
set the distortion budget of our method to 0.2, leading to a
privacy protection performance of 1-AUC=0.44 (with SVM)
and a utility of MAP=0.05 for POI Rec, and 1-AUC=0.45
(with SVM) and MAP=0.4 for Activity Rec. Based on the
obfuscated historical data, we now focus on how different
(c) Activity Rec with SVM (d) Activity Rec with NB online data obfuscation methods perform over time.
First, as our method and baselines use different pa-
Fig. 6. Privacy-Utility trade-off with different methods on the NYC dataset rameters to control the distortion budget for online data
(All baselines are significantly outperformed by PrivRank at the **0.01 obfuscation, we tune these parameters to maintain the same
or *0.05 level (p-value) with paired t-test.)
level of data utility (i.e., MAP=0.05 for POI Rec and MAP=0.4
for Activity Rec), and shows the privacy protection perfor-
7.2 Privacy & Utility Trade-off mance over time. Figure 7 shows the results on the NYC
dataset. We observe that although historical data obfusca-
In this experiment, we vary the parameters that control
tion can effectively protect user private data (at time 0), the
the distortion budget for our method and baselines, and
privacy protection performance rapidly decreases over time
observe the resulting trade-off between privacy (AUC) and
if attackers observe the actual user activity streams (shown
utility (MAP). As our framework and baselines use different
as No privacy in Figure 7). To provide continuous privacy
parameters to control the distortion budget, they are not
protection, online data obfuscation can actually alleviate this
directly comparable. Therefore, we tune the obfuscation
problem to some extent, as we observe a slower decrease
budgets in different methods to directly show the privacy-
of the privacy protection performance over time for all
utility trade-off, which is more informative for comparison.
data obfuscation methods. More importantly, our method
We consider gender as private data in all experiments except
outperforms all baselines by achieving the highest values of
explicitly mentioned otherwise. For the sake of runtime
1-AUC over time in all cases. However, all the methods still
efficiency, we empirically select the number of user clusters
show decreasing privacy protection performance over time.
|G| and the number of pair samples |S| when approximating
Second, we maintain the same level of privacy protection for
the kendall-τ distance for each use case: |G| = 200, S = 104
our online data obfuscation method by increasing the distortion
for POI Rec and |G| = 200, S = 103 for Activity Rec.
budget, and obtain PrivRank (MAP=0.043) for POI Rec and
The selection of these parameters is discussed below in the
PrivRank (MAP=0.36) for Activity Rec (as shown in Figure
experiment for runtime performance in Section 7.7.
8). Such an observation shows that for the same level of
Figure 6 shows the privacy-utility trade-off results for
privacy protection, online data obfuscation needs to sacrifice
different privacy-preserving data obfuscation methods on
more utility (caused by higher data distortion budget) than
the NYC dataset. First, we observe clearly the trade-off
historical data obfuscation. This is due to the real-time data
between privacy protection and utility of enabling ranking-
publishing constraint, i.e., online obfuscation can only be
based recommendation for all methods. On one hand, a
done based on the incoming activity data instance itself
better privacy protection can be achieved with a higher
without an overview of users’ whole public data history.
distortion budget, as highly distorted public data makes
Meanwhile, PrivRank still outperforms all baseline methods
it harder for adversaries to infer user private data. On the
by achieving the best privacy protection performance over
other hand, higher distortion budgets incur a higher loss
time under the same data utility guarantee.
of data utility, as highly distorted public data also prevent
recommendation algorithms from accurately predicting user
preferences. Second, we observe that compared to other 7.4 Performance of Customized Privacy Protection
methods, PrivRank consistently achieves better privacy pro- Since users often have different privacy requirements, our
tection and higher utility at the same time (i.e., the resulting framework is designed to protect user-specific private data.
data points are closer to the upper-right corner of the plot) In this experiment, we consider aforementioned two types
in all cases. We also conduct a paired t-test between each of private data, i.e., gender and social status. Specifically,
baseline method and PrivRank, and find that PrivRank users have three options according to the types of private
significantly outperform all baselines at either 0.01 or 0.05 data to be protected: 1) protecting gender only (PrivRank-
level (p-value). Due to the space limitation, we only show Gender); 2) protecting social status only (PrivRank-Social);
results obtained for the NYC dataset in all the experiments 3) protecting both gender and social status (PrivRank-Both).
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

(a) POI Rec with SVM (b) POI Rec with NB

(a) POI Rec with SVM (b) POI Rec with NB

(c) Activity Rec with SVM (d) Activity Rec with NB

Fig. 7. Privacy protection performance over time on NYC dataset (main-


taining the same level of data utility)
(c) Activity Rec with SVM (d) Activity Rec with NB

Fig. 9. Customization performance of privacy protection

leakage of all potential private data.

7.5 Utility with Different Loss Metrics


(a) POI Rec with SVM (b) POI Rec with NB In this experiment, we study the privacy-utility trade-off
using our method with different loss metrics, including
Euclidean distance, squared L2 distance, cosine distance,
Jensen-Shannon distance (JSD), and two ranking based met-
rics, i.e., Spearman correlation and Kendall-τ distance. We
keep other parameters the same as in previous experiments.
Figure 10 shows the privacy-utility trade-off with dif-
ferent loss metrics. First, we observe that ranking-based loss
(c) Activity Rec with SVM (d) Activity Rec with NB
metrics outperform non-ranking based metrics by achieving
Fig. 8. Privacy protection performance over time on NYC dataset (main-
better privacy protection and utility at the same time. In
taining the same level of privacy protection for PrivRank) other words, bounding the ranking loss incurred from data
obfuscation can better preserve the ranking relations in the
public data, which leads to smaller utility loss in learning-
We configure our framework with those three settings, and to-rank algorithms. Moreover, our proposed method using
report on the customized privacy protection performance. Kendall-τ distance achieves the best results in all cases,
We tune the distortion budget for all the data obfuscation as most of the learning-to-rank algorithms actually rely on
methods to keep the same data utility, i.e., MAP=0.05 for pairwise/listwise ranking relations in the training data [16],
POI Rec and MAP=0.4 for Activity Rec. which are optimally preserved by our method.
Figure 9 shows the privacy protection results for both Compared to our previous work [2] that uses Jensen-
gender and social status on the NYC dataset. We observe Shannon distance, PrivRank can effectively improve the
that PrivRank-Gender (or PrivRank-Social) outperforms all utility of ranking-based recommendation under the same
other methods when protecting the targeted gender data level of privacy protection. For example, for a given privacy
(or social status), by achieving the highest values of 1- protection 1-AUC=0.4 for the use case of POI Rec, PrivRank
AUC. Particularly, compared to PrivRank-Both, which treats shows an improvement of 9% and 12% in MAP under the
both data as private, PrivRank-Gender (or PrivRank-Social) attack methods SVM and NB, respectively.
can provide better privacy protection on gender (or social
status). In other words, better privacy protection can be 7.6 Impact of Private Data Setting
achieved under the same data utility guarantee when less In this experiment, we study the impact of different settings
private data has to be protected. for private data, i.e., the number of private attributes and
In addition, we observe that different types of private the size of their domain. As our real-world Foursquare
data suffer from different levels of privacy leakage. For dataset has a very limited number of private attributes
example, Figure 9 shows that a user’s gender can be inferred (only two), here we use synthetic datasets generated by the
more accurately than that of her social status. In practice, IBM synthetic data generator for itemsets2 , which is orig-
this observation can be used to help users decide which inally designed for frequent itemset mining. We generate
private data should be protected, by providing them with
a quantitative metric based on AUC to indicate the privacy 2. https://github.com/zakimjz/IBMGenerator
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

(a) Privacy protection (b) Total learning time


(a) POI Rec with SVM (b) POI Rec with NB
Fig. 12. Runtime and privacy performance for POI Rec with SVM

work is implemented on a commodity PC (Intel Core i7-


[email protected], 16GB RAM, OS X), running MATLAB
and CVX with MOSEK [21].

(c) Activity Rec with SVM (d) Activity Rec with NB


7.7.1 Historical Data Publishing
Fig. 10. Privacy-Utility trade-off with different loss metrics for data obfus- For learning obfuscation function, we adopt a user cluster-
cation (NYC dataset)
ing step to reduce the problem complexity and a bootstrap
sampling process for fast computation of Kendall-τ ranking
loss. Subsequently, we study the total learning time of the
obfuscation function w.r.t. the number of user clusters |G|,
the bootstrap sampling size |S| and the number of users |U|.
First, by varying |G| and |S|, we report privacy protec-
tion results (1-AUC) and the total learning time (including
user clustering, Kendall-τ distance computation and obfus-
(a) Number of private attributes (b) Private attribute domain size cation function learning), for the NYC dataset with POI
Rec, the NYC dataset with activity Rec and the synthetic
Fig. 11. Impact of different settings for private data dataset, in Figure 12, 13 and 14, respectively. For the NYC
dataset, we fix the utility of MAP=0.05 for POI Rec and
MAP=0.4 for Activity Rec, and consider gender as private
50K transactions with 100 items per transaction on average
data. For synthetic dataset, we fix the utility of MAP=0.16,
(1K different items in total), and keep other parameters as
and randomly sample one item as private data. On one
default. We regard each transaction as a user, and randomly
hand, we observe that the privacy protection performance
sample m items as private attributes while regarding the
improves when increasing |G| and |S|. Specifically, larger
rest as public data. The utility is computed using the POI
|G| can capture finer-grained user groups, which allow our
Rec task, where we now want to recommend items to users.
method to find the optimal function that achieves a better
We tune the parameters for each method to let it has the
privacy protection under the same distortion budget. Mean-
same utility MAP=0.16.
while, a larger bootstrap sampling size |S| can approximate
Figure 11(a) shows the average (1-AUC) of all private
the actual Kendall-τ distance more accurately, which can
attributes w.r.t. the number of private attributes. Here we set
better measure the ranking loss incurred by the optimal
the domain size of private attributes to two. We observe that
obfuscation function. In addition, we also find that 1-AUC
PrivRank outperforms all baselines, while its performance
converges after a certain point. On the other hand, the total
slightly decreases with the increasing number of private at-
learning time continuously increases when increasing |G|
tributes. With a small number of private attributes, PrivRank
and |S|. Therefore, the two parameters |G| and |S| can be
is able to put more focus on protecting the specified data by
selected at the convergence point for 1-AUC.
minimizing its privacy leakage from public data, and thus
provides better customized privacy protection. Second, by fixing |G| and |S| to the convergence point
Figure 11(b) shows the impact of the domain size for for 1-AUC, we vary the number of users |U| in synthetic
one private attribute. We observe that PrivRank still outper- datasets and report the learning time in Figure 15(a). We
forms all baselines. Similar to the impact of the number of observe that only user clustering time linearly increases
private attributes, the performance of PrivRank also slightly with |U|, while the Kendall-τ distance computation and
decreases with increasing domain sizes, because a larger obfuscation function learning time are independent from
domain size implies less obfuscation budget on each value |U|. PrivRank can easily scale up to 100K users (taking 3,378
for the private attribute. seconds on our test PC). Note that learning the optimal
obfuscation function is an offline step, and only depends
on the joint probability pG,Y . In practice, we can regularly
7.7 Runtime Performance update the obfuscation function to sustain its effectiveness.
As our framework includes both a historical and an Based on the learned obfuscation function, the proba-
online data publishing modules, we separately discuss bilistic data obfuscation in Algorithm 2 can be performed
their runtime performance. The prototype of our frame- very efficiently (i.e., 3.3ms per user in all cases).
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

TABLE 2
Runtime performance for online data publishing

Dataset NYC NYC TKY TKY


(Utility) (POI Rec) (Activity Rec) (POI Rec) (Activity Rec)
Obfuscation
function 662 sec 120 sec 1,438 sec 276 sec
learning
(a) Privacy protection (b) Total learning time

Fig. 13. Runtime and privacy performance for Activity Rec with SVM
(about 92 check-ins/sec on average) in 20163 .

8 C ONCLUSIONS AND F UTURE W ORK


This paper introduced PrivRank, a customizable and con-
tinuous privacy-preserving social media data publishing
framework. It continuously protects user-specified data
against inference attacks by releasing obfuscated user activ-
(a) Privacy protection (b) Total learning time
ity data, while still ensuring the utility of the released data
to power personalized ranking-based recommendations. To
Fig. 14. Runtime and privacy performance on synthetic dataset
provide customized protection, the optimal data obfuscation
is learned such that the privacy leakage of user-specified
7.7.2 Online Data Publishing private data is minimized; to provide continuous privacy
protection, we consider both the historical and online ac-
The complexity of learning obfuscation function for online tivity data publishing; to ensure the data utility for en-
data publishing (Algorithm 3) depends only on the number abling ranking-based recommendation, we bound the rank-
of items |I|. For the synthetic dataset, we vary the number of ing loss incurred from the data obfuscation process using
items and show the its impact on the runtime performance the Kendall-τ rank distance. We showed through extensive
in Figure 15(b). We observe that both Kendall-τ distance experiments that PrivRank can provide an efficient and
computation time and obfuscation function learning time effective protection of private data, while still preserving
increases with the number of items. PrivRank can easily the utility of the published data for different ranking-based
scale up to a large dataset with 10K items (taking 6,291 recommendation use cases.
seconds on our test PC). For the Foursquare datasets, we In the future, we plan to extend our framework by
keep the same parameters as in previous experiments and considering the data types with continuous values rather
report the runtime performance for both NYC and TKY than discretized values, and explore further data utility
datasets in Table 2. Our test PC is able to learn the optimal beyond personalized recommendation.
obfuscation function in a reasonable time in all cases. In
addition, although learning the personalized obfuscation
function needs to be performed for each individual for ACKNOWLEDGMENTS
online data publishing, this offline step can be easily paral-
This project has received funding from the European Re-
lelized w.r.t. the number of users, as one user’s obfuscation
search Council (ERC) under the European Unions Horizon
function learning process is independent from the others.
2020 research and innovation programme (grant agreement
Due to the streaming nature of user activity data, the 683253/GraphInt).
efficiency of probabilistic online data obfuscation is par-
ticularly important. Our method (Algorithm 4) is able to
perform the obfuscation process with a high speed of 2,200 R EFERENCES
activity instance per second on all datasets, which can easily
[1] S. Salamatian, A. Zhang, F. du Pin Calmon, S. Bhamidipati,
accommodate user activity streams from most social me-
N. Fawaz, B. Kveton, P. Oliveira, and N. Taft, “How to hide
dia platform. For example, the Foursquare check-in stream the elephant-or the donkey-in the room: Practical privacy against
has the peak-day record showing 8 million check-ins/day statistical inference for large data,” in Proc. of GlobalSIP. IEEE,
2013.
[2] D. Yang, D. Zhang, Q. Bingqing, and P. Cudre-Mauroux,
“Privcheck: Privacy-preserving check-in data publishing for per-
sonalized location based services,” in Proc. of UbiComp’16. ACM,
2016.
[3] C. Li, H. Shirani-Mehr, and X. Yang, “Protecting individual infor-
mation against inference attacks in data publishing,” in Advances
in Databases: Concepts, Systems and Applications. Springer, 2007,
pp. 422–433.
[4] B. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserving data
publishing: A survey of recent developments,” ACM Computer
(a) Impact of |U | on historical data (b) Impact of |I| on online data Survey, vol. 42, no. 4, p. 14, 2010.
publishing publishing
3. http://blog.foursquare.com/post/142900756695/since-
Fig. 15. Impact of |U | and |I| on the scalability foursquare-launched-in-2009-there-have-been
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

[5] I. A. Junglas, N. A. Johnson, and C. Spitzmüller, “Personality traits [29] X. Zhao, L. Li, and G. Xue, “Checking in without worries: Location
and concern for privacy: an empirical study in the context of privacy in location based social networks,” in Proc. of INFO-
location-based services,” European Journal of Information Systems, COM’13. IEEE, 2013, pp. 3003–3011.
vol. 17, no. 4, pp. 387–402, 2008. [30] C. X. Ling, J. Huang, and H. Zhang, “Auc: a better measure
[6] P. Cremonesi, Y. Koren, and R. Turrin, “Performance of recom- than accuracy in comparing learning algorithms,” in Advances in
mender algorithms on top-n recommendation tasks,” in Proc. of Artificial Intelligence. Springer, 2003, pp. 329–341.
RecSys’10. ACM, 2010, pp. 39–46. [31] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme,
[7] N. Li, R. Jin, and Z.-H. Zhou, “Top rank optimization in linear “Bpr: Bayesian personalized ranking from implicit feedback,” in
time,” in Advances in neural information processing systems, 2014, Proc. of UAI’09. AUAI Press, 2009, pp. 452–461.
pp. 1502–1510. [32] D. Yang, D. Zhang, Z. Yu, and Z. Yu, “Fine-grained preference-
[8] M. G. Kendall, “Rank correlation methods.” 1948. aware location search leveraging crowdsourced digital footprints
from lbsns,” in Proc. of UbiComp’13. ACM, 2013, pp. 479–488.
[9] L. Sweeney, “k-anonymity: A model for protecting privacy,” In-
[33] S. Agrawal and J. R. Haritsa, “A framework for high-accuracy
ternational Journal of Uncertainty, Fuzziness and Knowledge-Based
privacy-preserving mining,” in Proc. of the ICDE’05. IEEE, 2005,
Systems, vol. 10, no. 05, pp. 557–570, 2002.
pp. 193–204.
[10] L. Sankar, S. R. Rajagopalan, and H. V. Poor, “Utility-privacy [34] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating
tradeoffs in databases: An information-theoretic approach,” IEEE noise to sensitivity in private data analysis,” in Theory of Cryp-
Transactions on Information Forensics and Security, vol. 8, no. 6, pp. tography Conference. Springer, 2006, pp. 265–284.
838–852, 2013. [35] F. McSherry and K. Talwar, “Mechanism design via differential
[11] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubrama- privacy,” in Proc. of FOCS’07. IEEE, 2007, pp. 94–103.
niam, “l-diversity: Privacy beyond k-anonymity,” ACM Transac- [36] R. Chen, N. Mohammed, B. C. Fung, B. C. Desai, and L. Xiong,
tions on Knowledge Discovery from Data, vol. 1, no. 1, p. 3, 2007. “Publishing set-valued data via differential privacy,” PVLDB,
[12] C. Dwork, “Differential privacy,” in Automata, languages and pro- vol. 4, no. 11, pp. 1087–1098, 2011.
gramming. Springer, 2006, pp. 1–12. [37] L. Wang, D. Yang, X. Han, T. Wang, D. Zhang, and X. Ma,
[13] F. du Pin Calmon and N. Fawaz, “Privacy against statistical “Location privacy-preserving task allocation for mobile crowd-
inference,” in Proc. of Allerton’12. IEEE, 2012, pp. 1401–1408. sensing with differential geo-obfuscation,” in Proceedings of the 26th
[14] A. Zhang, S. Bhamidipati, N. Fawaz, and B. Kveton, “Priview: International Conference on World Wide Web. ACM, 2017, pp. 627–
Media consumption and recommendation meet privacy against 636.
inference attacks,” IEEE Web, vol. 2, 2014. [38] C. Dwork, “Differential privacy: A survey of results,” in Proc. of
[15] S. Salamatian, A. Zhang, F. du Pin Calmon, S. Bhamidipati, TAMC. Springer, 2008, pp. 1–19.
N. Fawaz, B. Kveton, P. Oliveira, and N. Taft, “Managing your [39] Z. Huang and S. Kannan, “The exponential mechanism for social
private and public data: Bringing down inference attacks against welfare: Private, truthful, and nearly optimal,” in Proc. of FOCS’12,
your privacy,” IEEE Journal of Selected Topics in Signal Processing, 2012.
vol. 9, no. 7, pp. 1240–1255, 2015. [40] Y. Shen and H. Jin, “Privacy-preserving personalized recommen-
[16] W. Chen, T.-Y. Liu, Y. Lan, Z.-M. Ma, and H. Li, “Ranking measures dation: An instance-based approach via differential privacy,” in
and loss functions in learning to rank,” in Proc. of NIPS, 2009, pp. Proc. of ICDM. IEEE, 2014, pp. 540–549.
315–323.
[17] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster
analysis and display of genome-wide expression patterns,” PNAS,
vol. 95, no. 25, pp. 14 863–14 868, 1998.
[18] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval.
ACM press New York, 1999, vol. 463.
[19] K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation Dingqi Yang is a senior researcher in the Department of Computer
of ir techniques,” ACM Transactions on Information Systems (TOIS), Science, University of Fribourg, Switzerland. He received his Ph.D. in
vol. 20, no. 4, pp. 422–446, 2002. Computer Science from Pierre and Marie Curie University (Paris VI)
and Institut Mines-TELECOM/TELECOM SudParis in 2015, where he
[20] B. Efron and R. J. Tibshirani, An introduction to the bootstrap. CRC won both the Doctorate Award and the Institut Mines-TELECOM Press
press, 1994. Mention. His research interests lie in big social media data analytics,
[21] M. Grant and S. Boyd, “Graph implementations for nonsmooth ubiquitous computing and smart city applications.
convex programs,” in Recent Advances in Learning and Control.
Springer, 2008, pp. 95–110.
[22] G. S. Manku, S. Rajagopalan, and B. G. Lindsay, “Approximate
medians and other quantiles in one pass and with limited mem-
ory,” in ACM SIGMOD Record, vol. 27, no. 2. ACM, 1998, pp.
426–435.
[23] D. Yang, D. Zhang, Z. Yu, and Z. Wang, “A sentiment-enhanced Bingqing Qu is a post-doc researcher in the Department of Computer
personalized location recommendation system,” in Proc. of HT’13. Science, University of Fribourg, Switzerland. She received her Ph.D. in
ACM, 2013, pp. 119–128. Computer Science in University of Rennes 1 in 2016. Her research inter-
[24] D. Yang, D. Zhang, V. W. Zheng, and Z. Yu, “Modeling user activ- ests include historical document analysis, multimedia content analysis,
ity preference by leveraging user spatial temporal characteristics social media data mining and computer vision.
in lbsns,” IEEE Transactions on System, Man, Cybernetics: System,
vol. 45, no. 1, pp. 129–142, 2015.
[25] Z. Yu, H. Xu, Z. Yang, and B. Guo, “Personalized travel package
with multi-point-of-interest recommendation based on crowd-
sourced user footprints,” IEEE Transactions on Human-Machine
Systems, vol. 46, no. 1, pp. 151–158, 2016.
[26] D. Yang, D. Zhang, L. Chen, and B. Qu, “Nationtelescope: Mon- Philippe Cudre-Mauroux is a Full Professor and the director of the
itoring and visualizing large-scale collective behavior in lbsns,” eXascale Infolab at the University of Fribourg in Switzerland. He re-
Journal of Network and Computer Applications, vol. 55, pp. 170–180, ceived his Ph.D. from the Swiss Federal Institute of Technology EPFL,
2015. where he won both the Doctorate Award and the EPFL Press Mention.
[27] D. Yang, D. Zhang, and B. Qu, “Participatory cultural mapping Before joining the University of Fribourg he worked on information
based on collective behavior data in location-based social net- management infrastructures for IBM Watson Research, Microsoft Re-
works,” ACM Transactions on Intelligent Systems and Technology search Asia, and MIT. His research interests are in next-generation,
(TIST), vol. 7, no. 3, p. 30, 2016. Big Data management infrastructures for non-relational data. Webpage:
[28] Z. Cheng, J. Caverlee, K. Lee, and D. Z. Sui, “Exploring millions http://exascale.info/phil
of footprints in location sharing services.” Proc. of ICWSM’11, vol.
2011, pp. 81–88, 2011.

View publication stats

You might also like