T Closeness Privacy Beyond K Anonymity and L Diversity
T Closeness Privacy Beyond K Anonymity and L Diversity
Ninghui Li
Tiancheng Li
Department of Computer Science, Purdue University
{ninghui, li83}@cs.purdue.edu
Abstract
The k-anonymity privacy requirement for publishing microdata requires that each equivalence class (i.e., a set of
records that are indistinguishable from each other with respect to certain identifying attributes) contains at least
k records. Recently, several authors have recognized that
k-anonymity cannot prevent attribute disclosure. The notion of -diversity has been proposed to address this; diversity requires that each equivalence class has at least
well-represented values for each sensitive attribute.
In this paper we show that -diversity has a number of
limitations. In particular, it is neither necessary nor sufcient to prevent attribute disclosure. We propose a novel
privacy notion called t-closeness, which requires that the
distribution of a sensitive attribute in any equivalence class
is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should
be no more than a threshold t). We choose to use the Earth
Mover Distance measure for our t-closeness requirement.
We discuss the rationale for t-closeness and illustrate its
advantages through examples and experiments.
1. Introduction
Agencies and other organizations often need to publish
microdata, e.g., medical data or census data, for research
and other purposes. Typically, such data is stored in a table,
and each record (row) corresponds to one individual. Each
record has a number of attributes, which can be divided into
the following three categories. (1) Attributes that clearly
identify individuals. These are known as explicit identiers
and include Social Security Number, Address, and Name,
and so on. (2) Attributes whose values when taken together
can potentially identify an individual. These are known as
quasi-identiers, and may include, e.g., Zip-code, Birthdate, and Gender. (3) Attributes that are considered sensitive, such as Disease and Salary.
When releasing microdata, it is necessary to prevent
the sensitive information of the individuals from being dis-
Suresh Venkatasubramanian
AT&T Labs Research
[email protected]
closed. Two types of information disclosure have been identied in the literature [4, 9]: identity disclosure and attribute
disclosure. Identity disclosure occurs when an individual is
linked to a particular record in the released table. Attribute
disclosure occurs when new information about some individuals is revealed, i.e., the released data makes it possible
to infer the characteristics of an individual more accurately
than it would be possible before the data release. Identity
disclosure often leads to attribute disclosure. Once there
is identity disclosure, an individual is re-identied and the
corresponding sensitive values are revealed. Attribute disclosure can occur with or without identity disclosure. It has
been recognized that even disclosure of false attribute information may cause harm [9]. An observer of a released
table may incorrectly perceive that an individuals sensitive
attribute takes a particular value, and behave accordingly
based on the perception. This can harm the individual, even
if the perception is incorrect.
While the released table gives useful information to
researchers, it presents disclosure risk to the individuals
whose data are in the table. Therefore, our objective is to
limit the disclosure risk to an acceptable level while maximizing the benet. This is achieved by anonymizing the
data before release. The rst step of anonymization is to remove explicit identiers. However, this is not enough, as an
adversary may already know the quasi-identier values of
some individuals in the table. This knowledge can be either
from personal knowledge (e.g., knowing a particular individual in person), or from other publicly-available databases
(e.g., a voter registration list) that include both explicit
identiers and quasi-identiers. A common anonymization
approach is generalization, which replaces quasi-identier
values with values that are less-specic but semantically
consistent. As a result, more records will have the same set
of quasi-identier values. We dene an equivalence class
of an anonymized table to be a set of records that have the
same values for the quasi-identiers.
To effectively limit disclosure, we need to measure the
disclosure risk of an anonymized table. To this end, Samarati and Sweeney [15, 16, 18] introduced k-anonymity as
the property that each record is indistinguishable with at
106
1
2
3
4
5
6
7
8
9
ZIP Code
47677
47602
47678
47905
47909
47906
47605
47673
47607
Age
29
22
27
43
52
47
30
36
32
Disease
Heart Disease
Heart Disease
Heart Disease
Flu
Heart Disease
Cancer
Heart Disease
Cancer
Cancer
1
2
3
4
5
6
7
8
9
ZIP Code
476**
476**
476**
4790*
4790*
4790*
476**
476**
476**
Age
2*
2*
2*
40
40
40
3*
3*
3*
Disease
Heart Disease
Heart Disease
Heart Disease
Flu
Heart Disease
Cancer
Heart Disease
Cancer
Cancer
107
Machanavajjhala et al. [12] gave a number of interpretations of the term well-represented in this principle:
sS
3. Limitations of -Diversity
While the -diversity principle represents an important
step beyond k-anonymity in protecting against attribute disclosure, it has several shortcomings that we now discuss.
-diversity may be difcult and unnecessary to achieve.
Example 2 Suppose that the original data has only one sensitive attribute: the test result for a particular virus. It takes
two values: positive and negative. Further suppose that
108
1
2
3
4
5
6
7
8
9
ZIP Code
47677
47602
47678
47905
47909
47906
47605
47673
47607
Age
29
22
27
43
52
47
30
36
32
Salary
3K
4K
5K
6K
11K
8K
7K
9K
10K
Disease
gastric ulcer
gastritis
stomach cancer
gastritis
u
bronchitis
bronchitis
pneumonia
stomach cancer
1
2
3
4
5
6
7
8
9
ZIP Code
476**
476**
476**
4790*
4790*
4790*
476**
476**
476**
Age
2*
2*
2*
40
40
40
3*
3*
3*
Salary
3K
4K
5K
6K
11K
8K
7K
9K
10K
Disease
gastric ulcer
gastritis
stomach cancer
gastritis
u
bronchitis
bronchitis
pneumonia
stomach cancer
hypothetical step, the observer is given a completely generalized version of the data table where all attributes in a
quasi-identier are removed (or, equivalently, generalized
to the most general values). The observers belief is inuenced by Q, the distribution of the sensitive attribute value
in the whole table, and changes to B1 . Finally, the observer
is given the released table. By knowing the quasi-identier
values of the individual, the observer is able to identify the
equivalence class that the individuals record is in, and learn
the distribution P of sensitive attribute values in this class.
The observers belief changes to B2 .
The -diversity requirement is motivated by limiting the
difference between B0 and B2 (although it does so only indirectly, by requiring that P has a level of diversity). We
choose to limit the difference between B1 and B2 . In other
words, we assume that Q, the distribution of the sensitive
attribute in the overall population in the table, is public information. We do not limit the observers information gain
about the population as a whole, but limit the extent to
which the observer can learn additional information about
specic individuals.
To justify our assumption that Q should be treated as
public information, we observe that with generalizations,
the most one can do is to generalize all quasi-identier attributes to the most general value. Thus as long as a version of the data is to be released, a distribution Q will be
released.1 We also argue that if one wants to release the
table at all, one intends to release the distribution Q and
this distribution is what makes data in this table useful. In
other words, one wants Q to be public information. A large
change from B0 to B1 means that the data table contains
a lot of new information, e.g., the new data table corrects
some widely held belief that was wrong. In some sense, the
larger the difference between B0 and B1 is, the more valuable the data is. Since the knowledge gain between B0 and
B1 is about the whole population, we do not limit this gain.
We limit the gain from B1 to B2 by limiting the distance
between P and Q. Intuitively, if P = Q, then B1 and B2
should be the same. If P and Q are close, then B1 and B2
should be close as well, even if B0 may be very different
from both B1 and B2 .
Denition 2 (The t-closeness Principle:) An equivalence
class is said to have t-closeness if the distance between the
distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than
a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.
1 Note that even with suppression, a distribution will still be released.
This distribution may be slightly different from the distribution with no
record suppressed; however, from our point of view, we only need to consider the released distribution and the distance of it from the ones in the
equivalence classes.
109
m
1
i=1
m
m
dij fij
i=1 j=1
fij 0
m
pi
pi log
= H(P) H(P, Q)
D[P, Q] =
qi
i=1
use the notation {v1 , v2 , , vm } to denote the uniform distribution where each value in {v1 , v2 , , vm } is equally likely.
2 We
m
fij +
j=1
m
m
m
where H(P)
=
i=1 pi log pi is the entropy of P and
m
p
H(P, Q) =
i=1 i log qi is the cross-entropy of P and
Q.
These distance measures do not reect the semantic distance among values. Recall Example 3 (Tables 3 and 4),
where the overall distribution of the Income attribute is Q =
{3k, 4k, 5k, 6k, 7k, 8k, 9k, 10k, 11k}.2 The rst equivalence class in Table 4 has distribution P1 = {3k, 4k, 5k}
and the second equivalence class has distribution P2 =
{6k, 8k, 11k}. Our intuition is that P1 results in more information leakage than P2 , because the values in P1 are all
in the lower end; thus we would like to have D[P1 , Q] >
D[P2 , Q]. The distance measures mentioned above would
not be able to do so, because from their point of view values
such as 3k and 6k are just different points and have no other
semantic meaning.
In short, we have a metric space for the attribute values
so that a ground distance is dened between any pair of values. We then have two probability distributions over these
values and we want the distance between the two probability distributions to be dependent upon the ground distances
among these values. This requirement leads us to the the
Earth Movers distance (EMD) [14], which is actually a
Monge-Kantorovich transportation distance [5] in disguise.
The EMD is based on the minimal amount of work
needed to transform one distribution to another by moving
distribution mass between each other. Intuitively, one distribution is seen as a mass of earth spread in the space and
1 i m, 1 j m
i=1 j=1
fij =
m
fji = qi
(c1)
1im
(c2)
j=1
m
i=1
pi =
m
qi = 1
(c3)
i=1
m
m
dij fij
i=1 j=1
|E2 |
|E1 |
D[P1 , Q] +
D[P2 , Q]
|E1 | + |E2 |
|E1 | + |E2 |
3 More generally, the EMD is the total work divided by the total ow.
However, since we are calculating distance between two probability distributions, the total ow is always 1, as shown in formula (c3).
110
D[P, Q] =
m
1
|pi qi | =
(pi qi ) =
(pi qi )
2 i=1
p <q
pi qi
Hierarchical Distance: The distance between two values of a categorical attribute is based on the minimum level
to which these two values are generalized to the same value
according to the domain hierarchy. Mathematically, let H
be the height of the domain hierarchy, the distance between
two values v1 and v2 (which are leaves of the hierarchy) is
dened to be level (v1 , v2 )/H, where level(v1 , v2 ) is the
height of the lowest common ancestor node of v1 and v2 .
It is straightforward to verify that this hierarchical-distance
measure is also a metric.
Given a domain hierarchy and two distributions P and
Q, we dene the extra of a leaf node that corresponds to
element i, to be pi qi , and the extra of an internal node N
to be the sum of extras of leaf nodes below N . This extra
function can be dened recursively as:
if N is a leaf
pi qi
extra(N ) =
extra(C)
otherwise
CChild(N )
111
1
3
8
4
5
6
2
7
9
CChild(N )extra(C)>0
neg extra(N )
|extra(C)|
CChild(N )extra(C)<0
We use cost(N ) to denote the cost of movings between N s children branches. An optimal ow moves exactly extra(N ) in/out of the subtree rooted at N . Suppose that pos extra(N ) > neg extra, then extra(N ) =
pos extra(N ) neg extra(N ) and extra(N ) needs to
move out. (This cost is counted in the cost of N s parent
node.) In addition, one has to move neg extra among the
children nodes to even out all children branches; thus,
cost(N ) =
height(N )
min(pos extra(N ), neg extra(N ))
H
5.3
ZIP Code
4767*
4767*
4767*
4790*
4790*
4790*
4760*
4760*
4760*
Age
40
40
40
40
40
40
40
40
40
Salary
3K
5K
9K
6K
11K
8K
4K
7K
10K
Disease
gastric ulcer
stomach cancer
pneumonia
gastritis
u
bronchitis
gastritis
bronchitis
stomach cancer
6. Experiments
We now revisit Example 3 in Section 3, to show how tcloseness with EMD handles the difculties of -diversity.
Recall that Q = {3k, 4k, 5k, 6k, 7k, 8k, 9k, 10k, 11k},
P1 = {3k, 4k, 5k}, and P2 = {6k, 8k, 11k}. We calculate
D[P1 , Q] and D[P2 , Q] using EMD. Let v1 = 3k, v2 =
4k, ...v9 = 11k, we dene the distance between vi and vj
to be |i j|/8, thus the maximal distance is 1. We have
D[P1 , Q] = 0.375,4 and D[P2 , Q] = 0.167.
For the disease attribute, we use the hierarchy in Figure 1 to dene the ground distances. For example, the distance between Flu and Bronchitis is 1/3, the distance
between Fluand Pulmonary embolism is 2/3, and the
distance between Flu and Stomach cancer is 3/3 = 1.
Then the distance between the distribution {gastric ulcer,
gastritis, stomach cancer} and the overall distribution is 0.5
while the distance between the distribution {gastric ulcer,
stomach cancer, pneumonia} is 0.278.
Table 5 shows another anonymized version of Table 3. It
has 0.167-closeness w.r.t Salary and 0.278-closeness w.r.t.
Disease. The Similarity Attack is prevented in Table 5. For
optimal mass ow that transforms P1 to Q is to move 1/9 probability mass across the following pairs: (5k11k), (5k10k), (5k9k),
(4k8k), (4k7k), (4k6k), (3k5k), (3k4k). The cost of this is
1/9 (6 + 5 + 4 + 4 + 3 + 2 + 2 + 1)/8 = 27/72 = 3/8 = 0.375.
4 One
112
1
2
3
4
5
6
7
8
9
Attribute
Age
Workclass
Education
Country
Marital Status
Race
Gender
Occupation
Salary
Type
Numeric
Categorical
Categorical
Categorical
Categorical
Categorical
Categorical
Sensitive
Sensitive
# of values
74
8
16
41
7
5
2
14
2
Height
5
3
4
3
3
3
2
3
2
also generate all 26 minimal recursive (4, 4)-diversity tables, and found that 17 of which are vulnerable to the similarity attack.
Efciency We compare the efciency and data quality
of ve privacy measures: (1) k-anonymity; (2) entropy
-diversity; (3) recursive (c, ) diversity; (4) k-anonymity
with t-closeness(t = 0.2); and (5) k-anonymity with tcloseness(t = 0.15).
Results of efciency experiments are shown in Figure 3.
Again we use the Occupation attribute as the sensitive attribute. Figure 3(a) shows the running times with xed
k = 5, = 5 and varied quasi-identier size s, where
2 s 7. A quasi-identier of size s consists of the
rst s attributes listed in Table 2. Figure 3(b) shows the
running times of the ve privacy measures with the same
quasi-identier but with different parameters for k and .
As shown in the gures, entropy -diversity run faster than
the other four measures; the difference gets larger when
increases. This is because with large , entropy -diversity
prunes the search lattice earlier.
Data Quality Our third set of experiments compare the
data quality of the ve privacy measures using the discernibility metric [2] and Minimal Average Group Size [10, 15].
7. Related Work
The problem of information disclosure has been studied extensively in the framework of statistical databases. A
number of information disclosure limitation techniques [3]
have been designed for data publishing, including Sampling, Cell Suppression, Rounding, and Data Swapping and
Perturbation. These techniques, however, compromised
data integrity of the tables. Samarati and Sweeney [15, 16,
18] introduced the k-anonymity approach and used generalization and suppression techniques to preserve information
truthfulness. Numerous algorithms [2, 6, 11, 10, 16, 17]
have been proposed to achieve k-anonymity requirement.
Optimal k-anonymity has been proved to be NP-hard for
k 3 [13].
Recently, a number of authors have recognized that kanonymity does not prevent attribute disclosure, e.g., [12,
19, 21]. Machanavajjhala et al. [12] proposed -diversity.
As we discuss in detail in Section 3, while -diversity is
an important step beyond k-anonymity, it has a number of
limitations. Xiao and Tao [21] observe that -diversity can-
113
114
present additional challenges. Suppose we have two sensitive attributes U and V . One can consider the two attributes
separately, i.e., an equivalence class E has t-closeness if
E has t-closeness with respect to both U and V . Another
approach is to consider the joint distribution of the two attributes. To use this approach, one has to choose the ground
distance between pairs of sensitive attribute values. A simple formula for calculating EMD may be difcult to derive,
and the relationship between t and the level of privacy becomes more complicated.
Other Anonymization Techniques t-closeness allows
us to take advantage of anonymization techniques other
than generalization of quasi-identier and suppression of
records. For example, instead of suppressing a whole
record, one can hide some sensitive attributes of the
record; one advantage is that the number of records in the
anonymized table is accurate, which may be useful in some
applications. Because this technique does not affect quasiidentiers, it does not help achieve k-anonymity and hence
has not been considered before. Removing a value only
decreases diversity; therefore, it does not help to achieve
-diversity. However, in t-closeness, removing an outlier
may smooth a distribution and bring it closer to the overall distribution. Another possible technique is to generalize
a sensitive attribute value, rather than hiding it completely.
An interesting question is how to effectively combine these
techniques with generalization and suppression to achieve
better data quality.
Limitations of using EMD in t-closeness The t-closeness
principle can be applied using other distance measures.
While EMD is the best measure we have found so far, it
is certainly not perfect. In particular, the relationship between the value t and information gain is unclear. For example, the EMD between the two distributions (0.01, 0.99)
and (0.11, 0.89) is 0.1, and the EMD between (0.4, 0.6)
and (0.5, 0.5) is also 0.1. However, one may argue that the
change between the rst pair is much more signicant than
that between the second pair. In the rst pair, the probability of taking the rst value increases from 0.01 to 0.11, a
1000% increase. While in the second pair, the probability
increase is only 25%. In general, what we need is a measure that combines the distance-estimation properties of the
EMD with the probability scaling nature of the KL distance.
References
[1] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network ows:
theory, algorithms, and applications. Prentice-Hall, Inc.,
Upper Saddle River, NJ, USA, 1993.
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[2] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proc. 21st Intnl. Conf. Data Engg.
115