Differential Privacy
Differential Privacy
Cynthia Dwork
Microsoft Research
[email protected]
1 Introduction
we will return to this point later. However, in this work privacy is paramount: we
will first define our privacy goals and then explore what utility can be achieved
given that the privacy goals will be satisified1 .
A 1977 paper of Dalenius [6] articulated a desideratum that foreshadows for
databases the notion of semantic security defined five years later by Goldwasser
and Micali for cryptosystems [15]: access to a statistical database should not
enable one to learn anything about an individual that could not be learned
without access2 . We show this type of privacy cannot be achieved. The obstacle
is in auxiliary information, that is, information available to the adversary other
than from access to the statistical database, and the intuition behind the proof
of impossibility is captured by the following example. Suppose one’s exact height
were considered a highly sensitive piece of information, and that revealing the
exact height of an individual were a privacy breach. Assume that the database
yields the average heights of women of different nationalities. An adversary who
has access to the statistical database and the auxiliary information “Terry Gross
is two inches shorter than the average Lithuanian woman” learns Terry Gross’
height, while anyone learning only the auxiliary information, without access to
the average heights, learns relatively little.
There are two remarkable aspects to the impossibility result: (1) it applies
regardless of whether or not Terry Gross is in the database and (2) Dalenius’
goal, formalized as a relaxed version of semantic security, cannot be achieved,
while semantic security for cryptosystems can be achieved. The first of these
leads naturally to a new approach to formulating privacy goals: the risk to one’s
privacy, or in general, any type of risk, such as the risk of being denied automobile
insurance, should not substantially increase as a result of participating in a
statistical database. This is captured by differential privacy.
The discrepancy between the possibility of achieving (something like) seman-
tic security in our setting and in the cryptographic one arises from the utility
requirement. Our adversary is analagous to the eavesdropper, while our user is
analagous to the message recipient, and yet there is no decryption key to set
them apart, they are one and the same. Very roughly, the database is designed
to convey certain information. An auxiliary information generator knowing the
data therefore knows much about what the user will learn from the database.
This can be used to establish a shared secret with the adversary/user that is
unavailable to anyone not having access to the database. In contrast, consider
a cryptosystem and a pair of candidate messages, say, {0, 1}. Knowing which
message is to be encrypted gives one no information about the ciphertext; in-
tuitively, the auxiliary information generator has “no idea” what ciphertext the
eavesdropper will see. This is because by definition the ciphertext must have no
utility to the eavesdropper.
1
In this respect the work on privacy diverges from the literature on secure function
evaluation, where privacy is ensured only modulo the function to be computed: if
the function is inherently disclosive then privacy is abandoned.
2
Semantic security against an eavesdropper says that nothing can be learned about a
plaintext from the ciphertext that could not be learned without seeing the ciphertext.
Differential Privacy 3
In this paper we prove the impossibility result, define differential privacy, and
observe that the interactive techniques developed in a sequence of papers [8, 13,
3, 12] can achieve any desired level of privacy under this measure. In many
cases very high levels of privacy can be ensured while simultaneously providing
extremely accurate information about the database.
There are two natural models for privacy mechanisms: interactive and non-
interactive. In the non-interactive setting the data collector, a trusted entity,
publishes a “sanitized” version of the collected data; the literature uses terms
such as “anonymization” and “de-identification”. Traditionally, sanitization em-
ploys techniques such as data perturbation and sub-sampling, as well as removing
well-known identifiers such as names, birthdates, and social security numbers.
It may also include releasing various types of synopses and statistics. In the in-
teractive setting the data collector, again trusted, provides an interface through
which users may pose queries about the data, and get (possibly noisy) answers.
Very powerful results for the interactive approach have been obtained ([13,
3, 12] and the present paper), while the non-interactive case has proven to be
more difficult, (see [14, 4, 5]), possibly due to the difficulty of supplying utility
that has not yet been specified at the time the sanitization is carried out. This
intuition is given some teeth in [12], which shows concrete separation results.
4 Cynthia Dwork
The distribution D completely captures any information that the adversary (and
the simulator) has about the database, prior to seeing the output of the auxiliary
information generator. For example, it may capture the fact that the rows in
the database correspond to people owning at least two pets. Note that in the
statement of the theorem all parties have access to D and may have a description
of C hard-wired in; however, the adversary’s strategy does not use either of these.
Intuitively, Part (2a) implies that we can extract ` bits of randomness from the
utility vector, which can be used as a one-time pad to hide any privacy breach of
the same length. (For the full proof, ie, when not necessarily all of w is learned by
the adversary/user, we will need to strengthen Part (2a).) Let `0 denote the least
` satisfying (both clauses of) Part 2. We cannot assume that `0 can be found in
finite time; however, for any tolerance γ let nγ be as in Part 1, so all but a γ
fraction of the support of D is strings of length at most nγ . For any fixed γ it is
possible to find an `γ ≤ `0 such that `γ satisfies both clauses of Assumption 2(2)
on all databases of length at most nγ . We can assume that γ is hard-wired into
all our machines, and that they all follow the same procedure for computing
nγ and `γ . Thus, Part 1 allows the more powerful order of quantifiersd in the
statement of the theorem; without it we would have to let A and A∗ depend on
D (by having ` hard-wired in). Finally, Part 3 is a nontriviality condition.
The strategy for X and A is as follows. On input DB ∈R D, X randomly
chooses a privacy breach y for DB of length ` = `γ , if one exists, which occurs
with probability at least 1 − γ. It also computes the utility vector, w. Finally,
it chooses a seed s and uses a strong randomness extractor to obtain from w
5
Although this case is covered by the more general case, in which not all of w need
be learned, it permits a simpler proof that exactly captures the height example.
6 Cynthia Dwork
an `-bit almost-uniformly distributed string r [16, 17]; that is, r = Ext(s, w),
and the distribution on r is within statistical distance from U` , the uniform
distribution on strings of length `, even given s and y. The auxiliary information
will be z = (s, y ⊕ r).
Since the adversary learns all of w, from s it can obtain r = Ext(s, w) and
hence y. We next argue that A∗ wins with probability (almost) bounded by µ,
yielding a gap of at least 1 − (γ + µ + ).
Assumption 2(3) implies that Pr[A∗ (D) wins] ≤ µ. Let d` denote the maxi-
mum, over all y ∈ {0, 1}` , of the probability, over choice of DB ∈R D, that y is a
privacy breach for DB. Since ` = `γ does not depend on DB, Assumption 2(3)
also implies that d` ≤ µ.
By Assumption 2(2a), even conditioned on y, the extracted r is (almost)
uniformly chosen, independent of y, and hence so is y ⊕ r. Consequently, the
probability that X produces z is essentially independent of y. Thus, the simula-
tor’s probability of producing a privacy breach of length ` for the given database
is bounded by d` + ≤ µ+, as it can generate simulated “auxiliary information”
with a distribution within distance of the correct one.
The more interesting case is when the sanitization does not necessarily reveal
all of w; rather, the guarantee is only that it always reveal a vector w0 within
Hamming distance κ/c of w for constant c to be determined6 . The difficulty with
the previous approach is that if the privacy mechanism is randomized then the
auxiliary information generator may not know which w0 is seen by the adversary.
Thus, even given the seed s, the adversary may not be able to extract the same
random pad from w0 that the auxiliary information generator extracted from w.
This problem is solved using fuzzy extractors [10].
Definition 1. An (M, m, `, t, ) fuzzy extractor is given by procedures (Gen,Rec).
1. Gen is a randomized generation procedure. On input w ∈ M outputs an
“extracted” string r ∈ {0, 1}` and a public string p. For any distribution W
on M of min-entropy m, if (R, P ) ← Gen(W ) then the distributions (R, P )
and (U` , P ) are within statistical distance .
2. Rec is a deterministic reconstruction procedure allowing recovery of r =
R(w) from the corresponding public string p = P (w) together with any vector
w0 of distance at most t from w. That is, if (r, p) ← Gen(w) and ||w−w0 ||1 ≤
t then Rec(w0 , p) = r.
In other words, r = R(w) looks uniform, even given p = P (w), and r = R(w)
can be reconstructed from p = P (w) and any w0 sufficiently close to w.
We now strenthen Assumption 2(2a) to say that the entropy of the source
San(W ) (vectors obtained by interacting with the sanitization mechanism, all of
distance at most κ/c from the true utility vector) is high even conditioned on
any privacy breach y of length ` and P = Gen(W ).
6
One could also consider privacy mechanisms that produce good approximations to
the utility vector with a certain probability for the distribution D, where the proba-
bility is taken over the choice of DB ∈R D and the coins of the privacy mechanism.
The theorem and proof hold mutatis mutandis.
Differential Privacy 7
Assumption 3 For some ` satisfying Assumption 2(2b), for any privacy breach
y ∈ {0, 1}` , the min-entropy of (San(W )|y) is at least k + `, where k is the length
of the public strings p produced by the fuzzy extractor7 .
Strategy when w need not be fully learned: For a given database DB, let w be
the utility vector. This can be computed by X , who has access to the database.
X simulates interaction with the privacy mechanism to determine a “valid” w0
close to w (within Hamming distance κ/c). The auxiliary information generator
runs Gen(w0 ), obtaining (r = R(w0 ), p = P (w0 )). It computes nγ and ` = `γ (as
above, only now satisfying Assumptions 3 and 2(2b) for all DB ∈ D of length
at most nγ ), and uniformly chooses a privacy breach y of length `γ , assuming
one exists. It then sets z = (p, r ⊕ y).
Let w00 be the version of w seen by the adversary. Clearly, assuming 2κ/c ≤ t
in Definition 1, the adversary can reconstruct r. This is because since w0 and w00
are both within κ/c of w they are within distance 2κ/c of each other, and so w00
is within the “reconstruction radius” for any r ← Gen(w0 ). Once the adversary
has reconstructed r, obtaining y is immediate. Thus the adversary is able to
produce a privacy breach with probability at least 1 − γ. It remains to analyze
the probability with which the simulator, having access only to z but not to the
privacy mechanism (and hence, not to any w00 close to w), produces a privacy
breach.
In the sequel, we let B denote the best machine, among all those with access
to the given information, at producing producing a privacy breach (“winning”).
By Assumption 2(3), Pr[B(D, San(DB)) wins] ≤ µ, where the probability is
taken over the coin tosses of the privacy mechanism and the machine B, and
the choice of DB ∈R D. Since p = P (w0 ) is computed from w0 , which in turn is
computable from San(DB), we have
p1 = Pr[B(D, p) wins] ≤ µ
where the probability space is now also over the choices made by Gen(), that is,
the choice of p = P (w0 ). Now, let U` denote the uniform distribution on `-bit
strings. Concatenating a random string u ∈R U` to p cannot help B to win, so
p2 = Pr[B(D, p, u) wins] = p1 ≤ µ
where the probability space is now also over choice of u. For any fixed string
y ∈ {0, 1}` we have U` = U` ⊕ y, so for all y ∈ {0, 1}` , and in particular, for all
privacy breaches y of DB,
p3 = Pr[B(D, p, u ⊕ y) wins] = p2 ≤ µ.
Let W denote the distribution on utility vectors and let San(W ) denote
the distribution on the versions of the utility vectors learned by accessing the
7
A good fuzzy extractor “wastes” little of the entropy on the public string. Better
fuzzy extractors are better for the adversary, since the attack requires ` bits of
residual min-entropy after the public string has been generated.
8 Cynthia Dwork
database through the privacy mechanism. Since the distributions (P, R) = Gen(W 0 ),
and (P, U` ) have distance at most , it follows that for any y ∈ {0, 1}`
p4 = Pr[B(D, p, r ⊕ y) wins] ≤ p3 + ≤ µ + .
Now, p4 is an upper bound on the probability that the simulator wins, given D
and the auxiliary information z = (p, r ⊕ y), so
(1 − γ) − (µ + ) = 1 − (γ + µ + )
between the winning probabilities of the adversary and the simulator. Setting
∆ = 1 − (γ + µ + ) proves Theorem 1.
We remark that, unlike in the case of most applications of fuzzy extractors
(see, in particular, [10, 11]), in this proof we are not interested in hiding partial
information about the source, in our case the approximate utility vectors W 0 , so
we don’t care how much min-entropy is used up in generating p. We only require
sufficient residual min-entropy for the generation of the random pad r. This is
because an approximation to the utility vector revealed by the privacy mecha-
nism is not itself disclosive; indeed it is by definition safe to release. Similarly, we
don’t necessarily need to maximize the tolerance t, although if we have a richer
class of fuzzy extractors the impossibility result applies to more relaxed privacy
mechanisms (those that reveal worse approximations to the true utility vector).
4 Differential Privacy
As noted in the example of Terry Gross’ height, an auxiliary information gen-
erator with information about someone not even in the database can cause a
privacy breach to this person. In order to sidestep this issue we change from ab-
solute guarantees about disclosures to relative ones: any given disclosure will be,
within a small multiplicative factor, just as likely whether or not the individual
participates in the database. As a consequence, there is a nominally increased
risk to the individual in participating, and only nominal gain to be had by con-
cealing or misrepresenting one’s data. Note that a bad disclosure can still occur,
but our guarantee assures the individual that it will not be the presence of her
data that causes it, nor could the disclosure be avoided through any action or
inaction on the part of the user.
Definition 2. A randomized function K gives -differential privacy if for all
data sets D1 and D2 differing on at most one element, and all S ⊆ Range(K),
could have on the output to the query function; we refer to this quantity as the
sensitivity of the function9 .
For many types of queries ∆f will be quite small. In particular, the simple count-
ing queries (“How many rows have property P ?”) have ∆f ≤ 1. Our techniques
work best – ie, introduce the least noise – when ∆f is small. Note that sensitivity
is a property of the function alone, and is independent of the database.
The privacy mechanism, denoted Kf for a query function f , computes f (X)
and adds noise with a scaled symmetric exponential distribution with variance
σ 2 (to be determined in Theorem 4) in each component, described by the density
function
Proof. Starting from (3), we apply the triangle inequality within the exponent,
yielding for all possible responses r
Pr[Kf (D1 ) = r] ≤ Pr[Kf (D2 ) = r] × exp(kf (D1 ) − f (D2 )k1 /σ) . (4)
The second term in this product is bounded by exp(∆f /σ), by the definition of
∆f . Thus (1) holds for singleton sets S = {a}, and the theorem follows by a
union bound.
References
[1] N. R. Adam and J. C. Wortmann, Security-Control Methods for Statistical
Databases: A Comparative Study, ACM Computing Surveys 21(4): 515-556 (1989).
[2] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc. ACM
SIGMOD International Conference on Management of Data, pp. 439–450, 2000.
[3] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: The SuLQ
framework. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Sym-
posium on Principles of Database Systems, pages 128–138, June 2005.
[4] S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee. Toward privacy in
public databases. In Proceedings of the 2nd Theory of Cryptography Conference,
pages 363–385, 2005.
[5] S. Chawla, C. Dwork, F. McSherry, and K. Talwar. On the utility of privacy-
preserving histograms. In Proceedings of the 21st Conference on Uncertainty in
Artificial Intelligence, 2005.
[6] T. Dalenius, Towards a methodology for statistical disclosure control. Statistik
Tidskrift 15, pp. 429–222, 1977.
[7] D. E. Denning, Secure statistical databases with random sample queries, ACM
Transactions on Database Systems, 5(3):291–315, September 1980.
[8] I. Dinur and K. Nissim. Revealing information while preserving privacy. In Pro-
ceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles
of Database Systems, pages 202–210, 2003.
[9] D. Dobkin, A.K. Jones, and R.J. Lipton, Secure databases: Protection against
user influence. ACM Trans. Database Syst. 4(1), pp. 97–106, 1979.
[10] Y. Dodis, L. Reyzin and A. Smith, Fuzzy extractors: How to generate strong keys
from biometrics and other noisy data. In Proceedings of EUROCRYPT 2004, pp.
523–540, 2004.
[11] Y. Dodis and A. Smith, Correcting Errors Without Leaking Partial Information,
In Proceedings of the 37th ACM Symposium on Theory of Computing, pp. 654–663,
2005.
[12] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitiv-
ity in private data analysis. In Proceedings of the 3rd Theory of Cryptography
Conference, pages 265–284, 2006.
[13] C. Dwork and K. Nissim. Privacy-preserving datamining on vertically partitioned
databases. In Advances in Cryptology: Proceedings of Crypto, pages 528–544, 2004.
[14] A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy
preserving data mining. In Proceedings of the 22nd ACM SIGMOD-SIGACT-
SIGART Symposium on Principles of Database Systems, pages 211–222, June
2003.
[15] S. Goldwasser and S. Micali, Probabilistic encryption. Journal of Computer and
System Sciences 28, pp. 270–299, 1984; prelminary version appeared in Proceed-
ings 14th Annual ACM Symposium on Theory of Computing, 1982.
[16] N. Nisan and D. Zuckerman. Randomness is linear in space. J. Comput. Syst.
Sci., 52(1):43–52, 1996.
[17] Ronen Shaltiel. Recent developments in explicit constructions of extractors. Bul-
letin of the EATCS, 77:67–95, 2002.
[18] Sweeney, L., Weaving technology and policy together to maintain confidentiality.
J Law Med Ethics, 1997. 25(2-3): p. 98-110.
[19] L. Sweeney, Achieving k-anonymity privacy protection using generalization and
suppression. International Journal on Uncertainty, Fuzziness and Knowledge-
based Systems, 10 (5), 2002; 571-588.