0% found this document useful (0 votes)
17 views20 pages

Wasserstein Differential Privacy: Chengyi Yang, Jiayin Qi, Aimin Zhou

Uploaded by

anna tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views20 pages

Wasserstein Differential Privacy: Chengyi Yang, Jiayin Qi, Aimin Zhou

Uploaded by

anna tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Wasserstein Differential Privacy

Chengyi Yang1 , Jiayin Qi2* , Aimin Zhou1


1
Shanghai Institute of AI for Education, School of Computer Science and Technology, and Key Laboratory of MEA (Ministry
of Education), East China Normal University
2
Cyberspace Institute of Advanced Technology, Guangzhou University
[email protected], [email protected], [email protected]
arXiv:2401.12436v1 [cs.LG] 23 Jan 2024

Abstract probability, it is believed that (ε, δ)-DP cannot strictly han-


Differential privacy (DP) has achieved remarkable results in dle composition problems (Mironov 2017; Dong, Roth, and
the field of privacy-preserving machine learning. However, Su 2022). To address the above issues, further researches
existing DP frameworks do not satisfy all the conditions for have been considering the specific data distribution, which
becoming metrics, which prevents them from deriving better can be divided into two main directions: the distribution of
basic private properties and leads to exaggerated values on privacy loss and the distribution of unique difference. For
privacy budgets. We propose Wasserstein differential privacy example, concentrated differential privacy (CDP) (Dwork
(WDP), an alternative DP framework to measure the risk of and Rothblum 2016), zero-concentrated differential privacy
privacy leakage, which satisfies the properties of symmetry (zCDP) (Bun and Steinke 2016), and truncated concentrated
and triangle inequality. We show and prove that WDP has 13 differential privacy (tCDP) (Bun et al. 2018) all assume that
excellent properties, which can be theoretical supports for the
the mean of privacy loss follows subgaussian distribution.
better performance of WDP than other DP frameworks. In ad-
dition, we derive a general privacy accounting method called While Bayesian differential privacy (BDP) (Triastcyn and
Wasserstein accountant, which enables WDP to be applied in Faltings 2020) considers the distribution of the only differ-
stochastic gradient descent (SGD) scenarios containing sub- ent data entry x′ . Nevertheless, they are all defined by the
sampling. Experiments on basic mechanisms, compositions upper bound of divergence, which implies that their privacy
and deep learning show that the privacy budgets obtained budgets are overly pessimistic (Triastcyn and Faltings 2020).
by Wasserstein accountant are relatively stable and less in- In this paper, we introduce a variant of differential pri-
fluenced by order. Moreover, the overestimation on privacy vacy from another perspective. We define the privacy bud-
budgets can be effectively alleviated. The code is available at get through the upper bound of the Wasserstein distance
https://github.com/Hifipsysta/WDP.
between adjacent distributions, which is called Wasser-
stein differential privacy (WDP). From a semantic perspec-
Introduction tive, WDP also follows the concept of indistinguishabil-
Differential privacy (Dwork et al. 2006b) is a mathemati- ity (Dwork et al. 2006b) in differential privacy. Specifically,
cally rigorous definition of privacy, providing quantifiable for all possible adjacent databases D and D′ , WDP reflects
descriptions of the risk on leaking sensitive information. In the maximum variation of optimal transport (OT) cost be-
the early stage, researches on differential privacy mainly tween the distributions queried by an adversary before and
focused on the issue of statistical queries (SQ) (McSherry after any data entry change in the database.
2009; Kasiviswanathan et al. 2011). With the risk of privacy Intuitively speaking, the advantages of WDP can be di-
leakage being warned in machine learning (Wang, Si, and vided into at least two aspects. (1) WDP focuses on indi-
Wu 2015; Shokri et al. 2017; Zhu, Liu, and Han 2019), dif- viduals within the distribution, rather than focusing on the
ferential privacy has been gradually applied for privacy pro- entire distribution like divergence, which is consistent with
tection in deep learning (Shokri and Shmatikov 2015; Abadi the original intention of differential privacy to protect indi-
et al. 2016; Phan et al. 2019; Cheng et al. 2022). vidual private information from leakage. (2) More impor-
However, these techniques are always constructed on the tantly, WDP satisfies all the conditions to become a metric,
postulation of standard DP (Dwork et al. 2006b), which only including non-negativity, symmetry and triangle inequality
provides the worst-case scenario, and tends to overestimate (see Proposition 1-3), which is not fully possessed by pri-
privacy budgets under the measure of maximum divergence vacy loss under the definition of divergence, as divergence
(Triastcyn and Faltings 2020). Although the most commonly itself does not satisfy symmetry and triangle inequality (see
applied approximate differential privacy (ε, δ-DP) (Dwork Proposition 11 in the appendix of Mironov (2017)).
et al. 2006a) ignores extreme situations with small prob- The combination of DP and OT has been taken into
abilities by introducing a relaxation term δ called failure consideration in several existing works. Their contribu-
* Corresponding Author tions are essentially to provide privacy guarantees for com-
Copyright © 2024, Association for the Advancement of Artificial puting Wasserstein distance between data domains (Tien,
Intelligence (www.aaai.org). All rights reserved. Habrard, and Sebban 2019), distributions (Rakotomamonjy
and Ralaivola 2021) or graph embeddings (Jin and Chen their µ-Wasserstein distance is
2022). However, our work is to compute privacy budgets  Z  µ1
through Wasserstein distance, and the contributions are sum- µ
. (1)
marized as follows: Wµ (P, Q) = inf ρ (x, y) dγ (x, y)
γ∈Γ(P,Q) X ×Y
Firstly, we propose an alternative DP framework called
Wasserstein differential privacy (WDP), which satisfies Where ρ(x, y) = ∥x − y∥ is the norm defined in probability
three basic properties of a metric (non-negativity, symme- space Ω = X ×Y. Γ (P, Q) is the set for all the
R possible joint
try and triangle inequality), and is easy to convert with other R and γ(x, y) > 0 satisfying γ (x, y) dy =
distributions,
DP frameworks (see Proposition 9-11). P (x) and γ (x, y) dx = Q(y).
Secondly, we show that WDP has 13 excellent properties. In practical sense, ρ (x, y) can be regarded as the cost for
More notably, basic sequential composition, group privacy one unit of mass transported from x to y. γ (x, y) can be
among them and advanced composition are all derived from seen as a transport plan representing the share to be moved
triangle inequality, which shows the advantages of WDP as from P to Q, which measures how much mass must be trans-
a metric DP. ported in order to complete the transportation.
Thirdly, we derive advanced composition, privacy loss In particular, when µ is equal to 1, we can obtain the 1-
and absolute moment under WDP, and finally develop Wasserstein distance applied in Wasserstein generative ad-
Wasserstein accountant to track and account privacy budgets versarial network (WGAN) (Arjovsky, Chintala, and Bottou
in subsampling algorithms such as SGD in deep learning. 2017; Gulrajani et al. 2017). The successful application of
Fourthly, we conduct experiments to evaluate WDP on 1-Wasserstein distance in WGAN should be attributed to
basic mechanisms, compositions and deep learning. Results Kantorovich-Rubinstein duality, which effectively reduces
show that applying WDP as privacy framework can effec- the computational complexity of Wasserstein distance.
tively avoid overstating the privacy budgets. Definition 2 (Kantorovich-Rubinstein distance (Kan-
torovich and Rubinshten 1958)). According to the property
Related Work of Kantorovich-Rubinstein duality, 1-Wasserstein distance
can be equivalently expressed as Kantorovich-Rubinstein
Pure differential privacy (ε-DP) (Dwork et al. 2006b) pro- distance
vides strict guarantees for all measured events through max-
imum divergence. To address the long tailed distribution K (P, Q) = sup Ex∼P [φ(x)] − Ey∼Q [φ(y)]. (2)
generated by privacy mechanism, (ε, δ)-DP (Dwork et al. ∥φ∥L ≤1
2006a) ignores extremely low probability events through a
Where φ : X → R is the so-called Kantorovich potential,
relaxation term δ. However, (ε, δ)-DP is considered to an
giving the optimal transport map by a close-form formula.
overly relaxed definition (Bun et al. 2018) and cannot effec-
Where ∥φ∥L is the Lipschitz bound of Kantorovich poten-
tively handle composition problems, such as leading to pa-
tial, ∥φ∥L ≤ 1 indicates that φ satisfies the 1-Lipschitz con-
rameter explosion (Mironov 2017) or failing to capture cor-
dition with
rect hypothesis testing (Dong, Roth, and Su 2022). In view
of this, CDP (Dwork and Rothblum 2016) applies a sub- ρ (φ(x), φ(y))
∥φ∥L = sup . (3)
gaussian assumption to the mean of privacy loss. zCDP (Bun x̸=y ρ (x, y)
and Steinke 2016) capture privacy loss is a subgaussian ran-
dom variable through Rényi divergence. Rényi differential Definition 3 ((µ, ε)-WDP). A randomized algorithm M is
privacy (RDP) (Mironov 2017) proposes a more general def- said to satisfy (µ, ε)-Wasserstein differential privacy if for
inition of DP based on Rényi divergence. tCDP (Bun et al. any adjacent datasets D, D′ ∈ D and all measurable subsets
2018) further relaxes zCDP. BDP (Triastcyn and Faltings S ⊆ R the following inequality holds
2020) considers the distribution of unique different entries.
Subspace differential privacy (Gao, Gong, and Yu 2022) Wµ (P r[M (D) ∈ S], P r[M (D′ ) ∈ S]) =
and integer subspace differential privacy (Dharangutte et al.  Z  µ1
µ
2023) consider privacy computing scenarios with external inf ′
ρ (x, y) dγ (x, y) ≤ ε.
γ∈Γ(P rM (D),P rM (D )) X ×Y
constraints. However, these concepts are all based on diver-
gence, so that their privacy loss does not have the property (4)
of metrics. Although f -DP and its special case Gaussian dif- Where M(D) and M(D′ ) represent two outputs when al-
ferential privacy (GDP) (Dong, Roth, and Su 2022) innova- gorithm M respectively performs on dataset D and D′ .
tively define privacy based on the trade-off function between P r[M (D) ∈ S] and P r[M (D′ ) ∈ S] are the probability
two types of errors in hypothesis testing, they are difficult to distributions, also denoted as P rM (D) and P rM (D′ ) in
associate with other DP frameworks. this paper. The value of Wµ (P rM (D), P rM (D′ )) is the
privacy loss under (µ, ε)-WDP and its upper bound ε is
called privacy budget.
Wasserstein Differential Privacy Symbolic representations. WDP can also be represented
In this section, we introduce the concept of Wasserstein dis- as Wµ (M(D), M(D′ )) ≤ ε. To emphasize the in-
tance and define our Wasserstein differential privacy. puts are two probability distributions, we denote WDP as
Definition 1 (Wasserstein distance (Rüschendorf 2009)). Wµ (P rM (D), P rM (D′ )) ≤ ε. To avoid confusion, we also
For two probability distributions P and Q defined over R, represent RDP as Dα (P rM (D)∥P rM (D′ )) ≤ ε, although
the representation Dα (M(D)∥M(D′ )) ≤ ε implies that the Proposition 5 (Parallel Composition). Suppose a dataset
results depend on the randomized algorithm and the queried D is divided into n parts disjointly which are denoted
data. They are both reasonable because M(D) can be seen as Di , i = 1, 2, · · · , n. Each randomized algorithm Mi
as a random variable that satisfies M(D) ∼ P rM (D). performed on different seperated datasets Di respectively.
For the convenience on computation, we define Kan- If Mi : D → Ri satisfies (µ, εi )-WDP for i =
torovich Differential Privacy (KDP) as an alternative way 1, 2, · · · , n, then the set of randomized algorithms M =
to obtain privacy loss or privacy budget under (1, ε)-WDP. {M1 , M2 , · · · , Mn } satisfies (µ, max{ε1 , ε2 , · · · , εn })-
Definition 4 (Kantorovich Differential Privacy). If a ran- WDP.
domized algorithm M satisfies(1, ε)-WDP, which can also Proof. See proof of Proposition 5 in the appendix.
be written as the form of Kantorovich-Rubinstein duality Proposition 6 (Sequential Composition). Consider a series
of randomized algorithms M = {M1 , · · · , Mi , · · · , Mn }
K (P rM (D), P rM (D′ )) =
performed on a dataset sequentially. If anyP Mi : D → Ri
sup Ex∼P rM (D) [φ(x)] − Ex∼P rM (D′ ) [φ(x)] ≤ ε. (5) n
satisfies (µ, εi )-WDP, then M satisfies (µ, i=1 εi )-WDP.
∥φ∥L ≤1
Proof. See proof of Proposition 6 in the appendix.
ε-KDP is equivalent to (1, ε)-WDP, and can be com- Proposition 7 (Laplace Mechanism). If an algorithm f :
puted more efficiently through duality formula based on D → R has sensitivity ∆p f and the order µ ≥ 1, then the
Kantorovich-Rubinstein distance. Laplace mechanism ML = f (x) + Lap (0, λ) preserves
 p  µ1 
Properties of WDP µ, 12 ∆p f 2 [1/λ + exp(−1/λ) − 1] -WDP.
Proposition 1 (Symmetry). Let M be a (µ, ε)-WDP algo- Proof. See proof of Proposition 7 in the appendix.
rithm, for any µ ≥ 1 and ε ≥ 0 the following equation holds Proposition 8 (Gaussian Mechanism). If an algorithm f :
D → R has sensitivity ∆p f and the order µ ≥ 1, then the
Wµ (P rM (D) ,P rM (D′ )) 2
=Wµ (P rM (D′ ) , P rM (D)) ≤ ε.
(6)   MG = f (x) + N 0, σ preserves
Gaussian mechanism
1
µ, 12 (∆p f /σ) µ -WDP.
The symmetric property of (µ, ε)-WDP is implied in its The proof of Gaussian mechanism is available in the ap-
definition. Specifically, the joint distribution Γ(·) satisfies pendix. The relation between parameters and privacy bud-
Γ(P rM (D′ ), P rM (D)) = Γ(P rM (D), P rM (D′ )). In ad- gets in Laplace mechanism and Gaussian mechanism are
dition, Kantorovich differential privacy also satisfies this summarized in Table 1.
property and the proof is available in the appendix. Proposition 9 (From DP to WDP) If M pre-
Proposition 2 (Triangle Inequality) Let D1 , D2 , D3 ∈ D serves ε-DP with sensitivity ∆f , it also satisfies
be three arbitrary datasets. Suppose there are fewer differ- 
1 ε
1

ent data entries between D1 and D2 compared with D1 and µ, 2 ∆p f (2ε · (e − 1)) 2µ
-WDP.
D3 , and the differences between D1 and D2 are included Proof. See proof of Proposition 9 in the appendix.
in the differences between D1 and D3 . For any randomized Proposition 10 (From RDP to WDP) If M pre-
algorithm M satisfies (µ, ε)-WDP with µ ≥ 1, we have serves
 (α, ε)-RDP with sensitivity ∆p f , it also satisfies
1
Wµ (P rM (D1 ),P rM (D3 )) µ, 12 ∆p f (2ε) 2µ -WDP.
≤Wµ (P rM (D1 ), P rM (D2 )) (7) Proof. See proof of Proposition 10 in the appendix.
+ Wµ (P rM (D2 ), P rM (D3 )). Proposition 11 (From WDP to RDP and DP) Suppose
µ ≥ 1 and log(pM (·)) is an L-Lipschitz function. If M
The proof is available in the appendix, and Minkowski’s preserves (µ, ε)-WDP with sensitivity ∆p f , it also satisfies
inequality is applied in the deduction process. Proposition 
α
2 can also be understood as the cost that converting from α, α−1 L · εµ/(µ+1) -RDP. Specifically, when α → ∞, M

P rM (D1 ) to P rM (D2 ) and then to P rM (D3 ) is not lower satisfies L · εµ/(µ+1) -DP.
than the cost that converting from P rM (D1 ) to P rM (D3 ) The proof is available in the appendix. Where pM (·) is
directly. Triangle inequality is indispensable in proving sev- the probability density function of distribution P rM (·).
eral properties, such as basic sequential composition (see Proposition 12 (Post-Processing). Let M : D → R be a
Proposition 6), group privacy (see Proposition 13) and ad- (µ, ε)-Wasserstein differentially private algorithm. Let G :
vanced composition (see Theorem 1). R → R′ be an arbitrary randomized mapping. For any order
Proposition 3 (Non-Negativity). For µ ≥ 1 µ ∈ [1, ∞) and all measurable subsets S ⊆ R, G(M)(·) is
and any randomized algorithm M, we have also (µ, ε)-Wasserstein differentially private, namely
Wµ (P rM (D), P rM (D′ )) ≥ 0.
Proof. See proof of Proposition 3 in the appendix. Wµ (P r[G(M(D)) ∈ S], P r[G(M(D′ )) ∈ S]) ≤ ε. (8)
Proposition 4 (Monotonicity). For 1 ≤ µ1 ≤ µ2 , we have proof. See proof of Proposition 12 in the appendix.
Wµ1 (P rM (D), P rM (D′ )) ≤ Wµ2 (P rM (D), P rM (D′ )), Proposition 13 (Group Privacy). Let M : D 7→ R be a
or we can equivalently described this proposition as (µ2 , ε)- (µ, ε)-Wasserstein differentially private algorithm. Then for
WDP implies (µ1 , ε)-WDP. any pairs of datasets D, D′ ∈ D differing in k data entries
The proof is available in the appendix, and the derivation x1 , · · · , xk for any i = 1, · · · , k, M is (µ, kε)-Wasserstein
is completed with the help of Lyapunov’s inequality. differentially private.
Differential Privacy Framework Laplace Mechanism Gaussian Mechanism

DP 1/λ ∞
n o
1 α α−1 α−1
exp − αλ

α > 1: α−1 log 2α−1 exp λ + 2α−1
RDP for order α α/(2σ 2 )
α = 1: 1/λ + exp (−1/λ) − 1
p  µ1 1
WDP for order µ 1 1
2 ∆ p f 2 [1/λ + exp(−1/λ) − 1] 2 (∆p f /σ) µ

Table 1: Privacy budgets of DP, RDP and WDP for Basic Mechanisms. The Laplace mechanism and Gaussian mechanism of
DP and RDP with sensitivity 1 are obtained from Table 2 in Mironov (2017). When it comes to WDP, the sensitivity ∆p f can
be an arbitrary positive constant.

Proof. See proof of Proposition 13 in the appendix. WDP at epoch t is


" n # µ1
Implementation in Deep Learning Wµ (P rMt (D), P rMt (D′ )) = inf
X
E (|Zti |µ ) ,
dt
Advanced Composition i=1
Zt ∼ N qdt , (2 − 2q + 2q 2 )σ 2 .

To derive advanced composition under WDP, we first define (11)
generalized (µ, ε)-WDP.
Where P rMt (D) is the outcome distribution when perform-
Definition 5 (Generalized (µ, ε)-WDP) A randomized ing Mt on D at epoch t. dt = ∥gt − gt′ ∥2 represents the l2
mechanism M is generalized (µ, ε)-Wasserstein differen- norm between pairs of adjacent gradients gt and gt′ . In ad-
tially private if for any two adjacent datasets D, D′ ∈ D dition, Zt is a vector follows Gaussian distribution, and Zti
holds that represents the i-th component of Zt .
P r[Wµ (P rM (D), P rM (D′ )) ≥ ε] ≤ δ. (9) Proof. See proof of Theorem 2 in the appendix.
Note that E (|Zti |µ ) is the µ-order raw absolute moment

According to the above definition, we find that (µ, ε)-WDP of the Gaussian distribution N qdt , (2 − 2q + 2q 2 )σ 2 . We
can be regarded as a special case of generalized (µ, ε)-WDP know that the raw moment of a Gaussian distribution can be
when δ tends to zero. obtained by taking the µ-th order derivatives of the moment
Definition 5 is helpful for designing Wasserstein accoun- generating function with respect to z. Nevertheless, we do
tant applied in private deep learning, and we will deduce not adopt such an indirect approach. We successfully derive
several necessary theorems based on this notion in the fol- a direct formula, as shown in Lemma 1.
lowing. Lemma 1 (Raw Absolute Moment) Assume that Zt ∼
Theorem 1 (Advanced Composition) Suppose a random- N (qdt , (2 − 2q + 2q 2 )σ 2 ), we can obtain the raw absolute
ized algorithm M consists of a sequence of (µ, ε)-WDP al- moment of Z as follow
gorithms M1 , M2 , · · · , MT , which perform on dataset D µ GF
µ+1
 
µ 1 q 2 d2t

adaptively and satisfy Mt : D → Rt , t ∈ {1, 2, · · · , T }. M E (|Zt |µ ) = (2V ar) 2 √ 2 K − , ;− .
is generalized (µ, ε)-Wasserstein differentially private with π 2 2 2V ar
ε > 0 and µ ≥ 1 if for any two adjacent datasets D, D′ ∈ D (12)
hold that Where V ar represents the Variance of random variable Z,
" T
X
# and can be expressed as V ar = (2−2q+2q 2 )σ 2 . GF µ+1 2
′ represents Gamma function as follow
exp β E(Wµ (P rMt (D), P rMt (D ))) − βε ≤ δ.
t=1
  Z ∞
µ+1 µ+1
(10) GF = x 2 −1 e−x dx, (13)
Where β is a customization parameter that satisfies β > 0. 2 0
Proof. See proof of Theorem 1 in the appendix. q 2 d2t
 
and K − µ2 , 21 ; − 2V ar represents Kummer’s confluent hy-
Privacy Loss and Absolute Moment pergeometric function as
∞ 2n n
Theorem 2 Suppose an algorithm M consists of a se- X q 2n dt Y µ − 2i + 2
quence of private algorithms M1 , M2 , · · · , MT protected n 2 n 2n
. (14)
n=0
n! · 4 (1 − q + q ) σ i=1 1 + 2i − 2
by Gaussian mechanism and satisfying Mt : D → R,
t = {1, 2, · · · , T }. If the subsampling probability, scale pa- proof. Our mathematical deduction is based on the work
rameter and l2 -sensitivity of algorithm Mt are represented from Winkelbauer (2012), and the proof is available in the
by q ∈ [0, 1], σ > 0 and dt ≥ 0, then the privacy loss under appendix.
Wasserstein Accountant in Deep Learning under Gaussian mechanism when the noise scale equals 1,
Next, we will deduce Wasserstein accountant applied in pri- simply because its order α increases. In addition, the slopes
vate deep learning. We obtain Theorem 3 based on the above of RDP curves with different noise scales are significantly
preparations including advanced composition, privacy loss different. These phenomena lead users to confusion about
and absolute moment under WDP. order selection and risk assessment through privacy budgets
Theorem 3 (Tail Bound) Under the conditions described in when utilizing RDP.
Theorem 2, M satisfies (µ, ε)-WDP for
Composition
T
" n # µ1 For the convenience of comparison, we adopt the same set-
X X
log δ = β inf E (|Zti |µ ) − βε. (15) tings as the composition experiment in Triastcyn and Falt-
dt ings (2020). We imitate heavy-tailed gradient distributions
t=1 i=1
 by generating synthetic gradients from a Weibull distribu-
Where Z ∼ N qdt , (2 − 2q + 2q 2 )σ 2 and dt = ∥gt − tion with 0.5 as its shape parameter and 50 × 1000 as its
gt′ ∥2 . The proof of Theorem 3 is available in the appendix. size.
In another case, if we have determined δ and want to know The hyper-parameter σ remains unchanged after being set
the privacy budget ε, then we can utilize the result in Corol- as 0.2, and the threshold of gradient clipping C is set to
lary 1. {0.05, 0.50, 0.75, 0.99}-quantiles of gradient norm in turns.
Corollary 1 Under the conditions described in Theorem 2, To observe the original variations of their privacy budgets,
M satisfies (µ, ε)-WDP for we do not clip gradients. Thus, C only affects Gaussian
" n # µ1 noise with variance C 2 σ 2 in DP-SGD (Abadi et al. 2016)
T
X X 1 in this experiment. In addition, we also provide the com-
ε= inf E (|Zti |µ ) − logδ. (16) position results with gradient clipping in the appendix for
dt β
t=1 i=1 comparison.
Corollary 1 is more commonly used than Theorem 3 since In Figure 2, we have the following key observations. (1)
the total privacy budget generated by an algorithm plays a The curves obtained from Wasserstein accountant (WA) al-
more important role in privacy computing. most replicate the changes and trends depicted by the curves
obtained from moments accountant (MA) and Bayesian ac-
countant (BA). (2) The privacy budgets under WA are al-
Experiments ways the lowest, and this advantage becomes more signifi-
The experiments in this paper consist of four parts. Firstly, cant with C increasing.
we test Laplace Mechanism and Gaussian Mechanism un- The above results show that Wasserstein accountant can
der RDP and WDP with ever-changing orders. Secondly, we retain the privacy features expressed by MA and BA at a
carry out the experiments of composition and compare our lower privacy budget.
Wasserstein accountant with Bayesian accountant and mo-
ments accountant. Thirdly, we consider the application sce- Deep Learning
nario of deep learning, and train a convolutional neural net- We adopt DP-SGD (Abadi et al. 2016) as the private op-
work (CNN) optimized by differentially private stochastic timizer to obtain the privacy budgets under MA, BA and
gradient descent (DP-SGD) (Abadi et al. 2016) on the task our WA when applying a CNN model designed by Triast-
of image classification. At last, we demonstrate the impact cyn and Faltings (2020) to the task of image classification on
of hyperparameter variations on privacy budgets. All the ex- four baseline datasets including MNIST (Lecun et al. 1998),
periments were performed on a single machine with Ubuntu CIFAR-10 (Krizhevsky and Hinton 2009), SVHN (Netzer
18.04, 40 Intel(R) Xeon(R) Silver 4210R CPUs @ 2.40GHz, et al. 2011) and Fashion-MNIST (Xiao, Rasul, and Vollgraf
and two NVIDIA Quadro RTX 8000 GPUs. 2017).
In the experiment of deep learning, we allow different DP
Basic Mechanisms frameworks to adjust the noise scale σ according to their
We conduct experiments to test Laplace Mechanism and own needs. The reasons are as follows: (1) MA supported
Gaussian Mechnism under RDP and WDP. Our experiments by DP can easily lead to gradient explosion when the noise
are based on the results of Proposition 7, 8 and Table 1. We scale is small, thus σ can only take a relatively larger value to
set the scale parameters of Laplace mechanism and Gaus- avoid this situation. However, an excessive noise limits the
sian mechanism as 1, 2, 3 and 5 respectively. The order µ performance of BDP and WDP. (2) In addition, this setting
of WDP is allowed to varies from 1 to 10, and so is the or- enables our experimental results more convenient to com-
der α of RDP. We plot the values of privacy budgets ε with pare with that in BDP (Triastcyn and Faltings 2020), because
increasing orders, and the results are shown in Figure 1. the deep learning experiment in BDP is also designed in this
We can observe that the privacy budgets of WDP increase way.
with µ growing, which corresponds to our monotonicity Table 2 shows the results obtained under the above ex-
property (see Proposition 4). More importantly, we find that perimental settings. We can observe the following phe-
the privacy budgets of WDP are not susceptible to the order nomenons: (1) WDP requires lower privacy budgets than DP
µ, because their curves all exhibit slow upward trends. How- and RDP to achieve the same level of test accuracy. (2) The
ever, the privacy budgets of RDP experience a steep increase convergence speed of the deep learning model under WA is
(a) LM for RDP (b) LM for WDP (c) GM for RDP (d) GM for WDP

Figure 1: Privacy buget curves of (µ, ε)-WDP and (α, ε)-RDP for Laplace mechanism (LM) and Gaussian mechanism (GM)
with varying orders. Where λ and σ is the scale of LM and GM respectively. The sensitivities are set to 1 and remains unchanged.

(a) 0.05-quantile of ∥gt ∥ (b) 0.50-quantile of ∥gt ∥ (c) 0.75-quantile of ∥gt ∥ (d) 0.99-quantile of ∥gt ∥

Figure 2: Privacy budgets over synthetic gradients obtained by moments accountant under DP, Bayesian accountant under BDP
and Wasserstein accountant under WDP without gradient clipping.

faster than that of MA and BA. Taking the experiments on


MNIST dataset as an example, DP and BDP need more than
100 epochs and 50 epochs of training respectively to achieve
the accuracy of 96%. While our WDP can reach the same
level after only 16 epochs of training.
BDP (Triastcyn and Faltings 2020) attributes its better
performance than DP to considering the gradient distribu-
tion information. Similarly, we can also analyze the advan-
tages of WDP from the following aspects. (1) From the per- (a) ε varies with β (b) ε varies with δ
spective of definition, WDP also utilizes gradient distribu-
tion information through γ ∈ (P rM (D), P rM (D′ )). From Figure 3: The impact of β and δ. The coordinates of hori-
the perspective of Wasserstein accountant, the information zontal axis in 3(b) are on a logarithmic scale.
of gradient distribution is included in dt and Zt . (2) More
importantly, privacy budgets under WDP will not explode
even under low noise conditions. Because Wasserstein dis- {1, 2, 5, 10}. We observe that β has a clear effect on the
tance is more stable than Renyi divergence or maximum di- value of ε in Figure 3(a). ε decreases quickly when β is less
vergence, which is similar to the reason why WGAN (Ar- than 10, while very slowly when it is greater than 10. When
jovsky, Chintala, and Bottou 2017) succeed to alleviate the it comes to 3(b), ε seems to be decreasing uniformly with
problem of mode collapse by applying Wasserstein distance. the exponential growth of delta.

Effect of β and δ Discussion


We also conduct experiments to illustrate the relation be-
tween privacy budgets and related hyperparameters. Our ex- Relations to Other DP Frameworks
periments are based on the results from Theorem 3 and We establish the bridges between WDP, DP and RDP
Corollary 1, which have been proved before. In Figure 3(a), through
 Proposition 9, 10 and 11. We know that ε-DP im-
the hyperparameter β in WDP are allowed to varies from 1
plies µ, 12 ∆p f (2ε · (eε − 1)) 2µ -WDP and (α, ε)-RDP
1 to 50, and the failure probability δ of WDP can only  1

be {10−10 , 10−8 , 10−5 , 10−3 }. While in Figure 3(b), the implies µ, 21 ∆p f (2ε) 2µ -WDP. In addition, (µ, ε)-WDP
failure probability δ is allowed to varies from 10−10 to  
α

10−5 , and the hyperparameter β under WDP can only be implies α, α−1 L · εµ/(µ+1) -RDP or L · εµ/(µ+1) -DP.
Accuracy Privacy
Dataset Non Private Private DP (δ = 10−5 ) BDP (δ = 10−10 ) WDP (δ = 10−10 )
MNIST 99% 96% 2.2 (0.898) 0.95 (0.721) 0.76 (0.681)
CIFAR-10 86% 73% 8.0 (0.999) 0.76 (0.681) 0.52 (0.627)
SVHN 93% 92% 5.0 (0.999) 0.87 (0.705) 0.40 (0.599)
F-MNIST 92% 90% 2.9 (0.623) 0.91 (0.713) 0.45 (0.611)

Table 2: Privacy budgets accounted by DP, BDP and WDP on MNIST, CIFAR-10, SVHN and Fashion-MNIST (F-MNIST).
The values in parentheses are the probability of potential attack success computed by P (A) = 1/(1 + e−ε ) (see Section 3 in
Triastcyn and Faltings (2020)).

With the above basic conclusions, we can obtain more to train neural network models.
derivative relationships through RDP or DP. For example,
2
we can obtain that (µ, ε)-WDP implies 12 L · εµ/(µ+1) - Additional Specifications
zCDP (zero-concentrated differentially private) according to Other possibility. Symmetry can be obtained by replacing
Proposition 1.4 in Bun and Steinke (2016), Rényi divergence with Jensen-Shannon divergence (JSD)
(Rao and Nayak 1985). While JSD does not satisfy the trian-
Advantages from Metric Property gle inequality unless we take its square root instead (Osán,
The privacy losses of DP, RDP and BDP are all non- Bussandri, and Lamberti 2018). Nevertheless, it still tends to
negative but asymmetric, and do not satisfy triangle inequal- exaggerate privacy budgets excessively, as it is defined based
ity (Mironov 2017). Several obvious advantages of WDP as on divergence.
a metric DP have been mentioned in the introduction (see Comparability. Another question worth explaining is why
Section ) and verified in the experiments (see Section ), and the privacy budgets obtained by DP, RDP, and WDP can
here we provide more additional details. be compared. (1) Their process of computing privacy bud-
Triangle inequality. (1) Several properties including basic gets follows the same mapping, namely M : D → R.
sequential composition, group privacy and advanced com- (2) They are essentially measuring the differences in distri-
position are derived from triangle inequality. (2) Properties butions between adjacent datasets, although their respective
in WDP are more comprehensible and easier to utilize than measurement methods are different. (3) Privacy budgets can
those in RDP. For example, RDP have to introduce addi- be uniformly transformed into the probability of successful
tional conditions of 2c -stable and α ≥ 2c+1 to derive group attacks (Triastcyn and Faltings 2020).
privacy (see Proposition 2 in Mironov (2017)), where c is a Computational problem. Although obtaining the Wasser-
constant. In contrast, our WDP utilizes its intrinsic triangle stein distance requires relatively high computational
inequality to obtain group privacy without introducing any costs (Dudley 1969; Fournier and Guillin 2015), we do not
complex concepts or conditions. need to worry about this issue. Because WDP does not
Symmetry. We have considered that the asymmetry of pri- need to directly calculate the Wasserstein distance no matter
vacy loss would not be transferred to the privacy bud- in basic privacy mechanisms or Wasserstein accountant for
get. Specifically, even if Dα (P rM (D)∥P rM (D′ )) ̸= deep learning (see Proposition 7-8 and Theorem 1-3).
Dα (P rM (D′ )∥P rM (D)), Dα (P rM (D)∥P rM (D′ )) ≤ ε
still implies Dα (P rM (D′ )∥P rM (D)) ≤ ε, because neigh- Conclusion
boring datasets D and D′ can be all possible pairs. Even so, In this paper, we propose an alternative DP framework called
symmetrical privacy loss still has at least two advantages: (1) Wasserstein differential privacy (WDP) based on Wasser-
When computing privacy budgets, it can reduce the amount stein distance. WDP satisfies the properties of symme-
of computation for traversing adjacent datasets by half. (2) try, triangle inequality and non-negativity that other DPs
When proving properties, it is not necessary to exchange do not satisfy all, which enables the privacy losses un-
datasets and deduce it again like non-metric DP (e.g. see der WDP to become real metrics. We prove that WDP has
Proof of Theorem 3 in Triastcyn and Faltings (2020)). several excellent properties (see Proposition 1-13) through
Lyapunov’s inequality, Minkowski’s inequality, Jensen’s in-
Limitations equality, Markov’s inequality, Pinsker’s inequality and tri-
WDP has excellent mathematical properties as a metric DP, angle inequality. We also derive advanced composition the-
and can effectively alleviate exploding privacy budgets as orem, privacy loss and absolute moment under the postula-
an alternative DP framework. However, when the volume of tion of WDP and finally obtain Wasserstein accountant to
data in the queried database is extremely small, WDP may compute cumulative privacy budgets in deep learning (see
release a much smaller privacy budget than other DP frame- Theorem 1-3 and Lemma 1). Our evaluations on basic mech-
works. Fortunately, this situation only occurs when there is anisms, compositions and deep learning show that WDP en-
very little data available in the dataset. WDP has great po- ables privacy budgets to be more stable and can effectively
tential in deep learning that requires a large amount of data avoid the overestimation or even explosion on privacy.
Acknowledgments Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. D.
This work is supported by National Natural Science Founda- 2006b. Calibrating Noise to Sensitivity in Private Data Anal-
tion of China (No. 72293583, No. 72293580), Science and ysis. In Theory of Cryptography, Third Theory of Cryptog-
Technology Commission of Shanghai Municipality Grant raphy Conference (TCC), volume 3876, 265–284. Springer.
(No. 22511105901), Defense Industrial Technology Devel- Dwork, C.; and Roth, A. 2014. The Algorithmic Founda-
opment Program (JCKY2019204A007) and Sino-German tions of Differential Privacy. Foundations and Trends in
Research Network (GZ570). Theory Computer Science, 9(3-4): 211–407.
Dwork, C.; and Rothblum, G. N. 2016. Concentrated Dif-
References ferential Privacy. arXiv preprint arXiv:1603.01887.
Abadi, M.; Chu, A.; Goodfellow, I. J.; McMahan, H. B.; Erven, T. V.; and Harremoës, P. 2014. Rényi Divergence and
Mironov, I.; Talwar, K.; and Zhang, L. 2016. Deep Learn- Kullback-Leibler Divergence. IEEE Transactions Informa-
ing with Differential Privacy. In Proceedings of ACM tion Theory, 60(7): 3797–3820.
SIGSAC Conference on Computer and Communications Se- Fedotov, A. A.; Harremoës, P.; and Topsøe, F. 2003. Refine-
curity (CCS), 308–318. ments of Pinsker’s inequality. IEEE Transactions on Infor-
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein mation Theory, 49(6): 1491–1498.
Generative Adversarial Networks. In International Confer- Fournier, N.; and Guillin, A. 2015. On the Rate of Con-
ence on Machine Learning (ICML), 214–223. vergence in Wasserstein Distance of the Empirical Measure.
Bobkov, S.; and Ledoux, M. 2019. One-Dimensional Empir- Probability Theory and Related Fields, 162: 707–738.
ical Measures, Order Statistics, and Kantorovich Transport Gao, J.; Gong, R.; and Yu, F. 2022. Subspace Differential
Distances. Memoirs of the American Mathematical Society, Privacy. In Thirty-Sixth AAAI Conference on Artificial Intel-
261(1259). ligence (AAAI), 3986–3995.
Bun, M.; Dwork, C.; Rothblum, G. N.; and Steinke, T. 2018. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and
Composable and Versatile Privacy via Truncated CDP. In Courville, A. C. 2017. Improved Training of Wasserstein
Proceedings of the 50th Annual ACM SIGACT Symposium GANs. In Advances in Neural Information Processing Sys-
on Theory of Computing (STOC), 74–86. ACM. tems (NeurIPS), 5767–5777.
Bun, M.; and Steinke, T. 2016. Concentrated Differential Jin, H.; and Chen, X. 2022. Gromov-Wasserstein Discrep-
Privacy: Simplifications, Extensions, and Lower Bounds. In ancy with Local Differential Privacy for Distributed Struc-
Theory of Cryptography Conference (TCC), volume 9985, tural Graphs. In Proceedings of the 31st International Joint
635–658. Conference on Artificial Intelligence (IJCAI), 2115–2121.
Cheng, A.; Wang, J.; Zhang, X. S.; Chen, Q.; Wang, P.; and Kantorovich, L. V.; and Rubinshten, G. S. 1958. On a Space
Cheng, J. 2022. DPNAS: Neural Architecture Search for of Completely Additive Functions. Vestnik Leningrad Univ,
Deep Learning with Differential Privacy. In Thirty-Sixth 13(7): 52–59.
AAAI Conference on Artificial Intelligence (AAAI), 6358– Kasiviswanathan, S. P.; Lee, H. K.; Nissim, K.; Raskhod-
6366. nikova, S.; and Smith, A. D. 2011. What Can We Learn
Clement, P.; and Desch, W. 2008. An Elementary Proof of Privately? SIAM Journal on Computing, 40(3): 793–826.
the Triangle Inequality for the Wasserstein Metric. Proceed- Krizhevsky, A.; and Hinton, G. 2009. Learning Multiple
ings of the American Mathematical Society, 136(1): 333– Layers of Features from Tiny Images. Handbook of Systemic
339. Autoimmune Diseases, 1(4).
Dharangutte, P.; Gao, J.; Gong, R.; and Yu, F. 2023. Inte- Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.
ger Subspace Differential Privacy. In Williams, B.; Chen, Gradient-based Learning Applied to Document Recogni-
Y.; and Neville, J., eds., Thirty-Seventh AAAI Conference on tion. Proceedings of the IEEE, 86(11): 2278–2324.
Artificial Intelligence (AAAI), 7349–7357. AAAI Press. McSherry, F. 2009. Privacy Integrated Queries: An Extensi-
Dong, J.; Roth, A.; and Su, W. J. 2022. Gaussian Differential ble Platform for Privacy-Preserving Data Analysis. In Pro-
Privacy. Journal of the Royal Statistical Society Series B: ceedings of ACM International Conference on Management
Statistical Methodology, 84(1): 3–37. of Data (SIGMOD), 19–30.
Dudley, R. M. 1969. The Speed of Mean Glivenko-Cantelli Mironov, I. 2017. Rényi Differential Privacy. In 30th IEEE
Convergence. Annals of Mathematical Statistics, 40: 40–50. Computer Security Foundations Symposium (CSF), 263–
Dwork, C.; Kenthapadi, K.; McSherry, F.; Mironov, I.; and 275.
Naor, M. 2006a. Our Data, Ourselves: Privacy via Dis- Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and
tributed Noise Generation. In Vaudenay, S., ed., 25th An- Ng, A. Y. 2011. Reading Digits in Natural Images with Un-
nual International Conference on the Theory and Applica- supervised Feature Learning. In NIPS Workshop on Deep
tions of Cryptographic Techniques (EUROCRYPT), volume Learning and Unsupervised Feature Learning.
4004, 486–503. Springer. Osán, T. M.; Bussandri, D. G.; and Lamberti, P. W. 2018.
Dwork, C.; and Lei, J. 2009. Differential Privacy and Robust Monoparametric Family of Metrics Derived from Classical
Statistics. In Proceedings of the 41st Annual ACM Sympo- Jensen–Shannon Divergence. Physica A: Statistical Me-
sium on Theory of Computing (STOC), 371–380. chanics and its Applications, 495: 336–344.
Panaretos, V. M.; and Zemel, Y. 2019. Statistical Aspects of
Wasserstein Distances. Annual Review of Statistics and Its
Application, 6(1).
Phan, N.; Vu, M. N.; Liu, Y.; Jin, R.; Dou, D.; Wu, X.; and
Thai, M. T. 2019. Heterogeneous Gaussian Mechanism: Pre-
serving Differential Privacy in Deep Learning with Provable
Robustness. In International Joint Conference on Artificial
Intelligence (IJCAI), 4753–4759.
Rakotomamonjy, A.; and Ralaivola, L. 2021. Differen-
tially Private Sliced Wasserstein Distance. In Proceedings
of the 38th International Conference on Machine Learning
(ICML), volume 139, 8810–8820.
Rao, C.; and Nayak, T. 1985. Cross entropy, Dissimilar-
ity Measures, and Characterizations of Quadratic Entropy.
IEEE Transactions on Information Theory, 31(5): 589–593.
Rüschendorf, L. 2009. Optimal Transport. Old and New.
Jahresbericht der Deutschen Mathematiker-Vereinigung,
111(2): 18–21.
Shokri, R.; and Shmatikov, V. 2015. Privacy-Preserving
Deep Learning. In Proceedings of ACM SIGSAC Conference
on Computer and Communications Security (CCS), 1310–
1321.
Shokri, R.; Stronati, M.; Song, C.; and Shmatikov, V. 2017.
Membership Inference Attacks Against Machine Learning
Models. In IEEE Symposium on Security and Privacy (SP),
3–18.
Tien, N. L.; Habrard, A.; and Sebban, M. 2019. Differ-
entially Private Optimal Transport: Application to Domain
Adaptation. In Proceedings of the 28th International Joint
Conference on Artificial Intelligence (IJCAI), 2852–2858.
Triastcyn, A.; and Faltings, B. 2020. Bayesian Differential
Privacy for Machine Learning. In International Conference
on Machine Learning (ICML), 9583–9592.
Wang, Y.; Si, C.; and Wu, X. 2015. Regression Model Fit-
ting under Differential Privacy and Model Inversion Attack.
In International Joint Conference on Artificial Intelligence
(IJCAI), 1003–1009.
Winkelbauer, A. 2012. Moments and Absolute Moments of
the Normal Distribution. arXiv preprint arXiv:1209.4340.
Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST:
a Novel Image Dataset for Benchmarking Machine Learning
Algorithms. arXiv preprint arXiv:1708.07747.
Zhu, L.; Liu, Z.; and Han, S. 2019. Deep Leakage from
Gradients. In Advances in Neural Information Processing
Systems (NeurIPS), 14747–14756.
Proof of Propositions and Theorems
Proof of Proposition 1
Proposition 1 (Symmetry). Let M be a (µ, ε)-WDP algorithm, for any µ ≥ 1 and ε ≥ 0 the following equation holds
Wµ (P rM (D) , P rM (D′ )) = Wµ (P rM (D′ ) , P rM (D)) ≤ ε.
Proof. Considering the definition of (µ,ε)-WDP, we have
 Z  µ1
′ µ
Wµ (P rM (D) , P rM (D )) = inf ′
ρ (x, y) dγ (x, y) ≤ ε.
γ∈Γ(P rM (D),P rM (D )) X ×Y
The symmetry of Wasserstein differential privacy is obvious for the reason that joint distribution has property
Γ(P rM (D′ ), P rM (D)) = Γ(P rM (D), P rM (D′ )).
Next, we want to proof that Kantorvich differential privacy also satisfies symmetry. Consider the definition of Kantorvich
differential privacy, we have
K (P rM (D) , P rM (D′ )) = sup Ex∼P rM (D) [φ(x)] − Ex∼P rM (D′ ) [φ(x)] (17)
∥φ∥L ≤1

and
K (P rM (D′ ) , P rM (D)) = sup Ex∼P rM (D′ ) [φ(x)] − Ex∼P rM (D) [φ(x)]. (18)
∥φ∥L ≤1
If we set ψ (x) = −φ (x), then the above formula can be written as
K (P rM (D′ ) , P rM (D)) = sup Ex∼P rM (D′ ) [−ψ(x)] − Ex∼P rM (D) [−ψ(x)]
∥ψ∥L ≤1

= sup Ex∼P rM (D) [ψ(x)] − Ex∼P rM (D′ ) [ψ(x)] (19)


∥ψ∥L ≤1

= K (P rM (D) , P rM (D′ )) .

Proof of Proposition 2
Proposition 2 (Triangle Inequality) Let D1 , D2 , D3 ∈ D be three arbitrary datasets. Suppose there are fewer different data
entries between D1 and D2 compared with D1 and D3 , and the differences between D1 and D2 are included in the differences
between D1 and D3 . For any randomized algorithm M satisfies (µ, ε)-WDP with µ ≥ 1, we have
Wµ (P rM (D1 ), P rM (D3 )) ≤ Wµ (P rM (D1 ), P rM (D2 )) + Wµ (P rM (D2 ), P rM (D3 )). (20)
Proof. Triangle inequality has been proved by Proposition 2.1 in Clement and Desch (2008). Here we provide a simpler proof
method from another perspective.
Firstly, we introduce another mathematical form that defines the Wasserstein distance (see Definition 6.1 in Rüschendorf
(2009) or Equation 1 in Panaretos and Zemel (2019))
1
Wµ (P, Q) = inf [E ρ(X, Y )µ ] µ , µ ≥ 1. (21)
X∼P
Y ∼Q

Where X and Y are random vectors, and the infimum is taken over all possible pairs of X and Y that are marginally distributed
as P and Q.
Let X1 , X2 , X3 be three random variables follow distributions P rM (D1 ), P rM (D2 ), P rM (D3 ) respectively.
1
µ
Wµ (P rM (D1 ), P rM (D3 )) = inf [E ρ (X1 , X3 ) ] µ (22)
X1 ∼P rM (D1 )
X3 ∼P rM (D3 )
1 1
µ µ
≤ inf [E ρ (X1 , X2 ) ] µ + inf [E ρ (X2 , X3 ) ] µ (23)
X1 ∼P rM (D1 ) X2 ∼P rM (D2 )
X2 ∼P rM (D2 ) X3 ∼P rM (D3 )

= Wµ (P rM (D1 ), P rM (D2 )) + Wµ (P rM (D2 ), P rM (D3 )). (24)


Here Equation 23 can be established by applying Minkowski’s inequality that ∥X1 +X2 ∥r ≤ ∥X1 ∥r +∥X2 ∥r with 1 < r < ∞.

Proof of Proposition 3
Proposition 3 (Non-Negativity). For µ ≥ 1 and any randomized algorithm M, we have Wµ (P rM (D), P rM (D′ )) ≥ 0.
Proof. We can be sure that the integrand function ρ(x, y) ≥ 0, for the reason that it’s a cost function in the sense of optimal
transport (Rüschendorf 2009) and a norm in the statistical sense (Panaretos and Zemel 2019). γ(x, y) is the probability measure,
so that γ(x, y) > 0 holds. Then according to the definition of WDP, the integral function
 Z  µ1
µ
inf ′
ρ (x, y) dγ (x, y) ≥ 0.
γ∈Γ(P rM (D),P rM (D )) X ×Y
Proof of Proposition 4
Proposition 4 (Monotonicity). For 1 ≤ µ1 ≤ µ2 , we have Wµ1 (P rM (D), P rM (D′ )) ≤ Wµ2 (P rM (D), P rM (D′ )), or we
can equivalently described this proposition as (µ2 , ε)-WDP implies (µ1 , ε)-WDP.
Proof. Consider the expectation form of Wasserstein differential privacy (see Equation 21), and apply Lyapunov’s inequality
as follow 1 1
[E| · |µ1 ] µ1 ≤ [E| · |µ2 ] µ2 , 1 ≤ µ1 ≤ µ2 (25)
we obtain that 1
µ
Wµ1 (P rM (D), P rM (D′ )) = inf [E ρ (X, Y ) 1 ] µ1
X∼M(D)
Y ∼M(D ′ )
1
≤ inf
µ
[E ρ (X, Y ) 2 ] µ2 (26)
X∼M(D)
Y ∼M(D ′ )

= Wµ2 (P rM (D), P rM (D′ )).

Proof of Proposition 5
Proposition 5 (Parallel Composition). Suppose a dataset D is divided into n parts disjointly which are denoted as Di , i =
1, 2, · · · , n. Each randomized algorithm Mi performed on different seperated dataset Di respectively. If Mi : D → Ri
satisfies (µ, εi )-WDP for i = 1, 2, · · · , n, then a set of randomized algorithms M = {M1 , M2 , · · · , Mn } satisfies (µ,
max{ε1 , ε2 , · · · , εn })-WDP.
Proof. From the definition of WDP, we obtain that

 Z  µ1
′ µ
Wµ (P rM (D ) , P rM (D)) = inf ρ (x, y) dγ (x, y) (27)
γ∈Γ(P rM (D),P rM (D ′ )) X ×Y

Z ! µ1 
 
µ
≤ max inf ρ (x, y) dγ (x, y) , ∀Mi ⊆ M, Di ⊆ D (28)
 γ∈Γ(P rMi (Di ),P rMi (Di′ )) X ×Y 

≤ max{ε1 , ε2 , · · · , εn }. (29)
Inequality 28 is tenable for the following reasons. (1) Privacy budget in WDP framework focuses on the upper bound of
privacy loss or distance. (2) The randomized algorithm in M that leads to the maximum differential privacy budget is
a certain Mi , because only one differential privacy mechanism can be applied in both Wµ (P rM (D′ ) , P rM (D)) and
Wµ (P rMi (Di ), P rMi (Di )). (3) There is only one element difference between both D, D′ and Di , Di′ , the difference is
greater when the data volume is small from the perspective of entire distributions. The query algorithm in differential privacy
requires hiding individual differences, and a larger amount of data helps to hide individual data differences.

Proof of Proposition 6
Proposition 6 (Sequential Composition). Consider a series of randomized algorithms M = {MP 1 , · · · , Mi , · · · , Mn } per-
n
formed on a dataset sequentially. If any Mi : D → Ri satisfies (µ, εi )-WDP, then M satisfies (µ, i=1 εi )-WDP.
Proof. Consider the mathematical forms of (µ, εi )-WDP
Wµ (P rM1 (D), P rM1 (D′ )) ≤ ε1 ,


Wµ (P rM2 (D), P rM2 (D′ )) ≤ ε2 ,


(30)

 ···
Wµ (P rMn (D), P rMn (D′ )) ≤ εn

According to the basic properties of the inequality, we can obtain the upper bound of the sum of Wassestein distances
n
X n
X
Wµ (P rMi (D), P rMi (D′ )) ≤ εi . (31)
i=1 i=1

According to the triangle inequality of Wasserstein distance (see Proposition 2), we have
n
X
Wµ (P rMi (D), P rMi (D′ )) ≥ Wµ (P rM (D), P rM (D′ )). (32)
i=1
Pn
Thus, we obtain that Wµ (P rM (D), P rM (D′ )) ≤ i=1 εi .
Proof of Proposition 7

 f :D→
Proposition 7 (Laplace Mechanism). If an algorithm
p
R has sensitivity ∆p f and the 
 µ1
order µ ≥ 1, then the Laplace
mechanism ML = f (x) + Lap (0, λ) preserves µ, 12 ∆p f 2 [1/λ + exp(−1/λ) − 1] -Wasserstein differential pri-
vacy.
Proof. Considering the Wasserstein distance between two Laplace distributions, we have

 Z  µ1
µ
Wµ (Lap (0, λ) , Lap (∆p f, λ)) = inf ρ (x, y) dγ (x, y) (33)
γ∈Γ(Lap(0,λ),Lap(∆p f,λ)) X ×Y
 Z  µ1
µ
≤ inf ∆p f dγ (x, y) (34)
γ∈Γ(Lap(0,λ),Lap(∆p f,λ)) X ×Y
 Z  µ1
= ∆p f inf 1dγ (x, y) (35)
γ∈Γ(Lap(0,λ),Lap(∆p f,λ))
1
= ∆p f inf [E 1X̸=Y ] µ (36)
X∼Lap(0,λ)
Y ∼Lap(∆p f,λ)

1 1
= ∆p f (∥Lap (0, λ) − Lap(∆p f, λ)∥T V ) µ (37)
2
q  µ1
1
≤ ∆p f 2DKL (Lap (0, λ) ∥Lap (∆p f, λ)) . (38)
2

Where ∆p f is the lp -sensitivity between two datasets (see Definition 8), and p is its order which can be set to any positive
integer as needed. X and Y are random variables follows Laplace distribution (see Equation 36). In addition, ∥ · ∥T V represents
the total variation. DKL (P ∥Q) represents the Kullback–Leibler (KL) divergence between P and Q, which is also equal to
one-order Rényi divergence D1 (P ∥Q) (see Theorem 5 in Erven and Harremoës (2014) or Definition 3 in Mironov (2017)).
We can obtain Equation 37 from Equation 36 because of the probabilistic interpretation of total variation when ρ(x, y) = 1,
which has been proposed at page 10 in Reference (Rüschendorf 2009). Equation 38 can be established because of Pinsker’s
inequality (see Section I in Fedotov, Harremoës, and Topsøe (2003))

1
DKL (P ∥Q) ≥ ∥P − Q∥2T V . (39)
2

Pinsker’s inequality establishs a relation between KL divergence and total variation, and P and Q represent the distributions of
two random variables respectively, and
To obtain the final result, we apply the outcome of Laplace Mechanism under Rényi DP of order one (see Table II in Mironov
(2017)) as follow

D1 (Lap(0, λ)∥Lap(1, λ)) = 1/λ + exp(−1/λ) − 1. (40)

Then we will obtain the outcome of Laplace Mechnism under wasserstein DP as follow

1 p  µ1
Wµ (Lap (0, λ) , Lap (1, λ)) ≤ ∆p f 2 [1/λ + exp(−1/λ) − 1] . (41)
2

Proof of Proposition 8

Proposition 8 (Gaussian Mechanism). If an algorithm


 f :D →R  has sensitivity ∆p f and the order µ ≥ 1, then Gaussian
1
1
2

mechanism MG = f (x) + N 0, σ preserves µ, 2 (∆p f /σ) -Wasserstein differential privacy.
µ
Proof. By directly calculating the Wasserstein distance between Gaussian distributions, we have
 Z  µ1
2
 2
 µ
Wµ N 0, σ , N ∆p f, σ = inf
2 2
ρ (x, y) dγ (x, y) (42)
γ∈Γ(N (0,σ ),N (∆p f,σ )) X ×Y
 Z  µ1
≤ inf ∆p f µ dγ (x, y) (43)
γ∈Γ(N (0,σ ),N (∆p f,σ 2 ))
2
X ×Y
 Z  µ1
= ∆p f inf 1dγ (x, y) (44)
γ∈Γ(N (0,σ ),N (∆p f,σ 2 ))
2

1
= ∆p f inf [E 1X̸=Y ] µ (45)
X∼N (0,σ 2 )
Y ∼N (∆p f,σ 2 )

1 1
= ∆p f ∥N (0, σ 2 ) − N (∆p f, σ 2 )∥T V µ (46)
2
q  µ1
1
≤ ∆p f 2DKL (N (0, σ 2 )∥N (∆p f, σ 2 )) . (47)
2
Where ∆p f is the lp -sensitivity between two datasets (see Definition 8). X and Y are random variables follows Gaussian
distribution. ∥ · ∥T V represents the total variation. DKL (P ∥Q) represents the KL divergence between P and Q, which is also
equal to one-order Rényi divergence D1 (P ∥Q) (see Theorem 5 in Erven and Harremoës (2014) or Definition 3 in Mironov
(2017)).
We can obtain Equation 46 from Equation 45 because of the probabilistic interpretation of total variation when ρ(x, y) = 1
(see page 10 in Rüschendorf (2009)). Equation 47 can be established because of Pinsker’s inequality (see Section I in Fedotov,
Harremoës, and Topsøe (2003))
1
DKL (P ∥Q) ≥ ∥P − Q∥2T V . (48)
2
Pinsker’s inequality establishs a relation between KL divergence and total variation, and P and Q represent the distributions of
two random variables.
To obtain the final result, we apply the property of Gaussian Mechanism under Rényi DP of order one (see Proposition 7 and
Table II in Mironov (2017)) as follow
(∆p f )2
D1 (N (0, σ 2 )∥N (1, σ 2 )) = . (49)
2σ 2
Then we will obtain the outcome of Gaussian Mechnism under wasserstein DP as follow
r ! µ1 1
2

2
 2
 1 (∆ p f ) 1 ∆p f µ
Wµ N 0, σ , N 1, σ ≤ 2 = . (50)
2 2σ 2 2 σ
 1

Thus we have proved that if algorithm f has sensitivity 1, then the Gaussian mechanism MG satisfies µ, 12 (∆p f /σ) µ -WDP.

Proof of Proposition 9
 1

Proposition 9 (From DP to WDP) If M preserves ε-DP with sensitivity ∆f , it also satisfies µ, 21 ∆p f (2ε · (eε − 1)) 2µ -
WDP.
Proof. Considering the definition of Wasserstein differential privacy and refering to Equation 33-38, we have
1 p  µ1
Wµ (P rM (D), P rM (D′ )) ≤ ∆p f 2DKL (P rM (D)∥P rM (D′ )) . (51)
2
To deduce further, we apply Lemma 3.18 in Dwork and Roth (2014). It said that if two random variables X, Y satisfy
D∞ (X∥Y ) ≤ ε and D∞ (X∥Y ) ≤ ε, then we can obtain
D1 (X∥Y ) ≤ ε · (eε − 1). (52)
It should be noted that the condition of ε-DP ensures that D∞ (X∥Y ) ≤ ε and D∞ (X∥Y ) ≤ ε can be established (see
Remark 3.2 in (Dwork and Roth 2014)). Based on Equation 51 and Equation 52, we have
1 p  µ1 1 1
Wµ (P rM (D), P rM (D′ )) ≤ ∆p f 2ε · (eε − 1) = ∆p f (2ε · (eε − 1)) 2µ . (53)
2 2
Proof of Proposition 10
 1

Proposition 10 (From RDP to WDP) If M preserves (α, ε)-RDP with sensitivity ∆p f , it also satisfies µ, 12 ∆p f (2ε) 2µ -
WDP.
Proof. Considering the definition of Wasserstein differential privacy and refering to Equation 33-38, we have
1 p  µ1
Wµ (P rM (D), P rM (D′ )) ≤ ∆p f 2DKL (P rM (D)∥P rM (D′ )) . (54)
2
Where DKL (P rM (D)∥P rM (D′ )) represents the KL divergence between P rM (D) and P rM (D′ ), which can also written as
1-order Rényi divergence (see Theorem 5 in Erven and Harremoës (2014) or Definition 3 in Mironov (2017))
DKL (P rM (D)∥P rM (D′ )) = D1 (P rM (D), P rM (D′ )). (55)
In addition, from the monotonicity property of RDP, we have
Dµ1 (P rM (D), P rM (D′ )) ≤ Dµ2 (P rM (D), P rM (D′ )) (56)
for 1 ≤ µ1 < µ2 and arbitrary P rM (D) and P rM (D′ ).
From the condition that M preserves (α, ε)-RDP, we have
Dα (P rM (D), P rM (D′ )) ≤ ε, α ≥ 1 (57)
Combining Equation 55, 56 and 57, we have
DKL (P rM (D)∥P rM (D′ )) = D1 (P rM (D), P rM (D′ )) ≤ Dα (P rM (D), P rM (D′ )) ≤ ε. (58)
Combining Equation 54 and 58, we have
1 √  µ1 1 1
Wµ (P rM (D), P rM (D′ )) ≤ ∆p f 2ε = ∆p f (2ε) 2µ . (59)
2 2
 1

Therefore, (α, ε)-RDP implies µ, 21 ∆p f (2ε) 2µ -WDP.

Proof of Proposition 11
Proposition 11 (From WDP to RDP)Suppose µ ≥ 1 and log(pM (·)) is an L-Lipschitz function. If M preserves (µ, ε)-WDP
α

with sensitivity ∆p f , it also satisfies α, α−1 L · εµ/(µ+1) -RDP. Specifically, when α → ∞, it satisfies L · εµ/(µ+1) -DP.
proof. Considering the definition of L-Lipschitz function, we have
| log pM (D) − log pM (D′ )| ≤ L|pM (D) − pM (D′ )| (60)
pM (D)
log ≤ L|pM (D) − pM (D′ )| (61)
pM (D′ )
pM (D)
−L|pM (D) − pM (D′ )| ≤ log ≤ L|pM (D) − pM (D′ )| (62)
pM (D′ )
′ pM (D) ′
e−L|pM (D)−pM (D )| ≤ ′
≤ eL|pM (D)−pM (D )| . (63)
pM (D )
Considering the Rényi divergence with order α, we have
 α 
1 pM (D)
Dα (P rM (D)∥P rM (D′ )) = log EP rM (D′ ) (64)
α−1 pM (D′ )
1  ′

≤ log EP rM (D′ ) eαL|pM (D)−pM (D )| (65)
α−1
1
log EP rM (D′ ) eαL∆p f .

≤ (66)
α−1
According to the definition of sensitivity, we know that
pM (D) ≤ pM (D′ ) + ∆p f, pM (D) ≥ pM (D′ )

. (67)
pM (D′ ) ≤ pM (D) + ∆p f, pM (D) ≤ pM (D′ )
From Theorem 2.7 in Bobkov and Ledoux (2019), we have
∆p f ≤ Wµ (P rM (D)∥P rM (D′ ))µ/(µ+1) . (68)
Combining Equation 66 and 68, we have
1  ′ µ/(µ+1)

Dα (P rM (D)∥P rM (D′ )) ≤ log EP rM (D′ ) eαL[Wµ (P rM (D)∥P rM (D ))] (69)
α−1
1  µ/(µ+1)

= log EP rM (D′ ) eαLε (70)
α−1
1  µ/(µ+1)

= log eαLε (71)
α−1
α
= Lεµ/(µ+1) . (72)
α−1
Through the same methods, we can also prove that
α
Dα (P rM (D′ )∥P rM (D)) ≤ Lεµ/(µ+1) . (73)
α−1
Next, we consider the special case that α → ∞. From the definition of max divergence, we have
pM (D)
D∞ (P rM (D)∥P rM (D′ )) = sup log . (74)
P rM (D) pM (D′ )
Refering to Equation 63, we have
D∞ (P rM (D)∥P rM (D′ )) ≤ sup L|pM (D) − pM (D′ )| = L∆p f. (75)
P rM (D)

Refering to Equation 68 , we know that


D∞ (P rM (D)∥P rM (D′ )) ≤ Lεµ/(µ+1) . (76)
Through the same methods, we can also prove that
D∞ (P rM (D′ )∥P rM (D)) ≤ Lεµ/(µ+1) . (77)

Proof of Proposition 12
Proposition 12 (Post-Processing). Let M : D → R be a (µ, ε)-Wasserstein differentially private algorithm, and G : R → R′
be an arbitrary randomized mapping. For any order µ ∈ [1, ∞) and all measurable subsets S ⊆ R, G(M)(·) is also (µ, ε)-
Wasserstein differentially private, namely
Wµ (P r[G(M(D)) ∈ S], P r[G(M(D′ )) ∈ S]) ≤ ε. (78)
proof. Let T = {x ∈ R : G(x) ∈ S}, then we have

Wµ (P r[G(M(D)) ∈ S], P r[G(M(D′ )) ∈ S] = Wµ (P r[M(D) ∈ T ], P r[M(D′ ) ∈ T ]) (79)



= Wµ (P rM (D), P rM (D )) ≤ ε. (80)

Proof of Proposition 13
Proposition 13 (Group Privacy). Let M : D 7→ R be a (µ, ε)-Wasserstein differentially private algorithm. Then for any pairs
of datasets D, D′ ∈ D differing in k data entries x′1 , · · · , x′k for any i = 1, · · · , k, M(D) is (µ, kε)-Wasserstein differentially
private.
Proof. We decompose the group privacy problem and denote D, D1′ as a pair of adjacent datasets only differ in x′1 . Similarly,
we denote D1′ and D2′ , D2′ and D3′ , · · · , Dk−1

and D′ as other k − 1 pairs of adjacent datasets only differ in x′2 , x′3 , · · · , x′k
respectively.
Recall that WDP satisfies triangle inequality in Proposition 2, then we have
Wµ (P rM (D), P rM (D′ )) ≤ Wµ (P rM (D), P rM (D1′ )) + Wµ (P rM (D1′ ), P rM (D2′ )) + · · ·
′ ′
+ Wµ (P rM (Dk−2 ), P rM (Dk−1 )) (81)

+ Wµ (P rM (Dk−1 ), P rM (D′ )) = kε.
Proof of Theorem 1
Theorem 1 (Advanced Composition) Suppose a randomized algorithm M consists of a sequence of (µ, ε)-WDP algorithms
M1 , M2 · · · , MT , which perform on dataset D adaptively and satisfy Mt : D → Rt , t ∈ {1, 2, · · · , T }. M is generalized
(µ, ε)-Wasserstein differentially private with ε > 0 and µ ≥ 1 if for any two adjacent datasets D, D′ ∈ D hold that
" T #
X

exp β E(Wµ (P rMt (D), P rMt (D ))) − βε ≤ δ. (82)
t=1

Where β is a customization parameter that satisfies β > 0.


Proof. With definition of generalized (µ, ε)-WDP, we have
" T
#
X
′ ′
P r [Wµ (P rM (D), P rM (D )) ≥ ε] ≤ P r β Wµ (P rMt (D), P rMt (D )) ≥ βε (83)
t=1
h PT i
E exp(β t=1 Wµ (P rMt (D), P rMt (D′ )))
≤ (84)
exp (βε)
h
PT i
exp βE t=1 (Wµ (P rMt (D), P rMt (D′ )))
≤ (85)
exp (βε)
h P i
T
exp β t=1 E(Wµ (P rMt D), P rMt (D′ )))
= (86)
exp (βε)
" T
#
X

= exp β E(Wµ (P rMt (D), P rMt (D ))) − βε . (87)
t=1

Where Equation 83 holds because triangle inequality (see Proposition 2) ensures that
T
X
Wµ (P rMt (D), P rMt (D′ )) ≥ Wµ (P rM (D), P rM (D′ )).
t=1

Inequality 84 holds because of Markov’s inequality


E(φ(| · |))
P r(| · | ≥ c) ≤ , c > 0. (88)
φ(c)
Here φ(·) can be any monotonically increasing function and satisfies the non-negative property. To simplify the computation of
privacy budgets in WDP, we set φ(·) as exp(·). Inequality 85 holds because of Jensen’s inequality. Equation 86 is supported by
the operational property of expectation. Thus, we find that Equation 87 ≤ δ implies P r [Wµ (M(D), M(D′ )) ≥ ε] ≤ δ.

Proof of Theorem 2
Theorem 2 Suppose an algorithm M consists of a sequence of private algorithms M1 , M2 · · · , MT protected by Gaussian
mechanism and satisfying Mt : D → R, t = {1, 2, · · · , T }. If the subsampling probability, scale parameter and l2 -sensitivity
of algorithm Mt are represented by q ∈ [0, 1], σ > 0 and dt ≥ 0, then the privacy loss under WDP at epoch t is
" n # µ1
X
′ µ
Wµ (P rMt (D), P rMt (D )) = inf E (|Zti | ) ,
dt (89)
i=1
2 2

Zt ∼ N qdt , (2 − 2q + 2q )σ .
Where P rMt (D) is the outcome distribution when performing M on D at epoch t. dt = ∥gt − gt′ ∥2 represents the l2 norm
between pairs of adjacent gradients gt and gt′ . In addition, Zt is a vector follows Gaussian distribution, and Zti represents the
i-th component of Zt .
Proof. With Gaussian mechanism in a subsampling scenario, we have
P rMt (D) = (1 − q)N (0, σ 2 ) + qN (dt , σ 2 ),
P rMt (D′ ) = N (0, σ 2 ).
To facilitate the later proof, we slightly simplify the expression of Mt (D).
P rMt (D) = (1 − q)N (0, σ 2 ) + qN (dt , σ 2 ) (90)
= N 0, (1 − q)2 σ 2 + N qdt , q 2 σ 2
 
(91)
= N qdt , (1 − 2q + 2q 2 )σ 2 .

(92)
Then we compute the privacy loss at epoch t
1
Wµ (P rMt (D), P rMt (D′ )) = inf [E ∥Xt − Yt ∥µ ] µ . (93)
X∼P rM (D)
t
Y ∼P rM (D ′ )
t

Let Zt = Xt − Yt , thus we have


Zt ∼ N qdt , 2 − 2q + 2q 2 .

(94)
The privacy loss is
1
Wµ (P rMt (D), P rMt (D′ )) = inf [E (∥Zt ∥µ )] µ . (95)
dt

Refering to the definition of norm, we can obtain


n
! µ1 n
X X
∥Zt ∥ = |Zti |µ ⇒ ∥Zt ∥µ = |Zti |µ . (96)
i=1 i=1

According to the summation property of expectation, we have


" n # n
X X
µ µ
E [∥Zt ∥ ] = E |Zti | = E (|Zti |µ ) . (97)
i=1 i=1

Finally, we have
" n # µ1
X
′ µ
Wµ (P rMt (D), P rMt (D )) = inf E (|Zti | ) . (98)
dt
i=1

Proof of Theorem 3
Theorem 3 (Tail bound) Under the conditions described in Theorem 2, M satisfies (µ, δ)-WDP for
T
" n # µ1
X X
µ
log δ = β inf E (|Zti | ) − βε,
dt (99)
t=1 i=1
2 2

Z ∼ N qdt , (2 − 2q + 2q )σ .
Proof. In Theorem 1, we have proved that
" T
#
X

exp β E(Wµ (P rMt (D), P rMt (D ))) − βε ≤ δ. (100)
t=1

Taking logarithms on both sides of Equation 100, we can obtain


T
X
β E(Wµ (P rMt (D), P rMt (D′ ))) − βε ≤ log δ. (101)
t=1

In Theorem 2, we have proved that


" T # µ1
X
Wµ (P rMt (D), P rMt (D′ )) = inf E (|Zti |µ ) , (102)
dt
t=1

gt′ ∥2 .
2 2

where Z ∼ N qdt , (2 − 2q + 2q )σ and dt = ∥gt −
Plugging Equation 102 into Equation 101, we can obtain
T
 "
n
# µ1 
X X
β E inf E (|Zti |µ )  − βε ≤ logδ. (103)
dt
t=1 i=1

Where E (|Z|µ ) can be obtained with the help of Lemma 1, thus we regard it as a computable whole part.
Observing Equation 103, we find that the uncertainty comes from two parts: Gaussian random variable Z and the norm of
pairwise gradients ∥gt −gt′ ∥2 . However, these two uncertainties have been eliminated by the inner expectation and the operation
of infimum. Thus, we no longer need outside E and the expression can be simplified as
T
" n # µ1
X X
β inf E (|Zti |µ ) − βε ≤ logδ. (104)
dt
t=1 i=1
We always want the probability of failure to be as small as possible, thus we replace the unequal sign with the equal sign as
follow
T
" n # µ1
X X
log δ = β inf E (|Zti |µ ) − βε. (105)
dt
t=1 i=1

Proof of Lemma 1
Lemma 1 (Raw Absolute Moment) Assume that Z ∼ N (qdt , (2 − 2q + 2q 2 )σ 2 ), we can obtain the raw absolute moment of Z
as follow
µ+1
 
q 2 d2t

µ GF µ 1
µ 2
E (|Z| ) = (2V ar) 2 √ K − , ;− .
π 2 2 2V ar
2 2
Where V ar
µ+1
 represents the Variance of Gaussian random variable Z, and can be expressed as V ar = (2 − 2q + 2q )σ .
GF 2 represents Gamma function as
  Z ∞
µ+1 µ+1
GF = x 2 −1 e−x dx, (106)
2 0
q 2 d2t
 
and K − µ2 , 12 ; − 2V ar represents Kummer’s confluent hypergeometric function as
∞ 2n n
X q 2n dt Y µ − 2i + 2
n 2 n 2n
. (107)
n=0
n! · 4 (1 − q + q ) σ i=1 1 + 2i − 2
Proof. From Equation 17 in Winkelbauer (2012), we can obtain the expression of E (|Z|µ ) as follow
µ+1
 
q 2 d2t

µ GF µ 1
µ 2
E (|Z| ) = (2V ar) 2 √ K − , ;− . (108)
π 2 2 2V ar
q 2 d2t
 
Where K − µ2 , 12 ; − 2V ar deduce further as follows
q 2 d2t q 2 d2t
   
µ 1 µ 1
K − , ;− = K − , ;− (109)
2 2 2V ar 2 2 2(2 − 2q + 2q 2 )σ 2
q 2 d2t
 
µ 1
= K − , ;− (110)
2 2 4(1 − q + q 2 )σ 2
n
q 2 d2

n
X∞
− µ2 − 4(1−q+qt 2 )σ2
= (111)
1 n n!

n=0 2
n
q 2 d2t

∞ n
X − µ2 n 4(1−q+q 2 )σ 2
=  (−1) (112)
1 n n!
n=0 2
µ n
∞  n
X n −2 q 2 d2t
= (−1) n (113)
1 n n! · (4(1 − q + q 2 )σ 2 )

n=0 2
∞ µ n 2n

X n −2 q 2n dt
= (−1) . (114)
1 n n! · 4n (1 − q + q 2 )n σ 2n

n=0 2
Where (− µ2 )n is the rising factorial of − µ2 (see Winkelbauer (2012))
 µ n GF (− µ2 + n)
− = (115)
2 GF (− µ )
 µ  2 µ   µ 
= − · − + 1 · ... · − + n − 1 (116)
2  2 2
n
n 1
= (−1) µ · (µ − 2) · ... · (µ − 2n + 2) . (117)
2

 n
1 GF ( 21 + n)
= (118)
2 GF ( 12 )
     
1 1 1
= · + 1 · ... · +n−1 (119)
2 2 2
 n
1
= [1 · 3 · ... · (1 + 2n − 2)] . (120)
2
From Equation 117 and 120, we have
n
− µ2 (−1)n · µ · (µ − 2) · ... · (µ − 2n + 2)
n
= (121)
1 [1 · 3 · ... · (1 + 2n − 2)]

2
Qn
µ − 2(i − 1)
= (−1)n Qi=1n (122)
i=1 1 + 2i − 2
n
Y µ − 2i + 2
= (−1)n . (123)
i=1
1 + 2i − 2

Combing Equation 114 and 123, we can obtain


∞ 2n n
q 2 d2t q 2n dt
  X
µ 1 Y µ − 2i + 2
K − , ;− = n 2 n 2n
(124)
2 2 2V ar n=0
n! · 4 (1 − q + q ) σ i=1 1 + 2i − 2
∞ 2n n
X q 2n dt Y µ − 2i + 2
= n (V ar)n
. (125)
n=0
n! · 4 i=1
1 + 2i − 2

Experiments
Composition with Clipping
Figure 4 demonstrate the changing process of privacy budget as step increases. We find that the impact of C on the privacy
budget has decreased, because the gradient norm is limited by the clipping threshold, and the gap of privacy budgets between
different DP frameworks has narrowed. However, this does not affect WDP to still get the lowest cumulative privacy budget,
and this value grows the a bit slower than that of DP and BDP.

(a) 0.05-quantile of ∥gt ∥ (b) 0.50-quantile of ∥gt ∥ (c) 0.75-quantile of ∥gt ∥ (d) 0.99-quantile of ∥gt ∥

Figure 4: Privacy budgets over synthetic gradients obtained by moments accountant under DP, Bayesian accountant under BDP
and Wasserstein accountant under WDP when applying gradient clipping.
Several Basic Concepts in Differential Privacy
Definition 6 (Differential Privacy (Dwork et al. 2006b)). A randomized algorithm M : D → R is (ε, δ)-differentially private
if for any adjacent datasets D, D′ ∈ D and all measurable subsets S ⊆ R the following inequality holds:
P r [M (D) ∈ S] ≤ eε P r [M (D′ ) ∈ S] + δ. (126)
Where P r[·] is the notation of probability, and ε is known as the privacy budget. In particular, if δ = 0, M is said to preserve
ε-DP or pure DP.
Definition 7 (Privacy Loss of DP). For a randomized algorithm M : D → R, and o is the outcome of algorithm M, then the
privacy loss of the M can be defined as
P r [M (D) = o]
Loss (o) = log . (127)
P r [M (D′ ) = o]
Privacy budget is the strict upper bound of privacy loss in ε-differential privacy, and is a quasi upper bound of privacy loss with
the confidence of 1 − δ in (ε,δ)-differential privacy.
Definition 8 (lp -Sensitivity (Dwork and Lei 2009)). Sensitivity in DP theory can be defined by maximum p-norm distance
between the same query functions of two adjacent datasets D and D′
∆p f = sup ∥f (D) − f (D′ )∥p . (128)
ρ(D,D ′ )≤1

Where f : D → Rd is a d-dimension query function, ρ(D, D′ ) = ∥D − D′ ∥p is the norm function between D and D′ .
lp -sensitivity measures the largest difference between all possible adjacent datasets.
Definition 9 (Rényi Differential Privacy (Mironov 2017)). A randomized algorithm M : D → R is said to preserve (α, ε)-RDP
if for any adjacent datasets D, D′ ∈ D the following holds
 α 
′ 1 pM(D) (o)
Dα (P rM (D)∥P rM (D )) = log Eo∼M(D′ ) ≤ ε. (129)
α−1 pM(D′ ) (o)
Where α ∈ (1, +∞) is the order of RDP, o is the output of algorithm M. P rM (D) and P rM (D′ ) are probability distributions,
while pM (D) and pM (D′ ) are probability density functions.
Definition 10 (Strong Bayesian Differential Privacy (Triastcyn and Faltings 2020)). A randomized algorithm M : D → R is
said to satisfy (εb , δb )-strong Bayesian differential privacy if for any adjacent datasets D, D′ ∈ D the following holds
 
p(o|D)
P r log ≥ εb ≤ δ b . (130)
p(o|D′ )
Where εb and δb are privacy budget and failure probability in BDP (Triastcyn and Faltings 2020). Where o is the output
satisfying o = M(·). p(o|D) and p(o|D′ ) are probability density functions of adjacent datasets.
Definition 11 (Bayesian Differential Privacy (Triastcyn and Faltings 2020)). Suppose the only different data entry x′ follows
a certain distribution b(x), namely x′ ∼ b(x). A randomized algorithm M : D → R is said to satisfy (εb , δb )-Bayesian
differential privacy if for any neighboring datasets D, D′ ∈ D and any set of outcomes O the following holds
P r [M (D) ∈ O] ≤ eεb P r [M (D) ∈ O] + δb . (131)
From the above definitions, we find that strong BDP is inspired by RDP, and the definition of BDP is similar to that of DP.
Therefore, the weaknesses of DP, BDP and RDP are similar: (1) Their privacy losses do not satisfy symmetry and triangle
inequality, which prevent them from becoming metrics. (2) Their privacy budgets tend to be overstated. To alleviate these
problems, we propose Wasserstein differential privacy in this paper, expecting to achieve better properties in privacy computing,
and thus obtain higher performances in private machine learning.

You might also like