Wasserstein Differential Privacy: Chengyi Yang, Jiayin Qi, Aimin Zhou
Wasserstein Differential Privacy: Chengyi Yang, Jiayin Qi, Aimin Zhou
DP 1/λ ∞
n o
1 α α−1 α−1
exp − αλ
α > 1: α−1 log 2α−1 exp λ + 2α−1
RDP for order α α/(2σ 2 )
α = 1: 1/λ + exp (−1/λ) − 1
p µ1 1
WDP for order µ 1 1
2 ∆ p f 2 [1/λ + exp(−1/λ) − 1] 2 (∆p f /σ) µ
Table 1: Privacy budgets of DP, RDP and WDP for Basic Mechanisms. The Laplace mechanism and Gaussian mechanism of
DP and RDP with sensitivity 1 are obtained from Table 2 in Mironov (2017). When it comes to WDP, the sensitivity ∆p f can
be an arbitrary positive constant.
Figure 1: Privacy buget curves of (µ, ε)-WDP and (α, ε)-RDP for Laplace mechanism (LM) and Gaussian mechanism (GM)
with varying orders. Where λ and σ is the scale of LM and GM respectively. The sensitivities are set to 1 and remains unchanged.
(a) 0.05-quantile of ∥gt ∥ (b) 0.50-quantile of ∥gt ∥ (c) 0.75-quantile of ∥gt ∥ (d) 0.99-quantile of ∥gt ∥
Figure 2: Privacy budgets over synthetic gradients obtained by moments accountant under DP, Bayesian accountant under BDP
and Wasserstein accountant under WDP without gradient clipping.
Table 2: Privacy budgets accounted by DP, BDP and WDP on MNIST, CIFAR-10, SVHN and Fashion-MNIST (F-MNIST).
The values in parentheses are the probability of potential attack success computed by P (A) = 1/(1 + e−ε ) (see Section 3 in
Triastcyn and Faltings (2020)).
With the above basic conclusions, we can obtain more to train neural network models.
derivative relationships through RDP or DP. For example,
2
we can obtain that (µ, ε)-WDP implies 12 L · εµ/(µ+1) - Additional Specifications
zCDP (zero-concentrated differentially private) according to Other possibility. Symmetry can be obtained by replacing
Proposition 1.4 in Bun and Steinke (2016), Rényi divergence with Jensen-Shannon divergence (JSD)
(Rao and Nayak 1985). While JSD does not satisfy the trian-
Advantages from Metric Property gle inequality unless we take its square root instead (Osán,
The privacy losses of DP, RDP and BDP are all non- Bussandri, and Lamberti 2018). Nevertheless, it still tends to
negative but asymmetric, and do not satisfy triangle inequal- exaggerate privacy budgets excessively, as it is defined based
ity (Mironov 2017). Several obvious advantages of WDP as on divergence.
a metric DP have been mentioned in the introduction (see Comparability. Another question worth explaining is why
Section ) and verified in the experiments (see Section ), and the privacy budgets obtained by DP, RDP, and WDP can
here we provide more additional details. be compared. (1) Their process of computing privacy bud-
Triangle inequality. (1) Several properties including basic gets follows the same mapping, namely M : D → R.
sequential composition, group privacy and advanced com- (2) They are essentially measuring the differences in distri-
position are derived from triangle inequality. (2) Properties butions between adjacent datasets, although their respective
in WDP are more comprehensible and easier to utilize than measurement methods are different. (3) Privacy budgets can
those in RDP. For example, RDP have to introduce addi- be uniformly transformed into the probability of successful
tional conditions of 2c -stable and α ≥ 2c+1 to derive group attacks (Triastcyn and Faltings 2020).
privacy (see Proposition 2 in Mironov (2017)), where c is a Computational problem. Although obtaining the Wasser-
constant. In contrast, our WDP utilizes its intrinsic triangle stein distance requires relatively high computational
inequality to obtain group privacy without introducing any costs (Dudley 1969; Fournier and Guillin 2015), we do not
complex concepts or conditions. need to worry about this issue. Because WDP does not
Symmetry. We have considered that the asymmetry of pri- need to directly calculate the Wasserstein distance no matter
vacy loss would not be transferred to the privacy bud- in basic privacy mechanisms or Wasserstein accountant for
get. Specifically, even if Dα (P rM (D)∥P rM (D′ )) ̸= deep learning (see Proposition 7-8 and Theorem 1-3).
Dα (P rM (D′ )∥P rM (D)), Dα (P rM (D)∥P rM (D′ )) ≤ ε
still implies Dα (P rM (D′ )∥P rM (D)) ≤ ε, because neigh- Conclusion
boring datasets D and D′ can be all possible pairs. Even so, In this paper, we propose an alternative DP framework called
symmetrical privacy loss still has at least two advantages: (1) Wasserstein differential privacy (WDP) based on Wasser-
When computing privacy budgets, it can reduce the amount stein distance. WDP satisfies the properties of symme-
of computation for traversing adjacent datasets by half. (2) try, triangle inequality and non-negativity that other DPs
When proving properties, it is not necessary to exchange do not satisfy all, which enables the privacy losses un-
datasets and deduce it again like non-metric DP (e.g. see der WDP to become real metrics. We prove that WDP has
Proof of Theorem 3 in Triastcyn and Faltings (2020)). several excellent properties (see Proposition 1-13) through
Lyapunov’s inequality, Minkowski’s inequality, Jensen’s in-
Limitations equality, Markov’s inequality, Pinsker’s inequality and tri-
WDP has excellent mathematical properties as a metric DP, angle inequality. We also derive advanced composition the-
and can effectively alleviate exploding privacy budgets as orem, privacy loss and absolute moment under the postula-
an alternative DP framework. However, when the volume of tion of WDP and finally obtain Wasserstein accountant to
data in the queried database is extremely small, WDP may compute cumulative privacy budgets in deep learning (see
release a much smaller privacy budget than other DP frame- Theorem 1-3 and Lemma 1). Our evaluations on basic mech-
works. Fortunately, this situation only occurs when there is anisms, compositions and deep learning show that WDP en-
very little data available in the dataset. WDP has great po- ables privacy budgets to be more stable and can effectively
tential in deep learning that requires a large amount of data avoid the overestimation or even explosion on privacy.
Acknowledgments Dwork, C.; McSherry, F.; Nissim, K.; and Smith, A. D.
This work is supported by National Natural Science Founda- 2006b. Calibrating Noise to Sensitivity in Private Data Anal-
tion of China (No. 72293583, No. 72293580), Science and ysis. In Theory of Cryptography, Third Theory of Cryptog-
Technology Commission of Shanghai Municipality Grant raphy Conference (TCC), volume 3876, 265–284. Springer.
(No. 22511105901), Defense Industrial Technology Devel- Dwork, C.; and Roth, A. 2014. The Algorithmic Founda-
opment Program (JCKY2019204A007) and Sino-German tions of Differential Privacy. Foundations and Trends in
Research Network (GZ570). Theory Computer Science, 9(3-4): 211–407.
Dwork, C.; and Rothblum, G. N. 2016. Concentrated Dif-
References ferential Privacy. arXiv preprint arXiv:1603.01887.
Abadi, M.; Chu, A.; Goodfellow, I. J.; McMahan, H. B.; Erven, T. V.; and Harremoës, P. 2014. Rényi Divergence and
Mironov, I.; Talwar, K.; and Zhang, L. 2016. Deep Learn- Kullback-Leibler Divergence. IEEE Transactions Informa-
ing with Differential Privacy. In Proceedings of ACM tion Theory, 60(7): 3797–3820.
SIGSAC Conference on Computer and Communications Se- Fedotov, A. A.; Harremoës, P.; and Topsøe, F. 2003. Refine-
curity (CCS), 308–318. ments of Pinsker’s inequality. IEEE Transactions on Infor-
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein mation Theory, 49(6): 1491–1498.
Generative Adversarial Networks. In International Confer- Fournier, N.; and Guillin, A. 2015. On the Rate of Con-
ence on Machine Learning (ICML), 214–223. vergence in Wasserstein Distance of the Empirical Measure.
Bobkov, S.; and Ledoux, M. 2019. One-Dimensional Empir- Probability Theory and Related Fields, 162: 707–738.
ical Measures, Order Statistics, and Kantorovich Transport Gao, J.; Gong, R.; and Yu, F. 2022. Subspace Differential
Distances. Memoirs of the American Mathematical Society, Privacy. In Thirty-Sixth AAAI Conference on Artificial Intel-
261(1259). ligence (AAAI), 3986–3995.
Bun, M.; Dwork, C.; Rothblum, G. N.; and Steinke, T. 2018. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and
Composable and Versatile Privacy via Truncated CDP. In Courville, A. C. 2017. Improved Training of Wasserstein
Proceedings of the 50th Annual ACM SIGACT Symposium GANs. In Advances in Neural Information Processing Sys-
on Theory of Computing (STOC), 74–86. ACM. tems (NeurIPS), 5767–5777.
Bun, M.; and Steinke, T. 2016. Concentrated Differential Jin, H.; and Chen, X. 2022. Gromov-Wasserstein Discrep-
Privacy: Simplifications, Extensions, and Lower Bounds. In ancy with Local Differential Privacy for Distributed Struc-
Theory of Cryptography Conference (TCC), volume 9985, tural Graphs. In Proceedings of the 31st International Joint
635–658. Conference on Artificial Intelligence (IJCAI), 2115–2121.
Cheng, A.; Wang, J.; Zhang, X. S.; Chen, Q.; Wang, P.; and Kantorovich, L. V.; and Rubinshten, G. S. 1958. On a Space
Cheng, J. 2022. DPNAS: Neural Architecture Search for of Completely Additive Functions. Vestnik Leningrad Univ,
Deep Learning with Differential Privacy. In Thirty-Sixth 13(7): 52–59.
AAAI Conference on Artificial Intelligence (AAAI), 6358– Kasiviswanathan, S. P.; Lee, H. K.; Nissim, K.; Raskhod-
6366. nikova, S.; and Smith, A. D. 2011. What Can We Learn
Clement, P.; and Desch, W. 2008. An Elementary Proof of Privately? SIAM Journal on Computing, 40(3): 793–826.
the Triangle Inequality for the Wasserstein Metric. Proceed- Krizhevsky, A.; and Hinton, G. 2009. Learning Multiple
ings of the American Mathematical Society, 136(1): 333– Layers of Features from Tiny Images. Handbook of Systemic
339. Autoimmune Diseases, 1(4).
Dharangutte, P.; Gao, J.; Gong, R.; and Yu, F. 2023. Inte- Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.
ger Subspace Differential Privacy. In Williams, B.; Chen, Gradient-based Learning Applied to Document Recogni-
Y.; and Neville, J., eds., Thirty-Seventh AAAI Conference on tion. Proceedings of the IEEE, 86(11): 2278–2324.
Artificial Intelligence (AAAI), 7349–7357. AAAI Press. McSherry, F. 2009. Privacy Integrated Queries: An Extensi-
Dong, J.; Roth, A.; and Su, W. J. 2022. Gaussian Differential ble Platform for Privacy-Preserving Data Analysis. In Pro-
Privacy. Journal of the Royal Statistical Society Series B: ceedings of ACM International Conference on Management
Statistical Methodology, 84(1): 3–37. of Data (SIGMOD), 19–30.
Dudley, R. M. 1969. The Speed of Mean Glivenko-Cantelli Mironov, I. 2017. Rényi Differential Privacy. In 30th IEEE
Convergence. Annals of Mathematical Statistics, 40: 40–50. Computer Security Foundations Symposium (CSF), 263–
Dwork, C.; Kenthapadi, K.; McSherry, F.; Mironov, I.; and 275.
Naor, M. 2006a. Our Data, Ourselves: Privacy via Dis- Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and
tributed Noise Generation. In Vaudenay, S., ed., 25th An- Ng, A. Y. 2011. Reading Digits in Natural Images with Un-
nual International Conference on the Theory and Applica- supervised Feature Learning. In NIPS Workshop on Deep
tions of Cryptographic Techniques (EUROCRYPT), volume Learning and Unsupervised Feature Learning.
4004, 486–503. Springer. Osán, T. M.; Bussandri, D. G.; and Lamberti, P. W. 2018.
Dwork, C.; and Lei, J. 2009. Differential Privacy and Robust Monoparametric Family of Metrics Derived from Classical
Statistics. In Proceedings of the 41st Annual ACM Sympo- Jensen–Shannon Divergence. Physica A: Statistical Me-
sium on Theory of Computing (STOC), 371–380. chanics and its Applications, 495: 336–344.
Panaretos, V. M.; and Zemel, Y. 2019. Statistical Aspects of
Wasserstein Distances. Annual Review of Statistics and Its
Application, 6(1).
Phan, N.; Vu, M. N.; Liu, Y.; Jin, R.; Dou, D.; Wu, X.; and
Thai, M. T. 2019. Heterogeneous Gaussian Mechanism: Pre-
serving Differential Privacy in Deep Learning with Provable
Robustness. In International Joint Conference on Artificial
Intelligence (IJCAI), 4753–4759.
Rakotomamonjy, A.; and Ralaivola, L. 2021. Differen-
tially Private Sliced Wasserstein Distance. In Proceedings
of the 38th International Conference on Machine Learning
(ICML), volume 139, 8810–8820.
Rao, C.; and Nayak, T. 1985. Cross entropy, Dissimilar-
ity Measures, and Characterizations of Quadratic Entropy.
IEEE Transactions on Information Theory, 31(5): 589–593.
Rüschendorf, L. 2009. Optimal Transport. Old and New.
Jahresbericht der Deutschen Mathematiker-Vereinigung,
111(2): 18–21.
Shokri, R.; and Shmatikov, V. 2015. Privacy-Preserving
Deep Learning. In Proceedings of ACM SIGSAC Conference
on Computer and Communications Security (CCS), 1310–
1321.
Shokri, R.; Stronati, M.; Song, C.; and Shmatikov, V. 2017.
Membership Inference Attacks Against Machine Learning
Models. In IEEE Symposium on Security and Privacy (SP),
3–18.
Tien, N. L.; Habrard, A.; and Sebban, M. 2019. Differ-
entially Private Optimal Transport: Application to Domain
Adaptation. In Proceedings of the 28th International Joint
Conference on Artificial Intelligence (IJCAI), 2852–2858.
Triastcyn, A.; and Faltings, B. 2020. Bayesian Differential
Privacy for Machine Learning. In International Conference
on Machine Learning (ICML), 9583–9592.
Wang, Y.; Si, C.; and Wu, X. 2015. Regression Model Fit-
ting under Differential Privacy and Model Inversion Attack.
In International Joint Conference on Artificial Intelligence
(IJCAI), 1003–1009.
Winkelbauer, A. 2012. Moments and Absolute Moments of
the Normal Distribution. arXiv preprint arXiv:1209.4340.
Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST:
a Novel Image Dataset for Benchmarking Machine Learning
Algorithms. arXiv preprint arXiv:1708.07747.
Zhu, L.; Liu, Z.; and Han, S. 2019. Deep Leakage from
Gradients. In Advances in Neural Information Processing
Systems (NeurIPS), 14747–14756.
Proof of Propositions and Theorems
Proof of Proposition 1
Proposition 1 (Symmetry). Let M be a (µ, ε)-WDP algorithm, for any µ ≥ 1 and ε ≥ 0 the following equation holds
Wµ (P rM (D) , P rM (D′ )) = Wµ (P rM (D′ ) , P rM (D)) ≤ ε.
Proof. Considering the definition of (µ,ε)-WDP, we have
Z µ1
′ µ
Wµ (P rM (D) , P rM (D )) = inf ′
ρ (x, y) dγ (x, y) ≤ ε.
γ∈Γ(P rM (D),P rM (D )) X ×Y
The symmetry of Wasserstein differential privacy is obvious for the reason that joint distribution has property
Γ(P rM (D′ ), P rM (D)) = Γ(P rM (D), P rM (D′ )).
Next, we want to proof that Kantorvich differential privacy also satisfies symmetry. Consider the definition of Kantorvich
differential privacy, we have
K (P rM (D) , P rM (D′ )) = sup Ex∼P rM (D) [φ(x)] − Ex∼P rM (D′ ) [φ(x)] (17)
∥φ∥L ≤1
and
K (P rM (D′ ) , P rM (D)) = sup Ex∼P rM (D′ ) [φ(x)] − Ex∼P rM (D) [φ(x)]. (18)
∥φ∥L ≤1
If we set ψ (x) = −φ (x), then the above formula can be written as
K (P rM (D′ ) , P rM (D)) = sup Ex∼P rM (D′ ) [−ψ(x)] − Ex∼P rM (D) [−ψ(x)]
∥ψ∥L ≤1
= K (P rM (D) , P rM (D′ )) .
Proof of Proposition 2
Proposition 2 (Triangle Inequality) Let D1 , D2 , D3 ∈ D be three arbitrary datasets. Suppose there are fewer different data
entries between D1 and D2 compared with D1 and D3 , and the differences between D1 and D2 are included in the differences
between D1 and D3 . For any randomized algorithm M satisfies (µ, ε)-WDP with µ ≥ 1, we have
Wµ (P rM (D1 ), P rM (D3 )) ≤ Wµ (P rM (D1 ), P rM (D2 )) + Wµ (P rM (D2 ), P rM (D3 )). (20)
Proof. Triangle inequality has been proved by Proposition 2.1 in Clement and Desch (2008). Here we provide a simpler proof
method from another perspective.
Firstly, we introduce another mathematical form that defines the Wasserstein distance (see Definition 6.1 in Rüschendorf
(2009) or Equation 1 in Panaretos and Zemel (2019))
1
Wµ (P, Q) = inf [E ρ(X, Y )µ ] µ , µ ≥ 1. (21)
X∼P
Y ∼Q
Where X and Y are random vectors, and the infimum is taken over all possible pairs of X and Y that are marginally distributed
as P and Q.
Let X1 , X2 , X3 be three random variables follow distributions P rM (D1 ), P rM (D2 ), P rM (D3 ) respectively.
1
µ
Wµ (P rM (D1 ), P rM (D3 )) = inf [E ρ (X1 , X3 ) ] µ (22)
X1 ∼P rM (D1 )
X3 ∼P rM (D3 )
1 1
µ µ
≤ inf [E ρ (X1 , X2 ) ] µ + inf [E ρ (X2 , X3 ) ] µ (23)
X1 ∼P rM (D1 ) X2 ∼P rM (D2 )
X2 ∼P rM (D2 ) X3 ∼P rM (D3 )
Proof of Proposition 3
Proposition 3 (Non-Negativity). For µ ≥ 1 and any randomized algorithm M, we have Wµ (P rM (D), P rM (D′ )) ≥ 0.
Proof. We can be sure that the integrand function ρ(x, y) ≥ 0, for the reason that it’s a cost function in the sense of optimal
transport (Rüschendorf 2009) and a norm in the statistical sense (Panaretos and Zemel 2019). γ(x, y) is the probability measure,
so that γ(x, y) > 0 holds. Then according to the definition of WDP, the integral function
Z µ1
µ
inf ′
ρ (x, y) dγ (x, y) ≥ 0.
γ∈Γ(P rM (D),P rM (D )) X ×Y
Proof of Proposition 4
Proposition 4 (Monotonicity). For 1 ≤ µ1 ≤ µ2 , we have Wµ1 (P rM (D), P rM (D′ )) ≤ Wµ2 (P rM (D), P rM (D′ )), or we
can equivalently described this proposition as (µ2 , ε)-WDP implies (µ1 , ε)-WDP.
Proof. Consider the expectation form of Wasserstein differential privacy (see Equation 21), and apply Lyapunov’s inequality
as follow 1 1
[E| · |µ1 ] µ1 ≤ [E| · |µ2 ] µ2 , 1 ≤ µ1 ≤ µ2 (25)
we obtain that 1
µ
Wµ1 (P rM (D), P rM (D′ )) = inf [E ρ (X, Y ) 1 ] µ1
X∼M(D)
Y ∼M(D ′ )
1
≤ inf
µ
[E ρ (X, Y ) 2 ] µ2 (26)
X∼M(D)
Y ∼M(D ′ )
Proof of Proposition 5
Proposition 5 (Parallel Composition). Suppose a dataset D is divided into n parts disjointly which are denoted as Di , i =
1, 2, · · · , n. Each randomized algorithm Mi performed on different seperated dataset Di respectively. If Mi : D → Ri
satisfies (µ, εi )-WDP for i = 1, 2, · · · , n, then a set of randomized algorithms M = {M1 , M2 , · · · , Mn } satisfies (µ,
max{ε1 , ε2 , · · · , εn })-WDP.
Proof. From the definition of WDP, we obtain that
Z µ1
′ µ
Wµ (P rM (D ) , P rM (D)) = inf ρ (x, y) dγ (x, y) (27)
γ∈Γ(P rM (D),P rM (D ′ )) X ×Y
Z ! µ1
µ
≤ max inf ρ (x, y) dγ (x, y) , ∀Mi ⊆ M, Di ⊆ D (28)
γ∈Γ(P rMi (Di ),P rMi (Di′ )) X ×Y
≤ max{ε1 , ε2 , · · · , εn }. (29)
Inequality 28 is tenable for the following reasons. (1) Privacy budget in WDP framework focuses on the upper bound of
privacy loss or distance. (2) The randomized algorithm in M that leads to the maximum differential privacy budget is
a certain Mi , because only one differential privacy mechanism can be applied in both Wµ (P rM (D′ ) , P rM (D)) and
Wµ (P rMi (Di ), P rMi (Di )). (3) There is only one element difference between both D, D′ and Di , Di′ , the difference is
greater when the data volume is small from the perspective of entire distributions. The query algorithm in differential privacy
requires hiding individual differences, and a larger amount of data helps to hide individual data differences.
Proof of Proposition 6
Proposition 6 (Sequential Composition). Consider a series of randomized algorithms M = {MP 1 , · · · , Mi , · · · , Mn } per-
n
formed on a dataset sequentially. If any Mi : D → Ri satisfies (µ, εi )-WDP, then M satisfies (µ, i=1 εi )-WDP.
Proof. Consider the mathematical forms of (µ, εi )-WDP
Wµ (P rM1 (D), P rM1 (D′ )) ≤ ε1 ,
Wµ (P rM2 (D), P rM2 (D′ )) ≤ ε2 ,
(30)
···
Wµ (P rMn (D), P rMn (D′ )) ≤ εn
According to the basic properties of the inequality, we can obtain the upper bound of the sum of Wassestein distances
n
X n
X
Wµ (P rMi (D), P rMi (D′ )) ≤ εi . (31)
i=1 i=1
According to the triangle inequality of Wasserstein distance (see Proposition 2), we have
n
X
Wµ (P rMi (D), P rMi (D′ )) ≥ Wµ (P rM (D), P rM (D′ )). (32)
i=1
Pn
Thus, we obtain that Wµ (P rM (D), P rM (D′ )) ≤ i=1 εi .
Proof of Proposition 7
f :D→
Proposition 7 (Laplace Mechanism). If an algorithm
p
R has sensitivity ∆p f and the
µ1
order µ ≥ 1, then the Laplace
mechanism ML = f (x) + Lap (0, λ) preserves µ, 12 ∆p f 2 [1/λ + exp(−1/λ) − 1] -Wasserstein differential pri-
vacy.
Proof. Considering the Wasserstein distance between two Laplace distributions, we have
Z µ1
µ
Wµ (Lap (0, λ) , Lap (∆p f, λ)) = inf ρ (x, y) dγ (x, y) (33)
γ∈Γ(Lap(0,λ),Lap(∆p f,λ)) X ×Y
Z µ1
µ
≤ inf ∆p f dγ (x, y) (34)
γ∈Γ(Lap(0,λ),Lap(∆p f,λ)) X ×Y
Z µ1
= ∆p f inf 1dγ (x, y) (35)
γ∈Γ(Lap(0,λ),Lap(∆p f,λ))
1
= ∆p f inf [E 1X̸=Y ] µ (36)
X∼Lap(0,λ)
Y ∼Lap(∆p f,λ)
1 1
= ∆p f (∥Lap (0, λ) − Lap(∆p f, λ)∥T V ) µ (37)
2
q µ1
1
≤ ∆p f 2DKL (Lap (0, λ) ∥Lap (∆p f, λ)) . (38)
2
Where ∆p f is the lp -sensitivity between two datasets (see Definition 8), and p is its order which can be set to any positive
integer as needed. X and Y are random variables follows Laplace distribution (see Equation 36). In addition, ∥ · ∥T V represents
the total variation. DKL (P ∥Q) represents the Kullback–Leibler (KL) divergence between P and Q, which is also equal to
one-order Rényi divergence D1 (P ∥Q) (see Theorem 5 in Erven and Harremoës (2014) or Definition 3 in Mironov (2017)).
We can obtain Equation 37 from Equation 36 because of the probabilistic interpretation of total variation when ρ(x, y) = 1,
which has been proposed at page 10 in Reference (Rüschendorf 2009). Equation 38 can be established because of Pinsker’s
inequality (see Section I in Fedotov, Harremoës, and Topsøe (2003))
1
DKL (P ∥Q) ≥ ∥P − Q∥2T V . (39)
2
Pinsker’s inequality establishs a relation between KL divergence and total variation, and P and Q represent the distributions of
two random variables respectively, and
To obtain the final result, we apply the outcome of Laplace Mechanism under Rényi DP of order one (see Table II in Mironov
(2017)) as follow
Then we will obtain the outcome of Laplace Mechnism under wasserstein DP as follow
1 p µ1
Wµ (Lap (0, λ) , Lap (1, λ)) ≤ ∆p f 2 [1/λ + exp(−1/λ) − 1] . (41)
2
Proof of Proposition 8
1
= ∆p f inf [E 1X̸=Y ] µ (45)
X∼N (0,σ 2 )
Y ∼N (∆p f,σ 2 )
1 1
= ∆p f ∥N (0, σ 2 ) − N (∆p f, σ 2 )∥T V µ (46)
2
q µ1
1
≤ ∆p f 2DKL (N (0, σ 2 )∥N (∆p f, σ 2 )) . (47)
2
Where ∆p f is the lp -sensitivity between two datasets (see Definition 8). X and Y are random variables follows Gaussian
distribution. ∥ · ∥T V represents the total variation. DKL (P ∥Q) represents the KL divergence between P and Q, which is also
equal to one-order Rényi divergence D1 (P ∥Q) (see Theorem 5 in Erven and Harremoës (2014) or Definition 3 in Mironov
(2017)).
We can obtain Equation 46 from Equation 45 because of the probabilistic interpretation of total variation when ρ(x, y) = 1
(see page 10 in Rüschendorf (2009)). Equation 47 can be established because of Pinsker’s inequality (see Section I in Fedotov,
Harremoës, and Topsøe (2003))
1
DKL (P ∥Q) ≥ ∥P − Q∥2T V . (48)
2
Pinsker’s inequality establishs a relation between KL divergence and total variation, and P and Q represent the distributions of
two random variables.
To obtain the final result, we apply the property of Gaussian Mechanism under Rényi DP of order one (see Proposition 7 and
Table II in Mironov (2017)) as follow
(∆p f )2
D1 (N (0, σ 2 )∥N (1, σ 2 )) = . (49)
2σ 2
Then we will obtain the outcome of Gaussian Mechnism under wasserstein DP as follow
r ! µ1 1
2
2
2
1 (∆ p f ) 1 ∆p f µ
Wµ N 0, σ , N 1, σ ≤ 2 = . (50)
2 2σ 2 2 σ
1
Thus we have proved that if algorithm f has sensitivity 1, then the Gaussian mechanism MG satisfies µ, 12 (∆p f /σ) µ -WDP.
Proof of Proposition 9
1
Proposition 9 (From DP to WDP) If M preserves ε-DP with sensitivity ∆f , it also satisfies µ, 21 ∆p f (2ε · (eε − 1)) 2µ -
WDP.
Proof. Considering the definition of Wasserstein differential privacy and refering to Equation 33-38, we have
1 p µ1
Wµ (P rM (D), P rM (D′ )) ≤ ∆p f 2DKL (P rM (D)∥P rM (D′ )) . (51)
2
To deduce further, we apply Lemma 3.18 in Dwork and Roth (2014). It said that if two random variables X, Y satisfy
D∞ (X∥Y ) ≤ ε and D∞ (X∥Y ) ≤ ε, then we can obtain
D1 (X∥Y ) ≤ ε · (eε − 1). (52)
It should be noted that the condition of ε-DP ensures that D∞ (X∥Y ) ≤ ε and D∞ (X∥Y ) ≤ ε can be established (see
Remark 3.2 in (Dwork and Roth 2014)). Based on Equation 51 and Equation 52, we have
1 p µ1 1 1
Wµ (P rM (D), P rM (D′ )) ≤ ∆p f 2ε · (eε − 1) = ∆p f (2ε · (eε − 1)) 2µ . (53)
2 2
Proof of Proposition 10
1
Proposition 10 (From RDP to WDP) If M preserves (α, ε)-RDP with sensitivity ∆p f , it also satisfies µ, 12 ∆p f (2ε) 2µ -
WDP.
Proof. Considering the definition of Wasserstein differential privacy and refering to Equation 33-38, we have
1 p µ1
Wµ (P rM (D), P rM (D′ )) ≤ ∆p f 2DKL (P rM (D)∥P rM (D′ )) . (54)
2
Where DKL (P rM (D)∥P rM (D′ )) represents the KL divergence between P rM (D) and P rM (D′ ), which can also written as
1-order Rényi divergence (see Theorem 5 in Erven and Harremoës (2014) or Definition 3 in Mironov (2017))
DKL (P rM (D)∥P rM (D′ )) = D1 (P rM (D), P rM (D′ )). (55)
In addition, from the monotonicity property of RDP, we have
Dµ1 (P rM (D), P rM (D′ )) ≤ Dµ2 (P rM (D), P rM (D′ )) (56)
for 1 ≤ µ1 < µ2 and arbitrary P rM (D) and P rM (D′ ).
From the condition that M preserves (α, ε)-RDP, we have
Dα (P rM (D), P rM (D′ )) ≤ ε, α ≥ 1 (57)
Combining Equation 55, 56 and 57, we have
DKL (P rM (D)∥P rM (D′ )) = D1 (P rM (D), P rM (D′ )) ≤ Dα (P rM (D), P rM (D′ )) ≤ ε. (58)
Combining Equation 54 and 58, we have
1 √ µ1 1 1
Wµ (P rM (D), P rM (D′ )) ≤ ∆p f 2ε = ∆p f (2ε) 2µ . (59)
2 2
1
Therefore, (α, ε)-RDP implies µ, 21 ∆p f (2ε) 2µ -WDP.
Proof of Proposition 11
Proposition 11 (From WDP to RDP)Suppose µ ≥ 1 and log(pM (·)) is an L-Lipschitz function. If M preserves (µ, ε)-WDP
α
with sensitivity ∆p f , it also satisfies α, α−1 L · εµ/(µ+1) -RDP. Specifically, when α → ∞, it satisfies L · εµ/(µ+1) -DP.
proof. Considering the definition of L-Lipschitz function, we have
| log pM (D) − log pM (D′ )| ≤ L|pM (D) − pM (D′ )| (60)
pM (D)
log ≤ L|pM (D) − pM (D′ )| (61)
pM (D′ )
pM (D)
−L|pM (D) − pM (D′ )| ≤ log ≤ L|pM (D) − pM (D′ )| (62)
pM (D′ )
′ pM (D) ′
e−L|pM (D)−pM (D )| ≤ ′
≤ eL|pM (D)−pM (D )| . (63)
pM (D )
Considering the Rényi divergence with order α, we have
α
1 pM (D)
Dα (P rM (D)∥P rM (D′ )) = log EP rM (D′ ) (64)
α−1 pM (D′ )
1 ′
≤ log EP rM (D′ ) eαL|pM (D)−pM (D )| (65)
α−1
1
log EP rM (D′ ) eαL∆p f .
≤ (66)
α−1
According to the definition of sensitivity, we know that
pM (D) ≤ pM (D′ ) + ∆p f, pM (D) ≥ pM (D′ )
. (67)
pM (D′ ) ≤ pM (D) + ∆p f, pM (D) ≤ pM (D′ )
From Theorem 2.7 in Bobkov and Ledoux (2019), we have
∆p f ≤ Wµ (P rM (D)∥P rM (D′ ))µ/(µ+1) . (68)
Combining Equation 66 and 68, we have
1 ′ µ/(µ+1)
Dα (P rM (D)∥P rM (D′ )) ≤ log EP rM (D′ ) eαL[Wµ (P rM (D)∥P rM (D ))] (69)
α−1
1 µ/(µ+1)
= log EP rM (D′ ) eαLε (70)
α−1
1 µ/(µ+1)
= log eαLε (71)
α−1
α
= Lεµ/(µ+1) . (72)
α−1
Through the same methods, we can also prove that
α
Dα (P rM (D′ )∥P rM (D)) ≤ Lεµ/(µ+1) . (73)
α−1
Next, we consider the special case that α → ∞. From the definition of max divergence, we have
pM (D)
D∞ (P rM (D)∥P rM (D′ )) = sup log . (74)
P rM (D) pM (D′ )
Refering to Equation 63, we have
D∞ (P rM (D)∥P rM (D′ )) ≤ sup L|pM (D) − pM (D′ )| = L∆p f. (75)
P rM (D)
Proof of Proposition 12
Proposition 12 (Post-Processing). Let M : D → R be a (µ, ε)-Wasserstein differentially private algorithm, and G : R → R′
be an arbitrary randomized mapping. For any order µ ∈ [1, ∞) and all measurable subsets S ⊆ R, G(M)(·) is also (µ, ε)-
Wasserstein differentially private, namely
Wµ (P r[G(M(D)) ∈ S], P r[G(M(D′ )) ∈ S]) ≤ ε. (78)
proof. Let T = {x ∈ R : G(x) ∈ S}, then we have
Proof of Proposition 13
Proposition 13 (Group Privacy). Let M : D 7→ R be a (µ, ε)-Wasserstein differentially private algorithm. Then for any pairs
of datasets D, D′ ∈ D differing in k data entries x′1 , · · · , x′k for any i = 1, · · · , k, M(D) is (µ, kε)-Wasserstein differentially
private.
Proof. We decompose the group privacy problem and denote D, D1′ as a pair of adjacent datasets only differ in x′1 . Similarly,
we denote D1′ and D2′ , D2′ and D3′ , · · · , Dk−1
′
and D′ as other k − 1 pairs of adjacent datasets only differ in x′2 , x′3 , · · · , x′k
respectively.
Recall that WDP satisfies triangle inequality in Proposition 2, then we have
Wµ (P rM (D), P rM (D′ )) ≤ Wµ (P rM (D), P rM (D1′ )) + Wµ (P rM (D1′ ), P rM (D2′ )) + · · ·
′ ′
+ Wµ (P rM (Dk−2 ), P rM (Dk−1 )) (81)
′
+ Wµ (P rM (Dk−1 ), P rM (D′ )) = kε.
Proof of Theorem 1
Theorem 1 (Advanced Composition) Suppose a randomized algorithm M consists of a sequence of (µ, ε)-WDP algorithms
M1 , M2 · · · , MT , which perform on dataset D adaptively and satisfy Mt : D → Rt , t ∈ {1, 2, · · · , T }. M is generalized
(µ, ε)-Wasserstein differentially private with ε > 0 and µ ≥ 1 if for any two adjacent datasets D, D′ ∈ D hold that
" T #
X
′
exp β E(Wµ (P rMt (D), P rMt (D ))) − βε ≤ δ. (82)
t=1
Where Equation 83 holds because triangle inequality (see Proposition 2) ensures that
T
X
Wµ (P rMt (D), P rMt (D′ )) ≥ Wµ (P rM (D), P rM (D′ )).
t=1
Proof of Theorem 2
Theorem 2 Suppose an algorithm M consists of a sequence of private algorithms M1 , M2 · · · , MT protected by Gaussian
mechanism and satisfying Mt : D → R, t = {1, 2, · · · , T }. If the subsampling probability, scale parameter and l2 -sensitivity
of algorithm Mt are represented by q ∈ [0, 1], σ > 0 and dt ≥ 0, then the privacy loss under WDP at epoch t is
" n # µ1
X
′ µ
Wµ (P rMt (D), P rMt (D )) = inf E (|Zti | ) ,
dt (89)
i=1
2 2
Zt ∼ N qdt , (2 − 2q + 2q )σ .
Where P rMt (D) is the outcome distribution when performing M on D at epoch t. dt = ∥gt − gt′ ∥2 represents the l2 norm
between pairs of adjacent gradients gt and gt′ . In addition, Zt is a vector follows Gaussian distribution, and Zti represents the
i-th component of Zt .
Proof. With Gaussian mechanism in a subsampling scenario, we have
P rMt (D) = (1 − q)N (0, σ 2 ) + qN (dt , σ 2 ),
P rMt (D′ ) = N (0, σ 2 ).
To facilitate the later proof, we slightly simplify the expression of Mt (D).
P rMt (D) = (1 − q)N (0, σ 2 ) + qN (dt , σ 2 ) (90)
= N 0, (1 − q)2 σ 2 + N qdt , q 2 σ 2
(91)
= N qdt , (1 − 2q + 2q 2 )σ 2 .
(92)
Then we compute the privacy loss at epoch t
1
Wµ (P rMt (D), P rMt (D′ )) = inf [E ∥Xt − Yt ∥µ ] µ . (93)
X∼P rM (D)
t
Y ∼P rM (D ′ )
t
Finally, we have
" n # µ1
X
′ µ
Wµ (P rMt (D), P rMt (D )) = inf E (|Zti | ) . (98)
dt
i=1
Proof of Theorem 3
Theorem 3 (Tail bound) Under the conditions described in Theorem 2, M satisfies (µ, δ)-WDP for
T
" n # µ1
X X
µ
log δ = β inf E (|Zti | ) − βε,
dt (99)
t=1 i=1
2 2
Z ∼ N qdt , (2 − 2q + 2q )σ .
Proof. In Theorem 1, we have proved that
" T
#
X
′
exp β E(Wµ (P rMt (D), P rMt (D ))) − βε ≤ δ. (100)
t=1
gt′ ∥2 .
2 2
where Z ∼ N qdt , (2 − 2q + 2q )σ and dt = ∥gt −
Plugging Equation 102 into Equation 101, we can obtain
T
"
n
# µ1
X X
β E inf E (|Zti |µ ) − βε ≤ logδ. (103)
dt
t=1 i=1
Where E (|Z|µ ) can be obtained with the help of Lemma 1, thus we regard it as a computable whole part.
Observing Equation 103, we find that the uncertainty comes from two parts: Gaussian random variable Z and the norm of
pairwise gradients ∥gt −gt′ ∥2 . However, these two uncertainties have been eliminated by the inner expectation and the operation
of infimum. Thus, we no longer need outside E and the expression can be simplified as
T
" n # µ1
X X
β inf E (|Zti |µ ) − βε ≤ logδ. (104)
dt
t=1 i=1
We always want the probability of failure to be as small as possible, thus we replace the unequal sign with the equal sign as
follow
T
" n # µ1
X X
log δ = β inf E (|Zti |µ ) − βε. (105)
dt
t=1 i=1
Proof of Lemma 1
Lemma 1 (Raw Absolute Moment) Assume that Z ∼ N (qdt , (2 − 2q + 2q 2 )σ 2 ), we can obtain the raw absolute moment of Z
as follow
µ+1
q 2 d2t
µ GF µ 1
µ 2
E (|Z| ) = (2V ar) 2 √ K − , ;− .
π 2 2 2V ar
2 2
Where V ar
µ+1
represents the Variance of Gaussian random variable Z, and can be expressed as V ar = (2 − 2q + 2q )σ .
GF 2 represents Gamma function as
Z ∞
µ+1 µ+1
GF = x 2 −1 e−x dx, (106)
2 0
q 2 d2t
and K − µ2 , 12 ; − 2V ar represents Kummer’s confluent hypergeometric function as
∞ 2n n
X q 2n dt Y µ − 2i + 2
n 2 n 2n
. (107)
n=0
n! · 4 (1 − q + q ) σ i=1 1 + 2i − 2
Proof. From Equation 17 in Winkelbauer (2012), we can obtain the expression of E (|Z|µ ) as follow
µ+1
q 2 d2t
µ GF µ 1
µ 2
E (|Z| ) = (2V ar) 2 √ K − , ;− . (108)
π 2 2 2V ar
q 2 d2t
Where K − µ2 , 12 ; − 2V ar deduce further as follows
q 2 d2t q 2 d2t
µ 1 µ 1
K − , ;− = K − , ;− (109)
2 2 2V ar 2 2 2(2 − 2q + 2q 2 )σ 2
q 2 d2t
µ 1
= K − , ;− (110)
2 2 4(1 − q + q 2 )σ 2
n
q 2 d2
n
X∞
− µ2 − 4(1−q+qt 2 )σ2
= (111)
1 n n!
n=0 2
n
q 2 d2t
∞ n
X − µ2 n 4(1−q+q 2 )σ 2
= (−1) (112)
1 n n!
n=0 2
µ n
∞ n
X n −2 q 2 d2t
= (−1) n (113)
1 n n! · (4(1 − q + q 2 )σ 2 )
n=0 2
∞ µ n 2n
X n −2 q 2n dt
= (−1) . (114)
1 n n! · 4n (1 − q + q 2 )n σ 2n
n=0 2
Where (− µ2 )n is the rising factorial of − µ2 (see Winkelbauer (2012))
µ n GF (− µ2 + n)
− = (115)
2 GF (− µ )
µ 2 µ µ
= − · − + 1 · ... · − + n − 1 (116)
2 2 2
n
n 1
= (−1) µ · (µ − 2) · ... · (µ − 2n + 2) . (117)
2
n
1 GF ( 21 + n)
= (118)
2 GF ( 12 )
1 1 1
= · + 1 · ... · +n−1 (119)
2 2 2
n
1
= [1 · 3 · ... · (1 + 2n − 2)] . (120)
2
From Equation 117 and 120, we have
n
− µ2 (−1)n · µ · (µ − 2) · ... · (µ − 2n + 2)
n
= (121)
1 [1 · 3 · ... · (1 + 2n − 2)]
2
Qn
µ − 2(i − 1)
= (−1)n Qi=1n (122)
i=1 1 + 2i − 2
n
Y µ − 2i + 2
= (−1)n . (123)
i=1
1 + 2i − 2
Experiments
Composition with Clipping
Figure 4 demonstrate the changing process of privacy budget as step increases. We find that the impact of C on the privacy
budget has decreased, because the gradient norm is limited by the clipping threshold, and the gap of privacy budgets between
different DP frameworks has narrowed. However, this does not affect WDP to still get the lowest cumulative privacy budget,
and this value grows the a bit slower than that of DP and BDP.
(a) 0.05-quantile of ∥gt ∥ (b) 0.50-quantile of ∥gt ∥ (c) 0.75-quantile of ∥gt ∥ (d) 0.99-quantile of ∥gt ∥
Figure 4: Privacy budgets over synthetic gradients obtained by moments accountant under DP, Bayesian accountant under BDP
and Wasserstein accountant under WDP when applying gradient clipping.
Several Basic Concepts in Differential Privacy
Definition 6 (Differential Privacy (Dwork et al. 2006b)). A randomized algorithm M : D → R is (ε, δ)-differentially private
if for any adjacent datasets D, D′ ∈ D and all measurable subsets S ⊆ R the following inequality holds:
P r [M (D) ∈ S] ≤ eε P r [M (D′ ) ∈ S] + δ. (126)
Where P r[·] is the notation of probability, and ε is known as the privacy budget. In particular, if δ = 0, M is said to preserve
ε-DP or pure DP.
Definition 7 (Privacy Loss of DP). For a randomized algorithm M : D → R, and o is the outcome of algorithm M, then the
privacy loss of the M can be defined as
P r [M (D) = o]
Loss (o) = log . (127)
P r [M (D′ ) = o]
Privacy budget is the strict upper bound of privacy loss in ε-differential privacy, and is a quasi upper bound of privacy loss with
the confidence of 1 − δ in (ε,δ)-differential privacy.
Definition 8 (lp -Sensitivity (Dwork and Lei 2009)). Sensitivity in DP theory can be defined by maximum p-norm distance
between the same query functions of two adjacent datasets D and D′
∆p f = sup ∥f (D) − f (D′ )∥p . (128)
ρ(D,D ′ )≤1
Where f : D → Rd is a d-dimension query function, ρ(D, D′ ) = ∥D − D′ ∥p is the norm function between D and D′ .
lp -sensitivity measures the largest difference between all possible adjacent datasets.
Definition 9 (Rényi Differential Privacy (Mironov 2017)). A randomized algorithm M : D → R is said to preserve (α, ε)-RDP
if for any adjacent datasets D, D′ ∈ D the following holds
α
′ 1 pM(D) (o)
Dα (P rM (D)∥P rM (D )) = log Eo∼M(D′ ) ≤ ε. (129)
α−1 pM(D′ ) (o)
Where α ∈ (1, +∞) is the order of RDP, o is the output of algorithm M. P rM (D) and P rM (D′ ) are probability distributions,
while pM (D) and pM (D′ ) are probability density functions.
Definition 10 (Strong Bayesian Differential Privacy (Triastcyn and Faltings 2020)). A randomized algorithm M : D → R is
said to satisfy (εb , δb )-strong Bayesian differential privacy if for any adjacent datasets D, D′ ∈ D the following holds
p(o|D)
P r log ≥ εb ≤ δ b . (130)
p(o|D′ )
Where εb and δb are privacy budget and failure probability in BDP (Triastcyn and Faltings 2020). Where o is the output
satisfying o = M(·). p(o|D) and p(o|D′ ) are probability density functions of adjacent datasets.
Definition 11 (Bayesian Differential Privacy (Triastcyn and Faltings 2020)). Suppose the only different data entry x′ follows
a certain distribution b(x), namely x′ ∼ b(x). A randomized algorithm M : D → R is said to satisfy (εb , δb )-Bayesian
differential privacy if for any neighboring datasets D, D′ ∈ D and any set of outcomes O the following holds
P r [M (D) ∈ O] ≤ eεb P r [M (D) ∈ O] + δb . (131)
From the above definitions, we find that strong BDP is inspired by RDP, and the definition of BDP is similar to that of DP.
Therefore, the weaknesses of DP, BDP and RDP are similar: (1) Their privacy losses do not satisfy symmetry and triangle
inequality, which prevent them from becoming metrics. (2) Their privacy budgets tend to be overstated. To alleviate these
problems, we propose Wasserstein differential privacy in this paper, expecting to achieve better properties in privacy computing,
and thus obtain higher performances in private machine learning.