Thesis
Thesis
Dionysis Manousakas
Dedicated to my parents.
I would not be here writing this very line if it were not
for them . . .
Declaration
This thesis is the result of my own work and includes nothing which is the outcome of
work done in collaboration except as declared in the preface and specified in the text. It
is not substantially the same as any work that has already been submitted before for any
degree or other qualification except as declared in the preface and specified in the text.
It does not exceed the prescribed word limit of 60,000 words for the Computer Science
Degree Committee, including appendices, footnotes, tables and equations.
Dionysis Manousakas
October 2020
Abstract
from the generated data summaries. Thus we deliver robustified scalable representations
for inference, that are suitable for applications involving contaminated and unreliable
data sources.
We demonstrate the performance of proposed summarization techniques on multiple
parametric statistical models, and diverse simulated and real-world datasets, from
music genre features to hospital readmission records, considering a wide range of data
dimensionalities.
Acknowledgements
This work contains the outcomes of prolonged research endeavours which would not have
been accomplished without my interaction with many outstanding collegues. I express
my deep gratitude to my supervisor, Cecilia Mascolo, for generously granting me vast
stretches of freedom to pursue my own research direction, her unabated trust in my
work, and her encouragement and advice over the ups and downs of my PhD. I’m also
grateful to Trevor Campbell for his bold and intellectually stimulating collaboration: my
perspective on science has been crucially enriched by his adherence to first principles
and his enthusiasm on research. My close collaboration with Alastair Beresford was
most valuable over the time I started delving into the subject of data privacy—also,
Borja Balle’s teaching on differential privacy has been instrumental in unveiling the
wonders and beauty of the field. I’m indebted to my labmates at the Mobile Systems
Group, Nokia Bell Labs, Max Planck Institute, and the rest of my collaborators: Shubham
Aggarwal, Sourav Bhattacharya, Alberto Jesús Coca, Abir De, Manuel Gomez Rodriguez,
Fahim Kawsar, Nic Lane, Akhil Mathur, Chulhong Min, Alberto Gil Ramos, and
Rik Sarkar. My warm thanks to Michael Schaub and Kostas Kyriakopoulos: their
mentorship over my first awkward steps in academic inquiry, provided me with solid
foundations which were substantial over my PhD. Thanks to Matthias Grossglauser
and Amanda Prorok for carefully examining this thesis, contributing their valuable
suggestions and hosting an enjoyable viva.
I gratefully acknowledge the financial support received from Nokia Bell Labs, Lundgren
Fund, Darwin College Cambridge, and the Department of Computer Science & Technology
that backed my research.
Writings of Sebald, Bolaño, Benjamin, Parra, Herbert, activities at Darwin College
Film Club, and university Canoeing and Tango societies kept me reinvigorating company
over rest breaks. Antonis, Jenny and Panos made my time enjoyable supporting me with
their friendship at all times—Pano, your light-spirited theory on information overflow
has been a paradoxical source of inspiration over my research on coresets. Finally, my
deepest thanks to Mum, Dad, Lena and Memos for keeping a shelter of unwavering love
and support alive through all my successes and failures.
Table of contents
1 Introduction 1
1.1 Thesis statement and main contributions . . . . . . . . . . . . . . . . . . 3
1.2 Organization of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background Material 7
2.1 Comparing probability distributions . . . . . . . . . . . . . . . . . . . . . 7
2.2 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Probabilistic learning at a glance . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Laplace’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Sampling methods . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Variational inference . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Bayesian coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Robust inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Standard Bayesian inference and lack of robustness in the large-data
regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Robustified generalized Bayesian posteriors . . . . . . . . . . . . . 16
2.5 Representing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 Finite-dimensional random projections . . . . . . . . . . . . . . . 20
2.6 Differential privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Bayesian Pseudocoresets 53
4.1 Related work & contributions . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Existing Bayesian coresets . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 High-dimensional data . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Bayesian pseudocoresets . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 Pseudocoreset variational inference . . . . . . . . . . . . . . . . . 58
4.3.2 Stochastic optimization . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.3 Differentially private scheme . . . . . . . . . . . . . . . . . . . . . 61
4.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.1 Gaussian mean inference . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.2 Bayesian linear regression . . . . . . . . . . . . . . . . . . . . . . 65
4.4.3 Bayesian logistic regression . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Summary & discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Conclusions 89
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.1 Privacy loss of coarsened structured data . . . . . . . . . . . . . . 89
6.1.2 Privacy-preserving Bayesian coresets in high dimensions . . . . . . 90
6.1.3 Robust Bayesian coresets under misspecification . . . . . . . . . . 90
6.2 Future research directions . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Coresets for models with structured likelihoods . . . . . . . . . . 91
6.2.2 Implicit differential privacy amplification of data-dependent com-
pressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2.3 Human-centric summaries for scalable inference . . . . . . . . . . 92
6.2.4 Compressing datasets for meta-learning . . . . . . . . . . . . . . . 92
Nomenclature 117
Bibliography 121
Chapter 1
Introduction
Machine learning pervades most modeling and decision-making tools of modern society:
scientists rely on the wealth of stored medical records to decipher the underlying causes
of diseases, web-scale recommender systems learn from users’ experience to suggest music,
movies, and products tailored to our habits, and driving-intelligence systems are capable
to navigate self-driving cars in complex, never-seen-before environments.
From the statistical point of view, Bayesian modeling offers a powerful unifying
framework where experts and practitioners alike can leverage domain-specific knowledge,
learn from new observations, share statistical strength across components of hierarchical
models, and take advantage of predictions which can account for model uncertainty.
Having access to larger datasets is invaluable for statistical models, as it allows more
insights into the process that gives rise to the data.
At the same time, handling massive-scale datasets in machine learning instigates a
number of computational, societal, and statistical reliability challenges. First, beyond
basic statistical settings, performing inference—i.e. computing expectations of interest
under posterior distributions updated in the light of new observations—does not scale
to large datasets; hence, learning in most interesting models requires additional effort
from the data analyst to explore the statistical-computational trade-off of the problem,
and turn to a suitable approximate inference method instead. Apart from addressing
scalability, modern approximate inference methods should be also able to offer guarantees
of convergence to the exact posterior distribution given sufficient computational resources,
admit efficient quality measuring, and work seamlessly in high dimensions, where many
of modern large-scale data live (e.g. genes, or social networks).
Secondly, a large fraction of modern massive-scale machine learning applications
involves observations stemming from privacy-sensitive data domains, for example health
records or behavioural studies. The sensitive information content of such sources makes
2 Introduction
crucial for data contributors that inference methods satisfy formal guarantees of statistical
privacy. To this end, the gold standard is relying on the established framework of
differential privacy: the existing toolset of privatising mechanisms and tight privacy loss
estimation techniques, reinforced by the massive population sizes of modern datasets,
allow statistically protecting individual information, yet extracting accurate insights
about the population under study.
Thirdly, real-world big data are often highly heterogeneous, contain outliers and noise,
or might be subject to data poisoning. The afore-mentioned phenomena are typically
expressed as patterns which cannot be fully captured within the parametric assumptions
of the statistical model. As a result, standard Bayesian inference techniques, which do not
take extra care to downweight the contributions of outlying datapoints, lack robustness
and, attempting to describe the full set of observations, might eventually yield unreliable
posteriors.
How should we develop methods for large-scale data analysis that sufficiently address
the problem of scalability, while formally preserving privacy and enhancing inferential
results with robustness against mismatching observations? When faced with a dataset too
large to be processed all at once, an obvious approach is to retain only a representative
part of it. In this thesis, we build on the data summarization idea, which is validated by a
critical insight in our massive-scale learning setup: when fitting a parametric probabilistic
model on a large dataset, much of the data is redundant. Therefore, compressing the
dataset under the strategic criterion of maximally reducing redundancy with respect to a
given statistical model, opens an avenue for scalable data analysis without substantially
sacrificing the accuracy of methods. The data summarization method of choice in this
work is constructing coresets: small, weighted collections of points in the data space that
can succintly and parsimoniously represent the complete dataset in a problem-dependent
way.
Data Summarization and Differential Privacy. The aim of summarization is
ostensibly in accord with the requirements of privacy, making it a good candidate to build
privacy-preserving methods: informally, in both cases the target is to ensure encoding the
prevailing patterns of the dataset, without revealing information about any individual
datapoint in particular. However, an intricacy lies in that releasing part of the data,
though perfectly acceptable for the purposes of coresets, directly breaches privacy, as it
obviously exposes the full private information of the summarizing datapoints. Private
coresets construction forms a challenging problem of releasing data in the non-interactive,
or offline setting—namely in scenarios where a data owner aims to publicly release
randomised privacy-preserving reductions of their data to third-parties, without knowing
1.1 Thesis statement and main contributions 3
what statistics might be computed next. Differentially private schemes for coresets
applicable in computational geometry already exist in the literature (Feldman et al., 2009;
2017). In the area of machine learning, the idea of releasing private dataset compressions
via synthetic datapoints has been pursued in kernel mean embeddings (Balog et al., 2018)
and compressive learning (Schellekens et al., 2019), with the utility of the private method
scaling adversely with data dimension. Work limited to sparse regression (Zhou et al.,
2007) has considered the high-dimensional data setting and proposed a method that
compresses data via random linear or affine transformations. Nevertheless, none of these
approaches is directly applicable to summarising for general-purpose Bayesian inference.
Data Summarization and Outliers Detection. Several approximate inference meth-
ods have proved brittle to observations that "deviate markedly from other members of
the sample" (Grubbs, 1969). Outliers are a common complication emerging in real-world
problems, attributed to limited precision, noise, uncertainty and adversarial behaviour
often arising over data collection procedures. Since the pioneering work of Tukey (1960)
and de Finetti (1961), discerning outliers has concerned the research community for over
60 years, shaping the area of robust statistics (Huber and Ronchetti, 2009). To this end,
non-parametric distance-based techniques are a predominant approach that decouples
outliers’ detection from statistical assumptions regarding the data generating distribution,
hence this paradigm has found broad applicability in machine learning and data mining.
On the other hand, scaling distance computation to massive datasets is particularly
resource intensive, while, further to computational intractability, distance-based analysis
in high dimensions faces complications due to the curse of dimensionality (Donoho,
2000; Vershynin, 2018; Wainwright, 2019). Summarization has been leveraged for the
purposes of outlier detection in non-probabilistic clustering in prior work by Lucic et al.
(2016a). In the case of Bayesian learning, addressing inference on contaminated data via
summarization critically relies on using as criterion of the coreset quality a robustified
posterior, that is by definition insensitive to small deviations in the data space. Then
the intuition used is that adding an outlier on a summary comprised of a majority of
inliers will have an insignificant impact on the quality of the robust posterior defined on
the summary points; hence, greedy incremental schemes of summarization can handily
reject outlying observations while efficiently compressing the dataset.
2. Propose novel principled methods that can directly address real-word considerations
of privacy and robustness when performing inference via summarization, without
increasing the corresponding computational and memory footprint compared to
the existing state-of-the-art methods.
In addition, the following paper was written during my PhD but is not discussed in
this thesis:
Background Material
This chapter aims to set the context for the remainder of this thesis. Various concepts
pertaining to this thesis, including Bayesian inference, exponential family distributions
and differential privacy, are briefly introduced in the following.
Divergence ϕ(ξ)
Kullback-Leibler ξ log ξ
1
β-divergence β(β+1)
ξ β+1 − β1 ξ + 1
β+1
, β>0
Table 2.1: Convex functions used for reductions of relative entropy and density power to
Bregman divergences on the domain of probability density functions.
One can easily show that the β-divergence converges to the KL divergence when
β → 0. Both divergences are asymmetric and do not satisfy the triangle inequality.
Moreover, both divergences are instances of the family of Bregman divergences (Banerjee
et al., 2005; Cichocki and Amari, 2010; Amari, 2016), i.e. a class of dissimilarity
measures that can be expressed as dϕ (p, q) = ϕ(p) − ϕ(q) − ⟨∇ϕ(q), p − q⟩ using a strictly
convex, differentiable function ϕ : K → R, for all p, q in a convex set K ⊆ Rd . In
the case of two probability
Z density functions π1 , π2 the Bregman divergence admits
the form Dϕ (π1 , π2 ) = [ϕ(π1 (θ)) − ϕ(π2 (θ)) − ϕ′ (π2 (θ))(π1 (θ) − π2 (θ))] dθ. The convex
functions defining the corresponding divergences are presented in Table 2.1.
2.2 Exponential families 9
We call t(x) : X → Rd the sufficient statistics of the data, h(x) the base density and
Z
Z(θ) := log e⟨θ,t(x)⟩ h(x)ν(dx) (2.5)
The parameter space of interest, referred to as the natural parameter space, is the
space Ω ⊆ Rd that contains all θ such that Z(θ) is finite. We say that a family is regular
if Ω is open.
An important property of exponential family densities is that the derivatives of the
log-partition function Z are related to the moments of the sufficient statistics as follows.
and
Proposition 2 allows efficient approximations for the gradient and Hessian of Z using
empirical estimates of the first two moments of the sufficient statistic; we take advantage
of this property in the variational inference schemes to be introduced in Chapters 4
and 5.
10 Background Material
π(x|θ)π0 (θ)
π(θ|x) = . (2.8)
π(x)
Henceforth any quantity of interest g(·) involving the assumed probabilistic model is
calculated using expectations under the posterior—which is considered to be the complete
information about θ given the data x—as follows
Z
Eθ∼π(θ|x) [g(θ)] := g(θ)π(θ|x)dθ. (2.9)
Marginalising, i.e. computing the integral of Eq. (2.10), can be done using analytical tools
for a number of simple Bayesian models—some of which will be discussed in the remainder,
including Gaussian mean inference, Bayesian and neural linear regression—where the
likelihood is conjugate to the prior. However, for the vast majority of interesting statistical
models marginalization cannot be done in closed form and should be approximated instead.
2.3 Probabilistic learning at a glance 11
Aiming to address such cases, approximate Bayesian inference has emerged as an active
research area for many decades. In the remainder of the section we present an overview of
existing approaches addressing approximate inference that are relevant to our algorithms.
For a more detailed exposure, including methods beyond the scope of this thesis (e.g.
expectation propagation), cf. (Bishop, 2006; Murphy, 2012; Angelino et al., 2016).
Let us write the posterior of Eq. (2.8) in the following equivalent form
1 −E(θ)
π(θ|x) = e , (2.11)
Z
where E(θ) := − log π(θ, x) is called the energy function, and Z is the unknown normal-
ization constant. Taking the Taylor series expansion of θ (up to order 2)
around the mode
−E( θ̂)
θ̂ := arg min E(θ), we obtain the approximation π̂(θ, x) := e exp (θ − θ̂)T Λ(θ − θ̂)
θ
1
π(θ|X) ≈ π̂(θ, x) ∝ N (θ̂, Λ−1 ), (2.12)
Z
i.e. the posterior can be approximated by a (unimodal) Gaussian, where the mean
corresponds to the minimum of the energy function and the covariance is the negative
Hessian of the energy function evaluated on the mean. Clearly, using standard numerical
optimization routines, e.g. quasi-Newton methods, we can achieve fast convergence to θ̂.
In the absence of analytical formulae, integrals in the form of Eq. (2.9) can be approxi-
mated via empirical averaging, using samples from the target posterior distribution
S
Z
1X i.i.d.
g(θ)π(θ|x)dθ ≈ g(θs ), (θs )Ss=1 ∼ π(θ|x). (2.13)
S s=1
Markov Chain Monte Carlo (MCMC), the workhorse of approximate Bayesian inference,
is a framework of established tools that pursue the above idea efficiently (Geyer, 1992;
Gilks, 2005; Robert and Casella, 2005).
Variational inference (VI) (Jordan et al., 1999; Blei et al., 2017) takes a fundamentally
different approach to addressing approximate inference. The problem formulation under-
pinning all VI methods is to find a member q ∗ within a family of tractable probability
densities Q that most closely matches our true posterior π (typically in the KL-sense)
In this way, Bayesian posterior inference gets reduced into an optimization problem; hence,
techniques allowing scaling up optimization (e.g. random subsampling) can in principle
be applied in VI methods, enabling scalable inference of approximate posteriors (Hoffman
et al., 2013).
We note in passing that, in classical Variational Bayes schemes, expanding the KL
divergence according to Eq. (2.1) makes the log-evidence appear in the objective
DKL (q(θ)||π(θ|x)) = Eθ∼q [log q(θ)] − Eθ∼q [log π(x, θ)] + log π(x). (2.15)
Since this term is not a function of q, it can be subtracted and the problem is reformulated
as minimizing the remaining two terms, the negation of which is known as the evidence
lower bound (ELBO)
q ∗ (θ; x) := arg min (−ELBO(q, x)) , ELBO(q, x) := Eq [log π(x, θ)] − Eq [log q(θ)] .
q∈Q
(2.16)
Via Jensen’s inequality, the ELBO can be shown to be a lower bound of the marginal
log-likelihood of x as expectation w.r.t. q. As opposed to MCMC methods, theoretical
guarantees for inferential results of the solution to Eq. (2.14) can only be obtained for a
few simple statistical models for the following main reasons: optimization methods in
typically non-convex landscapes can often converge to bad local optima; also, depending
on the statistical divergence and variational family used, VI might return miscalibrated
posterior variance estimates (Bishop, 2006, Chapter 10).
The simplest family Q that can be used for VI is the mean-field variational family
which relies on the simplifying assumption of independence among the coordinates of θ,
i.e. q(θ) := ΠD
d=1 qd (θd ). Our VI schemes in Chapters 4 and 5 propose approximations
within the exponential family instead, which generally allow less restricted posteriors.
Additionally, they can circumvent the use of ELBO, and instead be directly applied on
14 Background Material
In more recent work, Campbell and Beronov (2019) casted Bayesian coresets to a
problem of sparse variational inference within an exponential family, named Riemannian
coresets. Riemannian coresets removed the requirement for fixing a coarse posterior that
appears when computing the norm in practical Hilbert coreset constructions, achieving
full automation and improvement of approximation quality (measured through the KL
divergence) over a larger range of summary sizes.
1
π(θ|x) = π(x|θ)π0 (θ), (2.17)
Z′
where Z ′ is a normalization constant corresponding to the (typically intractable) marginal
likelihood term π(x). When the datapoints x are conditionally independent given θ, the
likelihood function gets factorized as π(x|θ) = ΠNn=1 π(xn |θ). An equivalent formulation
of the Bayesian posterior as a solution to a convex optimization problem over the
density space was introduced by Williams (1980) and Zellner (1988), and used in various
subsequent works including (Zhu et al., 2014; Dai et al., 2016; Futami et al., 2018).
Concretely, Eq. (2.17) can be recovered by solving the problem
N Z
X !
arg min DKL (q(θ)||π0 (θ)) − q(θ) log π(xn |θ)dθ , (2.18)
q(θ)∈P
n=1
16 Background Material
where P is the valid density space, while the Bayesian posterior can be expressed as
1
π(θ|x) = exp (−dKL (π̂(x)||π(x|θ))) π0 (θ). (2.19)
Z′
In the last expression, π̂(x) := N1 N n=1 δ(x − xn ) is the empirical distribution of the
P
observed datapoints and δ is the Dirac delta function. The exponent dKL (π̂(x)||π(x|θ)) :=
− N n=1 log π(xn |θ) corresponds (up to a constant) to the cross-entropy, which is equal
P
to the empirical average of negative log-likelihoods of the datapoints, and quantifies the
expected loss incurred by our estimates for the model parameters θ over the available
observations, under the Kullback-Leibler divergence.
When N is large, the Bayesian posterior is strongly affected by perturbations in
the observed data space. To develop an intuition on this, assuming that the true and
observed data distributions admit densities πθ and πobs respectively, we can rewrite an
approximation of Eq. (2.19) via the KL divergence as in (Miller and Dunson, 2019)
N
!
.
X Z
π(θ|x) ∝ exp log π(xn |θ) π0 (θ) = exp N πobs log πθ π0 (θ)
n=1
.
where = denotes agreement to first order in exponent.1 Hence, due to the large N in the
exponent, small changes in πobs will have a large impact on the posterior.
1 .
i.e. an = bn iff (1/n) log(an /bn ) → 0
2.4 Robust inference 17
where
N
!
β+1 β
Z
π(xn |θ) + π(χ|θ)1+β dχ ,
X
dβ (π̂(x)||π(x|θ)) := − (2.23)
n=1 β X
| {z }
:=fn (θ)
with β > 0. In the remainder of the thesis we refer to quantities defined in Eqs. (2.22)
and (2.23) as the β-posterior and β-likelihood respectively. Noticeably, the individual
terms fn (θ) of the β-likelihood allow attributing different strength of influence to each of
the datapoints, depending on their accordance with the model assumptions. As densities
get raised to a suitable power β, outlying observations are exponentially downweighted.
When β → 0, the Bayes’ posterior of Eqs. (2.17) and (2.19) is recovered, and all datapoints
are treated equally.
In the presentation above we focused on modeling observations (xn )N n=1 (unsupervised
N
learning). In the case of supervised learning on data pairs (xn , yn )n=1 ∈ (X × Y)N , the
respective expression for individual terms of the β-likelihood2 is (Basu et al., 1998)
β+1 Z
fn (θ) := − π(yn |xn , θ)β + π(ψ|xn , θ)1+β dψ. (2.24)
β Y
2
In this context for simplicity we use notation fn (·) to denote f (yn |xn , ·).
18 Background Material
(a) (b)
Figure 2.1: Effects of altering the statistical divergence when conducting inference on
datasets containing outliers. (a) Influence of individual datapoints under the Kullback-
Leibler and the β-divergence: the concavity of influence under the β-divergence illustrates
the robustness of the inferred posterior to outliers. (b) Posterior estimates of Gaussian
density on observations containing a small fraction for outliers under classical and
robustified inference.
sampled from a Student t(3) distribution, while observations with negative coordinates
were omitted from the presented plot due to symmetry. We can notice that the KL
divergence allows unbounded influence, indicating the brittleness of inference on the tails
of the observed distribution. In contrast, moving away from the mean, individual data-
points’ influence under the β-divergences is initially characterised by a regime of increase
until reaching a maximum (which depends on the selected robustness hyperparameter),
succeeded by attenuation down to zero at the tails of the data distribution. At the same
time, this experiment makes clear that for decision problems critically relying on the tail
information of the observations, KL might be the divergence of choice, as the density
power divergence would downweight the importance of datapoints lying far from the
mean.
Fig. 2.1b shows the posterior density estimation for classical and robustified Bayesian
inference on 1K datapoints sampled from a contaminated distribution 0.99 × N (0, 1) +
0.01 × N (5, 25). The posterior under the KL divergence tries to explain the long tails
of the observations—which are the effects of the contaminating component—eventually
overestimating the variance of the data distribution. On the other hand, using the density
2.5 Representing data 19
power divergence with β = 0.5 over inference allows us to declare the long tails as outliers,
and provides more accurate modeling of the inliers’ component.
ϕ : X → H, (2.25)
is sought which transforms the datapoints from the original data space {xn }N n=1 , xn ∈ X ,
N
into feature representations in a Hilbert space {ϕ(xn )}n=1 , ϕ(xn ) ∈ H. Then the patterns
of interest can be revealed via applications of inner products in the Hilbert space
⟨A, ϕ(x)⟩H . There is an extensive literature on constructing data representations; for the
purposes of this thesis, in the remainder of the section we focus on two of them: kernel
methods and random projections.
2.5.1 Kernels
The main tool in kernel methods (Schölkopf et al., 2002) is the kernel function defined
below.
N
X
ci cj k(xi , xj ) ≥ 0. (2.26)
i,j=1
A kernel representation might be lacking an explicit closed form, but can always be
accessed via the inner product of Eq. (2.27), which is the central object of interest in
learning with kernels.
20 Background Material
• Radial Basis Function (RBF) kernels k(x, x′ ) = f (d(x, x′ )), where d is a metric on
X and f is a function on R+ .
Kernel methods induce non-parametric representations on the data, i.e. when given
a set with N datapoints of dimension d, kernels effectively map each datapoint to an
N -dimensional representation.
The weighting distribution π̂ can be selected from a set of cheap posterior approximations,
for example using Laplace’s method, or running a few rounds of an MCMC algorithm. In
the general case, the norm of Eq. (2.29) is not available in closed form, hence a random
projection can be used instead to approximate it according to the following steps:
2.6 Differential privacy 21
i.i.d.
1. Sample J values for θ from the weighting distribution (θ̂j )Jj=1 ∼ π̂.
q
2. For n = 1 . . . N compute a J-dimensional projection fˆn (θ) := 1
J
[fn (θ̂1 ) . . . fn (θ̂J )].
The most common mechanisms that enable releasing numerical queries f under DP
rely on randomization via injecting additive noise. The amount of noise is calibrated to
the global sensitivity of the query, which is defined as
To achieve (ε, δ)-DP one can use the Gaussian Mechanism, which returns
q
2 log(1.25/δ) ∆2 (f )
f (x) + Z, Z ∼ N (0, σ 2 I), where σ ≥ . (2.33)
ε
DP is equipped with a suite of properties that facilitate reasoning about privacy
guarantees over complicated analysis tasks on a sensitive data collection in a modular
fashion. In the remainder we review a fraction of them which are frequently encountered
in machine learning settings.
A useful fact about DP algorithms is that a data analyst cannot weaken their privacy
guarantees by doing any computation on their output that does not depend on the private
input itself.
DP composition theorems accumulate the total privacy cost over the application of
a sequence of mechanisms. The moments accountant is a recently proposed technique,
that allows computing tight bounds for ε and δ, offering the following guarantees:
Proposition 8 (Moments Accountant (Abadi et al., 2016)). Given 0 < ε < 1 and
0 < δ < 1, to ensure (ε, T δ ′ + δ)-DP over the composition of T mechanisms M1 , . . . , MT ,
it suffices that each Mi is (ε′ , δ ′ )-DP, where ε′ = √ ε and δ ′ = Tδ .
2 2T log(2/δ)
The above tools are required for carrying out the privacy analysis of the subsampled
Gaussian mechanism (Abadi et al., 2016), which will be used for privatising the variational
inference scheme introduced in Chapter 4.
Chapter 3
In this chapter, we present a case study on population scale empirical data, which demon-
strates that releases of deidentified and reduced representations of structured individual
records might still breach the privacy of information-contributing participants. This
analysis motivates the necessity of developing new formal privacy-preserving frameworks
for scalable learning via data summarization, which is further studied in Chapter 4.
Human mobility is often represented as a mobility network, or graph, with nodes
representing places of significance which an individual visits, such as their home, work,
places of social amenity, etc., and edge weights corresponding to probability estimates of
movements between these places. Previous research has shown that individuals can be
identified by a small number of geolocated nodes in their mobility network, rendering
mobility trace anonymization a hard task. In this chapter we build on prior work, and
demonstrate that, even when all location and timestamp information is removed from
nodes, the graph topology of an individual mobility network itself is often uniquely
identifying. Further, we observe that a mobility network is often unique, even when
only a small number of the most popular nodes and edges are considered. We evaluate
our approach using a large dataset of cell-tower location traces from 1, 500 smartphone
handsets with a mean duration of 430 days. We process the data to derive the top−N
places visited by the device in the trace, and find that 93% of traces have a unique
top−10 mobility network, and all traces are unique when considering top−15 mobility
networks. Since mobility patterns, and therefore mobility networks for an individual, vary
over time, we use graph kernel distance functions, to determine whether two mobility
networks, taken at different points in time, represent the same individual. We then show
24 Quantifying Privacy Loss of Human Mobility Graph Topology
that our distance metrics, while imperfect predictors, perform significantly better than a
random strategy, and therefore our approach represents a significant loss in privacy.
device. However, the technique could also be used to link together different user profiles
which represent the same individual.
Our approach differs from previous studies in location data deanonymization (De
Mulder et al., 2008; Golle and Partridge, 2009; Gambs et al., 2014; Naini et al., 2016), in
that we aim to quantify the breach risk in preprocessed location data that do not disclose
explicit geographic information, and where instead locations are replaced with a set of
user-specific pseudonyms. Moreover, we also do not assume specific timing information
for the visits to abstract locations, merely ordering and coarse duration of stays.
We evaluate the power of our approach over a large dataset of traces from 1, 500
smartphones, where cell tower identifiers (cids) are used for localization. Our results show
that the examined data reductions contain structural information which may uniquely
identify users. This fact then supports the development of techniques to efficiently
reidentify individual mobility profiles. Conversely, our analysis may also support the
development of techniques to indistinguishably cluster users into larger groups with similar
mobility; such an approach may then be able to offer better anonymity guarantees.
A summary of the contributions of this chapter is as follows:
All the above-mentioned previous work assumes that locations are expressed using a
universal set of symbols or global identifiers, either corresponding to (potentially obfus-
cated) geographic coordinates, or pseudonymous stay points. Hence, cross-referencing
between individuals in the population is possible. This is inapplicable when location in-
formation is anonymized separately for each individual. Lin et al. (2015) presented a user
verification method in this setting. It is based on statistical profiles of individual indoor
and outdoor mobility, including cell tower ID and WiFi access point information. In
contrast, here we employ network representations based solely on cell tower ID sequences
without explicit time information.
Often, studies in human mobility aim to model properties of a population, thus location
data are published as aggregate statistics computed over the locations of individuals.
This has traditionally been considered a secure way to obfuscate the sensitive information
contained in individual location data, especially when released aggregates conform to
k−anonymity principles (Sweeney, 2002). However, recent results have questioned this
assumption. Xu et al. (2017) recovered movement trajectories of individuals with accuracy
levels of between 73% and 91% from aggregate location information computed from
cellular location information involving 100, 000 users. Similarly, Pyrgelis et al. (2017)
performed a set of inference attacks on aggregate location time-series data and detected
serious privacy loss, even when individual data are perturbed by differentially private
mechanisms before aggregation.
Therefore we interpret k−anonymity in this chapter to mean that the mobility network
of an individual in a population should be identical to the mobility network of at least
k − 1 other individuals. Recent work casts doubt on the protection guarantees offered
by k−anonymity in location privacy (Shokri et al., 2010), motivating the definition of
l−diversity (Machanavajjhala et al., 2007) and t−closeness (Li et al., 2007). Although
k−anonymity may be insufficient to ensure privacy in the presence of adversarial knowl-
edge, k−anonymity is a good metric to use to measure the uniqueness of an individual
in the data. Moreover, this framework is straightforwardly generalizable to the case of
graph data.
Structural equivalence in the space of graphs corresponds to isomorphism and, based
on this, we can define k−anonymity on unweighted graphs as follows:
After clustering our population of graphs into isomorphism classes, we can also define
the identifiability set and anonymity size (Pfitzmann and Hansen, 2010) as follows:
means that the transition by an individual to the next location in the mobility network
can be accurately modelled by considering only their current location. For example, the
probability that an individual visits the shops or work next depends only on where they
are located now, and a more detailed past history of places recently visited does not offer
significant improvements to the model. The alternative is that a sequence of the states is
better modelled by higher-order Markov chains, namely that transitions depend on the
current state and one or more previously visited states. For example, the probability
that an individual visits the shops or work next depends not only on where they are
now, but where they were earlier in the day or week. If higher-order Markov chains are
required, we should assume a larger state-space and use these states as the nodes of our
individual mobility networks. Recently proposed methods on optimal order selection of
sequential data (Xu et al., 2016; Scholtes, 2017) can be directly applied at this step.
Let us assume a mobility dataset from a population of users u ∈ U . We introduce
two network representations of user’s mobility.
represents the information that u had at least one recorded transition from viu to vju .
In some of our experiments, we prune the mobility networks of users by reducing the
size of the mobility network to the N most frequent places and rearranging the edges in
the network accordingly. We refer to these networks as top−N mobility networks.
ϕ(G′ )
* +
′ ϕ(G)
K(G, G ) = , . (3.1)
||ϕ(G)|| ||ϕ(G′ )||
One interpretation of this function is as the cosine similarity of the graphs in the feature
space defined by the map of the kernel.
In our experiments we apply a number of kernel functions on our mobility datasets
and assess their suitability for deanonymization applications on mobility networks. We
note in advance that, as the degree distribution and all substructure counts of a graph
remain unchanged under structure-preserving bijection of the vertex set, all examined
graph kernels are invariant under isomorphism. We briefly introduce these kernels in the
remainder of the section.
The degree distribution of nodes in the graph can be used to quantify the similarity
between two graphs. For example, we can use a histogram of weighted or unweighted
node degree as a feature vector. We can then compute the pairwise distance of two
graphs by taking either the inner product of the feature vectors, or passing them through
3.3 Proposed methodology 33
Here, the hyperparameters of the kernel are the variance σ (in case RBF is used), and
the number of bins in the histogram.
Kernels can use counts on substructures, such as subtree patterns, shortest paths,
walks, or limited-size subgraphs. This family of kernels are called R−convolution graph
kernels (Haussler, 1999). In this way, graphs are represented as vectors with elements
corresponding to the frequency of each such substructure over the graph. Hence, if
s1 , s2 , ... ∈ S are the substructures of interest and # (si ∈ G) the counts of si in graph
G, we get as feature map vectors
K(G, G′ ) = # (s ∈ G) # (s ∈ G′ ) .
X
(3.4)
s∈S
In the following, we briefly present some kernels in this category and explain how
they are adapted in our experiments.
Shortest-Path Kernel
The Shortest-Path (SP) graph kernel (Borgwardt and Kriegel, 2005) expresses the
similarity between two graphs by counting the co-occurring shortest paths in the graphs.
It
can be written
in the form of Eq. (3.3), where each element si ∈ S is a triplet
astart , aend , n , where n is the length of the path and aistart , aiend the attributes of the
i i
starting and ending nodes. The shortest path set is computable in polynomial time
using, for example, the Floyd-Warshall algorithm, with complexity O(|V |4 ), where |V | is
number of nodes in the network.
1, 3 4 3, 233 7
3 2 2 3 3, 233 2, 33 2, 33 3, 233
3 3 3, 1133 3, 1233
2, 33 5 3, 1133 8
3 3 3, 233 3, 133
1 1 1, 3 1, 3
1 1 1, 3 1, 3
3, 133 6 3, 1233 9
(a) G (b) G′ (c) (d) (e)
G with multiset G′ with multiset Label Compression
node attributes node attributes
7 5 5 7
features 1 2 3 4 5 6 7 8 9
8 9
7 6 ϕ1 (G) = [2, 1, 3, 2, 1, 0, 2, 1, 0 ]
4 4
4 4 ϕ1 (G′ ) = [2, 1, 3, 2, 1, 1, 1, 0, 1 ]
(f) G relabeled (g) G′ relabeled (h) extracted feature vectors
Figure 3.1: Computation of the Weisfeiler-Lehman subtree kernel of height h = 1 for two
attributed graphs.
The idea of the WL kernel is to measure co-occurrences of subtree patterns across node
attributed graphs.
Computation progresses over iterations as follows:
1. each node attribute is augmented with a multiset of attributes from adjacent nodes;
2. each node attribute is then compressed into a single attribute label for the next
iteration; and
where ϕh (G) and ϕh (G′ ) are the vectors of labels extracted after running h steps of the
computation (Fig. 3.1h). They consist of h blocks, where the i-th component of the j-th
block corresponds to the frequency of label i at the j-th iteration of the computation.
The computational complexity of the kernel scales linearly with the number of edges |E|
and the length h of the WL graph sequence.
and Vishwanathan, 2015). Hence, these kernels can quantify similar substructure co-
occurrence, offering more robust feature representations. DKs are based on computing
the following inner product:
We assume that an adversary has access to a set of mobility networks G ∈ Gtraining with
disclosed identities (or labels) lG ∈ L and a set of mobility networks G′ ∈ Gtest with
undisclosed identities lG′ ∈ L.
Generally we can think of lG′ ∈ J ⊃ L and assign some fixed probability mass to the
labels lG′ ∈ J \ L. However, here we make the closed world assumption that the training
36 Quantifying Privacy Loss of Human Mobility Graph Topology
(a) user 1: 1st half of the observation period (b) user 1: 2nd half of the observation period
(c) user 2: 1st half of the observation period (d) user 2: 2nd half of the observation period
Figure 3.2: Top−20 networks for two random users from the Device Analyzer dataset.
Depicted edges correspond to the highest 10th percentile of frequent transitions in the
respective observation window. The networks show a high degree of similarity between
the mobility profiles of the same user over the two observation periods. Moreover, the
presence of single directed edges in the profile of user 2 forms a discriminative pattern
that allows us to distinguish user 2 from user 1.
and test networks come from the same population. We make this assumption for two
reasons: first, it is a common assumption in works on deanonymization and, second, we
cannot directly update our beliefs on lG′ ∈ J \ L by observing samples from L.
We define a normalised similarity metric among the networks K : Gtraining ×Gtest → R+ .
We hypothesize that a training and test mobility network belonging to the same person
have common or similar connectivity patterns, thus a high degree of similarity.
The intention of an adversary is to deanonymize a given test network G′ ∈ Gtest , by
appropriately defining a vector of probabilities over the possible identities in L.
An uninformed adversary has no information about the networks of the population
and, in the absence of any other side knowledge, the prior belief of the adversary about
the identity of G′ is a uniform distribution over all possible identities:
An informed adversary has access to the population of training networks and can
compute the pairwise similarities of G′ with each Gi ∈ Gtraining using a kernel function K.
Hence the adversary can update her belief for the possible identities in L according to
the values of K. Therefore, when the adversary attempts to deanonymize identities in
the data, she assigns probabilities that follow a non-decreasing function of the computed
pairwise similarity of each label. Denoting this function by f , we can write the updated
adversarial probability estimate for each identity as follows:
f (K(Gi , G′ ))
PK (lG′ = lGi |Gtraining ) := X , for every Gi ∈ Gtraining . (3.8)
f (K(Gj , G′ ))
j∈L
In the case of the uninformed adversary, the true label for any user is expected to have
rank |L|/2. Under this policy, the amount of privacy for each user is proportional to the
size of the population.
In the case of the informed adversary, knowledge of Gtraining and the use of K will
induce some non-negative privacy loss which will result in the expected rank of user to
be smaller than |L|/2. The privacy loss (PL) can be quantified as follows:
PK lG′ = lG′true |Gtraining
PL (G′ ; Gtraining , K) := −1 (3.9)
P lG′ = lG′true
and cannot be exceeded in real-world datasets, where the test networks are expected to
be noisy copies of the training networks existing in our system. The step of comparing
with the set of training networks adds computational complexity of O(|Gtraining |) to the
similarity metric cost.
Moreover, our framework can naturally facilitate incorporating new data to our beliefs
when multiple examples per individual exist in the training dataset. For example, when
multiple instances of mobility networks per user are available, we can use k−nearest
neighbors techniques in the comparison of distances with the test graph.
We evaluate our methodology on the Device Analyzer dataset (Wagner et al., 2014).
Device Analyzer contains records of smartphone usage collected from over 30, 000 study
participants around the globe. Collected data include information about system status
and parameters, running background processes, cellular and wireless connectivity. For
privacy purposes, released cid information is given a unique pseudonym separately for
each user, and contains no geographic, or semantic, information concerning the location
of users. Thus we cannot determine geographic proximity between the nodes, and the
location data of two users cannot be directly aligned.
For our experiments, we analysed cid information collected from 1, 500 handsets with
the largest number of recorded location datapoints in the dataset. Fig. 3.3a shows the
observation period for these handsets; note that the mean is greater than one year but
there is lot of variance across the population. We selected these 1, 500 handsets in order
to examine the reidentifiability of devices with rich longitudinal mobility profiles. This
allowed us to study the various attributes of individual mobility affecting privacy in detail.
As mentioned in the previous section, the cost of computing the adversarial posterior
probability for the deanonymization of a given unlabeled network scales linearly with the
population size.
3.4 Data for analysis 39
(a) (b)
(c) (d)
Figure 3.3: Empirical statistical findings of the Device Analyzer dataset. (a) Distribution
of the observation period duration. (b) Normalized histogram and empirical probability
density estimate of network size for the full mobility networks over the population.
(c) Complementary cumulative distribution function (CCDF ) for the node degree in
the mobility network of a typical user from the population, displayed on log-log scale.
(d) Normalized histogram and probabilty density of average edge weight over the networks.
We began by selecting the optimal order of the network representations derived from the
mobility trajectories of the 1, 500 handsets selected from the Device Analyzer dataset.
We first parsed the cid sequences from the mobility trajectories into mobility networks.
In order to remove cids associated with movement, we only defined nodes for cids which
were visited by the handset for at least 15 minutes. Movements from one cid to another
were then recorded as edges in the mobility network.
40 Quantifying Privacy Loss of Human Mobility Graph Topology
As outlined in Section 3.3.1, we analysed the pathways of the Device Analyzer dataset
during the entire observation period, applying the model selection method of Scholtes
(2017).3 This method tests graphical models of varying orders and selects the optimal
order by balancing the model complexity and the explanatory power of observations.
We tested higher-order models up to order three. In the case of top−20 mobility
networks, we found routine patterns in the mobility trajectories were best explained with
models of order two for more than 20% of the users. However, when considering top−100,
top−200, top−500 and full mobility networks, we found that the optimal model for our
dataset has order one for more than 99% of the users; see Fig. 3.4. In other words, when
considering mobility trajectories which visit less frequent locations in the graph, the
overall increase in likelihood of the data for higher-order models cannot compensate for
the complexity penalty induced by the larger state space. Hence, while there might still
be regions in the graph which are best represented by a higher-order model, the optimal
order describing the entire graph is one. Therefore we use a model of order one in the
rest of this chapter.
Table 3.1: Summary statistics of mobility networks in the Device Analyzer dataset.
an increase in the variance of their statistics, and leads to smaller density, larger diameter
and larger average shortest-path values.
A recurrent edge traversal in a mobility network occurs when a previously traversed
edge is traversed for a second or subsequent time. We then define recurrence rate as
the percentage of edge traversals which are recurrent. We find that mobility networks
display a high recurrence rate, varying from 68.8% on average for full networks to 84.7%
for the top−50 networks, indicating that the mobility of the users is mostly comprised of
repetitive transitions between a small set of nodes in a mobility network.
Fig. 3.3b displays the normalized histogram and probability density estimate of
network size for full mobility networks. We observe that sizes of few hundred nodes are
most likely in our dataset, however mobility networks of more than 1, 000 nodes also
exist. Reducing the variance in network size will be proved helpful in cross-network
similarity metrics, hence we also consider truncated versions of the networks.
As shown in Fig. 3.3c, the parsed mobility network of a typical user is characterized
by a heavy-tailed degree distribution. We observe that a small number of locations have
high degree and correspond to dominant states for a person’s mobility routine, while a
large number of locations are only visited a few times throughout the entire observation
period and have a small degree.
Fig. 3.3d shows the estimated probability distribution of average edge weight. This
peaks in the range from two to four, indicating that many transitions captured in the full
mobility network are rarely repeated. However, most of the total weight of the network
is attributed to the tail of this distribution, which corresponds to the edges that the user
frequently repeats.
N 4 5 6 7 8 9
# undirected 11 34 156 1,044 12,346 274,668
N 4 5 6 7
# directed 2,128 9,608 1,540,944 882,033,440
Table 3.2: Sequences of non-isomorphic graphs for undirected and directed graphs of
increasing size.
4
http://pallini.di.uniroma1.it/
3.4 Data for analysis 43
Figure 3.5: Identifiability set and k−anonymity for undirected and directed top−N
mobility networks for increasing number of nodes. Displayed is also the theoretical upper
bound of identifiability for networks with N nodes.
Figure 3.6: Anonymity size statistics over the population of top−N mobility networks
for increasing network size.
attributed to patterns that are widely shared (e.g. the trip from work to home, and from
home to work).
Fig. 3.6 shows some additional statistics of the anonymous isomorphic clusters formed
for varying network sizes. Median anonymity becomes one for network sizes of five and
eight in directed and undirected networks respectively; see Fig. 3.6a. In Fig. 3.6b we
observe that the population arranges into clusters with small anonymity even for very
small network sizes: around 5% of the users have at most 10-anonymity when considering
only five locations in their network, while this percentage increases to 80% and 100% for
44 Quantifying Privacy Loss of Human Mobility Graph Topology
networks with 10 and 15 locations. This result confirms that anonymity is even harder
when the directionality of edges is provided, since the space of directed networks is much
larger than the space of the undirected networks with the same number of nodes.
The above empirical results indicate that the diversity of individuals’ mobility is
reflected in the network representations we use, thus we can meaningfully proceed to
discriminative tasks on the population of mobility networks.
to add some information about the different roles of nodes in users’ mobility routine.
Such attributes are computed independently for each user on the basis of the topological
information of each network. After experimenting with several schemes, we obtained the
best performance on the kernels when dividing locations into three categories with respect
to the frequency in which each node is visited by the user. Concretely, we computed the
distribution of users’ visits to locations and added the following values to the nodes:
3, if viu ∈ top−20% locations of u
ac=3 (viu ) := 2, if viu ∈
/ top−20% locations of u and viu ∈ top−80% locations
1,
otherwise.
3.5.3 Evaluation
We evaluated graph kernels functions from the following categories:
The Cumulative Density Functions (CDF s) of the true label rank for the best performing
kernel of each category are presented in Fig. 3.7.
If mobility networks are unique, an ideal retrieval mechanism would correspond to a
curve that reaches 1 at rank one, indicating a system able to correctly deanonymize all
traces by matching the closest training graph. This would be the case when users’ training
and test networks are identical, thus the knowledge of the latter implies maximum privacy
loss.
46 Quantifying Privacy Loss of Human Mobility Graph Topology
Figure 3.7: CDF of true rank over the population according to different kernels.
Our baseline, random, is a strategy which reflects the policy of an adversary with
zero knowledge about the mobility networks of the users, who simply returns uniformly
random orderings of the labels. The CDF of true labels’ rank for random lies on the
diagonal line. We observe that atomic substructure based kernels significantly outperform
the random baseline performance by defining a meaningful similarity ranking across the
mobility networks.
The best overall performance is achieved by the DSP kernel on graphs pruned to
200 nodes. In particular, this kernel places the true identity among the closest 10
networks for 10% of the individuals, and among the closest 200 networks for 80% of
the population. The Shortest-Path kernel has an intuitive interpretation in the case of
mobility networks, since its atomic substructures take into account the hop distances
among the locations in a user’s mobility network and the popularity categories of the
departing and arrival locations. The deep variant can also account for variation at the
level of such substructures, which are more realistic when considering the stochasticity
in the mobility patterns inherent to our dataset.
The best performance of the Weisfeiler-Lehman kernel is achieved by its deep variant
for h = 2 iterations of the WL test for a mobility network pruned to 200 nodes. This
phenomenon is explainable via the statistical properties of the mobility networks. As
3.5 Evaluation of privacy loss in longitudinal mobility traces 47
we saw in Section 3.4.3, the networks display power law degree distribution and small
diameters. Taking into account the steps of the WL test, it is clear that these topological
properties will lead the node relabeling scheme to cover the entire network after a very
small number of iterations. Thus local structural patterns will be described by few
features produced in the first iterations of the test. Furthermore, the feature space of the
kernel increases very quickly as a function of h, which leads to sparsity and low levels of
similarity over the population of networks.
Histograms of length 103 were also computed for the unweighted and weighted degree
distributions and passed through a Gaussian RBF kernel. We can see that the unweighted
degree distribution DD gives almost a random ranking, as this kernel produces a very
high-dimensional mapping, which is heavily dependent on the network size. When
including the normalized edge weights, the WD kernel only barely outperforms a random
ranking. Repetitions on pruned versions, which partly mitigate dimensionality effects,
did not significantly improve the performance and are not presented for brevity.
Based on the insights obtained from our experiment, we can make the following
observations with respect to attributes of individual mobility and their impact on the
identifiability of networks:
Figure 3.8: Boxplot of rank for the true Figure 3.9: Privacy loss over the test
labels of the population according to data of our population for an adversary
a Deep Shortest-Path kernel and to a adopting the informed policy of (3.10).
random ordering. Median privacy loss is 2.52.
The obtained ordering implies a significant decrease in user privacy, since the ranking
can be leveraged by an adversary to determine the most likely matches between a training
mobility network and a test mobility network. The adversary can estimate the true
identity of a given test network G′ , as suggested in Section 3.3.4.2, applying some simple
probabilistic policy that uses pairwise similarity information. For example, let us examine
the privacy loss implied by the update rule in (3.8) for function f :
1
f (KDSP (Gi , G′ )) := . (3.10)
rank (KDSP (Gi , G′ ))
This means that the adversary updates her probability estimate for the identity
corresponding to a test network, by assigning to each possible identity a probability that
is inversely proportional to the rank of the similarity between the test network and the
training network corresponding to the identity.
From equation (3.9), we can compute the induced privacy loss for each test network,
and the statistics of privacy loss over the networks of the Device Analyzer population.
Fig. 3.9 demonstrates considerable privacy loss with a median of 2.52. This means that
the informed adversary can achieve a median deanonymization probability 3.52 times
higher than an uninformed adversary. Moreover, the positive mean of privacy loss (≈ 27)
means that the probabilities of the true identities of the test networks have, on average,
3.6 Summary & discussion 49
much higher values in the adversarial estimate compared to the uninformed random
strategy. Hence, revealing the kernel values makes an adversarial attack easier.
mobility traces and then defining a kernel on this model (Song et al., 2011) may provide
better anonymity, and therefore privacy, and it would also support the generation of
artificial traces which mimic the mobility of users.
Chapter 4
Bayesian Pseudocoresets
(Dwork et al., 2006c; Dwork and Roth, 2014). Past work addresses this issue in the
context of clustering and computational geometry (Feldman et al., 2009; 2017)—with
the remarkable property that the privatized coreset may be queried ad infinitum without
loss of privacy—but no such method exists for Bayesian posterior inference.
In this chapter, we develop a novel technique for data summarization in the context
of Bayesian inference, under the constraints that the method is scalable and easy to
use, creates an intuitive summarization, applies to high-dimensional data, and enables
privacy control. Inspired by past work (Madigan et al., 2002; Snelson and Ghahramani,
2005; Zhou et al., 2007; Titsias, 2009), instead of using constituent datapoints, we
use synthetic pseudodata to summarize the large dataset, resulting in a pseudocoreset.
We show that in the high-dimensional problem setting of Proposition 16, the optimal
pseudocoreset with just one pseudodata point recovers the exact posterior, a significant
improvement upon the optimal standard coreset of any size. As in past work on Bayesian
coresets (Campbell and Beronov, 2019), we formulate pseudocoreset construction as
variational inference, and provide a stochastic optimization method (Section 4.3). As a
consequence of the use of pseudodata—as well as privacy-preserving stochastic gradient
descent mechanisms (Abadi et al., 2016; Jälkö et al., 2017; Park et al., 2020)—we show
that our method can easily be modified to output a privatized pseudocoreset. The chapter
concludes with experimental results demonstrating the performance of pseudocoresets on
real and synthetic data (Section 4.4).
N
!
1 X
π(θ) := exp f (xn , θ) π0 (θ). (4.1)
Z n=1
In the setting of Bayesian inference with conditionally independent data, the potentials
are data log-likelihoods, i.e. f (xn , θ) := log π(xn |θ), π0 is the prior density, π is the
posterior, and Z is the marginal likelihood of the data. Rather than working directly
with π(θ) for posterior inference—which requires a Θ(N ) computation per evaluation—a
Bayesian coreset approximation of the form
N
!
1 X
πw (θ) := exp wn f (xn , θ) π0 (θ) (4.2)
Z(w) n=1
56 Bayesian Pseudocoresets
for w ∈ RN , w ≥ 0 may be used in most popular posterior inference schemes (Neal, 2011;
Ranganath et al., 2014; Kucukelbir et al., 2017). If the number of nonzero entries ∥w∥0
of w is small, this results in a significant reduction in computational burden. Recent
work has formulated the problem of constructing a Bayesian coreset of size M ∈ N as
sparse variational inference (Campbell and Beronov, 2019),
and showed that the objective can be minimized using stochastic estimates of ∇w DKL (πw ||π)
based on samples from the coreset posterior πw .
Coresets, as formulated in Eq. (4.3), are limited to using the original datapoints themselves
to summarize the whole dataset. Proposition 16 shows that this is problematic when
summarizing high-dimensional data; in the common setting of posterior inference for a
Gaussian mean, the KL divergence DKL (πw⋆ ||π) of the optimal coreset of any size scales
with the dimension of the data. The proof may be found in Appendix A.1.
i.i.d.
Proposition 16. Suppose we use (Xn )N d
n=1 ∼ N (0, I) in R to perform posterior infer-
i.i.d.
ence in a Bayesian model with prior µ ∼ N (0, I) and likelihood (Xn )N n=1 ∼ N (µ, I).
Then ∀M < d and δ ∈ [0, 1], with probability at least 1 − δ the optimal size-M coreset w⋆
satisfies
!−1
1 N − M −1 N
DKL (πw⋆ ||π) ≥ F δ , (4.4)
2 1 + N d−M M
(a) (b)
Figure 4.1: Gaussian mean inference under pseudocoreset (PSVI) against standard coreset
(SparseVI) summarization for N = 1, 000 datapoints. (a) Progression of PSVI vs. Spar-
seVI construction for coreset sizes M = 0, 1, 5, 12, 30, 100, in 500 dimensions (displayed
are datapoint projections on 2 random dimensions). PSVI and SparseVI coreset predic-
tive 3σ ellipses are displayed in red and blue respectively, while the true posterior 3σ
ellipse is shown in black. PSVI has the ability to immediately move pseudopoints towards
the true posterior mean, while SparseVI has to add a larger number of existing points
in order to obtain a good posterior approximation. See Fig. 4.2b for the quantitative KL
comparison. (b) Optimal coreset KL divergence lower bound from Proposition 16 as a
function of dimension with δ = 0.5, and coreset size M evenly spaced from 0 to 100 in
increments of 5.
N
!−1 N
X X
Σw = 1 + wn I µw = Σw w n Xn . (4.5)
n=1 n=1
Corollary 17. Suppose the same setting with Proposition 16. The optimal size-M pseu-
docoreset (u⋆ , w⋆ ) defined on pseudodata u1 , . . . , uM ∈ Rd achieves DKL (πu⋆ ,w⋆ ||π) = 0,
for any size M ≥ 1 and any data dimension d.
This is not surprising; the mean of the data is precisely a sufficient statistic for the
data in this simple setting. However, it does illustrate that carefully-chosen pseudodata
may be able to represent the overall dataset—as “approximate sufficient statistics”—far
better than any reasonably small collection of the original data. This intuition has
been used before, e.g., for scalable Gaussian process inference (Snelson and Ghahramani,
2005; Titsias, 2009), privacy-preserving compression in linear regression (Zhou et al.,
2007), herding (Welling, 2009; Chen et al., 2010; Huszár and Duvenaud, 2012), and deep
generative models (Tomczak and Welling, 2018).
In this section, we extend the realm of applicability of pseudopoint compression
methods to the general class of Bayesian posterior inference problems with condition-
ally independent data, resulting in Bayesian pseudocoresets. Building on recent work
(Campbell and Beronov, 2019), we formulate pseudocoreset construction as a variational
inference problem where both the weights and pseudopoint locations are parameters of
the variational posterior approximation, and develop a stochastic algorithm to solve the
optimization.
M
!
1 X
πu,w (θ) = exp wm f (um , θ) π0 (θ), (4.6)
Z(u, w) m=1
where u := (um )M d M
m=1 are M pseudodata points um ∈ R , (wm )m=1 are nonnegative weights,
f : Rd × Θ → R is a potential function parametrized by a pseudodata point, and Z(u, w)
is the corresponding normalization constant rendering πu,w a probability density. In the
setting of Bayesian posterior inference, um will take the same form as the data, while the
potentials are the log-likelihood functions, i.e. f (um , θ) = log π(um |θ). We construct a
coreset by minimizing the KL divergence over both the pseudodata locations and weights,
do not need an explicit sparsity constraint; the coreset size is limited to M directly
through the selection of the number of pseudodata and weights.
Denote the vectors of original data potentials f (θ) ∈ RN and synthetic pseudodata
potentials f˜(θ) ∈ RM as f (θ) := [f1 (θ) . . . fN (θ)]T and f˜(θ) := [f (u1 , θ) . . . f (uM , θ)]T
respectively, where we suppress the (θ) for brevity where clear from context. Denote
Eu,w and Covu,w to be the expectation and covariance operator for the pseudocoreset
posterior πu,w . Then we may write the KL divergence in Eq. (4.7) as
DKL (πu,w ||π) =Eu,w [log πu,w (θ)] − Eu,w [log π(θ)]
= log Z(1) − log Z(u, w) − 1T Eu,w [f ] + wT Eu,w [f˜], (4.8)
As we will employ gradient descent steps as part of our algorithm to minimize the
variational objective over the parameters u, w, we need to evaluate the derivative of
the KL divergence Eq. (4.8). Despite the presence of the intractable normalization
constants and expectations, we show in Appendix A.2 that gradients can be expressed
using moments of the pseudodata and original data potential vectors. In particular, the
gradients of the KL divergence with respect to the weights w and to a single pseudodata
location um are
h i h i
∇w DKL = − Covu,w f˜, f T 1 − f˜T w , ∇um DKL = −wm Covu,w h(um ), f T 1 − f˜T w ,
(4.9)
where h(·, θ) := ∇u f (·, θ), and the θ argument is again suppressed for brevity.
The gradients in Eq. (4.9) involve expectations of (gradient) log-likelihoods from the
model. Although there are a few particular Bayesian models where these can be evaluated
in closed-form (e.g. the synthetic experiment in Section 4.4.1; see also Appendix A.3.1),
this is not usually the case. In order to make the proposed pseudocoreset method
broadly applicable, in this section we develop a black-box stochastic optimization scheme
(Algorithm 1) for Eq. (4.7).
60 Bayesian Pseudocoresets
u m ← xb m , wm ← N/M, m = 1, . . . , M (4.10)
B ∼ UnifSubset ([N ], M ) , B := {b1 , . . . , bM } . (4.11)
ˆ w ∈ RM and ∇
The stochastic gradient estimates ∇ ˆ um ∈ Rd are based on S ∈ N samples
i.i.d.
θs ∼ πu,w from the coreset approximation and a minibatch of B ∈ N datapoints from
4.3 Bayesian pseudocoresets 61
S
ˆ w := − 1 N T
gs 1 − g̃sT w ,
X
∇ g̃s (4.13)
S s=1 B
S
ˆ um := −wm 1 N T
gs 1 − g̃sT w ,
X
∇ h̃m,s (4.14)
S s=1 B
where
S S
!
1 X 1 X
h̃m,s := ∇u f (um , θs ) − ∇u f (um , θs′ ), gs := f (θs ) − f (θs′ ) ,
S s′ =1 S s′ =1 B
S
(4.15)
1
g̃s := f˜(θs ) − f˜(θs′ ),
X
B ∼ UnifSubset ([N ], B) ,
S s′ =1
and (·)|B denotes restriction of a vector to only those indices in B ⊂ [N ]. Crucially, note
that this computation does not scale with N , but rather with the number of coreset
points M , the sample and minibatch sizes S and B, and the dimension d. Obtaining
i.i.d.
θs ∼ πu,w efficiently via Markov chain Monte Carlo sampling algorithms (Hoffman and
Gelman, 2014; Jacob et al., 2020) is (roughly) O(M ) per sample, because the coreset is
always of size M ; and we need not compute the entire vector gs ∈ RN per sample s, but
rather only those B ≪ N indices in the minibatch B, resulting in a cost of O(B). Aside
from that, all computations involving g̃s ∈ RM and h̃m,s ∈ Rd are at most O(M d). Each
of these computations is repeated S times over the coreset posterior samples.
P[M(x) ∈ A] ≤ eε P[M(x′ ) ∈ A] + δ.
As in Section 2.6, we consider two datasets x, x′ as adjacent (denoted x ≈ x′ ) if their
Hamming distance equals 1, i.e. x′ can be obtained from x by adding or removing an
62 Bayesian Pseudocoresets
element. ε controls the effect that removal or addition of an element can have on the
output distribution of M, while δ captures the failure probability, and is preferably
o(1/N ).
In this section, we develop a differentially private version of pseudocoreset construction.
Beyond modifying our initialization scheme, private pseudocoreset construction comes as
natural extension of Algorithm 1 via replacing gradient computation involving points of
the true dataset with its differentially private counterpart.
4.3.3.2 Optimization
Examining lines 4–19 of Algorithm 1, the only steps that involve handling the original
data occur at lines 8, 12, and 14, when we use the minibatch subsample to compute
log-likelihoods and gradients. Due to the post-processing property of differential pri-
vacy (Dwork and Roth, 2014), all of the other computations in Algorithm 1 (e.g. sampling
from the pseudocoreset posterior, computing pseudopoint log-likelihoods, etc.) incur no
privacy cost. Therefore, we need only to control the influence of private data entering
the gradient computation through the vector of (gsT 1)Ss=1 terms.
To accomplish this we do repeated applications of the subsampled Gaussian mechanism,
since this also allows us to use a moments accountant technique to keep tight estimates
of privacy parameters (Abadi et al., 2016; Wang et al., 2019). As in the nonprivate
scheme, in each optimization step we uniformly subsample a minibatch B = {x1 , . . . , xB }
of private datapoints. We then replace the gsT 1 term in lines 12 and 14 with a randomized
privatization:
B
Gi
replace (gsT 1)Ss=1 Z ∼ N (0, σ 2 C 2 I),
X
with Z +
||Gi ||2
, (4.16)
i=1 max 1, C
S
where Gi := f (xi , θs ) − S1 Ss′ =1 f (xi , θs′ ) ∈ RS ∀xi ∈ B, and C, σ > 0 are param-
P
s=1
eters controlling the amount of privacy. This modification to Algorithm 1 has been
shown in past work to obtain the privacy guarantee provided in Corollary 19; crucially,
the privacy cost of our construction is independent of the pseudocoreset size. It also
4.4 Experimental results 63
Corollary 19 (Abadi et al. (2016)). There exist constants c1 , c2 such that Algorithm 1
modifiedqper Eq. (4.16) is (ε, δ)-differentially private for any ε < c1 q 2 T , δ > 0, and
B
σ ≥ c2 q T log(1/δ) /ε, where q := N is the fraction of data in a minibatch and T is the
number of optimization steps.
(a) Gaussian mean infer- (b) Gaussian mean infer- (c) Bayesian linear regres-
ence, d = 200 ence, d = 500 sion, d = 100
Using the exact posterior, we derive the exact moments used in the gradient formulae
from Eq. (4.9) in closed form (see Appendix A.3.1),
Covu,w [fn , fm ] = vnT Ψvm + 1/2 tr ΨT Ψ, Covu,w [f˜n , fm ] = ṽnT Ψvm + 1/2 tr ΨT Ψ,
(4.18)
Covu,w [h(ui ), fn ] = Q−T Ψvn , Covu,w [h(ui ), f˜n ] = Q−T Ψṽn ,
where Q is the lower triangular matrix of the Cholesky decomposition of Σ (i.e. Σ = QQT ),
Ψ := Q−1 Σu,w Q−T , vn := Q−1 (xn − µu,w ), and ṽm := Q−1 (um − µu,w ). We vary the
pseudocoreset size from M = 1 to 200, and set the total number of iterations to
T = 500. We use learning rates γt (M ) = α(M )t−1 , where α(M ) = 1 for SparseVI and
α(M ) = max(1.1 − 0.005M, 0.2) for PSVI. As verified in Figs. 4.2a and 4.2b, Hilbert
coresets provide poor quality summarizations in the high-dimensional regime, even for
large coreset sizes. Despite showing faster decrease of approximation error for a larger
range of coreset sizes, SparseVI is also fundamentally limited by the use of the original
datapoints, per Proposition 16. Furthermore, we observe that the quality of all previous
coreset methods when d = 500 is significantly lower compared to d = 200. On the other
hand, the KL divergence for PSVI decreases significantly more quickly, giving a near
perfect approximation for the true posterior with a single pseudodata point regardless
of data dimension. As shown earlier in Fig. 4.1a, PSVI has the capacity to move the
pseudodata points in order to capture the true posterior very efficiently.
4.4 Experimental results 65
pseudodata point values um , and use noise level σ = 5. Our hyperparameters choice
implies privacy parameters ε = 0.2 and δ = 1/N for each of the datasets. We initialise
i.i.d.
each pseudocoreset of size M via sampling (xm )M M
m=1 ∼ N (0, I), and sampling θ, (ym )m=1
from the statistical model.
Results presented in Fig. 4.3 demonstrate that PSVI achieves consistently the smallest
posterior approximation error in the small coreset size regime, offering improvement
compared to SparseVI and being competitive with GIGA (Optimal), without the
requirement for specifying a weighting function. In Fig. 4.3a, for M ≥ d GIGA (Op-
timal) follows a much steeper decrease in KL divergence, reflecting the dependence
of its approximation quality on dataset dimension per Proposition 16. In contrast,
66 Bayesian Pseudocoresets
Figure 4.4: Approximate posterior quality over decreasing differential privacy guarantees
for private pseudocoresets of varying size (DP-PSVI) plotted against private variational
inference (DP-VI, Jälkö et al. (2017)). δ is always kept fixed at 1/N . Markers on the
right end of each plot display the errorbar of approximation achieved by the corresponding
nonprivate posteriors. Results are displayed over 5 trials for each construction.
PSVI typically reaches its minimum at M < d. The difference in approximation quality
becomes clearer in higher dimensions (e.g. Music, where d = 237). Perhaps surprisingly,
the private pseudocoreset construction has only marginally worse approximation quality
compared to nonprivate PSVI and generally achieves better peformance in comparison
to the other state-of-the-art nonprivate coreset constructions.
In Fig. 4.4 we present the achieved posterior approximation quality via DP-PSVI,
against a competitive state-of-the-art method for general-purpose private inference (DP-
VI, Jälkö et al., 2017). The plots display the behaviour of methods over a wide range
of ε values, achieved using varying levels of privatization noise, and δ always set to
1/N . For logistic regression, DP-VI infers an approximate posterior from the family
of Gaussians with diagonal covariance via ADVI (Kucukelbir et al., 2017), followed by
an additional Laplace approximation. Note that by design, DP-VI is constrained by
4.5 Summary & discussion 67
the usual Gaussian variational approximation, while DP-PSVI is more flexible and
can approach the true posterior as M increases—this effect is reflected in nonprivate
posteriors as well as data dimensionality grows (see for example Fig. 4.4c). Indeed, we
verify that in the high-privacy regime DP-PSVI for sufficient pseudocoreset size (which is
typically small for tested real-world datasets) offers posterior approximation with better
KL divergence compared to DP-VI. Our findings indicate that private PSVI offers
efficient releases of big data via informative pseudopoints, which enable arbitrary post
processing (e.g. running any nonprivate black-box algorithm for Bayesian inference),
under strong privacy guarantees and without reducing the quality of inference.
Our learnable pseudodata are also generally not as interpretable as the points of
previous coreset methods, as they are not real data. And the level of interpretability
is model specific. This creates a risk of misinterpretation of pseudocoreset points in
practice. On the other hand, our optimization framework does allow the introduction of
interpretability constraints (e.g. pseudodata sparsity) to explicitly capture interpretability
requirements.
Pseudocoreset-based summarization is susceptible to reproducing potential biases and
unfairness existing in the original dataset. Majority-group datapoints in the full dataset
which capture information relevant to the statistical task of interest are expected to
remain over-represented in the learned summary; while minority-group datapoints might
be eliminated, if their distinguishing features are not related to inference. Amending
the initialization step to contain such datapoints, or using a prior that strongly favors a
debiased version of the dataset, could both mitigate these concerns; but more study is
warranted.
Chapter 5
including Gaussian mean inference, logistic and neural linear regression, demonstrating
its superiority to existing Bayesian summarization methods in the presence of outliers.
5.2 Method
In this section we present β-Cores, our unified solution to the robustness and scalability
challenges of large-scale Bayesian inference. Section 5.2.1 introduces the main quantity
of interest in our inference method, and shows how it addresses the exposed issues. Sec-
tion 5.2.2 presents an iterative algorithm that allows efficient approximate computations
of our posterior.
N
!
1 X
πβ,w (θ|x) = exp wn fn (θ) π0 (θ), (5.1)
Z(β, w) n=1
Here we aim to approximate data posterior via a sparse β-posterior, which can be
expressed as follows
In the following we denote expectations and covariances under θ ∼ πβ,w (θ|x) as Eβ,w and
Covβ,w respectively. Then the KL divergence is written as
" #
πβ,w
DKL (πβ,w ||πβ ) := Eβ,w log . (5.3)
πβ
In our formulation it is easy to observe that posteriors of Eq. (5.1) form a set of exponential
family distributions (Wainwright and Jordan, 2008), with natural parameters w ∈ RN ≥0 ,
N
sufficient statistics (fn (θ))n=1 , and log-partition function log Z(β, w). Following Campbell
and Beronov (2019), the objective can be expanded as
N
X
DKL (πβ,w ||πβ ) = log Z(β) − log Z(β, w) − Eβ,w [fn (θ) − wn fn (θ)] , (5.4)
n=1
and minimized via gradient descent on w. The gradient of the objective of Eq. (5.4) can
be derived in closed form, as
h i
∇w DKL (πβ,w ||πβ ) = − Covβ,w f, (1 − w)T f , (5.5)
where S is the number of samples from the coreset posterior, and T is the total number
of iterations over coreset points weights optimization. The full incremental construction
is outlined in Algorithm 2.
The optimization problem of Eq. (5.2) is intractable due to the cardinality constraint;
hence, our incremental scheme takes the approach of approximating the solution to the
original problem via solving a sequence of interleaved combinatorial and continuous
optimization problems as follows:
5.2 Method 75
For i ∈ {1, . . . , M } :
Next datapoint selection (Combinatorial optimization)
m⋆ = arg min DKL πβ,w←w∪{xm } ||πβ (5.6)
m∈[N ]
In Eq. (5.6) we have introduced the notation πβ,w←w∪{xm } to consider the coreset
expansion that assigns potentially non-zero weight to a datapoint xm .
We first select the next datapoint to include in our coreset summary Eq. (5.6), via a
greedy selection criterion. Although maximizing the decrease in KL locally via Eq. (5.5),
seems to be the natural greedy choice here, this would incur the impractical cost of
resampling from the coreset posterior for all potential expansions of the coreset with a
new datapoint. Moreover, even if we can tolerate this cost, adding a single unweighted
datapoint is likely to induce a negligible effect on the coreset posterior, especially in
massive dataset settings. Submodularity of the objective would be a clearly attractive
property, as it could possibly point to a cheap greedy strategy with provable suboptimality
guarantees—however, our analysis in Appendix B.2 demonstrates that this property is
generally not satisfied for our problem.
Hence, considering that the weight of the active support for the updated coreset
will be optimized in the subsequent step Eq. (5.7) of the algorithm, an efficient method
for informative datapoint selection can be based on adding a datapoint that correlates
well with the direction of residual error. Thus we finally rely instead on the following
correlation maximization criterion:
h i
Corrβ,w fn , N
B
1T f − wT f wn > 0
xm = arg max h i (5.8)
xn ∈I∪B
Corrβ,w fn , N
B
1T f − wT f wn = 0,
where we denoted by I the set of coreset points. Eq. (5.8) additionally allows us to expand
the information-geometric interpretation of Riemannian coresets presented in Campbell
and Beronov (2019) in our construction. This criterion is invariant to scaling each
potential fn by any positive constant, and selects the potential that has the largest
76 β-Cores: Robust Large-Scale Bayesian Data Summarization in the Presence of Outliers
After adding a new datapoint to the summary, we optimize Eq. (5.7), updating the
coreset weight vector w ∈ R≥0 via T steps of projected stochastic gradient descent, for
which we use the Monte Carlo estimate of Eq. (5.5) per line 17 of Algorithm 2.
(ii) PSVI (Manousakas et al., 2020), the method introduced in Chapter 4, which
runs a batch optimization on a set of pseudopoints, and uses standard likelihood
evaluations to jointly learn the pseudopoints’ weights and locations, so that the
extracted summary resembles the statistics of the full dataset.
We default the number of iterations in the optimization loop over gradient-based
coreset constructions to T = 500, using a learning rate γt ∝ t−1 and S = 100 random
projections per gradient computation. From Section 5.3.1 to Section 5.3.4, the values
for β are selected via cross-validation on a held-out dataset. For consistency with the
compared baselines, we evaluate inference results obtained by β-Cores using the classical
Bayesian posterior from Eq. (2.17) conditioned on the corresponding robustified data
summary. Additional details on used benchmark datasets are presented in Appendix B.3.
i.i.d.
θ ∼ N (µ0 , Σ0 ) , xn ∼ N (θ, Σ), n = 1, . . . , N. (5.9)
78 β-Cores: Robust Large-Scale Bayesian Data Summarization in the Presence of Outliers
(a)
(b)
Figure 5.1: (a) Scatterplot of the observed datapoints projected on two random axes,
overlaid by the corresponding coreset points and predictive posterior 3σ ellipses for
increasing coreset size (from left to right). Exact posterior (illustrated in black) is
computed on the dataset after removing the group of outliers. From top to bottom, the
level of structured contamination increases. Classic Riemannian coresets are prone to
model misspecification, adding points from the outlying component, while β-Cores adds
points only from the uncontaminated subpopulation yielding better posterior estimation.
(b) Reverse KL divergence between coreset and true posterior (the latter computed on
clean data), averaged over 5 trials. Solid lines display the median KL divergence, with
shaded areas showing 25th and 75th percentiles of KL divergence.
80 β-Cores: Robust Large-Scale Bayesian Data Summarization in the Presence of Outliers
indeed, Fig. 5.1b shows that the achieved KL divergence from the exact posterior is at
same order of magnitude regardless of failure probability.
We can however notice that, for coreset sizes growing beyond 60 points—despite
remaining consistently better compared to the baselines—β-Cores starts to present
some instability over trials in contaminated dataset instances. This effect is attributed
to the small value of the β hyperparameter selected for the demonstration (so that this
value can successfully model the case of clean data). As a result, eventually some outliers
might be allowed to enter the summary for large coreset sizes. The instability can be
resolved by increasing β according to the observations’ failure probability, and will be
further discussed in Section 5.3.5.
The closed form of β-likelihood terms required in our construction is computed in Ap-
pendix B.1.2.
Data corruption is simulated by generating unstructured outliers in the input and
output space similarly to (Futami et al., 2018): For corruption rate F , we sample two
random subsets of size F · N from the training data. For the datapoints in the first
subset, we replace the value of half of the features with Gaussian noise sampled i.i.d.
from N (0, 5); for the datapoints in the other subset, we flip the binary label. Over
construction we use the Laplace approximation to efficiently draw samples from the
(non-conjugate) coreset posterior, while over evaluation coreset posterior samples are
obtained via NUTS (Hoffman and Gelman, 2014). We evaluate the accuracy over the
test set, predicting labels according to the maximum log-likelihood rule for θs sampled
from the coreset posterior distribution. The learning rate schedule was set to γt = c0 t−1 ,
with c0 set to 1 for SparseVI and β-Cores, and 0.1 for PSVI. The values for and
learning rates γt were chosen via cross-validation.
5.3 Experiments & applications 81
Figure 5.2: Predictive accuracy vs coreset size for logistic regression experiments over
10 trials on 3 large-scale datasets. Solid lines display the median accuracy, with shaded
areas showing 25th and 75th percentiles. Dataset corruption rate F , and β value used in
β-Cores for each experiment are shown on the figures. The bottom row plots illustrate
the achieved predictive performance under no contamination.
Fig. 5.2 illustrates that β-Cores shows competitive performance with the classical
Riemannian coresets in the absence of data contamination (bottom row), while it con-
sistently achieves the best predictive accuracy in corrupted datasets (top row). On the
other hand, ordinary summarization techniques, although overall outperforming random
sampling for small coreset sizes, soon attain degraded predictive performance on poisoned
data: by construction, via increasing coreset size, Riemannian coresets are expected to
converge to the Bayesian posterior computed on the corrupted dataset. All baselines
present noticeable degradation in their predictive accuracy when corruption is introduced
(typically more than 5%), which is not the case for our method: β-Cores is designed
to support corrupted input and, for a well-tuned hyperparameter β, maintains similar
performance in the presence of outliers, while practically it can even achieve improvement
(as occurring for the WebSpam data).
Figure 5.3: Test RMSE vs coreset size for neural linear regression experiments averaged
over 30 trials. Solid lines display the median RMSE, with shaded areas showing 25th
and 75th percentiles. Dataset corruption rate F , and β value used in β-Cores for each
experiment are shown on the figures. The bottom row plots illustrate the achieved
predictive performance under no contamination.
(yn )N T
n=1 = θ z(xn ) + ϵn , (ϵn )N 2
n=1 ∼ N (0, σ ). (5.11)
The neural network is trained to learn an adaptive basis z(·) from N datapoint pairs
(xn , yn ) ∈ Rd ×R, which we then use to regress (yn )N N
n=1 on (z(xn ))n=1 , and yield uncertainty
aware estimates of θ. More details on the model-specific formulae entering coresets
construction are provided in Appendix B.1.3. Input and output related outliers are
simulated as in Section 5.3.2, while here, for the output related outliers, yn gets replaced
by Gaussian noise. Corruption occurs over a percentage F % of the total number of
minibatches of the dataset, while the remaining minibatches are left uncontaminated.
Each poisoned minibatch gets 70% of its points substituted by outliers.
5.3 Experiments & applications 83
Figure 5.4: Predictive accuracy against number of groups (left) and number of data-
points (right) selected for inference. Compared group selection shemes are β-Cores,
selection according to Shapley values based ranking, and random selection. The experi-
ment is repeated over 5 trials, on a contaminated dataset containing a 10% of crafted
outliers distributed non-uniformly across groups (top row), and a clean dataset (bottom
row).
knowledge regarding which group combinations are most beneficial in summarizing the
entire population of interest (Pinsler et al., 2019; Vahidian et al., 2020), and hence should
be prioritised over data collection.
In this study we use a subset of more than 60K datapoints from the HospitalRead-
missions dataset (for further details see Appendix B.3). Using combinations of age, race
and gender information of data contributors, we form a total of 165 subpopulations within
the training dataset. Data contamination is simulated identically to the experiment of
Section 5.3.2, while now we also consider the case of varying levels of contamination
across the subpopulations. In particular, we form groups of roughly equal size where
0%, 10% and 20% of the datapoints get replaced by outliers—this results in getting a
dataset with approximately 10% of its full set of datapoints corresponding to outliers.
5.3 Experiments & applications 85
Figure 5.5: Attributes of selected groups after running 10 iterations of β-Cores with
β = 0.6 on the contaminated HospitalReadmissions dataset (repeated over 5 random
trials).
We evaluate the predictive accuracy achieved by doing inference on the data subset
obtained after running 10 iterations of the β-Cores extension for groups (which gives
a maximum of 10 selected groups). We compare against (i) a random sampler, and
(ii) a baseline which ranks all groups according to their Shapley value and selects the
groups with the highest ranks. Shapley value is a concept originating in cooperative
game theory (Shapley, 1953), which has recently found applications in data valuation
and outliers detection (Ghorbani and Zou, 2019). In the context of our experiment, it
quantifies what is the marginal contribution of each group to the predictive accuracy
of the model at all possible group coalitions that can be formed. As this quantity is
notoriously expensive to be computed in large datasets, we use a Monte Carlo estimator
which samples 5K possible permutations of groups, and for each permutation it computes
marginals for coalitions formed by the first 20 groups.1
As illustrated in Fig. 5.4, β-Cores with β = 0.6 offers the best solution to our
problem, and is able to reach predictive accuracy exceeding 75% by fitting a coreset on
no more than 2 groups. Fig. 5.5 displays the demographic information of selected groups.
We can notice that subpopulations of female and older patients are more informative for
the classification task, while Caucasian and African-American groups are preferred to
smaller racial minorities. Importantly, β-Cores is able to distill clean from contaminated
groups. For the used β value, we can see than over the set of trials only one group
1
The latter truncation is supported by the observation that marginal contributions to the predictive
accuracy are diminishing as the dataset size increases.
86 β-Cores: Robust Large-Scale Bayesian Data Summarization in the Presence of Outliers
with outliers level of 10% is allowed to enter a summary, which already contains 3
uncontaminated groups.
Shapley values based ranking treats outliers better than random sampling: As outliers
are expected to have negative marginal contribution to predictive accuracy, their Shapley
rank is generally lower compared to clean data groups, hence the later are favoured.
On the other hand, Shapley computation is much slower than random sampling and
β-Cores, specific to the evaluation metric of interest, while Shapley values are not
designed to find data-efficient combinations of groups, hence this baseline can still retain
redundancy in the selected data subset.
Figure 5.6: Predictive performance of β-Cores for varying values of the robustness
hyperparameter β. At each experiment, results are averaged over 5 trials. Solid lines
display the median of the predictive metric, with shaded areas showing the corresponding
25th and 75th percentiles.
88 β-Cores: Robust Large-Scale Bayesian Data Summarization in the Presence of Outliers
Further directions include developing more methods for adaptive tuning of the
robustness hyperparameter β, as well as applying our techniques to more complicated
statistical models, including ones with structured likelihood functions (e.g. time-series
and temporal point processes). Moreover, future experimentation may consider stronger
adversarial settings where summaries are initialized to data subsets that already contain
outliers.
Chapter 6
Conclusions
In this thesis, we have presented three original pieces of work drawing on one of the
fundamental research problems in large-scale machine learning: finding scalable dataset
reductions under constraints commonly arising in real-world data analysis applications.
Our premise has been that principled dataset summarization methods can be harsenessed
to enable efficient approximations for the purposes of large-scale data analysis without
compromising requirements of privacy and robustness. In this section, we briefly recap
our key contributions and suggest directions for future research.
6.1 Summary
Our variational formulations for coreset construction Eqs. (4.7) and (5.2) use the as-
sumption that the data likelihood function gets factorised as a product of individual
datapoint potentials. To the best of our knowledge, the idea of constructing coresets has
not yet been used for inference in models with structured likelihood functions, including
time-series and point processes. Recent results on parameter estimation for Hawkes pro-
cesses using uniform downsampling (Li and Ke, 2019) indicate important improvements
in efficiency when learning in massive temporal event sequences via reducing data, even
without explicitly optimizing for redundancy in the extracted data subsets.
and
N
!
Id X
Σw = P
N
, µw = Σw w n Xn . (A.2)
1+ n=1 wn n=1
Thus,
1
DKL (πw ||π) ≥ (µ1 − µw )T Σ−1
1 (µ1 − µw ). (A.4)
2
Suppose we pick a set I ⊆ [N ], |I| = M of active indices n where the optimal wn ≥ 0,
and enforce that all others n ∈
/ I satisfy wn = 0. Then denoting
/ I] ∈ Rd×(N −M ) ,
Y = [Xn : n ∈ X = [Xn : n ∈ I] ∈ Rd×M , (A.5)
1 1 w
DKL (πw ||π) ≥ 1T Y T Y 1 + 1T Y T X −
2(N + 1) N + 1 1 + 1T w
T
N +1 1 w 1 w
T
+ − X X − . (A.6)
2 N + 1 1 + 1T w N + 1 1 + 1T w
The numerator is the squared norm of Y 1 minus its projection onto the subspace spanned
by the M columns of X. Since Y 1 ∼ N (0, (N − M )I), Y 1 ∈ Rd is an isotropic Gaussian,
then its projection into the orthogonal complement of any fixed subspace of dimension
M is also an isotropic Gaussian of dimension d − M with the same variance. Since
the columns of X are also independent and isotropic, its column subspace is uniformly
distributed. So therefore, for each possible choice of I
N −M
DKL πwI⋆ ||π ≥ ZI , ZI ∼ χ2 (d − M ). (A.8)
2(N + 1)
N
Note that the ZI will have dependence across the M
different choices of index subset
I. Thus, the probability that all ZI are large is
! !
N
P min ZI > ϵ ≥1 − P (ZI ≤ ϵ)
I⊆[N ],|I|=M M
!
N
=1 − Fd−M (ϵ), (A.9)
M
A.2 Gradient derivations 95
where Fk is the CDF for the χ2 distribution with k degrees of freedom. The result
follows.
Throughout, expectations and covariances over the random parameter θ with no explicit
subscripts are taken under pseudocoreset posterior πu,w . We also interchange differen-
tiation and integration without explicitly verifying that sufficient conditions to do so
hold.
Combining, we have
h h i i
∇w E [a(θ)] =E f˜(θ) − E f˜(θ) a(θ) . (A.13)
h h ii
Subtracting 0 = E [a(θ)] E f˜(θ) − E f˜(θ) yields
h i
∇w E [a(θ)] = Cov f˜(θ), a(θ) . (A.14)
The gradient with respect to w in Eq. (4.9) follows by substituting 1T f (θ) and wT f˜(θ)
for a(θ) and using the product rule.
96 Supplement for Bayesian Pseudocoresets
∇ui DKL = −∇ui log Z(u, w) − ∇ui E[f (θ)T 1] + ∇ui E[f˜(θ)T w]. (A.15)
Using the product rule and recalling from the main text that h(·, θ) := ∇u f (·, θ),
∇ui E [a(u, θ)] = E [∇ui a(u, θ)] + E [a(u, θ) (wi h(ui , θ) − ∇ui log Z(u, w))] . (A.17)
Taking the gradient of the log normalization constant using similar techniques,
Combining,
∇ui E [a(u, θ)] = E [∇ui a(u, θ)] + wi E [a(u, θ) (h(ui , θ) − E [h(ui , θ)])] . (A.19)
∇ui E [a(u, θ)] = E [∇ui a(u, θ)] + wi Cov [a(u, θ), h(ui , θ)] . (A.20)
The gradient with respect to ui in Eq. (4.9) follows by substituting f (θ)T 1 and f˜(θ)T w
for a(u, θ).
1
Cov[f˜n , fm ] = ṽnT Ψvm + tr ΨT Ψ. (A.22)
2
We now evaluate the remaining covariance Cov[h(ui ), fm ]; the derivation of Cov[h(ui ), f˜m ]
follows similarly. We begin by explicitly evaluating the log-likelihood gradient and its
expectation,
and likewise,
M
!−1 M
!
σ0−2 I −2
wm um uTm σ0−2 Iµ0 −2
X X
Σu,w = +σ , µu,w = Σu,w +σ w m y m um .
m=1 m=1
(A.35)
Here we present some more plots demonstating the dependence of Hilbert coresets’
approximation quality on the dimension of random projections in the Bayesian linear
regression setting presented in Fig. 4.2c. We remind that the dimension used at this
experiment and throughout the entire experiments section was set to 100. Increasing this
number is typically expensive to obtain in practice. As demonstrated in Fig. A.1, getting
higher projection dimension enables better posterior approximation in the problem for
both GIGA (Optimal) and GIGA (Realistic). However, PSVI remains competitive
in the small coreset regime, even for Hilbert coresets with extremely large projection
dimensionality, demonstrating the information-geometric limitations that Hilbert coreset
constructions are known to face (Campbell and Beronov, 2019).
A.3 Details on experiments 99
project. dim. = 200 project. dim. = 2, 000 project. dim. = 10, 000
The aim of inference is to compute the posterior over the latent parameter θ =
[θ0 . . . θd ]T ∈ Rd+1 . Log-likelihood of each datapoint can be expressed as
1
Tθ
−zn
fn := f (xn , yn |θ) =1[yn = −1] log 1 − T − 1[y n = 1] log 1 + e
1 + e−zn θ (A.37)
T
= − log 1 + exp(−yn zn θ) .
For logistic regression experiments, we used subsampled and full versions of datasets
presented in Table A.1: a synthetic dataset with x ∈ R2 sampled i.i.d. from a N (0, I) and
y ∈ {−1, 1} sampled from respective logistic likelihood with θ = [3, 3, 0]T (Synthetic); a
100 Supplement for Bayesian Pseudocoresets
Dataset name N D
Synthetic 500 2
Phishing 500 10
ChemReact 500 10
Transactions 100,000 50
ChemReact100 26,733 100
Music 8,419 237
In the small-scale experiment, the number of overall gradient updates was set to T = 1, 500,
while minibatch size was set to B = 400. Learning rate schedule for SparseVI and
PSVI was γt = 0.1t−1 . Results presented in Fig. A.2 indicate that PSVI achieves superior
quality to SparseVI for small coreset sizes, and is competitive to GIGA (Optimal),
while the latter unrealistically uses true posterior samples to tune a weighting function
required over construction.
Laplace approximation fitted on current coreset weights and points. To optimize initial
learning rates for SparseVI and PSVI, we did a grid search over {0.1, 1, 10}.
In the differential privacy experiment, we were not concerned with the extra privacy
cost of hyperparameter optimization task. Estimation of differential privacy cost at all
experiments was based on TensorFlow privacy implementation of moments accountant
for the subsampled Gaussian mechanism.1 For DP-PSVI we used the best learning
hyperparameters found for PSVI on the corresponding dataset. The demonstrated range
of privacy budgets was generated by decreasing the variance σ of additive Gaussian
noise and keeping the rest of hyperparameters involved in privacy accounting fixed.
Regarding DP-VI, over our experiments we also kept the subsampling ratio fixed. We
based our implementation of DP-VI on authors’ code,2 adapting noise calibration
according to the adjacency relation used in Section 4.3.3, and the standard differential
privacy definition (Dwork and Roth, 2014). In our experiment, we used the AdaGrad
optimizer, with learning rate 0.01, number of iterations 2, 000, and minibatch size 200.
Gradient clipping values for DP-VI results presented in Fig. 4.4, for Transactions,
ChemReact100, and Music datasets were tuned via grid search over {1, 5, 10, 50}.
The values of gradient clipping constant giving best privacy profiles for each dataset,
used in Fig. 4.4, were 10, 5, and 5 respectively.
1
https://github.com/tensorflow/privacy
2
https://github.com/DPBayes/DPVI-code
102 Supplement for Bayesian Pseudocoresets
Figure A.3: Comparison of PSVI and SparseVI approximate posterior quality vs CPU
time requirements for logistic regression experiment of Section 4.4.
Experiments were performed on a CPU cluster node with a 2x Intel Xeon Gold 6142
and 12GB RAM. In the case of PSVI the computation of coreset sizes from 1 to
100 was parallelized per single size over 32 cores in total. Fig. A.3 shows posterior
approximation error vs required CPU time for all coreset construction algorithms over
logistic regression on the small-scale and large-scale datasets. As opposed to existing
incremental coreset construction schemes, batch construction of PSVI reduces the
dependence between coreset size and processing cost: for SparseVI Θ(M 2 ) gradient
computations are required, as this method builds up a coreset one point at a time; in
contrast, PSVI requires Θ(M ) gradients since it learns all pseudodata points jointly.
Although each gradient step of PSVI is more expensive, practically this implies a steeper
decrease in approximation error over processing time compared to SparseVI. In the
case of differentially private PSVI, some extra CPU requirements are added due to the
subsampled Gaussian mechanism computations.
A.3 Details on experiments 103
Synthetic
Phishing
ChemReact
B.1 Models
In this section we present the derivations of β-likelihood terms Eqs. (2.23) and (2.24)
required over the β-Cores constructions for the statistical models of our experiments.
Hence, omitting the constant term due to the shift-invariance of potentials entering Al-
gorithm 2, we get up to proportionality
!
1 β
fn (µ) ∝ exp − (x − µ)T Σ−1 (x − µ) . (B.3)
β 2
Recall that in the neural linear regression model, yn − θT z(xn ) ∼ N (0, σ 2 ), n = 1, . . . , N .
Then the Gaussian log-likelihoods corresponding to individual observations (after drop-
ping normalization constants), are written as
1 T
2
fn (θ) = − y n − θ z(x n ) . (B.6)
2σ 2
Assuming a prior θ ∼ N (µ0 , σ02 I), the coreset posterior is a Gaussian πw (θ) = N (µw , Σw ),
with mean and covariance computable in closed form as follows
M
!−1
Σw := σ0−2 I + σ −2 wm z(xm )z(xm )T
X
, (B.7)
m=1
M
!
µw := Σw σ0−2 Iµ0 + σ −2
X
wm ym z(xm ) . (B.8)
m=1
By substitution to Eq. (2.24) and omitting constants, the β-likelihood terms for our
adaptive basis linear regression are written as
2
fn (θ) ∝ e−β (yn −θ )
T z(x /(2σ 2 )
n)
. (B.9)
Let C be the output of the coreset applied on a dataset D. Hence, in regression problems,
the predictive posterior on a test data pair (xt , yt ) via a coreset is approximated as follows
In the neural linear experiment, the predictive posterior is a Gaussian given by the
following formula
π(yt |xt , C) = N yt ; µTw z(xt ), σ 2 + z(xt )T Σw z(xt ) . (B.11)
B.2 Characterization of Riemannian coresets’ combinatorial optimization objective 107
m⋆ = arg max −DKL πβ,w←w∪{xm } ||πβ . (B.12)
m∈[N ]
Definition 20 (Submodularity). The set function d is submodular if and only if for all
S ⊆ X and xj , xk ∈ X \ S, we have d(S ∪ {xj }) − d(S) ≥ d(S ∪ {xj , xk }) − d(S ∪ {xk }).
In the next proposition, we demonstrate a problem instance where the necessary and
sufficient condition of Definition 20 is violated for d considered in Eq. (B.13), hence
proving that our objective is non-submodular under no further assumptions.
Proof. For convenience let’s focus on the case of Gaussian mean inference for the classical
Bayesian posterior (β → 0), where the objective can be handily written in closed form.
Similar arguments will in principle carry over for arbitrary βs and statistical models. We
recall from Eq. (A.3) that
" ! ! #
1 1+N 1+N T
d(S) = − −d log N −d+d N + (1 + N )(µ1 − µw ) (µ1 − µw )
2 1+ M ||IS ||1 1+ M ||IS ||1
1
= − (1 + N )||µ1 − µw ||22 , (B.14)
2
where
N
1 X 1 N X
µ1 = xn , µw = xi . (B.15)
1 + N n=1 1 + N M xi ∈S
108 Supplement for β-Cores
Let’s consider a set of observations containing two mirrored datapoints x0 , −x0 , such
that x0 ̸= µ1 . Then clearly
• a dataset used to predict whether a citizen’s income exceeds 50K$ per year extracted
from USA 1994 census data (Adult),
• a set of various features from homes in the suburbs of Boston, Massachussets used
to model housing price (Housing), and
• a dataset used to predict the release year of songs from associated audio features
(Songs).
B.3 Datasets details 109
3.6 Anonymity size statistics over the population of top−N mobility networks
for increasing network size. . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 CDF of true rank over the population according to different kernels. . . . 46
3.8 Boxplot of rank for the true labels of the population according to a Deep
Shortest-Path kernel and to a random ordering. . . . . . . . . . . . . . . 48
3.9 Privacy loss over the test data of our population for an adversary adopting
the informed policy of (3.10). Median privacy loss is 2.52. . . . . . . . . 48
5.1 (a) Scatterplot of the observed datapoints projected on two random axes,
overlaid by the corresponding coreset points and predictive posterior 3σ
ellipses for increasing coreset size (from left to right). Exact posterior
(illustrated in black) is computed on the dataset after removing the group
of outliers. From top to bottom, the level of structured contamination
increases. Classic Riemannian coresets are prone to model misspecification,
adding points from the outlying component, while β-Cores adds points
only from the uncontaminated subpopulation yielding better posterior
estimation. (b) Reverse KL divergence between coreset and true posterior
(the latter computed on clean data), averaged over 5 trials. Solid lines
display the median KL divergence, with shaded areas showing 25th and
75th percentiles of KL divergence. . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Predictive accuracy vs coreset size for logistic regression experiments over
10 trials on 3 large-scale datasets. Solid lines display the median accuracy,
with shaded areas showing 25th and 75th percentiles. Dataset corruption
rate F , and β value used in β-Cores for each experiment are shown
on the figures. The bottom row plots illustrate the achieved predictive
performance under no contamination. . . . . . . . . . . . . . . . . . . . . 81
5.3 Test RMSE vs coreset size for neural linear regression experiments averaged
over 30 trials. Solid lines display the median RMSE, with shaded areas
showing 25th and 75th percentiles. Dataset corruption rate F , and β value
used in β-Cores for each experiment are shown on the figures. The
bottom row plots illustrate the achieved predictive performance under no
contamination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Predictive accuracy against number of groups (left) and number of data-
points (right) selected for inference. Compared group selection shemes are
β-Cores, selection according to Shapley values based ranking, and ran-
dom selection. The experiment is repeated over 5 trials, on a contaminated
dataset containing a 10% of crafted outliers distributed non-uniformly
across groups (top row), and a clean dataset (bottom row). . . . . . . . . 84
2.1 Convex functions used for reductions of relative entropy and density power
to Bregman divergences on the domain of probability density functions. . 8
Acronyms/Abbreviations
cf. confer
cit. cited
DP Differentially Private
etc. et cetera
i.e. id est
KL Kullback-Leibler
MC Monte Carlo
PL Privacy Loss
118 Nomenclature
VI Variational Inference
Roman Symbols
D Dataset
H Hilbert space
Greek Symbols
ϵ a random variable
Other Symbols
[N ] [1, . . . , N ]
# Number of
◦ Composition of functions
:= Defined as
⟨· , ·⟩ Inner product
P Probability
Nomenclature 119
Dβ β-divergence
dβ β-cross-entropy
dKL Cross-entropy
∝ Proportional to
∼ Distributed as
Z Integer numbers
N Natural numbers
R Real numbers
Superscripts
c Empirical estimate
f Computed on pseudodata
Distributions
χ2 Chi-square distribution
N Normal distribution
Operators
E Expectation
Corr Correlation
Cov Covariance
tr Matrix trace
Var Variance
ID Identifier
Graphs
SP Shortest Path
WL Weisfeiler-Lehman
Bibliography
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K. and Zhang,
L. (2016). “Deep Learning with Differential Privacy”. ACM SIGSAC Conference on
Computer and Communications Security (cit. on pp. 22, 55, 62, 63, 91).
Agarwal, P. K., Har-Peled, S., Varadarajan, K. R., et al. (2005). “Geometric approximation
via coresets”. Combinatorial and computational geometry 52, pp. 1–30 (cit. on p. 54).
Ağır, B., Huguenin, K., Hengartner, U. and Hubaux, J.-P. (2016). “On the privacy
implications of location semantics”. Proceedings on Privacy Enhancing Technologies
2016.4 (cit. on p. 26).
Amari, S.-i. (2016). Information Geometry and Its Applications. Springer (cit. on p. 8).
Andrieu, C., De Freitas, N., Doucet, A. and Jordan, M. I. (2003). “An introduction to
MCMC for machine learning”. Machine Learning 50.1-2, pp. 5–43 (cit. on p. 12).
Bachem, O., Lucic, M. and Krause, A. (2015). “Coresets for nonparametric estimation—
the case of DP-means”. International Conference on Machine Learning (cit. on
p. 54).
Bachem, O., Lucic, M. and Krause, A. (2017). Practical Coreset Constructions for
Machine Learning. arXiv: 1703.06476 (cit. on p. 91).
122 Bibliography
Balle, B., Barthe, G., Gaboardi, M. and Geumlek, J. (2019). “Privacy Amplification
by Mixing and Diffusion Mechanisms”. Advances in Neural Information Processing
Systems. Vol. 32 (cit. on p. 83).
Balog, M., Tolstikhin, I. and Schölkopf, B. (2018). “Differentially private database release
via kernel mean embeddings”. International Conference on Machine Learning (cit. on
p. 3).
Banerjee, A., Merugu, S., Dhillon, I. S. and Ghosh, J. (2005). “Clustering with Bregman
Divergences”. J. Mach. Learn. Res. 6, pp. 1705–1749 (cit. on p. 8).
Bardenet, R., Doucet, A. and Holmes, C. (2014). “Towards scaling up Markov chain Monte
Carlo: an adaptive subsampling approach”. International Conference on Machine
Learning (cit. on p. 12).
Barreno, M., Nelson, B., Joseph, A. D. and Tygar, J. D. (2010). “The security of machine
learning”. Machine Learning (cit. on p. 70).
Bassily, R., Smith, A. and Thakurta, A. (2014). “Private Empirical Risk Minimiza-
tion: Efficient Algorithms and Tight Error Bounds”. IEEE Annual Symposium on
Foundations of Computer Science (cit. on p. 91).
Basu, A., Harris, I. R., Hjort, N. L. and Jones, M. C. (Sept. 1998). “Robust and efficient
estimation by minimising a density power divergence”. Biometrika 85.3, pp. 549–559
(cit. on pp. 8, 17).
Beimel, A., Nissim, K. and Stemmer, U. (2013). “Characterizing the sample complexity
of private learners”. Proceedings of the 4th conference on Innovations in Theoretical
Computer Science, pp. 97–110 (cit. on pp. 22, 91).
Berger, J. O., Moreno, E., Pericchi, L. R., Bayarri, M. J., Bernardo, J. M., Cano, J. A.,
De la Horra, J., Martín, J., Ríos Insúa, D., Betrò, B., et al. (1994). “An overview of
robust Bayesian analysis”. Test 3.1, pp. 5–124 (cit. on p. 71).
Beyer, K. S., Goldstein, J., Ramakrishnan, R. and Shaft, U. (1999). “When Is ”Nearest
Neighbor” Meaningful?” Proceedings of the 7th International Conference on Database
Theory. Springer-Verlag (cit. on p. 50).
Bhatia, K., Ma, Y.-A., Dragan, A. D., Bartlett, P. L. and Jordan, M. I. (2019). Bayesian
Robustness: A Nonasymptotic Viewpoint. arXiv: 1907.11826 (cit. on p. 71).
Bhattacharya, S., Manousakas, D., Ramos, A. G. C., Venieris, S. I., Lane, N. D. and
Mascolo, C. (2020). “Countering Acoustic Adversarial Attacks in Microphone-equipped
Bibliography 123
Smart Home Devices”. Proceedings of the ACM on Interactive, Mobile, Wearable and
Ubiquitous Technologies 4.2, pp. 1–24 (cit. on p. 6).
Biggio, B., Nelson, B. and Laskov, P. (2012). “Poisoning Attacks against Support Vec-
tor Machines”. Proceedings of the 29th International Coference on International
Conference on Machine Learning (cit. on p. 70).
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer (cit. on pp. 11,
13).
Braverman, V., Feldman, D. and Lang, H. (2016). New frameworks for offline and
streaming coreset constructions. arXiv: 1612.00889 (cit. on p. 54).
Chen, M., Gao, C. and Ren, Z. (2018). “Robust covariance and scatter matrix estimation
under Huber’s contamination model”. The Annals of Statistics 46.5, pp. 1932–1960
(cit. on p. 77).
Chen, Y., Welling, M. and Smola, A. (2010). “Super-Samples from Kernel Herding”.
Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence
(cit. on p. 58).
Cichocki, A. and Amari, S.-i. (2010). “Families of Alpha- Beta- and Gamma- Divergences:
Flexible and Robust Measures of Similarities”. Entropy 12.6, pp. 1532–1568 (cit. on
p. 8).
Csató, L. and Opper, M. (2002). “Sparse on-line Gaussian processes”. Neural computation
14.3, pp. 641–668 (cit. on p. 54).
Dai, B., He, N., Dai, H. and Song, L. (2016). “Provable bayesian inference via particle
mirror descent”. Artificial Intelligence and Statistics (cit. on p. 15).
Dawid, A. P., Musio, M. and Ventura, L. (2016). “Minimum scoring rule inference”.
Scandinavian Journal of Statistics 43.1, pp. 123–138 (cit. on p. 16).
De Mulder, Y., Danezis, G., Batina, L. and Preneel, B. (2008). “Identification via location-
profiling in GSM networks”. Proceedings of the 2008 ACM Workshop on Privacy in
the Electronic Society (cit. on pp. 25, 26).
Diakonikolas, I., Kamath, G., Kane, D., Li, J., Steinhardt, J. and Stewart, A. (2019).
“Sever: A Robust Meta-Algorithm for Stochastic Optimization”. Proceedings of the
36th International Conference on Machine Learning (cit. on p. 70).
Dickens, C., Meissner, E., Moreno, P. G. and Diethe, T. (2020). Interpretable Anomaly
Detection with Mondrian Pólya Forests on Data Streams. arXiv: 2008.01505 (cit. on
p. 70).
Drineas, P. and Mahoney, M. (2005). “On the Nyström method for approximating a Gram
matrix for improved kernel-based learning”. Journal of Machine Learning Research 6
(cit. on p. 54).
Dua, D. and Graff, C. (2017). UCI Machine Learning Repository (cit. on p. 108).
Duchi, J., Hazan, E. and Singer, Y. (2010). “Adaptive Subgradient Methods for Online
Learning and Stochastic Optimization”. The 23rd Conference on Learning Theory
(cit. on p. 83).
Bibliography 125
Duchi, J., Hazan, E. and Singer, Y. (2011). “Adaptive Subgradient Methods for Online
Learning and Stochastic Optimization”. Journal of Machine Learning Research 12.61,
pp. 2121–2159 (cit. on p. 83).
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C. and Pregibon, D. (1999). “Squashing
flat files flatter”. ACM Conference on Knowledge Discovery and Data Mining (cit. on
p. 54).
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I. and Naor, M. (2006a). “Our Data,
Ourselves: Privacy via Distributed Noise Generation”. International Conference on
The Theory and Applications of Cryptographic Techniques (cit. on p. 61).
Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006b). “Calibrating Noise to
Sensitivity in Private Data Analysis”. Conference on Theory of Cryptography (cit. on
p. 61).
Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006c). “Calibrating noise to
sensitivity in private data analysis”. Theory of Cryptography Conference. Springer,
pp. 265–284 (cit. on pp. 21, 28, 55).
Eguchi, S. and Kano, Y. (2001). Robustifying maximum likelihood estimation. Tech. rep.
(cit. on pp. 8, 16).
Feldman, D., Faulkner, M. and Krause, A. (2011). “Scalable training of mixture models
via coresets”. Advances in Neural Information Processing Systems (cit. on p. 54).
Feldman, D., Fiat, A., Kaplan, H. and Nissim, K. (2009). “Private Coresets”. ACM
Symposium on Theory of Computing (cit. on pp. 3, 55).
Feldman, D., Volkov, M. and Rus, D. (2016). “Dimensionality reduction of massive sparse
datasets using coresets”. Advances in Neural Information Processing Systems (cit. on
p. 54).
Feldman, D., Xiang, C., Zhu, R. and Rus, D. (2017). “Coresets for Differentially Private
k-Means Clustering and Applications to Privacy in Mobile Sensor Networks”. Inter-
national Conference on Information Processing in Sensor Networks (cit. on pp. 3,
55).
126 Bibliography
Finn, C., Abbeel, P. and Levine, S. (2017). “Model-Agnostic Meta-Learning for Fast
Adaptation of Deep Networks”. International Conference on Machine Learning (cit.
on p. 92).
Fujisawa, H. and Eguchi, S. (2008). “Robust Parameter Estimation with a Small Bias
against Heavy Contamination”. J. Multivar. Anal., pp. 2053–2081 (cit. on p. 16).
Futami, F., Sato, I. and Sugiyama, M. (2018). “Variational Inference based on Robust
Divergences”. Proceedings of the Twenty-First International Conference on Artificial
Intelligence and Statistics (cit. on pp. 15, 17, 71, 76, 80).
Gambs, S., Killijian, M.-O. and Núñez Del Prado Cortez, M. (2014). “De-anonymization
Attack on Geolocated Data”. J. Comput. Syst. Sci. 80 (cit. on pp. 25, 26).
Geyer, C. J. (Nov. 1992). “Practical Markov Chain Monte Carlo”. Statist. Sci. 7.4,
pp. 473–483 (cit. on p. 12).
Ghorbani, A. and Zou, J. (2019). “Data Shapley: Equitable Valuation of Data for Machine
Learning”. Proceedings of the 36th International Conference on Machine Learning
(cit. on pp. 70, 85).
Ghosh, A. and Basu, A. (2016). “Robust Bayes estimation using the density power
divergence”. Annals of the Institute of Statistical Mathematics 68.2, pp. 413–437
(cit. on p. 17).
Ginart, A., Guan, M., Valiant, G. and Zou, J. Y. (2019). “Making AI Forget You: Data
Deletion in Machine Learning”. Advances in Neural Information Processing Systems
(cit. on p. 92).
Golle, P. and Partridge, K. (2009). “On the anonymity of home/work location pairs”.
International Conference on Pervasive Computing. Springer (cit. on p. 25).
Grant, E., Finn, C., Levine, S., Darrell, T. and Griffiths, T. L. (2018). “Recasting
Gradient-Based Meta-Learning as Hierarchical Bayes”. International Conference on
Learning Representations (cit. on p. 92).
Bibliography 127
Hoffman, M. D., Blei, D. M., Wang, C. and Paisley, J. (2013). “Stochastic Variational
Inference”. Journal of Machine Learning Research 14, pp. 1303–1347 (cit. on pp. 13,
70).
Huber, P. J. and Ronchetti, E. M. (2009). Robust statistics; 2nd ed. Wiley Series in
Probability and Statistics (cit. on pp. 3, 71).
Huggins, J., Campbell, T. and Broderick, T. (2016). “Coresets for Scalable Bayesian
Logistic Regression”. Advances in Neural Information Processing Systems (cit. on
pp. 14, 54, 71, 72).
Huggins, J., Campbell, T., Kasprzak, M. and Broderick, T. (2020). “Validated Variational
Inference via Practical Posterior Error Bounds”. International Conference on Artificial
Intelligence and Statistics (cit. on p. 67).
Jacob, P. E., O’Leary, J. and Atchadé, Y. F. (2020). “Unbiased Markov chain Monte
Carlo methods with couplings”. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 82.3, pp. 543–600 (cit. on p. 61).
128 Bibliography
Jälkö, J., Dikmen, O. and Honkela, A. (2017). “Differentially Private Variational Inference
for Non-conjugate Models”. Uncertainty in Artificial Intelligence (cit. on pp. 55, 66).
Jewson, J., Smith, J. Q. and Holmes, C. (2018). “Principles of Bayesian inference using
general divergence criteria”. Entropy 20.6, p. 442 (cit. on pp. 16, 17).
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (Nov. 1999). “An
Introduction to Variational Methods for Graphical Models”. Mach. Learn. 37.2,
pp. 183–233 (cit. on p. 13).
Kang, J. H., Welbourne, W., Stewart, B. and Borriello, G. (2005). “Extracting places
from traces of locations”. ACM SIGMOBILE Mobile Computing and Communications
Review 9 (cit. on p. 31).
Karger, D. R., Oh, S. and Shah, D. (2011). “Iterative learning for reliable crowdsourcing
systems”. Advances in Neural Information Processing Systems (cit. on p. 70).
Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S. and Smith, A. (2011).
“What can we learn privately?” SIAM Journal on Computing 40.3, pp. 793–826
(cit. on p. 22).
Korattikara, A., Chen, Y. and Welling, M. (2014). “Austerity in MCMC land: Cutting
the Metropolis-Hastings budget”. International Conference on Machine Learning
(cit. on p. 12).
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A. and Blei, D. M. (2017). “Automatic
Differentiation Variational Inference”. Journal of Machine Learning Research 18.14
(cit. on pp. 56, 66).
Kullback, S. and Leibler, R. A. (1951). “On information and sufficiency”. The Annals of
Mathematical Statistics 22.1, pp. 79–86 (cit. on p. 7).
Bibliography 129
Kurtek, S. and Bharath, K. (2015). “Bayesian sensitivity analysis with the Fisher–Rao
metric”. Biometrika 102.3, pp. 601–616 (cit. on p. 17).
Laurila, J. K., Gatica-Perez, D., Aad, I., Bornet, O., Do, T.-M.-T., Dousse, O., Eberle,
J., Miettinen, M., et al. (2012). “The mobile data challenge: Big data for mobile
computing research”. Pervasive Computing (cit. on p. 24).
Lewis, D. D., Yang, Y., Rose, T. G. and Li, F. (2004). “RCV1: A New Benchmark
Collection for Text Categorization Research”. Journal of Machine Learning Research
5, pp. 361–397 (cit. on p. 70).
Li, B., Wang, Y., Singh, A. and Vorobeychik, Y. (2016). “Data poisoning attacks on
factorization-based collaborative filtering”. Advances in Neural Information Processing
Systems (cit. on p. 70).
Li, N., Qardaji, W. and Su, D. (2012). “On sampling, anonymization, and differential
privacy or, k-anonymization meets differential privacy”. Proceedings of the 7th ACM
Symposium on Information, Computer and Communications Security, pp. 32–33
(cit. on p. 91).
Li, T. and Ke, Y. (2019). “Thinning for Accelerating the Learning of Point Processes”.
Advances in Neural Information Processing Systems (cit. on p. 91).
Lin, M., Cao, H., Zheng, V. W., Chang, K. C.-C. and Krishnaswamy, S. (2015). “Mobile
user verification/identification using statistical mobility profile”. 2015 International
Conference on Big Data and Smart Computing (cit. on p. 27).
Liu, Q., Peng, J. and Ihler, A. T. (2012). “Variational Inference for Crowdsourcing”.
Advances in Neural Information Processing Systems (cit. on p. 70).
Lucic, M., Bachem, O. and Krause, A. (2016a). “Linear-Time Outlier Detection via
Sensitivity”. Proceedings of the Twenty-Fifth International Joint Conference on
Artificial Intelligence (cit. on pp. 3, 70).
Lucic, M., Bachem, O. and Krause, A. (2016b). “Strong coresets for hard and soft
Bregman clustering with applications to exponential family mixtures”. International
Conference on Artificial Intelligence and Statistics (cit. on p. 54).
Lucic, M., Faulkner, M., Krause, A. and Feldman, D. (2017). “Training Gaussian mixture
models at scale via coresets”. The Journal of Machine Learning Research 18.1,
pp. 5885–5909 (cit. on p. 14).
130 Bibliography
Manousakas, D., Mascolo, C., Beresford, A. R., Chan, D. and Sharma, N. (2018).
“Quantifying privacy loss of human mobility graph topology”. Proceedings on Privacy
Enhancing Technologies 2018.3, pp. 5–21 (cit. on p. 6).
Manousakas, D., Xu, Z., Mascolo, C. and Campbell, T. (2020). “Bayesian Pseudocoresets”.
Advances in Neural Information Processing Systems (cit. on pp. 6, 77).
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word
representations in vector space. arXiv: 1301.3781 (cit. on p. 35).
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002).
“Network Motifs: Simple Building Blocks of Complex Networks”. Science 298.5594,
pp. 824–827 (cit. on p. 29).
Musco, C. and Musco, C. (2017). “Recursive sampling for the Nyström method”. Advances
in Neural Information Processing Systems (cit. on p. 54).
Naini, F. M., Unnikrishnan, J., Thiran, P. and Vetterli, M. (2016). “Where You Are Is
Who You Are: User Identification by Matching Statistics”. IEEE Transactions on
Information Forensics and Security 11.2 (cit. on pp. 25, 26).
Olejnik, L., Castelluccia, C. and Janc, A. (2014). “On the uniqueness of Web browsing
history patterns”. Annales des Télécommunications 69 (cit. on p. 29).
Park, M., Foulds, J. R., Chaudhuri, K. and Welling, M. (2020). “Variational Bayes In
Private Settings (VIPS)”. J. Artif. Intell. Res. 68 (cit. on p. 55).
Paschou, P., Lewis, J., Javed, A. and Drineas, P. (2010). “Ancestry informative markers
for fine-scale individual assignment to worldwide populations”. Journal of Medical
Genetics (cit. on p. 70).
Pfitzmann, A. and Hansen, M. (2010). A terminology for talking about privacy by data min-
imization: Anonymity, Unlinkability, Undetectability, Unobservability, Pseudonymity,
and Identity Management (cit. on p. 30).
132 Bibliography
Pyrgelis, A., Troncoso, C. and De Cristofaro, E. (2017). “What Does The Crowd Say
About You? Evaluating Aggregation-based Location Privacy”. Proceedings on Privacy
Enhancing Technologies 2017.4, pp. 156–176 (cit. on p. 27).
Rahimi, A. and Recht, B. (2008). “Random features for large-scale kernel machines”.
Advances in neural information processing systems (cit. on p. 20).
Ranganath, R., Gerrish, S. and Blei, D. (2014). “Black Box Variational Inference”.
International Conference on Artificial Intelligence and Statistics (cit. on p. 56).
Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L. and Moy, L.
(2010). “Learning from crowds.” Journal of Machine Learning Research 11.4 (cit. on
p. 70).
Ríos Insúa, D. and Ruggeri, F. (2012). Robust Bayesian Analysis. Vol. 152. Springer
Science & Business Media (cit. on p. 71).
Riquelme, C., Tucker, G. and Snoek, J. (2018). “Deep Bayesian Bandits Showdown: An
Empirical Comparison of Bayesian Deep Networks for Thompson Sampling”. 6th
International Conference on Learning Representations (cit. on p. 81).
Robert, C. P. and Casella, G. (2005). Monte Carlo Statistical Methods (Springer Texts
in Statistics). Berlin, Heidelberg: Springer-Verlag. isbn: 0387212396 (cit. on p. 12).
Rossi, L., Williams, M. J., Stich, C. and Musolesi, M. (2015). “Privacy and the City:
User Identification and Location Semantics in Location-Based Social Networks”.
Proceedings of the Ninth International Conference on Web and Social Media (cit. on
p. 26).
Samek, W., Blythe, D., Müller, K.-R. and Kawanabe, M. (2013). “Robust spatial filtering
with beta divergence”. Advances in Neural Information Processing Systems (cit. on
p. 105).
Schellekens, V., Chatalic, A., Houssiau, F., de Montjoye, Y.-A., Jacques, L. and Gri-
bonval, R. (2019). “Differentially private compressive k-means”. IEEE International
Conference on Acoustics, Speech and Signal Processing (cit. on p. 3).
Schölkopf, B., Smola, A. J., Bach, F., et al. (2002). Learning with kernels: support vector
machines, regularization, optimization, and beyond. MIT press (cit. on p. 19).
Bibliography 133
Sheng, V. S., Provost, F. and Ipeirotis, P. G. (2008). “Get Another Label? Improving
Data Quality and Data Mining Using Multiple, Noisy Labelers”. Proceedings of the
14th ACM International Conference on Knowledge Discovery and Data Mining (cit. on
p. 70).
Shervashidze, N., Schweitzer, P., Leeuwen, V., Jan, E., Mehlhorn, K. and Borgwardt, K.
(2011). “Weisfeiler-Lehman graph kernels”. Journal of Machine Learning Research 12,
pp. 2539–2561 (cit. on p. 33).
Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K. and Borgwardt, K. (2009).
“Efficient graphlet kernels for large graph comparison”. Artificial Intelligence and
Statistics, pp. 488–495 (cit. on p. 29).
Shokri, R., Troncoso, C., Diaz, C., Freudiger, J. and Hubaux, J.-P. (2010). “Unraveling
an old cloak: k-anonymity for location privacy”. Proceedings of the 9th annual ACM
workshop on Privacy in the electronic society (cit. on p. 30).
Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M.,
Prabhat, M. and Adams, R. (2015). “Scalable Bayesian Optimization Using Deep
Neural Networks”. Proceedings of the 32nd International Conference on Machine
Learning (cit. on p. 81).
Song, Y., Stolfo, S. and Jebara, T. (2011). Markov models for network-behavior modeling
and anonymization. Tech. rep. Department of Computer Science, Columbia University
(cit. on p. 51).
Steinhardt, J., Koh, P. W. W. and Liang, P. S. (2017). “Certified defenses for data
poisoning attacks”. Advances in Neural Information Processing Systems (cit. on
p. 70).
134 Bibliography
Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J. and Clore,
J. N. (2014). “Impact of HbA1c measurement on hospital readmission rates: analysis
of 70, 000 clinical database patient records”. BioMed research international (cit. on
p. 108).
Thoma, M., Cheng, H., Gretton, A., Han, J., Kriegel, H. P., Smola, A., Song, L., Yu, P. S.,
Yan, X. and Borgwardt, K. M. (2010). “Discriminative frequent subgraph mining
with optimality guarantees”. Statistical Analysis and Data Mining 3.5, pp. 302–318
(cit. on p. 50).
Wang, C. and Blei, D. M. (2018). “A general method for robust Bayesian modeling”.
Bayesian Analysis (cit. on p. 71).
Bibliography 135
Wang, D., Irani, D. and Pu, C. (2012). “Evolutionary study of web spam: Webb Spam
Corpus 2011 versus Webb Spam Corpus 2006”. 8th International Conference on
Collaborative Computing: Networking, Applications and Worksharing (cit. on p. 108).
Wang, D., Liu, H. and Liu, Q. (2018). “Variational Inference with Tail-adaptive f-
Divergence”. Advances in Neural Information Processing Systems (cit. on p. 76).
Wang, Y., Kucukelbir, A. and Blei, D. M. (2017). “Robust Probabilistic Modeling with
Bayesian Data Reweighting”. Proceedings of the 34th International Conference on
Machine Learning (cit. on pp. 67, 71).
Welling, M. and Teh, Y. W. (2011). “Bayesian learning via stochastic gradient Langevin
dynamics”. Proceedings of the 28th International Conference on Machine Learning
(cit. on pp. 12, 70).
Whitehill, J., Wu, T.-F., Bergsma, J., Movellan, J. R. and Ruvolo, P. L. (2009). “Whose
vote should count more: Optimal integration of labels from labelers of unknown
expertise”. Advances in Neural Information Processing Systems (cit. on p. 70).
Williams, C. and Seeger, M. (2001). “Using the Nyström method to speed up kernel
machines”. Advances in Neural Information Processing Systems (cit. on p. 54).
Xu, F., Tu, Z., Li, Y., Zhang, P., Fu, X. and Jin, D. (2017). “Trajectory Recovery From
Ash: User Privacy Is NOT Preserved in Aggregated Mobility Data”. Proceedings of
the 26th International Conference on World Wide Web (cit. on p. 27).
136 Bibliography
Yan, X. and Han, J. (2002). “gSpan: Graph-Based Substructure Pattern Mining”. Pro-
ceedings of the 2002 IEEE International Conference on Data Mining (cit. on pp. 29,
50).
Yen, T.-F., Xie, Y., Yu, F., Yu, R. P. and Abadi, M. (2012). “Host Fingerprinting and
Tracking on the Web:Privacy and Security Implications”. The 19th Annual Network
and Distributed System Security Symposium. Internet Society (cit. on p. 29).
Zang, H. and Bolot, J. (2011). “Anonymization of Location Data Does Not Work: A Large-
scale Measurement Study”. Proceedings of the 17th Annual International Conference
on Mobile Computing and Networking. ACM (cit. on pp. 24, 26).
Zellner, A. (1988). “Optimal Information Processing and Bayes’s Theorem”. The American
Statistician 42.4, pp. 278–280 (cit. on p. 15).
Zhang, J. Y., Khanna, R., Kyrillidis, A. and Koyejo, O. (2021a). “Bayesian Coresets:
Revisiting the Nonconvex Optimization Perspective”. International Conference on
Artificial Intelligence and Statistics (cit. on p. 14).
Zhang, R., Li, Y., De Sa, C., Devlin, S. and Zhang, C. (2021b). “Meta-Learning Diver-
gences for Variational Inference”. Proceedings of The 24th International Conference
on Artificial Intelligence and Statistics (cit. on p. 76).
Zhang, Y., Chen, X., Zhou, D. and Jordan, M. I. (2016). “Spectral methods meet EM: A
provably optimal algorithm for crowdsourcing”. The Journal of Machine Learning
Research 17.1, pp. 3537–3580 (cit. on p. 70).
Zheleva, E. and Getoor, L. (2009). “To join or not to join: the illusion of privacy in
social networks with mixed public and private user profiles”. Proceedings of the 18th
International Conference on World Wide Web (cit. on p. 27).
Zheng, V. W., Pan, S. J., Yang, Q. and Pan, J. J. (2008). “Transferring Multi-device
Localization Models using Latent Multi-task Learning.” Proceedings of the Twenty-
Third AAAI Conference on Artificial Intelligence (cit. on p. 70).
Zhu, J., Chen, N. and Xing, E. P. (2014). “Bayesian Inference with Posterior Regular-
ization and Applications to Infinite Latent SVMs”. Journal of Machine Learning
Research 15.53, pp. 1799–1847 (cit. on p. 15).
Bibliography 137
Zhuang, H., Parameswaran, A., Roth, D. and Han, J. (2015). “Debiasing Crowdsourced
Batches”. Proceedings of the 21th ACM International Conference on Knowledge
Discovery and Data Mining (cit. on p. 70).