Statistical Distances
Statistical Distances
36-705
Today we will discuss distances and metrics between distributions that are useful in statistics.
We will discuss them in two contexts:
1. There are metrics that are analytically useful in a variety of statistical problems, i.e.
they have intimate connections with estimation and testing.
2. There are metrics that are useful in data analysis, i.e. given data we want to measure
some notion of distance between (the distribution of) subsets of the data and use this
in some way.
There is of course overlap between these contexts and so there are distances that are useful
in both, but they are motivated by slightly different considerations.
where A is just any measurable subset of the sample space, i.e. the TV distance is
measuring the maximal difference between the probability of an event under P versus
under Q.
The TV distance is equivalent to the `1 distance between the densities, i.e. one can
show that:
Z
1
TV(P, Q) = |p(x) − q(x)|dx.
2
1
2. The χ2 divergence: The χ2 divergence is defined for distributions P and Q such that
Q dominates P , i.e. if Q(A) = 0 for some set A then it has to be the case that P (A)
is also 0. For such distributions:
p2 (x)
Z
2
χ (P, Q) = dx − 1.
{x:q(x)>0} q(x)
All of these fundamental statistical distances are special cases of what are known as f -
divergences. The field of information theory has devoted considerable effort to studying
families of distances (α, β, φ, f -divergences) and so on, and this has led to a fruitful interface
between statistics and information theory. An f -divergence is defined for a convex function
f with f (1) = 0:
Z
p(x)
Df (P, Q) = q(x)f dx.
q(x)
You can look up (for instance on Wikipedia) which functions lead to each of the divergences
we defined above.
2
2 Hypothesis Testing Lower Bounds
A basic hypothesis testing problem is the following: suppose that we have two distributions
P0 and P1 , and we consider the following experiment: I toss a fair coin and if it comes up
heads I give you a sample from P0 and if it comes up tails I give you a sample from P1 . Let
T = 0 if the coin comes up heads and T = 1 otherwise. You only observe the sample X, and
need to tell me which distribution it came from.
This is exactly like our usual simple versus simple hypothesis testing problem, except I pick
each hypothesis with probability 1/2. Now, suppose you have a test Ψ : X 7→ {0, 1}. Define
its error rate as:
1
P(Ψ(X) 6= T ) = [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] .
2
Roughly, we might believe that if P0 and P1 are close in an appropriate distance measure
then the error rate of the best possible test should be high and otherwise the error rate
should be low. This is known an Le Cam’s Lemma.
Before we prove this result we should take some time to appreciate it. What Le Cam’s
Lemma tells us is that if two distributions are close in TV then no test can distinguish
them. In some sense TV is the right notion of distance for statistical applications. In fact,
in theoretical CS, the TV distance is sometimes referred to as the statistical distance. It is
also the case that the likelihood ratio test achieves this bound exactly (should not surprise
you since it is a simple-vs-simple hypothesis test).
Proof: For any test Ψ we can denote its acceptance region A, i.e. if X ∈ A then Ψ(X) = 0.
Then,
1 1
[P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] = [P0 (X ∈
/ A) + P1 (X ∈ A)]
2 2
1
= [1 − (P0 (X ∈ A) − P1 (X ∈ A))] .
2
So to find the best test we simply minimize the RHS or equivalently:
1 1
inf [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] = 1 − sup(P0 (X ∈ A) − P1 (X ∈ A))
Ψ 2 2 A
1
= (1 − TV(P0 , P1 )) .
2
3
Close analogues of Le Cam’s Lemma hold for all of the other divergences above, i.e. roughly,
if any of the χ2 , Hellinger or KL divergences are small then we cannot reliably distinguish
between the two distributions. If you want a formal statement see Theorem 2.2 in Tsybakov’s
book.
3 Tensorization
Given the above fact, that all of the distances we defined so far are in some sense “fun-
damental” in hypothesis testing, a natural question is why do we need all these different
distances?
The answer is a bit technical, but roughly, when we want to compute a lower bound (i.e.
understand the fundamental statistical difficulty of our problem) some divergences might be
easier to compute than others. For instance, it is often the case that for mixture distributions
the χ2 is easy to compute, while for many parametric models the KL divergence is natural
(in part because it is closely related to the Fisher information. Knowing which divergence
to use when is a bit of an art but having many tools in your toolbox is always useful.
One natural thing that will arise in statistical applications is that unlike the above setting
of Le Cam’s Lemma we will observe n i.i.d. samples X1 , . . . , Xn ∼ P , rather than just one
sample. Everything we have said so far works in exactly the same way, except we need
to calculate the distance between the product measures, i.e. d(P n , Qn ), which is just the
distance between the distributions:
n
Y
p(X1 , . . . , Xn ) = p(Xi ).
i=1
For the TV distance this turns out to be quite difficult to do directly. However, one of the
most useful properties of the Hellinger and KL distance is that they tensorize, i.e. they
behave nicely with respect to product distributions. In particular, we have the following
useful relationships:
KL(P n , Qn ) = nKL(P, Q)
n
H 2 (P, Q)
2 n n
H (P , Q ) = 2 1 − 1 − = 2 (1 − ρ(P, Q)n ) ,
2
R√
where ρ(p, q) = pq is the affinity defined earlier. The key point is that when we see n
i.i.d samples it is easy to compute the KL and Hellinger.
4
4 Hypothesis Testing Upper Bounds
One can ask if there are analogous upper bounds, i.e. for instance if the distance between
P0 and P1 gets larger, are there quantitatively better tests for distinguishing them?
For Hellinger and TV the answer turns out to be yes (and for χ2 and KL the answer is yes
under some assumptions). Formally, given n samples from either P0 or P1 you can construct
tests that distinguish between P0 and P1 such that for some constant c1 , c2 > 0:
1
inf [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] ≤ c1 exp(−c2 nTV(P0 , P1 )),
Ψ 2
and similarly
1
inf [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] ≤ c1 exp(−c2 nH 2 (P0 , P1 )).
Ψ 2
These are sometimes called large-deviation inequalities (and in most cases precise constants
are known).
It turns out that even for distinguishing two hypotheses that are separated in the Hellinger
distance, the likelihood ratio test is optimal (and achieves the above result). The proof is
short and elegant.
Proof: We recall the elementary bound:
log x ≤ x − 1, for all x ≥ 0.
Which in turn tells us that:
H 2 (P0 , P1 )
log ρ(P0 , P1 ) ≤ log ρ(P0 , P1 ) − 1 = − .
2
So now let us analyze the LRT which rejects the null if:
n
Y P0 (Xi )
≤ 1.
i=1
P1 (Xi )
Let us study its Type I error (its Type II error bound follows essentially the same logic). We
note that:
n
! n
!
Y P0 (Xi ) Y P1 (Xi )
P0 ≤ 1 = P0 ≥1
P (Xi )
i=1 1
P (Xi )
i=1 0
s !
n
Y P1 (Xi )
= P0 ≥1
i=1
P 0 (X i )
s
n
Y P1 (Xi )
≤ EP0 ,
i=1
P 0 (X i )
5
using Markov’s inequality. Now, using independence we see that:
n
! " s #n
Y P0 (Xi ) P1 (X)
P0 ≤ 1 ≤ EP0
P (Xi )
i=1 1
P0 (X)
= exp(n log ρ(P0 , P1 ))
≤ exp(−nH 2 (P0 , P1 )/2).
Putting this together with an identical bound under the alternate we obtain,
1
inf [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] ≤ exp(−nH 2 (P0 , P1 )/2).
Ψ 2
The result is quite nice – it says that the LRT can distinguish
√ two distributions reliably
provided their Hellinger distance is large compared to 1/ n. Furthermore, the bound has
an exponential form so you might imagine that it will interact nicely with a union bound (in
a multiple testing setup where we want to distinguish between several distributions).
5 Inequalities
Given the fact that these four distances are fundamental and that they have potentially
different settings where they are easy to use it is also useful to have inequalities that relate
these distances.
The following inequalities reveal a sort of hierarchy between the distances:
p p
TV(P, Q) ≤ H(P, Q) ≤ KL(P, Q) ≤ χ2 (P, Q).
This chain of inequalities should explain why it is the case that if any of these distances are
too small we cannot distinguish the distributions. In particular, if any distance is too small
then the TV must be small and we then use Le Cam’s Lemma.
There are also reverse inequalities in some cases (but not all). For instance:
1 2
H (P, Q) ≤ TV(P, Q) ≤ H(P, Q),
2
so up to the square factor Hellinger and TV are closely related.
6
1. Fisher information distance: for two distributions Pθ1 , Pθ2 we have that,
d(Pθ1 , Pθ2 ) = (θ1 − θ2 )T I(θ1 )−1 (θ1 − θ2 ).
2. Mahalanobis distance: for two distributions Pθ1 , Pθ2 , with means µ1 , µ2 and covariances
Σ1 , Σ2 , the Mahalanobis distance would be:
d(Pθ1 , Pθ2 ) = (µ1 − µ2 )T Σ−1
1 (µ1 − µ2 ).
This is just the Fisher distance for the Gaussian family with known covariance.
7
so that Huber’s model is very closely related to model mis-specification in TV. As a result, the
TV minimum distance estimator is very robust to outliers. The main drawback is however
a computational one: the TV minimum distance estimator is often difficult to compute.
If two distributions P, Q are identical then it should be clear that for any (measurable)
function f , it must be the case that:
One might wonder if the reverse implication is true, i.e. is it that if P 6= Q then there must
be some witness function f such that:
It turns out that this statement is indeed true. In particular, we have the following lemma.
Lemma 2 Two distributions P, Q are identical if and only if for every continuous function
f ∈ C(X )
8
This result suggests that to measure the distance between two distributions we could use a
so-called integral probability metric (IPM):
where F is a class of functions. We note that the TV distance is thus just an IPM with
F = {f : kf k∞ ≤ 1}. This class of functions (as well as the class of all continuous functions)
is too large to be useful statistically, i.e. it is the case that these IPMs are not easy to
estimate from data, so we instead use function classes that have smooth functions.
As with the TV there are many alternative ways of defining the Wasserstein distance. In par-
ticular, there are very nice interpretations of Wasserstein as a distance between “couplings”
and as a so-called transportation distance.
The Wasserstein distance has the somewhat nice property of being well-defined between a
discrete and continuous distribution, i.e. the two distributions you are comparing do not
need to have the same support. This is one of the big reasons why they are popular in ML.
In particular, a completely reasonable estimate of the Wasserstein distance between two
distributions, given samples from each of them is the Wasserstein distance between the
corresponding empirical measures, i.e. we estimate:
W
c1 (P, Q) = W1 (Pn , Qn ),
where Pn (for instance) is the distribution that puts mass 1/n on each sample point.
There are many other nice ways to interpret the Wasserstein distance, and it has many other
elegant properties. It is central in the field of optimal transport. A different expression for
the Wasserstein distance involves coupling: i.e. a joint distribution J over X and Y , such
that the marginal over X is P and over Y is Q. Then the W1 distance (or more generally
Wp distance) is:
9
One can also replace the Euclidean distance by any metric on the space on which P and Q
are defined. More generally, there is a way to view Wasserstein distances as measuring the
cost of optimally moving mass of the distribution P to make it look like the distribution Q
(hence, the term optimal transport).
The Wasserstein distance also arises frequently in image processing. In part, this is be-
cause the Wasserstein barycenter (a generalization of a mean) of a collection of distributions
preserves the shape of the distribution. This is quite unlike the “usual” average of the
distributions.
but because the space is an RKHS with a kernel k, it turns out we can write this distance
as:
MMD2 (P, Q) = EX,X 0 ∼P k(X, X 0 ) + EY,Y 0 ∼Q k(Y, Y 0 ) − 2EX∼P,Y ∼Q k(X, Y ).
Intuitively, we are contrasting how similar the samples from P look to each other and Q look
to each other to how similar the samples of P are to Q. If P and Q are the same then all of
these expectations should be the same and the MMD will be 0.
The key point that makes the MMD so popular is that it is completely trivial to estimate the
MMD since it is a bunch of expected values for which we can use the empirical expectations.
We have discussed U-statistics before, the MMD can be estimated by a simple U-statistic:
n X n X n n
1 X 1 X 2 XX
M D2 (P, Q) =
M\ k(Xi , Xj ) + k(Yi , Yj ) − 2 k(Xi , Yj ).
n(n − 1) i=1 j6=i n(n − 1) i=1 j6=i n i=1 j=1
10
However, the usefulness of the MMD hinges not on how well we can estimate it but on how
strong a notion of distance it is, i.e. for distributions that are quite different (say in the
TV/Hellinger/. . . distances), is it the case that the MMD is large? This turns out to be
quite a difficult question to answer.
9 Fano’s Inequality
We focused primarily on the role of f -divergences in testing but they are equally fundamental
in providing lower bounds for estimation. In estimation we obtain samples X1 , . . . , Xn ∼ Pθ
where Pθ ∈ PΘ , and our goal is to estimate θ (say with small `2 error).
In an intuitive sense, estimation is a lot like like a multiple hypothesis testing problem of
the following form – I give you samples from one of M distributions {Pθ1 , . . . , PθM } and I
ask you to figure out which one it was. Our estimator Ψ in this context simply takes in n
samples and returns an index {1, . . . , M }.
We could imagine the setting where we sample an index u uniformly from {1, . . . , M } and
generate n samples from Pθu . We define the following notion of error:
Fano’s inequality (and others like it) relate how hard this testing problem is (i.e. they lower
bound err) to a function of some distance between the distributions {Pθ1 , . . . , PθM }.
Suppose that M ≥ 3, and for some small constant c1 > 0:
M M
1 XX
KL(Pθi kPθj ) ≤ c1 log M,
M 2 i=1 j=1
then for some other c2 > 0, err > c2 . In words, Fano’s inequality says if the average pairwise
KL divergence is small then multiple testing problem is difficult.
So how does this relate to estimation? Suppose we additionally ensure that for all pairs (i, j)
we have that kθi − θj k22 ≥ (2δ)2 , then it is easy to verify that the minimax estimation error:
M
1 X
inf sup Ekθb − θk22 ≥ inf sup Ekθb − θk22 ≥ inf EX1 ,...,Xn ∼Pθi kθb − θi k22 ≥ c2 δ 2 ,
θb θ∈Θ θb θ∈{θ1 ,...,θM } θb M i=1
11
should make sense, if we can find parameters that are well-separated but the underlying
distributions are very close, then the estimation problem should be difficult.
An Application: So why do we need Fano’s inequality? Suppose we consider establishing
a lower bound for estimating the mean of a Normal distribution. We have already seen that
the minimax error is at least σ 2 d/n (but this required a complicated Bayes argument). On
the other using a simple versus simple testing problem (with two separated Normals) we can
easily show via Le Cam’s lemma that the error is at least σ 2 /n. To get the right dimension
dependence we will need to use the full power of Fano’s inequality.
Let us see how this works. We need to know a few facts: one is that there is a packing of
the radius 4δ sphere of size c2d (for some c > 0), such that:
i.e. there are roughly 2d vectors in the 4δ-sphere that are well-separated (i.e. are separated
by at least 2δ).
Additionally, we can calculate the KL divergence between n-samples from N (θi , Id ) and
N (θj , Id ),
n 2 cnδ 2
KL(Pθi , Pθj ) = 2 kθi − θj k2 ≤ 2 .
2σ σ
12