0% found this document useful (0 votes)

30 views

Statistical Distances

This document discusses distances and metrics between probability distributions that are useful in statistics. It outlines four fundamental statistical distances - total variation, chi-squared divergence, Kullback-Leibler divergence, and Hellinger distance. These distances are special cases of f-divergences. The document also discusses Le Cam's lemma, which relates the total variation distance between two distributions to the error rate of the best possible hypothesis test distinguishing those distributions. Finally, it notes that while all the distances are useful, some may be easier to compute than others in certain statistical applications, like when computing lower bounds or dealing with product measures of i.i.d. samples.

Uploaded by

Udita Goswami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Statistical Distances

Uploaded by

Udita Goswami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lecture Notes 27

36-705

Today we will discuss distances and metrics between distributions that are useful in statistics.
We will discuss them in two contexts:

1. There are metrics that are analytically useful in a variety of statistical problems, i.e.
they have intimate connections with estimation and testing.

2. There are metrics that are useful in data analysis, i.e. given data we want to measure
some notion of distance between (the distribution of) subsets of the data and use this
in some way.

There is of course overlap between these contexts and so there are distances that are useful
in both, but they are motivated by slightly different considerations.

1 The Fundamental Statistical Distances

There are four notions of distance that have an elevated status in statistical theory. Let P, Q
be two probability measures with densities p and q.

1. Total Variation: The TV distance between two distributions is:

Z
TV(P, Q) = sup |P (A) − Q(A)| = sup | (p(x) − q(x))dx|,
A A

where A is just any measurable subset of the sample space, i.e. the TV distance is
measuring the maximal difference between the probability of an event under P versus
under Q.
The TV distance is equivalent to the `1 distance between the densities, i.e. one can
show that:
Z
1
TV(P, Q) = |p(x) − q(x)|dx.
2

One can also write the TV distance as:

TV(P, Q) = sup |EP [f ] − EQ [f ]| .

kf k∞ ≤1

1
2. The χ2 divergence: The χ2 divergence is defined for distributions P and Q such that
Q dominates P , i.e. if Q(A) = 0 for some set A then it has to be the case that P (A)
is also 0. For such distributions:
p2 (x)
Z
2
χ (P, Q) = dx − 1.
{x:q(x)>0} q(x)

Alternatively, one can write this as:

(p(x) − q(x))2
Z
2
χ (P, Q) = dx.
{x:q(x)>0} q(x)

3. Kullback-Leibler divergence: Again we suppose that Q dominates P . The KL diver-

gence between two distributions:
Z
p(x)
KL(P, Q) = log p(x)dx.
q(x)
4. Hellinger distance: The Hellinger distance between two distributions is,
Z 1/2
p p 2
H(P, Q) = ( p(x) − q(x)) dx ,
√ √
i.e. the Hellinger distance is the `2 norm between p and q. It might seem at first
√ √
a bit weird to consider the `2 norm between p and q rather than p and q, but this
turns out to be the right thing to do for statistical applications. The use of Hellinger is
various statistical contexts was popularized by Lucien Le Cam, who advocated strongly
(and convincingly) for thinking of square-root of the density as the central object of
interest.
The Hellinger distance is also closely related to what is called the affinity (or Bhat-
tacharya coefficient):
Z p
ρ(P, Q) = p(x)q(x)dx.

In particular, note the equality:

H 2 (P, Q) = 2 (1 − ρ(P, Q)) .

All of these fundamental statistical distances are special cases of what are known as f -
divergences. The field of information theory has devoted considerable effort to studying
families of distances (α, β, φ, f -divergences) and so on, and this has led to a fruitful interface
between statistics and information theory. An f -divergence is defined for a convex function
f with f (1) = 0:
Z
p(x)
Df (P, Q) = q(x)f dx.
q(x)
You can look up (for instance on Wikipedia) which functions lead to each of the divergences
we defined above.

2
2 Hypothesis Testing Lower Bounds
A basic hypothesis testing problem is the following: suppose that we have two distributions
P0 and P1 , and we consider the following experiment: I toss a fair coin and if it comes up
heads I give you a sample from P0 and if it comes up tails I give you a sample from P1 . Let
T = 0 if the coin comes up heads and T = 1 otherwise. You only observe the sample X, and
need to tell me which distribution it came from.
This is exactly like our usual simple versus simple hypothesis testing problem, except I pick
each hypothesis with probability 1/2. Now, suppose you have a test Ψ : X 7→ {0, 1}. Define
its error rate as:
1
P(Ψ(X) 6= T ) = [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] .
2

Roughly, we might believe that if P0 and P1 are close in an appropriate distance measure
then the error rate of the best possible test should be high and otherwise the error rate
should be low. This is known an Le Cam’s Lemma.

Lemma 1 For any two distributions P0 , P1 , we have that,

1
inf P(Ψ(X) 6= T ) = (1 − TV(P0 , P1 )) .
Ψ 2

Before we prove this result we should take some time to appreciate it. What Le Cam’s
Lemma tells us is that if two distributions are close in TV then no test can distinguish
them. In some sense TV is the right notion of distance for statistical applications. In fact,
in theoretical CS, the TV distance is sometimes referred to as the statistical distance. It is
also the case that the likelihood ratio test achieves this bound exactly (should not surprise
you since it is a simple-vs-simple hypothesis test).
Proof: For any test Ψ we can denote its acceptance region A, i.e. if X ∈ A then Ψ(X) = 0.
Then,
1 1
[P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] = [P0 (X ∈
/ A) + P1 (X ∈ A)]
2 2
1
= [1 − (P0 (X ∈ A) − P1 (X ∈ A))] .
2
So to find the best test we simply minimize the RHS or equivalently:

1 1
inf [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] = 1 − sup(P0 (X ∈ A) − P1 (X ∈ A))
Ψ 2 2 A
1
= (1 − TV(P0 , P1 )) .
2

3
Close analogues of Le Cam’s Lemma hold for all of the other divergences above, i.e. roughly,
if any of the χ2 , Hellinger or KL divergences are small then we cannot reliably distinguish
between the two distributions. If you want a formal statement see Theorem 2.2 in Tsybakov’s
book.

3 Tensorization

Given the above fact, that all of the distances we defined so far are in some sense “fun-
damental” in hypothesis testing, a natural question is why do we need all these different
distances?
The answer is a bit technical, but roughly, when we want to compute a lower bound (i.e.
understand the fundamental statistical difficulty of our problem) some divergences might be
easier to compute than others. For instance, it is often the case that for mixture distributions
the χ2 is easy to compute, while for many parametric models the KL divergence is natural
(in part because it is closely related to the Fisher information. Knowing which divergence
to use when is a bit of an art but having many tools in your toolbox is always useful.
One natural thing that will arise in statistical applications is that unlike the above setting
of Le Cam’s Lemma we will observe n i.i.d. samples X1 , . . . , Xn ∼ P , rather than just one
sample. Everything we have said so far works in exactly the same way, except we need
to calculate the distance between the product measures, i.e. d(P n , Qn ), which is just the
distance between the distributions:

n
Y
p(X1 , . . . , Xn ) = p(Xi ).
i=1

For the TV distance this turns out to be quite difficult to do directly. However, one of the
most useful properties of the Hellinger and KL distance is that they tensorize, i.e. they
behave nicely with respect to product distributions. In particular, we have the following
useful relationships:

KL(P n , Qn ) = nKL(P, Q)
n
H 2 (P, Q)

2 n n
H (P , Q ) = 2 1 − 1 − = 2 (1 − ρ(P, Q)n ) ,
2

R√
where ρ(p, q) = pq is the affinity defined earlier. The key point is that when we see n
i.i.d samples it is easy to compute the KL and Hellinger.

4
4 Hypothesis Testing Upper Bounds
One can ask if there are analogous upper bounds, i.e. for instance if the distance between
P0 and P1 gets larger, are there quantitatively better tests for distinguishing them?
For Hellinger and TV the answer turns out to be yes (and for χ2 and KL the answer is yes
under some assumptions). Formally, given n samples from either P0 or P1 you can construct
tests that distinguish between P0 and P1 such that for some constant c1 , c2 > 0:
1
inf [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] ≤ c1 exp(−c2 nTV(P0 , P1 )),
Ψ 2
and similarly
1
inf [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] ≤ c1 exp(−c2 nH 2 (P0 , P1 )).
Ψ 2
These are sometimes called large-deviation inequalities (and in most cases precise constants
are known).
It turns out that even for distinguishing two hypotheses that are separated in the Hellinger
distance, the likelihood ratio test is optimal (and achieves the above result). The proof is
short and elegant.
Proof: We recall the elementary bound:
log x ≤ x − 1, for all x ≥ 0.
Which in turn tells us that:
H 2 (P0 , P1 )
log ρ(P0 , P1 ) ≤ log ρ(P0 , P1 ) − 1 = − .
2

So now let us analyze the LRT which rejects the null if:
n
Y P0 (Xi )
≤ 1.
i=1
P1 (Xi )

Let us study its Type I error (its Type II error bound follows essentially the same logic). We
note that:
n
! n
!
Y P0 (Xi ) Y P1 (Xi )
P0 ≤ 1 = P0 ≥1
P (Xi )
i=1 1
P (Xi )
i=1 0
s !
n
Y P1 (Xi )
= P0 ≥1
i=1
P 0 (X i )
s
n
Y P1 (Xi )
≤ EP0 ,
i=1
P 0 (X i )

5
using Markov’s inequality. Now, using independence we see that:
n
! " s #n
Y P0 (Xi ) P1 (X)
P0 ≤ 1 ≤ EP0
P (Xi )
i=1 1
P0 (X)
= exp(n log ρ(P0 , P1 ))
≤ exp(−nH 2 (P0 , P1 )/2).

Putting this together with an identical bound under the alternate we obtain,
1
inf [P0 (Ψ(X) 6= 0) + P1 (Ψ(X) 6= 1)] ≤ exp(−nH 2 (P0 , P1 )/2).
Ψ 2
The result is quite nice – it says that the LRT can distinguish
√ two distributions reliably
provided their Hellinger distance is large compared to 1/ n. Furthermore, the bound has
an exponential form so you might imagine that it will interact nicely with a union bound (in
a multiple testing setup where we want to distinguish between several distributions).

5 Inequalities
Given the fact that these four distances are fundamental and that they have potentially
different settings where they are easy to use it is also useful to have inequalities that relate
these distances.
The following inequalities reveal a sort of hierarchy between the distances:
p p
TV(P, Q) ≤ H(P, Q) ≤ KL(P, Q) ≤ χ2 (P, Q).

This chain of inequalities should explain why it is the case that if any of these distances are
too small we cannot distinguish the distributions. In particular, if any distance is too small
then the TV must be small and we then use Le Cam’s Lemma.
There are also reverse inequalities in some cases (but not all). For instance:

1 2
H (P, Q) ≤ TV(P, Q) ≤ H(P, Q),
2
so up to the square factor Hellinger and TV are closely related.

6 Distances from parametric families

We have encountered these before: they are usually only defined for parametric families:

6
1. Fisher information distance: for two distributions Pθ1 , Pθ2 we have that,
d(Pθ1 , Pθ2 ) = (θ1 − θ2 )T I(θ1 )−1 (θ1 − θ2 ).

2. Mahalanobis distance: for two distributions Pθ1 , Pθ2 , with means µ1 , µ2 and covariances
Σ1 , Σ2 , the Mahalanobis distance would be:
d(Pθ1 , Pθ2 ) = (µ1 − µ2 )T Σ−1
1 (µ1 − µ2 ).

This is just the Fisher distance for the Gaussian family with known covariance.

7 Robustness to Model Misspecification

Another strong motivation for studying estimation or testing in various metrics stems from
robustness considerations. We have already seen that Maximum Likelihood inherits a certain
type of KL-robustness, i.e. if we observe samples X1 , . . . , Xn ∼ P where possibly P ∈
/ Pθ ,
then MLE still makes sense and asymptotically (under some regularity conditions) will find
us a distribution Pθb such that,
KL(P, Pθb) ≤ KL(P, Pθ ) ∀ θ ∈ Θ.
One way to interpret this statement is that the MLE is robust to model-misspecification in
the KL distance, i.e. if P was close to Pθ in the sense that for some Pθ ∈ Pθ , KL(P, Pθ ) ≤
then the MLE would automatically find us a distribution which was asymptotically at most
-far in KL from P .
Of course, this is not the only notion of model mis-specification that we might care about, and
a general idea is to tailor the estimation procedure to the notion of model mis-specification
that we expect.
These lead to so-called minimum distance estimators. These are estimators that attempt to
find a distribution in Pθ that is close to the samples or the true distribution P in the some
distance – say the KL/TV/Hellinger etc. Minimum distance estimators are typically robust
to mis-specification in their native distance, i.e. the TV minimum distance estimator will be
robust to model-misspecification in TV.
Classically, outlier robustness was studied in something called the Huber -contamination
model, where we observe samples:
X1 , . . . , Xn ∼ (1 − )Pθ + Q,
where Q is an arbitrary distribution, i.e. -fraction of the samples are arbitrarily corrupted
(outliers). It is easy to see that,
TV((1 − )Pθ + Q, Pθ ) ≤ ,

7
so that Huber’s model is very closely related to model mis-specification in TV. As a result, the
TV minimum distance estimator is very robust to outliers. The main drawback is however
a computational one: the TV minimum distance estimator is often difficult to compute.

8 Distances for data analysis

Although the distances we have discussed so far are fundamental for many analytic purposes,
we do not often use them in data analysis. This is because they can be difficult to estimate
in practice. Ideally, we want to roughly be able to trade-off how “expressive” the distance is
versus how easy it is to estimate (and the distances so far are on one end of the spectrum).
To be a bit more concrete lets think about a typical setting (closely related to two-sample
testing that we have discussed earlier): we observe samples X1 , . . . , Xn ∼ P and Y1 , . . . , Yn ∼
Q, and we want to know how different the two distributions are, i.e. we want to estimate
some divergence between P and Q given samples.
This idea has caught on again recently because of GANs, where roughly we want to generate
samples that come from a distribution that is close to the distribution that generated the
training samples, and often we do this by estimating a distance between the training and
generated distributions and then trying to make it small.
Another popular task is independence testing or measuring dependence where we are given
samples (X1 , Y1 ), . . . , (Xn , Yn ) ∼ PXY and we want to estimate/test how far apart PXY and
PX PY are (i.e. we want to know how far apart the joint and product of marginals are).

8.1 Integral Probability Metrics

If two distributions P, Q are identical then it should be clear that for any (measurable)
function f , it must be the case that:

EX∼P [f (X)] = EY ∼Q [f (Y )].

One might wonder if the reverse implication is true, i.e. is it that if P 6= Q then there must
be some witness function f such that:

EX∼P [f (X)] 6= EY ∼Q [f (Y )].

It turns out that this statement is indeed true. In particular, we have the following lemma.

Lemma 2 Two distributions P, Q are identical if and only if for every continuous function
f ∈ C(X )

EX∼P [f (X)] = EY ∼Q [f (Y )].

8
This result suggests that to measure the distance between two distributions we could use a
so-called integral probability metric (IPM):

dF (P, Q) = sup |EX∼P [f (X)] − EY ∼Q [f (Y )]|,

f ∈F

where F is a class of functions. We note that the TV distance is thus just an IPM with
F = {f : kf k∞ ≤ 1}. This class of functions (as well as the class of all continuous functions)
is too large to be useful statistically, i.e. it is the case that these IPMs are not easy to
estimate from data, so we instead use function classes that have smooth functions.

8.2 Wasserstein distance

A natural class of smooth functions is the class of 1-Lipschitz functions, i.e.

FL = {f : f continous , |f (x) − f (y)| ≤ kx − yk}.

The corresponding IPM is known as the Wasserstein 1 distance:

W1 (P, Q) = sup |EX∼P [f (X)] − EY ∼Q [f (Y )]|.

f ∈FL

As with the TV there are many alternative ways of defining the Wasserstein distance. In par-
ticular, there are very nice interpretations of Wasserstein as a distance between “couplings”
and as a so-called transportation distance.
The Wasserstein distance has the somewhat nice property of being well-defined between a
discrete and continuous distribution, i.e. the two distributions you are comparing do not
need to have the same support. This is one of the big reasons why they are popular in ML.
In particular, a completely reasonable estimate of the Wasserstein distance between two
distributions, given samples from each of them is the Wasserstein distance between the
corresponding empirical measures, i.e. we estimate:

W
c1 (P, Q) = W1 (Pn , Qn ),

where Pn (for instance) is the distribution that puts mass 1/n on each sample point.
There are many other nice ways to interpret the Wasserstein distance, and it has many other
elegant properties. It is central in the field of optimal transport. A different expression for
the Wasserstein distance involves coupling: i.e. a joint distribution J over X and Y , such
that the marginal over X is P and over Y is Q. Then the W1 distance (or more generally
Wp distance) is:

Wpp (P, Q) = inf EkX − Y kp2 .

9
One can also replace the Euclidean distance by any metric on the space on which P and Q
are defined. More generally, there is a way to view Wasserstein distances as measuring the
cost of optimally moving mass of the distribution P to make it look like the distribution Q
(hence, the term optimal transport).
The Wasserstein distance also arises frequently in image processing. In part, this is be-
cause the Wasserstein barycenter (a generalization of a mean) of a collection of distributions
preserves the shape of the distribution. This is quite unlike the “usual” average of the
distributions.

8.3 Maximum Mean Discrepancy

Another natural class of IPMs is where we restrict F to be a unit ball in a Reproducing

Kernel Hilbert Space (RKHS). An RKHS is a function space associated with a kernel function
k that satisfies some regularity conditions. We will not be too formal about this but you
should think of an RKHS as another class of smooth functions (just like the 1-Lipschitz
functions) that has some very convenient properties. A kernel is some measure of similarity,
a commonly used one if the RBF kernel:
kX − Y k22

k(X, Y ) = exp −
2σ 2

The MMD is a typical IPM:

MMD(P, Q) = sup |EX∼P [f (X)] − EY ∼Q [f (Y )]|,
f ∈Fk

but because the space is an RKHS with a kernel k, it turns out we can write this distance
as:
MMD2 (P, Q) = EX,X 0 ∼P k(X, X 0 ) + EY,Y 0 ∼Q k(Y, Y 0 ) − 2EX∼P,Y ∼Q k(X, Y ).
Intuitively, we are contrasting how similar the samples from P look to each other and Q look
to each other to how similar the samples of P are to Q. If P and Q are the same then all of
these expectations should be the same and the MMD will be 0.
The key point that makes the MMD so popular is that it is completely trivial to estimate the
MMD since it is a bunch of expected values for which we can use the empirical expectations.
We have discussed U-statistics before, the MMD can be estimated by a simple U-statistic:
n X n X n n
1 X 1 X 2 XX
M D2 (P, Q) =
M\ k(Xi , Xj ) + k(Yi , Yj ) − 2 k(Xi , Yj ).
n(n − 1) i=1 j6=i n(n − 1) i=1 j6=i n i=1 j=1

In particular, if the kernel is bounded√then we can use a Hoeffding-style bound to conclude

that we can estimate the MMD at 1/ n rates.

10
However, the usefulness of the MMD hinges not on how well we can estimate it but on how
strong a notion of distance it is, i.e. for distributions that are quite different (say in the
TV/Hellinger/. . . distances), is it the case that the MMD is large? This turns out to be
quite a difficult question to answer.

9 Fano’s Inequality
We focused primarily on the role of f -divergences in testing but they are equally fundamental
in providing lower bounds for estimation. In estimation we obtain samples X1 , . . . , Xn ∼ Pθ
where Pθ ∈ PΘ , and our goal is to estimate θ (say with small `2 error).
In an intuitive sense, estimation is a lot like like a multiple hypothesis testing problem of
the following form – I give you samples from one of M distributions {Pθ1 , . . . , PθM } and I
ask you to figure out which one it was. Our estimator Ψ in this context simply takes in n
samples and returns an index {1, . . . , M }.
We could imagine the setting where we sample an index u uniformly from {1, . . . , M } and
generate n samples from Pθu . We define the following notion of error:

err = P (Ψ(X1 , . . . , Xn ) 6= u).

Fano’s inequality (and others like it) relate how hard this testing problem is (i.e. they lower
bound err) to a function of some distance between the distributions {Pθ1 , . . . , PθM }.
Suppose that M ≥ 3, and for some small constant c1 > 0:
M M
1 XX
KL(Pθi kPθj ) ≤ c1 log M,
M 2 i=1 j=1

then for some other c2 > 0, err > c2 . In words, Fano’s inequality says if the average pairwise
KL divergence is small then multiple testing problem is difficult.
So how does this relate to estimation? Suppose we additionally ensure that for all pairs (i, j)
we have that kθi − θj k22 ≥ (2δ)2 , then it is easy to verify that the minimax estimation error:
M
1 X
inf sup Ekθb − θk22 ≥ inf sup Ekθb − θk22 ≥ inf EX1 ,...,Xn ∼Pθi kθb − θi k22 ≥ c2 δ 2 ,
θb θ∈Θ θb θ∈{θ1 ,...,θM } θb M i=1

using Markov’s inequality.

So now we can try to understand the Fano inequality game. To produce a tight lower
bound, we need to find many hypothesis (large M ) such that their average KL divergence
is sufficiently small, but their corresponding parameters are well-separated. Intuitively, this

11
should make sense, if we can find parameters that are well-separated but the underlying
distributions are very close, then the estimation problem should be difficult.
An Application: So why do we need Fano’s inequality? Suppose we consider establishing
a lower bound for estimating the mean of a Normal distribution. We have already seen that
the minimax error is at least σ 2 d/n (but this required a complicated Bayes argument). On
the other using a simple versus simple testing problem (with two separated Normals) we can
easily show via Le Cam’s lemma that the error is at least σ 2 /n. To get the right dimension
dependence we will need to use the full power of Fano’s inequality.
Let us see how this works. We need to know a few facts: one is that there is a packing of
the radius 4δ sphere of size c2d (for some c > 0), such that:

kθi k2 ≤ 4δ, and kθi − θj k ≥ 2δ

i.e. there are roughly 2d vectors in the 4δ-sphere that are well-separated (i.e. are separated
by at least 2δ).
Additionally, we can calculate the KL divergence between n-samples from N (θi , Id ) and
N (θj , Id ),

n 2 cnδ 2
KL(Pθi , Pθj ) = 2 kθi − θj k2 ≤ 2 .
2σ σ

So (ignoring constants) if nδ 2 σ 2 d, then our minimax estimation error is at least δ 2 . So

this gives us that the minimax estimation error is at least cσ 2 d/n.

Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
100% (1)
Notes For 18.6501x, Fundamentals of Statistics: v0.2 (2019 April 24)
14 pages
Cramer Raoh and Out 08
No ratings yet
Cramer Raoh and Out 08
13 pages
What Is A UPEU
100% (1)
What Is A UPEU
5 pages
Distance
No ratings yet
Distance
18 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
18.650 Statistics For Applications
No ratings yet
18.650 Statistics For Applications
25 pages
18.6501x Fundamentals of Statistics
100% (1)
18.6501x Fundamentals of Statistics
8 pages
DSAI514-lec1-background-in-prob-part3
No ratings yet
DSAI514-lec1-background-in-prob-part3
25 pages
Hypothesis Testing (Hotelling's T - Statistic) : 4.1 The Union-Intersection Principle
No ratings yet
Hypothesis Testing (Hotelling's T - Statistic) : 4.1 The Union-Intersection Principle
19 pages
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
No ratings yet
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
4 pages
Limits of Graph Sequences
No ratings yet
Limits of Graph Sequences
9 pages
F Divergence PDF
No ratings yet
F Divergence PDF
13 pages
Econ 623 AsymptoticTheory 2023
No ratings yet
Econ 623 AsymptoticTheory 2023
74 pages
Sketching Information Divergence
No ratings yet
Sketching Information Divergence
15 pages
2202.07198
No ratings yet
2202.07198
10 pages
At Salak Is 2009
No ratings yet
At Salak Is 2009
2 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
midterm1-ece561
No ratings yet
midterm1-ece561
3 pages
Statistics
No ratings yet
Statistics
60 pages
On Divergences and Informations in Statistics and Information Theory
No ratings yet
On Divergences and Informations in Statistics and Information Theory
19 pages
MF27
No ratings yet
MF27
8 pages
Divergences
No ratings yet
Divergences
8 pages
Unit+16 T Test
No ratings yet
Unit+16 T Test
35 pages
Block 4 MCO 3 Unit 2
No ratings yet
Block 4 MCO 3 Unit 2
35 pages
MUML Preliminiaries
No ratings yet
MUML Preliminiaries
24 pages
statistics_lecture 7
No ratings yet
statistics_lecture 7
47 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 3
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 3
3 pages
Worksheet 1
No ratings yet
Worksheet 1
2 pages
HMWK 4
No ratings yet
HMWK 4
5 pages
Stat210b Lecture 9
No ratings yet
Stat210b Lecture 9
6 pages
A New Non-Symmetric Information Divergence of
No ratings yet
A New Non-Symmetric Information Divergence of
7 pages
Kullback-Leibler Divergence
No ratings yet
Kullback-Leibler Divergence
13 pages
Notes
No ratings yet
Notes
10 pages
Chap 3
No ratings yet
Chap 3
25 pages
Notes On Kullback-Leibler Divergence and Likelihood Theory
No ratings yet
Notes On Kullback-Leibler Divergence and Likelihood Theory
4 pages
Statistical+Inference+1 Shaw2007
No ratings yet
Statistical+Inference+1 Shaw2007
66 pages
Adrl App
No ratings yet
Adrl App
139 pages
training_LDA_correction
No ratings yet
training_LDA_correction
4 pages
Lecture 11: Standard Error, Propagation of Error, Central Limit Theorem in The Real World
No ratings yet
Lecture 11: Standard Error, Propagation of Error, Central Limit Theorem in The Real World
13 pages
STAT732: Solutions For Homework 2: Due: Wednesday, Feb 14
No ratings yet
STAT732: Solutions For Homework 2: Due: Wednesday, Feb 14
7 pages
Asymptotic Relative Efficiency of Tests: ARE on a G String: H θ H θ T H T K π θ P T K, θ n α α, π θ α
No ratings yet
Asymptotic Relative Efficiency of Tests: ARE on a G String: H θ H θ T H T K π θ P T K, θ n α α, π θ α
8 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
Stat520 Ch.5
No ratings yet
Stat520 Ch.5
5 pages
Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012
No ratings yet
Notes On The Cram Er-Rao Inequality: Kimball Martin February 8, 2012
6 pages
Appunti
No ratings yet
Appunti
34 pages
Lect 8
No ratings yet
Lect 8
22 pages
Kullback-Leibler Divergence Estimation of Continuous Distributions
No ratings yet
Kullback-Leibler Divergence Estimation of Continuous Distributions
5 pages
Sobolov Spaces
No ratings yet
Sobolov Spaces
49 pages
Statistical Distance
No ratings yet
Statistical Distance
3 pages
Assignment 1 Research Methodology
No ratings yet
Assignment 1 Research Methodology
5 pages
Advanced Statistical Inference
No ratings yet
Advanced Statistical Inference
7 pages
Grad Lecture 3
No ratings yet
Grad Lecture 3
27 pages
Asymptotic Statistics (By Changliang ZOU)
No ratings yet
Asymptotic Statistics (By Changliang ZOU)
115 pages
Risk Fisher
No ratings yet
Risk Fisher
39 pages
MAST20005 Statistics Assignment 3
No ratings yet
MAST20005 Statistics Assignment 3
8 pages
PDE - S and Transition Density Function
No ratings yet
PDE - S and Transition Density Function
44 pages
Worksheet1 Solutions
No ratings yet
Worksheet1 Solutions
3 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
F Nuessel Francois Grosjean Life With TW
No ratings yet
F Nuessel Francois Grosjean Life With TW
6 pages
Kluebersynth GEM 4 460N-En
No ratings yet
Kluebersynth GEM 4 460N-En
4 pages
Errata For Deveaux, Velleman and Bock, Stats: Data and Models, 3 Ed
No ratings yet
Errata For Deveaux, Velleman and Bock, Stats: Data and Models, 3 Ed
1 page
Delayed Photoactivation of Dualcure Composites Effect On Cuspal Flexure, Depth-Of-cure, and Mechanical Properties
No ratings yet
Delayed Photoactivation of Dualcure Composites Effect On Cuspal Flexure, Depth-Of-cure, and Mechanical Properties
8 pages
Analisa Harga Satuan
No ratings yet
Analisa Harga Satuan
10 pages
NWChem User Manual 5.1
No ratings yet
NWChem User Manual 5.1
405 pages
Ogude & Bradley (1994). Ionic conduction and electrical neutrality in operating electrochemical cells.
No ratings yet
Ogude & Bradley (1994). Ionic conduction and electrical neutrality in operating electrochemical cells.
6 pages
Honda GX 390 Specs
No ratings yet
Honda GX 390 Specs
2 pages
HSE - Best Practice For RBI PDF
100% (6)
HSE - Best Practice For RBI PDF
186 pages
50 AI Engineer Interview Questions & Answers [2025] - DigitalDefynd
No ratings yet
50 AI Engineer Interview Questions & Answers [2025] - DigitalDefynd
27 pages
Unit 5 Reading Quiz: Part A Key Skills
No ratings yet
Unit 5 Reading Quiz: Part A Key Skills
3 pages
Gen201 v12 wk4 Information Literacy Worksheet
No ratings yet
Gen201 v12 wk4 Information Literacy Worksheet
4 pages
Internship Report Presentation
No ratings yet
Internship Report Presentation
19 pages
Auto-2 Cs2018-2019 Validated PH
No ratings yet
Auto-2 Cs2018-2019 Validated PH
2 pages
O&m 830e-Ac
100% (2)
O&m 830e-Ac
164 pages
Chapter 6 (Deadlock)
No ratings yet
Chapter 6 (Deadlock)
53 pages
Latihan Soal Akademik Bahasa Inggris-1-1
No ratings yet
Latihan Soal Akademik Bahasa Inggris-1-1
4 pages
DMU Navigator
No ratings yet
DMU Navigator
912 pages
Petroleum Geology of Trinidad and Tobago
No ratings yet
Petroleum Geology of Trinidad and Tobago
88 pages
Regional PTA Orientation - DO 13s. 2022
100% (1)
Regional PTA Orientation - DO 13s. 2022
96 pages
SM0011 - CH 15 A Conductor - 1222141
No ratings yet
SM0011 - CH 15 A Conductor - 1222141
16 pages
C3 Field Planting PDF
100% (2)
C3 Field Planting PDF
102 pages
PHP W24 Model Answer by Campusify
No ratings yet
PHP W24 Model Answer by Campusify
39 pages
Makerere University: Name
No ratings yet
Makerere University: Name
3 pages
PDC Bit Design
No ratings yet
PDC Bit Design
5 pages
Northwestern Engineering Graduate Program Guide
No ratings yet
Northwestern Engineering Graduate Program Guide
21 pages
latestlog
No ratings yet
latestlog
18 pages
E765 2023 T1 Coursemap
No ratings yet
E765 2023 T1 Coursemap
3 pages
Determination of iron
No ratings yet
Determination of iron
12 pages

Statistical Distances

Uploaded by

Statistical Distances

Uploaded by

Lecture Notes 27

1 The Fundamental Statistical Distances

1. Total Variation: The TV distance between two distributions is:

One can also write the TV distance as:

TV(P, Q) = sup |EP [f ] − EQ [f ]| .

Alternatively, one can write this as:

3. Kullback-Leibler divergence: Again we suppose that Q dominates P . The KL diver-

In particular, note the equality:

Lemma 1 For any two distributions P0 , P1 , we have that,

6 Distances from parametric families

7 Robustness to Model Misspecification

8 Distances for data analysis

8.1 Integral Probability Metrics

EX∼P [f (X)] = EY ∼Q [f (Y )].

EX∼P [f (X)] 6= EY ∼Q [f (Y )].

EX∼P [f (X)] = EY ∼Q [f (Y )].

dF (P, Q) = sup |EX∼P [f (X)] − EY ∼Q [f (Y )]|,

8.2 Wasserstein distance

A natural class of smooth functions is the class of 1-Lipschitz functions, i.e.

FL = {f : f continous , |f (x) − f (y)| ≤ kx − yk}.

The corresponding IPM is known as the Wasserstein 1 distance:

W1 (P, Q) = sup |EX∼P [f (X)] − EY ∼Q [f (Y )]|.

Wpp (P, Q) = inf EkX − Y kp2 .

8.3 Maximum Mean Discrepancy

Another natural class of IPMs is where we restrict F to be a unit ball in a Reproducing

The MMD is a typical IPM:

In particular, if the kernel is bounded√then we can use a Hoeffding-style bound to conclude

err = P (Ψ(X1 , . . . , Xn ) 6= u).

using Markov’s inequality.

kθi k2 ≤ 4δ, and kθi − θj k ≥ 2δ

So (ignoring constants) if nδ 2  σ 2 d, then our minimax estimation error is at least δ 2 . So

You might also like

So (ignoring constants) if nδ 2 σ 2 d, then our minimax estimation error is at least δ 2 . So