Lec5 IntroToProbabilityAndStatistics
Lec5 IntroToProbabilityAndStatistics
Email: [email protected]
URL: https://www.zabaras.com/
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Contents
Markov and Chebyshev Inequalities
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 2
References
• Following closely Chris Bishops’ PRML book, Chapter 2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 3
Markov and Chebyshev Inequalities
You can show (Markov’s inequality) that if 𝑋 is a non-negative integrable
random variable and for any 𝑎 > 0:
[X ]
Pr[ X a]
a
Indeed : [ X ] x ( x)dx x ( x)dx a ( x)dx a Pr[ X a ]
0 a a
You can generalize this using any function of the random variable 𝑋 as:
[ f ( X )]
Pr[ f ( X ) a]
a
Using 𝑓(𝑋) = (𝑋 − 𝔼[𝑋])2 , we derive the following Chebyshev inequality:
a , X [X ] s2 s2
Pr X [ X ] 2
2 2
1
In terms of std’s, we can restate as : Pr X [ X ] s
2
S {x1 , x2 ,...xN }, x j n
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 6
Example of the Law of Large Numbers
Assume that we sample S {x1 , x2 ,...xN }, x j 2
1 1
( x | x0 , ) exp ( x x ) T 1
( x x0
)
2 det
0
2
1/2
22
Our problem is to estimate x0 and
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 7
Empirical Mean and Empirical Covariance
From the law of large numbers, we calculate:
𝑁
1
𝑥0 = 𝔼 𝑋 ≈ 𝑥𝑗 = 𝑥ҧ
𝑁
𝑗=1
To compute the covariance matrix, note that if 𝑋1, 𝑋2, … are i.i.d. so are
𝑓(𝑋1), 𝑓(𝑋2), … for any function f :
2
k
⇒ 𝑁
1 𝑇
𝜮 ≈ (𝑥𝑗 − 𝑥)ҧ 𝑥𝑗 − 𝑥ҧ = 𝜮 ഥ
𝑁
𝑗=1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 8
The Central Limit Theorem
Let (𝑋1, 𝑋2, … 𝑋𝑁 ) be independent and identically distributed
(i.i.d.) continuous random variables each with expectation m
and variance s2.
𝑁
1 𝑋ത𝑁 − 𝜇 1
Define: 𝑁
𝑍 = (𝑋1 + 𝑋2 +. . . +𝑋𝑁 − 𝑁𝜇) = 𝜎 , ത
𝑋𝑁 = 𝑋𝑗
𝜎 𝑁 𝑁
𝑗=1
𝑁
As 𝑁 −> ∞, the distribution of 𝑍𝑁 converges to the
distribution of a standard normal random variable
x
1
lim P Z N x e
t 2 /2
dt
𝑁
N 2
1 𝜎 2
If 𝑋ത𝑁 = 𝑁 𝑋𝑗 , for 𝑁 large, 𝑋ത𝑁 ~𝒩 𝜇, 𝑎𝑠 𝑁 → ∞
𝑗=1
𝑁
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 9
The CLT and the Gaussian Distribution
As an example, assume 𝑁 variables (𝑋1, 𝑋2, … 𝑋𝑁 ) each of
which has a uniform distribution over [0, 1] and then
consider the distribution of the sample mean
(𝑋1 + 𝑋2 + … + 𝑋𝑁 )/𝑁. For large 𝑁, this distribution tends
to a Gaussian. The convergence as 𝑁 increases can be
rapid.
N 2
4 4
N 1 N 10
4
3.5 3.5
3.5
3 3
3
2.5 2.5
2.5
2 2 2
1 1 1
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
MatLab Code
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 10
The CLT and the Gaussian Distribution
N
1
We plot a histogram of N
x , j 1:10000 where
i 1
ij xij ~ Beta (1,5)
2
N = 10
3
1
2
0
0 0.5 1
0
0 0.5 1
Run centralLimitDemo
from PMTK
0
0 0.5 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 11
Accuracy of Monte Carlo Approximation
In Monte Carlo approximation of the mean using the
sample mean 𝜇ҧ approximation, we have:
𝐹𝑜𝑟 𝜇 = 𝔼[𝑓(𝑥)]
𝜎2
𝜇ҧ − 𝜇~𝒩 0,
𝑁
𝑁
1
𝜎2 = 𝔼[𝑓 2 𝑥 ] − 𝔼[f x ]2 ≈ 𝑓(𝑥𝑠 ) − 𝜇ҧ 2 ≡ 𝜎ത 2
𝑁
𝑠=1
We can now derive the following error bars (using central
intervals):
𝜎ത 𝜎ത
Pr 𝜇 − 1.96 ≤ 𝜇ҧ ≤ 𝜇 + 1.96 = 0.95
𝑁 𝑁
The number of samples needed to drop the error within
is then: 𝜎ത 4𝜎ത 2
1.96 ≤𝜀⇒𝑁≥
𝑁 𝜀2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 12
Monte Carlo Approximation of Distributions
Computing the distribution of 𝑦 = 𝑥2, 𝑝(𝑥) is uniform.
The MC approximation is shown on the right. Take
samples from 𝑝(𝑥), square them and then plot the
histogram. 1.5 5.5 0.25
4.5 0.2
1
4
0.5 3
2.5 0.1
2 𝑝(𝑦)
0
1.5 0.05
-0.5 0.5 0
-1 0 1 0 0.5 1 0 0.5 1
Run changeofVarsDemo1d
from PMTK
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 13
Accuracy of Monte Carlo Approximation
Increase of the accuracy of MC with the number of
samples. Histograms (on the left) and (on the right) pdfs
using kernel 2.5
10 samples
density estimation. 2
1.5
The actual 1
distribution is 0.5
shown on red. 0
0.5 1 1.5 2 2.5
100 samples
1.8
1.6
1.4
1.2
N x |1.5, 0.252
1
0.8
0.6
0.4
Run mcAccuracyDemo 0.2
from PMTK
0
0.5 1 1.5 2 2.5
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 14
Example of CLT: Estimating by MC
Use the CLT to approximate 𝜋. Let 𝑥, 𝑦~𝒰[−𝑟, 𝑟].
r r
I
r r
( x 2 y 2 r 2 )dxdy r 2
2
r r
1 1
2 I 2 4r 2 ( x 2 y 2 r 2 ) p ( x) p ( y )dxdy 1.5
r r r r
1
r r
4 ( x 2 y 2 r 2 ) p ( x) p ( y )dxdy 0.5
r r
𝑁 0
1
𝜋ത ≈ 4 𝕀(𝑥𝑠2 + 𝑦𝑠2 ≤ 𝑟 2 ) , 𝑥𝑠 , 𝑦𝑠 ~𝒰[−𝑟, 𝑟] -0.5
𝑁
𝑠=1
-1
𝑥, 𝑦~𝒰[−𝑟, 𝑟]
We find = 3.1416 with standard
Run mcEstimatePi
error 0.09. from PMTK
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 15
CLT: The Binomial Tends to a Gaussian
One consequence of the CLT is that the binomial
distribution
N
Bin m | N , m
(1 )N m
x1 x2 ... xN m 1
~N
0.3
,
N N N 0.25 Bin ( N 10, m 0.25)
m ~ N N ,N 1 0.2
0.15
0.1
Matlab Code
0.05
0
0 1 2 3 4 5 6 7 8 9 10
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 16
Poisson Process
Consider that we count the number of photons from a light source. Let
𝑁(𝑡) be the number of photons observed in the time interval [0, 𝑡]. 𝑁(𝑡) is
an integer-valued random variable. We make the following assumptions:
a. Stationarity: Let ∆1 and ∆2 be any two time intervals of equal length, 𝑛
any non-negative integer. Assume that
Prob. of 𝑛 photons in ∆1 = Prob. of 𝑛 photons in ∆2
b. Independent increments: Let ∆1, ∆2, … , ∆𝑛 be non-overlapping time
intervals and 𝑘1, 𝑘2, … , 𝑘𝑛 non-negative integers. Denote by 𝐴𝑗 the event
defined as
𝐴𝑗 = 𝑘𝑗 photons arrive in the time interval ∆𝑗
Assume that these events are mutually independent, i.e.
P ( A1 A2 ... An ) P ( A1 ) P ( A2 )...P ( An )
t
n
P N (t ) n e t , 0, n 0,1, 2,...,
n!
n
N ~ Poisson ( ) e
n!
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 18
Poisson Process
Consider the Poisson (discrete) distribution N 0,1, 2,...,
n
P( N n) Poisson (n | ) e
n!
N 2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 19
Approximating a Poisson Distribution With a Gaussian
Theorem: A random variable 𝑋~Poisson(𝜃) can be considered as the
sum of n independent random variables 𝑋𝑖~Poisson(𝜃/𝑛).a
0.1 0.08
0.08 0.06
0.06
0.04
0.04
0.02
0.02
0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
0.12 0.09
Mean=15 Mean=20
0.08
0.1
0.07
0.08 0.06
0.05
0.06
0.04
0.04 0.03
0.02
0.02
0.01
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Poisson distributions (dots) vs their Gaussian approximations (solid line) for various values of the
mean . The higher the , the smaller the distance between the two distributions. See this MatLab
implementation.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 21
Kullback-Leibler Distance Between Two Densities
Let us consider the following two distributions:
n
n Poisson (n | ) e
n!
1 1
x Gaussian ( x | , ) exp ( x ) 2
2 2
Poisson (n | )
KL distance Poisson (. | ), Gaussian (. | , ) Poisson (n | ) log
n 0 Gaussian (n | , )
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 22
Approximating a Poisson Distribution With a Gaussian
-1
10
-2
10
-3
10
0 5 10 15 20 25 30 35 40 45 50
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 23
Application of the CLT: Noise Signals
Consider discrete sampling where the output is noise of length 𝑛.
We estimate the mean and the variance of the noise in a single measurement
as follows:
x0 x j , s x j x0
n n
1 2 1 2
n j 1 n j 1
To improve the signal to noise ratio, we repeat the measurement and average
the noise vector signals: N
1
x
N
x
k 1
(k )
n
1 N (k )
The average noise is a realization of a random variable: X
N k 1
X n
(1) (2)
If X , X ,... are i.i.d., X is asymptotically a Gaussian by the CLT, and its
variance is var( X )
s 2
. We need to repeat the experiment until
s 2
2
N N
(a given tolerance).
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 24
Noise Reduction By Averaging
Gaussian noise vectors of size 50 are used with 𝜎2 = 1
s is the std of a single noise vector
3 3
N=1 N=5
2 2
Estimated
1 1
Noise level :
0 0
s
2
-1 -1 N
-2 -2
-3 -3
0 10 20 30 40 50 0 10 20 30 40 50
3 3
N=10 N=25
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
0 10 20 30 40 50 0 10 20 30 40 50
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 25
Introduction to Information Theory
Information theory is concerned
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 26
Introduction to Information Theory
Decoding messages sent over noisy channels requires
having a good probability model of the kinds of
messages that people tend to send.
• David MacKay, Information Theory, Inference and Learning Algorithms , 2003 (available on line)
• Thomas M. Cover, Joy A. Thomas , Elements of Information Theory , Wiley, 2006.
• Viterbi, A. J. and J. K. Omura (1979). Principles of Digital Communication and Coding. McGraw-Hill.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 27
Introduction to Information Theory
Consider a discrete random variable 𝑥. We ask how
much information (‘degree of surprise’) is received when
we observe (learn) a specific value for this variable?
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 28
Entropy
From ℎ(𝑥, 𝑦) = ℎ(𝑥) + ℎ(𝑦) and 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦),
it is easily shown that ℎ(𝑥) must be given by the
logarithm of 𝑝(𝑥) and so we have
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 29
Noiseless Coding Theorem (Shanon)
Example 1 (Coding theory): 𝑥 discrete random variable with 8 possible
states; how many bits to transmit the state of 𝑥?
1 1
All states equally likely x 8 log 2 3 bits
8 8
Example 2: consider a variable having 8 possible states
{𝑎, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑔, ℎ} for which the respective (non-uniform) probabilities
are given by ( 1/2 , 1/4 , 1/8 , 1/16 , 1/64 , 1/64 , 1/64 , 1/64 ).
The entropy in this case is smaller than for the uniform distribution.
Distributions 𝑝(𝑥) that are sharply peaked around a few values will
have a relatively low entropy, whereas those that are spread more
evenly across many values will have higher entropy.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 32
Maximum Entropy: Uniform Distribution
The maximum entropy configuration can be found by maximizing H using
a Lagrange multiplier to enforce the normalization constraint on the
probabilities. Thus we maximize
ഥ = − 𝑝(𝑥𝑖 )ln𝑝(𝑥𝑖 ) + 𝜆 𝑝(𝑥𝑖 ) − 1
ℍ
𝑖 𝑖
We find p ( xi ) 1/ M , 𝑀 is the number of possible states and H= ln2𝑀.
To verify that the stationary point is indeed a maximum, we can evaluate
the 2nd derivative of the entropy, which gives
ഥ
𝜕2ℍ 1
= −𝐼𝑖𝑗
𝜕𝑝(𝑥𝑖 )𝜕𝑝(𝑥𝑗 ൯ 𝑝𝑖
where I ij are the elements of the identity matrix.
1 1
p ( xi ) ln p( xi ) p( xi ) ln ln p( xi ) ln M
i i p ( xi ) i p ( xi )
Here we used Jensen’s inequality (for the concave function log)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 33
Example: Biosequence Analysis
Recall the DNA Sequence logo example earlier.
to a deterministic distribution.
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sequence Position
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 34
Binary Variable
Consider binary random variables, 𝑋 ∈ {0, 1}, we can write
𝑝(𝑋 = 1) = 𝜃 and 𝑝(𝑋 = 0) = 1 − 𝜃.
X {0,1}, p ( X 1) , p( X 0) 1
Hence the entropy becomes (binary entropy function)
MatLab function
bernoulliEntropyFig
from PMTK
0
0 0.5 1
p(X = 1)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 35
Differential Entropy
Divide 𝑥 into bins of width Δ. Assuming 𝑝(𝑥) is
continuous, for each such bin, there must exist 𝑥𝑖 such
that
( i 1)
i
p ( x)dx p ( xi ) = probability in falling in bin
p ( xi ) ln p ( xi ) p ( xi ) ln p( xi ) ln
i i
lim p( xi ) ln p( xi ) p( x) ln p( x)dx (can be negative)
0
i
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 36
Differential Entropy
For a density defined over multiple continuous
variables, denoted collectively by the vector 𝒙, the
differential entropy is given by
x p( x ) ln p( x )dx
x p( y) ln p( y) | A | dy y ln | A | y x ln | A |
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 37
Differential Entropy and the Gaussian Distribution
The distribution that maximizes the differential entropy with
constraints on the first two moments is a Gaussian:
෩ = p( x) ln p( x)dx 1 p( x)dx 1 2 xp( x)dx m 3 x m 2 p( x)dx s 2
ℍ
Normalization Given Given
mean std
Using calculus of variations ,
1 1 2 x 3 x m
2 1 x m 2
p( x) e p( x) exp
2s 2 2s
1/2 2
Use
the
constra int s
2
1
2
d
x 1 ln 2s 2 ln 2 e det , d 1, det s 2 Note
1
ℍ[𝑥] < 0 for
𝜎2 < 1/(2𝜋𝑒)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 38
Kullback-Leibler Divergence and Cross Entropy
Consider some unknown distribution 𝑝(𝑥), and suppose
that we have modeled this using an approximating
distribution 𝑞(𝑥).
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 39
KL Divergence and Cross Entropy
The cross entropy p, q p( x) ln q( x)dx is the average
number of bits needed to encode data coming from a
source with distribution p when we use model q to define
our codebook.
H(𝑝)=H(𝑝, 𝑝) is the expected # of bits using the true model.
The KL divergence is the average number of extra bits
needed to encode the data, because we used distribution q
to encode the data instead of the true distribution p.
The “extra number of bits” interpretation makes it clear that
KL(p||q) ≥ 0, and that the KL is only equal to zero iff qq=( xp.
)
KL p || q p ( x) ln q ( x)dx p ( x) ln p ( x)dx p ( x) ln dx
p( x)
The KL distance is not a symmetrical quantity, that is
KL p || q KL q || p
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 40
KL Divergence Between Two Gaussians
Consider 𝑝(𝑥) = 𝒩(𝑥|𝜇, 𝜎2) and 𝑞(𝑥) = 𝒩(𝑥|𝑚, 𝑠2).
KL p || q p ( x) ln q ( x)dx
p ( x) ln p ( x)dx
1
N ( x|m ,s 2 ) ln 2 s 2
2
( x m )2
s 2
dx
1
2
ln 2 es 2
Finally we obtain:
1 s 2 s 2 m 2 2m m m2
KL p || q ln 2 1
2 s s 2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 41
KL Divergence Between Two Gaussians
Consider now 𝑝(𝒙) = 𝒩(𝒙|𝝁, 𝚺) and 𝑞(𝒙) = 𝒩(𝒙|𝒎, 𝑳).
KL p || q p ( x ) ln q( x )dx
D ln 2 ln|L| Tr L mm m L m m L m m L m
1 1 T T 1 T 1 T 1
2
p ( x ) ln p ( x )dx
ln|| 1 ln 2
1 D
2 2
1 D
ln
2 2
| L|
||
1 T
Tr L mm m L m m L m m L m
T 1 T 1 T 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 42
Jensen’s Inequality
For a convex function 𝑓, Jensen’s inequality gives (can
be proven easily by induction)
M M
f i xi i f ( xi ), i 0 and i 1
i 1 i 1 i
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 44
Jensen’s Inequality
M M
Using Jensen’s inequality f i xi i f ( xi ), i 0 and i 1
i 1 i 1 i
Set : i pi f x f ( x )
We can generalize this result to
continuous random variables:
( for continuous rv ) f xp( x)dx f ( x) p( x)dx
We will use this shortly in the context of the KL distance.
We often use Jensen’s inequality for concave functions
(e.g. log 𝑥). In that case, be sure you reverse the
inequality!
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 45
Jensen’s Inequality: Example
As another example of Jensen’s inequality, consider the
arithmetic and geometric means of a set of real variables:
1Τ𝑀
𝑀 𝑀
1
𝑥𝐴ҧ = 𝑥𝑖 , 𝑥ҧ𝐺 = ෑ 𝑥𝑖
𝑀
𝑖=1 𝑖=1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 46
The Kullback-Leibler Divergence
f x f ( x) f xp( x)dx f ( x) p( x)dx
Using Jensen’s inequality, we can show (−log is a
convex function) that:
q( x) q( x)
KL p || q p( x) ln dx ln p ( x ) dx ln q ( x)dx 0
p( x) p( x)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 47
Principle of Insufficient Reason
An important consequence of the information inequality is
that the discrete distribution with the maximum entropy is
the uniform distribution.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 48
The Kullback-Leibler Divergence
Data compression is in some way related to density
estimation.
q ( x) N
ln q x | ln p ( xn )
1
KL p || q p ( x) ln dx n
p ( x) Sample N n 1
average
approximation
of the mean
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 49
The KL Divergence Vs. MLE
Note that only the first term is a function of 𝑞.
ln q x | ln p ( xn )
1
KL p || q p ( x ) ln dx n
p( x ) N n 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 50
Conditional Entropy
For a joint distribution, the conditional entropy is
y | x p( y, x) ln p( y | x)dydx
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 51
Conditional Entropy for Discrete Variables
Consider the conditional entropy for discrete variables
y | x p( yi , x j ) ln p( yi | x j )
i j
To understand further the meaning of conditional entropy,
let us consider the implications of H[𝑦|𝑥] = 0.
We have:
y | x p( yi | x j ) ln p( yi | x j ) p( x j ) 0
i j
0
x, y x x | y y y | x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 53
Mutual Information
The mutual information represents the reduction in the
uncertainty about 𝑥 once we learn the value of 𝑦 (and
reversely).
x, y x x | y y y | x
x x | y
y y | x
H y
In a Bayesian setting, 𝑝(𝑥) =prior, 𝑝(𝑥|𝑦) posterior, and
I[𝑥, 𝑦] represents the reduction in uncertainty in 𝑥 once
we observe 𝑦.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 54
Note that H[x,y]≤H [x]+H [y]
This is easy to prove noticing that
x, y y y | x 0 ( KL divergence)
and
x, y y | x x
from which
x, y x y x , y x y
The equality here is true only if 𝑥, 𝑦 are independent:
x, y p( x, y) ln p( x, y)dydx p( x, y) ln p( x) ln p( y) dydx [ x] [ y ]
(sufficiency condition)
y | x y [ x, y ] 0 p ( x, y ) p ( x) p ( y ) (necessary
condition)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 55
Mutual Information for Correlated Gaussians
Consider two correlated Gaussians as follows:
X X 0 s 2 s 2
~ | , 2
Y Y 0 s s 2
2 det
1 1
Thus: x, y x y x, y log
2 1 2
0 (independent X , Y ) x, y 0
Note:
1 (linear correlated X Y ) x, y
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 56
Pointwise Mutual Information
A quantity which is closely related to 𝑀𝐼 is the pointwise
mutual information or 𝑃𝑀𝐼. For two events (not random
variables) 𝑥 and 𝑦, this is defined as
p( x) p( y ) p( x | y) p( y | x)
PMI ( x, y ) : log log log
p x, y p( x) p( y)
This measures the discrepancy between these events
occurring together compared to what would be expected
by chance. Clearly the 𝑀𝐼, x, y , of 𝑋 and 𝑌 is just the
expected value of the 𝑃𝑀𝐼.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 57
Mutual Information
For continuous random variables, it is common to first
discretize or quantize them into bins, and computing
how many values fall in each histogram bin (Scott 1979).
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 58
Maximal Information Coefficient
This statistic appropriately normalized is known as the
maximal information coefficient (𝑀𝐼𝐶).
max GG ( x , y ) X (G );Y (G )
We first define: m( x, y )
log min( x, y )
Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518–1524.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 60
Correlation Coefficient Vs MIC
see mutualInfoAllPairsMixed for
and miMixedDemo from PMTK3
Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518–1524.