0% found this document useful (0 votes)
24 views63 pages

Lec5 IntroToProbabilityAndStatistics

This document discusses topics in probability and statistics including the law of large numbers, central limit theorem, and statistical inference. It introduces Markov and Chebyshev inequalities and describes how the average of i.i.d. random variables converges almost surely to the mean according to the weak law of large numbers. The central limit theorem states that the distribution of the sample mean of i.i.d. random variables converges in distribution to a normal distribution as the sample size increases. Statistical inference involves either parametric or non-parametric approaches to estimating properties of the underlying probability distribution that generated observed data.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views63 pages

Lec5 IntroToProbabilityAndStatistics

This document discusses topics in probability and statistics including the law of large numbers, central limit theorem, and statistical inference. It introduces Markov and Chebyshev inequalities and describes how the average of i.i.d. random variables converges almost surely to the mean according to the weak law of large numbers. The central limit theorem states that the distribution of the sample mean of i.i.d. random variables converges in distribution to a normal distribution as the sample size increases. Statistical inference involves either parametric or non-parametric approaches to estimating properties of the underlying probability distribution that generated observed data.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Introduction to

Probability and Statistics


(Continued)
Prof. Nicholas Zabaras
Center for Informatics and Computational Science
https://cics.nd.edu/
University of Notre Dame
Notre Dame, Indiana, USA

Email: [email protected]
URL: https://www.zabaras.com/

August 29, 2018

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Contents
 Markov and Chebyshev Inequalities

 The Law of Large Numbers, Central Limit Theorem, Monte Carlo


Approximation of Distributions, Estimating 𝜋, Accuracy of Monte Carlo
approximation, Approximating the Binomial with a Gaussian,
Approximating the Poisson Distribution with a Gaussian, Application of
CLT to Noise Signals

 Information theory, Entropy, KL divergence, Jensen’s Inequality, Mutual


information, Maximal Information Coefficient

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 2
References
• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge


University Press.

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena


Scientific. 2nd Edition

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical


Inference. Springer.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 3
Markov and Chebyshev Inequalities
 You can show (Markov’s inequality) that if 𝑋 is a non-negative integrable
random variable and for any 𝑎 > 0:
[X ]
Pr[ X  a] 
 
a 
Indeed : [ X ]   x ( x)dx   x ( x)dx  a   ( x)dx  a Pr[ X  a ]
0 a a

 You can generalize this using any function of the random variable 𝑋 as:
[ f ( X )]
Pr[ f ( X )  a] 
a
 Using 𝑓(𝑋) = (𝑋 − 𝔼[𝑋])2 , we derive the following Chebyshev inequality:

a  ,  X  [X ]   s2 s2
Pr  X  [ X ]     2 
2 2
  
1
 In terms of std’s, we can restate as : Pr  X  [ X ]  s   
 2

 Thus the probability of 𝑋 being more than 2s away from 𝔼[𝑋] is ≤ ¼.


Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 4
The Law of Large Numbers (LLN)
 Let 𝑋𝑖 for 𝑖 = 1, 2, . . . , 𝑛 be independent and identically distributed
random variables (i.i.d.) with finite mean E(𝑋𝑖 ) = 𝜇 & variance Var(𝑋𝑖 ) =
𝜎2. 𝑛
1
𝑋ത𝑛 = ෍ 𝑋𝑖
 Let 𝑛
𝑖=1
𝑛
1
 Note that 𝔼 𝑋ത𝑛 = ෍𝜇 = 𝜇
𝑛
𝑖=1
𝑛
1 𝜎2
𝑉𝑎𝑟 𝑋ത𝑛 2
= 2 ෍𝜎 =
𝑛 𝑛
𝑖=1
This means that with
 Weak LLN: lim Pr |𝑋ത𝑛 − 𝜇| ≥ 𝜀 = 0 ∀𝜀 > 0 probability one, the
𝑛→∞ average of any
realizations of 𝑥1, 𝑥2, … of
the random variables
lim 𝑋ത𝑛 = 𝜇 𝑎𝑙𝑚𝑜𝑠𝑡 𝑠𝑢𝑟𝑒𝑙𝑦
 Strong LLN: 𝑛→∞ 𝑋1, 𝑋2, … converges to
the mean.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 5
Statistical Inference: Parametric & Non-Parametric Approach

Assume that we have a set of observations

S  {x1 , x2 ,...xN }, x j  n

The problem is to infer on the underlying probability distribution that gives


rise to the data S.
 Parametric problem: The underlying probability density has a
specified known form that depends on a number of parameters. The
problem of interest is to infer those parameters.

 Non-parametric problem: No analytical expression for the probability


density is available. Description consists of defining the dependency
or independency of the data. This leads to numerical exploration.
A typical situation for the parametric model is when the distribution is the
PDF of a random variable X :   .
n

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 6
Example of the Law of Large Numbers
 Assume that we sample S  {x1 , x2 ,...xN }, x j  2

 We consider a parametric model with xj realizations of X ~ N ( x0 , )


22
where we take both the mean x0 and the variance  as
unknowns.

The probability density of 𝑋 is:

1  1 
 ( x | x0 , )  exp   ( x  x ) T 1
 ( x  x0 
)
2  det  
0
 2 
1/2

22
 Our problem is to estimate x0 and  

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 7
Empirical Mean and Empirical Covariance
 From the law of large numbers, we calculate:
𝑁
1
𝑥0 = 𝔼 𝑋 ≈ ෍ 𝑥𝑗 = 𝑥ҧ
𝑁
𝑗=1
 To compute the covariance matrix, note that if 𝑋1, 𝑋2, … are i.i.d. so are
𝑓(𝑋1), 𝑓(𝑋2), … for any function f :
2
 k

Then we can compute:


𝜮 = cov(𝑋) = 𝔼 ൫𝑥 − 𝔼 𝑋 ) 𝑥 − 𝔼 𝑋 𝑇 ≈ 𝔼 ൫𝑥 − 𝑥)ҧ 𝑥 − 𝑥ҧ 𝑇

⇒ 𝑁
1 𝑇
𝜮 ≈ ෍(𝑥𝑗 − 𝑥)ҧ 𝑥𝑗 − 𝑥ҧ = 𝜮 ഥ
𝑁
𝑗=1

 The above formulas define the empirical mean and empirical


covariance.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 8
The Central Limit Theorem
 Let (𝑋1, 𝑋2, … 𝑋𝑁 ) be independent and identically distributed
(i.i.d.) continuous random variables each with expectation m
and variance s2.
𝑁
1 𝑋ത𝑁 − 𝜇 1
 Define: 𝑁
𝑍 = (𝑋1 + 𝑋2 +. . . +𝑋𝑁 − 𝑁𝜇) = 𝜎 , ത
𝑋𝑁 = ෍ 𝑋𝑗
𝜎 𝑁 𝑁
𝑗=1
𝑁
 As 𝑁 −> ∞, the distribution of 𝑍𝑁 converges to the
distribution of a standard normal random variable
x
1
lim P Z N  x  e
 t 2 /2
dt
𝑁
N  2 
1 𝜎 2
 If 𝑋ത𝑁 = 𝑁 ෍ 𝑋𝑗 , for 𝑁 large, 𝑋ത𝑁 ~𝒩 𝜇, 𝑎𝑠 𝑁 → ∞
𝑗=1
𝑁

 Somewhat of a justification for assuming Gaussian noise is


common

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 9
The CLT and the Gaussian Distribution
 As an example, assume 𝑁 variables (𝑋1, 𝑋2, … 𝑋𝑁 ) each of
which has a uniform distribution over [0, 1] and then
consider the distribution of the sample mean
(𝑋1 + 𝑋2 + … + 𝑋𝑁 )/𝑁. For large 𝑁, this distribution tends
to a Gaussian. The convergence as 𝑁 increases can be
rapid.
N 2
4 4

N 1 N  10
4

3.5 3.5
3.5

3 3
3

2.5 2.5
2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MatLab Code

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 10
The CLT and the Gaussian Distribution
N
1
 We plot a histogram of N
 x , j  1:10000 where
i 1
ij xij ~ Beta (1,5)

 As 𝑁 → ∞, the distribution tends towards a Gaussian.


N=5
3
N=1
3

2
N = 10
3
1

2
0
0 0.5 1

0
0 0.5 1

Run centralLimitDemo
from PMTK
0
0 0.5 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 11
Accuracy of Monte Carlo Approximation
 In Monte Carlo approximation of the mean using the
sample mean 𝜇ҧ approximation, we have:
𝐹𝑜𝑟 𝜇 = 𝔼[𝑓(𝑥)]
𝜎2
𝜇ҧ − 𝜇~𝒩 0,
𝑁
𝑁
1
𝜎2 = 𝔼[𝑓 2 𝑥 ] − 𝔼[f x ]2 ≈ ෍ 𝑓(𝑥𝑠 ) − 𝜇ҧ 2 ≡ 𝜎ത 2
𝑁
𝑠=1
 We can now derive the following error bars (using central
intervals):
𝜎ത 𝜎ത
Pr 𝜇 − 1.96 ≤ 𝜇ҧ ≤ 𝜇 + 1.96 = 0.95
𝑁 𝑁
 The number of samples needed to drop the error within
 is then: 𝜎ത 4𝜎ത 2
1.96 ≤𝜀⇒𝑁≥
𝑁 𝜀2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 12
Monte Carlo Approximation of Distributions
 Computing the distribution of 𝑦 = 𝑥2, 𝑝(𝑥) is uniform.
The MC approximation is shown on the right. Take
samples from 𝑝(𝑥), square them and then plot the
histogram. 1.5 5.5 0.25

4.5 0.2
1
4

𝑝(𝑥) 3.5 0.15

0.5 3

2.5 0.1

2 𝑝(𝑦)
0
1.5 0.05

-0.5 0.5 0
-1 0 1 0 0.5 1 0 0.5 1

Run changeofVarsDemo1d
from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 13
Accuracy of Monte Carlo Approximation
 Increase of the accuracy of MC with the number of
samples. Histograms (on the left) and (on the right) pdfs
using kernel 2.5
10 samples

density estimation. 2

1.5

 The actual 1

distribution is 0.5

shown on red. 0
0.5 1 1.5 2 2.5

100 samples
1.8

1.6

1.4

1.2
N x |1.5, 0.252
1

0.8

0.6

0.4
Run mcAccuracyDemo 0.2
from PMTK
0
0.5 1 1.5 2 2.5

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 14
Example of CLT: Estimating  by MC
 Use the CLT to approximate 𝜋. Let 𝑥, 𝑦~𝒰[−𝑟, 𝑟].
r r
I 
r r
( x 2  y 2  r 2 )dxdy   r 2 
2
r r
1 1
  2 I  2 4r 2   ( x 2  y 2  r 2 ) p ( x) p ( y )dxdy  1.5
r r r r
1
r r
  4  ( x 2  y 2  r 2 ) p ( x) p ( y )dxdy  0.5
r r
𝑁 0
1
𝜋ത ≈ 4 ෍𝕀(𝑥𝑠2 + 𝑦𝑠2 ≤ 𝑟 2 ) , 𝑥𝑠 , 𝑦𝑠 ~𝒰[−𝑟, 𝑟] -0.5
𝑁
𝑠=1
-1

 Here 𝑥, 𝑦 are uniform random -1.5

variables on [−𝑟, +𝑟], 𝑟 = 2. -2


-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

𝑥, 𝑦~𝒰[−𝑟, 𝑟]
 We find  = 3.1416 with standard
Run mcEstimatePi
error 0.09. from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 15
CLT: The Binomial Tends to a Gaussian
 One consequence of the CLT is that the binomial
distribution
N
Bin m | N , m
(1 )N m

which is a distribution over m defined by the sum of 𝑁


observations of the random binary variable 𝑥, will tend to a
Gaussian as 𝑁 → ∞.
binomial distribution
0.35

x1 x2 ... xN m 1
~N
0.3
,
N N N 0.25 Bin ( N  10, m  0.25)
m ~ N N ,N 1 0.2

0.15

0.1
Matlab Code
0.05

0
0 1 2 3 4 5 6 7 8 9 10

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 16
Poisson Process
Consider that we count the number of photons from a light source. Let
𝑁(𝑡) be the number of photons observed in the time interval [0, 𝑡]. 𝑁(𝑡) is
an integer-valued random variable. We make the following assumptions:
a. Stationarity: Let ∆1 and ∆2 be any two time intervals of equal length, 𝑛
any non-negative integer. Assume that
Prob. of 𝑛 photons in ∆1 = Prob. of 𝑛 photons in ∆2
b. Independent increments: Let ∆1, ∆2, … , ∆𝑛 be non-overlapping time
intervals and 𝑘1, 𝑘2, … , 𝑘𝑛 non-negative integers. Denote by 𝐴𝑗 the event
defined as
𝐴𝑗 = 𝑘𝑗 photons arrive in the time interval ∆𝑗
Assume that these events are mutually independent, i.e.
P ( A1 A2 ... An )  P ( A1 ) P ( A2 )...P ( An )

c. Negligible probability of coincidence: Assume that the probability of


two or more events at the same time is negligible. More precisely,
𝑁(0) = 0 and P  N (h)  1
lim 0
h0 h
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 17
Poisson Process
 If these assumptions hold, then for a given time 𝑡, 𝑁 is a Poisson
process:

 t 
n

P  N (t )  n  e t ,   0, n  0,1, 2,..., 
n!

 Let us fix 𝑡 = 𝑇 = observation time and define a random variable 𝑁 =


𝑁(𝑇). Let us define the parameter 𝜃 = 𝜆𝑇. We then denote:

n
N ~ Poisson ( )  e 
n!

 D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007


 S Ghahramani: Fundamentals of Probability,1996.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 18
Poisson Process
 Consider the Poisson (discrete) distribution N  0,1, 2,..., 

n
P( N  n)   Poisson (n |  )  e 
n!

 The mean and the variance are both equal to .



 N    n Poisson (n |  )   ,
n 0

 N    2   
 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 19
Approximating a Poisson Distribution With a Gaussian
 Theorem: A random variable 𝑋~Poisson(𝜃) can be considered as the
sum of n independent random variables 𝑋𝑖~Poisson(𝜃/𝑛).a

 According to the Central Limit Theorem, when 𝑛 is large enough,


 
1 n  
Take X i ~ Poisson ( , )   X i ~ N ( , 2 )
n n n i 1 n n
n
 Then X   X based on the Theorem is a draw from 𝒫ℴ𝒾𝓈𝓈ℴ𝓃(𝜃) and
i
i 1
from the CLT also follows a Gaussian for large n with:

X   n 
n

var  X   n 2 
n2
 Thus X ~ N ( , )

 The approximation of a Poisson distribution with a Gaussian for large


𝑛 is a result of the CLT!
a For a proof that the sum of independent Poisson Random Variables is a Poisson distribution see this document.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 20
Approximating a Poisson Distribution with a Gaussian
0.18 0.14
Mean=5 Mean=10
0.16
0.12
0.14
0.1
0.12

0.1 0.08

0.08 0.06

0.06
0.04
0.04
0.02
0.02

0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20

0.12 0.09
Mean=15 Mean=20
0.08
0.1
0.07

0.08 0.06

0.05
0.06
0.04

0.04 0.03

0.02
0.02
0.01

0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50

Poisson distributions (dots) vs their Gaussian approximations (solid line) for various values of the
mean . The higher the , the smaller the distance between the two distributions. See this MatLab
implementation.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 21
Kullback-Leibler Distance Between Two Densities
 Let us consider the following two distributions:
n
n   Poisson (n |  )  e
n!
1  1 
x   Gaussian ( x |  , )  exp   ( x   ) 2 
2  2 

 We often use the Kullback-Leibler distance to define the distance


between two distributions. In particular, in approximating the Poisson
distribution with a Gaussian distribution, we have the following:


 Poisson (n |  )
KL distance  Poisson (. |  ),  Gaussian (. |  , )     Poisson (n |  ) log
n 0  Gaussian (n |  , )

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 22
Approximating a Poisson Distribution With a Gaussian
-1
10

-2
10

-3
10
0 5 10 15 20 25 30 35 40 45 50

The KL distance of the Poisson distribution from


its Gaussian approximation as a function of the
mean  in a logarithmic scale. The horizontal
line indicates where the KL distance has
dropped to 1/10 of its value at  = 1.

See the following MatLab implementation.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 23
Application of the CLT: Noise Signals
 Consider discrete sampling where the output is noise of length 𝑛.

 The noise vector x  is a realization of X :  


n n

 We estimate the mean and the variance of the noise in a single measurement
as follows:
x0   x j , s    x j  x0 
n n
1 2 1 2

n j 1 n j 1

 To improve the signal to noise ratio, we repeat the measurement and average
the noise vector signals: N
1
x
N
x
k 1
(k )
 n

1 N (k )
 The average noise is a realization of a random variable: X  
N k 1
X  n

(1) (2)
 If X , X ,... are i.i.d., X is asymptotically a Gaussian by the CLT, and its

variance is var( X ) 
s 2
. We need to repeat the experiment until
s 2
 2
N N
(a given tolerance).
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 24
Noise Reduction By Averaging
Gaussian noise vectors of size 50 are used with 𝜎2 = 1
s is the std of a single noise vector
3 3
N=1 N=5
2 2
Estimated
1 1
Noise level :
0 0
s
2
-1 -1 N
-2 -2

-3 -3
0 10 20 30 40 50 0 10 20 30 40 50

3 3
N=10 N=25
2 2

1 1

0 0

-1 -1

-2 -2

-3 -3
0 10 20 30 40 50 0 10 20 30 40 50

See the following MatLab implementation.


 D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 25
Introduction to Information Theory
 Information theory is concerned

 with representing data in a compact fashion (data


compression or source coding), and
 transmitting and storing it in a way that is robust to
errors (error correction or channel coding).

 To compactly representing data requires allocating


short codewords to highly probable bit strings, and
reserving longer codewords to less probable bit strings.

 e.g. in natural language, common words (“a”, “the”,


“and”) are much shorter than rare words.

 D. MacKay, Information Theory, Inference and Learning Algorithms (Video Lectures)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 26
Introduction to Information Theory
 Decoding messages sent over noisy channels requires
having a good probability model of the kinds of
messages that people tend to send.

 We need models that can predict which kinds of data


are likely and which unlikely.

• David MacKay, Information Theory, Inference and Learning Algorithms , 2003 (available on line)
• Thomas M. Cover, Joy A. Thomas , Elements of Information Theory , Wiley, 2006.
• Viterbi, A. J. and J. K. Omura (1979). Principles of Digital Communication and Coding. McGraw-Hill.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 27
Introduction to Information Theory
 Consider a discrete random variable 𝑥. We ask how
much information (‘degree of surprise’) is received when
we observe (learn) a specific value for this variable?

 Observing a highly probable event provides little


additional information.
 If we have two events 𝑥 and 𝑦 that are unrelated, then
the information gain from observing both of them should
be ℎ(𝑥, 𝑦) = ℎ(𝑥) + ℎ(𝑦).

 Two unrelated events will be statistically independent, so


𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦).

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 28
Entropy
 From ℎ(𝑥, 𝑦) = ℎ(𝑥) + ℎ(𝑦) and 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦),
it is easily shown that ℎ(𝑥) must be given by the
logarithm of 𝑝(𝑥) and so we have

h( x)   log 2 p ( x)  0 the units of h(x) are bits (‘binary digits’)

 Low probability events correspond to high


information content.

 When transmitting a random variable, the average


amount of transmitted information is:
K
Entropy of X :  X    p( X  k ) log 2 p( X  k )
k 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 29
Noiseless Coding Theorem (Shanon)
 Example 1 (Coding theory): 𝑥 discrete random variable with 8 possible
states; how many bits to transmit the state of 𝑥?
1 1
All states equally likely  x   8  log 2  3 bits
8 8
 Example 2: consider a variable having 8 possible states
{𝑎, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑔, ℎ} for which the respective (non-uniform) probabilities
are given by ( 1/2 , 1/4 , 1/8 , 1/16 , 1/64 , 1/64 , 1/64 , 1/64 ).

The entropy in this case is smaller than for the uniform distribution.

Note: shorter codes


for the more probable
events vs longer codes
for the less probable
1 1 1 1 1 1 1 1 4 1
 x   log 2  log 2  log 2  log 2  log 2  2 bits
events.
2 2 4 4 8 8 16 16 64 64
1 1 1 1 1
average code length  1   2   3   4  4   6  2 bits
2 4 8 16 64
Shanon’s Noiseless Coding Theorem (1948): The entropy is a lower bound on the number of bits needed
to transmit the state of a random variable
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 30
Alternative Definition of Entropy
 Considering a set of 𝑁 identical objects that are to be divided
amongst a set of bins, such that there are 𝑛𝑖 objects in the ith bin.
Consider the number of different ways of allocating the objects to
the bins.
 In the ith bin there are 𝑛𝑖! ways of reordering the objects
(microstates), and so the total number of ways of allocating the 𝑁
objects to the bins is given by (multiplicity)
N!
W
 ni ! i
1 1 1
 The entropy is defined as  ln W  ln N !
N N N
 ln n !
i i

 We now consider the limit 𝑁 →∞, ln N !  N ln N  N , ln ni !  ni ln ni  ni

ni ni  pi is the probability of an object assigned


  lim  ln   pi ln pi to the ith bin.
N  N N
i i
 The occupation numbers pi correspond
to macrostates.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 31
Alternative Definition of Entropy
 Interpret the bins as the states 𝑥𝑖 of a discrete random variable 𝑋,
where 𝑝(𝑋 = 𝑥𝑖 ) = 𝑝𝑖 . The entropy of the random variable 𝑋 is
then
 p    p  xi  ln p  xi 
i

 Distributions 𝑝(𝑥) that are sharply peaked around a few values will
have a relatively low entropy, whereas those that are spread more
evenly across many values will have higher entropy.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 32
Maximum Entropy: Uniform Distribution
 The maximum entropy configuration can be found by maximizing H using
a Lagrange multiplier to enforce the normalization constraint on the
probabilities. Thus we maximize
ഥ = − ෍ 𝑝(𝑥𝑖 )ln𝑝(𝑥𝑖 ) + 𝜆 ෍ 𝑝(𝑥𝑖 ) − 1

𝑖 𝑖
 We find p ( xi )  1/ M , 𝑀 is the number of possible states and H= ln2𝑀.
 To verify that the stationary point is indeed a maximum, we can evaluate
the 2nd derivative of the entropy, which gives

𝜕2ℍ 1
= −𝐼𝑖𝑗
𝜕𝑝(𝑥𝑖 )𝜕𝑝(𝑥𝑗 ൯ 𝑝𝑖
where I ij are the elements of the identity matrix.

 For any discrete distribution with 𝑴 states, we have: H[𝑥] ≤ ln2𝑀

1 1
  p ( xi ) ln p( xi )   p( xi ) ln ln  p( xi )  ln M
i i p ( xi ) i p ( xi )
 Here we used Jensen’s inequality (for the concave function log)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 33
Example: Biosequence Analysis
 Recall the DNA Sequence logo example earlier.

 The height of each bar is defined to be 2 − H, where H is


the entropy of that distribution, and 2 (= ln24) is the
maximum possible entropy.
2

 Thus a bar of height 0


corresponds to a uniform
distribution (ln24), whereas
a bar of height 2 corresponds
Bits
1

to a deterministic distribution.

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sequence Position

seqlogoDemo from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 34
Binary Variable
 Consider binary random variables, 𝑋 ∈ {0, 1}, we can write
𝑝(𝑋 = 1) = 𝜃 and 𝑝(𝑋 = 0) = 1 − 𝜃.
X  {0,1}, p ( X  1)   , p( X  0)  1  
 Hence the entropy becomes (binary entropy function)

 X     log 2   1    log 2 1   


1

 The maximum value of 1


occurs when the distribution
is uniform, 𝜃 = 0.5.
H(X)
0.5

MatLab function
bernoulliEntropyFig
from PMTK
0
0 0.5 1
p(X = 1)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 35
Differential Entropy
 Divide 𝑥 into bins of width Δ. Assuming 𝑝(𝑥) is
continuous, for each such bin, there must exist 𝑥𝑖 such
that
( i 1) 


i
p ( x)dx  p ( xi ) = probability in falling in bin 

   p ( xi ) ln  p ( xi )    p ( xi ) ln  p( xi )   ln 
i i

 
lim  p( xi )  ln p( xi )     p( x) ln p( x)dx (can be negative)
 0
 i 

 The ln Δ term is omitted since it diverges as Δ0


(indicating that infinite bits are needed to describe a
continuous variable)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 36
Differential Entropy
 For a density defined over multiple continuous
variables, denoted collectively by the vector 𝒙, the
differential entropy is given by

 x     p( x ) ln p( x )dx

 Differential (unlike the discrete) entropy can be negative

 When doing variable transformation 𝒚(𝒙), use 𝑝(𝒙)𝑑𝒙 =


𝑝(𝒚)𝑑𝒚, e.g. if 𝒚 = 𝑨𝒙 then:

 x     p( y) ln  p( y) | A | dy   y   ln | A |  y    x   ln | A |

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 37
Differential Entropy and the Gaussian Distribution
 The distribution that maximizes the differential entropy with
constraints on the first two moments is a Gaussian:
෩ =  p( x) ln p( x)dx  1  p( x)dx  1  2  xp( x)dx  m   3   x  m 2 p( x)dx  s 2 
       
ℍ  
  

  

  
Normalization Given Given
mean std
 Using calculus of variations ,

෩ =   p( x) ln p( x)dx    p( x)dx  1   p( x)dx  2  x p ( x)dx 3   x  m   p ( x)dx  0


δℍ
2

1 1  2 x  3  x  m 
2 1   x  m 2 
p( x)  e  p( x)  exp   
 2s 2   2s 
1/2 2
Use
the  
constra int s

 Evaluating the differential entropy of the Gaussian, we obtain


(an expression for a multivariate Gaussian is also given)

2
 1
2
 d

 x   1  ln  2s 2   ln  2 e  det  , d  1, det   s 2 Note
1

ℍ[𝑥] < 0 for
𝜎2 < 1/(2𝜋𝑒)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 38
Kullback-Leibler Divergence and Cross Entropy
 Consider some unknown distribution 𝑝(𝑥), and suppose
that we have modeled this using an approximating
distribution 𝑞(𝑥).

 If we use 𝑞(𝑥) to construct a coding scheme for the


purpose of transmitting values of 𝑥 to a receiver, then the
additional information to specify 𝑥 is:
 q ( x) 

KL  p || q     p( x) ln q( x)dx    p( x) ln p( x)dx    p( x) ln    dx
 p ( x) 
I transmit q(x) but
I average it with the
exact probability p(x)

The cross entropy is defined as:


 p, q     p( x) ln q( x)dx

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 39
KL Divergence and Cross Entropy
 The cross entropy  p, q     p( x) ln q( x)dx is the average
number of bits needed to encode data coming from a
source with distribution p when we use model q to define
our codebook.
 H(𝑝)=H(𝑝, 𝑝) is the expected # of bits using the true model.
The KL divergence is the average number of extra bits
needed to encode the data, because we used distribution q
to encode the data instead of the true distribution p.
The “extra number of bits” interpretation makes it clear that
KL(p||q) ≥ 0, and that the KL is only equal to zero iff qq=( xp.
)
 
KL  p || q     p ( x) ln q ( x)dx    p ( x) ln p ( x)dx    p ( x) ln   dx
 p( x) 
The KL distance is not a symmetrical quantity, that is
KL  p || q   KL  q || p 
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 40
KL Divergence Between Two Gaussians
 Consider 𝑝(𝑥) = 𝒩(𝑥|𝜇, 𝜎2) and 𝑞(𝑥) = 𝒩(𝑥|𝑚, 𝑠2).

KL  p || q     p ( x) ln q ( x)dx 
   p ( x) ln p ( x)dx 

1
 
N ( x|m ,s 2 )  ln 2 s 2 
2
( x  m )2 
s 2 
 dx
1
2

ln 2 es 2 

 Note that the first term can be computed using the


moments and normalization condition of a Gaussian and
the second term from the differential entropy of a Gaussian.

 Finally we obtain:
1   s 2  s 2  m 2  2m m  m2 
KL  p || q    ln  2    1
2 s  s 2

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 41
KL Divergence Between Two Gaussians
 Consider now 𝑝(𝒙) = 𝒩(𝒙|𝝁, 𝚺) and 𝑞(𝒙) = 𝒩(𝒙|𝒎, 𝑳).

KL  p || q     p ( x ) ln q( x )dx

 N ( x|m , ) 2  D ln  2  ln| L| ( x  m ) L ( x  m ) dx


1 T 1

 D ln 2  ln|L| Tr  L  mm    m L m m L m  m L m 
1 1 T T 1 T 1 T 1 
2 


   p ( x ) ln p ( x )dx 
ln|| 1 ln  2  
1 D
2 2

1 D 
    ln
2 2
| L|
||
1 T

 Tr L  mm     m L m  m L m  m L m 
T 1 T 1 T 1


Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 42
Jensen’s Inequality
 For a convex function 𝑓, Jensen’s inequality gives (can
be proven easily by induction)

M  M
f   i xi    i f ( xi ), i  0 and  i 1
 i 1  i 1 i

 This is equivalent (assume 𝑀 = 2)


to our requirement for convexity 𝑓”(𝑥) > 0.
 Assume 𝑓”(𝑥) > 0 (strict convexity) for any 𝑥.
1
f ( x)  f ( x0 )  f '( x0 )( x  x0 )  f "( x*)( x  x0 ) 2  f ( x0 )  f '( x0 )( x  x0 )
2
f (a )  f ( x0 )  f '( x0 )(a  x0 ) 
For x  a, b :    f (a )  (1   ) f (b)  f ( x0 )  f '( x0 )( a  (1   )b  x0 )
f (b)  f ( x0 )  f '( x0 )(b  x0 ) 
Set : x0

Jensen’s inequality is thus shown:  f (a)  (1   ) f (b)  f   a  (1   )b 


Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 43
Jensen’s Inequality
 Assume Jensen’s inequality. We should show that
𝑓”(𝑥) > 0 (strict convexity) for any 𝑥.
 Set the following: 𝑎 = 𝑏 − 2𝜀, 𝑏 = 𝑎 + 2𝜀 > 𝑎, 𝜀 > 0.
Using Jensen’s inequality, we can easily derive the
above equation as:
1 1
f (a )  f (b)  f  0.5a  0.5b 
2 2
1 1
 f  0.5(b  2 )  0.5b   f  0.5a  0.5(a  2 ) 
2 2
1 1
 f (b   )  f (a   )  f (b)  f (b   )  f (a   )  f (a)
2 2

 For  small, we thus have:


f (b)  f (b   ) f (a   )  f (a )
 or f '(b)  f '(a )  f (.) is convex
 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 44
Jensen’s Inequality
M  M
 Using Jensen’s inequality f   i xi    i f ( xi ), i  0 and  i 1
 i 1  i 1 i

for a discrete random variable results in:

Set : i  pi  f   x    f ( x ) 
 We can generalize this result to
continuous random variables:
( for continuous rv ) f   xp( x)dx    f ( x) p( x)dx
 We will use this shortly in the context of the KL distance.
 We often use Jensen’s inequality for concave functions
(e.g. log 𝑥). In that case, be sure you reverse the
inequality!
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 45
Jensen’s Inequality: Example
 As another example of Jensen’s inequality, consider the
arithmetic and geometric means of a set of real variables:
1Τ𝑀
𝑀 𝑀
1
𝑥𝐴ҧ = ෍ 𝑥𝑖 , 𝑥ҧ𝐺 = ෑ 𝑥𝑖
𝑀
𝑖=1 𝑖=1

 Using Jensen’s inequality for 𝑓(𝑥) = log(𝑥) (concave),


i.e.
ln( x)  ln   x , we can show:
𝑀 𝑀
𝑀
1 1 1
ln𝑥ҧ𝐺 = ln ෑ 𝑥𝑖 = ෎ ln𝑥𝑖 ≤ ln ෎ 𝑥𝑖 = ln𝑥𝐴ҧ ⇒ 𝑥ҧ𝐺 ≤ 𝑥𝐴ҧ
𝑀 𝑀 𝑀
𝑖=1
𝑖=1 𝑖=1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 46
The Kullback-Leibler Divergence
f   x    f ( x)  f   xp( x)dx    f ( x) p( x)dx
 Using Jensen’s inequality, we can show (−log is a
convex function) that:
 q( x)  q( x)
KL  p || q     p( x) ln   dx   ln  p ( x ) dx   ln  q ( x)dx  0
 p( x)  p( x)

 Thus we derive the following Information Inequality:

KL  p || q   0, with KL  p || q   0 if and only if p( x)  q( x)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 47
Principle of Insufficient Reason
 An important consequence of the information inequality is
that the discrete distribution with the maximum entropy is
the uniform distribution.

 More precisely, ℍ(𝑋) ≤ log |𝒳 |, where | 𝒳 | is the


number of states for 𝑋, with equality iff 𝑝(𝑥) is uniform. To
see this, let 𝑢(𝑥) = 1/ | 𝒳 |. Then
KL  p || u    p ( x) log u ( x)   p ( x) log p ( x)  log | X |  ( x)  0
x x

This principle of insufficient reason, argues in favor of


using uniform distributions when there are no other
reasons to favor one distribution over another.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 48
The Kullback-Leibler Divergence
 Data compression is in some way related to density
estimation.

 The Kullback-Leibler divergence is measuring the distance


between two distributions and it is zero when the two
densities are identical.

Suppose the data is generated from an unknown 𝑝(𝒙) that we


try to approximate with a parametric model 𝑞(𝒙|𝜃). Suppose
we have observed training points 𝒙𝑛~𝑝 𝒙 , 𝑛 = 1, … , 𝑁. Then:

 q ( x)  N

  ln q  x |    ln p ( xn )
1
KL  p || q     p ( x) ln   dx  n
 p ( x)  Sample N n 1
average
approximation
of the mean

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 49
The KL Divergence Vs. MLE
 Note that only the first term is a function of 𝑞.

 Thus minimizing KL  p || q  is equivalent to maximizing the


likelihood function for  under the distribution 𝑞.
 q( x )  N

  ln q  x |    ln p ( xn )
1
KL  p || q     p ( x ) ln   dx  n
 p( x )  N n 1

 So the MLE estimate minimizes the KL divergence to the


empirical distribution
1 N
pemp ( x )    xn  x 
N n 1
 q ( x ) 
arg min KL  pemp ( x ) || q     pemp ( x ) ln 
N
1
 d x  const   ln q  x n | 
q  emp
p ( x )  N n 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 50
Conditional Entropy
 For a joint distribution, the conditional entropy is

 y | x     p( y, x) ln p( y | x)dydx

 This represents the average information to specify 𝑦 if we


already know the value of 𝑥
 It is easily seen, using p( y, x)  p( y | x) p( x) , and substituting
inside the log in  x, y     p( x, y) ln p( x, y)dydx that the
conditional entropy satisfies the relation
 x, y    y | x    x 
where H[𝑥, 𝑦] is the differential entropy of 𝑝(𝑥, 𝑦)
and H[𝑥] is the differential entropy of 𝑝(𝑥).

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 51
Conditional Entropy for Discrete Variables
 Consider the conditional entropy for discrete variables

 y | x    p( yi , x j ) ln p( yi | x j )
i j
 To understand further the meaning of conditional entropy,
let us consider the implications of H[𝑦|𝑥] = 0.
 We have:
 y | x      p( yi | x j ) ln p( yi | x j )  p( x j )  0
i j
0

 From this we can conclude that For each x j s.t. p ( x j )  0


the following must hold : p ( yi | x j ) ln p ( yi | x j )  0
 Since 𝑝𝑙𝑜𝑔𝑝 = 0 ↔ 𝑝 = 0 or 𝑝 = 1 and since 𝑝(𝑦𝑖|𝑥𝑗 ) is
normalized, there is only one 𝑦𝑖 s.t. p ( yi | x j )  1 with all
other p (. | x j )  0 . Thus 𝑦 is a function of 𝑥.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 52
Mutual Information
 If the variables are not independent, we can gain some
idea of whether they are ‘close’ to being independent by
considering the KL divergence between the joint
distribution and the product of the marginals:
Mutual Information :  x, y   KL  p  x, y  || p( x) p( y )  
p( x) p( y )
   p  x, y  ln dxdy  0
p  x, y 
 x, y   0 iff x, y independent
 The mutual information is related to the conditional
entropy through
p( y)
 x, y     p  x, y  ln dxdy  [ y ]  [ y | x] 
p  y | x

 x, y    x    x | y    y    y | x 
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 53
Mutual Information
 The mutual information represents the reduction in the
uncertainty about 𝑥 once we learn the value of 𝑦 (and
reversely).
 x, y    x    x | y    y    y | x 
 x   x | y 
 y   y | x

H  y
 In a Bayesian setting, 𝑝(𝑥) =prior, 𝑝(𝑥|𝑦) posterior, and
I[𝑥, 𝑦] represents the reduction in uncertainty in 𝑥 once
we observe 𝑦.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 54
Note that H[x,y]≤H [x]+H [y]
 This is easy to prove noticing that
 x, y    y    y | x   0 ( KL divergence)
and
 x, y    y | x    x 

from which
 x, y    x    y    x , y    x    y 
 The equality here is true only if 𝑥, 𝑦 are independent:
 x, y     p( x, y) ln p( x, y)dydx    p( x, y)  ln p( x)  ln p( y)  dydx  [ x]  [ y ]
(sufficiency condition)
 y | x   y   [ x, y ]  0  p ( x, y )  p ( x) p ( y ) (necessary
condition)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 55
Mutual Information for Correlated Gaussians
 Consider two correlated Gaussians as follows:
X X 0  s 2 s 2  
 ~  |  ,  2 
Y  Y  0   s s 2  

 For each of these variables we can write:


ln  2 es 2 
1
 X   Y  
2
 The joint entropy is also given similarly as
1  
 X , Y   ln   2 e  s (1   ) 
2 4 2

2  det  
1 1
 Thus:  x, y    x    y    x, y   log
2 1  2
  0 (independent X , Y )   x, y   0
 Note:
  1 (linear correlated X  Y )   x, y   
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 56
Pointwise Mutual Information
 A quantity which is closely related to 𝑀𝐼 is the pointwise
mutual information or 𝑃𝑀𝐼. For two events (not random
variables) 𝑥 and 𝑦, this is defined as
p( x) p( y ) p( x | y) p( y | x)
PMI ( x, y ) :  log  log  log
p  x, y  p( x) p( y)
This measures the discrepancy between these events
occurring together compared to what would be expected
by chance. Clearly the 𝑀𝐼,  x, y  , of 𝑋 and 𝑌 is just the
expected value of the 𝑃𝑀𝐼.

 This is the amount we learn from updating the prior 𝑝(𝑥)


into the posterior 𝑝(𝑥|𝑦), or equivalently, updating the
prior 𝑝(𝑦) into the posterior 𝑝(𝑦|𝑥).

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 57
Mutual Information
 For continuous random variables, it is common to first
discretize or quantize them into bins, and computing
how many values fall in each histogram bin (Scott 1979).

 The number of bins used, and the location of the bin


boundaries, can have a significant effect on the results.

 One can estimate the 𝑀𝐼 directly, without performing


density estimation (Learned-Miller, 2004). Another
approach is to try many different bin sizes and locations,
and to compute the maximum 𝑀𝐼 achieved.
 Scott, D. (1979). On optimal and data-based histograms, Biometrika 66(3), 605–610.
 Learned-Miller, E. (2004). Hyperspacings and the estimation of information theoretic quantities. Technical Report
04-104, U. Mass. Amherst Comp. Sci. Dept.
 Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518–1524.
 Speed, T. (2011, December). A correlation for the 21st century. Science 334, 152–1503.
*Use MatLab function miMixedDemo from Kevin Murphys’ PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 58
Maximal Information Coefficient
 This statistic appropriately normalized is known as the
maximal information coefficient (𝑀𝐼𝐶).
max GG ( x , y )  X (G );Y (G ) 
We first define: m( x, y ) 
log min( x, y )

Here G (𝑥, 𝑦) is the set of 2𝑑 grids of size x  y , and


𝑋(𝐺), 𝑌 (𝐺) represents a discretization of the variables
onto this grid (The maximization over bin locations is
performed efficiently using dynamic programming)

Now define the 𝑀𝐼𝐶 as


MIC  max m( x, y )
x , y: xy  B
 Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518–1524.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 59
Maximal Information Coefficient
 The 𝑀𝐼𝐶 is defined as:
max GG ( x , y )  X (G );Y (G )  MIC  max m( x, y )
m( x, y )  x , y:xy  B
log min( x, y )
 𝐵 is some sample-size dependent bound on the number
of bins we can use and still reliably estimate the
.
distribution (Reshef et al. suggest 𝐵 ~ 𝑁0 6).

𝑀𝐼𝐶 lies in the range [0, 1], where 0 represents no


relationship between the variables, and 1 represents a
noise-free relationship of any form, not just linear.

 Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518–1524.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 60
Correlation Coefficient Vs MIC
see mutualInfoAllPairsMixed for
and miMixedDemo from PMTK3

 Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518–1524.

 The data consists of 357 variables measuring a variety


of social, economic, etc. indicators, collected by WHO.
 On the left, we see the correlation coefficient (CC)
plotted against the MIC for all 63,566 variable pairs.
 On the right, we see scatter plots for particular pairs of
variables.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 61
Correlation Coefficient Vs MIC

 Point marked 𝐶 has a low 𝐶𝐶 and a low 𝑀𝐼𝐶. From the


corresponding scatter we see that there is no relationship
between these two variables.

The points marked 𝐷 and 𝐻 have high 𝐶𝐶 (in absolute


value) and high 𝑀𝐼𝐶 and we see from the scatter plot that
they represent nearly linear relationships.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 62
Correlation Coefficient Vs MIC

 The points 𝐸, 𝐹, and 𝐺 have low 𝐶𝐶 but high 𝑀𝐼𝐶. They


correspond to non-linear (and sometimes, as in 𝐸 and 𝐹,
one-to-many) relationships between the variables.

Statistics (such as 𝑀𝐼𝐶) based on mutual information can


be used to discover interesting relationships between
variables in a way that correlation coefficients cannot.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 63

You might also like