0% found this document useful (0 votes)

24 views63 pages

Lec5 IntroToProbabilityAndStatistics

This document discusses topics in probability and statistics including the law of large numbers, central limit theorem, and statistical inference. It introduces Markov and Chebyshev inequalities and describes how the average of i.i.d. random variables converges almost surely to the mean according to the weak law of large numbers. The central limit theorem states that the distribution of the sample mean of i.i.d. random variables converges in distribution to a normal distribution as the sample size increases. Statistical inference involves either parametric or non-parametric approaches to estimating properties of the underlying probability distribution that generated observed data.

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views63 pages

Lec5 IntroToProbabilityAndStatistics

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Introduction to

Probability and Statistics

(Continued)
Prof. Nicholas Zabaras
Center for Informatics and Computational Science
https://cics.nd.edu/
University of Notre Dame
Notre Dame, Indiana, USA

Email: [email protected]
URL: https://www.zabaras.com/

August 29, 2018

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Contents
 Markov and Chebyshev Inequalities

 The Law of Large Numbers, Central Limit Theorem, Monte Carlo

Approximation of Distributions, Estimating 𝜋, Accuracy of Monte Carlo
approximation, Approximating the Binomial with a Gaussian,
Approximating the Poisson Distribution with a Gaussian, Application of
CLT to Noise Signals

 Information theory, Entropy, KL divergence, Jensen’s Inequality, Mutual

information, Maximal Information Coefficient

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 2
References
• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge

University Press.

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena

Scientiﬁc. 2nd Edition

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical

Inference. Springer.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 3
Markov and Chebyshev Inequalities
 You can show (Markov’s inequality) that if 𝑋 is a non-negative integrable
random variable and for any 𝑎 > 0:
[X ]
Pr[ X  a] 
 
a 
Indeed : [ X ]   x ( x)dx   x ( x)dx  a   ( x)dx  a Pr[ X  a ]
0 a a

 You can generalize this using any function of the random variable 𝑋 as:
[ f ( X )]
Pr[ f ( X )  a] 
a
 Using 𝑓(𝑋) = (𝑋 − 𝔼[𝑋])2 , we derive the following Chebyshev inequality:

a  ,  X  [X ]   s2 s2
Pr  X  [ X ]     2 
2 2
  
1
 In terms of std’s, we can restate as : Pr  X  [ X ]  s   
 2

 Thus the probability of 𝑋 being more than 2s away from 𝔼[𝑋] is ≤ ¼.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 4
The Law of Large Numbers (LLN)
 Let 𝑋𝑖 for 𝑖 = 1, 2, . . . , 𝑛 be independent and identically distributed
random variables (i.i.d.) with finite mean E(𝑋𝑖 ) = 𝜇 & variance Var(𝑋𝑖 ) =
𝜎2. 𝑛
1
𝑋ത𝑛 = ෍ 𝑋𝑖
 Let 𝑛
𝑖=1
𝑛
1
 Note that 𝔼 𝑋ത𝑛 = ෍𝜇 = 𝜇
𝑛
𝑖=1
𝑛
1 𝜎2
𝑉𝑎𝑟 𝑋ത𝑛 2
= 2 ෍𝜎 =
𝑛 𝑛
𝑖=1
This means that with
 Weak LLN: lim Pr |𝑋ത𝑛 − 𝜇| ≥ 𝜀 = 0 ∀𝜀 > 0 probability one, the
𝑛→∞ average of any
realizations of 𝑥1, 𝑥2, … of
the random variables
lim 𝑋ത𝑛 = 𝜇 𝑎𝑙𝑚𝑜𝑠𝑡 𝑠𝑢𝑟𝑒𝑙𝑦
 Strong LLN: 𝑛→∞ 𝑋1, 𝑋2, … converges to
the mean.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 5
Statistical Inference: Parametric & Non-Parametric Approach

Assume that we have a set of observations

S  {x1 , x2 ,...xN }, x j  n

The problem is to infer on the underlying probability distribution that gives

rise to the data S.
 Parametric problem: The underlying probability density has a
specified known form that depends on a number of parameters. The
problem of interest is to infer those parameters.

 Non-parametric problem: No analytical expression for the probability

density is available. Description consists of defining the dependency
or independency of the data. This leads to numerical exploration.
A typical situation for the parametric model is when the distribution is the
PDF of a random variable X :   .
n

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 6
Example of the Law of Large Numbers
 Assume that we sample S  {x1 , x2 ,...xN }, x j  2

 We consider a parametric model with xj realizations of X ~ N ( x0 , )

22
where we take both the mean x0 and the variance  as
unknowns.

The probability density of 𝑋 is:

1  1 
 ( x | x0 , )  exp   ( x  x ) T 1
 ( x  x0 
)
2  det  
0
 2 
1/2

22
 Our problem is to estimate x0 and  

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 7
Empirical Mean and Empirical Covariance
 From the law of large numbers, we calculate:
𝑁
1
𝑥0 = 𝔼 𝑋 ≈ ෍ 𝑥𝑗 = 𝑥ҧ
𝑁
𝑗=1
 To compute the covariance matrix, note that if 𝑋1, 𝑋2, … are i.i.d. so are
𝑓(𝑋1), 𝑓(𝑋2), … for any function f :
2
 k

Then we can compute:

𝜮 = cov(𝑋) = 𝔼 ൫𝑥 − 𝔼 𝑋 ) 𝑥 − 𝔼 𝑋 𝑇 ≈ 𝔼 ൫𝑥 − 𝑥)ҧ 𝑥 − 𝑥ҧ 𝑇

⇒ 𝑁
1 𝑇
𝜮 ≈ ෍(𝑥𝑗 − 𝑥)ҧ 𝑥𝑗 − 𝑥ҧ = 𝜮 ഥ
𝑁
𝑗=1

 The above formulas define the empirical mean and empirical

covariance.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 8
The Central Limit Theorem
 Let (𝑋1, 𝑋2, … 𝑋𝑁 ) be independent and identically distributed
(i.i.d.) continuous random variables each with expectation m
and variance s2.
𝑁
1 𝑋ത𝑁 − 𝜇 1
 Define: 𝑁
𝑍 = (𝑋1 + 𝑋2 +. . . +𝑋𝑁 − 𝑁𝜇) = 𝜎 , ത
𝑋𝑁 = ෍ 𝑋𝑗
𝜎 𝑁 𝑁
𝑗=1
𝑁
 As 𝑁 −> ∞, the distribution of 𝑍𝑁 converges to the
distribution of a standard normal random variable
x
1
lim P Z N  x  e
 t 2 /2
dt
𝑁
N  2 
1 𝜎 2
 If 𝑋ത𝑁 = 𝑁 ෍ 𝑋𝑗 , for 𝑁 large, 𝑋ത𝑁 ~𝒩 𝜇, 𝑎𝑠 𝑁 → ∞
𝑗=1
𝑁

 Somewhat of a justification for assuming Gaussian noise is

common

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 9
The CLT and the Gaussian Distribution
 As an example, assume 𝑁 variables (𝑋1, 𝑋2, … 𝑋𝑁 ) each of
which has a uniform distribution over [0, 1] and then
consider the distribution of the sample mean
(𝑋1 + 𝑋2 + … + 𝑋𝑁 )/𝑁. For large 𝑁, this distribution tends
to a Gaussian. The convergence as 𝑁 increases can be
rapid.
N 2
4 4

N 1 N  10
4

3.5 3.5
3.5

3 3
3

2.5 2.5
2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MatLab Code

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 10
The CLT and the Gaussian Distribution
N
1
 We plot a histogram of N
 x , j  1:10000 where
i 1
ij xij ~ Beta (1,5)

 As 𝑁 → ∞, the distribution tends towards a Gaussian.

N=5
3
N=1
3

2
N = 10
3
1

2
0
0 0.5 1

0
0 0.5 1

Run centralLimitDemo
from PMTK
0
0 0.5 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 11
Accuracy of Monte Carlo Approximation
 In Monte Carlo approximation of the mean using the
sample mean 𝜇ҧ approximation, we have:
𝐹𝑜𝑟 𝜇 = 𝔼[𝑓(𝑥)]
𝜎2
𝜇ҧ − 𝜇~𝒩 0,
𝑁
𝑁
1
𝜎2 = 𝔼[𝑓 2 𝑥 ] − 𝔼[f x ]2 ≈ ෍ 𝑓(𝑥𝑠 ) − 𝜇ҧ 2 ≡ 𝜎ത 2
𝑁
𝑠=1
 We can now derive the following error bars (using central
intervals):
𝜎ത 𝜎ത
Pr 𝜇 − 1.96 ≤ 𝜇ҧ ≤ 𝜇 + 1.96 = 0.95
𝑁 𝑁
 The number of samples needed to drop the error within
 is then: 𝜎ത 4𝜎ത 2
1.96 ≤𝜀⇒𝑁≥
𝑁 𝜀2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 12
Monte Carlo Approximation of Distributions
 Computing the distribution of 𝑦 = 𝑥2, 𝑝(𝑥) is uniform.
The MC approximation is shown on the right. Take
samples from 𝑝(𝑥), square them and then plot the
histogram. 1.5 5.5 0.25

4.5 0.2
1
4

𝑝(𝑥) 3.5 0.15

0.5 3

2.5 0.1

2 𝑝(𝑦)
0
1.5 0.05

-0.5 0.5 0
-1 0 1 0 0.5 1 0 0.5 1

Run changeofVarsDemo1d
from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 13
Accuracy of Monte Carlo Approximation
 Increase of the accuracy of MC with the number of
samples. Histograms (on the left) and (on the right) pdfs
using kernel 2.5
10 samples

density estimation. 2

1.5

 The actual 1

distribution is 0.5

shown on red. 0
0.5 1 1.5 2 2.5

100 samples
1.8

1.6

1.4

1.2
N x |1.5, 0.252
1

0.8

0.6

0.4
Run mcAccuracyDemo 0.2
from PMTK
0
0.5 1 1.5 2 2.5

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 14
Example of CLT: Estimating  by MC
 Use the CLT to approximate 𝜋. Let 𝑥, 𝑦~𝒰[−𝑟, 𝑟].
r r
I 
r r
( x 2  y 2  r 2 )dxdy   r 2 
2
r r
1 1
  2 I  2 4r 2   ( x 2  y 2  r 2 ) p ( x) p ( y )dxdy  1.5
r r r r
1
r r
  4  ( x 2  y 2  r 2 ) p ( x) p ( y )dxdy  0.5
r r
𝑁 0
1
𝜋ത ≈ 4 ෍𝕀(𝑥𝑠2 + 𝑦𝑠2 ≤ 𝑟 2 ) , 𝑥𝑠 , 𝑦𝑠 ~𝒰[−𝑟, 𝑟] -0.5
𝑁
𝑠=1
-1

 Here 𝑥, 𝑦 are uniform random -1.5

variables on [−𝑟, +𝑟], 𝑟 = 2. -2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

𝑥, 𝑦~𝒰[−𝑟, 𝑟]
 We find  = 3.1416 with standard
Run mcEstimatePi
error 0.09. from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 15
CLT: The Binomial Tends to a Gaussian
 One consequence of the CLT is that the binomial
distribution
N
Bin m | N , m
(1 )N m

which is a distribution over m defined by the sum of 𝑁

observations of the random binary variable 𝑥, will tend to a
Gaussian as 𝑁 → ∞.
binomial distribution
0.35

x1 x2 ... xN m 1
~N
0.3
,
N N N 0.25 Bin ( N  10, m  0.25)
m ~ N N ,N 1 0.2

0.15

0.1
Matlab Code
0.05

0
0 1 2 3 4 5 6 7 8 9 10

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 16
Poisson Process
Consider that we count the number of photons from a light source. Let
𝑁(𝑡) be the number of photons observed in the time interval [0, 𝑡]. 𝑁(𝑡) is
an integer-valued random variable. We make the following assumptions:
a. Stationarity: Let ∆1 and ∆2 be any two time intervals of equal length, 𝑛
any non-negative integer. Assume that
Prob. of 𝑛 photons in ∆1 = Prob. of 𝑛 photons in ∆2
b. Independent increments: Let ∆1, ∆2, … , ∆𝑛 be non-overlapping time
intervals and 𝑘1, 𝑘2, … , 𝑘𝑛 non-negative integers. Denote by 𝐴𝑗 the event
defined as
𝐴𝑗 = 𝑘𝑗 photons arrive in the time interval ∆𝑗
Assume that these events are mutually independent, i.e.
P ( A1 A2 ... An )  P ( A1 ) P ( A2 )...P ( An )

c. Negligible probability of coincidence: Assume that the probability of

two or more events at the same time is negligible. More precisely,
𝑁(0) = 0 and P  N (h)  1
lim 0
h0 h
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 17
Poisson Process
 If these assumptions hold, then for a given time 𝑡, 𝑁 is a Poisson
process:

 t 
n

P  N (t )  n  e t ,   0, n  0,1, 2,..., 
n!

 Let us fix 𝑡 = 𝑇 = observation time and define a random variable 𝑁 =

𝑁(𝑇). Let us define the parameter 𝜃 = 𝜆𝑇. We then denote:

n
N ~ Poisson ( )  e 
n!

 D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007

 S Ghahramani: Fundamentals of Probability,1996.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 18
Poisson Process
 Consider the Poisson (discrete) distribution N  0,1, 2,..., 

n
P( N  n)   Poisson (n |  )  e 
n!

 The mean and the variance are both equal to .


 N    n Poisson (n |  )   ,
n 0

 N    2   
 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 19
Approximating a Poisson Distribution With a Gaussian
 Theorem: A random variable 𝑋~Poisson(𝜃) can be considered as the
sum of n independent random variables 𝑋𝑖~Poisson(𝜃/𝑛).a

 According to the Central Limit Theorem, when 𝑛 is large enough,

 
1 n  
Take X i ~ Poisson ( , )   X i ~ N ( , 2 )
n n n i 1 n n
n
 Then X   X based on the Theorem is a draw from 𝒫ℴ𝒾𝓈𝓈ℴ𝓃(𝜃) and
i
i 1
from the CLT also follows a Gaussian for large n with:

X   n 
n

var  X   n 2 
n2
 Thus X ~ N ( , )

 The approximation of a Poisson distribution with a Gaussian for large

𝑛 is a result of the CLT!
a For a proof that the sum of independent Poisson Random Variables is a Poisson distribution see this document.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 20
Approximating a Poisson Distribution with a Gaussian
0.18 0.14
Mean=5 Mean=10
0.16
0.12
0.14
0.1
0.12

0.1 0.08

0.08 0.06

0.06
0.04
0.04
0.02
0.02

0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20

0.12 0.09
Mean=15 Mean=20
0.08
0.1
0.07

0.08 0.06

0.05
0.06
0.04

0.04 0.03

0.02
0.02
0.01

0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50

Poisson distributions (dots) vs their Gaussian approximations (solid line) for various values of the
mean . The higher the , the smaller the distance between the two distributions. See this MatLab
implementation.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 21
Kullback-Leibler Distance Between Two Densities
 Let us consider the following two distributions:
n
n   Poisson (n |  )  e
n!
1  1 
x   Gaussian ( x |  , )  exp   ( x   ) 2 
2  2 

 We often use the Kullback-Leibler distance to define the distance

between two distributions. In particular, in approximating the Poisson
distribution with a Gaussian distribution, we have the following:

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 22
Approximating a Poisson Distribution With a Gaussian
-1
10

-2
10

-3
10
0 5 10 15 20 25 30 35 40 45 50

The KL distance of the Poisson distribution from

its Gaussian approximation as a function of the
mean  in a logarithmic scale. The horizontal
line indicates where the KL distance has
dropped to 1/10 of its value at  = 1.

See the following MatLab implementation.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 23
Application of the CLT: Noise Signals
 Consider discrete sampling where the output is noise of length 𝑛.

 The noise vector x  is a realization of X :  

n n

 We estimate the mean and the variance of the noise in a single measurement
as follows:
x0   x j , s    x j  x0 
n n
1 2 1 2

n j 1 n j 1

 To improve the signal to noise ratio, we repeat the measurement and average
the noise vector signals: N
1
x
N
x
k 1
(k )
 n

1 N (k )
 The average noise is a realization of a random variable: X  
N k 1
X  n

(1) (2)
 If X , X ,... are i.i.d., X is asymptotically a Gaussian by the CLT, and its

variance is var( X ) 
s 2
. We need to repeat the experiment until
s 2
 2
N N
(a given tolerance).
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 24
Noise Reduction By Averaging
Gaussian noise vectors of size 50 are used with 𝜎2 = 1
s is the std of a single noise vector
3 3
N=1 N=5
2 2
Estimated
1 1
Noise level :
0 0
s
2
-1 -1 N
-2 -2

-3 -3
0 10 20 30 40 50 0 10 20 30 40 50

3 3
N=10 N=25
2 2

1 1

0 0

-1 -1

-2 -2

-3 -3
0 10 20 30 40 50 0 10 20 30 40 50

See the following MatLab implementation.

 D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 25
Introduction to Information Theory
 Information theory is concerned

 with representing data in a compact fashion (data

compression or source coding), and
 transmitting and storing it in a way that is robust to
errors (error correction or channel coding).

 To compactly representing data requires allocating

short codewords to highly probable bit strings, and
reserving longer codewords to less probable bit strings.

 e.g. in natural language, common words (“a”, “the”,

“and”) are much shorter than rare words.

 D. MacKay, Information Theory, Inference and Learning Algorithms (Video Lectures)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 26
Introduction to Information Theory
 Decoding messages sent over noisy channels requires
having a good probability model of the kinds of
messages that people tend to send.

 We need models that can predict which kinds of data

are likely and which unlikely.

• David MacKay, Information Theory, Inference and Learning Algorithms , 2003 (available on line)
• Thomas M. Cover, Joy A. Thomas , Elements of Information Theory , Wiley, 2006.
• Viterbi, A. J. and J. K. Omura (1979). Principles of Digital Communication and Coding. McGraw-Hill.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 27
Introduction to Information Theory
 Consider a discrete random variable 𝑥. We ask how
much information (‘degree of surprise’) is received when
we observe (learn) a specific value for this variable?

 Observing a highly probable event provides little

additional information.
 If we have two events 𝑥 and 𝑦 that are unrelated, then
the information gain from observing both of them should
be ℎ(𝑥, 𝑦) = ℎ(𝑥) + ℎ(𝑦).

 Two unrelated events will be statistically independent, so

𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦).

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 28
Entropy
 From ℎ(𝑥, 𝑦) = ℎ(𝑥) + ℎ(𝑦) and 𝑝(𝑥, 𝑦) = 𝑝(𝑥)𝑝(𝑦),
it is easily shown that ℎ(𝑥) must be given by the
logarithm of 𝑝(𝑥) and so we have

h( x)   log 2 p ( x)  0 the units of h(x) are bits (‘binary digits’)

 Low probability events correspond to high

information content.

 When transmitting a random variable, the average

amount of transmitted information is:
K
Entropy of X :  X    p( X  k ) log 2 p( X  k )
k 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 29
Noiseless Coding Theorem (Shanon)
 Example 1 (Coding theory): 𝑥 discrete random variable with 8 possible
states; how many bits to transmit the state of 𝑥?
1 1
All states equally likely  x   8  log 2  3 bits
8 8
 Example 2: consider a variable having 8 possible states
{𝑎, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑔, ℎ} for which the respective (non-uniform) probabilities
are given by ( 1/2 , 1/4 , 1/8 , 1/16 , 1/64 , 1/64 , 1/64 , 1/64 ).

The entropy in this case is smaller than for the uniform distribution.

Note: shorter codes

for the more probable
events vs longer codes
for the less probable
1 1 1 1 1 1 1 1 4 1
 x   log 2  log 2  log 2  log 2  log 2  2 bits
events.
2 2 4 4 8 8 16 16 64 64
1 1 1 1 1
average code length  1   2   3   4  4   6  2 bits
2 4 8 16 64
Shanon’s Noiseless Coding Theorem (1948): The entropy is a lower bound on the number of bits needed
to transmit the state of a random variable
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 30
Alternative Definition of Entropy
 Considering a set of 𝑁 identical objects that are to be divided
amongst a set of bins, such that there are 𝑛𝑖 objects in the ith bin.
Consider the number of different ways of allocating the objects to
the bins.
 In the ith bin there are 𝑛𝑖! ways of reordering the objects
(microstates), and so the total number of ways of allocating the 𝑁
objects to the bins is given by (multiplicity)
N!
W
 ni ! i
1 1 1
 The entropy is defined as  ln W  ln N !
N N N
 ln n !
i i

 We now consider the limit 𝑁 →∞, ln N !  N ln N  N , ln ni !  ni ln ni  ni

ni ni  pi is the probability of an object assigned

  lim  ln   pi ln pi to the ith bin.
N  N N
i i
 The occupation numbers pi correspond
to macrostates.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 31
Alternative Definition of Entropy
 Interpret the bins as the states 𝑥𝑖 of a discrete random variable 𝑋,
where 𝑝(𝑋 = 𝑥𝑖 ) = 𝑝𝑖 . The entropy of the random variable 𝑋 is
then
 p    p  xi  ln p  xi 
i

 Distributions 𝑝(𝑥) that are sharply peaked around a few values will
have a relatively low entropy, whereas those that are spread more
evenly across many values will have higher entropy.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 32
Maximum Entropy: Uniform Distribution
 The maximum entropy configuration can be found by maximizing H using
a Lagrange multiplier to enforce the normalization constraint on the
probabilities. Thus we maximize
ഥ = − ෍ 𝑝(𝑥𝑖 )ln𝑝(𝑥𝑖 ) + 𝜆 ෍ 𝑝(𝑥𝑖 ) − 1
ℍ
𝑖 𝑖
 We find p ( xi )  1/ M , 𝑀 is the number of possible states and H= ln2𝑀.
 To verify that the stationary point is indeed a maximum, we can evaluate
the 2nd derivative of the entropy, which gives
ഥ
𝜕2ℍ 1
= −𝐼𝑖𝑗
𝜕𝑝(𝑥𝑖 )𝜕𝑝(𝑥𝑗 ൯ 𝑝𝑖
where I ij are the elements of the identity matrix.

 For any discrete distribution with 𝑴 states, we have: H[𝑥] ≤ ln2𝑀

1 1
  p ( xi ) ln p( xi )   p( xi ) ln ln  p( xi )  ln M
i i p ( xi ) i p ( xi )
 Here we used Jensen’s inequality (for the concave function log)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 33
Example: Biosequence Analysis
 Recall the DNA Sequence logo example earlier.

 The height of each bar is defined to be 2 − H, where H is

the entropy of that distribution, and 2 (= ln24) is the
maximum possible entropy.
2

 Thus a bar of height 0

corresponds to a uniform
distribution (ln24), whereas
a bar of height 2 corresponds
Bits
1

to a deterministic distribution.

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Sequence Position

seqlogoDemo from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 34
Binary Variable
 Consider binary random variables, 𝑋 ∈ {0, 1}, we can write
𝑝(𝑋 = 1) = 𝜃 and 𝑝(𝑋 = 0) = 1 − 𝜃.
X  {0,1}, p ( X  1)   , p( X  0)  1  
 Hence the entropy becomes (binary entropy function)

 X     log 2   1    log 2 1   

 The maximum value of 1

occurs when the distribution
is uniform, 𝜃 = 0.5.
H(X)
0.5

MatLab function
bernoulliEntropyFig
from PMTK
0
0 0.5 1
p(X = 1)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 35
Differential Entropy
 Divide 𝑥 into bins of width Δ. Assuming 𝑝(𝑥) is
continuous, for each such bin, there must exist 𝑥𝑖 such
that
( i 1) 


i
p ( x)dx  p ( xi ) = probability in falling in bin 

   p ( xi ) ln  p ( xi )    p ( xi ) ln  p( xi )   ln 
i i

 
lim  p( xi )  ln p( xi )     p( x) ln p( x)dx (can be negative)
 0
 i 

 The ln Δ term is omitted since it diverges as Δ0

(indicating that infinite bits are needed to describe a
continuous variable)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 36
Differential Entropy
 For a density defined over multiple continuous
variables, denoted collectively by the vector 𝒙, the
differential entropy is given by

 x     p( x ) ln p( x )dx

 Differential (unlike the discrete) entropy can be negative

 When doing variable transformation 𝒚(𝒙), use 𝑝(𝒙)𝑑𝒙 =

𝑝(𝒚)𝑑𝒚, e.g. if 𝒚 = 𝑨𝒙 then:

 x     p( y) ln  p( y) | A | dy   y   ln | A |  y    x   ln | A |

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 37
Differential Entropy and the Gaussian Distribution
 The distribution that maximizes the differential entropy with
constraints on the first two moments is a Gaussian:
෩ =  p( x) ln p( x)dx  1  p( x)dx  1  2  xp( x)dx  m   3   x  m 2 p( x)dx  s 2 
       
ℍ  
  

  

  
Normalization Given Given
mean std
 Using calculus of variations ,

෩ =   p( x) ln p( x)dx    p( x)dx  1   p( x)dx  2  x p ( x)dx 3   x  m   p ( x)dx  0

δℍ
2

1 1  2 x  3  x  m 
2 1   x  m 2 
p( x)  e  p( x)  exp   
 2s 2   2s 
1/2 2
Use
the  
constra int s

 Evaluating the differential entropy of the Gaussian, we obtain

(an expression for a multivariate Gaussian is also given)

2
 1
2
 d

 x   1  ln  2s 2   ln  2 e  det  , d  1, det   s 2 Note
1

ℍ[𝑥] < 0 for
𝜎2 < 1/(2𝜋𝑒)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 38
Kullback-Leibler Divergence and Cross Entropy
 Consider some unknown distribution 𝑝(𝑥), and suppose
that we have modeled this using an approximating
distribution 𝑞(𝑥).

 If we use 𝑞(𝑥) to construct a coding scheme for the

purpose of transmitting values of 𝑥 to a receiver, then the
additional information to specify 𝑥 is:
 q ( x) 

KL  p || q     p( x) ln q( x)dx    p( x) ln p( x)dx    p( x) ln    dx
 p ( x) 
I transmit q(x) but
I average it with the
exact probability p(x)

The cross entropy is defined as:

 p, q     p( x) ln q( x)dx

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 39
KL Divergence and Cross Entropy
 The cross entropy  p, q     p( x) ln q( x)dx is the average
number of bits needed to encode data coming from a
source with distribution p when we use model q to define
our codebook.
 H(𝑝)=H(𝑝, 𝑝) is the expected # of bits using the true model.
The KL divergence is the average number of extra bits
needed to encode the data, because we used distribution q
to encode the data instead of the true distribution p.
The “extra number of bits” interpretation makes it clear that
KL(p||q) ≥ 0, and that the KL is only equal to zero iff qq=( xp.
)
 
KL  p || q     p ( x) ln q ( x)dx    p ( x) ln p ( x)dx    p ( x) ln   dx
 p( x) 
The KL distance is not a symmetrical quantity, that is
KL  p || q   KL  q || p 
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 40
KL Divergence Between Two Gaussians
 Consider 𝑝(𝑥) = 𝒩(𝑥|𝜇, 𝜎2) and 𝑞(𝑥) = 𝒩(𝑥|𝑚, 𝑠2).

KL  p || q     p ( x) ln q ( x)dx 
   p ( x) ln p ( x)dx 

1
 
N ( x|m ,s 2 )  ln 2 s 2 
2
( x  m )2 
s 2 
 dx
1
2

ln 2 es 2 

 Note that the first term can be computed using the

moments and normalization condition of a Gaussian and
the second term from the differential entropy of a Gaussian.

 Finally we obtain:
1   s 2  s 2  m 2  2m m  m2 
KL  p || q    ln  2    1
2 s  s 2


Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 41
KL Divergence Between Two Gaussians
 Consider now 𝑝(𝒙) = 𝒩(𝒙|𝝁, 𝚺) and 𝑞(𝒙) = 𝒩(𝒙|𝒎, 𝑳).

KL  p || q     p ( x ) ln q( x )dx

 N ( x|m , ) 2  D ln  2  ln| L| ( x  m ) L ( x  m ) dx

1 T 1

 D ln 2  ln|L| Tr  L  mm    m L m m L m  m L m 
1 1 T T 1 T 1 T 1 
2 


   p ( x ) ln p ( x )dx 
ln|| 1 ln  2  
1 D
2 2

1 D 
    ln
2 2
| L|
||
1 T

 Tr L  mm     m L m  m L m  m L m 
T 1 T 1 T 1




Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 42
Jensen’s Inequality
 For a convex function 𝑓, Jensen’s inequality gives (can
be proven easily by induction)

M  M
f   i xi    i f ( xi ), i  0 and  i 1
 i 1  i 1 i

 This is equivalent (assume 𝑀 = 2)

to our requirement for convexity 𝑓”(𝑥) > 0.
 Assume 𝑓”(𝑥) > 0 (strict convexity) for any 𝑥.
1
f ( x)  f ( x0 )  f '( x0 )( x  x0 )  f "( x*)( x  x0 ) 2  f ( x0 )  f '( x0 )( x  x0 )
2
f (a )  f ( x0 )  f '( x0 )(a  x0 ) 
For x  a, b :    f (a )  (1   ) f (b)  f ( x0 )  f '( x0 )( a  (1   )b  x0 )
f (b)  f ( x0 )  f '( x0 )(b  x0 ) 
Set : x0

Jensen’s inequality is thus shown:  f (a)  (1   ) f (b)  f   a  (1   )b 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 43
Jensen’s Inequality
 Assume Jensen’s inequality. We should show that
𝑓”(𝑥) > 0 (strict convexity) for any 𝑥.
 Set the following: 𝑎 = 𝑏 − 2𝜀, 𝑏 = 𝑎 + 2𝜀 > 𝑎, 𝜀 > 0.
Using Jensen’s inequality, we can easily derive the
above equation as:
1 1
f (a )  f (b)  f  0.5a  0.5b 
2 2
1 1
 f  0.5(b  2 )  0.5b   f  0.5a  0.5(a  2 ) 
2 2
1 1
 f (b   )  f (a   )  f (b)  f (b   )  f (a   )  f (a)
2 2

 For  small, we thus have:

f (b)  f (b   ) f (a   )  f (a )
 or f '(b)  f '(a )  f (.) is convex
 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 44
Jensen’s Inequality
M  M
 Using Jensen’s inequality f   i xi    i f ( xi ), i  0 and  i 1
 i 1  i 1 i

for a discrete random variable results in:

Set : i  pi  f   x    f ( x ) 
 We can generalize this result to
continuous random variables:
( for continuous rv ) f   xp( x)dx    f ( x) p( x)dx
 We will use this shortly in the context of the KL distance.
 We often use Jensen’s inequality for concave functions
(e.g. log 𝑥). In that case, be sure you reverse the
inequality!
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 45
Jensen’s Inequality: Example
 As another example of Jensen’s inequality, consider the
arithmetic and geometric means of a set of real variables:
1Τ𝑀
𝑀 𝑀
1
𝑥𝐴ҧ = ෍ 𝑥𝑖 , 𝑥ҧ𝐺 = ෑ 𝑥𝑖
𝑀
𝑖=1 𝑖=1

 Using Jensen’s inequality for 𝑓(𝑥) = log(𝑥) (concave),

i.e.
ln( x)  ln   x , we can show:
𝑀 𝑀
𝑀
1 1 1
ln𝑥ҧ𝐺 = ln ෑ 𝑥𝑖 = ෎ ln𝑥𝑖 ≤ ln ෎ 𝑥𝑖 = ln𝑥𝐴ҧ ⇒ 𝑥ҧ𝐺 ≤ 𝑥𝐴ҧ
𝑀 𝑀 𝑀
𝑖=1
𝑖=1 𝑖=1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 46
The Kullback-Leibler Divergence
f   x    f ( x)  f   xp( x)dx    f ( x) p( x)dx
 Using Jensen’s inequality, we can show (−log is a
convex function) that:
 q( x)  q( x)
KL  p || q     p( x) ln   dx   ln  p ( x ) dx   ln  q ( x)dx  0
 p( x)  p( x)

 Thus we derive the following Information Inequality:

KL  p || q   0, with KL  p || q   0 if and only if p( x)  q( x)

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 47
Principle of Insufficient Reason
 An important consequence of the information inequality is
that the discrete distribution with the maximum entropy is
the uniform distribution.

 More precisely, ℍ(𝑋) ≤ log |𝒳 |, where | 𝒳 | is the

number of states for 𝑋, with equality iff 𝑝(𝑥) is uniform. To
see this, let 𝑢(𝑥) = 1/ | 𝒳 |. Then
KL  p || u    p ( x) log u ( x)   p ( x) log p ( x)  log | X |  ( x)  0
x x

This principle of insufficient reason, argues in favor of

using uniform distributions when there are no other
reasons to favor one distribution over another.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 48
The Kullback-Leibler Divergence
 Data compression is in some way related to density
estimation.

 The Kullback-Leibler divergence is measuring the distance

between two distributions and it is zero when the two
densities are identical.

Suppose the data is generated from an unknown 𝑝(𝒙) that we

try to approximate with a parametric model 𝑞(𝒙|𝜃). Suppose
we have observed training points 𝒙𝑛~𝑝 𝒙 , 𝑛 = 1, … , 𝑁. Then:

 q ( x)  N

  ln q  x |    ln p ( xn )
1
KL  p || q     p ( x) ln   dx  n
 p ( x)  Sample N n 1
average
approximation
of the mean

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 49
The KL Divergence Vs. MLE
 Note that only the first term is a function of 𝑞.

 Thus minimizing KL  p || q  is equivalent to maximizing the

likelihood function for  under the distribution 𝑞.
 q( x )  N

  ln q  x |    ln p ( xn )
1
KL  p || q     p ( x ) ln   dx  n
 p( x )  N n 1

 So the MLE estimate minimizes the KL divergence to the

empirical distribution
1 N
pemp ( x )    xn  x 
N n 1
 q ( x ) 
arg min KL  pemp ( x ) || q     pemp ( x ) ln 
N
1
 d x  const   ln q  x n | 
q  emp
p ( x )  N n 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 50
Conditional Entropy
 For a joint distribution, the conditional entropy is

 y | x     p( y, x) ln p( y | x)dydx

 This represents the average information to specify 𝑦 if we

already know the value of 𝑥
 It is easily seen, using p( y, x)  p( y | x) p( x) , and substituting
inside the log in  x, y     p( x, y) ln p( x, y)dydx that the
conditional entropy satisfies the relation
 x, y    y | x    x 
where H[𝑥, 𝑦] is the differential entropy of 𝑝(𝑥, 𝑦)
and H[𝑥] is the differential entropy of 𝑝(𝑥).

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 51
Conditional Entropy for Discrete Variables
 Consider the conditional entropy for discrete variables

 From this we can conclude that For each x j s.t. p ( x j )  0

the following must hold : p ( yi | x j ) ln p ( yi | x j )  0
 Since 𝑝𝑙𝑜𝑔𝑝 = 0 ↔ 𝑝 = 0 or 𝑝 = 1 and since 𝑝(𝑦𝑖|𝑥𝑗 ) is
normalized, there is only one 𝑦𝑖 s.t. p ( yi | x j )  1 with all
other p (. | x j )  0 . Thus 𝑦 is a function of 𝑥.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 52
Mutual Information
 If the variables are not independent, we can gain some
idea of whether they are ‘close’ to being independent by
considering the KL divergence between the joint
distribution and the product of the marginals:
Mutual Information :  x, y   KL  p  x, y  || p( x) p( y )  
p( x) p( y )
   p  x, y  ln dxdy  0
p  x, y 
 x, y   0 iff x, y independent
 The mutual information is related to the conditional
entropy through
p( y)
 x, y     p  x, y  ln dxdy  [ y ]  [ y | x] 
p  y | x

 x, y    x    x | y    y    y | x 
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 53
Mutual Information
 The mutual information represents the reduction in the
uncertainty about 𝑥 once we learn the value of 𝑦 (and
reversely).
 x, y    x    x | y    y    y | x 
 x   x | y 
 y   y | x

H  y
 In a Bayesian setting, 𝑝(𝑥) =prior, 𝑝(𝑥|𝑦) posterior, and
I[𝑥, 𝑦] represents the reduction in uncertainty in 𝑥 once
we observe 𝑦.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 54
Note that H[x,y]≤H [x]+H [y]
 This is easy to prove noticing that
 x, y    y    y | x   0 ( KL divergence)
and
 x, y    y | x    x 

from which
 x, y    x    y    x , y    x    y 
 The equality here is true only if 𝑥, 𝑦 are independent:
 x, y     p( x, y) ln p( x, y)dydx    p( x, y)  ln p( x)  ln p( y)  dydx  [ x]  [ y ]
(sufficiency condition)
 y | x   y   [ x, y ]  0  p ( x, y )  p ( x) p ( y ) (necessary
condition)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 55
Mutual Information for Correlated Gaussians
 Consider two correlated Gaussians as follows:
X X 0  s 2 s 2  
 ~  |  ,  2 
Y  Y  0   s s 2  

 For each of these variables we can write:

ln  2 es 2 
1
 X   Y  
2
 The joint entropy is also given similarly as
1  
 X , Y   ln   2 e  s (1   ) 
2 4 2

2  det  
1 1
 Thus:  x, y    x    y    x, y   log
2 1  2
  0 (independent X , Y )   x, y   0
 Note:
  1 (linear correlated X  Y )   x, y   
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 56
Pointwise Mutual Information
 A quantity which is closely related to 𝑀𝐼 is the pointwise
mutual information or 𝑃𝑀𝐼. For two events (not random
variables) 𝑥 and 𝑦, this is defined as
p( x) p( y ) p( x | y) p( y | x)
PMI ( x, y ) :  log  log  log
p  x, y  p( x) p( y)
This measures the discrepancy between these events
occurring together compared to what would be expected
by chance. Clearly the 𝑀𝐼,  x, y  , of 𝑋 and 𝑌 is just the
expected value of the 𝑃𝑀𝐼.

 This is the amount we learn from updating the prior 𝑝(𝑥)

into the posterior 𝑝(𝑥|𝑦), or equivalently, updating the
prior 𝑝(𝑦) into the posterior 𝑝(𝑦|𝑥).

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 57
Mutual Information
 For continuous random variables, it is common to first
discretize or quantize them into bins, and computing
how many values fall in each histogram bin (Scott 1979).

 The number of bins used, and the location of the bin

boundaries, can have a significant effect on the results.

 One can estimate the 𝑀𝐼 directly, without performing

density estimation (Learned-Miller, 2004). Another
approach is to try many different bin sizes and locations,
and to compute the maximum 𝑀𝐼 achieved.
 Scott, D. (1979). On optimal and data-based histograms, Biometrika 66(3), 605–610.
 Learned-Miller, E. (2004). Hyperspacings and the estimation of information theoretic quantities. Technical Report
04-104, U. Mass. Amherst Comp. Sci. Dept.
 Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518–1524.
 Speed, T. (2011, December). A correlation for the 21st century. Science 334, 152–1503.
*Use MatLab function miMixedDemo from Kevin Murphys’ PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 58
Maximal Information Coefficient
 This statistic appropriately normalized is known as the
maximal information coefficient (𝑀𝐼𝐶).
max GG ( x , y )  X (G );Y (G ) 
We first define: m( x, y ) 
log min( x, y )

Here G (𝑥, 𝑦) is the set of 2𝑑 grids of size x  y , and

𝑋(𝐺), 𝑌 (𝐺) represents a discretization of the variables
onto this grid (The maximization over bin locations is
performed efficiently using dynamic programming)

Now define the 𝑀𝐼𝐶 as

MIC  max m( x, y )
x , y: xy  B
 Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518–1524.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 59
Maximal Information Coefficient
 The 𝑀𝐼𝐶 is defined as:
max GG ( x , y )  X (G );Y (G )  MIC  max m( x, y )
m( x, y )  x , y:xy  B
log min( x, y )
 𝐵 is some sample-size dependent bound on the number
of bins we can use and still reliably estimate the
.
distribution (Reshef et al. suggest 𝐵 ~ 𝑁0 6).

𝑀𝐼𝐶 lies in the range [0, 1], where 0 represents no

relationship between the variables, and 1 represents a
noise-free relationship of any form, not just linear.

 Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518–1524.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 60
Correlation Coefficient Vs MIC
see mutualInfoAllPairsMixed for
and miMixedDemo from PMTK3

 The data consists of 357 variables measuring a variety

of social, economic, etc. indicators, collected by WHO.
 On the left, we see the correlation coefficient (CC)
plotted against the MIC for all 63,566 variable pairs.
 On the right, we see scatter plots for particular pairs of
variables.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 61
Correlation Coefficient Vs MIC

 Point marked 𝐶 has a low 𝐶𝐶 and a low 𝑀𝐼𝐶. From the

corresponding scatter we see that there is no relationship
between these two variables.

The points marked 𝐷 and 𝐻 have high 𝐶𝐶 (in absolute

value) and high 𝑀𝐼𝐶 and we see from the scatter plot that
they represent nearly linear relationships.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 62
Correlation Coefficient Vs MIC

 The points 𝐸, 𝐹, and 𝐺 have low 𝐶𝐶 but high 𝑀𝐼𝐶. They

correspond to non-linear (and sometimes, as in 𝐸 and 𝐹,
one-to-many) relationships between the variables.

Statistics (such as 𝑀𝐼𝐶) based on mutual information can

be used to discover interesting relationships between
variables in a way that correlation coefficients cannot.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 63

STAT 330 Course Notes Fall 2024 Edition
No ratings yet
STAT 330 Course Notes Fall 2024 Edition
482 pages
cs109 Final Cheat 3 PDF
No ratings yet
cs109 Final Cheat 3 PDF
13 pages
Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Installer Training
No ratings yet
Installer Training
84 pages
MTH2222 Mathematics of Uncertainty
No ratings yet
MTH2222 Mathematics of Uncertainty
96 pages
Chapter 3 ERROR DETECTION, CORRECTION & WIRELESS COMMUNICATION
0% (1)
Chapter 3 ERROR DETECTION, CORRECTION & WIRELESS COMMUNICATION
14 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Lec2 IntroToProbabilityAndStatistics
No ratings yet
Lec2 IntroToProbabilityAndStatistics
37 pages
Statistics
No ratings yet
Statistics
60 pages
확통1 LectureNote06 on Limit Theorems
No ratings yet
확통1 LectureNote06 on Limit Theorems
36 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
STAT 330 Supplementary Notes
No ratings yet
STAT 330 Supplementary Notes
134 pages
Intro To Data Science Lecture 2
No ratings yet
Intro To Data Science Lecture 2
12 pages
MS Theory Exam Study Guide
No ratings yet
MS Theory Exam Study Guide
50 pages
Lec11 Introduction2BayesianStatistics
No ratings yet
Lec11 Introduction2BayesianStatistics
48 pages
Xxxx- Mathematical Statistics II
No ratings yet
Xxxx- Mathematical Statistics II
192 pages
Rohatgi Expl
No ratings yet
Rohatgi Expl
192 pages
280 LN Deller PART1 WITH ALL SUPPLEMENTS Fall2015 PDF
No ratings yet
280 LN Deller PART1 WITH ALL SUPPLEMENTS Fall2015 PDF
118 pages
Series 1, Oct 1st, 2013 Probability and Related) : Machine Learning
No ratings yet
Series 1, Oct 1st, 2013 Probability and Related) : Machine Learning
4 pages
An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray instant download
100% (1)
An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray instant download
52 pages
STA 241 Topic 14 Laws of Large Numbers(Corr)
No ratings yet
STA 241 Topic 14 Laws of Large Numbers(Corr)
9 pages
MIR - Ivchenko G. I., Medvedev Yu. and Chistyakov A. - Problems in Mathematical Statistics - 1991
100% (4)
MIR - Ivchenko G. I., Medvedev Yu. and Chistyakov A. - Problems in Mathematical Statistics - 1991
282 pages
Download An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray ebook All Chapters PDF
100% (3)
Download An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray ebook All Chapters PDF
51 pages
Ivchenko Medvedev Chistyakov Problems in Mathematical Statistics
No ratings yet
Ivchenko Medvedev Chistyakov Problems in Mathematical Statistics
282 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray instant download
100% (1)
An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray instant download
57 pages
Introduction To Probability Theory and S
No ratings yet
Introduction To Probability Theory and S
127 pages
Introduction To Probability Theory and Statistics
No ratings yet
Introduction To Probability Theory and Statistics
127 pages
Math and Statistics PDF
No ratings yet
Math and Statistics PDF
192 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
Lect Main Blanc
No ratings yet
Lect Main Blanc
185 pages
Osobine Var
No ratings yet
Osobine Var
19 pages
Lesson4 MAT284 PDF
100% (1)
Lesson4 MAT284 PDF
36 pages
Download An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray ebook All Chapters PDF
100% (2)
Download An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray ebook All Chapters PDF
61 pages
Probability and Statistical Applications - Estimation Theory
No ratings yet
Probability and Statistical Applications - Estimation Theory
45 pages
Fall 2018 Statistics 201A Aditya Guntuboyina
No ratings yet
Fall 2018 Statistics 201A Aditya Guntuboyina
101 pages
Econ-2042- Unit 5-HO
No ratings yet
Econ-2042- Unit 5-HO
22 pages
College Statistics
No ratings yet
College Statistics
244 pages
Chap 3
No ratings yet
Chap 3
25 pages
Statistics Training
No ratings yet
Statistics Training
96 pages
Full Notes 201 A Fall 2022
No ratings yet
Full Notes 201 A Fall 2022
131 pages
Immediate download An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray ebooks 2024
100% (8)
Immediate download An Introduction to Statistical Signal Processing 1st Edition Robert M. Gray ebooks 2024
40 pages
DS_ML_Probability_Statistics_interview
No ratings yet
DS_ML_Probability_Statistics_interview
6 pages
All Lectures 2018 Fall 201 A
No ratings yet
All Lectures 2018 Fall 201 A
100 pages
A First Course in Mathematical Statistics - Nusbaum
No ratings yet
A First Course in Mathematical Statistics - Nusbaum
195 pages
Lec 4
No ratings yet
Lec 4
8 pages
report-endterm
No ratings yet
report-endterm
30 pages
STA2004F
No ratings yet
STA2004F
212 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Lec 8
No ratings yet
Lec 8
13 pages
Book Statistik Non Parametrik, Komang Suardika
No ratings yet
Book Statistik Non Parametrik, Komang Suardika
492 pages
(eBook PDF) Probability, Statistics, and Random Signals by Charles Boncelet instant download
100% (2)
(eBook PDF) Probability, Statistics, and Random Signals by Charles Boncelet instant download
53 pages
MIT14 30s09 Lec17
No ratings yet
MIT14 30s09 Lec17
9 pages
CH 02
No ratings yet
CH 02
41 pages
Econometría
No ratings yet
Econometría
43 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Math for Computer Applications
From Everand
Math for Computer Applications
The Editors of REA
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
ALGEBRA SIMPLIFIED EQUATIONS WORKBOOK WITH ANSWERS: Linear Equations, Quadratic Equations, Systems of Equations
From Everand
ALGEBRA SIMPLIFIED EQUATIONS WORKBOOK WITH ANSWERS: Linear Equations, Quadratic Equations, Systems of Equations
Luke Aneke
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
From Everand
Limits and Continuity (Calculus) Engineering Entrance Exams Question Bank
Mohmmad Khaja Shareef
No ratings yet
Seminar em
No ratings yet
Seminar em
51 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Dai 2020
No ratings yet
Dai 2020
62 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec28 StratifiedSampling
No ratings yet
Lec28 StratifiedSampling
15 pages
Noise Notes
No ratings yet
Noise Notes
17 pages
RX Paper
No ratings yet
RX Paper
11 pages
Adaptive Signal Processing in Wireless Communications 1st Edition Mohamed Ibnkahla - The full ebook set is available with all chapters for download
100% (1)
Adaptive Signal Processing in Wireless Communications 1st Edition Mohamed Ibnkahla - The full ebook set is available with all chapters for download
53 pages
5487 ArticleText 12074 2 10 20230526
No ratings yet
5487 ArticleText 12074 2 10 20230526
12 pages
Instant download Radio System Design for Telecommunications 3rd Edition Roger L. Freeman pdf all chapter
100% (4)
Instant download Radio System Design for Telecommunications 3rd Edition Roger L. Freeman pdf all chapter
61 pages
2-1 R16 Supple May-2024
No ratings yet
2-1 R16 Supple May-2024
3 pages
Zhou 2007 - PhD Thesis - Multiple Description Codes
No ratings yet
Zhou 2007 - PhD Thesis - Multiple Description Codes
175 pages
White Paper On 50G PON Technology V2.0 - EN
No ratings yet
White Paper On 50G PON Technology V2.0 - EN
31 pages
EnenIpGuard Datasheet B
No ratings yet
EnenIpGuard Datasheet B
4 pages
Error Control Coding Text: Error Control Coding Fundamentals & Applications - Shu Lin, D.J. Costello
No ratings yet
Error Control Coding Text: Error Control Coding Fundamentals & Applications - Shu Lin, D.J. Costello
16 pages
TP01WIFI (Enregistré Automatiquement)
No ratings yet
TP01WIFI (Enregistré Automatiquement)
25 pages
Design of Hamming Code Using Verilog
100% (1)
Design of Hamming Code Using Verilog
5 pages
Implementing A Standard DVB-T System Using MATLAB Simulink: International Journal of Computer Applications July 2014
No ratings yet
Implementing A Standard DVB-T System Using MATLAB Simulink: International Journal of Computer Applications July 2014
7 pages
Sagi Rama Krishnam Raju Engineering College (Autonomous)
No ratings yet
Sagi Rama Krishnam Raju Engineering College (Autonomous)
31 pages
Prospekt U 100 Serie - Engl - Webversion
No ratings yet
Prospekt U 100 Serie - Engl - Webversion
36 pages
Frequency-Shift Chirp Spread Spectrum Communications With Index Modulation
No ratings yet
Frequency-Shift Chirp Spread Spectrum Communications With Index Modulation
11 pages
Interleave Division Multiple Access [Idma]
No ratings yet
Interleave Division Multiple Access [Idma]
27 pages
Global Transport Label Odette Recommendation - GTLV03
No ratings yet
Global Transport Label Odette Recommendation - GTLV03
96 pages
Channel Coding_ Part I Presentation II Irvanda Kurniadi v. ( ) Digital Communication Ppt Download
No ratings yet
Channel Coding_ Part I Presentation II Irvanda Kurniadi v. ( ) Digital Communication Ppt Download
8 pages
Book 0
No ratings yet
Book 0
411 pages
Admin Guide SAP Advanced Track Trace
No ratings yet
Admin Guide SAP Advanced Track Trace
64 pages
Dạng Câu Hỏi Gap Filling Trong IELTS Reading
No ratings yet
Dạng Câu Hỏi Gap Filling Trong IELTS Reading
17 pages
T 2 Streams Parameter Sets 84
No ratings yet
T 2 Streams Parameter Sets 84
95 pages
CC2500 - Wireless Chip of GC08 Wireles Module
No ratings yet
CC2500 - Wireless Chip of GC08 Wireles Module
76 pages
Introduction To DVB T2
No ratings yet
Introduction To DVB T2
142 pages
Module 2 Part 1
No ratings yet
Module 2 Part 1
55 pages
TY Btech Syllabus
No ratings yet
TY Btech Syllabus
20 pages
Catalogue Digital
No ratings yet
Catalogue Digital
34 pages