0% found this document useful (0 votes)
129 views

Information Theory Differential Entropy

The document defines differential entropy as a measure of uncertainty for continuous random variables. It is defined as the expected value of the negative logarithm of the probability density function. Three key properties of differential entropy are discussed: (1) Differential entropy can be negative. (2) Differential entropy only depends on the probability density function, not the random variable itself. (3) Differential entropy of the normal distribution is proportional to the logarithm of the variance. Several other concepts are introduced, including joint and conditional differential entropy, and their relationships.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views

Information Theory Differential Entropy

The document defines differential entropy as a measure of uncertainty for continuous random variables. It is defined as the expected value of the negative logarithm of the probability density function. Three key properties of differential entropy are discussed: (1) Differential entropy can be negative. (2) Differential entropy only depends on the probability density function, not the random variable itself. (3) Differential entropy of the normal distribution is proportional to the logarithm of the variance. Several other concepts are introduced, including joint and conditional differential entropy, and their relationships.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Differential Entropy

Definition
Let

X be a random variable with cumulative

distribution function F(x) = Pr ( X x ). If F(x) is continuous, the r.v. is said to be continuous. Let f(x) = F(x) when the derivative is defined. If

f ( x )dx = 1, then f(x) is called the pdf for X .

The set where f(x) > 0 is called the support set of X .


2

Definition
The differential entropy h( X ) of a continuous r.v. X with a density function f(x) is defined as

h ( X ) = f ( x ) log f ( x ) dx
s

(1)

where S is the support of the r.v. Since h ( X ) depends only on f(x), sometimes the differential entropy is written as h( f ) rather then
h( X ) .
3

EX.1: (Uniform distribution)


, 0x<a f (x ) = 0, otherwise a 1 1 h ( X ) = log dx = log a 0 a a Note: For a<1, log a < 0, and h ( X ) = log a < 0.
1 a

However, 2

h(X )

=2

log a

= a is the volume of the

support set, which is always non-negative.

Ex. 2: (Normal distribution)


Let

X ~=

1 22

2 2 2

, then

h( ) = ln x2 2 = ( x ) ln 2 2 2 1 1 2 2 2 = E X + ln (2 ) 2 2 1 2 1 2 = + ln 2 2 2 2

[ ]

Changing the base of the logarithm, we have

1 1 2 = + ln 2 2 2 1 1 2 = ln e + ln 2 2 2 1 = ln 2e 2 2

nats

1 2 h( ) = log 2e bits. 2
6

Theorem 1
Let

X 1 , X 2 ,L, X n be a sequence of rvs drawn


1 log f ( X 1 , X 2 ,L X n ) n E [ log f ( X )] = h ( X ) in probability

i.i.d. according to the density f(x). Then

proof: The proof follows directly from the weak law of the large numbers.
7

Def: For >0 and any n, we define the typical set A(n ) w.r.t. f(x) as follows:

1 n A = ( X1, X2 ,LXn ) S : log f ( X1, X2 ,LXn ) h( X ) , n n where f ( X1, X2 ,LXn ) = i=1 f ( Xi )


(n )

Def: The volume Vol(A) of a set ARn is defined as

vol ( A) = dx1dx2 L dxn


A

( Thm: The typical set An ) has the following

(A( ) ) > 1 for n sufficiently large 1. Pr ( ) ( ( )) 2. Vol(A ) 2 for all n ( ( )) ( ) 3. Vol(A ) (1 )2 for n sufficiently large
n n n h X + n n h X
9

properties:

Thm: The Set A is the smallest volume set with probability 1-, to the first order in the exponent. The volume of the smallest set that contains most of the Prob. Is approximately 2nh. This is an n-D volume, so the corresponding side length is (2nh)1/n=2h. The differential entropy is the logarithm of the equivalent side length of the smallest set that contains most of the Prob. low entropy implies that the rv is confined to a small effective volume and high entropy indicates that the rv is widely dispersed.
10

(n )

Relation of Differential Entropy to Discrete Entropy


f(x)

Quantization of a continuous rv

Spose we divide the range of X into bins of length . Lets assume that the density is continuous within the bins.
11

By the mean value theorem, there is a value x i within each bin such that
f ( xi ) =
(i +1) i

f ( x )dx

Consider the quantized rv X , which is defined by


X = xi , if i X < (i + 1)

Then the prob. that X = xi is


Pi =
(i +1) i

f ( x )dx = f ( xi )
12

The entropy of the quantized version is


H (X

) =

Pi log Pi

= f ( xi ) log f ( xi ) f ( xi ) log = f ( xi ) log f ( xi ) log

= f ( xi ) log( f ( xi ) )

Since

f (x ) = f (x ) = 1
i
13

If f(x)logf(x) is Riemann integrable, then


f ( xi ) log f ( xi ) f ( x ) log f ( x ) dx , as 0

This proves the following Thm: If the density f(x) of the rv X is Riemann integrable, then

H (X

) + log h ( f ) = h ( X ),
-n

as 0

Thus the entropy of an n-bit quantization of a continuous rv X is approximately h( X ) + n

Since = 2 for a n - bit uniform quantizer

14

Joint and Conditional Differential Entropy:

h( X 1 , X 2 , L , X n ) = f ( x ) log f ( x )dx
n n

h( X | Y ) = f ( x, y ) log f ( x | y )dxdy h( X | Y ) = h( X , Y ) h(Y )

15

Theorem (Entropy of a multivariate normal distribution)

Let X 1 , X 2 , L , X n have a multivariate normal distribution with mean and covariance matrix K. Then 1 h( X 1 , X 2 , L , X n ) = h( n ( , K ) ) = log(2e) n K bits 2 where K denotes the determinant of K .

16

x pf : ( X 1 , X 2 ,L, X n ) ~ N n ( , K ) f ( ~ ) = Then

1 2

)K
n

1 2

exp

1 ~ ( x )T K 1 ( ~ ) x 2

1 ~ T h( f ) = f ( x ) ( x ) K 1 (~ ) ln 2 x 2 1 1 1 = E ( X i i )(K )ij (X j j ) + ln(2 ) n K 2 ij 2 1 1 1 = E ( X i i )(X j j )(K )ij + ln(2 ) n K 2 ij 2

)K
n

1 2

dx

17

= =

1 1 E (X j j )( X i i ) K 1 ij + ln(2 ) n K 2 ij 2 1 1 K ji K 1 ij + ln(2 ) n K 2 j i 2

]( )

( )
jj

1 = KK 1 2 j =

1 + ln(2 ) n K 2

1 1 I jj + ln(2 ) n K 2 2 j

n 1 = + ln(2 ) n K 2 2 1 = ln(2e) n K nats 2 1 = log(2e) n K bits 2

18

Relative Entropy and Mutual Information

D( f // g ) = I ( X ; Y ) =

f f log g

f ( x, y ) dxdy f ( x, y ) log f ( x) f ( y )

I ( X ;Y ) = h( X ) h( X | Y ) = h(Y ) h (Y | X ) = h( X ) + h(Y ) h( X , Y ) I ( X ;Y ) = D ( f ( x, y ) // f ( x ) f ( y ) )

19

Remark: The mutual information between two continuous r.vs is the limit of the mutual information between their quantized versions.

I ( X ;Y ) = H ( X ) H ( X | Y ) h ( X ) log (h( X | Y ) log ) = I ( X ; Y )

20

Properties of

h ( x ) , D(p

q) , I(X; Y)

D(f

g) 0

g g pf : - D(f g) = s f log log s f ( Jensen' s inequality ) f f = log s g log 1 = 0. I (X; Y) 0 h(X Y) h(X)
21

h (X1 , X 2 ,L, X n ) = h( XX 1 , X 2 ,L X i 1 ) i h (X1 , X 2 ,L, X n ) = h( X i )


i =1

Theorems h( X + c) = h( x ) : translation does not change the differential entropy h(aX ) = h( x ) + log a

22

1 y pf : let Y = aX. Then , f Y ( y ) = f x ( ), and a a h(aX) = - f Y ( y ) log f Y ( y ) d y y 1 1 y = f x ( ) log ( f x ( ))dy a a a a = f x ( x ) log x ( x ) dx + log a = h ( x ) + log a Corollary : h(AX) = h(X) + log det(A)
23

Theorem : The multivariate normal distribution maximizes the entropy over all distributions with the same variance.
Let the random vector X R n have zero mean and convariance K = E XX (i.e. , K ij = EX i X j , 1 i , j n ) , Then
T

1 h(x) log (2e) n K , with equality iff X ~ N n (0 , k ) 2

24

Pf:
x Let g (~) be any density satisfying g( ~ ) xi x j d~ = K ij for all i , j. x x Let k be the density of a (0, K) vector. Note that logk (~) is a quadratic form and xi x jk (~)d~ = K ij . x x x 0 D ( g k ) = g log( g = h( g ) + h(k )

k )

= h( g ) g log k = h( g ) k log k

25

where the substitution glogk follows from the fact that g and k yield the same moments of the quadraic form logk (x) the Gaussian distribution maximizes the entropy over all distributions with the same variance.

26

Let X be a random variable with differential entropy h(x) Let X be an estimate of X and let . E( X - X) 2 be the expected prediction error. Let h(x) be in nats

Theorem : For any r.v. X and estimator X ) 2 1 e 2h(x) E(X - X 2e with equality iff X is Gaussian and X is the mean of X

27

pf : Let X be any estimator of X then E(X - X) 2 min E (X - X) 2 (1)


x

= E(X - E(X)) 2 [ the mean of X is the best estimator for X ] 1 2h(x) e (2) = var (X) 2e [Gaussian distribution has the maximum entropy for a given varance] 1 i.e. , h(x) ln 2e 2 2

28

We have equality, only in (1), only if x is the best estimator (i.e. , x is the mean of X) and equality in (2) only if X is Gaussian. Gorollary : Given side information Y and estimator X(Y) it follows that (Y))2 1 e 2h(XY) E(X - X 2e Fano' s inequality

29

You might also like