Information Theory Differential Entropy
Information Theory Differential Entropy
Definition
Let
distribution function F(x) = Pr ( X x ). If F(x) is continuous, the r.v. is said to be continuous. Let f(x) = F(x) when the derivative is defined. If
Definition
The differential entropy h( X ) of a continuous r.v. X with a density function f(x) is defined as
h ( X ) = f ( x ) log f ( x ) dx
s
(1)
where S is the support of the r.v. Since h ( X ) depends only on f(x), sometimes the differential entropy is written as h( f ) rather then
h( X ) .
3
However, 2
h(X )
=2
log a
X ~=
1 22
2 2 2
, then
h( ) = ln x2 2 = ( x ) ln 2 2 2 1 1 2 2 2 = E X + ln (2 ) 2 2 1 2 1 2 = + ln 2 2 2 2
[ ]
1 1 2 = + ln 2 2 2 1 1 2 = ln e + ln 2 2 2 1 = ln 2e 2 2
nats
1 2 h( ) = log 2e bits. 2
6
Theorem 1
Let
proof: The proof follows directly from the weak law of the large numbers.
7
Def: For >0 and any n, we define the typical set A(n ) w.r.t. f(x) as follows:
(A( ) ) > 1 for n sufficiently large 1. Pr ( ) ( ( )) 2. Vol(A ) 2 for all n ( ( )) ( ) 3. Vol(A ) (1 )2 for n sufficiently large
n n n h X + n n h X
9
properties:
Thm: The Set A is the smallest volume set with probability 1-, to the first order in the exponent. The volume of the smallest set that contains most of the Prob. Is approximately 2nh. This is an n-D volume, so the corresponding side length is (2nh)1/n=2h. The differential entropy is the logarithm of the equivalent side length of the smallest set that contains most of the Prob. low entropy implies that the rv is confined to a small effective volume and high entropy indicates that the rv is widely dispersed.
10
(n )
Quantization of a continuous rv
Spose we divide the range of X into bins of length . Lets assume that the density is continuous within the bins.
11
By the mean value theorem, there is a value x i within each bin such that
f ( xi ) =
(i +1) i
f ( x )dx
f ( x )dx = f ( xi )
12
) =
Pi log Pi
= f ( xi ) log( f ( xi ) )
Since
f (x ) = f (x ) = 1
i
13
This proves the following Thm: If the density f(x) of the rv X is Riemann integrable, then
H (X
) + log h ( f ) = h ( X ),
-n
as 0
14
h( X 1 , X 2 , L , X n ) = f ( x ) log f ( x )dx
n n
15
Let X 1 , X 2 , L , X n have a multivariate normal distribution with mean and covariance matrix K. Then 1 h( X 1 , X 2 , L , X n ) = h( n ( , K ) ) = log(2e) n K bits 2 where K denotes the determinant of K .
16
x pf : ( X 1 , X 2 ,L, X n ) ~ N n ( , K ) f ( ~ ) = Then
1 2
)K
n
1 2
exp
1 ~ ( x )T K 1 ( ~ ) x 2
)K
n
1 2
dx
17
= =
1 1 E (X j j )( X i i ) K 1 ij + ln(2 ) n K 2 ij 2 1 1 K ji K 1 ij + ln(2 ) n K 2 j i 2
]( )
( )
jj
1 = KK 1 2 j =
1 + ln(2 ) n K 2
1 1 I jj + ln(2 ) n K 2 2 j
18
D( f // g ) = I ( X ; Y ) =
f f log g
f ( x, y ) dxdy f ( x, y ) log f ( x) f ( y )
I ( X ;Y ) = h( X ) h( X | Y ) = h(Y ) h (Y | X ) = h( X ) + h(Y ) h( X , Y ) I ( X ;Y ) = D ( f ( x, y ) // f ( x ) f ( y ) )
19
Remark: The mutual information between two continuous r.vs is the limit of the mutual information between their quantized versions.
20
Properties of
h ( x ) , D(p
q) , I(X; Y)
D(f
g) 0
g g pf : - D(f g) = s f log log s f ( Jensen' s inequality ) f f = log s g log 1 = 0. I (X; Y) 0 h(X Y) h(X)
21
Theorems h( X + c) = h( x ) : translation does not change the differential entropy h(aX ) = h( x ) + log a
22
1 y pf : let Y = aX. Then , f Y ( y ) = f x ( ), and a a h(aX) = - f Y ( y ) log f Y ( y ) d y y 1 1 y = f x ( ) log ( f x ( ))dy a a a a = f x ( x ) log x ( x ) dx + log a = h ( x ) + log a Corollary : h(AX) = h(X) + log det(A)
23
Theorem : The multivariate normal distribution maximizes the entropy over all distributions with the same variance.
Let the random vector X R n have zero mean and convariance K = E XX (i.e. , K ij = EX i X j , 1 i , j n ) , Then
T
24
Pf:
x Let g (~) be any density satisfying g( ~ ) xi x j d~ = K ij for all i , j. x x Let k be the density of a (0, K) vector. Note that logk (~) is a quadratic form and xi x jk (~)d~ = K ij . x x x 0 D ( g k ) = g log( g = h( g ) + h(k )
k )
= h( g ) g log k = h( g ) k log k
25
where the substitution glogk follows from the fact that g and k yield the same moments of the quadraic form logk (x) the Gaussian distribution maximizes the entropy over all distributions with the same variance.
26
Let X be a random variable with differential entropy h(x) Let X be an estimate of X and let . E( X - X) 2 be the expected prediction error. Let h(x) be in nats
Theorem : For any r.v. X and estimator X ) 2 1 e 2h(x) E(X - X 2e with equality iff X is Gaussian and X is the mean of X
27
= E(X - E(X)) 2 [ the mean of X is the best estimator for X ] 1 2h(x) e (2) = var (X) 2e [Gaussian distribution has the maximum entropy for a given varance] 1 i.e. , h(x) ln 2e 2 2
28
We have equality, only in (1), only if x is the best estimator (i.e. , x is the mean of X) and equality in (2) only if X is Gaussian. Gorollary : Given side information Y and estimator X(Y) it follows that (Y))2 1 e 2h(XY) E(X - X 2e Fano' s inequality
29