0% found this document useful (0 votes)
9 views

Lecture2 1

Uploaded by

liuyuexiao0305
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture2 1

Uploaded by

liuyuexiao0305
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Self-information

Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Lecture 2 Entropy

September 2, 2022

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Outline

1 Self-information

2 Entropy

3 Joint entropy and conditional entropy

4 Relative entropy and mutual information

5 Chain rules for entropy, relative entropy and mutual information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Outline

1 Self-information

2 Entropy

3 Joint entropy and conditional entropy

4 Relative entropy and mutual information

5 Chain rules for entropy, relative entropy and mutual information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Let E be an event belonging to a given event space and


having probability pE := P r(E), where 0 ≤ pE ≤ 1.
I(E), the self-information of E: the amount of information
one gains when learning that E has occurred, or equivalently,
the amount of uncertainty one had about E prior to learning
that it has happened.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Let E be an event belonging to a given event space and


having probability pE := P r(E), where 0 ≤ pE ≤ 1.
I(E), the self-information of E: the amount of information
one gains when learning that E has occurred, or equivalently,
the amount of uncertainty one had about E prior to learning
that it has happened.
Question: What properties should I(E) have?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

The properties that I(E) is expected to have

I(E) should be a decreasing function of pE .


In other words, this property states that I(E) = I(pE ), where
I(·) is a real-valued function defined over [0, 1].
I(pE ) should be continuous in PE .
If E1 and E2 are two independent events, then
I(E1 ∩ E2 ) = I(E1 ) + I(E2 ), or equivalently,
I(pE1 × pE2 ) = I(pE1 ) + I(pE2 ).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Theorem
The only function defined over p ∈ [0, 1] and satisfying
I(p) is monotonically decreasing in p;
I(p) is a continuous function of p for 0 ≤ p ≤ 1;
I(p1 × p2 ) = I(p1 ) + I(p2 );
is I(p) = −c · logb (p), where c is a positive constant and the base
b of the logarithm is a real number larger then one.

The constant c above is by convention normalized to c = 1.


The base b of the logarithm determines the type of units used
in measuring information.
We will use the base-2 logarithm throughout unless otherwise
specified.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Outline

1 Self-information

2 Entropy

3 Joint entropy and conditional entropy

4 Relative entropy and mutual information

5 Chain rules for entropy, relative entropy and mutual information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Entropy
Definition
The entropy H(X) of a discrete random variable X with
probability mass distribution or probability mass function (pmf)
PX (·) is defined by

H(X) := − PX (x) · log2 PX (x) (bits).
x∈X

H(X) represents the statistical average (mean) amount of


information one gains when learning that one gains when
learning that one of its |X | outcomes has occurred.
H(X) = −E[− log2 PX (X)] = E[I(X)], where
I(X) := − log2 PX (x).
We adopt the convention 0 · log2 0 = 0. .
.
.
.
.
. . . . .
. . . .
. . . .
. . . .
. . . .
. . . . .
.
.
.
.
.
.
.
.
.

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Lemma
H(X) ≥ 0.

Proof.
0 ≤ p(x) ≤ 1 implies that log p(x)
1
≥ 0.

Lemma
Hb (X) = (logb a)Ha (X).

Proof.
logb p = logb a loga p.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Example
Let {
1 with probability p,
X=
0 with probability 1-p.
Then

H(X) = −p log p − (1 − p) log(1 − p) =: H(p) (bits).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

How to measure information content?

You are given 12 balls, all equal in weight expect for one that is
either heavier or lighter. You are also given a two-pan balance to
use. In each use of the balance you may put any number of the 12
balls on the left pan and the same number on the right pan. There
are three possible outcomes: either the weights are equal, or the
balls on the left are heavier, or the balls on the right are heavier.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

How to measure information content?

You are given 12 balls, all equal in weight expect for one that is
either heavier or lighter. You are also given a two-pan balance to
use. In each use of the balance you may put any number of the 12
balls on the left pan and the same number on the right pan. There
are three possible outcomes: either the weights are equal, or the
balls on the left are heavier, or the balls on the right are heavier.
Your task is to design a strategy to determine which is the odd ball
and whether it is heavier or lighter than the others in as few uses
of the balance as possible.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

How can one measure information?


When you have identified the odd ball and whether it is heavy
or light, how much information have you gained?
Once you have designed a strategy, draw a tree showing, for
each of the possible outcomes of a weighing, what weighing
you perform next. At each node in the tree, how much
information have the outcomes so far given you, and how
much information remains to gained?
How much information is gained on the first step of weighing
problem if 6 balls are weighed against the other 6? How much
is gained if 4 are weighed against 4 on the first step, leaving
out 4 balls.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

1+
1+ 2+ 5
1 - 2+
1+
2+  2 R 5
3+ weigh
- - 3+

1+ Æ
4+
5
126
345
A 3+ 4+ 6
3
4 R 4+
A

2+ 6 6
3+ 7 AA
 -
7
4+ 8 AU 1
5+
6+ 
7 8
7 R 8


?
7+
 4


8+
R-
3
6+ 3 4

3
9+
10+  1 4
 6+
 
2
11+ weigh 3 weigh
- 2

B - A -
12+ 1234 4 126 1
1 2 5+
1 5678 5+ 345 2 R 1
2 B 6+ A 5+
3 BB 7+ AA 7+
4
BB 8+ AU 7+ 8+
7 - 8+
5 1 R
6 B ?
7 BB  9+
B R-
8 9
9+ 10+ 11+ 10+
9
BB 9+
10+  10
 11+
-
10
11 BB 11+ weigh
- 10
12 N 12+ 9 10 11  9
9 123 AA 9 10 11
10 R 9
11
10
11 AA 12+
12 AU 12+ 12
12 -
1 R 12
?

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Outline

1 Self-information

2 Entropy

3 Joint entropy and conditional entropy

4 Relative entropy and mutual information

5 Chain rules for entropy, relative entropy and mutual information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Joint entropy

Definition
The joint entropy H(X, Y ) of a pair of discrete random variables
(X, Y ) with a joint distribution p(x, y) is defined as
∑∑
H(X, Y ) = − p(x, y) log p(x, y),
x∈X y∈Y

which can also be expressed as

H(X, Y ) = −E log p(X, Y ).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Conditional entropy

Definition
If (X, Y ) ∼ p(x, y), the conditional entropy H(Y |X) is defined as

H(Y |X) = p(x)H(Y |X = x)
x∈X
∑ ∑
= − p(x) p(y|x) log p(y|x)
x∈X y∈Y
∑∑
= − p(x, y) log p(y|x)
x∈X y∈Y
= −E log p(Y |X).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Chain rule

Theorem (Chain rule)

H(X, Y ) = H(X) + H(Y |X). (3.1)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Proof.

∑∑
H(X, Y ) = − p(x, y) log p(x, y)
x∈X y∈Y
∑∑
= − p(x, y) log(p(x)p(y|x))
x∈X y∈Y
∑∑ ∑∑
= − p(x, y) log p(x) − p(x, y) log p(y|x)
x∈X y∈Y x∈X y∈Y
∑ ∑∑
= − p(x) log p(x) − p(x, y) log p(y|x)
x∈X x∈X y∈Y
= H(X) + H(Y |X).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Equivalently, we can write

log p(X, Y ) = log p(X) + log p(Y |X)

and take the expectation of both sides of the equation to obtain


the theorem.
Corollary

H(X, Y |Z) = H(X|Z) + H(Y |X, Z).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Example
X
1 2 3 4
Y
1 1 1 1
1 8 16 32 32
1 1 1 1
2 16 8 32 32
1 1 1 1
3 16 16 16 16
1
4 4 0 0 0
Let (X, Y ) have the above joint distribution. The marginal
distributions of X and Y are ( 12 , 14 , 18 , 81 ) and ( 14 , 41 , 14 , 41 )
respectively, and hence H(X) = 74 bits, H(Y ) = 2bits. Also


4
11
H(X|Y ) = p(Y = i)H(X|Y = i) = bits.
8
i=1

Similarly H(Y |X) = 13


8 bits, and H(X, Y ) = 27
8. bits.
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Outline

1 Self-information

2 Entropy

3 Joint entropy and conditional entropy

4 Relative entropy and mutual information

5 Chain rules for entropy, relative entropy and mutual information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Relative entropy

Definition
The relative entropy or Kullback-Leibler distance between two
probability mass functions p(x) and q(x) is defined as
∑ p(x) p(x)
D(p ∥ q) = p(x) log = Ep log .
q(x) q(x)
x∈X

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Mutual information

Definition
Consider two random variables X and Y with a joint probability
mass function p(x, y) and marginal probability mass functions p(x)
and p(y). The mutual information I(X; Y ) is the relative entropy
between the joint distribution and the product distribution
p(x)p(y):
∑ ∑ p(x,y)
I(X; Y ) = x∈X y∈Y p(x, y) log p(x)p(y)
= D(p(x, y) ∥ p(x)p(y))
p(X,Y )
= Ep(x,y) log p(X)p(Y ).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

The definition of mutual information can be rewritten as


∑ p(x, y)
I(X; Y ) = p(x, y) log
x,y
p(x)p(y)
∑ p(x|y)
= p(x, y) log
x,y
p(x)
∑ ∑
= − p(x, y) log p(x) + p(x, y) log p(x|y)
x,y x,y
∑ ∑
= − p(x) log p(x) − (− p(x, y) log p(x|y))
x x,y
= H(X) − H(X|Y ).

Similarly,
I(X; Y ) = H(Y ) − H(Y |X).
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Relationship between entropy and mutual information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Proposition
The mutual information between a random variable X and itself is
equal to the entropy of X, i.e., I(X; X) = H(X).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Example
Let X = {0, 1}, and consider two distributions p�q on X . Let
p(0) = 1 − r, p(1) = r and q(0) = 1 − s, q(1) = s. Then
1−r r
D(p∥q) = (1 − r) log + r log
1−s s
and
1−s s
D(q∥p) = (1 − s) log + r log .
1−r r
If r = s�then D(p∥q) = D(q∥p) = 0. Note that in general
D(p∥q) ̸= D(q∥p). For example, if r = 12 , s = 41 , then

D(p∥q) = 0.2075 bits, D(q∥p) = 0.1887 bits.


. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Outline

1 Self-information

2 Entropy

3 Joint entropy and conditional entropy

4 Relative entropy and mutual information

5 Chain rules for entropy, relative entropy and mutual information

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Theorem
Let X1 , X2 , . . . , Xn be drawn according to p(x1 , x2 , . . . , xn ). Then


n
H(X1 , X2 , · · · , Xn ) = H(Xi |Xi−1 , · · · , X1 ).
i=1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

conditional mutual information

Definition
The conditional mutual information of random variables X and Y
given Z is defined by

I(X; Y |Z) = H(X|Z) − H(X|Y, Z)


p(X, Y |Z)
= Ep(x,y,z) log .
p(X|Z)p(Y |Z)

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Chain rule for mutual information

Theorem


n
I(X1 , X2 , · · · , Xn ; Y ) = I(Xi ; Y |Xi−1 , Xi−2 , · · · , X1 ).
i=1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Proof.

I(X1 , X2 , · · · , Xn ; Y )
= H(X1 , X2 , · · · , Xn ) − H(X1 , X2 , · · · , Xn |Y )
∑n ∑n
= H(Xi |Xi−1 , · · · , X1 ) − H(Xi |Xi−1 , · · · , X1 , Y )
i=1 i=1
∑n
= I(Xi ; Y |Xi−1 , Xi−2 , · · · , X1 ).
i=1

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

conditional relative entropy

Definition
For joint probability mass functions p(x, y) and q(x, y), the
conditional relative entropy D(p(y|x)∥q(y|x)) is the average of the
relative entropies between the conditional probability mass function
p(y|x) and q(y|x) averaged over the probability mass function
p(x). More precisely,
∑ ∑ p(y|x)
D(p(y|x)∥q(y|x)) = p(x) p(y|x) log
x y
q(y|x)
p(Y |X)
= Ep(x,y) log .
q(Y |X)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Chain rule for relative entropy

Theorem (Chain rule for relative entropy)

D(p(x, y)∥q(x, y)) = D(p(x)∥q(x)) + D(p(y|x)∥q(y|x)).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy
Self-information
Entropy
Joint entropy and conditional entropy
Relative entropy and mutual information
Chain rules for entropy, relative entropy and mutual information

Proof.

D(p(x, y)∥q(x, y))


∑∑ p(x, y)
= p(x, y) log
x y
q(x, y)
∑∑ p(x)p(y|x)
= p(x, y) log
x y
q(x)q(y|x)
∑∑ p(x) ∑ ∑ p(y|x)
= p(x, y) log + p(x, y) log
x y
q(x) x y
q(y|x)
= D(p(x)∥q(x)) + D(p(y|x)∥q(y|x)).

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Lecture 2 Entropy

You might also like