0% found this document useful (0 votes)
4 views

exercise6-sol

The document contains exercises related to Information Theory, focusing on convolutional codes, state diagrams, and rate-distortion functions. It includes tasks to draw code trees, encode messages, and derive rate-distortion functions for specific distortion measures. Solutions are provided for each task, detailing the processes and equations used to arrive at the answers.

Uploaded by

lingxuant7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

exercise6-sol

The document contains exercises related to Information Theory, focusing on convolutional codes, state diagrams, and rate-distortion functions. It includes tasks to draw code trees, encode messages, and derive rate-distortion functions for specific distortion measures. Solutions are provided for each task, detailing the processes and equations used to arrive at the answers.

Uploaded by

lingxuant7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Exercise 6

Information Theory I
Prof. Dr. Heinz Koeppl, M.Sc. Maximilian Gehri

WS 2023/24
Sheet 6

Exercises (Prepare at home and discuss in session)

Task 6.1

Consider a convolutional code with the generator sequences g(0) = [1001] and g(1) = [1101].
(a) Draw the code tree.
(b) Draw the state diagram.
(c) Encode the message m = [1011010].

Solution:

From the generator sequences, we obtain the following relations between the input m(i) and outputs c(0) (i), c(1) (i):

c(0) (i) = m(i) + m(i − 3),


c(1) (i) = m(i) + m(i − 1) + m(i − 3),

where m(i) = 0 for i < 0. We begin by writing down, for arbitrary time i ≥ 0, the table of encoder states m(i − 3), m(i −

2), m(i − 1) , input m(i), and outputs c(0) (i), c(1) (i):

State name m(i − 3) m(i − 2) m(i − 1) m(i) c(0) (i) c(1) (i)
0 0 0 0 0 0
a
0 0 0 1 1 1
0 0 1 0 0 1
b
0 0 1 1 1 0
0 1 0 0 0 0
c
0 1 0 1 1 1
0 1 1 0 0 1
d
0 1 1 1 1 0
1 0 0 0 1 1
e
1 0 0 1 0 0
1 0 1 0 1 0
f
1 0 1 1 0 1
1 1 0 0 1 1
g
1 1 0 1 0 0
1 1 1 0 1 0
h
1 1 1 1 0 1

1
From this table, it is straightforward to draw both diagrams. In the following, we label transitions with input and output
using the notation m | c(0) c(1) .
(a) The code tree:

0|00
a
0|00
a
1|11
b
0|00
0|01
c
1|11
b
1|10
d
0|00
0|00
e
0|01
c
1|11
f
1|11
0|01
g
1|10
d
1|10
h

0|11
a
0|00
e
1|00
b
0|01
0|10
c
1|11
f
1|01
d
1|11
0|11
e
0|01
g
1|00
f
1|10
0|10
g
0|10
h
1|01
h

(b) The state diagram:


(c) Encoding can be done by using for example the state diagram: We start in state a and then move through the state
sequence b → c → f → d → g → f → c to produce the output sequence [11011101010010].

Task 6.2

In this exercise, we consider a recursive convolutional code with the rational transfer functions
1+D
g (0) (D) = , g (1) (D) = 1.
1 + D + D2
(a) Draw the state diagram for this code.
(b) Encode the message m = [100101110].

2
0|00
1|10

a
1|11 b
0|01 c d

0|00 1|01
0|11 1|00 1|11 0|10 0|01 1|10

e f
1|00 g
0|10 h

0|11
1|01

Solution:

the most systematic way to proceed is to  write down, for arbitrary time i ≥ 0, the table of all possible encoder input m(i),
states m(i − 1), c(0) (i − 2), c(0) (i − 1) , and outputs c(0) (i), c(1) (i). From the transfer functions, we have the relations

m(i) + m(i − 1) = c(0) (i) + c(0) (i − 1) + c(0) (i − 2),


m(i) = c(1) (i)

where m(i) = c(0) (i) = 0 for i < 0. The table then reads
State name m(i − 1) c(0) (i − 2) c(0) (i − 1) m(i) c(0) (i) c(1) (i) New state
0 0 0 0 0 0 a
a
0 0 0 1 1 1 d
1 0 0 0 1 0 c
b
1 0 0 1 0 1 b
0 0 1 0 1 0 g
c
0 0 1 1 0 1 f
1 0 1 0 0 0 e
d
1 0 1 1 1 1 h
0 1 0 0 1 0 c
e
0 1 0 1 0 1 b
1 1 0 0 0 0 a
f
1 1 0 1 1 1 d
0 1 1 0 0 0 e
g
0 1 1 1 1 1 h
1 1 1 0 1 0 g
h
1 1 1 1 0 1 f
(a) The state diagram is
(b) Encoding the message by using the state diagram, starting in state a, gives the sequence of states d, e, c, f, a, d, h, f, a
with output sequence [110010010011110100].

3
1|11

0|00
1|01

0|10
a b c d

0|10 1|11
1|01 0|10 1|11
1|01
0|10
0|00
e f g h
1|11
0|00
1|01

0|00

Task 6.3

Find the rate-distortion function R(D) = minp(x̂|x) : E[d(X,X̂)]≤D I(X; X̂) for X ∼ Bernoulli(1/2) and distortion measure

 0, x = x̂,
d(x, x̂) = 1, x = 1, x̂ = 0,
∞, x = 0, x̂ = 1.

Solution:

We express the distortion E[d(X, X̂)] in terms of probabilities:


X XX
E[d(X, X̂)] = pX (x)p(x̂ | x)d(x, x̂) = pX (x)p(x̂ | x)d(x, x̂)
x,x̂ x x̸̂=x

=pX (0)p(X̂ = 1 | X = 0)d(0, 1) + pX (1)p(X̂ = 0 | X = 1)d(1, 0)


1
= (p(X̂ = 1 | X = 0) d(0, 1) +p(X̂ = 0 | X = 1)).
2 | {z }
=∞

We conclude that E[d(X, X̂)] < ∞ implies p(X̂ = 1 | X = 0) = 0. Since D < ∞ we can eliminate this degree of freedom
in the optimization problem by setting p(X̂ = 1 | X = 0) = 0 and p(X̂ = 0 | X = 0) = 1. For brevity we denote
p := p(X̂ = 0 | X = 1). Then
p
E[d(X, X̂)] = ≤ D.
2
We show two approaches to express the mutual information as of function of p.

Approach 1: Straight forward

4
" # " #
p(X)p(X̂ | X) p(X̂ | X) X p(x̂ | x)
I(X; X̂) =E log = E log P = pX (x)p(x̂ | x) log P ′ ′
x′ p(x̂ | x )pX (x )
P
p(X) x p(X̂ | x)p(x) x p(X̂ | x)p(x) x̂,x
1X p(x̂ | x)
=1 + p(x̂ | x) log P ′
2
x̂,x x′ p(x̂ | x )
!
1 1 p(X̂ = 0 | X = 1)
=1 − log(1 + p(X̂ = 0 | X = 1)) + p(X̂ = 0 | X = 1) log ,
2 2 1 + p(X̂ = 0 | X = 1)

where we used pX (x) = 1/2 in the second line and the degree of freedom reduction in the third line. Then
   
1 1 p 1 p
I(X; X̂) = 1 − log(1 + p) + p log = 1 − (1 + p)HB .
2 2 1+p 2 1+p
This expression can be used for constraint optimization with constraints 0 ≤ p ≤ min{2D, 1}.

Approach 2: Via entropies


We express the mutual information as

I(X; X̂) = H(X) − H(X | X̂) = HB (1/2) − p(X̂ = 0)H(X | X̂ = 0) − p(X̂ = 1)H(X | X̂ = 0).

Then we represent the conditional entropies as binary entropies:

I(X; X̂) = 1 − p(X̂ = 0)HB (p(x | X̂ = 0)) − p(X̂ = 1)HB (p(x | X̂ = 1))

Note that p(X = 0 | X̂ = 1) ∝ p(X̂ = 1 | X = 0) by Bayes Theorem and therefore HB (p(x | X̂ = 1)) = 0. Also by Bayes
Theorem we get
p(X̂ = 0 | X = 1)
p(X = 1 | X̂ = 0) =
2p(X̂ = 0)
and with p(X̂ = 0) = 12 (p(X̂ = 0 | X = 1) + p(X̂ = 0 | X = 0)) = 12 (p(X̂ = 0 | X = 1) + 1) we can express the mutual
information as  
1 p
I(X; X̂) = 1 − (p + 1)HB
2 p+1
with p := p(X̂ = 0 | X = 1) (as above).

Conclusion: We observe that  


1 p
∂p I(X; X̂) = log <0
2 p+1
for all p > 0 and I(X, X̂) = 0 at p = 1. Thus the maximum possible p obeying the constraint attains minimum rate. We
obtain (  
1 − (D + 12 )HB D+
D
1 for all D ≤ 1/2
R(D) = 2

0 for all D ≥ 1/2.

Task 6.4

Let d(x, x̂) be a distortion function. We have a source X ∼ p(x). Let R(D) be the associated rate distortion function.
˜ x̂) =
(a) Find R̃(D) in terms of R(D), where R̃(D) is the rate distortion function associated with the distortion d(x,
d(x, x̂) + a for some constant a > 0. (They are not equal!)
(b) Now suppose that d(x, x̂) ≥ 0 for all x, x̂ and define a new distortion function d∗ (x, x̂) = b d(x, x̂), where b ≥ 0.
Find the associated rate distortion function R∗ (D) in terms of R(D).

5
Solution:

(a)

R̃(D) = inf I(X; X̂)


˜
p(x̂|x):E(d(x,x̂))≤D

= inf I(X; X̂)


p(x̂|x):E(d(x,x̂))+a<D

= inf I(X; X̂)


p(x̂|x):E(d(x,x̂))≤D−a

= R(D − a)

(b) If b > 0

R∗ (D) = inf I(X; X̂)


p(x̂|x):E(d∗ (x,x̂))≤D

= inf I(X; X̂)


p(x̂|x):E(b d(x,x̂))≤D

= inf I(X; X̂)


p(x̂|x):E(d(x,x̂))≤ D
b
 
D
=R ,
b

else if b = 0, then d∗ = 0 ≤ D for all D ≥ 0 and hence

R∗ (D) = 0 for all D ≥ 0.

Task 6.5

Let X ∼ N (0, σ 2 ) and let the distortion measure be squared error. Here, we do not allow block descriptions, i.e., we use
a (2nR , n)-rate distortion code with n = 1.
q
Show that the optimal reproduction points for 1-bit quantization (i.e., R = 1) are ± π2 σ and the expected distortion for
1-bit quantization is π−2 2
π σ . Compare this with the distortion-rate bound D(R) = σ 2 2−2R for R = 1.

Solution:

The distribution of X is given via its density


1 x2
p(x) = N (x | 0, σ 2 ) = √ e− 2σ2 .
2πσ 2
Since the distribution is symmetric around zero and we have only two reproduction points, we can regard the cases
x ≤ 0 and x > 0 separately with each one reproduction point. Let x̂ = −a be the reproduction point for x ≤ 0 and x̂ = a

6
for x > 0. The optimal points minimize the squared error distortion measure, D = E[(x − x̂)2 ], for which we find
Z 0 Z ∞
D = E[(x − x̂)2 ] = p(x)(x + a)2 dx + p(x)(x − a)2 dx
−∞ 0
Z ∞
(1) 2
= 2 p(x)(x − a) dx
0
 Z ∞ Z ∞ Z ∞ 
= 2 a2 p(x)dx + x2 p(x)dx − 2a xp(x)dx
0 0 0
 Z ∞ 
1 2 1 2 1 x2
− 2σ
= 2 a + σ − 2a √ xe 2
dx
2 2 2πσ 2 0
Z ∞
(2) 2 4a
= a + σ2 − √ σ2 e−y dy
2πσ 2 0
4aσ 2
= a2 + σ 2 − √ ,
2πσ 2
where in (1) we used the symmetry property and in (2) applied integration by substitution with y = x2 /(2σ 2 ) ⇒ dy =
2x/(2σ 2 ) · dx ⇒ dx = σ 2 dy/x. We find the optimal reconfiguration point by partial differentiation of D and setting the
derivative to zero:
∂D 4σ 2 !
= 2a − √ =0
∂a 2πσ 2
r
2
⇒ a∗ = σ
π
For the second partial derivative weqobtain 2 from which we see that a∗ actually minimizes the distortion. Hence, the
2
optimal reproduction points are ± π σ. For the optimal reproduction points we find the distortion evaluated with
q
a∗ = π2 σ to be
4a∗ σ 2 π−2 2
D|a=a∗ = (a∗ )2 + σ 2 − √ = σ .
2πσ 2 π
The optimal distortion for R = 1 (the distortion-rate function evaluated at R = 1) yields a lower bound on the single-bit
quantization (i.e., R = 1) distortion of a Gaussian random variable X:

σ2 π−2 2
Dopt = D(R = 1) = < σ
4 π
This shows that the single-bit quantizer does not achieve optimal distortion limit for single-bit quantizers. Analogously
to channel coding, the minimum distortion value D(R) for a fixed rate R is usually achieved only asymptotically for
large block length n.

Task 6.6

Consider the ordinary Gaussian channel with two correlated observations of X, i.e., Y = [Y1 , Y2 ]T , where

Y1 = X + Z1 , Y2 = X + Z2

with a power constraint P on X, and [Z1 , Z2 ]T ∼ N (0, K), where


 
N Nρ
K= .
Nρ N

Find the capacity for the cases ρ = 1, and ρ = −1.

7
Solution:

For ρ = 1, we have Z1 = Z2 , so that Y1 = Y2 . Clearly, the capacity will then be equal to the capacity for only one
measurement Y1 . For ρ = −1, we have Z1 = −Z2 , so that Y1 + Y2 = 2X. Then X can be determined exactly, so that the
capacity is infinite.

Task 6.7

Consider three parallel additive Gaussian noise channels

Yi = Xi + Zi , i = 1, 2, 3,

where the noise distribution is given by


    
Z1 1 0 0
Z2  ∼ N 0, σ 2 0 2 0 .
Z3 0 0 5
P3
We assume a common input power constraint i=1 E[Xi2 ] ≤ P . Find the distribution of X = [X1 , X2 , X3 ]T that leads to
maximum capacity for the three cases P = 0.5σ 2 , P = 2σ 2 , and P = 10σ 2 .

Solution:

The solution of this problem can be found with the water-filling algorithm. The maximizing distribution for x is
a multivariate zero-mean Gaussian with independent components, where the variances can most easily be found
graphically:

(a) (b) (c)

σ2 { σ2 { σ2 {

1 2 3 1 2 3 1 2 3
Channel number Channel number Channel number

Therefore, the solutions for the three cases are

(a) P1 = σ 2 /2, P2 = 0, P3 = 0.
(b) P1 = 3σ 2 /2, P2 = σ 2 /2, P3 = 0.
(c) P1 = 5σ 2 , P2 = 4σ 2 , P3 = σ 2 .

8
Task 6.8

Consider three parallel additive Gaussian noise channels

Yi = Xi + Zi , i = 1, 2, 3,

where the noise distribution is given by     


Z1 1 0 0
Z2  ∼ N 0, 0 1 ρ .
Z3 0 ρ 1
P3
Assuming a common input power constraint i=1 E[|Xi |2 ] ≤ P = 6ρ, find the maximum capacity and the optimum
distribution of power among the channels for which it is obtained.

Solution:

The case of correlated noise in a Gaussian channel can be handled by applying the water-filling algorithm in coordinates
in which the covariance matrix is diagonal.
In this case, the covariance matrix Σ has √ the√eigenvalues λ1√= 1, λ2√= 1 + ρ, and λ3 = 1 − ρ with corresponding
normalized eigenvectors [1, 0, 0]T , [0, 1/ 2, 1/ 2]T , and [0, 1/ 2, −1/ 2]T . Defining
   
1 0√ 0√ 1 0 0
S = 0 1/√2 1/ √2  , D = 0 1 + ρ 0 ,
0 1/ 2 −1/ 2 0 0 1−ρ

we then have Σ = SDST . We then to apply the water-filling algorithm to the diagonal matrix D. Let P̃i , i = 1, 2, 3, be
the allocated power in the new coordinates. By applying the water-filling algorithm, we find that

P̃1 = 2ρ, P̃2 = ρ, P̃3 = 3ρ,

where we assumed that ρ ≥ 0, as otherwise the input power constraint would be meaningless. The allocated power in
the original coordinates, i.e., the correlation matrix for the input which achieves the capacity, is then found to be
   
2ρ 0 0 2ρ 0 0
ST  0 ρ 0  S =  0 2ρ −ρ .
0 0 3ρ 0 −ρ 2ρ.

Questions of understanding (Solve and discuss during the session)

Task 6.9: Probabilistic decoding algorithms

(a) In Task 4.6 (f) we have shown that MAP-decoding minimizes the error probability Pe . State the probabilities of
error, that are minimized by the sum-product and the max-product message passing algorithm.
(b) Consider both block-wise ML-decoding and block-wise MAP-decoding for any linear binary (n, k)-block code with
P (c) chosen as in the lecture. Over which set do you optimize such that the decoders are equivalent? Explain your
answer.
(c) Discuss why bit-wise MAP decoding does not always yield valid codewords.
(d) Are bit-wise ML-decoding and bit-wise MAP-decoding via the sum-product algorithm equivalent for any linear binary
(n, k)-block code with P (c) chosen as in the lecture? Explain your answer.
Hint: Consider the case that the codebook contains at least one codeword with ci = 1 for all i ∈ {1 . . . , n} and
afterwards consider the opposite case. Is P (ci ) always uniform for all i?

9
(e) Discuss how ordinary minimum distance decoding and block-wise MAP decoding via the max-product algorithm
compare to each other.
(f) So far we used the sum-product and the max-product algorithm only for binary symmetric channels. What do we
need to change (in comparison to the BSC) if we want to decode a message sent over a binary erasure channel with
erasure probability ε?

Solution:

(a) The sum-product algorithm is a bit-wise MAP decoder (also called a bit-wise MAP estimator) ĉi = argmaxci p(ci | r)
for all i ∈ {1, . . . , n} and the max-product algorithm is a block-wise MAP decoder ĉ = argmaxc p(c | r). Therefore
the respective decoders minimize Pe = P (gi (r) ̸= ci ) w.r.t. the decoding function gi for all i ∈ {1, . . . , n} and
Pe = P (g(r) ̸= c) w.r.t. the decoding function g.
(b) The block-wise MAP-decoder is equivalent to an ML-decoder if the considered search space is the codebook C and
not the space of length-n sequences. The prior p(c) = 2−k M (c) is uniform on C, but not on {0, 1}n .
Case 1:
n
Y n
Y
ĉML,1 = argmaxc∈C p(rj | cj ) = argmaxc∈C M (c) p(rj | cj ) = ĉMAP .
j=1 j=1

Case 2:
n
Y n
Y
ĉML,2 = argmaxc∈{0,1}n p(rj | cj ) ̸= argmaxc∈{0,1}n M (c) p(rj | cj ) = ĉMAP = ĉML,1 .
j=1 j=1

(c) As we have seen in the solution of part (b), the block-wise MAP decoder effectively uses the codebook as its search
space. In contrast, the search space of bit-wise MAP decoding is not constrained, such that the decoding function
of the entire codeword g(r) = [g1 (r), . . . , gr (r)] maps to the whole space {0, 1}n instead of just mapping to the
codebook. The decoding function g processes the bits independently from each other. One can also speak of the
bit-wise MAP decoder using an averaged likelihood function:
X X
ĉi = argmaxci p(c | r) = argmaxci p(r | c)p(c) = argmaxci Ec [p(r | c) | ci ]p(ci ),
¬ci ¬ci

where Ec is the expectation over all codewords (and not over the received words r). This averaging reduces the
information, which length-n sequences are codewords, to the information how often r is produced (on average)
by a codeword that has value ci at the i-th position. The remaining information is not enough to enforce that the
decoding yields valid codewords.
(d) Let us start with the definition of the two decoding functions:

ĉi,MAP =argmaxci ∈{0,1} p(ci | r) = argmaxci ∈{0,1} p(r | ci )p(ci )


ĉi,ML =argmaxci {0,1} p(r | ci ),

where p(ci ) = ¬ci p(c) = 2−k ¬ci M (c). We first consider the case mentioned in the hint. If there exists at least
P P
one codeword with ci = 1, then the i-th column gi of the generator matrix G = [g1 , . . . , gn ] contains at least one
non-zero entry. Recall that ci = mgi for some message m ∈ {0, 1}k . Assume that the number of non-zero entries in
this particular column is w ≤ k. We want to calculate the number of codewords with ci = 0 to evaluate P (ci = 0).
To do this, we reorder the rows of G, such that the first w coordinates of gi are equal to one and the remaining
coordinates are all zero. This changes the mapping from message to codeword, but it does not change the set
of codewords C and so p(ci ) is also unchanged. To get a codeword with ci = 0 the number of ones in the first w
coordinates of ⇕ must be even. The remaining (k − w) elements can be chosen arbitrarily, since their choice has no
effect on ci . We evaluate the number of codewords with ci = 0 to
X X w 
k−w
M (c)|ci =0 = 2 = 2k−w 2w−1 = 2k−1 ,
¬c r even
r
i

10
where we used the identity r even wr = 2w−1 . This implies p(ci ) = 21 , i.e., the marginal distribution of ci on {0, 1}
P 

is uniform. Now we turn to the case, where there exists one i, such that ci = 0 for all codewords. As an example
consider the (5, 2)-code with generator matrix
 
0 0 1 1 0
G= .
0 1 1 0 1

It is a valid linear code since all rows are linearly independent, but c1 = 0 for all codewords. In this example, we
obviously have P (c1 = 0) = 1 which violates the assumption that c1 is uniformly distributed in {0, 1}. So only
under the assumption, that for all i ∈ {1 . . . , n} the codebook contains at least one codeword with ci = 1, we can in
general conclude that bit-wise MAP decoding via the sum-product algorithm equals the bit-wise ML estimator given
above. However, under certain assumptions, e.g., for a binary symmetric channel with bit-flip probability ε < 12 , the
decoding functions can still be equivalent.
(e) As we have shown in part (b) the block-wise MAP decoder is equivalent to the block-wise ML-decoder if P (c) is
chosen uniform on the space of codewords. Recall that minimum distance decoding is equivalent to block-wise ML
decoding for a binary symmetric channel with bit-flip probability ε < 21 . So if these two conditions are satisfied, then
the max-product algorithm is equivalent to ordinary minimum distance decoding. Note that in this case minimum
distance decoding minimizes the probability of a block decoding error. In all other cases the two decoding methods
are potentially different and minimum distance decoding may either not minimize the block decoding error or it
may even be inappropriate. For example, minimum distance decoding is not well-defined for the binary erasure
channel (Why?).
(f) The messages at terminal factor nodes are initialized as

(1 − ε, 0) if rj = 0

(p(rj | cj = 0), p(rj | cj = 1)) = (ε, ε) if rj = e

(0, 1 − ε) if rj = 1

Otherwise the algorithms remain unchanged.

Task 6.10: Rate distortion theory

(a) For a source X with a finite alphabet X and Hamming distortion, describe a (not necessarily constructive) sequence
of rate distortion codes that is asymptotically lossless and achieves the rate R = H(X).
1
(b) Suppose that X = {1, 2, 3, 4}, X̂ = {1, 2, 3, 4} and p(xi ) = 4 for i = 1, 2, 3, 4, and X1 ,X2 ,... are i.i.d. ∼ p(x). The
distortion matrix d(x, x̂) is given by
1 2 3 4
1 0 0 1 1
2 0 0 1 1
3 1 1 0 0
4 1 1 0 0

(i) Find R(0), the rate necessary to describe the process with zero distortion.
(ii) Find the rate distortion function R(D).
Hint: There are some irrelevant distinctions in alphabets X and X̂ , which allow the problem to be collapsed.
(iii) Suppose we have a non-uniform distribution p(xi ) = pi for i = 1, 2, 3, 4. What is R(D) ?

11
Solution:

(a) In the lecture we have seen a quote by Shannon, which motivates that we can discard all non-typical sequences and
assigning fixed length descriptions to all typical sequences could be used to construct an asymptotically error free
source code with asymptotic rate H(X).
(n) (n) (n)
Let ϵ = n1 , set X̂ = X and consider the typical sequences Tϵ ⊆ X n . Set M = |Tϵ | and let g : {1, . . . , M } → Tϵ
be an arbitrary mapping from an index set to the typical sequences. Since g is invertible we define the encoding
function f such that (
(n)
g −1 (x) if x ∈ Tϵ
f (x) =
1 else.
We simply map all non-typical sequences to the same index as it is asymptotically irrelevant where we map them. It
is left to show that this sequence of rate distortion codes is actually asymptotically lossless and achieves the rate
R = H(X):
n n
1X 1X 1
0 ≤ E[d(X n , X̂ n )] = P(Xi ̸= X̂i ) ≤ P(X n ̸= X̂ n ) = P(X n ̸= X̂ n ) = P(X n ∈
/ Tϵ(n) ) < ,
n i=1 n i=1 n

(n)
where we used that P(X n ∈ Tϵ ) < 1 − ϵ. So we have limn→∞ E[d(X n , X̂ n )] = 0
log |T (n) |
By construction we also have rate R(n) = lognM = n
ϵ
the n-th rate distortion code. The typical set is bounded
(n)
as (1 − ϵ)2n(H(X)−ϵ) ≤ |Tϵ | ≤ 2n(H(X)+ϵ) , such that

log(1 − ϵ)
+ H(X) − ϵ ≤ R(n) ≤ H(X) + ϵ
n
and hence limn→∞ R(n) = H(X).
Comment: We could have used a more constructive Ansatz by designing the encoding function as a modified
version of a block-Huffman encoder, block-Fano encoder or arithmetic (block-Elias) encoder. Each of these choices
asymptotically leads to having very long codewords for the non-typical sequences (much longer than length nH(X))
and approximately length nH(X) codewords for all typical sequences. An encoder can thus be constructed by
mapping a sequences with length ≥ n − n(1−H(X))
2 to 1 and all others to their respective binary integer, given by
(n)
the codeword. Asymptotically we can thus still exploit the properties of the typical set, e.g., we have M ≤ |Tϵ | + 2
for large n.
(b) Considering the distortion matrix we can exploit the fact that the distortion measure function does not differentiate
the pair of symbols (1, 2) as well as pair (3, 4). Let Y be a binary random variable such that:
(
0, X ∈ {1, 2}
Y =
1, X ∈ {3, 4}

And the distortion matrix:


d(y, ŷ) 0 1
0 0 1
1 1 0
Now the problem is reduced to a binary source, with hamming distortion measure. Recall the definition of the rate
distortion function:

R(D) = min I(Y ; Ŷ )


p(ŷ|y):E[d(Y,Ŷ )]≤D

12
We know that Y is a binary source and its probability distribution is given by:
1
p(Y = 0) = p(X = 1) + p(X = 2) = 2
1
p(Y = 1) = p(X = 3) + p(X = 4) = 2

And the source entropy H(Y ) is:


H(Y ) = HB ( 21 ) = 1 Bit/Symbol
Now we need to represent E[d(Y, Ŷ )] in terms of probabilities.
2 X
X 2
E[d(Y, Ŷ )] = d(yi , yˆj )p(Ŷ = ŷj |Y = yi )p(Y = yi )
i=1 j=1

= d(0, 1)p(Ŷ = 1|Y = 0)p(Y = 0) + d(1, 0)p(Ŷ = 0|Y = 1)p(Y = 1)


(1) For D = 0:
E[d(Y, Ŷ )] = 0
Therefore,
p(Ŷ = 1|Y = 0) = p(Ŷ = 0|Y = 1) = 0

Now we can express I(Y ; Ŷ ) in terms of entropies:


I(Y ; Ŷ ) = H(Y ) − H(Y |Ŷ )
= HB ( 21 ) − H(Y |Ŷ )
= 1 − H(Y |Ŷ )
Notice that the conditional probabilities p(Ŷ = 1|Y = 0) and p(Ŷ = 0|Y = 1) are both set to zero. Since both
Y and Ŷ are binary random variables we conclude that the value of one variable completely determines the
value of the other. Therefore,
H(Ŷ |Y ) = H(Y |Ŷ ) = 0

R(0) = HB ( 21 ) = 1 Bit/Symbol.

(2) Recall the expression of E[d(Y, Ŷ )] in terms of probabilities:


E[d(Y, Ŷ )] = d(0, 1)p(Ŷ = 1|Y = 0)p(Y = 0) + d(1, 0)p(Ŷ = 0|Y = 1)p(Y = 1)
= p(Ŷ = 1|Y = 0)p(Y = 0) + p(Ŷ = 0|Y = 1)p(Y = 1) = p(Y ̸= Ŷ )
= p(Y ̸= Ŷ ) ≤ D
To find the lower bound we consider the worst case where p(Y ̸= Ŷ ) = D.
Recall Fano’s inequality: For Ŷ an estimator of Y (with same support) we have
H(Y |Ŷ ) ≤ Pe log2 (|Y| − 1) + HB (Pe ), |Y| alphabet size of Y
≤ HB (D).
The first inequality is an equality if the distribution p(y | Ŷ = ŷ, Ŷ ̸= Y ) is uniform on Y \ {ŷ}. For |Y| = 2 this
is always true. With this upper bound for the conditional entropy we conclude:
I(Y ; Ŷ ) = H(Y ) − H(Y |Ŷ ) (1)
= HB ( 21 ) − H(Y |Ŷ ) (2)
≥ 1 − HB (D) (3)
R(D) = 1 − HB (D) (4)
Remark: Inequality (4) is obtained from the definition of the rate-distortion function.

13
(3) Finally, after generalizing the rate distortion function for a uniformly distributed source. In this task we assume
non-uniform distribution for the source X, p(xi ) = pi for i = 1, 2, 3, 4. Since we reduced the problem to a binary
problem the distribution of the binary random variable Y must be expressed in terms of the distribution of X
which we have already discussed in sub task (1) of this problem.

I(Y ; Ŷ ) = H(Y ) − H(Y |Ŷ ) (5)


= H(p1 + p2 , p3 + p4 ) − H(Y |Ŷ ) (6)
≥ H(p1 + p2 , p3 + p4 ) − HB (D) (7)
R(D) = HB (p1 + p2 ) − HB (D) (8)

14

You might also like