exercise6-sol
exercise6-sol
Information Theory I
Prof. Dr. Heinz Koeppl, M.Sc. Maximilian Gehri
WS 2023/24
Sheet 6
Task 6.1
Consider a convolutional code with the generator sequences g(0) = [1001] and g(1) = [1101].
(a) Draw the code tree.
(b) Draw the state diagram.
(c) Encode the message m = [1011010].
Solution:
From the generator sequences, we obtain the following relations between the input m(i) and outputs c(0) (i), c(1) (i):
where m(i) = 0 for i < 0. We begin by writing down, for arbitrary time i ≥ 0, the table of encoder states m(i − 3), m(i −
2), m(i − 1) , input m(i), and outputs c(0) (i), c(1) (i):
State name m(i − 3) m(i − 2) m(i − 1) m(i) c(0) (i) c(1) (i)
0 0 0 0 0 0
a
0 0 0 1 1 1
0 0 1 0 0 1
b
0 0 1 1 1 0
0 1 0 0 0 0
c
0 1 0 1 1 1
0 1 1 0 0 1
d
0 1 1 1 1 0
1 0 0 0 1 1
e
1 0 0 1 0 0
1 0 1 0 1 0
f
1 0 1 1 0 1
1 1 0 0 1 1
g
1 1 0 1 0 0
1 1 1 0 1 0
h
1 1 1 1 0 1
1
From this table, it is straightforward to draw both diagrams. In the following, we label transitions with input and output
using the notation m | c(0) c(1) .
(a) The code tree:
0|00
a
0|00
a
1|11
b
0|00
0|01
c
1|11
b
1|10
d
0|00
0|00
e
0|01
c
1|11
f
1|11
0|01
g
1|10
d
1|10
h
0|11
a
0|00
e
1|00
b
0|01
0|10
c
1|11
f
1|01
d
1|11
0|11
e
0|01
g
1|00
f
1|10
0|10
g
0|10
h
1|01
h
Task 6.2
In this exercise, we consider a recursive convolutional code with the rational transfer functions
1+D
g (0) (D) = , g (1) (D) = 1.
1 + D + D2
(a) Draw the state diagram for this code.
(b) Encode the message m = [100101110].
2
0|00
1|10
a
1|11 b
0|01 c d
0|00 1|01
0|11 1|00 1|11 0|10 0|01 1|10
e f
1|00 g
0|10 h
0|11
1|01
Solution:
the most systematic way to proceed is to write down, for arbitrary time i ≥ 0, the table of all possible encoder input m(i),
states m(i − 1), c(0) (i − 2), c(0) (i − 1) , and outputs c(0) (i), c(1) (i). From the transfer functions, we have the relations
where m(i) = c(0) (i) = 0 for i < 0. The table then reads
State name m(i − 1) c(0) (i − 2) c(0) (i − 1) m(i) c(0) (i) c(1) (i) New state
0 0 0 0 0 0 a
a
0 0 0 1 1 1 d
1 0 0 0 1 0 c
b
1 0 0 1 0 1 b
0 0 1 0 1 0 g
c
0 0 1 1 0 1 f
1 0 1 0 0 0 e
d
1 0 1 1 1 1 h
0 1 0 0 1 0 c
e
0 1 0 1 0 1 b
1 1 0 0 0 0 a
f
1 1 0 1 1 1 d
0 1 1 0 0 0 e
g
0 1 1 1 1 1 h
1 1 1 0 1 0 g
h
1 1 1 1 0 1 f
(a) The state diagram is
(b) Encoding the message by using the state diagram, starting in state a, gives the sequence of states d, e, c, f, a, d, h, f, a
with output sequence [110010010011110100].
3
1|11
0|00
1|01
0|10
a b c d
0|10 1|11
1|01 0|10 1|11
1|01
0|10
0|00
e f g h
1|11
0|00
1|01
0|00
Task 6.3
Find the rate-distortion function R(D) = minp(x̂|x) : E[d(X,X̂)]≤D I(X; X̂) for X ∼ Bernoulli(1/2) and distortion measure
0, x = x̂,
d(x, x̂) = 1, x = 1, x̂ = 0,
∞, x = 0, x̂ = 1.
Solution:
We conclude that E[d(X, X̂)] < ∞ implies p(X̂ = 1 | X = 0) = 0. Since D < ∞ we can eliminate this degree of freedom
in the optimization problem by setting p(X̂ = 1 | X = 0) = 0 and p(X̂ = 0 | X = 0) = 1. For brevity we denote
p := p(X̂ = 0 | X = 1). Then
p
E[d(X, X̂)] = ≤ D.
2
We show two approaches to express the mutual information as of function of p.
4
" # " #
p(X)p(X̂ | X) p(X̂ | X) X p(x̂ | x)
I(X; X̂) =E log = E log P = pX (x)p(x̂ | x) log P ′ ′
x′ p(x̂ | x )pX (x )
P
p(X) x p(X̂ | x)p(x) x p(X̂ | x)p(x) x̂,x
1X p(x̂ | x)
=1 + p(x̂ | x) log P ′
2
x̂,x x′ p(x̂ | x )
!
1 1 p(X̂ = 0 | X = 1)
=1 − log(1 + p(X̂ = 0 | X = 1)) + p(X̂ = 0 | X = 1) log ,
2 2 1 + p(X̂ = 0 | X = 1)
where we used pX (x) = 1/2 in the second line and the degree of freedom reduction in the third line. Then
1 1 p 1 p
I(X; X̂) = 1 − log(1 + p) + p log = 1 − (1 + p)HB .
2 2 1+p 2 1+p
This expression can be used for constraint optimization with constraints 0 ≤ p ≤ min{2D, 1}.
I(X; X̂) = H(X) − H(X | X̂) = HB (1/2) − p(X̂ = 0)H(X | X̂ = 0) − p(X̂ = 1)H(X | X̂ = 0).
I(X; X̂) = 1 − p(X̂ = 0)HB (p(x | X̂ = 0)) − p(X̂ = 1)HB (p(x | X̂ = 1))
Note that p(X = 0 | X̂ = 1) ∝ p(X̂ = 1 | X = 0) by Bayes Theorem and therefore HB (p(x | X̂ = 1)) = 0. Also by Bayes
Theorem we get
p(X̂ = 0 | X = 1)
p(X = 1 | X̂ = 0) =
2p(X̂ = 0)
and with p(X̂ = 0) = 12 (p(X̂ = 0 | X = 1) + p(X̂ = 0 | X = 0)) = 12 (p(X̂ = 0 | X = 1) + 1) we can express the mutual
information as
1 p
I(X; X̂) = 1 − (p + 1)HB
2 p+1
with p := p(X̂ = 0 | X = 1) (as above).
Task 6.4
Let d(x, x̂) be a distortion function. We have a source X ∼ p(x). Let R(D) be the associated rate distortion function.
˜ x̂) =
(a) Find R̃(D) in terms of R(D), where R̃(D) is the rate distortion function associated with the distortion d(x,
d(x, x̂) + a for some constant a > 0. (They are not equal!)
(b) Now suppose that d(x, x̂) ≥ 0 for all x, x̂ and define a new distortion function d∗ (x, x̂) = b d(x, x̂), where b ≥ 0.
Find the associated rate distortion function R∗ (D) in terms of R(D).
5
Solution:
(a)
= R(D − a)
(b) If b > 0
Task 6.5
Let X ∼ N (0, σ 2 ) and let the distortion measure be squared error. Here, we do not allow block descriptions, i.e., we use
a (2nR , n)-rate distortion code with n = 1.
q
Show that the optimal reproduction points for 1-bit quantization (i.e., R = 1) are ± π2 σ and the expected distortion for
1-bit quantization is π−2 2
π σ . Compare this with the distortion-rate bound D(R) = σ 2 2−2R for R = 1.
Solution:
6
for x > 0. The optimal points minimize the squared error distortion measure, D = E[(x − x̂)2 ], for which we find
Z 0 Z ∞
D = E[(x − x̂)2 ] = p(x)(x + a)2 dx + p(x)(x − a)2 dx
−∞ 0
Z ∞
(1) 2
= 2 p(x)(x − a) dx
0
Z ∞ Z ∞ Z ∞
= 2 a2 p(x)dx + x2 p(x)dx − 2a xp(x)dx
0 0 0
Z ∞
1 2 1 2 1 x2
− 2σ
= 2 a + σ − 2a √ xe 2
dx
2 2 2πσ 2 0
Z ∞
(2) 2 4a
= a + σ2 − √ σ2 e−y dy
2πσ 2 0
4aσ 2
= a2 + σ 2 − √ ,
2πσ 2
where in (1) we used the symmetry property and in (2) applied integration by substitution with y = x2 /(2σ 2 ) ⇒ dy =
2x/(2σ 2 ) · dx ⇒ dx = σ 2 dy/x. We find the optimal reconfiguration point by partial differentiation of D and setting the
derivative to zero:
∂D 4σ 2 !
= 2a − √ =0
∂a 2πσ 2
r
2
⇒ a∗ = σ
π
For the second partial derivative weqobtain 2 from which we see that a∗ actually minimizes the distortion. Hence, the
2
optimal reproduction points are ± π σ. For the optimal reproduction points we find the distortion evaluated with
q
a∗ = π2 σ to be
4a∗ σ 2 π−2 2
D|a=a∗ = (a∗ )2 + σ 2 − √ = σ .
2πσ 2 π
The optimal distortion for R = 1 (the distortion-rate function evaluated at R = 1) yields a lower bound on the single-bit
quantization (i.e., R = 1) distortion of a Gaussian random variable X:
σ2 π−2 2
Dopt = D(R = 1) = < σ
4 π
This shows that the single-bit quantizer does not achieve optimal distortion limit for single-bit quantizers. Analogously
to channel coding, the minimum distortion value D(R) for a fixed rate R is usually achieved only asymptotically for
large block length n.
Task 6.6
Consider the ordinary Gaussian channel with two correlated observations of X, i.e., Y = [Y1 , Y2 ]T , where
Y1 = X + Z1 , Y2 = X + Z2
7
Solution:
For ρ = 1, we have Z1 = Z2 , so that Y1 = Y2 . Clearly, the capacity will then be equal to the capacity for only one
measurement Y1 . For ρ = −1, we have Z1 = −Z2 , so that Y1 + Y2 = 2X. Then X can be determined exactly, so that the
capacity is infinite.
Task 6.7
Yi = Xi + Zi , i = 1, 2, 3,
Solution:
The solution of this problem can be found with the water-filling algorithm. The maximizing distribution for x is
a multivariate zero-mean Gaussian with independent components, where the variances can most easily be found
graphically:
σ2 { σ2 { σ2 {
1 2 3 1 2 3 1 2 3
Channel number Channel number Channel number
(a) P1 = σ 2 /2, P2 = 0, P3 = 0.
(b) P1 = 3σ 2 /2, P2 = σ 2 /2, P3 = 0.
(c) P1 = 5σ 2 , P2 = 4σ 2 , P3 = σ 2 .
8
Task 6.8
Yi = Xi + Zi , i = 1, 2, 3,
Solution:
The case of correlated noise in a Gaussian channel can be handled by applying the water-filling algorithm in coordinates
in which the covariance matrix is diagonal.
In this case, the covariance matrix Σ has √ the√eigenvalues λ1√= 1, λ2√= 1 + ρ, and λ3 = 1 − ρ with corresponding
normalized eigenvectors [1, 0, 0]T , [0, 1/ 2, 1/ 2]T , and [0, 1/ 2, −1/ 2]T . Defining
1 0√ 0√ 1 0 0
S = 0 1/√2 1/ √2 , D = 0 1 + ρ 0 ,
0 1/ 2 −1/ 2 0 0 1−ρ
we then have Σ = SDST . We then to apply the water-filling algorithm to the diagonal matrix D. Let P̃i , i = 1, 2, 3, be
the allocated power in the new coordinates. By applying the water-filling algorithm, we find that
where we assumed that ρ ≥ 0, as otherwise the input power constraint would be meaningless. The allocated power in
the original coordinates, i.e., the correlation matrix for the input which achieves the capacity, is then found to be
2ρ 0 0 2ρ 0 0
ST 0 ρ 0 S = 0 2ρ −ρ .
0 0 3ρ 0 −ρ 2ρ.
(a) In Task 4.6 (f) we have shown that MAP-decoding minimizes the error probability Pe . State the probabilities of
error, that are minimized by the sum-product and the max-product message passing algorithm.
(b) Consider both block-wise ML-decoding and block-wise MAP-decoding for any linear binary (n, k)-block code with
P (c) chosen as in the lecture. Over which set do you optimize such that the decoders are equivalent? Explain your
answer.
(c) Discuss why bit-wise MAP decoding does not always yield valid codewords.
(d) Are bit-wise ML-decoding and bit-wise MAP-decoding via the sum-product algorithm equivalent for any linear binary
(n, k)-block code with P (c) chosen as in the lecture? Explain your answer.
Hint: Consider the case that the codebook contains at least one codeword with ci = 1 for all i ∈ {1 . . . , n} and
afterwards consider the opposite case. Is P (ci ) always uniform for all i?
9
(e) Discuss how ordinary minimum distance decoding and block-wise MAP decoding via the max-product algorithm
compare to each other.
(f) So far we used the sum-product and the max-product algorithm only for binary symmetric channels. What do we
need to change (in comparison to the BSC) if we want to decode a message sent over a binary erasure channel with
erasure probability ε?
Solution:
(a) The sum-product algorithm is a bit-wise MAP decoder (also called a bit-wise MAP estimator) ĉi = argmaxci p(ci | r)
for all i ∈ {1, . . . , n} and the max-product algorithm is a block-wise MAP decoder ĉ = argmaxc p(c | r). Therefore
the respective decoders minimize Pe = P (gi (r) ̸= ci ) w.r.t. the decoding function gi for all i ∈ {1, . . . , n} and
Pe = P (g(r) ̸= c) w.r.t. the decoding function g.
(b) The block-wise MAP-decoder is equivalent to an ML-decoder if the considered search space is the codebook C and
not the space of length-n sequences. The prior p(c) = 2−k M (c) is uniform on C, but not on {0, 1}n .
Case 1:
n
Y n
Y
ĉML,1 = argmaxc∈C p(rj | cj ) = argmaxc∈C M (c) p(rj | cj ) = ĉMAP .
j=1 j=1
Case 2:
n
Y n
Y
ĉML,2 = argmaxc∈{0,1}n p(rj | cj ) ̸= argmaxc∈{0,1}n M (c) p(rj | cj ) = ĉMAP = ĉML,1 .
j=1 j=1
(c) As we have seen in the solution of part (b), the block-wise MAP decoder effectively uses the codebook as its search
space. In contrast, the search space of bit-wise MAP decoding is not constrained, such that the decoding function
of the entire codeword g(r) = [g1 (r), . . . , gr (r)] maps to the whole space {0, 1}n instead of just mapping to the
codebook. The decoding function g processes the bits independently from each other. One can also speak of the
bit-wise MAP decoder using an averaged likelihood function:
X X
ĉi = argmaxci p(c | r) = argmaxci p(r | c)p(c) = argmaxci Ec [p(r | c) | ci ]p(ci ),
¬ci ¬ci
where Ec is the expectation over all codewords (and not over the received words r). This averaging reduces the
information, which length-n sequences are codewords, to the information how often r is produced (on average)
by a codeword that has value ci at the i-th position. The remaining information is not enough to enforce that the
decoding yields valid codewords.
(d) Let us start with the definition of the two decoding functions:
where p(ci ) = ¬ci p(c) = 2−k ¬ci M (c). We first consider the case mentioned in the hint. If there exists at least
P P
one codeword with ci = 1, then the i-th column gi of the generator matrix G = [g1 , . . . , gn ] contains at least one
non-zero entry. Recall that ci = mgi for some message m ∈ {0, 1}k . Assume that the number of non-zero entries in
this particular column is w ≤ k. We want to calculate the number of codewords with ci = 0 to evaluate P (ci = 0).
To do this, we reorder the rows of G, such that the first w coordinates of gi are equal to one and the remaining
coordinates are all zero. This changes the mapping from message to codeword, but it does not change the set
of codewords C and so p(ci ) is also unchanged. To get a codeword with ci = 0 the number of ones in the first w
coordinates of ⇕ must be even. The remaining (k − w) elements can be chosen arbitrarily, since their choice has no
effect on ci . We evaluate the number of codewords with ci = 0 to
X X w
k−w
M (c)|ci =0 = 2 = 2k−w 2w−1 = 2k−1 ,
¬c r even
r
i
10
where we used the identity r even wr = 2w−1 . This implies p(ci ) = 21 , i.e., the marginal distribution of ci on {0, 1}
P
is uniform. Now we turn to the case, where there exists one i, such that ci = 0 for all codewords. As an example
consider the (5, 2)-code with generator matrix
0 0 1 1 0
G= .
0 1 1 0 1
It is a valid linear code since all rows are linearly independent, but c1 = 0 for all codewords. In this example, we
obviously have P (c1 = 0) = 1 which violates the assumption that c1 is uniformly distributed in {0, 1}. So only
under the assumption, that for all i ∈ {1 . . . , n} the codebook contains at least one codeword with ci = 1, we can in
general conclude that bit-wise MAP decoding via the sum-product algorithm equals the bit-wise ML estimator given
above. However, under certain assumptions, e.g., for a binary symmetric channel with bit-flip probability ε < 12 , the
decoding functions can still be equivalent.
(e) As we have shown in part (b) the block-wise MAP decoder is equivalent to the block-wise ML-decoder if P (c) is
chosen uniform on the space of codewords. Recall that minimum distance decoding is equivalent to block-wise ML
decoding for a binary symmetric channel with bit-flip probability ε < 21 . So if these two conditions are satisfied, then
the max-product algorithm is equivalent to ordinary minimum distance decoding. Note that in this case minimum
distance decoding minimizes the probability of a block decoding error. In all other cases the two decoding methods
are potentially different and minimum distance decoding may either not minimize the block decoding error or it
may even be inappropriate. For example, minimum distance decoding is not well-defined for the binary erasure
channel (Why?).
(f) The messages at terminal factor nodes are initialized as
(1 − ε, 0) if rj = 0
(p(rj | cj = 0), p(rj | cj = 1)) = (ε, ε) if rj = e
(0, 1 − ε) if rj = 1
(a) For a source X with a finite alphabet X and Hamming distortion, describe a (not necessarily constructive) sequence
of rate distortion codes that is asymptotically lossless and achieves the rate R = H(X).
1
(b) Suppose that X = {1, 2, 3, 4}, X̂ = {1, 2, 3, 4} and p(xi ) = 4 for i = 1, 2, 3, 4, and X1 ,X2 ,... are i.i.d. ∼ p(x). The
distortion matrix d(x, x̂) is given by
1 2 3 4
1 0 0 1 1
2 0 0 1 1
3 1 1 0 0
4 1 1 0 0
(i) Find R(0), the rate necessary to describe the process with zero distortion.
(ii) Find the rate distortion function R(D).
Hint: There are some irrelevant distinctions in alphabets X and X̂ , which allow the problem to be collapsed.
(iii) Suppose we have a non-uniform distribution p(xi ) = pi for i = 1, 2, 3, 4. What is R(D) ?
11
Solution:
(a) In the lecture we have seen a quote by Shannon, which motivates that we can discard all non-typical sequences and
assigning fixed length descriptions to all typical sequences could be used to construct an asymptotically error free
source code with asymptotic rate H(X).
(n) (n) (n)
Let ϵ = n1 , set X̂ = X and consider the typical sequences Tϵ ⊆ X n . Set M = |Tϵ | and let g : {1, . . . , M } → Tϵ
be an arbitrary mapping from an index set to the typical sequences. Since g is invertible we define the encoding
function f such that (
(n)
g −1 (x) if x ∈ Tϵ
f (x) =
1 else.
We simply map all non-typical sequences to the same index as it is asymptotically irrelevant where we map them. It
is left to show that this sequence of rate distortion codes is actually asymptotically lossless and achieves the rate
R = H(X):
n n
1X 1X 1
0 ≤ E[d(X n , X̂ n )] = P(Xi ̸= X̂i ) ≤ P(X n ̸= X̂ n ) = P(X n ̸= X̂ n ) = P(X n ∈
/ Tϵ(n) ) < ,
n i=1 n i=1 n
(n)
where we used that P(X n ∈ Tϵ ) < 1 − ϵ. So we have limn→∞ E[d(X n , X̂ n )] = 0
log |T (n) |
By construction we also have rate R(n) = lognM = n
ϵ
the n-th rate distortion code. The typical set is bounded
(n)
as (1 − ϵ)2n(H(X)−ϵ) ≤ |Tϵ | ≤ 2n(H(X)+ϵ) , such that
log(1 − ϵ)
+ H(X) − ϵ ≤ R(n) ≤ H(X) + ϵ
n
and hence limn→∞ R(n) = H(X).
Comment: We could have used a more constructive Ansatz by designing the encoding function as a modified
version of a block-Huffman encoder, block-Fano encoder or arithmetic (block-Elias) encoder. Each of these choices
asymptotically leads to having very long codewords for the non-typical sequences (much longer than length nH(X))
and approximately length nH(X) codewords for all typical sequences. An encoder can thus be constructed by
mapping a sequences with length ≥ n − n(1−H(X))
2 to 1 and all others to their respective binary integer, given by
(n)
the codeword. Asymptotically we can thus still exploit the properties of the typical set, e.g., we have M ≤ |Tϵ | + 2
for large n.
(b) Considering the distortion matrix we can exploit the fact that the distortion measure function does not differentiate
the pair of symbols (1, 2) as well as pair (3, 4). Let Y be a binary random variable such that:
(
0, X ∈ {1, 2}
Y =
1, X ∈ {3, 4}
12
We know that Y is a binary source and its probability distribution is given by:
1
p(Y = 0) = p(X = 1) + p(X = 2) = 2
1
p(Y = 1) = p(X = 3) + p(X = 4) = 2
R(0) = HB ( 21 ) = 1 Bit/Symbol.
13
(3) Finally, after generalizing the rate distortion function for a uniformly distributed source. In this task we assume
non-uniform distribution for the source X, p(xi ) = pi for i = 1, 2, 3, 4. Since we reduced the problem to a binary
problem the distribution of the binary random variable Y must be expressed in terms of the distribution of X
which we have already discussed in sub task (1) of this problem.
14