5.3 Kraft Inequality and Optimal Codeword Length: Theorem 22 Let X
5.3 Kraft Inequality and Optimal Codeword Length: Theorem 22 Let X
1
n
log p(x) H
(X), n
in probability. The set of typical sequences should then be dened as sequences such as
1
n
log p(x) H
(X)
This leads to the source coding theorem for ergodic sequences.
Theorem 22 Let X
n
be an stationary ergodic process. Then there exists a code which maps
sequences x of length n into binary sequences such that the mapping is invertible and
1
n
E
_
(x)
_
H
(X) +
x
p(x)(x) = 1.25 bit
Considering the classications above we see that the codewords are unique, meaning it
is a non-singular code. However, considering the codeword sequence
y = 00110 . . .
we encounter a problem in decoding. Since no codeword contain any double zeros it
is easy to see that the rst zero corresponds to the symbol x
1
. But then the next pair
of bits, 01, can either mean the combination x
1
, x
2
or the single symbol x
3
. There is no
way that we can make a clear decision between the alternatives, which means the code is
not uniquely decodable. The problem that has occurred is due to that we have unequal
length in the codewords and that we cannot use any separator between the codewords.
The unequal lengths is a requirements for having compression.
The third code, C
3
, is an example of a singular code, i.e. it is not a non-singular code. Here
the two rst codewords are the same. The average length is clearly improved, L(C
3
) =
1.125 but the price is that it is directly seen that it is not decodable.
The third code, C
4
, has unique codewords for all symbols, hence it is non-singular. Since
all codewords starts with a zero, and this is the only occurrence of zeros, any code se-
quence can be uniquely decoded, implying the code is uniquely decodable. The only
aw of the code is that we cannot see the end of a codeword until we have seen the rst
symbol of the next codeword. This is due to that the code is not prex-free. For example
the codeword 0 is a prex to all the other codewords, and 01 is a prex to 011 and 0111.
The average codeword length is L(C
4
) = 1.875.
1
Often in the literature this type of code is called Prex code. However this name is a bit misleading since
there should be no prexes in the code, and the name prex-free gives a better understanding of the concept.
5.3. KRAFT INEQUALITY AND OPTIMAL CODEWORD LENGTH 63
Finally, the last code, C
5
, is both non-singular and uniquely decodable. We can also, be
inspection, see that the code is prex-free. The fact that it is prex-free gives an easy
decoding rule. As soon as a codeword has been found in the sequence it can be decoded
to the corresponding source symbol. For example if the code sequence is
y = 01011010111 . . .
we start from the beginning and look for codewords. The rst we nd is 0 which corre-
sponds to x
1
. secondly we get 10 giving x
2
, and after that 110 representing x
3
. Continuing
this we can decode the sequence as follows
y =
0
x
1
10
x
2
110
x
3
10
x
2
111
x
4
. . .
From the above reasoning we can conclude that prex-free codes are desirable since it is
very easy to decode them. The class of uniquely decodable codes can also be decoded
uniquely, but it might require that we have to consider the complete code sequence to
start the decoding. Clearly, the class of prex-free codes is a subclass of the uniquely
decodable codes. One basic criteria for a code to be uniquely decodable is that the set of
codewords is non-overlapping, that is the class of uniquely decodable codes is a subclass
of the non-singular code. Furthermore, the class of non-singular codes is a subclass of
all codes. In Figure 5.4 a graphical representation of the relation between the classes is
shown.
All codes
Non-singular codes
Uniquely decodable codes
Prex-free codes
Figure 5.4: All prex-free codes are uniquely decodable, and all uniquely decodable
codes are non-singular.
In the continuation of this section we will consider prex-free codes. For this analysis we
rst need to consider a tree structure. A general tree
2
has a root node which may have
one or more child nodes. Each child node may also have one or more child nodes, and so
on. Anode that does not have any child nodes is called a leaf. In this text we will consider
a D-ary tree, which means that each node has either zero or D child nodes. In Figure 5.5
two examples of D-ary trees are shown, the left with D = 2 and the right with D = 3.
Notice that the trees grow to the right from the root. Normally, in computer science trees
grow downwards, but in many topics related to information theory and communication
theory they are drawn from left to right.
The depth of a node is the number of branches from the root note it is located in the tree.
Starting with the root node it has depth 0. In the left tree of Figure 5.5 the node labeled A
2
Even more general, in graph theory i tree is a directed graph where two nodes are connected by exactly
one path.
64 CHAPTER 5. SOURCECODINGTHEOREM
A
B
Figure 5.5: Examples of a binary (D = 2) and a 3-ary tree.
has depth 2 and the node labeled B has depth 4. A tree is said to be full if all leaves are
located at the same depth. In Figure 5.6 a D-ary tree with D = 2 and depth 3 is shown. In
a full D-ary tree of depth d there are D
d
leaves. In the tree of Figure 5.6 there are 2
3
= 8
leaves.
Figure 5.6: Example of a full binary tree of depth 3.
A prex-free codes of alphabet size D can be represented in a D-ary tree. The rst letter
of the codeword is represented by a branch stemming from the root. The second letter
is represented by a branch stemming from a node at depth 1, and so on. The end of a
codeword is reached when there are no children, i.e. a leaf is reached. In this way a struc-
ture is built where each sequence in the tree cannot be a prex of another codeword. In
Figure 5.7 the prex-free code C
5
from Table 5.1 is shown in a binary tree representation.
n this representation the probabilities for each source symbol is also added. The labeling
in the tree nodes is the sum of the probabilities for the source symbols stemming from
that node, i.e. the probability that a codeword goes through that node. Among the codes
in Table 5.1, the reference code C
1
is also a prex-free code. In Figure 5.8 a representation
of this code is shown in a tree. Since all codewords are of equal length we get a full binary
tree.
x p(x) C
5
x
1
1/2 0
x
2
1/4 10
x
3
1/8 110
x
4
1/8 111
1
1/2
1/4
x
4
1/8
1
x
3
1/8
0 1
x
2
1/4
0 1
x
1
1/2
0
Figure 5.7: Representation of the prex-free code C
5
in a binary tree.
There are many advantages with the tree representation. One is that we get a graphi-
5.3. KRAFT INEQUALITY AND OPTIMAL CODEWORD LENGTH 65
x p(x) C
1
x
1
1/2 00
x
2
1/4 01
x
3
1/8 10
x
4
1/8 11
1
1/4
x
4
1/8
1
x
3
1/8
0
1
3/4
x
2
1/4
1
x
1
1/2
0
0
Figure 5.8: Representation of the code C
1
in a (full) binary tree.
cal interpretation of the code, which is in many occasions a great help for our intuitive
understanding. Another advantage is that there is an alternative way to calculate the
average codeword length, formulated in the next lemma.
Lemma 23 (Path length lemma) In a tree representation of a prex-free code, the average code-
word length E[(x)] equals the sum of probabilities for the inner nodes, including the root.
We will not give a proof for the lemma, instead an example following the same outline as
a proof would do.
Example 31 Consider again the prex-free code C
5
. The average codeword length can be
derived as
L =
x
p(x)(x) =
1
2
1 +
1
4
2 +
1
8
3 +
1
8
3 = 1.75
The derivations can be rewritten as
L =
1
2
..
x
1
+
1
4
+
1
4
. .
x
2
+
1
8
+
1
8
+
1
8
. .
x
3
+
1
8
+
1
8
+
1
8
. .
x
4
=
1
2
+
1
4
+
1
8
+
1
8
. .
1
+
1
4
+
1
8
+
1
8
. .
1/2
+
1
8
+
1
8
. .
1/4
= 1.75
By rearranging the terms of the sum it can be seen that each leaf probability is present
in each of the nodes on the path to the leaf. Hence, by summing over all inner node
probabilities the contribution from a leaf probability is the probability times the depth, or
p(x)(x).
We are nowready to state and showa famous result, rst published in a Master of Science
thesis by Leon Kraft in 1949 [11]. It gives a requirement on the codeword lengths that
can be used to form a prex-free code. The result has been generalized by Brockway
McMillan in 1956 to yield all uniquely decodable codes [15]. The result is often called
Kraft inequality or Kraft-McMillan inequality.
Theorem 24 (Kraft inequality) There exists a prex-free D-ary code with codeword lengths
1
,
2
, . . . ,
k
if and only if
k
i=1
D
i
1
66 CHAPTER 5. SOURCECODINGTHEOREM
To show this we consider a D-ary prex-free code where the longest codeword length is
max
= max
x
(x). This code can be represented in a D-ary tree. A full D-ary tree of depth
max
has D
max
leaves. Use this tree to represent the codeword for the symbol x
i
. This
codeword is of length
i
. Since it should be a prex-free code the sub-tree spanned with
x
i
as a root is not allowed to used by any other codeword, hence it should be removed
from the tree, see Figure 5.9. So, when inserting the codeword for symbol x
i
we delete
D
max
i
leaves in the tree. Since we cannot remove more leaves than the tree has from
the beginning we get that for a prex-free code
i
D
max
i
D
max
By canceling D
max
on both sides, this proves that for a prex-free code we have
i
D
i
1
x
i
i
max
i
_
_
D
max
i
leaves
Remove sub-tree
Figure 5.9: Tree construction for a general prex-free code.
To show that if the inequality is fullled we can construct a prex-free code we start by
assuming that
i
D
i
1 and that the codeword lengths are ordered,
1
2
k
,
where
k
=
max
. Then we use the same construction as above. Start with the shortest
codeword and remove the corresponding sub-tree. After i < k steps the number of leaves
left is
D
max
n=1
D
max
i
= D
max
_
1
i
n=1
D
i
_
> 0
where the last inequality comes from the assumption, i.e. as long as i < k we have
i
n=1
D
i
< 1. in other words, as long as i < k there are leaves left at depth
max
. The
last codeword only needs one leaf since it is of maximum length, which shows that it is
always possible to construct a prex-free code if the inequality is fullled.
5.3. KRAFT INEQUALITY AND OPTIMAL CODEWORD LENGTH 67
As a complement to Kraft inequality we also give, without proof, the generalization by
McMillan for uniquely decodable codes.
Theorem 25 (McMillan) There exists a uniquely decodable D-ary code with codeword lengths
1
,
2
, . . . ,
k
if and only if
k
i=1
D
i
1
Notice that the second part of the proof is constructive. It describes a method to nd a
tree representing a prex-free code for the given codeword lengths.
Example 32 To construct a binary prex-free code with codeword lengths {2, 2, 2, 3, 4},
rst check that it is possible according to Kraft inequality. The derivation
i
= 2
2
+ 2
2
+ 2
2
+ 2
3
+ 2
4
= 3
1
4
+
1
8
+
1
16
=
15
16
< 1
shows that there exists such code. For example we can use the following binary tree
representation to get the tabular to the right.
1
x
5 0 1
x
4 0 1
x
3 0
1
x
2
1
x
1 0
0
x C
1
x
1
00 2
x
2
01 2
x
3
10 2
x
4
110 3
x
5
1110 4
Notice that there is one unused leaf in the tree. In this example it means that the codeword
for x
5
is unnecessary long and can be shortened to 111.
In the example we see how we can use the tree representation to get a code.
Example 33 To construct a binary prex-free code with codeword lengths {1, 2, 2, 3, 4},
rst check that it is possible according to Kraft inequality. However, the derivation
i
= 2
1
+ 2
2
+ 2
2
+ 2
3
+ 2
4
=
1
2
+ 2
1
4
+
1
8
+
1
16
=
19
16
> 1
shows that it is not possible to construct such code. The leaves in the binary tree will not
be enough and it is impossible to t these lengths in it.
68 CHAPTER 5. SOURCECODINGTHEOREM
With Kraft inequality we have a mathematical condition that must be fullled to con-
struct a prex-free code. With this it is possible to set up an optimization function for
the codeword lengths. One standard method for minimization of a function with some
side criterion is the Lagrange multiplication method. For this we set up an optimization
function
J =
i
p(x
i
)
i
+
_
i
D
i
1
_
Taking the derivative of J equal zero for each
k
we get an equation system to solve,
k
J = p(x
k
) D
k
ln D = 0
which gives
D
k
=
p(x
k
)
ln D
(5.5)
We can use the condition from Kraft inequality to get
i
D
k
=
i
p(x
k
)
ln D
=
1
ln D
1 (5.6)
Combining (5.5) and (5.6) we get
D
k
= p(x
k
)
1
ln D
p(x
k
)
Taking the logarithm and multiplying with (1) we can obtain the optimal codeword
length for codeword k as
(opt)
k
log
D
p(x
k
) (5.7)
The average codeword length for a prex-free code can now be derived
3
L
opt
=
i
p(x
i
)
(opt)
i
i
p(x
i
) log
D
p(x
i
) = H
D
(X) =
H(X)
log D
We state this result as the following theorem.
Theorem 26 The codeword length L = E[(x)] of a prex-free code is lower bounded by the
entropy of the source, i.e.
L H
D
(X) =
H(X)
log D
(5.8)
with equality if and only if (x) = log
D
p(x).
3
The notation
H
D
(X) =
x
p(x) log
D
p(x) =
x
p(x)
log p(x)
log D
=
H(X)
log D
is used for the entropy when derived over the base D instead of base 2.
5.3. KRAFT INEQUALITY AND OPTIMAL CODEWORD LENGTH 69
Fromthe previous theoremon the optimal codeword length and the construction method
in proof of Kraft inequality, we can nd a method to design a code. The codeword length
for source symbol x
i
is
(opt)
i
log
D
p(x
i
). This might not be an integer so we need to
take the closest integer that fulll the inequality, i.e. let
i
=
_
log
D
p(x
i
)
_
To see that we can actually use this for a code we need rst to check that Kraft inequality
is fullled,
i
D
i
=
i
D
log
D
p(x
i
)
i
D
(log
D
p(x
i
))
=
i
p(x
i
) = 1
which shows that it is possible to construct such a code. This code construction is named
Shannon-Fano Code.
For this construction the codeword lengths are in the interval
log
D
p(x
i
)
i
log
D
p(x
i
) + 1
Taking the expectation of the above gives
E
_
log
D
p(x)
E
_
E
_
log
D
p(x) + 1
The following example show that the Shannon-Fano code is not necessarily an optimal
code.
Example 34 Consider a random variable with four outcomes according to the table be-
low. In the table it is also listed the optimal codeword lengths and the lengths for the
codewords in a Shannon-Fano code.
70 CHAPTER 5. SOURCECODINGTHEOREM
x p(x) log p(x) = log p(cx)
x
1
0.45 1.152 2
x
2
0.25 2 2
x
3
0.20 2.32 3
x
4
0.10 3.32 4
From above we know that Kraft inequality is fullled, but as a further clarication we
derive it
=
1
4
+
1
4
+
1
8
+
1
16
=
11
16
< 1
This shows that it is possible to construct a prex-free code with the listed codeword
lengths. Following the procedure described earlier we get the following binary tree rep-
resentation and code list.
1
0.3
1
0.3
0.1
1
0.1
x
4 0 1
0.2
x
3 0
0
1
0.7
0.25
x
2
1
0.45
x
1 0
0
x p(x) C
x
1
0.45 00
x
2
0.25 01
x
3
0.20 100
x
4
0.10 1010
In the tree following the paths for 11 and 1011, there are unused leaves. The rst one
can be used to show that this code is not optimal. By moving the label for x
4
to the leaf
at 11 the length for this codeword will decrease from 4 to 2, which will result in a code
with lower average codeword length. Hence, the Shanon-Fano construction does not in
general give an optimal code. However, it should be bounded by (5.8). To view this we
rst derive the entropy for the random variable
H(X) = H(0.45, 0.25, 0.2, 0.1) = 1.815
With use of the path length lemma we can derive the average codeword length
L = 1 + 0.7 + 0.3 + 0.3 + 0.1 = 2.4
which lies in between H(X) = 1.815 and H(X) + 1 = 2.815.
To see how this relates to random processes we view sequences x = x
1
x
2
. . . x
n
of length
n as the alphabet of the source. Then we are interested in the average codeword length
per source symbol in the vector. By using Theorem 26 it is possible to bound the optimal
codeword length as in the next corollary
Corollary 28 Consider a coding from a length n vector of source symbols, x = (x
1
x
2
. . . x
n
),
to a binary codeword of length (x). Then the average codeword length per source symbol for an
optimal prex-free code satisfy
1
n
H(X
1
X
2
. . . X
n
) L
1
n
H(X
1
X
2
. . . X
n
) +
1
n
where L =
1
n
E[(x)].
5.4. HUFFMAN CODING 71
Letting n approaching innity this will sandwich the length at
1
n
H(X) H
(X). Ex-
pressed more formally we get the following corollary.
Corollary 29 If X
1
X
2
. . . X
n
is a stationary stochastic process, the average codeword length for
an optimal binary prex-free code
L H
(X), n
where H