0% found this document useful (0 votes)
3 views

EE4740 Lecture4 Slides

The document discusses fundamental concepts in data compression, focusing on source codes, prefix codes, and optimal coding techniques such as Shannon and Huffman codes. It explains the Kraft inequality, the construction of prefix codes, and the optimality conditions for these codes. Practical applications of Huffman coding in data compression methods like GZIP and JPEG are also mentioned.

Uploaded by

gkhosrokhavar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

EE4740 Lecture4 Slides

The document discusses fundamental concepts in data compression, focusing on source codes, prefix codes, and optimal coding techniques such as Shannon and Huffman codes. It explains the Kraft inequality, the construction of prefix codes, and the optimality conditions for these codes. Practical applications of Huffman coding in data compression methods like GZIP and JPEG are also mentioned.

Uploaded by

gkhosrokhavar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Fundamental Limits of Source Codes and Optimal Codes

EE4740/SC42160 Data Compression: Entropy & Sparsity Perspectives

Delft University of Technology, The Netherlands


Recap and Summary

Last lecture:
● Non-singular, uniquely decodable, and prefix codes
● Kraft inequality and bounds on the optimal code length

Today:
● Constructions of prefix codes: Shannon, Huffman, Arithmetic
● Optimality of Huffman code
Reference: Cover and Thomas, Chapters 5 and 13 (particularly, the parts
presented on the slides)

2 / 38
Recap

● Information produced by a discrete information source is represented using the


alphabet X = {x1 , x2 , . . . , xm }

● Source code: A source code C for a random variable X is a mapping from X ,


the range of X , to D, the set of finite-length strings of symbols from a D-ary
alphabet.

● The expected length L(C ) of a source code C (x) for a random variable X with
probability mass function p(x) is given by

L(C ) = ∑ p(x)l(x),
x∈X

where l(x) is the length of the codeword associated with x.

3 / 38
Recap: Types of Codes

A code is nonsingular if every element of X maps into a different string in D


A code is called uniquely decodable if its n extension is nonsingular for any n
A code is called a prefix code or an instantaneous code if no codeword is a prefix
of any other codeword

4 / 38
Recap: Optimal Prefix Code

Theorem (Kraft inequality)


For any prefix code over an alphabet of size D, the codeword lengths l1 , l2 , . . . , lm
must satisfy the inequality
m
−li
∑D ≤ 1.
i=1

Conversely, given a set of codeword lengths that satisfy this inequality, there exists a
prefix code with these word lengths.

Prefix code with the minimum expected length


−l
min ∑ pi li such that ∑ D i ≤ 1
l1 ,l2 ,...,m i i

Ô⇒ li∗ = − logD pi (non-integer choice)


This noninteger choice yields the expected codeword length HD (X )

5 / 38
Optimal Code

● If − log pi is not an integer, we use li = ⌈− log pi ⌉ which satisfies the Kraft


inequality
● Due to 1 bit overhead due to the ceiling, the optimal prefix code satisfies

HD (X ) ≤ L < HD (X ) + 1

● McMillian theorem says that any uniquely decodable code satisfies Kraft
inequality, implying there is no better choice than prefix code

Today:
● Constructions of prefix codes: Shannon, Huffman, Arithmetic
● Optimality of Huffman code

6 / 38
Representing Prefix-free Code
Root
0 1
Symbol Codeword c
A 0 A .
B 10 10 11
C 110
D 1110 B .
E 1111 110 111

Tree representation: Codewords correspond to the C .


leaf nodes 1110 1111

D E

7 / 38
Representing Prefix-free Code
Root
0 1
Symbol Codeword c
A 0 A .
B 10 10 11
C 110
D 1110 B .
E 1111 110 111

Tree representation: Codewords correspond to the C .


leaf nodes 1110 1111

D E
● Interval representation: Codewords correspond to intervals, which are
non-intersecting subsets of [0, 1]
● Given codeword c = (c1 , ..., cl ) ∈ {0, 1}, the interval is

Ic = [bc , bc + 2−l ] where bc = 0.c1 c2 . . . cl ∈ [0, 1]

7 / 38
Interval Representation
Given codeword c = (c1 , ..., cl ) ∈ {0, 1}, the interval is

the interval is Ic = [bc , bc + 2−l )


where bc = 0.c1 c2 . . . cl ∈ [0, 1]

Symbol Codeword c Number bc Interval


A 0 0 [0,0.5)
B 10 0.5 [0.5, 0.75)
C 110 0.75 [0.75, 0.875)
D 1110 0.875 [0.875, 0.9375)
E 1111 0.9375 [0.9375,1)

8 / 38
Interval Representation
Given codeword c = (c1 , ..., cl ) ∈ {0, 1}, the interval is

the interval is Ic = [bc , bc + 2−l )


where bc = 0.c1 c2 . . . cl ∈ [0, 1]

Symbol Codeword c Number bc Interval


A 0 0 [0,0.5)
B 10 0.5 [0.5, 0.75)
C 110 0.75 [0.75, 0.875)
D 1110 0.875 [0.875, 0.9375)
E 1111 0.9375 [0.9375,1)

● Why non-intersecting?: For any x ∈ Ic , we have

bc ≤ b < bc + 2−l ≤ bc + 0.00 . . . 0111 . . . = bc 111 . . .


´¹¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¶
l zeros

The interval Ic consists of all b ∈ [0, 1] starting with bc . Since the prefix is not
shared, the intervals are non-intersecting.

8 / 38
Construction 1: Shannon Code

● Shannon used interval-based construction by arranging the symbols in the order


of decreasing probability and considering the cumulative probability
i−1
Fi = ∑ pi
j=1

Symbol pi Fi li
A 4/10 0 2
B 3/10 4/10 2
C 2/10 7/10 3
D 1/10 9/10 4

● Expand Fi over its first li bits where

− log pi ≤ li < − log pi + 1

● The code for Fi differs from all succeeding ones in one or more of its li places
since the remaining Fi are at least 2−li larger

9 / 38
Decimal Fractions to Binary: Algorithm

1 Multiply the fraction by 2, while noting down the resulting integer and fraction
parts of the product

2 Keep multiplying each successive resulting fraction by 2 until you get a resulting
fraction product of zero

3 Now, write all the integer parts of the product in each step

For example 0.625 is 0.101 in binary:

0.625 ∗ 2 = 1 + 0.25 0.25 ∗ 2 = 0.5 0.5 ∗ 2 = 1

10 / 38
Shannon Code

Symbol pi Fi li Code
AA 4/10 0 2 00
H(X ) = 1.846
AB 3/10 4/10 2 01
L2 = 2.4
BA 2/10 7/10 3 101
BB 1/10 9/10 4 1110

11 / 38
Shannon Code: Tree Representation

● Compute
− log pi ≤ li < − log pi + 1

● Start with a complete binary tree of depth maxi li


● Place the symbol with the shortest length on a node at its corresponding length,
discard its children, move to the next symbol, and repeat this process
sequentially
● Both the constructions lead to the same code if we select the available node
with the least value in binary; a different rule for selecting the next available
node will lead to different codes.

12 / 38
Shannon Code: Some Observations

● Codeword lengths are chosen as li = ⌈− log pi ⌉

● Shannon code is asymptotically optimal as n → ∞

● Shannon code may be much worse than the optimal code. For example when
p1 = 0.99 and p2 = 0.01 since ⌈− log p2 ⌉ = 7

13 / 38
Example: Shannon Code

Root
0 1
Symbol pi Fi ⌈− log p(x)⌉ Code .
A
A 0.55 0 1 0 10 11
B 0.25 0.55 2 10
C 0.1 0.8 4 1100
B .
D 0.1 0.9 4 1110
110 111
What is the optimal (shortest expected length)
. .
prefix code? Huffman code!
1100 1101

C D

14 / 38
Huffman Code

● An optimal (shortest expected length) prefix code for a given distribution can be
constructed by a simple algorithm discovered by Huffman

● Huffman code arranges the symbols in the order of decreasing probability, and
joins the two least probable symbols together

● The new messages are reordered after which two symbols are join again. Repeat
this process until two symbols are left

● Working backward, a 0 or 1 is added to the codeword

15 / 38
Example: Huffman Code

16 / 38
Some Examples

1 p(x) = {0.25, 0.5, 0.125, 0.125}

2 p(x) = {0.25, 0.25, 0.2, 0.15, 0.15}

17 / 38
Some Examples

1 p(x) = {0.25, 0.5, 0.125, 0.125}


● Shannon code = {10, 0, 110, 111}
● Huffman code = {10, 0, 110, 111}

2 p(x) = {0.25, 0.25, 0.2, 0.15, 0.15}

17 / 38
Some Examples

1 p(x) = {0.25, 0.5, 0.125, 0.125}


● Shannon code = {10, 0, 110, 111}
● Huffman code = {10, 0, 110, 111}

2 p(x) = {0.25, 0.25, 0.2, 0.15, 0.15}


● Shannon code = {00,01,100,101,110}
● Huffman code = {10, 00, 01, 110, 111}

17 / 38
Huffman Code on English Text

Image courtesy: Twitter, Simon Pampena and Stanford data compression course notes

18 / 38
Practical prefix-free coding

● Huffman codes are actually used in practice due to their optimality and easy
construction

● Examples:
● http/2 header compression
● ZLIB, DEFLATE, GZIP, PNG compression
● JPEG compression

● The data after multiple transformations (domain dependent) needs to be


losslessly compressed where Huffman codes are used typically

19 / 38
Non-unique Optimal Codes

● Consider a random variable X with a distribution (1/3, 1/3, 1/4, 1/12)

● The Huffman coding procedure results in codeword lengths of

(2, 2, 2, 2) OR (1, 2, 3, 3)

● Both these codes achieve the same expected codeword length of 2

● In the second code, the third symbol has length 3, which is greater than
⌈−log 1/4⌉ = 2

20 / 38
Optimality of Codes

Theorem (Necessary conditions for Optimality)


For any distribution, there exists an optimal prefix code (with minimum expected
length) that satisfies the following properties:
1 The lengths are ordered inversely with the probabilities (i.e., if pj > pk , then
lj ≤ lk ).
2 The two longest codewords have the same length.

● A code C is optimal is for any other uniquely decodable code Ĉ , we have


L(C ) ≤ L(Ĉ )

21 / 38
Proof of Condition 1

Theorem (Necessary conditions for Optimality)


For any distribution, there exists an optimal prefix code that satisfies the following
properties:
1 The lengths are ordered inversely with the probabilities (i.e., if pj > pk , then
lj ≤ lk ).
2 The two longest codewords have the same length.

● Let’s assume that our code C is optimal but for pk > pj it has lk > lj .

● Now, lets consider another prefix-free code Ĉ where we exchange the codewords
corresponding to i and j.
m
L(Ĉ ) − L(C ) = ∑ pi (lˆi − li ) = pk (lj − lk ) + pj (lk − lj ) = (pk − pj )(lj − lk ) < 0
i=1

● Hence, this is in contradiction to our assumption that code C is optimal. This


proves the Condition 1.

22 / 38
Proof of Condition 2

Theorem (Necessary conditions for Optimality)


For any distribution, there exists an optimal prefix codethat satisfies the following
properties:
1 The lengths are ordered inversely with the probabilities (i.e., if pj > pk , then
lj ≤ lk ).
2 The two longest codewords have the same length.

● If the two longest codewords are not of the same length, one can delete the last
bit of the longer one, preserving the prefix property and achieving lower expected
codeword length.

● Hence, the two longest codewords must have the same length.

● For example, let a codeword be 0110101 be the unique longest codeword. Then,
as there is no other codeword of length 7, we can be sure that we can drop one
bit, we get a shorter average codelength!

23 / 38
Huffman Code and Optimality Conditions

Theorem (Necessary conditions for Optimality)


For any distribution, there exists an optimal prefix code that satisfies the following
properties:
1 The lengths are ordered inversely with the probabilities (i.e., if pj > pk , then
lj ≤ lk ).
2 The two longest codewords have the same length.

● The symbols with higher probability are chosen later, so their code-lengths are
lower.
● We always choose the two symbols with the smallest probability and combine
them, so the Condition 2 is also satisfied.
We verified that the Huffman code has some desirable properties, it does not prove
the optimality. See the C&T textbook for a rigorous proof.

Theorem (Optimality of Huffman Code)


Huffman coding is optimal.

24 / 38
Huffman Code: Some Observations

● The main property is that after the merge step of merging two smallest
probability nodes of probability distribution (p1 , p2 , . . . , pm ), the remaining tree
is optimal for the distribution (p1 , p2 , . . . , pm−1 + pm ) obtained after merging

● Huffman coding is a “greedy” algorithm in that it coalesces the two least likely
symbols at each stage.

● The typical prefix-free code decoding works by parsing through the prefix-free
tree, until we reach a leaf node

25 / 38
Issues with Symbol Codes

● Design Huffman code for the distribution p(x) = (4/5, 1/5)


Symbol pi Code
A 4/5 0
B 1/5 1
H(X ) = 0.722
L1 = 1
● The expected length of Huffman code L ≤ H(X ) + 1
● We can try Huffman code with extensions (block codes)

● For n extensions, H(X ) ≤ Ln < H(X ) + 1/n

26 / 38
Issues with Symbol Codes

● Design Huffman code for the distribution p(x) = (4/5, 1/5)


Symbol pi Code
A 4/5 0
B 1/5 1
H(X ) = 0.722
L1 = 1
● The expected length of Huffman code L ≤ H(X ) + 1
● We can try Huffman code with extensions (block codes)
Symbol pi Code
AA 16/25 0
AB 4/25 10
BA 4/25 110
BB 1/25 111
H(X ) = 1.444
L2 = 1.560/2 = 0.780
● For n extensions, H(X ) ≤ Ln < H(X ) + 1/n

26 / 38
A New Optimal Code

● Huffman with block size n achieves compression as close as 1/n bits/symbol to


entropy H(X )

● The issue is that, as we increase the block size n, our codebook size increases
exponentially as D n

● The larger the codebook size the more complicated the encoding/decoding
becomes, the more memory we need, the higher the latency etc.

● Huffman code is not easily extended to longer block lengths (extensions)


without redoing all the calculations

● The idea doesn’t hold ground practically. Arithmetic coding addresses this
issue.

27 / 38
Arithmetic Coding

● Arithmetic coding encodes entire data as a single block

● Codebook size: Codewords are computed on a fly. No pre-computed codebook


is involved.

● Codeword length: H(X ) ≤ LArithmetic ≤ H(X ) + σ+1 n


i.e., σ bits of overhead for the entire sequence, not for each symbol

● Arithmetic coding achieves almost the same compression as Huffman code for
extension, but it is practical.

28 / 38
An Example: Encoding

Given a sequence of n = 6 i.i.d, random variables with source alphabet


X = {A, B, C , D} and pmf {0.2, 0.5, 0.2, 0.1}. Consider the encoding of the sequence
{CBAABD}
1 Find an interval (or a range) [L, H) ⊂ [0, 1), corresponding to the entire
sequence. We start with [L, H) = [0, 1), and the subdivide the interval as we see
each symbol, depending on its probability.
● C → [0.7, 0.9)
● CB → [0.74, 0.84)
● CBA → [0.74, 0.76)
● ⋮
● CBAABD → [0.7426, 0.7428)

29 / 38
Example: Finding The Interval

30 / 38
Example: Finding The Interval

Given a sequence of n = 6 i.i.d, random variables with source alphabet


X = {A, B, C , D} and pmf {0.2, 0.50.2, 0.1}. Consider the encoding of the sequence
{C , B, A, A, B, D}

Find an interval (or a range) [L, H) ⊂ [0, 1), corresponding to the entire sequence.
We start with [L, H) = [0, 1), and the subdivide the interval as we see each symbol,
depending on its probability.

CBAABD → [0.7426, 0.7428)

31 / 38
Decoding
● 0.1011111000100 → v1 = 0.74267578125

● We can start decoding from the first interval I0 (x n ) = [0, 1) by comparing with
the cumulative distribution

0.74267578125 ∈ [0.7, 0.9) Ô⇒ C

● Shift the intervals back to [0,1)

v1 − 0.7
v2 = = 0.21337890625 ∈ [0.2, 0.7) Ô⇒ B
0.2

● Continue until you decode all the symbols


● When to stop: Inform the number of symbols beforehand OR use a special
message to stop (extra-overhead of σ bits)

32 / 38
Natural Example

● Suppose we have an i.i.d. source described by the following symbol probabilities:


1 1 1 1
p(A) = , p(B) = , p(C ) = , p(D) = .
2 4 8 8
Let the sequence be AAB....
● The first A is mapped to [0, 0.1) Ô⇒ emit a 0 as the first bit because all
numbers in this interval must begin with bit 0.
● For the next A, the interval becomes [0, 0.01)implies send a 0.
● For next B, the interval is [0.001, 0.0011). Any number within this range can
encode AAB, but the total codeword for AAB could be 001, 0010, 001011, etc.
● In this case, the decoder can interpret bits as they arrive: the first 0 maps to A,
and so forth.

33 / 38
Out of Order Example

● Suppose we have an i.i.d. source described by the following symbol probabilities:


1 1 1 1
p(D) = , p(B) = , p(A) = , p(C ) = .
8 4 2 8
Let the sequence be AAB....
● The first A is mapped to [0.011, 0.111) and can have its first bit as 0 or 1
● So we can not start encoding right away, unlike Huffman coding

34 / 38
Unique Representation

● We can uniquely represent the interval [L, H) by picking the midpoint: (L + H)/2
● The midpoint could have a very long expansion, so we round it off after B bits
to get the codeword w

− log(H − L) ≤ B ≤ − log(H − L) + 1

● For example, if the interval is [0.7426, 0.7428), the length is

In = 0.0002 Ô⇒ B = 13

Our choice is 0.7427 → w = 0.1011111000100


● This choice is prefix free

35 / 38
Optimality of Arithmetic Codes
● What is the size of the interval (H-L) for the input x n ?
n
H − L = ∏ pi = p(x n )
i=1

● The number of bits required to encode < σ − log p(x n ) + 1

● Length of encoded sequence < 1


n
(σ − log p(x n ) + 1) bits/symbol

● Expected codelength is

σ − log p(X n ) + 1 σ+1


LArithmetic < E ( ) = H(X ) + ,
n n

● LArithmetic → H(X ) for large n

36 / 38
Huffman Vs Arithmetic Coding

● For both schemes, the expected codelength goes to entropy as n → ∞

● Unlike Huffman coding, Arithmetic encoding does not have exponentially


growing memory demands

● Arithmetic coding can be applied to non-iid models (source with memory)

● A set of new compression algorithms, Asymmetric Numeral Systems, can


achieve compression performance similar to Arithmetic coding, but speeds closer
to that of Huffman coding.
Check out Charles Bloom’s blog:
http://cbloomrants.blogspot.com/2014/01/1-30-14-understanding-ans-1.html

37 / 38
Summary

● Shannon code
● Huffman code
● Optimality of Huffman code
● Arithmetic codes

End of our discussion on entropy-based data compression!

Next lecture: Sparsity perspective!

38 / 38

You might also like