lec-02
lec-02
Algorithms
Ragesh Jaiswal
CSE, IIT Delhi
Greedy Algorithms: Huffman
Coding
Greedy Algorithms: Huffman Coding
A B
A B
Source: wikipedia
Greedy Algorithms: Huffman Coding
The encoding of “e” should be shorter than the encoding of
“x”.
In fact Morse code was designed with this in mind.
Source: wikipedia
Source: wikipedia
Greedy Algorithms: Huffman Coding
Suppose you receive the following Morse code from your
friend:
What is the message?
Source: wikipedia
Greedy Algorithms: Huffman Coding
Prefix-free encoding: An encoding 𝑓 is called prefix-free if
for any pair of alphabets (𝑎1, 𝑎2), 𝑓(𝑎1) is not a prefix of
𝑓(𝑎2).
Morse code is clearly not prefix-free.
Consider a binary tree with 26 leaves and associate each
alphabet with a leaf in this tree.
Binary Tree: A rooted tree where each non-leaf node has at most
two children.
Label an edge 0 if this edge connects the parent to its left
child and 1 otherwise.
𝑓(𝑥) = The label of edges connecting the root with 𝑥.
Greedy Algorithms: Huffman Coding
Consider a binary tree with 26 leaves and associate each
alphabet with a leaf in this tree.
Label an edge 0 if this edge connects the parent to its left
child and 1 otherwise.
𝑓(𝑥) = The label of edges connecting the root with 𝑥.
1
• 𝑓(𝑎) = 01
0
• 𝑓(𝑏) = 000
0 1 0 1 • 𝑓(𝑐) = 101
a
0 • 𝑓(𝑑) = 111
1 1
b c d • Is 𝑓 prefix-free?
0 1
• 𝑔(𝑎) = 0
a • 𝑔(𝑏) = 11
0 1 • 𝑔(𝑐) = 101
b
• 𝑔(𝑑) = 100
0 1
d c
Example: ∑ = {𝑎, 𝑏, 𝑐, 𝑑}, 𝑡(𝑎) = 0.6, 𝑡(𝑏) = 0.2, 𝑡(𝑐) = 0.1, 𝑡(𝑑) = 0.1
0 1 • 𝑂𝑓 ?
0 1 0 1
a
0 1 1
b c d
Example: ∑ = {𝑎, 𝑏, 𝑐, 𝑑}, 𝑡(𝑎) = 0.6, 𝑡(𝑏) = 0.2, 𝑡(𝑐) = 0.1, 𝑡(𝑑) = 0.1
0 1
• 𝑂𝑓 ?
a
0 1
b
0 1
d c
Running time?
Greedy Algorithms: Huffman Coding
A DNA sequence has four characters 𝐴, 𝐶, 𝑇, 𝐺 and these
characters appear with frequency 30%, 20%, 10%, and 40%
respectively.
We have to encode a sequence of length 1 million(106) in
bits.
If we use two bits for each character, then the size of the
encoding will be 2 million bits.
Huffman coding:
𝑓(𝐴) = 10, 𝑓(𝐶) = 110, 𝑓(𝑇) = 111, 𝑓(𝐺) = 0
We will need 1.9 million bits.
End