0% found this document useful (0 votes)
2 views

lec-02

The document discusses Huffman coding, a method for encoding data to minimize the number of bits communicated, particularly in the context of sending emails. It explains the concept of prefix-free encoding and the construction of binary trees to achieve optimal encoding based on the frequency of characters. The document also outlines the Huffman algorithm for creating an optimal binary tree for a given set of characters and their frequencies.

Uploaded by

yogeshgarg0612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lec-02

The document discusses Huffman coding, a method for encoding data to minimize the number of bits communicated, particularly in the context of sending emails. It explains the concept of prefix-free encoding and the construction of binary trees to achieve optimal encoding based on the frequency of characters. The document also outlines the Huffman algorithm for creating an optimal binary tree for a given set of characters and their frequencies.

Uploaded by

yogeshgarg0612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CSL 356: Analysis and Design of

Algorithms
Ragesh Jaiswal
CSE, IIT Delhi
Greedy Algorithms: Huffman
Coding
Greedy Algorithms: Huffman Coding

A B

 A wants to send an email to B but wants to minimize the


amount of communication (number of bits communicated).
 How do you encode an email into bits?
 ASCII: (8 bits per character)
 Is this the best way to encode the email given that the goal is to minimize
the communication?
Greedy Algorithms: Huffman Coding

A B

 A wants to send an email to B but wants to minimize the


amount of communication (number of bits communicated).
 How do you encode an email into bits?
 ASCII: (8 bits per character)
 Is this the best way to encode the email given that the goal is to minimize
the communication?
 Different alphabets have different frequency of occurrence in a standard
English document.
Greedy Algorithms: Huffman Coding
 A wants to send an email to B but wants to minimize the amount of
communication (number of bits communicated).
 How do you encode an email into bits?
 ASCII: (8 bits per character)
 Is this the best way to encode the email given that the goal is to minimize the
communication?
 Different alphabets have different frequency of occurrence in a standard English
document.

Source: wikipedia
Greedy Algorithms: Huffman Coding
 The encoding of “e” should be shorter than the encoding of
“x”.
 In fact Morse code was designed with this in mind.

Source: wikipedia

Source: wikipedia
Greedy Algorithms: Huffman Coding
 Suppose you receive the following Morse code from your
friend:

 What is the message?

Source: wikipedia
Greedy Algorithms: Huffman Coding
 Prefix-free encoding: An encoding 𝑓 is called prefix-free if
for any pair of alphabets (𝑎1, 𝑎2), 𝑓(𝑎1) is not a prefix of
𝑓(𝑎2).
 Morse code is clearly not prefix-free.
 Consider a binary tree with 26 leaves and associate each
alphabet with a leaf in this tree.
 Binary Tree: A rooted tree where each non-leaf node has at most
two children.
 Label an edge 0 if this edge connects the parent to its left
child and 1 otherwise.
 𝑓(𝑥) = The label of edges connecting the root with 𝑥.
Greedy Algorithms: Huffman Coding
 Consider a binary tree with 26 leaves and associate each
alphabet with a leaf in this tree.
 Label an edge 0 if this edge connects the parent to its left
child and 1 otherwise.
 𝑓(𝑥) = The label of edges connecting the root with 𝑥.

1
• 𝑓(𝑎) = 01
0
• 𝑓(𝑏) = 000
0 1 0 1 • 𝑓(𝑐) = 101
a
0 • 𝑓(𝑑) = 111
1 1
b c d • Is 𝑓 prefix-free?

Simple example with 4 alphabets


Greedy Algorithms: Huffman Coding
 Suppose you are given a prefix-free encoding 𝑔.
 Can you construct a binary tree with 26 leaves, associate each
leaf with an alphabet, and label the edges as defined
previously such that the for any alphabet, the label of edges
connecting the root with 𝑥 = 𝑔(𝑥)?

0 1
• 𝑔(𝑎) = 0
a • 𝑔(𝑏) = 11
0 1 • 𝑔(𝑐) = 101
b
• 𝑔(𝑑) = 100
0 1
d c

Simple example with 4 alphabets


Greedy Algorithms: Huffman Coding
 Problem: Given an alphabet set ∑ containing 𝑛 alphabets and the
frequency of occurrence of alphabets (𝑡(𝑎1), 𝑡(𝑎2), … , 𝑡(𝑎𝑛 )). Find
the prefix-free encoding 𝑓 that minimizes:
𝑂𝑓 = ( 𝑓 𝑎1 ⋅ 𝑡 𝑎1 + |𝑓(𝑎2 )| ⋅ 𝑡 𝑎2 + ⋯ + 𝑓 𝑎𝑛 ⋅ 𝑡(𝑎𝑛 ))

 Example: ∑ = {𝑎, 𝑏, 𝑐, 𝑑}, 𝑡(𝑎) = 0.6, 𝑡(𝑏) = 0.2, 𝑡(𝑐) = 0.1, 𝑡(𝑑) = 0.1

0 1 • 𝑂𝑓 ?

0 1 0 1
a
0 1 1
b c d

Simple example with 4 alphabets


Greedy Algorithms: Huffman Coding
 Problem: Given an alphabet set ∑ containing 𝑛 alphabets and the
frequency of occurrence of alphabets (𝑡(𝑎1), 𝑡(𝑎2), … , 𝑡(𝑎𝑛 )). Find
the prefix-free encoding 𝑓 that minimizes:
𝑂𝑓 = ( 𝑓 𝑎1 ⋅ 𝑡 𝑎1 + |𝑓(𝑎2 )| ⋅ 𝑡 𝑎2 + ⋯ + 𝑓 𝑎𝑛 ⋅ 𝑡(𝑎𝑛 ))

 Example: ∑ = {𝑎, 𝑏, 𝑐, 𝑑}, 𝑡(𝑎) = 0.6, 𝑡(𝑏) = 0.2, 𝑡(𝑐) = 0.1, 𝑡(𝑑) = 0.1

0 1
• 𝑂𝑓 ?
a
0 1
b
0 1
d c

Simple example with 4 alphabets


Greedy Algorithms: Huffman Coding
 Definition: Given a binary tree, the depth of a vertex 𝑣,
denoted by 𝑑(𝑣) is the length of the path from the root to 𝑣.

 Every binary tree gives a prefix-free encoding and every


prefix-free encoding gives a binary tree. We will now use
these properties to rephrase the previous problem in terms
of binary trees and depths of leaves.
Greedy Algorithms: Huffman Coding
 Problem: Given an alphabet set Σ containing 𝑛 alphabets and the
frequency of occurrence of alphabets (𝑡(𝑎1 ), 𝑡(𝑎2 ), … , 𝑡(𝑎𝑛 )). Find
a binary tree 𝑇 with 𝑛 leaves (one leaf labeled with one alphabet) such
that:
𝑂𝑇 = (𝑑(𝑎1) ∗ 𝑡(𝑎1) + 𝑑(𝑎2) ∗ 𝑡(𝑎2) + ⋯ + 𝑑(𝑎𝑛 ) ∗ 𝑡(𝑎𝑛 ))
• 𝑑(𝑎𝑖 ) above is the depth of the leaf labeled with alphabet 𝑎𝑖

 What are the properties of the optimal tree 𝑇?


1. Claim: 𝑇 is a complete binary tree.
 Complete binary tree: Every non-leaf node has exactly two children.
Greedy Algorithms: Huffman Coding
 Problem: Given an alphabet set Σ containing 𝑛 alphabets and the
frequency of occurrence of alphabets (𝑡(𝑎1 ), 𝑡(𝑎2 ), … , 𝑡(𝑎𝑛 )). Find
a binary tree 𝑇 with 𝑛 leaves (one leaf labeled with one alphabet) such
that:
𝑂𝑇 = (𝑑(𝑎1) ∗ 𝑡(𝑎1) + 𝑑(𝑎2) ∗ 𝑡(𝑎2) + ⋯ + 𝑑(𝑎𝑛 ) ∗ 𝑡(𝑎𝑛 ))
• 𝑑(𝑎𝑖 ) above is the depth of the leaf labeled with alphabet 𝑎𝑖

 What are the properties of the optimal tree 𝑇?


1. Claim: 𝑇 is a complete binary tree.
 Complete binary tree: Every non-leaf node has exactly two children.
2. Claim: Consider the two alphabets 𝑥, 𝑦 with least frequency.
Then 𝑥 and 𝑦 have maximum depth in any optimal 𝑇 and
there is an optimal 𝑇 where 𝑥 and 𝑦 are siblings.
Greedy Algorithms: Huffman Coding
 Let Ω be a new symbol not present in Σ. Consider the
following (smaller) problem:
 Σ ′ = Σ − 𝑥, 𝑦 ∪ {Ω}
 For all 𝑧 in Σ’ ∖ {Ω}, 𝑡’(𝑧) = 𝑡(𝑧)
 𝑡’(Ω) = 𝑡(𝑥) + 𝑡(𝑦)
Find the optimal binary tree for the new alphabet Σ’ and the
new frequencies given by 𝑡’.
 Let 𝑇’ be the optimal binary tree for the above problem.
 Consider the leaf 𝑣 labeled with Ω in 𝑇’. Consider a new
tree 𝑇 where 𝑣 has two children that are leaves and are
labeled with 𝑥 and 𝑦.
 Claim: 𝑇 is the optimal tree for the original problem.
Greedy Algorithms: Huffman Coding
Huffman(Σ)
- Let 𝑣1 , … , 𝑣𝑛 be nodes. Each node denoting an alphabet
- 𝑆 = {𝑣1 , … , 𝑣𝑛 }
-While (|𝑆| > 1)
- Pick two nodes 𝑥 and 𝑦 with the least value of 𝑡(𝑥) and 𝑡(𝑦)
- Create a new node 𝑧 and set 𝑡(𝑧) = 𝑡(𝑥) + 𝑡(𝑦)
- Set 𝑥 as the left child of 𝑧 and 𝑦 as the right child
- Remove 𝑥 and 𝑦 from 𝑆 and add 𝑧 to 𝑆
- When |𝑆| = 1, return the only node in 𝑆 as the root node of the Huffman Tree

 Running time?
Greedy Algorithms: Huffman Coding
 A DNA sequence has four characters 𝐴, 𝐶, 𝑇, 𝐺 and these
characters appear with frequency 30%, 20%, 10%, and 40%
respectively.
 We have to encode a sequence of length 1 million(106) in
bits.
 If we use two bits for each character, then the size of the
encoding will be 2 million bits.
 Huffman coding:
 𝑓(𝐴) = 10, 𝑓(𝐶) = 110, 𝑓(𝑇) = 111, 𝑓(𝐺) = 0
 We will need 1.9 million bits.
End

You might also like