Data Compression
Data Compression
I Unit
Topic Cover:
Compression Techniques: Loss less compression, Lossy Compression, Measures of
performance, Modeling and coding, Mathematical Preliminaries for Lossless compression: A
brief introduction to information theory, Models: Physical models, Probability models,
Markov models, composite source model, Coding: uniquely decodable codes, Prefix codes.
Data Compression
Compression is used just about everywhere. All the images you get on the web are
compressed, typically in the JPEG or GIF formats, most modems use compression, HDTV will
be compressed using MPEG-2, and several file systems automatically compress files when
stored, and the rest of us do it by hand. The neat thing about compression, as with the other
topics we will cover in this course, is that the algorithms used in the real world make heavy
use of a wide set of algorithmic tools, including sorting, hash tables, tries, and FFTs.
Furthermore, algorithms with strong theoretical foundations play a critical role in real-world
applications.
Compression Techniques
Lossless Methods
A) Run-length
B) Huffman
C) Lempel Ziv
Lossy Methods
Lossy compression techniques involve some loss of information, and data that have been
compressed using lossy techniques generally cannot be recovered or reconstructed exactly.
In return for accepting this distortion in the reconstruction, we can generally obtain much
higher compression ratios than is possible with lossless compression.
In many applications, this lack of exact reconstruction is not a problem. For example,
when storing or transmitting speech, the exact value of each sample of speech is not
necessary. Depending on the quality required of the reconstructed speech, varying amounts
of loss of information about the value of each sample can be tolerated. If the quality of the
reconstructed speech is to be similar to that heard on the telephone, a significant loss of
information can be tolerated. However, if the reconstructed speech needs to be of the
quality heard on a compact disc, the amount of information loss that can be tolerated is
much lower. Similarly, when viewing a reconstruction of a video sequence, the fact that the
reconstruction is different from the original is generally not important as long as the
differences do not result in annoying artifacts. Thus, video is generally compressed using
lossy compression.
A) JPEG
B) MPEG
C) MP3
Entropy
In the context of information theory Shannon simply replaced “state” with “message”, so S
is a set of possible messages, and p(s) is the probability of message s ∈ S. Shannon also
defined the notion of the self information of a message as
Measures of Performance
In lossy compression, the reconstruction differs from the original data. Therefore, in
order to determine the efficiency of a compression algorithm, we have to have some way of
quantifying the difference. The difference between the original and the reconstruction is
often called the distortion.
Lossy techniques are generally used for the compression of data that originate as analog
signals, such as speech and video. In compression of speech and video, the final arbiter of
quality is human. Because human responses are difficult to model mathematically, many
approximate measures of distortion are used to determine the quality of the reconstructed
waveforms.
Other terms that are also used when talking about differences between the reconstruction
and the original are fidelity and quality. When we say that the fidelity or quality of a
reconstruction is high, we mean that the difference between the reconstruction and the
original is small. Whether this difference is a mathematical difference or a perceptual
difference should be evident from the context.
The development of data compression algorithms for a variety of data can be divided
into two phases.
Run-Length Coding
Run-length encoding is probably the simplest method of
compression. It can be used to compress data made of any
combination of symbols. It does not need to know the
frequency of occurrence of symbols and can be very efficient
if data is represented as 0s and 1s
The general idea behind this method is to replace consecutive repeating occurrence of a
symbol followed by the number of occurrences.
The method can be even more efficient if the data uses only two symbols (for example 0
and 1) in its bit pattern and one symbol is more frequent than the other.
Example:
000000000000001000011000000000000
14 4 0 12
We take pattern as: number’s of 0 before the first 1.
And convert the obtain 1’s number into 4 bit
As: 14=1110
4=0100
0=0000
12=1100
Now compressed data is: 1110010000001100
Encoding Algorithm for Run Length Coding
1- Count the number of 0’s between two 1’s.
2- If the number is less than 15, write it down in binary form.
3- If it greater than or equal to 15, write down 1111, and a following binary number to
indicate the rest of the 0s. If more than 30, repeat this process.
4- If data starts with a 1, write down 0000 at the beginning.
5- If the data end with a 1, write down 0000 at the end.
6- Send the binary string.
Huffman Code
Huffman codes are optimal prefix codes generated from a set of probabilities by a
particular algorithm, the Huffman Coding Algorithm. David Huffman developed the
algorithm as a student in a class on information theory at MIT in 1950. The algorithm
is now probably the most prevalently used component of compression algorithms,
used as the back end of GZIP, JPEG and many other utilities.
The Huffman algorithm is very simple and is most easily described in terms of how it
generates the prefix-code tree.
1- Start with a forest of trees, one for each message. Each tree contains a single
vertex with weight wi = pi
2- Repeat until only a single tree remains
3- Select two trees with the lowest weight roots (w1 and w2).
4- Combine them into a single tree by adding a new root with weight w1 +w2, and
making the two trees its children. It does not matter which is the left or right child,
but our convention will be to put the lower weight root on the left if w1 ≠ w2.
For a code of size n this algorithm will require n-1 steps since every complete binary
tree with n leaves has n-1 internal nodes, and each step creates one internal node. If
we use a priority queue with O(log n) time insertions and find-mins (e.g., a heap) the
algorithm will run in O(n log n) time. The key property of Huffman codes is that they
generate optimal prefix codes. We show this in the following theorem, originally
given by Huffman.
Important Terms