0% found this document useful (0 votes)

56 views

Data Compression

Data compression techniques can be divided into lossless and lossy methods. Lossless compression allows exact reconstruction of the original data, while lossy compression trades off some loss of information for higher compression ratios. Key lossless methods include run-length encoding and Huffman coding. Compression is evaluated based on the compression ratio and quality of reconstruction for lossy methods. Modeling finds statistical patterns in data, while coding assigns variable length codes to symbols based on their probabilities.

Uploaded by

Himanshu Panday

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Data Compression

Uploaded by

Himanshu Panday

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Data Compression

I Unit

Topic Cover:
Compression Techniques: Loss less compression, Lossy Compression, Measures of
performance, Modeling and coding, Mathematical Preliminaries for Lossless compression: A
brief introduction to information theory, Models: Physical models, Probability models,
Markov models, composite source model, Coding: uniquely decodable codes, Prefix codes.
Data Compression

Compression is used just about everywhere. All the images you get on the web are
compressed, typically in the JPEG or GIF formats, most modems use compression, HDTV will
be compressed using MPEG-2, and several file systems automatically compress files when
stored, and the rest of us do it by hand. The neat thing about compression, as with the other
topics we will cover in this course, is that the algorithms used in the real world make heavy
use of a wide set of algorithmic tools, including sorting, hash tables, tries, and FFTs.
Furthermore, algorithms with strong theoretical foundations play a critical role in real-world
applications.

Compression Techniques

Why needs compression?

Compression reduces the size of a file:

• To save space when storing it.
• To save time when transmitting it.
• Most files have lots of redundancy.
Data compression implies sending or storing a smaller number of bits. Although many
methods are used for this purpose, in general these methods can be divided into two broad
categories: lossless and lossy methods.

Lossless Methods

Lossless compression techniques, as their name implies, involve no loss of information. If

data have been losslessly compressed, the original data can be recovered exactly from the
compressed data. Lossless compression is generally used for applications that cannot
tolerate any difference between the original and reconstructed data.
Text compression is an important area for lossless compression. It is very important that the
reconstruction is identical to the text original, as very small differences can result in
statements with very different meanings. Consider the sentences “Do not send money” and
“Do now send money.” A similar argument holds for computer files and for certain types of
data such as bank records.

Example of Lossless methods-

A) Run-length
B) Huffman
C) Lempel Ziv

Lossy Methods

Lossy compression techniques involve some loss of information, and data that have been
compressed using lossy techniques generally cannot be recovered or reconstructed exactly.
In return for accepting this distortion in the reconstruction, we can generally obtain much
higher compression ratios than is possible with lossless compression.
In many applications, this lack of exact reconstruction is not a problem. For example,
when storing or transmitting speech, the exact value of each sample of speech is not
necessary. Depending on the quality required of the reconstructed speech, varying amounts
of loss of information about the value of each sample can be tolerated. If the quality of the
reconstructed speech is to be similar to that heard on the telephone, a significant loss of
information can be tolerated. However, if the reconstructed speech needs to be of the
quality heard on a compact disc, the amount of information loss that can be tolerated is
much lower. Similarly, when viewing a reconstruction of a video sequence, the fact that the
reconstruction is different from the original is generally not important as long as the
differences do not result in annoying artifacts. Thus, video is generally compressed using
lossy compression.

Example of Lossy Methods

A) JPEG
B) MPEG
C) MP3

Entropy

In particular a system is assumed to have a set of possible

states it can be in, and at a given time there is a probability distribution over those states.
Entropy is then defined as:
where S is the set of possible states, and p(s) is the probability of state s ∈ S. This definition
indicates that the more even the probabilities the higher the entropy (disorder) and the
more biased the probabilities the lower the entropy—e.g. if we know exactly what state the
system is in then H(S) = 0. One might remember that the second law of thermodynamics
basically says that the entropy of a closed system can only increase.

In the context of information theory Shannon simply replaced “state” with “message”, so S
is a set of possible messages, and p(s) is the probability of message s ∈ S. Shannon also
defined the notion of the self information of a message as

i(s) = log2 1/ p(s)

Measures of Performance

A compression algorithm can be evaluated in a number of different ways. We could

measure the relative complexity of the algorithm, the memory required to implement the
algorithm, how fast the algorithm performs on a given machine, the amount of
compression, and how closely the reconstruction resembles the original. In this book we will
mainly be concerned with the last two criteria. Let us take each one in turn.
A very logical way of measuring how well a compression algorithm compresses a given
set of data is to look at the ratio of the number of bits required to represent the data before
compression to the number of bits required to represent the data after compression. This
ratio is called the compression ratio. Suppose storing an image made up of a square array of
256×256 pixels requires 65,536 bytes. The image is compressed and the compressed version
requires 16,384 bytes. We would say that the compression ratio is 4:1. We can also
represent the compression ratio by expressing the reduction in the amount of data required
as a percentage of the size of the original data. In this particular example the compression
ratio calculated in this manner would be 75%.

In lossy compression, the reconstruction differs from the original data. Therefore, in
order to determine the efficiency of a compression algorithm, we have to have some way of
quantifying the difference. The difference between the original and the reconstruction is
often called the distortion.

Lossy techniques are generally used for the compression of data that originate as analog
signals, such as speech and video. In compression of speech and video, the final arbiter of
quality is human. Because human responses are difficult to model mathematically, many
approximate measures of distortion are used to determine the quality of the reconstructed
waveforms.
Other terms that are also used when talking about differences between the reconstruction
and the original are fidelity and quality. When we say that the fidelity or quality of a
reconstruction is high, we mean that the difference between the reconstruction and the
original is small. Whether this difference is a mathematical difference or a perceptual
difference should be evident from the context.

Modelling & Coding

The development of data compression algorithms for a variety of data can be divided
into two phases.

-The first phase is usually referred to as modelling.

- The second phase is called coding

Run-Length Coding
Run-length encoding is probably the simplest method of
compression. It can be used to compress data made of any
combination of symbols. It does not need to know the
frequency of occurrence of symbols and can be very efficient
if data is represented as 0s and 1s

The general idea behind this method is to replace consecutive repeating occurrence of a
symbol followed by the number of occurrences.
The method can be even more efficient if the data uses only two symbols (for example 0
and 1) in its bit pattern and one symbol is more frequent than the other.
Example:
000000000000001000011000000000000

14 4 0 12
We take pattern as: number’s of 0 before the first 1.
And convert the obtain 1’s number into 4 bit
As: 14=1110
4=0100
0=0000
12=1100
Now compressed data is: 1110010000001100
Encoding Algorithm for Run Length Coding
1- Count the number of 0’s between two 1’s.
2- If the number is less than 15, write it down in binary form.
3- If it greater than or equal to 15, write down 1111, and a following binary number to
indicate the rest of the 0s. If more than 30, repeat this process.
4- If data starts with a 1, write down 0000 at the beginning.
5- If the data end with a 1, write down 0000 at the end.
6- Send the binary string.

Huffman Code
Huffman codes are optimal prefix codes generated from a set of probabilities by a
particular algorithm, the Huffman Coding Algorithm. David Huffman developed the
algorithm as a student in a class on information theory at MIT in 1950. The algorithm
is now probably the most prevalently used component of compression algorithms,
used as the back end of GZIP, JPEG and many other utilities.
The Huffman algorithm is very simple and is most easily described in terms of how it
generates the prefix-code tree.

Algorithm for Huffman Code

1- Start with a forest of trees, one for each message. Each tree contains a single
vertex with weight wi = pi
2- Repeat until only a single tree remains
3- Select two trees with the lowest weight roots (w1 and w2).
4- Combine them into a single tree by adding a new root with weight w1 +w2, and
making the two trees its children. It does not matter which is the left or right child,
but our convention will be to put the lower weight root on the left if w1 ≠ w2.

For a code of size n this algorithm will require n-1 steps since every complete binary
tree with n leaves has n-1 internal nodes, and each step creates one internal node. If
we use a priority queue with O(log n) time insertions and find-mins (e.g., a heap) the
algorithm will run in O(n log n) time. The key property of Huffman codes is that they
generate optimal prefix codes. We show this in the following theorem, originally
given by Huffman.

Important Terms

1. Compression ratio: It is very logical way of measuring how well a compression

algorithm compresses a given set of data is to look at the ratio of the number of bits
required to represent the data before compression to the number of bits required to
represent the data after compression. This ratio is called ‘Compression ratio’. Ex.
Suppose storing n image requires 65536 bytes, this image is compressed and the
compressed version requires 16384 bytes. So the compression ratio is 4:1. It can be
also represented in terms of reduction in the amount of data required as a
percentage i.e 75%
2. Distortion: In order to determine the efficiency of a compression algorithm, we have
to have same way of quantifying the difference. The difference between the original
and the reconstruction is called as ‘Distortion’. Lossy techniques are generally used
for the compression of data that originate as analog signals, such as speech and
video. In compression of speech and video, the final arbiter of quality is human.
Because human responses are difficult to model mathematically, many approximate
measures of distortion are used to determine the quality of the reconstructed
waveforms.
3. Compression rate: It is the average number of bits required to represent a single
sample. Ex. In the case of the compressed image if we assume 8 bits per byte (or
pixel) the average number of bits per pixel in the compressed representation is 2.
Thus we would say that the compression rate is 2 bits/ pixel.
4. Fidelity and Quality: The difference between the reconstruction and the original are
fidelity and quality. When we say that the fidelity or quality of a reconstruction is
high, we mean that the difference between the reconstruction and the original is
small. Whether the difference is a mathematical or a perceptual difference should be
evident from the context.
5. Self Information: Shannon defined a quantity called Self – Information. Suppose we
have an event A, which is set of outcomes of some random experiment. If P(A) is the
probability that event A will occur, then the self-information associated with A is
given by:
$i(A)=log_b=-log_b P(A) .......................(1)$
If the probability of an event is low, the amount of self-information associated with it
is high. If the probability of an event is high, the information associated with it is low.
6. Binary Code: Binary code is the simplest form of computer code or programming
data. It is represented entirely by a binary system of digits consisting of a string of
consecutive zeros and ones. Binary code is often associated with machine code in
that binary sets can be combined to form raw code, which is interpreted by a
computer or other piece of hardware. A binary code represents text, computer
processor instructions, or other data using any two-symbol system, but often
the binary number system's 0 and 1. The binary code assigns a pattern of binary
digits (bits) to each character, instruction, etc. For example, a binary string of eight
bits can represent any of 256 possible values and can therefore represent a variety
of different items.
In computing and telecommunications, binary codes are used for various methods
of encoding data, such as character strings, into bit strings. Those methods may use
fixed-width or variable-width strings. In a fixed-width binary code, each letter, digit,
or other character is represented by a bit string of the same length; that bit string,
interpreted as a binary number, is usually displayed in code tables
in octal, decimal or hexadecimal notation. There are many character sets and
many character encodings for them.
7. BMP: The BMP file format, also known as bitmap image file or device independent
bitmap (DIB) file format or simply a bitmap, is a raster graphics image file
format used to store bitmap digital images, independently of the display
device (such as a graphics adapter), BMP is a palette-based graphics file format for
images with 1, 2, 4, 8, 16, 24, or 32 bit-planes. It uses a simple form of RLE to
compress images with 4 or 8 bit-planes. The BMP image file format is native to the
Microsoft Windows operating system. The format of a BMP file is simple. It starts
with a file header that contains the two bytes BM and the file size. This is followed
by an image header with the width, height, and number of bit-planes (there are two
different formats for this header). Following the two headers is the color palette
(that can be in one of three formats) which is followed by the image pixels, either in
raw format or compressed by RLE.
Models
1. Physical Models: If we know something about the physics of the data generation
process, we can use that information to construct a model.
For Ex. In speech- related applications, knowledge about the physics of speech
production can be used to construct a mathematical model for the sampled speech
process. Sampled speech can be encoded using this model.
Real life Application: Residential electrical meter readings
2. Probability Models: The simplest statistical model for the source is to assume
that each letter that is generated by the source is independent of every other letter,
and each occurs with the same probability. We could call this the ignorance model
as it would generation be useful only when we know nothing about the source. The
next step up in complexity is to keep the independence assumption but remove the
equal probability assumption and assign a probability of occurrence to each letter in
the alphabet.
For a source that generates letters from an alphabet A=a1,a2, am we can have
a probability model P=P(a1),P(a2) P(aM)
3. Markov Models: Markov models are particularly useful in text compression,
where the probability of the next letter is heavily influenced by the preceding letters.
In current text compression, the Kth order Markov Models are more widely known as
finite context models, with the word context being used for what we have earlier
defined as state. Consider the word ‘preceding’. Suppose we have already
processed ‘preceding’ and we are going to encode the next ladder. If we take no
account of the context and treat each letter a surprise, the probability of letter ‘g’
occurring is relatively low. If we use a 1st order Markov Model or single letter context
we can see that the probability of g would increase substantially. As we increase the
context size (go from n to in to din and so on), the probability of the alphabet
becomes more and more skewed which results in lower entropy.
4. Composite Source Model: In many applications it is not easy to use a single
model to describe the source. In such cases, we can define a composite source,
which can be viewed as a combination or composition of several sources, with only
one source being active at any given time. A composite source can be represented
as a number of individual sources Si , each with its own model Mi and a switch that
selects a source Si with probability Pi. This is an exceptionally rich model and can be
used to describe some very complicated processes.

Computer Science Extended Essay
No ratings yet
Computer Science Extended Essay
15 pages
Vik
No ratings yet
Vik
23 pages
Main Techniques and Performance of Each Compression
No ratings yet
Main Techniques and Performance of Each Compression
23 pages
Lossless Compression
No ratings yet
Lossless Compression
11 pages
CHAPTER 7
No ratings yet
CHAPTER 7
36 pages
Data Compression Intro
100% (1)
Data Compression Intro
107 pages
Chapter 6 PPT
No ratings yet
Chapter 6 PPT
82 pages
Chapter 5 New
No ratings yet
Chapter 5 New
19 pages
Chapter 3 Multimedia Data Compression
No ratings yet
Chapter 3 Multimedia Data Compression
23 pages
Text Data Compression
No ratings yet
Text Data Compression
13 pages
Research
No ratings yet
Research
4 pages
Assignment 1
No ratings yet
Assignment 1
14 pages
Compression
100% (1)
Compression
38 pages
Data Compression
No ratings yet
Data Compression
49 pages
Data Compression Report
No ratings yet
Data Compression Report
12 pages
CHAPTER FOURmultimedia
No ratings yet
CHAPTER FOURmultimedia
23 pages
DC M1 Merged
No ratings yet
DC M1 Merged
26 pages
Chap15 1473751047 598113
No ratings yet
Chap15 1473751047 598113
34 pages
Introduction To Data Compression - Guy E. Blelloch PDF
No ratings yet
Introduction To Data Compression - Guy E. Blelloch PDF
54 pages
MOD 1 DCT
No ratings yet
MOD 1 DCT
37 pages
MM05-1
No ratings yet
MM05-1
27 pages
Comparison of Lossless Data Compression Algorithms
No ratings yet
Comparison of Lossless Data Compression Algorithms
12 pages
Literature Survey
No ratings yet
Literature Survey
5 pages
Image Compression: Transmit
No ratings yet
Image Compression: Transmit
16 pages
Lecture 3 Compression in Multimedia
No ratings yet
Lecture 3 Compression in Multimedia
60 pages
Data Compression Report
No ratings yet
Data Compression Report
10 pages
Chapter 3
No ratings yet
Chapter 3
41 pages
Compression PDF
No ratings yet
Compression PDF
55 pages
Data Compression.ppt
No ratings yet
Data Compression.ppt
23 pages
Chapter 5 Data Compression
No ratings yet
Chapter 5 Data Compression
18 pages
multimedia unit-4
No ratings yet
multimedia unit-4
30 pages
Lossless and Lossy Compression
No ratings yet
Lossless and Lossy Compression
18 pages
Supplementary Notes On Compression and Formats
No ratings yet
Supplementary Notes On Compression and Formats
15 pages
Chapter 8-Image Compression
No ratings yet
Chapter 8-Image Compression
61 pages
Data Compression in Multimedia Text Imag PDF
No ratings yet
Data Compression in Multimedia Text Imag PDF
7 pages
Data Compression in Multimedia Text Imag
No ratings yet
Data Compression in Multimedia Text Imag
7 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
A Novel Approach of Lossless Image Compression Using Two Techniques
No ratings yet
A Novel Approach of Lossless Image Compression Using Two Techniques
5 pages
Data Compression Btech Notes
No ratings yet
Data Compression Btech Notes
32 pages
Data Compression
No ratings yet
Data Compression
7 pages
Data Compression Techniques: Pushpender Rana, Student
No ratings yet
Data Compression Techniques: Pushpender Rana, Student
4 pages
Dereje Teferi Dereje - Teferi@aau - Edu.et
No ratings yet
Dereje Teferi Dereje - Teferi@aau - Edu.et
36 pages
Chapter Five Lossless Compression
No ratings yet
Chapter Five Lossless Compression
49 pages
Module 3
No ratings yet
Module 3
23 pages
Data Compression Seminar Report
67% (6)
Data Compression Seminar Report
34 pages
Chapter 2 Final
No ratings yet
Chapter 2 Final
26 pages
Compression Algo
No ratings yet
Compression Algo
10 pages
Matlab
No ratings yet
Matlab
4 pages
Data Compression (RCS 087)
No ratings yet
Data Compression (RCS 087)
51 pages
Data Compression: This Article May Require by
No ratings yet
Data Compression: This Article May Require by
25 pages
File Compression: When To Use An Archive or File Compression
No ratings yet
File Compression: When To Use An Archive or File Compression
5 pages
Module 3a - Itext & Image Compression
No ratings yet
Module 3a - Itext & Image Compression
54 pages
mmc module 3
No ratings yet
mmc module 3
58 pages
Analysis and Comparison of Algorithms For Lossless Data Compression
No ratings yet
Analysis and Comparison of Algorithms For Lossless Data Compression
8 pages
Chapter 1: Lossless Data Compression
No ratings yet
Chapter 1: Lossless Data Compression
4 pages
15 Data Compression: Foundations of Computer Science Cengage Learning
No ratings yet
15 Data Compression: Foundations of Computer Science Cengage Learning
33 pages
Computer Graphics and Multimedia Notes Unit 4
No ratings yet
Computer Graphics and Multimedia Notes Unit 4
19 pages
TCOM 370: NOTES 99-10 Data Compression
No ratings yet
TCOM 370: NOTES 99-10 Data Compression
19 pages
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Data Compression: Unlocking Efficiency in Computer Vision with Data Compression
From Everand
Data Compression: Unlocking Efficiency in Computer Vision with Data Compression
Fouad Sabry
No ratings yet
Human Visual System Model: Understanding Perception and Processing
From Everand
Human Visual System Model: Understanding Perception and Processing
Fouad Sabry
No ratings yet
DCT For Speech Compression
No ratings yet
DCT For Speech Compression
21 pages
Distributed Application Design Issues: Presented by
No ratings yet
Distributed Application Design Issues: Presented by
20 pages
CM1030 HCW Final Mar2020
No ratings yet
CM1030 HCW Final Mar2020
6 pages
7 Zip Command Line
No ratings yet
7 Zip Command Line
9 pages
Draft Finger Image and Minutiae Data Standard Version 1
No ratings yet
Draft Finger Image and Minutiae Data Standard Version 1
25 pages
Fast Lempel-ZIV (LZ'78) Algorithm Using Codebook Hashing: Megha Atwal, Lovnish Bansal
No ratings yet
Fast Lempel-ZIV (LZ'78) Algorithm Using Codebook Hashing: Megha Atwal, Lovnish Bansal
4 pages
Multimedia File Handling
No ratings yet
Multimedia File Handling
36 pages
Robust Alogorithm
No ratings yet
Robust Alogorithm
5 pages
Cambridge International AS & A Level: Computer Science 9608/12 October/November 2021
No ratings yet
Cambridge International AS & A Level: Computer Science 9608/12 October/November 2021
9 pages
Video Enhancement With Task-Oriented Flow: Liu and Sun 2011 Baker Et Al 2011 Liu and Freeman 2010
No ratings yet
Video Enhancement With Task-Oriented Flow: Liu and Sun 2011 Baker Et Al 2011 Liu and Freeman 2010
20 pages
Imagemanipulation PDF
No ratings yet
Imagemanipulation PDF
8 pages
O319360EP Grounds of Opposition
No ratings yet
O319360EP Grounds of Opposition
41 pages
Pelco
No ratings yet
Pelco
6 pages
Image Compression Technique
No ratings yet
Image Compression Technique
18 pages
Marantz Hifi Av-Receivers SR6003 Overview
No ratings yet
Marantz Hifi Av-Receivers SR6003 Overview
31 pages
Comparing The Use of Midi and Digitized Audio in Multimedia Systems
No ratings yet
Comparing The Use of Midi and Digitized Audio in Multimedia Systems
4 pages
BSIT 51 Previous Year Question Paper Solve
100% (1)
BSIT 51 Previous Year Question Paper Solve
27 pages
NV JPEG
No ratings yet
NV JPEG
74 pages
Adaptive Differential Pulse
No ratings yet
Adaptive Differential Pulse
2 pages
On-Board Processing Benchmarks
No ratings yet
On-Board Processing Benchmarks
9 pages
Question Bank - Multimedia System - V Sem
No ratings yet
Question Bank - Multimedia System - V Sem
2 pages
MMC Question Bank 2024 (1)
No ratings yet
MMC Question Bank 2024 (1)
2 pages
Digigram - MPX Over IP Networks, Product Sheet
No ratings yet
Digigram - MPX Over IP Networks, Product Sheet
2 pages
3D Motion Graphics
No ratings yet
3D Motion Graphics
16 pages
Prediction of Bearing Remaining Useful Life Based On LSTM Network
No ratings yet
Prediction of Bearing Remaining Useful Life Based On LSTM Network
10 pages
Using Entropy Analysis To Find Encrypted and Packed Malware PDF
No ratings yet
Using Entropy Analysis To Find Encrypted and Packed Malware PDF
6 pages
Jpeg Ls Loco
No ratings yet
Jpeg Ls Loco
16 pages
SarixValue IJV Minidome Spec 082621
No ratings yet
SarixValue IJV Minidome Spec 082621
7 pages
R2018 Final Year Electronics Scheme Syllabus 23april21
No ratings yet
R2018 Final Year Electronics Scheme Syllabus 23april21
65 pages

Data Compression

Uploaded by

Data Compression

Uploaded by

Data Compression

Why needs compression?

Compression reduces the size of a file:

Lossless compression techniques, as their name implies, involve no loss of information. If

Example of Lossless methods-

Example of Lossy Methods

In particular a system is assumed to have a set of possible

i(s) = log2 1/ p(s)

A compression algorithm can be evaluated in a number of different ways. We could

Modelling & Coding

-The first phase is usually referred to as modelling.

- The second phase is called coding

Algorithm for Huffman Code

1. Compression ratio: It is very logical way of measuring how well a compression

You might also like