100% found this document useful (2 votes)
6K views

Data Compression Techniques

The document discusses data compression techniques including lossless compression where the original files can be perfectly retrieved (e.g. zip files) and lossy compression where files can be approximately retrieved (e.g. mp3 files). It aims to save storage space and bandwidth. Key concepts covered include defining codecs, remarks on compressing relevant vs random data, the Kraft-McMillan inequality for prefix codes, entropy as a measure of uncertainty, and Shannon's theorem relating expected code length to entropy.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
6K views

Data Compression Techniques

The document discusses data compression techniques including lossless compression where the original files can be perfectly retrieved (e.g. zip files) and lossy compression where files can be approximately retrieved (e.g. mp3 files). It aims to save storage space and bandwidth. Key concepts covered include defining codecs, remarks on compressing relevant vs random data, the Kraft-McMillan inequality for prefix codes, entropy as a measure of uncertainty, and Shannon's theorem relating expected code length to entropy.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Compression

Techniques

By…

Sukanta behera
Reg. No. 07SBSCA048
Data Compression
Lossless data compression:
Store/Transmit big files using few bytes so
that the original files can be perfectly
retrieved. Example: zip.

Loosely data compression:


Store/Transmit big files using few bytes so
that the original files can be approximately
retrieved. Example: mp3.

Motivation: Save storage space and/or


bandwidth.
Definition of Codec
Let Σ be an alphabet and let S µ Σ*
be a set of possible messages.

 A lossless codec (c,d) consists of


A coder c : S ! {0,1}*
A decoder d: {0,1}* ! Σ*
so that
8 x 2 S: d(c(x))=x
Remarks
 It is necessary for c to be an injective map.

 If we do not worry about efficiency, we don’t have


to specify d if we have specified c.

 Terminology: Some times we just say “code”


rather than “codec”.

Terminology: The set c(S) is called the set of


code words of the codec. In examples to follow,
we often just state the set of code words.
Proposition
 Let S = {0,1}n. Then, for any codec
(c,d) there is some x 2 S, so that |
c(x)| ¸ n.

 “Compression is impossible”
Proposition
 For any message x, there is a codec
(c,d) so that |c(x)|=1.

“The Encyclopedia Britannica can be


compressed to 1 bit”.
Remarks
We cannot compress all data. Thus, we must
concentrate on compressing “relevant” data.

It is trivial to compress data known in advance.


We should concentrate on compressing data about
which there is uncertainty.

We will use probability theory as a tool to model


uncertainty about relevant data.
Can random data be
compressed?
 Suppose Σ = {0,1} and S = {0,1}2.

We know we cannot compress all data, but


can we do well on the average?

Let us assume the uniform distribution on


S and look at the expected length of the
code words.
Definition of prefix codes
 A prefix code c is a code with the property that
for all different messages x and y, c(x) is not a
prefix of c(y).

 Example: Fixed length codes (such as ascii).


 Example: {0,11,10}

 All codes in this course will be prefix codes.


Proposition
 If c is a prefix code for S = Σ1 then cn
is a prefix code for S = Σn where

cn(x1 x2 .. xn) = c(x1)¢ c(x2) ….¢


c(xn)
Prefix codes and trees
 Set of code words of a prefix code:
{0,11,10}.

0 1

0 1
Alternative view of prefix
codes
 A prefix code is an assignment of the
messages of S to the leaves of a
rooted binary tree.

 The codeword of a message x is


found by reading the labels on the
edges on the path from the root of the
tree to the leaf corresponding to x.
Binary trees and the interval
[0,1)
0 1

[0,1/2) 0 1

[1/2,3/4) [3/4,1)

0 1/4 1/2 3/4 1


Alternative view of prefix
codes
 A prefix code is an assignment of the
messages of S to disjoint dyadic
intervals.

 A dyadic interval is a real interval of


the form [ k 2- m, (k+1) 2- m ) with
k+1 · 2m. The corresponding code
word is the m-bit binary
representation of k.
Kraft-McMillan Inequality
 Let m1, m2, … be the lengths of the
code words of a prefix code. Then, ∑ 2-
mi
· 1.

 Let m1, m2, … be integers with ∑ 2- mi


· 1. Then there is prefix code c so that
{mi} are the lengths of the code words
of c.
Probability
A probability distribution p on S is a

map p: S ! [0,1] so that ∑x 2 S p(x) = 1.

A U-valued stochastic variable is a map


Y: S ! U.

If Y: S ! R is a stochastic variable, its


expected value E[Y] is ∑x 2 S p(x) Y(x).
Self-entropy
 Given a probability distribution p on S, the self-entropy of
x 2 S is the defined as
H(x) = – log2 p(x).

 The self-entropy of a message with probability 1 is 0 bits

 The self-entropy of a message with probability 0 is +1.

 The self-entropy of a message with probability ½ is 1 bit

 We often measure entropy is unit “bits”


Entropy
Given a probability distribution p on S, its
entropy H[p] is defined as E[H], i.e.
H[p] = – ∑x 2 S p(x) log2 p(x).

For a stochastic variable X, its entropy H[X]


is the entropy of its underlying distribution:
H[X] = – ∑i Pr[X=i] log2 Pr[X=i]
Facts
The entropy of the uniform distribution on
{0,1}n is n bits. Any other distribution on
{0,1}n has strictly smaller entropy.

If X1 and X2 are independent stochastic


variables, then H(X1, X2) = H(X1) + H(X2).

 For any function f, H(f(X)) · H(X).


Shannon’s theorem
Let S be a set of messages and let X be an S-
valued stochastic variable.

 For all prefix codes c on S,


E[ |c(X)| ] ¸ H[X].

 There is a prefix code c on S so that


E[ |c(X)| ] < H[X] + 1
In fact, for all x in S, |c(x)| < H[x] + 1.

You might also like