09 Basic Compression
09 Basic Compression
Introduction to Compression
Dr Kirill Sidorov
[email protected]
www.facebook.com/kirill.sidorov
Estimated entropy for English text: HEnglish ≈ 0.6 − 1.3 bits/letter. (If
all letters and space were equally probable, then it would be
H0 = log2 27 ≈ 4.755 bits/letter.)
Shannon 1948
Basically:
The ideal code length for an event with probability p is L(p) = −log2 p
ones and zeros (or generally, −logb p if instead we use b possible values
for codes).
External link: Shannon’s original 1948 paper.
Shannon vs Kolmogorov
Basic reason:
• Compression ratio of lossless methods (e.g. Huffman coding,
arithmetic coding, LZW) is not high enough for audio/video.
• By cleverly making a small sacrifice in terms of fidelity of data, we
can often achieve very high compression ratios.
• Cleverly = sacrifice information that is perceptually unimportant.
Lossless compression algorithms
• Entropy encoding:
• Shannon-Fano algorithm.
• Huffman coding.
• Arithmetic coding.
• Repetitive sequence suppression.
• Run-Length Encoding (RLE).
• Pattern substitution.
• Lempel-Ziv-Welch (LZW) algorithm.
Simple repetition suppression
894f32
where f is the flag for zero.
Simple repetition suppression
Original sequence:
111122233333311112222
The savings are dependent on the data: In the worst case (random noise)
encoding is more heavy than original file:
2×integer rather than 1×integer if original data is integer vector/array.
Example:
• Count occurrence of tokens (to estimate probabilities).
• Assign shorter codes to more probable symbols and vice versa.
Example:
Consider a finite string S over alphabet {A, B, C, D, E}:
S = ACABADADEAABBAAAEDCACDEAAABCDBBEDCBACAE
Symbol A B C D E
Count 15 7 6 6 5
Shannon-Fano algorithm
Encoding with the Shannon-Fano algorithm
A top-down approach:
The following points are worth noting about the above algorithm:
• Decoding for the above two algorithms is trivial as long as the
coding table/book is sent before the data.
• There is a bit of an overhead for sending this.
• But negligible if |string| |alphabet|.
• Unique prefix property: no code is a prefix to any other code (all
symbols are at the leaf nodes) → great for decoder, unambiguous.
• If prior statistics are available and accurate, then Huffman coding is
very good.
Huffman entropy
The idea behind arithmetic coding is: encode the entire message into a
single number, n, (0.0 ≤ n < 1.0).
• Consider a probability line segment, [0. . . 1), and
• Assign to every symbol a range in this interval:
• Range proportional to probability with
• Position at cumulative probability.
Symbol Range
A [0.0, 0.5)
B [0.5, 0.75)
C [0.75, 1.0)
• Subdivide the range for the first symbol given the probabilities of
the second symbol then the symbol etc.
Where:
• range, keeps track of where the next range should be.
• high and low, specify the output number.
• Initially high = 1.0, low = 0.0
Arithmetic coding example
For the second symbol we have:
(now range = 0.25, low = 0.5, high = 0.75):
Symbol Range
BA [0.5, 0.625)
BB [0.625, 0.6875)
BC [0.6875, 0.75)
We now reapply the subdivision of our scale again to get for our third
symbol:
(range = 0.125, low = 0.5, high = 0.625):
Symbol Range
BAA [0.5, 0.5625)
BAB [0.5625, 0.59375)
BAC [0.59375, 0.625)
Arithmetic coding example
Subdivide again:
(range = 0.03125, low = 0.59375, high = 0.625):
Symbol Range
BACA [0.59375, 0.60937)
BACB [0.609375, 0.6171875)
BACC [0.6171875, 0.625)
So the (unique) output code for BACA is any number in the range:
[0.59375, 0.60937).
Decoding
Fractions in decimal:
0.1 decimal = 1011 = 1/10
0.01 decimal = 1012 = 1/100
0.11 decimal = 1011 + 1012 = 11/100
So in binary we get:
Therefore:
prob(X) = 2/3
prob(Y) = 1/3
L = 0.d1 d2 d3 . . . dt−1 0 . . .
H = 0.d1 d2 d3 . . . dt−1 1 . . .
• We can select and transmit the t bits: d1 d2 d3 . . . dt−1 1.
Arithmetic coding
Basic idea/analogy:
Suppose we want to encode the Oxford Concise English
dictionary which contains about 159,000 entries.
Problems:
• Too many bits per word,
• Everyone needs a dictionary to decode back to English.
• Only works for English text.
Solution:
• Find a way to build the dictionary adaptively.
• Original methods (LZ) due to Lempel and Ziv in 1977/8.
• Quite a few variations on LZ.
• Terry Welch improvement (1984), patented LZW algorithm
• LZW introduced the idea that only the initial dictionary needs to be
transmitted to enable decoding:
The decoder is able to build the rest of the table from the encoded
sequence.
LZW compression algorithm
w = NIL;
while ( read a character k ) {
if wk exists in the dictionary
w = wk;
else {
add wk to the dictionary;
output the code for w;
w = k;
}
}
• Original LZW used dictionary with 4K entries, first 256 (0-255) are
ASCII codes.
LZW compression algorithm example:
Input string is "ˆWEDˆWEˆWEEˆWEBˆWET".
read a character k;
output k;
w = k;
while ( read a character k )
/* k could be a character or a code. */
{
entry = dictionary entry for k;
output entry;
add w + entry[0] to dictionary;
w = entry;
}
Note: LZW decoder only needs the initial dictionary. The decoder is able
to build the rest of the table from the encoded sequence.
LZW decompression algorithm example:
read a character k;
output k;
w = k;
while ( read a character k )
/* k could be a character or a code. */
{
entry = dictionary entry for k;
/* Special case */
if (entry == NIL) // Not found
entry = w + w[0];
output entry;
if (w != NIL)
add w + entry[0] to dictionary;
w = entry;
}
MATLAB LZW code
1 Take top left pixel as the base value for the block, pixel A.
2 Calculate three other transformed values by taking the
difference between these (respective) pixels and pixel A,
i.e. B − A, C − A, D − A.
3 Store the base pixel and the differences as the values of the
transform.
Transform coding example
Given the above we can easily form the forward transform:
X0 = A
X1 = B − A
X2 = C − A
X3 = D − A
A = X0
B = X1 + X0
C = X2 + X0
D = X3 + X0
Compressing data with this transform?
120 130
125 120
then we get:
X0 = 120
X1 = 10
X2 = 5
X3 = 0
We can then compress these values by taking fewer bits to represent the
data.
Transform coding example: discussion
Actual data: 9 10 7 6
Predicted data: 0 9 10 7
• Search Engine:
• Group (cluster) data into vectors.
• Find closest code vectors.
• When decoding, output needs to be unblocked (smoothed).
Vector quantisation code book construction
How to cluster data?
• Use some clustering technique,
e.g. K-means, Voronoi decomposition
Essentially cluster on some closeness measure, minimise
inter-sample variance or distance.
K-Means
• Assign each point to the cluster whose centroid yields the least
within-cluster squared distance. (This partitions according to
Voronoi diagram with seeds = centroids.)
• Update: set new centroids to be the centroids of each cluster.
Vector Quantisation Code Book Construction
How to code?
• For each cluster choose a mean (median) point as
representative code for all points in cluster.
Vector quantisation image coding example