We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 3
4” stage Subject: Data Compression
Lecture No.:8
Lecture time: 10:30 AM-2:30 PM. Class room no,
Instructor: Dr. All Kadhum AL-Quraby Department of computer science
Dictionary Methods
Statistical compression methods use a statistical model of the data, which is why the quality of
compression they achieve depends on how good that moda is.
Dictionary-based compression methods do not wse a statistical model, nor do they use vatiable-size
codes. Instead they select strings of symbols and encode each string as a token using a dictionary.
‘String Compression
Ingeneral, compression methods based on strings of symbols can be rore efficient than methods that
compress individual symbols in principle, better compression is possible if the symbols of the
alphabet have very different probabilities of occurrence. We use a simple example to show that the
robabilities of stings of symbols vary more than the probabilities of the individual symbols
constituting the strings
‘Wesstart with a 2-symbol alphabet al and a2, with probabilities P1 = 0.8 and P2 = 0.2, respectively.
The average probability is 0.5, and we can get an idea of the variance (how much the individual
‘probabilities deviate from the average) hy calculating the swn of absolute differences (variance }=
ps—0s|+ p2-05|=06.
Any variable size code would assign 1-bit codes to the two symibols, so the average size of the code
is one bit per symbol.
‘We now generate all the strings of two symnbols, There are four of them, shown in Table 3.1,
String Probability Code
may 08x OR=000 0
asa, OR~O2=O.16
aay 0.2 O.8=0.16 100
aga, «0.2% 0.2=0.04 101
‘Together with their probabilities and a set of Huffman codes. The average probability is 0.25, so a
sum of absolute differences similar to the one above yields variance = p.64—0.25|+ p.16—0.25|+
P16 -0.25|+ p.04- 0.25|- 0.78
‘The average size of the Huffman code is
1* 064+ 2x 0.16 + 3x 0.16 + 3 x 0.04 = 1.56 bits per string, which is 0.78 (1.56/2-0.78) bits per
gmmbol,
Inthe next step we similarly create al eight stings of thnee symbols. They are shown in Table 3.1b,st Probability Code
yaaa 0.8 KOS XO, 100
ayaa, 0.8 «0.20. 101
yaaa 08 0.2% 0, L100
ayaa, 02 X08 x0. 110
tgayay 0.2K 08 x0. 101
fanaa 0.20.20, nino
dnaaty 0.2 40.2% 0, aut
)
‘together with their probabilities and a set of Huffinan codes. The average probability is 0.225, so a
sum of absolute differences sirlar to the ones above yields variance= [).512-0.125* 3)0128-
0.125 #3 P.032- 0.125|+ [.008- 0.125 0.792
‘The average size of the Huffman code in this case is
1x 0512 + 3x 3 0.128 + 3 5 0.082 + 5 x 0.008 = 2.184 bits per string, which equals 0.728
(2.184/3) bits. per symbol
As we keep generating longer and longer strings, the probabilities of the strings differ more and more
fromtheir average, and the average code size gets better (Table 3.10).
Str. Variance Avg. size
size of prob. _ of code
1 06 T
2 07s 078
3 0.792 0.728
iC)
127 (Sliding Window)
‘The principle of this method (which is sometimes referred to as LZ1) [Ziv and Lempel 77] is to we
part of the previousty-seen input stream as the dictionary. The encoder maintains a window to the
input stream and shifts the input in that window from right to left as strings of symbols are being
encoded. Thus, the method is based on a stiding window The window helow is divided into two
‘patts, The part on the lefts the search buffer. This is the curert dictionary, and it includes symbols
that have recently been input and encoded, The parton the rightis the look-ahead buffer, containing
textyet tobe encoded. In practical implementations the search buffer is sore thousands of bytes long,
‘while the look-ahead buffer is only tens of bytes long, The vertical har between the t and the e below
represents the cument dividing line between the two buffers, We assume that the text
sir_sid_eastman_easily_t has already been compressed, while the text enses_sea_sick_seals still
needs to be compressed.
lst aaa pl Qld Data Compression 2‘coded text. . . sir_sid_eastman_easily_t]eases_sea_sick_seals. . . text to be
read
The encoder scars the search buffer backwards (from right to lef? looking for a atch for the first
symbol e in the look-ahead buffer. It finds one at the e of the word easily. This e is at a distance
(offset) of 8 from the end of the search buffer. The encoder then matches as many symbols following
the two e's as possible. Three syribols eas match in this case, so the length of the match is 3. The
encoder then continues the backward scan, trying to find longer matches. In our case, there is one
more match at the word eastman, with offset 16, and it has the same length. The encoder selects the
longest match or, if they are all the same length, the last one found, and prepares the token (16, 3, €).
Tis possible to follow L277 with Hluffinan, or some other statistical coding of the tokens, where
small offsets are assigned shorter codes. This method, proposed by Bemd Herd, is called LZH.
Having mary srrell offsets implies better compression in LZH.
Ingeneral, an LZ77 token has three parts: offset, length, and next symbol in the look-ahead buffer
(which, in our case, isthe second e of the word feases). This token is written on the output stream,
and the window is shifted to the right (or, atematively, the input stream is moved to the left) four
positions. three positions for the matched string and one position for the next symbol.
ir_sid_eastman_easily tease, sea_sick seals.
If the backward search yields no match, an 1.2.77 token with zero offset and length and with the
unmatched symbol is written. This is also the reason a token has a third component. Tokers with.
zero offset and length are common at the beginning of an compression job, when the search buffer is
empty or almost empty. The first five steps in encoding our example are the following:
sirjeidjeastman] = (0,0,“s”)
sir sid eastman,e| (0,0, “i” )
silrusid jeastman,eal (0,0, x”)
sir|sidjeastmanjeas = (0,0,4L")
sir, |sid_eastman_easil (4,2,*a")
LZ78
‘The 1.278 method (which is sometimes referred to as 1.22) [Ziv and Lempel 78] does not use any
search buffer, look-ahead buffer, or sliding window. Instead, there is a dictionary of previously
encountered strings. This dictionary starts empty (or almost ery).
lst aaa pl Qld Data Compression 3