0% found this document useful (0 votes)
15 views3 pages

DC 1

always available

Uploaded by

srigoutham2414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
15 views3 pages

DC 1

always available

Uploaded by

srigoutham2414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 3
4” stage Subject: Data Compression Lecture No.:8 Lecture time: 10:30 AM-2:30 PM. Class room no, Instructor: Dr. All Kadhum AL-Quraby Department of computer science Dictionary Methods Statistical compression methods use a statistical model of the data, which is why the quality of compression they achieve depends on how good that moda is. Dictionary-based compression methods do not wse a statistical model, nor do they use vatiable-size codes. Instead they select strings of symbols and encode each string as a token using a dictionary. ‘String Compression Ingeneral, compression methods based on strings of symbols can be rore efficient than methods that compress individual symbols in principle, better compression is possible if the symbols of the alphabet have very different probabilities of occurrence. We use a simple example to show that the robabilities of stings of symbols vary more than the probabilities of the individual symbols constituting the strings ‘Wesstart with a 2-symbol alphabet al and a2, with probabilities P1 = 0.8 and P2 = 0.2, respectively. The average probability is 0.5, and we can get an idea of the variance (how much the individual ‘probabilities deviate from the average) hy calculating the swn of absolute differences (variance }= ps—0s|+ p2-05|=06. Any variable size code would assign 1-bit codes to the two symibols, so the average size of the code is one bit per symbol. ‘We now generate all the strings of two symnbols, There are four of them, shown in Table 3.1, String Probability Code may 08x OR=000 0 asa, OR~O2=O.16 aay 0.2 O.8=0.16 100 aga, «0.2% 0.2=0.04 101 ‘Together with their probabilities and a set of Huffman codes. The average probability is 0.25, so a sum of absolute differences similar to the one above yields variance = p.64—0.25|+ p.16—0.25|+ P16 -0.25|+ p.04- 0.25|- 0.78 ‘The average size of the Huffman code is 1* 064+ 2x 0.16 + 3x 0.16 + 3 x 0.04 = 1.56 bits per string, which is 0.78 (1.56/2-0.78) bits per gmmbol, Inthe next step we similarly create al eight stings of thnee symbols. They are shown in Table 3.1b, st Probability Code yaaa 0.8 KOS XO, 100 ayaa, 0.8 «0.20. 101 yaaa 08 0.2% 0, L100 ayaa, 02 X08 x0. 110 tgayay 0.2K 08 x0. 101 fanaa 0.20.20, nino dnaaty 0.2 40.2% 0, aut ) ‘together with their probabilities and a set of Huffinan codes. The average probability is 0.225, so a sum of absolute differences sirlar to the ones above yields variance= [).512-0.125* 3)0128- 0.125 #3 P.032- 0.125|+ [.008- 0.125 0.792 ‘The average size of the Huffman code in this case is 1x 0512 + 3x 3 0.128 + 3 5 0.082 + 5 x 0.008 = 2.184 bits per string, which equals 0.728 (2.184/3) bits. per symbol As we keep generating longer and longer strings, the probabilities of the strings differ more and more fromtheir average, and the average code size gets better (Table 3.10). Str. Variance Avg. size size of prob. _ of code 1 06 T 2 07s 078 3 0.792 0.728 iC) 127 (Sliding Window) ‘The principle of this method (which is sometimes referred to as LZ1) [Ziv and Lempel 77] is to we part of the previousty-seen input stream as the dictionary. The encoder maintains a window to the input stream and shifts the input in that window from right to left as strings of symbols are being encoded. Thus, the method is based on a stiding window The window helow is divided into two ‘patts, The part on the lefts the search buffer. This is the curert dictionary, and it includes symbols that have recently been input and encoded, The parton the rightis the look-ahead buffer, containing textyet tobe encoded. In practical implementations the search buffer is sore thousands of bytes long, ‘while the look-ahead buffer is only tens of bytes long, The vertical har between the t and the e below represents the cument dividing line between the two buffers, We assume that the text sir_sid_eastman_easily_t has already been compressed, while the text enses_sea_sick_seals still needs to be compressed. lst aaa pl Qld Data Compression 2 ‘coded text. . . sir_sid_eastman_easily_t]eases_sea_sick_seals. . . text to be read The encoder scars the search buffer backwards (from right to lef? looking for a atch for the first symbol e in the look-ahead buffer. It finds one at the e of the word easily. This e is at a distance (offset) of 8 from the end of the search buffer. The encoder then matches as many symbols following the two e's as possible. Three syribols eas match in this case, so the length of the match is 3. The encoder then continues the backward scan, trying to find longer matches. In our case, there is one more match at the word eastman, with offset 16, and it has the same length. The encoder selects the longest match or, if they are all the same length, the last one found, and prepares the token (16, 3, €). Tis possible to follow L277 with Hluffinan, or some other statistical coding of the tokens, where small offsets are assigned shorter codes. This method, proposed by Bemd Herd, is called LZH. Having mary srrell offsets implies better compression in LZH. Ingeneral, an LZ77 token has three parts: offset, length, and next symbol in the look-ahead buffer (which, in our case, isthe second e of the word feases). This token is written on the output stream, and the window is shifted to the right (or, atematively, the input stream is moved to the left) four positions. three positions for the matched string and one position for the next symbol. ir_sid_eastman_easily tease, sea_sick seals. If the backward search yields no match, an 1.2.77 token with zero offset and length and with the unmatched symbol is written. This is also the reason a token has a third component. Tokers with. zero offset and length are common at the beginning of an compression job, when the search buffer is empty or almost empty. The first five steps in encoding our example are the following: sirjeidjeastman] = (0,0,“s”) sir sid eastman,e| (0,0, “i” ) silrusid jeastman,eal (0,0, x”) sir|sidjeastmanjeas = (0,0,4L") sir, |sid_eastman_easil (4,2,*a") LZ78 ‘The 1.278 method (which is sometimes referred to as 1.22) [Ziv and Lempel 78] does not use any search buffer, look-ahead buffer, or sliding window. Instead, there is a dictionary of previously encountered strings. This dictionary starts empty (or almost ery). lst aaa pl Qld Data Compression 3

You might also like