Text Processing: Data Structures and Algorithms in Java 1/47
Text Processing: Data Structures and Algorithms in Java 1/47
Text Processing
String S a b c a b a a b c a b a c
Pattern p a b a a
p a b a a
p a b a a
Data Structures and Algorithms in Java 7/47
Brute-Force algorithm demo - 3
Step 3: compare p[3] with S[3]
S a b c a b a a b c a b a c
p a b a a
Mismatch occurs here..
p a ba a
Finally, a match would be found after shifting p three times to the right side.
Drawbacks of this approach: if m is the length of pattern p and n the length of string
S, the matching time is of the order O(mn). This is a certainly a very slow running
algorithm.
What makes this approach so slow is the fact that elements of S with which
comparisons had been performed earlier are involved again and again in
comparisons in some future iterations. For example: when mismatch is detected for
the first time in comparison of p[3] with S[3], pattern p would be moved one position
to the right and matching procedure would resume from here. Here the first
comparison that would take place would be between p[0]=a and S[1]=b. It should be
noted here that S[1]=b had been previously involved in a comparison in step 2. this
is a repetitive use of S[1] in another comparison.
It is these repetitive comparisons that lead to the runtime of O(mn).
Data Structures and Algorithms in Java 9/47
The Knuth-Morris-Pratt Algorithm
Knuth, Morris and Pratt proposed a linear time
algorithm for the string matching problem.
A matching time of O(n+m) is achieved by
avoiding comparisons with elements of S that
have previously been involved in comparison
with some element of the pattern p to be
matched. i.e., backtracking on the string S
never occurs
Compute-Prefix-Function (p)
1 m length[p] //p pattern to be matched
2 Π[1] 0
3 k0
4 for q 2 to m
5 do while k > 0 and p[k+1] != p[q]
6 do k Π[k]
7 If p[k+1] = p[q]
8 then k k +1
9 Π[q] k
10 return Π
q 1 2 3 4 5 6 7
p a b a b a c a
Step 2: q = 3, k = 0,
Π[3] = 1 Π 0 0 1
q 1 2 3 4 5 6 7
Step 3: q = 4, k = 1 p a b a b a c A
Π[4] = 2 Π 0 0 1 2
Data Structures and Algorithms in Java 13/47
Step 4: q = 5, k =2 q 1 2 3 4 5 6 7
Π[5] = 3 p a b a b a c a
Π 0 0 1 2 3
q 1 2 3 4 5 6 7
Step 5: q = 6, k = 3
Π[6] = 1 p a b a b a c a
Π 0 0 1 2 3 1
q 1 2 3 4 5 6 7
Step 6: q = 7, k = 1 p a b a b a c a
Π[7] = 1 Π 0 0 1 2 3 1 1
Note: KMP finds every occurrence of a p in S. That is why KMP does not terminate in
step 12, rather it searches remainder of S for any more occurrences of p.
S b a c b a b a b a b a c a c a
p a b a b a c a
Let us execute the KMP algorithm to find
whether p occurs in S.
For p the prefix function, Π was computed previously and is as follows:
q 1 2 3 4 5 6 7
p a b A b a c a
Π 0 0 1 2 3 1 1
Data Structures and Algorithms in Java 16/47
Initially: n = size of S = 15;
m = size of p = 7
Step 1: i = 1, q = 0
comparing p[1] with S[1]
S b a c b a b a b a b a c a a b
p a b a b a c a
P[1] does not match with S[1]. p will be shifted one position to the right.
Step 2: i = 2, q = 0
comparing p[1] with S[2]
S b a c b a b a b a b a c a a b
p a b a b a c a
P[1] matches S[2]. Since there is a match, p is not shifted.
Data Structures and Algorithms in Java 17/47
Step 3: i = 3, q = 1
Comparing p[2] with S[3] p[2] does not match with S[3]
S b a c b a b a b a b a c a a b
p a b a b a c a
Backtracking on p, comparing p[1] and S[3]
Step 4: i = 4, q = 0
comparing p[1] with S[4] p[1] does not match with S[4]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 5: i = 5, q = 0
comparing p[1] with S[5] p[1] matches with S[5]
S b a c b a b a b a b a c a a b
p a b a b a c a
Data Structures and Algorithms in Java 18/47
Step 6: i = 6, q = 1
p a b a b a c a
Step 7: i = 7, q = 2
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 8: i = 8, q = 3
S b a c b a b a b a b a c a a b
p a b a b a c a
Data Structures and Algorithms in Java 19/47
Step 9: i = 9, q = 4
p a b a b a c a
Step 10: i = 10, q = 5
S b a c b a b a b a b a c a a b
p a b a b a c a
Backtracking on p, comparing p[4] with S[10] because after mismatch q = Π[5] = 3
Step 11: i = 11, q = 4
S b a c b a b a b a b a c a a b
p a b a b a c a
Data Structures and Algorithms in Java 20/47
Step 12: i = 12, q = 5
p a b a b a c a
S b a c b a b a b a b a c a a b
p a b a b a c a
Pattern p has been found to completely occur in string S. The total number of shifts
that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.
• Types of compression
– Lossy: MP3, JPG
– Lossless: ZIP, GZ
• Compression Algorithm:
– Huffman Encoding
– Lempel-Ziv
– RLE: Run Length Encoding
• Performance of compression depends on
file types.
Data Structures and Algorithms in Java 24/47
Compress data by decoding symbols
contained in it (lossless compression)
• The information content of the set M, called the entropy of
the source X = (x1, x2,…, xn ), is defined by:
Lave = H = P(x1)L(x1) + · · · + P(xn)L(xn)
Where L(xi) = - log2P(xi), which is the minimum length of a
codeword for symbol xi (Claude E.Shannon, 1948). Shannons
entropy represents an absolute limit on the best possible
lossless compression of any communication.
• To compare the efficiency of different data compression
methods when applied to the same data, the same measure is
used; this measure is the compression rate
length(input) – length (output) Codeword:
length(input) sequence of bits of a code
corresponding to a symbol.
Data Structures and Algorithms in Java 25/47
Uniquely Decodable Codes
A variable length code assigns a bit string (codeword)
of variable length to every message value
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence of bits
1011 ?
Is it aba, ca, or, ad?
A uniquely decodable code is a variable length code in
which bit strings can always be uniquely
decomposed into its codewords.
E A C D B F
.40 .20 .15 .11 .09 .05
E A C BF D
.40 .20 .15 .14 .11
B F
.09 .05
E BFD A C
.40 .25 .20 .15
BF D
.14 .11
B F
.09 .05
E AC BFD
.40 .35 .25
A C BF D
.20 .15 .14 .11
B F
.09 .05
BFDAC E
.60 .40
AC BFD
.35 .25
A C BF D
.20 .15 .14 .11
B F
.09 .05
A C BF D
.20 .15 0 .14 .11
1
B F
.09 .05
Data Structures and Algorithms in Java 36/47
Huffman Coding example - 8
Raw : FFFFOOOOFFFOOFFFFFOOOOOOOO
4 4 3 2 5 7
Compressed: 4F4O3F2O5F7O