String Matching
String Matching
• Rabin-Karp Algorithm
• Knuth-Morris-Pratt
(KMP) Algorithm
• Finite Automata O Θ(n)
s=0 a b a a
s=1
a b a a
s=2
a b a a
s=3
a b a a
shift s = 3 is a valid shift
(n=13, m=4 and 0 ≤ s ≤ n-m holds)
GCET-CSE String Matching 5
Example
1 2 3 4
pattern P a b a a
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a b c a b a a b c a b a a
s=3 a b a a
s=9 a b a a
GCET-CSE String Matching 6
Terminology
• Concatenation of 2 strings x and y is xy
– E.g., x=“putra”, y=“jaya” xy = “putrajaya”
• A string w is a prefix of a string x, if x=wy
for some string y
– E.g., “putra” is a prefix of “putrajaya”
• A string w is a suffix of a string x, if x=yw
for some string y
– E.g., “jaya” is a suffix of “putrajaya”
NAÏVE-STRING-MATCHER (T, P)
n ← length[T]
m ← length[P]
for s ← 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print “pattern occurs with shift” s
GCET-CSE String Matching 8
Analysis: Worst-case Example
1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a a a a a a a a a a a a a
a a a b
a a a b
GCET-CSE String Matching 9
Worst-case Analysis
• There are m comparisons for each shift
in the worst case
• There are n-m+1 shifts
• So, the worst-case running time is
Θ((n-m+1)m)
– In the example on previous slide, we have
(13-4+1)4 comparisons in total
• Naïve method is inefficient because
information from a shift is not used again
GCET-CSE String Matching 10
Rabin-Karp Algorithm
3 1 4 3
p = decimal value of P
ts = decimal value of length m substring 3 1 4 3
GCET-CSE String Matching 15
Rabin-Karp basic example
pattern P 3 1 4 7
1 2 3 4 5 6 7 8 9 10 11 12 13
text T 3 1 4 7 5 3 3 4 2 3 1 4 7
9 0 2 3 1 4 1 5 2 6 7 3
t4 t3
14152 (31415 – 3 · 10000) · 10 + 2 (mod 13)
mod value
(7 – 3 · 3) · 10 + 2 (mod 13)
8 (mod 13)
GCET-CSE String Matching 19
Problem of Spurious Hits
• ts p (mod q) does not imply that ts=p
– Modular equivalence does not necessarily
mean that two integers are equal
• A case in which ts p (mod q) when ts ≠
p is called a spurious hit
mod 13
7 text
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 1 4 1 5 2 6 7 3 9 9 2 1
mod 13
1 7 8 4 5 10 11 7 9 11
valid spurious
GCET-CSE match String Matching hit 21
Rabin-Karp Algorithm
• Basic structure like the naïve algorithm,
but uses modular arithmetic as described
• For each hit, i.e., for each s where ts p
(mod q), verify character by character
whether s is a valid shift or a spurious hit
• In the worst case, every shift is verified
– Running time can be shown as O((n-m+1)m)
• Average-case running time is O(n+m)
GCET-CSE String Matching 22
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1 2 3 4 5 6 7 While 0>0 x
0 0 If P[0+1] = P[2]
a b x
[2]=0 k
1 2 3 4 5 6 7 If P[0+1] = P[6]
0 0 1 2 3 0 a c x
[6]=0 k
• Advantages of KMP
– It runs in optimal time O(m+n) very fast
– Algorithm never need to move backward in the input text
– It is good for processing very large file(network
streaming/external devices/..)
• Disadvantage
– KMP does not work effectively for large alphabet size( more
chance of mismatch)
input {c}
it moves to state ‘6’
input {c}
previous visited {ababaca}
so on input {ababacac} = {ababaca} | {babacac}
find the largest suffix & prefix
length=0
a a a
a b a b a c a
b b
a
a a a
a b a b a c a
b b
a