4string Matching Kmprabin Karp and Naive
4string Matching Kmprabin Karp and Naive
String Matching
• Introduction
• Naïve Algorithm
• Rabin-Karp Algorithm
• Knuth-Morris-Pratt (KMP) Algorithm
1
Introduction
• What is string matching?
– Finding all occurrences of a pattern in a given text (or
body of text)
• Many applications
– While using editor/word processor/browser
– Login name & password checking
– Virus detection
– Header analysis in data communications
– DNA sequence analysis, Web search engines (e.g. Goo
gle), image analysis
2
Brute Force
• The Brute Force algorithm compares the pattern to the text, one
character at a time, until unmatching characters are found
4
Brute Force-Complexity
• Given a pattern M characters in length, and a text N characters in
length...
• Worst case: compares pattern to each substring of text of length M.
For example, M=5.
• This kind of case can occur for image data.
pattern P s=3
a b a a
1 2 3 4
s=3 a b a a
s=9 a b a a
11
Naïve String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed
NAÏVE-STRING-MATCHER (T, P)
n ← length[T]
m ← length[P]
for s ← 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print “pattern occurs with shift” s
12
Naïve Algorithm
a a a b
a a a b
14
Worst-case Analysis
• There are m comparisons for each shift in the
worst case
• There are n-m+1 shifts
• So, the worst-case running time is Θ((n-
m+1)m)
– In the example on previous slide, we have (13-4+1)4
comparisons in total
• Naïve method is inefficient because information
from a shift is not used again
15
Naïve Algorithm
16
Rabin-Karp Algorithm
• Has a worst-case running time of O((n-
m+1)m) but average-case is O(n+m)
– Also works well in practice
• Based on number-theoretic notion of
modular equivalence
• We assume that = {0,1, 2, …, 9}, i.e.,
each character is a decimal digit
– In general, use radix-d where d = ||
17
18
19
20
21
22
23
Rabin-Karp Approach
• We can view a string of k characters (digits)
as a length-k decimal number
– E.g., the string “31425” corresponds to the
decimal number 31,425
• Given a pattern P [1..m], let p denote the
corresponding decimal value
• Given a text T [1..n], let ts denote the decimal
value of the length-m substring T [(s+1)..
(s+m)] for s=0,1,…,(n-m)
24
The Rabin-Karp algorithm
25
The Rabin-Karp algorithm
26
Rabin-Karp Approach …contd
mod 13
7 text
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 1 4 1 5 2 6 7 3 9 9 2 1
mod 13
1 7 8 4 5 10 11 7 9 11
valid spurious
match hit 32
Rabin-Karp Algorithm
• Basic structure like the naïve algorithm,
but uses modular arithmetic as described
• For each hit, i.e., for each s where ts p
(mod q), verify character by character
whether s is a valid shift or a spurious hit
• In the worst case, every shift is verified
– Running time can be shown as O((n-m+1)m)
• Average-case running time is O(n+m)
33
3. The Boyer-Moore Algorithm
• The Boyer-Moore pattern matching
algorithm is based on two techniques.
T x a
• There are 3 possible
i
cases, tried in order.
P ba
j
Case 1
• If P contains x somewhere, then try to
shift P right to align the last occurrence
of x in P with T[i].
T x a T x a ? ?
i inew
and
move i and
j right, so
P x c ba j at end P x c ba
j jnew
Case 2
• If P contains x somewhere, but a shift right
to the last occurrence is not possible, then
shift P right by 1 character to T[i+1].
T x a x T xa x ?
i inew
and
move i and
j right, so
P cw ax j at end P cw ax
j x is after jnew
j position
Case 3
• If cases 1 and 2 do not apply, then shift P to
align P[0] with T[i+1].
T x a T x a ? ? ?
i inew
and
move i and
j right, so
P d c ba j at end P d c ba
j 0 jnew
No x in P
Boyer-Moore Example (1)
T:
a p a t t e r n m a t c h i n g a l g o r i t h m
1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m
P: 2 4 6
r i t h m r i t h m r i t h m
Last Occurrence Function
• Boyer-Moore’s algorithm preprocesses the
pattern P and the alphabet A to build a last
occurrence function L()
– L() maps all the letters in A to integers
• A = {a, b, c, d} 0 1 2 3 4 5
• P: "abacab"
x a b c d
L(x) 4 5 3 -1
x a b c d
L (x) 4 5 3 1
Analysis
• Boyer-Moore worst case running time is
O(nm + A)
46
continued
• If a mismatch occurs between the text and p
attern P at P[j], what is the most we can shif
t the pattern to avoid wasteful comparisons?
47
Example
T:
P: j=5
jnew = 2
48
Why j == 5
50
Failure Function Example
(k == j-1)
• P: "abaaba" j 0 1 2 3 4
j: 012345 F(j) 0 0 1 1 2
52
Using the Failure Function
53
Example
T: a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P: a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k 0 1 2 3 4 14 15 16 17 18 19
F(k ) 0 0 1 0 1 a b a c a b
54
Why is F(4) == 1?P: "abacab"
• F(4) means
– find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
=1
55
KMP Advantages
• KMP runs in optimal time: O(m+n)
– very fast
57