0% found this document useful (0 votes)
9 views57 pages

4string Matching Kmprabin Karp and Naive

Uploaded by

omarahmad12318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views57 pages

4string Matching Kmprabin Karp and Naive

Uploaded by

omarahmad12318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 57

Outline

String Matching
• Introduction
• Naïve Algorithm
• Rabin-Karp Algorithm
• Knuth-Morris-Pratt (KMP) Algorithm

1
Introduction
• What is string matching?
– Finding all occurrences of a pattern in a given text (or
body of text)
• Many applications
– While using editor/word processor/browser
– Login name & password checking
– Virus detection
– Header analysis in data communications
– DNA sequence analysis, Web search engines (e.g. Goo
gle), image analysis
2
Brute Force
• The Brute Force algorithm compares the pattern to the text, one
character at a time, until unmatching characters are found

Compared characters are italicized.


Correct matches are in boldface type.
• The algorithm can be designed to stop on either the first
occurrence of the pattern, or upon reaching the end of the text. 3
Brute Force Pseudo-Code
• Here’s the pseudo-code
do if (text letter == pattern letter)
compare next letter of pattern to next
letter of text
else move pattern down text by one letter
while (entire pattern found or end of text)

4
Brute Force-Complexity
• Given a pattern M characters in length, and a text N characters in
length...
• Worst case: compares pattern to each substring of text of length M.
For example, M=5.
• This kind of case can occur for image data.

Total number of comparisons: M (N-M+1) 5


Worst case time complexity: O(MN)
Brute Force-Complexity(cont.)
• Given a pattern M characters in length, and a text N characters in
length...
• Best case if pattern found: Finds pattern in first M positions of text.
For example, M=5.

Total number of comparisons: M


Best case time complexity: O(M) 6
Brute Force-Complexity(cont.)
• Given a pattern M characters in length, and a text N characters in length...
• Best case if pattern not found: Always mismatch on first character. For
example, M=5.

Total number of comparisons: N 7


Best case time complexity: O(N)
String-Matching Problem
• The text is in an array T [1..n] of length n
• The pattern is in an array P [1..m] of
length m
• Elements of T and P are characters from a
finite alphabet 
– E.g.,  = {0,1} or  = {a, b, …, z}
• Usually T and P are called strings of
characters
8
String-Matching Problem …contd

• We say that pattern P occurs with shift s


in text T if:
a) 0 ≤ s ≤ n-m and
b) T [(s+1)..(s+m)] = P [1..m]
• If P occurs with shift s in T, then s is a
valid shift, otherwise s is an invalid shift
• String-matching problem: finding all valid
shifts for a given T and P
9
Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a c

pattern P s=3
a b a a
1 2 3 4

shift s = 3 is a valid shift


(n=13, m=4 and 0 ≤ s ≤ n-m holds)
10
1 2
Example
3 4
2
pattern P a b a a
1 2 3 4 5 6 7 8 9 10 11 12 13
text T
a b c a b a a b c a b a a

s=3 a b a a

s=9 a b a a
11
Naïve String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed

NAÏVE-STRING-MATCHER (T, P)
n ← length[T]
m ← length[P]
for s ← 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print “pattern occurs with shift” s
12
Naïve Algorithm

• The Naïve algorithm consists in checking, at all the positions


in the text between 0 to n-m, whether an occurrence of the
pattern starts there or not.
• After each attempt, it shifts the pattern by exactly one position
to the right.
Example (from left to right):
a b c a b c a
a b c a (shift = 0)
a b c a (shift = 1)
a b c a (shift = 2)
a b c a (shift = 3)
13
Analysis: Worst-case Example
1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a a a a a a a a a a a a a

a a a b

a a a b
14
Worst-case Analysis
• There are m comparisons for each shift in the
worst case
• There are n-m+1 shifts
• So, the worst-case running time is Θ((n-
m+1)m)
– In the example on previous slide, we have (13-4+1)4
comparisons in total
• Naïve method is inefficient because information
from a shift is not used again

15
Naïve Algorithm

Example (from right to left):


a b c a b c a
a b c a (shift =3)
a b c a (shift = 2)
a b c a (shift = 1)
a b c a (shift = 0)
Pattern occur with shift 0 and 3

16
Rabin-Karp Algorithm
• Has a worst-case running time of O((n-
m+1)m) but average-case is O(n+m)
– Also works well in practice
• Based on number-theoretic notion of
modular equivalence
• We assume that  = {0,1, 2, …, 9}, i.e.,
each character is a decimal digit
– In general, use radix-d where d = ||

17
18
19
20
21
22
23
Rabin-Karp Approach
• We can view a string of k characters (digits)
as a length-k decimal number
– E.g., the string “31425” corresponds to the
decimal number 31,425
• Given a pattern P [1..m], let p denote the
corresponding decimal value
• Given a text T [1..n], let ts denote the decimal
value of the length-m substring T [(s+1)..
(s+m)] for s=0,1,…,(n-m)
24
The Rabin-Karp algorithm

25
The Rabin-Karp algorithm

26
Rabin-Karp Approach …contd

• ts = p iff T [(s+1)..(s+m)] = P [1..m]


• s is a valid shift iff ts = p
• p can be computed in O(m) time
– p = P[m] + 10 (P[m-1] + 10 (P[m-2]+…))
• t0 can similarly be computed in O(m) time
• Other t1, t2,…, tn-m can be computed in O(n-
m) time since ts+1 can be computed from ts in
constant time
27
Rabin-Karp Approach …contd

• ts+1 = 10(ts - 10m-1 ·T [s+1]) + T [s+m+1]


– E.g., if T={…,3,1,4,1,5,2,…}, m=5 and ts=
31,415, then ts+1 = 10(31415 – 10000·3) + 2
– =14152
– Thus we can compute p in  (m) and can
compute t0, t1, t2,…, tn-m in  (n-m+1) time
– And we can find al occurrences of the pattern
P[1…m] in text T[1…n] with  (m)
preprocessing time and  (n-m+1) matching time.
• But…a problem: this is assuming p and ts are small numbers
– They may be too large to work with easily
28
Rabin-Karp Approach …contd

• Solution: we can use modular arithmetic with


a suitable modulus, q
– E.g.,
– ts+1  (10(ts – T[s+1]h) + T [s+m+1]) (mod q)
– Where h =10 m-1 (mod q)
• q is chosen as a small prime number ; e.g.,
13 for radix 10
– Generally, if the radix is d, then dq should fit
within one computer word
29
How values modulo 13 are computed
3 1 4 1 5 2

old high- new low-


order digit 7 8 order digit

14152  ((31415 – 3 · 10000) · 10 + 2 )(mod


13)
 ((7 – 3 · 3) · 10 + 2 )(mod 13)
30
 8 (mod 13)
Problem of Spurious Hits
• ts  p (mod q) does not imply that ts=p
– Modular equivalence does not necessarily mean
that two integers are equal
• A case in which ts  p (mod q) when ts ≠ p is
called a spurious hit

• On the other hand, if two integers are not


modular equivalent, then they cannot be
equal
31
Example
3 1 4 1 5 pattern

mod 13
7 text

1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 1 4 1 5 2 6 7 3 9 9 2 1

mod 13

1 7 8 4 5 10 11 7 9 11
valid spurious
match hit 32
Rabin-Karp Algorithm
• Basic structure like the naïve algorithm,
but uses modular arithmetic as described
• For each hit, i.e., for each s where ts  p
(mod q), verify character by character
whether s is a valid shift or a spurious hit
• In the worst case, every shift is verified
– Running time can be shown as O((n-m+1)m)
• Average-case running time is O(n+m)
33
3. The Boyer-Moore Algorithm
• The Boyer-Moore pattern matching
algorithm is based on two techniques.

• 1. The looking-glass technique


– find P in T by moving backwards through P,
starting at its end
• 2. The character-jump technique
– when a mismatch occurs at T[i] == x
– the character in pattern P[j] is not the
same as T[i]

T x a
• There are 3 possible
i
cases, tried in order.

P ba
j
Case 1
• If P contains x somewhere, then try to
shift P right to align the last occurrence
of x in P with T[i].

T x a T x a ? ?
i inew
and
move i and
j right, so
P x c ba j at end P x c ba
j jnew
Case 2
• If P contains x somewhere, but a shift right
to the last occurrence is not possible, then
shift P right by 1 character to T[i+1].

T x a x T xa x ?
i inew
and
move i and
j right, so
P cw ax j at end P cw ax
j x is after jnew
j position
Case 3
• If cases 1 and 2 do not apply, then shift P to
align P[0] with T[i+1].

T x a T x a ? ? ?
i inew
and
move i and
j right, so
P d c ba j at end P d c ba
j 0 jnew
No x in P
Boyer-Moore Example (1)
T:
a p a t t e r n m a t c h i n g a l g o r i t h m

1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m

P: 2 4 6
r i t h m r i t h m r i t h m
Last Occurrence Function
• Boyer-Moore’s algorithm preprocesses the
pattern P and the alphabet A to build a last
occurrence function L()
– L() maps all the letters in A to integers

• L(x) is defined as: // x is a letter in


A
– the largest index i such that P[i] == x, or
– -1 if no such index exists
L() Example
P a b a c a b

• A = {a, b, c, d} 0 1 2 3 4 5
• P: "abacab"

x a b c d
L(x) 4 5 3 -1

L() stores indexes into P[]


Note
• In Boyer-Moore code, L() is calculated whe
n the pattern P is read in.

• Usually L() is stored as an array


– something like the table in the previous slide
Boyer-Moore Example (2)
T: a b a c a a b a d c a b a c a b a a b b
1
P: a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

x a b c d
L (x) 4 5 3 1
Analysis
• Boyer-Moore worst case running time is
O(nm + A)

• But, Boyer-Moore is fast when the alphabet (A) is


large, slow when the alphabet is small.
– e.g. good for English text, poor for binary

• Boyer-Moore is significantly faster than brute force


for searching English text.
Worst Case Example
T: a a a a a a a a a
• T: "aaaaa…a"
6 5 4 3 2 1
• P: "baaaaa" P: b a a a a a
12 11 10 9 8 7
b a a a a a
18 17 16 15 14 13
b a a a a a
24 23 22 21 20 19
b a a a a a
3. The KMP Algorithm
• The Knuth-Morris-Pratt (KMP) algorithm l
ooks for the pattern in the text in a left-to-ri
ght order (like the brute force algorithm).

• But it shifts the pattern more intelligently th


an the brute force algorithm.

46
continued
• If a mismatch occurs between the text and p
attern P at P[j], what is the most we can shif
t the pattern to avoid wasteful comparisons?

• Answer: the largest prefix of P[0 .. j-1] that i


s a suffix of P[1 .. j-1]

47
Example
T:

P: j=5

jnew = 2

48
Why j == 5

• Find largest prefix (start) of:


"a b a a b" ( P[0..j-1] )

which is suffix (end) of:


"b a a b" ( p[1 .. j-1] )

• Answer: "a b"


• Set j = 2 // the new j value
49
KMP Failure Function
• KMP preprocesses the pattern to find
matches of prefixes of the pattern with the
pattern itself.
• j = mismatch position in P[]
• k = position before the mismatch (k = j-1).
• The failure function F(k) is defined as the
size of the largest prefix of P[0..k] that is
also a suffix of P[1..k].

50
Failure Function Example
(k == j-1)
• P: "abaaba" j 0 1 2 3 4
j: 012345 F(j) 0 0 1 1 2

F(k) is the size of


the largest prefix.

• In code, F() is represented by an array, like


the table.
51
Why is F(4) == 2?P: "abaaba"
• F(4) means
– find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
=2

52
Using the Failure Function

• Knuth-Morris-Pratt’s algorithm modifies the


brute-force algorithm.
– if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j

53
Example
T: a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P: a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k 0 1 2 3 4 14 15 16 17 18 19
F(k ) 0 0 1 0 1 a b a c a b

54
Why is F(4) == 1?P: "abacab"
• F(4) means
– find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
=1

55
KMP Advantages
• KMP runs in optimal time: O(m+n)
– very fast

• The algorithm never needs to move backwa


rds in the input text, T
– this makes the algorithm good for processing ve
ry large files that are read in from external devi
ces or through a network stream
56
KMP Disadvantages
• KMP doesn’t work so well as the size of the
alphabet increases
– more chance of a mismatch (more possible mis
matches)
– mismatches tend to occur early in the pattern, b
ut KMP is faster when the mismatches occur lat
er

57

You might also like