0% found this document useful (0 votes)
12 views

String Matching

Uploaded by

keshavpal583
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

String Matching

Uploaded by

keshavpal583
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

String Matching

• Introduction Processing time Matching time


• Naïve Algorithm 0

• Rabin-Karp Algorithm
• Knuth-Morris-Pratt
(KMP) Algorithm
• Finite Automata O Θ(n)

GCET-CSE String Matching 1


Introduction
• What is string matching?
– Finding all occurrences of a pattern in a
given text (or body of text)
• Many applications
– While using editor/word processor/browser
– Login name & password checking
– Virus detection
– Header analysis in data communications
– DNA sequence analysis

GCET-CSE String Matching 2


String-Matching Problem
• The text is in an array T [1..n] of length n
• The pattern is in an array P [1..m] of
length m
• Elements of T and P are characters from
a finite alphabet 
– E.g.,  = {0,1} or  = {a, b, …, z}
• Usually T and P are called strings of
characters
GCET-CSE String Matching 3
String-Matching Problem …contd

• We say that pattern P occurs with shift s


in text T if:
a) 0 ≤ s ≤ n-m and
b) T [(s+1)..(s+m)] = P [1..m]
• If P occurs with shift s in T, then s is a
valid shift, otherwise s is an invalid shift
• String-matching problem: finding all valid
shifts for a given T and P

GCET-CSE String Matching 4


Pattern P a b a a
1 2 3 4 5 6 7 8 9 10 11 12 13
Text T a b c a b a a b c a b a c

s=0 a b a a
s=1
a b a a
s=2
a b a a
s=3
a b a a
shift s = 3 is a valid shift
(n=13, m=4 and 0 ≤ s ≤ n-m holds)
GCET-CSE String Matching 5
Example
1 2 3 4
pattern P a b a a
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a b c a b a a b c a b a a

s=3 a b a a

s=9 a b a a
GCET-CSE String Matching 6
Terminology
• Concatenation of 2 strings x and y is xy
– E.g., x=“putra”, y=“jaya”  xy = “putrajaya”
• A string w is a prefix of a string x, if x=wy
for some string y
– E.g., “putra” is a prefix of “putrajaya”
• A string w is a suffix of a string x, if x=yw
for some string y
– E.g., “jaya” is a suffix of “putrajaya”

GCET-CSE String Matching 7


Naïve String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed

NAÏVE-STRING-MATCHER (T, P)
n ← length[T]
m ← length[P]
for s ← 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print “pattern occurs with shift” s
GCET-CSE String Matching 8
Analysis: Worst-case Example
1 2 3 4
pattern P a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
text T a a a a a a a a a a a a a

a a a b

a a a b
GCET-CSE String Matching 9
Worst-case Analysis
• There are m comparisons for each shift
in the worst case
• There are n-m+1 shifts
• So, the worst-case running time is
Θ((n-m+1)m)
– In the example on previous slide, we have
(13-4+1)4 comparisons in total
• Naïve method is inefficient because
information from a shift is not used again
GCET-CSE String Matching 10
Rabin-Karp Algorithm

GCET-CSE String Matching 6-11


Rabin-Karp Algorithm
• Has a worst-case running time of O((n-m+1)m)
but average-case is O(n+m)
• Characters are  = {0,1, 2, …, 9}
– In general, use radix-d where d = ||
• Pattern P[1…m]
• Text T[1……….n]
• p= decimal value of pattern (P)
• ts=decimal value of length m substring
T(s+1,….s+m).
GCET-CSE String Matching 12
Rabin-Karp Algorithm
• If ts=p then match
• Decimal of pattern
p = P[m]+10 { P[m-1] +10 { P[m-2] +…… } }
• Decimal of text at shift 0(s=0)
t0 = T[m]+10 { T[m-1] +10 { T[m-2] +…… } }
• Decimal of text at shift s
• ts+1=10 { ts -10m-1 T[s+1]} + T[s+m+1]
But…a problem: this is assuming p and ts are small numbers
For larger strings this method fails because of resource limitations
GCET-CSE String Matching 13
Rabin-Karp Algorithm
• Solution: we can use modular arithmetic with a
suitable modulus, q
• q is chosen as a small prime number e.g., 13
for radix 10
– Generally, if the radix is d, then dq should fit
within one computer word
• ts+1= { d { ts -h T[s+1] } + T[s+m+1] } mod q
Where h = { dm-1 } mod q

GCET-CSE String Matching 14


Rabin-Karp basic example
1 2 3 4
pattern P 3 1 4 3
1 2 3 4 5 6 7 8 9 10 11 12 13
text T 3 1 4 3 0 3 3 4 2 3 1 4 3

3 1 4 3

p = decimal value of P
ts = decimal value of length m substring 3 1 4 3
GCET-CSE String Matching 15
Rabin-Karp basic example
pattern P 3 1 4 7
1 2 3 4 5 6 7 8 9 10 11 12 13
text T 3 1 4 7 5 3 3 4 2 3 1 4 7

3 x 1000 = 3000 3147 – 3000


1 x 100 = 100 147
4 x 10 = 40
7 10x147=1470
p = 3147 1470+5

3 x 1000 = 3000 1475 = t1


1 x 100 = 100
4 x 10 = 40 t1 =10 { t0 - 103 x T[1] } + T[5]
7
t0 = 3147 t1 =10 { 3147 – 1000 x3 } + 5
GCET-CSE String Matching 16
Modulus Division
• q is the quotient of the division
• r = a mod q, is the remainder of the division
• If q=13 & a = 10000
• r = 10000 mod 13 =3
• Similarly
• r = 31415 mod 13 =7
• r = 14152 mod 13 =8
• r = -10000 mod 13 = -3 + 13 = 10
• r = -31415 mod 13 = -7 + 13 = 6
• r = -14152 mod 13 = -8 + 13 = 5
GCET-CSE String Matching 17
Rabin-Karp Algorithm
• Calculate the hash value of pattern & m
characters subsequence of text.
• If hash value unequal , the algorithm will
calculate the hash value of next-m characters
of sub sequence(next shift s+1).
• If hash value are equal then brute force
comparison.
• In this way there is only one comparison per
text sub sequence.

GCET-CSE String Matching 18


Compute modulo-t4 using modulo-t3

9 0 2 3 1 4 1 5 2 6 7 3

old high- new low-


order digit 7 8 order digit

t4 t3
14152  (31415 – 3 · 10000) · 10 + 2 (mod 13)
mod value
 (7 – 3 · 3) · 10 + 2 (mod 13)
 8 (mod 13)
GCET-CSE String Matching 19
Problem of Spurious Hits
• ts  p (mod q) does not imply that ts=p
– Modular equivalence does not necessarily
mean that two integers are equal
• A case in which ts  p (mod q) when ts ≠
p is called a spurious hit

• On the other hand, if two integers are not


modular equivalent, then they cannot be
equal
GCET-CSE String Matching 20
Example
3 1 4 1 5 pattern

mod 13
7 text

1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 1 4 1 5 2 6 7 3 9 9 2 1

mod 13

1 7 8 4 5 10 11 7 9 11
valid spurious
GCET-CSE match String Matching hit 21
Rabin-Karp Algorithm
• Basic structure like the naïve algorithm,
but uses modular arithmetic as described
• For each hit, i.e., for each s where ts  p
(mod q), verify character by character
whether s is a valid shift or a spurious hit
• In the worst case, every shift is verified
– Running time can be shown as O((n-m+1)m)
• Average-case running time is O(n+m)
GCET-CSE String Matching 22
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

GCET-CSE String Matching 23


Knuth-Morris-Pratt (KMP)

GCET-CSE String Matching 6-24


Knuth-Morris-Pratt (KMP) algorithm
• Idea: after some character (such as q) matches of P with
T and then a mismatch, the matched q characters allows
us to determine immediately that certain shifts are
invalid. So directly go to the shift which is potentially
valid.
• The matched characters in T are in fact a prefix of P, so
just from P, it is OK to determine whether a shift is invalid
or not.
• Define a prefix function , which encapsulates the
knowledge about how the pattern P matches against
shifts of itself.
–  :{1,2,…,m}{0,1,…,m-1}
– [q]=max{k: k<q and Pk  Pq}, that is [q] is the length of the
longest prefix of P that is a proper suffix of Pq.

GCET-CSE String Matching 25


Prefix function
If we precompute prefix function of
P (against itself), then whenever
a mismatch occurs, the prefix function
can determine which shift(s) are invalid
and directly ruled out. So move directly
to the shift which is potentially valid.
However, there is no need to compare
these characters again since they are
equal.

GCET-CSE String Matching 26


Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

GCET-CSE String Matching 27


Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

GCET-CSE String Matching 28


Compute Prefix()
m=7, [1]=0 & k=0
1 2 3 4 5 6 7
P a b a b a c a For q=2

1 2 3 4 5 6 7 While 0>0 x
 0 0 If P[0+1] = P[2]
a b x
[2]=0 k

GCET-CSE String Matching 29


Compute Prefix()
1 2 3 4 5 6 7 For q=3
P a b a b a c a While 0>0 x
1 2 3 4 5 6 7 If P[0+1] = P[3]
 0 0 1 a a ok
k=k+1=1
[3]=1 k

GCET-CSE String Matching 30


Compute Prefix()
1 2 3 4 5 6 7 For q=4
P a b a b a c a While 1>0 ok
1 2 3 4 5 6 7 & P[1+1] ≠ P[4]
 0 0 1 2 b b x
If P[1+1] = P[4]
b b ok
k=k+1=2
[4]=2 k
GCET-CSE String Matching 31
Compute Prefix()
1 2 3 4 5 6 7 For q=5
P a b a b a c a While 2>0 ok
1 2 3 4 5 6 7 & P[2+1] ≠ P[5]
 0 0 1 2 3 a a x
If P[2+1] = P[5]
a a ok
k=k+1=3
[5]=3 k
GCET-CSE String Matching 32
Compute Prefix()
1 2 3 4 5 6 7 For q=6
P a b a b a c a While 3>0 ok
1 2 3 4 5 6 7 & P[3+1] ≠ P[6] b≠c ok
 0 0 1 2 3 k=[k] k= [3]=1  k
While 1>0 ok
& P[1+1] ≠ P[6] b≠c ok
k= [1]=0 k
While 0>0 x
GCET-CSE String Matching 33
Compute Prefix()
1 2 3 4 5 6 7 For q=6 continue..
P a b a b a c a Now updated k=0

1 2 3 4 5 6 7 If P[0+1] = P[6]
 0 0 1 2 3 0 a c x
[6]=0 k

GCET-CSE String Matching 34


Compute Prefix()
1 2 3 4 5 6 7 For q=7
P a b a b a c a While 0>0 x
1 2 3 4 5 6 7 If P[0+1] = P[7]
 0 0 1 2 3 0 1 a a ok
k=0+1=1
[7]=1 k

GCET-CSE String Matching 35


KMP algorithm

GCET-CSE String Matching 36


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=0, n=15,m=7
For i=1
While 0>0 x
If P[0+1] = T[1]
a b x
If q = m x

GCET-CSE String Matching 37


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=0
For i=2
While 0>0 x
If P[0+1] = T[2]
a a ok  q=q+1=1
If 1 = 7 x

GCET-CSE String Matching 38


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=1
For i=3
While 1>0 ok P[1+1] = T[3]  b ≠ c ok
q=[1]=0
While 0>0 x
If P[0+1] = T[3]
a c x
If 0 = 7 x 39
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=0
For i=4
While 0>0 x
If P[0+1] = T[2]
a b x
If 0 = 7 x

GCET-CSE String Matching 40


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=0
For i=5
While 0>0 x
If P[0+1] = T[5]
a a ok  q=q+1=1
If 1 = 7 x

GCET-CSE String Matching 41


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=1
For i=6
While 1>0 ok P[1+1] = T[6]  b ≠ b x
If P[1+1] = T[6]
b b ok  q=q+1=2
If 2 = 7 x

GCET-CSE String Matching 42


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=2
For i=7
While 2>0 ok P[2+1] = T[7]  a ≠ a x
If P[2+1] = T[7]
a a ok  q=q+1=3
If 3 = 7 x

GCET-CSE String Matching 43


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=3
For i=8
While 3>0 ok P[3+1] = T[8]  b ≠ b x
If P[3+1] = T[8]
b b ok  q=q+1=4
If 4 = 7 x

GCET-CSE String Matching 44


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=4
For i=9
While 4>0 ok P[4+1] = T[9]  a ≠ a x
If P[4+1] = T[9]
a a ok  q=q+1=5
If 5 = 7 x

GCET-CSE String Matching 45


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3 4 KMP-Algorithm
c b 5 b
a 6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=5
For i=10
While 5>0 ok P[5+1] = T[10]  c ≠ c x
If P[5+1] = T[10]
c c ok  q=q+1=6
If 6 = 7 x

GCET-CSE String Matching 46


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a c b
2 3 4 a
5 b
6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=6
For i=11
While 6>0 ok P[6+1] = T[11]  a ≠ a x
If P[6+1] = T[11]
a a ok  q=q+1=7
If 7 = 7 ok  match at shift i-m 11-7=4
q=[7]=1
GCET-CSE String Matching 47
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T b
1 a
2 3
c b
4 a
5 b
6 a
7 b a c a b a b a
P a b a b a c a
 0 0 1 2 3 0 1
q=1
For i=12
While 1>0 ok P[1+1] = T[12]  b ≠ b x
If P[1+1] = T[12]
b b ok  q=q+1=2
If 2 = 7 x
----- Continued till end of text -------------
GCET-CSE String Matching 48
Analysis of KMP algorithm
• The running time of COMPUTE-PREFIX-FUNCTION is
(m) and KMP-MATCHER (m)+ (n).
• Using amortized analysis (potential method) (for
COMPUTE-PREFIX-FUNCTION):
– Associate a potential of k with the current state k of the
algorithm:
– Consider codes in Line 5 to 9.
– Initial potential is 0, line 6 decreases k since [k]<k, k never
becomes negative.
– Line 8 increases k at most 1.
– Amortized cost = actual-cost + potential-increase
– (repeat-times-of-Line-5+O(1))+(potential-decrease-at-least the
repeat-times-of-Line-5+O(1) in line 8)=O(1).

GCET-CSE String Matching 49


Knuth-Morris-Pratt (KMP) algorithm

• Advantages of KMP
– It runs in optimal time O(m+n) very fast
– Algorithm never need to move backward in the input text
– It is good for processing very large file(network
streaming/external devices/..)

• Disadvantage
– KMP does not work effectively for large alphabet size( more
chance of mismatch)

GCET-CSE String Matching 50


FINITE AUTOMATA

GCET-CSE String Matching 6-51


FINITE AUTOMATA

GCET-CSE String Matching 6-52


FINITE AUTOMATA

GCET-CSE String Matching 6-53


State a b c Pattern
Pattern: a babaca
0 1 0 0 a
state ‘0’
input {a} 1 1 2 0 b
it moves to state ‘1’ 2 3 a

And on inputs {b} & {c} at state ‘0’ not defined 3 4 b


4 5 a
input {a} state ‘1’ 5 6 c
previous visited {a} 6 7 a
so on input {aa} = {a} | {a} 7
find the largest suffix of {a} and prefix of {a}
length=1
input {b}
it moves to state ‘2’
input {c}
previous visited {a}
so on input {ac} = {a} | {c}
find the largest suffix of {a} and prefix of {c}
length=0
GCET-CSE String Matching 6-54
State a b c Pattern
Pattern: a babaca
input {a} 0 1 0 0 a
state ‘2’
it moves to state ‘3’ 1 1 2 0 b
input {b} 2 3 0 0 a
previous visited {ab} 3 4 b
so on input {abb} = {ab} | {bb}
4 5 a
find the largest suffix & prefix
length=0 5 6 c
6 7 a
input {c} 7
previous visited {ab}
so on input {abc} = {ab} | {bc}
find the largest suffix & prefix
length=0

GCET-CSE String Matching 6-55


State a b c Pattern
Pattern: a babaca
input {a} 0 1 0 0 a
state ‘3’
previous visited {aba} 1 1 2 0 b
so on input {abaa} = {aba} | {baa} 2 0 0 a
3
find the largest suffix & prefix
length=1 {a} | {a} 3 1 4 0 b
4 5 a
input {b} 5 c
6
it moves to state ‘4’
6 7 a
input {c} 7
previous visited {aba}
so on input {abac} = {aba} | {bac}
find the largest suffix & prefix
length=0

GCET-CSE String Matching 6-56


State a b c Pattern
Pattern: a babaca
0 1 0 0 a
input {a} state ‘4’
1 1 2 0 b
it moves to state ‘5’
2 3 0 0 a
input {b}
3 1 4 0 b
previous visited {abab}
so on input {ababb} = {abab} | {babb} 4 5 0 0 a
find the largest suffix & prefix 5 6 c
length=0
6 7 a
input {c} 7
previous visited {abab}
so on input {ababc} = {abab} | {babc}
find the largest suffix & prefix
length=0

GCET-CSE String Matching 6-57


State a b c Pattern
Pattern: a babaca
0 1 0 0 a
input {a} state ‘5’
1 1 2 0 b
previous visited {ababa}
so on input {ababaa} = {ababa} | {babaa} 2 3 0 0 a
find the largest suffix & prefix 3 1 4 0 b
length=1 {a} | {a} 4 5 0 0 a
input {b} 5 1 4 6 c
previous visited {ababa}
so on input {ababab} = {ababa} | {babab} 6 7 a
find the largest suffix & prefix 7
length=4 {abab} | {abab}

input {c}
it moves to state ‘6’

GCET-CSE String Matching 6-58


State a b c Pattern
Pattern: a babaca
0 1 0 0 a
input {a} state ‘6’
1 1 2 0 b
it moves to state ‘7’
2 3 0 0 a
input {b} 3 1 0 b
4
previous visited {ababac}
so on input {ababacb} = {ababac} | {babacb} 4 5 0 0 a
find the largest suffix & prefix 5 1 4 6 c
length=0 6 7 0 0 a
input {c} 7
previous visited {ababac}
so on input {ababacc} = {ababac} | {babacc}
find the largest suffix & prefix
length=0

GCET-CSE String Matching 6-59


State a b c Pattern
Pattern: a babaca
0 1 0 0 a
input {a} state ‘7’
previous visited {ababaca} 1 1 2 0 b
so on input {ababacaa} = {ababaca} | {babacaa} 2 3 0 0 a
find the largest suffix & prefix 3 1 4 0 b
length=1 {a} | {a}
4 5 0 0 a
input {b}
5 1 4 6 c
previous visited {ababaca}
so on input {ababacab} = {ababaca} | {babacab} 6 7 0 0 a
find the largest suffix & prefix 7 1 2 0
length=2 {ab} | {ab}

input {c}
previous visited {ababaca}
so on input {ababacac} = {ababaca} | {babacac}
find the largest suffix & prefix
length=0

GCET-CSE String Matching 6-60


State a b c Pattern
Pattern: a babaca
0 1 0 0 a
1 1 2 0 b
2 3 0 0 a
3 1 4 0 b
4 5 0 0 a
5 1 4 6 c
6 7 0 0 a
7 1 2 0

a a a

a b a b a c a
b b
a

GCET-CSE String Matching 6-61


State a b c Pattern
Pattern: a babaca
0 1 0 0 a
1 1 2 0 b
Text=abababacaba 2 3 0 0 a
3 1 4 0 b
a b a b a b a c a b a …..
4 5 0 0 a
Current state 0 1 2 3 4 5 4 5 6 7 2
Next state
5 1 4 6 c
1 2 3 4 5 4 5 6 7 2 3
6 7 0 0 a
7 1 2 0

a a a

a b a b a c a
b b
a

GCET-CSE String Matching 6-62


GCET-CSE String Matching 6-63

You might also like