CS369 StringAlgs PDF
CS369 StringAlgs PDF
Georgy Gimel’farb
(with basic contributions from M. J. Dinneen, Wikipedia, and web materials by
Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.)
1 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
Recommended reading:
http://www-igm.univ-mlv.fr/~lecroq/string/index.html
C. Charras and T. Lecroq: Exact String Matching Algorithms. Univ. de Rouen, 1997
2 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
3 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
http://www.biotechnologyonline.gov.au/
popups/img helix.html http://www.insectscience.org/2.10/ref/fig5a.gif
4 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
http://biology.kenyon.edu/courses/biol114/Chap08/longread sequence.gif
5 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
http://www-igm.univ-mlv.fr/∼lecroq/string/index.html
6 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
Notation:
m – the length (size) of the pattern; n – the length of the searched text
8 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
9 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
ababbaabaaab ............
10 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
11 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
q a ab aba abaa
c
Transitions δ(q, c) : a a a aba abaa a
b ab ab ab
c=a
c=b
c=b c=a c=b
start c=a
a ab aba abaa
c=a c=b c=a
c=b
See also: http://www.ics.uci.edu/∼eppstein/161/960222.html
13 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
Automaton:
Step Text Transition
Initial state 1 ababbaabaaab → a
Final states {abaa} 2 ababbaabaaab a → ab
3 ababbaabaaab ab → aba
Transitions
4 ababbaabaaab aba → ab
qnext = δ(qcurr , c) = 5 ababbaabaaab ab →
qcurr \ c a b 6 ababbaabaaab → a
7 ababbaabaaab a → a
a 8 ababbaabaaab a → ab
a a ab 9 ababbaabaaab ab → aba
ab aba 10 ababbaabaaab aba → abaa
aba abaa ab 11 ababbaabaaab abaa → a
12 ababbaabaaab a → ab
abaa a ab
14 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
Rabin-Karp Algorithm
Main features:
• Using hashing function
(i.e., it is more efficient to check whether the window contents
“looks like” the pattern than checking exact match)
• Preprocessing phase: time complexity O(m) and constant
space
• Searching phase time complexity:
• O(mn) for worst case
• O(n + m) for expected case
• Good for multiple patterns x being used
15 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
Rabin-Karp Algorithm
Using external hash and rehash functions:
18 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
Knuth-Morris-Pratt Algorithm
20 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
x u ac
v v
x v c? u−v
The length of the largest substring v being a prefix and suffix of u, which
are followed by different characters (like va and vc above), gives the next
search index next[i]
21 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
Knuth-Morris-Pratt Preprocessing
All the shift distances next[i] can be actually computed for
0 ≤ i ≤ m in total time O(m) where m = |x|
Searching phase:
ababbaabaaab
ABAa Shift by 2 (next[3]=1)
24 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
Boyer-Moore Algorithm
25 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
or if yj+i does not occur in x, the left end of the window is aligned with
the character yj+i+1 immediately after yj+i :
...a Shift by 2
..aA Shift by 1
aBAA Shift by 3
ABAA Shift by 3
28 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
29 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
31 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
32 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others
The OR’s in the above rule account for the 4 possible ways to
approximate the pattern with errors
33 / 33