Stringsearch
Stringsearch
Some applications.
■ Word processors.
■ Virus scanning.
■ Text information retrieval systems. (Lexis, Nexis)
■ Digital libraries.
■ Natural language processing.
■ Specialized databases.
■ Computational molecular biology.
Reference: Chapter 19, Algorithms in C by R. Sedgewick. ■ Web search engines.
Addison Wesley, 1990.
Princeton University • COS 423 • Theory of Algorithms • Spring 2002 • Kevin Wayne 2
3 4
Analysis of Brute Force How To Save Comparisons
Analysis of brute force. How to avoid recomputation?
■ Running time depends on pattern and text. ■ Pre-analyze search pattern.
– can be slow when strings repeat themselves ■ Ex: suppose that first 5 characters of pattern are all a’s.
■ Worst case: MN comparisons. – If t[0..4] matches p[0..4] then t[1..4] matches p[0..3].
– too slow when M and N are large – no need to check i = 1, j = 0, 1, 2, 3
– saves 4 comparisons
■ Need better ideas in general.
Search Pattern
a a a a a b
Search Pattern
Search Text a a a a a b
a a a a a a a a a a a a a a a a a a a a b
a a a a a b Search Text
a a a a a b a a a a a a a a a a a a a a a a a a a a b
a a a a a b a a a a a b
a a a a a b a a a a a b
a a a a a b
5 6
Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.
b b
b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b
7 8
Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.
b b
b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b
9 10
Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.
b b
b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b
11 12
Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.
b b
b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b
13 14
Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.
b b
b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b
15 16
Knuth-Morris-Pratt Knuth-Morris-Pratt
KMP algorithm. KMP algorithm.
■ Use knowledge of how search pattern repeats itself. ■ Use knowledge of how search pattern repeats itself.
■ Build FSA from pattern. ■ Build FSA from pattern.
■ Run FSA on text. ■ Run FSA on text.
■ O(M + N) worst-case running time. ■ O(M + N) worst-case running time.
b b
b a b a
a a b a a a a a b a a a
0 1 2 3 4 5 6 0 1 2 3 4 5 6
b accept state b accept state
b b b b
17 18
b
b a
a a b a a a
0 1 2 3 4 5 6
b accept state
b b
19 20
KMP Algorithm FSA Construction for KMP
Given the FSA, string search is easy. FSA construction for KMP.
■ The array next[] contains next FSA state if character mismatches. ■ FSA builds itself!
21 22
Example. Building FSA for aabaaabb. Example. Building FSA for aabaaabb.
■ State 7. p[0..6] = aabaaab
– assume you know state for p[1..6] = abaaab X=3
– if next char is b (match): go forward 7+1=8
– next char is a (mismatch): go to state for abaaaba X + ’a’ = 4
– update X to state for p[1..7] = abaaabb X + ’b’ = 0
b b
b a b a
a a b a a a b a a b a a a b
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
b b
b b b b
a a
23 24
FSA Construction for KMP FSA Construction for KMP
FSA construction for KMP. FSA construction for KMP.
■ FSA builds itself! ■ FSA builds itself!
0
a a 1
b b 0
b b
a a
0 1 0 1
27 28
FSA Construction for KMP FSA Construction for KMP
Search Pattern j pattern[1..j] X next Search Pattern j pattern[1..j] X next
a a b a a a b b 0 0 0 a a b a a a b b 0 0 0
1 a 1 0 1 a 1 0
2 a b 0 2
0 1 0 1 2
a 1 2 a 1 2 2
b 0 0 b 0 0 3
b b a
a a a a b
0 1 2 0 1 2 3
b b
29 30
b
b a b a
a a b a a a b a a
0 1 2 3 4 0 1 2 3 4 5
b b
b b
31 32
FSA Construction for KMP FSA Construction for KMP
Search Pattern j pattern[1..j] X next Search Pattern j pattern[1..j] X next
a a b a a a b b 0 0 0 a a b a a a b b 0 0 0
1 a 1 0 1 a 1 0
2 a b 0 2 2 a b 0 2
0 1 2 3 4 5 0 1 2 3 4 5 6
3 a b a 1 0 3 a b a 1 0
a 1 2 2 4 5 6 a 1 2 2 4 5 6 2
4 a b a a 2 0 4 a b a a 2 0
b 0 0 3 0 0 3 b 0 0 3 0 0 3 7
5 a b a a a 2 3 5 a b a a a 2 3
6 a b a a a b 3 2
b b
b a b a
a a b a a a a a b a a a b
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7
b b
b b b b
a
33 34
Text Text
a s t r i n g s s e a r c h c o n s i s t i n g o f x x x x x x x b a b x x x x x x x x x x x x x x
s t i n g x c a b d a b d a b
s t i n g x c a b d a b d a b
Pattern
s t i n g
s t i n g
s t i n g bad character heuristic
s t i n g
Mismatch
s t i n g
Match
s t i n g
No comparison 41 42
Boyer-Moore Boyer-Moore
Boyer-Moore algorithm (1974). Boyer-Moore analysis.
■ Right-to-left scanning. ■ O(N / M) average case if given letter usually doesn’t occur in string.
■ Heuristic 1: advance offset i using "bad character rule." – English text: 10 character search string, 26 char alphabet
– extremely effective for English text – time decreases as pattern length increases
Text
x x x x x x x b a b x x x x x x x x x x x x x x
x c a b d a b d a b
x c a b d a b d a b
43 44
Karp-Rabin Karp-Rabin
Idea: use hashing. Idea: use hashing.
■ Compute hash function for each text position. ■ Compute hash function for each text position.
■ No explicit hash table!
– just compare with pattern hash Problems.
■ Need full compare on hash match to guard against collisions.
Example. – 59265 % 97 = 95
■ Hash "table" size = 97. Search Pattern – 59362 % 97 = 95
5 9 2 6 5 59265 % 97 = 95
■ Hash function depends on M characters.
– running time on search miss = MN
Search Text
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6
3 1 4 1 5 31415 % 97 = 84
1 4 1 5 9 14159 % 97 = 94
4 1 5 9 2 41592 % 97 = 76
1 5 9 2 6 15926 % 97 = 18
5 9 2 6 5 59265 % 97 = 95
45 46
Key idea: fast to compute hash function of adjacent substrings. #define q 3355439 // table size
■ Use previous hash to compute next hash. #define d 256 // radix
■ O(1) time per hash, except first one. int rksearch(char p[], char t[]) {
int i, j, dM = 1, h1 = 0, h2 = 0;
int M = strlen(p), N = strlen(t);
Example.
■ Pre-compute: 10000 % 97 = 9 for (j = 1; j < M; j++) // precompute d^M % q
■ Previous hash: 41592 % 97 = 76 dM = (d * dM) % q;
47 48
Karp-Rabin
Karp-Rabin algorithm.
■ Choose table size at RANDOM to be huge prime.
■ Expected running time is O(M + N).
■ O(MN) worst-case, but this is (unbelievably) unlikely.
Randomized algorithms.
■ Monte Carlo: don’t check for collisions.
– algorithm can be wrong but running time guaranteed linear
■ Las Vegas: if collision, start over with new random table size.
– algorithm always correct, but running time is expected linear
Advantages.
■ Extends to 2d patterns and other generalizations.
49