53SubstringSearch 2x2
53SubstringSearch 2x2
Goal. Find pattern of length M in a text of length N. Goal. Find pattern of length M in a text of length N.
pattern N E E D L E pattern N E E D L E
text I N A H A Y S T A C K N E E D L E I N A text I N A H A Y S T A C K N E E D L E I N A
match match
Substring search Substring search
3 4
Substring search applications Substring search applications
Goal. Find pattern of length M in a text of length N. Goal. Find pattern of length M in a text of length N.
pattern N E E D L E pattern N E E D L E
text I N A H A Y S T A C K N E E D L E I N A text I N A H A Y S T A C K N E E D L E I N A
match match
Substring search Substring search
Computer forensics. Search memory or disk for signatures, Identify patterns indicative of spam.
e.g., all URLs or RSA keys that the user has entered. ・ PROFITS
・ L0SE WE1GHT
・ herbal Viagra
・ There is no catch.
・ This is a one-time mailing.
http://citp.princeton.edu/memory
・ This message is sent in compliance with spam regulations.
5 6
Screen scraping. Extract relevant data from web page. Java library. The indexOf() method in Java's string library returns the index
of the first occurrence of a given string, starting at a given offset.
Ex. Find string delimited by <b> and </b> after first occurrence of
pattern Last Trade:.
public class StockQuote
{
... public static void main(String[] args)
<tr> {
<td class= "yfnc_tablehead1" String name = "http://finance.yahoo.com/q?s=";
width= "48%"> In in = new In(name + args[0]);
Last Trade: String text = in.readAll();
</td> int start = text.indexOf("Last Trade:", 0);
<td class= "yfnc_tabledata1"> int from = text.indexOf("<b>", start);
<big><b>452.92</b></big> int to = text.indexOf("</b>", from);
</td></tr> String price = text.substring(from + 3, to);
<td class= "yfnc_tablehead1" StdOut.println(price);
http://finance.yahoo.com/q?s=goog width= "48%"> }
Trade Time: }
% java StockQuote goog
</td> 582.93
<td class= "yfnc_tabledata1">
... % java StockQuote msft
24.84
7 8
Brute-force substring search
i j i+j 0 1 2 3 4 5 6 7 8 9 10
5.3 S UBSTRING S EARCH txt A B A C A D A B R A C
0 2 2 A B R A pat
‣ introduction 1 0 1 A B R A entries in red are
mismatches
‣ brute force 2 1 3 A B R A
entries in gray are
3 0 3 A B R A
for reference only
‣ Knuth-Morris-Pratt
Algorithms 4 1 5
entries in black
A B R A
‣ Boyer-Moore 5 0 5 match the text A B R A
6 4 10 A B R A
R OBERT S EDGEWICK | K EVIN W AYNE
‣ Rabin-Karp return i when j is M
match
http://algs4.cs.princeton.edu
10
Brute-force substring search: Java implementation Brute-force substring search: worst case
Check for pattern starting at each text position. Brute-force algorithm can be slow if text and pattern are repetitive.
i j i+j 0 1 2 3 4 5 6 7 8 9 10
A B A C A D A B R A C
4 i ij j i+j 0 0 1 1 22 33 44
i+j 5
5 66 77 88 9910
3 7 A D A C R
5 0 5 A D A C R
txt
txt A A A B AA AC AA A D AA BA RA AB C
0 2 2 A B R A pat
0 4 4 A A A A B pat
1 0 1 A B R A entries in red are
1 4 5 A A A A B mismatches
public static int search(String pat, String txt) 2 1 3 A B R A
2 4 6 A A A A B entries in gray are
{ 3 0 3 A B R A
3 for reference only
int M = pat.length(); 44 17 5 A AA A B RA AB
entries in black
int N = txt.length(); 4 54 08 5 match the text A A A BA RA AB
for (int i = 0; i <= N - M; i++)
5 65 10
4 10 A AA BA RA AB
{
int j;
return i when j is M
Brute-force substring search (worst case)
match
for (j = 0; j < M; j++)
if (txt.charAt(i+j) != pat.charAt(j))
Brute-force substring search
break;
index in text where
if (j == M) return i;
pattern starts
}
return N; not found
} Worst case. ~ M N char compares.
11 12
Backup Brute-force substring search: alternate implementation
In many applications, we want to avoid backup in text stream. Same sequence of char compares as previous implementation.
・Treat input as stream of data. ・ i points to end of sequence of already-matched chars in text.
・Abstract model: standard input. ・ j stores # of already-matched chars (end of sequence in pattern).
“ATTACK AT
DAWN”
substring search
machine
i j 0 1 2 3 4 5 6 7 8 9 10
A B A C A D A B R A C
Brute-force algorithm needs backup for every mismatch. 7 3 A D A C R
matched chars 5 0 A D A C R
mismatch
A A A A A A A A A A A A A A A A A A A A A B
public static int search(String pat, String txt)
A A A A A B {
backup int i, N = txt.length();
int j, M = pat.length();
A A A A A A A A A A A A A A A A A A A A A B for (i = 0, j = 0; i < N && j < M; i++)
A A A A A B {
if (txt.charAt(i) == pat.charAt(j)) j++;
else { i -= j; j = 0; } explicit backup
shift pattern right one position
}
if (j == M) return i - M;
Approach 1. Maintain buffer of last M characters.
else return N;
Approach 2. Stay tuned. }
13 14
Practical challenge. Avoid backup in text stream. often no room or time to save text
5.3 S UBSTRING S EARCH
Now is the time for all people to come to the aid of their party. Now is the time for all good people to
come to the aid of their party. Now is the time for many good people to come to the aid of their party.
‣ introduction
‣ brute force
Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good
people to come to the aid of their party. Now is the time for all of the good people to come to the aid of
their party. Now is the time for all good people to come to the aid of their party. Now is the time for
‣ Knuth-Morris-Pratt
each good person to come to the aid of their party. Now is the time for all good people to come to the aid
of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the
time for all good people to come to the aid of their party. Now is the time for many or all good people to Algorithms
come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
is the time for all good Democrats to come to the aid of their party. Now is the time for all people to ‣ Boyer-Moore
come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
is the time for many good people to come to the aid of their party. Now is the time for all good people to
come to the aid of their party. Now is the time for a lot of good people to come to the aid of their
R OBERT S EDGEWICK | K EVIN W AYNE
‣ Rabin-Karp
party. Now is the time for all of the good people to come to the aid of their party. Now is the time for
all good people to come to the aid of their attack at dawn party. Now is the time for each person to come
http://algs4.cs.princeton.edu
to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is
the time for all good Republicans to come to the aid of their party. Now is the time for all good people
to come to the aid of their party. Now is the time for many or all good people to come to the aid of their
party. Now is the time for all good people to come to the aid of their party. Now is the time for all good
Democrats to come to the aid of their party.
15
Knuth-Morris-Pratt substring search Deterministic finite state automaton (DFA)
Intuition. Suppose we are searching in text for pattern BAAAAAAAAA. DFA is abstract string-searching machine.
・Suppose we match 5 chars in pattern, with mismatch on 6th char. ・Finite number of states (including start and halt).
・We know previous 6 chars in text are BAAAAB. ・Exactly one transition for each char in alphabet.
・Don't need to back up text pointer! assuming { A, B } alphabet ・Accept if sequence of transitions leads to halt state.
i X
internal representation X
text
j 0 1 2 3 4 5
A B A A A A B A A A A A A A A A
after mismatch j 0 1
pat.charAt(j) 2A 3 B 4 A 5 B A C
on sixth char B A A A A A A A A A pattern pat.charAt(j) A B A A1 B 1 A 3 C 1 5 If in1state j reading char c:
A 1 1 B 30 1 5 1
brute-force backs B A A A A A A A A A dfa[][j] 2 0 4 0 4if j is 6 halt and accept
up to try this dfa[][j] B 0 2 0 4 0 4
B A A A A A A A A A C 0 0 0 0 0 6else move to state dfa[c][j]
C 0 0 0 0 0 6
and this B A A A A A A A A A
and this B A A A A A A A A A
and this graphical representation
B A A A A A A A A A
and this
B,C A A A A
B,C
but no backup B A A A A A A A A A A A B B
0 A
01 B 21 A 3 2B A 4 A 5 C
A6
is needed A B 3 B 4 5 C 6
C
B,C
C C B,C
B,C C B,C
Text pointer backup in substring searching
A A B A C A A B A B A C A A A A B A C A A B A B A C A A
0 1 2 3 4 5 0 1 2 3 4 5
pat.charAt(j) A B A B A C pat.charAt(j) A B A B A C
A 1 1 3 1 5 1 A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4 dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6 C 0 0 0 0 0 6
A A
B, C A A B, C A A
B B
0 A 1 B 2 A 3 B 4 A 5 C 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C C
B, C B, C
substring found
C C
B, C 19 B, C 20
Interpretation of Knuth-Morris-Pratt DFA Knuth-Morris-Pratt substring search: Java implementation
Q. What is interpretation of DFA state after reading in txt[i]? Key differences from brute-force implementation.
A. State = number of characters in pattern that have been matched. ・Need to precompute dfa[][] from pattern.
length of longest prefix of pat[] ・Text pointer i never decrements.
that is a suffix of txt[0..i]
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6
C Running time.
B, C
・Simulate DFA on text: at most N character accesses.
・Build DFA: how to do efficiently? [warning: tricky algorithm ahead]
C
B, C 21 22
Key differences from brute-force implementation. Include one state for each character in pattern (plus accept state).
・Need to precompute dfa[][] from pattern.
・Text pointer i never decrements.
・Could use input stream.
0 1 2 3 4 5
public int search(In in) pat.charAt(j) A B A B A C
{ A
int i, j; dfa[][j] B
for (i = 0, j = 0; !in.isEmpty() && j < M; i++)
C
j = dfa[in.readChar()][j]; no backup
if (j == M) return i - M; Constructing the DFA for KMP substring search for A B A B A C
else return NOT_FOUND;
}
X
j 0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 1 3 1 5 1
dfa[][j] B 0 2 0 4 0 4
C 0 0 0 0 0 6
0 1 2 3 4 5 6
B,C A A
A B
0 A 1 B 2 A 3 B 4 A 5 C 6
C
B,C C B,C
Include one state for each character in pattern (plus accept state).
0 1 2 3 4 5
pat.charAt(j) 0 1 2 3 4 5
A B A B A C pat.charAt(j) A B A B A C
A 1 1 3 1 5 1 A
dfa[][j] B 0 2 0 4 0 4 dfa[][j] B
C 0 0 0 0 0 6 C
B, C A A
B
0 A 1 B 2 A 3 B 4 A 5 C 6 0 1 2 3 4 5 6
C
B, C
C
B, C 25 26
How to build DFA from pattern? How to build DFA from pattern?
Match transition. If in state j and next char c == pat.charAt(j), go to j+1. Mismatch transition. If in state j and next char c != pat.charAt(j),
then the last j-1 characters of input are pat[1..j-1], followed by c.
first j characters of pattern next char matches now first j+1 characters of
have already been matched pattern have been matched
To compute dfa[c][j]: Simulate pat[1..j-1] on DFA and take transition c.
Running time. Seems to require j steps. still under construction (!)
0 1 2 3 4 5
pat.charAt(j) A B A B A C
A 1 3 5 Ex. dfa['A'][5] = 1; dfa['B'][5] = 4
dfa[][j] B 2 4 simulate BABA; simulate BABA; j 0 1 2 3 4 5
C 6 take transition 'A' take transition 'B' pat.charAt(j) A B A B A C
= dfa['A'][3] = dfa['B'][3]
5’s outgoing mismatch edges are same as BABA’s edges.
A
simulation
B, C A A of BABA j
B
0 A 1 B 2 A 3 B 4 A 5 C 6 0 A 1 B 2 A 3 B 4 A 5 C 6
C
B, C
C
27 B, C 28
How to build DFA from pattern? Knuth-Morris-Pratt construction demo (in linear time)
Mismatch transition. If in state j and next char c != pat.charAt(j), Include one state for each character in pattern (plus accept state).
then the last j-1 characters of input are pat[1..j-1], followed by c.
state X
B, C A A X j
B
0 A 1 B 2 A 3 B 4 A 5 C 6 0 1 2 3 4 5 6
C
B, C
simulation
C
of BABAC
State X interpretation: Result of simulating DFA on text
B, C
after backing up to the next brute force start position. 29 30
Knuth-Morris-Pratt construction demo (in linear time) Constructing the DFA for KMP substring search: Java implementation
B, C A A
X = dfa[pat.charAt(j)][X]; update restart state
B }
0 A 1 B 2 A 3 B 4 A 5 C 6 }
C
B, C
C
Running time. M character accesses (but space/time proportional to R M).
B, C 31 32
KMP substring search analysis Knuth-Morris-Pratt: brief history
Proposition. KMP substring search accesses no more than M + N chars ・Independently discovered by two theoreticians and a hacker.
to search for a pattern of length M in a text of length N. – Knuth: inspired by esoteric theorem, discovered linear algorithm
– Pratt: made running time independent of alphabet size
Pf. Each pattern char accessed once when constructing the DFA; – Morris: built a text editor for the CDC 6400 computer
each text char accessed once (in the worst case) when simulating the DFA. ・Theory meets practice.
internal representation
j 0 1 2 3 4 5 SIAM J. COMPUT.
Proposition. KMPpat.charAt(j)
constructs dfa[][]
A B Ain time
B A and
C space proportional to R M. Vol. 6, No. 2, June 1977
Key words, pattern, string, text-editing, pattern-matching, trie memory, searching, period of a
string, palindrome, optimum algorithm, Fibonacci string, regular expression
0 A 1 B 2 A 3 B 4 A 5 C 6
・Can skip as many as M text chars when finding one not in the pattern.
* Received by the editors August 29, 1974, and in revised form April 7, 1976.
t Computer Science Department, Stanford University, Stanford, California 94305. The work of
this author was supported in part by the National Science Foundation under Grant GJ 36473X and by
the Office of Naval Research under Contract NR 044-402.
Xerox Palo Alto Research Center, Palo Alto, California 94304. The work of this author was
‣ introduction
University.
323
‣ brute force i j 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
text F I N D I N A H A Y S T A C K N E E D L E I N A
‣ Knuth-Morris-Pratt
Algorithms 0
5
5
5
N E E D L E
N E
pattern
E D L E
‣ Boyer-Moore 11 4 N E E D L E
Case 1. Mismatch character not in pattern. Case 2a. Mismatch character in pattern.
i i
before before
txt . . . . . . T L E . . . . . . txt . . . . . . N L E . . . . . .
pat N E E D L E pat N E N D L E
i i
after after
txt . . . . . . T L E . . . . . . txt . . . . . . N L E . . . . . .
pat N E E D L E pat N E N D L E
mismatch character 'T' not in pattern: increment i one character beyond 'T' mismatch character 'N' in pattern: align text 'N' with rightmost pattern 'N'
Case 2b. Mismatch character in pattern (but heuristic no help). Case 2b. Mismatch character in pattern (but heuristic no help).
i i
before before
txt . . . . . . E L E . . . . . . txt . . . . . . E L E . . . . . .
pat N E E D L E pat N E E D L E
i i
txt . . . . . . E L E . . . . . . txt . . . . . . E L E . . . . . .
pat N E E D L E pat N E E D L E
mismatch character 'E' in pattern: align text 'E' with rightmost pattern 'E' ? mismatch character 'E' in pattern: increment i by 1
39 40
Boyer-Moore: mismatched character heuristic Boyer-Moore: Java implementation
41 42
Boyer-Moore: analysis
Boyer-Moore variant. Can improve worst case to ~ 3 N character compares Michael Rabin, Turing Award '76
by adding a KMP-like rule to guard against repetitive patterns. Dick Karp, Turing Award '85
43
Rabin-Karp fingerprint search Review: Computing hash functions for a string (from hashing lecture)
Basic idea = modular hashing. Modular hash function. Using the notation ti for txt.charAt(i),
・Compute a hash of pattern characters 0 to M - 1. we wish to compute
・For each i, compute a hash of text characters i to M + i - 1. xi = ti R M-1 + ti+1 R M-2 + … + ti+M-1 R 0 (mod Q)
・If pattern hash = text substring hash, check for a match.
pat.charAt(i)
i 0 1 2 3 4
Intuition. M-digit, base-R integer, modulo Q.
2 6 5 3 5 % 997 = 613
txt.charAt(i)
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Horner's method. Linear-time method to evaluate degree-M polynomial.
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3
0 3 1 4 1 5 % 997 = 508
pat.charAt() // Compute hash for M-digit key
1 1 4 1 5 9 % 997 = 201 i 0 1 2 3 4 private long hash(String key, int M)
2 6 5 3 5 {
2 4 1 5 9 2 % 997 = 715
R Q
0 2 % 997 = 2 long h = 0;
3 1 5 9 2 6 % 997 = 971
1 2 6 % 997 = (2*10 + 6) % 997 = 26 for (int j = 0; j < M; j++)
4 5 9 2 6 5 % 997 = 442 2 2 6 5 % 997 = (26*10 + 5) % 997 = 265
match h = (h * R + key.charAt(j)) % Q;
5 9 2 6 5 3 % 997 = 929 3 2 6 5 3 % 997 = (265*10 + 3) % 997 = 659 return h;
6 return i = 6 2 6 5 3 5 % 997 = 613 4 2 6 5 3 5 % 997 = (659*10 + 5) % 997 = 613 }
Computing the hash value for the pattern with Horner’s method
Basis for Rabin-Karp substring search
Goal. Find efficient way to compute hashes for each position in text. Slow way to compute hashCode: “26535” = 2*10000 + 6*1000 + 5*100 + 3*10 + 5
R = 10
Horner’s method to compute hashCode: “26535” = ((((2) *10 + 6) * 10 + 5) * 10 + 3) * 10 + 5
45 46
Modulus property. Given an expression % Q, if the expression contains only Challenge. How to efficiently compute xi+1 given that we know xi.
multiplication and addition operations, can replace any term X by X % Q. xi = ti R M–1 + ti+1 R M–2 + … + ti+M–1 R0
4 1 5 9 2 current value
- 4 0 0 0 0
Don’t have to 1 5 9 2 subtract leading digit
replace all of them! * 1 0 multiply by radix
1 5 9 2 0
+ 6 add new trailing digit
1 5 9 2 6 new value
47 48
Key computation in Rabin-Karp substring search
(move right one position in the text)
Example: Using mod trick and incremental approach to get hash Rabin-Karp substring search example
xi
Given First 5 entries. Horner’s rule.
48500 % 9 = 8 Later entries. Use incremental approach and mod trick.
xi+1
Find i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
85006 % 9 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3
0 3 % 997 = 3 Q
RM-1 , also called RM 1 3 1 % 997 = (3*10 + 1) % 997 = 31
Hint 2 3 1 4 % 997 = (31*10 + 4) % 997 = 314
10000 % 9 = 1 3 3 1 4 1 % 997 = (314*10 + 1) % 997 = 150
R 4 3 1 4 1 5 % 997 = (150*10 + 5) % 997 = 508 RM R
ti ti+M-1
Q 5 1 4 1 5 9 % 997 = ((508 + 3*(997 - 30))*10 + 9) % 997 = 201
Solution 6 4 1 5 9 2 % 997 = ((201 + 1*(997 - 30))*10 + 2) % 997 = 715
85006 = (48500 - 4*10000) * 10 + 6 7 1 5 9 2 6 % 997 = ((715 + 4*(997 - 30))*10 + 6) % 997 = 971
85006 % 9 = ((48500 - 4*10000) * 10 + 6) % 9 8 5 9 2 6 5 % 997 = ((971 + 1*(997 - 30))*10 + 5) % 997 = 442 match
9 9 2 6 5 3 % 997 = ((442 + 5*(997 - 30))*10 + 3) % 997 = 929
= ((8 - 4*1 ) * 1 + 6) % 9
10 return i-M+1 = 6 2 6 5 3 5 % 997 = ((929 + 9*(997 - 30))*10 + 5) % 997 = 613
= 10 % 9
Rabin-Karp substring search example
= 1 This extra factor of Q is added to ensure that the sum is positive
since % is the remainder operator, not the modulus operator in Java.
Key idea: To compute xi+1, use xi, ti, ti+M-1, RM, R, and Q.
49 50
public class RabinKarp Monte Carlo version. Return match if hash match.
{
private long patHash; // pattern hash value
private int M; // pattern length public int search(String txt)
private long Q; // modulus { check for hash collision
private int R; // radix int N = txt.length(); using rolling hash function
private long RM; // R^(M-1) % Q int txtHash = hash(txt, M);
if (patHash == txtHash) return 0;
public RabinKarp(String pat) { ensures new txthash is positive,
for (int i = M; i < N; i++)
M = pat.length(); { doesn’t affect magnitude of answer
R = 256; txtHash = (txtHash + Q - RM*txt.charAt(i-M) % Q) % Q;
Q = longRandomPrime(); a large prime
txtHash = (txtHash*R + txt.charAt(i)) % Q;
(but avoid overflow)
if (patHash == txtHash) return i - M + 1;
RM = 1; precompute RM – 1 (mod Q)
}
for (int i = 1; i <= M-1; i++) equivalent of 10000 % 9 from example return N;
RM = (R * RM) % Q; }
patHash = hash(pat, M);
}
public int search(String txt) Las Vegas version. Check for substring match if hash match;
{ /* see next slide */ }
}
continue search if false collision.
51 52
Rabin-Karp analysis Rabin-Karp fingerprint search
full DFA
algorithm version5.6 ) 2operation
N count
1.1 N no
backup yes
correct?
extra
MR
(Algorithm in input? space
Knuth-Morris-Pratt guarantee typical
mismatch
3 NN 1.1 no yes
1.1 N
M
brute force — only
transitions M N yes yes 1
fullfull DFA
algorithm 32N N1.1
/M yes yes R
N N no yes MR
(Algorithm 5.6 )
Knuth-Morris-Pratt
Boyer-Moore mismatched char
mismatch
heuristic only 3N
M N1.1
/MN yes
no yes
yes R
M
transitions only
(Algorithm 5.7 )
full
Montealgorithm
Carlo 3N N/M yes yes R
7N 7N no yes † 1
(Algorithm 5.8 )
Rabin-Karp†
Boyer-Moore mismatched char
heuristic only
Las Vegas 7MNN† N7 /NM yes†
no yes
yes 1R
(Algorithm 5.7 )
† probabilisitic guarantee, with uniform hash function
Monte Carlo
summary 5.8
7N 7N no yes † 1
(Algorithm
Cost for )substring-search implementations
Rabin-Karp†
55
Las Vegas 7N† 7N no † yes 1