0% found this document useful (0 votes)
77 views

CS369 StringAlgs PDF

The document discusses string matching algorithms. It outlines several algorithms including the naive/brute-force algorithm, automaton-based search, Rabin-Karp algorithm, Knuth-Morris-Pratt algorithm, and Boyer-Moore algorithm. It provides time complexities for preprocessing and matching for each algorithm and describes the sliding window mechanism and examples of the naive algorithm.

Uploaded by

kamalsmanek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

CS369 StringAlgs PDF

The document discusses string matching algorithms. It outlines several algorithms including the naive/brute-force algorithm, automaton-based search, Rabin-Karp algorithm, Knuth-Morris-Pratt algorithm, and Boyer-Moore algorithm. It provides time complexities for preprocessing and matching for each algorithm and describes the sliding window mechanism and examples of the naive algorithm.

Uploaded by

kamalsmanek
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

String Matching Algorithms

Georgy Gimel’farb
(with basic contributions from M. J. Dinneen, Wikipedia, and web materials by
Ch. Charras and Thierry Lecroq, Russ Cox, David Eppstein, etc.)

COMPSCI 369 Computational Science

1 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

1 String matching algorithms


2 Naı̈ve, or brute-force search
3 Automaton search
4 Rabin-Karp algorithm
5 Knuth-Morris-Pratt algorithm
6 Boyer-Moore algorithm
7 Other string matching algorithms

Learning outcomes: Be familiar with string matching algorithms

Recommended reading:
http://www-igm.univ-mlv.fr/~lecroq/string/index.html
C. Charras and T. Lecroq: Exact String Matching Algorithms. Univ. de Rouen, 1997

2 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

String Matching (Searching)


String matching or searching algorithms try to find places where
one or several strings (also called patterns) are found within a
larger string (searched text):

... try to find places where one or several strings (also...


Pattern: ace
... try to find places where one or several strings (also...

Formally, both the pattern and searched text are concatenation of


elements of an alphabet (finite set) Σ
• Σ may be a usual human alphabet, for example, the Latin
letters a through z or Greek letters α through ω
• Other applications may include binary alphabet, Σ = {0, 1},
or DNA alphabet, Σ = {A, C, G, T }, in bioinformatics

3 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

String Searching: DNA alphabet


DNA alphabet contains only four
“letters”, forming fixed pairs in the
double-helical structure of DNA
• A – adenine: A pairs with T
• C – cytosine: C pairs with G
• G – guanine: G pairs with C
• T - thymine: T pairs with A

http://www.biotechnologyonline.gov.au/
popups/img helix.html http://www.insectscience.org/2.10/ref/fig5a.gif

4 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

String Searching: DNA alphabet

http://biology.kenyon.edu/courses/biol114/Chap08/longread sequence.gif

5 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

String Searching (Matching) Problems

http://www-igm.univ-mlv.fr/∼lecroq/string/index.html

String matching: Find one, or more generally, all the occurrences


of a pattern x = [x0 x1 ..xm−1 ]; xi ∈ Σ; i = 0, . . . , m − 1, in a text
(string) y = [y0 y1 ..yn−1 ]; yj ∈ Σ; j = 0, . . . , n − 1
• Two basic variants:
1 Given a pattern, find its occurrences in any initially unknown
text
• Solutions by preprocessing the pattern using finite automata
models or combinatorial properties of strings
2 Given a text, find occurrences of any initially unknown pattern
• Solutions by indexing the text with the help of trees or finite
automata
• In COMPSCI 369: only algorithms of the first kind
• Algorithms of the second kind: look e.g. at Google. . .

6 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

String Matching: Sliding Window Mechanism

• Sliding window: Scan the text by a window of size, which is


generally equal to m
• An attempt: Align the left end of the window with the text and
compare the characters in the window with those of the pattern
• Each attempt (step) is associated with position j in the text
when the window is positioned on yj ..yj+m−1
• Shift the window to the right after the whole match of the pattern
or after a mismatch
Effectiveness of the search depends on the order of comparisons:
1 The order is not relevant (e.g. naı̈ve, or brute-force algorithm)
2 The natural left-to-right order (the reading direction)
3 The right-to-left order (the best algorithms in practice)
4 A specific order (the best theoretical bounds)
7 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Single Pattern Algorithms (Summary)

Notation:
m – the length (size) of the pattern; n – the length of the searched text

String search algorithm Time complexity for


preprocessing matching
Naı̈ve 0 (none) Θ(n · m)
Rabin-Karp Θ(m) avg Θ(n + m)
worst Θ(n · m)
Finite state automaton Θ(m|Σ|) Θ(n)
Knuth-Morris-Pratt Θ(m) Θ(n)
Boyer-Moore Θ(m + |Σ|) Ω(n/m), O(n)
Bit based (approximate) Θ(m + |Σ|) Θ(n)
See http://www-igm.univ-mlv.fr/∼lecroq/string for some animations of
these and many other string algorithms

8 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Naı̈ve (Brute-Force) Algorithm

for ( j = 0; j <= n - m; j++ ) {


for ( i = 0; i < m && x[i] == y[i + j]; i++ );
if ( i >= m ) return j;
}

Main features of this easy (but slow) O(nm) algorithm:


• No preprocessing phase
• Only constant extra space needed
• Always shifts the window by exactly 1 position to the right
• Comparisons can be done in any order
• mn expected text characters comparisons

9 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Naı̈ve Algorithm: An Example

Pattern: abaa; searched string: ababbaabaaab

ababbaabaaab ............

abaa________ step 1 ABA# mismatch: 4th letter


_abaa_______ step 2 _#... mismatch: 1st letter
__abaa______ step 3 __AB#. mismatch: 3rd letter
___abaa_____ step 4 ___#... mismatch: 1st letter
____abaa____ step 5 ____#... mismatch: 1st letter
_____abaa___ step 6 _____A#.. mismatch: 2nd letter
______abaa__ step 7 ______ABAA success
_______abaa_ step 8 _______#... mismatch: 1st letter
________abaa step 9 ________.#..mismatch: 2nd letter
Runs with 9 window steps and 18 character comparisons

10 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Automaton Based Search


Main features:
• Building the minimal deterministic finite automaton (DFA)
accepting strings from the language L = Σ∗ x
• L is the set of all strings of characters from Σ ending with the
pattern x
• Time complexity O(m|Σ|) of this preprocessing (m = |x|, i.e.
the size of x)
• Time complexity O(n) of the search in a string y of size n if
the DFA is stored in a direct access table
• Most suitable for searching within many different strings y for
same given pattern x
  empty
9 9
> >
x0 a
>
> >
>
= =
x = x0 x1 x2 x3 ⇒ x0 x1 m + 1 DFA states −→ x = abaa ⇒ ab
x0 x1 x2 aba
>
> >
>
>
; >
;
x0 x1 x2 x3 abaa

11 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Bulding the Minimal DFA for L = Σ∗ x


• The DFA (Q, Σ, δ : Q × Σ → Q, q0 ∈ Q, F ⊆ Q) to recognise
the language L = Σ∗ x:
• Q – the set of all the prefixes of x = x0 · · · xm−1 :
Q = {, x0 , x0 x1 , . . . , x0 · · · xm−2 , x0 · · · xm−1 }
• q0 =  – the state representing the empty prefix
• F = {x} – the state representing the pattern(s) x
• δ – the state+character to state transition function
• For q ∈ Q and c ∈ Σ, δ(q, c) = qc if and only if qc ∈ Q
• Otherwise δ(q, c) = p such that p is the longest suffix of qc,
which is a prefix of x (i.e. p ∈ Q)
• Once the DFA is built, searching for the word x in a text y
consists of parsing y with the DFA beginning with the initial
state q0
• Each time a unique final state F is encountered an occurrence
of x is reported
12 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Automaton Search: x = abaa and y = ababbaabaaab

Σ = {a, b}; Q = {, a, ab, aba, abaa}; q0 = ; F = {x} = {abaa}

q  a ab aba abaa
c
Transitions δ(q, c) : a a a aba abaa a
b  ab  ab ab

c=a
c=b
c=b c=a c=b

start c=a
 a ab aba abaa
c=a c=b c=a

c=b
See also: http://www.ics.uci.edu/∼eppstein/161/960222.html

13 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Automaton Search: x = abaa and y = ababbaabaaab

Automaton:
Step Text Transition
Initial state  1 ababbaabaaab  → a
Final states {abaa} 2 ababbaabaaab a → ab
3 ababbaabaaab ab → aba
Transitions
4 ababbaabaaab aba → ab
qnext = δ(qcurr , c) = 5 ababbaabaaab ab → 
qcurr \ c a b 6 ababbaabaaab  → a
7 ababbaabaaab a → a
 a  8 ababbaabaaab a → ab
a a ab 9 ababbaabaaab ab → aba
ab aba  10 ababbaabaaab aba → abaa
aba abaa ab 11 ababbaabaaab abaa → a
12 ababbaabaaab a → ab
abaa a ab

Runs with 12 steps and 12 character comparisons

14 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Rabin-Karp Algorithm

Main features:
• Using hashing function
(i.e., it is more efficient to check whether the window contents
“looks like” the pattern than checking exact match)
• Preprocessing phase: time complexity O(m) and constant
space
• Searching phase time complexity:
• O(mn) for worst case
• O(n + m) for expected case
• Good for multiple patterns x being used

15 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Rabin-Karp Hashing Details


Desirable hashing functions hash(. . .) for string matching:
• Efficiency of computation
• High discrimination for strings
• Easy computation of hash(yj+1 ..yj+m ) from the previous window:
i.e. hash(yj+1 ..yj+m ) = rehash (yj , yj+m , hash(yj ..yj+m−1 ))
For a word w of length m, let hash(w) be defined as:
hash(w0 ..wm−1 ) = w0 2m−1 + w1 2m−2 + . . . + wm−1 20

mod q

where q is a large number. Then rehash(a, b, h) = (2h − a2 + b) mod q


m

• Preprocessing phase: computing hash(x)


• It can be done in constant space and O(m) time
• Searching phase: comparing hash(x) with hash(yj ..yj+m−1 ) for
0≤j <n−m
• If an equality is found, still check the equality x = yj ..yj+m−1
character by character
16 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Rabin-Karp Algorithm
Using external hash and rehash functions:

int RabinKarp(String x, String y)


{
m = x.length();
n = y.length();
hx = hash(x,0,m-1);
hy = hash(y,0,m-1);
for (int j = 0; j <= n - m; j++)
{
if (hx==hy && y.substring(j,j+m-1)==x) return j;
hy = rehash(y[j],y[j+m],hy);
}
return -1; // not found
}
17 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Rabin-Karp with x = abaa and y = ababbaabaaab

hash(abaa) = 1459 → hx − hash value for pattern x


hy ↓ → hash value for substring y
hash(y0 ..y3 ) = 1460 ababbaabaaab
hash(y1 ..y4 ) = 1466 ababbaabaaab
hash(y2 ..y5 ) = 1461 ababbaabaaab
hash(y3 ..y6 ) = 1467 ababbaabaaab
hash(y4 ..y7 ) = 1464 ababbaabaaab
hash(y5 ..y8 ) = 1457 ababbaabaaab
hash(y6 ..y9 ) = 1459 ababbaabaaab : return 6
hash(y7 ..y1 0) = 1463 ababbaabaaab
hash(y8 ..y1 1) = 1456 ababbaabaaab

18 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Rabin-Karp Algorithm (searching multiple patterns)


Extending the search for multiple patterns of the same length:
void RabinKarpMult(String[] x, String y)
{
m = x[0].length();
n = y.length();
for( int i = 0; i < x.length; i++ )
hx[i] = hash( x[i], 0, m-1);
hy = hash(y, 0, m-1);
for( int j = 0; j <= n - m; j ++ ) {
for( int k = 0; k < x.length; k++ )
if ( hx[k]==hy && y.substring(j,j+m-1) == x[k] )
matchProcess( k, j );
hy = rehash( y[j], y[j+m], hy );
}
}
19 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Knuth-Morris-Pratt Algorithm

Searches for occurrences of a pattern x within a main text string y


by employing the simple observation: after a mismatch, the word
itself allows us to determine where to begin the next match to
bypass re-examination of previously matched characters
• Preprocessing phase: O(m) space and time complexity
• Searching phase: O(n + m) time complexity (independent
from the alphabet size)
• At most 2n − 1 character comparisons during the text scan
• The maximum number of comparisons
√ for a single text
character: ≤ logη m where η = 1+ 5
2 is the golden ratio
The algorithm was invented in 1977 by Knuth and Pratt and
independently by Morris, but the three published it jointly

20 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Knuth-Morris-Pratt Window Shift Idea


Let offset i; 0 < i < m, be the first mismatched position for a pattern x
matched to the text string y starting at index position j (i.e. x0 ..xi−1 =
yj ..yj+i−1 = u, but xi = a 6= yj+i = b):
j i+j
y u bb
matching part of x ! ← mismatch

x u ac
v v
x v c? u−v

The length of the largest substring v being a prefix and suffix of u, which
are followed by different characters (like va and vc above), gives the next
search index next[i]
21 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Knuth-Morris-Pratt Preprocessing
All the shift distances next[i] can be actually computed for
0 ≤ i ≤ m in total time O(m) where m = |x|

void computeNext( String x, int[] next ) {


int i = 0;
int j = next[0] = -1; // end of window marker
while ( i < x.length() ) {
while ( j > -1 && x[i] != x[j] ) j = next[ j ];
i++;
j++;
if ( x[ i ] == x[ j ] )
next[ i ]= next[ j ];
else next[ i ]= j;
}
}
22 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Knuth-Morris-Pratt Main Algorithm


The main search runs in time O(n) where n = |y|.

int KMP(String x, String y) {


int m = x.length(); int n = y.length();
int[ m+1 ] next;
computeNext( x, next );
int i = 0; int j = 0; // indices in x and y
while ( j < n ) {
while ( i > -1 && x[i] != y[j] ) i = next[ i ];
i++;
j++;
if ( i >= m ) return j - i; // Match
}
return -1; // Mismatch
}

So the total time of the KMP algorithm is O(m + n)


23 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

KMP with x = abaa and y = ababbaabaaab


x a b a a -
Preprocessing phase: i 0 1 2 3 4
next[i] -1 0 -1 1 1

Searching phase:

ababbaabaaab
ABAa Shift by 2 (next[3]=1)

.Ba. Shift by 3 (next[2]=-1)

Ab.. Shift by 1 (next[1]=0)

ABAA Shift by 3 (match found)

24 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Boyer-Moore Algorithm

Main features of this “best practical choice” algorithm:


• Performing the comparisons from right to left
• Preprocessing phase: O(m + |Σ|) time and space complexity
• Searching phase: O(m + n) time complexity;
• 3n text character comparisons in the worst case when
searching for a non periodic pattern
• O(n/m) best performance
Two precomputed functions to shift the window to the right:
• The good-suffix shift (also called matching shift)
• The bad-character shift (also called occurrence shift)

25 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Good-Suffix (Matching) Shifts


A mismatch xi 6= yj+i for a matching attempt at position j, so that
xi+1 ..xm−1 = yj+i+1 ..yj+m−1 = u
The shift: by aligning the segment u in y with its rightmost occurrence
in x that is preceded by a character different from xi :

or if no such segment in x exists, by aligning the longest suffix of u in y


with a matching prefix v of x:

See details in http://www-igm.univ-mlv.fr/∼lecroq/string/index.html


26 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Bad-Character (Occurrence) Shifts


A mismatch xi 6= yj+i for a matching attempt at position j, so that
xi+1 ..xm−1 = yj+i+1 ..yj+m−1 = u
The shift: by aligning the text character yi+j with its rightmost
occurrence in x0 ..xm−2 :

or if yj+i does not occur in x, the left end of the window is aligned with
the character yj+i+1 immediately after yj+i :

See details in http://www-igm.univ-mlv.fr/∼lecroq/string/index.html


27 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Boyer-Moore with x = abaa and y = ababbaabaaab

Bad-character shifts Bc[a]=1 and Bc[b]=2.


Good-suffix shifts Gs are 3, 3, 1, and 2, respectively.
ababbaabaaab

...a Shift by 2

..aA Shift by 1

aBAA Shift by 3

ABAA Shift by 3

28 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Shift-AND Matching Algorithm

Also known as the Baeza-Yates–Gonnet algorithm and is related


to the Wu-Manber k-differences algorithm.

The main features of this bit based algorithm are:


• efficient if the pattern length is no longer than the
memory-word size of the machine;
• preprocessing phase in O(m + |Σ|) time and space complexity;
• searching phase in O(n) time complexity;
• adapts easily to approximate string matching.

29 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Shift-AND Matching Algorithm (continued)


Algorithm uses (for fixed i) a state vector ŝ, where

s[j] = 1 iff y[i − j, . . . , i] = x[0, . . . , j]

For c ∈ Σ let T [c] be a (Boolean) bit vector of length m = |x| that


indicates where c occurs in x.
The next state vector at postions i + 1 is computed very fast:

ŝ = ((ŝ << 1) + 1) & T [y[i + 1]]

A match is found whenever s[m − 1] = 1.


30 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Shift-AND with X=abaa and Y=ababbaabaaab

The character vectors for the pattern x are:


x \ T [] a b
a 1 0
b 0 1
a 1 0
a 1 0
The main search progresses as follows:
x \ s[] a b a b b a a b a a a b
a 1 0 1 0 0 1 1 0 1 1 1 0
b 0 1 0 1 0 0 0 1 0 0 0 1
a 0 0 1 0 0 0 0 0 1 0 0 0
a 0 0 0 0 0 0 0 0 0 1 0 0

31 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Wu-Manber Approximation Matching Algorithm

The Shift-AND algorithm can be modified to detect string


matching with at most k errors (or k differences).

The possibilities for matching x[0, . . . , j] with a substring of y that


ends at position i with e errors:
1 Match: x[j] = y[i] and a match with e errors between
x[0, . . . , j − 1] and a substring of y ending at i − 1.
2 Substitution: a match with e − 1 errors between
x[0, . . . , j − 1] and a substring of y ending at i − 1.
3 Insertion: a match with e − 1 errors between x[0, . . . , j] and a
substring of y ending at i − 1.
4 Deletion: a match with e − 1 errors between x[0, . . . , j − 1]
and a substring of y ending at i.

32 / 33
Outline String matching Naı̈ve Automaton Rabin-Karp KMP Boyer-Moore Others

Wu-Manber State Update Procedure

We introduce new state vectors sˆe that represent the matches


where 0 ≤ e ≤ k errors have occurred. [Note: ŝ = sˆ0 .]

The state updating is a generalization of the Shift-AND rule:

s0e = (((se << 1) + 1) AND T [y[i + 1]]) OR


((se−1 << 1) + 1) OR
((s0e−1 << 1) + 1) OR
se−1
Here, the s∗ and s0∗ denote the state at character position i and
i + 1, respectively, of the text string y

The OR’s in the above rule account for the 4 possible ways to
approximate the pattern with errors
33 / 33

You might also like