0% found this document useful (0 votes)
16 views

12_strings.v3

String matching
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

12_strings.v3

String matching
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 111

String processing

algorithms
David Kauchak
cs161
Summer 2009
Administrative
 Check your scores on coursework
 SCPD Final exam: e-mail me with proctor

information
 Office hours next week?

 Reminder: HW6 due Wed. 8/12 before class

and no late homework


Where did “dynamic programming” come from?

Richard Bellman On the Birth of


Dynamic Programming
Stuart Dreyfus
http://www.eng.tau.ac.il/~ami/cd/
or50/1526-5463-2002-50-01-
0048.pdf
Strings
 Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …,
z)
 A string is any member of Σ*, i.e. any

sequence of 0 or more members of Σ


 ‘this is a string’  Σ*
 ‘this is also a string’  Σ*
 ‘1234’  Σ*
String operations
 Given strings s1 of length n and s2 of length m
 Equality: is s1 = s2? (case sensitive or
insensitive)
‘this is a string’ = ‘this is a string’
‘this is a string’ ≠ ‘this is another string’
‘this is a string’ =? ‘THIS IS A STRING’

 Running time
 O(n) where n is length of shortest string
String operations
 Concatenate (append): create string s1s2
‘this is a’ . ‘ string’ → ‘this is a string’

 Running time
 Θ(n+m)
String operations
 Substitute: Exchange all occurrences of a
particular character with another character
Substitute(‘this is a string’, ‘i’, ‘x’) →
‘thxs xs a strxng’
Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’

 Running time
 Θ(n)
String operations
 Length:return the number of
characters/symbols in the string
Length(‘this is a string’) → 16
Length(‘this is another string’) → 24

 Running time
 O(1) or Θ(n) depending on implementation
String operations
 Prefix: Get the first j characters in the string

Prefix(‘this is a string’, 4) → ‘this’

 Running time
 Θ(j)
 Suffix: Get the last j characters in the string
Suffix(‘this is a string’, 6) → ‘string’

 Running time
 Θ(j)
String operations
 Substring – Get the characters between i and
j inclusive

Substring(‘this is a string’, 4, 8) → ‘s is ’

 Running time
 Θ(j - i)
 Prefix?
 Prefix(S, i) = Substring(S, 1, i)
 Suffix?
 Suffix(S, i) = Substring(S, i+1, length(n))
Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Insertion:

ABACED ABACCED DABACCED

Insert ‘C’ Insert ‘D’


Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED
Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED

Delete ‘A’
Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED BACE

Delete ‘A’ Delete ‘D’


Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Substitution:

ABACED ABADED ABADES

Sub ‘D’ for ‘C’ Sub ‘S’ for ‘D’


Edit distance examples

Edit(Kitten, Mitten) = 1

Operations:

Sub ‘M’ for ‘K’ Mitten


Edit distance examples

Edit(Happy, Hilly) = 3

Operations:

Sub ‘a’ for ‘i’ Hippy


Sub ‘l’ for ‘p’ Hilpy
Sub ‘l’ for ‘p’ Hilly
Edit distance examples

Edit(Banana, Car) = 5

Operations:

Delete ‘B’ anana


Delete ‘a’ nana
Delete ‘n’ naa
Sub ‘C’ for ‘n’ Caa
Sub ‘a’ for ‘r’ Car
Edit distance examples

Edit(Simple, Apple) = 3

Operations:

Delete ‘S’ imple


Sub ‘A’ for ‘i’ Ample
Sub ‘m’ for ‘p’ Apple
Is edit distance symmetric?
 that is, is Edit(s1, s2) = Edit(s2, s1)?

Edit(Simple, Apple) =? Edit(Apple, Simple)

 Why?
 sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’
 delete ‘i’ → insert ‘i’
 insert ‘i’ → delete ‘i’
Calculating edit distance

X=ABCBDAB

Y=BDCABA

Ideas?
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

After all of the operations, X needs


to equal Y
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

Operations: Insert
Delete
Substitute
Insert

X=ABCBDA?

Y=BDCAB?
Insert

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) 1  Edit ( X 1...n , Y1...m  1 )


Delete

X=ABCBDA?

Y=BDCAB?
Delete

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m )


Substition

X=ABCBDA?

Y=BDCAB?
Substition

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m  1 )


Anything else?

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) Edit ( X 1...n  1 , Y1...m  1 )


Combining results
Insert: Edit ( X , Y ) 1  Edit ( X 1...n , Y1...m  1 )

Delete: Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m )

Substitute: Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m  1 )

Equal: Edit ( X , Y ) Edit ( X 1...n  1 , Y1...m  1 )


Combining results
 1  Edit(X 1...n ,Y1...m  1 ) insertion

Edit ( X , Y ) min  1  Edit ( X 1...n  1 , Y1...m ) deletion
 Diff ( x , y )  Edit ( X
 n m 1... n  1 , Y1... m  1 ) equal/substitution
Running time

Θ(nm)
Variants
 Only include insertions and deletions
 What does this do to substitutions?
 Include swaps, i.e. swapping two adjacent
characters counts as one edit
 Weight insertion, deletion and substitution
differently
 Weight specific character insertion, deletion
and substitutions differently
 Length normalize the edit distance
String matching
 Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
String matching
 Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
Uses
 grep/egrep

 search

 find

 java.lang.String.contains()
Naive implementation
Is it correct?
Running time?

 What is the cost of the equality check?


 Best case: O(1)
 Worst case: O(m)
Running time?

 Best case
 Θ(n) – when the first character of the pattern does
not occur in the string
 Worst case
 O((n-m+1)m)
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA

repeated work!
Worst case
P = AAAA

S = AAAAAAAAAAAAA

Ideally, after the first match, we’d


know to just check the next
character to see if it is an ‘A’
Patterns
 Whichof these patterns will have that
problem?

P = ABAB

P = ABDC

P = BAA

P = ABBCDDCAABB
Patterns
 Whichof these patterns will have that
problem?

P = ABAB If the pattern has a


suffix that is also a
P = ABDC prefix then we will
have this problem

P = BAA

P = ABBCDDCAABB
Finite State Automata (FSA)
 An FSA is defined by 5 components
 Q is the set of states
q0 q1 q2 … qn
Finite State Automata (FSA)
 An FSA is defined by 5 components
 Q is the set of states
q0 q1 q2 … qn

 q0 is the start state q7

 A  Q, is the set of accepting states where |A| > 0


 Σ is the alphabet (e.g. {A, B}
  is the transition function from Q x Σ to Q
QΣ Q B

q0 A q1
q0 B q2
q0 q1
A
q2 …
A


q1 A q1
FSA operation

B A A

q0 q1 q1 q1
A B A

B
B

An FSA starts at state q0 and reads the characters of the input


string one at a time.
If the automaton is in state q and reads character a, then it
transitions to state (q,a).
If the FSA reaches an accepting state (q  A), then the FSA has
found a match.
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

What pattern does this represent?


FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(abcdab, ababcd) = ?
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(abcdab, ababcd) = 2
Suffix function
 Thesuffix function σ(x,y) is the index of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(daabac, abacac) = ?
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(daabac, abacac) = 4
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(dabb, abacd) = ?
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(dabb, abacd) = 0
Building a string matching
automata
 Given a pattern P = p1, p2, …, pm, we’d like to build
an FSA that recognizes P in strings

P = ababaca

Ideas?
Building a string matching automata
P = ababaca
Q = q1, q2, …, qm corresponding to each
symbol, plus a q0 starting state
 the set of accepting states, A = {qm}
 vocab Σ all symbols in P, plus one more
representing all symbols not in P
 The transition function for q  Q and a  Σ is

defined as:
 (q, a) = σ(p1…qa, P)
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 ? a σ(a, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
B,C
state a b c P
q0 1 0 0 a q0 q1
A
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 ? b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 1 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 6 c
q6 7 0 0 a
q7 1 2 0
Matching runtime
 Once we’ve built the FSA, what is the
runtime?
 Θ(n) - Each symbol causes a state transition and
we only visit each character once
 What is the cost to build the FSA?
 How many entries in the table?
 m|Σ| - Best case: Ω(m|Σ|)
 How long does it take to calculate the suffix
function at each entry?
 Naïve: O(m2)
 Overall naïve: O(m3|Σ|)
 Overall fast implementation O(m|Σ|)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA Hash P

T(P)

S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

Hash m symbol
S = BABABBABABA sequences and compare

T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

match
Hash m symbol
S = BABABBABABA sequences and compare

T(ABA)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

Hash m symbol
S = BABABBABABA sequences and compare

T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

Hash m symbol
S = BABABBABABA sequences and compare

T(BAB)
=
T(P)
Rabin-Karp algorithm

For this to be
useful/efficient, what
P = ABA needs to be true
about T?

S = BABABBABABA

T(BAB)
=
T(P)
Rabin-Karp algorithm

For this to be
useful/efficient, what
P = ABA needs to be true
about T?

 Given T(si…i+m-1) we must be


able to efficiently calculate
S = BABABBABABA T(si+1…i+m)

T(BAB)
=
T(P)
Calculating the hash function
 For simplicity, assume Σ = (0, 1, 2, …, 9). (in
general we can use a base larger than 10).
 A string can then be viewed as a decimal

number
 How do we efficiently calculate the numerical

representation of a string?

T(‘9847261’) = ?
Horner’s rule
T ( p1...m )  pm  10( pm  1  10( pm  2  ...  10( p2  10 p1 )))

9847261
9 * 10 = 90

(90 + 8)*10 = 980

(980 + 4)*10 = 9840

(9840 + 7)*10 = 98470

… = 9847621
Horner’s rule
T ( p1...m )  pm  10( pm  1  10( pm  2  ...  10( p2  10 p1 )))

9847261 Running time?


9 * 10 = 90
Θ(m)
(90 + 8)*10 = 980

(980 + 4)*10 = 9840

(9840 + 7)*10 = 98470

… = 9847621
Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4
963801572348267
T(si…i+m-1)

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 801
963801572348267
T(si…i+m-1) subtract highest order digit

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 8010
963801572348267
T(si…i+m-1)
shift digits up

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 8015
963801572348267
T(si…i+m-1)
add in the lowest digit

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 Running time?


- Θ(m) for s1…m
963801572348267
- O(1) for the rest
T(si…i+m-1)

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Algorithm so far…
 Is it correct?
 Each string has a unique numerical value and we
compare that with each value in the string
 Running time
 Preprocessing:
 Θ(m)
 Matching
 Θ(n-m+1)

Is there any problem with this analysis?


Algorithm so far…
 Is it correct?
 Each string has a unique numerical value and we
compare that with each value in the string
 Running time
 Preprocessing:
 Θ(m)
 Matching
 Θ(n-m+1)

How long does the check T(P) = T(si…i+m-1) take?


Modular arithmetics
 The run time assumptions we made were
assuming arithmetic operations were
constant time, which is not true for large
numbers
 To keep the numbers small, we’ll use
modular arithmetics, i.e. all operations are
performed mod q
 a+b = (a+b) mod q
 a*b = (a*b) mod q
 …
Modular arithmetics
 If T(A) = T(B), then T(A) mod q = T(B) mod q
 In general, we can apply mods as many times as
we want and we will not effect the result
 What is the downside to this modular
approach?
 Spurious hits: if T(A) mod q = T(B) mod q that
does not necessarily mean that T(A) = T(B)
 If we find a hit, we must check that the actual
string matches the pattern
Runtime
 Preprocessing
 Θ(m)
 Running time
 Best case:
 Θ(n-m+1) – No matches and no spurious hits
 Worst case
 Θ((n-m+1)m)
Average case running time
 Assume v valid matches in the string
 What is the probability of a spurious hit?
 As with hashing, assume a uniform mapping onto
values of q:

Σ* 1…q

 What is the probability under this assumption?


Average case running time
 Assume v valid matches in the string
 What is the probability of a spurious hit?
 As with hashing, assume a uniform mapping onto
values of q:

Σ* 1…q

 What is the probability under this assumption? 1/q


Average case running time
 How many spurious hits?
 n/q
 Average case running time:
O(n-m+1) + O(m(v+n/q)

iterate over the checking matches


positions and spurious hits
Matching running times

Algorithm Preprocessing time Matching time


Naïve 0 O((n-m+1)m)
FSA Θ(m|Σ|) Θ(n)
Rabin-Karp Θ(m) O(n-m+1)m)
Knuth-Morris-Pratt Θ(m) Θ(n)

You might also like