12_strings.v3
12_strings.v3
algorithms
David Kauchak
cs161
Summer 2009
Administrative
Check your scores on coursework
SCPD Final exam: e-mail me with proctor
information
Office hours next week?
Running time
O(n) where n is length of shortest string
String operations
Concatenate (append): create string s1s2
‘this is a’ . ‘ string’ → ‘this is a string’
Running time
Θ(n+m)
String operations
Substitute: Exchange all occurrences of a
particular character with another character
Substitute(‘this is a string’, ‘i’, ‘x’) →
‘thxs xs a strxng’
Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’
Running time
Θ(n)
String operations
Length:return the number of
characters/symbols in the string
Length(‘this is a string’) → 16
Length(‘this is another string’) → 24
Running time
O(1) or Θ(n) depending on implementation
String operations
Prefix: Get the first j characters in the string
Running time
Θ(j)
Suffix: Get the last j characters in the string
Suffix(‘this is a string’, 6) → ‘string’
Running time
Θ(j)
String operations
Substring – Get the characters between i and
j inclusive
Substring(‘this is a string’, 4, 8) → ‘s is ’
Running time
Θ(j - i)
Prefix?
Prefix(S, i) = Substring(S, 1, i)
Suffix?
Suffix(S, i) = Substring(S, i+1, length(n))
Edit distance
(aka Levenshtein distance)
Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Insertion:
ABACED
Edit distance
(aka Levenshtein distance)
Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:
ABACED BACED
Delete ‘A’
Edit distance
(aka Levenshtein distance)
Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:
Edit(Kitten, Mitten) = 1
Operations:
Edit(Happy, Hilly) = 3
Operations:
Edit(Banana, Car) = 5
Operations:
Edit(Simple, Apple) = 3
Operations:
Why?
sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’
delete ‘i’ → insert ‘i’
insert ‘i’ → delete ‘i’
Calculating edit distance
X=ABCBDAB
Y=BDCABA
Ideas?
Calculating edit distance
X=ABCBDA?
Y=BDCAB?
X=ABCBDA?
Y=BDCAB?
Operations: Insert
Delete
Substitute
Insert
X=ABCBDA?
Y=BDCAB?
Insert
X=ABCBDA?
Edit
Y=BDCAB?
X=ABCBDA?
Y=BDCAB?
Delete
X=ABCBDA?
Edit
Y=BDCAB?
X=ABCBDA?
Y=BDCAB?
Substition
X=ABCBDA?
Edit
Y=BDCAB?
X=ABCBDA?
Y=BDCAB?
Equal
X=ABCBDA?
Y=BDCAB?
Equal
X=ABCBDA?
Edit
Y=BDCAB?
Θ(nm)
Variants
Only include insertions and deletions
What does this do to substitutions?
Include swaps, i.e. swapping two adjacent
characters counts as one edit
Weight insertion, deletion and substitution
differently
Weight specific character insertion, deletion
and substitutions differently
Length normalize the edit distance
String matching
Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA
S = DCABABBABABA
String matching
Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA
S = DCABABBABABA
Uses
grep/egrep
search
find
java.lang.String.contains()
Naive implementation
Is it correct?
Running time?
Best case
Θ(n) – when the first character of the pattern does
not occur in the string
Worst case
O((n-m+1)m)
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Worst case
P = AAAA
S = AAAAAAAAAAAAA
repeated work!
Worst case
P = AAAA
S = AAAAAAAAAAAAA
P = ABAB
P = ABDC
P = BAA
P = ABBCDDCAABB
Patterns
Whichof these patterns will have that
problem?
P = BAA
P = ABBCDDCAABB
Finite State Automata (FSA)
An FSA is defined by 5 components
Q is the set of states
q0 q1 q2 … qn
Finite State Automata (FSA)
An FSA is defined by 5 components
Q is the set of states
q0 q1 q2 … qn
q0 A q1
q0 B q2
q0 q1
A
q2 …
A
…
q1 A q1
FSA operation
B A A
q0 q1 q1 q1
A B A
B
B
q0 q1 q1 q1
A B A
B
B
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
Suffix function
Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(abcdab, ababcd) = ?
Suffix function
Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(abcdab, ababcd) = 2
Suffix function
Thesuffix function σ(x,y) is the index of the
longest suffix of x that is a prefix of y
σ(daabac, abacac) = ?
Suffix function
Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(daabac, abacac) = 4
Suffix function
Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(dabb, abacd) = ?
Suffix function
Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
σ(dabb, abacd) = 0
Building a string matching
automata
Given a pattern P = p1, p2, …, pm, we’d like to build
an FSA that recognizes P in strings
P = ababaca
Ideas?
Building a string matching automata
P = ababaca
Q = q1, q2, …, qm corresponding to each
symbol, plus a q0 starting state
the set of accepting states, A = {qm}
vocab Σ all symbols in P, plus one more
representing all symbols not in P
The transition function for q Q and a Σ is
defined as:
(q, a) = σ(p1…qa, P)
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 ? a σ(a, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 1 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
B,C
state a b c P
q0 1 0 0 a q0 q1
A
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 ? b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 1 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 c
q6 a
q7
Transition function
P = ababaca
(q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 6 c
q6 7 0 0 a
q7 1 2 0
Matching runtime
Once we’ve built the FSA, what is the
runtime?
Θ(n) - Each symbol causes a state transition and
we only visit each character once
What is the cost to build the FSA?
How many entries in the table?
m|Σ| - Best case: Ω(m|Σ|)
How long does it take to calculate the suffix
function at each entry?
Naïve: O(m2)
Overall naïve: O(m3|Σ|)
Overall fast implementation O(m|Σ|)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA Hash P
T(P)
S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
Hash m symbol
S = BABABBABABA sequences and compare
T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
match
Hash m symbol
S = BABABBABABA sequences and compare
T(ABA)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
Hash m symbol
S = BABABBABABA sequences and compare
T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
Hash m symbol
S = BABABBABABA sequences and compare
…
T(BAB)
=
T(P)
Rabin-Karp algorithm
For this to be
useful/efficient, what
P = ABA needs to be true
about T?
S = BABABBABABA
…
T(BAB)
=
T(P)
Rabin-Karp algorithm
For this to be
useful/efficient, what
P = ABA needs to be true
about T?
number
How do we efficiently calculate the numerical
representation of a string?
T(‘9847261’) = ?
Horner’s rule
T ( p1...m ) pm 10( pm 1 10( pm 2 ... 10( p2 10 p1 )))
9847261
9 * 10 = 90
… = 9847621
Horner’s rule
T ( p1...m ) pm 10( pm 1 10( pm 2 ... 10( p2 10 p1 )))
… = 9847621
Calculating the hash on the
string
Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4
963801572348267
T(si…i+m-1)
m=4 801
963801572348267
T(si…i+m-1) subtract highest order digit
m=4 8010
963801572348267
T(si…i+m-1)
shift digits up
m=4 8015
963801572348267
T(si…i+m-1)
add in the lowest digit
Σ* 1…q
Σ* 1…q