0% found this document useful (0 votes)

16 views

12_strings.v3

String matching

Uploaded by

mori.alizadeh.2000m

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

12_strings.v3

String matching

Uploaded by

mori.alizadeh.2000m

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 111

String processing

algorithms
David Kauchak
cs161
Summer 2009
Administrative
 Check your scores on coursework
 SCPD Final exam: e-mail me with proctor

information
 Office hours next week?

 Reminder: HW6 due Wed. 8/12 before class

and no late homework

Where did “dynamic programming” come from?

Richard Bellman On the Birth of

Dynamic Programming
Stuart Dreyfus
http://www.eng.tau.ac.il/~ami/cd/
or50/1526-5463-2002-50-01-
0048.pdf
Strings
 Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …,
z)
 A string is any member of Σ*, i.e. any

sequence of 0 or more members of Σ

 ‘this is a string’  Σ*
 ‘this is also a string’  Σ*
 ‘1234’  Σ*
String operations
 Given strings s1 of length n and s2 of length m
 Equality: is s1 = s2? (case sensitive or
insensitive)
‘this is a string’ = ‘this is a string’
‘this is a string’ ≠ ‘this is another string’
‘this is a string’ =? ‘THIS IS A STRING’

 Running time
 O(n) where n is length of shortest string
String operations
 Concatenate (append): create string s1s2
‘this is a’ . ‘ string’ → ‘this is a string’

 Running time
 Θ(n+m)
String operations
 Substitute: Exchange all occurrences of a
particular character with another character
Substitute(‘this is a string’, ‘i’, ‘x’) →
‘thxs xs a strxng’
Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’

 Running time
 Θ(n)
String operations
 Length:return the number of
characters/symbols in the string
Length(‘this is a string’) → 16
Length(‘this is another string’) → 24

 Running time
 O(1) or Θ(n) depending on implementation
String operations
 Prefix: Get the first j characters in the string

Prefix(‘this is a string’, 4) → ‘this’

 Running time
 Θ(j)
 Suffix: Get the last j characters in the string
Suffix(‘this is a string’, 6) → ‘string’

 Running time
 Θ(j)
String operations
 Substring – Get the characters between i and
j inclusive

Substring(‘this is a string’, 4, 8) → ‘s is ’

 Running time
 Θ(j - i)
 Prefix?
 Prefix(S, i) = Substring(S, 1, i)
 Suffix?
 Suffix(S, i) = Substring(S, i+1, length(n))
Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Insertion:

ABACED ABACCED DABACCED

Insert ‘C’ Insert ‘D’

Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED
Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED

Delete ‘A’
Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED BACE

Delete ‘A’ Delete ‘D’

ABACED ABADED ABADES

Sub ‘D’ for ‘C’ Sub ‘S’ for ‘D’

Edit distance examples

Edit(Kitten, Mitten) = 1

Operations:

Sub ‘M’ for ‘K’ Mitten

Edit distance examples

Edit(Happy, Hilly) = 3

Operations:

Sub ‘a’ for ‘i’ Hippy

Sub ‘l’ for ‘p’ Hilpy
Sub ‘l’ for ‘p’ Hilly
Edit distance examples

Edit(Banana, Car) = 5

Operations:

Delete ‘B’ anana

Delete ‘a’ nana
Delete ‘n’ naa
Sub ‘C’ for ‘n’ Caa
Sub ‘a’ for ‘r’ Car
Edit distance examples

Edit(Simple, Apple) = 3

Operations:

Delete ‘S’ imple

Sub ‘A’ for ‘i’ Ample
Sub ‘m’ for ‘p’ Apple
Is edit distance symmetric?
 that is, is Edit(s1, s2) = Edit(s2, s1)?

Edit(Simple, Apple) =? Edit(Apple, Simple)

 Why?
 sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’
 delete ‘i’ → insert ‘i’
 insert ‘i’ → delete ‘i’
Calculating edit distance

X=ABCBDAB

Y=BDCABA

Ideas?
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

After all of the operations, X needs

to equal Y
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

Operations: Insert
Delete
Substitute
Insert

X=ABCBDA?

Y=BDCAB?
Insert

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) 1  Edit ( X 1...n , Y1...m  1 )

Delete

X=ABCBDA?

Y=BDCAB?
Delete

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m )

Substition

X=ABCBDA?

Y=BDCAB?
Substition

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m  1 )

Anything else?

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) Edit ( X 1...n  1 , Y1...m  1 )

Combining results
Insert: Edit ( X , Y ) 1  Edit ( X 1...n , Y1...m  1 )

Delete: Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m )

Substitute: Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m  1 )

Equal: Edit ( X , Y ) Edit ( X 1...n  1 , Y1...m  1 )

Combining results
 1  Edit(X 1...n ,Y1...m  1 ) insertion

Edit ( X , Y ) min  1  Edit ( X 1...n  1 , Y1...m ) deletion
 Diff ( x , y )  Edit ( X
 n m 1... n  1 , Y1... m  1 ) equal/substitution
Running time

Θ(nm)
Variants
 Only include insertions and deletions
 What does this do to substitutions?
 Include swaps, i.e. swapping two adjacent
characters counts as one edit
 Weight insertion, deletion and substitution
differently
 Weight specific character insertion, deletion
and substitutions differently
 Length normalize the edit distance
String matching
 Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
String matching
 Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
Uses
 grep/egrep

 search

 find

 java.lang.String.contains()
Naive implementation
Is it correct?
Running time?

 What is the cost of the equality check?

 Best case: O(1)
 Worst case: O(m)
Running time?

 Best case
 Θ(n) – when the first character of the pattern does
not occur in the string
 Worst case
 O((n-m+1)m)
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA

repeated work!
Worst case
P = AAAA

S = AAAAAAAAAAAAA

Ideally, after the first match, we’d

know to just check the next
character to see if it is an ‘A’
Patterns
 Whichof these patterns will have that
problem?

P = ABAB

P = ABDC

P = BAA

P = ABBCDDCAABB
Patterns
 Whichof these patterns will have that
problem?

P = ABAB If the pattern has a

suffix that is also a
P = ABDC prefix then we will
have this problem

P = BAA

P = ABBCDDCAABB
Finite State Automata (FSA)
 An FSA is defined by 5 components
 Q is the set of states
q0 q1 q2 … qn
Finite State Automata (FSA)
 An FSA is defined by 5 components
 Q is the set of states
q0 q1 q2 … qn

 q0 is the start state q7

 A  Q, is the set of accepting states where |A| > 0

 Σ is the alphabet (e.g. {A, B}
  is the transition function from Q x Σ to Q
QΣ Q B

q0 A q1
q0 B q2
q0 q1
A
q2 …
A

…
q1 A q1
FSA operation

B A A

q0 q1 q1 q1
A B A

B
B

An FSA starts at state q0 and reads the characters of the input

string one at a time.
If the automaton is in state q and reads character a, then it
transitions to state (q,a).
If the FSA reaches an accepting state (q  A), then the FSA has
found a match.
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B