String Edit PDF
String Edit PDF
Computational Linguistics
CMPSCI 591N, Spring 2006
University of Massachusetts Amherst
Andrew McCallum
Dynamic Programming
(Not much to do with programming in the
CS sense.)
Dynamic programming is efficient in finding
optimal solutions for cases with lots of
overlapping sub-problems.
It solves problems by recombining solutions
to sub-problems, when the sub-problems
themselves may share sub-sub-problems.
Fibonacci Numbers
1 1 2 3 5 8 13 21 34 ...
Andrew McCallum, UMass Amherst,
including material from William Cohen
DP Example:
Calculating Fibonacci Numbers
Dynamic Programming: avoid repeated calls by remembering
function values already calculated.
table = {}
def fib(n):
global table
if table.has_key(n):
return table[n]
if n == 0 or n == 1:
table[n] = n
return n
else:
value = fib(n-1) + fib(n-2)
table[n] = value
return value
Andrew McCallum, UMass Amherst,
including material from William Cohen
DP Example:
Calculating Fibonacci Numbers
...or alternately, in a list instead of a dictionary...
def fib(n):
table = [0] * (n+1)
table[0] = 0
table[1] = 1
for i in range(2,n+1):
table[i] = table[i-2] + table[i-1]
return table[n]
We will see this pattern many more times in this course:
1. Create a table (of the right dimensions to describe our problem.
2. Fill the table, re-using solutions to previous sub-problems.
Andrew McCallum, UMass Amherst,
including material from William Cohen
Andrew
Amdrewz
1. substitute m to n
2. delete the z
Distance = 2
(cost 0)
(cost 1)
(cost 1)
(cost 1)
W I
L L gap I
A M _ C O H E N
alignment
t
W I L L L I A M _ C O H O N
edit
C C C C I
C C C C C C C S C
op
cost 0 0 0 0 1 1 1 1 1 1 1 1 2 2
so far...
Andrew McCallum, UMass Amherst,
including material from William Cohen
SPAKE
P
S
P
A
K
E
Andrew McCallum, UMass Amherst,
including material from William Cohen
cij
cij =
SPAKE
Edit operations
for turning
SPAKE
into
PARK
delete
S
P
A
K
E
insert
substitute
SPAKE
P
c00
c02 c03
c04
c05
c10
c11 c12
c13
c14
c20
c21 c22
c23
c24
c30
c31 ???
K
E
Andrew McCallum, UMass Amherst,
including material from William Cohen
c00
c02 c03
c04
c05
c10
c11 c12
c13
c14
c23
c24
c20
subst
c30
insert
delete
c21 c22
c31 ???
K
E
D(i,j) = score of best alignment from s1..si to t1..tj
= min
Andrew McCallum, UMass Amherst,
including material from William Cohen
D(i-1,j-1), if si=tj
D(i-1,j-1)+1, if si!=tj
D(i-1,j)+1
D(i,j-1)+1
//copy
//substitute
//insert
//delete
K
4
D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1
//substitute
//insert
//delete
K
4
D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1
//substitute
//insert
//delete
K
4
4
D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1
//substitute
//insert
//delete
K
4
4
D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1
//substitute
//insert
//delete
Final cost of
aligning all of
both strings.
D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1
//substitute
//insert
//delete
A trace indicates
where the min
value came from,
and can be used to
find edit
operations and/or
a best alignment
C
C
1
1
2
2
3
3
3
Smith-Waterman
Find longest soft matching subsequence
Needleman-Wunch distance
D(i,j) = min
d(c,d) is an arbitrary
distance function on
characters (e.g. related
to typo frequencies,
amino acid
substitutibility, etc)
Andrew McCallum, UMass Amherst,
including material from William Cohen
William Cohen
Wukkuan Cigeb
Smith-Waterman distance
Instead of looking at each sequence in its
entirety, this compares segments of all
possible lengths and chooses whichever
maximize the similarity measure.
For every cell the algorithm calculates all
possible paths leading to it. These paths can
be of any length and can contain insertions
and deletions.
Smith-Waterman distance
D(i,j) = min
G=1
d(c,c) = -2
d(c,d) = +1
0
//start over
D(i-1,j-1) + d(si,tj) //subst/copy
D(i-1,j) + G
//insert
D(i,j-1) + G
//delete
C
-2
-1
-2
-1
-1
-4
-3
-2
-1
-3
-6
-5
-3
-2
-5
-5
-7
l
o
u
n
g
e
0
1
2
3
4
5
6
s
1
0
0
0
0
0
0
'
2
0
0
0
0
0
0
a
3
0
0
0
0
0
0
l
l
o
n
g
e
4
5
6
7
8
9
0 * 0
0
0
0
0
0
0 *-2 -1
0
0
0
0 *-1 -1
0
0
0
0
0 *-3 -2 -1
0
0
0 -2 *-5 -4
0
0
0 -1 -4 *-7
r
10
0
0
0
0
-3
-6
D(i,j) = max
D(i-1,j-1)
D(i-1,j-1) +
+ d(si,tj)
d(si,tj) //subst/copy
D(i-1,j)-1
IS(I-1,j-1) + d(si,tj) //insert
D(i,j-1)-1
//delete
IT(I-1,j-1) + d(si,tj)
IS(i,j) = max
D(i-1,j) - A
IS(i-1,j) - B
D(i,j-1) - A
IT(i,j-1) - B
IT(i,j) = max
IS
-B
IT
-B
-d(si,tj)
-A
-d(si,tj)
D
-d(si,tj)
-A
Homework #2
The assignment
Start with my stredit.py code
Make some modifications
Write a little about your experiences