0% found this document useful (0 votes)

252 views

String Edit PDF

This document provides an overview of string edit distance and dynamic programming. It discusses how dynamic programming can be used to efficiently calculate the minimum number of edit operations (insertions, deletions, substitutions) needed to transform one string into another. The document presents the algorithm to fill a 2D table in a bottom-up manner to store the edit distances between prefixes of the two strings. It also describes how to trace back through the table to find the optimal sequence of edits.

Uploaded by

Ponmalar Elamaran

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

252 views

String Edit PDF

Uploaded by

Ponmalar Elamaran

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

String Edit Distance

(and intro to dynamic programming)

Lecture #4

Computational Linguistics
CMPSCI 591N, Spring 2006
University of Massachusetts Amherst

Andrew McCallum

Andrew McCallum, UMass Amherst,

including material from William Cohen

Dynamic Programming
(Not much to do with programming in the
CS sense.)
Dynamic programming is efficient in finding
optimal solutions for cases with lots of
overlapping sub-problems.
It solves problems by recombining solutions
to sub-problems, when the sub-problems
themselves may share sub-sub-problems.

Andrew McCallum, UMass Amherst,

including material from William Cohen

Fibonacci Numbers

1 1 2 3 5 8 13 21 34 ...
Andrew McCallum, UMass Amherst,
including material from William Cohen

Andrew McCallum, UMass Amherst,

including material from William Cohen

Calculating Fibonacci Numbers

F(n) = F(n-1) + F(n-2),
where F(0)=0, F(1)=1.

Non-Dynamic Programming implementation

def fib(n):
if n == 0 or n == 1:
return n
else:
return fib(n-1) + fib(n-2)
For fib(8), how many calls to function fib(n)?
Andrew McCallum, UMass Amherst,
including material from William Cohen

DP Example:
Calculating Fibonacci Numbers
Dynamic Programming: avoid repeated calls by remembering
function values already calculated.
table = {}
def fib(n):
global table
if table.has_key(n):
return table[n]
if n == 0 or n == 1:
table[n] = n
return n
else:
value = fib(n-1) + fib(n-2)
table[n] = value
return value
Andrew McCallum, UMass Amherst,
including material from William Cohen

DP Example:
Calculating Fibonacci Numbers
...or alternately, in a list instead of a dictionary...
def fib(n):
table = [0] * (n+1)
table[0] = 0
table[1] = 1
for i in range(2,n+1):
table[i] = table[i-2] + table[i-1]
return table[n]
We will see this pattern many more times in this course:
1. Create a table (of the right dimensions to describe our problem.
2. Fill the table, re-using solutions to previous sub-problems.
Andrew McCallum, UMass Amherst,
including material from William Cohen

String Edit Distance

Given two strings (sequences) return the distance
between the two strings as measured by...
...the minimum number of character edit operations
needed to turn one sequence into the other.

Andrew
Amdrewz

1. substitute m to n
2. delete the z
Distance = 2

Andrew McCallum, UMass Amherst,

including material from William Cohen

String distance metrics: Levenshtein

Given strings s and t
Distance is shortest sequence of edit
commands that transform s to t, (or equivalently s
to t).
Simple set of operations:

Copy character from s over to t

Delete a character in s
Insert a character in t
Substitute one character for another

This is Levenshtein distance

Andrew McCallum, UMass Amherst,
including material from William Cohen

(cost 0)
(cost 1)
(cost 1)
(cost 1)

Levenshtein distance - example

distance(William Cohen, Willliam Cohon)
s

W I

L L gap I

A M _ C O H E N

alignment
t
W I L L L I A M _ C O H O N
edit
C C C C I
C C C C C C C S C
op
cost 0 0 0 0 1 1 1 1 1 1 1 1 2 2
so far...
Andrew McCallum, UMass Amherst,
including material from William Cohen

Alignment is a little bit like a parse.

Finding the Minimum

What is the minimum number of operations for....?

Another fine day in the park

Anybody can see him pick the ball
Not so easy.... not so clear.
Not only are the strings, longer, but is isnt immediately
obvious where the alignments should happen.
What if we consider all possible alignments by brute force?
How many alignments are there?
Andrew McCallum, UMass Amherst,
including material from William Cohen

Dynamic Program Table for String Edit

PARK

Measure distance between strings

SPAKE
P

S
P
A
K
E
Andrew McCallum, UMass Amherst,
including material from William Cohen

cij

cij =

the number of edit

operations needed
to align PA with
SPA.

Dynamic Programming to the Rescue!

How to take our big problem and chop it into building-block pieces.

Given some partial solution, it isnt hard to figure

out what a good next immediate step is.
Partial solution =
This is the cost for aligning s up to position i
with t up to position j.
Next step =
In order to align up to positions x in s and y in
t, should the last operation be a substitute,
insert, or delete?
Andrew McCallum, UMass Amherst,
including material from William Cohen

Dynamic Program Table for String Edit

PARK

Measure distance between strings

SPAKE
Edit operations
for turning
SPAKE
into
PARK

delete

S
P
A
K
E

Andrew McCallum, UMass Amherst,

including material from William Cohen

insert

substitute

Dynamic Program Table for String Edit

PARK

Measure distance between strings

SPAKE
P

c00

c02 c03

c04

c05

c10

c11 c12

c13

c14

c20

c21 c22

c23

c24

c30

c31 ???

K
E
Andrew McCallum, UMass Amherst,
including material from William Cohen

Dynamic Program Table for String Edit

P
S
P
A

c00

c02 c03

c04

c05

c10

c11 c12

c13

c14

c23

c24

c20

subst

c30

insert

delete

c21 c22

c31 ???

K
E
D(i,j) = score of best alignment from s1..si to t1..tj
= min
Andrew McCallum, UMass Amherst,
including material from William Cohen

D(i-1,j-1), if si=tj
D(i-1,j-1)+1, if si!=tj
D(i-1,j)+1
D(i,j-1)+1

//copy
//substitute
//insert
//delete

Computing Levenshtein distance - 2

D(i,j) = score of best alignment from s1..si to t1..tj
= min

D(i-1,j-1) + d(si,tj) //subst/copy

D(i-1,j)+1
//insert
D(i,j-1)+1
//delete

(simplify by letting d(c,d)=0 if c=d, 1 else)

also let D(i,0)=i (for i inserts) and D(0,j)=j

Andrew McCallum, UMass Amherst,

including material from William Cohen

Dynamic Program Table Initialized

P
0
S

K
4

D(i,j) = score of best alignment from s1..si to t1..tj

= min
Andrew McCallum, UMass Amherst,
including material from William Cohen

D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1

//substitute
//insert
//delete

Dynamic Program Table ... filling in

P
0

K
4

D(i,j) = score of best alignment from s1..si to t1..tj

= min
Andrew McCallum, UMass Amherst,
including material from William Cohen

D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1

//substitute
//insert
//delete

Dynamic Program Table ... filling in

K
4
4

D(i,j) = score of best alignment from s1..si to t1..tj

= min
Andrew McCallum, UMass Amherst,
including material from William Cohen

D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1

//substitute
//insert
//delete

Dynamic Program Table ... filling in

K
4
4

D(i,j) = score of best alignment from s1..si to t1..tj

= min
Andrew McCallum, UMass Amherst,
including material from William Cohen

D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1

//substitute
//insert
//delete

Dynamic Program Table ... filling in

Final cost of
aligning all of
both strings.

D(i,j) = score of best alignment from s1..si to t1..tj

= min
Andrew McCallum, UMass Amherst,
including material from William Cohen

D(i-1,j-1)+d(si,tj)
D(i-1,j)+1
D(i,j-1)+1

//substitute
//insert
//delete

DP String Edit Distance

def stredit (s1,s2):
"Calculate Levenstein edit distance for strings s1 and s2."
len1 = len(s1) # vertically
len2 = len(s2) # horizontally
# Allocate the table
table = [None]*(len2+1)
for i in range(len2+1): table[i] = [0]*(len1+1)
# Initialize the table
for i in range(1, len2+1): table[i][0] = i
for i in range(1, len1+1): table[0][i] = i
# Do dynamic programming
for i in range(1,len2+1):
for j in range(1,len1+1):
if s1[j-1] == s2[i-1]:
d = 0
else:
d = 1
table[i][j] = min(table[i-1][j-1] + d,
table[i-1][j]+1,
table[i][j-1]+1)
Andrew McCallum, UMass Amherst,
including material from William Cohen

Remebering the Alignment (trace)

D(i,j) = min

D(i-1,j-1) + d(si,tj) //subst/copy

D(i-1,j)+1
//insert
D(i,j-1)+1
//delete
C

A trace indicates
where the min
value came from,
and can be used to
find edit
operations and/or
a best alignment

(may be more than 1)

Andrew McCallum, UMass Amherst,

including material from William Cohen

C
C

1
1
2

2
3

3
3

Three Enhanced Variants

Needleman-Munch
Variable costs

Smith-Waterman
Find longest soft matching subsequence

Affine Gap Distance

Make repeated deletions (insertions) cheaper

(Implement one for homework?)

Andrew McCallum, UMass Amherst,
including material from William Cohen

Needleman-Wunch distance
D(i,j) = min

D(i-1,j-1) + d(si,tj) //subst/copy

D(i-1,j) + G
//insert
D(i,j-1) + G
//delete
G = gap cost

d(c,d) is an arbitrary
distance function on
characters (e.g. related
to typo frequencies,
amino acid
substitutibility, etc)
Andrew McCallum, UMass Amherst,
including material from William Cohen

William Cohen
Wukkuan Cigeb

Smith-Waterman distance
Instead of looking at each sequence in its
entirety, this compares segments of all
possible lengths and chooses whichever
maximize the similarity measure.
For every cell the algorithm calculates all
possible paths leading to it. These paths can
be of any length and can contain insertions
and deletions.

Andrew McCallum, UMass Amherst,

including material from William Cohen

Smith-Waterman distance
D(i,j) = min

G=1
d(c,c) = -2
d(c,d) = +1

Andrew McCallum, UMass Amherst,

including material from William Cohen

0
//start over
D(i-1,j-1) + d(si,tj) //subst/copy
D(i-1,j) + G
//insert
D(i,j-1) + G
//delete
C

-2

-1

-2

-1

-4

-3

-2

-1

-3

-6

-5

-3

-2

-5

-7

Example output from Python

l
o
u
n
g
e

0
1
2
3
4
5
6

s
1
0
0
0
0
0
0

'
2
0
0
0
0
0
0

a
3
0
0
0
0
0
0

l
l
o
n
g
e
4
5
6
7
8
9
0 * 0
0
0
0
0
0
0 *-2 -1
0
0
0
0 *-1 -1
0
0
0
0
0 *-3 -2 -1
0
0
0 -2 *-5 -4
0
0
0 -1 -4 *-7

r
10
0
0
0
0
-3
-6

(My implementation of HW#2, task choice #2. -McCallum)

Andrew McCallum, UMass Amherst,
including material from William Cohen

Affine gap distances

Smith-Waterman fails on some pairs that
seem quite similar:
William W. Cohen
William W. Dont call me Dubya Cohen
Intuitively,single
a single
long
insertion
cheaper
Intuitively,
long
insertions
areis cheaper
thanaalot
lotof
ofshort
shortinsertions
insertions
than
Andrew McCallum, UMass Amherst,
including material from William Cohen

Affine gap distances - 2

Idea:
Current cost of a gap of n characters: nG
Make this cost: A + (n-1)B, where A is cost of
opening a gap, and B is cost of continuing a
gap.

Andrew McCallum, UMass Amherst,

including material from William Cohen

Affine gap distances - 3

D(i,j) = max

D(i-1,j-1)
D(i-1,j-1) +
+ d(si,tj)
d(si,tj) //subst/copy
D(i-1,j)-1
IS(I-1,j-1) + d(si,tj) //insert
D(i,j-1)-1
//delete
IT(I-1,j-1) + d(si,tj)

IS(i,j) = max

D(i-1,j) - A
IS(i-1,j) - B

Best score in which si

is aligned with a gap

D(i,j-1) - A
IT(i,j-1) - B

Best score in which tj

is aligned with a gap

IT(i,j) = max

Andrew McCallum, UMass Amherst,

including material from William Cohen

Affine gap distances as automata

-B

-d(si,tj)
-A
-d(si,tj)

D
-d(si,tj)

Andrew McCallum, UMass Amherst,

including material from William Cohen

-A

Generative version of affine gap

automata (Bilenko&Mooney, TechReport 02)
HMM emits pairs: (c,d) in
state M, pairs (c,-) in state
D, and pairs (-,d) in state I.
For each state there is a
multinomial distribution
on pairs.
The HMM can trained with
EM from a sample of pairs
of matched strings (s,t)
E-step is forward-backward; M-step uses some ad hoc smoothing
Andrew McCallum, UMass Amherst,
including material from William Cohen

Affine gap edit-distance learning:

experiments results (Bilenko & Mooney)

Experimental method: parse records into fields; append a

few key fields together; sort by similarity; pick a
threshold T and call all pairs with distance(s,t) < T
duplicates; picking T to maximize F-measure.
Andrew McCallum, UMass Amherst,
including material from William Cohen

Affine gap edit-distance learning:

experiments results (Bilenko & Mooney)

Andrew McCallum, UMass Amherst,

including material from William Cohen

Affine gap edit-distance learning:

experiments results (Bilenko & Mooney)

Precision/recall for MAILING dataset duplicate detection

Andrew McCallum, UMass Amherst,
including material from William Cohen

Affine gap distances experiments

(from McCallum, Nigam,Ungar KDD2000)
Goal is to match data like this:

Andrew McCallum, UMass Amherst,

including material from William Cohen

Homework #2
The assignment
Start with my stredit.py code
Make some modifications
Write a little about your experiences

Some possible modifications

Implement Needleman-Wunch, Smith-Waterman, or Affine Gap
Distance.
Create a little spell-checker: if entered word isnt in the dictionary,
return the dictionary word that is closest.
Change implementation to operate on sequences of words rather
than characters... get an online translation dictionary, and find
alignments between English & French or English & Russian!
Try to learn the parameters of the function from data. (Tough.)

Andrew McCallum, UMass Amherst,

including material from William Cohen

(Daniel I. A. Cohen) Introduction To Computer Theo (BookSee - Org) 2
No ratings yet
(Daniel I. A. Cohen) Introduction To Computer Theo (BookSee - Org) 2
649 pages
Excel-Vba Code For One Dimensional Consolidation Analysis
No ratings yet
Excel-Vba Code For One Dimensional Consolidation Analysis
8 pages
Basics of Compiler Design - Torben Mogensen - Exercise Solutions
0% (1)
Basics of Compiler Design - Torben Mogensen - Exercise Solutions
23 pages
Artigo - Why Study The History of Mathematics
No ratings yet
Artigo - Why Study The History of Mathematics
9 pages
Ex 7
No ratings yet
Ex 7
7 pages
Data Structure and Algorith
No ratings yet
Data Structure and Algorith
8 pages
Adsa U4,2
No ratings yet
Adsa U4,2
3 pages
Ada Module 3 Notes
No ratings yet
Ada Module 3 Notes
40 pages
University Of Campinas Notebook
No ratings yet
University Of Campinas Notebook
17 pages
9457Lab Manual Expt No. 7 AOA - Longest Common Subsequence
No ratings yet
9457Lab Manual Expt No. 7 AOA - Longest Common Subsequence
9 pages
Design and Analysis of Algorithm
No ratings yet
Design and Analysis of Algorithm
89 pages
K. J. Somaiya College of Engineering, Mumbai-77
No ratings yet
K. J. Somaiya College of Engineering, Mumbai-77
7 pages
Design Techniques Part 2 64
No ratings yet
Design Techniques Part 2 64
15 pages
DP - 2
No ratings yet
DP - 2
13 pages
LCS Notes
No ratings yet
LCS Notes
5 pages
Programacion Dinamica Sin BB
No ratings yet
Programacion Dinamica Sin BB
50 pages
Greedy DP
No ratings yet
Greedy DP
57 pages
Solutions Practice Set DP Greedy
No ratings yet
Solutions Practice Set DP Greedy
6 pages
Week 9N
No ratings yet
Week 9N
9 pages
B306 DAA Lab Manual Exp 7
No ratings yet
B306 DAA Lab Manual Exp 7
8 pages
The Fundamentals: Algorithms The Integers
No ratings yet
The Fundamentals: Algorithms The Integers
55 pages
Determinants: Ms Do Thi Phuong Thao Fall 2012
No ratings yet
Determinants: Ms Do Thi Phuong Thao Fall 2012
31 pages
Data Structures and Algorithms Unit - V: Dynamic Programming
No ratings yet
Data Structures and Algorithms Unit - V: Dynamic Programming
19 pages
To Print - Dynprog2
No ratings yet
To Print - Dynprog2
46 pages
20MCA023 Algorithm Assighnment
No ratings yet
20MCA023 Algorithm Assighnment
13 pages
Analysis of Algorithms Notes
No ratings yet
Analysis of Algorithms Notes
72 pages
JHU Department of Civil Engineering CE 560.445: Advanced Structural Analysis
No ratings yet
JHU Department of Civil Engineering CE 560.445: Advanced Structural Analysis
5 pages
CSE 205 Lab Manual 13 LCS
No ratings yet
CSE 205 Lab Manual 13 LCS
5 pages
MB0048 or Solved Winter Drive Assignment 2012
No ratings yet
MB0048 or Solved Winter Drive Assignment 2012
9 pages
Dynprog
No ratings yet
Dynprog
18 pages
s10898-006-9066-4
No ratings yet
s10898-006-9066-4
16 pages
com2001
No ratings yet
com2001
5 pages
Intro To Dynamic Programming
No ratings yet
Intro To Dynamic Programming
7 pages
Algorithms and Data Structure
No ratings yet
Algorithms and Data Structure
29 pages
Final Term Test Soln 2020
No ratings yet
Final Term Test Soln 2020
9 pages
Definition of Minimum Edit Distance
No ratings yet
Definition of Minimum Edit Distance
49 pages
Assignment Solution (1)
No ratings yet
Assignment Solution (1)
12 pages
Waterman - 1984 - Efficient Sequence Alignment Algorithms
No ratings yet
Waterman - 1984 - Efficient Sequence Alignment Algorithms
5 pages
1_updated_AP1.3_Ayush Anand
No ratings yet
1_updated_AP1.3_Ayush Anand
5 pages
6A-Divide-Conquer CP PC
No ratings yet
6A-Divide-Conquer CP PC
35 pages
COMPUTATIONAL COMPLEXITY
No ratings yet
COMPUTATIONAL COMPLEXITY
6 pages
Dynamic Programming. 1: CS 3510 - Design and Analysis of Algorithms
No ratings yet
Dynamic Programming. 1: CS 3510 - Design and Analysis of Algorithms
8 pages
Mid2 2022 November Solution
No ratings yet
Mid2 2022 November Solution
11 pages
CSCE 3110 Data Structures & Algorithm Analysis: Rada Mihalcea
No ratings yet
CSCE 3110 Data Structures & Algorithm Analysis: Rada Mihalcea
30 pages
daa qnpaper
No ratings yet
daa qnpaper
26 pages
Mid2 SOLUTION
No ratings yet
Mid2 SOLUTION
11 pages
CP Tricks
No ratings yet
CP Tricks
4 pages
11339AoA -EX-7
No ratings yet
11339AoA -EX-7
7 pages
42459
No ratings yet
42459
11 pages
HW 9 Solution
No ratings yet
HW 9 Solution
8 pages
Cs3401 Alg Unit 3 Notes Eduengg
No ratings yet
Cs3401 Alg Unit 3 Notes Eduengg
33 pages
Dynamic Programming Algorithms: Based On University of Toronto CSC 364 Notes, Original Lectures by Stephen Cook
No ratings yet
Dynamic Programming Algorithms: Based On University of Toronto CSC 364 Notes, Original Lectures by Stephen Cook
18 pages
hw09 Solution PDF
No ratings yet
hw09 Solution PDF
8 pages
Levenshtein
No ratings yet
Levenshtein
14 pages
Hierarchical Clustering Implementation
No ratings yet
Hierarchical Clustering Implementation
34 pages
To Read Dynprog2
No ratings yet
To Read Dynprog2
50 pages
CS253 Report 3 Wilhelm Aaron
No ratings yet
CS253 Report 3 Wilhelm Aaron
35 pages
Week 13 APSVaibhav Sharma
No ratings yet
Week 13 APSVaibhav Sharma
3 pages
A Linear Time Algorithm for the k Maximal Sums Problem 1st Edition by Fredrik Bengtsson, Jingsen Chen ISBN 9783540305514pdf download
100% (5)
A Linear Time Algorithm for the k Maximal Sums Problem 1st Edition by Fredrik Bengtsson, Jingsen Chen ISBN 9783540305514pdf download
46 pages
Logical progression of twelve double binary tables of physical-mathematical elements correlated with scientific-philosophical as well as metaphysical key concepts evidencing the dually four-dimensional basic structure of the universe
From Everand
Logical progression of twelve double binary tables of physical-mathematical elements correlated with scientific-philosophical as well as metaphysical key concepts evidencing the dually four-dimensional basic structure of the universe
Federico Tambara
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Worked Examples in Mechanical Vibrations using MATLAB
From Everand
Worked Examples in Mechanical Vibrations using MATLAB
Eric Okoth Ogur
No ratings yet
Compiler Design 2
No ratings yet
Compiler Design 2
9 pages
Weebly Whole Group Math
No ratings yet
Weebly Whole Group Math
2 pages
From Java To Kotlin PDF
No ratings yet
From Java To Kotlin PDF
9 pages
2.1 Conditional Statements
No ratings yet
2.1 Conditional Statements
58 pages
An Introduction To Dependency Grammar: February 2000
No ratings yet
An Introduction To Dependency Grammar: February 2000
17 pages
Basic Concepts in Set Theory
No ratings yet
Basic Concepts in Set Theory
10 pages
Rule Based Programming: 3rd Year, 2nd Semester
No ratings yet
Rule Based Programming: 3rd Year, 2nd Semester
27 pages
Swift and C# Quick Reference - Language Equivalents and Code Examples
No ratings yet
Swift and C# Quick Reference - Language Equivalents and Code Examples
1 page
Compiler Design Jan 2023
No ratings yet
Compiler Design Jan 2023
8 pages
Introduction To Java Programming: Week 2
No ratings yet
Introduction To Java Programming: Week 2
56 pages
Semantic Networks
No ratings yet
Semantic Networks
3 pages
Long Palindrome Text
No ratings yet
Long Palindrome Text
1 page
Maji 2003 Soft Set Theory PDF
No ratings yet
Maji 2003 Soft Set Theory PDF
8 pages
Computing Functions With Turing Machines
No ratings yet
Computing Functions With Turing Machines
40 pages
PL1 Examples
No ratings yet
PL1 Examples
5 pages
Flat Bits
No ratings yet
Flat Bits
9 pages
Chapter 2 REGULAR EXPRESSION
No ratings yet
Chapter 2 REGULAR EXPRESSION
26 pages
Introduction To Turing Machines: Site
No ratings yet
Introduction To Turing Machines: Site
28 pages
Knowledge Representation and Logic - (Rule Based Systems) : Version 2 CSE IIT, Kharagpur
No ratings yet
Knowledge Representation and Logic - (Rule Based Systems) : Version 2 CSE IIT, Kharagpur
14 pages
Comp106 Logic
No ratings yet
Comp106 Logic
98 pages
Proofs and Fundamentals
No ratings yet
Proofs and Fundamentals
444 pages
CS8082 MLT 2017 Syllabus
No ratings yet
CS8082 MLT 2017 Syllabus
1 page
SQL Functions: by Neil A. Basabe
No ratings yet
SQL Functions: by Neil A. Basabe
81 pages
Compiler Design
No ratings yet
Compiler Design
3 pages
Fuzzy Logic
No ratings yet
Fuzzy Logic
2 pages
Adrian Rezus Review FILD Preprint 20140306
No ratings yet
Adrian Rezus Review FILD Preprint 20140306
5 pages
Dfa Examples
No ratings yet
Dfa Examples
10 pages

String Edit PDF

Uploaded by

String Edit PDF

Uploaded by

String Edit Distance

(and intro to dynamic programming)

Andrew McCallum, UMass Amherst,

Andrew McCallum, UMass Amherst,

Andrew McCallum, UMass Amherst,

Calculating Fibonacci Numbers

Non-Dynamic Programming implementation

String Edit Distance

Andrew McCallum, UMass Amherst,

String distance metrics: Levenshtein

Copy character from s over to t

This is Levenshtein distance

Levenshtein distance - example

Alignment is a little bit like a parse.

Finding the Minimum

Another fine day in the park

Dynamic Program Table for String Edit

Measure distance between strings

the number of edit

Dynamic Programming to the Rescue!

Given some partial solution, it isnt hard to figure

Dynamic Program Table for String Edit

Measure distance between strings

Andrew McCallum, UMass Amherst,

Dynamic Program Table for String Edit

Measure distance between strings

Dynamic Program Table for String Edit

Computing Levenshtein distance - 2

D(i-1,j-1) + d(si,tj) //subst/copy

(simplify by letting d(c,d)=0 if c=d, 1 else)

Andrew McCallum, UMass Amherst,

Dynamic Program Table Initialized

D(i,j) = score of best alignment from s1..si to t1..tj

Dynamic Program Table ... filling in

D(i,j) = score of best alignment from s1..si to t1..tj

Dynamic Program Table ... filling in

D(i,j) = score of best alignment from s1..si to t1..tj

Dynamic Program Table ... filling in

D(i,j) = score of best alignment from s1..si to t1..tj

Dynamic Program Table ... filling in

D(i,j) = score of best alignment from s1..si to t1..tj

DP String Edit Distance

Remebering the Alignment (trace)

D(i-1,j-1) + d(si,tj) //subst/copy

(may be more than 1)

Andrew McCallum, UMass Amherst,

Three Enhanced Variants

Affine Gap Distance

(Implement one for homework?)

D(i-1,j-1) + d(si,tj) //subst/copy

Andrew McCallum, UMass Amherst,

Andrew McCallum, UMass Amherst,

Example output from Python

(My implementation of HW#2, task choice #2. -McCallum)

Affine gap distances

Affine gap distances - 2

Andrew McCallum, UMass Amherst,

Affine gap distances - 3

Best score in which si

Best score in which tj

Andrew McCallum, UMass Amherst,

Affine gap distances as automata

Andrew McCallum, UMass Amherst,

Generative version of affine gap

Affine gap edit-distance learning:

Experimental method: parse records into fields; append a

Affine gap edit-distance learning:

Andrew McCallum, UMass Amherst,

Affine gap edit-distance learning:

Precision/recall for MAILING dataset duplicate detection

Affine gap distances experiments

Andrew McCallum, UMass Amherst,

Some possible modifications

Andrew McCallum, UMass Amherst,

You might also like