0% found this document useful (0 votes)
36 views

Dedupe Theory

Uploaded by

Gary Passero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Dedupe Theory

Uploaded by

Gary Passero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

Copyright

by

Mikhail Yuryevich Bilenko

2006
The Dissertation Committee for Mikhail Yuryevich Bilenko
certifies that this is the approved version of the following dissertation:

Learnable Similarity Functions and Their Application


to Record Linkage and Clustering

Committee:

Raymond J. Mooney, Supervisor

William W. Cohen

Inderjit S. Dhillon

Joydeep Ghosh

Peter H. Stone
Learnable Similarity Functions and Their Application
to Record Linkage and Clustering

by

Mikhail Yuryevich Bilenko, B.S.; M.S.

Dissertation

Presented to the Faculty of the Graduate School of

The University of Texas at Austin

in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

The University of Texas at Austin

August 2006
To my dear grandparents
Acknowledgments

I am indebted to many people for all the help and support I received over the past six years.
Most notably, no other person influenced me as much as my advisor, Ray Mooney. Ray has
been a constant source of motivation, ideas, and guidance. His encouragement to explore
new directions while maintaining focus has kept me productive and excited about research.
Ray’s passion about science, combination of intellectual depth and breadth, and great per-
sonality have all made working with him a profoundly enriching and fun experience.
Members of my committee, William Cohen, Inderjit Dhillon, Joydeep Ghosh and
Peter Stone, have provided me with many useful ideas and comments. William’s work has
been a great inspiration for applying machine learning methods to record linkage, and I am
very grateful to him for providing detailed feedback and taking the time to visit Austin. In-
derjit and Joydeep gave me a terrific introduction to data mining in my first year in graduate
school, and have had a significant influence on me through their work on clustering. I deeply
appreciate Peter’s helpful insights, encouragement and advice on matters both related and
unrelated to this research.
I have been lucky to work with several wonderful collaborators. Throughout my
graduate school years, Sugato Basu has been the perfect partner on countless projects, a
great sounding board, and a kind friend. Talking to Arindam Banerjee has always been
both eye-opening and motivating. During my internship at Google, working with Mehran
Sahami has been enlightening and exciting, and I have appreciated Mehran’s kindness and
wisdom on many occasions. Exploring several new research directions in the past year with

v
Beena Kamath, Peng Bi, Tuyen Huynh, Maytal Saar-Tsechansky and Duy Vu has been a
stimulating finish for my graduate career.
My academic siblings in the Machine Learning Group are a wonderful clever bunch
who have been a joy to be around. I am most grateful to Prem Melville for his trusty
friendship, razor-sharp wit, and reliable company in all sorts of places at all sorts of hours;
doing science (and not-quite-science) side-by-side with Prem has been a highlight of the
past years. Lily Mihalkova and Razvan Bunescu have been essential in providing much
needed moral support in the office when it was most needed in the final stretch. Sugato,
Prem, Lily, Razvan, Un Yong Nahm, John Wong, Ruifang Ge, Rohit Kate, and Stewart
Yang have been always ready to help with papers, practice talks and anything else in work
and life, and I hope to see them often for years to come.
A number of UTCS staff members have provided me their support during the past
six years. The administrative expertise of Stacy Miller, Gloria Ramirez and Katherine Utz
has been much appreciated on many occasions. My life has been made much easier by the
friendly department computing staff who have provided a well-oiled and smoothly running
environment for all the experiments in this thesis.
I am grateful to many fellow UTCS grad students for their intellectually stimulat-
ing camaraderie. Joseph Modayil has been a great friend, a dream roommate, and baker
extraordinaire whose editing skills have been invaluable a number of times, including the
present moment. Yuliya Lierler’s warm friendship has brightened many days, and seeing
her in Austin after her German sojourn is a perfect graduation gift. The company of Aniket
Murarka, Amol Nayate, Nalini Belaramani, Patrick Beeson, Nick Jong, Alison Norman and
Serita Nelesen has been a precious pleasure on too many occasions to count, and I hope I
will enjoy it in the future as often as possible.
My friends outside computer science have made my years in Austin a fantastic
experience. From climbing trips to live music shows to random coffeeshop jaunts, they
have given my life much-needed balance and spice.

vi
My road to this point would not be possible without the people I love. My sister Na-
talia, parents Olga and Yury, and grandparents Tamara Alexandrovna, Nina Alexandrovna,
Alexander Mikhailovich and David Isakovich have always given me their unconditional
love and support. My family has inspired me, encouraged me, and helped me in every
possible way, and there are no words to express my gratitude. Sandra, Dick, Gretchen and
Grant Fruhwirth selflessly opened their home to me when I first came to the US, have been
by my side during my college years, and will always be family to me. Finally, I thank Anna
Zaster for making my life complete. I would not be happy without you.
The research in this thesis was supported by the University of Texas MCD Fellow-
ship, the National Science Foundation under grants IIS-0117308 and EIA-0303609, and a
Google Research Grant.

M IKHAIL Y URYEVICH B ILENKO

The University of Texas at Austin


August 2006

vii
Learnable Similarity Functions and Their Application
to Record Linkage and Clustering

Publication No.

Mikhail Yuryevich Bilenko, Ph.D.


The University of Texas at Austin, 2006

Supervisor: Raymond J. Mooney

Many machine learning and data mining tasks depend on functions that estimate similarity
between instances. Similarity computations are particularly important in clustering and
information integration applications, where pairwise distances play a central role in many
algorithms. Typically, algorithms for these tasks rely on pre-defined similarity measures,
such as edit distance or cosine similarity for strings, or Euclidean distance for vector-space
data. However, standard distance functions are frequently suboptimal as they do not capture
the appropriate notion of similarity for a particular domain, dataset, or application.
In this thesis, we present several approaches for addressing this problem by em-
ploying learnable similarity functions. Given supervision in the form of similar or dis-

viii
similar pairs of instances, learnable similarity functions can be trained to provide accurate
estimates for the domain and task at hand. We study the problem of adapting similarity
functions in the context of several tasks: record linkage, clustering, and blocking. For each
of these tasks, we present learnable similarity functions and training algorithms that lead to
improved performance.
In record linkage, also known as duplicate detection and entity matching, the goal
is to identify database records referring to the same underlying entity. This requires esti-
mating similarity between corresponding field values of records, as well as overall simi-
larity between records. For computing field-level similarity between strings, we describe
two learnable variants of edit distance that lead to improvements in linkage accuracy. For
learning record-level similarity functions, we employ Support Vector Machines to combine
similarities of individual record fields in proportion to their relative importance, yielding
a high-accuracy linkage system. We also investigate strategies for efficient collection of
training data which can be scarce due to the pairwise nature of the record linkage task.
In clustering, similarity functions are essential as they determine the grouping of
instances that is the goal of clustering. We describe a framework for integrating learnable
similarity functions within a probabilistic model for semi-supervised clustering based on
Hidden Markov Random Fields (HMRFs). The framework accommodates learning vari-
ous distance measures, including those based on Bregman divergences (e.g., parameterized
Mahalanobis distance and parameterized KL-divergence), as well as directional measures
(e.g., cosine similarity). Thus, it is applicable to a wide range of domains and data repre-
sentations. Similarity functions are learned within the HMRF-KM EANS algorithm derived
from the framework, leading to significant improvements in clustering accuracy.
The third application we consider, blocking, is critical in making record linkage
and clustering algorithms scalable to large datasets, as it facilitates efficient selection of
approximately similar instance pairs without explicitly considering all possible pairs. Pre-
viously proposed blocking methods require manually constructing a similarity function or

ix
a set of similarity predicates, followed by hand-tuning of parameters. We propose learning
blocking functions automatically from linkage and semi-supervised clustering supervision,
which allows automatic construction of blocking methods that are efficient and accurate.
This approach yields computationally cheap learnable similarity functions that can be used
for scaling up in a variety of tasks that rely on pairwise distance computations, including
record linkage and clustering.

x
Contents

Acknowledgments v

Abstract viii

List of Figures xiv

Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Chapter 2 Background 7
2.1 Similarity functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Similarity Functions for String Data . . . . . . . . . . . . . . . . . 8
2.1.2 Similarity Functions for Numeric Data . . . . . . . . . . . . . . . . 12
2.2 Record Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Blocking in Record Linkage and Clustering . . . . . . . . . . . . . . . . . 18
2.5 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Chapter 3 Learnable Similarity Functions in Record Linkage 21


3.1 Learnable Similarity Functions for Strings . . . . . . . . . . . . . . . . . . 21

xi
3.1.1 Learnable Edit Distance with Affine Gaps . . . . . . . . . . . . . . 22
3.1.2 Learnable Segmented Edit Distance . . . . . . . . . . . . . . . . . 29
3.2 Learnable Record Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Combining Similarity Across Fields . . . . . . . . . . . . . . . . . 38
3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Training-Set Construction for Learning Similarity Functions . . . . . . . . 45
3.3.1 Likely-positive Selection of Training Pairs . . . . . . . . . . . . . 45
3.3.2 Weakly-labeled Selection . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Chapter 4 Learnable Similarity Functions in Semi-supervised Clustering 56


4.1 Similarity Functions in Clustering . . . . . . . . . . . . . . . . . . . . . . 56
4.2 The HMRF Model for Semi-supervised Clustering . . . . . . . . . . . . . 57
4.2.1 HMRF Model Components . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 Joint Probability in the HMRF Model . . . . . . . . . . . . . . . . 60
4.3 Learnable Similarity Functions in the HMRF Model . . . . . . . . . . . . . 62
4.3.1 Parameter Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Parameterized Squared Euclidean Distance . . . . . . . . . . . . . 66
4.3.3 Parameterized Cosine Distance . . . . . . . . . . . . . . . . . . . . 67
4.3.4 Parameterized Kullback-Leibler Divergence . . . . . . . . . . . . . 68
4.4 Learning Similarity Functions within the HMRF-KMeans Algorithm . . . . 70
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.2 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

xii
4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Chapter 5 Learnable Similarity Functions in Blocking 93


5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Adaptive Blocking Formulation . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 Disjunctive blocking . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.2 DNF Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1 Pairwise Training Data . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.2 Learning Blocking Functions . . . . . . . . . . . . . . . . . . . . . 101
5.3.3 Blocking with the Learned Functions . . . . . . . . . . . . . . . . 104
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.1 Methodology and Datasets . . . . . . . . . . . . . . . . . . . . . . 105
5.4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Chapter 6 Future Work 113


6.1 Multi-level String Similarity Functions . . . . . . . . . . . . . . . . . . . . 113
6.2 Discriminative Pair HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3 Active Learning of Similarity Functions . . . . . . . . . . . . . . . . . . . 115
6.4 From Adaptive Blocking to Learnable Metric Mapping . . . . . . . . . . . 116

Chapter 7 Conclusions 117

Bibliography 120

Vita 136

xiii
List of Figures

2.1 The K-Means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 A generative model for edit distance with affine gaps . . . . . . . . . . . . 23


3.2 Training algorithm for generative string distance with affine gaps . . . . . . 26
3.3 Sample coreferent records from the Reasoning dataset . . . . . . . . . . . . 27
3.4 Mean average precision values for field-level record linkage . . . . . . . . 29
3.5 Field linkage results for the Face dataset . . . . . . . . . . . . . . . . . . . 30
3.6 Field linkage results for the Constraint dataset . . . . . . . . . . . . . . . . 30
3.7 Field linkage results for the Reasoning dataset . . . . . . . . . . . . . . . . 31
3.8 Field linkage results for the Reinforcement dataset . . . . . . . . . . . . . . 31
3.9 Segmented pair HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.10 Sample coreferent records from the Cora dataset . . . . . . . . . . . . . . 35
3.11 Sample coreferent records from the Restaurant dataset . . . . . . . . . . . 35
3.12 Field-level linkage results for the unsegmented Restaurant dataset . . . . . 37
3.13 Field-level linkage results for the unsegmented Cora dataset . . . . . . . . 37
3.14 Computation of record similarity from individual field similarities . . . . . 39
3.15 M ARLIN overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.16 Record-level linkage results on the Cora dataset . . . . . . . . . . . . . . . 43
3.17 Record-level linkage results on the Restaurant dataset . . . . . . . . . . . . 43
3.18 Mean average precision values for record-level linkage . . . . . . . . . . . 45

xiv
3.19 Classifier comparison for record-level linkage on the Cora dataset . . . . . 46
3.20 Classifier comparison for record-level linkage on the Restaurant dataset . . 46
3.21 Comparison of random and likely-positive training example selection on
the Restaurant dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.22 Comparison of random and likely-positive training example selection on
the Cora dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.23 Comparison of using weakly-labeled non-coreferent pairs with using ran-
dom labeled record pairs on the Restaurant dataset . . . . . . . . . . . . . 51
3.24 Comparison of using weakly-labeled non-coreferent pairs with using ran-
dom labeled record pairs on the Cora dataset . . . . . . . . . . . . . . . . 51

4.1 A Hidden Markov Random Field for semi-supervised clustering . . . . . . 59


4.2 Graphical plate model of variable dependence in HMRF-based semi-supervised
clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 The HMRF-KM EANS algorithm . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Results for deuc on the Iris dataset . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Results for deuc on the Iris dataset with full and per-cluster parameterizations 81
4.6 Results for deuc on the Wine dataset . . . . . . . . . . . . . . . . . . . . . . 82
4.7 Results for deuc on the Wine dataset with full and per-cluster parameterizations 82
4.8 Results for deuc on the Protein dataset . . . . . . . . . . . . . . . . . . . . 83
4.9 Results for deuc on the Protein dataset with full and per-cluster parameteri-
zations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.10 Results for deuc on the Ionosphere dataset . . . . . . . . . . . . . . . . . . 84
4.11 Results for deuc on the Ionosphere dataset with full and per-cluster parame-
terizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.12 Results for deuc on the Digits-389 dataset . . . . . . . . . . . . . . . . . . . 85
4.13 Results for deuc on the Digits-389 dataset with full and per-cluster parame-
terizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

xv
4.14 Results for deuc on the Letters-IJL dataset . . . . . . . . . . . . . . . . . . 86
4.15 Results for deuc on the Letters-IJL dataset with full and per-cluster parame-
terizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.16 Results for dcosA on the News-Different-3 dataset . . . . . . . . . . . . . . . 88
4.17 Results for dIA on the News-Different-3 dataset . . . . . . . . . . . . . . . . 88
4.18 Results for dcosA on the News-Related-3 dataset . . . . . . . . . . . . . . . 89
4.19 Results for dIA on the News-Related-3 dataset . . . . . . . . . . . . . . . . 89
4.20 Results for dcosA on the News-Similar-3 dataset . . . . . . . . . . . . . . . 90
4.21 Results for dIA on the News-Similar-3 dataset . . . . . . . . . . . . . . . . 90

5.1 Examples of blocking functions from different record linkage domains . . . 94


5.2 Blocking key values for a sample record . . . . . . . . . . . . . . . . . . . 96
5.3 Red-blue Set Cover view of disjunctive blocking . . . . . . . . . . . . . . 99
5.4 The algorithm for learning disjunctive blocking . . . . . . . . . . . . . . . 102
5.5 The algorithm for learning DNF blocking . . . . . . . . . . . . . . . . . . 104
5.6 Blocking accuracy results for the Cora dataset . . . . . . . . . . . . . . . . 109
5.7 Blocking accuracy results for the Addresses dataset . . . . . . . . . . . . . 109

xvi
Chapter 1

Introduction

1.1 Motivation

Similarity functions play a central role in machine learning and data mining tasks where
algorithms rely on estimates of distance between objects. Consequently, a large number of
similarity functions have been developed for different data types, varying greatly in their
expressiveness, mathematical properties, and assumptions. However, the notion of simi-
larity can differ depending on the particular domain, dataset, or task at hand. Similarity
between certain object features may be highly indicative of overall object similarity, while
other features may be unimportant.
Many commonly used functions make the assumption that different instance fea-
tures contribute equally to similarity (e.g., edit distance or Euclidean distance), while oth-
ers use statistical properties of a given dataset to transform the feature space (e.g., TF-IDF
weighted cosine similarity or Mahalanobis distance) (Duda, Hart, & Stork, 2001). These
similarity functions make strong assumptions regarding the optimal representation of data,
while they may or may not be appropriate for specific datasets and tasks. Therefore, it is
desirable to learn similarity functions from training data to capture the correct notion of dis-
tance for a particular task in a given domain. While learning similarity functions via feature

1
selection and feature weighting has been extensively studied in the context of classifica-
tion algorithms (Aha, 1998; Wettschereck, Aha, & Mohri, 1997), use of adaptive distance
measures in other tasks remains largely unexplored. In this thesis, we develop methods for
adapting similarity functions to provide accurate similarity estimates in the context of the
following three problems:

 Record Linkage
Record linkage is the general task of identifying syntactically different object de-
scriptions objects that refer to the same underlying entity (Winkler, 2006). It has
been previously studied by researchers in several areas as duplicate detection, entity
resolution, object identification, and data cleaning, among several other coreferent
names for this problem. Examples of record linkage include matching of coreferent
bibliographic citations (Giles, Bollacker, & Lawrence, 1998), identifying the same
person in different Census datasets (Winkler, 2006), and linking different offers for
the same product from multiple online retailers for comparison shopping (Bilenko,
Basu, & Sahami, 2005). In typical settings, performing record linkage requires two
kinds of similarity functions: those that estimate similarity between individual object
attributes, and those that combine such estimates to obtain overall object similarity.
Object similarities are then used by matching or clustering algorithms to partition
datasets into groups of equivalent objects, or perform pairwise record matching be-
tween distinct data sources.

 Semi-supervised Clustering
Clustering is an unsupervised learning problem in which the objective is to partition
a set of objects into meaningful groups (clusters) so that objects within the same clus-
ter are more similar to each other than to objects outside the cluster (Jain, Murty, &
Flynn, 1999). In pure unsupervised settings, this objective can take on many forms
depending on the semantics of “meaningful” in a specific context and on the choice of
the similarity function. In semi-supervised clustering, prior information is provided

2
to aid the grouping either in the form of objects labeled as belonging to certain cate-
gories (Basu, Banerjee, & Mooney, 2002), or in the form of pairwise constraints indi-
cating preference for placing them in same or different clusters (Wagstaff & Cardie,
2000).

 Blocking
Blocking is the task of efficiently selecting a minimal subset of approximately sim-
ilar object pairs from the set of all possible object pairs in a given dataset (Kelley,
1985). Because computing similarity for all object pairs is computationally costly
for large datasets, to be scalable, record linkage and clustering algorithms that rely
on pairwise distance estimates require blocking methods that efficiently retrieve the
subset of object pairs for subsequent similarity computation. Blocking can be viewed
as applying a computationally inexpensive similarity function to the entire dataset to
obtain approximately similar pairs.

In these tasks, dissimilarity estimates provided by distance functions directly in-


fluence the task output and therefore can have a significant effect on performance. Thus,
ensuring that employed similarity functions are appropriate for a given domain is essential
for obtaining high accuracy.
This thesis presents several techniques for training similarity functions to provide
accurate, domain-specific distance estimates in the context of record linkage, semi-supervised
clustering and blocking. Proposed techniques are based on parameterizing traditional dis-
tance functions, such as edit distance or Euclidean distance, and learning parameter values
that are appropriate for a given domain.
Learning is performed using training data in the form of pairwise supervision which
consists of object pairs known to be similar or dissimilar. Such supervision has different
semantics in different tasks. In record linkage, pairs of records or strings that refer to
the same or different entities are known as matching and non-matching pairs (Winkler,
2006). In clustering, pairs of objects that should be placed in the same cluster or different

3
clusters are known as must-link and cannot-link pairs, respectively (Wagstaff & Cardie,
2000). Finally, in blocking, either of the above types of supervision can be used depending
on the task for which blocking is employed. Regardless of the setting, pairwise supervision
is a common form of prior knowledge that is either available in many domains, or is easy
to obtain via manual labeling. Our methods exploit such pairwise supervision in the three
tasks listed above to learn accurate distance functions that reflect an appropriate notion of
similarity for a given domain.

1.2 Thesis Contributions

The goal of this thesis is proposing learnable variants of similarity functions commonly
used in record linkage and clustering, developing algorithms for training such functions
using pairwise supervision within these tasks, and performing experiments to study the
effectiveness of the proposed methods. The contributions of the thesis are outlined below:

 We describe two learnable variants of affine-gap edit distance, a string similarity func-
tion commonly used in record linkage on string data. Based on pair Hidden Markov
Models (pair HMMs) originally developed for aligning biological sequences (Durbin,
Eddy, Krogh, & Mitchison, 1998), our methods lead to accuracy improvements over
unlearned affine-gap edit distance and TF-IDF cosine similarity. One of the two pro-
posed variants integrates string distance computation with string segmentation, pro-
viding a joint model for these two tasks that leads to more accurate string similarity
estimates with little or no segmentation supervision. Combining learnable affine-gap
edit distances across different fields using Support Vector Machines produces nearly
perfect (above 0.99 F-measure) results on two standard benchmark datasets.

 We propose two strategies that facilitate efficient construction of training sets for
learning similarity functions in record linkage: weakly-labeled negative and likely-
positive pair selection. These techniques facilitate selecting informative training ex-

4
amples without the computational costs of traditional active learning methods, which
allows learning accurate similarity functions using small amounts of training data.

 We describe a framework for learning similarity functions within the Hidden Markov
Random Field (HMRF) model for semi-supervised clustering (Basu, Bilenko, Baner-
jee, & Mooney, 2006). This framework leads to embedding similarity function train-
ing within an iterative clustering algorithm, HMRF-KM EANS, which allows learn-
ing similarity functions from a combination of unlabeled data and labeled supervision
in the form of same-cluster and different-cluster pairwise constraints. Our approach
accommodates a number of parameterized similarity functions, leading to improved
clustering accuracy on a number of text and numeric benchmark datasets.

 We develop a new framework for learning blocking functions that provides efficient
and accurate selection of approximately similar object pairs for record linkage and
clustering tasks. Previous work on blocking methods has relied on manually con-
structed blocking functions with manually tuned parameters, while our method au-
tomatically constructs blocking functions using training data that can be naturally
obtained within record linkage and clustering tasks. We empirically demonstrate that
our technique results in an order of magnitude increase in efficiency while maintain-
ing high accuracy.

1.3 Thesis Outline

Below is a summary of the remaining chapters in the thesis:

 Chapter 2, Background. We provide the background on commonly used string


and numeric similarity functions, and describe the record linkage, semi-supervised
clustering and blocking tasks.

5
 Chapter 3, Learnable Similarity Functions in Record Linkage. We show how
record linkage accuracy can be improved by using learnable string distances for indi-
vidual attributes and employing Support Vector Machines to combine such distances.
The chapter also discusses strategies for collecting informative training examples for
training similarity functions in record linkage.

 Chapter 4, Learnable Similarity Functions in Semi-supervised Clustering. This


chapter presents a summary of the HMRF framework for semi-supervised clustering
and describes how it incorporates learnable similarity functions that lead to improved
clustering accuracy.

 Chapter 5, Learnable Similarity Functions in Blocking. In this chapter we present


a new method for automatically constructing blocking functions that efficiently select
pairs of approximately similar objects for a given domain.

 Chapter 6, Future Work. This chapter discusses several directions for future re-
search based on the work presented in this thesis.

 Chapter 7, Conclusions. In this chapter we review and summarize the main contri-
butions of this thesis.

Some of the work presented here has been described in prior publications. Material
presented in Chapter 3 appeared in (Bilenko & Mooney, 2003a) and (Bilenko & Mooney,
2003b), except for work described in Section 3.1.2 which has not been previously published.
Material presented in Chapter 4 is a summary of work presented in a series of publications
on the HMRF model for semi-supervised clustering: (Bilenko, Basu, & Mooney, 2004),
(Bilenko & Basu, 2004), (Basu, Bilenko, & Mooney, 2004), and (Basu et al., 2006). Finally,
an early version of the material described in Chapter 3 has appeared in (Bilenko, Kamath,
& Mooney, 2006).

6
Chapter 2

Background

Because many data mining and machine learning algorithms require estimating similarity
between objects, a number of distance functions for various data types have been developed.
In this section, we provide a brief overview of several popular distance functions for text
and vector-space data. We also provide background on three important problems, record
linkage, clustering, and blocking, solutions for which rely on similarity estimates between
observations. Finally, we introduce active learning methods that select informative training
examples from a pool of unlabeled data.
Let us briefly describe the notation that we will use in the rest of this thesis. Strings
are denoted by lower-case italic letters such as s and t; brackets are used for string char-
acters and subsequences: s[i℄ stands for i-th character of string s, while s[i: j℄ represents the
contiguous subsequence of s from i-th to j-th character. We use lowercase letters such as x
and y for vectors, and uppercase letters such as A and M for matrices. Sets are denoted by
script uppercase letters such as X and Y .
We use the terms “distance function” and “similarity function” interchangeably
when referring to binary functions that estimate degree of difference or likeness between
instances.

7
2.1 Similarity functions

2.1.1 Similarity Functions for String Data

Techniques for calculating similarity between strings can be separated into two broad groups:
sequence-based functions and vector-space-based functions. Sequence-based functions com-
pute string similarity by viewing strings as contiguous sequences of either characters or
tokens. Differences between sequences are assumed to be the result of applying edit opera-
tions that transform specific elements in one or both strings. Vector space-based functions,
on other hand, do not view strings as contiguous sequences but as unordered bags of ele-
ments. Below we describe two most popular similarity functions from these groups, edit
distance and TF-IDF cosine similarity. Detailed discussion of these similarity functions can
be found in (Gusfield, 1997) and (Baeza-Yates & Ribeiro-Neto, 1999), respectively. For an
overview of various string similarity functions proposed in the context of string matching
and record linkage tasks, see (Winkler, 2006) and (Cohen, Ravikumar, & Fienberg, 2003a).

Edit Distance

Edit distance is a dissimilarity function for sequences that is widely used in many applica-
tions in natural text and speech processing (Jelinek, 1998), bioinformatics (Durbin et al.,
1998), and data integration (Cohen, Ravikumar, & Fienberg, 2003b; Winkler, 2006). Clas-
sical (Levenshtein) edit distance between two strings is defined as the minimum number of
edit operations (deletions, insertions, and substitutions of elements) required to transform
one string into another (Levenshtein, 1966). The minimum number of such operations can
be computed using dynamic programming in time equal to the product of string lengths.
Edit distance can be character-based or token-based: the former assumes that every string
is a sequence of characters, while the latter views strings as sequences of tokens.
For example, consider calculating character-based edit distance between strings
s = “12 8 Street” and t = “12 8th St.”. There are several character edit operation sequences

8
of length 6 that transform s into t, implying that Levenshtein distance between s and t is 6.
For example, the following six edit operations applied to s transform it into t:

1. Insert “t”: “12 8 Street”!”12 8t Street”


2. Insert “h”: “12 8t Street”!”12 8th Street”
3. Substitute “r” with “.”: “12 8th Street”!”12 8th St.eet’
4. Delete “e”: ”12 8th St.eet”!”12 8th St.et”
5. Delete “e”: ”12 8th St.et”!”12 8th St.t”
6. Delete “t”: ”12 8th St.t”!”12 8th St.”

Wagner and Fisher (1974) generalized edit distance by allowing edit operations
to have different costs. Needleman and Wunsch (1970) extended edit distance further to
distinguish the cost of contiguous insertions or deletions, known as gaps, and Gotoh (1982)
subsequently introduced the affine (linear) model for gap cost yielding an efficient dynamic
programming algorithm for computing edit distance with gaps. The following recursions
are used to compute affine-gap edit distance d (s; t ) between strings s and t in O(jsjjt j)
computational time:
8
>>M(i 1 j 1) + c(s i t j )
< ; [ ℄; [ ℄

;
>>
M (i j) = min I1 (i 1 j 1) + c(s i t j )
; [ ℄; [ ℄

:I2(i 1 j 1) + c(s i t j )
; [ ℄; [ ℄
8
><M i 1 j) + d + c(s i ε)
( ; [ ℄;
I1 (i j) = min
;
>:I1(i 1 j) + e + c(s i ε)
; [ ℄;
(2.1)

8
><M(i j 1) + d + c(ε t j )
; ; [ ℄
I2 (i j) = min
;
>:I2(i j 1) + e + c(ε t j )
; ; [ ℄

d (s; t ) = min(M (jsj; jt j); I1 (jsj; jt j); I2 (jsj; jt j))

9
where c(s[i℄ ; t[ j℄ ) is the cost of substituting (or matching) i-th element of s and j-th element of
t, c(s[i℄ ; ε) and c(ε; t[ j℄ ) are the costs of inserting elements s[i℄ and t[ j℄ into the first and second
strings respectively (aligning this element with a gap in the other string), and d and e are
the costs of starting a gap and extending it by one element. Entries (i; j) in matrices M, I1 ,
and I2 correspond to the minimal cost of an edit operation sequence between string prefixes
s[1:i℄ and t[1: j℄ with the sequence respectively ending in a match/substitution, insertion into
the first string, or insertion into the second string.
Any sequence of edit operations transforming one string into another corresponds
to an alignment of the two strings. Alignment is a representation of the two strings obtained
by inserting empty characters into the strings in place of insertions, and placing the two
strings one above the other. Following is the alignment of strings s and t corresponding to
the sequence of edit operations shown in the example above:

1 2 8 ε ε S t r e e t
(2.2)
1 2 8 t h S t : ε ε ε

This representation shows that the sequence of edit operations for any alignment
can be viewed as an a production of the two strings in parallel by emitting elements from
either one or both strings simultaneously. This view will be central in the development of
learnable affine-gap edit distance in Chapter 3.

Jaccard and TF-IDF Cosine Similarity

While sequence-based string similarity functions work well for estimating distance between
shorter strings, they become too computationally expensive and less accurate for longer
strings. For example, when differences between equivalent strings are due to long-range
transpositions of multiple words, sequence-based similarity functions assign high cost to
non-aligned string segments, resulting in low similarity scores for strings that share many
common words. At the same time, computing string edit distance becomes computationally

10
prohibitive for larger strings such as text documents on the Web because its computational
complexity is quadratic in string size.
The vector-space model of text avoids these problems by viewing strings as “bags
of tokens” and disregarding the order in which the tokens occur in the strings (Salton &
McGill, 1983). Jaccard similarity can then be used as the simplest method for computing
likeness as the proportion of tokens shared by both strings. If strings s and t are represented
by sets of tokens S and T , Jaccard similarity is:

jS \ T j
simJaccard (s; t ) =
jS [ T j (2.3)

The primary problem with Jaccard similarity is that it does not take into account
the relative importance of different tokens. Tokens that occur frequently in a given string
should have higher contribution to similarity than those that occur few times, as should those
tokens that are rare among the set of strings under consideration. The Term Frequency-
Inverse Document Frequency (TF-IDF) weighting scheme achieves this by associating a
weight wvi s =
;
N (vi ;s)
maxv j 2s N (v j ;s)  log N Nv
( i)
with every token vi from string s, where N (vi ; s) is the
number of times vi occurs in s (term frequency), N is the number of strings in the overall
corpus under consideration, and N (vi ) is the number of strings in the corpus that include vi
(document frequency).
Given a corpus of strings that yields the set V of distinct tokens after tokenization,
a string s can be represented as a jV j-dimensional vector of weights, every non-zero com-
ponent of which corresponds to a token present in s. TF-IDF cosine similarity between two
strings is defined as the cosine of the angle between their vector representations:

wTs wt ∑vi 2V ws vi wt vi
; ;
= q q
simT F IDF (s; t ) =
kwskkwt k ∑si 2S w2s si 
; ∑ti 2T wt2 ti
;
(2.4)

With the help of appropriate inverted index data structures, TF-IDF cosine similar-
ity is computationally efficient due to high sparsity of most vectors, and provides a rea-

11
sonable off-the-shelf metric for long strings and text documents. Tokenization is typically
performed by treating each individual word of certain minimum length as a separate token,
usually excluding a fixed set of functional “stop words” and optionally stemming tokens
to their roots (Baeza-Yates & Ribeiro-Neto, 1999). An alternative tokenization scheme is
known as n-grams: it relies on using all overlapping contiguous character subsequences of
length n as tokens.

2.1.2 Similarity Functions for Numeric Data

Euclidean and Mahalanobis distances

For data represented by vectors in Euclidean space, the Minkowski family of metrics, also
known as the Lk norms, includes most commonly used similarity measures for objects de-
scribed by d-dimensional vectors (Duda et al., 2001):

 d 1=k

k
Lk (xi ; x j ) = xil x jl (2.5)
l =1

The L2 norm, commonly known as Euclidean distance, is frequently used for low-
dimensional vector data. Its popularity is due to a number of factors:

 Intuitive simplicity: the L2 norm corresponds to straight-line distance between points


in Euclidean space;

 Invariance to rotation or translation in feature space;

 Mathematical metric properties: non-negativity (L2 (xi ; x j )  0)), reflexivity (L2 (xi ; x j ) =
0 iff xi = x j ), symmetry (L2 (xi ; x j ) = L2 (x j ; xi )), and triangle inequality (L2 (xi ; x j ) +
L2 (x j ; xk )  L2 (xi ; xk )), that allow using it in many algorithms that rely on metric
assumptions.

If distance is computed among points of a given dataset, Mahalanobis distance is an


extension of Euclidean distance that takes into account the data mean as well as variance

12
of each dimension and correlations between the different dimensions, which are estimated
from the dataset. Given a set of observation vectors fx1 ; :::; xn g, Mahalanobis distance is
defined as:
dMah (xi ; x j ) = ((xi x j )T Σ 1
(xi x j ))1
= 2
(2.6)

where Σ 1 is the inverse of the covariance matrix Σ = n 1 ∑i=1 (xi


1 n
µ)(xi µ)T , and µ =

n ∑i=1 xi
1 n
is the data mean.
Essentially, Mahalanobis distance attempts to give each dimension equal weight
when computing distance by scaling its contribution proportionally to variance, while tak-
ing into account co-variances between the dimensions.

Cosine Similarity

Minkowski metrics including Euclidean distance suffer from the curse of dimensionality
when they are applied to high-dimensional data (Friedman, 1997). As the dimensionality
of the Euclidean space increases, sparsity of observations increases exponentially with the
number of dimensions, which leads to observations becoming equidistant in terms of Eu-
clidean distance. Cosine similarity, or normalized dot product, has been widely used as an
alternative similarity function for high-dimensional data (Duda et al., 2001):

xT y ∑di=1 xi  yi
= q q
Simcos (x; y) =
kxkkyk ∑di=1 x2i  ∑di=1 y2i
(2.7)

If applied to normalized vectors, cosine similarity obeys metric properties when


converted to distance by negating it from 1. In general, however, it is not a metric in the
mathematical sense, and it is not invariant to translations and linear transformations.

Information-theoretic Measures

In certain domains, data can be described by probability distributions, e.g., text docu-
ments can be represented as probability distributions over words generated by a multinomial

13
model (Pereira, Tishby, & Lee, 1993). Kullback-Leibler (KL) divergence, also known as
relative entropy, is a widely used distance measure for such data:

d
xim
dKL (xi ; x j ) = ∑ xim log x jm (2.8)
m=1

where xi and x j are instances described by probability distributions over d events: ∑dm=1 xim =
∑dm=1 x jm = 1. Note that KL divergence is not symmetric: dKL (xi ; x j ) 6= dKL (xi ; x j ) for any
xi 6= x j . In domains where a symmetrical distance function is needed, Jensen-Shannon di-
vergence, also known as KL divergence to the mean, is used:

1 xi + x j xi + x j
dJS (xi ; x j ) = (dKL (xi ; ) + dKL (x j ; )) (2.9)
2 2 2

Kullback-Leibler divergence is widely used in information theory (Cover & Thomas,


1991), where it is interpreted as the expected extra length of a message sampled from dis-
tribution xi encoded using a coding scheme that is optimal for distribution x j .

2.2 Record Linkage

As defined in Chapter 1, the goal of record linkage is identifying instances that differ syn-
tactically yet refer to the same underlying object. Matching of coreferent bibliographic
citations and identifying multiple variants of a person’s name or address in medical, cus-
tomer, or census databases are instances of this problem. A number of researchers in dif-
ferent communities have studied variants of record linkage tasks : after being introduced
in the context of matching medical records by Newcombe, Kennedy, Axford, and James
(1959), it was investigated under a number of names including merge/purge (Hernández &
Stolfo, 1995), heterogeneous database integration (Cohen, 1998), hardening soft databases
(Cohen, Kautz, & McAllester, 2000), reference matching (McCallum, Nigam, & Ungar,
2000), de-duplication (Sarawagi & Bhamidipaty, 2002; Bhattacharya & Getoor, 2004),
fuzzy duplicate elimination (Ananthakrishna, Chaudhuri, & Ganti, 2002; Chaudhuri, Gan-

14
jam, Ganti, & Motwani, 2003), entity-name clustering and matching (Cohen & Richman,
2002), identity uncertainty (Pasula, Marthi, Milch, Russell, & Shpitser, 2003; McCallum &
Wellner, 2004a), object consolidation (Michalowski, Thakkar, & Knoblock, 2003), robust
reading (Li, Morie, & Roth, 2004), reference reconciliation (Dong, Halevy, & Madhavan,
2005), object identification (Singla & Domingos, 2005), and entity resolution (Bhattacharya
& Getoor, 2006).
The seminal work of Fellegi and Sunter (1969) described several key ideas that
have been used or re-discovered by most record linkage researchers, including combining
similarity estimates across multiple fields, using blocking to reduce the set of candidate
record pairs under consideration, and using a similarity threshold to separate the corefer-
ent and non-coreferent object pairs. Fellegi and Sunter (1969) considered record linkage
in an unsupervised setting where no examples of coreferent and non-coreferent pairs are
available. In this setting, several methods have been proposed that rely on learning prob-
abilistic models with latent variables that encode the matching decisions (Winkler, 1993;
Ravikumar & Cohen, 2004). In the past decade, a number of researchers have considered
record linkage settings where pairwise supervision is available, allowing the application of
such classification techniques as decision trees (Elfeky, Elmagarmid, & Verykios, 2002; Te-
jada, Knoblock, & Minton, 2001), logistic regression (Cohen & Richman, 2002), Bayesian
networks (Winkler, 2002), and Support Vector Machines (Bilenko & Mooney, 2003a; Co-
hen et al., 2003a; Minton, Nanjo, Knoblock, Michalowski, & Michelson, 2005) to obtain
record-level distance functions that combine the field-level similarities. These methods treat
individual field similarities as features and train a classifier to distinguish between coref-
erent and non-coreferent records, using the confidence of the classifier’s prediction as the
similarity estimate.
The majority of solutions for record linkage treat it as a modular problem that is
solved in multiple stages. In the first stage, blocking is performed to obtain a set of candidate
record pairs to be investigated for co-reference, since the computational cost of computing

15
pairwise similarities between all pairs of records in a large database is often prohibitive; see
Section 2.4 for discussion of blocking. In the second stage, similarity is computed between
individual fields of candidate record pairs. In the final linkage stage, similarity is computed
between candidate pairs, and highly similar records are labeled as matches that describe
the same entity. Linkage can be performed either via pairwise inference where decisions
for the different candidate pairs are made independently, or via collective inference over all
candidate record pairs (Pasula et al., 2003; Wellner, McCallum, Peng, & Hay, 2004; Singla
& Domingos, 2005).

2.3 Clustering

Clustering is typically defined as the problem of partitioning a dataset into disjoint groups so
that observations belonging to the same cluster are similar, while observations belonging to
different clusters are dissimilar. Clustering has been widely studied for several decades, and
a great variety of algorithms for clustering have been proposed (Jain et al., 1999). Several
large groups of clustering algorithms can be distinguished that include hierarchical cluster-
ing methods that attempt to create a hierarchy of data partitions (Kaufman & Rousseeuw,
1990), partitional clustering methods that separate instances into disjoint clusters (Karypis
& Kumar, 1998; Shi & Malik, 2000; Strehl, 2002; Banerjee, Merugu, Dhilon, & Ghosh,
2005b), and overlapping clustering techniques that allow instances to belong to multiple
clusters (Segal, Battle, & Koller, 2003; Banerjee, Krumpelman, Basu, Mooney, & Ghosh,
2005c).
Traditionally, clustering has been viewed as a form of unsupervised learning, since
no class labels for the data are provided. In semi-supervised clustering, supervision from
a user is incorporated in the form of class labels or pairwise constraints on objects which
can be used to initialize clusters, guide the clustering process, and improve the clustering
algorithm parameters (Basu, 2005).
Work presented in Chapter 4 is based on K-Means, a widely used clustering algo-

16
rithm that performs iterative relocation of cluster centroids to locally minimize the total dis-
tance between the data points and the centroids. Given a set of data points X = fxi gNi 1 xi 2
=
;

Rm , let fµh gKh=1 represent the K cluster centroids, and yi be the cluster assignment of a point
xi , where yi 2 Y and Y = f1 ;:::; K g. The Euclidean K-Means algorithm creates K disjoint
subsets of X , fXl gKl=1 , whose union is X , so that the following objective function is (locally)
minimized:

Jkmeans (X ; Y ) = ∑ kxi µyi k2 (2.10)


xi 2X

Intuitively, this objective function measures the tightness of each cluster as the sum
of squared Euclidean distances between every point in the cluster and the centroid. Fig-
ure 2.1 presents the pseudocode for the algorithm.

Algorithm: K-M EANS


Input: Set of data points X = {xi }ni=1 , xi ∈ Rd , number of clusters K
Output: Disjoint K-partitioning {Xh }Kh=1 of X such that objective function
Jkmeans is optimized
Method:
(0)
1. Initialize clusters: Initial centroids {µh }Kh=1 are selected at random
2. Repeat until convergence
(t+1)
2a. assign cluster: Assign each data point xi to the cluster h∗ (i.e. set Xh∗ ),
(t)
where h∗ = arg minkxi − µh k2
h

(t+1) 1
2b. estimate means: µh ← (t+1) xi
|X h | (t+1)
xi ∈Xh
2c. t ← (t + 1)

Figure 2.1: The K-Means algorithm

Recently, it has been shown that K-Means-style algorithms can be derived based on
a number of dissimilarity functions including directional measures such as cosine similar-
ity (Banerjee, Dhillon, Ghosh, & Sra, 2005a) and a large class of functions known as Breg-
man divergences, which include squared Euclidean distance and KL-divergence (Banerjee
et al., 2005b).

17
2.4 Blocking in Record Linkage and Clustering

Because the number of similarity computations grows quadratically with the size of the
input dataset, scaling up to large datasets is problematic for tasks that require similarities
between all instance pairs. Additionally, even for small datasets, estimation of the full
similarity matrix can be difficult if computationally costly similarity functions, distance
metrics or kernels are used. At the same time, in many tasks, the majority of similarity
computations are unnecessary because most instance pairs are highly dissimilar and have
no influence on the task output. Avoiding the unnecessary computations results in a sparse
similarity matrix, and a number of algorithms become practical for large datasets when
provided with sparse similarity matrices, e.g. the collective inference algorithms for record
linkage (Pasula et al., 2003; McCallum & Wellner, 2004b; Singla & Domingos, 2005).
Blocking methods efficiently select a subset of instance pairs for subsequent sim-
ilarity computation, ignoring the remaining pairs as highly dissimilar and therefore irrele-
vant. A number of blocking algorithms have been proposed by researchers in recent years,
all of which rely on a manually tuned set of predicates or parameters (Fellegi & Sunter,
1969; Kelley, 1985; Jaro, 1989; Hernández & Stolfo, 1995; McCallum et al., 2000; Baxter,
Christen, & Churches, 2003; Chaudhuri et al., 2003; Jin, Li, & Mehrotra, 2003; Winkler,
2005).
Key-based blocking methods form blocks by applying some unary predicate to each
record and assigning all records that return the same value (key) to the same block (Kelley,
1985; Jaro, 1989; Winkler, 2005). For example, such predicates as Same Zipcode or Same
3-character Prefix of Surname could be used to perform key-based blocking in a name-
address database, resulting in blocks that contain records with the same value of the Zipcode
attribute and the same first three characters of the Surname attribute, respectively.
Another popular blocking technique is the sorted neighborhood method proposed
by Hernández and Stolfo (1995). This method forms blocks by sorting the records in a
database using lexicographic criteria and selecting all records that lie within a window of

18
fixed size. Multiple sorting passes are performed to increase coverage.
The canopies blocking algorithm of McCallum et al. (2000) relies on a similar-
ity function that allows efficient retrieval of all records within a certain distance threshold
from a randomly chosen record. Blocks are formed by randomly selecting a “canopy cen-
ter” record and retrieving all records that are similar to the chosen record within a certain
(“loose”) threshold. Records that are closer than a “tight” threshold are removed from the
set of possible canopy centers, which is initialized with all records in the dataset. This
process is repeated iteratively, resulting in formation of blocks selected around the canopy
centers. Inverted index-based similarity functions such as Jaccard or TF-IDF cosine sim-
ilarity are typically used with the canopies method as they allow fast selection of nearest
neighbors based on co-occurring tokens. Inverted indices are also used in the blocking
method of Chaudhuri et al. (2003), who proposed using indices based on character n-grams
for efficient selection of candidate record pairs.
Recently, Jin et al. (2003) proposed a blocking method based on mapping database
records to a low-dimensional metric space based on string values of individual attributes.
While this method can be used with arbitrary similarity functions, it is computationally
expensive compared to the sorting and index-based methods.

2.5 Active Learning

When training examples are selected for a learning task at random, they may be suboptimal
in the sense that they do not lead to a maximally attainable improvement in performance.
Active learning methods attempt to identify those examples that lead to maximal accuracy
improvements when added to the training set (Lewis & Catlett, 1994; Cohn, Ghahramani,
& Jordan, 1996; Tong, 2001). During each round of active learning, the example that is
estimated to improve performance the most when added to the training set is identified and
labeled. The system is then re-trained on the training set including the newly added labeled
example.

19
Three broad classes of active learning methods exist: (1) uncertainty sampling tech-
niques (Lewis & Catlett, 1994) attempt to identify examples for which the learning algo-
rithm is least certain in its prediction; (2) query-by-committee methods (Seung, Opper, &
Sompolinsky, 1992) utilize a committee of learners and use disagreement between commit-
tee members as a measure of training examples’ informativeness; (3) estimation of error re-
duction techniques (Lindenbaum, Markovitch, & Rusakov, 1999; Roy & McCallum, 2001)
select examples which, when labeled, lead to greatest reduction in error by minimizing
prediction variance.
Active learning was shown to be a successful strategy for improving performance
using small amounts of training data on a number of tasks, including classification (Cohn
et al., 1996), clustering (Hofmann & Buhmann, 1998; Basu, Banerjee, & Mooney, 2004),
and record linkage (Sarawagi & Bhamidipaty, 2002; Tejada, Knoblock, & Minton, 2002).

20
Chapter 3

Learnable Similarity Functions in


Record Linkage

In this chapter, we describe the use of learnable similarity functions in record linkage, where
they improve the accuracy of distance estimates in two tasks: computing similarity of string
values between individual record fields, and combining such similarities across multiple
fields to obtain overall record similarity. At the field level, two adaptive variants of edit
distance are described that allow learning the costs of string transformations to reflect their
relative importance in a particular domain. At the record level, we employ Support Vector
Machines, a powerful discriminative classifier, to distinguish between pairs of similar and
dissimilar records. We also propose two strategies for automatically selecting informative
pairwise training examples. These strategies do not require the human effort needed by
active learning methods, yet vastly outperform random pair selection.

3.1 Learnable Similarity Functions for Strings

In typical linkage applications, individual record fields are represented by short string val-
ues whose length does not exceed several dozen characters or tokens. For such strings,

21
differences between coreferent values frequently arise due to local string transformations at
either character or token level, e.g., misspellings, abbreviations, insertions, and deletions.
To capture such differences, similarity functions must estimate the total cost associated with
performing these transformations on string values.
As described in Section 2.1.1, edit distance estimates string dissimilarity by com-
puting the cost of a minimal sequence of edit operations required to transform one string
into another. However, the importance of different edit operations varies from domain to
domain. For example, a digit substitution makes a big difference in a street address since
it effectively changes the house or apartment number, while a single letter substitution is
semantically insignificant because it is more likely to be caused by a typographic error or
an abbreviation. For token-level edit distance, some tokens are unimportant and therefore
their insertion cost should be low, e.g., for token St. in street addresses.
Ability to vary the gap cost is a significant advantage of affine-gap edit distance over
Levenshtein edit distance, which penalizes all insertions independently (Bilenko & Mooney,
2002). Frequency and length of gaps in string alignments also vary from domain to domain.
For example, during linkage of coreferent bibliographic citations, gaps are common for the
author field where names are often abbreviated, yet rare for the title field which is typically
unchanged between citations to the same paper.
Therefore, adapting affine-gap edit distance to a particular domain requires learning
the costs for different edit operations and the costs of gaps. In the following subsections,
we present two methods that perform such of edit distance parameters using a corpus of
coreferent string pairs from a given domain.

3.1.1 Learnable Edit Distance with Affine Gaps

The Pair HMM Model

We propose learning the costs of edit distance parameters using a three-state pair HMM
shown in Figure 3.1. It extends the one-state model used by Ristad and Yianilos (1998)

22
... <z,ε>
<b,ε>
<a,ε>

σ I1
µ δ
γ
γ
M

<a,a>
M
<b,b>
... γ γ I
<z,z> M
<a,b>
<a,c> ...<y,z>
σ δ
I2
< ε,a>
< ε,b>... < ε,z>

Figure 3.1: A generative model for edit distance with affine gaps

to learn parameters of Levenshtein edit distance, and is analogous to models proposed


in Durbin et al. (1998) for scoring alignments of biological sequences.
For any pair of strings, the model can generate all possible alignments between them
as sequences of state transitions and edit operation emissions, where emissions correspond
to productions of elements of the two strings in parallel, including gaps. Each possible
alignment is associated with the probability of observing the corresponding sequence of
transitions and emissions.
The three states of the model generate gaps in the alignment in states I1 and I2 , and
generate matches and substitutions in state M. Transitions between state M and states I1
and I2 in the pair HMM correspond to starting a gap in the deterministic affine-gap edit dis-
tance model, while self-transitions in states I1 and I2 model gap extensions. Probabilities of
these transitions, σ and δ, correspond to gap opening and extension costs, while probabil-
ities µ, γM and γI correspond to the relative frequency of continued matching, gap ending,
and observing adjacent gaps (these transitions have no direct analog in the deterministic
model).1
1 Traditional edit distance algorithms as well as pair HMMs described by Durbin et al. (1998) also disallow
gaps in the two strings to be contiguous. This restriction corresponds to prohibiting transitions between states I1
and I2 , but in the record linkage domain it is unnecessary since the two strings may have parallel non-matching

23
Emissions in the pair HMM correspond to individual edit operations that generate
both strings in parallel. Given A  = A [fεg, the symbol alphabet extended with the special
“gap” character ε, the full set of edit operations is E = EM [ EI1 [ EI2 , where EM = fha; bi :
a; b 2 A g is the set of all substitution and matching operations, while EI1 = fha εi : a 2
;

A g and EI2 = fhε ai : a 2 A g are the insertions into the first and into the second strings
;

respectively. Each state associates its set of emissions with a probability distribution. Thus,
emission probabilities in the pair HMM, PM = f p(e) : e 2 EM g, PI1 = f p(e) : e 2 EI g, and
1

PI2 = f p(e) : e 2 EI2 g, correspond to costs of individual edit operations in the deterministic
model. Edit operations with higher probabilities produce character pairs that are likely to
be aligned in a given domain, e.g., substitution h=; i for phone numbers or deletion h εi :;

for addresses. For each state in the pair HMM, there is an associated probability of starting
or ending the string alignment in that state, corresponding to the frequency of observing
alignments with gaps at the beginning or at the end.
Because in record linkage applications the order of two strings is unimportant,
several parameters in the model are tied to make alignments symmetrical with respect
to the two strings. Tied parameters include probabilities of transitions entering and exit-
ing the insertion states: σ, γM , γI , and δ; emission probabilities for the insertion states:
p(ha; εi) = p(hε; ai), and emissions of substitutions p(ha; bi) = p(hb; ai).
Two methods can be used for computing edit distance using a trained pair HMM.
The Viterbi algorithm computes the highest-probability alignment of two strings, while the
forward (or backward) algorithm computes the total probability of observing all possible
alignments of the strings, which can be beneficial if several high-probability alignments
exist (Rabiner, 1989). If performed in log-space, the algorithms are analogous to the de-
terministic edit distance computation shown in Eq. (2.1), with the negative logarithms of
probabilities replacing the corresponding costs. The three matrices of the deterministic
affine-gap edit distance described in Section 2.1.1 correspond to dynamic programming
regions.

24
matrices computed by the Viterbi, forward and backward algorithms. Each entry (i; j) in
the dynamic programming matrices for states M, I1 , and I2 contains the forward, backward,
or Viterbi probability for aligning prefixes x[1:i℄ x[1: j℄ and ending the transition sequence(s)
in the corresponding state.

Training

Given a training set of N coreferent string pairs D f


= (xi
(1)
;
(2)
xi )g, the transition and emis-
sion probabilities in the model can be learned using a variant of the Baum-Welch algorithm
outlined in Figure 3.2, which is an Expectation-Maximization procedure for learning pa-
rameters of HMMs (Rabiner, 1989); Ristad and Yianilos (1998) used an analogous algo-
rithm for training their one-state model for Levenshtein distance. The training procedure
iterates between two steps, expectation (E-step) and maximization (M-step), converging to a
(1) (2)
(local) maximum of the log-likelihood of training data L = ∑i=1::N log pΘ (xi ; xi ), where
Θ = fµ; δ; σ; γM ; γI ; ; PM ; PI1 ; PI2 g is the set of emission and transition probabilities being
learned. In the E-step, the forward and backward matrices are computed for every training
pair to accumulate the expected number of transitions and emissions given current parame-
ter values (Rabiner, 1989). In the M-step, parameters Θ are updated by re-normalizing the
expectations of transition and emission probabilities accumulated in the E-step.
Once trained, the model can be used for estimating similarity between pairs of
strings by using the forward algorithm to compute the probability of generating all pos-
sible string alignments. To prevent numerical underflow for long strings, this computation
should be performed in log-space.
Modeling edit distances with pair HMMs has an intrinsic drawback: because prob-
ability of generating a string pair decreases with string length, alignments of longer strings
have lower probabilities that alignments of shorter strings. This problem is alleviated by
using length-corrected distance d (x; y) = log p(x; y)1 (jxj+jyj) , which is equivalent to scal-
=

ing deterministic edit distance by the sum of string lengths. Furthermore, the standard

25
Algorithm: A FFINE -G AP PAIR -HMM EM T RAINING
(1) (2)
Input: A corpus of coreferent string pairs D = {(xi , xi )}
Output: A set of pair HMM parameters: Θ = {µ, δ, σ, γM , γI , PM , PI1 , PI2 }
Method:
until convergence
(1) (2)
E- STEP : for each (xi , xi ) ∈ D
(f) (f) (1) (2)
(M ( f ) , I1 , I2 ) =F ORWARD(xi , xi )
(b) (b) (1) (2)
(M (b) , I1 , I2 ) =BACKWARD(xi , xi )
(1)
for j = 1, . . . , |xi |
(2)
for k = 1, . . . , |xi |
Accumulate E[µ], E[σ], E[δ], E[γM ], E[γI ],
(1) (2) (1) (2)
E[p(hxi j] , εi)], E[p(hε, xi k] i)], E[p(hxi j] , xi k] i)]
[ [ [ [
M- STEP :
E[µ]
µ= E[µ]+2E[σ]
σ = E[µ]+2E[σ]
E[σ]

δ = E[δ]+E[γ
E[δ]
M ]+E[γI ]
E[γM ]
γM = E[δ]+E[γ M ]+E[γI ]
E[γI ]
γI = E[δ]+E[γ M ]+E[γI ]
E[p(e)]
for each p(e) ∈ PM : p(e) = ∑ p(eM )∈PM E[p(eM )]
E[p(e)]
for each p(e) ∈ PI1 : p(e) = ∑ p(eI )∈PI E[p(eI1 )]
1 1
E[p(e)]
for each p(e) ∈ PI2 : p(e) = ∑ p(eI )∈PI E[p(eI2 )]
2 2

Figure 3.2: Training algorithm for generative string distance with affine gaps

pHMM must normalize the probability of the exact match operations against the proba-
bilities of substitutions, and normalize µ, the probability of the self-transition in state M,
against 2δ, the probabilities of starting a gap in either string. These normalizations imply
that even the perfect matching of a string to itself has a less-than-1 probability, which is
counter-intuitive, yet unavoidable within the pair HMM generative framework. However,
we have found that setting the costs (log-probabilities) of M-state self-transitions and match
emissions to 0 leads to improved empirical results, although the pair HMM model does not
provide a principled way of encoding this intuition.

26
Experimental Evaluation

We evaluated the proposed model for learnable affine-gap edit distance on four datasets.
Face,Constraint, Reasoning, and Reinforcement are single-field datasets containing un-
segmented citations to computer science papers in corresponding areas from the Citeseer
digital library (Giles et al., 1998). Face contains 349 citations to 242 papers, Constraint
contains 295 citations to 199 papers, and Reasoning contains 514 citation records that rep-
resent 196 unique papers, and Reinforcement contains 406 citations to 148 papers. Figure
3.3 presents sample coreferent records from one of the datasets.
Every dataset was randomly split into 2 folds for cross-validation during each exper-
imental run. A larger number of folds is impractical since it would result in fewer coreferent
pairs per fold. To create the folds, coreferent records were grouped together, and the result-
ing clusters were randomly assigned to the folds. All results are reported over 10 random
splits, where for each split the two folds were used alternately for training and testing.
During each trial, learnable edit distance is trained as described above using ran-
domly sampled pairs of coreferent strings from the training fold. After training, edit dis-
tance is computed between all pairs of strings in the testing fold. Then, pairs are iteratively
labeled as coreferent in order of decreasing similarity. After labeling of each successive
string pair, accuracy is evaluated using pairwise precision and recall, which are computed
as follows:

Figure 3.3: Sample coreferent records from the Reasoning dataset


L. P. Kaelbling. An architecture for intelligent reactive systems. In Reasoning
About Actions and Plans: Proceedings of the 1986 Workshop. Morgan Kaufmann, 1986
Kaelbling, L. P., 1987. An architecture for intelligent reactive systems. In
M. P. Georgeff & A. L. Lansky, eds., Reasoning about Actions and Plans, Morgan
Kaufmann, Los Altos, CA, 395 410

27
#o fCorrectCore f erentPairs
precision =
#o f LabeledPairs
#o fCorrectCore f erentPairs
recall =
#o f TrueCore f erentPairs

We also compute mean average precision (MAP), defined as follows:

1 n
n i∑
MAP = precision(i) (3.1)
=1

where n is the number of true coreferent pairs in the dataset, and precision(i) is the pair-
wise precision computed after correctly labeling i-th coreferent pair. These measures eval-
uate how well a similarity function distinguishes between coreferent and non-coreferent
string pairs: a perfect string distance would assign higher similarity to all coreferent pairs
than to any non-coreferent pair, achieving 1.0 on all metrics. On the precision-recall curve,
precision at any recall level corresponds to the fraction of pairs above a certain similarity
threshold that are coreferent, while lowering the threshold results in progressive identifica-
tion of more truly coreferent pairs. For averaging the results across multiple trials, precision
is interpolated at fixed recall levels following the standard methodology from information
retrieval (Baeza-Yates & Ribeiro-Neto, 1999).
To evaluate the usefulness of adapting affine-gap string edit distance to a specific
domain, we compare the pair HMM-based learnable affine-gap edit distance with its fixed-
cost equivalent on the task of identifying equivalent field values, as well as with classic
Levenshtein distance. The following results are presented:

 P HMM L EARNABLE ED: learnable affine-gap edit distance based on characters


trained as described above using the EM algorithm shown in Fig.3.2;

 U NLEARNED ED: fixed-cost affine-gap edit distance (Gusfield, 1997) with a substi-
tution cost of 5, gap opening cost of 5, gap extension cost of 1, and match cost of -5,

28
Figure 3.4: Mean average precision values for field-level record linkage
Distance metric Face Constraint Reasoning Reinforcement
pHMM Learnable edit distance 0.960 0.968 0.955 0.961
Unlearned edit distance 0.956 0.956 0.946 0.952
Levenshtein edit distance 0.901 0.874 0.892 0.899

which are parameters previously suggested by (Monge & Elkan, 1996);

 L EVENSHTEIN : classic Levenshtein distance described in Section 2.1.1.

Precision-recall curves for the four datasets are shown on Figures 3.5-3.8. These
results are summarized in Figure 3.4. Each entry contains the mean average precision over
the 20 evaluated folds. Improvements of the learnable edit distance over the fixed-cost vari-
ant are significant at the 0.05 level using a two-tailed paired t-test for all datasets. These
results demonstrate that learned affine-gap edit distance outperforms its deterministic equiv-
alent in identifying coreferent values in individual fields, which in turn is significantly more
accurate than Levenshtein distance.

3.1.2 Learnable Segmented Edit Distance

Affine-gap edit distance and the corresponding pair HMM model described in Section 3.1.1
treat strings as homogeneous entities. In domains where strings are composed of multiple
fields, such as bibliographic citations, ignoring their internal structure disregards the dif-
ferences between edit distance parameters in appropriate models for the fields, while some
string transformations may be frequent in one field, but rare in another. For affine-gap edit
distance derived from a pair HMM, rarity of certain operations (e.g., rarity of gaps for ti-
tle values) corresponds to a lower value of σ, probability of the gap opening transition.
Training individual pair HMM distances for every field allows making such distinctions.
Therefore, segmenting strings into individual fields can improve the accuracy of similarity
computations, and in domains where accurate segmentation is available, or original data is
described by multiple fields, combining multiple field-specific distances was shown to be

29
1

0.9

0.8

0.7
Precision

0.6

0.5

0.4

0.3 pHMM Learnable ED


Unlearned ED
Levenshtein
0.2
0 20 40 60 80 100
Recall

Figure 3.5: Field linkage results for the Face dataset


1

0.9

0.8

0.7
Precision

0.6

0.5

0.4

0.3 pHMM Learnable ED


Unlearned ED
Levenshtein
0.2
0 20 40 60 80 100
Recall

Figure 3.6: Field linkage results for the Constraint dataset

30
1

0.9

0.8

0.7
Precision

0.6

0.5

0.4

0.3

0.2 pHMM Learnable ED


Unlearned ED
Levenshtein
0.1
0 20 40 60 80 100
Recall

Figure 3.7: Field linkage results for the Reasoning dataset


1

0.9

0.8

0.7
Precision

0.6

0.5

0.4

0.3 pHMM Learnable ED


Unlearned ED
Levenshtein
0.2
0 20 40 60 80 100
Recall

Figure 3.8: Field linkage results for the Reinforcement dataset

31
effective for the record linkage task, as results in Section 3.2 will show.

However, in domains where supervision in the form of segmented strings for train-
ing an information extraction system is limited, field values cannot be extracted reliably.
Segmentation mistakes lead to erroneous field-level estimates of similarity, combining which
may produce worse results than utilizing a single string similarity function.

Segmented Pair HMM

We propose a new type of pair HMM, the segmented pair HMM (spHMM), that overcomes
the above limitations by combining segmentation and edit distance computation within a
single framework. A sample spHMM is shown in Figure 3.9. It can be viewed as an
interconnected sequence of pair HMMs, where the emission and transition probabilities
within each pair HMM are trained for a particular field, while probabilities of transitions
between the pair HMMs capture the field structure of the strings. The model generates
string matchings by emitting alignments of individual fields in corresponding components,
transitioning between them at segment boundaries in both strings simultaneously.

τ (2)1 τ (k)2 τ (k)1


τ (1)k τ (2)k
τ2
(1)

(1) (2) (k)

σ I1 σ I1 σ I1
(1) (2) (k)

µ δ µ δ µ δ
(1) (1) (2) (2) (k) (k)

γM(1) γM(2) γM(k)


γI(1) γI(2) γI(k)
M
(1)

γM(1) γI(1)
M
(2)

γM(2) γI(2)
... M
(k)

γM(k) γI(k)
σ σ σ
(1) (2) (k)

δ δ δ
(1) (2) (k)

(1) (2) (k)


I2 I2 I2

field 1 field 2 field k

Figure 3.9: Segmented pair HMM

As in regular pair HMMs, edit distance between two strings x and y in the spHMM
is computed as the negative logarithm of the probability of generating the string pair over
all possible alignments, d (x; y) = log p(x; y), which can be computed using the standard

32
forward (or backward) algorithm. This allows aggregating alignment probabilities over the
different possible segmentations of the two strings, which is not achievable if segmentation
and matching are performed in isolation. The obtained distance value is length-corrected to
avoid penalizing longer strings d (x; y) = log p(x; y)(jxj+jyj) .

Training

As with the learnable affine-gap edit distance without segmentation described in Section 3.1.1,
transition and emission probabilities of the spHMM are learned using a training set D =

f(xi 1
( )
;
(2)
xi ) gf(xi yi )gNi 1 consisting of N string pairs. Training is performed using an exten-
;
=

sion of the Expectation-Maximization (EM) procedure shown in Figure 3.2 that learns an
extended set of emission and transition probabilities for all k pair HMMs in the spHMM:
Θ = fµ(i) ; δ(i) ; σ(i) ; γM ; γI ; τ1 gki 1 . Probabilities of transitions between
(i) (i) (i) (i) (i) (i) (i)
; ::; τk ; PM ; PI1 ; PI2 =

pair HMMs, fτ1 g


(i) (i)
; ::; τk k ,
i=1 are learned by decomposing them into transitions outgoing
(i) (i)
from individual states M (i) , I1 , I2 into any other state outside the i-th component, and
tying the parameters over all such transitions for any two pair HMMs.
Training can incorporate any combination of supervision used for segmentation and
string similarity computation tasks. There are three types of training data that may be
available:
(i) pairs of coreferent segmented strings, e.g.
author title other year
M.J. Kearns. The Computational Complexity of Machine Learning. MIT Press, Cambridge, MA (1990).
Michael Kearns. The Computational Complexity of Machine Learning. MIT Press, 1990.
author title other year

(ii) pairs of coreferent unsegmented strings, e.g.


M. Kearns, R. Schapire and L. Sellie, Towards efficient agnostic learning. COLT, 1992
Kearns, M., Schapire, R., and Sellie, L. (1992) Toward efficient agnostic learning. In Proc. 5th Ann. Workshop on
Computational Learning Theory. Pittsburgh, PA: Morgan Kaufmann.

(iii) individual segmented strings, e.g.


author year title venue other
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121 (2), 256-285

Each individual segmented string xi is converted to a pairwise training example


by creating a coreferent training pair (xi ; xi ), which allows accumulating expectations of

33
emissions for characters or tokens in that string along with accumulating the expectations
of cross-component transitions. Forward and backward procedures are modified for seg-
mented string pairs so that expectations are only accumulated for component pair HMMs
that produce alignments for the corresponding fields, while for unsegmented string pairs,
expectations are accumulated over all component pair HMMs, thus considering alignments
over all possible segmentations of the two strings.
Because the proposed model is designed for settings where supervision is limited,
and the number of parameters in the above model can be very large, training may result
in poor parameter estimates due to sparsity of training data. To address this, we employ
a variant of shrinkage, or deleted interpolation – a smoothing technique previously used
in generative models for language modeling (Jelinek & Mercer, 1980) and information ex-
traction (Freitag & McCallum, 1999). We simultaneously train two models: one that emits
actual observations (individual characters or tokens for character-based and token-based
edit distances respectively), and another that distinguishes between several large classes of
emissions (characters, digits and punctuation for character-based edit distance, and vocab-
ulary, non-vocabulary, and numeric tokens for token-based edit distance). Parameters of
the two models are then interpolated (“shrunk”) using the method of Freitag and McCallum
(1999).

Experimental Results

We perform evaluation following the cross-validation procedure described in Section 3.1.1


on two datasets where segmentation is available: Restaurant and Cora. Restaurant is a
database obtained by Tejada et al. (2002), who integrated records from Fodor’s and Zagat’s
guidebooks to obtain 864 restaurant names and addresses that include 112 duplicates. Cora
is a collection of 1295 distinct citations to 122 Computer Science research papers from the
Cora Computer Science research paper search engine collected by McCallum et al. (2000).
The citations were automatically segmented into multiple fields such as author, title, venue,

34
Figure 3.10: Sample coreferent records from the Cora dataset
author title venue year
W. W. Cohen, R. E. Shapire, Learning to order things. In Advances in Neural Informa- 1998
and Y. Singer. tion Processing Systems 10,
William W. Cohen, Rob Learning to order things. To appear in NIPS-97, 1997
Schapire, and Yoram Singer.

Figure 3.11: Sample coreferent records from the Restaurant dataset


name address city phone cuisine
Fenix 8358 Sunset Blvd. West Hollywood 213/848-6677 American
Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood 213-848-6677 French(new)

etc. by an information extraction system, resulting in some noise in the field values. Figures
3.10 and 3.11 present sample coreferent records from the two datasets in segmented form.
Since SP HMM is designed for domains where entities are represented by strings
containing multiple fields, we omit the available segmentation for all records in the test fold,
while retaining it in the training fold for segmented supervision of types of (i) and (iii) in the
list above. For Cora, five fields are distinguished: author, title, venue, year, and other, where
the other field may includes such information as page numbers, names of editors, location,
etc. For Restaurant, fields name, street address, city, and cuisine are distinguished. We
employ token-based edit distance in all experiments, since in these domains the differences
between the fields are mainly at the token, not character level.
We compare the accuracy of spHMM-learned affine-gap edit distance with the fol-
lowing baselines:

 P HMM: learnable affine-gap edit distance without segmentation described in Sec-


tion 3.1.1;

 SEQ: a baseline that uses labeled and unlabeled strings to train the IE system of Grenager,
Klein, and Manning [2005] that was specifically designed to handle unsupervised
data. Individual affine-gap edit distances are learned for all extracted fields, and
an SVM classifier is trained to combine them; Section 3.2 describes this process in

35
detail. During testing, the IE system segments each string into fields. Learned affine-
gap edit distances are computed for all extracted fields, and then combined using the
SVM classifier to obtain overall string similarity estimates.

Comparison with the P HMM baseline evaluates whether incorporating segmenta-


tion in learnable affine-gap edit distance yields improvements, while comparison with the
SEQ baseline evaluates performing the segmentation and string matching steps sequen-
tially.
We consider four combinations of training data for the spHMM: segmented string
pairs only, a mixture of segmented and unsegmented pairs, a mixture of unsegmented and
individual segmented strings, and unsegmented pairs only. In all experiments 50 string pairs
are used; the three numbers in identifiers SP HMM-50-0-0, SP HMM-25-25-0, SP HMM-

0-50-50, and SP HMM-0-50-0 represent the amount of supervision for the three super-
vision types in the order listed in the previous section. For example, SP HMM-25-25-0

uses 25 segmented coreferent pairs, 25 unsegmented coreferent pairs, and no individual


segmented strings for training. The SP HMM-50-0-0, SP HMM-25-25-0, and SP HMM-0-
50-0 curves demonstrate the effects of training on various combinations of segmented and
unsegmented pairwise supervision, while the SP HMM-0-50-50 curve shows the effects of
adding some individual segmented supervision to the pairwise unsegmented supervision.
Figures 3.12 and 3.13 contain precision-recall curves for the two datasets. The
results demonstrate that affine-gap edit distance based on spHMM outperforms both regular
learnable affine-gap edit distance as well as the sequential combination of segmentation and
learnable affine-gap edit distance on both datasets. The improvement is less pronounced on
the Cora dataset compared to the Restaurant dataset: this is due to the fact that the field
structure in citations is more complex than in restaurant records, since the ordering of the
fields varies significantly. As a result, learning an accurate segmentation model is more
difficult for Cora. If field extraction is performed in isolation, segmentation errors degrade
the quality of similarity computations significantly as can be seen from the SEQ results.

36
1

0.9

0.8

0.7
Precision

0.6

0.5

0.4 spHMM-50-0-0
spHMM-25-25-0
0.3 spHMM-0-50-50
spHMM-0-50-0
0.2 pHMM
SEQ
0.1
0 0.2 0.4 0.6 0.8 1
Recall

Figure 3.12: Field-level linkage results for the unsegmented Restaurant dataset
1

0.9

0.8

0.7

0.6

0.5
spHMM-50-0-0
0.4 spHMM-25-25-0
spHMM-0-50-50
0.3 spHMM-0-50-0
pHMM
SEQ
0.2
0 0.2 0.4 0.6 0.8 1

Figure 3.13: Field-level linkage results for the unsegmented Cora dataset

37
In contrast, SP HMM is able to improve over non-segmented learnable edit distance by
combining similarities from the multiple alignments.
The utility of training the model on segmented versus unsegmented string pairs is
also dependent on the difficulty of the segmentation task. Because segmentations produced
by the trained model are less reliable in Cora than in Restaurant, utilizing more segmented
training data does not result in statistically significant improvements. In Restaurant records,
the field structure is more regular, and a small amount of either segmented pairs or in-
dividual segmented strings improves results obtained with just unsegmented pairs, as the
differences between the SP HMM-0-50-0 and the other SP HMM results demonstrate.
Overall, the results show that incorporating segmentation into learnable edit dis-
tance yields an improved similarity function for string linkage even without segmented
training data, while increased improvements are obtained when small amounts of seg-
mented supervision is provided.

3.2 Learnable Record Similarity

3.2.1 Combining Similarity Across Fields

Because correspondence between overall record similarity and individual field similarities
can vary greatly depending on field importance, an accurate record similarity function must
weigh fields in proportion to their contribution to the true similarity between records. For
example, similarities of the author and title fields in bibliographic citation are more signif-
icant than similarity for the year field, and accurate distance measure for overall citations
must reflect this. While statistical aspects of combining similarity scores for individual
fields have been addressed in previous work on record linkage (Winkler, 1999), availability
of labeled duplicates allows a more direct approach that uses a binary classifier which com-
putes a similarity function (Tejada et al., 2002; Elfeky et al., 2002; Sarawagi & Bhamidi-
paty, 2002; Cohen & Richman, 2002). Given a database containing records composed of k

38
Name Address City Cuisine

Fenix 8358 Sunset Blvd. West Hollywood American


Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new)

Field d1n d2n d1a d2a d1c d2c d1cu d2cu


similarity
functions
T
Feature vector d1n d2n d1a d2a d1c d2c d1cu d2cu

SVM

Distance

Coreferent pairs Non−coreferent pairs

Figure 3.14: Computation of record similarity from individual field similarities

different fields and a set of m similarity functions, fd1 (; ); : : : ; dm (; )g, we can represent
any pair of records by an mk-dimensional vector. Each component of the vector contains
similarity between two field values computed using one of the m similarity functions.
As in training string similarity functions, pairs of coreferent records can be used
to construct a training set of such feature vectors by assigning them a positive class label.
Pairs of non-coreferent records form a complementary set of negative examples, which
can be very large due to the paiwise nature of the matching task, and therefore requires
subsampling; this problem is addressed in more detail in Section 3.3.
A binary classifier is then trained on such supervision to discriminate between pairs
of records corresponding to coreferent and non-coreferent pairs. Previous work in this area
relied on Bayesian classifiers (Winkler, 2002), decision trees (Tejada et al., 2002; Elfeky
et al., 2002), and logistic regression (Cohen & Richman, 2002). We employ a Support
Vector Machine (SVM) with an RBF kernel which in the last decade has proven to be a
top-performing classifier on a wide array of categorization tasks (Shawe-Taylor & Cristian-
ini, 2000). Properties that make SVMs particularly appropriate for discriminating between
coreferent and non-coreferent record pairs include their resilience to noise, ability to handle
correlated features, and robustness to the relative sizes of training samples from different
classes. The latter requirement is particularly important, given that the proportion of coref-

39
erent records in a database is very difficult to estimate in realistic record linkage applications
due to the pairwise nature of the task.
Once trained, the SVM provides a confidence estimate for each record pair which
can be treated as an estimate of similarity between records. The confidence estimate is de-
rived from the margin of a particular example, that is, its distance from the hyperplane that
separates the two classes. It has been shown that margins can be converted to confidence or
probability values via a logistic transformation (Wahba, 1999; Platt, 1999b).
Figure 3.14 illustrates the process of computing record similarity using multiple
similarity measures over each field and an SVM to categorize the resulting feature vector
as belonging to the class of duplicates or non-duplicates. For each field of the database,
two similarity functions, d1 and d2 , are applied to compute similarity for that field. The
values computed by these measures form the feature vector that is then classified by a
support vector machine, producing a confidence value that represents similarity between
the database records.

The Overall Record Linkage Framework

An overall view of our system, M ARLIN (Multiply Adaptive Record Linkage with INduc-
tion), is presented in Figure 3.15. The training phase consists of two steps. First, the
learnable string similarity functions are trained for each record field. The training corpus of
field-level coreferent and non-coreferent pairs is obtained by taking pairs of values for each
field from the set of coreferent record pairs. Because equivalent records may contain indi-
vidual fields that are not coreferent, training data can be noisy. For example, if one record
describing a restaurant contains “Asian” in the cuisine field, and an equivalent record con-
tains “Seafood”, a noisy training pair is formed that implies equivalence between these
two strings. However, this issue does not pose a serious problem for our approach for two
reasons. First, particularly noisy fields that are unhelpful for identifying record-level dupli-
cates will be considered irrelevant by the classifier that combines similarities from different

40
Training:
Field training data extractor
Distance
Labeled Field duplicates Metric
coreferent pairs
Learner

Learned
Coreferent and non−coreferent
Record training parameters
record pairs
data extractor

Coreferent and Learned


Binary classifier non−coreferent Distance
distance vectors Metrics

Testing:
Candidate pair extractor
Potential Learned
Database duplicates
Distance
records
Metrics

Distance features
Binary classifier

Identified
duplicates

Figure 3.15: M ARLIN overview

fields. Second, the presence of such pairs in the database indicates that there is a degree
of similarity between such values, and using them in training allows the learnable record
similarity function to capture that likelihood as much as possible.
After individual string similarity functions are learned, they are used to compute
distances for each field of training record pairs to obtain training data for the binary classifier
in the form of vectors composed of distance features.
The record linkage phase starts with generation of potential coreferent pairs. Since
producing all possible pairs of records and computing similarity between them is too ex-
pensive for large databases, M ARLIN incorporates several blocking strategies to efficiently
obtain candidate record pairs that are approximately similar and warrant detailed distance
computations. Blocking is discussed in detail in Chapter 5, in which we describe an adap-

41
tive framework for training blocking functions.
Learned string similarity functions are then used to calculate distances for each field
of every candidate record pair, forming field feature vectors for the classifier. Confidence
estimates for belonging to the class of coreferent pairs are produced by the binary classifier
for each candidate pair, and pairs are sorted by decreasing similarity to evaluate similarity
function accuracy as discussed in Section 3.1.1.

3.2.2 Experimental Results

We evaluated the performance of multi-field record linkage within the M ARLIN framework
using the SVM implementation within the W EKA software toolkit (Witten & Frank, 1999)
that relies on the Sequential Minimal Optimization (SMO) training algorithm for the under-
lying quadratic optimization problem (Platt, 1999a). We conducted two sets of experiments.
First, we compared the performance of learnable and non-learnable variants of affine-gap
edit distance as components of a record-level similarity function that combines their predic-
tions for individual fields. We have again used the Restaurant and Cora datasets, this time
using the field segmentation provided with the datasets.
Figures 3.16 and 3.17 present the precision-recall curves for record linkage us-
ing SVM as the combining classifier and different field-level similarity functions: learned
edit distance, unlearned edit distance, and TF-IDF weighted cosine similarity. The results
demonstrate that using learnable string edit distance with affine gaps leads to improved per-
formance even when similarities from multiple fields are combined. At high recall levels
(above 90%), using learnable edit distance performs particularly well, indicating that it pro-
vides better field similarity estimates for particularly difficult coreferent pairs, leading to
more accurate computation of the overall record similarity.
Second, we compared the performance of several classifiers that have been recently
employed for the record linkage task by different researchers. Using the implementations
in the W EKA toolkiet, we compared the following classifiers using both unlearned and

42
1

0.95

0.9
Precision

0.85

0.8
pHMM Learnable ED
unlearned ED
cosine
0.75
0 20 40 60 80 100
Recall

Figure 3.16: Record-level linkage results on the Cora dataset


1

0.995

0.99

0.985

0.98
Precision

0.975

0.97

0.965

0.96
pHMM Learnable ED
0.955 Unlearned ED
cosine
0.95
0 20 40 60 80 100
Recall

Figure 3.17: Record-level linkage results on the Restaurant dataset

43
learnable affine-gap edit distances as the underlying field similarity functions:

 SVM-RBF: Support Vector Machine with the Gaussian kernel;

 SVM- LINEAR : Support Vector Machine with the linear kernel;

 A DA B OOST: boosting algorithm of Freund and Schapire (1996) that uses J48, W EKA’s
implementation of C4.5 decision tree as the base classifier (Quinlan, 1993);

 M AX E NT: logistic regression (le Cessie & van Houwelingen, 1992);

 BAYES N ET: a Bayesian Network learner that uses the K2 structure learning algo-
rithm (Cooper & Herskovits, 1992).

Figures 3.19 and 3.20 present results for the experiments that used learnable affine-
gap edit distance on Restaurant and Cora datasets, while mean average precision (MAP)
values for all experiments are shown in Figure 3.18.
Overall, the results demonstrate that Support Vector Machines yield the best accu-
racy on both datasets, outperforming the other classifiers significantly. Both the Gaussian
and the linear kernel provide equivalently good performance, which is not surprising since
the classification is performed on a very low-dimensional problem. Other classifiers per-
form significantly worse for both datasets. We conclude that SVM-based learnable record
similarity is a robust, accurate similarity function for combining similarities of multiple
fields in the record linkage setting. We also note that using learnable affine-gap edit distance
as the field similarity function provides better results than using unlearned edit distance, al-
though statistically significant differences are only observed on parts of the learning curve
for most classifiers (e.g., for SVM-RBF and SVM- LINEAR the improvements are statisti-
cally significant at 0.05 level using a two-tailed paired t-test only at 98% and 100% recall
respectively). However, the improvements are consistent and suggest that using learnable
edit distance for field-level comparisons leads to accuracy improvements even when fields
are combined by a classifier.

44
Figure 3.18: Mean average precision values for record-level linkage
Restaurant Cora
pHMM ED Unlearned ED pHMM ED Unlearned ED
SVM-RBF 0.999 0.996 0.998 0.997
SVM- LINEAR 0.994 0.994 0.998 0.997
A DA B OOST 0.948 0.927 0.975 0.974
M AX E NT 0.938 0.937 0.824 0.815
BAYES N ET 0.884 0.873 0.976 0.967

3.3 Training-Set Construction for Learning Similarity Functions

Training string and record similarity functions in real-world scenarios requires selecting
a set of pairs for a human expert to label as coreferent or non-coreferent, or asking the
expert to identify all groups of coreferent records in the dataset, which is not feasible for
large datasets. Since typical corpora and databases contain few coreferent records, selecting
random pairs as potential training examples leads to training sets with extremely few coref-
erent pairs (positive examples). As a result, such randomly selected training sets are highly
skewed toward non-coreferent pairs, which leads to suboptimal performance of similarity
functions trained on this data. We propose two heuristic approaches for collecting training
data: static-active learning and weakly-labeled selection, and present experimental results
on their effectiveness.

3.3.1 Likely-positive Selection of Training Pairs

Traditional active learning systems are “dynamic”: labels of training examples selected in
earlier rounds influence which unlabeled examples are deemed most informative in sub-
sequent rounds. While prior work has examined dynamic active learning approaches to
adaptive record linkage (Sarawagi & Bhamidipaty, 2002; Tejada et al., 2002), such strate-
gies may not always be feasible due to high computational costs exacerbated by the large
number of potential training examples. We propose using a “static” active learning method
for selecting pairs of records that are likely to be coreferent, as a middle ground between

45
1

0.9

0.8

0.7
Precision

0.6

0.5

0.4 SVM-RBF
SVM-linear
0.3 AdaBoost
MaxEnt
BayesNet
0.2
0 20 40 60 80 100
Recall

Figure 3.19: Classifier comparison for record-level linkage on the Cora dataset
1

0.9

0.8

0.7
Precision

0.6

0.5

0.4

0.3 SVM-RBF
SVM-linear
0.2 AdaBoost
MaxEnt
BayesNet
0.1
0 20 40 60 80 100
Recall

Figure 3.20: Classifier comparison for record-level linkage on the Restaurant dataset

46
computationally expensive dynamic active learning methods that try to identify the most
informative training examples and random selection that is efficient but fails to select useful
training data.
Our approach relies on the fact that off-the-shelf string similarity functions, such
as TF-IDF cosine similarity, can accurately identify coreferent strings or records at low
recall levels (high confidence) even when coreferent and non-coreferent pairs are difficult
to distinguish at high recall levels (low confidence). Therefore, when a random sample of
records from a database is taken and similarity between them is computed using such an
off-the-shelf similarity function, string or record pairs with high similarity scores are likely
to be coreferent. By asking the user to label strings or records with high textual similarity,
a training sample with a high proportion of coreferent pairs can be obtained. At the same
time, non-coreferent pairs selected using this method are likely to be “near-miss” negative
examples that are more informative for training than randomly selected record pairs most
of which tend to be “obvious” non-coreferent pairs. Because training sets constructed using
this method have a dramatically different distribution of coreferent and non-coreferent pairs
from their actual distribution in the dataset, adding some randomly selected non-coreferent
pairs is desirable to decrease the difference between the two distributions and provide the
learner more negative examples.
Figures 3.21 and 3.22 demonstrate the comparative utility of static-active selection
and random selection for choosing training record pairs on Restaurant and Cora datasets
respectively. The record similarity function was trained on 40 training examples comprised
of randomly selected record pairs and/or the most similar pairs selected by a static-active
method using TF-IDF cosine similarity. Using a token-based inverted index for the vector-
space model (Baeza-Yates & Ribeiro-Neto, 1999) allowed efficient selection of static-active
training examples without computing similarity between all pairs of records. All exper-
iments utilized SVMlight for computing learnable record similarity function and two un-
learned string similarity functions for field comparisons: TF-IDF cosine similarity and edit

47
1

0.9

0.8

0.7

0.6
Precision

0.5

0.4
0 static-active, 40 random
0.3 5 static-active, 35 random
15 static-active, 25 random
25 static-active, 15 random
0.2 35 static-active, 5 random
40 static-active, 0 random
0.1

0
0 0.2 0.4 0.6 0.8 1
Recall

Figure 3.21: Comparison of random and likely-positive training example selection on the
Restaurant dataset
1

0.9

0.8
Precision

0.7

0.6

0 static-active, 40 random
0.5 5 static-active, 35 random
15 static-active, 25 random
25 static-active, 15 random
35 static-active, 5 random
40 static-active, 0 random
0.4
0 0.2 0.4 0.6 0.8 1
Recall

Figure 3.22: Comparison of random and likely-


positive training example selection on the Cora dataset

48
distance with affine gaps.

For both datasets, the highest performance is achieved when record similarity func-
tions are trained using a mix of static-active and randomly selected pairs. However, employ-
ing many random pairs with a few static-active examples yields the best results on Cora,
while on Restaurant the highest performance is achieved when the system is trained on a
balanced mix of static-active and random examples. This difference is explained by the
makeup of the two datasets. Cora has a higher absolute number of coreferent pairs than
Restaurant (8,592 versus 56 for each fold); coreferent pairs in Cora also represent a larger
proportion of all record pairs (4.1% versus 0.06% for each fold). On Restaurant, random
selection results in training sets that contain almost no coreferent pairs, while including
a significant number of pairs selected using the static-active technique leads to balanced
training sets that contain sufficient positive and negative examples. On Cora, however, ran-
domly selected record pairs are likely to contain a few coreferent pairs. Including a limited
number of record pairs chosen using the static-active technique results in the best perfor-
mance, but as more static-active examples are added, performance decreases because highly
similar coreferent pairs take the place of informative non-coreferent pairs in the training set.
Thus, the worst performance on Restaurant occurs when all training examples are chosen
randomly because coreferent pairs are almost never encountered, while on Cora using only
examples chosen by static-active selection results in the opposite problem: extremely few
non-coreferent pairs are found, and the class distribution of training data is highly skewed
toward non-coreferent pairs.
Based on these results, we conclude that best training sets for learnable record sim-
ilarity functions are obtained when randomly chosen pairs of records are combined with
pairs chosen using static-active selection. The specific proportion in which the two kinds of
training data should be mixed can be estimated based on the outcome of labeling randomly
chosen pairs. If coreferent pairs are exceptionally rare, a significant number of static-active
examples is required to obtain a sufficient sample of coreferent pairs, while databases with a

49
large number of coreferent records need only a small number of record pairs selected using
the static-active methodology to complete a representative training set.
Overall, we show that a reasonable baseline to which dynamic active learning meth-
ods for adaptive similarity functions should be compared is not the one that uses only ran-
domly selected training pairs, but one that employs the static-active method to overcome
the extreme skewness in class distribution that is typical for similarity function learning and
record linkage problems.

3.3.2 Weakly-labeled Selection

While the static-active method allows identifying coreferent training pairs for learnable
similarity functions, the inverse problem can be encountered in some real-world situations:
a “legacy” training set consisting of identified coreferent pairs may be available, while in-
formative non-coreferent pairs need to be collected. For such situations we consider an
unsupervised technique for obtaining negative examples. Since coreferent records are rare
in a typical database, two randomly selected records are likely to be non-coreferent, and
therefore can potentially be used as negative training examples for learning similarity func-
tions. To help ensure that no coreferent records are included among these pairs, only pairs of
records that do not share a significant number of common tokens should be included as neg-
ative examples. Such selection of “weakly-labeled” (and potentially noisy) non-coreferent
record pairs is the unsupervised analog of static-active selection of coreferent pairs. The
process can also be thought of as the opposite of blocking or canopies techniques that use
off-the-shelf metrics to avoid comparing “obvious” non-coreferent records to speed up the
record linkage process.
We compared the record linkage accuracy of M ARLIN trained on weakly-labeled
negatives with training on user-labeled negatives. Figures 3.23 and 3.24 present the results
of these experiments on the Restaurant and Cora datasets. Weakly-labeled negatives were
selected randomly from record pairs that shared no more than 20% of tokens to minimize the

50
1

0.9

0.8

0.7
Precision

0.6

0.5

0.4
40 weakly labeled negative
40 user-labeled random
0.3 20 weakly labeled negative
20 user-labeled random
10 weakly labeled negative
10 user-labeled random
0.2
0 0.2 0.4 0.6 0.8 1
Recall

Figure 3.23: Comparison of using weakly-labeled non-coreferent pairs with using random
labeled record pairs on the Restaurant dataset
1

0.9

0.8
Precision

0.7

0.6

40 weakly-labeled negative
0.5 40 user-labeled random
20 weakly-labeled negative
20 user-labeled random
10 weakly-labeled negative
10 user-labeled random
0.4
0 0.2 0.4 0.6 0.8 1
Recall

Figure 3.24: Comparison of using weakly-labeled non-coreferent pairs with using random
labeled record pairs on the Cora dataset

51
noise. All experiments used training sets composed of two parts: half the examples were
positives randomly selected among user-labeled coreferent pairs, and the other half was
composed of either weakly-labeled non-coreferent records or randomly selected labeled
record pairs. SVMlight was employed to compute record similarity, and TF-IDF cosine
similarity and edit distance with affine gaps were used as the underlying string similarity
functions for individual fields.
The results again demonstrate that the utility of the heuristic selection of training
data for similarity function learning is dataset-dependent. On Restaurant, where coreferent
pairs are scarce and randomly selected records are truely non-coreferent with very high
probability, using weakly-labeled non-coreferent pairs yields results identical to randomly
selected labeled coreferent pairs when a large number of examples is selected, and improves
slightly over random selection when the training set is small. We conjecture that biasing
the SVM with “negative but slightly similar” examples when very little training data is
available allows learning a better separating hyperplane. On Cora, using weakly-labeled
negatives leads to slight degradation of system accuracy, which is expected since coreferent
pairs are relatively frequent, and noise is likely to be introduced when negative examples
are collected in an unsupervised manner. However, the drop in performance is small, and in
situations where human labeling of negatives is expensive or infeasible (e.g. due to privacy
issues), using weakly-labeled selection is a viable avenue for unsupervised acquisition of
negative training examples for similarity function learning.

3.4 Related Work

Several researchers described methods for learning string similarity functions in prior work.
For string edit distance, Ristad and Yianilos (1998) proposed learning the costs of individual
edit operations of Levenshtein distance using a probabilistic generative model. In their
model, a string alignment is equivalent to a sequence of character pairs generated by edit
operations emitted by a hidden Markov model with a single non-terminal state. We have

52
followed the same approach in developing a learning model for affine-gap edit distance,
which provides significantly better similarity estimates for natural text strings (Bilenko &
Mooney, 2002; Cohen et al., 2003a).
Both our model and the model of Ristad-Yianilos are instances of pair Hidden
Markov Models, proposed earlier for biological sequence alignments in bioinformatics (Durbin
et al., 1998). Using such models for record linkage requires several modifications that we
have described. Among those, parameter tying, gap-to-gap transitions, and length normal-
ization are important for obtaining good performance of pair HMM-based edit distance in
natural language string similarity computations.
Two other models for learning the costs of individual edit operations have been
proposed by Zhu and Ungar (2000) and Yancey (2004). Zhu and Ungar (2000) have used
genetic algorithms for learning the costs of several manually constructed edit operations.
Yancey (2004) has employed a variant of Expectation-Maximization for learning the prob-
abilities of individual edit operations, where only highest-probability (Viterbi) alignments
were used to accumulate expectations. Both of these approaches are adaptive variants of
Levenshtein distance and do not include taking gaps into account.
In recent years, two models of learnable edit distance have been proposed based
on discriminative classifiers. Joachims (2003) formulated the problem of learning edit op-
eration costs as maximum-margin optimization, and showed how it can be solved using
SVMs. However, this formulation relies on availability of actual string alignments, not just
coreferent string pairs, and therefore requires significant labeling effort to obtain training
data. McCallum, Bellare, and Pereira (2005) described a model for learning the parameters
of affine-gap edit distance based on Conditional Random Fields (CRFs), a discriminative
analog of HMMs. While they have obtained improvements over the field-level results we
presented in Section 3.1.1 on some of the datasets, their method relies on a number of extra
matching features, some of which could also be implemented in the HMM-based model.
Additionally, training algorithms for CRF-based models are more complex than EM-based

53
training of HMMs and incur significant computational costs.
A number of record linkage researchers have relied on classifiers to combine simi-
larity estimates across multiple fields. Approaches in the statistical literature has tradition-
ally relied on generative classifiers such as Naive Bayes and Bayesian networks (Winkler,
2002), while in recent machine learning research a number of classifiers have been used,
including decision trees (Elfeky et al., 2002; Tejada et al., 2002; Sarawagi & Bhamidipaty,
2002) and logistic regression (Cohen & Richman, 2002). We have shown that Support Vec-
tor Machines outperform these methods significantly on both datasets that we considered.
Sarawagi and Bhamidipaty (2002) and Tejada et al. (2002) have proposed active
learning methods that obtain informative training examples for learning record-level sim-
ilarity functions between records. The training set construction strategies we described in
Section 3.3 approximate these methods without the computational cost of active learning
for selecting likely positives, and without the need for a human oracle for weak negatives.
Recent work on record linkage has focused on the third stage of the record link-
age process described in Section 2.2: clustering for obtaining groups of coreferent records.
In particular, a number of methods have been proposed for collective grouping coreferent
records and obtaining the complete partitioning of datasets into such groups (Pasula et al.,
2003; Wellner et al., 2004; Li et al., 2004; Singla & Domingos, 2005). Our work addresses
an orthogonal problem, accurate computation of record and field similarities, and the meth-
ods presented in this chapter can be used as input to the collective linkage approaches, since
they rely on pairwise similarity estimates between records or their fields.

3.5 Chapter Summary

In this chapter, we have shown how learnable similarity functions lead to significant perfor-
mance improvements in the record linkage task. Because record linkage requires accurate
distance estimates between individual field values and overall records, adapting similarity
functions that provide these estimates allows learning domain-specific parameters to com-

54
pute similarity with higher accuracy.
Two learnable variants of affine-gap edit distance based on pair HMMs that we
described learn edit operation and gap costs that discriminate between coreferent and non-
coreferent strings. For record-level similarity, we have shown that using Support Vector
Machines leads to accurate distance estimations between records composed of multiple
fields. We have demonstrated that employing learnable field-level similarity functions is
still advantageous over using unlearned methods in multi-field domains when field similar-
ities are combined by a classifier. Finally, we have shown that informative training examples
for these methods can be collected without relying on active learning methods, and possibly
without even relying on human supervision.

55
Chapter 4

Learnable Similarity Functions in


Semi-supervised Clustering

In this chapter, we show how learnable similarity functions improve clustering accuracy
when employed in a semi-supervised clustering setting. We describe a probabilistic model
for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that ac-
commodates a wide variety of learnable similarity functions. This model yields a clustering
algorithm, HMRF-KM EANS, that integrates similarity function learning with constraint-
based clustering, improving on algorithms that perform these tasks in isolation.

4.1 Similarity Functions in Clustering

As discussed in Section 2.3, clustering inherently relies on similarity estimations as its goal
is to group instances that are alike while separating instances that are dissimilar. For many
datasets, off-the-shelf functions may fail in providing similarity estimates that place same-
cluster points nearby and different-cluster points far apart, preventing the discovery of a
desired partitioning of a dataset. Examples of same-cluster and different-cluster instance
pairs that are available in the semi-supervised clustering setting provide supervision for

56
training the similarity function to produce appropriate distance estimates, making it easier
to create clusters that respect the pairwise supervision when grouping the unlabeled data.
For some datasets, clusters of different shapes may be desirable, which effectively
indicates that datapoints in these clusters lie in different subspaces of the overall data
space. Recovering such partitioning requires using an individual similarity function for
each cluster, a fact that is exploited in unsupervised clustering algorithms like Expectation-
Maximization that estimate distinct density parameters for different clusters. In the semi-
supervised setting, pairwise constraints provide additional information about the shape of
underlying clusters that can be captured if the similarity function is learned using both su-
pervised and unsupervised data.
The HMRF framework for semi-supervised clustering presented below addresses
the above considerations in a principled probabilistic model and leads to a clustering algo-
rithm, HMRF-KM EANS, that combines the advantages of constraint-based and similarity-
based approaches to semi-supervised clustering. The following section presents an overview
of the overall HMRF framework, more detailed description of which can be found in (Basu
et al., 2006). Then, use of learnable similarity functions within the framework is described
in detail. Three examples of similarity functions and their parameterizations for use with
HMRF-KM EANS are provided for squared Euclidean distance, cosine distance and KL
divergence. Through parameterization, each of these functions becomes adaptive in the
semi-supervised clustering setting, which allows learning the appropriate notion of similar-
ity using both the pairwise constraints and the unlabeled data.

4.2 The HMRF Model for Semi-supervised Clustering

We assume that we are given a set of n data points X = fxi gni 1 , where each xi 2 Rd is a
=

d-dimensional vector. Supervision consists of two sets of pairwise constraints over points
in X : must-link constraints CML f
= (xi ; x j )g and cannot-link constraints CCL = f(xi x j )g,
;

where (xi x j ) 2 CML implies that xi and x j should belong to the same cluster, while (xi x j ) 2
; ;

57
CCL implies that xi and x j should belong to different clusters. The constraints may be
accompanied by associated violation costs W , where wi j represents the cost of violating the
constraint between points xi and x j , if such a constraint exists.
The model relies on selecting a distortion measure dA to compute dissimilarity
between points: dA : Rd  Rd ! R. The distortion measure corresponds to a learnable sim-
ilarity function, with A being the set of parameters to learn, which is typically a matrix or
a vector of weights. The objective of semi-supervised clustering is to partition the data-
points X into K disjoint clusters fX1 ; : : : ; XK g so that the total distortion between the points
and the corresponding cluster representatives is minimized according to the given distortion
measure dA , while constraint violations are kept to a minimum.

4.2.1 HMRF Model Components

The Hidden Markov Random Field (HMRF) (Zhang, Brady, & Smith, 2001) probabilistic
framework for semi-supervised constrained clustering consists of the following compo-
nents:

 An observable set X = fxi gni 1 corresponding to the given data points X . Note that
=

we overload notation and use X to refer to both the given set of data points and their
corresponding random variables.

 An unobservable (hidden) set Y = fyi gni 1 corresponding to cluster assignments of


=

points in X . Each hidden variable yi encodes the cluster label of the point xi and takes
values from the set of cluster indices f1; : : : ; K g.

 An unobservable (hidden) set of generative model parameters Θ, which consists


of distortion measure parameters A and cluster representatives M = fµi gKi 1 :
= Θ=
fA ; M g.

 An observable set of constraint variables C fc12 c13 cn 1 n g. Each ci j is a


= ; ;:::; ;

tertiary variable taking on a value from the set f 1 0 1g, where ci j = 1 indicates
; ;

58
Hidden MRF

y1 = 1 y2 = 1
Must-link (c12 = 1)
Cannot−link
(c23 = 1)
Must-link (c14 = 1)
y5 = 3
y4 = 1 y3 = 2

x1 x2

x4 x5 x3

Observed data

Figure 4.1: A Hidden Markov Random Field for semi-supervised clustering

that (xi ; x j ) 2 CML , ci j = 1 indicates that (xi ; x j ) 2 CCL , and ci j = 0 corresponds to


pairs (xi ; x j ) that are not constrained.

Fig. 4.1 shows the HMRF for a hypothetical five-point dataset X . The datapoints
correspond to variables (x1 ; : : : ; x5 ) that have cluster labels Y = (y1 ; : : : ; y5 ), which may
each take on values (1; 2; 3) denoting the three clusters. Three pairwise constraints are
provided: two must-link constraints (x1 ; x2 ) and (x1 ; x4 ), and one cannot-link constraint
(x2 ; x3 ). Corresponding constraint variables are c12 = 1, c14 = 1, and c23 = 1; all other
variables in C are set to zero. The task is to partition the five points into three clusters.
Fig. 4.1 demonstrates one possible clustering configuration which does not violate any con-
straints. The must-linked points x1 ; x2 and x4 belong to cluster 1; the point x3 , which is
cannot-linked with x2 , is assigned to cluster 2; x5 , which is not involved in any constraints,
belongs to cluster 3.

59
11111
00000
00000
11111
00000
11111
C
00000
11111
Θ

1111
0000
0000
1111
0000
1111
Y 0000
1111
X
0000
1111
0000
1111
Figure 4.2: Graphical plate model of variable dependence in HMRF-based semi-supervised
clustering

4.2.2 Joint Probability in the HMRF Model

The graphical plate model (Buntine, 1994) of the dependence between the random variables
in the HMRF is shown in Figure 4.2, where the unshaded nodes represent the hidden vari-
ables, the shaded nodes are the observed variables, the directed links show dependencies
between the variables, while the lack of an edge between two variables implies conditional
independence. The prior probability of Θ is assumed to be independent of C . The probabil-
ity of observing the label configuration Y depends on the constraints C and current genera-
tive model parameters Θ. Observed datapoints corresponding to variables X are generated
using the model parameters Θ based on cluster labels Y , independent of the constraints C .
The variables X are assumed to be mutually independent: each xi is generated individually
from a conditional probability distribution Pr(xi jyi ; Θ). Then, the joint probability of X , Y ,
and Θ, given C , can be factorized as follows:

n
Pr(X ; Y ; ΘjC ) = Pr(Θ) Pr(Y jΘ; C ) ∏ p(xi jyi ; Θ) (4.1)
i=1

where p(jyi ; Θ) is the parameterized probability density function for the yi -th cluster, from
which xi is generated. This probability density corresponds to the clustering distortion
measure dA , and is discussed in detail in Section 4.3 below.

60
Each hidden random variable yi 2 Y representing the cluster label of xi 2 X is as-
sociated with a set of neighbors N i , defined as all points to which xi is must-linked or
cannot-linked: N i = fy j j(xi ; x j ) 2 CML [ (xi ; x j ) 2 CCL g. We make the Markov assumption
that each label yi is conditionally independent of all other labels in Y given the labels of
its neighbors. The resulting random field over the hidden variables Y is a Markov Ran-
dom Field (MRF), in which by the Hammersley-Clifford theorem (Hammersley & Clifford,
1971) the prior probability of a particular label configuration Y can be expressed as a Gibbs
distribution (Geman & Geman, 1984):

!
Pr(Y jΘ; C ) =
1
Z
exp ∑ v(i ; j) (4.2)
i; j

where Z is the partition function (normalizing term), and each v(i; j) is the potential function
encoding the compatibility of cluster labels yi and y j . Because label compatibility is only
relevant for pairs of points that participate in constraints, we define v(i; j) as follows:

8
>
>
> w f (i; j) if ci j = 1 and yi 6= y j
< i j ML
v(i; j) = wi j fCL (i; j) if ci j = 1 and yi = y j (4.3)
>
>
>
: 0 otherwise

where fML and fCL are penalty functions that encode the lowered probability of observing
configurations of Y where must-link and cannot-link constraints are violated respectively,
and wi j is the user-provided constraint weight that can be used to indicate its importance.
Penalty functions are chosen to correlate with the distortion measure by depending on the
distortion measure parameters A , and will be described in detail in Section 4.3 below. Over-
all, this formulation for observing the label assignment (clustering) Y results in higher
probabilities being assigned to configurations in which cluster assignments do not violate
the provided constraints.

61
Then, joint probability on the HMRF can be expressed as follows:

 ! 
n
Pr(X ; Y ; ΘjC ) = Pr(Θ) ∏ p(xi jyi
1
exp ∑ v(i; j) ; Θ) (4.4)
Z 6
(i; j):ci j =0 i=1

The first factor in the above expression describes a probability distribution over
the model parameters preventing them from attaining degenerate values, thereby providing
regularization. The second factor is the conditional probability of observing a particular
label configuration given the provided constraints, effectively assigning a higher probability
to configurations where the cluster assignments do not violate the constraints. Finally, the
third factor is the conditional probability of generating the observed data points given the
labels and the parameters: if maximum-likelihood estimation (MLE) was performed on the
HMRF, the goal would have been to maximize this term in isolation.
Overall, maximizing the joint HMRF probability in Eq.(4.4) is equivalent to jointly
maximizing the likelihood of generating datapoints from the model and the probability of
label assignments that respect the constraints, while regularizing the model parameters.

4.3 Learnable Similarity Functions in the HMRF Model

The Joint probability formulation in Eq.(4.4) provides a general framework for incorporat-
ing various similarity functions in clustering by choosing a particular form of p(xi jyi ; Θ),
the probability density that generates the i-th instance xi from cluster yi . In this work, we
restrict our attention to probability densities from the exponential family, where the expec-
tation parameter corresponding to cluster yi is µyi , the mean of the points of that cluster.
Using this assumption and the bijection between regular exponential distributions and regu-
lar Bregman divergences (Banerjee et al., 2005b), the conditional density for observed data
can be represented as follows:


p(xi jyi ; Θ) =
1
exp dA (xi ; µyi ) ; (4.5)

62
where dA (xi ; µyi ) is the Bregman divergence between xi and µyi , corresponding to the expo-
nential density p, and ZΘ is the normalizer. Different similarity functions can be expressed
via this exponential form:

 If xi and µyi are vectors in Euclidean space, and dA is the square of the L2 distance
parameterized by a positive semidefinite weight matrix A , dA (xi ; µyi ) = kxi µyi k2A ,
then the cluster conditional probability is a d-dimensional multivariate normal density
with covariance matrix A 1: p(xi jyi ; Θ) = (2π)d =21jAj 1=2 exp( 1
2(kxi µyi k2A ) (Kearns,
Mansour, & Ng, 1997);

 If xi and µyi are probability distributions, and dA is KL-divergence (dA (xi ; µyi ) =

∑dm=1 xim log µxyimm ), then the cluster conditional probability is a multinomial distribu-
i

tion (Dhillon & Guan, 2003).

The relation in Eq.(4.5) holds even if dA is not a Bregman divergence but a direc-
tional distance measure such as cosine distance. Then, if xi and µyi are vectors of unit length
∑dm=1 xim µyi m 
and dA is one minus the dot-product of the vectors dA (xi ; µyi ) = 1
k k
kxi k µyi , then the
cluster conditional probability is a von-Mises Fisher (vMF) distribution with unit concen-
tration parameter (Banerjee et al., 2005a), which is the spherical analog of a Gaussian.
Putting Eq.(4.5) into Eq.(4.4) and taking logarithms gives the following clustering
objective function, minimizing which is equivalent to maximizing the joint probability over
the HMRF in Eq.(4.4):

Jobj = ∑ dA (xi ; µyi ) + ∑ v(i; j) log Pr(Θ) + log Z + n log ZΘ (4.6)


xi 2X ci j 2C

Thus, an optimal clustering is obtained by minimizing Jobj over the hidden variables
Y and parameters Θ, which are comprised of cluster centroids M and distortion measure
parameters A (note that given cluster assignments Y , the means M = fµi gKi 1 are uniquely
=

determined).

63
Selecting an appropriate distortion measure dA for a clustering task typically in-
volves knowledge about properties of the particular domain and dataset. For example,
squared Euclidean distance is most appropriate for low-dimensional data, while cosine
similarity is most fitting for data described by vectors in high-dimensional space where
directional differences are important but vector lengths are irrelevant.
Once a distortion measure is chosen for a given domain, the functions fML and fCL
must be defined to penalize must-link and cannot-link constraint violations respectively,
as described in Section 4.2.2. Each violation penalty is scaled proportionally to the “egre-
giousness” of the violation with respect to the current similarity function. That is, a violated
must-link constraint carries a heavy penalty in the objective function if the distance between
its points is high: this indicates that the two points are highly dissimilar, and the current pa-
rameterization of the similarity function is grossly inadequate. Likewise, two if points of a
violated cannot-link constraints are similar, the penalty is high since the parameterization
of the similarity function is inappropriate: the points should be dissimilar.
To reflect this intuition, the penalty functions are defined as follows:

fML (i; j) = ϕ(i; j) (4.7)

fCL (i; j) = ϕmax ϕ(i; j) (4.8)

where ϕ : X  X ! R+ is a non-negative function that penalizes constraint violations, while


ϕmax is an upper bound on the maximum value of ϕ over any pair of points in the dataset;
examples of such bounds for specific distortion functions are described below. The func-
tion ϕ is chosen to be identical or proportional to the distortion measure, assigning higher
penalties to violations of must-link constraints between points that are distant with respect
to the current parameter values of the distortion measure. Conversely, penalties for violated
cannot-link constraints are higher for points that have low distance between them. With
this formulation of the penalty functions, constraint violations will lead to changes in the

64
distortion measure parameters that attempt to mend the violations. The potential function
v(i; j) in Eq.(4.3) then becomes:

8
>
>
> wi j ϕ(xi ; x j ) if ci j =1 and yi 6= y j
< 
v(i; j) = wi j ϕmax ϕ(xi ; x j ) if ci j = 1 and yi = y j ; (4.9)
>
>
>
: 0 otherwise

and the objective function for semi-supervised clustering in Eq.(4.6) can be expressed as:

Jobj = ∑ dA (xi ; µ(i)) + ∑ wi j ϕ(xi ; x j )


xi 2X (xi ;x j ) 2CML
s:t : 6 yj
yi =

+ ∑ wi j ϕmax ϕ(xi ; x j ) log Pr(A ) + n log ZΘ (4.10)
2
(xi ;x j ) CCL
s:t : yi =y j

Note that the MRF partition function term log Z has been dropped from the objective
function. Its estimation cannot be performed in closed form for most non-trivial dependency
structures, and while approximate inference methods could be employed for computing
it (Kschischang, Frey, & Loeliger, 2001; Wainwright & Jordan, 2003), experiments with the
different methods have shown that minimizing the simplified objective yields comparable
results (Bilenko & Basu, 2004).

4.3.1 Parameter Priors

Following the definition of Θ in Section 4.2.1, the prior term log Pr(Θ) in Eq.(4.6) and the
subsequent equations can be factored as follows:


log Pr(Θ) = log Pr(A ) Pr(M ) = log Pr(A ) + PM

65
where the distortion parameters A are assumed to be independent of the cluster centroids
M = fµi gKi 1 , and uniform priors are considered over the cluster centroids (leading to the
=

constant term PM ). For different distortion measures, parameter values may exist that lead
to degenerate solutions of the optimization problem. For example, for squared Euclidean
distance, the zero matrix A = 0 is one such solution. To prevent degenerate solutions, Pr(A )
is used to regularize the parameter values using a prior distribution.
If the standard Gaussian prior was used on the parameters of the distortion function,
it would allow the parameters to take negative values. Since it is desirable to constrain the
parameter values to be non-negative, it is more appropriate to use the Rayleigh distribu-
tion (Papoulis & Pillai, 2001). Assuming independence of the parameters ai 2 A, the prior
term based on the Rayleigh distribution is the following:

 
a2i
ai exp

s2
Pr(A) = (4.11)
ai 2A s2

where s is the width parameter.


Next, we consider three examples of commonly used distortion measures and their
parameterizations for use with HMRF-KM EANS: squared Euclidean distance, cosine dis-
tance and KL divergence. Through learning, each of these similarity functions reflects the
correct notion of similarity provided by the pairwise constraints, leading to better clustering
accuracy.

4.3.2 Parameterized Squared Euclidean Distance

Squared Euclidean distance is parameterized using a symmetric positive-definite matrix A


as follows:

deucA (xi ; x j ) = kxi x j k2A = (xi x j )T A(xi x j ): (4.12)

This form of the parameterized squared Euclidean distance is equivalent to Ma-

66
halanobis distance with an arbitrary positive semidefinite weight matrix A in place of the
inverse covariance matrix, and it was previously used by (Xing, Ng, Jordan, & Russell,
2003) and (Bar-Hillel, Hertz, Shental, & Weinshall, 2003). Such formulation can also be
viewed as a projection of every instance x onto a space spanned by A1 2 : x ! A1 2 x. = =

The ϕ function that penalizes constraint violations is defined as ϕ(xi ; x j ) = deucA (xi ; x j ).
One possible initialization of the upper bound for cannot-link penalties is ϕmax
eucA = ∑(xi x j )2CCL deucA (xi ; x j ),;

which guarantees that the penalty is always positive. Using these definitions in the objec-
tive in Eq.(4.10), the following objective function is obtained for semi-supervised clustering
with parameterized squared Euclidean distance:

JeucA = ∑ deuc A( xi ; µ(i)) + ∑ wi j deucA (xi ; x j )


xi 2X (xi ;x j ) 2CML
s:t : 6 yj
yi =

+ ∑ wi j ϕmax
eucA deucA (xi ; x j ) log Pr(A) n log det(A ) (4.13)
2
(xi ;x j ) CCL
s:t : yi =y j

Note that the log ZΘ term in the general objective function in Eq.(4.10) is com-
putable in closed form for a Gaussian distribution with covariance matrix A 1 , resulting in
the log det(A) term.

4.3.3 Parameterized Cosine Distance

Cosine distance can be parameterized using a symmetric positive-definite matrix A, which


leads to the following distortion measure:

xTi Ax j
dcosA (xi ; x j ) = 1
kxi kA kx j kA : (4.14)

Because for realistic high-dimensional domains computing the full matrix A is very expen-
sive computationally, diagonal matrix is considered in this case, such that a = diag(A) is
a vector of positive weights, intuitively corresponding to the relative importance of each

67
dimension.
To use parameterized cosine distance as the adaptive distortion measure for cluster-
ing, the ϕ function is defined as ϕ(xi ; x j ) = dcosA (xi ; x j ). Using this definition along with
Eq.(4.10), and setting ϕmax = 1 as an upper bound on ϕ(xi ; x j ), the following objective
function is obtained for semi-supervised clustering with parameterized cosine distance:

JcosA = ∑ dcos A( xi ; µ(i)) + ∑ wi j dcosA (xi ; x j )


xi 2X (xi ;x j ) 2CML
s:t : 6 yj
yi =

+ ∑ wi j 1 dcosA (xi ; x j ) log Pr(A ) (4.15)
2
(xi ;x j ) CCL
s:t : yi =y j

Note that the log ZΘ term is difficult to compute in closed form (Banerjee et al.,
2005a), so it is assumed to be constant during the clustering process and therefore dropped
from the objective function. This assumption is reasonable given an appropriate prior Pr(A),
and experimentally we have not observed problems with algorithm convergence due to this
approximation.

4.3.4 Parameterized Kullback-Leibler Divergence

In domains where each instance can be described a probability distribution, Kullback-


Leibler divergence can be used to measure similarity between instances. In previous work,
(Cohn, Caruana, & McCallum, 2003) parameterized KL-divergence by multiplying every
0 (xi ; x j ) = ∑d am xim log xim .
component by a real weight: dKL m=1 x jm

We use a similar parameterization of KL divergence, where the vector of positive


weights, a, corresponds to a diagonal matrix A. However, since after the reweighting
each instance is no longer a probability distribution, this parameterization requires using
I-divergence, a function that also belongs to the class of Bregman divergences (Banerjee
et al., 2005b). I-divergence has the form: dI (xi ; x j ) = ∑dm=1 xim log xxim
jm
∑dm=1 (xim x jm ),

68
where xi and x j no longer need to be probability distributions but can be any non-negative
vectors.1 The parameterized I-divergence is expressed as follows:

d d
xim
dIA (xi ; x j ) = ∑ am xim log x jm ∑ am (xim x jm ); (4.16)
m=1 m=1

which can be interpreted as scaling every component of the original probability distribution
by a weight contained in the corresponding component of a, and then taking I-divergence
between the transformed vectors.
The HMRF framework requires defining an appropriate penalty violation function
ϕ that is symmetric, since the constraint pairs are unordered. To meet this requirement,
xi +x j
a sum of weighted I-divergences from xi and x j to the mean vector 2 is used. This
parameterized I-divergence to the mean, dIMA , is equivalent to weighted Jensen-Shannon
divergence (Cover & Thomas, 1991), the symmetric KL-divergence to the mean, and is
defined as follows:

d
2xim 2x jm 
dIMA (xi ; x j ) = ∑ am xim log
xim + x jm
+ x jm log
xim + x jm
: (4.17)
m =1

Then, defining the constraint violation function ϕ as ϕ(xi ; x j ) = dIMA (xi ; x j ) yields
the following objective function for semi-supervised clustering with parameterized I-divergence:

JIA = ∑ dI (xi A ; µ(i)) + ∑ wi j dIMA (xi ; x j )


xi 2X (xi ;x j ) 2CML
s:t : 6 yj
yi =

+ ∑ max
wi j dIMA
dIMA (xi ; x j ) log Pr(A ) (4.18)
2
(xi ;x j ) CCL
s:t : yi =y j

max can be initialized as d max


The upper bound dIMA IMA = ∑dm=1 am , which follows from
1 For probability distributions, I-divergence and KL-divergence are equivalent.

69
the fact that unweighted Jensen-Shannon divergence is bounded above by 1 (Lin, 1991).
As for cosine distance, the log ZΘ term is difficult to compute in closed form for
parameterized I-divergence (Banerjee et al., 2005a), so it is assumed to be constant during
the clustering process and therefore dropped from the objective function. Again, this as-
sumption is reasonable given an appropriate prior Pr(A), and experimentally we have not
observed problems with algorithm convergence due to this approximation.

4.4 Learning Similarity Functions within the HMRF-KMeans


Algorithm

Since the cluster assignments and the generative model parameters are unknown in a clus-
tering setting, minimizing the general objective function in Eq.(4.10) is an “incomplete-data
problem”. A popular solution technique for such problems is the Expectation-Maximization
(EM) algorithm (Dempster, Laird, & Rubin, 1977). The K-Means algorithm (MacQueen,
1967) is known to be equivalent to the EM algorithm with hard clustering assignments, un-
der certain assumptions (Kearns et al., 1997; Basu et al., 2002; Banerjee et al., 2005b). This
section describes a K-Means-type hard partitional clustering algorithm, HMRF-KM EANS,
that finds a local minimum of the semi-supervised clustering objective function Jobj in
Eq.(4.10).
The outline of the algorithm is presented in Fig. 4.3. The basic idea of HMRF-
KM EANS is as follows. First, the constraints are used to obtain a good initialization of the
cluster centroids. Then, in the E-step, given the current cluster representatives, every data
point is re-assigned to the cluster which minimizes its contribution to Jobj . In the M-step,
the cluster centroids M = fµi gKi 1 are re-estimated given current assignments to minimize
=

Jobj for the current assignment of points to clusters. The clustering distortion measure dA
is subsequently updated in the M-step to reduce the objective function by modifying the
parameters A of the distortion measure.

70
Algorithm: HMRF-KM EANS
Input: Set of data points X = {xi }ni=1
Set of constraints C
Parameterized distortion measure dA (·, ·).
Constraint violation costs W
Desired number of clusters K
Output: Disjoint K-partitioning {Xi }ni=1 of X such that objective
function Jobj in Eqn. (3.9) is (locally) minimized.
Method:
(0)
1. Initialize the K clusters centroids M (0) = {µ1 }Ki=1 , set t ← 0
2. Repeat until convergence
2a. E-step: Given centroids M (t) and distortion parameters A (t) ,
(t+1) n
re-assign cluster labels Y (t+1) = {yi }i=1 on X to minimize Jobj .
2b. M-step(A): Given cluster labels Y (t+1) and distortion parameters A (t+1) ,
(t+1) (t+1) K
re-calculate centroids M = {µi }i=1 to minimize Jobj .
2c. M-step(B): Given cluster labels Y (t+1) and centroids M (t+1) ,
re-estimate parameters A (t+1) of the distortion measure to reduce Jobj .
2d. t ← t+1

Figure 4.3: The HMRF-KM EANS algorithm

Note that this corresponds to the generalized EM algorithm (Dempster et al., 1977;
Neal & Hinton, 1998), where the objective function is reduced but not necessarily mini-
mized in the M-step. Effectively, the E-step minimizes Jobj over cluster assignments Y , the
M-step(A) minimizes Jobj over cluster centroids M , and the M-step(B) reduces Jobj over
the parameters A of the distortion measure dA . The E-step and the M-step are repeated until
a specified convergence criterion is reached.
Detailed discussion of the initialization, E-step, and M-step(A) of the algorithm
along with the proof of convergence can be found in (Basu, 2005), while in this section we
focus on M-step(B) where the distortion measure parameters are updated to decrease the
objective function.
For certain distortion measure parameterizations, minimization via taking partial
derivatives and solving for the parameter values may be feasible, e.g., for squared Euclidean
distance with uniform parameter priors (Bilenko et al., 2004), in which case the weight
matrix A is obtained in M-Step(B) as:

71

A = jX j ∑ (xi µyi )(xi µyi )T + ∑ wi j (xi x j )(xi x j )T
xi 2X (xi ;x j ) 2CML
s:t : yi 6=y j

0 0 0 0 T  1
+ ∑ wi j ∑ (xi x j )(xi x j ) (xi x j )(xi x j)T
(4.19)
2
(xi ;x j ) CCL 2CCL
(x0i ;x0j )
s:t : yi =y j

Since the weight matrix A is obtained by inverting the summation of covariance


matrices in Eq.(4.19), that summation (corresponding to jX1 j A 1) must not be singular. If at
any iteration the summation is singular, it can be conditioned via adding the identity matrix
multiplied by a small fraction of the trace of A 1 : A 1 = A 1+ε tr(A 1 )I. If the weight
matrix A resulting from the inversion is negative definite, it is mended by projecting on the
set C = fA : A  0g of positive semi-definite matrices, to ensure that the squared Euclidean
distance parameterized by A is a Mahalanobis distance (Golub & van Loan, 1989).
In general, for parameterized Bregman divergences or directional distances with
general parameter priors, it is difficult to obtain a closed form update for the parameters of
the distortion measure that can minimize the objective function. Gradient descent provides
an alternative avenue for learning the distortion measure parameters.
For squared Euclidean distance, a full parameter matrix A is updated during gradient
∂JeucA ∂JeucA
descent using the rule: A = A + η ∂A (where η is the learning rate). Using Eq.(4.13), ∂A

can be expressed as:

∂JeucA ∂deucA (xi ; µ(i)) ∂deucA (xi ; x j )


∂A
= ∑ ∂A
+ ∑ wi j ∂A
xi 2X (xi x j )2CML ;

s:t : yi 6=y j
 
∂ϕmax ∂deucA (xi ; x j ) ∂ log Pr(A) ∂ log det(A)
+ ∑ wi j eucA
∂A ∂A ∂A
n
∂A
: (4.20)
2
(xi ;x j ) CCL
s:t : yi =y j

The gradient of the parameterized squared Euclidean distance is given by:

72
∂deucA (xi ; x j )
= (xi x j )(xi x j )T
∂A
∂ϕmax
The derivative of the upper bound ϕmax
eucA is
eucA
∂A = ∑(xi x j )2CCL (xi
;
x j )(xi x j )T if
ϕmax
eucA is computed as described in Section 4.3.2. In practice, one can initialize ϕeucA with
max

a sufficiently large constant, which would make its derivative zero. Accordingly, an extra
condition must be then inserted into the algorithm to guarantee that penalties for violated
cannot-link constraints are never negative, in which case the constant must be increased.
When Rayleigh priors are used on the set of parameters A, the partial derivative of
the log-prior with respect to every individual parameter am 2 A, ∂ log Pr(A)
∂am , is given by:

∂ log Pr(A) 1 am
= (4.21)
∂am am s2

The gradient of the distortion normalizer log det(A) term is as follows:

∂ log det(A) 1 1
= 2A diag(A ): (4.22)
∂A

For parameterized cosine distance and KL divergence, a diagonal parameter matrix


A is considered, where a = diag(A) is a vector of positive weights. During gradient descent,
∂Jobj
each weight am is updated individually: am = am + η ∂a (η is the learning rate). Using
m
∂Jobj
Eq.(4.10), ∂am can be expressed as:

∂Jobj ∂dA (xi ; µ(i)) ∂ϕ(xi ; x j )


∂am
= ∑ ∂am
+ ∑ wi j
∂am
xi 2X (x i ;x j )2CML
s:t : yi 6=y j
 
∂ϕmax ∂ϕ(xi ; x j ) ∂ log Pr(A)
+ ∑ wi j
∂am ∂am ∂am
(4.23)
2
(xi ;x j ) CCL
s:t : yi =y j

The gradients of the corresponding distortion measures and constraint potential


functions for parameterized cosine distance and KL divergence are the following:

73
x2im kx j k2A +x2jm kxi k2A
∂dcosA (xi ; x j ) xim x jm kxi kA kx j kA xTi Ax j 2kxi kA kx j kA
∂am
=
kxi k2A kx j k2A ;

∂dIA (xi ; x j ) xim


= xim log (xim x jm );
∂am x jm
∂dIMA (xi ; x j ) 2xim 2x jm
= xim log + x jm log ; (4.24)
∂am xim + x jm xim + x jm

∂ϕmax
while the gradient of the upper bound ∂am is 0 for parameterized cosine and 1 for parame-
terized KL divergence, as follows from the expressions for these constants in Sections 4.3.3
and 4.3.4.
For all distortion metrics, individual similarity function parameters can be learned
for each cluster, allowing the clusters to lie in different subspaces. To implement cluster-
specific similarity function learning, the above updates should be based only on points
belonging to the cluster, ignoring the rest of the dataset.
Overall, the distance learning step results in modifying the distortion measure pa-
rameters so that data points in violated must-link constraints are brought closer together,
while points in violated cannot-link constraints are pulled apart, and each dimension is
scaled proportionally to data variance. This process leads to a transformed data space that
facilitates partitioning of the unlabeled data by attempting to mend the constraint violations
while capturing the natural variance in the data.

4.5 Experimental Results

This section describes the experiments that were performed to demonstrate the effectiveness
of using learnable similarity functions within HMRF-KM EANS.

74
4.5.1 Datasets

Experiments were run on both low-dimensional and high-dimensional datasets to eval-


uate the HMRF-KM EANS framework with different distortion measures. For the low-
dimensional datasets, on which squared Euclidean distance was used as the distortion mea-
sure, the following datasets were considered:

 Three datasets from the UCI repository: Iris, Wine, and Ionosphere (Blake & Merz,
1998);

 The Protein dataset used by Xing et al. (2003) and Bar-Hillel et al. (2003);

 Randomly sampled subsets from the Digits and Letters handwritten character recog-
nition datasets, also from the UCI repository. For Digits and Letters, two sets of three
classes were chosen: fI, J, Lg from Letters and f3, 8, 9g from Digits, sampling 10%
of the data points from the original datasets randomly. These classes were chosen
since they represent difficult visual discrimination problems.

Table 4.1 summarizes the properties of the low-dimensional datasets: the number
of instances, the number of dimensions, and the number of classes.

Table 4.1: Low-dimensional datasets used in experimental evaluation

Iris Wine Ionosphere Protein Letters Digits


Instances 150 178 351 116 227 317
Dimensions 4 13 34 20 16 16
Classes 3 3 2 6 3 3

For the high-dimensional text data, 3 datasets that have the characteristics of being
sparse, high-dimensional, and having a small number of points compared to the dimension-
ality of the space were considered. This is done for two reasons:

75
 When clustering sparse high-dimensional data, e.g., text documents represented us-
ing the vector space model, it is particularly difficult to cluster small datasets, as
observed by Dhillon and Guan (2003). The purpose of performing experiments on
these subsets is to scale down the sizes of the datasets for computational reasons but
at the same time not scale down the difficulty of the tasks.

 Clustering small number of sparse high-dimensional data points is a likely scenario


in realistic applications. For example, when clustering the search results in a web-
search engine like Vivı́simo2 , the number of webpages that are being clustered is
typically in the order of hundreds. However, the dimensionality of the feature space,
corresponding to the number of unique words in all the webpages, is in the order of
thousands. Moreover, each webpage is sparse, since it contains only a small number
of all the possible words. On such datasets, clustering algorithms can easily get stuck
in local optima: in such cases it has been observed that there is little relocation of
documents between clusters for most initializations, which leads to poor clustering
quality after convergence of the algorithm (Dhillon & Guan, 2003). Supervision in
the form of pairwise constraints is most beneficial in such cases and may significantly
improve clustering quality.

Three datasets were derived from the 20-Newsgroups collection.3 This collection
has messages harvested from 20 different Usenet newsgroups, 1000 messages from each
newsgroup. From the original dataset, a reduced dataset was created by taking a random
subsample of 100 documents from each of the 20 newsgroups. Three datasets were cre-
ated by selecting 3 categories from the reduced collection. News-Similar-3 consists of 3
newsgroups on similar topics (comp.graphics, comp.os.ms-windows, comp.windows.x)
with significant overlap between clusters due to cross-posting. News-Related-3 consists
of 3 newsgroups on related topics (talk.politics.misc, talk.politics.guns, and
2 http://www.vivisimo.com
3 http://www.ai.mit.edu/people/jrennie/20Newsgroups

76
talk.politics.mideast). News-Different-3 consists of articles posted in 3 newsgroups
that cover different topics (alt.atheism, rec.sport.baseball, sci.space) with well-
separated clusters. All the text datasets were converted to the vector-space model by tok-
enization, stop-word removal, TF-IDF weighting, and removal of very high-frequency and
low-frequency words, following the methodology of Dhillon and Modha (2001). Table 4.2
summarizes the properties of the high-dimensional datasets.

Table 4.2: High-dimensional datasets used in experimental evaluation

News-Different-3 News-Related-3 News-Similar-3


Instances 300 300 300
Dimensions 3251 3225 1864
Classes 3 3 3

4.5.2 Clustering Evaluation

Normalized mutual information (NMI) was used as the clustering evaluation measure. NMI
is an external clustering validation metric that estimates the quality of the clustering with
respect to a given underlying class labeling of the data: it measures how closely the cluster-
ing algorithm could reconstruct the underlying label distribution in the data (Strehl, Ghosh,
& Mooney, 2000). If Ŷ is the random variable denoting the cluster assignments of the points
and Y is the random variable denoting the underlying class labels on the points, then the
NMI measure is defined as:

I (Y ; Ŷ )
NMI = (4.25)
(H (Y ) + H (Ŷ ))=2

where I (X ;Y ) = H (X ) H (X jY ) is the mutual information between the random variables


X and Y , H (X ) is the Shannon entropy of X , and H (X jY ) is the conditional entropy of
X given Y (Cover & Thomas, 1991). NMI effectively measures the amount of statistical
information shared by the random variables representing the cluster assignments and the

77
user-labeled class assignments of the data points. Though various clustering evaluation
measures have been used in the literature, NMI and it’s variants have become popular lately
among clustering practitioners (Dom, 2001; Fern & Brodley, 2003; Meila, 2003).

4.5.3 Methodology

Learning curves were generated using two-fold cross-validation performed over 20 runs on
each dataset. In every trial, 50% of the dataset was set aside as the training fold. Every point
on the learning curve corresponds to the number of constraints on pairs of data points from
the training fold. These constraints are obtained by randomly selecting pairs of points from
the training fold and creating must-link or cannot-link constraints depending on whether the
underlying classes of the two points are the same or different. Unit constraint costs W were
used for all constraints (original and inferred), since the datasets did not provide individual
weights for the constraints. The gradient step size η for learning the distortion measure
parameters and the Rayleigh prior width parameter s were set based on pilot studies. The
gradient step size was set to η = 100:0 for clustering with weighted cosine distance dcosA
and η = 0:08 for weighted I divergence dIA . The Rayleigh prior width parameter was set
to s = 1. In a real-life setting, the free parameters of the algorithm could be tuned using
cross-validation with a hold-out set. The clustering algorithm was run on the whole dataset,
but NMI was calculated using points in the test fold.
Sensitivity experiments were performed with HMRF-KM EANS to study the ef-
fectiveness of employing learnable similarity functions. The proposed HMRF-KM EANS
algorithm was compared with three ablations, as well as with unsupervised K-Means clus-
tering. The following variants were compared for distortion measures dcosA , dIA and deucA :

 HMRF-KM EANS -C-D-R is the complete HMRF-KM EANS algorithm that incor-
porates constraints in cluster assignments (C), performs distortion measure learning
(D), and also performs regularization (R) using a Rayleigh prior as described in Sec-
tion 4.3.1;

78
 HMRF-KM EANS -C-D is the first ablation of HMRF-KM EANS that includes all
components except for regularization of distortion measure parameters;

 HMRF-KM EANS -C is an ablation of HMRF-KM EANS that uses pairwise supervi-


sion for initialization and cluster assignments, but does not perform distortion mea-
sure learning;

 RCA-KM EANS is K-Means algorithm that uses distortion measure parameters learned
using the Relevant Components Analysis (RCA) algorithm of Bar-Hillel et al. (2003);

 KM EANS is the unsupervised K-Means algorithm;

The goal of these experiments was to evaluate the utility of distortion measure learn-
ing HMRF framework and identify settings in which particular components are beneficial.
For low-dimensional datasets, we also compared several distinct possibilities for parame-
terizing the distance metric deucA :

 HMRF-KM EANS -C-D-R is the complete HMRF-KM EANS algorithm that learns a
single diagonal weight matrix for the entire dataset (A is diagonal and identical for
all clusters);

 HMRF-KM EANS -C-D-R-M is the complete HMRF-KM EANS algorithm that learns
K diagonal weight matrices Ai ; : : : ; Ak so that each cluster corresponds to a distinct
similarity function;

 HMRF-KM EANS -C-D-R-F ULL is the complete HMRF-KM EANS algorithm that
learns a single fully-parameterized Mahalanobis distance: A is a d  d positive-
definite matrix that is identical for all clusters.

The goal of these experiments is to study the utility of learning a full parameteri-
zation of the similarity function (effectively training a Mahalanobis distance) versus only
using a diagonal parameterization (learning weights for a Euclidean distance), since the

79
latter is significantly cheaper computationally. Results obtained with learning individual
similarity functions for each cluster illustrate the utility of allowing different clusters to lie
in different subspaces, as opposed to learning a single set of similarity function parameters
for the entire dataset.

4.5.4 Results and Discussion

Low-dimensional datasets: Figures 4.4-4.15 show learning curves for the ablation ex-
periments on the six low-dimensional datasets. Across all datasets, the overall HMRF-
KM EANS approach without regularization (KM EANS -C-D) outperforms the constraints-
only ablation and unsupervised KMeans. Since the performance of KM EANS -C-D-R is
not substantially different from KM EANS -C-D, it can be concluded that regularization does
not lead to performance improvements on low-dimensional datasets. This can be explained
by the fact that the number of distortion measure parameters is small for low-dimensional
domains while estimates obtained from data do not have high variance, and therefore incor-
porating a prior in the probabilistic model is not necessary.
For the Wine, Protein, and Digits-389 datasets, the difference between ablations
that utilize metric learning (KM EANS -C-D-R and KM EANS -C-D) and those that do not
(KM EANS -C and KM EANS) at the beginning of the learning curve indicates that even in
the absence of constraints, weighting features by their variance (essentially using unsuper-
vised Mahalanobis distance) improves clustering accuracy. For the Wine dataset, additional
constraints provide an insubstantial improvement in cluster quality on this dataset, which
shows that meaningful feature weights are obtained from scaling by variance using just the
unlabeled data.
Comparing the performance of different variants of HMRF-KM EANS with RCA,
we can see that the ability to embed similarity function learning within the clustering al-
gorithm leads to significantly better results for HMRF-KM EANS. This is explained by

80
0.88

0.86

0.84

0.82 HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
NMI

HMRF-KMeans-C
0.8 RCA-KMeans
KMeans

0.78

0.76

0.74
0 100 200 300 400 500
Number of Constraints

Figure 4.4: Results for deuc on the Iris dataset


0.95

0.9

0.85

0.8
NMI

0.75

0.7

0.65

0.6 HMRF-KMeans-C-D-R
HMRF-KMeans-C-D-R-M
HMRF-KMeans-C-D-R-Full
0.55
0 100 200 300 400 500
Number of Constraints

Figure 4.5: Results for deuc on the Iris dataset with full
and per-cluster parameterizations

81
0.9

0.85

0.8

0.75

0.7 HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
NMI

0.65 HMRF-KMeans-C
RCA-KMeans
KMeans
0.6

0.55

0.5

0.45

0.4
0 100 200 300 400 500
Number of Constraints

Figure 4.6: Results for deuc on the Wine dataset


0.95

0.9

0.85

0.8

0.75
NMI

0.7 HMRF-KMeans-C-D-R
HMRF-KMeans-C-D-R-M
0.65 HMRF-KMeans-C-D-R-Full

0.6

0.55

0.5

0.45
0 100 200 300 400 500
Number of Constraints

Figure 4.7: Results for deuc on the Wine dataset with


full and per-cluster parameterizations

82
0.64

0.62

0.6

0.58

0.56
NMI

0.54

0.52

0.5

0.48 HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
0.46 HMRF-KMeans-C
RCA-KMeans
KMeans
0.44
0 100 200 300 400 500
Number of Constraints

Figure 4.8: Results for deuc on the Protein dataset


0.65

0.6

0.55

0.5
NMI

0.45

0.4

0.35 HMRF-KMeans-C-D-R
HMRF-KMeans-C-D-R-M
HMRF-KMeans-C-D-R-Full
0.3
0 100 200 300 400 500
Number of Constraints

Figure 4.9: Results for deuc on the Protein dataset with


full and per-cluster parameterizations

83
0.155
HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
HMRF-KMeans-C
0.15 RCA-KMeans
KMeans

0.145
NMI

0.14

0.135

0.13

0.125
0 100 200 300 400 500
Number of Constraints

Figure 4.10: Results for deuc on the Ionosphere dataset


0.26

0.24

0.22

0.2
NMI

0.18

0.16

0.14

0.12 HMRF-KMeans-C-D-R
HMRF-KMeans-C-D-R-M
HMRF-KMeans-C-D-R-Full
0.1
0 100 200 300 400 500
Number of Constraints

Figure 4.11: Results for deuc on the Ionosphere dataset


with full and per-cluster parameterizations

84
0.8

0.75

0.7
NMI

0.65

0.6 HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
HMRF-KMeans-C
RCA-KMeans
KMeans
0.55
0 100 200 300 400 500
Number of Constraints

Figure 4.12: Results for deuc on the Digits-389 dataset


0.85

0.8

0.75
NMI

0.7

0.65

0.6
HMRF-KMeans-C-D-R
HMRF-KMeans-C-D-R-M
HMRF-KMeans-C-D-R-Full
0.55
0 100 200 300 400 500
Number of Constraints

Figure 4.13: Results for deuc on the Digits-389 dataset


with full and per-cluster parameterizations

85
0.5

0.45

0.4

0.35
NMI

0.3

0.25
HMRF-KMeans-C-D-R
0.2 HMRF-KMeans-C-D
HMRF-KMeans-C
RCA-KMeans
KMeans
0.15
0 100 200 300 400 500
Number of Constraints

Figure 4.14: Results for deuc on the Letters-IJL dataset


0.7

0.6

0.5
NMI

0.4

0.3

0.2
HMRF-KMeans-C-D-R
HMRF-KMeans-C-D-R-M
HMRF-KMeans-C-D-R-Full
0.1
0 100 200 300 400 500
Number of Constraints

Figure 4.15: Results for deuc on the Letters-IJL dataset


with full and per-cluster parameterizations

86
the fact that RCA utilizes only the pairwise constraints for learning the similarity function
parameters, while HMRF-KM EANS uses both the constraints and the unlabeled data, ad-
justing the parameters gradually in the course of clustering.
The results for learning full-matrix and per-cluster parameterizations of the similar-
ity function demonstrate that both of these extensions can lead to significant improvements
in clustering quality. However, the relative usefulness of these two techniques varies be-
tween the datasets. Multiple similarity functions are beneficial for all datasets except for
Protein where they did not affect the results, and Iris, where they had a negative effect. Us-
ing the full matrix parameterization also did not affect Protein results, and had a negative
effect on Digits, while it improved results on the other four datasets. This inconsistency can
be explained by the fact that the relative success of the two techniques depends on the prop-
erties of a particular dataset: using a full weight matrix helps when the features are highly
correlated, while using per-cluster parameterization lead to improvements when clusters in
the dataset are of different shapes or lie in different subspaces of the original space. A
combination of the two techniques is most helpful when both of these requirements are
satisfied, as for Wine and Letters, which was observed by visualizing low-dimensional pro-
jections of these datasets. For other datasets with the exception of Protein, either per-cluster
parameterization or the full weight matrix lead to maximum performance in isolation.
Some of the HMRF-KM EANS learning curves display a characteristic “dip”, where
clustering accuracy decreases as a few initial constraints are provided, but after a certain
point starts to increase and eventually rises above the initial point on the learning curve.
One possible explanation of this phenomenon is overfitting: having just a few constraints
provides unreliable supervision, forcing the algorithm to converge to inferior local op-
tima, while increasing the number of provided constraints allows overcoming this effect.
Overall, when both constraints and distortion measure learning are utilized, the unified ap-
proach benefits from the individual strengths of the two methods, as can be seen from the
KM EANS -C-D results.

87
0.8
HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
0.7 HMRF-KMeans-C
KMeans

0.6

0.5
NMI

0.4

0.3

0.2

0.1

0
0 100 200 300 400 500
Number of Constraints

Figure 4.16: Results for dcosA on the News-Different-3


dataset
0.8

0.7

0.6

0.5
NMI

0.4

0.3 HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
HMRF-KMeans-C
0.2 KMeans

0.1

0
0 100 200 300 400 500
Number of Constraints

Figure 4.17: Results for dIA on the News-Different-3


dataset

88
0.35
HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
HMRF-KMeans-C
0.3 KMeans

0.25

0.2
NMI

0.15

0.1

0.05

0
0 100 200 300 400 500
Number of Constraints

Figure 4.18: Results for dcosA on the News-Related-3


dataset
0.35
HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
HMRF-KMeans-C
0.3 KMeans

0.25
NMI

0.2

0.15

0.1

0.05
0 100 200 300 400 500
Number of Constraints

Figure 4.19: Results for dIA on the News-Related-3


dataset

89
0.35
HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
HMRF-KMeans-C
0.3 KMeans

0.25

0.2
NMI

0.15

0.1

0.05

0
0 100 200 300 400 500
Number of Constraints

Figure 4.20: Results for dcosA on the News-Similar-3


dataset
0.35
HMRF-KMeans-C-D-R
HMRF-KMeans-C-D
HMRF-KMeans-C
0.3 KMeans

0.25

0.2
NMI

0.15

0.1

0.05

0
0 100 200 300 400 500
Number of Constraints

Figure 4.21: Results for dIA on the News-Similar-3


dataset

90
High-dimensional datasets: Figures 4.16, 4.18 and 4.20 present the results for
the ablation experiments where weighted cosine similarity dcosA was used as the distortion
measure, while Figures 4.17, 4.19 and 4.21 summarize experiments where weighted I di-
vergence dIA was used.
As the results demonstrate, the full HMRF-KM EANS algorithm with regularization
(KM EANS -C-D-R) outperforms the unsupervised K-Means baseline as well as the ablated
versions of the algorithm for both distortion measures dcosA and dIA . As can be seen from
results for zero pairwise constraints in Figs. 4.16-4.21, distortion measure learning is bene-
ficial even in the absence of any pairwise constraints, since it allows capturing the relative
importance of the different attributes in the unsupervised data. In the absence of super-
vised data or when no constraints are violated, distance learning attempts to minimize the
objective function by adjusting the weights given the distortion between the unsupervised
datapoints and their corresponding cluster representatives.
For high-dimensional datasets, regularization is clearly beneficial to performance,
as can be seen from the improved performance of KM EANS -C-D-R over KM EANS -C-D
on all datasets. This can be explained by the fact that the number of distortion measure
parameters is large for high-dimensional datasets, and therefore algorithm-based estimates
of parameters tend to be unreliable unless they incorporate a prior.
Overall, the experimental results demonstrate that learning similarity functions within
the HMRF-KM EANS algorithm lead to significant improvements in clustering accuracy,
effectively exploiting both supervision in the form of pairwise constraints and the unsuper-
vised data.

4.6 Related Work

Several semi-supervised clustering approaches were proposed that incorporate adaptive dis-
tortion measures, including parameterizations of Jensen-Shannon divergence (Cohn et al.,
2003) as well as Euclidean and Mahalanobis distances (Klein, Kamvar, & Manning, 2002;

91
Bar-Hillel et al., 2003; Xing et al., 2003). These techniques use only constraints to learn
the distortion measure parameters and ignore unlabeled data in the parameter learning step,
as well as separate training of the similarity function from the clustering process.
In contrast, the HMRF model provides an integrated framework which incorpo-
rates both learning the distortion measure parameters and constraint-sensitive cluster as-
signments. In HMRF-KM EANS, the parameters of the similarity function are learned iter-
atively as the clustering progresses, utilizing both unlabeled data and pairwise constraints.
The parameters are modified to decrease the parameterized distance between violated must-
linked constraints and increase it between violated cannot-link constraints, while allowing
constraint violations if they accompany a more cohesive clustering.

4.7 Chapter Summary

This chapter has demonstrated the utility of learnable similarity functions in semi-supervised
clustering, and presented a general approach for employing them within a general proba-
bilistic framework based on Hidden Markov Random Fields (HMRFs). The framework
accommodates a broad class of similarity functions (Bregman divergences), as well as di-
rectional measures such as cosine distance, making it applicable to a wide variety of do-
mains.
The framework yields an EM-style clustering algorithm, HMRF-KM EANS, that
maximizes the joint probability of observed data points, their cluster assignments, and dis-
tortion measure parameters. The fact that the similarity functions are trained within the
clustering algorithm allows utilizing both labeled and unlabeled data in learning similarity
function parameters, which leads to results that are superior to learning similarity functions
in isolation.

92
Chapter 5

Learnable Similarity Functions in


Blocking

In this chapter, we show how learnable similarity functions can be employed not only for
improving the accuracy of tasks that rely on pairwise similarity computations, but also
for improving their scalability. We introduce an adaptive framework for learning blocking
functions that are efficient and accurate for a given domain by automatically constructing
them from combinations of blocking predicates. Our approach allows formulating this task
as an instance of the Red-Blue Set Cover problem, approximation algorithms for which can
be used for learning blocking functions.

5.1 Motivation

As discussed in Section 2.4, intelligent data analysis tasks that rely on computing pairwise
similarities require blocking methods for scaling up to large datasets due to the quadratic
number of instance pairs in a given dataset. Manual selection of fields and parameter tuning
are required by all existing blocking strategies to reduce the number of returned dissimilar
pairs while retaining the similar pairs.

93
Since an appropriate blocking strategy can be highly domain-dependent, the ad-hoc
construction and manual tuning of blocking methods is difficult. They may lead to over-
selection of many dissimilar pairs which impedes efficiency, or, worse, under-selection of
important similar pairs which decreases accuracy. Because there can be many potentially
useful blocking criteria over multiple object attributes, there is a need for automating the
process of constructing blocking strategies so that all or nearly all same-entity or same-
cluster pairs are retained while the maximum number of dissimilar pairs is discarded.
In subsequent sections, we formalize the problem of learning an optimal blocking
strategy using training data. In many record linkage domains, some fraction of instances
contains true entity identifiers, e.g., UPC (bar code) numbers for retail products, SSN num-
bers for individuals, or DOI identifiers for citations. Presence of such labeled data allows
evaluating possible blocking functions and selecting from them one that is optimal, that is,
one that selects all or nearly all positive record pairs (that refer to the same entity), and a
minimal number of negative pairs (that refer to different entities).
We propose to construct blocking functions based on sets of general blocking pred-
icates which efficiently select all instance pairs that satisfy some binary similarity criterion.
Figure 5.1 contains examples of predicates for specific record fields in different domains.
We formulate the problem of learning an optimal blocking function as the task of finding
a combination of blocking predicates that captures all or nearly all coreferent object pairs
and a minimal number of non-coreferent pairs. Our approach is general in the sense that we
do not place restrictions on the similarity predicates computed on instance pairs selected
by blocking, such as requiring them to be an inner product or to correspond to a distance
metric.

Domain Blocking Predicate


Census Data Same 1st Three Chars in Last Name
Product Normalization Common token in Manufacturer
Citations Publication Year same or off-by-one

Figure 5.1: Examples of blocking functions from different record linkage domains

94
We consider two types of blocking functions: (1) disjunctions of blocking pred-
icates, and (2) predicates combined in disjunctive normal form (DNF). While finding a
globally optimal solution for these formulations is NP-hard, we describe an effective ap-
proximation method for them and discuss implementation issues. Empirical evaluation on
synthetic and real-world record linkage datasets demonstrates the efficiency of our tech-
niques.

5.2 Adaptive Blocking Formulation

Let us formally define the problem of learning an optimal blocking function. We assume
that a training dataset Dtrain = fX Y g is available that includes a set X = fxi gni 1 of n
;
=

records known to refer to m true objects: Y = fyi gni 1 , where each yi is the true object
=

identifier for the i-th record: yi 2 f1 mg. Each record xi may have one or more fields.
;:::;

We assume that a set of s general blocking predicates f pi gsi 1 is available, where


=

each predicate pi corresponds to two functions:

 Indexing function hi () is a unary function that is applied to a field value from some
domain Dom(hi ) (e.g., strings, integers, or categories) and generates one or more keys
for the field value: hi : Dom(hi ) ! U  , where U is the set of all possible keys;

 Equality function pi (; ) returns 1 if the intersection of the key sets produced by
the indexing function on its arguments is non-empty, and returns zero otherwise:
pi (x j ; xk ) = 1 iff hi (x j ) \ hi (xk ) 6= 0.
/ Any pair (x j ; xk ) for which pi (x j ; xk ) = 1 is
covered by the predicate pi .

Each general blocking predicate can be instantiated for a particular field (or a com-
bination of fields) in a given domain, resulting in several specific blocking predicates for the
domain. Given a database with d fields and a set of s general blocking predicates, we obtain
t  s  d specific predicates P = f pi gti
=1 by applying the general predicates to all fields of

95
Sample record:
author year title venue other
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256-285

Blocking predicates and key sets produced by their indexing functions for the record:

author title venue year other


Contain ffreund, yg fboosting, f1995g f121, 2, 256, 285g
Common a, weak, finformation,
Token learning, computationg
algorithm,
by,
majorityg
Exact Match f’boosting a f’1995’g f’121 2 256 285’g
f’freund y’g weak f’information
learning and
algorithm by computation’g
majority’g
Same 1st ffreg fboog finfg f199g f121g
Three Chars
Contains 0/ 0/ 0/ f120 121, 121 122,
Same or f1994 1995, 1 2, 2 3, 255 256,
Off-By-One 1995 1996g 256 257, 284 285,
Integer 285 286g

Figure 5.2: Blocking key values for a sample record

the appropriate type. For example, suppose we have four general predicates defined for all
textual fields: “Contain Common Token”, “Exact Match”, and “Same 1st Three Chars”,
“Contains Same of Off-By-One Integer”. When these general predicates are instantiated
for the bibliographic citation domain with five textual fields, author, title, venue, year, and
other, we obtain 5  4 = 20 specific blocking predicates for this domain. Figure 5.2 demon-
strates the values produced by the indexing functions of these specific blocking predicates
on a sample citation record (we assume that all strings are converted to lower-case and
punctuation is removed before the application of the indexing functions):
Multiple blocking predicates are combined by an overall blocking function fP con-
structed using the set P of predicates. Like the individual predicates, fP corresponds to an

96
indexing function that can be applied to any record, and an equality function for any pair of
records. Pairs for which this equality function returns 1 are covered: they comprise the set
of candidate pairs returned for subsequent similarity computation, while pairs for which the
blocking function returns 0 are ignored (uncovered). Efficient generation of the set of can-
didate pairs requires computing the indexing function for all records, followed by retrieval
of all candidate pairs using inverted indices.
Given the set P = f pi gti
=1 containing t specific blocking predicates, the objective
of the adaptive blocking framework is to identify an optimal blocking function fP that
combines all or a subset of the predicates in P so that the set of candidate pairs it returns
contains all or nearly all coreferent (positive) record pairs and a minimal number of non-
coreferent (negative) record pairs.
Formally, this objective can be expressed as follows:

fP = argmin ∑ fP (xi ; x j )


fP 2R
(xi ;x j )
(5.1)
s.t. jB j ∑ fP (xi x j ) ; < ε
2B
(xi ;x j )

where R f
= (xi ; x j ) : yi 6= y j g is the set of non-coreferent pairs, B f
= (xi ; x j ) : yi = y j g is
the set of coreferent pairs, and ε is a small value indicating that up to ε coreferent pairs may
remain uncovered, thus accommodating noise and particularly difficult coreferent pairs.
The optimal blocking function fP must be found in a hypothesis space that corresponds to
some method of combining the individual blocking predicates. In this paper, we consider
two classes of blocking functions:

 Disjunctive blocking selects record pairs that are covered by at least one blocking
predicate from the subset of predicates that comprise the blocking function. This
strategy can be viewed as covering pairs for which a the equality function for at least
one of the selected predicates returns 1. The blocking function is trained by selecting

97
a subset of blocking predicates from P .

 Disjunctive Normal Form (DNF) blocking selects object pairs that are covered by
at least one conjunction of blocking predicates from a constructed set of conjunctions.
This strategy can be viewed as covering record pairs for which at least one equality
function of a conjunction of predicates returns 1. The blocking function is trained by
constructing a DNF formula from the blocking predicates.

Each type of blocking functions leads to a distinct formulation of the objective (5.1),
and we consider them individually in the following subsections.

5.2.1 Disjunctive blocking

Given a set of specific blocking predicates P = f pi gti


=1
, a disjunctive blocking function
corresponds to selecting some subset of predicates P 0  P , performing blocking using
each pi 2 P 0 , and then selecting record pairs that share at least one common key in the
key sets computed by the indexing functions of the selected predicates. Thus, the equality
function for the disjunctive blocking function based on subset P 0 = f pi1 ; : : : ; pik g of pred-
icates returns 1 if the equality function for at least one predicate returns 1: fP 0 (xi ; x j ) =
[ pi1 (xi ; x j ) +  + pik (xi ; x j )℄ where [π℄ = 1 if π > 0, and 0 otherwise. If the equality func-
tion for the overall blocking function fP 0 returns 1 for a pair (xi ; x j ), we say that this pair is
covered by the blocking function.
Learning the optimal blocking function fP requires selecting a subset P  of predi-
cates that results in all or nearly all coreferent pairs being covered by at least one predicate
in P  , and a minimal number of non-coreferent pairs being covered. Then the general
adaptive blocking problem in Eq.(5.1) can be written as follows:

98
Negative pairs
R = fr1; : : : ; rρg = f(xi; x j ) : yi 6= y j g

Blocking predicates
P = f p1; : : : ; pt g
Positive
pairs
B = fb1; : : : ; bβg = f(xi; x j ) : yi = y j g
Figure 5.3: Red-blue Set Cover view of disjunctive blocking

w = argmin ∑ [wT p(xi ; x j ) > 0℄


w (xi ;x j )2R
s.t. jB j ∑ [wT p(xi x j ) ; > 0℄ < ε (5.2)
2B
(xi ;x j )

w is binary

where w is a binary vector of length t encoding which of the blocking predicates are se-
lected as a part of fP , and p(xi ; x j ) = [ p1 (xi ; x j ); : : : ; pt (xi ; x j )℄T is a vector of binary values
returned by equality functions of the t predicates for the pair (xi ; x j ).
This formulation of the learnable blocking problem is equivalent to the Red-Blue Set
Cover problem if ε = 0 (Carr, Doddi, Konjevod, & Marathe, 2000). Figure 5.3 illustrates the
equivalence. The task of selecting a subset of predicates is represented by a graph with three
sets of vertices. The bottom row of β vertices corresponds to positive (coreferent) record
pairs designated as the set of blue elements B = fb1 ;:::; bβ g. The top row of ρ vertices
corresponds to negative (non-coreferent) record pairs designated as the set of red elements
R = fr1 ;:::; rρ g. The middle row of t vertices represents the set of blocking predicates P ,
where each pi 2 P corresponds to a set covering some red and blue elements. Every edge
between an element vertex and a predicate vertex indicates that the record pair represented
by the element vertex is covered by the predicate. Learning the optimal disjunctive blocking

99
function is then equivalent to selecting a subset of predicate vertices with their incident
edges so that at least β ε blue (positive) vertices have at least one incident edge, while the
cover cost, equal to the number of red (negative) vertices with at least one incident edge, is
minimized.

5.2.2 DNF Blocking

In some domains, a disjunctive combination of blocking predicates may be an insufficient


representation of the optimal blocking strategy. For example, in US Census data, conjunc-
tions of predicates such as “Same Zipcode AND Same 1st Char in Surname” yield useful
blocking criteria (Winkler, 2005). To incorporate such blocking criteria, we must extend the
disjunctive formulation described above to a formulation based on combining predicates in
disjunctive normal form (DNF). Then, the hypothesis space for the blocking function must
include disjunctions of not just individual blocking predicates, but also of their conjunc-
tions.
A search for the optimal DNF blocking function can be viewed as solving an ex-
tended variant of the red-blue set cover problem. In that variant, the cover is constructed
using not only the sets representing the original predicates, but also using additionally
constructed sets representing predicate conjunctions. Because the number of all possible
conjunctions is exponential, only conjunctions up to fixed length k are considered. In Fig-
ure 5.3, considering a conjunction of blocking predicates corresponds to adding a vertex to
the middle row, with edges connecting it to the red and blue vertices present in the intersec-
tion of covered vertex sets for the individual predicates in the conjunction.
The learnable blocking problem based on DNF blocking functions is then equiv-
alent to constructing a set of conjunctions followed by selection of a set of predicate and
conjunction vertices so that at least β ε positive (blue) vertices have at least one incident
edge, while the cost, equal to the number of negative (red) nodes with at least one incident
edge, is minimized.

100
5.3 Algorithms

5.3.1 Pairwise Training Data

For clustering settings, supervision corresponds to sets of must-link (same-cluster) and


cannot-link (different-cluster) pairs. For record linkage, supervision is available in many
domains in the form of records for which the true entities to which they refer are known,
as discussed in Section 5.1. Such labeled records comprise the training dataset Dtrain =

fX Y g that can be used to generate the pairwise supervision for learning the blocking
;

function in the form of coreferent (positive) and non-coreferent (negative) record pairs. For
large databases, it is impractical to explicitly generate and store in memory all positive pairs
and negative pairs. However, the set of covered pairs for each predicate can be computed
using the indexing function of the predicate to form an inverted index based on the key
values returned by the indexing function. Then, bit arrays can be used to store the cover of
each predicate, obtained by iteration over the inverted index.
If training data is unavailable, it can be obtained automatically by performing link-
age or clustering without blocking, and then using the linkage or clustering results as train-
ing data for learning a blocking function for the given domain.

5.3.2 Learning Blocking Functions

Disjunctive Blocking

The equivalence of learning optimal disjunctive blocking and the red-blue set cover problem
described in Section 5.2.1 has discouraging implications for the practitioner. The red-blue
set cover problem is NP-hard, and Carr et al. (2000) have shown that unless P=NP, it cannot
1 δ
be efficiently approximated within a factor O(2log t ); δ = 1=log logc t, where t is the num-
ber of predicates under consideration. On the other hand, several approximate algorithms
have been proposed for the red-blue set cover problem (Carr et al., 2000; Peleg, 2000). We
base our approach on a modified version of Peleg’s greedy algorithm that has an approxi-

101
Algorithm: A PPROX RBS ET C OVER
Input: Training set B = {b1 , . . . , bβ } and R = {r1 , . . . , rρ } where
each bi ∈ B is a pair of coreferent records (xi1 , xi2 ) s.t. yi1 = yi2
each ri ∈ R is a pair of non-coreferent records (xi1 , xi2 ) s.t. yi1 6= yi2
Set of blocking predicates P = {p1 , . . . , pt }
Maximum number of coreferent pairs allowed to be uncovered ε
Maximum number of pairs that any predicate may cover η
Output: A disjunctive blocking function based on subset P ∗ ⊂ P
Method:
1. Discard from P all predicates pi for which r(pi ) ≥ η:
P ← {pi ∈ P |r(pi ) ≤ η}.
2. If |B | −p|B (P )| > ε return P (cover is not feasible, η is too low)
3. Set γ = t/ log β.
4. Discard all ri covered by more than γ predicates:
R ← {ri ∈ R | deg(ri , P ) ≤ γ}
5. Construct an instance of weighted set cover T by discarding
elements of R , creating a set τi for each pi ∈ P , and setting
its weight ω(τi ) = r(pi ).
6. T ∗ ← 0/
7. while |B| ≥ ε
8. select τi ∈ T that maximizes b(τi )/ω(τi )
9. B ← B − B (τi )
10. T ∗ ← T ∗ ∪ {τi }
11. Return the set of predicates P ∗ corresponding to T ∗ .

Figure 5.4: The algorithm for learning disjunctive blocking

p
mation ratio of 2 t log β (Peleg, 2000). This algorithm is particularly appropriate for the
adaptive blocking setting as it involves early discarding of particularly costly sets (block-
ing predicates that cover too many non-coreferent pairs), leading to more space-efficient
learning of the blocking function. In the remaining discussion, we use the term “blocking
predicates” in place of “sets” considered in the original set cover problem.
The outline of the algorithm A PPROX RBS ET C OVER is shown in Figure 5.4. The
algorithm is provided with training data in the form of β coreferent record pairs B =

fb1 ;:::; bβ g and ρ non-coreferent records pairs R fr1 rβ g, where each ri and bi rep-
= ;:::;

resents a record pair (xi1 ; xi2 ). For each predicate pi 2 P , let covered negatives R ( pi ) be
the set of negative pairs it covers, predicate cost r( pi ) be the number of negative pairs it
covers r( pi ) = jR ( pi )j, covered positives B ( pi ) be the set of positive pairs it covers, and

102
coverage b( pi ) be the number of covered positives, b( pi ) = jB ( pi )j. For each negative pair
ri = (xi1 ; xi2 ), let the degree deg(ri ; P ) be the number of predicates in P that cover it; degree
for a positive pair, deg(bi ; P ), is defined analogously. In step 1 of the algorithm, blocking
predicates that cover too many negative pairs are discarded, where the parameter η can be
set to a fraction of the total number of pairs in the dataset. Then, negative pairs covered
by too many predicates are discarded in step 4, which intuitively corresponds to disregard-
ing non-coreferent pairs that are highly similar and are placed in the same block by most
predicates. Again, this parameter can be set as a fraction of the available predicate set.
Next, a standard weighted set cover problem is set up for the remaining predicates
and pairs by setting the cost of each predicate to be the number of negatives it covers and
removing the negatives. The resulting weighted set cover problem is solved in steps 6-11
using Chvatal’s greedy approximation algorithm (Chvatal, 1979). The algorithm iteratively
constructs the cover, at each step adding the blocking predicate pi that maximizes a greedy
heuristic: the ratio of the number of previously uncovered positives over the predicate cost.
To soften the constraint requiring all positive pairs to be covered, we add an early stopping
condition permitting up to ε positives to remain uncovered. In practice, ε should be set to
0 at first, and then gradually increased if the cover identified by the algorithm is too costly
for the application at hand (that is, when covering all positives incurs covering too many
negatives).

DNF Blocking

Learning DNF blocking can be viewed as an extension of learning disjunctive blocking


where not only individual blocking predicates may be selected, but also their conjunctions.
We assume that conjunctions that include up to k predicates are considered. Because enu-
merating over all possible conjunctions of predicates results in an exponential number of
predicate sets under consideration, we propose a two-stage procedure, shown in Figure 5.5.
First, a set of t (k 1) predicate conjunctions of lengths from 2 to k is created in

103
Algorithm: A PPROX DNF
Input: Training set B = {b1 , . . . , bβ } and R = {r1 , . . . , rρ } where
each bi is a pair of coreferent records (xi1 , xi2 ) s.t. yi1 = yi2
each ri is a pair of non-coreferent records (xi1 , xi2 ) s.t. yi1 6= yi2
Set of blocking predicates P = {p1 , . . . , pt }
Maximum number of coreferent pairs allowed to be uncovered ε
Maximum number of pairs that any predicate may cover η
Maximum conjunction length, k
Output: A DNF blocking function based on P :
(pi1 ∧ · · · ∧ pi′1 ) ∨ · · · ∨ (pin ∧ · · · ∧ pi′n ), each i′j ≤ k
Method:
1. Discard from P all predicates pi for which r(pi ) ≥ η:
P ← {pi ∈ P |r(pi ) ≤ η}.
2. P (c) = 0/
3. For each pi ∈ P
(c)
4. Construct t − 1 candidate conjunctions pi = pi ∧ · · · ∧ pik
(c) (c)
by greedily selecting each pi j that maximizes cover b(pi )/r(pi ),
(c)
adding each pi to P (c) .
5. Return A PPROX RBS ET C OVER(R , B , P ∪ P (c) , ε, η).

Figure 5.5: The algorithm for learning DNF blocking

a greedy fashion. Candidate conjunctions are constructed iteratively starting with each
predicate pi 2 P. At each step, another predicate is added to the current conjunction so
that the ratio between the number of positives and the number of negatives covered by the
conjunction is maximally improved.
After the candidate set of conjunctions of lengths from 2 to k is constructed, the
conjunctions are added to P , the set of individual predicates. Then, the A PPROX RBS ET-
C OVER algorithm described in the previous section is used to learn a blocking function that
corresponds to a DNF formula over the blocking predicates.

5.3.3 Blocking with the Learned Functions

Efficiency considerations, which are the primary motivation for this work, require the
learned blocking functions to perform the actual blocking on new, unlabeled data in an
effective manner. After the blocking function is learned using training data, it should be

104
applied to the test data (for the actual linkage or clustering task) without explicitly con-
structing all pairs of records and evaluating the predicates on them. This is achieved by
applying the indexing function for every blocking predicate or conjunction in the learned
blocking function to every record in the test dataset. Thus, an inverted index is constructed
for each predicate or conjunction in the blocking function. In each inverted index, every key
is associated with a list of instances for which the indexing function of the corresponding
predicate returns the key value. Disjunctive and DNF blocking can then be performed by
iterating over every key in all inverted indices and returning all pairs of records that occur
in the same list for any key.

5.4 Experimental Results

5.4.1 Methodology and Datasets

We evaluate the efficiency of the our methods for learning blocking functions using two
metrics, speedup ratio and recall. They are are defined with respect to the number of
coreferent and non-coreferent record pairs that get covered by a blocking function fP in
a database of n records:

∑(xi x j )2R fP (xi ; x j ) + ∑(xi x j )2B fP (xi ; x j )


; ;
ReductionRatio = 1:0
n(n 1)=2
∑(xi x j )2B fP (xi ; x j )
;
Recall =
jB j
Intuitively, recall captures blocking accuracy by measuring the proportion of truly
coreferent record pairs that have been covered by the blocking function. an ideal blocking
function would have recall of 1.0, indicating that all coreferent pairs are covered. Reduction
ratio measures the efficiency gain due to blocking by measuring what proportion of all pairs
in the dataset is filtered out by the blocking function. Without blocking, reduction ratio is 0

105
since all record pairs are returned, while a higher number indicates what proportion of pairs
is not covered, and therefore will not require similarity computations in the subsequent
record linkage stages or in the clustering algorithm. Note that efficiency savings due to
blocking are more substantial if collective (graph-based) inference methods are used for
linkage or clustering (Pasula et al., 2003; McCallum & Wellner, 2004a; Singla & Domingos,
2005; Bhattacharya & Getoor, 2006), as the time complexity of these methods increases
superlinearly with the number of record pairs under consideration.
Results are obtained using 10 runs of two-fold cross-validation. Using a higher
number of folds would result in fewer coreferent records in the test fold, which would
artificially make the blocking task easier. During each run, the dataset is split into two
folds by randomly assigning all records for every underlying entity to one of the folds. The
blocking function is then trained using record pairs generated from the training fold. The
learned blocking function is used to perform blocking on the test fold, based on which recall
and reduction ratio are measured.
We present results on two datasets: Cora and Addresses. The Cora dataset is de-
scribed in Section 3.1.1. While it is a relatively small-scale dataset, results of Chapter 3
illustrate that good linkage performance on this domain requires computationally intensive
string similarity functions; it has also been shown that linkage on that dataset benefits from
collective linkage methods (Singla & Domingos, 2005), justifying the need for blocking.
Addresses is a dataset containing names and addresses of 50,000 9-field records for 10,000
unique individuals that was generated using the DBG EN program provided by Hernández
and Stolfo (1995).
We use the following general predicates are for constructing learnable blocking
functions:

 Exact Match: covers instances that have the same value for the field;

 Contain Common Token: covers instances that contain a common token in the field
value;

106
 Contain Common Integer: covers instances that contain a common token consisting
of digits in the field value;

 Contain Same or Off-by-One Integer: covers instances that contain integer tokens
that are equal or differ by at most 1;

 Same N First Chars, N = 3; 5; 7: covers instances that have a common character prefix
in the field value;

 Contain Common Token N-gram, N = 2; 4; 6: covers instances that contain a common


length-N subsequence of tokens;

 Token-based TF-IDF > δ; δ = 0:2; 0:4; 0:6; 0:8; 1:0: covers instances where token-
based TF-IDF cosine similarity between field values is greater than the threshold δ;

 N-gram-based TF-IDF > δ; δ = 0:2; 0:4; 0:6; 0:8; 1:0; N = 3; 5: covers instances
where TF-IDF cosine similarity between n-gram representations of field values is
greater than the threshold δ.

As described in Section 5.2, these general predicates are instantiated for all fields
in the given database. Algorithms presented in Section 5.3.2 are used to construct blocking
functions by selecting subsets of the resulting field-specific predicates. For DNF blocking,
conjunctions of length 2 were employed, as experiments with longer conjunctions did not
lead to improvements over blocking based on a 2-DNF.
We vary the value of parameter ε (which specifies the number of coreferent pairs
allowed to remain uncovered) by setting to rβ for different values of desired recall r between
0.0 and 1.0, where β is the number of coreferent record pairs in the training fold. This
parameter captures the dependence between the reduction ratio and recall: if ε is high, fewer
predicates are selected resulting in lower recall since not all coreferent pairs are retrieved.
At the same time, the reduction ratio is higher for higher ε since fewer pairs are covered by
the learned blocking function, leading to higher efficiency. By varying ε, we obtain a series

107
of results that demonstrate the trade-off between obtaining higher recall and improving the
reduction ratio.
We compare the proposed methods with C ANOPIES (McCallum et al., 2000), a
popular blocking method relying on token-based or n-gram-based TF-IDF similarity com-
puted using an inverted index. In a previous study, Baxter et al. (Baxter et al., 2003) have
compared several manually-tuned blocking strategies and found C ANOPIES to produce best
overall results. C ANOPIES also allows trading off precision and the reduction ratio by vary-
ing the threshold parameter that controls the coverage of the blocking.1 We tried both
token-based C ANOPIES and tri-gram based C ANOPIES and chose the best-performing vari-
ants as baselines: token-based indexing for Cora, and tri-gram indexing for Addresses. This
difference is due to the fact that most variation between coreferent citations in Cora is due
to insertions and deletions of whole words, making token-based similarity more appropri-
ate. Coreferent records in Addresses, on other hand, mostly differ due to misspellings and
character-level transformations that n-gram similarity is able to capture.

5.4.2 Results and Discussion

Figures 5.6 and 5.7 show the reduction ratio versus recall curves for the two types of learned
blocking functions described above and for C ANOPIES. From these results, we observe that
both variants of adaptive blocking outperform the unlearned baseline: combining multiple
predicates allows achieving higher recall levels as well as achieving higher reduction ratios.
DNF blocking is more accurate than disjunctive blocking, and on Addresses it also achieves
higher recall, while for Cora the maximum recall is comparable. Because DNF blocking
is based on predicate conjunctions, non-coreferent pairs are easier avoided by the blocking
function: conjunctions effectively form high-precision, low-recall rules that cover smaller
subsets of coreferent pairs but fewer non-coreferent pairs compared to single predicates.
1 The original C ANOPIES algorithm allows varying two separate threshold parameters, however, empirical

results have shown that using the same value for both thresholds yields highest performance (McCallum et al.,
2000).

108
1

0.95
Reduction Ratio

0.9

0.85

0.8
DNF
Disjunctive
Canopies
0.75
0.95 0.96 0.97 0.98 0.99 1
Recall

Figure 5.6: Blocking accuracy results for the Cora dataset


1

0.98

0.96
Reduction Ratio

0.94

0.92

0.9

0.88

0.86
DNF
0.84 Disjunctive
Canopies
0.82
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
Recall

Figure 5.7: Blocking accuracy results for the Addresses dataset

109
While none of the methods achieve 100% recall (as it would effectively require returning
all record pairs), for both datasets adaptive blocking is able to achieve higher recall than
C ANOPIES. Thus, using learnable blocking functions leads to both accuracy and efficiency
improvements.

Cora Addresses
DNF Blocking, 23,499 4,890,410
Disjunctive Blocking 41,439 4,090,283
Canopies 125,986 1,745,995
Total number of pairs 606,182 312,487,500

Table 5.1: Average number of pairs covered by the learned blocking functions and highest
achieved recall

Table 5.1 shows the actual number of record pairs returned by the different block-
ing methods at highest achieved recall. These results demonstrate the significance of dif-
ferences in the reduction ratio between the different blocking functions: because the total
number of pairs is very large, differences in the reduction ratio translate into significant sav-
ings in the number of pairs for which similarity must be computed. Note that the smaller
number of pairs returned by C ANOPIES and disjunctive blocking on Addresses corresponds
to a significantly lower recall, while for a fixed recall level DNF blocking either does as
well or better.

Cora Addresses
DNF Blocking 26.9 735.81
Disjunctive Blocking 32.4 409.4
Canopies 16.0 572.7

Table 5.2: Average blocking time, CPU seconds

Table 5.2 show the blocking times for the different methods measured at maximum
achieved recall. Learnable blocking functions incur a relatively modest increase in compu-
tational time despite the fact that they utilize many predicates. This is due to the fact that
the learned predicates that cover few negatives typically require smaller inverted indices

110
than the one built by canopies using tokens or n-grams where each token or n-gram occurs
in many strings. Many predicates employed by the adaptive blocking functions, on other
hand, map each string to a single key, resulting in more efficient retrieval of covered pairs.
Inverted indices corresponding to conjunctions are even more efficient as they contain many
keys (the cross product of the key sets for the predicates in the conjunction) and incur less
chaining, which is the reason for better performance of DNF blocking compared to disjunc-
tive blocking on Cora, where the number of predicates in the constructed blocking function
is similar for the two methods. On Addresses, DNF blocking constructs blocking functions
containing more predicates, which on one hand incurs a computational penalty, but on other
allows it to achieve higher recall.
Overall, the results demonstrate that adaptive blocking functions significantly im-
prove the efficiency of record linkage, and provide an attractive methodology for scaling up
data mining tasks that rely on similarity computations between pairs of instances.

5.5 Related Work

A number of blocking methods have been proposed by researchers for speeding up record
linkage and clustering (Fellegi & Sunter, 1969; Kelley, 1985; Newcombe, 1988; Jaro, 1989;
Hernández & Stolfo, 1995; McCallum et al., 2000; Baxter et al., 2003; Chaudhuri et al.,
2003; Jin et al., 2003; Gu & Baxter, 2004; Winkler, 2005); see the summary of these
methods in Section 2.4. A key distinction between prior work and our approach is that
previously described methods focus on improving blocking efficiency while assuming that
an accurate blocking function is known and its parameters have been tuned manually. In
contrast, our approach attempts to construct an optimal blocking function automatically.
Because blocking functions can be learned using any combination of similarity predicates
on different record fields, and no assumptions are made about the number of record fields
or their type, our approach can be used for adapting the blocking function in any domain,
while allowing human experts to add domain-specific predicates.

111
Our predicate-based blocking approach is most closely related to key-based meth-
ods developed by researchers working on record linkage for Census data (Kelley, 1985;
Newcombe, 1988; Jaro, 1989; Winkler, 2005). Techniques described by Kelley (Kelley,
1985) and Winkler (Winkler, 2005) are particularly relevant as they describe methodolo-
gies for evaluating the accuracy of individual blocking predicates, and could be integrated
with our approach for further speedup of blocking function learning.
Our formulation for training disjunctive and DNF blocking functions is related to
machine learning algorithms for learning disjunctive rules and DNFs (Mitchell, 1997). A
principal difference between that work and the learnable blocking problem is that in our
setting the learned disjunctions must cover all positive record pairs while minimizing the
number of covered negative pairs, while rule learning methods generally attempt to equally
minimize the number of errors on both positive and negative examples. Cost-sensitive
machine learning methods (Elkan, 2001) may provide a foundation for other approaches
to adaptive blocking, and we hope that our initial work will encourage the development of
alternative learnable blocking techniques.

5.6 Chapter Summary

In this chapter, we formulated the adaptive blocking problem as the task of learning a func-
tion that returns a minimal number of non-coreferent record pairs while returning all or
nearly all coreferent pairs. We described two types of blocking functions: disjunctive and
DNF blocking. Formulating the learning problem as an instance of the Red-Blue Set Cover
problem allowed us to adopt a well-known approximation algorithm for that problem to
construct blocking functions. Experimental results demonstrated the ability of our approach
to learn efficient and accurate blocking functions automatically.

112
Chapter 6

Future Work

Because learnable similarity functions are a part of many machine learning and data analy-
sis tasks, there is a large number of applications where adapting distance computations can
have a significant effect on performance. These applications can be found in such fields
as natural language processing, information retrieval, vision, robotics and bioinformatics,
where application-specific similarity functions are often employed. Adapting such func-
tions in situ in these applications can be achieved using the framework used in the three
applications considered in this thesis: learning from pairwise supervision. While specific
applications in the above areas are beyond the scope of this thesis, in subsequent sections
we describe several directions for future work that are directly related to the applications
and similarity functions considered in prior chapters.

6.1 Multi-level String Similarity Functions

Improvements obtained using learnable affine-gap edit distance over its unlearned equiv-
alent demonstrated the benefits of adapting string similarity computations to a given do-
main. However, edit distance has certain properties that may limit its suitability in some
domains. For example, it does not directly handle transpositions of entire fragments (e.g.,

113
token swaps), and while edit operations for short-range transpositions can be added at
considerable computational cost, handling long-term transpositions is problematic. Order-
insensitive similarity functions such as cosine similarity, on other hand, have no trouble
dealing with token transpositions, yet they depend on accurate tokenization and suffer when
edit operations occur at the character level.
The SoftTFIDF variant of cosine similarity recently proposed by Cohen et al. (2003a)
attempts to amend this drawback of cosine similarity, yet it cannot adapt to a given domain
beyond the IDF weighting. An exciting challenge for future work lies in developing learn-
able string similarity functions that integrate adaptive string comparison at the character,
token, and document (string) levels. Such similarity functions must rely on joint similarity
computation across the levels while remaining computationally efficient. While segmented
pair HMMs presented in Section 3.1.2 are a first step in this direction, developing string
similarity models that perform further structural analysis of strings remains an open re-
search issue. Progress in this area will have impact in all tasks that rely on string similarity
functions such as record linkage and information retrieval.

6.2 Discriminative Pair HMMs

The Expectation-Maximization algorithm that we described in Chapter 3 for training pair


HMMs only utilizes positive supervision: the learning procedure maximizes the likelihood
of observing alignments of coreferent pairs. However, it may be advantageous to exploit
negative supervision, that is, pairs of non-coreferent strings, since some “near-miss” nega-
tive examples can be very informative.
A recently proposed edit distance model based on Conditional Random Fields (CRFs)
has structure that allows training with both positive and negative examples so that the model
directly learns to discriminate between the two kinds of pairs (McCallum et al., 2005). The
CRF edit distance model consists of two three-state edit distance transducers, one of which
computes the alignment probabilities for coreferent strings, while the other computes align-

114
ment probabilities for non-coreferent strings.
Although the CRF-based model has different probabilistic semantics (alignments
are not generated since the model is conditioned on the fact that any two strings under con-
sideration are aligned), the coupled structure of that model can be implemented as a pair
HMM. Considering such coupled structures within the pair HMM framework is an interest-
ing area for future work, since it would allow applying discriminative training methods that
explicitly attempt to learn model parameters that effectively distinguish between coreferent
and non-coreferent strings (Eisner, 2002). Another avenue for future work on alternative
pair HMM structures involves deriving learnable models for local alignment that focus
on scoring matching alignment fragments while disregarding the mismatched sequences
around them (Gusfield, 1997). In domains where large gaps are commonplace yet small
matching sequences may be very informative, e.g., in linkage of retail product descriptions,
pair HMM structures that model local alignment may yield better performance, and inves-
tigating this possibility is an interesting future direction.

6.3 Active Learning of Similarity Functions

As discussed in Section 3.3, the goal of active learning methods for similarity functions is
identifying pairs of objects whose equivalence or non-equivalence is informative for im-
proving distance estimates. The classifier-based record similarity described in Section 3.2
lends itself nicely to active learning techniques developed for classification, which has been
explored by Sarawagi and Bhamidipaty (2002) and Tejada et al. (2002) in the record linkage
context.
One of the biggest challenges in selecting useful training example pairs lies with
the fact that the space of possible pairs grows quadratically with the number of examples,
and static-active and weakly-labeled methodologies we proposed in Section 3.3 address this
challenge. However, these methods are based on heuristics, while developing more princi-
pled active learning methods remains an interesting direction for future work. Such methods

115
must directly attempt to identify example pairs that would lead to maximal improvement
of similarity estimates. Traditional active learning approaches such as uncertainty sam-
pling (Lewis & Catlett, 1994), query-by-committee (Seung et al., 1992), estimation error
reduction (Lindenbaum et al., 1999; Roy & McCallum, 2001), and version space reduc-
tion (Tong, 2001) could be adopted for this task, and developing such methods for directly
improving the learning of similarity functions like edit distance or distortion measures de-
scribed in Chapter 4 is an area yet to be explored.

6.4 From Adaptive Blocking to Learnable Metric Mapping

The predicate-based methodology that we proposed in Chapter 5 for automatically obtain-


ing accurate blocking functions requires specifying an initial set of blocking predicates.
Although a sufficiently general set of predicates for textual data is easy to encode, in future
work it would be interesting to explore learnable blocking methods that are not predicate-
based but rely on mapping records to metric spaces. Several existing blocking methods rely
on such mapping, such as those of Jin et al. (2003) and Chaudhuri et al. (2003). Learning
algorithms that would make these methods adaptive could pursue two directions: searching
for an optimal mapping of data to metric space, or transforming the metric space after the
mapping to allow efficient yet accurate selection of approximately similar records.
This problem is related to methods for fast nearest-neighbor searching, a number
of which have been developed in the past decade (Indyk & Motwani, 1998; Liu, Moore,
Gray, & Yang, 2004; Beygelzimer, Kakade, & Langford, 2006). However, using these
techniques for domains where data is described by multiple fields of heterogeneous types
is non-trivial as they typically rely on strong metric assumptions on the data space, and do
not scale efficiently to high-dimensional data. Developing adaptive nearest-neighbor search
methods for heterogeneous data is an interesting area for future work that has applications in
blocking as well as in other tasks where retrieving approximately similar objects efficiently
is important, e.g., in classification methods and database retrieval.

116
Chapter 7

Conclusions

Research presented in this thesis has focused on learning similarity functions from pairwise
supervision. We have shown that by parameterizing several popular distance functions and
learning parameter values from examples of similar and dissimilar instance pairs, we obtain
increases in accuracy of similarity computations, which lead to performance improvements
in tasks that rely on them: record linkage, semi-supervised clustering, and blocking.
First, we have considered learning similarity functions in the context of record link-
age where they are used for two tasks: computing similarity between individual field values
and combining these similarities across multiple fields. For field-level similarity computa-
tions, we have described two adaptive variants of affine-gap edit distance in which the costs
of string transformations are learned on a corpus of coreferent string pairs. Our approach is
based on pair HMMs, a probabilistic model for generating string alignments. Learning the
costs of affine-gap edit distance parameters allows adapting the underlying string match-
ing algorithm to each field’s domain, while using segmented pair HMMs integrates such
adaptation with performing string segmentation that is helpful in domains where strings are
composed of multiple fields from different domains.
For computing similarity between records in linkage, we have demonstrated that
Support Vector Machines (SVMs) effectively combine similarities from individual fields in

117
proportion to their relative importance. Using learnable similarity functions at both field and
record levels leads to improved results over using record-level learnable similarity functions
that combine unlearned field-level similarities.
We have proposed two strategies for selecting informative pairs of coreferent or
non-coreferent examples for training similarity functions in record linkage. One of the
proposed strategies, weakly-labeled negative selection does not require labeled supervision,
while the other, likely positive pair selection, avoids the computational costs of the standard
active learning methods. Both of these strategies facilitate efficient selection of training
pairs that allows learning accurate similarity functions on small training sets.
Second, we have demonstrated the utility of employing learnable similarity func-
tions in semi-supervised clustering. By incorporating similarity function learning within the
HMRF-KM EANS algorithm for semi-supervised clustering, we were able to leverage both
labeled pairwise supervision and unlabeled data when adapting the similarity functions.
Our approach allows learning individual similarity functions for different clusters which is
useful for domains where clusters have different shapes. The proposed framework can be
used with a variety of distortion (distance) functions that include directional measures such
as cosine similarity, and Bregman divergences that include Euclidean distance and Kull-
back Leibler divergence. Ablation experiments have demonstrated that the HMRF-based
approach combines the strengths of learnable similarity functions and constrained cluster-
ing to obtain significant improvements in clustering quality.
In the context of blocking, the third application we considered, we have proposed
methods for learning similarity functions that efficiently select approximately similar pairs
of examples. Because blocking is required for scaling record linkage and many pairwise
clustering algorithms up to large datasets, our technique shows that learnable similarity
functions can be employed not only for increasing accuracy of data mining tasks, but also
for improving their scalability. Unlike previous blocking methods that require manual tun-
ing and hand-construction of blocking functions, our approach is adaptive as it optimizes

118
the blocking function for a given domain using pairwise supervision that can be naturally
obtained in linkage and clustering tasks.
For the three tasks under consideration, we have evaluated the effectiveness of uti-
lizing learnable similarity functions, comparing their accuracy on standard benchmarks
with that of unlearned similarity functions typically used in these tasks. Our experiments
demonstrate that learnable similarity functions effectively utilize the pairwise training data
to make distance estimates more accurate for a given domain, resulting in overall perfor-
mance improvements on the tasks.
Overall, the work presented in this thesis contributes methods leading to state-of-
the art performance on the considered tasks and provides a number of useful algorithms
for practitioners in record linkage, semi-supervised clustering, and blocking. This research
demonstrates the power of using similarity functions that can adapt to a given domain using
pairwise supervision, and we hope that it will motivate further research in trainable distance
functions, as well as encourage employing such functions in various applications where
distance estimates between instances are required.

119
Bibliography

Aha, D. W. (1998). Feature weighting for lazy learning algorithms. In Liu, H., & Motoda,
H. (Eds.), Feature Extraction, Construction and Selection: A Data Mining Perspec-
tive. Kluwer Academic Publishers, Norwell, MA.

Ananthakrishna, R., Chaudhuri, S., & Ganti, V. (2002). Eliminating fuzzy duplicates in
data warehouses. In Proceedings of the 28th International Conference on Very Large
Databases (VLDB-2002) Hong Kong, China.

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. ACM Press,
New York.

Banerjee, A., Dhillon, I., Ghosh, J., & Sra, S. (2005a). Clustering on the unit hypersphere
using von Mises-Fisher distributions. Journal of Machine Learning Research, 6,
1345–1382.

Banerjee, A., Merugu, S., Dhilon, I., & Ghosh, J. (2005b). Clustering with Bregman diver-
gences. Journal of Machine Learning Research, 6, 1705–1749.

Banerjee, A., Krumpelman, C., Basu, S., Mooney, R. J., & Ghosh, J. (2005c). Model-based
overlapping clustering. In Proceedings of the Eleventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD-05).

Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions

120
using equivalence relations. In Proceedings of 20th International Conference on
Machine Learning (ICML-2003), pp. 11–18.

Basu, S. (2005). Semi-supervised Clustering: Probabilistic Models, Algorithms and Exper-


iments. Ph.D. thesis, University of Texas at Austin.

Basu, S., Banerjee, A., & Mooney, R. J. (2002). Semi-supervised clustering by seeding. In
Proceedings of 19th International Conference on Machine Learning (ICML-2002),
pp. 19–26.

Basu, S., Banerjee, A., & Mooney, R. J. (2004). Active semi-supervision for pairwise
constrained clustering. In Proceedings of the 2004 SIAM International Conference
on Data Mining (SDM-04).

Basu, S., Bilenko, M., Banerjee, A., & Mooney, R. J. (2006). Probabilistic semi-supervised
clustering with constraints. In Chapelle, O., Schölkopf, B., & Zien, A. (Eds.), Semi-
Supervised Learning. MIT Press, Cambridge, MA.

Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semi-
supervised clustering. In Proceedings of the Tenth ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining (KDD-2004), pp. 59–68 Seattle,
WA.

Baxter, R., Christen, P., & Churches, T. (2003). A comparison of fast blocking methods
for record linkage. In Proceedings of the 2003 ACM SIGKDD Workshop on Data
Cleaning, Record Linkage, and Object Consolidation, pp. 25–27 Washington, DC.

Beygelzimer, A., Kakade, S., & Langford, J. (2006). Cover trees for nearest neighbor. In
Proceedings of 23rd International Conference on Machine Learning (ICML-2006).

Bhattacharya, I., & Getoor, L. (2004). Deduplication and group detection using links. In
Proceedings of the 2004 ACM SIGKDD Workshop on Link Analysis and Group De-
tection.

121
Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity
resolution. In 6th SIAM Conference on Data Mining (SDM-2006) Bethesda, MD.

Bilenko, M., Kamath, B., & Mooney, R. J. (2006). Adaptive blocking: Learning to scale up
record linkage. In Proceedings of the WWW-2006 Workshop on Information Integra-
tion on the Web.

Bilenko, M., & Basu, S. (2004). A comparison of inference techniques for semi-supervised
clustering with hidden Markov random fields. In Proceedings of the ICML-2004
Workshop on Statistical Relational Learning and its Connections to Other Fields
(SRL-2004) Banff, Canada.

Bilenko, M., Basu, S., & Mooney, R. J. (2004). Integrating constraints and metric learning
in semi-supervised clustering. In Proceedings of 21st International Conference on
Machine Learning (ICML-2004), pp. 81–88 Banff, Canada.

Bilenko, M., Basu, S., & Sahami, M. (2005). Adaptive product normalization: Using online
learning for record linkage in comparison shopping. In Proceedings of the 5th IEEE
International Conference on Data Mining (ICDM-2005), pp. 58–65.

Bilenko, M., & Mooney, R. J. (2002). Learning to combine trained distance metrics for
duplicate detection in databases. Tech. rep. AI 02-296, Artificial Intelligence Labo-
ratory, University of Texas at Austin, Austin, TX.

Bilenko, M., & Mooney, R. J. (2003a). Adaptive duplicate detection using learnable string
similarity measures. In Proceedings of the Ninth ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining (KDD-2003), pp. 39–48 Wash-
ington, DC.

Bilenko, M., & Mooney, R. J. (2003b). On evaluation and training-set construction for
duplicate detection. In Proceedings of the KDD-03 Workshop on Data Cleaning,
Record Linkage, and Object Consolidation, pp. 7–12 Washington, DC.

122
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases.
http://www.ics.uci.edu/˜mlearn/MLRepository.html.

Buntine, W. L. (1994). Operations for learning graphical models. Journal of Artificial


Intelligence Research, 2, 159–225.

Carr, R. D., Doddi, S., Konjevod, G., & Marathe, M. (2000). On the Red-Blue Set Cover
problem. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete
Algorithms (SODA-2000) San Francisco, CA.

Chaudhuri, S., Ganjam, K., Ganti, V., & Motwani, R. (2003). Robust and efficient fuzzy
match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD Interna-
tional Conference on Management of Data (SIGMOD-03), pp. 313–324. ACM Press.

Chvatal, V. (1979). A greedy heuristic for the set covering problem. Mathematics of Oper-
ations Research, 4(3), 233–235.

Cohen, W. W. (1998). Integration of heterogeneous databases without common domains


using queries based on textual similarity. In Proceedings of the 1998 ACM SIGMOD
International Conference on Management of Data (SIGMOD-98), pp. 201–212 Seat-
tle, WA.

Cohen, W. W., Kautz, H., & McAllester, D. (2000). Hardening soft information sources.
In Proceedings of the Sixth International Conference on Knowledge Discovery and
Data Mining (KDD-2000) Boston, MA.

Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003a). A comparison of string distance
metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on
Information Integration on the Web, pp. 73–78 Acapulco, Mexico.

Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003b). A comparison of string metrics for
matching names and records. In Proceedings of the 2003 ACM SIGKDD Workshop on

123
Data Cleaning, Record Linkage, and Object Consolidation, pp. 13–18 Washington,
DC.

Cohen, W. W., & Richman, J. (2002). Learning to match and cluster large high-dimensional
data sets for data integration. In Proceedings of the Eighth ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining (KDD-2002), pp.
475–480 Edmonton, Alberta.

Cohn, D., Caruana, R., & McCallum, A. (2003). Semi-supervised clustering with user
feedback. Tech. rep. TR2003-1892, Cornell University.

Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active learning with statistical
models. Journal of Artificial Intelligence Research, 4, 129–145.

Cooper, G. G., & Herskovits, E. (1992). A Bayesian method for the induction of proba-
bilistic networks from data. Machine Learning, 9, 309–347.

Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley-Interscience.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38.

Dhillon, I. S., & Modha, D. S. (2001). Concept decompositions for large sparse text data
using clustering. Machine Learning, 42, 143–175.

Dhillon, I. S., & Guan, Y. (2003). Information theoretic clustering of sparse co-occurrence
data. In Proceedings of the Third IEEE International Conference on Data Mining
(ICDM-2003), pp. 517–521.

Dom, B. E. (2001). An information-theoretic external cluster-validity measure. Research


report RJ 10219, IBM.

124
Dong, X., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex infor-
mation spaces. In Proceedings of the 2005 ACM SIGMOD international conference
on Management of data (SIGMOD-2005), pp. 85–96 Baltimore, MD.

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (Second edition).
Wiley, New York.

Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

Eisner, J. (2002). Parameter estimation for probabilistic finite-state transducers. In Proceed-


ings of the 40th Annual Meeting of the Association for Computational Linguistics
(ACL-2002).

Elfeky, M. G., Elmagarmid, A. K., & Verykios, V. S. (2002). TAILOR: A record linkage
tool box. In Proceedings of the 18th International Conference on Data Engineering
(ICDE-2002), pp. 17–28.

Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the Seven-


teenth International Joint Conference on Artificial Intelligence (IJCAI-2001).

Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American
Statistical Association, 64(328), 1183–1210.

Fern, X., & Brodley, C. (2003). Random projection for high dimensional data clustering:
A cluster ensemble approach. In Proceedings of 20th International Conference on
Machine Learning (ICML-2003).

Freitag, D., & McCallum, A. (1999). Information extraction with HMMs and shrinkage. In
Papers from the Sixteenth National Conference on Artificial Intelligence (AAAI-99)
Workshop on Machine Learning for Information Extraction, pp. 31–36 Orlando, FL.

125
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Pro-
ceedings of the Thirteenth International Conference on Machine Learning (ICML-
96), pp. 148–156 Bari, Italy. Morgan Kaufmann.

Friedman, J. H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data


Mining and Knowledge Discovery, 1(1), 55–77.

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 6, 721–742.

Giles, C. L., Bollacker, K., & Lawrence, S. (1998). CiteSeer: An automatic citation index-
ing system. In Proceedings of the Third ACM Conference on Digital Libraries, pp.
89–98 Pittsburgh, PA.

Golub, G. H., & van Loan, C. F. (1989). Matrix Computations (Second edition). Johns
Hopkins University Press.

Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of


Molecular Biology, 162, 705–708.

Grenager, T., Klein, D., & Manning, C. D. (2005). Unsupervised learning of field segmen-
tation models for information extraction. In Proceedings of the 43nd Annual Meeting
of the Association for Computational Linguistics (ACL-05).

Gu, L., & Baxter, R. (2004). Adaptive filtering for efficient record linkage. In Proceedings
of the Fourth SIAM International Conference on Data Mining (SDM-04).

Gusfield, D. (1997). Algorithms on Strings, Trees and Sequences. Cambridge University


Press, New York.

Hammersley, J., & Clifford, P. (1971). Markov fields on graphs and lattices. Unpublished
manuscript.

126
Hernández, M. A., & Stolfo, S. J. (1995). The merge/purge problem for large databases. In
Proceedings of the 1995 ACM SIGMOD International Conference on Management
of Data (SIGMOD-95), pp. 127–138 San Jose, CA.

Hofmann, T., & Buhmann, J. M. (1998). Active data clustering. In Advances in Neural
Information Processing Systems 10.

Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the
curse of dimensionality. In Proceedings of the 30th ACM Symposium on Theory of
Computing (STOC-98), pp. 604–613.

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing
Surveys, 31(3), 264–323.

Jaro, M. (1989). Advances in record-linkage methodology as applied to matching the 1985


census of Tampa, Florida. Journal of the American Statistical Association, 84, 414–
420.

Jelinek, F., & Mercer, R. (1980). Interpolated estimation of markov source parameters from
sparse data. In Pattern Recognition in Practice, pp. 381–402.

Jelinek, F. (1998). Statistical Methods for Speech Recognition. MIT Press, Cambridge,
MA.

Jin, L., Li, C., & Mehrotra, S. (2003). Efficient record linkage in large data sets. In
Proceedings of the 8th International Conference on Database Systems for Advanced
Applications (DASFAA-03), pp. 137–148 Kyoto, Japan.

Joachims, T. (2003). Learning to align sequences: A maximum-margin approach. Tech.


rep., Cornell University, Department of Computer Science.

Karypis, G., & Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning
irregular graphs. SIAM Journal on Scientific Computing, 20(1), 359–392.

127
Kaufman, L., & Rousseeuw, P. (1990). Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley and Sons, New York.

Kearns, M., Mansour, Y., & Ng, A. Y. (1997). An information-theoretic analysis of hard
and soft assignment methods for clustering. In Proceedings of 13th Conference on
Uncertainty in Artificial Intelligence (UAI-97), pp. 282–293.

Kelley, R. P. (1985). Advances in record linkage methodology: a method for determining


the best blocking strategy. In Record Linkage Techniques - 1985: Proceedings of the
Workshop on Exact Matching Methodologies, pp. 199–203 Arlington, VA.

Klein, D., Kamvar, S. D., & Manning, C. (2002). From instance-level constraints to space-
level constraints: Making the most of prior knowledge in data clustering. In Pro-
ceedings of 19th International Conference on Machine Learning (ICML-2002), pp.
307–314 Sydney, Australia.

Kschischang, F. R., Frey, B., & Loeliger, H.-A. (2001). Factor graphs and the sum-product
algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.

le Cessie, S., & van Houwelingen, J. C. (1992). Ridge estimators in logistic regression.
Applied Statistics, 41(1), 191–201.

Levenshtein, V. I. (1966). Binary codes capable of correcting insertions and reversals.


Soviet Physics Doklady, 10(8), 707–710.

Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised
learning. In Proceedings of the Eleventh International Conference on Machine Learn-
ing (ICML-94), pp. 148–156 San Francisco, CA. Morgan Kaufmann.

Li, X., Morie, P., & Roth, D. (2004). Identification and tracing of ambiguous names: Dis-
criminative and generative approaches. In Proceedings of the 19th National Confer-
ence on Artificial Intelligence (AAAI-2004).

128
Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Transactions on
Information Theory, 37(1), 145–151.

Lindenbaum, M., Markovitch, S., & Rusakov, D. (1999). Selective sampling for nearest
neighbor classifiers. In Proceedings of the Sixteenth National Conference on Artificial
Intelligence (AAAI-99), pp. 366–371.

Liu, T., Moore, A., Gray, A., & Yang, K. (2004). An investigation of practical approximate
nearest neighbor algorithms. In Advances in Neural Information Processing Systems
16.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate obser-
vations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and
Probability, pp. 281–297.

McCallum, A., Bellare, K., & Pereira, F. (2005). A conditional random field for
discriminatively-trained finite-state string edit distance. In Proceedings of the 21st
Conference on Uncertainty in Artificial Intelligence (UAI-2005).

McCallum, A., Nigam, K., & Ungar, L. (2000). Efficient clustering of high-dimensional
data sets with application to reference matching. In Proceedings of the Sixth Inter-
national Conference on Knowledge Discovery and Data Mining (KDD-2000), pp.
169–178 Boston, MA.

McCallum, A., & Wellner, B. (2004a). Conditional models of identity uncertainty with ap-
plication to noun coreference. In Advances in Neural Information Processing Systems
17.

McCallum, A., & Wellner, B. (2004b). Conditional models of identity uncertainty with ap-
plication to noun coreference. In Advances in Neural Information Processing Systems
17.

129
Meila, M. (2003). Comparing clusterings by the variation of information. In Proceedings
of the 16th Annual Conference on Computational Learning Theory, pp. 173–187.

Michalowski, M., Thakkar, S., & Knoblock, C. A. (2003). Exploiting secondary sources for
automatic object consolidation. In Proceedings of the ACM SIGKDD-03 Workshop on
Data Cleaning, Record Linkage, and Object Consolidation, pp. 34–36 Washington,
DC.

Minton, S. N., Nanjo, C., Knoblock, C. A., Michalowski, M., & Michelson, M. (2005). A
heterogeneous field matching method for record linkage. In Proceedings of the 5th
IEEE International Conference on Data Mining (ICDM-2005), pp. 314–321.

Mitchell, T. (1997). Machine Learning. McGraw-Hill, New York, NY.

Monge, A. E., & Elkan, C. (1996). The field matching problem: Algorithms and applica-
tions. In Proceedings of the Second International Conference on Knowledge Discov-
ery and Data Mining (KDD-96), pp. 267–270 Portland, OR.

Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental,
sparse, and other variants. In Jordan, M. I. (Ed.), Learning in Graphical Models, pp.
355–368. MIT Press.

Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search
for similarities in the amino acid sequences of two proteins. Journal of Molecular
Biology, 48, 443–453.

Newcombe, H. B. (1988). Handbook of record linkage: methods for health and statistical
studies, administration, and business. Oxford University Press.

Newcombe, H. B., Kennedy, J. M., Axford, S. J., & James, A. P. (1959). Automatic linkage
of vital records. Science, 130, 954–959.

130
Papoulis, A., & Pillai, S. U. (2001). Probability, Random Variables and Stochastic Pro-
cesses (Fourth edition). McGraw-Hill Inc., New York.

Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity uncertainty and
citation matching. In Advances in Neural Information Processing Systems 15. MIT
Press.

Peleg, D. (2000). Approximation algorithms for the Label-CoverMAX and Red-Blue Set
Cover problems. In Proceedings of the 7th Scandinavian Workshop on Algorithm
Theory (SWAT-2000), LNCS 1851.

Pereira, F. C. N., Tishby, N., & Lee, L. (1993). Distributional clustering of English words.
In Proceedings of the 31st Annual Meeting of the Association for Computational
Linguistics (ACL-93), pp. 183–190 Columbus, Ohio.

Platt, J. (1999a). Fast training of support vector machines using sequential minimal op-
timization. In Scholkopf, B., Burges, C. J. C., & Smola, A. J. (Eds.), Advances in
Kernel Methods - Support Vector Learning, pp. 185–208. MIT Press, Cambridge,
MA.

Platt, J. C. (1999b). Probabilistic outputs for support vector machines and comparisons
to regularized likelihood methods. In Smola, A. J., Bartlett, P., Schölkopf, B., &
Schuurmans, D. (Eds.), Advances in Large Margin Classifiers, pp. 185–208. MIT
Press.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo,CA.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in


speech recognition. Proceedings of the IEEE, 77(2), 257–286.

Ravikumar, P., & Cohen, W. W. (2004). A hierarchical graphical model for record linkage.

131
In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI-
2004).

Ristad, E. S., & Yianilos, P. N. (1998). Learning string edit distance. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 20(5), 522–532.

Roy, N., & McCallum, A. (2001). Toward optimal active learning through sampling estima-
tion of error reduction. In Proceedings of 18th International Conference on Machine
Learning (ICML-2001), pp. 441–448. Morgan Kaufmann, San Francisco, CA.

Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw
Hill, New York.

Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning.
In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-2002) Edmonton, Alberta.

Segal, E., Battle, A., & Koller, D. (2003). Decomposing gene expression into cellular
processes. In Proceedings of the 8th Pacific Symposium on Biocomputing.

Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Query by committee. In Proceedings
of the ACM Workshop on Computational Learning Theory Pittsburgh, PA.

Shawe-Taylor, J., & Cristianini, N. (2000). Kernel Methods for Pattern Analysis. Cam-
bridge University Press.

Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(8), 888–905.

Singla, P., & Domingos, P. (2005). Object identification with attribute-mediated depen-
dences. In Proceedings of the 9th European Conference on Principles and Practice
of Knowledge Discovery in Databases (PKDD-2005) Porto, Portugal.

132
Strehl, A. (2002). Relationship-based clustering and cluster ensembles for high-
dimensional data mining. Ph.D. thesis, The University of Texas at Austin.

Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of similarity measures on web-page
clustering. In Workshop on Artificial Intelligence for Web Search (AAAI 2000), pp.
58–64.

Tejada, S., Knoblock, C. A., & Minton, S. (2001). Learning object identification rules for
information integration. Information Systems Journal, 26(8), 635–656.

Tejada, S., Knoblock, C. A., & Minton, S. (2002). Learning domain-independent string
transformation weights for high accuracy object identification. In Proceedings of the
Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD-2002) Edmonton, Alberta.

Tong, S. (2001). Active Learning: Theory and Applications. Ph.D. thesis, Stanford Univer-
sity, Stanford, CA.

Wagner, R. A., & Fisher, M. J. (1974). The string to string correction problem. Journal of
the Association for Computing Machinery, 21, 168–173.

Wagstaff, K., & Cardie, C. (2000). Clustering with instance-level constraints. In Proceed-
ings of the Seventeenth International Conference on Machine Learning (ICML-2000),
pp. 1103–1110 Stanford, CA.

Wahba, G. (1999). Support vector machines, reproducing kernel Hilbert spaces and the
randomized GACV. In Burges, C. J., Schölkopf, B., & Smola, A. J. (Eds.), Advances
in Kernel Methods – Support Vector Learning, pp. 69–88. MIT Press.

Wainwright, M. J., & Jordan, M. I. (2003). Graphical models, exponential families, and
variational inference. Tech. rep. 649, Department of Statistics, University of Califor-
nia, Berkeley.

133
Wellner, B., McCallum, A., Peng, F., & Hay, M. (2004). An integrated, conditional model
of information extraction and coreference with application to citation matching. In
Proceedings of 20th Conference on Uncertainty in Artificial Intelligence (UAI-2004)
Banff, Canada.

Wettschereck, D., Aha, D. W., & Mohri, T. (1997). A review and empirical evaluation of
feature weighting methods for a class of lazy learning algorithms. AI Review, 11,
273–314.

Winkler, W. E. (1993). Improved decision rules in the fellegi-sunter model of record link-
age. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.

Winkler, W. E. (1999). The state of record linkage and current research problems. Tech.
rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.

Winkler, W. E. (2002). Methods for record linkage and Bayesian networks. Tech. rep.,
Statistical Research Division, U.S. Census Bureau, Washington, DC.

Winkler, W. E. (2005). Approximate string comparator search strategies for very large
administrative lists. Tech. rep., Statistical Research Division, U.S. Census Bureau,
Washington, DC.

Winkler, W. E. (2006). Overview of record linkage and current research directions. Tech.
rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.

Witten, I. H., & Frank, E. (1999). Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. Morgan Kaufmann, San Francisco.

Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metric learning, with
application to clustering with side-information. In Advances in Neural Information
Processing Systems 15, pp. 505–512 Cambridge, MA. MIT Press.

134
Yancey, W. E. (2004). An adaptive string comparator for record linkage. Tech. rep., Statis-
tical Research Division, U.S. Census Bureau, Washington, DC.

Zhang, Y., Brady, M., & Smith, S. (2001). Hidden Markov random field model and segmen-
tation of brain MR images. IEEE Transactions on Medical Imaging, 20(1), 45–57.

Zhu, J. J., & Ungar, L. H. (2000). String edit analysis for merging databases. In Proceedings
of the KDD-2000 Workshop on Text Mining.

135
Vita

Misha Bilenko was born in Saratov, Russia in 1978. After graduating from Physico-
technical Lyceum in 1995, he decided to go west and found himself in Spokane, WA, where
he studied Computer Science and Physics at Whitworth College, graduating summa cum
laude in 1999. After spending a year back in Russia, he began graduate studies in the De-
partment of Computer Sciences at the University of Texas at Austin in 2000. Still unable to
resist the urge to go west, he will next join Microsoft Research in sunny Redmond, WA.

Permanent Address: Department of Computer Sciences


Taylor Hall 2.124
University of Texas at Austin
Austin, TX 78712-1188
USA

This dissertation was typeset with LATEX 2ε 1 by the author.

1 LAT
EX 2ε is an extension of LATEX. LATEX is a collection of macros for TEX. TEX is a trademark of the
American Mathematical Society. The macros used in formatting this dissertation were written by Dinesh Das,
Department of Computer Sciences, The University of Texas at Austin, and extended by Bert Kay, James A.
Bednar, and Ayman El-Khashab.

136

You might also like