Applications of Corpus-Based Semantic Similarity and Word Segmentation To Database Schema Matching
Applications of Corpus-Based Semantic Similarity and Word Segmentation To Database Schema Matching
DOI 10.1007/s00778-007-0067-9
REGULAR PAPER
Received: 24 July 2006 / Revised: 8 May 2007 / Accepted: 8 July 2007 / Published online: 18 October 2007
© Springer-Verlag 2007
Abstract In this paper, we present a method for database schema matching and nevertheless achieves a measure score
schema matching: the problem of identifying elements of that is comparable to the methods that use multiple properties
two given schemas that correspond to each other. Schema (e.g., element name, text description, data instance, context
matching is useful in e-commerce exchanges, in data inte- description). Our schema matching method also uses nor-
gration/warehousing, and in semantic web applications. We malized and modified versions of the longest common sub-
first present two corpus-based methods: one method is for sequence string matching algorithm with weight factors to
determining the semantic similarity of two target words and allow for a balanced combination. We validate our methods
the other is for automatic word segmentation. Then we pres- with experimental studies, the results of which suggest that
ent a name-based element-level database schema matching these methods can be a useful addition to the set of existing
method that exploits both the semantic similarity and the methods.
word segmentation methods. Our word similarity method
uses pointwise mutual information (PMI) to sort lists of Keywords Database schema matching · Semantic
important neighbor words of two target words; the words similarity · Word segmentation · Corpus-based methods
which are common in both lists are selected and their PMI val-
ues are aggregated to calculate the relative similarity score.
Our word segmentation method uses corpus type frequency 1 Introduction
information to choose the type with maximum length and
frequency from “desegmented” text. It also uses a modi- Schema matching is the problem of identifying elements of
fied forward–backward matching technique using maximum two given schemas that correspond to each other. It has been
length frequency and entropy rate if any non-matching por- the focus of research since the 1970s in the artificial intel-
tions of the text exist. Finally, we exploit both the semantic ligence, databases, and knowledge representation commu-
similarity and the word segmentation methods in our pro- nities. Schema matching can also be defined as discovering
posed name-based element-level schema matching method. semantically corresponding attributes in different schemas.
This method uses a single property (i.e., element name) for Traditionally, the problem of matching schemas has essen-
tially relied on finding pairwise-attribute correspondences.
A. Islam (B) · D. Inkpen · I. Kiringa Though schema matching identifies elements that correspond
School of Information Technology and Engineering, to each other, it does not explain how they correspond. For
University of Ottawa, Ottawa, Canada example, it might say that FirstName and LastName in one
e-mail: [email protected]
URL: www.site.uottawa.ca/∼mdislam
schema are related to Name in the other, but it does not say
that concatenating the former yields the latter. Automatically
D. Inkpen
e-mail: [email protected]
discovering these correspondences or matches is inherently
URL: www.site.uottawa.ca/∼diana difficult.
I. Kiringa
Today, many researchers realize that schema matching is
e-mail: [email protected] a core problem in e-commerce exchanges, in data integra-
URL: www.site.uottawa.ca/∼kiringa tion/warehousing, and in semantic web applications. Schema
123
1294 A. Islam et al.
matching is fundamental for enabling query mediation and important neighbor words of the two target words and
data exchange across information sources [2,57]. While distinguish the words which are common in both lists
schema matching has always been a problematic and interest- and aggregate their PMI values from the opposite list to
ing aspect of information integration, the problem is exacer- calculate the relative similarity score. Evaluation results
bated as the number of information sources to be integrated, show that our method outperforms several competing
and hence the number of integration problems that must be corpus-based methods.
solved, grows. Such schema matching problems arise both in • Second, we present a new corpus-based method for auto-
“classical” scenarios such as company mergers, and in “new” matic word segmentation. Our method uses corpus-type
scenarios such as the integration of diverse sets of informa- frequency information to choose the type with maximum
tion sources queriable over the web. length and frequency from “desegmented”1 text. It also
We present a schema matching method that uses a sin- uses a modified forward–backward matching technique
gle property (i.e., element name) for matching and achieves using maximum length frequency and entropy rate if
a comparable F-measure score compared to methods that any non-matching portions of the text exist. Evaluation
use multiple properties (e.g., element name, text descrip- results show that our method outperforms several com-
tion, data instance, context description). If we use a single peting corpus-based segmentation methods.
property instead of multiple properties, it can speed up the • Third, we present a name-based element-level schema
matching process; this is important when schema matching matching method that exploits our proposed corpus-
is used in peer-to-peer (P2P) data management systems or based word similarity and word segmentation method
in online query processing environments. If the properties together with a substring matching algorithm. Finally,
that we use for schema matching contain element names or we point out some areas where these methods or modi-
any types of text description, then we need to focus on both fied versions of these methods can be exploited.
string matching and semantic similarity of words because
sometimes only string matching or only semantic similar- The remainder of this paper is organized as follows.
ity of words provides good mapping results, and sometimes Section 2 introduces the idea of corpus-based semantic simi-
we need to use both in a balanced way. Names in schemas larity of words, and describes in detail our proposed method,
are often not segmented (words are connected together to with experimental results. In Sect. 3, we discuss corpus-
form a name); therefore a good word segmentation method based word segmentation and related work. We also describe
is required for better schema matching results. We propose our proposed word segmentation method with examples and
two corpus-based methods: one for determining the seman- evaluation. Then Sect. 4 presents database schema match-
tic similarity of words and the other for word segmentation; ing: a brief overview of schema matching approaches and our
then we formulate a name-based schema matching method proposed name-based element-level hybrid schema matching
that uses these two corpus-based methods. By corpus we method. Finally, we conclude in Sect. 5 with a brief discus-
mean a large collection of general-purpose English text. sion of future work.
We were motivated to propose corpus-based similarity
and word segmentation methods for several reasons. First,
we focused our attention on corpus-based measures because 2 Semantic similarity of words
of their large type coverage (types of words). The types that
are used in real-world database schema elements are often Semantic relatedness refers to the degree to which two con-
not found in dictionaries. Second, off-the-shelf corpus-based cepts or words are related (or not) whereas semantic simi-
similarity measures are not as comparable as dictionary- larity is a special case or a subset of semantic relatedness.
based measures in performance which drew us to devise a Humans are able to easily judge if a pair of words are related
corpus-based similarity measure that would be comparable in some way. For example, most would agree that apple
to dictionary-based measures in performance. and orange are more related than are apple and toothbrush.
Third, some existing corpus-based word segmentation Budanitsky and Hirst [9] point out that semantic similarity
methods provide good precision score, but provide low recall is used when similar entities such as apple and orange or
score and as a result low F-measure score. So, we were table and furniture are compared. These entities are close
inspired to propose a corpus-based word segmentation to each other in an is–a hierarchy. For example, apple and
method that would provide good F-measure score. orange are hyponyms of fruit and table is a hyponym of furni-
In this paper, we make the following contributions. ture. However, even dissimilar entities may be semantically
related, for example, glass and water, tree and shade, or gym
• First, we present a corpus-based method for determining and weights. In this case the two entities are intrinsically not
the semantic similarity of two target words. Our method
uses pointwise mutual information (PMI) to sort lists of 1 Words are connected together to form a name.
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1295
similar, but are related by some relationship. Sometimes this For example, the semantically similar car and vehicle are
relationship may be one of the classical relationships such expected to have a number of common co-occurring words
as meronymy (is part of) as in computer–keyboard or a non- such as parking, garage, model, industry, accident, traffic
classical one as in glass–water, tree–shade and gym–weights. and so on, in a large enough text corpus.
Thus two entities are semantically related if they are seman- Various distributional similarity measures were discussed
tically similar (close together in the is–a hierarchy) or share in [63] where co-occurrence types of a target word are the
any other classical or non-classical relationships. contexts in which it occurs and these have associated fre-
Measures of the semantic similarity of words have been quencies which may be used to form probability estimates.
used for a long time in applications in natural language pro- Lesk [33] was one of the first to apply the cosine measure,
cessing and related areas, such as the automatic creation of which computes the cosine of the angle between two vec-
thesauri [23,35,37], automatic indexing, text annotation and tors, to word similarity. The Jensen–Shannon (JS) divergence
summarization [36], text classification, word sense disambig- measure [15,50] and the skew divergence measure [32] are
uation [34,35,65], information extraction and retrieval [10, based on the Kullback–Leibler (KL) divergence measure.
62,64], lexical selection, automatic correction of word errors Jaccard’s coefficient [56] calculates the proportion of fea-
in text and discovering word senses directly from text [45]. tures belonging to either word that are shared by both words.
A word similarity measure is also used for language mod- In the simplest case, the features of a word are defined as the
eling by grouping similar words into classes [8]. In databases, contexts in which it has been seen to occur. PMI was first used
word similarity can be used to solve semantic heterogeneity, to measure word similarity by Church and Hanks [13] where
a key problem in any data sharing system whether it is a feder- positive values indicate that words occur together more than
ated database, a data integration system, a message passing would be expected under an independence assumption and
system, a web service or a peer-to-peer data management negative values indicate that one word tends to appear only
system [39]. when the other does not. Jaccard-MI is a variant [37] in which
the features of a word are those contexts for which the point-
2.1 Related work on semantic similarity of words wise mutual information between the word and the context
is positive. Average mutual information corresponds to the
Many different measures of semantic similarity between expected value of two random variables using the same equa-
word pairs have been proposed, some using statistical or tion as PMI and was used as a word similarity measure by
distributional techniques [24,37], some using lexical data- [15,52]. Cosine of pointwise mutual information was used by
bases (thesaurus) and some hybrid approaches, combining [45] to uncover word senses from text. L 1 norm method was
distributional and lexical techniques. PMI-IR [61] uses PMI proposed as an alternative word similarity measure in lan-
and information retrieval (IR) to measure the similarity of guage modeling to overcome zero-frequency problems for
pairs of words. PMI-IR is a statistical approach that uses a bigrams [15]. A likelihood ratio was used by [20] to test
huge data source: the web and the PMI of two words are word similarity under the assumption that the words in text
approximated by the number of web documents where they have a binomial distribution.
co-occur. Another well-known statistical approach to mea- There are several dictionary-based approaches to measur-
suring semantic similarity is latent semantic analysis (LSA) ing the similarity of words. Most of the dictionary-based
[31]. We will briefly discuss these two approaches in next approaches use WordNet [43], a broad coverage lexical net-
subsections. work of English words. Some use Roget’s Thesaurus.
Individual words in a given text corpus have more or less Budanitsky and Hirst [9] presented a detail overview of sev-
differing contexts around them. The context of a word is eral WordNet based measures. We briefly discuss Lin’s [38]
composed of words co-occurring with it within a certain win- approach, one of the hybrid measures using both the WordNet
dow around it. Distributional measures use statistics acquired and corpus in next subsection. Jarmasz and Szpakowicz [27]
from a large text corpora to determine how similar the con- implemented a similarity measure using Roget’s Thesaurus.
texts of two words are. These measures are also used as
approximations to measures of semantic similarity of words, 2.1.1 Latent semantic analysis (LSA)
because words found in similar contexts tend to be semanti-
cally similar. Such measures have traditionally been referred LSA [31], a high-dimensional linear association model, ana-
to as measures of distributional similarity. If two words have lyzes a large corpus of natural text and generates a represen-
many co-occurring words, then similar things are being said tation that captures the similarity of words and text passages.
about both of them and therefore they are likely to be seman- The underlying idea is that the aggregation of all the word
tically similar. Conversely, if two words are semantically contexts in which a given word does and does not appear
similar then they are likely to be used in a similar fashion provides a set of mutual constraints that largely determines
in text and thus end up with many common co-occurrences. the similarity of meaning of words and sets of words to each
123
1296 A. Islam et al.
other [31]. The model tries to answer how people acquire as tendency to co-occur, then p( pr oblem & choicei ) will be
much knowledge as they do on the basis of as little infor- greater than p(problem) · p(choicei ).
mation as they get. It uses the singular value decomposition PMI-IR used AltaVista Advanced Search query syntax to
(SVD) to find the semantic representations of words by ana- calculate the probabilities. In the simplest case, two words
lyzing the statistical relationships among words in a large co-occur when they appear in the same document:
corpus of text. The corpus is broken up into chunks of texts
score1 (choicei )
approximately the size of a small text or paragraph. Landauer
and Dumais mentioned in [31] “. . . we took a sample consist- = hits( pr oblem AND choicei )/hits(choicei )
ing of (usually) the whole text or its first 2,000 characters, Here, hits(x) is the number of hits (the number of docu-
whichever was less, for a mean text sample length of 151 ments retrieved) when the query x is given to AltaVista.
words, roughly the size of a rather long paragraph‘’. Analyz- AltaVista provides how many documents contain both prob-
ing each text or paragraph, the number of occurrences of each lem and choicei , and then how many documents contain
word is set in a matrix with a column for each word and a row choicei alone. The ratio of these two numbers is the score
for each paragraph. Then each cell of the matrix (a word by for choicei . There are three other versions of this scoring
context matrix, X ), is transformed from the raw frequency equation based on the closeness of the pairs in documents,
count into the log of the count. After that each
cell is divided considering antonyms, and taking context into account.
by the entropy of the column, given by − p log p, where
the summation is over all the paragraphs the word appeared. 2.1.3 Lin’s measure
The next step is to apply SVD to X , to decompose X into
a product of three matrices Lin [38] noticed that most of the similarity measures were
X = W SP , tied to a particular application domain or resource and then he
attempted to define a similarity measure that would be both
where W and P are in column orthonormal form (i.e., col- universal and theoretically justified. He used the following
umns are orthogonal) and S is the diagonal matrix of non-zero three intuitions as a basis:
entries (singular values). To reduce dimensions, the rows of
W and P corresponding to the highest entries of S are kept. (1) The similarity between A and B is related to their com-
In other words, the new lower dimensional matrices WL , PL monality. The more commonality they share, the more
and SL are the matrices produced by removing the columns similar they are. The commonality between A and B is
and rows with smallest singular values from W, P and S. measure by
This new matrix
I (common(A, B))
X L = WL SL PL
is a compressed matrix which represents all the words and where common(A, B) is a proposition that states the
text samples in a lower dimensional space. Then the similar- commonalities between A and B; I (s) is the amount
ity of two words, using LSA, is measured by the cosine of of information contained in a proposition s.
the angle between their corresponding row vectors. (2) The similarity between A and B is related to the differ-
ences between them. The more differences they have,
the less similar they are. The difference between A and
2.1.2 PMI-IR
B is measure by
PMI-IR [61], a simple unsupervised learning algorithm for
I (description(A, B)) − I (common(A, B))
recognizing synonyms, uses PMI as follows:
score(choicei ) = p( pr oblem & choicei )/ p(choicei ) where description(A, B) is a proposition that describes
what A and B are.
Here, problem represents the problem word and {choice1 , (3) The maximum similarity between A and B is reached
choice2 , . . . , choicen } represent the alternatives. p when A and B are identical, no matter how much com-
( pr oblem & choicei ) is the probability that problem and monality they share.
choicei co-occur. In other words, each choice is simply
scored by the conditional probability of the problem word, Given these assumptions and definitions and the apparatus
given the choice word, p( pr oblem|choicei ). If problem and of information theory, Lin proved the following theorem:
choicei are statistically independent, then the probability
that they co-occur is given by the product p( pr oblem) · Similarity Theorem. The similarity between A and B is
p(choicei ). If they are not independent, and they have a measured by the ratio between the amount of information
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1297
needed to state the commonality of A and B and the Then we define pointwise mutual information function for
information needed to fully describe what A and B are: only those words having f b (ti , w) > 0,
log P(common(A, B)) f b (ti , w) × m
sim(A, B) = f pmi (ti , w) = log2 ,
log P(descri ption(A, B)) f t (ti ) f t (w)
Lin demonstrated how this similarity theorem could be where f t (ti ) f t (w) > 0 and m is total number of tokens
applied in different domains using WordNet and corpus. For in corpus C as mentioned earlier. Now, for word w1 , we
example, his measure of similarity between two concepts in define a set of words, X , sorted in descending order by their
taxonomy is a corollary of this theorem: PMI values with w1 and take the top-most β1 words having
f pmi (ti , w1 ) > 0.
2 log P(lso(A, B))
sim(A, B) = , X = {X i }, where i = 1, 2, . . . , β1
log P(A) + log P(B)
and f pmi (t1 , w1 ) ≥ f pmi (t2 , w1 ) ≥ · · · f pmi (tβ1−1 , w1 )
where probabilities P(x) are determined by
≥ f pmi (tβ1 , w1 )
w∈W (x) count (w) Similarly, for word w2 , we define a set of words, Y , sorted
P(x) = ,
N in descending order by their PMI values with w2 and take
where W (x) is the set of words (nouns) in the corpus whose the top-most β2 words with f pmi (ti , w2 ) > 0.
senses are subsumed by concept x, and N is the total number
Y = {Yi }, where i = 1, 2, . . . , β2
of word (noun) tokens in the corpus that are also present in
WordNet. and f pmi (t1 , w2 ) ≥ f pmi (t2 , w2 ) ≥ · · · f pmi (tβ2−1 , w2 )
≥ f pmi (tβ2 , w2 )
2.2 Proposed second-order co-occurrence PMI method Note that we have not yet determined the value for βs
(either β1 or β2 ) which actually depend on the word w and
Let w1 and w2 be the two words for which we need to the number of types in the corpus (this will be discussed in
determine the semantic similarity and C = {c1 , c2 , . . . , cm } the next section).
denotes a large corpus of text (after some preprocessing, Again, we define the β-PMI summation function. For
e.g., stop words elimination and lemmatization) containing word w1 , the β-PMI summation function is:
m words (tokens). Also, let T = {t1 , t2 , . . . , tn } be the set of
all unique words (types) which occur in the corpus C. Unlike β1
γ
the corpus C, which is an ordered list containing many occur- f β (w1 ) = f pmi (X i , w2 ) ,
rences of the same words, T is a set containing no repeated i=1
words. Throughout this section, we will use w to denote either where f pmi (X i , w2 ) > 0 and f pmi (X i , w1 ) > 0 which sums
w1 or w2 . all the positive PMI values of words in the set Y also com-
We set a parameter α, which determines how many words mon to the words in the set X . In other words, this function
before and after the target word w, will be included in the actually aggregates the positive PMI values of all the seman-
context window. The window also contains the target word tically close words of w2 which are also common in w1 . Note
w itself, resulting in a window size of 2α + 1 words. The that we call it semantically close because all these words have
steps in determining the semantic similarity involve scan- high PMI values with w2 and this does not ensure the close-
ning the corpus and then extracting some functions related ness with respect to the distance within the window size.
to frequency counts. Similarly, for word w2 , the β-PMI summation function is:
We define the type frequency function,
β2
γ
f t (ti ) = |{k: ck = ti }|, where i = 1, 2, . . . , n f β (w2 ) = f pmi (Yi , w1 ) ,
i=1
which tells us how many times the type ti appeared in the
entire corpus. Let where f pmi (Yi , w1 ) > 0 and f pmi (Yi , w2 ) > 0 which sums
all the positive PMI values of words in the set X also com-
f b (ti , w) = |{k: tk = w and tk± j = ti }|, mon to the words in the set Y . In other words, this function
aggregates the positive PMI values of all the semantically
where i = 1, 2, . . . , n and – α ≤ j ≤ α, be the bigram close words of w1 which are also common in w2 . We have
frequency function. f b (ti , w) tells us how many times word not discussed the criteria for choosing the exponential para-
ti appeared with word w in a window of size 2α + 1 words. meter γ (this will be discussed in the next subsection).
123
1298 A. Islam et al.
100
Finally, we define the semantic PMI similarity function
between two words, w1 and w2 , 90
78.75 % of correct answers
80 76.25 73.75
f β (w f β (w
% of correct answers
1) 2) 70 64.37
Sim(w1 , w2 ) = +
β1 β2 60
50
40
2.2.1 Choosing the values of β and γ 40
30
2 (log2 (n))
βi = log( f t (wi )) , Fig. 1 Results on the 80 TOEFL questions
δ
where i = 1, 2 and δ is a constant. 45 42
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1299
68 66
70 64 respectively. The WordNet-based measures—implemented
60 in the WordNet::Similarity package by Pedersen et al. [46]—
50 achieve lower accuracy on the two data sets than the Roget
40 measure [27]. The fact that the Roget measure performs bet-
30 ter than the corpus-based measures is to be expected, because
20 Roget’s thesaurus can be seen as a classification system. It
10 is composed of six primary classes and each is composed
0 of multiple divisions and then sections. This may be con-
Penguin Roget SOC-PMI PMI-IR Lin
Method name ceptualized as a tree containing over a 1,000 branches for
individual meaning clusters or semantically linked words.
Fig. 3 Results on the 50 ESL questions These words are not exactly synonyms, but can be viewed
as colors or connotations of a meaning or as a spectrum of a
concept. One of the most general words is chosen to typify
9
8 the spectrum as its headword, which labels the whole group.
8 Number of missing
Second-order co-occurrence PMI may be helpful as a tool
Number ofquestion or answer
words
7 to aid in the automatic construction of the synonyms of a
words not found
0.5 0.472
0.406 task. In future work we can try the SOC-PMI method for this
0.4
task.
0.3
0.2
0.1
3 Corpus-based word segmentation
0
Miller & Charles 30 noun pairs Rubenstein & Goodenough
65 noun pairs
data set
Word segmentation is an important problem in many natural
language processing tasks; for example, in speech recogni-
Fig. 5 Correlation of noun pairs tion where there is no explicit word boundary information
given within a continuous speech utterance, or in interpret-
ing written languages such as Chinese, Japanese and Thai
These correlation values are very good for a corpus-based where words are not delimited by white-space but instead
measure, considering that a baseline vector space method must be inferred from the basic character sequence. We dif-
using cosine obtains 0.406 for the first set and 0.472 for the ferentiate the terms word breaking and word segmentation.
second set. For dictionary-based measures [27], the correla- Word breaking refers to the process of segmenting known
tions are slightly higher, but comparable to ours. words that are predefined in a lexicon. Word segmentation
123
1300 A. Islam et al.
refers to the process of both lexicon word segmentation and 3.1 Related work on word segmentation
unknown word or new word3 detection. Automatic word
segmentation is a basic requirement for unsupervised learn- Word segmentation methods can be roughly classified as
ing in morphological analysis. Developing a morphologi- either dictionary-based or statistically based methods, while
cal analyzer for a new language by hand can be costly and many state-of-the-art systems use hybrid approaches. In
time consuming, requiring a great deal of effort by highly dictionary-based methods, given an input character string,
specialized experts. only words that are stored in the dictionary can be identified.
In databases, word segmentation can be used in schema The performance of these methods thus depends to a large
matching to solve semantic heterogeneity, a key problem in degree upon the coverage of the dictionary, which unfortu-
any data sharing system whether it is a federated database, nately may never be complete because new words appear
a data integration system, a message passing system, a web constantly. Therefore, in addition to the dictionary, many
service or a peer-to-peer data management system [39]. The systems also contain special components for unknown word
name of an element in a database typically contains words identification. In particular, statistical methods have been
that are descriptive of the element’s semantics. N -grams4 widely applied because they use a probabilistic or cost-based
have been shown to work well in the presence of short forms, scoring mechanism rather than a dictionary to segment the
incomplete names and spelling errors that are common in text [22].
schema names [19]. A simple word segmentation algorithm is to consider each
Also, extracting words (word segmentation) from a character a distinct word. This is practical for Chinese
scanned document page or a PDF is an important and basic because the average word length is very short, usually
step in document structure analysis and understanding sys- between one and two characters, depending on the corpus
tems; incorrect word segmentation during OCR leads to [21], and actual words can be recognized with this algo-
errors in information retrieval and in understanding the rithm. Although it does not assist in task such as parsing,
document. part-of-speech tagging or text-to-speech systems [60], the
One of the common approaches involving an extensive character-as-word segmentation algorithm has been used to
word list combined with an informed segmentation algo- obtain good performance in Chinese information retrieval, a
rithm can help achieve a certain degree of accuracy in word task in which the words in a text play a major role in indexing.
segmentation, but the greatest barrier to accurate word One of the most popular methods is maximum matching
segmentation is in recognizing unknown words, words not (MM), usually augmented with heuristics to deal with ambi-
in the lexicon of the segmenter. This problem is dependent guities in segmentation. Another very common approach to
both on the source of the lexicon as well as the correspon- word segmentation is to use a variation of the maximum
dence between the text in question and the lexicon. Fung matching algorithm, frequently referred to as the greedy algo-
and Wu [21] reported that segmentation accuracy is sig- rithm. The greedy algorithm starts at the first character in a
nificantly higher when the lexicon is constructed using text and, using a word list for the language being segmented,
the same type of corpus as the corpus on which it is attempts to find the longest word in the list starting with that
tested. character. If a word is found, the maximum-matching algo-
The term maximum-length descending-frequency means rithm marks a boundary at the end of the longest word, then
that we choose maximum length n-grams that have a mini- begins the same longest match search starting at the char-
mum threshold frequency and then we look for further acter following the match. If no match is found in the word
n-grams in descending order based on length. If two n-grams list, the greedy algorithm simply segments that character as
have same length then we choose the n-gram with higher fre- a word and begins the search starting at the next character.
quency first and then the n-gram with next higher frequency A variation of the greedy algorithm segments a sequence of
if any of its characters are not a part of the previous one. If we unmatched characters as a single word; this variant is more
follow this procedure, after some iterations, we can be in a likely to be successful in writing systems with longer average
state with some remaining characters (we call it residue) that word lengths. In this manner, an initial segmentation can be
are not matched with any type in the corpus. To solve this, we obtained that is more informed than a simple character-as-
use the leftMaxMatching and rightMaxMatching algorithms word approach. As a demonstration of the application of the
presented in Sect. 3.2 along with entropy rate. character-as-word and greedy algorithms, consider an exam-
ple of “desegmented” English, in which all the white space
3
has been removed: the “desegmented” version of the text,
New words in this paper refer to out-of-vocabulary words that are
neither recognized as named entities or factoids, nor derived by mor-
the most favourite music of all time, would thus be them-
phological rules. These words are mostly domain-specific and/or time- ostfavouritemusicofalltime, Applying the character-as-word
sensitive. algorithm would result in the useless sequence of tokens t h e
4 Sequence of n consecutive characters. m o s t f a v o u r i t e m u s i c o f a l l t i m e, which is why this
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1301
algorithm only makes sense for languages such as Chinese. yet another communication code within the framework of
Applying the greedy algorithm with a “perfect” word list information theory [58].
containing all known English words would first identify the
word them, since that is the longest sequence of letters start-
3.2 Proposed word segmentation method
ing at the initial t which forms an actual word. Starting at the
o following them, the algorithm would then find no match.
Let S = l1l2 l3 . . . lm denote a text of m consecutive characters
Continuing in this manner, themostfavouritemusicofalltime
without any space in between them for which we need to seg-
would be segmented by the greedy algorithm as them o s t
ment and C = {c1 , c2 , . . . , cτ } denote a large corpus of text
favourite music of all time. A variant of the maximum match-
containing τ words (tokens). Also, let T p = {t1 , t2 , . . . , t p }
ing algorithm is the reverse maximum matching algorithm,
be the set of all ( p) unique words (types) which occur in
in which the matching proceeds from the end of the string of
the corpus C and T f = { f 1 , f 2 , . . . , f p } be the set of fre-
characters, rather than the beginning. In the foregoing exam-
quencies of all the corresponding types in T p ; where f x is the
ple, themostfavouritemusicofalltime would be segmented as
frequency of type tx . Unlike the corpus C, which is an ordered
the most favourite music o fall time by the reverse maximum
list containing many occurrences of the same words, T p is a
matching algorithm. Greedy matching from the beginning
set containing no repeated words. Again, let n be the maxi-
and the end of the string of characters enables an algorithm
mum length of any possible words in the segmented words list
such as forward–backward matching, in which the results are
where n ≤ m and N p = {l1 , l2 , . . . , ln , l1 l2 , l2 l3 , . . . , l1 l2 . . .
composed and the segmentation optimized based on the two
ln , . . .} be the set of all possible n-grams where η = |N p |
results [16].
is the total number of n-grams in N p . We can also consider
Many unsupervised methods have been proposed for seg-
N p as N p = {w1 , w2 . . . , wη }. And N f = { f 1 , f 2 . . . , f η }
menting raw character sequences with no boundary infor-
be the set of frequencies of all the corresponding n-grams of
mation into words [4,5,11,12,17,25,29]. Brent [4] gives a
N p taken from T f ; where f x is the frequency of wx . To get
good survey of these methods. Most current approaches are
rid of the noise types of the corpus, we assign a set of mini-
using some form of expectation maximization (EM) to learn a
mum frequencies for each possible length from 1 to n to be
probabilistic speech-or-text model and then employing
considered as a valid word. M f = {α1 , α2 , . . . , αn }, where
Viterbi decoding procedures [48] to segment new speech or
αx is the minimum frequency required to be a valid word5
text into words. One reason that EM is widely adopted for
of length x. Minimum required frequency, αx is inversely
unsupervised learning is that it is guaranteed to converge to
proportional to the word length, x. The steps of the method
a good probability model that locally maximizes the like-
are as follows:
lihood or posterior probability of the training data. For the
Step 1: Sort all the elements of N p in descending order based
problem of word segmentation, EM is typically applied by
on length (in characters). Again sort in descending order for
first extracting a set of candidate multi-grams from a given
same length words of the sorted N p (say N p ) based on the
training corpus [17], initializing a probability distribution
frequencies of N f . For each element in N p do the next steps:
over this set, and then using the standard iteration to adjust
Step 2: If S = Ø and the current maximum length n-gram
the probabilities of the multi-grams to increase the posterior
(say wn ) in N p satisfies f n ≥ α|wn | and wn ∈ S (i.e., S ∩
probability of the training data.
wn = wn ) then add wn to segmented word list, S (i.e., S ←
Saffran et al. [55] proposed that word segmentation from
S ∪ wn ) and remove wn from S (i.e., S ← S\wn ) and add a
continuous speech may be achieved using transitional prob-
blank space as a boundary mark.
abilities (TP) between adjacent syllables A and B, where,
Step 3: If S = Ø and not all elements in N p are done then
TP(A → B) = P(AB)/P(A), with P(AB) being the fre-
update wn by the next maximum length n-gram from N p and
quency of B following A, and P(A) the total frequency of A.
go to step 2.
Word boundaries are postulated at local minima, where the
Step 4: Rearrange all the words of S in accordance with
TP is lower than its neighbors.
S. If S = Ø, then output S and exit. Otherwise, for each
In corpus-based word segmentation, there is either no 6
remaining chunks , r in S call matchResidue(r ), output S
explicit model learnt, as when neural networks [54] or lazy
and exit.
learning [14] are used, or the derived models are less sophisti-
cated and do not use any abstractions of the word constituents
5
found in data [7,41]. Using annotated corpora greatly facili- We experiment in BNC and find out the minimum frequency
tates learning. However, there are situations in which one is range for nearly all valid words up to words of length 34, as in
BNC the length of the longest valid word is 34. For n = 34, M f =
interested in unsupervised learning (UL), that is, from unan- {1000, 500, 50, 16, 15, 12, 10, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
notated corpora. Motivation for UL can vary from purely 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; for example, nearly all valid words of length
pragmatic, such as the high cost or unavailability of anno- 3 in BNC have a minimum frequency of 50 or more.
tated corpora, to theoretical, when language is modeled as 6 A single chunk may contain one or more characters.
123
1302 A. Islam et al.
In mat ch Resi due, if l e f t M ax M at chi ng and Step 2: wn satisfies f n ≥ α|wn | as 4671 ≥ 2 and wn is a
r i ght M ax M at chi ng return same numbers of words then substring of S.
we use entropy rate to decide which set of words we will S = { f avourite} and S = {themost musico f alltime}.
accept. The intuition behind using entropy rate is that if we Step 3: Not all elements in N p are done, update wn =
have a set of words having larger average frequency (we use {alltime} and go to step 2.
normalized frequency in the entropy rate) than the other set Step 2: does not satify f n ≥ α|wn | as 6 < 10 though wn is a
of words, it is obvious that the first set of words is more substring of S.
meaningful than the second set of words (Figs. 6, 7, 8). Step 3: Not all elements in N p are done, update wn =
{ f avour } and go to step 2.
3.3 A walk-through example Step 2: Condition fails as wn is not a substring of S.
Step 3: Not all elements in N p are done, update wn =
As a demonstration of the application of the proposed {musico} and go to step 2.
algorithms, consider the same example of “desegmented” Step 2: Condition fails as wn does not satisfy f n ≥ α|wn | as
English text, S = {themost f avouritemusico f alltime}7 . 10 < 12.
We have used the BNC corpus to calculate T p and T f . Let Step 3: Not all elements in N p are done, update wn =
n = 9 be the maximum length8 of all possible word in S and {music} and go to step 2.
M f = {1000, 500, 50, 16, 15, 12, 10, 3, 2}. Table 1 shows Step 2: wn satisfies f n ≥ α|wn | as 15134 ≥ 15 and wn is a
the sorted n-grams, N p and their frequencies, N f for this substring of S.
specific example. S = { f avourite, music} and
For each element wn (say, favourite) in N p , S = {themost o f alltime}.
We will only show the step 2 of all the remaining elements
7 in N p that satisfy the conditions.
S is a set with one string element; a space in the string will be used
to replace a substring that will be taken out, in order to distinguish the Step 2: wn = {them}
next parts to be processed. S = { f avourite, music, them} and S = {ost o f alltime}.
8 Though in BNC, the length of the longest valid word is 34. Step 2: wn = {time}
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1303
f
5. M { 1, 2…, n}
6. i 1
7. while ( i n && i m)
8. if ( fi' i )
9. max i
10. end
11. increment i
12. end
13. St' St' U wmax
14. St St \ wmax
15. end
Output: St' // St' is the segmented word list after leftmax matching
6. i 1
7. while ( i n && i m)
8. if ( fi' i )
9. max i
10. end
11. increment i
12. end
13. St' St' U wmax
14. St St \ wmax
15. end
Output: St' // St' is the segmented word list after rightmax matching
S = { f avourite, music, them, time} and S = {ost o}.
S = {ost o f all}. Step 4: Rearrange S = {them, f avourite, music, f all,
Step 2: wn = { f all} time} and S = Ø, so call mat ch Resi due(ost) and then
S = { f avourite, music, them, time, f all} and mat ch Resi due(o).
123
1304 A. Islam et al.
Table 1 Sorted n-grams and their frequencies Case 2: mat ch Resi due(o) is called
N ps N f N ps N f
S = S \{wn−1 , wn }
favourite 4671 tem 31
S = {the, most, f avourite, music, f all, time}
alltime 6 emo 20
favour 6805 ost 18 \{music, f all}
musico 10 of 3052752 = {the, most, f avourite, time}
Music 15134 it 1054552 St = {musico f all}
them 167457 he 641236
St = {music, o f, all}
time 164294 me 131869
← l e f t M ax M at chi ng(musico f all)
most 98276 us 80206
fall 11202 co 17476 St = {mus, ico, f all}
item 3780 th 16486 ← r i ght M ax M at chi ng(musico f all)
rite 293 st 15565
allt 28 al 7299 As in this case |St | = |St |, we need to find whether St
1 |x|
or St maximizes the entropy rate, |x| i=1 log2 ( f i ), where
emus 14 fa 2172
musi 3 em 1641 x ∈ {St , St }. The entropy rate for St is (13.89 + 21.54 +
the 6057315 os 1005 18.11)/3 and for St is (8.07 + 6.57 + 13.45)/3.
So, S =
all 282012 te 831
St
our 93463 si 658 {the, most, f avourite, time}∪St , as 1 i=1 log2 ( f i ) >
St
tim 3401 mo 639
St
hem 305 ti 615 1
i=1 log2 ( f i ). Finally, S = {the, most, f avourite,
sic 292 im 576 St
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1305
Value in %
60%
50%
40%
30%
20%
10%
0%
de Marcken Peng&Schuurmans Kit&Wilks Our method
Methods
as follows: of the corpus (4,292K) and tested on unseen data. After the
lexicon is optimized, they obtained 16.19% higher recall and
P = T P/(T P + F P) 4.73% lower precision; resulting in an improvement of 5.2%
R = T P/(T P + F N ) in boundary F-measure. de Marcken [18] also used a mini-
F = (1 + β)P R/(β P + R) mum description length (MDL) framework and a hierarchical
model to learn a word lexicon from raw speech. However, this
= 2 PR / (P + R), with β = 1 such that precision and recall work does not explicitly yield word boundaries, but instead
weighted equally. Here, TP, FP and FN stand for True Posi- recursively decomposes an input string down to the level
tive, False Positive and False Negative respectively. of individual characters. As pointed out by Brent [4], this
For instance, if the target segmentation is “we are human”, study gives credit for detecting a word if any node in the
and the model outputs “weare human”, then precision is hierarchical decomposition spans the word. Under this mea-
1/2 (“human” out of “weare” and “human”, recall is 1/3 sure, de Marcken [18] reports a word recall rate of 90.5%
(“human” out of “we”, “are”, and “human”) and F-measure on the Brown corpus. However, his method creates numer-
is 2/5. ous chunks and therefore only achieves a word precision
We used the type frequency from BNC and tested our seg- rate of 17%. Christiansen et al. [12] used a simple recur-
mentation method on part of the BNC corpus. Specifically, rent neural network approach and report a word precision
we converted a portion of the corpus to lowercase letters rate of 42.7% and word recall rate of 44.9% on spontane-
and removed all white space and punctuation. We used 285K ous child-directed British English. Brent and Cartwright [5]
characters and 57,904 tokens as our test data. We obtained used a MDL approach and reported a word precision rate
84.28% word precision rate 81.63% word recall rate and of 41.3% and a word recall rate of 47.3% on the CHILDES
82.93% word F-measure. collection. Brent [4] achieved about 70% word precision and
In a second test, we used the type frequency from BNC 70% word recall by employing additional language model-
and tested our segmentation method on the Brown corpus to ing and smoothing techniques. Peng and Schuurmans [47]
make sure that we test on different vocabulary from the train- obtained 74.6% word precision rate, 79.2% word recall rate
ing data. This ensures that some of the word in the test set and 75.4% word F-measure on the Brown corpus. A balance
were not previously seen (out-of-vocabulary words). There of high precision and high recall is the main advantage of
were 4,705,022 characters and 1,003,881 tokens in the Brown our proposed method. However, it is difficult to draw a direct
corpus. We obtained 89.92% word precision rate, 94.69% comparison between these results because of the different
word recall rate and 92.24% word F-measure. The average test corpora used by different authors. Figure 9 summarizes
number of tokens per line could be the reason for obtaining the result of different methods which are tested on the Brown
better result when we tested on the Brown corpus, as 8.49 corpus based on precision, recall and F-measure. Though all
and 16.07 are the average number of tokens per line in the the methods in Fig. 9 use the Brown corpus, the testing data
Brown corpus and the BNC corpus, respectively. sets in the Brown corpus are not exactly the same.
One of the best known results on segmenting the Brown Actually, this method can effectively distill new words,
corpus is due to Kit and Wilks [29] who use a descrip- special terms and proper nouns when the corpus covers a
tion-length gain method. They trained their model on the huge collection of both domain-dependent and domain-
whole corpus (6.13M) and reported results on the training independent words, and it can effectively avoid statistical
set, obtaining a boundary precision of 79.33%, a bound- errors in shorter strings which belong to a longer one. How-
ary recall of 63.01% and boundary F-measure of 70.23%. ever, names are not always easy to exploit and contain abbre-
Peng and Schuurmans [47] trained their model on a subset viations and special characters that vary between domains.
123
1306 A. Islam et al.
can speed up this process by either automatically discover- Fig. 10 Classification of schema matching approaches [49]
ing good mappings, or at least by proposing likely matches
that are then verified by a human expert [28]. Rahm and
Bernstein [49] point out that it is not possible to determine
4.1.1 Linguistic approaches
fully automatically all matches between two schemas, pri-
marily because most schemas have some semantics that
Linguistic matchers use element names and text (i.e., words/
affects the matching criteria but is not formally expressed
tokens or sentences) to find semantically similar schema ele-
or often not even documented. The implementation of the
ments. We discuss here two schema-level approaches, name
matching should therefore only determine match candidates,
matching and description matching.
which the user can accept, reject or change. Furthermore,
Element name matching. Element name-based match-
the user should be able to specify matches for elements for
ing matches schema elements with equal or similar names.
which the system was unable to find satisfactory match
Similarity of names can be defined and measured in various
candidates.
ways, including:
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1307
Solving any task related to synonyms and hypernyms nor- (D.name21 = S2 .name) and
mally requires the use of thesauri or dictionaries. General nat- (D.similarity > threshold)
ural language dictionaries such as LDOCE,9 Wordnet10 may
be useful; perhaps even multi-language dictionaries (e.g., The constraint here is that D will have to contain all rel-
English–German or German–English) to deal with input evant pairs of the transitive closure over similar names. For
schemas in different languages. In addition, name match- instance, if sim( A, B) = 0.6 and sim(B, C) = 0.7 are in
ing can use domain- or enterprise-specific dictionaries and D, then probabilistically we would expect D also to contain
is–a taxonomies containing common names, synonyms and sim(B, A) = 0.6, sim(C, B) = 0.7 and sim(A, C) = x,
descriptions of schema elements, abbreviations, etc. These sim(C, A) = x. Probabilistically, we would expect the sim-
specific dictionaries require a substantial effort to be built up ilarity value x to be 0.6 × 0.7 = 0.42, but this depends on
in a consistent way. The effort may be worth the investment, some factors such as the type of similarity, the use of hom-
especially for schemas with relatively flat structure where onyms, and perhaps other factors. For example, we might
dictionaries provide the most valuable matching hints. But have sim(deliver, ship) = 0. 9 and sim(ship, boat) = 0.9,
corpus-based methods could be a better choice than dictio- but not sim(deliver, boat) = x for a high similarity value x.
nary-based methods as a balanced corpus covers a huge col- Bright et al. [6] discuss another approach to assigning differ-
lection of both domain-dependent and domain-independent ent weights to different types of similarity relationships.
words including special terms and proper nouns. Further-
more, tools are needed to enable names to be accessed and 4.1.2 Description matching
(re-)used, such as within a schema editor when defining new
schemas. Often, schemas contain text descriptions of elements that typ-
Homonyms (one of two or more words that have the same ically explain the meaning of elements in natural language to
sound and often the same spelling but differ in meaning) express the intended semantics of schema elements. But the
can mislead a matching algorithm as homonyms are similar quality of these descriptions varies a lot. These comments
names that refer to different elements. Clearly, homonyms can also be evaluated linguistically to determine the similar-
may be a part of natural language, such as bank (embank- ity between schema elements. For instance, this would help
ment, river bank) and bank (place where money is kept). A find that the following elements match by a linguistic analysis
name matcher can reduce the number of wrongly matched of the comments associated with each schema element:
candidates using mismatch information supplied by users or S1 : empn//employee name
dictionaries, though it requires a substantial effort or at least,
S2 : name//name of employee
the matcher can offer a warning of the potential ambiguity
due to multiple meanings of the name. This linguistic analysis could be as simple as extracting
Name-based matching can identify multiple relevant keywords from the description which are used for synonym
matches for a given schema element; that is, it is not lim- comparison, much like name matching. Some approaches
ited to finding just 1:1 matches. For example, it can match consider rule-based schema matching which are domain
“address” with both “home address” and “office address”. dependent [44].
In the case of synonyms and hypernyms, the join-like pro-
cessing involves a list D of word pairs and their similarity 4.2 Proposed Name-based element-level schema matching
as a further input. Assume a relation-like representation as method
123
1308 A. Islam et al.
4. return ri
5. else ri ri \ c ; that is, remove the right-most character from ri
6. end if
7. end while
Output: ri // ri is the Maximal Consecutive LCS starting at character 1
similarity measures. We use three different modified versions normalized longest common subsequence (NLCS) which is,
of LCS and then take weighted sum of these11 . Kondrak [30]
showed that edit distance and the length of the longest com- {length(LC S(ri , s j ))}2
v1 = N LC S(ri , s j ) =
mon subsequence are special cases of n-gram distance and length(ri ) × length(s j )
similarity, respectively. Melamed [40] normalized LCS by
dividing the length of the longest common subsequence by While in classical LCS, the common subsequence need
the length of the longer string and called it longest common not be consecutive, but in database schema matching, con-
subsequence ratio (LCSR). But LCSR does not take into secutive common subsequence is important for high degree
account the length of the smaller string which sometimes of matching. We use maximal consecutive longest common
has a significant value in similarity score. subsequence starting at character 1, MC LC S1 (Fig. 11) and
We normalize the LCS so that it takes into account the maximal consecutive longest common subsequence starting
length of both the smaller and the larger strings and call it at any character n, MC LC Sn (Fig. 12). In Fig. 11, we present
an algorithm that takes two strings as input and returns the
smaller string or maximal consecutive portions of the smaller
11 string that consecutively match with the larger string, where
We use modified versions because in our experiments we obtained
better results (precision and recall for schema matching) than when matching must be from first character (character 1) for both
using the original LCS, or other string similarity measures. of the strings. In Fig. 12, we present another algorithm that
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1309
takes two strings as input and returns the smaller string or Consider two given database schemas R = {R1 , R2 , . . . ,
maximal consecutive portions of the smaller string that con- Rσ } and S = {S1 , S2 , . . . , Sχ }; for each element in one data-
secutively match with the larger string, where matching may base schema, we try to identify a matching element in the
start from any character (character n) for both of the strings. other schema, if any, using element names. We assume that
We normalize MC LC S1 and MC LC Sn and call it normal- schema R has σ elements and Ri is the element’s name, where
ized MC LC S1 (N MC LC S1 ) and normalized MC LC Sn i = 1, . . . , σ . Similarly, schema S has χ elements and S j is
(N MC LC Sn ), respectively. the element’s name where j = 1, . . . , χ . Note that some
elements in R can match multiple elements in S, and vice
{length(MC LC S1 (ri , s j ))}2 versa. So, our task is to identify whether an element name
v2 = N MC LC S1 (ri , s j ) =
length(ri ) × length(s j ) Ri ∈ R matches an element name S j ∈ S. Both Ri and S j are
{length(MC LC Sn (ri , s j ))}2 strings of characters. Our method provides a similarity score
v3 = N MC LC Sn (ri , s j ) =
length(ri ) × length(s j ) between 0 and 1, inclusively. If the similarity score is above a
certain threshold then the elements are considered as match
We take the weighted sum of these individual v1 , v2 and v3 candidates. If we set the threshold to 1 and the similarity
to determine string similarity score, where w1 , w2 , w3 are score reaches this value, only then are we certain about their
weights and w1 + w2 + w3 = 1. Therefore, the similarity of matching. For all other cases, we can only determine more or
the two strings is: less probable match candidates. The method comprises the
following six steps
α = w1 v1 + w2 v2 + w3 v3 .
Step 1: We use all special characters, punctuations, and cap-
We set equal weights for our experiments. Theoretically, ital letters, if any, as initial word boundary and eliminate all
v3 ≥ v2 . these special characters and punctuations. After this initial
For example, if ri = albastr u and s j = alabaster , then word segmentation, we pass each of these segmented words
to our word segmentation method and lemmatize to generate
LC S(ri , s j ) = albastr tokens. We assume Ri = {r1 , r2 , . . . , rm } has m tokens and
MC LC S1 (ri , s j ) = al S j = {s1 , s2 , . . . , sn } has n tokens and n ≥ m. Otherwise,
MC LC Sn (ri , s j ) = bast we switch Ri and S j .
Step 2: We count the number of ri s (say, δ) for which ri = s j ,
N LC S(ri , s j ) = 72 /(8 × 9) = 0.68
for all r ∈ Ri and for all s ∈ Si ; that is, there are δ tokens in
N MC LC S + 1 = 22 /(8 × 9) = 0.056 Ri that exactly match with S j , where δ ≤ m. We remove all
N MC LC Sn (ri , s j ) = 42 /(8 × 9) = 0.22 δ tokens from both of Ri and S j . So, Ri = {r1 , r2 , . . . , rm−δ }
String similarity, α = w1 v1 + w2 v2 + w3 v3 and S j = {s1 , s2 , . . . , sn−δ }. If m − δ = 0, we go to step 6.
Step 3: We construct a (m − δ) × (n − δ) matching matrix
= 0.33 × 0.68 + 0.33 × 0.056 + 0.33 × 0.22 = 0.32
(say, M1 = (αi j )(m−δ)×(n−δ) ) using the following process:
We then use word similarity measure, normalize it (Fig. 13) we assume any token ri ∈ Ri has τ characters; that is, ri =
and combine it with the string similarity to obtain a final sim- {c1 c2 . . . cτ } and any token s j ∈ S j has η characters; that
ilarity score. We now describe our schema matching method is, s j = {c1 c2 . . . cη }where τ ≤ η. In other words, η is the
in detail. length of the larger token and τ is the length of the smaller
123
1310 A. Islam et al.
token. We calculate the followings: harmonic mean of m and n to obtain a balance similarity
score between 0 and 1, inclusively.
v1 ← N LC S(ri , s j ) |ρ|
v2 ← N MC LC S1 (ri , s j ) δ + i=1 ρi × (m + n)
Similarit y Scor e(Ri , S j ) =
v3 ← N MC LC Sn (ri , s j ) 2mn
αi j ← w1 v1 + w2 v2 + w3 v3 ;
4.2.1 Choosing the values of ζ, λ and ς
αi j is a weighted sum of v1 , v2 and v3 where w1 , w2 , w3 are
weights and w1 + w2 + w3 = 1. We set equal weights for The parameter ζ is the minimum number of characters for
our experiments. which we continue the matching process. Theoretically ζ
We put αi j in row i and column j position of the matrix could be any value between 1 and m inclusively. We usually
for all i = 1 . . . m − δ and j = 1 . . . n − δ. set ζ to 1. If we use ζ to 1 then we can get expected matching
⎡ ⎤ result for small-length tokens, e.g., if we have three sample
α11 α12 α1 j α1(n−δ)
tokens named min, max and similarity and we set ζ to 1. The
⎢ α21 α22 α2 j α2(n−δ) ⎥
⎢ ⎥ pair min max returns m and the pair min similarity returns Ø
M1 = ⎢ ⎢ ⎥
⎣ αi1 αi2 αi j αi(n−δ) ⎥
⎦ when we use MC LC S1 . When we use MC LC Sn , the first
pair returns m and the second pair returns mi. But if we set
α(m−δ)1 α(m−δ)2 α(m−δ) j α(m−δ)(n−δ)
ζ to 2, the pair min max returns Ø for both MC LC S1 and
Step 4: We construct a (m−δ)×(n−δ) similarity matrix (say, MC LC Sn . If we set ζ to 3, the pair min similarity returns
M2 = (βi j )(m−δ)×(n−δ) ) using the following process: We put Ø for both MC LC S1 and MC LC Sn . Basically, λ is depen-
βi j (βi j ← similarit y Matching(ri , s j ) (Fig. 13)) in row i dent on the semantic similarity method we use. We choose
and column j position of the matrix for all i = 1, . . . , m − δ the value of λ based on the maximum range of the similarity
and j = 1, . . . , n − δ. values for the semantic similarity method we use. We usually
⎡ ⎤ set λ to 20 when we use SOCPMI semantic similarity method
β11 β12 β1 j β1(n−δ) because our experiments showed that 20 is the maximum of
⎢ β21 β22 β2 j β2(n−δ) ⎥
⎢ ⎥ the region of values for best matches. We can use any other
M2 = ⎢ ⎢ βi1
⎥
⎣ βi2 βi j βi(n−δ) ⎥ ⎦ similarity measure including a dictionary-based or a hybrid
approach. For example, if we use the Roget-based measure
β(m−δ)1 β(m−δ)2 β(m−δ) j β(m−δ)(n−δ)
[27] than we need to set λ to 16. One of the main advantages
Step 5: We construct another (m − δ) × (n − δ) joint matrix of using distributional measures based on corpus is that it
(say, M = (γi j )(m−δ)×(n−δ) ) using M ← ψ M1 + ϕ M2 (i.e., covers significantly more tokens than any dictionary-based
γi j = ψαi j + ϕβi j ) where ψ is the matching matrix weight measure. Theoretically, ς could be any value between 0 and 1
factor. ϕ is the similarity matrix weight factor, and ψ +ϕ = 1. exclusively, but we usually set ς close to 0 (we set ς = 0.01
Setting any one of these factors to 0 means that we do not for all of our experiments). All matrix elements having val-
include that matrix. Setting both the factors to 0.5 means we ues lower than ς may have negative impacts to the matching
consider them equally important. result, thus it is better to omit those.
⎡ ⎤
γ11 γ12 γ1 j γ1(n−δ)
4.3 Walk-through examples
⎢ γ21 γ22 γ2 j γ2(n−δ) ⎥
⎢ ⎥
M =⎢ ⎢ ⎥
γ γ γ γ ⎥ We provide two examples that describe the proposed method
⎣ i1 i2 ij i(n−δ) ⎦
and determine the similarity score. In example 1, we use two
γ(m−δ)1 γ(m−δ)2 γ(m−δ) j γ(m−δ)(n−δ)
real element names from a database schema and in
After constructing the joint matrix, M, we find out the max- example 2, we use two element names that we created, in
imum-valued matrix-element, γi j . We add this matrix ele- order to better illustrate the method (to cover all its strength
ment to a list (say, ρ and ρ ← ρUγi j ) if γi j ≥ ς (we will at once).
discuss about the similarity threshold, ς in the next section).
We remove all the matrix elements of ith row and jth col- 4.3.1 Example 1
umn from M. We repeat the finding of the maximum-valued
matrix-element, γi j adding it to ρ and removing all the matrix Let Ri = “maxprice”, S j = “High_Price”.
elements of the corresponding row and column until either Step 1: After eliminating all special characters and punc-
γi j < ς , or m − δ − |ρ| = 0, or both. tuations, if any, and then using word segmentation method
Step 6: We sum up all the elements in ρ and add δ to it to get and lemmatizing, we get Ri = {max, price} and S j =
a total score. We multiply this total score by the reciprocal {high, price} where m = 2 and n = 2.
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1311
Step 2: Because only one token (i.e., price) in Ri exactly Step 2: Only one token (i.e., make) in Ri exactly matches
matches with S j we set δ to 1. We remove price from both with S j therefore we set δ to 1. We remove make from both
Ri and S j . So, Ri = {max} and S j = {high}. As m − δ = 0, Ri and S j . So, Ri = {all, mileage, max, km} and S j =
we proceed to next step. {min, mile, distance, possible, take}. As m − δ = 0, we
Step 3: We construct a 1 × 1 matching matrix, M1 . Consider proceed to next step.
the max high pair where η = 4 is the length of the larger Step 3: We construct a 4×5 matching matrix, M1 . Consider
token (high), τ = 3 is the length of the smaller token (max) the mileage possible pair where length(LCS(mileage, possi-
and 0 is the maximal length of the consecutive portions of ble)) = 3, η = 8 is the length of the larger token (possible),
the smaller token that consecutively match with the larger τ = 7 is the length of the smaller token (mileage) and 2 is
token. So, v1 = v2 = v3 = 0 and α11 = 0. the maximal length of the consecutive portions of the smaller
token that consecutively match with the larger token, where
high matching starts from third character of the smaller token and
M1 = max 0 seventh character of the larger token. So, v1 = 32 /(8 × 7) =
0.16
v2 = 0
Step 4: We construct a 1 × 1 similarity matrix, M2 . Here,
λ = 20 as we used the SOCPMI method. v3 = 22 /(8 × 7) = 0.071
and α24 = 0.33 × v1 + 0.33 × v2 + 0.33 × v3 = 0.076
high
M2 = max 0.326
min mile distance possible take
all 0 0.055 0.041 0.027 0.082
Step 5: We construct a 1×1 joint matrix, M and assign equal mileage 0.188 0.565 0.058 0.076 0.058
weight factor by setting both ψ and ϕ to 0.5. M1 = max 0.11 0.082 0.027 0 0.055
km 0.11 0.082 0 0 0.123
high
M= max 0.163
Step 4: We construct a 4 × 5 similarity matrix, M2 . Here,
λ = 20 as we used SOCPMI method.
We find the only maximum-valued-matrix-element, γi j =
0.163 and add it to ρ as γi j ≥ ς (we use ς = 0.01 in this min mile distance possible take
example). So, ρ = {0.163}. The new M is empty after remov- all 0.172 0.233 0.48 0 0.813
ing ith (i = 1) row and jth ( j = 1) column. We proceed to mileage 0.587 0.976 0.826 0 0.558
next step as m − δ − |ρ| = 0. (Here, m = 2, δ = 1 and
M2 = max 0.199 0.194 0.141 0 0.243
|ρ| = 1)
km 0.67 0.962 0.89 0 0.408
Step 6:
|ρ|
δ + i=1 ρi × (m + n)
Similarit y Scor e(Ri , S j ) = Step 5: We construct a 4×5 joint matrix, M and assign equal
2mn
= (1 + 0.163) × 4/8 weight factor by setting both ψ and ϕ to 0.5.
= 0.582
min mile distance possible take
4.3.2 Example 2 all 0.086 0.144 0.26 0.013 0.447
mileage 0.388 0.771 0.442 0.038 0.308
Let M= max 0.154 0.138 0.084 0 0.149
Ri = “allmileage_make_maxkm”,
km 0.39 0.522 0.445 0 0.266
S j = “make_minmile_distance_possible_take”.
Step 1: After eliminating all special characters and punctua-
tions, if any, and then using word segmentation method and We find the maximum-valued matrix-element, γi j = 0.771
lemmatizing, we get Ri = {all, mileage, make, max, km} and add it to ρ as γi j ≥ ς (we use ς = 0.01 in this example).
and S j = {make, min, mile, distance, possible, take} So, ρ = {0.771}. The new M after removing ith (i = 2) row
where m = 5 and n = 6. and jth ( j = 2) column is:
123
1312 A. Islam et al.
min distance possible take could not reproduce the exact 75% that they used. Figures 14
all 0.086 0.26 0.013 0.447 and 15 are two sample schemas from auto domain (vname
M= max 0.154 0.084 0 0.149 are the element names to be matched), while Fig. 16 is their
km 0.39 0.445 0 0.266 manual mapping (the tags <left> and <right> are used to
show an element name from the first schema that matches
with an element name from the second schema).
We find the maximum-valued matrix-element, γi j = 0.447 In each domain, they manually created mappings between
for this new M and add it to ρ as γi j ≥ ς . So, ρ = {0.771, randomly chosen schema pairs. The matches were one–many;
0.447}. The new M after removing ith (i = 1) row and jth that is, an element can match any number of elements in the
( j = 4) column is: other schema. These manually created mappings are used as
a gold standard to compare the mapping performance of the
different methods, including our method. Table 2 provides
min distance possible
detailed information about each of the two domains and our
max 0.154 0.084 0 results.
M= km 0.39 0.445 0 In each domain, we compared each predicted mapping pair
against the manually created mapping pair. For our experi-
ment, we only used element names for matching. We used
Here, 0.445 is the maximum-valued matrix-element and 11 different similarity thresholds ranging from 0 to 1 with
γi j ≥ ς . So, ρ = {0.771, 0.447, 0.445}. The new M after interval 0.1, e.g., using auto domain when we used similar-
removing ith (i = 2) row and jth ( j = 2) column is: ity threshold 0.1, our method matched 961 elements, out of
which 628 elements were among the 769 manually matched
min possible elements. Precision vs. similarity threshold curves and recall
M= max 0.154 0 vs. similarity threshold curves of the two web domains for
the 11 different similarity thresholds are shown is Figs. 17
and 18, respectively. P–R curves of the two web domains
We find 0.154 as the maximum-valued matrix-element and for the 11 different similarity thresholds are shown in Fig. 19
γi j ≥ ς . So, ρ = {0.771, 0.447, 0.445, 0.154}. The new M where the similarity threshold decreases from left to right in
is empty after removing ith (i = 1) row and jth ( j = 1) the figure. Figure 20 shows F-measure vs. similarity thresh-
column. old curves; it is obvious that a lower similarity threshold (≈
We proceed to next step as m − δ − |ρ| = 0. (Here, m = 0.2) gives a better F-measure score.
5, δ = 1 and |ρ| = 4) The reason for a lower similarity threshold to obtain a bet-
Step 6: ter F-measure score is that we always take into accounts both
|ρ| the string similarity and semantic word similarity measures.
δ + i=1 ρi × (m + n) If two strings have perfect semantic word similarity score
Similarit y Scor e(Ri , S j ) = (≈ 1) and no string similarity score (≈ 0), it is practically a
2mn
perfect matching (e.g., car and vehicle); this lowers the total
= (1 + 1.817) × 11/60
similarity score. Again, we multiply this total score by the
= 0.516 reciprocal harmonic mean of m and n to obtain a balance
similarity score which also lowers the final similarity value.
4.4 Evaluation and experimental results In Fig. 18, when we use string similarity threshold score
of 1 (i.e., matching the element names exactly, therefore no
We now present experimental results that demonstrate the semantic similarity matching is needed), we obtain recall
performance of our method. All the schemas we used in our values of 0.133 and 0.107 for auto and real estate domains,
experiments are from Madhavan et al. [39], where they used respectively. We can consider these scores as the baselines.
web form schemas from two different domains, auto and real Madhavan et al. [39] used three methods: direct, pivot
estate. Web form schema matching is the problem of identi- and augment. They selected a random 25% of the manu-
fying corresponding input fields in the web forms. Each web ally created mappings in each domain as training data and
form schema is a set of elements, one for each input. The tested on the remaining 75% of the mappings. In the aug-
properties of each input include: the hidden input name or ele- ment method, they used different base learners such as name
ment name that is passed to the server when the form is pro- learner, text learner, data instance learner, context learner
cessed, the description text and sample values in the option and then used a meta-learner to combine the predictions of
box. We tested on the same data as Madhavan et al. [39], the different base learners into a single similarity score. To
all of it, while they used 75% of it, randomly selected. We train a learner, the augment method requires learner-specific
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1313
123
1314 A. Islam et al.
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1315
positive and negative examples for the element on which it is of around 0.71, 0.71 and 0.78, respectively. We achieved
being trained. The direct method uses the same base learners precision of 0.68, recall of 0.75, and F-measure of 0.72 with
of augment method, but the training data for these learners are 0.2 as similarity threshold.
extracted only from the schemas being matched. Pivot is the Generally, it seems that precision does matter more than
method that computes cosine distance of the interpretation recall in the schema matching problem. But pragmatically it
vectors of the two elements directly. is not possible to determine fully automatically all matches
In Fig. 21, the direct, pivot and augment methods for the between two schemas, and the implementation of the match-
auto domain achieved precision of around 0.76, 0.74 and ing therefore only determine match candidates that are then
0.92, recall of around 0.74, 0.78, 0.72 and F-measure of verified by a human expert. If a human expert is involved in
around 0.73, 0.74 and 0.78, respectively. We achieved around verification procedure then recall is as important as precision;
0.78 as precision, recall and F-measure with 0.2 as similarity that is, F-measure does matter more than precision.
threshold. Our method is computationally less intensive than the
In Fig. 22, the direct, pivot and augment methods for the method of Madhavan et al. [39] because it uses a single prop-
real estate domain achieved precision of around 0.78, 0.71 erty in matching (element names), while their method uses
and 0.76, recall of around 0.69, 0.74, 0.81 and F-measure multiple properties (names, descriptions, instance values,
123
1316 A. Islam et al.
Table 2 Characteristics of the Domain Number Number of Similarity Number of Number of Number of
evaluation domains and our name of schemas manual threshold score predicted correct manually created
results mappings in our method mapping pairs mapping pairs mapping pairs
1
1
Auto
0.9 0.9
Real estate
0.8 0.8
Recall
0.5 0.5
0.4 0.4
0.3 0.3
0.2
0.2
0.1
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity threshold score
Similarity threshold score
context). We feel that a rigorous comparison is not possi- Finally, we wanted to measure the contribution of our
ble, since the algorithm of Madhavan et al. is not described two new methods, the semantic similarity method and the
in sufficient details. However, we believe that the complexity text segmentation method, to the task of database schema
of our matcher is similar to that of Madhavan et al.’s name matching.
learner; in addition they include the complexity of three other When we used Lin’s [38] WordNet-based word similarity
learners. method instead of our corpus-based word similarity method,
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1317
Value in %
0.6
Precision
60%
0.5 50%
0.4 40%
0.3 30%
20%
0.2 Auto
10%
0.1 Real estate
0%
0 Direct Pivot Augment Our method
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Methods
Recall
Fig. 22 Results on the real estate domain
Fig. 19 P–R curves of the two web domains for 11 different similarity
threshold (similarity threshold decreases from left to right)
(with precision and recall of 0.78 as well).12 The decrease in
1 F-measure due to the replacement of our semantic similarity
0.9
Auto method is 0.12; the decrease in precision and recall is 0.07
Real estate
and 0.15, respectively.
0.8
When we used a simplistic segmentation approach (seg-
0.7 mentation using only capitalization and punctuation) instead
0.6 of our proposed word segmentation approach, for the auto
F-measure
0.5
domain, we achieved a precision of 0.76, recall of 0.68, and
F-measure of 0.71. It matched 687 elements, out of which
0.4
521 elements were among the 769 manually matched ele-
0.3 ments. The loss in F-measure, precision and recall due to
0.2 the simplistic segmentation method is 0.08, 0.02 and 0.10,
0.1
respectively.
We can conclude that our semantic similarity method con-
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
tributed a significant improvement (12 percentage points
Similarity threshold score for the auto domain). Our word segmentation method also
contributed significantly (an improvement of 8 percentage
Fig. 20 F-measure vs. similarity threshold curves of the two web
points).
domains for 11 different similarity threshold
60%
50% In this paper, we addressed the task of database schema
40% matching. First, we evaluated a new corpus-based word simil-
30% arity measure, called SOC-PMI and compared it with existing
20% word similarity measures. We performed intrinsic evaluation
10% on the noun pairs mentioned earlier. We also performed a
0% task-based evaluation: solving synonyms test questions. One
Direct Pivot Augment Our method
Methods
of the main characteristics of the SOC-PMI method is that
we can determine the semantic similarity of two words even
Fig. 21 Results on the auto domain though they do not co-occur within the window size at all
in the corpus. Actually, we are considering the second-order
for the auto domain, we achieved a precision of 0.71, recall co-occurrences, as we are judging also by the co-occurrences
of 0.63, and F-measure of 0.66. It matched 680 elements, of the neighbor words, not only the co-occurrence of the two
out of which 485 elements were among the 769 manually
matched elements. We compare this to the results of our 12 We used 0.2 as the similarity threshold, to have the same threshold
best run for the auto domain that had F-measure of 0.78 for the compared systems.
123
1318 A. Islam et al.
target words. This is not the case for PMI-IR and many other rightMaxMatching when both of them return the same num-
corpus-based semantic similarity measures. ber of elements as we cannot use entropy rate for the absence
Second, we proposed a word segmentation method that of type frequencies. Future directions also involve integrat-
could be exploited in web search engine to provide better ing the current word segmentation algorithm into a larger
suggestion when the search text contains three or more words system for comprehensive and context-based word analysis.
in the “desegmented” part. The method can also effectively Our proposed schema matching method together with the
distill new words, special terms and proper nouns when the semantic similarity of words method can be further extended
corpus covers a huge collection of both domain-dependent for the task of paraphrase recognition, entailment identifica-
and domain-independent words, and it can effectively avoid tion and measuring the semantic similarity of texts. A corpus-
statistical errors in shorter strings which belong to a longer based measure is useful to identify any similarity between
one. Experimental results show that our word segmentation words like President and Clinton from sentences ‘Mr. Pres-
method can segment words with high precision and high ident was supposed to visit Europe’ and ‘Mr. Clinton was
recall. supposed to visit Europe’. The proposed schema matching
Finally, we exploit both the semantic similarity and the method can also be updated in exactly the same way as the
word segmentation method in our proposed name-based ele- text similarity approach to exploit text description matching,
ment-level schema matching method. another approach for schema matching.
Our schema matching method uses a single property Incorporating equality of canonical name representations
(i.e., element name) for matching and achieves a comparable for special prefix/suffix symbols (e.g., CName → customer
F-measure score with respect to the methods that use mul- name, and EmpNO → employee number) will enhance the
tiple properties (e.g., element name, text description, data performance of the method. We tried with the Opaui,13 a col-
instance, context description). If we use a single property lection of lists of acronyms, abbreviations, and initialisms on
instead of multiple properties, it can speed up the match- the Word Wide Web, which has 353,494 entries from 128
ing process which is important when schema matching is different categories, but it even made the result worse. The
used in P2P data management or online query processing reason is that misleading acronyms, abbreviations or initial-
in P2P environment. Our method is scalable, in the sense isms return lower string similarity and semantic word simi-
that, if needed, we could also add other properties (i.e., text larity scores.
description, data instance matching) to obtain a better schema Our name-based schema matching method can be aug-
matching result. mented with rule-based methods to improve the accuracy in
domain-specific schema matching.
5.2 Future work Notice that our algorithms assumed that most element
names are tokenizable (contain words or fragments of words),
We plan to apply our proposed second-order co-occurrence but not all of them. There are indeed types of data where it
PMI method to other tasks, such as measuring the seman- was nearly impossible to obtain matches using element name
tic similarity of two texts and detecting semantic outliers in matching. For such cases, we got very low similarity values.
speech recognition transcripts. The SOC-PMI method may For example, in the Real Estate domain, a schema named
also be helpful as a tool to aid in the automatic construction “CommercialRealEstate” had five fields/elements: cata, beds,
of the synonyms of a word. A very naïve approach would be catb, catc, state, and another schema named “RealyInves-
as follows. First, we need to sort out the significant words tor-1” had seven fields/elements: OptionListSelectedTypes,
list based on PMI values for the word (say, x) we are inter- tScMinPriceSale, tScMaxPriceSale, tScMinSfSale, tScMax-
ested to find the synonyms. If there are n significant words SfSale, tScMinUnits, tScMaxUnits. Their manual matches
in this words list, we will apply the SOC-PMI method for are as follows: (cata = tScMinPriceSale and tScMaxPrice-
each possible pair mapping from x to n. Instead of taking Sale), (catb = tScMinSfSale and tScMaxSfSale), (catc =
the similarity value, we will consider all the second-order OptionListSelectedTypes). However, even by considering
co-occurrence types and sort out this types list based on PMI cases like this one, we obtained good results on our experi-
values. The words on the top of the list could be the best mental data sets, which is from real-world web data sources.
candidates for synonyms of the word. This means that these type of data are not very frequent in
Our corpus-based word segmentation method can be real-world web data sources. To test this hypothesis further,
extended as a hybrid method with some additions to the algo- we collected 112 element names from 12 websites.14 Among
rithms. The absence of type frequencies in dictionaries means them only eight names were not tokenizable.
that we can only use the length of the types. In that case, we
need to focus on what type to choose for the same length 13 http://www.abbreviationz.com/.
types that share some common characters. Again, we cannot 14 The list is available at http://www.site.uottawa.ca/~diana/elements.
choose whether we take the elements of leftMaxMatching or htm.
123
Applications of corpus-based semantic similarity and word segmentation to database schema matching 1319
To deal with non-tokenizable cases, we also plan to 13. Church, K.W., Hanks, P.: Word association norms, mutual infor-
combine our name-based schema matcher with other exist- mation, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
14. Daelamans, W., van den Bosch, A., Weijters, A.: IGTree: Using
ing matchers, in order to address specific situations that our trees for compression and classification in lazy learning algo-
method does not cover. When the element names are not rithms. Artif. Intell. Rev. 11, 407–423 (1997)
words or fragments of words, then we need to use an instance 15. Dagan, I., Lee, L., Pereira, F.C.N.: Similarity based models of word
matcher that looks at the type of the values in two columns, cooccurrence probabilities. Mach. Learning 34(1–3), 43–69 (1999)
16. Dale, R., Moisl, H., Somers, H.: Handbook of Natural Language
or at the values of the instances. To quickly test this idea, we Processing. Marcel Dekker, Inc., New York (2000)
implemented a simple type instance matcher that verifies the 17. Deligne, S., Bimbot, F.: Language modeling by variable length
type of instance values. In case our name matcher decided sequences: theoretical formulation and evaluation of multigrams.
to match two fields, we did not accept the match if the fields In: Proceedings of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP-95) (1995)
had different types, for example if one field was a string 18. de Marcken, C.: The unsupervised acquisition of a lexicon from
and the other was numeric. In this way, for the auto domain, continuous speech. Technical Report AI Memo No. 1558, M.I.T.,
we eliminated 52 incorrect matches; increasing the precision Cambridge, MA (1995)
from 0.78 to 0.83 and the F-measure from 0.78 to 0.80. The 19. Do, H.H., Rahm, E.: COMA—a system for flexible combination of
schema matching approaches. In: Proceedings of the 28th Interna-
recall stayed the same because all the eliminated matches tional Conference on Very Large Data Bases (VLDB), pp. 610–621
were indeed wrong matches. If the instances are words, we (2002)
can re-use our semantic and string similarity matching at the 20. Dunning, T.: Accurate methods for the statistics of surprise and
level of the instances. Sometimes two columns might match coincidence. Comput. Linguist. 19, 61–74 (1993)
21. Fung, P., Wu, D.: Improving Chinise tokenization with
if similar words are used to denote different fields in two dif- linguistic filters on statistical lexical acquisition. In: Fourth
ferent databases. In such cases, the precision of the matching Conference Applied Natural Language Processing, Stuttgart,
can be increased by matching the text descriptions of the col- pp. 180–181 (1994)
umns, if available. Our word-level similarity measure can be 22. Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmenta-
tion and named entity recognition: a pragmatic approach. Comput.
used to determine the similarity level of two texts. Linguist. 31(4) (2005)
23. Grefenstette, G.: Automatic thesaurus generation from raw text
using knowledge-poor techniques. In: Making sense of Words, 9th
Annual Conference of the UW Centre for the New OED and Text
Research (1993)
References 24. Grefenstette, G.: Finding semantic similarity in raw text: the
deese antonyms. In: Goldman, R., Norvig, P., Charniak, E., Gale,
1. Allison, L., Dix, T.I.: A bit-string longest-common-Subsequence B. (eds.) Working Notes of the AAAI Fall Symposium on Proba-
algorithm. Inf. Process. Lett. 23, 305–310 (1986) bilistic Approaches to Natural Language, pp. 61–65. AAAI Press
2. Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis (1992)
of methodologies for database schema integration. ACM Comput. 25. Hua, Y.: Unsupervised word induction using MDL criterion. In:
Surv. 18(4), 323–364 (1986) Proceedings ISCSL 2000, Beijing (2000)
3. Bell, G.S., Sethi, A.: Matching records in a national medical patient 26. Inkpen, D., Désilets, A.: Semantic similarity for detecting rec-
index. Commun. ACM (CACM) 44(9), 83–88 (2001) ognition errors in automatic speech transcripts. In: Proceedings
4. Brent, M.: An efficient, probabilistically sound algorithm for seg- of Empirical Methods in Natural Language Processing (EMNLP
mentation and word discovery. Mach. Learning 34, 71–106 (1999) 2005), Vancouver (2005)
5. Brent, M., Cartwright, T.: Distributional regularity and phonotac- 27. Jarmasz, M., Szpakowicz, S.: Roget’s thesaurus and semantic sim-
tics are useful for segmentation. Cognition 61, 93–125 (1996) ilarity. In: Proceedings of the International Conference on Recent
6. Bright, M.W., Hurson, A.R., Pakzad, S.H.: Automated resolution Advances in Natural Language Processing (RANLP-2003), Boro-
of semantic heterogeneity in multi databases. Trans. Database Sys- vets, Bulgaria, pp. 212–219 (2003)
tems (TODS) 19(2), 212–253 (1994) 28. Kang, J., Naughton, J.F.: On schema matching with opaque col-
7. Brill, E.: Some advances in transformation-based part of speech umn names and data values. In: Proceedings of Special Interest
tagging. In Proceedings of the Twelfth National Conference on Group on Management of Data (SIGMOD 2003), San Diego, pp.
Artificial Intelligence, pp. 748–753. AAI Press/MIT Press (1994) 205–216 (2003)
8. Brown, P.F., DeSouza, P.V., Mercer, R.L., Watson, T.J., Della 29. Kit, C., Wilks, Y.: Unsupervised learning of word boundary with
Pietra, V.J., Lai, J.C.: Class-based n-gram models of natural lan- description length gain. In: Proceedings CoNLL99 ACL Work-
guage. Comput. Linguist. 18, 467–479 (1992) shop, Bergen (1999)
9. Budanitsky, A., Hirst, G.: Evaluating WordNet-based measures of 30. Kondrak, G.: N-gram similarity and distance. In: Proceedings of
semantic distance. Comput. Linguist. 32(1) (2006) the Twelfth International Conference on String Processing and
10. Buckley, C., Salton, J.A., Singhal, A.: Automatic query expansion Information Retrieval (SPIRE 2005), Buenos Aires, pp. 115–126
using Smart: TREC 3. In Proceedings of the Third Text Retrieval (2005)
Conference, Gaithersburg (1995) 31. Landauer, T.K., Dumais, S.T.: A solution to plato’s problem:
11. Christiansen, M., Allen, J.: Coping with Variation in Speech Seg- the latent semantic analysis theory of the acquisition, induction,
mentation. In: Proceedings of GALA 1997: Language Acquisition: and representation of knowledge. Psychol. Rev. 104(2), 211–
Knowledge Representation and Processing, pp. 327–332 (1997) 240 (1997)
12. Christiansen, M., Allen, J., Seidenberg, M.: Learning to segment 32. Lee, L.: Measures of distributional similarity. In: Proceedings
speech using multiple cues: a connectionist model. Language of the Association for Computational Linguistics (ACL-1999),
Cogn. Process. 13, 221–268 (1998) pp. 23–32 (1999)
123
1320 A. Islam et al.
33. Lesk, M.E.: Word-word associations in document retrieval 49. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic
systems. Am. Doc. 20(1), 27–38 (1969) schema matching. Int. J. Very Large Data Bases (VLDB)
34. Lesk, M.E.: Automatic sense disambiguation using machine read- 10(4), 334–350 (2001)
able dictionaries: How to tell a pine cone from an ice cream cone. 50. Rao, C.R.: Diversity: Its measurement, decomposition, apportion-
In: Proceedings of the Conference of Special Interest Group on ment and analysis. Sankyha: Indian J. Statist. 44(A), 1–22 (1983)
Design Of Communication (SIGDOC), Toronto (1986) 51. Resnik, P.: Using information content to evaluate semantic sim-
35. Li, H., Abe, N.: Word clustering and disambiguation based on ilarity. In Proceedings of 14th International Joint Conference on
co-occurrence data. In: Proceedings of the Joint Conference of Artificial Intelligence, Montreal, pp. 448–453 (1995)
the International Committee on Computational Linguistics and 52. Rosenfeld, R.: A maximum entropy approach to adaptive statistical
the Association for Computational Linguistics (COLING-ACL), language modeling. Comput Speech Language 10, 187–228 (1996)
pp. 749–755 (1998) 53. Rubenstein, H., Goodenough, J.B.: Contextual correlates of syn-
36. Lin, C.Y., Hovy, E.H. (2003) Automatic evaluation of summa- onymy. Commun. ACM 8(10), 627–633 (1965)
ries using n-gram co-occurrence statistics. In: Proceedings of 54. Rumelhart, D.E., McClelland, J.: On learning the past tense of
Human Language Technology Conference (HLT-NAACL 2003), English verbs. In: Parallel Distributed Processing, vol. II, pp. 216–
Edmonton 271, MIT Press, Cambridge, (1986)
37. Lin, D.: Automatic retrieval and clustering of similar words. In Pro- 55. Saffran, J.R., Newport, E.L., Aslin, R.N.: Word segmentation:
ceedings of the Joint Conference of the International Committee on The role of distributional cues. J. Memory Language 35, 606–
Computational Linguistics and the Association for Computational 621 (1996)
Linguistics (COLING-ACL), pp. 768–774 (1998) 56. Salton, G., McGill, M.J.: Introduction to Modern Information
38. Lin, D.: An information-theoretic definition of similarity. In: Pro- Retrieval. McGraw-Hill (1983)
ceedings of the 15th International Conference on Machine Learn- 57. Seligman, L., Rosenthal, A., Lehner, P., Smith, A.: Data integra-
ing, pp. 296–304 (1998) tion: Where does the time go? Bull. Tech. Comm. Data Eng. 25(3)
39. Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based (2002)
Schema Matching. In: International Conference on Data Engineer- 58. Shannon, C.E., Weaver, W.: The mathematical theory of commu-
ing (ICDE-05), pp. 57–68 (2005) nication. University of Illinois Press, Urbana (1963)
40. Melamed, I.D.: Bitext maps and alignment via pattern recogni- 59. Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic finite-
tion. Comput. Linguist. 25(1), 107–130 (1999) state word-segmentation algorithm for Chinese. Comput. Linguist.
41. Mikheev, A.: Automatic rule induction for unknown word guess- 22(3), 377–404 (1996)
ing. Comput. Linguist. 23(3), 405–423 (1997) 60. Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic word seg-
42. Miller, G.A., Charles, W.G.: Contextual correlates of semantic sim- mentation algorithm for a Mandarin text-to-speech system. In:
ilarity. Language Cogn. Process. 6(1), 1–28 (1991) 32nd Annual Meeting of the Association for Computational Lin-
43. Miller, G.A.: WordNet: a lexical database for English. Commun. guistics pp. 66–72. Las Cruces (1994)
ACM 38(11), 39–41 (1995) 61. Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA
44. Milo, T., Zohar, S.: Using Schema matching to simplify heteroge- on TOEFL. In: Proceedings of the Twelfth European Conference
neous data translation. In: Proceedings of the International Con- on Machine Learning (ECML-2001), pp. 491–502 (2001)
ference on Very Large Data Bases (VLDB), pp. 122–133 (1998) 62. Vechtomova, O., Robertson, S.: Integration of collocation statistics
45. Pantel, P., Lin, D.: Discovering word senses from text. In: Proceed- into the probabilistic retrieval model. In: 22nd Annual Colloquium
ings of ACM SIGKDD Conference on Knowledge Discovery and on Information Retrieval Research, Cambridge (2000)
Data Mining, pp. 613–619 (2002) 63. Weeds, J., Weir, D., McCarthy, D.: Characterising measures of lex-
46. Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet: ical distributional similarity. In: Proceedings of the 20th Interna-
Similarity—measuring the relatedness of concepts. In: Pro- tional Conference on Computational Linguistics, COLING-2004
ceedings of the Nineteenth National Conference on Artificial pp. 1015–1021. Geneva (2004)
Intelligence (AAAI-04), July 25–29, San Jose (Intelligent Systems 64. Xu, J., Croft, B.: Improving the effectiveness of information
Demonstration) retrieval. ACM Trans. Inf. Syst. 18(1), 79–112 (2000)
47. Peng, F., Schuurmans, D.: A hierarchical EM approach to word seg- 65. Yarowsky, D.: Word-sense disambiguation using statistical models
mentation. In: Proceedings of the Sixth Natural Language Process- of Roget’s categories trained on large corpora. In: Proceedings of
ing Pacific Rim Symposium (NLPRS 2001). pp. 475–480, Tokyo, the International Conference on Computational Linguistics (COL-
Japan (2001) ING-92), pp. 454–460, Nantes, (1992)
48. Rabiner, L.: A tutorial on hidden markov models and selected
applications in speech recognition. In: Proc. IEEE 77(2), 257–286
(1989)
123