0% found this document useful (0 votes)
17 views8 pages

Unsupervised Methods For Developing Taxonomies by Combining Syntactic and Statistical Information

The document describes an unsupervised algorithm for placing unknown words into a taxonomy by combining syntactic and statistical information from a corpus. It evaluates the accuracy of using this method to reconstruct parts of the WordNet taxonomy for common nouns, proper nouns and verbs. The algorithm finds semantic neighbors of unknown words using latent semantic analysis combined with part-of-speech tagging, then places the words in the part of the taxonomy where the neighbors are most concentrated.

Uploaded by

202151085
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views8 pages

Unsupervised Methods For Developing Taxonomies by Combining Syntactic and Statistical Information

The document describes an unsupervised algorithm for placing unknown words into a taxonomy by combining syntactic and statistical information from a corpus. It evaluates the accuracy of using this method to reconstruct parts of the WordNet taxonomy for common nouns, proper nouns and verbs. The algorithm finds semantic neighbors of unknown words using latent semantic analysis combined with part-of-speech tagging, then places the words in the part of the taxonomy where the neighbors are most concentrated.

Uploaded by

202151085
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of HLT-NAACL 2003

Main Papers , pp. 197-204


Edmonton, May-June 2003

Unsupervised methods for developing taxonomies by combining syntactic


and statistical information

Dominic Widdows
Center for the Study of Language and Information, Stanford University
[email protected]

Abstract Firstly, given a particular taxonomic class (such as fruit)


one could seek members of this class (such as apple, ba-
This paper describes an unsupervised algo- nana). This problem is addressed by Riloff and Shepherd
rithm for placing unknown words into a taxon- (1997), Roark and Charniak (1998) and more recently by
omy and evaluates its accuracy on a large and Widdows and Dorow (2002). Secondly, given a partic-
varied sample of words. The algorithm works ular word (such as apple), one could seek suitable tax-
by first using a large corpus to find semantic onomic classes for describing this object (such as fruit,
neighbors of the unknown word, which we ac- foodstuff). The work in this paper addresses the second
complish by combining latent semantic analy- of these questions.
sis with part-of-speech information. We then The goal of automatically placing new words into a
place the unknown word in the part of the tax- taxonomy has been attempted in various ways for at least
onomy where these neighbors are most concen- ten years (Hearst and Schütze, 1993). The process for
trated, using a class-labelling algorithm devel- placing a word w in a taxonomy T using a corpus C often
oped especially for this task. This method is contains some version of the following stages:
used to reconstruct parts of the existing Word-
Net database, obtaining results for common • For a word w, find words from the corpus C whose
nouns, proper nouns and verbs. We evaluate occurrences are similar to those of w. Consider
the contribution made by part-of-speech tag- these the ‘corpus-derived neighbors’ N (w) of w.
ging and show that automatic filtering using the • Assuming that at least some of these neighbors are
class-labelling algorithm gives a fourfold im-
already in the taxonomy T , map w to the place in
provement in accuracy.
the taxonomy where these neighbors are most con-
centrated.
1 Introduction
Hearst and Schütze (1993) added 27 words to Word-
The importance of automatic methods for enriching lex- Net using a version of this process, with a 63% ac-
icons, taxonomies and knowledge bases from free text is curacy at assigning new words to one of a number of
well-recognized. For rapidly changing domains such as disjoint WordNet ‘classes’ produced by a previous al-
current affairs, static knowledge bases are inadequate for gorithm. (Direct comparison with this result is prob-
responding to new developments, and the cost of building lematic since the number of classes used is not stated.)
and maintaining resources by hand is prohibitive. A more recent example is the top-down algorithm of
This paper describes experiments which develop auto- Alfonseca and Manandhar (2001), which seeks the node
matic methods for taking an original taxonomy as a skele- in T which shares the most collocational properties with
ton and fleshing it out with new terms which are discov- the word w, adding 42 concepts taken from The Lord of
ered in free text. The method is completely automatic and the Rings with an accuracy of 28%.
it is completely unsupervised apart from using the origi- The algorithm as presented above leaves many degrees
nal taxonomic skeleton to suggest possible classifications of freedom and open questions. What methods should
for new terms. We evaluate how accurately our meth- be used to obtain the corpus-derived neighbors N (w)?
ods can reconstruct the WordNet taxonomy (Fellbaum, This question is addressed in Section 2. Given a col-
1998). lection of neighbors, how should we define a “place in
The problem of enriching the lexical information in the taxonomy where these neighbors are most concen-
a taxonomy can be posed in two complementary ways. trated?” This question is addressed in Section 3, which
defines a robust class-labelling algorithm for mapping a tion of part-of-speech information to extracting seman-
list of words into a taxonomy. In Section 4 we describe tic neighbors of the word fire is shown in Table 2. As
experiments, determining the accuracy with which these can be seen, the noun fire (as in the substance/element)
methods can be used to reconstruct the WordNet taxon- and the verb fire (mainly used to mean firing some sort
omy. To our knowledge, this is the first such evaluation of weapon) are related to quite different areas of mean-
for a large sample of words. Section 5 discusses related ing. Building a single vector for the string fire confuses
work and other problems to which these techniques can this distinction — the neighbors of fire treated just as a
be adapted. string include words related to both the meaning of fire as
a noun (more frequent in the BNC) and as a verb.
2 Finding semantic neighbors: Combining Part of the goal of our experiments was to investi-
latent semantic analysis with gate the contribution that this part-of-speech information
made for mapping words into taxonomies. As far as we
part-of-speech information.
are aware, these experiments are the first to investigate
There are many empirical techniques for recognizing the combination of latent semantic indexing with part-of-
when words are similar in meaning, rooted in the idea that speech information.
“you shall know a word by the company it keeps” (Firth,
1957). It is certainly the case that words which repeat- 3 Finding class-labels: Mapping
edly occur with similar companions often have related collections of words into a taxonomy
meanings, and common features used for determining
Given a collection of words or multiword expressions
this similarity include shared collocations (Lin, 1999), which are semantically related, it is often important to
co-occurrence in lists of objects (Widdows and Dorow, know what these words have in common. All adults with
2002) and latent semantic analysis (Landauer and Du-
normal language competence and world knowledge are
mais, 1997; Hearst and Schütze, 1993). adept at this task — we know that plant, animal and fun-
The method used to obtain semantic neighbors in our gus are all living things, and that plant, factory and works
experiments was a version of latent semantic analysis, are all kinds of buildings. This ability to classify objects,
descended from that used by Hearst and Schütze (1993, and to work out which of the possible classifications of a
§4). First, 1000 frequent words were chosen as col- given object is appropriate in a particular context, is es-
umn labels (after removing stopwords (Baeza-Yates and sential for understanding and reasoning about linguistic
Ribiero-Neto, 1999, p. 167)). Other words were assigned meaning. We will refer to this process as class-labelling.
co-ordinates determined by the number of times they oc- The approach demonstrated here uses a hand-built tax-
cured within the same context-window (15 words) as one onomy to assign class-labels to a collection of similar
of the 1000 column-label words in a large corpus. This nouns. As with much work of this nature, the taxonomy
gave a matrix where every word is represented by a row- used is WordNet (version 1.6), a freely-available broad-
vector determined by its co-occurence with frequently oc- coverage lexical database for English (Fellbaum, 1998).
curing, meaningful words. Since this matrix was very Our algorithm finds the hypernyms which subsume as
sparse, singular value decomposition (known in this con- many as possible of the original nouns, as closely as pos-
text as latent semantic analysis (Landauer and Dumais, sible 1 . The concept v is said to be a hypernym of w if
1997)) was used to reduce the number of dimensions w is a kind of v. For this reason this sort of a taxonomy
from 1000 to 100. This reduced vector space is called is sometimes referred to as an ‘IS A hierarchy’. For ex-
WordSpace (Hearst and Schütze, 1993, §4). Similarity ample, the possible hypernyms given for the word oak in
between words was then computed using the cosine sim- WordNet 1.6 are
ilarity measure (Baeza-Yates and Ribiero-Neto, 1999, p.
28). Such techniques for measuring similarity between oak ⇒ wood ⇒ plant material ⇒ material,
words have been shown to capture semantic properties: stuff ⇒ substance, matter ⇒ object, physical
for example, they have been used successfully for recog- object ⇒ entity, something
nizing synonymy (Landauer and Dumais, 1997) and for 1
Another method which could be used for class-
finding correct translations of individual terms (Widdows labelling is given by the conceptual density algorithm of
et al., 2002). Agirre and Rigau (1996), which those authors applied to word-
The corpus used for these experiments was the British sense disambiguation. A different but related idea is presented
National Corpus, which is tagged for parts-of-speech. by Li and Abe (1998), who use a principle from information
This enabled us to build syntactic distinctions into theory to model selectional preferences for verbs using differ-
ent classes from a taxonomy. Their algorithm and goals are
WordSpace — instead of just giving a vector for the string different from ours: we are looking for a single class-label for
test we were able to build separate vectors for the nouns, semantically related words, whereas for modelling selectional
verbs and adjectives test. An example of the contribu- preferences several classes may be appropriate.
fire (string only) fire nn1 fire vvi
fire 1.000000 fire nn1 1.000000 fire vvi 1.000000
flames 0.709939 flames nn2 0.700575 guns nn2 0.663820
smoke 0.680601 smoke nn1 0.696028 firing vvg 0.537778
blaze 0.668504 brigade nn1 0.589625 cannon nn0 0.523442
firemen 0.627065 fires nn2 0.584643 gun nn1 0.484106
fires 0.617494 firemen nn2 0.567170 fired vvd 0.478572
explosion 0.572138 explosion nn1 0.551594 detectors nn2 0.477025
burning 0.559897 destroyed vvn 0.547631 artillery nn1 0.469173
destroyed 0.558699 burning aj0 0.533586 attack vvb 0.468767
brigade 0.532248 blaze nn1 0.529126 firing nn1 0.459000
arson 0.528909 arson nn1 0.522844 volley nn1 0.458717
accidental 0.519310 alarms nn2 0.512332 trained vvn 0.447797
chimney 0.489577 destroyed vvd 0.512130 enemy nn1 0.445523
blast 0.488617 burning vvg 0.502052 alert aj0 0.443610
guns 0.487226 burnt vvn 0.500864 shoot vvi 0.443308
damaged 0.484897 blast nn1 0.498635 defenders nn2 0.438886

Table 1: Semantic neighbors of fire with different parts-of-speech. The scores are cosine similarities

oak, oak tree ⇒ tree ⇒ woody plant, ligneous Hirst, 2001). The easiest method for assigning a distance
plant ⇒ vascular plant, tracheophyte ⇒ plant, between words and their hypernyms is to count the num-
flora, plant life ⇒ life form, organism, being, ber of intervening levels in the taxonomy. This assumes
living thing ⇒ entity, something that the distance in specificity between ontological levels
is constant, which is of course not the case, a problem
Let S be a set of nouns or verbs. If the word w ∈ S is
addressed by Resnik (1999).
recognized by WordNet, the WordNet taxonomy assigns
Given an appropriate affinity score, it is a simple matter
to w an ordered set of hypernyms H(w).
to define the best class-label for a collection of objects.
Consider the union
[
H= H(w). Definition 1 Let S be a set of nouns, let H =
S
w∈S H(w) be the set of hypernyms of S and let α(w, h)
w∈S

This is the set of all hypernyms of any member of S. Our be an affinity score function as defined in equation (1).
intuition is that the most appropriate class-label for the The best class-label hmax (S) for S is the node hmax ∈ H
set S is the hypernym h ∈ H which subsumes as many with the highest total affinity score summed over all the
as possible of the members of S as closely as possible members of S, so hmax is the node which gives the max-
in the hierarchy. There is a trade-off here between sub- imum score X
suming ‘as many as possible’ of the members of S, and max α(w, h).
h∈H
subsuming them ‘as closely as possible’. This line of rea- w∈S

soning can be used to define a whole collection of ‘class- Since H is determined by S, hmax is solely determined
labelling algorithms’. by the set S and the affinity score α.
For each w ∈ S and for each h ∈ H, define the affinity In the event that hmax is not unique, it is customary to
score function α(w, h) between w and h to be take the most specific class-label available.

f (dist(w, h)) if h ∈ H(w) Example
α(w, h) = (1)
−g(w, h) if h ∈
/ H(w),
A particularly simple example of this kind of algorithm
where dist(w, h) is a measure of the distance between w is used by Hearst and Schütze (1993). First they parti-
and h, f is some positive, monotonically decreasing func- tion the WordNet taxonomy into a number of disjoint sets
tion, and g is some positive (possibly constant) function. which are used as class-labels. Thus each concept has
The function f accords ‘positive points’ to h if h sub- a single ‘hypernym’, and the ‘affinity-score’ between a
sumes w, and the condition that f be monotonically de- word w and a class h is simply the set membership func-
creasing ensures that h gets more positive points the tion, α(w, h) = 1 if w ∈ h and 0 otherwise. A collection
closer it is to w. The function g subtracts ‘penalty points’ of words is assigned a class-label by majority voting.
if h does not subsume w. This function could depend in
many ways on w and h — for example, there could be a 3.1 Ambiguity
smaller penalty if h is a very specific concept than if h is In theory, rather than a class-label for related strings, we
a very general concept. would like one for related meanings — the concepts to
The distance measure dist(w, h) could take many which the strings refer. To implement this for a set of
forms, and there are already a number of distance mea- words, we alter our affinity score function α as follows.
sures available to use with WordNet (Budanitsky and Let C(w) be the set of concepts to which the word w
could refer. (So each c ∈ C is a possible sense of w.) • For a word w, find the neighbors N (w) of w in
Then WordSpace. Remove w itself from this set.

f (dist(c, h)) if h ∈ H(c) • Find the best class-label hmax (N (w)) for this set
α(w, h) = max (2) (using Definition 1).
c∈C(w) −g(w, c) if h ∈
/ H(c),

This implies that the ‘preferred-sense’ of w with respect • Test to see if, according to WordNet, hmax is a hy-
to the possible subsumer h is the sense closest to h. In pernym of the original word w, and if so check how
practice, our class-labelling algorithm implements this closely hmax subsumes w in the taxonomy.
preference by computing the affinity score α(c, h) for all Since our class-labelling algorithm gives a ranked list
c ∈ C(w) and only using the best match. This selec- of possible hypernyms, credit was given for correct clas-
tive approach is much less noisy than simply averaging sifications in the top 4 places. This algorithm was tested
the probability mass of the word over each possible sense on singular common nouns (PoS-tag nn1), proper nouns
(the technique used in (Li and Abe, 1998), for example). (PoS-tag np0) and finite present-tense verbs (PoS-tag
vvb). For each of these classes, a random sample of words
3.2 Choice of scoring functions for the
was selected with corpus frequencies ranging from 1000
class-labelling algorithm
to 250. For the noun categories, 600 words were sam-
The precise choice of class-labelling algorithm depends pled, and for the finite verbs, 420. For each word w, we
on the functions f and g in the affinity score function found semantic neighbors with and without using part-of-
α of equation (2). There is some tension here between speech information. The same experiments were carried
being correct and being informative: ‘correct’ but unin- out using 3, 6 and 12 neighbors: we will focus on the re-
formative class-labels (such as entity, something) can be sults for 3 and 12 neighbors since those for 6 neighbors
obtained easily by preferring nodes high up in the hier- turned out to be reliably ‘somewhere in between’ these
archy, but since our goal in this work was to classify un- two.
known words in an informative and accurate fashion, the
functions f and g had to be chosen to give an appropriate Results for Common Nouns
balance. After a variety of heuristic tests, the function f The best results for reproducing WordNet classifica-
was chosen to be tions were obtained for common nouns, and are sum-
marized in Table 2, which shows the percentage of test
1
f= , words w which were given a class-label h which was a
dist(w, h)2 correct hypernym according to WordNet (so for which
h ∈ H(w)). For these words for which a correct clas-
where for the distance function dist(w, h) we chose the
sification was found, the ‘Height’ columns refer to the
computationally simple method of counting the number
number of levels in the hierarchy between the target word
of taxonomic levels between w and h (inclusively to
w and the class-label h. If the algorithm failed to find a
avoid dividing by zero). For the penalty function g we
class-label h which is a hypernym of w, the result was
chose the constant g = 0.25.
counted as ‘Wrong’. The ‘Missing’ column records the
The net effect of choosing the reciprocal-distance-
number of words in the sample which are not in WordNet
squared and a small constant penalty function was that
at all.
hypernyms close to the concept in question received mag-
The following trends are apparent. For finding any
nified credit, but possible class-labels were not penalized
correct class-label, the best results were obtained by
too harshly for missing out a node. This made the algo-
taking 12 neighbors and using part-of-speech informa-
rithm simple and robust to noise but with a strong prefer-
tion, which found a correct classification for 485/591 =
ence for detailed information-bearing class-labels. This
82% of the common nouns that were included in Word-
configuration of the class-labelling algorithm was used in
Net. This compares favorably with previous experiments,
all the experiments described below.
though as stated earlier it is difficult to be sure we are
comparing like with like. Finding the hypernym which
4 Experiments and Evaluation
immediately subsumes w (with no intervening nodes)
To test the success of our approach to placing unknown exactly reproduces a classification given by WordNet,
words into the WordNet taxonomy on a large and signif- and as such was taken to be a complete success. Tak-
icant sample, we designed the following experiment. If ing fewer neighbors and using PoS-information both im-
the algorithm is successful at placing unknown words in proved this success rate, the best accuracy obtained be-
the correct new place in a taxonomy, we would expect it ing 86/591 = 15%. However, this configuration actually
to place already known words in their current position. gave the worst results at obtaining a correct classification
The experiment to test this worked as follows. overall.
Height 1 2 3 4 5 6 7 8 9 10 Wrong Missing
Common Nouns (sample size 600)
3 neighbors
With PoS 14.3 26.1 33.1 37.8 39.8 40.6 41.5 42.0 42.0 42.0 56.5 1.5
Strings only 11.8 23.3 31.3 36.6 39.6 41.1 42.1 42.3 42.3 42.3 56.1 1.5
12 neighbors
With PoS 10.0 21.8 36.5 48.5 59.3 70.0 76.6 78.8 79.8 80.8 17.6 1.5
without PoS 8.5 21.5 33.6 46.8 57.1 66.5 72.8 74.6 75.3 75.8 22.6 1.5
Proper Nouns (sample size 600)
3 neighbors
With PoS 10.6 13.8 15.5 16.5 108 18.6 18.8 18.8 19.1 19.3 25.0 55.6
Strings only 9.8 14.3 16.1 18.6 19.5 20.1 20.8 21.1 21.5 21.6 22.1 55.6
12 neighbors
With PoS 10.5 14.5 16.3 18.1 22.0 23.8 25.5 28.0 28.5 29.3 15.0 55.6
Strings only 9.5 13.8 17.5 20.8 22.3 24.6 26.6 30.7 32.5 34.3 10.0 55.6
Verbs (sample size 420)
3 neighbors
With PoS 17.6 30.2 36.1 40.4 42.6 43.0 44.0 44.0 44.0 44.0 52.6 3.3
Strings only 24.7 39.7 43.3 45.4 47.1 48.0 48.3 48.8 49.0 49.0 47.6 3.3
12 neighbors
With PoS 19.0 36.4 43.5 48.8 52.8 54.2 55.2 55.4 55.7 55.9 40.7 3.3
Strings only 28.0 48.3 55.9 60.2 63.3 64.2 64.5 65.0 65.0 65.0 31.7 3.3

Table 2: Percentage of words which were automatically assigned class-labels which subsume them in the WordNet
taxonomy, showing the number of taxonomic levels between the target word and the class-label

Height 1 2 3 4 5 6 Wrong
Common Nouns 0.799 0.905 0.785 0.858 0.671 0.671 0.569
Proper Nouns 1.625 0.688 0.350 0.581 0.683 0.430 0.529
Verbs 1.062 1.248 1.095 1.103 1.143 0.750 0.669

Table 3: Average affinity score of class-labels for successful and unsuccessful classifications
In conclusion, taking more neighbors makes the automatic discovery: preliminary experiments using the
chances of obtaining some correct classification for a two names above as ‘seed-words’ (Roark and Charniak,
word w greater, but taking fewer neighbors increases the 1998; Widdows and Dorow, 2002) show that by taking
chances of ‘hitting the nail on the head’. The use of part- a few known examples, finding neighbors and removing
of-speech information reliably increases the chances of words which are already in WordNet, we can collect first
correctly obtaining both exact and broadly correct classi- names of the same gender with at least 90% accuracy.
fications, though careful tuning is still necessary to obtain Verbs pose special problems for knowledge bases. The
optimal results for either. usefulness of an IS A hierarchy for pinpointing informa-
tion and enabling inference is much less clear-cut than
Results for Proper Nouns and Verbs for nouns. For example, sleeping does entail breathing
The results for proper nouns and verbs (also in Table and arriving does imply moving, but the aspectual prop-
2) demonstrate some interesting problems. On the whole, erties, argument structure and case roles may all be dif-
the mapping is less reliable than for common nouns, at ferent. The more restrictive definition of troponymy is
least when it comes to reconstructing WordNet as it cur- used in WordNet to describe those properties of verbs
rently stands. that are inherited through the taxonomy (Fellbaum, 1998,
Proper nouns are rightly recognized as one of the cat- Ch 3). In practice, the taxonomy of verbs in WordNet
egories where automatic methods for lexical acquisition tends to have fewer levels and many more branches than
are most important (Hearst and Schütze, 1993, §4). It the noun taxonomy. This led to problems for our class-
is impossible for a single knowledge base to keep up-to- labelling algorithm — class-labels obtained for the verb
date with all possible meanings of proper names, and this play included exhaust, deploy, move and behave, all of
would be undesirable without considerable filtering abil- which are ‘correct’ hypernyms according to WordNet,
ities because proper names are often domain-specific. while possible class-labels obtained for the verb appeal
Ih our experiments, the best results for proper nouns included keep, defend, reassert and examine, all of which
were those obtained using 12 neighbors, where a cor- were marked ‘wrong’. For our methods, the WordNet
rect classification was found for 206/266 = 77% of the taxonomy as it stands appears to give much less reli-
proper nouns that were included in WordNet, using no able evaluation criteria for verbs than for common nouns.
part-of-speech information. Part-of-speech information It is also plausible that similarity measures based upon
still helps for mapping proper nouns into exactly the right simple co-occurence are better for modelling similarity
place, but in general degrades performance. between nominals than between verbs, an observation
Several of the proper names tested are geographical, which is compatible with psychological experiments on
and in the BNC they often refer to regions of the British word-association (Fellbaum, 1998, p. 90).
Isles which are not in WordNet. For example, hampshire In our experiments, the best results for verbs were
is labelled as a territorial division, which as an English clearly those obtained using 12 neighbors and no part-
county it certainly is, but in WordNet hampshire is in- of-speech information, for which some correct classifi-
stead a hyponym of domestic sheep. For many of the cation was found for 273/406 = 59% of the verbs that
proper names which our evaluation labelled as ‘wrongly were included in WordNet, and which achieved better re-
classified’, the classification was in fact correct but a dif- sults than those using part-of-speech information even for
ferent meaning from those given in WordNet. The chal- finding exact classifications. The shallowness of the tax-
lenge for these situations is how to recognize when cor- onomy for verbs means that most classifications which
pus methods give a correct meaning which is different were successful at all were quite close to the word in
from the meaning already listed in a knowledge base. question, which should be taken into account when in-
Many of these meanings will be systematically related terpreting the results in Table 2.
(such as the way a region is used to name an item or As we have seen, part-of-speech information degraded
product from that region, as with the hampshire example performance overall for proper nouns and verbs. This
above) by generative processes which are becoming well may be because combining all uses of a particular word-
understood by theoretical linguists (Pustejovsky, 1995), form into a single vector is less prone to problems of data
and linguistic theory may help our statistical algorithms sparseness, especially if these word-forms are semanti-
considerably by predicting what sort of new meanings we cally related in spite of part-of-speech differences 2 . It is
might expect a known word to assume through metonymy also plausible that discarding part-of-speech information
and systematic polysemy. 2
Typical first names of people such as lisa and ralph al- This issue is reminiscent of the question of whether stem-
ming improves or harms information retrieval (Baeza-Yates and
most always have neighbors which are also first names Ribiero-Neto, 1999) — the received wisdom is that stemming
(usually of the same gender), but these words are not rep- (at best) improves recall at the expense of precision and our
resented in WordNet. This lexical category is ripe for findings for proper nouns are consistent with this.
should improve the classification of verbs for the follow- be added to the knowledge base for (at least) the domain
ing reason. Classification using corpus-derived neighbors in question. Results for verbs are more difficult to inter-
is markedly better for common nouns than for verbs, and pret: reasons for this might include the shallowness and
most of the verbs in our sample (57%) also occur as com- breadth of the WordNet verb hierarchy, the suitability of
mon nouns in WordSpace. (In contrast, only 13% of our our WordSpace similarity measure, and many theoretical
common nouns also occur as verbs, a reliable asymmetry issues which should be taken into account for a successful
for English.) Most of these noun senses are semantically approach to the classification of verbs.
related in some way to the corresponding verbs. Since Filtering using the affinity score from the class-
using neighboring words for classification is demonstra- labelling algorithm can be used to dramatically increase
bly more reliable for nouns than for verbs, putting these performance.
parts-of-speech together in a single vector in WordSpace
might be expected to improve performance for verbs but 5 Related work and future directions
degrade it for nouns. The experiments in this paper describe one combination
Filtering using Affinity scores of algorithms for lexical acquisition: both the finding
One of the benefits of the class-labelling algorithm of semantic neighbors and the process of class-labelling
(Definition 1) presented in this paper is that it returns not could take many alternative forms, and an exhaustive
just class-labels but an affinity score measuring how well evaluation of such combinations is far beyond the scope
each class-label describes the class of objects in question. of this paper. Various mathematical models and distance
The affinity score turns out to be signficantly correlated measures are available for modelling semantic proxim-
with the likelihood of obtaining a successful classifica- ity, and more detailed linguistic preprocessing (such as
tion. This can be seen very clearly in Table 3, which chunking, parsing and morphology) could be used in a
shows the average affinity score for correct class-labels of variety of ways. As an initial step, the way the granularity
different heights above the target word, and for incorrect of part-of-speech classification affects our results for lex-
class-labels — as a rule, correct and informative class- ical acquistion will be investigated. The class-labelling
labels have significantly higher affinity scores than incor- algorithm could be adapted to use more sensitive mea-
rect class-labels. It follows that the affinity score can be sures of distance (Budanitsky and Hirst, 2001), and corre-
used as an indicator of success, and so filtering out class- lations between taxonomic distance and WordSpace sim-
labels with poor scores can be used as a technique for ilarity used as a filter.
improving accuracy. The coverage and accuracy of the initial taxonomy we
To test this, we repeated our experiments using 3 are hoping to enrich has a great influence on success rates
neighbors and this time only using class-labels with an for our methods as they stand. Since these are precisely
affinity score greater than 0.75, the rest being marked the aspects of the taxonomy we are hoping to improve,
‘unknown’. Without filtering, there were 1143 success- this raises the question of whether we can use automati-
ful and 1380 unsuccessful outcomes: with filtering, these cally obtained hypernyms as well as the hand-built ones
numbers changed to 660 and 184 respectively. Filtering to help classification. This could be tested by randomly
discarded some 87% of the incorrect labels and kept more removing many nodes from WordNet before we begin,
than half of the correct ones, which amounts to at least a and measuring the effect of using automatically derived
fourfold improvement in accuracy. The improvement was classifications for some of these words (possibly those
particularly dramatic for proper nouns, where filtering re- with high confidence scores) to help with the subsequent
moved 270 out of 283 incorrect results and still retained classification of others.
half of the correct ones. The use of semantic neighbors and class-labelling for
computing with meaning go far beyond the experimen-
Conclusions tal set up for lexical acquisition described in this pa-
For common nouns, where WordNet is most reliable, per — for example, Resnik (1999) used the idea of a
our mapping algorithm performs comparatively well, ac- most informative subsuming node (which can be re-
curately classifying several words and finding some cor- garded as a kind of class-label) for disambiguation, as
rect information about most others. The optimum num- did Agirre and Rigau (1996) with the conceptual density
ber of neighbors is smaller if we want to try for an exact algorithm. Taking a whole domain as a ‘context’, this
classification and larger if we want information that is approach to disambiguation can be used for lexical tun-
broadly reliable. Part-of-speech information noticeably ing. For example, using the Ohsumed corpus of medical
improves the process of both broad and narrow classifi- abstracts, the top few neighbors of operation are amputa-
cation. For proper names, many classifications are cor- tion, disease, therapy and resection. Our algorithm gives
rect, and many which are absent or incorrect according medical care, medical aid and therapy as possible class-
to WordNet are in fact correct meanings which should labels for this set, which successfully picks out the sense
of operation which is most important for the medical do- Ricardo Baeza-Yates and Berthier Ribiero-Neto. 1999.
main. Modern Information Retrieval. Addison Wesley /
The level of detail which is appropriate for defining ACM press.
and grouping terms depends very much on the domain in A. Budanitsky and G. Hirst. 2001. Semantic distance in
question. For example, the immediate hypernyms offered wordnet: An experimental, application-oriented evalu-
by WordNet for the word trout include ation of five measures. In Workshop on WordNet and
Other Lexical Resources, Pittsburgh, PA. NAACL.
fish, foodstuff, salmonid, malacopterygian,
teleost fish, food fish, saltwater fish Christiane Fellbaum, editor. 1998. WordNet: An elec-
tronic lexical database. MIT press, Cambridge MA.
Many of these classifications are inappropriately fine-
grained for many circumstances. To find a degree of J. Firth. 1957. A synopsis of linguistic theory 1930-
1955. Studies in Linguistic Analysis, Philological So-
abstraction which is suitable for the way trout is used
ciety, Oxford, reprinted in Palmer, F. (ed. 1968) Se-
in the BNC, we found its semantic neighbors which in- lected Papers of J. R. Firth, Longman, Harlow.
clude herring swordfish turbot salmon tuna. The highest-
scoring class-labels for this set are Marti Hearst and Hinrich Schütze. 1993. Customizing
a lexicon to better suit a computational task. In ACL
2.911 saltwater fish SIGLEX Workshop, Columbus, Ohio.
2.600 food fish
1.580 fish T. Landauer and S. Dumais. 1997. A solution to plato’s
1.400 scombroid, scombroid problem: The latent semantic analysis theory of acqui-
0.972 teleost fish sition. Psychological Review, 104(2):211–240.
Hang Li and Naoki Abe. 1998. Generalizing case frames
The preferred labels are the ones most humans would an-
using a thesaurus and the mdl principle. Computa-
swer if asked what a trout is. This process can be used tional Linguistics, 24(2):217–244.
to select the concepts from an ontology which are ap-
propriate to a particular domain in a completely unsuper- Dekang Lin. 1999. Automatic identification of non-
vised fashion, using only the documents from that do- compositional phrases. In ACL:1999, pages 317–324.
main whose meanings we wish to describe.
James Pustejovsky. 1995. The Generative Lexicon. MIT
Demonstration press, Cambridge, MA.
Interactive demonstrations of the class-labelling al- Philip Resnik. 1999. Semantic similarity in a taxonomy:
gorithm and WordSpace are available on the web at An information-based measure and its application to
http://infomap.stanford.edu/classes and problems of ambiguity in natural language. Journal of
http://infomap.stanford.edu/webdemo. An artificial intelligence research, 11:93–130.
interface to WordSpace incorporating the part-of-speech Ellen Riloff and Jessica Shepherd. 1997. A corpus-based
information is currently under consideration. approach for building semantic lexicons. In Claire
Cardie and Ralph Weischedel, editors, Proceedings of
Acknowledgements
the Second Conference on Empirical Methods in Natu-
This research was supported in part by the Research ral Language Processing, pages 117–124. Association
Collaboration between the NTT Communication Science for Computational Linguistics, Somerset, New Jersey.
Laboratories, Nippon Telegraph and Telephone Corpora-
Brian Roark and Eugene Charniak. 1998. Noun-phrase
tion and CSLI, Stanford University, and by EC/NSF grant
co-occurence statistics for semi-automatic semantic
IST-1999-11438 for the MUCHMORE project. lexicon construction. In COLING-ACL, pages 1110–
1116.

References Dominic Widdows and Beate Dorow. 2002. A graph


model for unsupervised lexical acquisition. In 19th In-
E. Agirre and G. Rigau. 1996. Word sense disambigua- ternational Conference on Computational Linguistics,
tion using conceptual density. In Proceedings of COL- pages 1093–1099, Taipei, Taiwan, August.
ING’96, pages 16–22, Copenhagen, Denmark.
Dominic Widdows, Beate Dorow, and Chiu-Ki Chan.
Enrique Alfonseca and Suresh Manandhar. 2001. Im- 2002. Using parallel corpora to enrich multilingual
proving an ontology refinement method with hy- lexical resources. In Third International Conference
ponymy patterns. In Third International Conference on Language Resources and Evaluation, pages 240–
on Language Resources and Evaluation, pages 235– 245, Las Palmas, Spain, May.
239, Las Palmas, Spain.

You might also like