Unsupervised Methods For Developing Taxonomies by Combining Syntactic and Statistical Information
Unsupervised Methods For Developing Taxonomies by Combining Syntactic and Statistical Information
Dominic Widdows
Center for the Study of Language and Information, Stanford University
[email protected]
Table 1: Semantic neighbors of fire with different parts-of-speech. The scores are cosine similarities
oak, oak tree ⇒ tree ⇒ woody plant, ligneous Hirst, 2001). The easiest method for assigning a distance
plant ⇒ vascular plant, tracheophyte ⇒ plant, between words and their hypernyms is to count the num-
flora, plant life ⇒ life form, organism, being, ber of intervening levels in the taxonomy. This assumes
living thing ⇒ entity, something that the distance in specificity between ontological levels
is constant, which is of course not the case, a problem
Let S be a set of nouns or verbs. If the word w ∈ S is
addressed by Resnik (1999).
recognized by WordNet, the WordNet taxonomy assigns
Given an appropriate affinity score, it is a simple matter
to w an ordered set of hypernyms H(w).
to define the best class-label for a collection of objects.
Consider the union
[
H= H(w). Definition 1 Let S be a set of nouns, let H =
S
w∈S H(w) be the set of hypernyms of S and let α(w, h)
w∈S
This is the set of all hypernyms of any member of S. Our be an affinity score function as defined in equation (1).
intuition is that the most appropriate class-label for the The best class-label hmax (S) for S is the node hmax ∈ H
set S is the hypernym h ∈ H which subsumes as many with the highest total affinity score summed over all the
as possible of the members of S as closely as possible members of S, so hmax is the node which gives the max-
in the hierarchy. There is a trade-off here between sub- imum score X
suming ‘as many as possible’ of the members of S, and max α(w, h).
h∈H
subsuming them ‘as closely as possible’. This line of rea- w∈S
soning can be used to define a whole collection of ‘class- Since H is determined by S, hmax is solely determined
labelling algorithms’. by the set S and the affinity score α.
For each w ∈ S and for each h ∈ H, define the affinity In the event that hmax is not unique, it is customary to
score function α(w, h) between w and h to be take the most specific class-label available.
f (dist(w, h)) if h ∈ H(w) Example
α(w, h) = (1)
−g(w, h) if h ∈
/ H(w),
A particularly simple example of this kind of algorithm
where dist(w, h) is a measure of the distance between w is used by Hearst and Schütze (1993). First they parti-
and h, f is some positive, monotonically decreasing func- tion the WordNet taxonomy into a number of disjoint sets
tion, and g is some positive (possibly constant) function. which are used as class-labels. Thus each concept has
The function f accords ‘positive points’ to h if h sub- a single ‘hypernym’, and the ‘affinity-score’ between a
sumes w, and the condition that f be monotonically de- word w and a class h is simply the set membership func-
creasing ensures that h gets more positive points the tion, α(w, h) = 1 if w ∈ h and 0 otherwise. A collection
closer it is to w. The function g subtracts ‘penalty points’ of words is assigned a class-label by majority voting.
if h does not subsume w. This function could depend in
many ways on w and h — for example, there could be a 3.1 Ambiguity
smaller penalty if h is a very specific concept than if h is In theory, rather than a class-label for related strings, we
a very general concept. would like one for related meanings — the concepts to
The distance measure dist(w, h) could take many which the strings refer. To implement this for a set of
forms, and there are already a number of distance mea- words, we alter our affinity score function α as follows.
sures available to use with WordNet (Budanitsky and Let C(w) be the set of concepts to which the word w
could refer. (So each c ∈ C is a possible sense of w.) • For a word w, find the neighbors N (w) of w in
Then WordSpace. Remove w itself from this set.
f (dist(c, h)) if h ∈ H(c) • Find the best class-label hmax (N (w)) for this set
α(w, h) = max (2) (using Definition 1).
c∈C(w) −g(w, c) if h ∈
/ H(c),
This implies that the ‘preferred-sense’ of w with respect • Test to see if, according to WordNet, hmax is a hy-
to the possible subsumer h is the sense closest to h. In pernym of the original word w, and if so check how
practice, our class-labelling algorithm implements this closely hmax subsumes w in the taxonomy.
preference by computing the affinity score α(c, h) for all Since our class-labelling algorithm gives a ranked list
c ∈ C(w) and only using the best match. This selec- of possible hypernyms, credit was given for correct clas-
tive approach is much less noisy than simply averaging sifications in the top 4 places. This algorithm was tested
the probability mass of the word over each possible sense on singular common nouns (PoS-tag nn1), proper nouns
(the technique used in (Li and Abe, 1998), for example). (PoS-tag np0) and finite present-tense verbs (PoS-tag
vvb). For each of these classes, a random sample of words
3.2 Choice of scoring functions for the
was selected with corpus frequencies ranging from 1000
class-labelling algorithm
to 250. For the noun categories, 600 words were sam-
The precise choice of class-labelling algorithm depends pled, and for the finite verbs, 420. For each word w, we
on the functions f and g in the affinity score function found semantic neighbors with and without using part-of-
α of equation (2). There is some tension here between speech information. The same experiments were carried
being correct and being informative: ‘correct’ but unin- out using 3, 6 and 12 neighbors: we will focus on the re-
formative class-labels (such as entity, something) can be sults for 3 and 12 neighbors since those for 6 neighbors
obtained easily by preferring nodes high up in the hier- turned out to be reliably ‘somewhere in between’ these
archy, but since our goal in this work was to classify un- two.
known words in an informative and accurate fashion, the
functions f and g had to be chosen to give an appropriate Results for Common Nouns
balance. After a variety of heuristic tests, the function f The best results for reproducing WordNet classifica-
was chosen to be tions were obtained for common nouns, and are sum-
marized in Table 2, which shows the percentage of test
1
f= , words w which were given a class-label h which was a
dist(w, h)2 correct hypernym according to WordNet (so for which
h ∈ H(w)). For these words for which a correct clas-
where for the distance function dist(w, h) we chose the
sification was found, the ‘Height’ columns refer to the
computationally simple method of counting the number
number of levels in the hierarchy between the target word
of taxonomic levels between w and h (inclusively to
w and the class-label h. If the algorithm failed to find a
avoid dividing by zero). For the penalty function g we
class-label h which is a hypernym of w, the result was
chose the constant g = 0.25.
counted as ‘Wrong’. The ‘Missing’ column records the
The net effect of choosing the reciprocal-distance-
number of words in the sample which are not in WordNet
squared and a small constant penalty function was that
at all.
hypernyms close to the concept in question received mag-
The following trends are apparent. For finding any
nified credit, but possible class-labels were not penalized
correct class-label, the best results were obtained by
too harshly for missing out a node. This made the algo-
taking 12 neighbors and using part-of-speech informa-
rithm simple and robust to noise but with a strong prefer-
tion, which found a correct classification for 485/591 =
ence for detailed information-bearing class-labels. This
82% of the common nouns that were included in Word-
configuration of the class-labelling algorithm was used in
Net. This compares favorably with previous experiments,
all the experiments described below.
though as stated earlier it is difficult to be sure we are
comparing like with like. Finding the hypernym which
4 Experiments and Evaluation
immediately subsumes w (with no intervening nodes)
To test the success of our approach to placing unknown exactly reproduces a classification given by WordNet,
words into the WordNet taxonomy on a large and signif- and as such was taken to be a complete success. Tak-
icant sample, we designed the following experiment. If ing fewer neighbors and using PoS-information both im-
the algorithm is successful at placing unknown words in proved this success rate, the best accuracy obtained be-
the correct new place in a taxonomy, we would expect it ing 86/591 = 15%. However, this configuration actually
to place already known words in their current position. gave the worst results at obtaining a correct classification
The experiment to test this worked as follows. overall.
Height 1 2 3 4 5 6 7 8 9 10 Wrong Missing
Common Nouns (sample size 600)
3 neighbors
With PoS 14.3 26.1 33.1 37.8 39.8 40.6 41.5 42.0 42.0 42.0 56.5 1.5
Strings only 11.8 23.3 31.3 36.6 39.6 41.1 42.1 42.3 42.3 42.3 56.1 1.5
12 neighbors
With PoS 10.0 21.8 36.5 48.5 59.3 70.0 76.6 78.8 79.8 80.8 17.6 1.5
without PoS 8.5 21.5 33.6 46.8 57.1 66.5 72.8 74.6 75.3 75.8 22.6 1.5
Proper Nouns (sample size 600)
3 neighbors
With PoS 10.6 13.8 15.5 16.5 108 18.6 18.8 18.8 19.1 19.3 25.0 55.6
Strings only 9.8 14.3 16.1 18.6 19.5 20.1 20.8 21.1 21.5 21.6 22.1 55.6
12 neighbors
With PoS 10.5 14.5 16.3 18.1 22.0 23.8 25.5 28.0 28.5 29.3 15.0 55.6
Strings only 9.5 13.8 17.5 20.8 22.3 24.6 26.6 30.7 32.5 34.3 10.0 55.6
Verbs (sample size 420)
3 neighbors
With PoS 17.6 30.2 36.1 40.4 42.6 43.0 44.0 44.0 44.0 44.0 52.6 3.3
Strings only 24.7 39.7 43.3 45.4 47.1 48.0 48.3 48.8 49.0 49.0 47.6 3.3
12 neighbors
With PoS 19.0 36.4 43.5 48.8 52.8 54.2 55.2 55.4 55.7 55.9 40.7 3.3
Strings only 28.0 48.3 55.9 60.2 63.3 64.2 64.5 65.0 65.0 65.0 31.7 3.3
Table 2: Percentage of words which were automatically assigned class-labels which subsume them in the WordNet
taxonomy, showing the number of taxonomic levels between the target word and the class-label
Height 1 2 3 4 5 6 Wrong
Common Nouns 0.799 0.905 0.785 0.858 0.671 0.671 0.569
Proper Nouns 1.625 0.688 0.350 0.581 0.683 0.430 0.529
Verbs 1.062 1.248 1.095 1.103 1.143 0.750 0.669
Table 3: Average affinity score of class-labels for successful and unsuccessful classifications
In conclusion, taking more neighbors makes the automatic discovery: preliminary experiments using the
chances of obtaining some correct classification for a two names above as ‘seed-words’ (Roark and Charniak,
word w greater, but taking fewer neighbors increases the 1998; Widdows and Dorow, 2002) show that by taking
chances of ‘hitting the nail on the head’. The use of part- a few known examples, finding neighbors and removing
of-speech information reliably increases the chances of words which are already in WordNet, we can collect first
correctly obtaining both exact and broadly correct classi- names of the same gender with at least 90% accuracy.
fications, though careful tuning is still necessary to obtain Verbs pose special problems for knowledge bases. The
optimal results for either. usefulness of an IS A hierarchy for pinpointing informa-
tion and enabling inference is much less clear-cut than
Results for Proper Nouns and Verbs for nouns. For example, sleeping does entail breathing
The results for proper nouns and verbs (also in Table and arriving does imply moving, but the aspectual prop-
2) demonstrate some interesting problems. On the whole, erties, argument structure and case roles may all be dif-
the mapping is less reliable than for common nouns, at ferent. The more restrictive definition of troponymy is
least when it comes to reconstructing WordNet as it cur- used in WordNet to describe those properties of verbs
rently stands. that are inherited through the taxonomy (Fellbaum, 1998,
Proper nouns are rightly recognized as one of the cat- Ch 3). In practice, the taxonomy of verbs in WordNet
egories where automatic methods for lexical acquisition tends to have fewer levels and many more branches than
are most important (Hearst and Schütze, 1993, §4). It the noun taxonomy. This led to problems for our class-
is impossible for a single knowledge base to keep up-to- labelling algorithm — class-labels obtained for the verb
date with all possible meanings of proper names, and this play included exhaust, deploy, move and behave, all of
would be undesirable without considerable filtering abil- which are ‘correct’ hypernyms according to WordNet,
ities because proper names are often domain-specific. while possible class-labels obtained for the verb appeal
Ih our experiments, the best results for proper nouns included keep, defend, reassert and examine, all of which
were those obtained using 12 neighbors, where a cor- were marked ‘wrong’. For our methods, the WordNet
rect classification was found for 206/266 = 77% of the taxonomy as it stands appears to give much less reli-
proper nouns that were included in WordNet, using no able evaluation criteria for verbs than for common nouns.
part-of-speech information. Part-of-speech information It is also plausible that similarity measures based upon
still helps for mapping proper nouns into exactly the right simple co-occurence are better for modelling similarity
place, but in general degrades performance. between nominals than between verbs, an observation
Several of the proper names tested are geographical, which is compatible with psychological experiments on
and in the BNC they often refer to regions of the British word-association (Fellbaum, 1998, p. 90).
Isles which are not in WordNet. For example, hampshire In our experiments, the best results for verbs were
is labelled as a territorial division, which as an English clearly those obtained using 12 neighbors and no part-
county it certainly is, but in WordNet hampshire is in- of-speech information, for which some correct classifi-
stead a hyponym of domestic sheep. For many of the cation was found for 273/406 = 59% of the verbs that
proper names which our evaluation labelled as ‘wrongly were included in WordNet, and which achieved better re-
classified’, the classification was in fact correct but a dif- sults than those using part-of-speech information even for
ferent meaning from those given in WordNet. The chal- finding exact classifications. The shallowness of the tax-
lenge for these situations is how to recognize when cor- onomy for verbs means that most classifications which
pus methods give a correct meaning which is different were successful at all were quite close to the word in
from the meaning already listed in a knowledge base. question, which should be taken into account when in-
Many of these meanings will be systematically related terpreting the results in Table 2.
(such as the way a region is used to name an item or As we have seen, part-of-speech information degraded
product from that region, as with the hampshire example performance overall for proper nouns and verbs. This
above) by generative processes which are becoming well may be because combining all uses of a particular word-
understood by theoretical linguists (Pustejovsky, 1995), form into a single vector is less prone to problems of data
and linguistic theory may help our statistical algorithms sparseness, especially if these word-forms are semanti-
considerably by predicting what sort of new meanings we cally related in spite of part-of-speech differences 2 . It is
might expect a known word to assume through metonymy also plausible that discarding part-of-speech information
and systematic polysemy. 2
Typical first names of people such as lisa and ralph al- This issue is reminiscent of the question of whether stem-
ming improves or harms information retrieval (Baeza-Yates and
most always have neighbors which are also first names Ribiero-Neto, 1999) — the received wisdom is that stemming
(usually of the same gender), but these words are not rep- (at best) improves recall at the expense of precision and our
resented in WordNet. This lexical category is ripe for findings for proper nouns are consistent with this.
should improve the classification of verbs for the follow- be added to the knowledge base for (at least) the domain
ing reason. Classification using corpus-derived neighbors in question. Results for verbs are more difficult to inter-
is markedly better for common nouns than for verbs, and pret: reasons for this might include the shallowness and
most of the verbs in our sample (57%) also occur as com- breadth of the WordNet verb hierarchy, the suitability of
mon nouns in WordSpace. (In contrast, only 13% of our our WordSpace similarity measure, and many theoretical
common nouns also occur as verbs, a reliable asymmetry issues which should be taken into account for a successful
for English.) Most of these noun senses are semantically approach to the classification of verbs.
related in some way to the corresponding verbs. Since Filtering using the affinity score from the class-
using neighboring words for classification is demonstra- labelling algorithm can be used to dramatically increase
bly more reliable for nouns than for verbs, putting these performance.
parts-of-speech together in a single vector in WordSpace
might be expected to improve performance for verbs but 5 Related work and future directions
degrade it for nouns. The experiments in this paper describe one combination
Filtering using Affinity scores of algorithms for lexical acquisition: both the finding
One of the benefits of the class-labelling algorithm of semantic neighbors and the process of class-labelling
(Definition 1) presented in this paper is that it returns not could take many alternative forms, and an exhaustive
just class-labels but an affinity score measuring how well evaluation of such combinations is far beyond the scope
each class-label describes the class of objects in question. of this paper. Various mathematical models and distance
The affinity score turns out to be signficantly correlated measures are available for modelling semantic proxim-
with the likelihood of obtaining a successful classifica- ity, and more detailed linguistic preprocessing (such as
tion. This can be seen very clearly in Table 3, which chunking, parsing and morphology) could be used in a
shows the average affinity score for correct class-labels of variety of ways. As an initial step, the way the granularity
different heights above the target word, and for incorrect of part-of-speech classification affects our results for lex-
class-labels — as a rule, correct and informative class- ical acquistion will be investigated. The class-labelling
labels have significantly higher affinity scores than incor- algorithm could be adapted to use more sensitive mea-
rect class-labels. It follows that the affinity score can be sures of distance (Budanitsky and Hirst, 2001), and corre-
used as an indicator of success, and so filtering out class- lations between taxonomic distance and WordSpace sim-
labels with poor scores can be used as a technique for ilarity used as a filter.
improving accuracy. The coverage and accuracy of the initial taxonomy we
To test this, we repeated our experiments using 3 are hoping to enrich has a great influence on success rates
neighbors and this time only using class-labels with an for our methods as they stand. Since these are precisely
affinity score greater than 0.75, the rest being marked the aspects of the taxonomy we are hoping to improve,
‘unknown’. Without filtering, there were 1143 success- this raises the question of whether we can use automati-
ful and 1380 unsuccessful outcomes: with filtering, these cally obtained hypernyms as well as the hand-built ones
numbers changed to 660 and 184 respectively. Filtering to help classification. This could be tested by randomly
discarded some 87% of the incorrect labels and kept more removing many nodes from WordNet before we begin,
than half of the correct ones, which amounts to at least a and measuring the effect of using automatically derived
fourfold improvement in accuracy. The improvement was classifications for some of these words (possibly those
particularly dramatic for proper nouns, where filtering re- with high confidence scores) to help with the subsequent
moved 270 out of 283 incorrect results and still retained classification of others.
half of the correct ones. The use of semantic neighbors and class-labelling for
computing with meaning go far beyond the experimen-
Conclusions tal set up for lexical acquisition described in this pa-
For common nouns, where WordNet is most reliable, per — for example, Resnik (1999) used the idea of a
our mapping algorithm performs comparatively well, ac- most informative subsuming node (which can be re-
curately classifying several words and finding some cor- garded as a kind of class-label) for disambiguation, as
rect information about most others. The optimum num- did Agirre and Rigau (1996) with the conceptual density
ber of neighbors is smaller if we want to try for an exact algorithm. Taking a whole domain as a ‘context’, this
classification and larger if we want information that is approach to disambiguation can be used for lexical tun-
broadly reliable. Part-of-speech information noticeably ing. For example, using the Ohsumed corpus of medical
improves the process of both broad and narrow classifi- abstracts, the top few neighbors of operation are amputa-
cation. For proper names, many classifications are cor- tion, disease, therapy and resection. Our algorithm gives
rect, and many which are absent or incorrect according medical care, medical aid and therapy as possible class-
to WordNet are in fact correct meanings which should labels for this set, which successfully picks out the sense
of operation which is most important for the medical do- Ricardo Baeza-Yates and Berthier Ribiero-Neto. 1999.
main. Modern Information Retrieval. Addison Wesley /
The level of detail which is appropriate for defining ACM press.
and grouping terms depends very much on the domain in A. Budanitsky and G. Hirst. 2001. Semantic distance in
question. For example, the immediate hypernyms offered wordnet: An experimental, application-oriented evalu-
by WordNet for the word trout include ation of five measures. In Workshop on WordNet and
Other Lexical Resources, Pittsburgh, PA. NAACL.
fish, foodstuff, salmonid, malacopterygian,
teleost fish, food fish, saltwater fish Christiane Fellbaum, editor. 1998. WordNet: An elec-
tronic lexical database. MIT press, Cambridge MA.
Many of these classifications are inappropriately fine-
grained for many circumstances. To find a degree of J. Firth. 1957. A synopsis of linguistic theory 1930-
1955. Studies in Linguistic Analysis, Philological So-
abstraction which is suitable for the way trout is used
ciety, Oxford, reprinted in Palmer, F. (ed. 1968) Se-
in the BNC, we found its semantic neighbors which in- lected Papers of J. R. Firth, Longman, Harlow.
clude herring swordfish turbot salmon tuna. The highest-
scoring class-labels for this set are Marti Hearst and Hinrich Schütze. 1993. Customizing
a lexicon to better suit a computational task. In ACL
2.911 saltwater fish SIGLEX Workshop, Columbus, Ohio.
2.600 food fish
1.580 fish T. Landauer and S. Dumais. 1997. A solution to plato’s
1.400 scombroid, scombroid problem: The latent semantic analysis theory of acqui-
0.972 teleost fish sition. Psychological Review, 104(2):211–240.
Hang Li and Naoki Abe. 1998. Generalizing case frames
The preferred labels are the ones most humans would an-
using a thesaurus and the mdl principle. Computa-
swer if asked what a trout is. This process can be used tional Linguistics, 24(2):217–244.
to select the concepts from an ontology which are ap-
propriate to a particular domain in a completely unsuper- Dekang Lin. 1999. Automatic identification of non-
vised fashion, using only the documents from that do- compositional phrases. In ACL:1999, pages 317–324.
main whose meanings we wish to describe.
James Pustejovsky. 1995. The Generative Lexicon. MIT
Demonstration press, Cambridge, MA.
Interactive demonstrations of the class-labelling al- Philip Resnik. 1999. Semantic similarity in a taxonomy:
gorithm and WordSpace are available on the web at An information-based measure and its application to
http://infomap.stanford.edu/classes and problems of ambiguity in natural language. Journal of
http://infomap.stanford.edu/webdemo. An artificial intelligence research, 11:93–130.
interface to WordSpace incorporating the part-of-speech Ellen Riloff and Jessica Shepherd. 1997. A corpus-based
information is currently under consideration. approach for building semantic lexicons. In Claire
Cardie and Ralph Weischedel, editors, Proceedings of
Acknowledgements
the Second Conference on Empirical Methods in Natu-
This research was supported in part by the Research ral Language Processing, pages 117–124. Association
Collaboration between the NTT Communication Science for Computational Linguistics, Somerset, New Jersey.
Laboratories, Nippon Telegraph and Telephone Corpora-
Brian Roark and Eugene Charniak. 1998. Noun-phrase
tion and CSLI, Stanford University, and by EC/NSF grant
co-occurence statistics for semi-automatic semantic
IST-1999-11438 for the MUCHMORE project. lexicon construction. In COLING-ACL, pages 1110–
1116.