0% found this document useful (0 votes)
12 views7 pages

Linguistik

This paper explores using a combination of linguistic and structural information to extract definitions from Dutch texts. It uses a two-step process, first applying a rule-based parser to match definition patterns, then using machine learning to filter the results. The experiments show this approach is promising for the definition extraction task.

Uploaded by

Fitri Manalu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

Linguistik

This paper explores using a combination of linguistic and structural information to extract definitions from Dutch texts. It uses a two-step process, first applying a rule-based parser to match definition patterns, then using machine learning to filter the results. The experiments show this approach is promising for the definition extraction task.

Uploaded by

Fitri Manalu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Definition Extraction using Linguistic and Structural

Features
Eline Westerhout
Utrecht University
[email protected]

Abstract extraction of definitions cannot be limited to sentences


In this paper a combination of linguistic and consisting of a subject, a copular verb and a predica-
structural information is used for the extraction tive phrase, as is often the case in question-answering
of Dutch definitions. The corpus used is a col- tasks, but a much richer typology of patterns needs
lection of Dutch texts on computing and elearn- to be identified than in current research on definition
ing containing 603 definitions. The extraction extraction.
process consists of two steps. In the first step
a parser using a grammar defined on the basis
Different approaches for the extraction of definitions
of the patterns observed in the definitions is ap- can be distinguished. We use a sequential combination
plied on the complete corpus. Machine learning of a rule-based approach and machine learning to ex-
is thereafter applied to improve the results ob- tract them. As a first step a grammar is used to match
tained with the grammar. The experiments show sentences with a definition pattern and thereafter, ma-
that using a combination of linguistic (n-grams, chine learning techniques are applied to filter out those
type of article, type of noun) and structural in- sentences that – although they have a definition pat-
formation (layout, position) is a promising ap- tern – do not qualify as definitions.
proach to the definition extraction task. Our work has several innovative aspects compared
to other work in this area. First, we address less com-
mon definition types in addition to ‘to be’ definitions.
Keywords Second, we apply a machine learning algorithm de-
definition extraction, machine learning, grammar, linguistic fea- signed specifically to deal with imbalanced datasets,
tures, text structure which seems to be more appropriate for us because we
have data sets in which the proportion of ‘yes’-cases is
extremely low. The third innovative aspect on which
1 Introduction this paper focuses has to do with the combination of
different types of information for the extraction of defi-
Definition extraction is a relevant task in different ar- nitions. Not only linguistic information (n-grams, type
eas. Most times it is used in the domain of ques- of article, type of noun) has been used, but also exper-
tion answering to answer ‘What-is’-questions, but it iments with structural and textual information have
is also used for dictionary building, ontology develop- been carried out (position, layout).
ment and glossary creation. The context in which we The paper is organized as follows. Section 2 intro-
apply definition extraction is the automatic creation duces relevant work in definition extraction, focusing
of glossaries within elearning. Glossaries can play an on the work done within the glossary creation context.
important role within this domain since they support Section 3 describes the data used in the experiments
the learner in decoding the learning object he is con- and the definition categories we distinguish. In section
fronted with and in understanding the central concepts 4 the way in which grammars have been applied to
which are being conveyed in the learning material. extract definitions and the results obtained with them
The glossary creation context provides its own re- are discussed. Section 5 talks about the machine learn-
quirements to the task. The most relevant one is con- ing approach, covering issues such as the classifier, the
stituted by the corpus of learning objects which in- features and the experiments. Section 6 reports and
cludes a variety of text genres (such as manuals, sci- discusses the results obtained in the experiments. Sec-
entific texts, descriptive documents) and also a va- tion 7 provides conclusions and presents some future
riety of writing styles that pose a real challenge to work.
computational techniques for automatic identification
and extraction of definitions together with the head-
words. Our texts are not as structured as those em- 2 Related research
ployed for the extraction of definitions in question-
answering tasks which most times include encyclope- Research on definition extraction has been pursued
dias and Wikipedia. Furthermore, some of our learn- mainly in the context of automatic dictionary build-
ing objects are relatively small in size, thus our ap- ing from text, question-answering and ontology de-
proach has not only to favor precision but also recall. velopment. Initially, mainly pattern-based methods
That is, we want to make sure that as many as possible were used to extract definitions (cf. [12, 15, 16, 19])
definitions present in a text are proposed to the user but recently, several researchers have started to apply
for the creation of the relevant glossary. Therefore, the also machine learning techniques and combinations of

61
Workshop On Definition Extraction 2009 - Borovets, Bulgaria, pages 61–67,
pattern-based methods and machine learning in this grammar [22], and using several combinations of ma-
area (cf. [2, 9, 11]). [20] provides an overview of the chine learning and a grammar to extract definitions
work done in the different areas and compares it to the [21, 23, 20]. A comparison of a standard classifier
task within the glossary creation context. (naive Bayes) and the Balanced Random Forest (BRF)
Definition detection approaches developed in classifier showed that, especially for the more imbal-
the context of question-answering tasks are often anced data sets, the BRF classifier outperforms the
definiendum-centered, that is, they search for defi- naive Bayes classifier [20]. In all these previous exper-
nitions containing a given term. Our approach, in iments the features used were either only n-grams or
contrast, is connector-centered, which means that we a combination of n-grams and linguistic features.
search for verbs or phrases that typically appear in
definitions with the aim of finding the complete list of
all definitions in a corpus independently of the defined 3 Data
terms. Despite the challenges that the eLearning ap-
plication involves, we believe that the techniques for Definitions are expected to contain at least three parts.
the extraction of definitions developed within the Nat- The definiendum is the element that is defined (Latin:
ural Language Processing and the Information Extrac- that which is to be defined). The definiens provides
tion communities can be adapted and extended for our the meaning of the definiendum (Latin: that which
purposes. is doing the defining). Definiendum and definiens are
Our work on definition extraction started within the connected by a verb or punctuation mark, the connec-
European LT4eL project. Within the scope of this tor, which indicates the relation between definiendum
project experiments for different languages have been and definiens [19].
carried out. [13] describe experiments on definition Based on the connectors used in the 603 manually
extraction in Slavic languages and present the results annotated patterns, four common definition types were
obtained with Bulgarian, Czech and Polish grammars. distinguished. The first type are the definitions in
The three grammars show varying degrees of sophis- which a form of the verb ‘to be’ is used as connec-
tication. The more sophisticated the grammar, the tor (called ‘is definitions’). The second group consists
more patterns are covered. Although the recall im- of definitions in which a verb (or verbal phrase) other
proves when more rules are added, the precision does than ‘to be’ is used as connector (e.g. to mean, to
not drop and is comparable for the three languages comprise). It also happens that a punctuation charac-
(22.3-22.5%). ter is used as connector (most times the colon), such
For Polish, [10, 14, 7] put efforts in outperform- patterns are contained in the third type. The fourth
ing the pattern-based approach using machine learning category contains the definitory contexts in which rela-
techniques. To this end, [10] describe an approach in tive or demonstrative pronouns are used to point back
which the Balanced Random Forest classifier is used to a defined term that is mentioned in a preceding
to extract definitions from Polish texts. They com- sentence. The definition of the term then follows after
pare the results obtained with this approach to results the pronoun. Table 1 shows an example for each of
obtained with experiments on the same data in which the four types.
grammars were used [14] and to results of experiments
with standard classifiers [7]. The best results are ob-
tained with the approach designed for dealing with im- 4 Grammar
balanced datasets. The differences with my approach
are that (1) they used either only machine learning or The first part of the extraction process is rule-based
only a grammar and not a combination of the two, (2) in our approach. Based on the part-of-speech tag pat-
they did not distinguish different definition types and terns observed in the development part of the corpus
(3) they only used relatively simple features, such as a grammar was written to detect the four types of def-
n-grams. initions. For a proper extraction of both sentences of
[3] applies Genetic Algorithms to the extraction of multi-sentence pronoun definitions, anaphora resolu-
English ‘to be’ definitions. Her experiments focus on tion would have to be included in the system. As this
assigning weights to a set of features for the identi- is a completely different topic, we decided to restrict
fication of such definitions. These weights act as a ourselves to only looking at the part of the definition
ranking mechanism for the classification of sentences, containing the pronoun and connector verb (phrase).
providing a level of certainty as to whether a sentence When the tool is integrated into the Learning Man-
is actually a definition or a non-definition. They ob- agement System, it shows for each definition candidate
tain a precision of 62% and a recall of 52 % on the one sentence to the left and one sentence to the right
extraction of is definitions by using a set of features to see the context in which it is used. For the multi-
such as ‘has keyword’ and ‘contains ‘is a’. sentence pronoun definitions this makes it possible to
[8] focus on the extraction of Portuguese ‘to be’ def- see which term is defined in the previous sentence and
initions. First, a simple grammar is used to extract to select it manually.
all sentences in which the verb ‘to be’ is used as main The XML transducer LXTransduce developed by
verb. Because their corpus is heavily imbalanced and [18] has been used to match the grammars against files
only 10 percent of the sentences are defintions, they in XML format. LXTransduce is an XML transducer
investigate which sampling technique gives the best that supplies a format for the development of gram-
results and present results from experiments that seek mars which are matched against either pure text or
to obtain optimal solutions for this problem. XML documents. The grammars are represented in
Previous experiments for Dutch focused on using a XML using the lxtransduce.dtd DTD, which is part

62
Type Example sentence
is Gnuplot is een programma om grafieken te maken
‘Gnuplot is a program for drawing graphs’
verb E-learning omvat hulpmiddelen en toepassingen die via het internet beschikbaar zijn en creatieve mo-
gelijkheden bieden om de leerervaring te verbeteren .
‘eLearning comprises resources and application that are available via the Internet and provide creative
possibilities to improve the learning experience’
punctuation Passen: plastic kaarten voorzien van een magnetische strip, die door een gleuf gehaald worden, waardoor
de gebruiker zich kan identificeren en toegang krijgt tot bepaalde faciliteiten.
‘Passes: plastic cards equipped with a magnetic strip, that can be swiped through a card reader, by means
of which the identity of the user can be verified and the user gets access to certain facilities. ’
pronoun Dedicated readers. Dit zijn speciale apparaten, ontwikkeld met het exclusieve doel e-boeken te kunnen
lezen.
‘Dedicated readers. These are special devices, developed with the exclusive goal to make it possible to
read e-books.’

Table 1: Examples for each of the definition types

of the software. A sentence is classified as a defini- In the Random Forest classifier there is not just one
tion sentence if the parsing algorithm finds a match tree used for classification but an ensemble of trees
in this sentence of at least one token (not necessarily [4]. The ‘forest’ is created by using bootstrap samples
spanning the whole sentence). of the training data and random feature selection in
tree induction. Prediction is made by aggregating the
type R P F F2 predictions of the ensemble. This idea behind Random
is 0.83 0.36 0.50 0.58 Forest can be used in other classifiers as well and is
verb 0.75 0.45 0.56 0.61
punctuation 0.93 0.07 0.13 0.18
called bagging (bootstrap aggregating).
pronoun 0.64 0.09 0.16 0.21 A disadvantage of the Random Forest approach is
all 0.79 0.16 0.27 0.34 that when data are extremely imbalanced, there is a
significant probability that a bootstrap sample con-
tains few or even none of the minority class. As a con-
Table 2: Results with the grammar sequence, the resulting tree will perform poor when
predicting the minority class. To solve this problem,
Table 4 shows the results obtained with the gram- [6] proposed the Balanced Random Forest classifier.
mar. As can be seen from this table, the precision is This is a modification of the Random Forest method
quite low for all types, especially for the punctuation specifically designed to deal with imbalanced data sets
and pronoun types. The grammar rules were thus not using down-sampling. In this method a adapted ver-
specific enough to filter the incorrect sentences. To sion of the bagging procedure is used, the difference be-
improve these low precision scores, machine learning ing that trees are induced from balanced down-sampled
has been applied on the grammar results. data. The procedure of the Balanced Random Forest
(BRF) algorithm is described by [6]:
1. For each iteration in random forest, draw a boot-
5 Machine learning strap sample from the minority class. Randomly
draw the same number of cases, with replacement,
The datasets obtained with the grammar are imbal- from the majority class.
anced, especially for the punctuation and pronoun def-
initions. Our interest leans towards correct classifi- 2. Induce a classification tree from the data to max-
cation of the smaller class (the ‘positive’ class), that imum size, without pruning. The tree is induced
is, the class containing the definitions. Therefore, a with the CART (Classification and Regression
classifier specifically designed to deal with imbalanced Trees) algorithm [5], with the following modifica-
datasets has been used, namely the Balanced Random tion: at each node, instead of searching through
Forest classifier. After describing how this classifier all variables for the optimal split, only search
works, the features and feature settings are set out. through a set of m randomly selected variables1 .
3. Repeat the two steps above for the number of
5.1 Balanced Random Forest Classifier times desired. Aggregate the predictions of the
ensemble and make the final prediction.
The Random Forest classifier is a decision tree algo-
rithm, which aims at finding a tree that best fits the
training data. Whereas normally the underlying tree
5.2 Features
is a CART tree, in the Weka package it is a modified The features that have been used can be divided into
variant of REPTree. The Weka algorithm follows the five categories. Several combinations of these features
same methods of introducing randomness and voting resulted in 16 settings.
of models. At the root node of the tree the feature that 1 [4] experimentend with m = 1 and a higher value of m and
best divides the training data is used. In the Random concluded that the procedure is not very sensitive to the value
Forest classifier [5] the Gini index is used as splitting of m. The average absolute difference between the error rate
measure. using F=1 and the higher value of F is less than 1%

63
1. Text properties: these include various types of differences between our data and the data from [9].
n-grams with different values for n. In both data sets the vast majority of articles tends
to be indefinite at the start of the definiens (72% and
2. Syntactic properties: features of this category 64%), which is quite different from the proportions
give information on syntactic properties of the for the non-definitions (30% and 29%). Differences
sentences, in these experiments the type of article between the two data sets are the proportion of definite
used in definiens and definiendum are considered. articles in the definitions group (15% and 23%) and the
proportion of no articles in the non-definitions (18%
3. Word properties: in this category information and 1%), which is much higher in the LT4eL data set.
on specific words is included, in these experi-
ments, whether the noun in the definiens is a definitions non-definitions
proper or a common noun. definite 14.7% 30.0%
indefinite 71.8% 30.0%
4. Position properties: these include several fea- no article 9.0% 18.7%
tures which give information on the place in the other 4.5% 21.3%
100% 100%
document where the definition is used.
5. Lay-out properties: this category contains fea-
tures on layout information used in definitions. Table 4: Proportions of article types used at start of
definiens in is-definitions

N-grams
In many text classification tasks n-grams are used for Nouns
predicting the correct class (cf. [1] and [17]). For the
classification of definitions two types of n-grams have Nouns can be divided into two types, namely proper
been used, with n being 1, 2 or 3. We used Part-of- nouns and common nouns. Unfortunately, with our
Speech tag (PoS-tag) n-grams. The tagger used dis- linguistic annotation tools it was not possible to get
tinguished 9 parts of speech: adjective, adverb, article, more detailed information about the type of proper
conjunction, interjection, noun, numeral, preposition, noun (e.g. person, location), so we can only distin-
pronoun, verb. In addition it used the tag ‘Misc’ for guish between proper and common nouns.The distri-
unknown words and ‘Punc’ for punctuation marks. bution of these types is different for definitions and
non-definitions, especially for is-definitions. In the
is-definitions the proportion of proper nouns in the
Articles definiendum is considerably higher for the definitions
than for the non-definitions (53% versus 31%). For the
[9] investigated whether there is a connection between
other definition types the difference observed is much
the type of article used in the definiendum (definite,
smaller.
indefinite, other) and the class of sentences (defini-
tion or non-definition). Although our definition cor-
pus contains less structured texts than the data used Layout
by [9] (Wikipedia texts), part of the figures are quite
similar for our data (table 3). In the Wikipedia sen- Because definitions contain important information you
tences, the majority of subjects in definition sentences might expect special layout features (e.g. bold, italics,
did not have an article (63%), which is the same in underlined) to occur more often in definitions than in
our corpus (62%). A difference with their data is the non-definitions. Because in our data information on
proportion of indefinite articles, which is 25% in our the original layout of the documents has been stored
data and 13% in the data from [9]. per word it was possible to check whether this was the
case. No other research on definitions included this
definition non-definition property in their research as far as we know. For each
definite 12.8% 44.4% of the sentences it was indicated whether a specific
indefinite 25.0% 8.3% layout feature was used in the definiendum. Because
no article 62.2% 43.7% of the small numbers for some of the properties we de-
other 0% 3.6%
100% 100% cided to combine all layout features into one group. A
comparison shows that is, verb and punctuation def-
inition sentences contain significantly more layout in-
Table 3: Proportions of article types used in definien- formation in the definiendum than non-definition sen-
dum of is-definitions tences.2 For each of the definition types the proportion
of layout information is about twice as high in defini-
The differences in distribution observed for the is- tions than in non-definitions.
definitions is not seen to the same extent for the verb
and punctuation definitions. In the verb definition Position
candidates, for instance, both in definitions and non-
definitions, definite articles tend to be used. However, [9] in their reseach on definition extraction from
also for these types there is a difference between defini- Wikipedia texts reduced the set of definition candi-
tions and non-definitions with respect to this feature. 2 The pronoun definitions were not included in this investiga-
The article used in the predicate complement has tion, because the definiendum of these sentences is often not
also been included. Again, we observe similarities and in the same sentence as the definiens

64
dates extracted with the grammar by selecting only Whereas is and verb definitions tend to span a com-
the first sentences of each document as possible can- plete sentence, the rules for punctuation definition are
didates. It seems that Google’s define query feature less strict for this feature. On the basis of this observa-
also relies heavily on this feature to answer definition tion I investigated whether information on this could
queries. However, as [9] also state, the first position be used to distinguish definitions from non-definitions.
sentence is likely to be a weaker predictor of defi- In addition to this, a second reason has to do with
nition versus non-definition sentences for documents the conversion from original document to XML docu-
from other sources, which are not as structured as ment. During this process sentences were split auto-
Wikipedia. The texts from the LT4eL corpus are such matically and marked as <s>. However, not all sen-
less structured texts and therefore using this restric- tences were splitted correctly, because the sentence
tion would not be a good decision when dealing with splitter tool made errors sometimes which were not
these documents. In addition to being less structured, corrected manually. Therefore, an extra rule had to
they are also often longer and contain on average 10.6 be used to detect the beginning of a sentences saying
definitions, so applying the first sentence restriction that each word starting with a capital could indicate
would cause a dramatic decrease of recall and make it the start of a sentence.
impossible to fulfil our aim of extracting as much def- The position is given by indicating the number of
initions as possible because at most one sentence per tokens in the <s> before the definition starts. For all
document would be extracted using this method. definition types, the absolute position of the definition
Although we thus cannot use the same restriction, candidate within the sentence is significantly lower for
it is nevertheless possible to include information on definitions than for non-definitions.
the position of the definition candidate in a document
as feature in the machine learning experiments to see
whether it helps the classifier in predicting the correct Position of definiendum When a term is de-
class. To this end, three types of positional informa- fined, one would expect that it has not been used a
tion were included in the features, namely information lot of times before it is explained in the definition. Al-
on the position of the sentence within the paragraph, though it is possible that it has been used two or three
information on the position of the definition within times before already (e.g. in title of document, table
the sentence and information on the (relative and ab- of contents or heading), intuitively you would expect
solute) position of the definiendum compared to other it to be used more after it has been explained. Based
occurrences of the term in the document. on this intuition three measures have been included.
The first two are the absolute number of occurrences
of the term before and after it is used in the definition
Position in paragraph Each document is di- candidate. For all types the average number of occur-
vided into paragraphs which are again divided into rences before is lower for definitions. This difference
sentences. It is thus possible to see where in the para- is significant for all types except for the is-definitions.
graph a definition is used. When we consider each The number of occurrences of the term after it has
paragraph as a separate block of information, we would been defined seems to be a less good predictor and is
expect definitions to appear at the beginning of such a only significantly lower for the is-definitions. When
block. The fact that sentence position is such a strong we look at the relative position of the definiendum the
predictor in Wikipedia articles supports this idea. score is significantly lower for the definition sentences
The first property related to position in paragraph is for all types except the is-definitions for which there
the absolute position of the definition sentence within is no difference observed.
the paragraph. When we compare definitions and non-
definitions with respect to this feature we see that for
three of the four definition types the absolute position 5.3 Feature settings
is lower for the definitions. Only of the pronoun defi- The first setting are the n-grams of part-of-speech tags.
nitions there is no significant difference. The pronoun This setting is the baseline to which all other settings
definitions tend to be used later on in the paragraph are compared. The four types of features – articles,
compared to the non-definitions for this type. This nouns, position and layout – have been combined in
might be caused by the fact that they are used more all possible ways resulting in 16 settings in total. In
often at the second position of the paragraph where the second group the four types of feature settings were
the term is mentioned in the first sentence. tried separately (setting 2 to 5). Settings 6 to 11 are
In addition to the absolute position of a sentence, all possible combinations of two of the four settings.
we also included a score on the relative position tak- Then there are four settings (12 to 15) in each of which
ing into account the number of sentences in a para- three types were combined and in the last setting all
graph, because the beginning of a paragraph is a rel- four types are integrated. Table 5 shows the settings.
ative property. When we compare the scores on this
property for definitions and non-definitions, for three
of the four types there is a significant difference, only
the result for the punctuation-definitions is not signif-
6 Results
icant.
The final results after applying both the grammar and
machine learning are shown in table 6. The sentences
Position in sentence When we look at the four not detected with the grammar rules could of course
definition types, one of the differences observed is the not be retrieved anymore, and as a consequence the
place in the sentence where it can start and end. recall after applying machine learning is always lower

65
IS VERB PUNCTUATION PRONOUN
setting R P F A R P F A R P F A R P F A
1. 0.57 0.49 0.53 0.60 0.58 0.54 0.56 0.56 0.51 0.16 0.24 0.74 0.40 0.15 0.22 0.64
2. 0.74 0.56 0.64 0.66 0.49 0.53 0.51 0.54 0.50 0.13 0.21 0.70 0.55 0.17 0.26 0.61
3. 0.49 0.47 0.48 0.58 0.49 0.43 0.46 0.43 0.43 0.11 0.18 0.68 0.49 0.21 0.29 0.70
4. 0.57 0.50 0.54 0.61 0.52 0.53 0.53 0.54 0.47 0.13 0.20 0.70 0.47 0.19 0.27 0.67
5. 0.17 0.52 0.26 0.61 0.15 0.56 0.24 0.53 0.39 0.14 0.21 0.76 0.57 0.09 0.15 0.21
6. 0.70 0.56 0.62 0.66 0.49 0.61 0.55 0.60 0.60 0.11 0.19 0.58 0.56 0.18 0.27 0.62
7. 0.64 0.63 0.64 0.71 0.56 0.56 0.56 0.58 0.53 0.15 0.24 0.73 0.47 0.22 0.30 0.72
8. 0.74 0.57 0.64 0.68 0.44 0.54 0.49 0.54 0.52 0.13 0.21 0.68 0.53 0.17 0.26 0.62
9. 0.54 0.52 0.53 0.62 0.56 0.56 0.56 0.57 0.50 0.15 0.23 0.73 0.45 0.18 0.26 0.68
10. 0.53 0.47 0.50 0.58 0.22 0.46 0.30 0.49 0.42 0.16 0.24 0.78 0.53 0.20 0.29 0.67
11. 0.57 0.52 0.54 0.62 0.52 0.51 0.51 0.52 0.48 0.14 0.21 0.72 0.44 0.19 0.26 0.68
12. 0.63 0.62 0.62 0.70 0.56 0.58 0.57 0.59 0.53 0.17 0.26 0.75 0.47 0.23 0.31 0.73
13. 0.69 0.57 0.62 0.67 0.51 0.64 0.57 0.62 0.53 0.14 0.22 0.70 0.57 0.19 0.29 0.64
14. 0.66 0.64 0.65 0.72 0.59 0.57 0.58 0.58 0.54 0.16 0.24 0.73 0.46 0.22 0.30 0.72
15. 0.57 0.53 0.54 0.63 0.53 0.52 0.52 0.53 0.45 0.14 0.21 0.73 0.47 0.20 0.28 0.70
16. 0.63 0.63 0.63 0.71 0.56 0.57 0.56 0.58 0.47 0.15 0.23 0.75 0.42 0.22 0.29 0.73

Table 6: Final results after applying grammar and machine learning

# setting
1. n-grams
combination of all feature settings (setting 16).
2. article For the layout setting the recall is very low, which is
3. noun not strange given the fact that only in a small subset
4. position of the definitions there was special layout used. Al-
5. layout though there is a slight improvement when it is used
6. article + noun
7. article + position in combination with other features, the added value is
8. article + layout not big. Adding the noun to other settings generally
9. noun + position leads to either lower or similar classification results.
10. noun + layout The maximum improvement of precision compared
11. position + layout
12. article + noun + position to the precision obtained with the grammar is 77.8%
13. article + noun + layout (setting 14).
14. article + position + layout
15. noun + position + layout
16. article + noun + position + layout Verb definitions The second group of definitions in
table 6 are the verb definitions. For this type none of
the individual settings outperforms the baseline set by
Table 5: The sixteen feature settings the n-grams. The best feature here is position. Using
a combination of features makes it possible to per-
form better than the n-grams. The highest precision
than the recall obtained in the first step. For each ex- is obtained with setting 13, which is a combination of
periment four measures are reported. The first three article, noun and layout. The results with the layout
are the recall, precision, and f-score of the definition setting are comparable to the results for the is defini-
class. The fourth score is the overall classification ac- tions. The grammar precision for this type was 0.45
curacy. The separate results for the non-definition so the maximum improvement is 42.2% (setting 13).
class are not shown. As the aim of the experiments is
to improve the precision obtained with the grammar, Punctuation definitions For the punctuation def-
this is the most important measure. However, recall initions the accuracy is highly determined by the non-
and accuracy may not become too low and therefore definitions, as these constitute over 90% of the data
also recall, f-score and accuracy are reported. set. For the individual feature settings the best preci-
For each of the types it is described in this section sion and accuracy are obtained with the layout setting,
how the results should be interpreted and to which ex- however, the recall is quite low for this type. Only one
tent the settings can compete with setting 1 (n-grams). of the settings gives better results than the n-grams,
namely setting 12 (article, noun and position). The
maximum improvement of precision compared to the
6.1 Results per type precision obtained with the grammar is 142.9% with
Is definitions The first block of information in ta- this setting.
ble 6 shows the results for the is definitions. We see
that for this type the article is the best feature for Pronoun definitions Just as for the punctuation
classification. Using only this feature gives better re- definitions, the pronoun definitions data set is highly
sults than the results obtained with the n-grams. The imbalanced. The noun is the most important indi-
second best individual feature is the information on vidual feature setting, which is surprising as many of
position, although for this type the results with the n- these definitions do not have a definiendum. In most
grams are almost the same. A combination of article, settings the recall improves compared to the result on
noun and position (setting 14) gives the best result, this score of the n-grams, but it often goes with a drop
which is equally good as the result obtained with a of precision. An overall improvement compared to the
combination of article and position (setting 7) and a base line is observed in most of the settings, especially

66
in setting 7 (article and position) and 10 (noun and References
layout) and the best result is obtained with setting 12 [1] R. Bekkerman and J. Allan. Using bigrams in text categoriza-
(article, noun and position), which is considerably bet- tion. Technical report, Technical Report IR, 2003.
ter than the result of setting 1. With this setting the [2] S. Blair-Goldensohn, K. R. McKeown, and A. Hazen Schlaik-
increase of the precision score compared to the pre- jer. New Directions In Question Answering, chapter Answer-
ing Definitional Questions: A Hybrid Approach. AAAI Press,
cision obtained with the grammar is 155.6% (setting 2004.
12). [3] C. Borg. Automatic definition extraction using evolutionary
algorithms. PhD thesis, University of Malta, 2009.
[4] L. Breiman. Random Forests. Machine Learning, 45(1):5–32,
6.2 General observations 2001.
[5] L. Breiman, J. Friedman, R. Olshen, C. Stone, L. Breiman,
J. Freidman, R. Olshen, and C. Stone. Classification and re-
When looking from the perspective of the settings, we gression trees. Wadsworth, 1984.
see that the article and position in general are the best [6] C. Chen, A. Liaw, and L. Breiman. Using Random Forest to
features. The problem with the layout feature setting learn imbalanced data. Technical Report 666, University of
mainly is that the recall obtained with it is quite low. California, Berkeley, 2004.
Also, adding it as an extra feature to other settings [7] L. Degórski, M. Marcińczuk, and A. Przepiórkowski. Definition
extraction using a sequential combination of baseline gram-
does not lead to much improvement of these results. mars and machine learning classifiers. In Proceedings of LREC
A second general observation is that for none of the 2008, 2008.
types the best results are obtained when a combination [8] R. Del Gaudio and A. Branco. Extraction of definitions in
of all features is used. It is thus not the case that the portuguese: An imbalanced data set problem. In Proceedings
of Text Mining and Applications at EPIA 2009, 2009.
more information is included the better results will be
[9] I. Fahmi and G. Bouma. Learning to identify definitions us-
obtained. For all types one or more feature settings ing syntactic features. In R. Basili and A. Moschitti, editors,
outperform the n-grams results. Proceedings of the EACL workshop on Learning Structured
Information in Natural Language Applications, 2006.
[10] L. Kobyliński and A. Przepiórkowski. Definition extraction
with balanced random forests. In B. Nordström and A. Ranta,
editors, Advances in Natural Language Processing: Proceed-
7 Conclusions and future work ings of the 6th International Conference on Natural Lan-
guage Processing, GoTAL 2008, pages 237–247. Springer Ver-
lag, LNAI series 5221, 2008.
The influence of the inclusion of linguistic and struc- [11] S. Miliaraki and I. Androutsopoulos. Learning to identify
tural features on classification accuracy differs per type single-snippet answers to definition questions. In Proceedings
of COLING 2004, pages 1360–1366, 2004.
and per combination of settings. Except for the lay-
out setting all individual settings perform well on at [12] S. Muresan and J. Klavans. A method for automatically build-
ing and evaluating dictionary resources. In Proceedings of the
least one of the definition types. Combining the dif- Language Resources and Evaluation Conference, 2002.
ferent feature settings generally improves the results. [13] A. Przepiórkowski, L. Degórski, M. Spousta, K. Simov,
The precision improved in all cases. The two P. Osenova, L. Lemnitzer, V. Kubon, and B. Wójtowicz. To-
wards the automatic extraction of denitions in Slavic. In Pro-
types on which the grammar performed best (is and ceedings of BSNLP workshop at ACL, 2007.
verb) showed a substantial improvement of 77.8% and [14] A. Przepiórkowski, M. Marcińczuk, and L. Degórski. Dealing
44.2%. And even though precision was still low for with small, noisy and imbalanced data: Machine learning or
punctuation and pronoun patterns after applying ma- manual grammars? In Proceedings of TSD 2008, 2008.
chine learning, the percentual improvement was huge [15] H. Saggion. Identifying definitions in text collections for ques-
tion answering. In Proceedings of the Language Resources and
for these types (142.9% and 155.6% respectively). Evaluation Conference, 2004.
The fact that it is possible to obtain better results [16] A. Storrer and S. Wellinghof. Automated detection and annota-
with linguistic and structural features than with part- tion of term definitions in German text corpora. In Proceedings
of-speech n-grams is encouraging for several reasons. of LREC 2006, 2006.

First, because it shows that it makes sense to use other [17] C. Tan, Y. Wang, and C. Lee. The use of bigrams to en-
hance text categorization. Information Processing and Man-
information in addition to linguistic information (posi- agement, 38(4):529–546, 2002.
tion and lay-out settings) and to structure the linguis-
[18] R. Tobin. Lxtransduce, a replacement for fsgmatch, 2005.
tic information (article and noun settings). A second http://www.ltg.ed.ac.uk/ richard/ltxml2/lxtransduce-manual.html.
issue is that those features provide us more insight on [19] S. Walter and M. Pinkal. Automatic extraction of definitions
how definitions are used, which is relevant for research from German court decisions. In Proceedings of the workshop
on definitions. on information extraction beyond the document, pages 20–28,
2006.
As the results are promising, future work will pro- [20] E. Westerhout. Extraction of definitions using grammar-
ceed in this direction. We plan to conduct experi- enhanced machine learning. In Proceedings of the Student Re-
ments in which other feature settings that go beyond search Workshop at EACL 2009, pages 88–96, Athens, Greece,
2009. Association for Computational Linguistics.
use of linguistic information are used in addition to
[21] E. Westerhout and P. Monachesi. Combining pattern-based
the settings discussed in this paper. An example of and machine learning methods to detect denitions for elearning
such a setting is the importance of words in a text purposes. In Proceedings of RANLP 2007 Workshop “Natu-
ral Language Processing and Knowledge Representation for
(‘keywordiness’). Another future experiment will in- eLearning Environments”, 2007.
vestigate whether the number of included n-grams (in
[22] E. Westerhout and P. Monachesi. Extraction of Dutch defini-
these experiments we included all n-grams) can be de- tory contexts for elearning purposes. In Proceedings of CLIN
creased to lower the computational load while keeping 2006, 2007.
the same results. Initial experiments with 100 n-grams [23] E. Westerhout and P. Monachesi. Creating glossaries using
for the is definitions did not show much decrease in pattern-based and machine learning techniques. In Proceedings
of LREC 2008, 2008.
performance.

67

You might also like