Mcsethesis Tanmoy Chakraborty
Mcsethesis Tanmoy Chakraborty
Expressions
Thesis submitted to the Faculty of Engineering & Technology, Jadavpur University
In partial fulfillment of the requirements for the Degree Of
By
Tanmoy Chakraborty
Exam Roll No. – M4CSE11$23
Class Roll No. – 000910502033
Registration No. – 108432 of 2009$2010
_______________________________________________
(Prof. Sivaji Bandyopadhyay)
Thesis Supervisor
Department of Computer Science and Engineering,
Jadavpur University, Kolkata- 700032
Countersigned:
________________________________________________
(Prof. Chandan Mazumdar)
Head of the Department,
Department of Computer Science and Engineering,
Jadavpur University, Kolkata- 700032
________________________________________________
(Prof. Niladri Chakraborty)
Dean,
Faculty of Engineering and Technology,
Jadavpur University, Kolkata- 700032
FACULTY OF ENGINEERING & TECHNOLOGY
JADAVPUR UNIVERSITY
KOLKATA$
KOLKATA$700032
CERTIFICATE OF APPROVAL
2. ____________________________
(Signature of Examiners)
Declaration of Originality and Compliance of
Academic Ethics
I hereby declare that this thesis contains literature survey and original research
work by the undersigned candidate, as part of his “Multiword Expressions”
studies.
I also declare that, as required by these rules and conduct, I have fully cited and
referenced all material and results that are not original to this work.
My lovely
Parents
Who show me this world.
Acknowledgement
10 Conclusion 184
10.1 Summary and Findings of this Thesis …………………………………………... 184
10.2 Future Road Map …………..……………………………………………………. 185
10.2.1 Future Research on MWEs …………………………………………….. 185
10.2.2 Future Research on the Applications of MWEs ……………………….. 186
Multiword Expressions
Expressions
Abstract
In natural languages, words can occur in single units called simplex words or in a group of
simplex words that function as a single unit, called Multiword Expressions (MWEs). Although
MWEs are similar to simplex words in their syntax and semantics, they pose their own sets of
challenges. MWEs are arguably one of the biggest roadblocks in computational linguistics due to
their high productivity and due to the bewildering range of syntactic, semantic, pragmatic and
statistical idiomaticity they are associated with. In addition, the large numbers in which they
occur in a text demand specialized handling. Moreover, dealing with MWEs has a broad range of
applications, from syntactic disambiguation to semantic analysis in Natural Language Processing
(NLP).
In this research, the main goal is to use computational techniques to shed light on the
underlying linguistic processes that generate MWEs across constructions and languages; to
generalize existing techniques by abstracting away from individual MWE types; and finally to
exemplify the utility of MWE interpretation within general NLP tasks such as Machine
Translation, Authorship identification and Stylometry Analysis.
In this thesis, we mainly target on Bengali MWEs and additionally take into account English
MWEs as a continuation of parallel process. In particular, we focus on reduplicated phrases,
noun compounds (NCs), verbal phrases (VPs) (complex predicates: compound verbs and
conjunct verbs) in Bengali along with some other classes (Adjective-noun, verb-subject and
verb-object combinations) of English MWEs due to their high productivity and frequency.
Besides resource-constrained and unsophisticated handling of Bengali language, the
challenges dealing with the above mentioned phrases are manifold. For NCs, the challenges are:
(1) identifying them from the corpus; (2) interpreting the semantic relation (SR) that represents
the underlying connection between the head noun and modifier(s); (3) resolving syntactic
Abstract P a g e | ix
ambiguity in NCs comprising three or more terms; and (4) analyzing the impact of word sense on
noun compound interpretation. Our basic approach is to identify Bengali noun-noun (N-N)
bigram MWEs from the Bengali using simple statistical approaches (Chapter 7). We also deal
with the reduplicated phrases and try to explore the semantics using some traditional resources
like English WordNet and Bengali monolingual and bilingual dictionaries (Chapter 6). Finally,
identification task has been modified beyond the conventional treatment of MWEs due to the
insufficiency of resources (i.e. corpus, WordNet etc.) in Bengali. This concept is highly
motivated by the traditional definition of MWE in that the semantics of the composite phrase are
likely to be unpredictable by the meaning of its parts. We call this approach as ‘Semantic
Clustering Approach’ (Chapter 7). Meanwhile, we have also proposed different taxonomies both
for NCs and Reduplications in Bengali based on their linguistic evidences (Chapter 7 and
Chapter 6).
We have two experiences dealing with the bulk amount of data within a short-fixed time
boundary. Prior challenge was raised by the shared task named ‘Semantic Evaluation (SemEval
2010): Task 5 – Automatic Keyphrase Extraction from Scientific Articles’ organized as part of
the 48th Annual Meeting of the Association for Computational Linguistic (ACL 2010). We have
been given the training and testing data which were related to computer Science domain and they
were unformatted and noisy. For every article, we had to identify first fifteen relevant keyphrases
with their stemmed forms. We have tackled two major issues in this regard: Candidate selection
and feature engineering. To develop an efficient candidate selection method, first we take a
supervised approach and analyze various properties of keyphrases which can be selected as the
features of CRF based system like term frequency, Inverse term frequency, collective frequency,
length, position, Part of speech, chunk and dependency and collect the outputs of the CRF as
candidates. Secondly, we re-examine the existing features broadly and collect the nature and
variation of keyphrases using regular expressions (Chapter 5).
The second challenge has been proposed by the share task named ‘Distributional Semantics
and Compositionality (DISCo)’ in conjunction with ACL 2011 (Chapter 8). This task focuses on
the identification of compositionality of three types of English phrases (i.e. adjective-noun, verb-
object and verb-subject combinations) from a large corpus. We have given statistical evidences
against each phrase using traditional statistical methodologies of MWEs like frequency, Point-
wise Mutual Information (PMI), T-score, Chi-square coefficient and perplexity (root and surface
Abstract Page |x
level). Each feature is cross-validated to check their individual impact and finally they are
aggregated using average and weighted combination methods.
This thesis draws its conclusions showing some of the impacts of MWEs in Natural
Language Processing applications. We have experimented with MWEs in two major application
domains: (i) Stylometry and Authorship Identification (Chapter 9.1) and (ii) Textual alignment
and Machine Translation (Chapter 9.2). The experiment concerning Stylometry was first for any
Indian languages as far our knowledge is concerned. This domain is challenging because we
want to analyze the impact of MWEs in the writer’s style of writing and identifying the other
influencing factors in Bengali writings by which the system may be able to identify the
prospective author. We have experimented in two ways: (i) rule-based and (ii) machine learning
approaches. Secondly, we have shown the impact of MWEs in textual alignment. Source and
target side treatments of MEWs and considering them as single token in either or both side have
led to the increase of the BLEU evaluation score of a Phrase based Statistical Machine
Translation (SMT) system for English-Bengali machine translation system.
Finally, we conclude the thesis with a chapter-by-chapter summary and outline of the
findings of our work, suggestions for potential NLP applications and a presentation of further
research directions (Chapter 10).
P a g e | xi
List of Tables
2.1 Examples of statistical idiomaticity (“+” = strong lexical affinity, “?”
marginal lexical affinity, “−” = negative lexical affinity) (Cruse 1986) … 25
2.2 Classification of MWEs in terms of different forms of idiomaticity …….. 26
2.3 Classification of the compositionality of VPCs (Bannard et al.
2003 vs. McCarthy et al. 2003) …………………………………………. 34
2.4 A semantic classification of D-PPs ……………………………………... 37
and ConjVs from EILMT Travel and Tourism Corpus ………………… 142
7.12 Recall, Precision and F-Score of the system for acquiring the CompVs
and ConjVs from Rabindra Rachanabali corpus ……………………….. 142
List of Figures
1.1 The semantic classification task ………………………………………… 12
1.2 The semantic interpretation task ………………………………………… 14
3.1 Distributional semantics of the German idiom den loffel Abgeben (Katz
and Giesbrech 2006) …………………………………………………… 53
Lexemes are the basic unit of natural (i.e., human) language. In a sentence, they combine
together and interact to form structures and meaning. Lexemes can occur in single units called
simplex words, which is the smallest lexical unit that contains meaning, or as multiple simplex
words that function as a single lexical unit, called Multiword Expressions (MWEs). (1.1) – (1.4)
show a number of MWEs (in bold) in context.
(1.1) The marketing manager can learn how to take advantage of the growing
database...
(1.2) Most of the time it failed to make it out of the pit lane...
(1.4) You should also make a note of the serial number of your television video...
Both simplex words and MWEs function as structural and conceptual units of language.
However, MWEs often require deeper syntactic and semantic reasoning due to subtle
interactions with the syntax and semantics of their component simplex words, or alternatively
behavior which is completely at odds with their parts. In the following examples, the relationship
between the MWEs and their component simplex words is relatively transparent.
In (1.5)–(1.7), the MWEs are relatively easy to detect as their components occur
continuously. The semantics of the MWEs in these examples is also predictable. The meaning of
bus driver as “one who drives a bus” is easily accessible despite bus having meanings including
“an electrical conductor that makes a common connection between several circuits” and “a car
that is old and unreliable”, and driver having meanings including “a golfer who hits the golf ball
with a driver” and “a program that determines how a computer will communicate with a
peripheral device”1. The process for disambiguating the semantics in context here is identical to
that of determining the word sense, e.g., from among its many senses, based on analysis of the
combinatory interaction between possible word senses of the lexemes in the sentence. However,
while at this simplistic level MWEs are similar to simplex words in terms of their function within
a sentence, they pose bigger challenges due to their syntactically and semantically unexpected
behavior (Sag et al. 2002). (1.8) – (1.10) show more complicated MWEs where knowledge of
the components alone is insufficient to predict the observed linguistic behavior.
(1.8) – (1.10) are MWE examples which are hard to recognize as a single unit due to their
length or the fact that they are discontinuous. For example, although take out is an MWE, it is
not immediately apparent that (1.8) includes a token instance of it since out is separated from the
verb take. Also, due to the internal modification by long, take a bath is not easily recognizable
as a unit, or analogously, it is not immediately apparent that long is not a component of the
MWE. In addition, MWEs are often confused with non-MWEs, e.g. the MWE vs. non-MWE
usages of put on in “put the coat on” vs. “put the coat on the table”, respectively. As a result of
such variations in the context of usage of MWEs, it is sometimes difficult to distinguish MWEs
from compositional usages of the individual simplex words. Though often understated,
understanding and processing language are overwhelmingly difficult without the means to
syntactically recognize MWEs. MWEs are problematic semantically as well. The meaning(s) of
an MWE cannot always be directly predicted from its component words. The contribution of the
MWE components to its semantics can vary widely from no contribution from any single word
1
Glosses taken from WordNet 2.1.
Mutliword Expressions
Chapter 1: Introduction 3
(e.g., kick the bucket), to a single component making the most significant contribution (e.g.,
finish up), to all words in an MWE contributing equally (e.g., bus driver). The meaning of kick
the bucket as an MWE is “pass from physical life and lose all bodily attributes and functions
necessary to sustain life”. However, unfortunately, neither kick nor bucket contains this meaning.
Hence, estimating the exact meaning of kick the bucket from its parts is futile. It is also
impossible to estimate the meaning of by and large as ‘mostly’ from the components by and
large. Hence, semantically, some MWEs need a different treatment compared to simplex words.
The number of MWEs is estimated to be of the same order of magnitude as the number of
simplex words in a speaker’s lexicon (Wood 1964; Gates 1988; Jackendoff 1973). To add to this,
new types of MWE are continuously created as languages evolve (e.g. shock and awe, cell
phone, ring tone) (Dias et al.1999). Regionally, MWEs vary considerably. For example, take
away and take out have an identical meaning in the context of fast food outlets, but the former is
the preferred expression in Australian English, while the latter is the preferred expression in
American English. Another example is mail box and post box in the context of postal service,
where the former is the preferred form in American English and the latter is the preferred form in
Australian English. MWEs can also be used to represent information concisely (Levi 1978). For
example, winter school is a compact way of expressing “a school which is held in the winter”.
MWEs can also lend emphasis to language (Brinton 1985; Side 1990). For example, up in finish
up the food adds the meaning of “completion”. That is, finish up has the meaning of finish, but it
also contains the entailment that the food is completely consumed and emphasizes the
completeness of the eating action.
There is a modest body of research on modeling MWEs which has been integrated into NLP
applications, e.g., for the purpose of fluency, robustness or better understanding of natural
language. Understanding MWEs has broad utility in tasks ranging from syntactic disambiguation
to conceptual (semantic) comprehension. Explicit lexicalized MWE data helps to simplify the
syntactic structure of sentences that include MWEs, and conversely, a lack of MWE lexical items
in a precision grammar is a significant source of parse errors (Baldwin et al. 2004). Additionally,
it has been shown that accurate recognition of MWEs influences the accuracy of semantic
tagging (Piao et al. 2003), and word alignment in machine translation (MT) can be improved
through a specific handling of the syntax and semantics of MWEs (Venkatapathy and Joshi
2006).
Mutliword Expressions
Chapter 1: Introduction 4
Syntactically, one of the major issues with MWEs is recognition, due to idiomatic and
syntactically-flexible expressions. MWEs are often found in the form of semi- or non-fixed
expressions. The components often inflect for number or tense (e.g., family cars, The plane has
taken off). The occurrence of the components also varies with context. For example, modifiers
can internally modify the components of MWEs (e.g. make a big mistake).
Semantically also, MWEs can cause difficulties for comprehension. MWEs can be
semantically idiomatic, i.e., the meaning can be explicitly or implicitly derived from the
components of MWEs or be completely unrelated to the semantics of the parts. It is also
relatively common for the components of MWEs to combine compositionally to form competing
analyses. For example, a piece of cake can be an MWE with meaning “any undertaking that is
easy to do”, or alternatively it can be a simple compositional expression referring to a portion of
cake. Moreover, MWEs are highly productive and their components are often used to generate
novel MWEs. The verb take, for example, combines with a number of prepositions to form verb
particle constructions including take away, take off and take up, each of which has distinctive
semantics.
To add to these difficulties, MWEs occur in a bewildering array of syntactic and semantic
types which are interrelated to varying degrees, such that neither is it possible to come up with a
genuinely general-purpose analysis of all MWEs, nor is it adequate to try to document each
individual MWE type independently. For example, while syntactically identifying instances of
noun compounds such as paper submission and chocolate bar is relatively easy, it is much harder
with other types of MWEs such as in one’s shoes and break the ice. Semantically, predicting the
meaning of MWEs is relatively easy with some types of MWEs such as take a walk and make a
note (of ), whereas with other MWEs such as make out and kick the bucket it is considerably
more difficult.
MWEs pose significant challenges for NLP, and developing a framework for modeling
MWEs both syntactically and semantically is vital to the furtherance of NLP.
Mutliword Expressions
Chapter 1: Introduction 5
(Venkatapathy and Joshi 2006). Depending on the type of MWE, the relative importance of these
syntactic and semantic tasks varies. For example, with noun compounds, the identification and
extraction tasks are relatively trivial, whereas interpretation is considerably more difficult.
Prior to detailing the computational tasks relating to MWEs, let us briefly define a number of
MWE types which will recur in later discussions. Full details of the MWE types are described in
Section 2.1.3. A noun compound (NC, e.g., golf club or paper submission) is a compound noun
made up of two or more nouns. A verb-particle construction (VPC, e.g., hand over or battle on)
is a verbal MWE made up of a verb and obligatory particle(s). A light-verb construction (LVC,
e.g. take a walk or make a mistake) is a verbal MWE made up of a verb and (usually indefinite
singular) object NP, where the verb has bleached semantics and the noun complement
determines the semantics of the MWE to a large degree. A determinerless prepositional phrase
(D-PP, e.g. at school or on air) is an adverbial MWE made up of a preposition and a singular
noun without a determiner. Finally, an idiom (e.g., kick the bucket or take a turn for the worse) is
an amalgam of words in a construction other than those explicitly identified above, which has
different semantics to that of the combination of the individual components.
In the following sections, we discuss the primary research issues relating to MWEs, and prior
work done in each area. In doing so, we offer our perspective on why these issues continue to
pose a challenge for NLP.
1.2.1 Identification
Identification is the task of determining individual occurrences of MWEs in running text. The
task is at the token (instance) level, such that we may identify 50 distinct occurrences of pick up
in a given corpus. To give an example of an identification task, given the corpus fragment in (2)
(taken from “The Frog Prince”, a children’s story), we might identify the MWEs in (2):
(2) One fine evening a young princess put on her bonnet and clogs, and went out to take
a walk by herself in a wood; ... she ran to pick it up; ...
Mutliword Expressions
Chapter 1: Introduction 6
sentence ‘Kim signed in the room’, there is ambiguity between a VPC interpretation (sign in =
“check in/announce arrival”) and an intransitive verb + PP interpretation (“Kim performed the
act of signing in the room”).
MWE identification has tended to take the form of customized methods for particular MWE
construction types and languages (e.g. English VPCs, LVCs), but attempts have been made to
develop generalized techniques, as outlined below. Perhaps the most obvious method of
identifying MWEs is via a part-of-speech (POS) tagger, chunker or parser, in the case that lexical
information required to identify MWEs is contained within the parser output. For example, in the
case of VPCs, there is a dedicated tag for (prepositional) particles in the Penn POS tagset, such
that VPC identification can be performed simply by POS tagging a text, identifying all particle
tags, and further identifying the head verb associated with each particle (e.g. by looking left for
the first main verb, within a word window of fixed size) (Baldwin and Villavicencio 2002;
Baldwin 2005a). Similarly, a chunker or phrase structure parser can be used to identify
constructions such as noun compounds or VPCs (McCarthy, Keller, and Carroll 2003; Lapata
and Lascarides 2003). This style of approach is generally not able to distinguish MWE and literal
usages of a given word combination, however, as they are not differentiated in their surface
syntax. Deep parsers which have lexical entries for MWEs and disambiguate to the level of
lexical items are able to make this distinction, however, via supertagging or full parsing
(Baldwin et al. 2004).
Another general approach to MWE identification is to treat literal and MWE usages as
different senses of a given word combination. This then allows for the application of word sense
disambiguation (WSD) techniques to the identification problem. As with WSD research, both
supervised (Patrick and Fletcher 2005) and unsupervised (Birke and Sarkar 2006; Katz and
Giesbrecht 2006; Sporleder and Li 2009) approaches have been applied to the identification task.
The key assumption in unsupervised approaches has been that literal usages will be contextually
similar to simplex usages of the component words (e.g. kick and bucket in the case of kick the
bucket). Mirroring the findings from WSD research, supervised methods tend to be more
accurate, but have the obvious drawback that they require large numbers of annotated literal and
idiomatic instances of a given MWE to work. Unsupervised techniques are therefore more
generally applicable.
Mutliword Expressions
Chapter 1: Introduction 7
1.2.2 Extraction
MWE extraction is a type-level task, wherein the MWE lexical items attested in a
predetermined corpus are extracted out into a lexicon or other lexical listing. For example, with a
given verb take and preposition off, we wish to know whether the two words combine together to
form a VPC (i.e. take off) in a given corpus. This contrasts with MWE identification, where the
focus is on individual token instances of MWEs, although obviously extraction can be seen to be
a natural consequence of identification (in compiling out the list of those attested MWEs). The
underlying assumption in MWE extraction is that there is evidence in the given corpus for each
extracted MWE to form an MWE in some context, without making any claims about whether
there also exist simple compositional combinations of those same words. The motivation for
MWE extraction is generally lexicon development or expansion, e.g., in recognizing newly-
formed MWEs (e.g. ring tone or shock and awe) or domain-specific MWEs (e.g. bus speed or
Mutliword Expressions
Chapter 1: Introduction 8
boot up in an IT domain). In general, MWE extraction pulls MWEs out of context as standalone
lexical items, although this generally involves analysis of the context of a given combination of
words. However, as stated above, extraction often takes advantage of the results of MWE
identification. For example, Baldwin (2005a) extracted English VPCs based on identifying VPC
candidates using resources including a parser and chunker. Extracting MWEs is relevant to any
lexically-driven application, such as grammar development or information extraction. In
addition, it is particularly important for productive MWEs or domains that have distinctive MWE
content. MWE extraction is as difficult as MWE identification in terms of syntactic flexibility
and ambiguity. The bulk of research on MWE extraction has focused on extracting English verb-
particle constructions, light-verb constructions and idioms (Baldwin and Villavicencio 2002).
Despite a healthy body of research on MWE extraction, however, the results have not been as
compelling as for MWE identification. Baldwin (2005a) achieved high accuracy on an English
VPC extraction task, whereas others such as verb-noun pair extraction (Venkatapathy and Joshi
2005; Fazly and Stevenson 2007) still have considerable room for improvement. Part of the
complexity here is that the target lexical resource for the MWE extraction often introduces its
own constraints or requirements for extra lexical properties.
The motivation for MWE extraction is generally lexicon development and expansion, e.g.,
recognizing newly-formed MWEs (e.g., ring tone or shock and awe) or domain-specific MWEs.
Extracting MWEs is relevant to any lexically-driven application, such as grammar engineering or
information extraction. Depending on the particular application, it may be necessary to
additionally predict lexical properties of a given MWE, e.g., its syntactic or semantic class. In
addition, it is particularly important for productive MWEs or domains which are rich in technical
terms (e.g., bus speed or boot up in the IT domain). There has been a strong focus on the
development of general-purpose techniques for MWE extraction, particularly in the guise of
collocation extraction. The dominating view here is that extraction can be carried out via
association measures such as Point-wise Mutual Information (PMI) or the T-test, based on
analysis of the frequency of occurrence of a given word combination, often in comparison with
the frequency of occurrence of the component words (Church and Hanks 1989; Smadja 1993;
Frantzi, Ananiadou, and Mima 2000). Association measures provide a score for each word
combination, which forms the basis of a ranking of MWE candidates. Final extraction, therefore,
consists of determining an appropriate cut-off in the ranking, although evaluation is often carried
Mutliword Expressions
Chapter 1: Introduction 9
out over the full ranking. Collocation extraction techniques have been applied to a wide range of
extraction tasks over a number of languages, with the general finding that it is often
unpredictable which association measure will work best for a given task. As a result, recent
research has focused on building supervised classifiers to combine the predictions of a number of
association measures, and has shown that this leads to consistently superior results than any one
association measure (Pecina 2008). It has also been shown that this style of approach works most
effectively when combined with POS tagging or parsing, and strict filters on the type of MWE
that is being extracted (e.g., adjective–noun or verb–noun: Justeson and Katz (1995)). It is worth
noting that association measures have generally been applied to continuous word n-grams, or less
frequently, pre-determined dependency types in the output of a parser. Additionally, collocation
extraction techniques tend to require a reasonable number of token occurrences of a given word
combination to operate reliably, which we cannot always assume (Fazly 2007).
A second approach to MWE extraction, targeted specifically at semantically and statistically
idiomatic MWEs, is to extend the general association measure approach to include substitution
(Lin 1999; Schone and Jurafsky 2001; Pearce 2001). For example, in assessing the idiomaticity
of red tape, explicit comparison is made with lexically-related candidates generated by
component word substitution, such as yellow tape or red strip. Common approaches to
determining substitution candidates for a given component word are (near-) synonymy—e.g.
based on resources such as WordNet—and distributional similarity. Substitution can also be used
to generate MWE candidates, and then check for their occurrence in corpus data. For example, if
clear up is a known (compositional) VPC, it is reasonable to expect that VPCs such as
clean/tidy/unclutter/... up are also VPCs (Villavicencio 2005). That is not to say that all of these
occur as MWEs, so an additional check for corpus attestation is usually used in this style of
approach.
A third approach, also targeted at semantically idiomatic MWEs, is to analyze the relative
similarity between the context of use of a given word combination and its component words
(Schone and Jurafsky 2001; Stevenson, Fazly, and North 2004; Widdows and Dorow 2005).
Similar to the unsupervised WSD-style approach to MWE identification, the underlying
hypothesis is that semantically idiomatic MWEs will occur in markedly different lexical contexts
to their component words. A bag of words representation is commonly used to model the
combined lexical context of all usages of a given word or word combination. By interpreting this
Mutliword Expressions
Chapter 1: Introduction 10
context model as a vector, it is possible to compare lexical contexts, e.g., via simple cosine-
similarity (Widdows 2005). In order to reduce the effects of data sparseness, dimensionality
reduction is often carried out over the word space prior to comparison. The same approach has
also been applied to extract LVCs, based on the assumption that the noun complements in LVCs
are often deverbal (e.g. bath, proposal, walk), and that the distribution of nouns in PPs post-
modifying noun complements in genuine LVCs (e.g. (make a) proposal of marriage) will be
similar to that of the object of the underlying verb (e.g. propose marriage) (Grefenstette and
Teufel 1995). Here, therefore, the assumption is that LVCs will be distributionally similar to the
base verb form of the noun complement, whereas with the original extraction method, the
assumption was that semantically idiomatic MWEs are dissimilar to their component words.
A fourth approach is to perform extraction on the basis of implicit identification. That is,
(possibly noisy) token-level statistics can be fed into a type-level classifier to predict whether
there have been genuine instances of a given MWE in the corpus. An example of this style of
approach is to use POS taggers, chunkers and parsers to identify English VPCs in different
syntactic configurations, and feed the predictions of the various preprocessors into the final
extraction classifier. Alternatively, a parser can be used to identify PPs with singular nouns, and
semantically idiomatic D-PPs can be extracted from them based on distributional (dis)similarity
of occurrences with and without determiners across a range of prepositions (van der Beek 2005).
A fifth approach is to use syntactic fixedness as a means of extracting MWEs, based on the
assumption that semantically idiomatic MWEs undergo syntactic variation (e.g. passivization or
internal modification) less readily than simple verb–noun combinations (Bannard 2007; Fazly,
Cook, and Stevenson 2009).
In addition to general-purpose extraction techniques, linguistic properties of particular MWE
construction types have been used in extraction. For example, the fact that a given verb–
preposition combination occurs as a verb (e.g. take off, clip-on) is a strong predictor of the fact
that the combination is occurring as a VPC. One bottleneck in MWE extraction is the token
frequency of the MWE candidate. With a few notable exceptions (e.g. (Baldwin 2005a; Fazly,
Cook, and Stevenson 2009)), MWE research has tended to ignore low-frequency MWEs, e.g., by
applying a method only to word combinations which occur at least N times in a corpus.
Mutliword Expressions
Chapter 1: Introduction 11
Mutliword Expressions
Chapter 1: Introduction 12
constructions (Venkatapathy and Joshi 2005; Piao et al. 2006; Kim and Baldwin 2007a).
Recently, MWE compositionality has been studied not only to detect or measure the degree of
the compositionality, but also to utilize this in NLP applications. Venkatapathy and Joshi (2006)
successfully showed the utility of MWE compositionality in a word alignment task between
English and Hindi. However, since the task setup was supervised, large amounts of training data
were necessary. There is a gap in the research literature on measuring the degree of MWE
compositionality and also on the utility of compositionality in NLP applications.
Departure
Rise
Send
Parody
Figure 1.1: The semantic classification task
In Figure 1.1, the target is to determine the semantics of a given MWE. Often the meaning of
the components is employed to specify the semantics of the whole. Hence, compositionality is a
very useful clue in estimating the meaning of compositional MWEs. In our example, the target is
to determine the different senses of take off (i.e., “departure, rise, send up, parody”). This can be
performed based on individual analysis of take and off to some degree. WordNet is commonly
used as a sense inventory for semantic classification tasks, although there are instances of user-
Mutliword Expressions
Chapter 1: Introduction 13
defined sense inventories (e.g., particle semantics in Bannard (2003) and Cook and Stevenson
(2006)).
Semantic classification in the context of MWEs is non-trivial due to the varying degrees of
opacity in MWEs. The contribution of the individual components can vary (e.g., eat up and start
over, where the verb is the primary determinant of the semantics). Sometimes none of the parts
contribute to the semantics of the MWE (i.e., in fully non-compositional VPCs such as make
out). Prior work related to the semantic classification of MWEs has been undertaken from both
the linguistic and computational perspectives (Fraser 1976; Bame 1999; Gries 1999; Bannard
2003; ȌHara and Wiebe 2003; Patrick and Fletcher 2004; Cook and Stevenson 2006). Most of
the research on the semantic classification of MWEs has focused on English VPCs. The
relatedness between semantic classification and measuring the compositionality of MWEs is not
well understood, warranting further study.
Mutliword Expressions
Chapter 1: Introduction 14
In Figure 1.2, the target is to interpret the semantic relation between the components. For
example, apple pie can be interpreted as “pie made from apple”. The semantic relation between
apple and pie is specified as MAKE, where the head noun is made from the modifier.
Semantic relations (or associations) are most commonly used to interpret noun compounds
and determiner-less prepositional phrases. The semantic relation used to interpret a given MWE
varies with the components. For example, the semantic relation in morning juice is ‘TIME’
(“juice in the morning”) whereas that in orange juice is ‘MAKE’ (“juice made from orange(s)”).
Another example with D-PPs is by car/bus/plane.., where a mode of transportation combined
with the method/manner preposition by leads to the semantic relation manner, whereas other
nouns such as day lead to specific temporal interpretations. The majority of past research on
semantic interpretation has focused on interpreting noun compounds (Vanderwende 1994;
Copestake and Lascarides 1997; Lapata 2002; Moldovan et al. 2004; Kim and Baldwin 2005;
Nastase et al. 2006) and D-PPs (Van Der Beek 2005; Baldwin et al. 2006). This research,
particularly that on NC interpretation, has been suggested to be relevant for the NLP applications
for QA and IR (Moldovan et al. 2004), although there is no definitive empirical evidence to
support this claim.
In all prior work, however, a major difficulty in semantic interpretation has been the design
of a standard set of semantic relations with which to perform the interpretation. For interpreting
noun compounds, the scalability and portability to novel domains/NC types is questionable, as
methods make specific assumptions about the domain or range of NC interpretation. The current
level of accuracy of NC interpretation over open domain data is not high enough to utilize the
acquired data for NLP applications. Also, lack of agreement on the semantic relations used for
MWE interpretation makes it hard to incorporate NC interpretation into applications.
Mutliword Expressions
Chapter 1: Introduction 15
Another point is that much of the work on semantic interpretation is based on supervised
methods, which raises questions on the amount of training data and effective learning algorithms
for a particular method or set of semantic relations.
2
More generally, for an n item noun compound, the number of possible interpretations is defined by the Catalan
ଵ
number : Cn= ቀ ቁ (ଶ ܥ )
ାଵ
Mutliword Expressions
Chapter 1: Introduction 16
Mutliword Expressions
Chapter 1: Introduction 17
method in which bilingual MWEs were used to modify the word alignment so as to improve the
SMT quality. In their work, a bilingual MWE in training corpus was grouped as one unique
token before training alignment models. They reported that both alignment quality and
translation accuracy were improved on a small corpus. However, in their further study, they
reported even lower BLEU scores after grouping MWEs based on part-of-speech on a large
corpus (Lambert and Banchs, 2006). Nonetheless, since MWE represents linguistic knowledge,
the role and usefulness of MWE in full-scale SMT is intuitively positive. The difficulty lies in
how to integrate bilingual MWEs into existing SMT system to improve SMT performance,
especially when translating domain texts.
Mutliword Expressions
Chapter 1: Introduction 18
Mutliword Expressions
Chapter 1: Introduction 19
3
http://ltrc.iiit.ac.in/analyzer/bengali
Mutliword Expressions
Chapter 2
The Linguistics of Multiword
Expressions
Multiword expressions can be syntactically and semantically categorized into various types,
including noun compounds and idioms. Each type of MWE has distinctive linguistic features
which we will describe in Section 2.1.1. Due to these differences, for distinct MWEs, we have
specific objectives for knowledge acquisition and different obstacles to overcome. For example,
interpreting semantic relations in noun compounds is a hard task while extracting or identifying
them is relatively trivial. On the other hand, extracting or identifying verb-particle constructions
is challenging since there is often ambiguity with a verb-PP analysis. Also, measuring
compositionality is an important task for VPCs as there is a more uniform distribution of VPCs
across the spectrum of compositionality, whereas it is less of an issue for noun compounds as
they are mostly compositional1. In this chapter, we will survey the linguistics of the major types
of English MWE.
1
That is noun compound types are mostly compositional; noun compound tokens are arguably not.
Chapter 2: The Linguistics of Multiword Expressions 21
• Idiomaticity
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 22
are anomalous in other contexts (e.g., good morning when finishing a meal, or all aboard when
watching a soccer match). Statistical idiomaticity occurs with MWEs such as black and white
where they occur with uncommonly high frequency in contrast to alternative forms of the same
expression. It is perfectly acceptable to say white and black, but the skew towards the first form
is sufficiently great that white and black photograph, e.g., is marked in English.
• Lexical Idiomaticity
Lexical idiomaticity occurs when one or more components of an MWE are not part of the
conventional English lexicon. For example, ad hoc is lexically marked in that neither of its
components (ad and hoc) are standalone English words. Lexical idiomaticity inevitably results in
syntactic and semantic idiomaticity because there is no lexical knowledge associated directly
with the parts from which to predict the behavior of the MWE. As such, it is one of the most
clear-cut and predictive properties of MWEhood.
• Syntactic Idiomaticity
Syntactic idiomaticity occurs when the syntax of the MWE is not derived directly from that
of its components (Katz and Postal 2004; Chafe 1968). For example, by and large, is
syntactically idiomatic in that it is adverbial in nature, but made up of the anomalous
coordination of a preposition (by) and an adjective (large). On the other hand, take a walk is not
syntactically marked as it is a simple verb–object combination which is derived transparently
from a transitive verb (take) and a countable noun (walk). Syntactic idiomaticity can also occur
at the constructional level, in classes of MWEs having syntactic properties which are
differentiated from their component words, e.g., verb-particle constructions and determinerless
prepositional phrases described in later section.
VP AdvP AdjP
a walk
Figure 2.1: Examples of syntactic non-markedness vs. markedness
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 23
• Semantic Idiomaticity
???
???
Meanings change opinion Vehical operator
die
Non-identifiability (Nunberg et al. 1994) is the notion of the meaning of an MWE not being
easily predictable from the surface form (components), much like our definition of semantic
idiomaticity. For example, the meaning of kick the bucket (“die”) cannot be derived from either
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 24
kick or bucket. Another example is make out, where the parts (i.e., make and out) do not
semantically contribute to the meaning of the whole. This property relates closely to
compositionality. That is, when MWEs are compositional, the meaning of MWEs can be
predicted from the parts. Hence, non-identifiability coincides with non-compositionality (other
examples of non-identifiable and non-compositional MWEs are on ice, cock up, chicken out and
by and large).
Figuration (Fillmore et al. 1988; Nunberg et al. 1994) is an attribute of encoded expressions
such as metaphors (e.g., take the bull by the horns), metonymies (e.g., lend a hand) and
hyperboles (e.g., not worth the paper it’s printed on). It is defined as the property of the
components of an MWE having some metaphoric or hyperbolic meaning in addition to their
literal meaning. That is, the semantics of the MWE is derived from the components through a
process of metaphor, hyperbole or metonymy, although the precise nature of the figuration may
be more or less obvious. Hence, figuration involves subtle interactions between idiomatic and
literal meaning. We return to touch on the relationship between figuration and semantic
idiomaticity below.
• Pragmatic idiomaticity
Pragmatic idiomaticity is the condition of a MWE being associated with a fixed set of
situations or a particular context (Kastovsky 1982; Jackendoff 1997; Sag, Baldwin, Bond,
Copestake, and Flickinger 2002). Good morning and all aboard are examples of pragmatic
MWEs: the first is a greeting associated specifically with mornings 2 and the second is a
command associated with the specific situation of a train station or dock, and the imminent
departure of a train or ship. Pragmatically idiomatic MWEs are often ambiguous with (non-
situated) literal translations; e.g., good morning can mean “pleasant morning” (c.f. Kim had a
good morning).
• Statistical Idiomaticity
Statistical idiomaticity occurs when a combination of words occurs with surprising
frequency, relative to the component words or alternative phrasings of the same expression
(Pawley and Syder 1983; Cruse 1986; Sag et al. 2002). Cruse (1986:p281) provides some nice
examples of statistical idiomaticity in the matrix of adjectives and nouns presented in Table 2.1.
2
Which is not to say that it can’t be used ironically at other times of the day!
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 25
Table 2.1: Examples of statistical idiomaticity (“+” = strong lexical affinity, “?” = marginal
Lexical affinity, “−” = negative lexical affinity) (Cruse 1986)
The adjectives are largely synonymous, and yet different nouns have particular preferences
for certain subsets of the adjectives as modifiers, as indicated by the cells in the matrix (“+”
indicates a strong lexical affinity, “?” indicates a marginal lexical affinity, and “−” indicates a
negative lexical affinity). Note that the statistical idiomaticity (i.e., the alternative phrasing) can
be in terms of alternative orderings of the components. For example, black and white is much
more common in English than white and black, while the reverse holds in the case of other
languages such as Japanese and Spanish (see Table 2.1). For the purposes of this thesis, we will
follow Sag et al. (2002) in referring to MWEs which are only statistically idiomatic (i.e., not also
lexico-statistically, semantically or pragmatically idiomatic) as collocations.
Statistical idiomaticity relates to the notion of institutionalization/conventionalization, i.e. a
particular word combination coming to be used to refer a given object (Fernando and Flavell
1981; Bauer 1983; Nunberg et al. 1994; Sag et al. 2002). For example, traffic light is the
conventionalized descriptor for “a visual signal to control the flow of traffic at intersections”.
There is no reason why it shouldn’t instead be called a traffic director or intersection regulator,
but the simple matter of the fact is that it is not referred to using either of those expressions;
instead, traffic light was settled on as the canonical term for referring to the object. Similarly, it
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 26
is an arbitrary fact of the English language that we say many thanks and not *several thanks, and
salt and pepper in preference to pepper and salt.3
Nunberg et al. (1994) consider collocation (conventionality in their terms) to be a mandatory
property of MWEs. We consider conventionality to relate to semantic, pragmatic and statistical
idiomaticity, but consider that MWEs do not have to have any one of these three forms of
markedness (e.g., MWEs which are strictly lexico-syntactically idiomatic are classified as
MWEs in this research). Collocations are most apparent when observed in contrast with anti-
collocations. Anti-collocations are lexico-syntactic variants of collocations which have
unexpectedly low frequency (Pearce 2001). For example, pepper and salt is an anti-collocation
for salt and pepper, and traffic director is an anti-collocation for traffic light.
It is important to acknowledge that our use of the term collocation differs from the
mainstream usage in computational linguistics, where a collocation is often defined as an
arbitrary and recurrent word combination that co-occurs more often than would be expected by
chance (Choueka 1988; Lin 1998b; Evert 2004).
Above, we described four different forms of idiomaticity. We bring these together in
categorizing a selection of MWEs in Table 2.2. In Table 2.2, some examples such as kick the
3
Which is not to say there wasn’t grounds for the selection of the canonical form at its genesis, e.g., for historical,
crosslingual or phonological reasons.
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 27
bucket, make out and traffic light are marked with only one form of idiomaticity, which is
sufficient for them to be classified as MWE. On the other hand, other MWEs such as shock and
awe and to and fro are idiosyncratic in more ways than one. We analyze shock and awe as being
pragmatically idiomatic because of its particular association with the bombardment of Baghdad
at the commencement of the Iraq War, and to and fro as being lexicosyntactically idiomatic
because of the relative syntactic opacity of the antiquated fro.
Other Properties
Other common properties of MWE are: single-word paraphrasability, proverbiality and
prosody. Unlike idiomaticity, where some form of idiomaticity is a necessary feature of MWEs,
these other properties are neither necessary nor sufficient. Prosody relates to semantic
idiomaticity, while the other properties are independent of idiomaticity as described above.
• Crosslingual Variation
There is remarkable variation in MWEs across languages (Villavicencio, Baldwin, and
Waldron 2004). In some cases, there is direct lexicosyntactic correspondence for a cross-lingual
MWE pair with similar semantics. For example, in the red has a direct lexico-syntactic correlate
in Portuguese with the same semantics: no vermelho, where no is the contraction of in and the,
vermelho means red, and both idioms are prepositional phrases (PPs). Others have identical
syntax but differ lexically. For example, in the black corresponds to no azul (“in the blue”) in
Portuguese, with a different choice of colour term (blue instead of black). More obtusely, Bring
the curtain down corresponds to the Portuguese botar um ponto final em (lit. “put the final dot
in”), with similar syntactic make-up but radically different lexical composition. Other MWEs
again are lexically similar but syntactically differentiated. For example, in a corner (e.g., The
media has him in a corner) and encurralado (“cornered”) are semantically equivalent but
realized by different constructions – a PP in English and an adjective in Portuguese.
There are of course many MWEs which have no direct translation equivalent in a second
language. For example, the Japanese MWE zoku-giiN, meaning “legistors championing the
causes of selected industries” has no direct translation in English (Tanaka and Baldwin 2003).
Equally, there are terms which are realised as MWEs in one language but single-word lexemes in
another, such as interest rate and its Japanese equivalent riritsu.
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 28
• Single-word paraphrasability
Single-word paraphrasability is the observation that significant numbers of MWEs can be
paraphrased with a single word (Chafe 1968; Gibbs 1980; Fillmore et al. 1988; Liberman and
Sproat 1992; Nunberg et al. 1994). While some MWEs are single-word paraphrasable (e.g.,
leave out = “omit”), some are not (e.g., look up = ?). Also, MWEs with arguments can
sometimes be paraphrasable (e.g., take off clothes = “undress”), just as multi-word non-MWEs
can be single-word paraphrasable (e.g., not sufficient = “insufficient”).
• Proverbiality
Proverbiality is the ability of an MWE to “describe and implicitly to explain a recurrent
situation of particular social interest in the virtue of its resemblance or relation to a scenario
involving homely, concrete things and relations” (Nunberg et al. 1994). For example, VPCs and
idioms are often indicators of more informal situations (e.g., piss off is an informal form of
annoy, and kick the bucket is an informal form of die, demise). Nunberg et al. (1994) treat
informality as a separate category, where we combine it with proverbiality.
• Prosody
MWEs can have distinct prosody, i.e., stress patterns, from compositional language (Fillmore
et al. 1988; Liberman and Sproat 1992; Nunberg et al. 1994). For example, when the
components do not make an equal contribution to the semantics of the whole, MWEs can be
prosodically marked, e.g., soft spot is prosodically marked (due to the stress on soft rather than
spot), although first aid and red herring are not. Note that prosodic marking can equally occur
with non-MWEs, such as dental operation.
A common term in NLP which relates closely to our discussion of MWEs is collocation. A
widely-used definition for collocation is “an arbitrary and recurrent word combination” (Benson
1990), or in our terms, a statistically idiomatic MWE (esp. of high frequency). While there is
considerable variation between individual researchers, collocations are often distinguished from
“idioms” or “non-compositional phrases” on the grounds that they are not syntactically
idiomatic, and if they are semantically idiomatic, it is through a relatively transparent process of
figuration or metaphor (Choueka 1988; Lin 1998; McKeown and Radev 2000; Evert 2004).
Additionally, much work on collocations focuses exclusively on predetermined constructional
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 29
templates (e.g. adjective-noun or verb-noun collocations). In Table 2.2, e.g., social butterfly is an
uncontroversial instance of a collocation, but to and fro would tend not to be classified as
collocations. As such, collocations form a proper subset of MWEs.
English MWEs can be syntactically and semantically categorized in various ways. In this
thesis, we adopt the classification and terminology of Bauer (1983) and Sag et al. (2002), as
outlined in Figure 2.3. The classification of MWEs into lexicalized phrases and institutionalized
phrases hinges on whether the MWE is lexicalized (i.e., explicitly encoded in the lexicon) on the
grounds of lexico-syntactic or semantic idiomaticity, or a simple collocation (i.e., only
statistically idiosyncratic). Note that we will largely ignore pragmatic idiomaticity for the
remainder of this thesis. Lexicalized phrases are MWEs in which the components have
idiosyncratic syntax or semantics in part or in combination. Lexicalized phrases can be further
split into: fixed expressions (e.g., by train, at first), semi-fixed expressions (e.g., spill the beans,
car dealer, Chicago White Socks) and syntactically-flexible expressions (e.g., add up, give a
demo).
• Fixed expressions are fixed strings that undergo neither morphosyntactic variation nor
internal modification. For example, by and large is not morpho-syntactically modifiable
(e.g., *by and larger) or internally modifiable (e.g., *by and very large). Non-modifiable
determinerless prepositional phrases such as on air and by car are also fixed expressions.
• Semi-fixed expressions are lexically-variable MWEs that have hard restrictions on word
order and composition, but undergo some degree of lexical variation such as inflection (e.g.,
kick/kicks/kicked/kicking the bucket vs. *the bucket was kicked), variation in reflexive
pronouns (e.g., in her/his/their shoes) and determiner selection (e.g., The Beatles vs. a Beatles
album). Non-decomposable VNICs (e.g., kick the bucket, shoot the breeze) and nominal
MWEs (e.g., attorney general, part of speech) are also classified as semi-fixed expressions.
• Syntactically flexible expressions are MWEs which undergo syntactic variation, such as
verb-particle constructions, light-verb constructions and decomposable idioms. The nature of
the flexibility varies significantly across construction types. Verb-particle constructions, for
example, are syntactically flexible with respect to the word order of the particle and NP in
transitive usages: hand in the paper vs. hand the paper in. They are also usually compatible
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 30
with internal modification, even for intransitive VPCs: the plane took right off. Light-verb
constructions (e.g., give a demo) undergo full syntactic variation, including passivization
(e.g., a demo was given), extraction (e.g., how many demos did he give?) and internal
modification (e.g., give a clear demo). Decomposable idioms are also syntactically flexible to
some degree, although the exact form of syntactic variation is hard to predict (Riehemann
2001).
MWE
Lexicalized
Institutionalized
LVCs
As described in Section 2.1.1, collocations (or institutionalized phrases) are MWEs that
occur with surprising frequency, relative to the component words or alternative phrasings of the
same expression (i.e., they are strictly statistically idiosyncratic), but which are otherwise
unmarked. Examples include peanut butter and jam, salt and pepper, telephone booth, many
thanks and traffic light.
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 31
where the head is deverbal (e.g., investor hesitation or stress avoidance). There is also the
broader class of nominal MWEs where the modifiers are not restricted to be nominal, but can
also be verbs (usually present or past participles, such as connecting flight or hired help) or
adjectives (e.g., open secret). To avoid confusion, we will term this broader set of nominal
MWEs nominal compounds. In Romance languages such as Italian, there is the additional class
of complex nominals which include a preposition or other marker between the nouns, such as
succo di limone “lemon juice” and porta a vetri “glass door”. One property of noun compounds
which has put them in the spotlight of NLP research is their underspecified semantics. For
example, while sharing the same head, there is little semantic commonality between nut tree,
clothes tree and family tree: a nut tree is a tree which bears edible nuts; a clothes tree is a piece
of furniture shaped somewhat like a tree, for hanging clothes on; and a family tree is a graphical
depiction of the genealogical history of a family (which can be shaped like a tree). In each case,
the meaning of the compound relates (if at times obtusely!) to a sense of both the head and the
modifier, but the precise relationship is highly varied and not represented explicitly in any way.
Furthermore, while it may be possible to argue that these are all lexicalised noun compounds
with explicit semantic representations in the mental lexicon, native speakers generally have
reasonably sharp intuitions about the semantics of novel compounds. For example, a bed tree is
most plausibly a tree that beds are made from or perhaps for sleeping in, and a reflection tree
could be a tree for reflecting in/near or perhaps the reflected image of a tree. Similarly, context
can evoke irregular interpretations of high-frequency compounds (Downing 1977; Spārck Jones
1983; Copestake and Lascarides 1997; Gagn´e, Spalding, and Gorrie 2005). This suggests that
there is a dynamic interpretation process that takes place, which complements encyclopedic
information about lexicalised compounds.
One popular approach to capturing the semantics of compound nouns is via a finite set of
relations. For example, orange juice, steel bridge and paper hat could all be analysed as
belonging to the make relation, where head is made from modifier. This observation has led to
the development of a bewildering range of semantic relation sets of varying sizes, based on
abstract relations (Vanderwende 1994; Barker and Szpakowicz 1998; Rosario and Hearst 2001;
Moldovan), direct paraphrases, e.g. using prepositions or verbs (Lauer 1995; Lapata 2002;
Grover, Lapata, and Lascarides 2004; Nakov 2008), or various hybrids of the two (Levi 1978;
Vanderwende 1994). This style of approach has been hampered by issues including low inter-
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 32
annotator agreement (especially for larger semantic relation sets), coverage over data from
different domains, the impact of context on interpretation, how to deal with “fringe” instances
which do not quite fit any of the relations, and how to deal with interpretational ambiguity
(Downing 1977; Spärck Jones 1983). An additional area of interest with nominal MWEs
(especially noun compounds) is the syntactic disambiguation of MWEs with 3 or more terms.
For example, glass window cleaner can be syntactically analyzed as either (glass (window
cleaner)) (i.e., “a window cleaner made of glass”, or similar) or ((glass window) cleaner) (i.e., “a
cleaner of glass windows”). Syntactic ambiguity impacts on both the semantic interpretation and
prosody of the MWE. The task of disambiguating syntactic ambiguity in nominal MWEs is
called bracketing.
Generally, VPCs are both idiosyncratic and semi-idiosyncratic combinations although some
are adverbial and/or non-lexical particle cases (Dehe et al. 2001). VPCs often involve subtle
interactions between the verb and particle (Bolinger 1976b; Jackendoff 1973; Fraser 1976;
Lidner 1983; Kayne 1985; Svenonius 1994; Dehe et al. 2001; Dehe 2002). For example, the
particle can impact on various properties of the verb, including: aspect (e.g. eat vs. eat up),
reciprocity (e.g. ring vs. ring back) and repetition (e.g. start vs. start over).
Note that VPCs are termed phrasal verbs by some researchers (Bolinger 1976b; Side 1990;
Dirven 2001; McCarthy et al. 2003) and verb-particle constructions by others (Dehe et al. 2001;
Bannard et al. 2003; Bannard 2003; Baldwin et al. 2003a; Cook and Stevenson 2006; Kim and
Baldwin 2007a). In this thesis, we will refer to them exclusively as VPCs. One MWE type which
relates closely to VPCs is prepositional verbs (Jackendoff 1973; O’Dowd 1998; Huddleston and
Pullum 2002; Baldwin 2005b), which are similarly made up of a verb and preposition, but the
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 33
preposition is transitive and selected by the verb (e.g., refer to, look for ). It is possible to
differentiate transitive VPCs4 from prepositional verbs via their respective linguistic properties
(Bolinger 1976b; Jackendoff 1973; Fraser 1976; Lidner 1983; ȌDowd 1998; Dehe et al. 2001;
Jackendoff 2002; Huddleston and Pullum 2002; Baldwin 2005b):
• in the case that the object NP is not pronominal, transitive VPCs can occur in either the
joined or split word order (c.f. (2.4)), while prepositional verbs must always occur in the
joined form (c.f. (2.7));
• in the case that the object NP is pronominal, transitive VPCs must occur in the split
word order (c.f. (2.5)), while prepositional verbs must occur in the joined form (c.f.
(2.8));
• manner adverbs cannot occur between the verb and particle in VPCs (c.f. (2.6)), while
they can occur with prepositional verbs (c.f. (2.9)). In this thesis, we will focus
exclusively on VPCs where the particle is prepositional.
Verb-particle constructions
(2.4) Non-pronominal object: optional joined/split word order
• Put on the sweater.
• Put the sweater on.
(2.5) Pronominal object: obligatory split word order
• Finish it up.
• *Finish up it.
(2.6) With manner adverb
• Quickly eat up the food.
• *Eat quickly up the food.
Prepositional verbs
(2.7) Non-pronominal object
• Look for a word.
• *Look a word for.
(2.8) Pronominal object
• Look for it.
4
Prepositional verbs are obligatorily transitive, so there is no ambiguity with intransitive VPCs.
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 34
• *Look it for.
(2.9) With manner adverb
• Come with me quickly.
• Come quickly with me.
VPCs undergo morphological, syntactic and semantic variation. Morphologically, VPCs
inflect for tense and number (e.g., take/takes/took/have taken/is taken/... off). Syntactically, VPCs
undergo word order variation, and are internally modifiable by a small set of adverbs (e.g., right,
back, way and all the way). Semantically, VPCs populate the spectrum of compositionality
relative to their components (Lidner 1983; Brinton 1985; Ishikawa 1999; Olsen 2000; Jackendoff
2002; Bannard e1t al. 2003; Cook and Stevenson 2006). According to the view of Bannard et al.
(2003), VPCs can be sub-classified into four compositionality classes based on the independent
semantic contribution of the verb and particle: (1) both the verb and particle contribute
semantically, (2) only the verb contributes semantically, (3) only the particle contributes
semantically, and (4) neither the verb nor the particle contributes semantically. Other researchers
such as McCarthy et al. (2003) employ a one-dimensional classification of VPC
compositionality (over a cline or a number of discrete sub-classes): compositional vs. non-
compositional. Table 2.3 details the two classification systems, with examples.
Table 2.3: Classification of the compositionality of VPCs (Bannard et al. 2003 vs. McCarthy et
al. 2003)
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 35
“light”, in the sense that their contribution to the meaning of the LVC is relatively small in
comparison with that of the noun complement. LVCs are also sometimes termed verb-
complement pairs (Kan and Cui 2006) or support verb constructions (Calzolari et al. 2002). Our
definition of light-verb constructions is in line with that of Huddleston and Pullum (2002). The
principal light verbs are do, give, have, make, put and take, for each of which we provide a
selection of LVCs in (2.10)–(2.15). English LVCs generally take the form verb+a/an+object,
although there is some variation here.
(2.10) do: do a demo, do a drawing, do a report
(2.11) give: give a wave, give a sigh, give a kiss
(2.12) have: have a rest, have a drink, have (a) pity (on)
(2.13) make: make an offer, make an attempt, make a call
(2.14) put: put the blame (on), put an end (to), put stop (to)
(2.15) take: take a walk, take a bath, take a photograph (of )
There is some disagreement in the scope of the term LVC, most notably in the membership
of verbs which can be considered “light”. Calzolari et al. (2002), e.g., argued that the definition
of LVCs (or support verb constructions in their terms) should be extended as follows: (1) when
the verbs combine with an event noun (deverbal or otherwise) and the subject is a participant in
the event most closely identified with the noun (e.g., take an exam, ask a question, make a
promise); and (2) when the subject of these verbs belongs to some scenario associated with the
full understanding of the event type designated by the object noun (e.g., pass an exam, survive
an operation, answer a question, keep a promise).5
Morphologically, LVCs inflect but the noun complement tends to have fixed number and a
preference for determiner type (Wierzbicka 1982; Alba-Salas 2002; Kearns 2002; Butt 2003;
Folli et al. 2003; Stevenson et al. 2004). For example, put an end (to) undergoes full verbal
inflection (put/puts/putting an end (to)), but the noun complement cannot be pluralized or
modified derivationally (e.g. *put an ending (to), *put ends to).6
As described above, there is little constraint on the syntax of LVCs. Semantically, although
the meaning of the verb in LVCs is bleached, a given noun will usually have strong constraints
on which light verb(s) it combines with to form an LVC (e.g., put blame (on) vs.
5
All examples are taken from Calzolari et al. (2002).
6
But also note other examples where the noun complement can be pluralized, e.g. take a bath vs. take baths.
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 36
*do/give/have/make blame), and different light verbs can lead to VPCs with different semantics
(Butt 2003). For example, put blame (on) and take blame are both LVCs but having very
different semantics: the subject of put blame (on) is the Agent of the blaming and the object of
the PP headed by on is the Patient, while the subject of take blame is the Theme. Also, what light
verb a given noun will combine with to form an LVC is often consistent across semantically-
related noun clusters (e.g., give a cry/moan/howl vs. *take a cry/moan/howl7).
2.5 Idioms
An idiom is an MWE whose meaning is fully or partially unpredictable from the meanings of
its components (e.g., kick the bucket, blow hot and cold) (Nunberg et al. 1994; Potter et al. 2000;
Sag et al. 2002; Huddleston and Pullum 2002). Huddleston and Pullum (2002) identified
subtypes of idioms such as verbal idioms (e.g., jump off, get out, run ahead) and prepositional
idioms (e.g., for example, in person, under the weather) which we classify as VPCs/prepositional
verbs and determinerless PPs, respectively. In our terms, therefore, idioms are those non-
compositional MWEs not included in the named construction types of VPCs, prepositional verbs,
noun compounds and determinerless PPs. While all idioms are non-compositional (to varying
degrees), we further categorize them into two groups: decomposable and non-decomposable
(Nunberg et al. 1994). With decomposable idioms, given the interpretation of the idiom, it is
possible to associate components of the idiom with distinct elements of the idiom interpretation
based on semantics not immediately accessible from the components in isolation. Assuming an
interpretation of spill the beans such as reveal’(X, secret’), e.g., we could analyze spill as having
the semantics of reveal’ and beans having the semantics of secret’, and hence arrive at a post hoc
explanation for the interpretation of the idiom via the reverse-engineered semantics of the
components (through figuration of some description). Note that the interpretations of the
components (spill as reveal’ and beans as secret’) are removed from those for the simplex words,
and it is on this basis that we consider the idiom non-compositional. Other examples of
decomposable idioms are pull one’s leg and pull strings. Examples of non-decomposable idioms
where a post hoc semantic decomposition is not accessible are break a leg and kick the bucket.
Decomposable idioms tend to be syntactically flexible, as defined by the nature of the semantic
decomposition, whereas non-decomposable idioms tend not to be syntactically flexible (Katz and
7
Examples are from Stevenson et al. (2004).
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 37
Postal 2004; Wood 1964; Chafe 1968; Kastovsky 1982; Pawley and Syder 1983; Cruse 1986;
Jackendoff 1997; Sag et al. 2002). For example, spill the beans can be passivized (It’s a shame
the beans were spilled) and internally modified (AT&T spilled the Starbucks beans).
Class Examples
institutional at school, in church, on campus, in gaol
media on TV, on record, off screen, in radio
metaphor on ice, at large, at hand, at liberty
temporal at breakfast, on holiday, on break, by day
means/manner by car, by hammer, by computer, via radio
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 38
Note that some D-PPs with in combine with countable nouns such as pub and hospital but they
do not refer to social institution. In general, D-PPs have been categorized into five semantic
groups by Stvan (1998). These classes often correlate with a particular compositionality, e.g.,
metaphorical D-PPs are non-compositional while the other classes are compositional.
Mutliword Expressions
Chapter 2: The Linguistics of Multiword Expressions 39
MWE types. We have further sub-classified idioms into decomposable and non-decomposable
idioms. Finally, determinerless PPs are made up of a preposition and a singular noun without a
determiner.
Mutliword Expressions
Chapter 3
3.1 Introduction
In this chapter, we will look at the underlying methods commonly used in statistical approaches
to MWE extraction: co-occurrence properties, substitutability, distributional similarity, semantic
similarity and linguistic properties. We will take a look at how these methods are used for
computational tasks relating to MWE extraction, and weigh up the advantages and disadvantages
of each approach. We will also look at prior approaches, and provide an overview and
comparison of the methods used in this thesis.
extracting statistically-marked MWEs such as shock and awe as their co-occurrence tends to
have abnormally high frequency relative to the alternative ordering. Example (3.1) is a sample of
such high frequents such binomials (relative to their alternative ordering) while example (3.2) is
a sample of such high-frequent binomials where both orderings have approximately the same
frequency. This method can also be paired with analysis of alternative wordings for a given
phrase in the form of substitutability (see Section 3.3). Note that when we say co-occurrence we
refer to the co-occurrence of the parts rather than co-occurrence with any specific context, which
is the basis of distributional similarity in Section 3.4.
(3.1) MWEs: black and white, by and large, salt and pepper, shock and awe
(3.2) Non-MWEs: blue and red, small and large, salt and sugar
Note that the underlying mechanism driving co-occurrence is statistical idiomaticity, as most
MWEs are statistically idiomatic to some degree. In (3.1), for example, the method can be seen
to have extracted statistically-marked MWEs (by and large) as well as semantically- (black and
white) and pragmatically-marked MWEs (shock and awe).
Co-occurrence properties are often measured by association measures such as pointwise/
specific mutual information (Church and Hanks 1989), the Dice coefficient (Church and Hanks
1989), the student’s T-test, Pearson’s chi-square (Dunning 1993) and log likelihood (Dunning
1993). The measurement of co-occurrence properties is useful when the components combine
together with markedly high frequency relative to the components, or alternatively relative to an
alternative form of the same MWE. However, quantitatively measuring co-occurrence properties
via a given association measure has its limitations. As most of the measures rely on lexicalized
corpus frequencies, they are vulnerable to the effects of data sparseness.
Furthermore, it is often difficult to predict which association method will perform best over a
given MWE type and corpus (Pecina 2005). Co-occurrence properties have been used widely in
tasks such as extracting collocation and MWEs (Smadja 1993; Grefenstette and Teufel 1995;
Villavicencio et al. 2004; Baldwin 2005b; Fazly et al. 2005; Villada Moiron 2005; Pecina 2005;
Widdows and Dorow 2005; Kan and Cui 2006), modeling the compositionality of MWEs
(Bannard 2003; McCarthy et al. 2003; Venkatapathy and Joshi 2005; Fazly and Stevenson 2007;
Kim and Baldwin 2007a), and classifying MWE semantics (Fraser 1976; Lapata and Keller
2004). Below, we outline a representative selection of papers on the co-occurrence properties of
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 42
MWEs, in the context of extraction, compositionality modeling and semantic classification tasks,
respectively. Note that in some instances, the original research uses the term collocation in the
broader sense of the term to mean MWE. In our description of the research, we will use the terms
MWE and collocation as outlined in Chapter 2.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 43
automatic MWE extraction methods using precision–recall curves, and to propose a new
approach for combining individual extraction methods using supervised learning methods.
Pecina used a total of 84 association measures based on occurrence frequencies (i.e., co-
occurrence properties) over binary MWEs. As association measures, he used simple
probabilities, mutual information and derived measures, statistical tests of independence,
likelihood measures, and various heuristic association measures and coefficients. He also used
context association measures based on syntactic and semantic units, with a more sound linguistic
foundation. The final conclusion of this work was that the combination of multiple independent
measures is superior to any one individual extraction method at MWE extraction.
Grefenstette and Teufel (1995) developed a method for extracting light verbs and their
complements (i.e. LVCs) using co-occurrence properties. The basic idea behind this work is that
the noun complements in LVCs are often deverbal (e.g., proposal), and that the distribution of
nouns in PPs post-modifying noun complements in genuine LVCs (e.g., (make a) proposal of
marriage) will be similar to that of the object of the underlying verb (e.g., propose marriage).
Grefenstette and Teufel collected verbs and their nominalized forms, along with verb–object
relations for the verbs and verb–noun–PP relations for the nouns, based on a low-level parser and
heuristics.1 From this, they selected the most common verb supporting the structure NP PP where
the given nominalization heads the NP and the prepositional head of the PP is most similar to
that of the underlying verb of the nominalization. In this case, therefore, multiple co-occurrences
are considered (verb–noun and noun–preposition) to predict the light verb associated with a
given nominalization. Baldwin (2005b) employed several statistical tests to extract prepositional
verbs (see Section 2.3). The main idea in this work is that the verb and preposition in
prepositional verbs co-occur more frequently than for simple verb–preposition combinations.
Baldwin proposed a number of unsupervised methods to extract prepositional verbs based on
statistical tests such as chi-square and Dice’s coefficient, as well as substitutability with highly
frequent verbs and transitive prepositions (see Section 3.3). The method also adopted linguistic
features of prepositional verbs, and demonstrated that co-occurrence properties were effective in
1
Note that a parser has been employed in several MWE extraction methods, including Baldwin (2005a) in the
context of English VPC extraction. However, in Baldwin (2005a), the parser(s) are used extensively not only to
extract VPC candidates but also to analyze the argument structure of the VPC.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 44
the extraction task, but that the combination of all extraction method strategies was superior
overall.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 45
of verb–noun pairs than distributional similarity, and that the correlation between the combined
features and the human ranking was much better than that using individual features.
3.3 Substitutability
3.3.1 Overview of Substitutability
Substitutability is the ability to replace parts of MWEs with alternative lexical items, and
involves comparison of the target MWE with anti-collocations. Also, this method is directly
related to single-word paraphrasability described in Section 2.1.1. This approach is effective
when parts of an MWE occur with unusually high frequency relative to lexical alternatives, i.e.
their collocational association is high. In this thesis, we consider substitutability to be a subset of
cooccurrence properties.
Substitutability can be applied to either compositional or non-compositional MWEs.
Substitutability is closely related to anti-collocation, as when parts of the MWE are replaced, the
new lexical items are generally no longer MWEs. Note that in substitutability, we always
consider the whole MWE (in the form of the original or the anti-collocation), while in co-
occurrence properties, we sometimes compare the whole to a variant word order, and sometimes
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 46
compare the whole to its parts. Analysis of substitutability tends to be based on the same
inventory of statistical tests as for co-occurrence, as outlined in Section 3.2.1.
In generating substitution candidates, we often replace components of the original MWE
with synonyms, sister words or antonyms, depending on the task and approach. This is based on
the assumption of institutionalization, i.e. that a particular word combination has been
established as an MWE to the exclusion of other plausible possibilities based on related words.
Table 3.1 gives details examples where substitution leads to syntactically and/or semantically
anomalous word combinations.
MWE Non-MWE
frying pan frying pot
salt and pepper salt and sugar
many thanks several thanks
red tape yellow tape
In Table 3.1, when parts such as pan and many are replaced with related words, the newly-
formed word combinations (i.e. frying pot and several thanks, respectively) are no longer
MWEs. Similarly, yellow tape, formed by substituting red with yellow in red tape, does not
preserve the original meaning of “bureaucracy” (non-elective government officials).
Substitutability can also be used to investigate the limits of productivity of MWEs such as VPCs
and NCs. Despite various semantic restrictions, certain MWEs are highly productive. Hence,
substitutability can be employed in order to construct new MWEs while maintaining the original
“semantic collocation” (e.g. the same verb synset combined with the same particle).
In (3.3), call up is the basis for generating the VPCs phone up and ring up, but anomalously
not telephone up, despite telephone being a lexical variant of phone. Starting with lemon juice in
(3.4), we form the three NCs orange juice, lime juice and fruit juice, based on substituting lemon
with a synonym, hypernym and sister word, respectively.
In a computational context, substitutability is broadly used to classify word combinations as
MWEs or non-MWEs (Lin 1998b; Lin 1998d; Lin 1999; Pearce 2001). Substitutability is also
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 47
applicable to the modeling of MWE compositionality (Bannard et al. 2003; Bannard 2003;
McCarthy et al. 2003; Kim and Baldwin 2007a), the generation of MWEs with related semantics
or compositionality (Stevenson et al. 2004; Baldwin 2005b; Turney 2005; Kim and Baldwin
2007b; Kim and Baldwin 2007c), and semantic classification (Villavicencio et al. 2004;
Villavicencio 2005; Uchiyama et al. 2005).
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 48
MWE Anti-collocations
emotional baggage emotional luggage
many thanks several thanks
strong coffee powerful coffee
In Table 3.2, emotional baggage is an MWE whereas emotional luggage is not, despite
baggage and luggage being synonyms. That is, in terms of the MWE properties described in
Section 2.1.1, the MWEs in Table 3.2 are institutionalized, as indicated by their unusually high
frequency relative to their anti-collocations. In evaluation, Pearce classified the test instances
into three classes: MWE, potential and unknown. The experimental results were promising, and
demonstrated the power of the rich hierarchical structure of WordNet.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 49
top 10 related words using synonymy, hypernymy and sister words. He then filters generated
word pairs based on frequency, and measures the similarity of phrases based on clustering to
confirm that they preserve the same relational semantics.
Two notable aspects of this research are that: (1) it is based on substitutability; and (2) it
makes use of clustering and not classification, and as such does not attempt to resolve the exact
relation between the nouns in a given pair.
Uchiyama et al. (2005) used the co-occurrence properties of Japanese compound verbs to
predict their semantics. Japanese compound verbs are made up of a verb in the continuative form
(V1) and an auxiliary verb (V2), as in tabe-sugiru/eat too much. Japanese compound verbs are
highly productive and semantically ambiguous, and are subject to semantic constraints between
the first verb and the second verb. (3.6)– (3.8) show examples of Japanese compound verbs and a
classification according to the semantics of the V2 (i.e. spatial, aspectual and adverbial), which
also correspond to distinct translation strategies into English (as indicated). Note that the
translation between Japanese and English has been carried out base on the fact that they have a
semi-similarity due to their loose connection.
Uchiyama et al. (2005) proposed a novel machine learning method to disambiguate the
semantics of V2, based on the co-occurrence of V1 and V2. The method is based on a matrix
analysis of V1–V2 combinatory. That is, the features used to classify a given combination of V1
and V2 are based on the semantic classes of each V2/ which co-occurs with V1, and each V1/
which co-occurs with V2, based on the row containing V1 and column containing V2.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 50
(3.9) The old man requested, “When I kick the bucket, bury me on top of that mountain.”
(3.10) When we were about to enter the room, Kim accidentally kicked the bucket next to
the door.
Comparing distributional similarity with the previous two methods, it is similar to co-
occurrence properties in that it compares word combinations, with the big distinction that
distributional similarity analyses the context of token occurrences of a given lexical item,
whereas co-occurrence properties analyses the frequencies of components.
Distributional similarity is a more powerful method in that there is greater scope for
parameterization/reformulating in terms of: how the context window is defined, how token
counts are translated into feature vectors, and how context vectors are compared. In the context
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 51
of translating token counts into feature vectors, e.g. a considerable amount of work has been
done on dimensionality reduction, such as with latent semantic analysis (LSA) (Landauer et al.
1998) to overcome data sparseness.
Co-occurrence properties, on the other hand, are based fundamentally on token counts of
components/re-orderings of the original lexical item, with the only place for innovation in the
numeric interpretation of those numbers. One way in which researchers have extended the basic
distributional similarity method is by redefining the context window to look at the second-order
co-occurrence of words. Here, rather than using the neighboring words of the target lexical
item’s neighboring words across multiple contexts as a direct representation of the target
expression, the neighboring words of a specific token occurrence of the target expression are in
turn modeled via their neighboring words. For example, let’s assume that the target word bank
has neighboring words money, stock and savings in a given context window. Rather than
represent these directly as a 3-term (sparse) vector, we look to see what words each of them co-
occurs with across the sum total of their usages. For example, money might co-occur with terms
such as banking and market across all of its token occurrences, giving us a rich vector with
which to present that one context term. We similarly generate individual vectors for the other
two context terms and use the combination of the three to represent the original context. If we
were then to compare the original token instance of bank with a single token instance of financial
institution, say, although the immediate context words may not overlap, there is a good chance
hat the context vectors for each of the context words will. Second-order co-occurrence therefore
provides a powerful mechanism for performing token-level analysis of context, e.g. in
disambiguating individual occurrences of word sequences (such as kick the bucket) as either
MWEs or simple compositional combinations.
The main weakness of distributional similarity is that it relies on large amounts of corpus
data to operate effectively. Distributional similarity has been employed to model the
compositionality of MWEs (Schone and Jurafsky 2001; Bannard 2003; Baldwin et al. 2003a;
Venkatapathy and Joshi 2005; McCarthy et al. 2007), to identify MWEs (Katz and Giesbrecht
2006), and to classify the semantics of MWEs (Stevenson et al. 2004).
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 52
Bannard et al. (2003) used the distributional semantics of English VPCs to measure their
compositionality and to model the contribution of the verb and particle in the overall semantics
of the VPC. The basic idea behind this work is that if an MWE is compositional, then it will
occur in the same lexical context as its components. The authors assumed that VPCs populate a
continuum between fully compositional and fully non-compositional structures. Bannard et al.
used four different classification methods: the method of Lin (1999), the context space model of
Schutze (1998), a substitution method, and distributional similarity between each of the
components and the overall VPC. The authors found that the mixed methods performed best, and
the third and fourth methods outperformed the first and second methods. Significantly, this paper
showed that distributional semantics can be applied to the analysis of particles and MWEs,
where previous work had tended to focus exclusively on simplex content words.
Baldwin et al. (2003a) used distributional similarity to compare MWEs with their
components, focusing on NCs and VPCs. The proposed method was based on the context space
model of Schutze (1998), which incorporates LSA2. (3.11) illustrates the outputs of the method
for the VPCs cut out and cut off with the component verb cut. Based on the similarity values, the
model is predicting that cut out is more compositional than cut off.
(3.11) similarity (cut, cut out) = .433 vs. similarity (cut, cut off) = .183
To evaluate their method, the authors compared the predicted similarity between VPCs and
their component verbs, and NCs and their component nouns, with similarities generated from
WordNet. They found a weak correlation between the two, and once again demonstrated the
potential for distributional semantics to model the compositionality of MWEs.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 53
German MWEs and their components, they once again employed the context space model of
Schutze (1998).
Figure 3.2 shows the context vector associated with an idiomatic usage of den loffel Abgeben
(corresponding to kick the bucket in German, and literally meaning “to eat the spoon”), compared
to each of its component words vs. a paraphrase for the MWE (sterben, meaning die). Here,
therefore, the prediction would be that the usage is idiomatic rather than literal. The authors
concluded that it is possible to identify MWEs in context using distributional similarity.
ESSEN
LOFFEL
STERBEN
Figure 3.1: Distributional semantics of the German idiom den loffel Abgeben (Katz and
Giesbrecht 2006)
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 54
above), it is generally the case that they have a similar interpretation, e.g. via a semantic relation.
This gives rise to a method for interpreting MWE semantics (Rosario and Marti 2001; Moldovan
et al. 2004; Kim and Baldwin 2005; Nastase et al. 2006; Girju 2007; Kim and Baldwin 2007c).
(3.12) and (3.13) show how to interpret the semantic relations in NCs using semantic similarity.
(3.12) modifier = fruit, head noun = liquid -> SR = make
e.g. apple juice, orange juice, grapes nectar, chocolate milk
(3.13) modifier = location, head noun = liquid -> SR = location
e.g. Fuji apple, California orange, Bordeaux wine
In (3.12) and (3.13), despite different combinations of lexical items, NCs such as apple juice
and chocolate milk are predicted to have the same SR of make, as the modifier and head noun,
respectively, have similar semantics.
The advantage of this method comes from the ability to use existing similarity measures for
simplex words (e.g. based on lexical resources such as WordNet or CoreLex) to accurately
interpret MWEs, although such methods are limited by the coverage of the underlying similarity
measures (and hence the coverage of any base lexical resources).
This method is employed in computational tasks such as interpreting NCs (Rosario and Marti
2001; Moldovan et al. 2004; Kim and Baldwin 2005; Girju 2007), and modeling the
compositionality of MWEs (Piao et al. 2006; Kim and Baldwin 2007a).
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 55
The paper proposes a novel method for measuring the semantic distance between an MWE
and its component words based on hand-tagged hierarchical semantic information. Piao et al.
evaluated the proposed method over 89 MWEs, scoring each from 0 (least compositional) to 10
(completely compositional). They used Spearman’s correlation coefficient to measure the
correlation between the automatic and manual rankings, and claimed results comparable to
human performance.
( , )
= (3.16)
( )
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 56
where fij is a simplified feature pair fi fj (i.e. the word senses of the modifier and head noun in an
NC) and r is the semantic relation. The preferred SR r* for the given word sense combination is
that which maximizes the probability:
∗ = € ⁄
In evaluation, the authors found that their method performed at about 43% accuracy over
open domain NCs.
that it is not a VPC (although in this last case, it is a sufficient but not necessary condition on
VPCs).
Linguistic properties often take the form of highly-specific syntactic features of MWEs,
either in context or at the type-level. While linguistic properties can rely on context, they differ
from distributional similarity in that they are very selective and fine-tuned to particular
construction types. As shown in the examples above, linguistic properties can provide very
reliable features for identifying or otherwise classifying MWEs. Their main drawbacks are that
they do not easily generalize, and rely on the occurrence of very particular usages/contexts.
Example tasks where linguistic properties have been employed are MWE extraction (Baldwin
and Villavicencio 2002; Baldwin 2005a; Nakov and Hearst 2005), identification (Patrick and
Fletcher 2004; Van Der Beek 2005; Kim and Baldwin 2006a) and semantic classification
(O’Hara and Wiebe 2003; Stevenson et al. 2004; Cook and Stevenson 2006).
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 58
• Hand it in promptly.
• ?*Hand it promptly in.
The task in Baldwin (2005a) was undertaken with no assumptions about corpus annotation,
using only information from pre-processors such as a part-of-speech tagger, chunker and RASP.
It also evaluated VPC extraction as both shallow and deep lexical acquisition tasks, that is either
as the simple task of determining what combinations of verb and preposition can form a VPC, or
as the harder task of determining what combinations of verb and preposition can form an
intransitive and transitive VPC (e.g. for the purposes of a deep grammar lexicon).
The proposed method was tested over three corpora (Brown Corpus, Wall Street Journal and
British National Corpus), and linguistic properties were shown to provide valuable evidence in
the extraction task, especially over low-frequency VPCs. Nakov and Hearst (2005) employed
linguistic properties in a probabilistic model for bracketing NCs with 3 or more terms in the
medical domain, building on the work of Marcus (1980) and Lauer (1995). Nakov and Hearst
extended this earlier work by integrating linguistic features into their model, based on analysis of
surface features in web data as illustrated in (3.28)–(3.30).
Based on these and other features, Nakov and Hearst (2005) developed an unsupervised
method for NC bracketing using chi-square, and achieved 89.34% bracketing accuracy.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 59
combine with a similar range of target particles; and (2) what verbs can combine with a given
particle is an indicator of the semantics of the target particle. Based on these observations, the
authors used slot features to encode the relative frequencies of the syntactic slots (i.e. subject,
direct and indirect object, and object of a preposition), and particle features to encode the relative
frequency of the verb co-occurring with high frequency particles. Cook and Stevenson classified
the particle up into four different semantic classes, as illustrated in (3.31)–(3.34).
The paper also used co-occurrence features to classify particle semantics. In their
experiments, the authors found that the method based on linguistic properties outperformed that
using word co-occurrence features in the task of classifying particle semantics.
3.7 Collocations
3.7.1 Overview of Collocation
A Collocation is an expression consisting of two or more words that correspond to some
conventional way of saying things. Or in the words of Firth (1957: 181): “Collocations of a given
word are statements of the habitual or customary places of that word.” Collocations include noun
phrases like strong tea and weapons of mass destruction, phrasal verbs like to make up, and other
stock phrases like the rich and powerful. Particularly interesting are the subtle and not-easily-
explainable patterns of word usage that native speakers all know: why we say a stiff breeze but
not *a stiff wind (while either a strong breeze or a strong wind is okay), or why we speak of
broad daylight (but not *bright daylight or *narrow darkness).
Collocations are characterized by limited compositionality. We call a natural language
expression compositional if the meaning of the expression can be predicted from the meaning of
the parts. Collocations are not fully compositional in that there is usually an element of meaning
added to the combination. In the case of strong tea, strong has acquired the meaning rich in some
active agent which is closely related, but slightly different from the basic sense having great
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 60
physical strength. Idioms are the most extreme examples of non-compositionality. Idioms like to
kick the bucket or to hear it through the grapevine only have an indirect historical relationship to
the meanings of the parts of the expression. We are not talking about buckets or grapevines
literally when we use these idioms. Most collocations exhibit milder forms of non-
compositionality, like the expression international best practice that we used as an example
earlier in this book. It is very nearly a systematic composition of its parts, but still has an element
of added meaning. It usually refers to administrative efficiency and would, for example, not be
used to describe a cooking technique although that meaning would be compatible with its literal
meaning.
Collocations are important for a number of applications: natural language generation (to
make sure that the output sounds natural and mistakes like powerful tea or to take a decision are
avoided), computational lexicography (to automatically identify the important collocations to be
listed in a dictionary entry), parsing (so that preference can be given to parses with natural
collocations), and corpus linguistic research (for instance, the study of social phenomena like the
reinforcement of cultural stereotypes through language (Stubbs 1996)).
There is much interest in collocations partly because this is an area that has been neglected in
structural linguistic traditions that follow Saussure and Chomsky. There is, however, a tradition
in British linguistics, associated with the names of Firth, Halliday, and Sinclair, which pays close
attention to phenomena like collocations. Structural linguistics concentrates on general
abstractions about the properties of phrases and sentences. In contrast, Firth’s Contextual Theory
of Meaning emphasizes the importance of context: the context of the social setting (as opposed to
the idealized speaker), the context of spoken and textual discourse (as opposed to the isolated
sentence), and, important for collocations, the context of surrounding words (hence Firth’s
famous dictum that a word is characterized by the company it keeps). These contextual features
easily get lost in the abstract treatment that is typical of structural linguistics. A good example of
the type of problem that is seen as important in this contextual view of language is Halliday’s
example of strong vs. powerful tea (Halliday1966: 150). It is a convention in English to talk
about strong tea, not powerful tea, although any speaker of English would also understand the
latter unconventional expression. Arguably, there are no interesting structural properties of
English that can be gleaned from this contrast. However, the contrast may tell us something
interesting about attitudes towards different types of substances in our culture (why do we use
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 61
powerful for drugs like heroin, but not for cigarettes, tea and coffee) and it is obviously important
to teach this contrast to students who want to learn idiomatically correct English. Social
implications of language use and language teaching are just the type of problem that British
linguists following a Firthian approach are interested in.
In this chapter, we will describe the principal approaches to finding collocations: selection of
collocations by frequency, selection based on mean and variance of the distance between focal
word and collocating word, hypothesis testing, and mutual information.
• Frequency
Surely the simplest method for finding collocations in a text corpus is counting. If two words
occur together a lot, then that is evidence that they have a special function that is not simply
explained as the function that results from their combination. Since MWEs generally get
institutionalized, the frequency of the collocation is a good first indicator of MWEness, given a
large enough corpus. Hence candidate collocations are ranked by the frequency of occurrence in
the corpus. The drawback of measuring simply the frequency of a phrase is that it needs a large
corpus because fewer occurrence of a phrase in a corpus does not imply any measurable
conclusion of the behavior of the phrase as MWE.
• Mean and Variance
Frequency-based search works well for fixed phrases. But many collocations consist of two
words that stand in a more flexible relationship to one another. Consider the verb knock and one
of its most frequent arguments, door. Here are some examples of knocking on or at a door:
(3.35) she knocked on his door.
(3.36) they knocked at the door.
(3.37) 100 women knocked on Donaldson’s door.
(3.38) a man knocked on the metal front door.
The words that appear between knocked and door vary and the distance between the two
words is not constant so a fixed phrase approach would not work here. But there is enough
regularity in the patterns to allow us to determine that knock is the right verb to use in English for
this situation, not hit, beat or rap.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 62
A short note is in order here on collocations that occur as a fixed phrase versus those that are
more variable. To simplify matters we only look at fixed phrase collocations in most of the cases,
and usually at just bigrams. But it is easy to see how to extend techniques applicable to bigrams
to bigrams at a distance. We define a collocational window (usually a window of 3 to 4 words on
each side of a word), and we enter every word pair in there as a collocational bigram. We then
proceed to do the calculations as usual on this larger pool of bigrams. However, the mean and
variance based methods described in this section by definition look at the pattern of varying
distance between two words. If that pattern of distances is relatively predictable, then we have
evidence for a collocation like knock . . . door that is not necessarily a fixed phrase.
The mean is simply the average offset. The variance measures how much the individual
offsets deviate from the mean. We estimate it as follows.
∑#$#( − ")
=
(3.18)
%−1
where n is the number of times the two words co-occur, di is the offset for co-occurrence i, and µ
is the mean. If the offset is the same in all cases, then the variance is zero. If the offsets are
randomly distributed (which will be the case for two words which occur together by chance, but
not in a particular relationship), then the variance will be high. As is customary, we use the
standard deviation σ=√σ2, the square root of the variance, to assess how variable the offset
between two words is.
• Hypothesis Testing
One difficulty that we have glossed over so far is that high frequency and low variance can
be accidental. If the two constituent words of a frequent bigram are frequently occurring words,
then we expect the two words to co-occur a lot just by chance, even if they do not form a
collocation. What we really want to know is whether two words occur together more often than
chance. Assessing whether or not something is a chance event is one of the classical problems of
statistics. It is usually couched in terms of hypothesis testing. We formulate a null hypothesis H0
that there is no association between the words beyond chance occurrences, compute the
probability p that the event would occur if H0 were true, and then reject H0 if p is too low
(typically if beneath a significance level of p < 0:05, 0:01, 0:005, or 0:001) and retain H0 as
possible otherwise (Significance at a level of 0:05 is the weakest evidence that is normally
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 63
accepted in the experimental sciences. The large amount of data commonly available for
statistical NLP tasks means that we can often expect to achieve greater levels of significance).
• T-test
We need a statistical test that tells us how probable or improbable it is that a certain
constellation will occur. A test that has been widely used for collocation discovery is the t test.
The t test looks at the mean and variance of a sample of measurements, where the null hypothesis
is that the sample is drawn from a distribution with mean µ. The test looks at the difference
between the observed and expected means, scaled by the variance of the data, and tells us how
likely one is to get a sample of that mean and variance (or a more extreme mean and variance)
assuming that the sample is drawn from a normal distribution with mean µ. To determine the
probability of getting our sample (or a more extreme sample), we compute the t statistic:
2(-)2(.)
2(-,.)−
≈ 42(-,.)
3
(3.19)
Here C(x) and C(y) are respectively the frequencies of word X and word Y in the corpus,
C(X,Y) is the frequency of bigram <X Y> and N is the total number of tokens in the corpus. The
bigram count can be extended to the frequency of word X when it is followed or preceded by Y
in the window of K words (here K=1). If the t statistic is large enough we can reject the null
hypothesis.
Use of the t-test has been criticized because it assumes that probabilities are approximately
normally distributed, which is not true in general. An alternative test for dependence which does
not assume normally distributed probabilities is the χ2-test (pronounced “chi-square test”). In the
simplest case, this 2 test is applied to a 2-by-2 table as shown in Table 3.3. The essence of the
test is to compare the observed frequencies in the table with the frequencies expected for
independence. If the difference between observed and expected frequencies is large, then we can
reject the null hypothesis of independence.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 64
X = new X ≠ new
n11 n12
Y= companies (new companies) (e.g., old companies)
n21 n22
Y ≠ companies (e.g., new machines) (e.g., old machines)
Table 3.3: A 2-by-2 table showing the dependence of occurrences of new and companies
Each variable in the Table 3.3 depicts its individual frequency e.g. n11 denotes the frequency
of the phrase “new companies”. The χ2 statistic sums the differences between observed and
expected values in all squares of the table, scaled by the magnitude of the expected values, as
shown in equation (3.20)
∑A %A ∑A %A
where @ = × × 3 % 3 is the number of tokens in the corpus.
3 3
This result is the same as we got with the t statistic. In general, for the problem of finding
collocations, the differences between the t statistic and the chi square statistic do not seem to be
large. However, this test is also appropriate for large probabilities, for which the normality
assumption of the T-test fails. This is perhaps the reason that the χ2 test has been applied to a
wider range of problems in collocation discovery.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 65
The explanation of the variables of the above equation is described later. PMI represents the
amount of information provided by the occurrence of the event represented by X about the
occurrence of the event represented by Y.
Mutliword Expressions
Chapter 3: Statistical Frameworks of MWEs and Related Work 66
of the MWE. It has been used to model compositionality, classify semantics and to identify
MWEs.
Semantic similarity is the process of modeling the semantics of the whole via the semantics
of the parts, notably in comparing corresponding components of a pairing of MWEs and
inferring that the MWEs are similar in the instance that the components are similar. This
approach is effective for interpreting the semantics of MWEs, and has been applied to the tasks
of semantic interpretation and compositionality modeling.
Linguistic properties of MWEs can be used to model MWEs, e.g. based on the output of a
parser. They tend to be highly construction-specific, and are high-precision but low-recall. As a
result, they tend to be combined with other approaches rather than form standalone
methodologies. This approach has been applied to MWE extraction and semantic classification.
Finally, we discuss collocation as a subset MWE and analyze different statistical
methodologies used to identify co-occurrence property of certain phrase in a corpus based on the
frequency of occurrence of individual as well as entire phrase.
Mutliword Expressions
Chapter 4
This chapter describes the resources used in our research, including corpora, lexical
resources, dictionaries and software. The resources vary in coverage, usage and language, and
not only provide fundamental knowledge to understand context, but are also used in some cases
to evaluate the proposed models.
4.1 Corpus
1
http://www.rabindra-rachanabali.nltr.org
2
http://www.becs.ac.in
3
http://www.iitkgp.ac.in
Chapter 4: Resources 68
motivation to spread the invention of Bengali writings of the great Indian Noble laureate
Rabindranath Tagore.
This site contains a huge number of poems, short and long stories, novels, dramas which are
the part of Rabindrarachanaboli, the collected works of Rabindranath Tagore. This site contains
hundreds of lyrics of Bengali songs wrriten by Rabindranath Tagore which are well-known as
Rabindra Sangit in Bengali. The developers also attempt to include the unpublished articles and
letters of Rabindranath Tagore. The corpus can be potentially used to identify the writing style of
Rabindranath Tagore using NLP techniques.
This site does not contain any direct link to download the articles of Rabindranath Tagore.
Even the site is dymanic in nature. The articles are not possible to be crawled by the crawler. We
extracted the articles using copy-paste approach and created separate files for each article. When
pasting the documents, the words and letters were scattered and sometime few letters were not
supported in the Microsoft word document file. So, using them properly in our exprements was
itself a challenging task.
4
http://wacky.sslmit.unibo.it/
Mutliword Expressions
Chapter 4: Resources 69
• itWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .it
domain and using medium-frequency words from the Repubblica corpus and basic Italian
vocabulary lists as seeds. The corpus was POS-tagged with the TreeTagger, and
lemmatized using the Morph-it! Lexicons.
• ukWaC: a 2 billion word corpus constructed from the Web limiting the crawl to the .uk
domain and using medium-frequency words from the BNC as seeds. The corpus was
POS-tagged and lemmatized with the TreeTagger.
5
http://wordnet.princeton.edu/
Mutliword Expressions
Chapter 4: Resources 70
4.3 Tools
4.3.1 Bengali Shallow Parser
Bengali Shallow pareser 6 is the first shallow parser of Bengali language developed by
Language Technologies Research Center, Indian Institute of Information Technology (IIIT),
Hydrabad. It gives the analysis of a Bengali sentence at various levels. The analysis begins at the
morphological level and accumulates the results of POS tagger and chunker. The final ouput
combines the results of all these levels and shows them in a single representation (called Shakti
Standard Format). The details of the tool are given to the documentation part of the downloaded
software.
6
http://ltrc.iiit.ac.in/analyzer/bengali
Mutliword Expressions
Chapter 4: Resources 71
knowledge about the entities. But modeling the joint distribution can lead to difficulties when
using the rich local features that can occur in relational data, because it requires modeling the
distribution p(x), which can include complex dependencies. Modeling these dependencies among
inputs can lead to intractable models, but ignoring them can lead to reduced performance.
A solution to this problem is to directly model the conditional distribution p(y|x), which is
sufficient for classification. This is the approach taken by conditional random fields (Lafferty et
al. 2001). A conditional random field is simply a conditional distribution p(y|x) with an
associated graphical structure. Because the model is conditional, dependencies among the input
variables x do not need to be explicitly represented, affording the use of rich, global features of
the input. For example, in natural language tasks, useful features include neighboring words and
word bigrams, prefixes and suffixes, capitalization, membership in domain-specific lexicons, and
semantic information from sources such as WordNet. Recently there has been an explosion of
interest in CRFs, with successful applications including text processing (Taskar et al. 2002),
bioinformatics (Sato and Sakakibara 2005), and computer vision (Kumar and Hebert 2003).
In this section, we define CRFs with general graphical structure, as they were introduced
originally (Lafferty et al. 2001). Although initial applications of CRFs used linear chains, there
have been many later applications of CRFs with more general graphical structures. Also,
although CRFs have typically been used for across-network classification, in which the training
and testing data are assumed to be independent, we will see that CRFs can be used for within-
network classification as well, in which we model probabilistic dependencies between the
training and testing data. The generalization from linear-chain CRFs to general CRFs is fairly
straightforward. We simply move from using a linear-chain factor graph to a more general factor
graph, and from forward-backward to more general (perhaps approximate) inference algorithms.
First we present the general definition of a conditional random field. Let G be a factor graph
over Y. Then p(y|x) is a conditional random field if for any fixed x, the distribution p(y|x)
factorizes according to G. Thus, every conditional distribution p(y|x) is a CRF for some, perhaps
trivial, factor graph. If F = { ψA} is the set of factors in G, and each factor takes the exponential
family form, then the conditional distribution can be written as
Mutliword Expressions
Chapter 4: Resources 72
1
| = , 4.1
€
In addition, practical models rely extensively on parameter tying. For example, in the linear-
chain case, often the same weights are used for the factors ψt(yt, yt−1, xt) at each time step. To
denote this, we partition the factors of G into C = {C1,C2, . . .CP }, where each Cp is a clique
template whose parameters are tied. This notion of clique template generalizes that in Taskar et
al. (2002). Each clique template Cp is a set of factors which has a corresponding set of sufficient
statistics {fpk(xp, yp)}. Then the CRF can be written as
1
| = ! , ! ; #$ % 4.2
' € & €'
= ! , ! ; #$ % 4.4
- ' € & €'
For example, in a linear-chain conditional random field, typically one clique template
C = {ψt(yt,yt−1, xt)} is used for the entire network. Several special cases of conditional random
fields are of particular interest. First, dynamic conditional random fields (Sutton et al. 2004) are
sequence models which allow multiple labels at each time step, rather than single labels as in
linear-chain CRFs. Second, relational Markov networks (Taskar et al. 2002) are a type of general
CRF in which the graphical structure and parameter tying are determined by an SQL-like syntax.
Finally, Markov logic networks (Richardson and Domingos 2005; Singla and Domingos 2005)
are a type of probabilistic logic in which there are parameters for each first-order rule in a
knowledge base.
Mutliword Expressions
Chapter 4: Resources 73
4.3.3 WordNet::Similarity
We used the open-source WordNet::Similarity package (Patwardhan et al. 2003)7 to compute
word similarities. WordNet::Similarity is developed at the University of Minnesota, and provides
various methods to measure the similarity or relatedness between a pair of concepts or word
senses. It contains implementations of a variety of comparison methods, of three basic types:
similarity, relatedness and random.
The similarity methods are categorized into two groups: path-based (LCH (Leacock and
Chodorow 1998) and WUP (Wu and Palmer 1994)) and information-content based (RES (Resnik
1995), JCN (Jiang and Conrath 1997), and LIN (Lin 1998c)), as summarized in Figure 4.1. Path-
based methods compute lexical similarity based on the shortest path between two target synsets
based on the WordNet is-a hierarchy. The difference between LCH and WUP is in the
calculation of path length. LCH calculates the path length between two target concepts (c1 and
c2) based on Equation 4.5:
where p is the number of nodes in the shortest path connecting c1 and c2, and depth is the
maximum depth of WordNet hierarchy.
WUP, on the other hand, is based on the path length to the root node from the least
common subsumer (LCS) of the two target concepts (c1 and c2). The LCS is defined as that
concept at greatest depth in the WordNet hierarchy that subsumes both c1 and c2. The
calculation of similarity is based on Equation 4.6.
D/0/123/4EF$ 71 , 72 =
8×>G
>H>8H8×>G
(4.6)
where p1 and p2 are the number of nodes on the path from c1 to c2 respectively and p3 is the
number of nodes on the path between LCS and root.
RES, JCN and LIN augment the calculation of path length with the information content (IC)
of the LCS, calculated as follows:
7
www.d.umn.edu/~tpederse/similarity.html
Mutliword Expressions
Chapter 4: Resources 74
I7J = −1KL
MNAO!
MNAONPPB
(4.7)
where freq(c) is the frequency of a given concept c, and freq(root) is the frequency of the root of
the hierarchy.
RES calculates the similarity of two concepts by the information of their LCS:
JCN is an extension of RES, where the path length between the two concepts is included in
the calculation, based on:
D/0/123/45US J , J8 =
8×V5!Q!W ,!X
V!W HV!W
(4.10)
The relatedness measures use additional relations such as has-part, is-made-of and is-an-
attribute-of in addition to the is-a relation. There are three relatedness measures: HSO (Hirst and
St-Onge 1998), LESK (Banerjee and Pedersen 2003) and VECTOR (Patwardhan 2003). HSO is
based on path similarity, and takes into consideration sequences of lexical relations connecting
synsets in the WordNet hierarchy that are likely to be indicative of word-level (rather than sense)
relatedness. LESK is based on the weighted word overlap of different pairings of synset glosses,
over a variety of relation types.
VECTOR is a corpus-based measure. Each word is represented as a multi-dimensional vector
of co-occurring words. The similarity of a word pair is measured by the cosine similarity of the
two vectors. In Equation 4.11, ZZZZZ[
Y1 and ZZZZZZ[
Y2 are the vectors of the two target words:
ZZZZZZ[
^ × ZZZZ[
\ 124/K]^A!BPN J , J8 =
^8
ZZZZZZ[`_
ZZZZZZ[`_ ×_`^8
_`^
(4.11)
Mutliword Expressions
Chapter 5
Automatic Extraction of Keyphrases
This chapter contains the detailed approach of automatic extraction of Keyphrases, which is a
word or a set of words that describe the close relationship of contain and context in the
documents using Conditional Random Fields (CRF). Keypharase sometime shows their multi-
word characteristics in that they act as special meaning-bearing unit. Named-entities also belong
to the class of MWE and they definitely act as keywords in the document. The system is trained
using 144 scientific articles and tested on 100 scientific articles. Different combinations of
features have been used. With reader and author assigned keyphrases, the system shows a
precision of 17.80%, recall of 18.21% and F-measure of 18.00% with top 15 candidates.
Automatic keyphrase extraction from document can be used in summarization, where keywords
or query words are not available. The extracted keyphrases can be used as keywords to generate
the summary.
5.1 Introduction
Keyphrase is a word or set of words that describe the close relationship of contain and
context in the document. Keyphrases are simplex nouns or noun phrases (NPs) that represent the
key ideas of the document. Keyphrases can serve as a representative summary of the document
and also serve as high quality index terms (Kim and Kan, 2009). Keyphrases can be used in
various natural language processing (NLP) applications such as summarization, information
retrieval (IR), question answering (QA) etc. Keyphrase extraction also plays an important role in
Chapter 5: Automatic Extraction of Keyphrases 76
Search engines. With the advancement of research, many attempts of automatic keyphrase
extraction have been made and this attempt is one among them. The need of CRF is discussed in
Section 5.2 followed by the system design, experimental results and conclusion in Section 5.3,
5.4 and 5.5 respectively.
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 77
∑ , ,
: | : = exp ∑ : , 5.1
Let us walk through the model in detail. The scalar Z is the normalization factor, or partition
function, to make it a valid probability. Z is defined as the sum of exponential number of
sequences,
Therefore is difficult to compute in general. It may be noted that Z implicitly depends on x1:N
and the parameters λ. The big exponential function is there for historical reasons, with
connection to the exponential family distribution. For now, it is sufficient to note that λ and f()
can take arbitrary real values, and the whole exp function will be non-negative. Within the exp()
function, we sum over n = 1, . . . ,N word positions in the sequence. For each position, we sum
over i = 1, . . . , F weighted features. The scalar λi is the weight for feature fi(). The λi’s are the
parameters of the CRF model, and must be learned, similar to θ = {π, ϕ, A} in HMMs.
, ,
0 %1 23456 78 91: ;)+
: , = {) )*+,-./, 5.3
How is this feature used? It depends on its corresponding weight λ1. If λ1 > 0, whenever f1 is
active (i.e. we see the word John in the sentence and we assign it tag PERSON), it increases the
probability of the tag sequence z1:N. This is another way of saying the CRF model should prefer
the tag PERSON for the word John. If on the other hand λ1 < 0, the CRF model will try to avoid
the tag PERSON for John. Which way is correct? One may set λ1 by domain knowledge (we
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 78
know it should probably be positive), or learn λ1 from corpus (let the data tell us), or both
(treating domain knowledge as prior on λ1). It may be noted that λ1, f1() together is equivalent to
(the log of) HMM’s ϕ parameter p(x = John|z = PERSON).
As another example, consider
= , ,
0 %1 23456 78 91>&: /78
: , = {) )*+,-./, 5.4
This feature is active if the current tag is PERSON and the next word is ‘said’. One would
therefore expect a positive λ2 to go with the feature. Furthermore, functions f1 and f2 can be both
active for a sentence like “John said so.” where z1 = PERSON. This is an example of overlapping
features. It boosts up the belief of z1 = PERSON to λ1+ λ2. This is something HMMs cannot do:
HMMs cannot look at the next word, nor can they use overlapping features. The next feature
example is rather like the transition matrix A in HMMs. We can define
@ , ,
0 %1A& 6BC34 78 %1: /78
: , = {) )*+,-./, 5.5
This feature is active if we see the particular tag transition (OTHER, PERSON). Note it is the
value of λ3 that actually specifies the equivalent of (log) transition probability from OTHER to
PERSON, or OTHER, PERSON in HMM notation. In a similar fashion, we can define all K2
transition features, where K is the size of tag set. Of course the features are not limited to binary
functions. Any real-valued function is allowed.
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 79
where Z is the normalization factor. In the special case of linear-chain CRFs, the cliques
correspond to a pair of states zn−1, zn as well as the corresponding x nodes, with
This is indeed the direct connection to factor graph representation as well. Each clique can be
represented by a factor node with the factor ψ(Xc), and the factor node connects to every node in
Xc. There is one addition special factor node which represents Z. A welcome consequence is that
the sum-product algorithm and max-sum algorithm immediately apply to Markov Random Fields
(and CRFs in particular). The factor corresponding to Z can be ignored during message passing.
! log L | (L
(5.8
L
Often one can also put a Gaussian prior on the λ’s to regularize the training (i.e., smoothing).
If λ ~N (0, σ2), the objective becomes
M
=
! log O (L
P (L
−! = (5.9
2R
L
The good news is that the objective is concave, so the λ’s have a unique set of optimal
values. The bad news is that there is no closed form solution2.
The standard parameter learning approach is to compute the gradient of the objective
function, and use the gradient in an optimization algorithm like L-BFGS. The gradient of the
objective function is computed as follows:
1
Unlike HMMs which can use the Baum-Welch (EM) algorithm to train on unlabeled data x only, CRFs training on
unlabeled data is difficult
2
If this reminds you of logistic regression, you are right: logistic regression is a special case of CRF where there are
no edges among hidden states. In contrast, HMMs when trained on fully labeled data have simple and intuitive
closed form solutions.
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 80
M
T
! log O (L P (L
TU
L
M
= T (L (L =
− ! 2R = = ! #! ! , , , − log $−! =
(L (L
TU 2R (5.10
L
X=1 ∑ W −1 , , , −
(X (X
= ∑Y (X
M
=
! ! Z%1A& ,%1[ \U O , ,
] ] (L
, ^_ − =
R
[
L
(5.11
where we used the fact
T
log = Z% [ `! U (
]
, ] , , a = ! Z%1A& ,%1[ bU ( , , , c
] ]
(5.12
TU
[
= ∑ ∑%1A&
[ ,%1[ ( , |
] ]
U (
]
, ] , , (5.13)
Note the edge marginal probability p(z’n−1, z’n | x) is under the current parameters, and this is
exactly what the sum-product algorithm can compute.
The partial derivative in (12) has an intuitive explanation. Let us ignore the term λk/σ2 from
the prior. The derivative has the form of (observed counts of feature fk) minus (expected counts
of feature fk). When the two are the same, the derivative is zero, and there is no longer an
incentive to change λk. Therefore we see that training can be thought of as finding λ’s that match
the two counts.
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 81
understood as in combination with each tag value. Similarly one can test whether the
word is capitalized, the identity of the neighboring words, the part of speech of the word,
and so on. The state transition features are also atomic. From the large number of atomic
candidate features, a small number of features are selected by how much they improve
the CRF model (e.g., increase in the training set likelihood).
2. “Grow” candidate features. It is natural to combine features to form more complex
features. For example, one can test for current word being capitalized, the next word
being “Inc.”, and both tags being ORGANIZATION. However, the number of complex
features grows exponentially. A compromise is to only grow candidate features on
selected features so far, by extending them with one atomic addition, or other simple
Boolean operations.
Often any remaining atomic candidate features are added to the grown set. A small number
of features are selected, and added to the existing feature set. This stage is repeated until enough
features have been added.
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 82
iv)Term frequency (TF) range: The maximum value of the term frequency (max_TF) is
divided into five equal sizes (size_of_range) and each of the term frequency values is mapped to
the appropriate range (0 to 4). The term frequency range value is used as a feature. i.e.
max _mn
de"_g_hij" =
5
Thus Table 5.1 shows the range representation:
Class Range
0 to size_of_range 0
size_of_range + 1 to2*size_of_range 1
2*size_of_range + 1 to 3*size_of_range 2
3*size_of_range + 1 to 4*size_of_range 3
4*size_of_range + 1 to 5*size_of_range 4
This is done to have uniform values for the term frequency feature instead of random and
scattered values.
v) Word in Title: Every word is marked with T if found in the title else O to mark other. The
title word feature is useful because the words in title have a high chance to be a keyphrase.
vi) Word in Abstract: Every word is marked with A (Abstract) if found in the abstracts else
O to mark other. The abstract word feature is useful because the words in abstracts have a high
chance to be a keyphrase.
vii) Word in Body: Every word is marked with B (Body) if found in the body of the text else
O to mark other. It is a useful feature because words present in the body of the text are
distinguished from other words in the document.
viii) Word in Reference: Every word is marked with R if found in the references else O to
mark other. The reference word feature is useful because the words in references have a high
chance to be a keyphrase.
ix) Stemming: The Porter Stemmer algorithm is used to stem every word and the output
stem for each word is used as a feature. This is because words in keyphrases can appear in
different inflected forms.
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 83
x) Context word feature: The preceding and the following word of the current word are
considered as context feature since keyphrases can be a group of words.
Automatic identification of keyphrases is our main task. In order to perform this task the data
provided by the SEMEVAL-23 Task Id #5 is being used both for training and testing. In total 144
scientific articles or papers are provided for training and another 100 documents have been
marked for testing. All the files are cleaned by placing spaces before and after every punctuation
mark and removing the citations in the text. The author names appearing after the paper title was
removed. In the reference section, only the paper or book title was kept and all other details were
deleted.
One algorithm has been defined to extract the title from a document. Another algorithm has
been defined to extract the positional feature of a word, i.e., whether the word is present in title,
abstracts, body or in references.
• Algorithm 1: Algorithm to extract the title.
Step 1: Read the line one by one from the beginning of the article until a '.'(dot) or '@' found
in the line. ('.'(dot) occurs in author’s name and '@' occurs in author’s mail id).
Step 2: If '.' found first in a line then each line before it is extracted as Title and returned.
Step 3: If '@' found first in a line then extract all the line before it.
Step 4: Check the extracted line one by one from beginning.
Step 5: Take a line, extract all the words of that line. Check whether all the words are not
repeated in the article (excluding the references) or not. If not then stop and extract all the
previous lines as Title and return.
3
http:// semeval2.fbk.eu/ semeval2.php? loction=data
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 84
The features used in the keyphrase extraction system are identified in the following ways.
Step 1: The dependency parsing is done by the Stanford Parser4. The output of the parser is
modified by making the word and the associated tags for every word appearing in a line.
Step 2: The same output is used for chunking and for every word it identifies whether the
word is a part of a noun phrase or a verb phrase.
Step 3: The Stanford POS Tagger5 is used for POS tagging of the documents.
Step 4: The term frequency (TF) range is identified as defined before.
Step 5: Using the algorithms described in Section 5.3.1 every word is marked as T or O for
the title word feature, marked as A or O for the abstract word feature, marked as B or O for the
body word feature and marked as R or O for the reference word feature.
Step 6: The Porter Stemming Algorithm6 is used to identify the stem of every word that is
used as another feature.
Step 7: In the training data with the combined keyphrases, the words that begin a keyphrase
are marked with B-KP and words that are present intermediate in a keyphrase are marked as I-
KP. All other words are marked as O. But for test data only O is marked in this column.
4
http://nlp.stanford.edu/software/lex-parser.shtml
5
http://nlp.stanford.edu/software/tagger.shtml
6
http://tartarus.org/~martin/PorterStemmer/
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 85
A template file is created in order to train the system using the feature file generated from the
tanning set following the above procedure described in the previous section. After training the
C++ based CRF++ 0.53 package7 which is readily available as open source for segmenting or
labeling sequential data, a model file is produced. The model file is required to run the test files.
The feature file is again created from the test set using the above steps as outlined in Section
5.3.2 except the step 7. For test set the last feature column i.e. Keyphrase column, is marked with
‘O’. This feature file is used with the C++ based CRF++ 0.53 package. After running the Test
files into the system, the system produce the output file with the keyphrases marked with B-KP
and I-KP. All the Keyphrases are extracted from the output file and stemmed using Porter
Stemmer.
In all tables, P, R and F mean micro-averaged precision, recall and F-scores. For baselines,
they have used 1, 2 or 3 grams as candidates and TF·IDF as features. In Table 5.2, TF·IDF is an
unsupervised method to rank the candidates based on TF·IDF scores. NB and ME are supervised
methods using Naїve Bayes and maximum entropy in WEKA. In second column, R denotes the
use of the reader-assigned keyword set as gold-standard data and C denotes the use of combined
keywords i.e. both author-assigned and reader-assigned keyword sets as answers. There are three
sets of score. First set of score i.e. Top 5 candidates, is obtained by evaluating only top 5
7
http://crfpp.sourceforge.net/
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 86
keyphrases from evaluated data. Similarly Top 10 candidates set is obtained by evaluating top 10
keyphrases and Top 15 Candidates set result is obtained by evaluating all 15 keyphrases. The
evaluation results of the CRF based keyphrase extraction system are shown in Table 5.3.
The scores for the top 5 candidates and top 10 candidates of keyphrases extracted show a
better precision score since the keyphrases are generally concentrated in the title and abstracts.
The recall shows a contrast improvement from 9.69% to 18.21% as the number of candidate
increases since the coverage of the text increases.
The F-score is 18.00% when top 15 candidates are considered which is 2.90% better from the
best baseline model. Different features have been tried and the best feature we have used in the
system is:
Here, POSi-1, POSi and POSi+1 are the POS tags of the previous word, the current word and
the following word respectively. Similarly Wi-1, Wi and Wi+1 denote the previous word, the
current word and the following word respectively. This POSi and Wi give a contrasting result
when only the word and the POS of the word is considered.
A better result could have been obtained if the multiplication of Term Frequency and Inverse
Document Frequency (TF*IDF) range is included [50, 51]. TF*IDF measures the document
cohesion. The maximum value of the TF*IDF (max_TF_IDF) can be divided into five equal size
(size_of_range) and each of the TF*IDF values is mapped to the appropriate range (0 to 4) i.e.
max_TF_IDF
de" g hij" =
5
We have used the Unigram template in the template file CRF++ 0.53 package but the use of
bigram could have improved the score.
Multiword Expressions
Chapter 5: Automatic Extraction of Keyphrases 87
Unigram template:
The number of feature functions generated by a template amounts to (L * N), where L is the
number of output classes and N is the number of unique string expanded from the given
template.
Bigram template:
This is a template to describe bigram features. With this template, a combination of the
current output token and previous output token (bigram) is automatically generated. Note that
this type of template generates a total of (L * L * N) distinct features, where L is the number of
output classes and N is the number of unique features generated by the templates.
Multiword Expressions
Chapter 6
Identification of Reduplications
Reduplications
in Bengali
6.1 Introduction
In linguistic studies, the term reduplication is generally used to mean repetition of any
linguistic unit such as a phoneme, morpheme, word, phrase, clause or the utterance as a
whole. The study of reduplication at all these levels is very significant both from the
grammatical as well as the semantic point of view. The identification of reduplication is a
part of general task of identification of multiword expressions (MWE). In the present work,
reduplications have been identified from the Bengali corpus of the articles of Rabindranath
Tagore. The rule-based approach is divided into two phases. In the first phase, identification
of reduplications has been done mainly at general expression level and in second phase, their
structural and semantics classifications are analyzed. This system has been evaluated with
average Precision, Recall and F-Score values of 92.82%, 91.50% and 92.15% respectively.
In all languages, the repetition of noun, pronoun, adjective, and verb are broadly classified
under two coarse-grained categories: repetition at the (a) expression level, and repetition at the
(b) contents or semantic level. This paper deals with the identification of reduplications at both
Chapter 6: Identification of Reduplications in Bengali 89
levels in Bengali. Reduplication is a common feature of Bengali. Bengali is the richest Indian
language with 2400 words (Chaudhuri et al 2005) in the onomatopoeic and idiophonic category
of reduplication. In Telugu and Marathi, this number is less than 500. In Hindi, the number is
large, but they do not take as many suffix-like extensions as Bengali (Apte 1968; Bhaskara Rao
1977). The repetition at both the levels is mainly used for emphasis, generality, intensity, or to
show continuation of an act. In certain cases, the repetition of a particular linguistic unit is
obligatory. For example,
s । ( Rastay Chalte Chalte Lokti Hatat Theme Gelo.)
, chup-chap, silence), sometime only consonant in the first position is changed (e.g. -
, rakom-sakam, various) or the consonant in the first position and matra (modified vowel)
attached with that consonant are changed (e.g - , kapar-chopar, clothes) leaving
other letters unchanged. Some correlative words are used in Bengali to express the
possessiveness, relative or descriptiveness. They are called ‘secondary descriptive compounds’.
For example,
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 90
This example shows that before reduplication, a matra (‘-’) at the beginning and a matra (‘-
i’) at the end are attached with the root verb (‘r’) (mar, to fight) to make a correspondence
with the main verb ‘ ’ (kara, to do). This kind of partial reduplication forms a verb
compound with the second light verb and is aligned with a single verb word in English.
The study of reduplication is a general subtask of multiword expression identification.
Multiword Expressions (MWE) are those whose structure and meaning cannot be derived from
their component words, as they occur independently. A typical natural language system assumes
each word to be a lexical unit, but this assumption does not hold in case of MWEs (Fillmore
2003; Becker 1995). They have idiosyncratic interpretations that cross word boundaries and
hence are a ‘pain in the neck for NLP’ (Sag et. al 2002). Reduplicated words are usually
collocated MWEs which are fixed expressions and cannot be written separately. Some of the
proverbs and quotations can also be considered as fixed expressions.
ব যo, bari-bari jaoa, moving door to door), incompleteness (- , sheet-
sheet bhab, feeling cold), hesitation (- , mane-mane kara, uttering for
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 91
meaning), softness ( - !, hasi-hasi mukh, smiling face), similarity (-
", lal-lal phul, deep red rose) or onomatopoeic expression ( , khat-khat
kora, knocking).
• Synonym or antonym word of first word: The second word is generally the synonym
( -я i, lok-jon nai, no sign of life) or antonym ( - $, pap-punya
bichar, vice and virtue) of the first word to express completeness of the meaning or
sometime used idiomatically.
• Imitation or partial copying of first word: A word followed by its partially changed
form is used to express the similarity (বi-i, boI-toI, books), unpleasant (-,
saram, soften).
? Ar Luchi-Tuchi Lagbe, Can you have more luchis?), it expresses the softness of the
speaker. But if the same word is changed into ‘’ ( - !o ।, Beshi Luchi-Tuchi
kheo na, Do not eat many luchis.), it expresses speaker’s disregardness or hardness.
Sometime, reduplication is used for sentiment marking to identify whether the speaker uses it
in positive or negative sense. For example,
(i) e ব ব & ? (Eto Bara Bara Asha Kisher? Why are you thinking so high?)
(ii) ব ব ( e! ?(Ki Bara Bara Bari Ekhane? Here, the buildings are very large.)
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 92
The first example expresses negative senses of the speaker, but second one shows positive
sense for the same reduplication (‘ব ব’, bara-bara, big-big).
• Complete Reduplication: The individual words carry certain meaning, and they are
repeated. e.g. ব- ব (bara-bara, big big), - (dheere dheere, slowly). In some
cases, both the speaker and the listener repeat certain clauses or phrases in long
utterances or narrations. The repetition of such utterances breaks the monotony of the
narration, allows a pause for the listener to comprehend the situation, and also provides
an opportunity to the speaker to change the style of narration. For example:
! । (Tarpar! Tarpar Ki Hala .)
thukur, God), ব - (boka-soka, Foolish), я- я (seje-guje, dressed up)
etc.
• Semantic Reduplication: The paired members are semantically related. The most
common forms of relation between the words are synonymy (!-n, matha-mundu,
head), antonym (#- , din-rat, day and night), class representative (- , cha-
paani, snacks)).
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 93
word and ‘-i’ is added for the second and then both are agglutinated to make a single
word. For example, ‘’ (maramari, fighting). Here, the above specified affixes
are added with the root ‘r’ (mar, to fight) to form a single token.
(i) ব ব * । (Ram Saradin Bari Bari Ghurchhe.)
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 94
Mainly sound words are onomatopoeic expressions. The constituent words imitate a sound,
and the unit as a whole refers to that sound. For example:
(i) 'l 'l я প । (Chal Chal Kore Jal Porche, sound of water falling on a surface)
Sometime, besides sound expression, reduplication can also be used to express feelings. For
example:
Reduplication at the sense level is an important feature of Bengali as well as some other Indo-
Aryan languages like Tamil, Manipuri (Becker 1975), Hindi etc. Different types of reduplication
at sense level and their corresponding expression level classifications are described as follows:
(i) Sense of repetition: It expresses a sense of repetitiveness. Complete reduplications are
mainly grouped into this class. For example:
ব' ব' e я । ( Bachar Bachar Ek Kaj Kara .)
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 95
The sentence without the reduplication means only ‘red rose’, but after reduplication,
emphasise on the meaning ‘red’ becomes deeper.
(iv) Sense of completion: Mainly partial or semantic reduplication belongs to this class. For
example:
* #* & 0 য । ( Kheye Deye Ami Shute Jaba .)
ব ব % $প । (Katha Bolte Bolte Hatat Se Chup Kore Gelo.)
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 96
goods). The matra is removed from w2 before comparison. The following algorithm is followed:
In partial reduplication, three cases are possible- (i) change of the first vowel or the matra
attached with first consonant, (ii) change of consonant itself in first position or (iii) change of
both matra and consonant. Exception is reported where vowel in first position is changed to
consonant and its corresponding matra is added. For example, & - (abal-tabal,
incoherent or irrelevant). Here, none of the individual words has any dictionary entry. This
special case is handled by checking the change of vowel to its corrosponding matra attached with
the new consonent (here, vowel ‘’ is changed to its corrosponding matra attached (‘-’) with
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 97
consonant (‘’).These are due to the orthographic rules applied in Bengali. Partial reduplication
is handled using algorithm 2 discussed below. Here also, inflection has been removed before
comparison. Linguistic study reveals that (Chattopadhyay 1992) when only consonant can be
changed to any of the following four consonants: ‘%’, ‘"’, ‘’, ‘’. But any consonant can be
If 1st char in both words are consonants and length are same
If 1st char in both words are similar then
If 2nd char in both words are matra and dissimilar then
Check for all chars of both from position 3 to end;
If all similar then reduplication; //matra change
Else if 1st char of w1 is , , ", % then
w2 are agglutinated to make a single word. In this case, after removing inflection, words are
divided equally and then the comparison is done. Sometime onomatopoeic expression is used to
express feelings ( , kankane shit, extremely cold) or to emphasise the adjectives
related to the noun (e.g. %% , taktake lal, Deep red). After collecting the words tagged as
adjective and removing matra attached with the last letter, the algorithm 3 is applied.
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 98
Word is divided into two parts and assigned to two strings S1 and S2 separately;
Check all the chars of both S1 and S2 sequentially;
If all is equal then reduplication;
For correlative reduplication, approaches are more or less same with the previous
algorithm 3. Here, naturally matra is not added with w2 and before reduplication, the formative
affixes ‘–&’ and ‘-i’ are added with the root to form the first word and second words
respectively and agglutinated together to make a single word. The algorithm is described below.
For semantic reduplication, a dictionary based approach has been taken. List of inflections
identified for the semantic reduplication is shown in Table 6.1.
This system has identified those consecutive words having same part-of-speech (mainly noun,
adjective and adverb). Then, morphological analysis has been done to identify the roots of both
w1 and w2. In synonymous reduplication, w2 is the synonym of w1. So at first in Bengali
monolingual dictionary; the entry of w1 is searched to have any existence of w2. If matching is
found, it is considered as reduplicated word. For antonym words, the opposite word of w1 is
difficult to identify. These opposite words are mainly gradable opposites (পপ-প, pap-purna,
Vice and Virtue) where the word and its antonyms are entirely different wordforms. The
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 99
productive opposites (я, gargaji, disagree is the opposite of я, raji, agree) are easy to
identify because the opposite word is generated by adding prefix or suffix with the original. In
dictionary based approach, English meaning of both w1 and w2 are extracted and opposite of w1
is searched in English WordNet for any entry of w2.
The first model for identifying the five types of reduplications is shown in Figure 1. The
functions performed by the different parts of the proposed architecture are:
• Tokenizer: It separates the words based on blank space or special symbols (like hyphen,
exclamation notation etc) to identify two consecutive words Wi and Wi+1.
• Reduplication Identifier: It is the main component of the first model. Consecutive tokens are
passed to it to verify whether they are reduplicated words and to find the class they belong.
Dictionary is sued for semantic reduplication.
• Dictionary: It includes the lexicon and the associated semantics. The system uses both
Bengali-to-Bengali and Bengali-to-English dictionaries.
Mainly eight types of semantic classifications are identified in Section 6.2.7 and their
correspondence with expression level classifications has been mentioned. For example, if the
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 100
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 101
Hyphen and closed form count: Orthographic representation of a collocation may provide
clues about the collocation being reduplication. Words joined with hyphens (-, hat-Tat,
hands) or occurring in closed form ( я я, sejeguje, dressed up) are likely to denote a single
The collected corpus includes 14,810 tokens for 3675 distinct word forms at the root level.
Precision, Recall, F-score are calculated for each class as well as for the reduplication
identification system and are shown in Table 6.2 and Figure 6.2.
The scores of partial and semantic evaluation are not satisfactory because of some wrong
tagging by the shallow parser (adjective, adverb and noun are mainly interchanged). Some
synonymous reduplication ( - ,, dhire-susthe, slowly and steadily, leisurely) implies
some sense of the previous word but not its exact synonym. These words are not identified
properly.
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 102
120
100
80
60
40
20 Precision
0
Recall
Onomatopoeic
Partial
Complete
Correlative
System
semantic
F-Score
Figure 6.2 Bar graph of five types of reduplications and the system performances
Frequency is an important indication of whether a compound is a MWE. Figure 6.3 shows that
in this corpus, the use of complete reduplication is more and hence a useful statistics has been
developed for this corpus and the writing style of the author.
In this corpus maximum identified hyphened words are not reduplicated words and only
8.52% of reduplications are hyphened. This result shows that the trend of writing reduplications
is the use of space as separator. Also the percentage of closed reduplications is 33.09% where
maximum of them are onomatopoeic, correlative and semantic reduplications. 100% of
correlative reduplications and maximum of onomatopoeic reduplications are closed.
8.51 Onomatopoeic
18.08
12.7 Complete
51.06 Partial
26.6 Semantic
Correlative
Mutliword Expressions
Chapter 6: Identification of Reduplications in Bengali 103
by only rule based approach. Sometimes it is observed in the present system that some word
combinations are wrongly identified as reduplicated MWEs. This issues need to be studied
further. Apart from this, more work needs to be done for identifying semantic reduplication
using statistical and morphological approach. By gathering all these statistics, future research is
planned on the field of Stylometry analysis or Plagiarism detection to identify the writing style of
an author.
Mutliword Expressions
Chapter 7
Identification of MWEs in
Bengali
7.1 Introduction
One of the key issues in both natural language understanding and generation is appropriate
processing of multiword expressions (MWEs). Automatic extraction of MWEs from text corpora
using linguistic and statistical tools is an alternative to manual creation of such databases. The
existing techniques for automatic extraction of MWE rely upon accurate parts-of-speech (POS)
taggers, shallow parsers and rich lexical resources such as WordNets. However, most of the
Indian languages including Bengali cannot boast of even a good size representative corpus, let
alone such sophisticated NLP tools. The chapter describes an approach for automatic extraction
of certain categories of MWEs for Bengali in such a miserly resource scenario. The technique
uses a morphological analyzer and a moderate size untagged text corpus. The results for Bengali
are encouraging and the generic nature of the approach makes us believe that similar results can
be expected for other languages as well.
West Bengal) etc., where inflections can be added to the last word only.
(1.c) Idiomatic Compound Nouns: These are noun-noun MWEs that are idiomatic or
unproductive in nature and inflection can be added only to the last word. The formation of such
compounds may be due to hidden conjunctions e.g. - (mA bAbA, parents) meaning mA
(mother) and bAbA (father) or hidden inflections e.g. ! " (lalATa lekhana) meaning
# (tAsera ghar, house of cards - fragile), #$ ড (gho.DAra Dima, horse’s egg,
absurd).
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 106
(2.b) Kin Terms: Bengali kin terms are normally two word MWEs such as *i
(2.c) Productive Compound Nouns: same as 1.c, except for the fact that the meaning is not
idiomatic. These are also called institutionalized phrases. E.g. , *я (mACha bhAjA, fish
(2.d) Noun – Noun collocations with inflections: same as 1.d, except for the fact that these
are semi-productive. E.g. / 0 (mATira mAnuSha, man of earth, down-to-earth).
(2.e) Conjunct Verbs: These are a pair of similar verbs used together to denote some other
action. When used in inflected form, the same inflection (normally e or te) is added to both the
verbs. E.g. khAoYA dAoYA (take food), pa.DA shonA (“read-hear” – study) etc.
(3.b) Light Verb Constructions: Some verbs like deoYA (to give) or kATA can have different
senses in different context. They are often referred to as light verbs (Stevenson et al. 2004). E.g.
( ! (chula kATA, to dress hair).
(3.c) Adjective-Verb and Adverb-Verb Collocations: Might be idiomatic or compositional,
but statistically marked. E.g. jя o (lajjAYe lAla haoYA, blush), 0 4 56
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 107
substitution is allowed. E.g. u k ,$ (ulu bane mukto Cha.DAno, useless
attempt) etc.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 108
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 109
1
Figuration is the property of the components of a MWE having some metaphoric, hyperbolic or metonymic in addition to
their literal meaning.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 110
between the components or extinction of inflection from the first component (maa-baba, mother
and father).
(iii) Idioms: They are also compound nouns with idiosyncratic meaning, but first noun is
generally in possessive form (taser ghar, fragile). Sometime, individual components may not
carry any significant meaning and can not be a part of dictionary (gadai laskari chal, indolent
habit). For them, no inflection is allowed even to the last word.
(iv) Numbers: They are highly productive, impenetrable and allow slight syntactic variations
like inflections. Inflections can be added only to the last component (soya sat ghanta, seven
hours and fifteen minutes).
(v) Relational Noun Compounds: They are mainly kin terms and bigram in nature. Inflection
can be added with the last word (pistuto bhai, maternal cousin).
(vi) Conventionalized Phrases: Sometime they are called as ‘Institutionalized phrase’. They
are not idiomatic and a particular word combination coming to be used to refer to a given object.
They are productive and have unexpectedly low frequency and in doing so, contrastively
highlight the statistical idiomaticity of the target expression (bibhha barshiki, marriage
anniversary).
(vii) Simile Terms: They are analogy term in Bengali and sometime similar to the idioms
except the fact that they are semi-productive (hater panch, remaining resource).
(viii) Reduplicated Terms: Reduplications are non-productive and tagged as noun phrase. They
are further classified as onomatopoeic expressions (khat khat, knock knock), complete
reduplication (bara-bara, big big), partial reduplication (thakur-thukur, God), semantic
reduplication (matha-mundu, head), Correlative Reduplication (maramari, fighting).
A number of research activities in Bengali Named Entity detection have been carried out
(Ekbal et al. 2008), but there is no such standard tool to detect this. Here we have manually
identified NE. Though numbers and kin terms can be captured by some lexicons, the use of
lexicons during development phase is not at all a very acceptable way. Our work mainly focuses
on the extraction of productive and semi-productive bigram MWEs like idioms, idiomatic
compound nouns, simile terms, numbers, relational terms, and conventionalized phrases.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 111
Resource acquisition is one of the most challenging obstacles to work with electronically
resource constrained languages like Bengali. However, this system has used a large number of
Bengali articles written by the noted Indian Nobel laureate Rabindranath Tagore (discussed in
Section 4.1.1). While we are primarily interested in single document term affinity, document
information need not be maintained and manipulated by the experiment and document length
normalization need not be considered. The order of the documents within the sequence is not of
major importance. After merging all the articles, a medium size raw corpus has been created. It
consists of 393,985 tokens and 283,533 types. Actual motivation of choosing this domain is to
develop a useful statistics and further work on the Stylometry analysis.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 112
Mainly bigram collocations within same chunk have been extracted as candidates. In second
phase, feature engineering consisting of various statistical co-occurrence parameters is applied
on those candidates. Final decisions regarding a binary classification of MWE or non-MWE and
Precision, Recall and F-score for each measurement are done in the final phase.
The crawled corpus is so scattered and unformatted that a basic semi-automatic pre-
processing has been needed. Some of them are like sentence boundary detection and make the
corpus suitable for parsing. Parsing using Bengali shallow parser has been done for identifying
the POS, chunk, root and inflection of each token. Some of the tokens are misspelled due to
typographic or phonetic error. For example, the token ‘boi’ (book) is written as ‘i’ or sometime
as ‘;’. Shallow parser is not able to detect their actual root and inflection and the number of
After pre-processing, bigram noun sequence within the same chunk is extracted from their
POS and chunk categories. Shallow parser is confused with the two noun tags i.e. common noun
(‘NN’) and proper noun (‘NNP’) because of the the continuous need for coinage of new terms
for describing new concepts. For identifying all N-N MWEs, we have taken both of them and
manual deletion of NEs has been done afterword. Although Chunking information helps to
identify phrase boundary, some of the candidates belong to a chunk, which is formed by more
than two nouns. Their frequency is also identified during evaluation phase. Bigram candidates
can be thought of as <w1w2>. Total candidate selection phase is standing on the some heuristics
described in Table 7.1. After first phase, a list of possible candidates is prepared. ‘NN’ and
‘NNP’ tags are mixed up and some of the consecutive nouns not belonging to a single chunk are
also extracted by the parser. These parsing errors and NEs have been detected and filtered
manually. A statistics of parsing error is calculated during evaluation phase.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 113
Heuristics
1. POS POS of each bigram must be either ‘NN’ or
‘NNP’
2. Chunk w1 and w2 must be in the same ‘NP’ chunk
We have said earlier that frequency information is not a reliable source of making any
statistics especially for MWE because each MWE is too low in number in a medium size corpus.
We have given a proof of this assumption taking directly frequency related measures like PMI
and LLR. The following are the different association measures that we have taken for our
analysis. Though these measures are discussed Section 7.3.2, they are briefly discussed hare:
• Point-wise Mutual Information (PMI): The PMI of a pair of outcomes x and y belonging to
discrete random variables quantifies the discrepancy between the probability of their coincidence
given their joint distribution versus the probability of their coincidence given only their
individual distributions and assuming independence (Church et al.1990). Mathematically,
P ( xy )
PMI ( x, y ) = log
P( x) P( y ) (7.1)
where, P(xy) = probability of the word x and y occurring together, P(x) = probability of x
occurring in the corpus and P(y) = probability of y occurring in the corpus.
These probabilities can be assigned looking at the relative bigram and unigram frequency.
This PMI is prone to highly overestimating the occurrence of rare events. This occurs since PMI
does not incorporate the notion of support of the collocation (Kunchukuttan and Damani 2008).
2
Linguistic study (Chattopadhyay, 1992) reveals that for compound noun MWE, considerable inflections of first noun are only
those which are mentioned above.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 114
• Log-Likelihood Ratio (LLR): The LLR is the ratio of the likelihood of the observations
given the null-hypothesis to that of the alternate hypothesis (Dunning 1993). Generally, it is the
ratio between the probability of observing one component of a collocation given the other is
present and the probability of observing the same component of a collocation in the absence of
other.
Here the order of the words in the candidate collocation was irrelevant. We have adopted first
probability using Baye’s theorem by averaging the probability of w1 giving w2 and probability
of w2 giving w1.
• Phi-Coefficient: In statistics, the Phi coefficient Ф is a measure of association for two binary
variables. The Phi coefficient is also related to the chi-square statistic as:
2
Φ = χ
n (7.3)
2
where n is the total number of observations and χ is the chi-square distribution. Two binary
variables are considered positively associated if most of the data falls along the diagonal cells. In
contrast, two binary variables are considered negatively associated if most of the data falls off
the diagonal. Here, the binary distinction denotes the positional information of the words. If we
have a 2×2 table for two random variables x and y which denotes the presence of w1 and w2
respectively, we have following matrix:
where, n11=actual bigram <w1w2> count, n10=frequency of bigram containing w1 but not w2,
n01=frequency of bigram containing w2 not w1, n00=frequency of bigram not containing anyone
of w1 and w2. nx1 and nx0 are the summation of their respective rows and ny1 and ny0 are the
summation of their respective columns. Alternative words in place of absent w1 or w2 must be
nouns. The phi coefficient that describes the association of x and y is
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 115
n 11 n 00 − n 01 n 10
ϕ= (7.4)
n x1n x 0 n y 1n y 0
−d ( s,w1,w1)
co( w1, w2) = ∑ e (7.5)
s∈S ( w1,w2)
where, sigw1(w2)=significance of w2 with respect to w1. Here slightly modification has been
done from the original by interchanging the roles of w1 and w2 in the first equation and
averaging them. Same modification has been done for fw1(w2) which denotes number of w1 with
which w2 has occurred. In the second equation, these modified values are used in their respective
place. sig(w1,w2) denotes general significance of w1 and w2. σ(x) is the sigmoid function
defined as [exp(-x)/(1+exp(-x))]. Two constants κ1 and κ2 define the stiffness of the sigmoid
curve and for simplicity we have taken both of them as 5.0 (Agarwal et al. 2004). λ is defined as
the average number of noun-noun co-occurrences. The value of significance function lies
between 0 and 1.
• Weighted Combination: Final evaluation has been carried out by combining all the above-
mentioned features. Experimental results show that Phi-coefficient, co-occurrence and
significance functions actually based on the co-occurrence distribution has given more accurate
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 116
results than the frequency-based measurement approaches like LLR, PMI in the higher ranks. So
these three measures are considered and have been given certain weights after working with
various weights. The final results are reported for the weighted triple <0.45, 0.35, 0.20> for co-
occurrence, Phi and significance function respectively. The individual scores are normalized
before assigning weights so that they are in the range 0 to 1.
We have used standard IR metrics like Precision, Recall, F-score for evaluating our final
weighted measurement as well as all the association measures. Manual identification of MWEs is
done for evaluation purpose. Total candidates are divided into four classes: (1) valid N-N MWEs
(M), (2) valid N-N semantic collocations but not MWEs (S), (3) invalid collocations due to
considering bigram in a n-gram chunk where n>2 (B), (4) invalid candidates due to error in
parsing like POS, chunk, inflection (E). For N number of candidates, three measuring approaches
in percentage are calculated for each association measures.
Actual Precision (V) = M/N
Overall Precision (I) = (M+S)/N
Error rate due to B-type (O) =B/N
Precision for every measure is calculated as:
Precision (P) = (MWEs in top 1000 ranked candidates/ 1000)
Recall is defined as:
Recall (R) = (MWEs in top 1000 candidates/total N-N MWEs in the documents)
F-score (F) = (2*P*R) / (P+R)
Top 1000 ranked candidates are taken to evaluate each measure in the higher ranking.
Four classes as discussed in Section 7.4.1 are identified manually and their frequencies are
plotted in Figure 7.2.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 117
MWE
17.54
28.56 Valid N-N
24.37 Others
29.53
POS error
Maximum numbers of the candidates are erroneous due to parsing error. E-type candidates are
filtered manually as it has produced erroneous statistics and the result might be biased. For each
measurement, the scores have been sorted in descending order and the total range is divided into
five ranks so that approximately equal scores fall within same rank. For every rank, three measures
discussed in Section 5.1 are calculated and plotted in a graph. Table 7.2 depicts those results and Figure
7.3 gives a relative study of those measures.
The slope of each measure in Figure 7.3 is important in this purpose because the
monotonously decreasing graph indicates the more number of MWEs in upper ranks rather than
in lower ranks. PMI and LLR prove to be bad measures because graphs for LLR and PMI do not
follow any significant alignment and slight upward slope have been noticed. This shows the
presence of higher number of MWEs in the lower ranks.
1 17.5 50.0 50.0 18.0 45.6 54.0 34.0 84.0 15.1 35.7 78.5 21.4 38.5 88.3 11.7
2 16.0 51.0 49.0 16.0 56.8 43.0 22.6 52.9 47.0 21.9 68.2 31.8 21.6 64.5 33.5
3 20.3 55.5 44.0 18.8 64.6 35.3 18.5 62.6 37.0 15.9 62.9 37.1 16.1 45.4 54.6
4 19.0 64.8 35.1 22.2 69.4 30.5 11.2 61.0 39.0 17.8 52.2 47.7 12.3 44.6 55.3
5 20.7 66.3 33.7 23.9 64.9 35.0 10.6 67.2 32.0 15.7 46.6 53.4 9.7 37.7 62.3
Table 7.2 Performance metrics of three different measures (in %) for each association approach.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 118
Another important notification is that maximum of the lower ranked MWEs are reduplicated
MWEs and they are filtered out when top 1000 ranked candidates are chosen. In weighted
measured approach, maximum valid MWEs are listed in the top ranks. For this, V, I and O
Weighted Measures
Rank V I O
1 46.54 89.23 10.77
2 30.27 72.28 27.72
3 13.43 59.98 40.02
4 7.66 36.62 63.38
5 5.09 23.39 76.61
measures are shown in Table 7.3. It is clearly shown from the column (named as V) that the
corresponding graph for valid N-N MWEs is decreasing in nature. The weighted combination
approach improves upon each of the individual methods. If these association measures are
combined using any ranking approach, it does not require any empirical settings of weights. But
the problem is that there is no standard ranking methodology on these association measures.
Top three candidates for each measure and its corresponding tags are shown in Table 7.4.
Borda’s positional ranking that does an approximate aggregation of the ranked collocation list
has been used as standard ranking function in previous studies (Kunchukuttan and Damani
2008).
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 119
ghore dhuke S
PMI payer dhula M
mukher pratibimba S
ratdin moner B
LLR barsar chata S
premer sagar M
mathai hat M
Co-occurrence maa-r mone S
mukher bhab S
khabarer kagaj S
Phi-coefficient haater panch M
dhaner haater B
bayer din S
Significance maran dasa M
haater panch M
galai dori M
Weighted garer maath M
nacher tale S
100
Precision
80
Recall
60 F-score
40
20
0
Figure 7.4 Precision, Recall and F-score (in %) for different measurements.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 120
But the results were not satisfactory using this ranking and it did not serve as an effective MWE
extraction technique.Precision, Recall and F-score are performed for all association measurements as well
as for the weighted approach. These are measured among top 1000 candidates after manually deleting the
parsing errors. The performance metrics for different measurements are shown in Figure 7.4.
Precision, Recall and F-score for weighted approach are 39.64%, 91.29% and 55.28% respectively,
which are quite satisfactory in the first attempt. The present work does not focus on the increase of
Precision. Our goal is to make a comparative study on the existing association measurements with our
own weighted measurement and try to capture maximum number of the N-N collocated MWEs with in
the top 1000 ranked candidates.
As an effort of developing a lexicon on N-N MWEs has been done simultaneously, we have observed
the use of MWEs by the author in the documents. For this, we have chosen 10 novels of Rabindranath
Tagore randomly and made a study using the following equation:
where i=document id varied from 1 to 10, Ci=Combined list of N-N MWEs after ith document is
processed, CLi= Extracted list of N-N MWEs in ith document.
Value of (CLi - Ci-1) is plotted against the document id in Figure 7.5, which indicates the use of new
MWEs in the documents. The behavior of the graph is downward which indicates the saturation of use of
MWEs with the increase of number of documents. Besides reduplications, some of the high frequent
MWEs used through out the documents are shown in Table 5 according to their frequencies.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 121
The presence of named-entities in the candidate list also affects the performance. While
conceptually all named entities are MWEs, we do not include them in our research. We have
manually filtered them at the beginning of the second phase.
Another important cause of taking the overall Precision (I) in consideration is that our basic
goal is to build a statistics of different use of MWEs and compound in the articles by the writer
and to identify the writing style or Plagiarism detection. Focusing on this, these semantically
collocated compounds sometime express themselves as Institutionalized phrases in different text
positions.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 122
Significance function and co-occurrence measurement, which are used in this work, have been
modified according to our need. Here, two binary variables used in Phi-coefficient are related to
the positional information of constituent words w1 and w2. Weighted approach is basically a
trial and error approach to find best triple.
Apart from being the first work of its kind for Bengali language, the contributions of this work
are discussed as follows: (i) a morpho-syntactic classification of Bengali compound noun MWE
classifications beyond the conventional classification of MWEs in English (Baldwin and Kim,
2010), (ii) new weighted approach for measuring MWEs which may be used for other types of
collocation measurements, (iii) a list of Bengali N-N compound MWEs used as a lexical
resource for developing synsets of MWEs, (iv) development of formatted corpus in Bengali for
further study, (v) an initial study for the identification of Stylometry of Rabindranath Tagore.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 123
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 124
Intersection of two synsets indicates the commonality of the two sets of the components. These
common elements act as the dimensions of a vector space and the similarity based algorithm is
applied to measure the semantic similarity between two components considering the component
as vector in that semantic space.
This phase deals with the building of Bengali synsets that aim not only to identify the
meaning of Multi-Word Expressions but also to step up towards the development of Bengali
WordNet. The input monolingual dictionary (Samsada Bengali Abhidhana)3 contains each word
present with its parts-of-speech (. Noun, =. Adjective, >. Pronoun, a . Indeclinables, k.
Verb), phonetics and synonymous sets. Synonymous sets are separated using distinguishable
notations based on similar sense and differential sense. Then synonyms of different sense with
3
http://dsal.uchicago.edu/dictionaries/biswas-bangala/
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 125
respect to a word entry is separated by a semicolon (;) and synonyms having same sense are
separated by comma (,). An automatic technique is devised to identify the synsets for a particular
word entry based on the clues (, and ;) of similar and differential senses. The symbol tilde (~)
indicates that the suffix string followed by the tilde (~) notation makes another new word
concatenating with the original entry word. A snapshot of the synsets for the Bengali word
“aA<” (Angshu) is shown in Figure 7.6. For each entry, identification number, synset entry
numbers have been given to each entry. Though these identity numbers are not used directly in
this experiment, they can help to separate the entries and track them for further operations.
Dictionary Entry:
aA< [aṃśu] . 1 =, I, p*; 2 LM), n, .k aA) । [A. aPQ+u]। ~ . st ,
.k st ; ) প! i ps st ((A<)। ~ я . = ), = । ~ 4 .
aA< 4 ; .> । ~ =. (st) =, я > । ~ (-t) 1 .;> 2 .>A)
я পXt । ~ . Iя , = । ~ (- n) . .> । ~ =. = ,
=)6 ।
Synsets:
aA< =/ I/p*_.#25_1_1 LM)/n/ _.k_aA)_[_A._aPQ+u]_.#25_2_2
aA< st/.k_st_.#26_1_1 )_প!_i _ps_st_((A<)_.#26_2_2
aA<я = )/ = _.#27_1_1
aA<4 aA< _4 _.#28_1_1 ._> .#28_2_2
aA< (st)_ =/я >_=.#29_1_1
aA< ._> #30_1_1 __2_.>A)_ _ я _পXt#30_2_2
aA< Iя / = _.#31_1_1
aA< ._> .#32_1_1
aA< =/ =)6_=.#33_1_1
Figure 7.6 Monolingual Dictionary Entry and built synsets for the word “aA< (Angshu)
The synonymous entries are separated by slash (“/”) and spaces are replaces by underscore
(“_”) within same synonymous set to ignore the confusion of separation between the words in
same sense and between two senses. In Table 7.6, the frequencies of different synsets according
to their part-of-speech are shown.
Word Entries 33619
Synsets 63403
Noun 28485
Adjective 11023
Pronoun 235
Indeclinables 497
Verb 1709
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 126
This phase is the beginning of the main clustering approach. Here, we have tried to generate
the synonymous sets for the nouns present in the corpus using the dictionary. As our main goal is
to make an intersection of the synsets of two consecutive noun words, we have used the
dictionary as our standard resource and the dictionary entries as the member of the synsets.
Another reason behind the use of dictionary entries as the closer set of entries is that Bengali
language is resource constraints. The lack of standard lemmatizer, stemmer in Bengali makes an
obstacle of doing direct the comparison of the synsets of the two components of the candidate
bigram present in the documents. We can imagine a dictionary as a close set of words W1, W2,
W3,……,Wn where
Where, W1, W2, ….,Wm are the dictionary entries and i, j, k,…., p are the number which can vary
from 1 to the number of synsets for the corresponding entries and make close sets of synonyms.
Now, each noun entry identified by the shallow parser in the document is searched in the
dictionary for its individual existence with or without inflection. Suppose N is a noun in the
corpus and it is present in the synsets of the W1, W3 and W5. Therefore, they become the entries
of the synsets of N. Mathematically, the synset of the noun N can be defined as:
SynSet (N) = {Wl} (7.8)
Where l € 1 to m, such that N € {nla}. Here, a € {i, j, k, …, p} and m is the number of
dictionary entries.
After generating the synsets for all the noun words in the document, the main task is to
identify the semantic similarity between two nouns. It has been done by the intersection of the
synsets of the two words and a score has been given that shows the semantic affinity between
each other. Suppose, Ni and Nj are the two noun words in the document and Wi and Wj are their
corresponding synsets. Now, the semantic similarity of the two words can be defined as:
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 127
From the above equation, it is clearly shown that this value is maximum when the similarity
is measured with itself (i.e. Similarity (Ni, Nj) is maximum when i=j). This semantic similarity
measurement approximately gives a similarity score according to the commonality of their
synonymous sets. However, it is noted that this similarity measure is not the only procedure to
measure the similarity between two words. The ultimate similarity measurement has been taken
using two vector space model discussed in section 7.5.2.6.
Shallow Parser identifies all the nouns present in the document and tags them as ‘NN’
(common noun), ‘NST’ (noun denoting spatial and temporal expression) and ‘NNP’ (proper
noun). For this experiment, we have used mainly the common noun. Now, according to the score
obtained by the semantic similarity measurement, we have clustered all the nouns present in the
document for a particular noun and give a score for each of the similarities. For example,
suppose the nouns identified by the shallow parser in the document are W1, W2, …,Wi, Wj, Wk,
Wl, Wm, Wn,….etc. Now, for each noun M belonging to that set, the semantic similarities of M
with the nouns are shown in the figure 7.7. In this figure, the distances from the word M to other
nouns are the weights associated with the edges (i.e. a, b, c, etc).
In this phase, the actual identification of candidates as MWE is done using the output
obtained from the previous phase. If we consider noun-noun bigram as <M1 M2>, then the
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 128
algorithm to identify the bigram as MWE is discussed below with proper example shown in
Figure 7.8.
ALOGRITHM: MWE-CHECKING
INPUT: noun-noun bigram <M1 M2>
OUTPUT: True if MWE or false.
1. Extract semantic clusters of M1 and M2 using the procedure described in Section 7.5.2.3,
7.5.2.4 and 7.5.2.5.
2. Intersection of the synsets of both M1 and M2 (Figure 7.8 shows only the similar words
common for both M1 and M2).
3. For measuring the similarity between M1 and M2:
3.1. Identify the common elements of the similarity set of M1 and M2 (here common
element is 8) with their scores (wij).
3.2. In an n-dimensional vector space (here n=8), put M1 and M2 as two vectors and
associated weights as their co-ordinates.
3.3. calculate cosine-similarity measurement and Euclidean distance.
3.4. Decision taken individually for two different measurements-
3.4.1 If cosine-similarity > m, return false; Else return true; or
3.4.2 If Euclidean distance > n, return false; Else return true;
(Where m and n are the pre-defined cut-off )
This algorithm looks little bit tricky especially in step 3. After identifying the common terms
in both sets, a vector space model is used to identify the similarity between two components. In
n-dimensional vector space, these common elements become the axes and each candidate acts as
a vector in that space. The co-ordinate value of the vector in each direction is represented by the
similarity measure between the candidate and the common term in that direction. The binary
decision regarding the classification of a given candidate as MWE or not is a bit surprising
(described in step 3.4). In the experiment, we have seen that the bigram MWEs mainly the
idioms have shown low score of the similarity values between their constituents.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 129
Figure 7.8.1 Intersection of the clusters of the constituents; Fig 7.8.2 Similarity between two
constituents
If we take an example of a Bengali idiom hater panch (remaining resource), we have seen
that WordNet defines two components of the idiom hat (hand) as a part of a limb that is farthest
from the torso and panch (five) as a number which is one more than four. So from these two
glosses it is quite clear that they are not at all semantically related in any sense. The synonymous
sets for these two components extracted from the formatted dictionary are shown below –
Synset () = {s, , প=, `, * я, X) , s kপ, 4 =, ", ", sk , sn , я}
Synset (পM() = {পc, A" , >, d, , , =, X$, nt, >, পct, প , প.=>, পc)}
So it is very clearly seen from the above synonymous sets that no one element of two sets is
common and its similarity score is obviously zero. In this case, vector space model cannot be
drawn using zero dimensions. For them, a marginal weight is assigned to show them as
completely non-compositional phrase. To identify their non-compositionality, we have to show
that their occurrence is not certain only in one case; rather they can occur side by side in several
occasions. But this statistical proof can be determined better using a large corpus. Here, for those
candidate phrases which show zero similarity, we have seen their existence more than one time
in the corpus. Taking any decision using single occurrence may give incorrect result because
they can be unconsciously used by the authors in their writings. That is why, the more the
similarity between two components in a bigram, the less the probability to be a MWE.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 130
As Aforementioned, there is no such lexical resource like WordNet in Bengali; we have tried
to use English WordNet (discussed in Section 4.2.1) in this research to measure the semantic
distance between two Bengali words after translating into English. WordNet::Similarity is an
open-source package for calculating the lexical similarity between word (or sense) pairs based on
variety of similarity measures. Basically, WordNet measures the relative distance between two
nodes denoted by two words in the WordNet tree which can vary from -1 to 1 where -1 indicates
total dissimilarity between two nodes.
In this experiment, we have first translated the root of two Bengali components in a phrase
into their English forms using Bengali to English Bilingual Dictionary. Then these two words are
passed into the WordNet based Similarity module for measuring the distance. A predefined cut-
off value is determined to distinguish between MWE and simple compositional term. If the
measured distance is less than that threshold, the similarity between them is less. But the
candidate phrase consisting of these two words has a reasonable occurrence in the corpus. It
concludes the phrase to be a MWE. Evaluation results are taken after varying the cut-off value.
The list of noun-noun collocations are extracted from the output of the parser for manual
checking. It is observed that 39.39% error occurs due to wrong POS tagging or extracting invalid
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 131
collocations due to considering bigram in a n-gram chunk where n > 2. We have separated these
phrases from the final list.
We have used standard IR matrices like Precision (P), Recall (R) and F-score (F) to
evaluating the final results obtained from three modules. Human annotated list is used as the gold
standard for the evaluation. The results are shown in Table 7.8. The predefined threshold has
been varied to catch individual results in each case. Increasing recall in accordance with the
increment of cut-off infers that maximum numbers of MWEs are identified in a wide range of
threshold. But precision does not increase monotonously. It shows that higher cut-off degrades
the performance. The reasonable results for precision and recall have been achieved in case of
cosine-similarity at the cut-off value of 0.5 where Euclidean distance and WordNet Similarity
give maximum precision at cut-off values of 0.4 and 0.5 respectively.
Baldwin et. al. (2003) suggested that WordNet::Similarity is effective to identify
decomposability of Multiword Expression. We are surprisingly concluding the same for Bengali
language. There are also candidates with very low value of similarity between their constituents
(e.g. ganer gajat, ‘earth of songs’) and they are discarded from this experiment because of their
low frequency of occurrence in the corpus.
7.5.5 Conclusion
We hypothesized that sense induction by analyzing synonymous set can assist the
identification of Multiword Expression. We have introduced an unsupervised approach to
explore the hypothesis and have shown that clustering technique along with similarity measures
can be successfully employed to perform the task. This experiment additionally contributes to the
followings- (i) Clustering of words having similar sense, (ii) Identification of MWEs for
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 132
4
http://www.ethnologue.com/ethno_docs/distribution.asp?by=size
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 133
(CPs) in Bengali consists of two types, compound verbs (CompVs) and conjunct verbs (ConjVs).
The compound verbs (CompVs) (e.g. g mere phela ‘kill’, bolte laglo,
‘started saying’) consist of two verbs. The first verb is termed as Full Verb (FV) that is present at
surface level either as conjunctive participial form -e (–e) or the infinitive form - (–te). The
second verb bears the inflection based on Tense, Aspect and Person. The second verbs that are
termed as Light Verbs (LV) are polysemous, semantically bleached and confined into some
definite candidate seeds (Paul, 2010).
On the other hand, each of the Bengali conjunct verbs (ConjVs) (e.g. * bharsha
kara ‘to depend’, hk hk jhakjhak kara ‘to glow’) consists of noun or adjective followed by
a Light Verb (LV). The Light Verbs (LVs) bear the appropriate inflections based on Tense, Aspect
and Person. According to the definition of multi-word expressions (MWEs)(Baldwin and Kim
2010), the absence of conventional meaning of the Light Verbs in Complex Predicates (CPs)
entails us to consider the Complex Predicates (CPs) as MWEs (Sinha 2009). But, there are some
typical examples of Complex Predicates (CPs), e.g. " dekha kara ‘see-do’ that bear the
similar lexical pattern as Full Verb (FV)+ Light Verb (LV) but both of the Full Verb (FV) and
Light Verb (LV) loose their conventional meanings and generate a completely different meaning
(‘to meet’ in this case).
In addition to that, other types of predicates such as niye gelo ‘take-go’ (took and
went), diye gelo ‘give-go’ (gave and went) follows the similar lexical patterns FV+LV
as of Complex Predicates (CPs) but they are not mono-clausal. Both the Full Verb (FV) and
Light Verb (LV) behave like independent syntactic entities and they belong to non-Complex
Predicates (non-CPs). The verbs are also termed as Serial Verb (SV) (Mukherjee et al. 2006).
Butt (1993) and Paul (2004) have also mentioned the following criteria that are used to check the
validity of complex predicates (CPs) in Bengali. The following cases are the invalid criteria of
complex predicates (CPs).
1. Control Construction (CC): " likhte bollo ‘asked to write’, " 4
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 134
3. Passives (Pass) : 4 প dhora porlo ‘was caught’, mara holo ‘was beaten’
4. Auxiliary Construction (AC): L , bose ache ‘is sitting’, ( niye chilo ‘had
taken’.
Sometimes, the successive sequence of the Complex Predicates (CPs) shows a problem of
deciding the scopes of individual Complex Predicates (CPs) present in that sequence. For
example the sequence, u প " uthe pore dekhlam ‘rise-wear-see’ (rose and saw)
seems to contain two Complex Predicates (CPs) (u প uthe pore ‘rose’ and প "
pore dekhlam ‘wore and see’). But there is actually one Complex Predicate (CP). The first one
u প uthe pore ‘rose’ is a compound verb (CompV) as well as a Complex Predicate (CP).
Another one is " dekhlam ‘saw’ that is a simple verb. As the sequence is not mono-
clausal, the Complex Predicate (CP) u প uthe pore ‘rose’ associated with " dekhlam
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 135
also the augmentation of argument structure agreement (Das 2009), the analysis of Non-
MonoClausal Verb (NMCV) or Serial Verb, Control Construction (CC), Modal Control
Construction (MCC), Passives (Pass) and Auxiliary Construction (AC) (Butt, 1993; Paul, 2004)
are also necessary to identify the Complex Predicates (CPs). The error analysis shows that the
system suffers in distinguishing the Complex Predicates (CPs) from the above constraint
constructions.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 136
Two types of Bengali corpus have been considered to carry out the present task. One corpus
is collected from a travel and tourism domain and another from an online web archive of
Rabindranath Rachanabali (discussed in Section 4.1.1).The former EILMT travel and tourism
corpus is obtained from the consortium mode project “Development of English to Indian
Languages Machine Translation (EILMT)5 System”. The second type of corpus is retrieved from
the web archive and pre-processed accordingly. Each of the Bengali corpora contains 400 and
500 development and test sentences respectively. The sentences are passed through an open
source Bengali shallow parser. The shallow parser gives different morphological information
(root, lexical category of the root, gender, number, person, case, vibhakti, tam, suffixes etc.) that
help in identifying the lexical patterns of Complex Predicates (CPs).
5
The EILMT project is funded by the Department of Information Technology (DIT), Ministry of Communications and
Information Technology (MCIT), Government of India.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 137
processed to generate the simplified patterns. An example of similar lexical pattern of the
shallow parsed result and its simplified output is shown in Figure 7.9.
The corresponding lexical categories of the root words a4 adhyan ‘study’ (e.g. noun for
‘n’) and kar, ‘do’ (e.g. verb for ‘v’) are shown in bold face in Figure 7.9. The following
example is of conjunct verb (ConjV). The extraction of Bengali compound verbs (CompVs) is
straightforward rather than conjunct verbs (ConjVs). The lexical pattern of compound verb is
{[XXX](v) [YYY] (v)} where the lexical or basic POS categories of the root words of “XXX” and
“YYY” are only verb. If the basic POS tags of the root forms of “XXX” and “YYY” are verbs (v) in
shallow parsed sentences, then only the corresponding lexical patterns are considered as the
probable candidates of compound verbs (CompVs).
Example 1:
<i |verb| <i/VM/VGNF/) ^v^*^*^i^*^i^i)
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 138
Bengali, like any other Indian languages, is morphologically very rich. Different suffixes
may be attached to a Light Verb (LVs) (in this case [YYY]) depending on the various features
such as Tense, Aspect, and Person. In case of extracting compound verbs (CompVs), the Light
Verbs are identified from a seed list (Paul, 2004). The list of Light Verbs is specified in Table
7.9. The dictionary forms of the Light Verbs are stored in this list. As the Light Verbs contain
different suffixes, the primary task is to identify the root forms of the Light Verbs (LVs) from
shallow parsed result. Another table that stores the root forms and the corresponding dictionary
forms of the Light Verbs is used in the present task. The table contains a total number of 378
verb entries including Full Verbs (FVs) and Light Verbs (LVs). The dictionary forms of the Light
Verbs (LVs) are retrieved from the Table.
On the other hand, the conjunctive participial form –e/i -e/iya or the infinitive form –/-
i –te/ite are attached with the Full Verbs (FVs) (in this case [XXX]) in compound verbs
(CompVs). i / iya and i / ite are also used for conjunctive participial form -e –e or the
infinitive form - –te respectively in literature. The participial and infinitive forms are checked
based on the morphological information (e.g. suffixes of the verb) given in the shallow parsed
results. In Example 1, the Full Verb (FV) contains -i -iya suffix. If the dictionary forms of the
Light Verbs (LVs) are present in the list of Light Verbs and the Full Verbs (FVs) contain the
suffixes of –e/i - e/iya or –/-i –te/ite, both verbs are combined to frame the patterns of
compound verbs (CompVs).
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 139
The identification of conjunct verbs (ConjVs) requires the lexical pattern (Noun / Adjective +
Light Verb) where a noun or an adjective is followed by a Light Verb (LV). The dictionary forms
of the Light Verbs (LVs) that are frequently used as conjunct verbs (ConjVs) are prepared
manually. The list of Light Verbs (LVs) is given in Table 7.10. The detection of Light Verbs
(LVs) for conjunct verbs (ConjVs) is similar to the detection of the Light Verbs (LVs) for
compound verbs (CompVs) as described earlier in this section. If the basic POS of the root of the
first words ([XXX]) is either “noun” or “adj” (n/adj) and the basic POS of the following word
([YYY]) is “verb” (v), the patterns are considered as conjunct verbs (ConjVs). The Example 2 is
an example of conjunct verb (ConjV).
For example, hk hk (jhakjhak kara ‘to glow’), k k (taktak ‘to glow’),
( প(প (chupchap kara ‘to silent’) etc are identified as conjunct verbs (ConjVs) where the
basic POS of the former word is an adjective (adj) followed by kara ‘to do’, a common
Light Verb.
Example 3:
hhk | adj | hhk /JJ/JJP/(hhk ^adj) #
| verb | /VM/VGF/(r^v^*^*^5^*^r^r)
But, the extraction of conjunct verbs (ConjVs) that have a “noun+verb” construction is
descriptively and theoretically puzzling (Das 2009). The identification of lexical patterns is not
sufficient to recognize the compound verbs (CompVs). For example, i o boi deoya ‘give
book’ and * o bharsa deyoa ‘to assure’ both contain similar lexical pattern (noun+verb)
and same Light Verb o deyoa. But, * o bharsa deyoa ‘to assure’ is a conjunct verb
(ConjV) where as i o boi deoya ‘give book’ is not a conjunct verb (ConjV). Linguistic
observation shows that the inclusion of this typical category into conjunct verbs (ConjVs)
requires the additional knowledge of syntax and semantics. In connection to conjunct verbs
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 140
(ConjVs), (Mohanty 2010) defines two types of conjunct verbs (ConjVs), synthetic and analytic.
A synthetic conjunct verb is one in which both the constituents form an inseparable whole from
the semantic point of view or semantically non-compositional in nature. On the other hand, an
analytic conjunct verb is semantically compositional. Hence, the identification of conjunct verbs
requires knowledge of semantics rather than only the lexical patterns. It is to be mentioned that
sometimes, the negative markers ( no, i nai) are attached with the Light Verbs u uthona
‘do not get up’ g phelona ‘do not throw”. Negative attachments are also considered in the
present task while checking the suffixes of Light Verbs (LVs).
The identification of lexical scopes of the Complex Predicates (CPs) from their successive
sequences shows that multiple Complex Predicates (CPs) can occur in a long sequence. An
automatic method is employed to identify the Complex Predicates (CPs) along with their lexical
scopes. The lexical category or basic POS tags are obtained from the parsed sentences. If the
compound and conjunct verbs occur successively in a sequence, the left most two successive
tokens are chosen to construct the Complex Predicate (CP). If successive verbs are present in a
sequence and the dictionary form of the second verb reveals that the verb is present in the lists of
compound Light Verbs (LV), then that Light Verb (LV) may be a part of a compound verb
(CompV). For that reason, the immediate previous word token is chosen and tested for its basic
POS in the parsed result. If the basic POS of the previous word is “verb (v)” and any suffixes of
either conjunctive participial form –e/-i -e/iya or the infinitive form –/-i –te/ite is
attached to the previous verb, the two successive verbs are grouped together to form a compound
verb (CompV) and the lexical scope is fixed for the Complex Predicate (CP).
If the previous verb does not contain –e/-i -e/iya or –/-i –te/ite inflections, no
compound verb (CompV) is framed with these two verbs. But, the second Light Verb (LV) may
be a part of another Complex Predicate (CP). This Light Verb (LV) is now considered as the Full
Verb (FV) and its immediate next verb is searched in the list of compound Light Verbs (LVs)
and the formation of compound verbs (CompVs) progresses similarly. If the verb is not in the list
of compound Light Verbs, the search begins by considering the present verb as Full Verb (FV)
and the search goes in a similar way.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 141
The following examples are given to illustrate the formation of compound verbs (CompVs)
and find the lexical scopes of the compound verbs (CompVs).
L ( প $ ।
compound Light Verbs (as shown in Table 7.9), the immediate previous verb ( chalte is
checked for inflections –e/-i-e/iya or /i –te/ite. As the verb ( chalte contains the
inflection - -te , the verb group ( chalte giye is a compound verb (CompV) where
giye is a Light Verb and ( chalet is the Full Verb with inflection (- -te). Next verb
group, প $ pore gelam is identified as compound verb (CompV) in a similar way (প$+ (-
of u uthe is প $ pore that is chosen and its dictionary form is searched in the list of compound
Light Verbs (LV) similarly. As the dictionary form প$( pOra) of the verb প $ pore is present
in the list of Light Verbs and the verb u uthe contains the inflection -e –e, the consecutive
verbs frame a compound verb (CompV) u প $ where u uthe is a Full Verb with inflection -
-e –e and প $ pore is a Light Verb. The final verb " dekhlam is chosen and as there is no
other verb present, the verb " dekhlam is excluded from any formation of compound verb
(CompV) by considering it as a simple verb. Similar technique is adopted for identifying the
lexical scopes of conjunct verbs (ConjVs). The method seems to be a simple pattern matching
technique in a left-to-right fashion but it helps in case of conjunct verbs (ConjVs). As the noun or
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 142
adjective occur in the first slot of conjunct verbs (ConjVs) construction, the search starts from the
point of noun or adjective. If the basic POS of a current token is either “noun” or “adjective” and
the dictionary form of the next token with the basic POS “verb (v)” is in the list of conjunct Light
Verbs (LVs), then the two consecutive tokens are combined to frame the pattern of a conjunct
verb (ConjV). For example, the identification of lexical scope of a conjunct verb (ConjV) from a
sequence such as uparjon korte gelam ‘earn-do-go’ (went to earn) identifies the conjunct verb
(ConjV) uparjon korte. There is another verb group korte gelam that seems to be a compound
verb (CompV) but is excluded by considering gelam as a simple verb.
Table 7.11. Recall, Precision and F-Score of Table 7.12 Recall, Precision and F-Score of
the system for acquiring the CompVs and the system for acquiring the CompVs and
ConjVs from EILMT Travel and Tourism ConjVs from Rabindra Rachanabali corpus
Corpus.
The results show that the system identifies the Complex Predicates (CPs) satisfactorily from
both of the corpus. In case of Compound Verbs (CompVs), the precision value is higher than the
recall. The lower recall value of Compound Verbs (CompVs) signifies that the system fails to
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 143
capture the other instances from overlapping sequences as well as non-Complex predicates (non-
CPs). But, it is observed that the identification of lexical scopes of compound verbs (CompVs)
and conjunct verbs (ConjVs) from long sequence of successive Complex Predicates (CPs)
increases the number of Complex Predicates (CPs) entries along with compound verbs (CompVs)
and conjunct verbs (ConjVs). The figures shown in bold face in Table 7.11 and Table 7.12 for the
Travel and Tourism corpus and Short Story corpus of Rabindranath Tagore indicates the
improvement of identifying lexical scopes of the Complex Predicates (CPs). In comparison to
other similar language such as Hindi (Mukerjee et al. 2006) (the reported precision and recall are
83% and 46% respectively), our results (84.66% precision and 83.67% recall) are higher in case
of extracting Complex Predicates (CPs). The reason may be of resolving the lexical scope and
handling the morpho-syntactic features using shallow parser. In addition to Non-MonoClausal
Verb (NMCV) or Serial Verb, the other criteria (Butt 1993; Paul 2004) are used in our present
diagnostic tests to identify the complex predicates (CPs). The frequencies of Compound Verb
(CompV), Conjunct Verb (ConjV) and the instances of other constraints of non Complex
Predicates (non-CPs) are shown in Figure 2. It is observed that the numbers of instances of
Conjunct Verb (ConjV), Passives (Pass), Auxiliary Construction (AC) and Non-MonoClausal
Verb (NMCV) or Serial Verb are comparatively high than other instances in both of the corpus.
The error analysis is conducted on both of the corpus. Considering both corpora as a whole
single corpus, the confusion matrix is developed and shown in Table 7.13. The bold face figures
in Table 7.13 indicate that the percentages of non-Complex Predicates (non-CPs) such as
Non-MonoClausal Verbs (NMCV), Passives (Pass) and Auxiliary Construction (AC) that are
identified as compound verbs (CompVs). The reason is the frequencies of the non-Complex
Predicates (non-CPs) that are reasonably higher in the corpus. In case of conjunct verbs
(ConjVs), the Non-MonoClausal Verbs (NMCV) and Auxiliary Construction (AC) occur as
conjunct verbs (ConjVs). The system also suffers from clausal detection that is not attempted in
the present task. The Passives (Pass) and Auxiliary Construction (AC) requires the knowledge of
semantics with argument structure knowledge.
Multiword Expressions
Chapter 7: Identification of MWEs in Bengali 144
Table 7.13 Confusion Matrix for CPs and constraints of non-CPs (in %).
Figure 7.10 The frequencies of Complex Predicates (CPs) and different constrains of non-
Complex Predicates (non-CPs).
7.6.5 Conclusion
In this paper, we have presented a study of Bengali Complex Predicates (CPs) with a special
focus on compound verbs, proposed automatic methods for their extraction from a corpus and
diagnostic tests for their evaluation. The problem arises in case of distinguishing Complex
Predicates (CPs) from Non-Mono-Clausal verbs, as only the lexical patterns are insufficient to
identify the verbs. In future task, the subcategorization frames or argument structures of the
sentences are to be identified for solving the issues related to the errors of the present system.
Multiword Expressions
Chapter 8
8.1 Introduction
The measurement of relative compositionality of bigrams is crucial to identify Multi-word
Expressions (MWEs) in Natural Language Processing (NLP) tasks. The article presents the
experiments carried out as part of the participation in the shared task ‘Distributional Semantics
and Compositionality (DiSCo)’ organized as part of the DiSCo workshop in ACL-HLT 2011.
The experiments deal with various collocation based statistical approaches to compute the
relative compositionality of three types of bigram phrases (Adjective-Noun, Verb-subject and
Verb-object combinations). The experimental results in terms of both fine-grained and coarse-
grained compositionality scores have been evaluated with the human annotated gold standard
data. Reasonable results have been obtained in terms of average point difference and coarse
precision.
Chapter 8: Measuring the Compositionality of Bigrams in English 146
The present work examines the relative compositionality of Adjective-Noun (ADJ-NN; e.g.,
blue chip), Verb-subject (V-SUBJ; where noun acting as a subject of a verb, e.g., name imply)
and Verb-object (V-OBJ; where noun acting as an object of a verb, e.g., beg question)
combinations using collocation based statistical approaches. Measuring the relative
compositionality is useful in applications such as machine translation where the highly non-
compositional collocations can be handled in a special way (Hwang and Sasaki 2005).
Multi-word expressions (MWEs) are sequences of words that tend to co-occur more frequently
than chance and are either idiosyncratic or decomposable into multiple simple words (Baldwin
2006). Deciding idiomaticity of MWEs is highly important for machine translation, information
retrieval, question answering, lexical acquisition, parsing and language generation.
Compositionality refers to the degree to which the meaning of a MWE can be predicted by
combining the meanings of its components. Unlike syntactic compositionality (e.g. by and
large), semantic compositionality is continuous (Baldwin 2006).
Several studies have been carried out for detecting compositionality of noun-noun MWEs
using WordNet hypothesis (Baldwin et al. 2003), verb-particle constructions using statistical
similarities (Bannard et al. 2003; McCarthy et al. 2003) and verb-noun pairs using Latent
Semantic Analysis (Katz and Giesbrecht 2006).
Our contributions are two-fold: firstly, we experimentally show that collocation based
statistical compositionality measurement can assist in identifying the continuum of
compositionality of MWEs. Secondly, we show that supervised weighted parameter tuning
results in accuracy that is comparable to the best manually selected combination of parameters.
Multiword Expressions
Chapter 8: Measuring the Compositionality of Bigrams in English 147
originally defined as the mutual information between particular events X and Y and in our case
the occurrence of particular words, as follows:
, ,
= log ≈ log 8.1
. .
PMI represents the amount of information provided by the occurrence of the event represented
by X about the occurrence of the event represented by Y.
T-test: T-test has been widely used for collocation discovery. This statistical test tells us the
probability of a certain constellation (Nugues, 2006). It looks at the mean and variance of a
sample of measurements. The null hypothesis is that the sample is drawn from a distribution with
mean. T-score is computed using the equation (8.2):
, −
, =
"# , + # #
()(*
%,&'
≈ +
8.2
,%,&
In both the equations (1) and(2), C(x) and C(y) are respectively the frequencies of word X and
word Y in the corpus, C(X,Y) is the combined frequency of the bigrams <X Y> and N is the total
number of tokens in the corpus. Mean value of P(X,Y) represents the average probability of the
bigrams <X Y>. The bigram count can be extended to the frequency of word X when it is
followed or preceded by Y in the window of K words (here K=1).
where H(X) is the cross-entropy of X. Here, X is the candidate bigram whose value is measured
throughout the corpus. Perplexity is interpreted as the average “branching factor” of a word: the
statistically weighted number of words that follow a given word. As we see from equation (4),
Perplexity is equivalent to entropy. The only advantage of perplexity is that it results in numbers
more comprehensible for human beings. Here, perplexity is measured at both root level and
surface level.
Chi-square test: The t-test assumes that probabilities are approximately normally distributed,
which may not be true in general (Manning and Schütze, 2003). An alternative test for
Multiword Expressions
Chapter 8: Measuring the Compositionality of Bigrams in English 148
dependence which does not assume normally distributed probabilities is the χ2-test (pronounced
“chi-square test”). In the simplest case, this 2 test is applied to a 2-by-2 table as shown below:
X = new X ≠ new
Y= companies n11 n12
(new companies) (e.g.,old companies)
Y ≠ companies n21 n22
(e.g., new machines) (e.g., old machines)
Table 8.1: A 2-by-2 table showing the dependence of occurrences of new and companies
Each variable in the above table depicts its individual frequency, e.g., n11 denotes the
frequency of the phrase “new companies”.
The idea is to compare the observed frequencies in the table with the expected frequencies
when the words occur independently. If the difference between observed and expected
frequencies is large, then we can reject the null hypothesis of independence. The equation for
this test is defined below:
7899 8 − 89 8 9
6 = 8.4
899 + 89 899 + 8 9 89 + 8 8 9 +8
∑A ?A ∑A A@
where 8?@ = × ×7
7 7
N is the number of tokens in the corpus.
1
http://wacky.sslmit.unibo.it/
Multiword Expressions
Chapter 8: Measuring the Compositionality of Bigrams in English 149
per phrase and normalized between 0 and 100. These numerical scores are used for the Average Point
Difference score. For coarse-grained score, phrases with numerical judgments between 0 and 33 as “low”,
34 to 66 as “medium” and 66 and over got the label "high".
Multiword Expressions
Chapter 8: Measuring the Compositionality of Bigrams in English 150
minimum error score is assigned the higher weight. For each co-occurrence feature score i, if the
error on the training data is ei, the weight Wi assigned to the co-occurrence feature score i is
defined as:
100 − ?
C? = 8.5
∑?100 − ?
The individual co-occurrence feature scores are normalized to be in the range of 0 to 1 before
calculating the weighted sum.
Note that, when measuring coarse-precision, the fine-grained scores are bucketed into three
bins as explained in Section 8.3.
Multiword Expressions
Chapter 8: Measuring the Compositionality of Bigrams in English 151
RUN-1 (Weighted Combination): These results are obtained from the weighted combination
of individual scores. Both the perplexity measures are not useful to make significant gain over
the compositionality measure. For the rank combination experiments, the best co-occurrence
measures, i.e., PMI, Chi-square and T-test are considered. For the weighted combination, the
results are reported for the weight triple (0.329, 0.309, 0.364) for PMI, Chi-square and T-test
respectively.
RUN-2 (Average Combination): These results are reported by simply averaging the values
obtained from the five measures.
RUN-3 (Best Scoring Measure: T-test): The T-test results are observed as the best scoring
measure used in this experiment.
When calculating the coarse-grained score the compositionality of each phrase is tagged as
‘high’, ‘medium’ or ‘low’ discussed in Section 8.3.
The final test data set has been evaluated on the gold standard data developed by the
organizers and the results on the three submitted runs are described in Table 8.3. The positive
value of Spearman’s rho coefficient implies that the system results are in the same direction with
the gold standard results; while the Kandell’s tau indicates the independence of the system value
Multiword Expressions
Chapter 8: Measuring the Compositionality of Bigrams in English 152
with the gold standard data. As expected, Table 8.3 shows that the weighted average score (Run
1) gives better accuracy for all phrases based on the APD scores. On the other hand, the T-test
results (Run 3) give high accuracy for the coarse precision calculation while it is in the last
position for ADP scores.
8.8 Conclusion
We have demonstrated that the statistical evidences can be useful to indicate the continuum
of compositionality of the bigrams i.e. adjectivenoun, verb-subject and verb-object combination.
We are extremely confident with these empirical approaches to a semantic measure as
compositionality directly relates to the semantics of a phrase. The coarse precision can be
improved if three ranges of numerical values can be tuned properly and the size of the three bins
can be varied significantly. As our future task, we can use other statistical collocation-based
methods (e.g. Log-likelihood ratio, Relative frequency ratios etc.). Furthermore, we will plan to
incorporate standard lexical resources like WordNet, VerbNet and use their lexical ontology to
enhance the compositionality judgment of the collocations.
Multiword Expressions
Chapter 9
Applications of Multiword
Multiword
Expressions
9.1.1 Introduction
Stylometry, the science of inferring characteristics of the author from the characteristics of
documents written by that author, is a problem with a long history and belongs to the core task of
Text categorization that involves authorship identification, plagiarism detection, forensic
investigation, computer security, copyright and estate disputes etc. In this work, we present a
strategy for Stylometry detection of documents written in Bengali. We adopt a set of fine-grained
attribute features with a set of lexical markers for the analysis of the text and use three semi-
supervised measures for making decisions. Finally, a majority voting approach has been taken
for final classification. We also try our experiment using Conditional Random Filed (CRF). The
system is fully automatic and language-independent. Evaluation results of our attempt for
Chapter 9: Applications of Multiword Expressions 154
Bengali author’s Stylometry detection show reasonably promising accuracy in comparison to the
baseline model.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 155
length of words and sentences etc.), phrase level measures (noun chunk, verb chunk, etc.) and
context level measures (number of dialog, length of dialog, sentence structure analysis etc.).
Additionally, we propose a baseline system for Bengali Stylometry analysis using vocabulary
richness function. The present attempt basically deals with the microscopic observation for the
stylistic behaviours of the articles written by the famous novel laureate Rabindranath Tagore
long years back and tries to disambiguate them from the anonymous articles written by some
other authors in that period.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 156
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 157
2. Stylistic Features Extraction: Stylistic features have been proposed as more reliable style
markers than for example, word-level features since they are not under the conscious control of
the author. To allow the selection of the linguistic features rather than n-gram terms, robust and
accurate text analysis tools such as lemmatizers, part-of-speech (POS) taggers, chunkers etc. are
needed. We have used the Shallow parser, which gives a parsed output of a raw input corpus. It
tokenizes the input, performs a part-of-speech analysis, looks for chunks and inflections and a
number of other grammatical relations. The stylistic markers which have been selected in this
experiment are coarsely classified into three categories and discussed in the Table 9.1. Sentences
are detected using the sentence boundary markers mainly ‘dari’ or ‘viram’ (‘_ ’), question marks
(‘?’) or exclamation notation (‘!’) in Bengali. Sentence length and word count are the traditional
and well-defined measures in authorship attribute studies and punctuation count is another
interesting characteristics of the personal style of a writer. Chunk or phrase level markers are
indications of various stylistic aspects, e.g., syntactic complexity, formality etc. Out of all
detected chunk sets, mainly nine chunk types have been considered in this experiment. They are
noun chunk (NP), verb-finite chunk (VGF), verb-non-finite chunk (VGNF), gerunds (VGNN),
adjective chunk (JJP), adverb chunk (RBP), conjunct phrase (CCP), chunk fragment (FRAGP)
and others (OTHERS). Shallow parser identifies 25 Part Of Speech (POS) categories. Among
them, 24 POSs have been taken into consideration except UNK. Words tagged with UNK are
unknown words and are verified by Bengali monolingual dictionary. Since Shallow parser is an
automated text-processing tool, the style markers of the above levels are measured
approximately. Depending on the complexity of the text, the provided measures may vary from
real values which can only be measured using manual intervention. Making the system fully
automated, the system performance depends on the performance of the parser. As we can see in
the Table 9.1 that each marker is defined as a percentage measure of the ratio of two relevant
measures, this approach was followed in order to work with text-length independent style
markers as possible. However, it is worth noting that we do not claim that the proposed set of 76
markers is the final one. It could be possible to split them into more fine-grained measures e.g.
F21 can be split into separate measures i.e. individual occurrence of the punctuation symbols
(comma per word, colon per word, dari per word etc.). Here, our goal is to make an attempt
towards the investigation of Bengali author’s writing style and to prove that an appropriately
defined set of such style markers performs better than the traditional lexical based approaches.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 158
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 159
R .T ∑ r .t i i
Similarity = cos(θ ) = = i =1 (9.1)
| R |.|T | n n
∑r
i =1
i
2
* ∑t
i =1
i
2
The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the
same, with 0 usually indicating independence and in-between values indicating intermediate
similarity or dissimilarity. Here, n is the number of features (i.e., 76) that act as dimensions of
the vectors and ri and ti are the features of reference and test vectors respectively.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 160
value in the cell with our calculated χ2 value, if the χ2 value is greater than the 0.05, 0.01 or 0.001
column, then the goodness-of-fit null hypothesis can be rejected, otherwise accepted.
Euclidean distance: The Euclidean distance between two points, p and q is the length of the
line segment. In Cartesian coordinates, if p = (p1, p2... pn) and q = (q1, q2,..., qn) are two points in
Euclidean n-space, then the distance from p to q is given by:
n
d ( p, q ) = ∑( p
i =1
i − qi )2 (9.3)
where, n is the number of features or dimension of a point, p is the reference point (i.e. mean
vector) of each cluster and q is the testing vector. For every test vector, three distances from
three reference points have been calculated and smallest distance defines the probable cluster.
• Corpus Acquisition
Aforementioned, Resource acquisition is one of the most important challenge to work with
resource constrained languages like Bengali. The system has used thirty stories in Bengali
written by the noted Indian Nobel laureate Rabindranath Tagore (discussed in section 4.1.1).
Among them, we have selected twenty stories for training purpose and rest for testing. We
choose this domain for the reason that in such writings the idiosyncratic style of the author is not
likely to be overshadowed by the characteristics of the corresponding text genre. To differentiate
them from other author’s articles, we have selected 30 articles from author A and 30 articles of
other authors1. In this way, we have three clustered set of documents identified as articles of
Author R (Tagore’s articles), Author A and others (O). This paper focuses on two topics: (i) the
effort of earlier works on feature selection and learning and (ii) the effort of limited data in
authorship detection.
• Baseline System
In order to set up a baseline system, we proposed traditional lexical based methodology
called vocabulary richness. Among the various measures like Yule’s K measure, Honore’s R
measure, we have taken most typical one as the type-token ratio (V/N) where V is the size of the
1
http://banglalibrary.evergreenbangla.com
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 161
vocabulary of the sample text and N is the number of tokens which forms the simple text. We
have gathered dimensional features of the articles of each cluster and averaged them to make a
mean vector for every cluster. So these three mean vectors indicate the references of three
clusters respectively. Now, for every testing document, similar features have been extracted and
a test vector has been developed. Now, using Nearest-neighbour algorithm, we have tried to
identify the author of the test documents. The results of the baseline system are shown using
confusion matrix in Table 9.2. Each row shows classification of the ten texts of the
corresponding authors. The diagonal contains the correct classification. The baseline system
achieves 37% average accuracy. Approximately 60% of average accuracy error (for author A and
O) is due to the wrong identification of the author as Author R.
Baseline System
R A O e (Error)
R 6 0 4 0.40
A 7 2 1 0.80
O 5 2 3 0.70
Average error 0.63
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 162
Our System
Cosine-similarity Chi
Chi-square measure Euclidean distance Combined voting
R A O e R A O e R A O e R A O e
9.1.4.3 Discussion
Form the experimental results, it is clear that statistical approaches show nearly similar
performance and accuracies of all of them are around 50%. Also the major sources of the errors
are for the inappropriate identification of author as Author R. From the figure 9.2, we can see
that the system looks little bit biased towards the identification of Rabindranath Tagore as author
of the test documents. In all cases, the bar graphs for Author R are higher than others. The reason
behind this is the acquisition of resources. Developing appropriate corpus for this study is itself a
separate research area and takes huge amount of time. Furthermore, the collected articles from
other authors are heterogeneous and not domain constrained. Our studies will be
b planned to
focus on the
80
R
60 A
40 O
20
0
Cosine Chi-square Euclidean Combined
identification of the unpublished articles of Rabindranath Tagore. For this, more microscopic
observation in various fields of his writings will be needed. Here we only try our experiments on
the stories of the writer. The succ
success
ess of the system lies not on the correct mapping of the articles
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 163
to their corresponding three authors but to filter all the inventions of Rabindranath Tagore from a
bag of documents and the more the accuracy of the filtering, the more the accuracy of the
system. Apart from being the first work of its kind for Bengali language, the contributions of
this experiment can be identified as: (i) application of statistical approach in the field of
Stylometry, (ii) development of classification algorithm in n-dimensional vector space, (iii)
developing a baseline system in this field and (iv) more importantly, working with the great
writings of Rabindranath Tagore to reveal his swinging of thought and dexterity of pen when
writing articles.
9.1.4.5 Conclusion
This paper introduced the use of a large number of fine-grained features for Stylometry
detection. The presented methodology can also be used in author verification task i.e. the
verification of the hypothesis whether or not a given person is the author of the text under study.
The methodology can be adopted for other languages since maximum of the features are
language independent. The classification is very fast since it is based on the calculation of some
simple statistical measurements. Particularly, it appears from our experiments that texts with less
word are less likely to be classified correctly. For that, our system is little biased towards the
stylometry of Rabindranath Tagore. It is due to the lack of the large number of resources of other
authors under study. However from this preliminary study, future works are planned to increase
the database with more fine grained features and to identify more context dependent attributes
for further improvement.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 164
interchangeably to express
press their anonymous senses. Experimental results indicate that the CRF
model can enhance the task of identifying the authors.
• Preprocessing
processing and Parsing
The documents are row and so unformatted that an initial cleaning is required before CRF
model training. We must transfer the document into the formatted sequences, i.e. a bag of words
or phrases of the document. From the pre
pre-processed document, token-levell and some of the
context-level
level features are extracted. For a new document, we conduct the sentence segment, POS
tagging using Shallow Parser so that the stylistic features are easily viewed to the system. From
this, chunk-level and context-level
level markers aare identified.
• CRF Model:
The input is the feature
ture vector discussed in Table 9.4
9.4.. There are three kinds of features, i.e.
(1) token-level
level features, (2) phrase
phrase-level features and (3) context-level
level features. Token-level
Token
features include length of the word, number of keywords, starting word of a dialog maximum
time present, count of hapaxx legomena. Phrase
Phrase-level
level include count of POSs, chunks those we
have considered here (not all POSs or chunks, the Shallow parser generally gives, are
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 165
considered). Average length of the paragraph and length of the dialog are included in context-
level features. Detected sentences are the sentence boundary ended mainly with ‘dari’ (‘। ’),
question marks (‘?’) or exclamation notation (‘!’) in Bengali. Sentence-length, word-count are
the traditional and well-defined measures in authorship attribute studies and punctuation count is
the very interesting characteristics of the personal style of a writer. Problem occurs to identify
keywords as there is no standard tool to extract keywords for Bengali documents. For this, we
have identified top ten high frequent words (excluding stop-words in Bengali) for every cluster
using TF*IDF method which act as the list of keywords of that cluster corresponding to that
author. Now, similarly, we have extracted a list of top ten high frequent words from every testing
document and intersect them with the keywords of cluster1, cluster 2 and cluster 3 which are the
count of the features KW1, KW2, KW3 respectively. Since Shallow parser is an automated text-
processing tool, the style markers of the above levels are measured approximately. Depending on
the complexity of the text, the provided measures may vary from real values which can only be
measured using manual intervention. Making the system fully automated, the system fully
believes on the performance of the parser for the extraction of all POS and chunk level features.
The last column of the training feature file is labeled as R, A or O which is the indication of the
three authors and for testing, all are labeled as X which is an arbitrary word indicating unknown
author. CRF adds an extra column at the last position which indicates the label of the author for
that document (R, A or O). As we can see that maximum of these features are the ratio of two
relevant measures, this approach was followed in order to achieve as text-length independent
style markers as possible. However, it is worth noting that we do not claim that the proposed set
of features is the final one. It could be possible to split them into more fine-grained measures.
Here, our goal is to make a pioneer approach towards the investigation of Bengali author’s
writing style and to prove that an appropriately defined set of such style markers performs better
than the traditional lexically-based approaches.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 166
• Used Corpus
The corpus used in this experiment is discussed in Section 9.4.1.2 of the ‘Corpus acquisition’
subsection.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 167
• Baseline System
Same baseline system as discussed in Section 9.4.1.2 of the ‘Baseline System’ subsection is
also used hare.
Cosine-similarity CRF
R A O e R A O e
R 6 2 2 0.4 7 1 2 0.3
A 2 5 3 0.5 3 5 2 0.5
O 4 2 5 0.5 3 1 6 0.4
Average error 0.46 Average error 0.40
Performance of the CRF model for authorship identification is shown in the second half of
the Table 9.4 named as “CRF”. The average accuracy of this system is 60% which shows a
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 168
tremendous improvement in comparison with the baseline system. The identification of the
documents of author A is more or less same with the previous statistical approach and 30% of
the error for Author A and Author O have been occurred for wrong identification of the author as
Author R. This shows a little biasness of the system to the Stylometry of Tagore’s writing.
9.1.5.3 Conclusion
Conditional Random Field is a state-of-the-art sequence modeling approach, which can use
the features of the documents more sufficiently and effectively. In this experiment, we have
studied in Bengali corpus to detect the stylistic features of the anonymous writings and try to
map them with their possible authors. The presented methodology can also be used in author
verification task i.e. the verification of the hypothesis whether or not a given person is the author
of the text under study even if in other languages since maximum of the features are language
independent. Particularly, it seems from our experiments that texts with less word are less likely
to be correctly classified. However, for our future study, we would like to apply this system for
other languages. Furthermore, we plan for a hybrid approach that can takes into account the
advantage of both the unsupervised as well as machine learning approaches and look for the
improvement of the performance. For this, more textual analysis and relevant corpus collection
will be needed. Above all, we would implement this system on the other fields of Text mining
i.e. e-mail identification, forensic investigation, copyright and estate disputes etc. to make it
more robust and general.
9.2.1 Introduction
Preprocessing of the parallel corpus plays an important role in improving the performance of
a phrase-based statistical machine translation (PB-SMT). In this experiment, we propose a frame
work in which predefined information of Multiword Expressions (MWEs) can boost the
performance of PB-SMT. Here, we preprocess the parallel corpus to identify Noun-noun MWEs,
reduplicated phrases, complex predicates and phrasal prepositions. Single-tokenization of Noun-
noun MWEs, phrasal preposition (source side only) and reduplicated phrases (target side only)
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 169
provide significant gains over our previous best PB-SMT model. Automatic alignment of
complex predicates substantially improves the overall MT performance and the word alignment
quality as well. For establishing NE alignments, we transliterate source NEs into the target
language and then compare them with the target NEs. Target language NEs are first converted
into a canonical form before the comparison takes place. The proposed system achieves
significant improvements (6.38 BLEU points absolute, 73% relative improvement) over the
baseline system on an English- Bengali translation task.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 170
statistical transliteration technique. We rely on these automatically aligned NEs and treat them as
translation examples (Pal.et.al 2010). Adding bilingual dictionaries, which in effect are instances
of atomic translation pairs, to the parallel corpus is a well-known practice in domain adaptation
in SMT (Eck et al. 2004; Wu et al. 2008). We modify the parallel corpus by converting the
MWEs into single tokens and adding the aligned NEs and complex predicates in the parallel
corpus to improve the word alignment and hence the phrase alignment quality. The
preprocessing of the parallel corpus results in improved MT quality in terms of automatic MT
evaluation metrics.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 171
since PB-SMT (or any other approaches to SMT) does not generally treat MWEs as special
tokens. Another problem with SMT systems is the wrong translation of verb phrases. Sometimes
verb phrases are deleted in the output sentence. Moreover, the words inside verb phrases are
generally not aligned one-to-one; the alignments of the words inside source and target verb
phrases are mostly many-to-many, particularly so for the English-Bengali language pair. These
are the motivations behind considering MWEs like NEs, reduplicated phrases, prepositional
phrase and compound verbs for special treatment in this work. By converting the MWEs into
single tokens, we make sure that PB-SMT also treats them as a whole. The first objective of the
present work is to see how single tokenization and alignment of NEs on both the sides, single
tokenization of phrasal verbs and phrasal prepositions on them source side and single
tokenization of reduplicated phrases and noun-noun compounds on the target side affects the
overall MT quality. The second objective is to see whether prior automatic alignment of complex
predicates and single tokenized MWEs can bring any further improvement in the overall
performance of the MT system. We carried out the experiments on English-Bengali translation
task. Bengali shows high morphological richness at lexical level. Language resources in Bengali
are not widely available. Furthermore, this is the first time when the identification of MWEs in
Bengali language is used to enhance the performance of an English-Bengali Machine Translation
System.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 172
M
log P (e1I | f 1J ) = ∑ λ m hm ( f 1J , e1I , s1K ) + λLM log P(e1I ) (9.5)
m =1
k
where s1 = s1...sk denotes a segmentation of the source and target sentences respectively into the
sequences of phrases (eˆ1 ,..., eˆk ) and ( fˆ1 ,..., fˆk ) such that (we set i0 = 0) (9.6):
∀1 ≤ k ≤ K , sk = (ik, bk, jk),
eˆk = ei k −1 +1...ei k ,
and each feature ĥm in equation (9.5) can be rewritten as in equation (9.7):
K
hm ( f1J , e1I , s1K ) = ∑ hˆm ( fˆk , eˆk , sk ) (9.7)
k =1
where hˆ = ∑ λm hˆm .
m =1
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 173
a chance to make these reduplicated words as a single-token of the issue for many to one
alignment problem because these kinds of reduplicated words should have mapped with the
single word of the source side. Phrasal preposition and phrasal verb may have carried different
meaning for the target side , So we treat these kind of word as single token to inform the
translation model that this word have carried different meaning instead of single occurrence of
the words. Once the compound verbs and the NEs are identified on both sides of the parallel
corpus, they are converted into and replaced by single tokens. When converting these MWEs
into single tokens, we replace the spaces with underscores (‘_’). Since there are already some
hyphenated words in the corpus, we do not use hyphenation for this purpose; besides, the use of
a special word separator (underscore in our case) facilitates the job of deciding which single-
token (target language) MWEs to detokenize into words comprising them, before evaluation.
We adopt the tool named as UCREL 2 Semantic analysis System developed by Lancaster
University (Rayson et al. 2004). The UCREL semantic analysis system (USAS) is a software
tool for undertaking the automatic semantic analysis of English spoken and written data. It
contains hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained
semantic field tags. The semantic tags show semantic fields which group together word senses
that are related by virtue of their being connected at some level of generality with the same
mental concept. The groups include not only synonyms and antonyms but also hypernyms and
hyponyms. Currently, the lexicon contains nearly 37,000 words and the template list contains
over 16,000 multi-word units. Each template consists of a pattern of words and syntactic tags,
some using wildcards to enable tagging with inflectional variants and less strictly defined
patterns. The semantic tags for each template are arranged in rank frequency order in the same
way as the lexicon. Various types of MWUs are included: phrasal verbs (e.g. stubbed out), noun
phrases (e.g. riding boots), proper names (e.g. United States of America), true idioms (e.g. living
the life of Riley) etc.
Currently, the USAS system consists of the CLAWS POS tagger (Garside and Smith 1997), a
lemmatiser, a semantic tagger and some auxiliary format manipulating components. For POS
2
http://www.comp.lancs.ac.uk/ucrel
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 174
tagging, this system employs the C7 tagset 3. Subsequent semantic disambiguation, to a large
extent, depends on POS information encoded in this tagset. They report an evaluation of the
accuracy of the system compared to a manually tagged test corpus on which the USAS software
obtained a precision value of 91% after testing it in a corpus containing about 124,900 words.
• Identification of Reduplication
In all languages, the repetition of noun, pronoun, adjective and verb are broadly classified
under two coarse-grained categories: repetition at the (a) expression level, and at the (b) contents
or semantic level. The repetition at both the levels is mainly used for emphasis, generality,
intensity or to show continuation of an act. The works on MWE identification and extraction
have been continuing in English (Fillmore 2003). In this experiment, we have used simple rule-
based approach (Chakraborty and Bandyopadhyay 2010) (discussed in Section 6) to identify
reduplication in Bengali-side corpus. In that paper, the author classified expression-level Bengali
reduplication into five fine-grained subcategories. They are (i) Onomatopoeic expressions (khat
khat, knock knock), (ii) Complete Reduplication (bara-bara, big big), (iii) Partial Reduplication
(thakur-thukur, God), (iv) Semantic Reduplication (matha-mundu, head) and (v) Correlative
Reduplication (maramari, fighting). We have tried to cover almost all above mentioned types.
We have used simple rules and morphological properties in lexical level and Bengali-
monolingual dictionary for semantic reduplications.
In the past few years, noun compounds have received increasing attention as researchers
work towards the goal of full text understanding. Compound nouns are nominal compound
where two or more nouns are combined to form a single phrase such as ‘golf club’ or ‘computer
science department’ (Baldwin and Kim 2010). Compound noun MWEs can be defined as a
lexical unit made up of two or more elements, each of which can function as a lexeme
independent of the others(s) in other contexts and which shows some phonological and/or
grammatical isolation from normal syntactic usage. In English, Noun-Noun (NN) compounds
3
See http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 175
occur with high frequency and high lexical and semantic variability (Tanaka and Baldwin 2003).
In this experiment, we have used simple statistical methodology for identifying Noun-noun
MWEs. For that, the system uses Point-wise Mutual Information (PMI), Log-likelihood Ratio
(LLR) and Phi-coefficient, Co-occurrence measurement and Significance function ((Agarwal et
al. 2004). Final evaluation has been carried out by combining all the above mentioned features.
A predefined cut-off has been taken out and the candidates having above threshold value have
been considered as MWEs.
We first create an NE parallel corpus by extracting the source and target (single token) NEs
from the NE-tagged parallel translations and align those as the strategies applied by Pal.et.al,
2010.
For the extraction of Complex Predicates (CPs) in Bengali, specially focused on compound
verbs (CPs) (Verb + Verb) and conjunct verbs (Noun /Adjective + Verb) we have adopted the
method applied by Das et al. (2010). But here we have also considered serial verbs (SVs) (Verb
+ Verb or the patterns like Verb + Verb + Verb). To extract serial verb, we have taken those
pattern, which occur serially in a sentence and do not have to be considered further in Compound
verb extraction. . Below we have given some example of complex predicates and serial verb in
Bengali side which is associated with their extracted form in source English side.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 176
Analysis and Extraction procedure mainly follows on the target Bengali side. At first, we
have extracted and listed all serial verbs and complex predicates with their sentence id from the
target side. By using those sentence ids from the target list, we have extracted and listed the
entire verb chunk associated with them in the source English side.
• Verb Chunk Aligner
Form the extracted list we have aligned both side as all possible combination of complex
predicates and produced a roughly unaligned list with sentence id as follows Example
1069 ||| designed ||| _/NCP
This module produces an alignment list from the unaligned list using statistical method. From
a single English verb chunk, we have made more corresponding verb chunk by extracting synset
from wordnet for the main verb of that particular verb chunk. Using this synset we produce more
verb chunk from the one verb chunk. When we check source target combination frequency in
the entire unaligned list, we also check same for the produced synset chunk. If more than one
combination occurs so frequent in the unaligned list then we have consider this should be
aligned. With this strategy, we have prepared an alignment list of source – target complex
predicates and serial verb list. After getting an alignment list then remove these entries from the
initial unaligned list and proceed to the next steps. For finding frequency of occurrence of words,
we analyze the morphology of the word on both side and matching root word only.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 177
The pattern Generator extracts patterns for both source and target from the alignment list and
produces a source target pattern list. The extracted patterns are as follows:
Root form of main verb of source side_ suffix target side pattern
MV_ed MV_ /r/e (pattern)
In target side pattern we have consider root word, and inflection only. We generate pattern
for the each verb chunk from the unaligned list and produce a list of pattern for the unaligned
list. Now match both list if both source and target pattern are match conjugally in the unaligned
pattern list then we make align those chunk together. After getting list with this module we
increase the alignment list.
If module I and II increases the size of alignment list then this module will take decision that the
process will start again otherwise it will stop the iteration.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 178
4
The EILMT and ILILMT projects are funded by the Department of Information Technology (DIT), Ministry of
Communications and Information Technology (MCIT), Government of India.
5
http://nlp.stanford.edu/software/lex-parser.shtml
6
http://crfchunker.sourceforge.net/
7
http://nlp.stanford.edu/software/CRF-NER.shtml
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 179
phrase length of 4 produced the optimum baseline result. We therefore carried out the rest of the
experiments using these settings.
English Bengali
In training set
T U T U
CPs 4874 2289 14174 7154
reduplicated word - - 85 50
Noun-noun compound 892 711 489 300
Phrasal preposition 982 779 - -
Phrasal verb 549 532 - -
Total NE words 22931 8273 17107 9106
This system continues with the various preprocessing of the corpus and going on observing the
improvement achieved by the identification of MWEs in phrase level. In this experiment, our
intuition is that the more the MWEs are identified and aligned properly, the more the system
shows the improvement in the translation procedure. In the source side, the system treats the
phrasal prepositions and noun-noun compounds as a single token which shows n:m alignment in
the bilingual context. After identifying them as single token and align them using GIZA++, the
system has achieved an accuracy of 13.99 BLEU score. But when noun-noun compounds are
identified separately, the system shows relatively degradable results with respect to the other
identification. The reason behind these results is manifold. Firstly, the accuracy of the UCREL
semantic toolkit is not satisfactory especially for the tourism domain. Secondly, it has been
observed that noun-noun compounds are translated in target side with n:n alignment basis. For
them, the single tokenization is not desirable at all. However, overall combined result infers our
actual intuition.
In that target side, reduplication has been identified and aligned it with the source side. The
system draws an estimating result after aligning reduplication with the improvement of 0.51
BLEU score as reduplications in the target side may not show any significant existence in the
source side. In the target side, reduplications, noun-noun compound as well as both have given
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 180
the satisfactory results with the improvement of 0.50 BLEU score which again proves our
intuition.
Experiments Exp BLEU NIST
Table 9.7 Evaluation results for different experimental setups. (The ‘†’ marked systems produce
statistically significant improvements on BLEU over the baseline system)
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 181
Finally, we have treated both the source and destination side corpus by combining the previous
identified phrases. Table 9.7 shows that when we combined the prepositional phrase, verb-object
combination, reduplicated word and Noun-noun compound as single token, the alignment system
achieves the best results with 14.58 BLEU score. The table also reflects the results for the other
combination which also proves our intuition with respect to the baseline system.
Table 9.6 shows the MWE statistics of the parallel corpus as identified by the NERs. The
average NE length in the training corpus is 2.16 for English and 1.61 for Bangla. As can be seen
from Table 9.5, 44.5% and 47.8% of the NEs are single-word NEs in English and Bangla
respectively, which suggests that prior alignment of the single-word NEs, in addition to multi-
word NE alignment, should also be beneficial to word and phrase alignment.
Of all the NEs in the training and development sets, the transliteration-based alignment
process was able to establish alignments of 4,711 single-word NEs, 4,669 two-word NEs and
1,745 NEs having length more than two. It is to be noted that, some of the single-word NE
alignments, as well as two-word NE alignments, result from multi-word NE alignment.
We analyzed the output of the NE alignment module and observed that longer NEs were
aligned better than the shorter ones, which is quite intuitive, as longer NEs have more tokens to
be considered for intra-NE alignment. Since the NE alignment process is based on transliteration,
the alignment method does not work where NEs involve translation or acronyms. We also
observed that English MW NEs are sometimes fused together into single-word NEs.
We performed three sets of experiments: treating compound verbs as single tokens, treating
NEs as single tokens, and the combination thereof. Again for NEs, we carried out three types of
preprocessing: single-tokenization of (i) two-word NEs, (ii) more than two-word NEs, and (iii)
NEs of any length. We make distinctions among these three to see their relative effects. The
development and test sets, as well as the target language monolingual corpus (for language
modeling), are also subjected to the same preprocessing of single-tokenizing the MWEs. For NE
alignment, we performed experiments using 4 different settings: alignment of (i) NEs of length
up to two, (ii) NEs of length two, (iii) NEs of length greater than two, and (iv) NEs of any length.
Before evaluation, the single-token (target language) underscored MWEs are expanded back to
words comprising the MWEs.
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 182
Since we did not have the gold-standard word alignment, we could not perform intrinsic
evaluation of the word alignment. Instead we carry out extrinsic evaluation on the MT quality
using the well known automatic MT evaluation metrics: BLEU (Papineni et al. 2002), METEOR
(Banerjee and Lavie 2005), NIST (Doddington 2002), WER, PER and TER (Snover et al. 2006).
As can be seen from the evaluation results reported in Table 9.6, baseline Moses without any
preprocessing of the dataset produces a BLEU score of 8.74. The low score can be attributed to
the fact that Bangla, a morphologically rich language, is hard to translate into. Moreover, Bangla
being a relatively free phrase order language (Ekbal and Bandyopadhyay 2009) ideally requires
multiple set of references for proper evaluation. Hence using a single reference set does not
justify evaluating translations in Bangla. Also the training set was not sufficiently large enough
for SMT. Treating only longer than 2-word NEs as single tokens does not help improve the
overall performance much, while single tokenization of two-word NEs as single tokens produces
some improvements (.39 BLEU points absolute, 4.5% relative). Considering compound verbs as
single tokens (CVaST) produces a .82 BLEU point improvement (9.4% relative) over the
baseline. Strangely, when both compound verbs and NEs together are counted as single tokens,
there is hardly any improvement. By contrast, automatic NE alignment (NEA) gives a huge
impetus to system performance, the best of them (4.59 BLEU points absolute, 52.5% relative
improvement) being the alignment of NEs of any length that produces the best scores across all
metrics. When NEA is combined with CVaST, the improvements are substantial, but it can not
beat the individual improvement on NEA. The (†) marked systems produce statistically
significant improvements as measured by bootstrap resampling method (Koehn 2004) on BLEU
over the baseline system. Metric-wise individual best scores are shown in bold in Table 9.6.
In this experiment, we have successfully shown how the simple yet effective preprocessing of
treating various types of MWEs, namely NEs, reduplications and compound verbs, as single-
tokens, and conjunction with prior NE alignment can boost the performance of PB-SMT system
on an English-Bengali translation task. Treating compound verbs as single-tokens provides
significant gains over the baseline PB-SMT system. Amongst the MWEs, NEs perhaps play the
most important role in MT, as we have clearly demonstrated through experiments that automatic
alignment of NEs by means of transliteration improves the overall MT performance substantially
Multiword Expressions
Chapter 9: Applications of Multiword Expressions 183
across all automatic MT evaluation metrics. Our best system yields 4.59 BLEU points
improvement over the baseline, a 52.5% relative increase. We compared a subset of the output of
our best system with that of the baseline system, and the output of our best system almost always
looks better in terms of either lexical choice or word ordering. The fact that only 28.5% of the
test set NEs appear in the training set, yet prior automatic alignment of the NEs brings about so
much improvement in terms of MT quality, suggests that it not only improves the NE alignment
quality in the phrase table, but word alignment and phrase alignment quality must have also been
improved significantly. At the same time, single-tokenization of MWEs makes the dataset
sparser, but yet improves the quality of MT output to some extent. Data-driven approaches to
MT, specifically for scarce-resource language pairs for which very little parallel texts are
available, should benefit from these preprocessing methods. Data sparseness is perhaps the
reason why single-tokenization of NEs and compound verbs, both individually and in
collaboration, did not add significantly to the scores. However, a significantly large parallel
corpus can take care of the data sparseness problem introduced by the single-tokenization of
MWEs.
The present work offers several avenues for further work. In future, we will investigate how
these automatically aligned NEs can be used as anchor words to directly influence the word
alignment process. We will look into whether similar kinds of improvements can be achieved for
larger datasets, corpora from different domains and for other language pairs. We will also
investigate how NE alignment quality can be improved, especially where NEs involve translation
and acronyms. We will also try to perform morphological analysis or stemming on the Bangla
side before NE alignment.
Multiword Expressions
Chapter 10
Conclusion
An attempt has been made in the thesis to model the syntax and semantics of Bengali Multi
Word Expressions (MWEs) based on the following statistical approaches: substitutability, co-
occurrence properties, semantic clustering and linguistic properties. A detailed discussion of the
experiments to automatically acquire the syntax and semantics of MWEs has been presented in
the thesis. We have experimented with different approaches to idetify the statistical approaches
that are suited to specific MWE types and tasks.
The findings of the various experiments carried out in the thesis are summarized below:
1. The experiments mainly focus on the Bengali Multiword Expressions; though an
experiment on identification of compositionality for English bigram MWEs has been
carried out.
2. Identification and Extraction of MWEs have been done with various statistical
methodologies.
3. All types of Bengali bigram Noun compounds and complex predicates are handled in
these experiments.
Chapter 10: Conclusions 185
Multiword Expressions
Chapter 10: Conclusions 186
In the concluding part, we agree that complete identification of Bengali MWEs is not yet
done. The systems developed start a new direction on working with MWEs in Bengali language.
Multiword Expressions
Appendix
Research Publications
Publications
1. Tanmoy Chakraborty and Sivaji Bandyopadhyay, Identification of Reduplication in
Bengali Corpus and their Semantic Analysis: A Rule Based Approach. In Proceedings of
Multiword Expressions: from Theory to Applications (MWE 2010), The 23rd
International Conference on Computational Linguistics (COLING 2010), Beijing, China,
August 28, 2010, pp. 73-76.
2. Dipankar Das, Santanu Pal, Tapabrata Mondal, Tanmoy Chakraborty and Sivaji
Bandyopadhyay, Automatic Extraction of Complex Predicates in Bengali. In
Proceedings of Multiword Expressions: from Theory to Applications (MWE 2010), The
23rd International Conference on Computational Linguistics (COLING 2010), Beijing,
China, August 28, 2010, pp. 37- 45.
3. Tanmoy Chakraborty and Sivaji Bandyopadhyay, Authorship Identification Using
Stylometry Analysis: A CRF Based Approach, In Proceedings of IEEE Cascom
Postgraduate Student paper Conference, Jadavpur University, Kolkata, November 27,
2010, pp. 66-69.
4. Tanmoy Chakraborty, Identification of Noun-Noun (N-N) Collocations as Multi-
Word Expressions in Bengali Corpus, The 8th International Conference on Natural
Language Processing (ICON 2010), IIT Kharagpur, India, December 8-11, 2010.
5. Tanmoy Chakraborty and Sivaji Bandyopadhyay, Inference of Fine-grained
Attributes of Bengali Corpus for Stylometry Detection, The 12th International Conference
on Intelligent Text Processing and Computational Linguistics (CICLING 2011), Tokyo,
Japan, February 20-26, 2011.
6. Tanmoy Chakraborty, Santanu Pal, Tapabrata Mondal, Tanik Saikh and Sivaji
Bandyopadhyay, Shared task system description: Measuring the Compositionality of
Bigrams using Statistical Methodologies, In Proceedings of Distributional Semantics and
Compositionally (DiSCo), The 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies (ACL-HLT 2011), Portland, Oregon, USA,
June 24, 2011. (Accepted)
7. Tanmoy Chakraborty, Dipankar Das, Sivaji Bandyopadhyay, Semantic Clustering:
an Attempt to Extract Multiword Expressions in Bengali, In Proceedings of Multiword
Expressions: from Parsing and Generation to the Real World (MWE 2011), The 49th
Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies (ACL-HLT 2011) Portland, Oregon, USA, June 23, 2011. (Accepted).
Argamon, S., Saric, M., Stien, S.S. 2003. Style mining of electronic messages for
multiple authorship discrimination: First results, In proceedings of 9th ACM
SIGKDD, pp. 475-480.
Abeillѐ, Anne. 1988. Light verb constructions and extraction out of NP in a tree
adjoining grammar. In Papers of the 24th Regional Meeting of the Chicago
Linguistics Society.
Alsina, Alex. 1996. Complex Predicates: Structure and Theory. Center for the Study of
Language and Information Publications, Stanford, CA.
Agirre, Eneko, and Philip Edmonds. 2006. Word Sense Disambiguation: Algorithms and
Applications. Dordrecht, Netherlands: Springer.
Agirre, Eneko, and David Martinez. 2000. Exploring automatic word sense
disambiguation with decision lists and the web. In Proceedings of COLING workshop
on Semantic Annotation and Intelligent Content, 11–19, Saarbrucken, Germany.
Agarwal, Aswini, Biswajit Ray, Monojit Choudhury, Sudeshna Sarkar and Anupam
Basu. 2004. Automatic Extraction of Multiword Expressions in Bengali: An
Approach for Miserly Resource Scenario. In Proceedings of International Conference
on Natural Language Processing (ICON), pp. 165-174.
Argamon, Shlomo, Marin Saric, and Sterling S. Stein.. 2003. Style mining of electronic
messages for multiple authorship discrimination: First results, In: Proceedings of the
2003 Association for Computing Machinery Conference on Knowledge Discovery
and Data Mining (ACM SIGKDD), pp 475—480.
Brown Peter F., Stephen Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer.
1993. The mathematics of statistical machine translation: Parameter estimation.
Computational Linguistics, 19(2):263– 311.
Bashir, Elena. 1993. Causal chains and compound verbs. In M. K. Verma ed. (1993)
Complex Predicates in South Asian Languages, Manohar Publishers and Distributors,
New Delhi.
Burton-Page, John. 1957. Compound and conjunct verbs in Hindi. Bulletin of the School
of Oriental and African Studies, 19: 469-78.
Baldwin, Timothy. 2005b. Looking for prepositional verbs in corpus data. In Proceedings
of the 2nd ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and
their Use in Computational Linguistics Formalisms and Applications, 115–126,
Colchester, UK.
Baldwin, Timothy, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003a. An
empirical model of multiword expression decomposability. In Proceedings of the
ACL2003 Workshop on Multiword Expressions: analysis, acquisition and treatment,
pp. 89–96, Sapporo, Japan.
Baldwin, Timothy, John Beavers, Leonoor Vander Beek, Francis Bond, Dan Flickinger,
and Ivan A. Sag. 2003b. In search of a systematic treatment of determinerless PPs. In
Proceedings of the ACL-SIGSEM Workshop on the Linguistic Dimensions of
Prepositions and their Use in Computational Linguistics Formalisms and
Applications, 145–156, Toulouse, France.
Baldwin, Timothy, John Beavers, Leonor Van Der Beek, Francis Bond, Dan Flickinger,
and Ivan A. Sag. 2006. In search of a systematic treatment of determinerless PPs. In
Syntax and Semantics of Prepositions, ed. by Patrick Saint-Dizier. Springer.
Baldwin, Timothy, Emily M. Bender, Dan Flickinger, Ara Kim, and Stephan Oepen.
2004. Road-testing the English Resource Grammar over the British National Corpus.
In Proceedings of the 4th International Conference on Language Resources and
Evaluation (LREC-2004), 2047–2050, Lisbon, Portugal.
Baldwin, Timothy, and Aline Villavicencio. 2002. Extracting the unextractable: A case
study on verb-particles. In Proceedings of the 6th Conference on Natural Language
Learning (CoNLL-2002), 98–104, Taipei, Taiwan.
Multiword Expressions
Bibliography 190
Bame, Ken, 1999. Aspectual and resultative verb-particle constructions with up. Handout
for talk presented at the Ohio State University Linguistics Graduate Student
Colloquium.
Banerjee, Satanjeev, and Ted Pedersen. 2003. Extended gloss overlaps as a measure of
semantic relatedness. In Proceedings of the 18th International Joint Conference on
Artificial Intelligence (IJCAI-2003), 805–810, Acapulco, Mexico.
Bannard, Colin, 2003. Statistical techniques for automatically inferring the semantics of
verb-particle constructions. Master’s thesis, University of Edinburgh.
Bannard, Colin, Timothy Baldwin, and Alex Lascarides. 2003. A statistical approach to
the semantics of verb-particles. In Proceedings of the ACL2003 Workshop on
Multiword Expressions: analysis, acquisition and treatment, 65–72, Sapporo, Japan.
Barker, Ken, and Stan Szpakowicz. 1998. Semi-automatic recognition of noun modifier
relationships. In Proceedings of the 17th International Conference on Computational
Linguistics (COLING-1998), 96–102, Montreal, Canada.
Ben Taskar, Pieter Abbeel, and Daphne Koller. 2002, Discriminative probabilistic models
for relational data. In Eighteenth Conference on Uncertainty in Artificial Intelligence
(UAI02).
Bolinger, Dwight. 1976b. The Phrasal Verb in English. Boston, USA: Harvard
University Press.
Bond, Francis, 2001. Determiners and number in English, contrasted with Japanese, as
exemplified in machine translation. Brisbane, Australia: University of Queensland
dissertation.
Borthen, Kaja, 2003. Norwegian bare singulars. Norwegian University of Science and
Technology dissertation.
Brinton, Laurel. 1985. Verb particles in English: Aspect or aktionsart. Studia Linguistica
39. pp. 157–168.
Multiword Expressions
Bibliography 191
Briscoe, Ted, and John Carroll. 2002. Accurate statistical annotation of general text. In
Proceedings of the 3rd International Conference on Language Resources and
Evaluation (LREC-2002), 1499–1504, Las Palmas, Canary Islands.
Butt, Miriam. 1995. The Structure of Complex Predicates in Urdu. Doctoral Dissertation,
Stanford University.
Burnard, Lou, 1995. User guide for the British National Corpus. Butt, Miriam. 2003. The
light verb jungle. In Proceedings of the Workshop on Multi-verb Constructions, 1–49,
Trondheim, Norway.
Croft, D.J.: Book of Mormon word prints reexamined. Sun Stone Publish., 6, 15--22
(1981)
Calzolari, Nicoletta, Charles Fillmore, Ralph Grishman, Nancy Ide, Alessandro Lenci,
Catherine MacLeod, and Antonio Zampolli. 2002. Towards best practice for
multiword expressions in computational lexicons. In Proceedings of the 3rd
International Conference on Language Resources and Evaluation (LREC-2002),
1934–1940, Las Palmas, Canary Islands.
Cao, Yunbo, and Hang Li. 2002. Base noun phrase translation using web data and the
EM algorithm. In Proceedings of the 19th International Conference on
Computational Linguistics (COLING-2002), 37–40, Taipei, Taiwan.
Carpuat, Marine, and Dekai Wu. 2005. Word sense disambiguation vs. statistical machine
translation. In Proceedings of 43rd Annual Meeting of the Association for
Computational Linguistics, 387–394, Ann Arbor, USA.
Multiword Expressions
Bibliography 192
Church, Kenneth W., and Patrick Hanks. 1989. Word association norms, mutual
information and lexicography. In Proceedings of the 27th Annual Meeting of the
Association of Computational Linguistics (ACL-1989), 76–83, Vancouver, Canada.
Chakrabarti, Debasri, Hemang Mandalia, Ritwik Priya, Vaijayanthi Sarma, and Pushpak
Bhattacharyya. 2008. Hindi compound verbs and their automatic extraction. In
Proceedings of the 22nd International Conference on Computational Linguistics
(Coling 2008), Posters and demonstrations, Manchester, UK, pp. 27-30.
Cook, Paul, and Suzanne Stevenson. 2006. Classifying particle semantics in English
verb-particle constructions. In Proceedings of the ACL-2006 Workshop on Multiword
Expressions: Identifying and Exploiting Underlying Properties, 45–53, Sydney,
Australia.
Copestake, Ann, and Alex Lascarides. 1997. Integrating symbolic and statistical
representations: The lexicon pragmatics interface. In Proceedings of the 35th Annual
Meeting of the Association of Coomputational Linguistics and 8th Conference of the
European Chapter of Association of Computational Linguistics (ACL/EACL-1997),
136–143, Madrid, Spain.
Cruse, Alan D. 1986. Lexical Semantics. Cambridge, UK: Cambridge University Press.
Multiword Expressions
Bibliography 193
Daelemans, Walter, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch, 2004.
TiMBL: Tilburg memory based learner, version 5.1, reference guide.
Das, Pradeep Kumar. 2009. The form and function of Conjunct verb construction in
Hindi. Global Association of Indo-ASEAN Studies, Daejeon, South Korea.
Dasgupta, Sajib, Naira Khan, Asif Iqbal sarkar, Dewan Shahriar Hossain Pavel and
Mumit Khan. Morphological Analysis of Inflecting Compound Words in Bengali. In
Proceedings of the 8th International Conference on Computer and Information
Technology (ICCIT), Bangladesh, 2005.
Dehe, Nicole. 2002. Particle Verbs in English: syntax, information structure and
intonation. Amsterdam/Philadelphia: John Benjamins Publishing.
Dempster, A.P., N.M. Laird, and D.B. Rubin. 1977). Maximum Likelihood from
Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society,
Series B (Methodological) 39 (1): 1–38.
Dias, Gaël Harry. 2003. Multiword Unit Hybrid Extraction. In proceedings of the First
Association for Computational Linguistics, Workshop on Multiword Expressions:
Analysis, Acquisition and Treatment, pp. 41-48.
Dias, Gaël Harry, Ray Jackendoff, Andrew McIntyre, and Silke Urban (eds.) 2001. Verb-
Particle Explorations. Berlin/New York: Mounton de Gruyter.
Dias, Gaėl, S. Guillor´e, and J. G. Pereira Lopes. 1999. Multilingual aspects of multiword
lexical units. In Workshop on Language Technologies in the Framework of the 32rd
Annual Meeting of the Societas Linguistica Europaea, Ljubljana, Slovenia.
Diab M. T, and P. Bhutada. 2009. Verb Noun Construction MWE Token Supervised
lassification, In Proceedings of the Joint conference of Association for Computational
Linguistics and International Joint Conference on Natural Language Processing of
the Asian Federation of Natural Language Processing 2009 (ACL-IJCNLP 2009),
Workshop on Multiword Expression., Singapore, pp.17-22.
Multiword Expressions
Bibliography 194
Dirven, Ren´e. 2001. The metaphoric in recent cognitive approaches to English phrasal
verbs. metaphorik.de.pp. 39–54.
Downing, Pamela. 1977. On the creation and use of English compound nouns. Language
53.pp. 810–842.
Dras, Mark, and Mike Johnson. 1996. Death and lightness: Using a demographic model
to find support verbs. In Proceedings of the 5th International Conference on the
Cognitive Science of Natural Language Processing, 165–172, Dublin, Ireland.
Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence.
Computational Linguistics 19.pp. 61–74.
Eck, Matthias, Stephan Vogel, and Alex Waibel. 2004. Improving statistical machine
translation in the medical domain using the Unified Medical Language System. In
Proceedings of the 20th International Conference on Computational Linguistics
(COLING 2004), Geneva, Switzerland, pp. 792-798.
Ekbal, Asif, and Sivaji Bandyopadhyay. 2008. Maximum Entropy Approach for Named
Entity Recognition in Indian Languages. International Journal for Computer
Processing of Languages (IJCPOL), Vol. 21 (3), pp. 205-237.
Ekbal, Asif, and Sivaji Bandyopadhyay. 2009. Voted NER system using appropriate
unlabeled data. In proceedings of the ACL-IJCNLP-2009 Named Entities Workshop
(NEWS 2009), Suntec, Singapore, pp.202-210.
Evert, Stephen, 2004. The Statistics of Word Cooccurrences: Word Pairs and
Collocations. University of Stuttgart dissertation.
Ekbal, Asif, Rejwanul Haque and Sivaji Bandyopadhyay. (2008). Maximum Entropy
Based Bengali Part of Speech Tagging. In proceedings of Advances in Natural
Language Processing and Applications Research in Computing Science, 2008, pp.
67-78.
Feng, Donghui, Yajuan Lv, and Ming Zhou. 2004. A new approach for English-Chinese
named entity alignment. In Proceedings of the 2004 Conference on Empirical
Methods in Natural Language Processing (EMNLP-2004), Barcelona, Spain, pp. 372-
379.
Multiword Expressions
Bibliography 195
Fan, James, Ken Barker, and Bruce W. Porter. 2003. The knowledge required to interpret
noun compounds. In Proceedings of the 7th International Joint Conference on
Artificial Intelligence, 1483–1485, Acapulco, Mexico.
Fazly, Afsaneh, Ryan North, and Suzanne Stevenson. 2005. Automatically distinguishing
literal and figurative usages of highly polysemous verbs. In Proceedings of the ACL-
SIGLEX Workshop on Deep Lexical Acquisition, 38–47, Ann Arbor, USA.
Association for Computational Linguistics.
Fernando, Chitra, and Roger Flavell. 1981. On idioms. Exeter: University of Exeter.
Fillmore, Charles, Paul Kay, and Mary C. O’Connor. 1988. Regularity and idiomaticity
in grammatical constructions. Language 64.501–538.
Folli, Raffaella, Heidi Harley, and Simin Karim. 2003. Determinants of even type in
Persian complex predicates. Cambridge Working Papers in Linguistics.
Fraser, Bruce. 1976. The Verb-Particle Combination in English. The Hague: Mouton.
Gates, Edward. 1988. The treatment of multiword lexemes in some current
dictionaries of English. Snell-Hornby.
Gibbs, Raymond W. 1980. Spilling the beans on understanding and memory for idioms in
conversation. Memory and Cognition 8.149–156.
Girju, Roxana. 2007. Improving the interpretation of noun phrases with cross-linguistic
information. In Proceedings of the 45th Annual Meeting of the Association of
Computational Linguistics, 568–575, Prague, Czech Republic.
Girju, Roxana, Dan Moldovan, Marta Tatu, and Daniel Antohe. 2005. On the semantics
of noun compounds. Computer Speech and Language 19.479–496.
Multiword Expressions
Bibliography 196
Girju, Roxana, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, and Deniz
Yuret. 2007. Semeval-2007 task 04: Classification of semantic relations between
nominals. In Proceedings of the 4th Semantic Evaluation Workshop(SemEval-2007),
13–18, Prague, Czech Republic.
Grefenstette, Gregory, and Simual Teufel. 1994. What is a word, what is a sentence?
problems of tokenization. In Proceedings of the 3rd Conference on Computational
Lexicography and Text Research, 79–87, Budapest, Hungary.
Grefenstette, Gregory, and Simual Teufel. 1995. A corpus-based method for automatic
identification of support verbs for nominalizations. In Proceedings of the 7th
European Chapter of Association of Computational Linguistics (EACL-1995), pp.
98–103, Dublin, Ireland.
Grishman, Ralph, Catherine Macleod, and Adam Myers, 1998. Complex syntax reference
manual.
Grover, Claire, Maria Lapata, and Alex Lascarides. 2004. A comparison of parsing
technologies for the biomedical domain. Journal of Natural Language Engineering 1.
Pp. 1–38.
Huang, Fei, Stephan Vogel, and Alex Waibel. 2003. Automatic extraction of named
entity translingual equivalence based on multi-feature cost minimization. In
Proceedings of the ACL-2003 Workshop on Multilingual and Mixed-language Named
Entity Recognition, 2003, Sapporo, Japan, pp. 9-16.
Haspelmath, Martin. 1997. From Space to Time in The World’s Languages. Munich,
Germany: Lincorn Europa.
Hearst, Marti. 1992. Automatic acquisition of hyponyms from large text corpora. In Proc.
of the 14th International Conference on Computational Linguistics (COLING ’92),
Nantes, France.
Hermjakob, Ulf, Eduard Hovy, and Chin-Yew Lin. 2002. Automated question answering
in Webclopedia. In Proceedings of the ACL-02 Demonstrations Session, 98–99,
Philadelphia, USA.
Hirst, Graeme, and David St-Onge. 1998. Lexical chains as representations of context for
the detection and correction of malapropisms. In (Fellbaum 1998), pp. 305–332.
Hook, Peter. 1974. The Compound Verbs in Hindi. The Michigan Series in South and
South-east Asian Language and Linguistics. The University of Michigan.
Multiword Expressions
Bibliography 197
Halteren, V. H.: Linguistic profiling for author recognition and verification, In:
Proceedings of the 2005 Meeting of the Association for Computational Linguistics
(ACL) (2005).
Holmes, D.: 1994. Authorship Attribution, Computers and the Humanities, 28, pp. 87—
106.
Hoshi, H., 1994. Passive, Causive, and Light Verbs: A Study of Theta Role Assignment.
University of Connecticut dissertation.
Huddleston, Rodney, and Geoffrey K. Pullum. 2002. The Cambridge Grammar of the
English Language. Cambridge, UK: Cambridge University Press.
Humphreys, L., D. Lindberg, H. Schoolmand, and G.O. Barnett. 1998. The unified
medical language system: An informatics research collabration. Journal of the
American Medical informatics Assocation 5.1–13.
Ide, Nancy, and Jean Veronis. 1998. Word sense disambiguation: The state of the art.
Computational Linguistics 24.pp. 1–40.
Isabelle, Pierre. 1984. Another look at nominal compounds. In Proceedings of the 10th
International Conference on Computational Linguistics (COLING-1984), pp. 509–
516, San Francisco, USA.
Jackendoff, Ray. 1973. The base rules for prepositional phrases. In A Festschrift for
Morris Halles, 345–356. New York: Halt: Rinehart and Winston.
Jackendoff, Ray. 1997. The Architecture of the Language Faculty. Cambridge, USA:
MIT Press.
Jackendoff, Ray. 2002. Foundation of Language. Oxford, UK: Oxford University Press.
Jespersen, Otto. 1965. A Modern English Grammar on Historical Principles, Part VI,
Morphology. London, UK: George Allen and Unwin Ltd.
Jiang, Jay, and David Conrath. 1997. Semantic similarity based on corpus statistics and
lexical taxonomy. In Proceedings on International Conference on Research in
Computational Linguistics, 19–33, Taipai, Taiwan.
Multiword Expressions
Bibliography 198
Johnston, Michael, and Frederica Busa. 1996. Qualia structure and the compositional
interpretation of compounds. In Proceedings of the ACL SIGLEX Workshop on
Breadth and Depth of Semantic Lexicons, 77–88, Santa Cruz, USA.
Kan, Yee Fan Tan Min-Yen, and Hang Cui. 2006. Extending corpus-based identification
of light verb constructions using a supervised learning framework. In Proceedings of
the EACL 2006 Workshop on Multi-word-expressions in a multilingual context
(MWEmc), Trento, Italy.
Kilgarriff, Adam and Joseph Rosenzweig. 2000. Framework and results for english
SENSEVAL. Computers and the Humanities. Senseval Special Issue, 34(1-2). pp.15-
48.
Katz, Jerrold J., and Paul M. Postal. 2004. Semantic interpretation of idioms and
sentences containing them. In Quarterly Progress Report (70), MIT Research
Laboratory of Electronics, 275–282. MIT Press.
Keane E. 2001. Echo Words in Tamil. PhD thesis, Meriton College, Oxford.
Kneser, Reinhard, and Hermann Ney. 1995. Improved backing-off for m-gram language
modeling. In Proceedings of the IEEE Internation Conference on Acoustics, Speech,
and Signal Processing (ICASSP), vol. 1, pp. 181–184. Detroit, MI.
Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based
translation. In Proceedings of HLT-NAACL 2003: conference combining Human
Language Technology conference series and the North American Chapter of the
Association for Computational Linguistics conference series, Edmonton, Canada, pp.
48-54.
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,
Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris
Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open
source toolkit for statistical machine translation. In Proceedings of the 45th Annual
meeting of the Association for Computational Linguistics (ACL 2007): Proceedings of
demo and poster sessions, Prague, Czech Republic, pp. 177-180.
Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In
EMNLP-2004: Proceedings of the Conference on Empirical Methods in Natural
Language Processing, 25-26 July 2004, Barcelona, Spain, pp 388-395.
Multiword Expressions
Bibliography 199
Kaul, Vijay Kumar. 1985. The Compound Verb in Kashmiri. Unpublished Ph.D.
dissertation. Kurukshetra University.
Kearns, Kate, 2002. Light verbs in English. Kengo Sato and Yasubumi Sakakibara. RNA
secondary structural alignment with conditional random fields. Bioinformatics, 21:pp.
237–242, 2005.
Kim, Su Nam, and Timothy Baldwin. 2005. Automatic interpretation of compound nouns
using WordNet similarity. In Proceedings of 2nd International Joint Conference on
Natual Language Processing (IJCNLP-2005), 945–956, Jeju, Korea.
Kim, Su Nam, and Timothy Baldwin. 2006a. Automatic extraction of verb-particles using
linguistic features. In Proceedings of the Third ACL-SIGSEM Workshop on
Prepositions, 65–72, Trento, Italy.
Kim, Su Nam and Timothy Baldwin. 2006b. Interpreting semantic relations in noun
compounds via verb semantics. In Proceedings of Proceedings of the 44th Annual
Meeting of the Association for Computational Linguistics and 21st International
Conference on Computational Linguistics (COLING/ACL-2006), 491–498, Sydney,
Australia.
Kim, Su Nam, and Timothy Baldwin. 2007a. Detecting compositionality of English verb-
particle constructions using semantic similarity. In Proceedings of Conference of the
Pacific Association for Computational Linguistics, 40–48, Melbourne, Australia.
Multiword Expressions
Bibliography 200
Kim, Su Nam, and Timothy Baldwin. 2007c. Interpreting noun compounds using
bootstrapping and sense collocation. In Proceedings of Conference of the Pacific
Association for Computational Linguistics, 129–136, Melbourne, Australia.
Kim, Su Nam, and Timothy Baldwin. 2007d. Melb-kb: Nominal classification as noun
compound interpretation. In Proceedings of the 4th International Workshop on
Semantic Evaluations, pp. 231–236, Prague, Czech Republic.
Kim, Su Nam, Meladel Mistica, and Timothy Baldwin. 2007. Extending sense
collocation on interpreting noun compounds. In Proceedings of Australasian
Language Technology Workshop, 49–56, Melbourne, Australia.
Landauer, Thomas K., Peter W. Faltz, and Darrell Laham. 1998. Introduction to latent
semantic analysis. Discourse Processes. pp. 259–284.
Lapata, Mirella, and Frank Keller. 2004. The web as a baseline: Evaluating the
performance of unsupervised web-based models for a range of NLP tasks. In
Proceedings of the Human Langauge Techinology Conference and Conference on
Empirical Methods in National Language Processing (HLT/NAACL-2004), pp. 121–
128, Boston, USA.
Leacock, Claudia, and Nancy Chodorow. 1998. Combining local context and WordNet
similarity for word sense identification. Cambridge, USA: MIT Press.
Levi, Judith. 1978. The Syntax and Semantics of Complex Nominals. New York, New
York, USA: Academic Press.
Li, Wei, Xiuhong Zhang, Cheng Niu, Yuankai Jiang, and Rohini K. Srihari. 2003. An
expert lexicon approach to identifying English phrasal verbs. In Proceedings of the
ACL2003 Workshop on Multiword Expressions: analysis, acquisition and treatment,
513–520, Sapporo, Japan.
Multiword Expressions
Bibliography 201
Liberman, Mark, and Richard Sproat. 1992. The stress and structure of modified noun
phrases in English. In Lexical Matters – CSLI Lecture Notes No. 24, ed. by Ivan A.
Sag and A. Szabolcsi. Stanford, USA: CSLI Publications.
Lidner, Sue., 1983. A lexico-semantic analysis of English verb particle constructions with
OUT and UP. University of Indiana at Bloomington dissertation.
Lin, Dekang. 1998a. Automatic retrieval and clustering of similar words. In Proceedings
of the 36th Annual Meeting of the Association for Computational Linguistics and 17th
International Conference on Computational Linguistics (COLING/ACL-1998), pp.
768–774, Montreal, Canada.
Lin, Dekang. 1998b. Extracting collocations from text corpora. In Proceedings of the 1st
Workshop on Computational Terminology, Montreal, Canada.
Lohse, Barbara, John A. Hawkins, and Thomas Wasow. 2004. Domain minimization in
English verb-particle constructions. Language 80. pp. 238–261.
Lynott, Dermot, and Mark T. Keane. 2004. A model of novel compound production. In
Proceedings of the 26th Annual Conference of the Cognitive Science Society,
Chicago, Illinois, USA.
Mustafa, T. K.., Mustapha, N., Azmi, M. A., Sulaiman, N. B. 2010. Dropping down the
Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in
the Text Mining for Authorship Investigation, Journal of Computer Science 6 (3), pp.
235—243.
Multiword Expressions
Bibliography 202
Moore, Robert C. 2003. Learning translations of named entity phrases from parallel
corpora. In Proceedings of 10th Conference of the European Chapter of the Association
for Computational Linguistics (EACL 2003), Budapest, Hungary; pp. 259-266.
MacQueen, James B. 1967. Some methods for classification and analysis of multivariate
observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistcs
and Probability, 281–297, Berkeley, USA. University of California at Berkeley Press.
M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. 2009. The WaCky Wide Web: A
Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language
Resources and Evaluation 43 (3): 209-226.
Matthew Richardson and Pedro Domingos. Markov logic networks. Machine Learning,
2005.
Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a
large annotated corpus of English: the Penn treebank. Computational Linguistics
19.pp. 313–330.
McCarthy, Diana, Bill Keller, and John Carroll. 2003. Detecting a continuum of
compositionality in phrasal verbs. In Proceedings of the ACL2003 Workshop on
Multiword Expressions: analysis, acquisition and treatment, 73–80, Sapporo, Japan.
McCarthy, Diana, Rob Koeling, Julie Weeds, and John Carroll. 2004. Finding
predominant senses in untagged text. In Proceedings of the 42nd Annual Meeting of
the Association of Computational Linguistics, 280–287, Barcelona, Spain.
Mendes, A., Antunes, S., Nascimento, M., Casteleiro, J., Pereira, L. and Sá, T. 2006. COMBINA-
PT: A Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword
Expressions. In Proceedings of the 5th International Conference on Language Resources and
Evaluation, 24-26.5. 2006, Genova, pp. 1900-05.
Mihalcea, Rada, and Ehsanul Faruque. 2004. Senselearner: Minimally supervised word
sense disambiguation for all words in open text. In Proceedings of the ACL/SIGLEX
Senseval-3 , 155–158, Barcelona, Spain.
Multiword Expressions
Bibliography 203
McCarthy, Diana, and Dan Moldovan. 1999. An automatic method for generating sense
tagged corpora. In Proceedings of the 16th Conference of the American Association
of Aritificial Intelligence (AAAI-1999), pp. 461–466, Orlando, USA.
Minnen, Guido, John Carroll, and Darren Pearce. 2001. Applied morphological
processing of English. Natural Language Engineering. Pp. 207–223.
Miyagawa, Shigeru. 1989. Light verbs and the ergative hypothesis. Linguistic Inquiry
20.659–668. Mohanty, Gopabandhu. 1992. The Compound Verbs in Oriya. Ph. D.
dissertation, Deccan College Post-Graduate and Research Institute, Pune.
Mohanty, Panchanan. 2010. WordNets for Indian Languages: Some Issues. Global
WordNet Conference-2010, pp. 57-64.
Moldovan, Dan, Adriana Badulescu, Marta Tatu, Daniel Antohe, and Roxana Girju.
2004. Models for the semantic classification of noun phrases. In Proceedings of HLT-
NAACL 2004: Workshop on Computational Lexical Semantics, 60–67, Boston, USA.
Mukherjee, Amitabha, Soni Ankit and Raina Achla M. 2006. Detecting Complex
Predicates in Hindi using POS Projection across Parallel Corpora. Multiword
Expressions: Identifying and Exploiting Underlying Properties. Association for
Computational Linguistics.
Nakov, Preslav, and Marti Hearst. 2005. Search engine statistics beyond the n-gram:
Application to noun compound bracketting. In Proceedings of the 9th Conference on
Computational Natural Language Learning (CoNLL-2005), pp. 17–24, Ann Arbor,
USA.
Nakov, Preslav, and Marti Hearst. 2006. Using verbs to characterize noun-noun relations.
In Proceedings of the 12th International Conference on Artificial Intelligence:
Methodology, Systems, Applications (AIMSA), 233–244, Bularia.
Nastase, Vivi, Jelber Sayyad-Shirabad, Marina Sokolova, and Stan Szpakowicz. 2006.
Learning noun-modifier semantic relations with corpus-based and WordNet-based
features. In Proceedings of the 21st National Conference on Artificial Intelligence
(AAAI), 781–787, Boston, USA.
Multiword Expressions
Bibliography 204
Ney, H. and Popovic, M. 2004. Improving Word Alignment Quality using Morphosyntactic
Information. Proceedings of the 20th International Conference on Computational Linguistics,
Geneva 23-27.8. 2004, pp. 310-314.
Ngai, Grace, and Radu Florian. 2001. Transformation-based learning in the fast lane. In
Proceedings of the 2nd Annual Meeting of the North American Chapter of
Association for Computational Linguistics (NAACL), 40–47, Pittsburgh, USA.
Nulty, Paul. 2007. Semantic classification of noun phrases using web counts and learning
algorithms. In Proceedings of the Association of Computational Linguistics 2007
Student Research Workshop, 79–84, Prague, Czech Republic.
Nunberg, Geoffrey, Ivan A. Sag, and Tom Wasow. 1994. Idioms. Language. Pp. 491–
538.
Och, Franz J. 2003. Minimum error rate training in statistical machine translation. In
Proceedings of the 41st Annual Meeting of the Association for Computational
Linguistics (ACL-2003), Sapporo, Japan, pp. 160-167.
ŐHara, Tom, and Janyce Wiebe. 2003. Preposition semantic classification via Treebank
and framenet. In Proceedings of the 7th Conference on Natural Language Learning,
pp. 79–86, Edmonton, Canada.
ŐS´eaghdha, Diarmuid, and Ann Copestake. 2007. Co-occurrence contexts for noun
compound interpretation. In Proceedings of the ACL-2007 Workshop on A Broader
Perspective on Multiword Expressions, 57–64, Prague, Czech Republic.
Multiword Expressions
Bibliography 205
Paul, Soma. 2010. Representing Compound Verbs in Indo WordNet. Golbal Wordnet
Conference-2010, pp. 84-91.
Paul, Soma. 2004. An HPSG Account of Bangla Compound Verbs with LKB
Implementation. Ph.D dissertation, University of Hyderabad, Hyderabad.
Pavelec, D., Justino, E., Oliveira, L. S. 2007. Author Identification using Stylometric
features, Inteligencia Artificial, Revista Ideroamericana de Inteligencia Artifical.
Valencia, Espana, Vol 11, pp. 59—65.
Patrick, Jon, and Jeremy Fletcher. 2004. Differentiating types of verb particle
constructions. In Proceedings of Australian Language Technology Workshop, 163–
170, Sydney, Australia.
Patwardhan, Siddharth, Satanjeev Banerjee, and Ted Pedersen. 2003. Using measures of
semantic relatedness for word sense disambiguation. In Proceedings of the 4th
International Conference on Intelligent Text Processing and Computational
Linguistics (CICLing-2003), 17–21, Mexico City, Mexico.
Pawley, Andrew, and Frances Hodgetts Syder, 1983. Two puzzles for linguistic theory:
nativelike selection and nativelike fluency.
Piao, Scott, Paul Rayson, Dawn Archer, Andrew Wilson, and Tony McEnery. 2003.
Extracting multiword expressions wth a semantic tagger. In Proceedings of the
ACL2003 Workshop on Multiword Expressions: analysis, acquisition and treatment,
49–56, Sapporo, Japan.
Piao, Scott, Paul Rayson, Olga Mudraya, Andrew Wilson, and Roger Garside. 2006.
Measuring mwe compositionality using semantic annotation. In Proceedings of the
ACL-2006 Workshop on Multiword Expressions: Identifying and Exploiting
Underlying Properties, 2–11, Sydney, Australia.
Multiword Expressions
Bibliography 206
Potter, Elizabeth, Jenny Watson, Michael Lax, and Miranda Timewell. 2000. Collins
Cobuild Dictionary of Idioms. Cambridge, UK: Harper Collins Publishers.
Pal Santanu, Sudip Kumar Naskar, Pavel Pecina, Sivaji Bandyopadhyay and Andy
Way.2010. Handling Named Entities and Compound Verbs in Phrase-Based
Statistical Machine Translation, In proceedings of the workshop on Multiword
expression: from theory to application (MWE-2010), The 23rd International
conference of computational linguistics (Coling 2010), Beijing, Chaina, pp. 46-54.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method
for automatic evaluation of machine translation. In Proceedings of the 40th Annual
Meeting of the Association for Computational Linguistics (ACL-2002), Philadelphia,
PA, pp. 311-318.
Prager, J., and J. Chu-Carroll, 2001. Use of WordNet hypernyms for answering what-is
questions.
Pustejovsky, James. 1995. The Generative Lexicon. Cambridge, USA: MIT press.
Quirk, Randolph, Sydney GreenBaum, Geoffrey Leech, and Jan Svartvik. 1985. A
Comprehensive Grammar of the English Language. London, UK: Longman.
Ramchand, Gillian, and Peer Svenonius. 2002. The lexical syntax and lexical semantics
of the verb-particle construction. In Proceedings of WCCFL, pp. 387–400,
Somerville, USA.
Resnik, Philip. 1995. Disambiguating noun groupings with respect to WordNet senses. In
Proceedings of the 3rd Workshop on Very Large Corpus, 77–98, Cambridge, USA.
Rayson, Paul, Dawn Archer, Scott Piao, and Tony McEnery. 2004. The UCREL
Semantic Analysis System. In proc. Of LREC-04 Workshop: Beyond Named Entity
Recognition Semantic Labeling for NLP Tasks, pages 7-12, Lisbon, Porugal.
Ren, Zhixiang, Yajuan Lü, Jie Cao, Qun Liu, and Yun Huang. 2009. Improving statistical
machine translation using domain bilingual multiword expressions. In Proceedings of
the 2009 Workshop on Multiword Expressions, ACL-IJCNLP 2009, Suntec,
Singapore, pp. 47-54.
Rosario, Barbara, and Hearst Marti. 2001. Classifying the semantic relations in noun
compounds via a domain-specific lexical hierarchy. In Proceedings of the 6th
Conference on Empirical Methods in Natural Language Processing (EMNLP-2001),
82–90, Pittsburgh, Pennsylvania, USA.
Ross, Haj. 1995. Defective noun phrases. In In Papers of the 31st Regional Meeting of
the Chicago Linguistics Society, 398–440, Chicago, Illinois, USA.
Multiword Expressions
Bibliography 207
Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger.
2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd
International Conference on Intelligent Text Processing and Computational
Linguistics (CICLing-2002), 1–15, Mexico City, Mexico.
Salton, Gerard, Allan Wong, and C.S. Yang. 1975. A vector space model for automatic
indexing. Communications of the ACM 18.613–620.
Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul.
2006. A study of translation edit rate with targeted human annotation.
Sinha, R. Mahesh, K. 2009. Mining Complex Predicates In Hindi Using A Parallel Hindi-
English Corpus. Multiword Expression Workshop, Association of Computational
Linguistics - International Joint Conference on Natural Language Processing-2009,
pp. 40-46, Singapore.
Sanjiv Kumar and Martial Hebert. Discriminative fields for modeling spatial
dependencies in natural images. In Sebastian Thrun, Lawrence Saul, and Bernhard
Schȍlkopf, editors, Advances in Neural Information Processing Systems 16. MIT
Press, Cambridge, MA, 2003.
Side, Richard. 1990. Phrasal verbs: sorting them out. ELT Journal. pp144–52.
Snow, Rion, Daniel Jurafsky, and Andrew Y. Ng. 2005. Learning syntactic patterns for
automatic hypernym discovery. In Advances in Neural Information Processing
Systems, pp. 1297–1304, Vancouver, Canada.
Stamatatos, E., Fakotakis N., Kokkinakis, G.: Automatic authorship attribution, In: proc.
of the 9th Conference on European Chapter of the ACL, pp. 158--165, June 8-12,
(1999)
Multiword Expressions
Bibliography 208
Sparck Jones, Karen. 1983. Compound noun interpretation problems. Englewood Cliffes,
USA: Prentice-Hall.
Stevenson, Suzanne, Afsaneh Fazly, and Ryan North. 2004. Statistical measures of the
semi-productivity of light verb constructions. In Proceedings of the 2nd ACL
Workshop on Multiword Expressions: Integrating Processing, pp. 1–8, Barcelona,
Spain.
Su Nam Kim and Min-Yen Kan. 2009. Re-examining Automatic Keyphrase Extraction
Approaches in Scientific Articles. In: Proc. of the 2009 Workshop on multiword
Expressions, ACL-IJCNLP 2009. Suntec, Singapore. pp. 9-16.
Stvan, Laurel Smith, 1998. The Semantics and Pragmatics of Bare Singular Noun
Phrases. Northwestern University dissertation.
Thoudam D.S and S. Bandyopadhyay. 2008. Morphology Driven Manipuri POS Tagger.
In workshop on NLP fo Less Privileged Languages, Hyderabad, pp. 91-98
Utsuro, Takehito, Takao Shime, Masatoshi Tsuchiya, Suguru Matsuyoshi, and Satoshi
Sato. 2007. Learning dependency relations of Japanese compound functional
expressions. In Proceedings of the ACL-2007 Workshop on a Broader Perspective on
Multiword Expressions, 65–72, Prague, Czech Republic.
Multiword Expressions
Bibliography 209
Van de Curys, Tim, and Begona Villada Moiron. 2007. Semantics-based multiword
expression extraction. In Proceedings of the ACL-2007 Workshop on a Broader
Perspective on Multiword Expressions, 25–32, Prague, Czech Republic.
Van Der Beek, Leonoor, 2005. Topics in Corpus-Based Dutch Syntax . University of
Rijksuniversiteit Groningen dissertation.
Venkatapathy, Sriram, and Aravind Joshi. 2005. Measuring the relative compositionality
of verb-noun (V-N) collocations by integrating features. In the Proceedings of Human
Language Technology Conference and Conference on Empirical Methods in Natural
Language Processing (HLT/EMNLP-2005), pp. 899–906, Vancouver, Canada.
Venkatapathy, Sriram, and Aravind K. Joshi. 2006. Using information about multi-word
expressions for the word-alignment task. In Proceedings of Coling-ACL 2006:
Workshop on Multiword Expressions: Identifying and Exploiting Underlying
Properties, Sydney, pp. 20-27.
Vogel, Stephan, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word
alignment in statistical translation. In Proceedings of the 16th International
Conference on Computational Linguistics (COLING 1996), Copenhagen, pp. 836-
841.
Multiword Expressions
Bibliography 210
Wacholder, Nina, and Peng Song. 2003. Toward a task-based gold standard for
evaluation of NP chunks and technical terms. In Proceedings of the 3rd International
Conference on Human Language Technology Research and 4th Annual Meeting of
the NAACL (HLT/NAACL-2003), 189–196, Edmonton, Canada.
Widdows, Dominic, and Beate Dorow. 2005. Automatic extraction of idioms using graph
analysis and asymmetric lexicosyntactic patterns. In Proceedings of ACL2005
Workshop on Deep Lexical Axquisition, 48–56, Ann Arbor, USA.
Wierzbicka, Anna. 1982. Why can you have a drink when you can’t *have an eat?
Language 58.753–799.
Wu, Zhibiao, and Martha Palmer. 1994. Verb semantics and lexical selection. In
Proceedings of the 32nd Annual Meeting of the Association for Computational
Linguistics (ACL-1994), 133–138, Las Cruces, New Mexico, USA.
Yarowsky, David. 1993. One sense per collocation. In Proceedings of the ARPA Human
Language Technology Workshop, 266–271, Plainsboro, New Jerey, USA.
Yoon, Juntae, Key-Sun Choi, and Mansuk Song. 2001. A corpus-based approach for
Korean nominal compound analysis based on linguistic and statistical information.
Natural Language Engineering 7.251–270.
Zhao, Jinglei, Hui Liu, and Ruzhan Lu. 2007. Semantic labeling of compound
nominalization in Chinese. In Proceedings of the ACL-2007 Workshop on a Broader
Perspective on Multiword Expressions, 73–80, Prague, Czech Republic.
Zhang, T., Damerau, F., Johnson, D. 2002. Text chunking using regularized winnow, In:
proc. 39th Annual Meeting on ACL, pp. 539--546, July 6-11, 2001.
Multiword Expressions