Unsupervised Compositionality Prediction of Nominal Compounds
Unsupervised Compositionality Prediction of Nominal Compounds
Nominal Compounds
Silvio Cordeiro
Federal University of Rio Grande do Sul
and Aix Marseille University, CNRS, LIS
[email protected]
Aline Villavicencio
University of Essex and
Federal University of Rio Grande do Sul
[email protected]
Marco Idiart
Federal University of Rio Grande do Sul
[email protected]
Carlos Ramisch
Aix Marseille University, CNRS, LIS
[email protected]
Nominal compounds such as red wine and nut case display a continuum of compositionality,
with varying contributions from the components of the compound to its semantics. This article
proposes a framework for compound compositionality prediction using distributional semantic
models, evaluating to what extent they capture idiomaticity compared to human judgments. For
evaluation, we introduce data sets containing human judgments in three languages: English,
French, and Portuguese. The results obtained reveal a high agreement between the models and
human predictions, suggesting that they are able to incorporate information about idiomaticity.
We also present an in-depth evaluation of various factors that can affect prediction, such as
model and corpus parameters and compositionality operations. General crosslingual analyses
reveal the impact of morphological variation and corpus size in the ability of the model to predict
compositionality, and of a uniform combination of the components for best results.
1. Introduction
Submission received: 4 December 2017; revised version received: 22 June 2018; accepted for publication:
8 August 2018.
doi:10.1162/COLI a 00341
2
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
and machine translation (Ren et al. 2009; Carpuat and Diab 2010; Cap et al. 2015;
Salehi et al. 2015). Moreover, the evaluation of DSMs on tasks involving MWEs, such
as compositionality prediction, has the potential to drive their development towards
new directions.
The main hypothesis of our work is that, if the meaning of a compositional nominal
compound can be derived from a combination of its parts, this translates in DSMs
as similar vectors for a compositional nominal compound and for the combination of
the vectors of its parts using some vector operation, that we refer to as composition
function. Conversely we can use the lack of similarity between the nominal compound
vector representation and a combination of its parts to detect idiomaticity. Further-
more, we hypothesize that accuracy in predicting compositionality depends both on
the characteristics of the DSMs used to represent expressions and their components
and on the composition function adopted. Therefore, we have built 684 DSMs and
performed an extensive evaluation, involving over 9,072 analyses, investigating various
types of DSMs, their configurations, the corpora used to train them, and the composition
function used to build vectors for expressions.2
This article is structured as follows. Section 2 presents related work on distributional
semantics, compositionality prediction, and nominal compounds. Section 3 presents the
data sets created for our evaluation. Section 4 describes the compositionality prediction
framework, along with the composition functions which we evaluate. Section 5 spec-
ifies the experimental setup (corpora, DSMs, parameters, and evaluation measures).
Section 6 presents the overall results of the evaluated models. Sections 7 and 8 evaluate
the impact of DSM and corpus parameters, and of composition functions on composi-
tionality prediction. Section 9 discusses system predictions through an error analysis.
Section 10 summarizes our conclusions. Appendix A contains a glossary, Appendix B
presents extra sanity-check experiments, Appendix C contains the questionnaire used
for data collection, and Appendices D, E, and F list the compounds in the data sets.
2. Related Work
The literature on distributional semantics is extensive (Lin 1998; Turney and Pantel 2010;
Baroni and Lenci 2010; Mohammad and Hirst 2012), so we provide only a brief introduc-
tion here, underlining their most relevant characteristics to our framework (Section 2.1).
Then, we define compositionality prediction and discuss existing approaches, focusing
on distributional techniques for multiword expressions (Section 2.2). Our framework is
evaluated on nominal compounds, and we discuss their relevant properties (Section 2.3)
along with existing data sets for evaluating compositionality prediction (Section 2.4).
1. We consolidate the description of the data sets introduced in Ramisch et al. (2016) and
Ramisch, Cordeiro, and Villavicencio (2016) by adding details about data collection, filtering,
and results of a thorough analysis studying the correlation between compositionality and
related variables.
2. We extend the compositionality prediction framework described in Cordeiro, Ramisch, and
Villavicencio (2016) by adding and evaluating new composition functions and DSMs.
3. We extend the evaluation reported in Cordeiro et al. (2016) not only by adding Portuguese,
but also by evaluating additional parameters: corpus size, composition functions, and new
DSMs.
3
Computational Linguistics Volume 45, Number 1
Distributional semantic models (DSMs) use context information to represent the mean-
ing of lexical units as vectors. These vectors are built assuming the distributional
hypothesis, whose central idea is that the meaning of a word can be learned based
on the contexts where it appears—or, as popularized by Firth (1957), “you shall know a
word by the company it keeps.”
Formally, a DSM attempts to encode the meaning of each target word wi of a
vocabulary V as a vector of real numbers v(wi ) in R|V| . Each component of v(wi ) is a
function of the co-occurrence between wi and the other words in the vocabulary (its
contexts wc ). This function can be simply a co-occurrence count c(wi , wc ), or some mea-
sure of the association between wi and each wc , such as pointwise mutual information
(PMI, Church and Hanks [1990], Lin [1999]) or positive PMI (PPMI, Baroni, Dinu, and
Kruszewski [2014]; Levy, Goldberg, and Dagan [2015]).
In DSMs, co-occurrence can be defined as two words co-occurring in the same
document, sentence, or sentence fragment in a corpus. Intrasentential models are often
based on a sliding window; that is, a context word wc co-occurs within a certain window
of W words around the target wi . Alternatively, co-occurrence can also be based on
syntactic relations obtained from parsed corpora, where a context word wc appears
within specific syntactic relations with wi (Lin 1998; Padó and Lapata 2007; Lapesa and
Evert 2017).
The set of all vectors v(wi ), ∀wi ∈ V can be represented as a sparse co-occurrence
matrix V × V → R. Given that most word pairs in this matrix co-occur rarely (if ever),
a threshold on the number of co-occurrences is often applied to discard irrelevant pairs.
Additionally, co-occurrence vectors can be transformed to have a significantly smaller
number of dimensions, converting vectors in R|V| into vectors in Rd , with d |V |.3
Two solutions are commonly employed in the literature. The first one consists in using
context thresholds, where all target–context pairs that do not belong to the top-d most
relevant pairs are discarded (Salehi, Cook, and Baldwin 2014; Padró et al. 2014b). The
second solution consists in applying a dimensionality reduction technique such as
singular value decomposition on the co-occurrence matrix where only the d largest
singular values are retained (Deerwester et al. 1990). Similar techniques focus on the
factorization of the logarithm of the co-occurrence matrix (Pennington, Socher, and
Manning 2014) and on alternative factorizations of the PPMI matrix (Salle, Villavicencio,
and Idiart 2016).
Alternatively, DSMs can be constructed by training a neural network to predict
target–context relationships. For instance, a network can be trained to predict a target
word wi among all possible words in V given as input a window of surrounding
context words. This is known as the continuous bag-of-words model. Conversely, the
network can try to predict context words for a target word given as input, and this is
known as the skip-gram model (Mikolov et al. 2013). In both cases, the network training
procedure allows encoding in the hidden layer semantic information about words as a
side effect of trying to solve the prediction task. The weight parameters that connect
the unity representing wi with the d-dimensional hidden layer are taken as its vector
representation v(wi ).
There are a number of factors that may influence the ability of a DSM to accurately
learn a semantic representation. These include characteristics of the training corpus such
3 After dimensionality reduction, nowadays word vectors are often called word embeddings.
4
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
as size (Mikolov, Yih, and Zweig 2013) as well as frequency thresholds and filters (Ferret
2013; Padró et al. 2014b), genre (Lapesa and Evert 2014), preprocessing (Padó and
Lapata 2003, 2007), and type of context (window vs. syntactic dependencies) (Agirre
et al. 2009; Lapesa and Evert 2017). Characteristics of the model include the choice of
association and similarity measures (Curran and Moens 2002), dimensionality reduction
strategies (Van de Cruys et al. 2012), and the use of subsampling and negative sampling
techniques (Mikolov, Yih, and Zweig 2013). However, the particular impact of these
factors on the quality of the resulting DSM may be heterogeneous and depends on the
task and model (Lapesa and Evert 2014). Because there is no consensus about a single
optimal model that works for all tasks, we compare a variety of models (Section 5) to
determine which are best suited for our compositionality prediction framework.
4 The task of determining whether a phrase is compositional is closely related to MWE discovery (Constant
et al. 2017), which aims to automatically extract MWE lists from corpora.
5
Computational Linguistics Volume 45, Number 1
further modified to learn polynomial projections of higher degree, with quadratic pro-
jections yielding particularly promising results (Yazdani, Farahmand, and Henderson
2015). These models come with the caveat of being supervised, requiring some amount
of pre-annotated data in the target language. Because of these requirements, our study
focuses on unsupervised compositionality prediction methods only, based exclusively
on automatically POS-tagged and lemmatized monolingual corpora.
Alternatives to the additive model include the multiplicative model and its vari-
ants (Mitchell and Lapata 2008). However, results suggest that this representation
is inferior to the one obtained through the additive model (Reddy, McCarthy, and
Manandhar 2011; Salehi, Cook, and Baldwin 2015). Recent work on predicting intra-
compound semantics also supports that additive models tend to yield better results
than multiplicative models (Hartung et al. 2017).
The third ingredient is the measure of similarity between the compositionally
constructed vector and its actual corpus-based representation. Cosine similarity is the
most commonly used measure for compositionality prediction in the literature (Schone
and Jurafsky 2001; Reddy, McCarthy, and Manandhar 2011; Schulte im Walde, Müller,
and Roller 2013; Salehi, Cook, and Baldwin 2015). Alternatively, one can calculate the
overlap between the distributional neighbors of the whole phrase and those of the
component words (McCarthy, Keller, and Carroll 2003), or the number of single-word
distributional neighbors of the whole phrase (Riedl and Biemann 2015).
5 The terms noun compound and compound noun are usually reserved for nominal compounds formed by
sequences of nouns only, typical of Germanic languages but not frequent in Romance languages.
6 In this article, examples are preceded by their language codes: EN for English, FR for French, and PT for
Brazilian Portuguese. In the absence of a language code, English is implied.
6
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
in this article. Hence, we focus on 2-word nominal compounds of the form noun1 –noun2
(in English), and noun–adjective and adjective–noun (in the three languages).
Regarding the meaning of nominal compounds, the implicit relation between the
components of compositional compounds can be described in terms of free paraphrases
involving verbs, such as flu virus as virus that causes/creates flu (Nakov 2008),7 or prepo-
sitions, such as olive oil as oil from olives (Lauer 1995). These implicit relations can often
be seen explicitly in the equivalent expressions in other languages (e.g., FR huile d’olive
and PT azeite de oliva for EN olive oil).
Alternatively, the meaning of compositional nominal compounds can be described
using a closed inventory of relations which make the role of the modifier explicit with
respect to the head noun, including syntactic tags such as subject and object, and seman-
tic tags such as instrument and location (Girju et al. 2005). The degree of compositionality
of a nominal compound can also be represented using numerical scores (Section 2.4)
to indicate to what extent the component words allow predicting the meaning of the
whole (Reddy, McCarthy, and Manandhar 2011; Roller, Schulte im Walde, and Scheible
2013; Salehi et al. 2015). The latter is the representation that we adopted in this article.
7 Nakov (2008) also proposes a method for automatically extracting paraphrases from the web to classify
nominal compounds. This was extended in a SemEval 2013 task, where participants had to rank free
paraphrases according to the semantic relations in the compounds (Hendrickx et al. 2013).
7
Computational Linguistics Volume 45, Number 1
• Farahmand, Smith, and Nivre (2015) collected judgments for 1,042 English
noun–noun compounds. Each compound has binary judgments regarding
non-compositionality and conventionalization given by four expert
annotators (both native and non-native speakers). A hard threshold is
applied so that compounds are considered as noncompositional if at least
two annotators say so (Yazdani, Farahmand, and Henderson 2015), and
the total compositionality score is given by the sum of the four binary
judgments. This data set will be referred to as Farahmand in our
experiments.
• Kruszewski and Baroni (2014) built the Norwegian Blue Parrot data set,
containing judgments for modifier-head phrases in English. The
judgments consider whether the phrase is (1) an instance of the concept
denoted by the head (e.g., dead parrot and parrot) and (2) a member of the
more general concept that includes the head (e.g., dead parrot and pet),
along with typicality ratings, with 5,849 judgments in total.
• Roller, Schulte im Walde, and Scheible (2013) collected judgments for a set
of 244 German noun–noun compounds, each compound with an average
of around 30 judgments on a compositionality scale from 1 to 7, obtained
through crowdsourcing. The resource was later enriched with feature
norms (Roller and Schulte im Walde 2014).
• Schulte im Walde et al. (2016) collected judgments for a set of 868 German
noun–noun compounds, including human judgments of compositionality
on a scale of 1 to 6. Compounds are judged by multiple annotators, and
the final compositionality score is the average across annotators. The data
set is also annotated for in-corpus frequency, productivity, and ambiguity,
and a subset of 180 compounds has been selected for balancing these
variables. The annotations were performed by the authors, linguists, and
through crowdsourcing. For the balanced subset of 180 compounds,
compositionality annotations were performed by experts only, excluding
the authors.
For a multilingual evaluation, in this work, we construct two data sets, one for
French and one for Portuguese compounds, and extend the Reddy data set for English
using the same protocol as Reddy, McCarthy, and Manandhar (2011).
In Section 3.1, we describe the construction of data sets of 180 compounds for French
(FR-comp) and Portuguese (PT-comp). For English, the complete data set contains 280
compounds, of which 190 are new and 90 come from the Reddy data set. We use 180
of these (EN-comp) for cross-lingual comparisons (90 from the original Reddy data set
combined with 90 new ones from EN-comp90 ), and 100 new compounds as held-out data
(EN-compExt ), to evaluate the robustness of the results obtained (Section 6.3). These data
sets containing compositionality scores for 2–word nominal compounds are used to
evaluate our framework (Section 4), and we discuss their characteristics in Section 3.2.8
8 For English, only EN-comp90 and EN-compExt (90 and 100 new compounds, respectively) are considered.
Reddy (included in EN-comp) is analyzed in Reddy, McCarthy, and Manandhar (2011).
8
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
For each of the target languages, we collected, via crowdsourcing, a set of numerical
scores corresponding to the level of compositionality of the target nominal compounds.
We asked non-expert participants to judge each compound considering three sentences
where the compound occurred. After reading the sentences, participants assess the
degree to which the meaning of the compound is related to the meanings of its parts.
This follows from the assumption that a fully compositional compound will have an
interpretation whose meaning stems from both words (e.g., lime tree as a tree of limes),
while a fully idiomatic compound will have a meaning that is unrelated to its compo-
nents (e.g., nut case as an eccentric person).
Our work follows the protocol proposed by Reddy, McCarthy, and Manandhar
(2011), where compositionality is explained in terms of the literality of the individual
parts. This type of indirect annotation does not require expert linguistic knowledge,
and still provides reliable data, as we show later. For each language, data collection
involved four steps: compound selection, sentence selection, questionnaire design, and
data aggregation.
Compound Selection. For each data set, we manually selected nominal compounds from
dictionaries, corpus searches, and by linguistic introspection, maintaining an equal pro-
portion of compounds that are compositional, partly compositional, and idiomatic.9 We
considered them to be compositional if their semantics are related to both components
(e.g., benign tumor), partly compositional if their semantics are related to only one of
the components (e.g., grandfather clock), and idiomatic if they are not directly related to
either (e.g., old flame). This preclassification was used only to select a balanced set of
compounds and was not shown to the participants nor used at any later stage. For
all languages, all compounds are required to have a head that is unambiguously a
noun, and additionally for French and Portuguese, all compounds have an adjective
as modifier.
Sentence Selection. Compounds may be polysemous (e.g., FR bras droit may mean most
reliable helper or literally right arm). To avoid any potential sense uncertainty, each
compound was presented to the participants with the same sense in three sentences.
These sentences were manually selected from the WaC corpora: ukWaC (Baroni et al.
2009), frWaC, and brWaC (Boos, Prestes, and Villavicencio 2014), presented in detail in
Section 5.
Questionnaire Design. For each compound, after reading three sentences, participants are
asked to:
9 We have not attempted to select compounds that are translations of each other, as a compound in a given
language may be realized differently in the other languages.
9
Computational Linguistics Volume 45, Number 1
Participants answer the last three items using a Likert scale from 0 (idiomatic/non-
literal) to 5 (compositional/literal), following Reddy, McCarthy, and Manandhar (2011).
To qualify for the task, participants had to submit demographic information confirming
that they are native speakers, and to undergo training in the form of four example
questions with annotated answers in an external form (see Appendix C for details).
Data Aggregation. For English and French, we collected answers using Amazon Mechan-
ical Turk (AMT), manually removing answers that were not from native speakers or
where the synonyms provided were unrelated to the target compound sense. Because
AMT has few Brazilian Portuguese native speakers, we developed an in-house web inter-
face for the questionnaire, which was sent out to Portuguese-speaking NLP mailing
lists.
For a given compound and question we calculate aggregated scores as the arith-
metic averages of all answers across participants. We will refer to these averaged
scores as the human compositionality score (hc)s. We average the answers to the three
questions independently, generating three scores: hcH for the head noun, hcM for the
modifier, and hcHM for the whole compound. In our framework, we try to predict hcHM
automatically (Section 5). To assess the variability of the answers (Section 3.2.1), we also
calculate the standard deviation across participants for each question (σH , σM , and σHM ).
The list of compounds, their translations, glosses, and compositionality scores are
given in Appendices D (EN-comp90 and EN-compExt ), E (FR-comp), and F (PT-comp).10
3.2.1 Measuring Data set Quality. To assess the quality of the collected human composi-
tionality scores, we use standard deviation and inter-annotator agreement scores.
Standard Deviation ( σ and Pσ>1.5 ) . The standard deviation (σ ) of the participants’ an-
swers can be used as an indication of their agreement: for each compound and for each
of the three questions, small σ values suggest greater agreement. In addition, if the
instructions are clear, σ can also be seen as an indication of the level of difficulty of
the task. In other words, all other things being equal, compounds with larger σ can
be considered intrinsically harder to analyze by the participants. For each data set, we
consider two aggregated metrics based on σ :
10
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
Table 1
Average number of answers per compound n, average standard deviation σ, proportion of high
standard deviation Pσ>1.5 , for the compound (HM), head (H), and modifier (M).
Table 1 presents the result of these metrics when applied to our in-house data sets,
as well as to the original Reddy data set. The column n indicates the average number of
answers per compound, while the other six columns present the values of σ and Pσ>1.5
for compound (HM), head-only (H), and modifier-only (M) scores.
These values are below what would be expected for random decisions (σ rand '
1.71, for the Likert scale). Although our data sets exhibit higher variability than Reddy,
this may be partly due to the application of filters done by Reddy, McCarthy, and
Manandhar (2011) to remove outliers.11 These values could also be due to the collec-
tion of fewer answers per compound for some of the data sets. However, there is no
clear tendency in the variation of the standard deviation of the answers and the num-
ber of participants n. The values of σ are quite homogeneous, ranging from 1.05 for
EN-comp90 (head) to 1.27 for EN-compExt (head). The low agreement for modifiers may be
related to a greater variability in semantic relations between modifiers and compounds:
these include material (e.g., brass ring), attribute (e.g., black cherry), and time (e.g., night
owl).
Figure 1(a) shows standard deviation (σHM , σH , and σM ) for each compound of
FR-comp as a function of its average compound score hcHM .12 For all three languages,
greater agreement was found for compounds at the extremes of the compositionality
scale (fully compositional or fully idiomatic) for all scores. These findings can be partly
explained by end-of-scale effects, that result in greater variability for the intermedi-
ate scores in the Likert scale (from 1 to 4) that correspond to the partly composi-
tional cases. Hence, we expect that it will be easier to predict the compositionality of
idiomatic/compositional compounds than of partly compositional ones.
11 Participants with negative correlation with the mean, and answers farther than ±1.5 from the mean.
12 Only FR-comp is shown as the other data sets display similar patterns.
11
Computational Linguistics Volume 45, Number 1
1.5
3
1.0 2
0.5 σH (head only) 1 hcH (head only)
σM (modifier only) hcM (modifier only)
0.0 σHM (compound) 0 hcHM (compound)
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180
Compounds ranked by hcHM Compounds ranked by hcHM
Figure 1
Left: Standard deviations (σH , σM , and σHM ) as a function of hcHM in FR-comp. Right: Average
compositionality (hcH , hcM , and hcHM ) as a function of hcHM in FR-comp.
α score assumes that each participant rates all the items, we focus on the answers
provided by three of the participants, who rated the whole set of 180 compounds in
PT-comp.
Using a linear distance schema between the answers,13 we obtain an agreement of
α = .58 for head-only, α = .44 for modifier-only, and α = .44 for the whole compound.
To further assess the difficulty of this task, we also calculate α for a single expert
annotator, judging the same set of compounds after an interval of one month. The scores
were α = .69 for the head and α = .59 for both the compound and for the modifier. The
Spearman correlation between these two annotations performed by the same expert
is ρ = 0.77 for hcHM . This can be seen as a qualitative upper bound for automatic
compositionality prediction on PT-comp.
3.2.2 Compositionality, Familiarity, and Conventionalization. Figure 1(b) shows the average
scores (hcHM , hcH , and hcM ) for the compounds ranked according to the average com-
pound score hcHM . Although this figure is for FR-comp, similar patterns were found for
the other data sets. For all three languages, the human compositionality scores provide
additional confirmation that the data sets are balanced, with the compound scores
(hcHM ) being distributed linearly along the scale. Furthermore, we have calculated the
average hcHM values separately for the compounds in each of the three compositionality
classes used for compound selection: idiomatic, partly compositional and compositional
(Section 3.1). These averages are, respectively, 1.0, 2.4, and 4.0 for EN-comp90 ; 1.1, 2.4,
and 4.2 for EN-compExt ; 1.3, 2.7, and 4.3 for FR-comp; and 1.3, 2.5, and 3.9 for PT-comp,
indicating that our attempt to select a balanced number of compounds from each class
is visible in the collected hcHM scores.
Additionally, the human scores also suggest an asymmetric impact of the non-literal
parts over the compound: whenever participants judged an element of the compound
as non-literal, the whole compound was also rated as idiomatic. Thus, most head and
modifier scores (hcH and hcM ) are close to or above the diagonal line in Figure 1(b).
In other words, a component of the compound is seldom rated as less literal than the
compositionality of the whole compound hcHM , although the opposite is more common.
12
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
3
hc H ⊗ hc M
1 ⊗ = arithmetic mean
⊗ = geometric mean
Linear regr. of arith. mean
0 Linear regr. of geom. mean
0 1 2 3 4 5
hc HM
Figure 2
Relation between hcH ⊗ hcM and hcHM in FR-comp, using arithmetic and geometric means.
Table 2
Spearman ρ correlation between compositionality, frequency, and PMI for the three data sets.
To evaluate if it is possible to predict hcHM from the hcH and hcM , we calculate the
arithmetic and geometric means between hcH and hcM for each compound. Figure 2
shows the linear regression of both measures for FR-comp. The goodness of fit is
r2arith = .93 for the arithmetic mean, and r2geom = .96 for the geometric mean, confirming
that they are good predictors of hcHM .14 Thus, we assume that hcHM summarizes hcH
and hcM , and focus on predicting hcHM instead of hcH and hcM separately. These find-
ings also inspired the pcarith and pcgeom compositionality prediction functions (Section 4).
To examine whether there is an effect of the familiarity of a compound on hc
scores, in particular if more idiomatic compounds need to be more familiar, we also
calculated the correlation between the compositionality score for a compound hcHM
and its frequency in a corpus, as a proxy for familiarity. In this case we used the WaC
corpora and calculated the frequencies based on the lemmas. The results, in Table 2,
show a statistically significant positive Spearman correlation of ρ = 0.305 for EN-comp90 ,
ρ = 0.384 for EN-compExt , and ρ = 0.598 for FR-comp, indicating that, contrary to our
expectations, compounds that are more frequent tend to be assigned higher composi-
tionality scores. However, frequency alone is not enough to predict compositionality,
and further investigation is needed to determine if compositionality and frequency
are also correlated with other factors.
14 r2arith and r2geom are .91 and .96 in PT-comp, .90 and .96 in EN-comp90 , and .92 and .95 in EN-compExt .
13
Computational Linguistics Volume 45, Number 1
For vβ (w1 , w2 ), we use the additive model (Mitchell and Lapata 2008), in which the
composition function is a weighted linear combination:
v(whead ) v(wmod )
vβ (w1 w2 ) = β + (1 − β ) ,
||v(whead )|| ||v(wmod )||
where whead (or wmod ) indicates the head (or modifier) of the compound w1 w2 , || · || is the
Euclidean norm, and β ∈ [0, 1] is a parameter that controls the relative importance of
the head to the compound’s compositionally constructed vector. The normalization of
both vectors allows taking only their directions into account, regardless of their norms,
which are usually proportional to their frequency and irrelevant to meaning.
We define six compositionality scores based on pcβ . Three of them pchead (w1 w2 ),
pcmod (w1 w2 ), and pcuniform (w1 w2 ), correspond to different assumptions about how we
model compositionality: if dependent on the head (β = 1, for e.g., crocodile tears), on
the modifier (β = 0, for e.g., busy bee), or in equal measure on the head and modifier
(β = 1/2, for e.g., graduate student). The fourth score is based on the assumption that
compositionality may be distributed differently between head and modifier for different
compounds. We implement this idea by setting individually for each compound the
14
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
questionnaires
questionnaires
questionnaires human comp.
scores (hc)
(list of compounds)
Spearman ρ
vocabulary
v(w1w2) similarity
w1
function
DSM composition ≅ predicted comp.
w2 function scores (pc)
v(w2)
compound +
DSM
processed parameters vβ(w1,w2)
corpus v(w1) compositionally
DSM configuration corpusderived constructed vectors
vectors
Figure 3
Schema of a compositionality prediction configuration based on a composition function. Thick
arrows indicate corpus-based vectors of two-word compounds treated as a single token. The
schema also covers the evaluation of the compositionality prediction configuration (top right).
value for β that yields maximal similarity in the predicted compositionality score,
that is:17
Two other scores are not based on the additive model and do not require a compo-
sition function. Instead, they are based on the intuitive notion that compositionality is
related to the average similarity between the compound and its components:
We test two possibilities: the arithmetic mean pcarith (w1 w2 ) considers that composition-
ality is linearly related to the similarity of each component of the compound, whereas
the geometric mean pcgeom (w1 w2 ) reflects the tendency found in human annotations to
assign compound scores hcHM closer to the lowest score between that for the head hcH
and for the modifier hcM (Section 3.2).
5. Experimental Setup
This section describes the common setup used for evaluating compositionality pre-
diction, such as corpora (Section 5.1), DSMs (Section 5.2), and evaluation metrics
(Section 5.3).
17 In practice, for the special case of two words, we do not need to perform parameter search for β, which
has a closed form obtained by solving the equation ∂β ∂ pc (w w ) = 0:
β 1 2
cos(w1 w2 ,w1 ) − cos(w1 w2 ,w2 ) × cos(w1 ,w2 )
β= (cos(w1 w2 ,w1 ) + cos(w1 w2 ,w2 )) × (1−cos(w1 ,w2 )) .
15
Computational Linguistics Volume 45, Number 1
5.1 Corpora
In this work we used the lemmatized and POS-tagged versions of the WaC corpora not
only for building DSMs, but also as sources of information about the target compounds
for the analyses performed (e.g., in Sections 3.2.2, 9.1, and 9.2):
• for English, the ukWaC (Baroni et al. 2009), with 2.25 billion tokens, parsed
with MaltParser (Nivre, Hall, and Nilsson 2006);
• for French, the frWaC with 1.61 billion tokens preprocessed with
TreeTagger (Schmid 1995); and
• for Brazilian Portuguese, a combination of brWaC (Boos, Prestes, and
Villavicencio 2014), Corpus Brasileiro,18 and all Wikipedia entries,19 with a
total of 1.91 billion tokens, all parsed with PALAVRAS (Bick 2000).
For all compounds contained in our data sets, we transformed their occur-
rences into single tokens by joining their component words with an underscore (e.g.,
EN monkey business → monkey business and FR belle-mère → belle mère).20 , 21 To han-
dle POS-tagging and lemmatization irregularities, we retagged the compounds’ com-
ponents using the gold POS and lemma in our data sets (e.g., for EN sitting duck,
sit/verb duck/noun→sitting/adjective duck/noun). We also simplified all POS tags using
coarse-grained labels (e.g., verb instead of vvz). All forms are then lowercased (surface
forms, lemmas, and POS tags); and noisy tokens, with special characters, numbers, or
punctuation, are removed. Additionally, ligatures are normalized for French (e.g., œ →
oe) and a spellchecker22 is applied to normalize words across English spelling variants
(e.g., color → colour).
To evaluate the influence of preprocessing on compositionality prediction (Sec-
tion 7.3), we generated four versions of each corpus, with different levels of linguistic
information. We expect lemmatization to reduce data sparseness by merging morpho-
logically inflected variants of the same lemma:
18 http://corpusbrasileiro.pucsp.br/cb/Inicial.html
19 Wikipedia articles downloaded on June 2016.
20 Hyphenated compounds are also re-tokenized with an underscore separator.
21 Therefore, in Section 5.2, the terms target/context words may actually refer to compounds.
22 https://hunspell.github.io
23 In the lemmatized corpora, the lemmas of proper names are replaced by placeholders.
16
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
5.2 DSMs
Positive Pointwise Mutual Information (PPMI). In the models based on the PPMI matrix,
the representation of a target word is a vector containing the PPMI association scores
between the target and its contexts (Bullinaria and Levy 2012). The contexts are nouns
and verbs, selected in a symmetric sliding window of W words to the left/right and
weighted linearly according to their distance D to the target (Levy, Goldberg, and Dagan
2015).24 We consider three models that differ in how the contexts are selected:
• In PPMI–thresh, the vectors are |V |-dimensional but only the top d contexts
with highest PPMI scores for each target word are kept, while the others
are set to zero (Padró et al. 2014a).25
• In PPMI–TopK, the vectors are d-dimensional, and each of the d
dimensions corresponds to a context word taken from a fixed list of k
contexts, identical for all target words. We chose k as the 1, 000 most
frequent words in the corpus after removing the top 50 most frequent
words (Salehi, Cook, and Baldwin 2015).
• In PPMI–SVD, singular value decomposition is used to factorize the PPMI
matrix and reduce its dimensionality from |V | to d.26 We set the value of
the context distribution smoothing factor to 0.75, and the negative
sampling factor to 5 (Levy, Goldberg, and Dagan 2015). We use the default
minimum word count threshold of 5.
Global Vectors (glove). GloVe28 implements a factorization of the logarithm of the posi-
tional co-occurrence count matrix (Pennington, Socher, and Manning 2014). We adopt
the default configurations from the documentation, except for: internal cutoff parameter
xmax = 75 and processing of the corpus in 15 iterations. For the corpora versions lemma
and lemmaPoS (Section 5.1), we use the minimum word count threshold of 5. For surface
and surface+ , due to the larger vocabulary sizes, we use thresholds of 15 and 20.29
24 In previous work adjectives and adverbs were also included as contexts, but the results obtained with
only verbs and nouns were better (Padró et al. 2014a).
25 Vectors still have |V | dimensions but we use d as a shortcut to represent the fact that we only retain the
most relevant target-context pairs for each target word.
26 https://bitbucket.org/omerlevy/hyperwords
27 https://code.google.com/archive/p/word2vec/
28 https://nlp.stanford.edu/projects/glove/
29 Thresholds were selected so as to not use more than 128 GB of RAM.
17
Computational Linguistics Volume 45, Number 1
Table 3
Summary of DSMs, their parameters, and evaluated parameter values. The combination of
these DSMs and their parameter values leads to 228 DSM configurations evaluated per language
(1 × 1 × 4 × 3 = 12 for PPMI–TopK, plus 6 × 3 × 4 × 3 = 216 for the other models).
DSM D IMENSION W ORD F ORM W INDOW S IZE
PPMI–TopK d = 1000
PPMI–thresh surface+ ,
PPMI–SVD W = 1+1,
d = 250, surface,
w2v–cbow W = 4+4,
d = 500, lemma,
w2v–sg W = 8+8
d = 750 lemmaPoS
glove
lexvec
Lexical Vectors (lexvec). The LexVec model30 factorizes the PPMI matrix in a way that
penalizes errors on frequent words (Salle, Villavicencio, and Idiart 2016). We adopt
the default configurations in the documentation, except for: 25 negative samples, sub-
sampling rate of 10−6 , and processing of the corpus in 15 iterations. Due to the vocab-
ulary sizes, we use a word count threshold of 10 for lemma and lemmaPoS , and 100 for
surface and surface+ .31
30 https://github.com/alexandres/lexvec
31 This is in line with the authors’ threshold suggestions (Salle, Villavicencio, and Idiart 2016).
32 Common window sizes are between 1+1 and 10+10, but a few works adopt larger sizes like 16+16 or
20+20 (Kiela and Clark 2014; Lapesa and Evert 2014).
18
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
6. Overall Results
In this section, we present the overall results obtained on the Reddy, Farahmand, EN-
comp, FR-comp, and PT-comp data sets, comparing all possible configurations (Sec-
tion 6.1). To determine their robustness we also report evaluation for all languages
using cross-validation (Section 6.2) and for English using the held-out data set
EN-compExt (Section 6.3). All results reported in this section use the pcuniform function.
Table 4 shows the highest overall values obtained for each DSM (columns) on each
data set (rows). For English (Reddy, EN-comp, and Farahmand), the highest results for
the compounds found in the corpus were obtained with w2v and PPMI–thresh, shown
as the first value in each pair in Table 4. Not all compounds in the English data sets are
present in our corpus. Therefore, we also report results adopting a fallback strategy (the
second value). Because its impact depends on the data set, and the relative performance
of the models is similar with or without it, for the remainder of the article we discuss
only the results without fallback.34
The best w2v–cbow and w2v–sg configurations are not significantly different from
each other, but both are different from PPMI–thresh (p < 0.05). In a direct comparison
19
Computational Linguistics Volume 45, Number 1
Table 4
Highest results for each DSM, using BF1 for Farahmand data set, Pearson r for Reddy (r), and
Spearman ρ for all the other data sets. For English, in each pair of values, the first is for the
compounds found in the corpus, and the second uses fallback for missing compounds.
with related work, our best result for the Reddy data set (Spearman ρ = .812, Pearson
r = .814) improves upon the best correlation reported by Reddy, McCarthy, and
Manandhar (2011) (ρ = .714), and by Salehi, Cook, and Baldwin (2015) (r = .796). For
Farahmand, these results are comparable to those reported by Yazdani, Farahmand,
and Henderson (2015) (BF1 = .487), but our work adopts an unsupervised approach
for compositionality prediction. For both FR-comp and PT-comp, the w2v models are
outperformed by PPMI–thresh, whose predictions are significantly different from the
predictions of other models (p < 0.05).
In short, these results suggest language-dependent trends for DSMs, by which
w2v models perform better for the English data sets, and PPMI–thresh for French and
Portuguese. While this may be due to the level of morphological inflection in these lan-
guages, it may also be due to differences in corpus size or to particular DSM parameters
used in each case. In Section 7, we analyze the impact of individual DSM and corpus
parameters to better understand this language dependency.
6.2 Cross-Validation
Table 4 reports the best configurations for the EN-comp, FR-comp, and PT-comp data
sets. However, to determine whether the Spearman scores obtained are robust and
generalizable, in this section we report evaluation using cross-validation. For each data
set, we partition the 180 compounds into 5 folds of 36 compounds (f1 , f2 , . . . , f5 ). Then,
for each fold fi , we exhaustively look for the best configuration (values of W INDOW S IZE,
D IMENSION, and W ORD F ORM) for the union of the other folds (∪j6=i fj ), and predict the
36 compositionality scores for fi using this configuration. The predicted scores for the
5 folds are then grouped into a single set of predictions, which is evaluated against
the 180 human judgments.
The partition of compounds into folds is performed automatically, based on random
shuffling.35 To avoid relying on a single arbitrary fold partition, we run cross-validation
10 times, with different fold partitions each time. This process generates 10 Spearman
correlations, for which we calculate the average value and a 95% confidence interval.
35 We have also considered separating folds so as to be balanced regarding their compositionality scores.
The results were similar to the ones reported here.
20
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
(a) Cross-validation for each language/DSM pair (b) English cross-validation (word form)
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
Oracle Oracle
0.3 Cross-validation (CI 95%)
0.3 Cross-validation (CI 95%)
ur +
w2 hres h/su e +
e+
MI hr rf a
bo m e
w2 w/su ma
w2 w2v- g/lem ce
b o su a
v-s urf e
v-c mm oS
v-c w/l PoS
v- m oS
v-c h/le rfac
w2 w/s rfac
PP MI-t sh/su mm
v-c sg/ m
g/s ace
w2 sg/le maP
w2 w/le maP
v-s rfa
-t es ac
pt_ /w2 sg
-sg
MI w
fr/ resh
fr/ w
M -sg
MI h
2v h
en ow
fa c
w2 bo a
bo em
/PP res
en hres
/PP bo
bo
PP thre sh/le
BR 2v-
-cb
2v
fr/ w2v
BR v-c
v-c
-th
en I-th
pt_ R/w
/w
-t
MI hre
w2
/w
B
PP
PP MI-t
pt_
-
PP
(c) French cross-validation (dimension) (d) Portuguese cross-validation (window size)
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
Oracle Oracle
0.3 Cross-validation (CI 95%) 0.3 Cross-validation (CI 95%)
w2 /250
v-c 00
PP hres 0
w2 /250
v-c 00
w2 /750
PP hres 0
-th 00
50
5
w2 g/1
w2 /4
w2 g/8
5
MI h/8
w2 /4
w2 ow/1
w2 /4
-th /8
h/1
w2 g/5
MI g/7
w2 ow/5
MI h/2
MI h/5
h/7
g
h
PP -cbow
g
v-s
v-s
v-s
w
PP hres
res
res
bo
v-s
v-s
PP v-s
res
b
bo
bo
b
v-c
v-c
-th
v-c
-t
v
-t
-t
MI
MI
w2
PP
Figure 4
Results with highest Spearman for oracle and cross-validation, the latter with a confidence
interval of 95%; (a) top left: overall Spearman correlations per DSM and language, (b) top right:
different W ORD F ORM values and DSMs for English, (c) bottom left: different D IMENSION values
and DSMs for French, and (d) bottom right: different W INDOW S IZE values and DSMs for
Portuguese.
21
Computational Linguistics Volume 45, Number 1
For PT-comp, the confidence intervals are quite wide, meaning that prediction
quality is sensitive to the choice of compounds used to estimate the best configura-
tions. Probably a larger data set would be required to stabilize cross-validation results.
Nonetheless, the other two data sets seem representative enough, so that the small
confidence intervals show that, even if we fix the value of a given parameter (e.g.,
d = 750), the results using cross-validation are stable and very similar to the oracle.
The confidence intervals overlapping with oracle data points also indicate that most
cross-validation results are not statistically different from the oracle. This suggests that
the highest-Spearman oracle configurations could be trusted as reasonable approxi-
mations of the best configurations for other data sets collected for the same language
constructed using similar guidelines.
Table 5
Configurations with best performances on EN-comp and on EN-compExt . Best performances are
measured on EN-comp and the corresponding configurations are applied to EN-compExt .
22
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
Figure 5
Best results for each DSM and W INDOW S IZE (1+1, 4+4, and 8+8), using BF1 for Farahmand, and
Spearman ρ for other data sets. Thin bars indicate the use of fallback in English. Differences
between the two highest Spearman correlations for each model are statistically significant
(p < 0.05), except for PPMI–SVD, according to Wilcoxon’s sign-rank test.
DSMs build the representation of every word based on the frequency of other words
that appear in its context. Our hypothesis is that larger window sizes result in higher
scores, as the additional data allows a better representation of word-level semantics.
However, as some of these models adopt different weight decays for larger windows,36
variation in their behavior related to window size is to be expected.
Contrary to our expectations, for the best models in each language, large windows
did not lead to better compositionality prediction. Figure 5 shows the best results
obtained for each window size.37 For English, w2v is the best model, and its performance
does not seem to depend much on the size of the window, but with a small trend for
smaller sizes to be better. For French and Portuguese, PPMI–thresh is only the best model
for the minimal window size, and there is a large gap in performance for PPMI–thresh as
window size increases, such that for larger windows it is outperformed by other models.
36 For PPMI–SVD with W INDOW S IZE=8+8, a context word at distance D from its target word is weighted
8−D 8
8 . For glove, the decay happens much faster, with a weight of D , which allows the model to look
farther away without being affected by potential noise introduced by distant contexts.
37 Henceforth, we omit results for EN-comp90 and Reddy, as they are included in EN-comp.
23
Computational Linguistics Volume 45, Number 1
7.2 Dimension
When creating corpus-derived vectors with a DSM, the question is whether additional
dimensions can be informative in compositionality prediction. Our hypothesis is that
the larger the number of dimensions, the more precise the representations, and the more
accurate the compositionality prediction.
The results shown in Figure 6 for each of the comparable data sets confirm this trend
in the case of the best DSMs: w2v and PPMI–thresh. Moreover, the effect of changing the
vector dimensions for the best models seems to be consistent across these languages.
The results for PPMI–SVD, lexvec, and glove are more varied, but they are never among
Figure 6
Best results for each DSM and D IMENSION, using BF1 for Farahmand data set, and Spearman ρ
for all the other data sets. For English, the thin bars indicate results using fallback. Differences
between two highest Spearman correlations for each model are statistically significant (p < 0.05),
except for PPMI–SVD for FR-comp, according to Wilcoxon’s sign-rank test.
24
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
the best models for compositionality prediction in any of the languages.38 All differences
between the two highest Spearman correlations are statistically significant (p < 0.05),
with the exception of PPMI–SVD for FR-comp, according to Wilcoxon’s sign-rank test.
In related work, DSMs are constructed from corpora with various levels of pre-
processing (Bullinaria and Levy 2012; Mikolov et al. 2013; Pennington, Socher, and
Manning 2014; Kiela and Clark 2014; Levy, Goldberg, and Dagan 2015; Salle,
Villavicencio, and Idiart 2016). In this work, we compare four levels: W ORD F ORM=
surface+ , surface, lemmaPoS and lemma, described in Section 5.1, corresponding to decreas-
ing amounts of information. Testing different varieties of corpus preprocessing allows
us to explore the trade-off between informational content and the statistical significance
related to data sparsity for compositionality prediction.
Figure 7 presents the impact of different types of corpus preprocessing on the
quality of compositionality prediction. In EN-comp, all differences between the two
highest Spearman values for each DSM were significant, according to Wilcoxon’s sign-
rank test, except for PPMI–thresh, whereas in FR-comp and PT-comp they were significant
only for PPMI–TopK and lexvec. However, note that the top two results are often both
obtained on representations based on lemmas. If we compare the highest lemma-based
result with the highest surface-based result for the same DSM, we find a statistically
significant difference in every single case (p < 0.05).
When considering the results themselves, although the results for English are het-
erogeneous, for French and Portuguese, the lemma-based representations consistently
allow a better prediction of compositionality scores. This may be explained by the fact
that these two languages are morphologically richer than English, and lemma-based
representations reduce the sparsity in the data, allowing more information to be gath-
ered from the same amount of data. Moreover, adding POS information (lemmaPoS vs.
lemma) does not seem to bring consistent improvements that are statistically significant.
This suggests that words that share the same lemma are semantically close enough
that any gains from disambiguation are masked by the sparsity of a higher vocabulary
size. Finally, the impact of stopword removal is also inconclusive (surface vs. surface+ ),
considering the best models for each language.
If we assume that the bigger the corpus, the better the DSM, this could explain why the
results for English are better than those for French and Portuguese, although it does not
explain why Portuguese is behind French.39 In this section, we examine the impact of
corpus size on prediction quality by incrementally increasing the amount of data used
to generate the DSMs while monitoring the Spearman correlation (ρ) with the human
annotations. We use only the best DSMs for these languages, PPMI–thresh and w2v–sg,
with the configurations that produced highest Spearman scores for each full corpus.
As expected, the results in Figure 8 show a smooth, roughly monotonic increase
of the ρ values with corpus size, for PPMI–thresh and w2v–sg for each language and
38 For PPMI–SVD and lexvec, this behavior might be related to the fact that both methods perform a
factorization of the PPMI matrix.
39 As the characteristics of Farahmand are different from the other data sets, in this analysis we only use the
other more comparable data sets.
25
Computational Linguistics Volume 45, Number 1
Figure 7
Best results for each DSM and W ORD F ORM, using BF1 for Farahmand data set, and Spearman ρ
for all the other data sets. For English, the thin bars indicate results using fallback. In EN-comp all
differences between the two highest Spearman values for each DSM were significant, according
to Wilcoxon’s sign-rank test, except for PPMI–thresh, while in FR-comp and PT-comp they were
only significant for PPMI–TopK and lexvec.
data set.40 In all cases there is a clear saturation behavior, so that we can safely say that
after one billion tokens, the quality of the predictions reaches a plateau and additional
corpus fragments do not bring improvements. This suggests that differences in compo-
sitionality prediction performance for these languages cannot be totally explained by
differences in corpus sizes.
Up to this point, the predicted compositionality scores for the compounds were calcu-
lated using a uniform function that assumes that each component contributes 50% to
40 For PPMI–thresh, eight different samplings of corpus fragments were performed (for a total of 800 DSMs
per language), with each y-axis data point presenting the average and standard deviation of the ρ
obtained from those samplings. For w2v–sg, since it is much more time-consuming, a single sampling
was used, and thus only one execution was performed for each datapoint (for a total of 100 DSMs per
language).
26
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
1.0
PPMI-thresh 1.0
w2v-sg
0.8 0.8
Average Spearman ρ (±σ)
0.6 0.6
Spearman ρ
0.4 0.4
the meaning of the compound (pcuniform ). However, this might not accurately capture
a faithful representation of compounds whose meaning is more semantically related
to one of the components (e.g., crocodile tears, which is semantically closer to the
head tears; and night owl, which is semantically closer to the modifier night). As this
may have an impact on the success of compositionality prediction, in this section we
evaluate how different compositionality prediction functions model these compounds.
In particular, we proposed pcmaxsim , (Section 4) for dynamically determining weights
that assign maximal similarity between the compound and each of its components.
We have also proposed pcgeom , which favors idiomatic readings through the geometric
mean of the similarities between a compound and its components. Our hypotheses are
that pcmaxsim will be better correlated with human scores for compositional and partly
compositional compounds, while pcgeom can better capture the semantics of idiomatic
ones (Section 8.1).
First, to verify whether other prediction functions improve results obtained for the
best pcuniform configurations reported up to now, we have evaluated every strategy on
all DSM configurations. Table 6 shows that the functions that combine both components
(columns pcuniform to pcarith ) generate better compositionality predictions than functions
that ignore one of the individual components (columns pchead and pcmod ). There is some
variation among the combined scores, with the best score indicated in bold. Every best
score is statistically different from all other scores in its row (p < 0.05). The results for
pcarith and pcuniform are very similar, reflecting their similar formulations.41
Here we focus on the issue of adjusting β in the compositionally constructed vector;
that is, we consider the use of pcmaxsim instead of pcuniform . This score seems to be
beneficial in the case of English (EN-comp), but not in the case of French or Portuguese.
41 The Pearson correlations (averaged across 7 DSMs) between pcarith and pcuniform are r = .972 for EN-comp,
r = .991 for FR-comp, and r = .969 for PT-comp, confirming their similar results.
27
Computational Linguistics Volume 45, Number 1
Table 6
Spearman ρ for the proposed compositionality prediction scores, using the best DSM
configuration for each score.
Table 7 presents the best pcmaxsim model for each data set, along with the average
weights assigned to head and modifier for every compound in the data set. Before
analyzing the results in Table 7, we have to verify whether the data sets are balanced
for the influence of each component to the meaning of the whole, or if there is any
bias towards heads/modifiers. The influence of the head, estimated as the average of
hcH /(hcH + hcM ) over all compounds of a data set, is 0.50 for EN-comp, 0.52 for FR-
comp, and 0.52 for PT-comp. This indicates that the data sets are balanced in terms of
the influence of each component, and neither head nor modifier predominates as more
compositional or idiomatic than the other.
As for the average β weights in pcmaxsim , while the weights that maximize compo-
sitionality are fairly similar for EN-comp, they strongly favor the head for both FR-comp
and PT-comp. This may be explained by the fact that, for the latter, the modifiers are all
adjectives, while EN-comp has mostly nouns as modifiers. Surprisingly, this seemingly
more realistic weighting of the compound components for French and Portuguese does
not reflect in better compositionality scores, and does not correspond to the average
influence of modifiers in these data sets, estimated as 0.48 on average. One possible
explanation could be that, in these cases, the adjectives may be contributing to some
specific more idiomatic meaning that is not found in isolated occurrences of the adjec-
tive itself, such as FR beau (lit. beautiful), which is used in the translation of most in-
law family members, such as FR beau-frère (lit. beautiful-brother ‘brother-in-law’). In the
next section, we investigate which compounds are affected the most by these different
scores.
Table 7
DSM and Separman ρ of pcmaxsim , as well as the average weights for the head (β) and for the
modifier (1 − β) on each data set.
28
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
150
PPMI-thresh 150
w2v-sg
1
Improvement from uniform to maxsim
To better evaluate the effect of adjusting β for the individual compounds with respect
to the pcuniform score, we define the rank improvement as:
improvf (w1 w2 ) = |rkuniform (w1 w2 ) − rkhuman (w1 w2 )| − |rkf (w1 w2 ) − rkhuman (w1 w2 )|,
where rk indicates the rank of the compound w1 w2 in the data set when ordered accord-
ing to pcuniform , human annotations hcHM , or the compositionality prediction function f .
For instance, when f = maxsim, positive improvmaxsim values indicate that pcmaxsim yields
a better approximation of the ranks assigned by hcHM than pcuniform , whereas negative
values indicate that pcuniform provides a better ranking.
We perform a cross-lingual analysis, grouping the hcHM scores of the EN-comp,
FR-comp, and PT-comp into a unique data set (henceforth ALL-comp), containing 540
compounds. Figure 9 presents the values of rank improvement for the best PPMI–thresh
and w2v–sg configurations, ranked according to hcHM (rkhuman ): compounds that are
better predicted by pcmaxsim have positive rank movements (above the 0 line).42 The
density of movement on either side of the 0 (no movement) line appears to be similar
for both models with pcmaxsim performing as well as pcuniform .
Figure 9 also marks the outlier compounds with the highest improvements (num-
bers from 1 to 8) and those with the lowest improvements (letters from A to H), and
Table 8 shows their improvement scores. In the case of these outliers, the adjustment
seems to be more beneficial to compositional compounds than to idiomatic cases. This
is confirmed by a linear regression of the movement of the 8+8 outliers as a function
of the compositionality scores hcHM , where we obtain a positive coefficient of r = 0.73
and r = 0.72 for PPMI–thresh and w2v–sg, respectively. There are more outlier com-
pounds for Portuguese and French (particularly the former), suggesting that pcmaxsim
has a stronger impact on those languages than on English. Moreover, some compounds
had a similar improvement under both DSMs, with, for example, high improvement
29
Computational Linguistics Volume 45, Number 1
Table 8
Outlier compounds with extreme positive/negative improvmaxsim values. Example identifiers
correspond to numbers/letters shown in Figure 9.
H −68 0.40 FR bras droit ‘most important helper/assistant’ (lit. arm right)
G −70 1.52 PT alta costura ‘haute couture’ (lit. high sewing)
F −71 3.66 PT carne vermelha ‘red meat’ (lit. meat red)
E −82 1.35 PT alto mar ‘high seas’ (lit. high sea)
D −85 1.10 PT mesa redonda ‘round table’ (lit. table round)
C −86 2.84 EN half sister
B −109 1.43 PT febre amarela ‘yellow fever’ (lit. fever yellow)
A −128 1.06 PT coração partido ‘broken heart’ (lit. heart broken)
for PT caixa forte literally box strong ‘safe’ and low improvement for PT coração partido
‘broken heart’. In addition, pcmaxsim also affected some equivalent compounds in differ-
ent languages, as in the case of PT caixa forte and FR coffre fort. Overall, pcmaxsim does not
present a considerable impact on the predictions, obtaining an average improvement of
improvmaxsim = +0.41 across all compounds in ALL-comp.
Figure 10 shows the same analysis for f = geom, showing the improvement score
of pcgeom over pcuniform . We hypothesized that pcgeom should more accurately represent
30
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
PPMI-thresh w2v-sg
400 400
Improvement from uniform to geom
idiomatic compounds. From the previous sections, we know that pcgeom has lower
performance than pcuniform when used to estimate the compositionality of the entire
data sets (cf. Table 6). This is confirmed by an average score of improvgeom = −7.87.
As in Figure 9, Figure 10 shows a random distribution of improvements. However, the
outliers have the opposite pattern, indicating that large reclassifications due to pcgeom
tend to favor idiomatic instead of compositional compounds. The linear regression of
the movement of the outliers as a function of the compositionality scores results in
r = −0.73 and r = −0.82 for PPMI–thresh and w2v–sg, respectively. These confirm our
hypothesis for the behavior of pcgeom .
Table 9 lists the outlier compounds indicated in Figure 10 along with their improve-
ment values. Here again, the majority of the outliers belong to PT-comp. Some of the
compounds that were found as outliers in pcmaxsim re-appear as outliers for pcgeom with
inverted polarity in the improvement score, such as the ranks predicted by PPMI–
thresh for PT prato feito literally plate made ‘blue-plate special’ (improvmaxsim = +58,
improvgeom = −234) and by w2v–sg for FR bras droit literally arm right ‘assistant’
(improvmaxsim = −68, improvgeom = +228). This suggests that, as future work, we should
consider combining both approaches into a single prediction that decides which score
to use for each compound as a function of pcuniform .
31
Computational Linguistics Volume 45, Number 1
Table 9
Outlier compounds with extreme positive/negative improvgeom values. Example identifiers
correspond to numbers/letters shown on Figure 10.
1 +228 0.40 FR bras droit ‘most important helper/assistant’ (lit. arm right)
2 +158 1.40 PT lua nova ‘new moon’ (lit. moon new)
3 +127 1.35 PT alto mar ‘high seas’ (lit. high sea)
4 +104 0.10 PT pé direito ‘ceiling height’ (lit. foot right)
5 +89 1.24 EN carpet bombing
6 +75 1.60 PT lista negra ‘black list’ (lit. list black)
7 +73 0.65 PT arma branca ‘cold weapon’ (lit. weapon white)
8 +72 3.32 EN search engine
Results from Section 3.2.2 show that the familiarity of compounds measured as fre-
quency in large corpora is associated with the compositionality scores assigned by
humans. We would like to know whether this correlation also holds true to system
predictions: Are the most frequent compounds being predicted as more compositional?
As expected, the rank correlation between frequency and pcuniform shows medium to
32
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
Table 10
Spearman ρ correlations between different variables. We consider the set of predicted scores
(pc), the set of human–prediction differences (diff), the compound frequencies (freq), and the
compound PMI. The predicted scores are the ones from the best configurations of each sub–data
set in ALL-comp. Correlations are indicated only when significant (p < 0.05).
strong correlation (see Table 10, column ρ[pc,freq]), though the level of correlation is
somewhat DSM-dependent, are in line with the correlation observed between frequency
and human scores, and with the high correlation between predicted and human scores.
Another hypothesis we test is whether frequent compounds are easier to model. A
first intuition would be that this hypothesis is true, as a higher number of occurrences is
associated with a larger amount of data, from which more representative vectors can
be built. To test this hypothesis, we define a compound’s difficulty as the difference
between the predicted score and the normalized human score, diff = |pc − (hcHM /5)|,
where high values indicate a compound whose compositionality is harder to predict.43
We found a weak (though statistically significant) correlation between frequency
and difficulty for some of the DSMs (Table 10, column ρ[diff,freq]). They are mostly
positive, indicating that frequency is correlated with difficulty, which is a surprising
result, as it implies that the compositionality of rarer compounds was mildly easier to
predict for these systems, disproving the hypothesis above. These results either point to
an overall lack of correlation between frequency and difficulty, or indicate mild DSM-
specific behavior, which should be investigated in further research.
43 We linearly normalize predicted scores to be between 0 and 1. However, given that negative scores are
rare in practice, unreported correlation with non-normalized pc are similar to the ones reported.
33
Computational Linguistics Volume 45, Number 1
(Section 3.2.2) and as DSM predictions are strongly correlated to human predictions,
these results indicate that our models capture more than conventionalization. They
may also be a feature of this particular set of compounds, as even the compositional
cases are also conventional to some extent (e.g., white/?yellow wine). Therefore, further
investigation of possible links between idiomaticity and conventionalization is needed.
We also calculated the correlation between PMI and the human–prediction differ-
ence (diff), to determine if DSMs build less precise vectors for less conventionalized
compounds (approximated as those with lower PMI). However, no statistically signifi-
cant correlation was found for most DSMs (Table 10, column ρ[diff, PMI]).
Table 11
Spearman’s ρ of best pcuniform models, separated into 3 ranges according to σHM and according to
hcHM , all with p < 0.05.
34
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
(from 0.63 to 0.66), suggesting that it may be harder to make fine-grained distinctions
(e.g., between two compositional compounds like access road and subway system) than to
make inter-range distinctions (e.g., between idiomatic and compositional compounds
like ivory tower and access road). However, further investigation would be needed to
verify this hypothesis.
10. Conclusions
44 The resulting data sets and framework implementation are freely available to the community.
45 As Farahmand is considerably different from the other data sets, a direct comparison is not possible.
35
Computational Linguistics Volume 45, Number 1
for all languages. Other functions like pcmaxsim and pcgeom , which modify these scores to
account for different contributions of each component, produced at best similar results.
A deeper analysis of the predicted compositionality scores revealed that, similarly
to human-rated scores, familiarity measured as frequency was positively correlated
with predicted compositionality. In the case of conventionalization measured as PMI,
no correlation was found with human-rated scores and only a mild correlation was
found with some predicted scores, suggesting that our models capture more than
compound conventionalization, as they have a strong agreement with human scores.
Intra-compound standard deviation on human scores was also found to be related
to predicted scores, indicating that DSMs have difficulties on those compounds that
humans also found difficult. Moreover, predictions were found to be more accurate for
compositional compounds.
Although there are many questions that still need to be solved regarding compo-
sitionality, we believe that the results presented here advance significantly its under-
standing and computational modeling. Furthermore, the proposed framework opens
important avenues of research that are ready to be pursued. First, the role of morpholog-
ical inflection could be clarified by extending this investigation to even more inflected
languages, such as Turkish. Moreover, other categories of MWEs such as verb+noun
expressions should be evaluated to determine the interplay between compositionality
prediction and syntactic flexibility of MWEs. The ultimate test would be to use predicted
compositionality scores in downstream applications and tasks involving some degree
of semantic processing, ranging from MWE identification to parsing, and word-sense
disambiguation. In particular, it would be interesting to predict compositionality in
context, in order to distinguish idiomatic from literal usages in sentences.
Appendix A. Glossary
36
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
The number of possible DSM configurations grows exponentially with the number of
internal variables in a DSM, forestalling the possibility of an exhaustive search for every
possible parameter. We have evaluated in this article the set of variables that are most
often manually tuned in the literature, but a reasonable question would be whether
these results can be further improved through the modification of some other often-
ignored model-specific parameters. We thus perform some sanity checks through a local
search of such parameters around the highest-Spearman configuration of each DSM.
Some of the DSMs in consideration on this paper are iterative: they re-read and re-
process the same corpus multiple times. For those DSMs, we present the results of
running their best configuration, but using a higher number of iterations. This higher
37
Computational Linguistics Volume 45, Number 1
Table 12
Results using a higher number of iterations.
number of iterations is inspired by the models found in parts of the literature, where,
for example, the number of glove iterations can be as high as 50 (Salle, Villavicencio, and
Idiart 2016) or even 100 (Pennington, Socher, and Manning 2014). The intuition is that
most models will lose some information (due to their probabilistic sampling), which
could be regained at the cost of a higher number of iterations.
Table 12 presents a comparison between the baseline ρ for 15 iterations and the
ρ obtained when 100 iterations are performed. For all DSMs, we see that the increase
in the number of iterations does not improve the quality of the vectors, with the
relatively small number of 15 iterations yielding better results. This may suggest that
a small number of iterations can already sample enough distributional information,
with further iterations accruing additional noise from low-frequency words. The extra
number of iterations could also be responsible for overfitting of the DSM to represent
particularities of the corpus, which would reduce the quality of the underlying vectors.
Given the extra cost of running more iterations,46 we refrained from building further
models with as many iterations in the rest of the article.
Minimum-count thresholds are often neglected in the literature, where a default con-
figuration of 0, 1, or 5 being presumably used by most authors. An exception to this
trend is the threshold of 100 occurrences used by Levy, Goldberg, and Dagan (2015),
whose toolkit we use in PPMI–SVD. No explicit justification has been found for this
higher word-count threshold. A reasonable hypothesis would be that higher thresholds
improve the quality of the data, as it filters rare words more aggressively.
38
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
Table 13
Results for a higher minimum threshold of word count.
Table 13 presents the result from the highest-Spearman configurations along with
the results for an identical configuration with a higher occurrence threshold of 50.47 The
results unanimously agree that a higher threshold does not contribute to the removal of
any extra noise. In particular, for PPMI–SVD, it seems to discard enough useful informa-
tion to considerably reduce the quality of the compositionality prediction measure. The
results strongly contradict the default configuration used for PPMI–SVD, suggesting
that a lower word-count threshold might yield better results for this task.
For many models, the best window size found was either W INDOW S IZE = 1+1 or
W INDOW S IZE = 4+4 (see Section 7.1). It is possible that a higher score could be obtained
by a configuration in between. While a full exhaustive search would be the ideal
solution, an initial approximation of the best 2+2 configuration could be obtained by
running the experiments on the highest Spearman configurations, with the window
size replaced by 2+2.
Results shown in Table 14 for a window size of 2+2 are consistently worse than the
base model, indicating that the optimal configuration is likely the one that was obtained
with window size of 1+1 or 4+4. This is further confirmed by the fact that most DSMs
had the best configuration with window size of 1+1 or 8+8, with few cases of 4+4
as best model, which suggests that the quality of most configurations in the space of
models is either monotonically increasing or decreasing with regards to these window
sizes, favoring thus the configurations with more extreme W INDOW S IZE parameters.
47 The threshold used for ρbase depends on the DSM, and is described in Section 5.2.
39
Computational Linguistics Volume 45, Number 1
Table 14
Results using a window of size 2+2.
As seen in Section 7.2, some DSMs obtain better results when moving from 250 to 500
dimensions, and this trend continues when moving to 750 dimensions. This behavior
is notably stronger for PPMI–thresh, which suggests that an even higher number of
dimensions could have better predictive power.
Table 15 presents the result of running PPMI–thresh for increasing values of of the
D IMENSION parameter. The baseline configuration (indicated as ? in Table 15) was the
highest-scoring configuration found in Section 7.2: lemmaPoS .W1 .d750 for PT-comp and
FR-comp, and surface.W8 .d750 for Reddy. As seen in Section 7.2, results for 250 and 500
dimensions have lower scores than the results for 750 dimensions. Results for 1,000
dimensions were mixed: they are slightly worse for FR-comp and EN-comp, and slightly
better for PT-comp. Increasing the number of dimensions generates models that are
progressively worse. These results suggest that the maximum vector quality is achieved
between 750 and 1,000 dimensions.
The word vectors generated by the glove and w2v models have some level of non-
determinism caused by random initialization and random sampling techniques. A
reasonable concern would be whether the results presented for different parameter
variations are close enough to the scores obtained by an average model. To assess the
variability of these models, we evaluated three different runs of every DSM configu-
ration (the original execution ρ1 , used elsewhere in this article, along with two other
executions ρ2 and ρ3 ) for glove, w2v–cbow, and w2v–sg. We then calculate the average
ρavg of these three executions for every model.
40
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
Table 15
Results for higher numbers of dimensions (PPMI–thresh).
Table 16 reports the highest-Spearman configurations of ρavg for the Reddy and
EN-comp data sets. When comparing ρavg to the results of the original execution ρ1 , we
see that the variability in the different executions of the same configuration is minimal.
This is further confirmed by the low sample standard deviation48 obtained from the
scores of the three executions. Given the high stability of these models, results in the
rest of the article were calculated and reported as ρ1 for all data sets.
Along with the verification of parameters, we also evaluate whether data set variations
could yield better results. In particular, we consider the use of filtering techniques,
which are used in the literature as a method of guaranteeing data set quality. As
per Roller, Schulte im Walde, and Scheible (2013), we consider two strategies of data
removal: (1) removing individual outlier compositionality judgments through z-score
48 The low standard deviation is not a unique property of high-ranking configurations: The average of
deviations for all models was .004 for EN-comp and .006 for Reddy.
41
Computational Linguistics Volume 45, Number 1
Table 16
Configurations with highest ρavg for nondeterministic models.
Reddy glove lemmaPoS .W8 .d250 .759 .760 .753 .757 .004
w2v–cbow surface.W1 .d500 .796 .807 .799 .801 .006
w2v–sg surface.W1 .d750 .812 .788 .812 .804 .014
EN-comp glove lemmaPoS .W8 .d500 .651 .646 .650 .649 .003
w2v–cbow surface+ .W1 .d750 .730 .732 .728 .730 .002
w2v–sg surface+ .W1 .d750 .741 .732 .721 .731 .010
Table 17
Intrinsic quality measures for the raw and filtered data sets.
filtering; and (2) removing all annotations from outlier human judges. A composition-
ality judgment is considered an outlier if it stands at more than z standard deviations
away from the mean; a human judge is deemed an outlier if its Spearman correlation to
the average of the others ρoth is lower than a given threshold R.49 These methods allow
us to remove accidentally erroneous annotations, as well as annotators whose response
deviated too much from the mean (in particular, spammers and non-native speakers).
Table 17 presents the evaluation of raw and filtered data sets regarding two quality
measures: The average of the standard deviations for all NCs (σ); and the proportion of
NCs in the data set whose standard deviation is higher than 1.5 (Pσ>1.5 ), as per Reddy,
McCarthy, and Manandhar (2011). The results suggest that filtering techniques can
improve the overall quality of the data sets, as seen in the reduction of the proportion of
NCs with high standard deviation, as well as in the reduction of the average standard
deviation itself. We additionally present the data retention rate (DRR), which is the
proportion of NCs that remained in the data set after filtering. While the DRR does
indicate a reduction in the amount of data, this reduction may be considered acceptable
in light of the improvement suggested by the quality measures.
On a more detailed analysis, we have verified that the improvement in these quality
measures is heavily tied to the use of z-score filtering, with similar results obtained when
it is considered alone. The application of R-filtering by itself, on the other hand, did
not show any noticeable improvement in the quality measures for reasonable amounts
of DRR. This is the opposite from what was found by Roller, Schulte im Walde, and
49 The judgment threshold we adopted was z = 2.2 for EN-comp90 , z = 2.2 for PT-comp, and z = 2.5 for
FR-comp. The human judge threshold was R = 0.5.
42
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
Table 18
Extrinsic quality measures for the raw and filtered data sets.
Scheible (2013) on their German data set, where only R-filtering was found to improve
results under these quality measures. We present our findings in more detail in Ramisch,
Cordeiro, and Villavicencio (2016).
We then consider whether filtering can have an impact on the performance of
predicted compositionality scores. For each of the 228 model configurations that were
constructed for each language, we launched an evaluation on the filtered EN-comp90 , FR-
comp, and PT-comp data sets (using z-score filtering only, as it was responsible for most
of the improvement in quality measures). Overall, no improvement was observed in the
results of the prediction (values of Spearman ρ) when we compare raw and filtered data
sets. Looking more specifically at the best configurations for each DSM (see Table 18), we
can see that most results do not significantly change when the evaluation is performed
on the raw or filtered data sets. This suggests that the amount of judgments collected
for each compound greatly offsets any irregularity caused by outliers, making the use
of filtering techniques superfluous.
Appendix C. Questionnaire
The questionnaire was structured in five subtasks, presented to the annotators through
these instructions:
43
Computational Linguistics Volume 45, Number 1
Figure 11
Evaluating compositionality of a compound regarding its head.
We present below the 90 nominal compounds in EN-comp90 and the 100 nominal com-
pounds in EN-compExt , along with their human-rated compositionality scores. We refer
to Reddy, McCarthy, and Manandhar (2011) for the other 90 compounds belonging to
Reddy which, together with the former two sets, represent 280 nominal compounds in
total.
44
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
45
Computational Linguistics Volume 45, Number 1
We present below the 180 nominal compounds in FR-comp, along with their human-
rated compositionality scores.
46
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
47
Computational Linguistics Volume 45, Number 1
48
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
49
Computational Linguistics Volume 45, Number 1
We present below the 180 nominal compounds in PT-comp, along with their human-
rated compositionality scores.
50
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
51
Computational Linguistics Volume 45, Number 1
52
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
53
Computational Linguistics Volume 45, Number 1
Bride, Antoine, Tim Van de Cruys, and In Proceedings of the Tenth International
Nicholas Asher. 2015. A generalisation of Conference on Language Resources and
lexical functions for composition in Evaluation (LREC 2016), pages 1221–1225,
distributional semantics. In Association for European Language Resources
Computational Linguistics (1), pages 281–291. Association (ELRA), Paris.
Bullinaria, John A., and Joseph P. Levy. 2012. Curran, James R., and Marc Moens. 2002.
Extracting semantic representations from Scaling context space. In Proceedings of the
word co-occurrence statistics: Stop-lists, 40th Annual Meeting of the Association for
stemming, and SVD. Behavior Research Computational Linguistics, pages 231–238.
Methods, 44(3):890–907. Deerwester, Scott, Susan T. Dumais,
Camacho-Collados, José, Mohammad Taher George W. Furnas, Thomas K. Landauer,
Pilehvar, and Roberto Navigli. 2015. and Richard Harshman. 1990. Indexing by
A framework for the construction of latent semantic analysis. Journal of the
monolingual and cross-lingual word American Society for Information Science,
similarity datasets. In Proceedings of the 41(6):391.
53rd Annual Meeting of the Association for Evert, Stefan. 2004. The Statistics of Word
Computational Linguistics and the 7th Cooccurrences: Word Pairs and Collocations.
International Joint Conference on Natural Ph.D. thesis, Institut für maschinelle
Language Processing (Volume 2: Short Sprachverarbeitung, University of
Papers), pages 1–7, Beijing. Stuttgart, Stuttgart, Germany.
Cap, Fabienne, Manju Nirmal, Marion Farahmand, Meghdad, Aaron Smith, and
Weller, and Sabine Schulte im Walde. 2015. Joakim Nivre. 2015. A multiword
How to account for idiomatic German expression data set: Annotating
support verb constructions in non-compositionality and
statistical machine translation. In conventionalization for English noun
Proceedings of the 11th Workshop on compounds. In Proceedings of the 11th
Multiword Expressions, pages 19–28, Workshop on Multiword Expressions,
Association for Computational Linguistics, pages 29–33, Association for
Denver. Computational Linguistics, Denver.
Carpuat, Marine, and Mona Diab. 2010. Fazly, Afsaneh, Paul Cook, and Suzanne
Task-based evaluation of multiword Stevenson. 2009. Unsupervised type and
expressions: A pilot study in statistical token identification of idiomatic
machine translation. In Proceedings of expressions. Computational Linguistics,
NAACL/HLT 2010, pages 242–245, 35(1):61–103.
Los Angeles. Ferret, Olivier. 2013. Identifying bad
Church, Kenneth Ward, and Patrick Hanks. semantic neighbors for improving
1990. Word association norms, mutual distributional thesauri. In Association
information, and lexicography. for Computational Linguistics (1),
Computational Linguistics, 16(1):22–29. pages 561–571.
Cohen, Jacob. 1960. A coefficient of Finlayson, Mark, and Nidhi Kulkarni. 2011.
agreement for nominal scales. Educational Detecting multi-word expressions
and Psychological Measurement, 20(1):37–46. improves word sense disambiguation.
Constant, Mathieu, Gülşen Eryiğit, Johanna In Proceedings of the Association for
Monti, Lonneke Van Der Plas, Carlos Computational Linguistics 2011
Ramisch, Michael Rosner, and Amalia Workshop on MWEs, pages 20–24,
Todirascu. 2017. Multiword expression Portland, OR.
processing: A survey. Computational Firth, John R. 1957. A synopsis of linguistic
Linguistics, 43(4):837–892. theory, 1930–1955. In F. R. Palmer, ed.,
Cordeiro, Silvio, Carlos Ramisch, Marco Selected Papers of J. R. Firth, pages 168–205,
Idiart, and Aline Villavicencio. 2016. Longman, London.
Predicting the compositionality of nominal Fleiss, Joseph L., and Jacob Cohen. 1973.
compounds: Giving word embeddings a The equivalence of weighted kappa
hard time. In Proceedings of the 54th Annual and the intraclass correlation coefficient
Meeting of the Association for Computational as measures of reliability. Educational
Linguistics (Volume 1: Long Papers), and Psychological Measurement,
pages 1986–1997, Berlin. 33(3):613–619.
Cordeiro, Silvio, Carlos Ramisch, and Aline Frege, Gottlob. 1892/1960. Über sinn und
Villavicencio. 2016. mwetoolkit+sem: bedeutung. Zeitschrift für Philosophie und
Integrating word embeddings in the philosophische Kritik, 100:25–50. Translated,
mwetoolkit for semantic MWE processing. as ‘On Sense and Reference,’ by Max Black.
54
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
Freitag, Dayne, Matthias Blume, John and their Compositionality (CVSC) at EACL,
Byrnes, Edmond Chow, Sadik Kapadia, pages 21–30.
Richard Rohwer, and Zhiqiang Wang. Köper, Maximilian, and Sabine Schulte im
2005. New experiments in distributional Walde. 2016. Distinguishing literal and
representations of synonymy. In non-literal usage of German particle verbs.
Proceedings of the Ninth Conference on In HLT-NAACL, pages 353–362.
Computational Natural Language Learning, Kruszewski, Germán, and Marco Baroni.
pages 25–32. 2014. Dead parrots make bad pets:
Girju, Roxana, Dan Moldovan, Marta Tatu, Exploring modifier effects in noun
and Daniel Antohe. 2005. On the semantics phrases. In Proceedings of the Third Joint
of noun compounds. Computer Speech & Conference on Lexical and Computational
Language, 19(4):479–496. Semantics, *SEM@COLING 2014, August
Goldberg, Adele E. 2015. Compositionality, 23-24, 2014, pages 171–181, The *SEM 2014
Chapter 24. Routledge, Amsterdam. Organizing Committee, Dublin.
Guevara, Emiliano. 2011. Computing Landauer, Thomas K., Peter W. Foltz, and
semantic compositionality in distributional Darrell Laham. 1998. An introduction to
semantics. In Proceedings of the Ninth latent semantic analysis. Discourse
International Conference on Computational Processes, 25(2-3):259–284.
Semantics, IWCS ’11, pages 135–144, Lapesa, Gabriella, and Stefan Evert. 2014.
Association for Computational Linguistics, A large scale evaluation of distributional
Stroudsburg, PA. semantic models: Parameters, interactions
Harris, Zellig. 1954. Distributional structure. and model selection. Transactions of the
Word, 10:146–162. Association for Computational Linguistics,
Hartung, Matthias, Fabian Kaupmann, 2:531–545.
Soufian Jebbara, and Philipp Cimiano. Lapesa, Gabriella, and Stefan Evert. 2017.
2017. Learning compositionality functions Large-scale evaluation of
on word embeddings for modelling dependency-based DSMs: Are they
attribute meaning in adjective-noun worth the effort? In EACL 2017,
phrases. In Proceedings of the 15th Meeting of pages 394–400.
the European Chapter of the Association for Lauer, Mark. 1995. How much is enough?:
Computational Linguistics (Volume 1), Data requirements for statistical NLP.
pages 54–64. CoRR, abs/cmp-lg/9509001.
Hendrickx, Iris, Zornitsa Kozareva, Preslav Levy, Omer, Yoav Goldberg, and Ido Dagan.
Nakov, Diarmuid Ó Séaghdha, Stan 2015. Improving distributional similarity
Szpakowicz, and Tony Veale. 2013. with lessons learned from word
Semeval-2013 task 4: Free paraphrases of embeddings. Transactions of the Association
noun compounds. In Proceedings of *SEM for Computational Linguistics, 3:211–225.
2013 (Volume 2 — SemEval), pages 138–143, Lin, Dekang. 1998. Automatic retrieval and
Association for Computational clustering of similar words. In Proceedings
Linguistics. of the 17th International Conference on
Hwang, Jena D., Archna Bhatia, Clare Bonial, Computational Linguistics (Volume 2),
Aous Mansouri, Ashwini Vaidya, pages 768–774.
Nianwen Xue, and Martha Palmer. 2010. Lin, Dekang. 1999. Automatic identification
Propbank annotation of multilingual light of non-compositional phrases. In
verb constructions. In Proceedings of the Proceedings of the 37th Annual Meeting of the
LAW 2010, pages 82–90, Association for Association for Computational Linguistics on
Computational Linguistics. Computational Linguistics, pages 317–324.
Jagfeld, Glorianna, and Lonneke van der McCarthy, Diana, Bill Keller, and John
Plas. 2015. Towards a better semantic role Carroll. 2003. Detecting a continuum of
labelling of complex predicates. In compositionality in phrasal verbs. In
Proceedings of NAACL Student Research Proceedings of the Association for
Workshop, pages 33–39, Denver. Computational Linguistics 2003 Workshop on
Jurafsky, Daniel, and James H. Martin. 2009. Multiword Expressions: Analysis, Acquisition
Speech and Language Processing, 2nd and Treatment, pages 73–80, Association
Edition, Prentice-Hall, Inc., Upper Saddle for Computational Linguistics, Sapporo,
River, NJ. Japan.
Kiela, Douwe, and Stephen Clark. 2014. A Mikolov, Tomas, Ilya Sutskever, Kai Chen,
systematic study of semantic vector space Greg S. Corrado, and Jeff Dean. 2013.
model parameters. In Proceedings of the 2nd Distributed representations of words and
Workshop on Continuous Vector Space Models phrases and their compositionality. In
55
Computational Linguistics Volume 45, Number 1
56
Cordeiro et al. Unsupervised Compositionality Prediction of Nominal Compounds
Salehi, Bahar, Paul Cook, and Timothy Schulte im Walde, Sabine, Anna Hätty, Stefan
Baldwin. 2014. Using distributional Bott, and Nana Khvtisavrishvili. 2016.
similarity of multi-way translations to GhoSt-NN: A representative gold standard
predict multiword expression of German noun-noun compound.
compositionality. In Proceedings of the 14th In Proceedings of the Conference on
Conference of the European Chapter of the Language Resources and Evaluation,
Association for Computational Linguistics, pages 2285–2292.
pages 472–481, Gothenburg, Sweden. Schulte im Walde, Sabine, Stefan Müller, and
Salehi, Bahar, Paul Cook, and Timothy Stefan Roller. 2013. Exploring vector space
Baldwin. 2015. A word embedding models to predict the compositionality
approach to predicting the of German noun-noun compounds. In
compositionality of multiword Proceedings of *SEM 2013 (Volume 1),
expressions. In Proceedings of the 2015 pages 255–265. Association for
Conference of the North American Chapter of Computational Linguistics.
the Association for Computational Linguistics: Socher, Richard, Brody Huval,
Human Language Technologies, Christopher D. Manning, and Andrew Y.
pages 977–983, Denver. Ng. 2012. Semantic compositionality
Salehi, Bahar, Nitika Mathur, Paul Cook, and through recursive matrix-vector spaces.
Timothy Baldwin. 2015. The impact of In Proceedings of the 2012 Joint Conference
multiword expression compositionality on on Empirical Methods in Natural Language
machine translation evaluation. In Proceedings Processing and Computational Natural
of the 11th Workshop on Multiword Language Learning, pages 1201–1211.
Expressions, pages 54–59, Association for Stymne, Sara, Nicola Cancedda, and Lars
Computational Linguistics, Denver. Ahrenberg. 2013. Generation of compound
Salle, Alexandre, Aline Villavicencio, and words in statistical machine translation
Marco Idiart. 2016. Matrix factorization into compounding languages.
using window sampling and negative Computational Linguistics, 39(4):1067–1108.
sampling for improved word Tsvetkov, Yulia, and Shuly Wintner. 2012.
representations. In Proceedings of the 54th Extraction of multi-word expressions from
Annual Meeting of the Association for small parallel corpora. Natural Language
Computational Linguistics (Volume 2: Short Engineering, 18(04):549–573.
Papers), pages 419–424, Berlin. Turney, Peter D., and Patrick Pantel. 2010.
Schmid, Helmut. 1995. Treetagger—A From frequency to meaning: vector space
language independent part-of-speech models of semantics. Journal of Artificial
tagger. Institut für Maschinelle Intelligence Research, 37(1):141–188.
Sprachverarbeitung, Universität Stuttgart, Van de Cruys, Tim, Laura Rimell, Thierry
43:28. Poibeau, and Anna Korhonen. 2012.
Schneider, Nathan, Dirk Hovy, Anders Multiway tensor factorization for
Johannsen, and Marine Carpuat. 2016. unsupervised lexical acquisition. In
SemEval 2016 task 10: Detecting minimal COLING 2012, pages 2703–2720.
semantic units and their meanings Yazdani, Majid, Meghdad Farahmand,
(DiMSUM). In Proceedings of SemEval, and James Henderson. 2015. Learning
pages 546–559, San Diego. semantic composition to detect
Schone, Patrick, and Daniel Jurafsky. 2001. non-compositionality of multiword
Is knowledge-free induction of multiword expressions. In Proceedings of the 2015
unit dictionary headwords a solved Conference on Empirical Methods in Natural
problem? In Proceedings of Empirical Language Processing, pages 1733–1742,
Methods in Natural Language Processing, Association for Computational Linguistics,
pages 100–108, Pittsburgh. Lisbon.
57