Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
While very useful semantic representations are available • Model. We propose Sent2Vec, a simple unsupervised
for words, it remains challenging to produce and learn such model allowing to compose sentence embeddings us-
semantic embeddings for longer pieces of text, such as sen- ing the word vectors along with n-gram embeddings,
tences, paragraphs or entire documents. Even more so, it simultaneously training composition and the embed-
remains a key goal to learn such general-purpose represen- ding vectors themselves.
tations in an unsupervised way.
• Scalability. The computational complexity of our em-
Currently, two contrary research trends have emerged in
beddings is only O(1) vector operations per word pro-
text understanding: On one hand, a strong trend in deep-
cessed, both during training and inference of the sen-
*
Equal contribution 1 Iprova SA, Switzerland 2 Computer and 1
Communication Sciences, EPFL, Switzerland. Correspondence All our code and pre-trained models are publicly available
to: Martin Jaggi <[email protected]>. on http://github.com/epfml/sent2vec.
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
tence embeddings. This strongly contrasts all neural Formally, we learn source v w and target uw embeddings
network based approaches, and allows our model to for each word w in the vocabulary, with embedding dimen-
learn from extremely large datasets, which is a crucial sion h and k = |V| as in (1). The sentence embedding
advantage in the unsupervised setting. is defined as the average of the source word embeddings of
its constituent words, as in (2). We augment this model fur-
• Performance. Our method shows significant perfor- thermore by also learning source embeddings for not only
mance improvements compared to the current state- unigrams but also n-grams present in each sentence, and
of-the-art unsupervised and even semi-supervised averaging the n-gram embeddings along with the words,
models. The resulting general-purpose embeddings i.e., the sentence embedding v S for S is modeled as
show strong robustness when transferred to a wide X
range of prediction benchmarks. v S := 1 1
(2)
|R(S)| V ◆R(S) = |R(S)| vw
w2R(S)
2. Model
where R(S) is the list of n-grams (including unigrams)
Our model is inspired by simple matrix factor models (bi- present in sentence S. In order to predict a missing word
linear models) such as recently very successfully used in from the context, our objective models the softmax output
unsupervised learning of word embeddings (Mikolov et al., approximated by negative sampling following (Mikolov
2013b;a; Pennington et al., 2014; Bojanowski et al., 2017) et al., 2013b). For the large number of output classes
as well as supervised of sentence classification (Joulin |V| to be predicted, negative sampling is known to sig-
et al., 2017). More precisely, these models are formalized nificantly improve training efficiency, see also (Goldberg
as an optimization problem of the form & Levy, 2014). Given the binary logistic loss function
` : x 7! log (1 + e x ) coupled with negative sampling, our
X
min fS (U V ◆S ) (1) unsupervised training objective is formulated as follows:
U ,V
X X ✓ ◆
S2C
X
min ` u>
wt v S\{wt } + ` u>
w0 v S\{wt }
for two parameter matrices U 2 Rk⇥h and V 2 Rh⇥|V| , U ,V
S2C wt 2S w0 2Nwt
where V denotes the vocabulary. In all models studied,
the columns of the matrix V will collect the learned word where S corresponds to the current sentence and Nwt is
vectors, having h dimensions. For a given sentence S, the set of words sampled negatively for the word wt 2 S.
which can be of arbitrary length, the indicator vector ◆S 2 The negatives are sampled2 following a multinomial distri-
{0, 1}|V| is a binary vector encoding S (bag of words en- bution where
coding). p each word
P w ispassociated with a probability
qn (w) := fw wi 2V fwi , where fw is the nor-
Fixed-length context windows S running over the cor- malized frequency of w in the corpus.
pus are used in word embedding methods as in C-BOW To select the possible target unigrams (positives), we use
(Mikolov et al., 2013b;a) and GloVe (Pennington et al., subsampling as in (Joulin et al., 2017; Bojanowski et al.,
2014). Here we have k = |V| and each cost function 2017), each word w being discardedp with probability 1
fS : Rk ! R only depends on a single row of its input, de- qp (w) where qp (w) := min 1, t/fw + t/fw . Where t
scribing the observed target word for the given fixed-length is the subsampling hyper-parameter. Subsampling prevents
context S. In contrast, for sentence embeddings which very frequent words of having too much influence in the
are the focus of our paper here, S will be entire sentences learning as they would introduce strong biases in the pre-
or documents (therefore variable length). This property is diction task. With positives subsampling and respecting the
shared with the supervised FastText classifier (Joulin et al., negative sampling distribution, the precise training objec-
2017), which however uses soft-max with k ⌧ |V| being tive function becomes
the number of class labels.
X X✓
min qp (wt )` u>wt v S\{wt } (3)
2.1. Proposed Unsupervised Model U ,V
S2C wt 2S
X ◆
We propose a new unsupervised model, Sent2Vec, for 0
+ |Nwt | qn (w )` u>
w0 v S\{wt }
learning universal sentence embeddings. Conceptually, the
w0 2V
model can be interpreted as a natural extension of the word- 2
contexts from C-BOW (Mikolov et al., 2013b;a) to a larger To efficiently sample negatives, a pre-processing table is con-
structed, containing the words corresponding to the square root of
sentence context, with the sentence words being specifi- their corpora frequency. Then, the negatives Nwt are sampled
cally optimized towards additive combination over the sen- uniformly at random from the negatives table except the target wt
tence, by means of the unsupervised objective function. itself, following (Joulin et al., 2017; Bojanowski et al., 2017).
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
2.2. Computational Efficiency possible target unigrams using subsampling. We update the
weights using SGD with a linearly decaying learning rate.
In contrast to more complex neural network based mod-
els, one of the core advantages of the proposed technique Also, to prevent overfitting, for each sentence we use
is the low computational cost for both inference and train- dropout on its list of n-grams R(S) \ {U (S)}, where U (S)
ing. Given a sentence S and a trained model, computing is the set of all unigrams contained in sentence S. After
the sentence representation v S only requires |S| · h floating empirically trying multiple dropout schemes, we find that
point operations (or |R(S)| · h to be precise for the n-gram dropping K n-grams (n > 1) for each sentence is giving su-
case, see (2)), where h is the embedding dimension. The perior results compared to dropping each token with some
same holds for the cost of training with SGD on the objec- fixed probability. This dropout mechanism would nega-
tive (3), per sentence seen in the training corpus. Due to the tively impact shorter sentences. The regularization can be
simplicity of the model, parallel training is straight-forward pushed further by applying L1 regularization to the word
using parallelized or distributed SGD. vectors. Encouraging sparsity in the embedding vectors is
particularly beneficial for high dimension h. The additional
2.3. Comparison to C-BOW soft thresholding in every SGD step adds negligible com-
putational cost. See also Appendix B.
C-BOW (Mikolov et al., 2013b;a) tries to predict a chosen
target word given its fixed-size context window, the context We train two models on each dataset, one with unigrams
being defined by the average of the vectors associated with only and one with unigrams and bigrams. All training
the words at a distance less than the window size hyper- parameters for the models are provided in Table 5 in the
parameter ws. If our system, when restricted to unigram supplementary material. Our C++ implementation builds
features, can be seen as an extension of C-BOW where upon the FastText library (Joulin et al., 2017; Bojanowski
the context window includes the entire sentence, in prac- et al., 2017). We will make our code and pre-trained mod-
tice there are few important differences as C-BOW uses els available open-source.
important tricks to facilitate the learning of word embed-
dings. C-BOW first uses frequent word subsampling on the 3. Related Work
sentences, deciding to discard each token w with probabil-
ity qp (w) or alike (small variations exist across implemen- We discuss existing models which have been proposed to
tations). Subsampling prevents the generation of n-grams construct sentence embeddings. While there is a large body
features, and deprives the sentence of an important part of of works in this direction – several among these using e.g.
its syntactical features. It also shortens the distance be- labelled datasets of paraphrase pairs to obtain sentence em-
tween subsampled words, implicitly increasing the span of beddings in a supervised manner (Wieting et al., 2016b;a)
the context window. A second trick consists of using dy- – we here focus on unsupervised, task-independent mod-
namic context windows: for each subsampled word w, the els. While some methods require ordered raw text i.e., a
size of its associated context window is sampled uniformly coherent corpus where the next sentence is a logical con-
between 1 and ws. Using dynamic context windows is tinuation of the previous sentence, others rely only on raw
equivalent to weighing by the distance from the focus word text i.e., an unordered collection of sentences. Finally we
w divided by the window size (Levy et al., 2015). This also discuss alternative models built from structured data
makes the prediction task local, and go against our objec- sources.
tive of creating sentence embeddings as we want to learn
how to compose all n-gram features present in a sentence. 3.1. Unsupervised Models Independent of Sentence
In the results section, we report a significant improvement Ordering
of our method over C-BOW.
The ParagraphVector DBOW model (Le & Mikolov,
2014) is a log-linear model which is trained to learn sen-
2.4. Model Training tence as well as word embeddings and then use a soft-
Three different datasets have been used to train our models: max distribution to predict words contained in the sentence
the Toronto book corpus3 , Wikipedia sentences and tweets. given the sentence vector representation. They also pro-
The Wikipedia and Toronto books sentences have been to- pose a different model ParagraphVector DM where they
kenized using the Stanford NLP library (Manning et al., use n-grams of consecutive words along with the sentence
2014), while for tweets we used the NLTK tweets tokenizer vector representation to predict the next word.
(Bird et al., 2009). For training, we select a sentence ran- (Hill et al., 2016a) propose a Sequential (Denoising) Au-
domly from the dataset and then proceed to select all the toencoder, S(D)AE. This model first introduces noise in
3
http://www.cs.toronto.edu/˜mbweb/ the input data: Firstly each word is deleted with prob-
ability p0 , then for each non-overlapping bigram, words
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
are swapped with probability px . The model then uses an 3.2. Unsupervised Models Depending on Sentence
LSTM-based architecture to retrieve the original sentence Ordering
from the corrupted version. The model can then be used
The SkipThought model (Kiros et al., 2015) combines
to encode new sentences into vector representations. In the
sentence level models with recurrent neural networks.
case of p0 = px = 0, the model simply becomes a Sequen-
Given a sentence Si from an ordered corpus, the model is
tial Autoencoder. (Hill et al., 2016a) also propose a variant
trained to predict Si 1 and Si+1 .
(S(D)AE + embs.) in which the words are represented by
fixed pre-trained word vector embeddings. FastSent (Hill et al., 2016a) is a sentence-level log-linear
bag-of-words model. Like SkipThought, it uses adjacent
(Arora et al., 2017) propose a model in which sentences
sentences as the prediction target and is trained in an unsu-
are represented as a weighted average of fixed (pre-trained)
pervised fashion. Using word sequences allows the model
word vectors, followed by post-processing step of subtract-
to improve over the earlier work of paragraph2vec (Le &
ing the principal component. Using the generative model
Mikolov, 2014). (Hill et al., 2016a) augment FastSent fur-
of (Arora et al., 2016), words are generated conditioned on
ther by training it to predict the constituent words of the
a sentence “discourse” vector cs :
sentence as well. This model is named FastSent + AE in
exp(c̃>
s vw ) our comparisons.
P r[w | cs ] = ↵fw + (1 ↵) ,
Zc̃s Compared to our approach, Siamese C-BOW (Kenter
P > et al., 2016) shares the idea of learning to average word em-
where Zc̃s := w2V exp(c̃s v w ) and c̃s := c0 + (1
)cs and ↵, are scalars. c0 is the common discourse vec- beddings over a sentence. However, it relies on a Siamese
tor, representing a shared component among all discourses, neural network architecture to predict surrounding sen-
mainly related to syntax. It allows the model to better gen- tences, contrasting our simpler unsupervised objective.
erate syntactical features. The ↵fw term is here to enable Note that on the character sequence level instead of word
the model to generate some frequent words even if their sequences, FastText (Bojanowski et al., 2017) uses the
matching with the discourse vector c̃s is low. same conceptual model to obtain better word embeddings.
Therefore, this model tries to generate sentences as a mix- This is most similar to our proposed model, with two
ture of three type of words: words matching the sentence key differences: Firstly, we predict from source word se-
discourse vector cs , syntactical words matching c0 , and quences to target words, as opposed to character sequences
words with high fw . (Arora et al., 2017) demonstrated to target words, and secondly, our model is averaging the
thatPfor this model, the MLE of c̃s can be approximated source embeddings instead of summing them.
by w2S fwa+a v w , where a is a scalar. The sentence dis-
course vector can hence be obtained by subtracting c0 es- 3.3. Models requiring structured data
timated by the first principal component of c̃s ’s on a set DictRep (Hill et al., 2016b) is trained to map dictionary
of sentences. In other words, the sentence embeddings are definitions of the words to the pre-trained word embed-
obtained by a weighted average of the word vectors strip- dings of these words. They use two different architectures,
ping away the syntax by subtracting the common discourse namely BOW and RNN (LSTM) with the choice of learn-
vector and down-weighting frequent tokens. They gen- ing the input word embeddings or using them pre-trained.
erate sentence embeddings from diverse pre-trained word A similar architecture is used by the CaptionRep variant,
embeddings among which are unsupervised word embed- but here the task is the mapping of given image captions to
dings such as GloVe (Pennington et al., 2014) as well as su- a pre-trained vector representation of these images.
pervised word embeddings such as paragram-SL999 (PSL)
(Wieting et al., 2015) trained on the Paraphrase Database
(Ganitkevitch et al., 2013). 4. Evaluation Tasks
In a very different line of work, C-PHRASE (Pham et al., We use a standard set of supervised as well as unsuper-
2015) relies on additional information from the syntactic vised benchmark tasks from the literature to evaluate our
parse tree of each sentence, which is incorporated into the trained models, following (Hill et al., 2016a). The breadth
C-BOW training objective. of tasks allows to fairly measure generalization to a wide
area of different domains, testing the general-purpose qual-
(Huang & Anandkumar, 2016) show that single layer ity (universality) of all competing sentence embeddings.
CNNs can be modeled using a tensor decomposition ap- For downstream supervised evaluations, sentence embed-
proach. While building on an unsupervised objective, the dings are combined with logistic regression to predict tar-
employed dictionary learning step for obtaining phrase get labels. In the unsupervised evaluation for sentence sim-
templates is task-specific (for each use-case), not resulting ilarity, correlation of the cosine similarity between two em-
in general-purpose embeddings.
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
beddings is compared to human annotators. els in the main paper. Performance of supervised and semi-
supervised models on these evaluations can be observed in
Downstream Supervised Evaluation. Sentence embed-
Tables 6 and 7 in the supplementary material.
dings are evaluated for various supervised classification
tasks as follows. We evaluate paraphrase identification Downstream Supervised Evaluation Results. On run-
(MSRP) (Dolan et al., 2004), classification of movie review ning supervised evaluations and observing the results in
sentiment (MR) (Pang & Lee, 2005), product reviews (CR) Table 1, we find that on an average our models are sec-
(Hu & Liu, 2004), subjectivity classification (SUBJ)(Pang ond only to SkipThought vectors. Also, both our models
& Lee, 2004), opinion polarity (MPQA) (Wiebe et al., achieve state of the art results on the CR task. We also ob-
2005) and question type classification (TREC) (Voorhees, serve that on half of the supervised tasks, our unigrams +
2002). To classify, we use the code provided by (Kiros bigram model is the the best model after SkipThought. Our
et al., 2015) in the same manner as in (Hill et al., models are weaker on the MSRP task (which consists of the
2016a). For the MSRP dataset, containing pairs of sen- identification of labelled paraphrases) compared to state-
tences (S1 , S2 ) with associated paraphrase label, we gen- of-the-art methods. However, we observe that the models
erate feature vectors by concatenating their Sent2Vec rep- which perform extremely well on this task end up faring
resentations |v S1 v S2 | with the component-wise prod- very poorly on the other tasks, indicating a lack of general-
uct v S1 v S2 . The predefined training split is used to izability.
tune the L2 penalty parameter using cross-validation and
On rest of the tasks, our models perform extremely well.
the accuracy and F1 scores are computed on the test set.
The SkipThought model is able to outperform our models
For the remaining 5 datasets, Sent2Vec embeddings are in-
on most of the tasks as it is trained to predict the previous
ferred from input sentences and directly fed to a logistic
and next sentences and a lot of tasks are able to make use of
regression classifier. Accuracy scores are obtained using
this contextual information missing in our Sent2Vec mod-
10-fold cross-validation for the MR, CR, SUBJ and MPQA
els. For example, the TREC task is a poor measure of how
datasets. For those datasets nested cross-validation is used
one predicts the content of the sentence (the question) but
to tune the L2 penalty. For the TREC dataset, as for the
a good measure of how the next sentence in the sequence
MRSP dataset, the L2 penalty is tuned on the predefined
(the answer) is predicted.
train split using 10-fold cross-validation, and the accuracy
is computed on the test set. Unsupervised Similarity Evaluation Results. In Table 2,
we see that our Sent2Vec models are state-of-the-art on the
Unsupervised Similarity Evaluation. We perform un-
majority of tasks when comparing to all the unsupervised
supervised evaluation of the the learnt sentence embed-
models trained on the Toronto corpus, and clearly achieve
dings using the sentence cosine similarity, on the STS
the best averaged performance. Our Sent2Vec models also
2014 (Agirre et al., 2014) and SICK 2014 (Marelli et al.,
on average outperform or are at par with the C-PHRASE
2014) datasets. These similarity scores are compared to the
model, despite significantly lagging behind on the STS
gold-standard human judgements using Pearson’s r (Pear-
2014 WordNet and News subtasks. This observation can
son, 1895) and Spearman’s ⇢ (Spearman, 1904) correlation
be attributed to the fact that a big chunk of the data that
scores. The SICK dataset consists of about 10,000 sentence
the C-PHRASE model is trained on comes from English
pairs along with relatedness scores of the pairs. The STS
Wikipedia, helping it to perform well on datasets involv-
2014 dataset contains 3,770 pairs, divided into six differ-
ing definition and news items. Also, C-PHRASE uses data
ent categories on the basis of origin of sentences/phrases
three times the size of the Toronto book corpus. Interest-
namely Twitter, headlines, news, forum, WordNet and im-
ingly, our model outperforms C-PHRASE when trained on
ages. See (Agirre et al., 2014) for more precise information
Wikipedia, as shown in Table 3, despite the fact that we use
on how the pairs have been created.
no parse tree information.
5. Results and Discussion In the official results of the more recent edition of the STS
2017 benchmark (Cer et al., 2017), our model also signif-
In Tables 1 and 2, we compare our results with those ob- icantly outperforms C-PHRASE, and delivers the best un-
tained by (Hill et al., 2016a) on different models. Along supervised baseline method.
with the models discussed in Section 3, this also includes
Macro Average. To summarize our contributions on both
the sentence embedding baselines obtained by simple av-
supervised and unsupervised tasks, in Table 3 we present
eraging of word embeddings over the sentence, in both the
the results in terms of the macro average over the averages
C-BOW and skip-gram variants. TF-IDF BOW is a repre-
sentation consisting of the counts of the 200,000 most com- 4
For the Siamese C-BOW model trained on the Toronto cor-
mon feature-words, weighed by their TF-IDF frequencies. pus, supervised evaluation as well as similarity evaluation results
To ensure coherence, we only include unsupervised mod- on the SICK 2014 dataset are unavailable.
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
Table 1: Comparison of the performance of different models on different supervised evaluation tasks. An underline
indicates the best performance for the dataset. Top 3 performances in each data category are shown in bold. The average
is calculated as the average of accuracy for each category (For MSRP, we take the average of two entries.)
Table 2: Unsupervised Evaluation Tasks: Comparison of the performance of different models on Spearman/Pearson
correlation measures. An underline indicates the best performance for the dataset. Top 3 performances in each data
category are shown in bold. The average is calculated as the average of entries for each correlation measure.
of both supervised and unsupervised tasks along with the tributed to the superior generalizability of our model across
training times of the models5 . For unsupervised tasks, av- supervised and unsupervised tasks.
erages are taken over both Spearman and Pearson scores.
Comparison with Arora et al. (2017). In Table 4, we
The comparison includes the best performing unsupervised
report an experimental comparison to the model of Arora
and semi-supervised methods described in Section 3. For
et al. (2017), which is particularly tailored to sentence sim-
models trained on the Toronto books dataset, we report a
ilarity tasks. In the table, the suffix W indicates that their
3.8 % points improvement over the state of the art. Con-
down-weighting scheme has been used, while the suffix
sidering all supervised, semi-supervised methods and all
R indicates the removal of the first principal component.
datasets compared in (Hill et al., 2016a), we report a 2.2 %
They report values of a 2 [10 4 , 10 3 ] as giving the best
points improvement.
results and used a = 10 3 for all their experiments. Their
We also see a noticeable improvement in accuracy as we down-weighting scheme hints us to reduce the importance
use larger datasets like twitter and Wikipedia dump. We of syntactical features. To do so, we use a simple black-
can also see that the Sent2Vec models are also faster to train list containing the 25 most frequent tokens in the Twitter
when compared to methods like SkipThought and DictRep corpus and discard them before averaging. Results are also
owing to the SGD step allowing a high degree of paralleliz- reported in Table 4.
ability.
We observe that our results are competitive with the embed-
We can clearly see Sent2Vec outperforming other unsuper- dings of Arora et al. (2017) for purely unsupervised meth-
vised and even semi-supervised methods. This can be at- ods. We confirm their empirical finding that reducing the
5
influence of the syntax helps performance on semantic sim-
time taken to train C-PHRASE models is unavailable
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
Table 3: Best unsupervised and semi-supervised methods ranked by macro average along with their training times. ** in-
dicates trained on GPU. * indicates trained on a single node using 30 cores. Training times for non-Sent2Vec models are
due to (Hill et al., 2016a)
Table 4: Comparison of the performance of the unsupervised and semi-supervised sentence embeddings by (Arora et al.,
2017) with our models, in terms of Pearson’s correlation.
is state-of-the-art for these evaluations on average. Future Hu, Minqing and Liu, Bing. Mining and summarizing customer
work could focus on augmenting the model to exploit data reviews. In Proceedings of the tenth ACM SIGKDD interna-
with ordered sentences. Furthermore, we would like to fur- tional conference on Knowledge discovery and data mining,
pp. 168–177. ACM, 2004.
ther investigate the models ability as giving pre-trained em-
beddings to enable downstream transfer learning tasks. Huang, Furong and Anandkumar, Animashree. Unsupervised
Learning of Word-Sequence Representations from Scratch via
Convolutional Tensor Decomposition. arXiv, 2016.
Acknowledgments. We are indebted to Piotr Bojanowski
and Armand Joulin for helpful discussions. Joulin, Armand, Grave, Edouard, Bojanowski, Piotr, and
Mikolov, Tomas. Bag of Tricks for Efficient Text Classifi-
cation. In Proceedings of the 15th Conference of the Euro-
References pean Chapter of the Association for Computational Linguis-
tics, Short Papers, pp. 427–431, Valencia, Spain, 2017.
Agirre, Eneko, Banea, Carmen, Cardie, Claire, Cer, Daniel, Diab,
Mona, Gonzalez-Agirre, Aitor, Guo, Weiwei, Mihalcea, Rada, Kenter, Tom, Borisov, Alexey, and de Rijke, Maarten. Siamese
Rigau, German, and Wiebe, Janyce. Semeval-2014 task 10: CBOW: Optimizing Word Embeddings for Sentence Represen-
Multilingual semantic textual similarity. In Proceedings of the tations. In ACL - Proceedings of the 54th Annual Meeting of
8th international workshop on semantic evaluation (SemEval the Association for Computational Linguistics, pp. 941–951,
2014), pp. 81–91. Association for Computational Linguistics Berlin, Germany, 2016.
Dublin, Ireland, 2014.
Kiros, Ryan, Zhu, Yukun, Salakhutdinov, Ruslan R, Zemel,
Arora, Sanjeev, Li, Yuanzhi, Liang, Yingyu, Ma, Tengyu, and Richard, Urtasun, Raquel, Torralba, Antonio, and Fidler, Sanja.
Risteski, Andrej. A Latent Variable Model Approach to PMI- Skip-Thought Vectors. In NIPS 2015 - Advances in Neural In-
based Word Embeddings. In Transactions of the Association formation Processing Systems 28, pp. 3294–3302, 2015.
for Computational Linguistics, pp. 385–399, July 2016.
Le, Quoc V and Mikolov, Tomas. Distributed Representations
Arora, Sanjeev, Liang, Yingyu, and Ma, Tengyu. A simple but of Sentences and Documents. In ICML 2014 - Proceedings of
tough-to-beat baseline for sentence embeddings. In Interna- the 31st International Conference on Machine Learning, vol-
tional Conference on Learning Representations (ICLR), 2017. ume 14, pp. 1188–1196, 2014.
Bird, Steven, Klein, Ewan, and Loper, Edward. Natural language Levy, Omer, Goldberg, Yoav, and Dagan, Ido. Improving distribu-
processing with Python: analyzing text with the natural lan- tional similarity with lessons learned from word embeddings.
guage toolkit. ” O’Reilly Media, Inc.”, 2009. Transactions of the Association for Computational Linguistics,
3:211–225, 2015.
Bojanowski, Piotr, Grave, Edouard, Joulin, Armand, and
Mikolov, Tomas. Enriching Word Vectors with Subword In- Luhn, Hans Peter. The automatic creation of literature ab-
formation. Transactions of the Association for Computational stracts. IBM Journal of research and development, 2(2):159–
Linguistics, 5:135–146, 2017. 165, 1958.
Cer, Daniel, Diab, Mona, Agirre, Eneko, Lopez-Gazpio, Inigo, Manning, Christopher D, Surdeanu, Mihai, Bauer, John, Finkel,
and Specia, Lucia. SemEval-2017 Task 1: Semantic Textual Jenny Rose, Bethard, Steven, and McClosky, David. The stan-
Similarity Multilingual and Cross-lingual Focused Evaluation. ford corenlp natural language processing toolkit. In ACL (Sys-
In SemEval-2017 - Proceedings of the 11th International Work- tem Demonstrations), pp. 55–60, 2014.
shop on Semantic Evaluations, pp. 1–14, Vancouver, Canada,
August 2017. Association for Computational Linguistics. Marelli, Marco, Menini, Stefano, Baroni, Marco, Bentivogli,
Luisa, Bernardi, Raffaella, and Zamparelli, Roberto. A sick
Dolan, Bill, Quirk, Chris, and Brockett, Chris. Unsupervised cure for the evaluation of compositional distributional seman-
construction of large paraphrase corpora: Exploiting massively tic models. In LREC, pp. 216–223, 2014.
parallel news sources. In Proceedings of the 20th international
conference on Computational Linguistics, pp. 350. Association Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey.
for Computational Linguistics, 2004. Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781, 2013a.
Ganitkevitch, Juri, Van Durme, Benjamin, and Callison-Burch,
Chris. Ppdb: The paraphrase database. In HLT-NAACL, pp. Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and
758–764, 2013. Dean, Jeff. Distributed Representations of Words and Phrases
and their Compositionality. In NIPS - Advances in Neural In-
Goldberg, Yoav and Levy, Omer. word2vec Explained: deriving formation Processing Systems 26, pp. 3111–3119, 2013b.
Mikolov et al.’s negative-sampling word-embedding method.
arXiv, February 2014. Pang, Bo and Lee, Lillian. A sentimental education: Sentiment
analysis using subjectivity summarization based on minimum
Hill, Felix, Cho, Kyunghyun, and Korhonen, Anna. Learning Dis- cuts. In Proceedings of the 42nd annual meeting on Associ-
tributed Representations of Sentences from Unlabelled Data. In ation for Computational Linguistics, pp. 271. Association for
Proceedings of NAACL-HLT, February 2016a. Computational Linguistics, 2004.
Hill, Felix, Cho, KyungHyun, Korhonen, Anna, and Ben- Pang, Bo and Lee, Lillian. Seeing stars: Exploiting class relation-
gio, Yoshua. Learning to understand phrases by em- ships for sentiment categorization with respect to rating scales.
bedding the dictionary. TACL, 4:17–30, 2016b. URL In Proceedings of the 43rd annual meeting on association for
https://tacl2013.cs.columbia.edu/ojs/ computational linguistics, pp. 115–124. Association for Com-
index.php/tacl/article/view/711. putational Linguistics, 2005.
Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
Supplementary Material
A. Parameters for training models
Minimum Initial Bigrams Number of
Embedding Minimum Subsampling
Model Target word Lear ning Epochs Dropped negatives
Dimensions word count hyper-parameter
Count Rate per sentence sampled
Book corpus
5
Sent2Vec 700 5 8 0.2 13 1 ⇥ 10 - 10
unigrams
Book corpus
6
Sent2Vec 700 5 5 0.2 12 5 ⇥ 10 7 10
unigrams + bigrams
Wiki Sent2Vec 5
600 8 20 0.2 9 1 ⇥ 10 - 10
unigrams
Wiki Sent2Vec 6
700 8 20 0.2 9 5 ⇥ 10 4 10
unigrams + bigrams
Twitter Sent2Vec 6
700 20 20 0.2 3 1 ⇥ 10 - 10
unigrams
Twitter Sent2Vec 6
700 20 20 0.2 3 1 ⇥ 10 3 10
unigrams + bigrams
B. L1 regularization of models
Optionally, our model can be additionally improved by adding an L1 regularizer term in the objective function, leading to
slightly better generalization performance. Additionally, encouraging sparsity in the embedding vectors is beneficial for
memory reasons, allowing higher embedding dimensions h.
We propose to apply L1 regularization individually to each word (and n-gram) vector (both source and target vectors).
Formally, the training objective function (3) then becomes
X X ✓⇣ ⌘
min qp (wt ) ` u> wt v S\{wt } + ⌧ (kuwt k1 + kv S\{wt } k1 ) + (4)
U ,V
S2C wt 2S
X ⇣ ⌘◆
0
|Nwt | qn (w ) ` u>
w0 v S\{wt } + ⌧ (kuw0 k1 )
w0 2V
We observe that L1 regularization using the proximal step gives our models a small boost in performance. Also, applying
the thresholding operator takes only |R(S\{wt })|·h floating point operations for the updating the word vectors correspond-
ing to the sentence and (|N | + 1) · h for updating the target as well as the negative word vectors, where |N | is the number of
negatives sampled and h is the embedding dimension. Thus, performing L1 regularization using soft-thresholding operator
comes with a small computational overhead.
We set ⌧ to be 0.0005 for both the Wikipedia and the Toronto Book Corpus unigrams + bigrams models.
Table 6: Comparison of the performance of different Sent2Vec models with different semi-supervised/supervised models
on different downstream supervised evaluation tasks. An underline indicates the best performance for the dataset and
Sent2Vec model performances are bold if they perform as well or better than all other non-Sent2Vec models, including
those presented in Table 1.
Table 7: Unsupervised Evaluation: Comparison of the performance of different Sent2Vec models with semi-
supervised/supervised models on Spearman/Pearson correlation measures. An underline indicates the best performance
for the dataset and Sent2Vec model performances are bold if they perform as well or better than all other non-Sent2Vec
models, including those presented in Table 2.
D. Dataset Description
STS 2014 SICK 2014 Wikipedia Twitter Book Corpus
Sentence Length News Forum WordNet Twitter Images Headlines Test + Train Dataset Dataset Dataset
Average 17.23 10.12 8.85 11.64 10.17 7.82 9.67 25.25 16.31 13.32
Standard Deviation 8.66 3.30 3.10 5.28 2.77 2.21 3.75 12.56 7.22 8.94
Table 8: Average sentence lengths for the datasets used in the comparison.