0% found this document useful (0 votes)
44 views4 pages

Notes On Noise Contrastive Estimation and Negative Sampling: Chris Dyer

1. Noise contrastive estimation (NCE) and negative sampling are techniques for estimating the parameters of probabilistic language models without computing expensive partition functions. 2. NCE frames parameter estimation as a binary classification problem of distinguishing samples from the data distribution from a noise distribution. It introduces parameters to directly model the partition functions. 3. NCE maximizes the conditional likelihood of correctly classifying samples from the data and noise distributions. This avoids explicitly computing partition functions but the expectation over the noise distribution remains intractable.

Uploaded by

omonait17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views4 pages

Notes On Noise Contrastive Estimation and Negative Sampling: Chris Dyer

1. Noise contrastive estimation (NCE) and negative sampling are techniques for estimating the parameters of probabilistic language models without computing expensive partition functions. 2. NCE frames parameter estimation as a binary classification problem of distinguishing samples from the data distribution from a noise distribution. It introduces parameters to directly model the partition functions. 3. NCE maximizes the conditional likelihood of correctly classifying samples from the data and noise distributions. This avoids explicitly computing partition functions but the expectation over the noise distribution remains intractable.

Uploaded by

omonait17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Notes on Noise Contrastive Estimation and Negative Sampling

Chris Dyer
School of Computer Science
Carnegie Mellon University
5000 Forbes Ave., Pittsburgh, PA, 15213
[email protected]
arXiv:1410.8251v1 [cs.LG] 30 Oct 2014

Abstract

Estimating the parameters of probabilistic models of language such as maxent models and probabilistic neu-
ral models is computationally difficult since it involves evaluating partition functions by summing over an
entire vocabulary, which may be millions of word types in size. Two closely related strategies—noise con-
trastive estimation (Mnih and Teh, 2012; Mnih and Kavukcuoglu, 2013; Vaswani et al., 2013) and nega-
tive sampling (Mikolov et al., 2012; Goldberg and Levy, 2014)—have emerged as popular solutions to this
computational problem, but some confusion remains as to which is more appropriate and when. This doc-
ument explicates their relationships to each other and to other estimation techniques. The analysis shows
that, although they are superficially similar, NCE is a general parameter estimation technique that is asymp-
totically unbiased, while negative sampling is best understood as a family of binary classification models
that are useful for learning word representations but not as a general-purpose estimator.

1 Introduction
Let us assume the following model of language which predicts a word w in a vocabulary V based on some
given context c:1

uθ (w, c) uθ (w, c)
pθ (w | c) = P ′
= , (1)
w ′ ∈V uθ (w , c) Zθ (c)

where uθ (w, c) = exp sθ (w, c) assigns a score to a word in context, Z(c) is the partition function that
normalizes this into a probability distribution, and sθ (w, c) is differentiable with respect to θ. The standard
learning procedure is to maximize the likelihood of a sample of training data. Unfortunately, computing
this probability (and its derivatives) is expensive since this requires summing over all words in V , which is
generally very large.
What can be done? Since the derivatives of the log likelihood include terms that are the expected values
of the parameters under the model distribution, the classic strategy has been to use importance sampling
and related Monte Carlo techniques to approximate these expectations (Bengio and Senécal, 2003). Noise
contrastive estimation and negative sampling represent an evolution of these techniques. These work by
transforming the computationally expensive learning problem into a binary classification proxy problem that
uses the same parameters but requires statistics that are easier to compute.
1
By language model I mean a model that generates one word at a time, conditional on any other ambient context such as
previously generated or surrounding words, a topic label, text in another language, etc. Excluded are so-called “whole-sentence”
or “globally normalized” language models. While these can also, in principle, be learned using the techniques described in these
notes, this exposition focuses on models that predict a single word at a time.
1.1 Empirical distributions, Noise distributions, and Model distributions
I will refer to p̃(w | c) and p̃(c) as empirical distributions. Our task is to find the parameters θ of a model
pθ (w | c) that approximates the empirical distribution as closely as possible, in terms of minimal cross-
entropy. To avoid costly summations, a “noise” distribution, q(w), is used. In practice q is a uniform,
empirical unigram, or “flattened” empirical unigram distribution (by exponentiating each probability by
0 < α < 1 and renormalizing).

2 Noise contrastive estimation (NCE)


NCE reduces the language model estimation problem to the problem of estimating the parameters of a prob-
abilistic binary classifier that uses the same parameters to distinguish samples from the empirical distribution
from samples generated by the noise distribution (Gutmann and Hyvärinen, 2010). The two-class training
data is generated as follows: sample a c from p̃(c), then sample one “true” sample from p̃(w | c), with
auxiliary label D = 1 indicating the datapoint is drawn from the true distribution, and k “noise” samples
from q, with auxiliary label D = 0 indicating these data points are noise. Thus, given c, the joint probability
of (d, w) in the two-class data has the form of the mixture of two distributions:
(
k
× q(w) if d = 0
p(d, w | c) = 1+k1
.
1+k × p̃(w | c) if d = 1

Using the definition of conditional probability, this can be turned into a conditional probability of d having
observed w and c:
k
1+k × q(w)
p(D = 0 | c, w) = 1 k
1+k × p̃(w | c) + 1+k × q(w)
k × q(w)
=
p̃(w | c) + k × q(w)
p̃(w | c)
p(D = 1 | c, w) = .
p̃(w | c) + k × q(w)

Note that these probabilities are written in terms of the empirical distribution.
NCE replaces the empirical distribution p̃(w | c) with the model distribution pθ (w | c), and θ is cho-
sen to maximize the conditional likelihood of the “proxy corpus” created as described above. But, thus
far, we have not solved any computational problem yet: pθ (w | c) still requires evaluating the partition
function—all we have done is transform the objective by adding some noise. To avoid the expense of evalu-
ating the partition function, NCE makes two further assumptions. First, it proposes partition function value
Z(c) be estimated as parameter zc (thus, for every empirical c, classic NCE introduces one parameter).
Second, for neural networks with lots of parameters, it turns out that fixing zc = 1 for all c is effective
(Mnih and Teh, 2012). This latter assumption both reduces the number of parameters and encourages the
model to have “self-normalized” outputs (i.e., Z(c) ≈ 1). Making these assumptions, we can now write the
conditional likelihood of being a noise sample or true distribution sample in terms of θ as

k × q(w)
p(D = 0 | c, w) =
uθ (w, c) + k × q(w)
uθ (w, c)
p(D = 1 | c, w) = .
uθ (w, c) + k × q(w)

We now have a binary classification problem with parameters θ that can be trained to maximize conditional
log-likelihood of D, with k negative samples chosen:
X
LNCEk = (log p(D = 1 | c, w) + kEw∼q log p(D = 0 | c, w)) .
(w,c)∈D

Unfortunately, the expectation of the second term in this summation is still a difficult summation—it is k
times the expected log probability (according to the current model) of producing a negative label under the
noise distribution over all words in V in a context c. We still have a loop over the entire vocabulary. The
final step is therefore to replace this expectation with its Monte Carlo approximation:
 
k
X X 1
LMC
NCEk =
log p(D = 1 | c, w) + k × × log p(D = 0 | c, w)
k
(w,c)∈D i=1,w∼q
 
X Xk
= log p(D = 1 | c, w) + log p(D = 0 | c, w) .
(w,c)∈D i=1,w∼q

2.1 Asymptotic analysis


Although the objective LNCEk is intractable, its derivative sheds light on why NCE works. This quantity may
be written as
!
∂ X X k × q(w) ∂
LNCEk = × (p̃(w | c) − uθ (w | c)) log uθ (w | c) .
∂θ uθ (w | c) + k × q(w) ∂θ
(w ,c)∈D
′ w∈V

It is easy to see that in the limiting case as k → ∞, this derivative tends to the gradient of the log likelihood
of D under pθ (and furthermore, LMC NCEk → L NCEk ). That is, the gradient is 0 when the model distribution
matches the empirical distribution.

3 Negative sampling
Negative sampling is a variation of NCE used by the popular word2vec tool which generates a proxy
corpus and also learns θ as a binary classification problem, but it defines the conditional probabilities given
(w, c) differently:
1
p(D = 0 | c, w) =
uθ (w, c) + 1
uθ (w, c)
p(D = 1 | c, w) = .
uθ (w, c) + 1

This objective can be understood in several ways. First, it is equivalent to NCE when k = |V | and q
is uniform. Second, it can be understood as the hinge objective of Collobert et al. (2011) where the max
function has been replaced with a softmax. As a result, aside from the k = |V | and uniform q case,
the conditional probabilities of D given (w, c) are not consistent with the language model probabilities of
(w, c) and therefore the θ estimated using this as an objective will not optimize the likelihood of the language
model in Eq. 1. Thus, while negative sampling may be appropriate for word representation learning, it does
not have the same asymptotic consistency guarantees that NCE has.

4 Conclusion
NCE is an effective way of learning parameters for an arbitrary locally normalized language model. How-
ever, negative sampling should be thought of as an alternative task for generating representations of words
for use in other tasks, but not, itself, as a method for learning parameters in a generative model of language.
Thus, if your goal is language modeling, you should use NCE; if your goal is word representation learning,
you should consider both NCE and negative sampling.

References
[Bengio and Senécal2003] Yoshua Bengio and Jean-Sébastien Senécal. 2003. Quick training of probabilistic neural
nets by importance sampling. In Proc. AISTATS.
[Collobert et al.2011] Ronan Collobert, Jason Weston, León Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel
Kuksa. 2011. Natural language processing (almost) from scratch. JMLR.
[Goldberg and Levy2014] Yoav Goldberg and Omer Levy. 2014. word2vec explained: Deriving Mikolov et al.’s
negative-sampling word-embedding method.
[Gutmann and Hyvärinen2010] Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new
estimation principle for unnormalized statistical models. In Proc. AISTATS.
[Mikolov et al.2012] Tomas Mikolov, Ilya Sutskeve, Kai Chen, Greg Corrado, and Jeffrey Dean. 2012. Distributed
representations of words and phrases and their compositionality. In Proc. NIPS.
[Mnih and Kavukcuoglu2013] Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently
with noise-contrastive estimation. In Proc. NIPS.
[Mnih and Teh2012] Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural proba-
bilistic language models. In Proc. ICML.
[Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with
large-scale neural language models improves translation. In Proc. EMNLP.

You might also like