Notes On Noise Contrastive Estimation and Negative Sampling: Chris Dyer

1. Noise contrastive estimation (NCE) and negative sampling are techniques for estimating the parameters of probabilistic language models without computing expensive partition functions. 2. NCE frames parameter estimation as a binary classification problem of distinguishing samples from the data distribution from a noise distribution. It introduces parameters to directly model the partition functions. 3. NCE maximizes the conditional likelihood of correctly classifying samples from the data and noise distributions. This avoids explicitly computing partition functions but the expectation over the noise distribution remains intractable.

Uploaded by

omonait17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views4 pages

Notes On Noise Contrastive Estimation and Negative Sampling: Chris Dyer

Uploaded by

omonait17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Notes on Noise Contrastive Estimation and Negative Sampling

Chris Dyer
School of Computer Science
Carnegie Mellon University
5000 Forbes Ave., Pittsburgh, PA, 15213
[email protected]
arXiv:1410.8251v1 [cs.LG] 30 Oct 2014

Abstract

Estimating the parameters of probabilistic models of language such as maxent models and probabilistic neu-
ral models is computationally difficult since it involves evaluating partition functions by summing over an
entire vocabulary, which may be millions of word types in size. Two closely related strategies—noise con-
trastive estimation (Mnih and Teh, 2012; Mnih and Kavukcuoglu, 2013; Vaswani et al., 2013) and nega-
tive sampling (Mikolov et al., 2012; Goldberg and Levy, 2014)—have emerged as popular solutions to this
computational problem, but some confusion remains as to which is more appropriate and when. This doc-
ument explicates their relationships to each other and to other estimation techniques. The analysis shows
that, although they are superficially similar, NCE is a general parameter estimation technique that is asymp-
totically unbiased, while negative sampling is best understood as a family of binary classification models
that are useful for learning word representations but not as a general-purpose estimator.

1 Introduction
Let us assume the following model of language which predicts a word w in a vocabulary V based on some
given context c:1

uθ (w, c) uθ (w, c)
pθ (w | c) = P ′
= , (1)
w ′ ∈V uθ (w , c) Zθ (c)

where uθ (w, c) = exp sθ (w, c) assigns a score to a word in context, Z(c) is the partition function that
normalizes this into a probability distribution, and sθ (w, c) is differentiable with respect to θ. The standard
learning procedure is to maximize the likelihood of a sample of training data. Unfortunately, computing
this probability (and its derivatives) is expensive since this requires summing over all words in V , which is
generally very large.
What can be done? Since the derivatives of the log likelihood include terms that are the expected values
of the parameters under the model distribution, the classic strategy has been to use importance sampling
and related Monte Carlo techniques to approximate these expectations (Bengio and Senécal, 2003). Noise
contrastive estimation and negative sampling represent an evolution of these techniques. These work by
transforming the computationally expensive learning problem into a binary classification proxy problem that
uses the same parameters but requires statistics that are easier to compute.
1
By language model I mean a model that generates one word at a time, conditional on any other ambient context such as
previously generated or surrounding words, a topic label, text in another language, etc. Excluded are so-called “whole-sentence”
or “globally normalized” language models. While these can also, in principle, be learned using the techniques described in these
notes, this exposition focuses on models that predict a single word at a time.
1.1 Empirical distributions, Noise distributions, and Model distributions
I will refer to p̃(w | c) and p̃(c) as empirical distributions. Our task is to find the parameters θ of a model
pθ (w | c) that approximates the empirical distribution as closely as possible, in terms of minimal cross-
entropy. To avoid costly summations, a “noise” distribution, q(w), is used. In practice q is a uniform,
empirical unigram, or “flattened” empirical unigram distribution (by exponentiating each probability by
0 < α < 1 and renormalizing).

2 Noise contrastive estimation (NCE)

NCE reduces the language model estimation problem to the problem of estimating the parameters of a prob-
abilistic binary classifier that uses the same parameters to distinguish samples from the empirical distribution
from samples generated by the noise distribution (Gutmann and Hyvärinen, 2010). The two-class training
data is generated as follows: sample a c from p̃(c), then sample one “true” sample from p̃(w | c), with
auxiliary label D = 1 indicating the datapoint is drawn from the true distribution, and k “noise” samples
from q, with auxiliary label D = 0 indicating these data points are noise. Thus, given c, the joint probability
of (d, w) in the two-class data has the form of the mixture of two distributions:
(
k
× q(w) if d = 0
p(d, w | c) = 1+k1
.
1+k × p̃(w | c) if d = 1

Note that these probabilities are written in terms of the empirical distribution.
NCE replaces the empirical distribution p̃(w | c) with the model distribution pθ (w | c), and θ is cho-
sen to maximize the conditional likelihood of the “proxy corpus” created as described above. But, thus
far, we have not solved any computational problem yet: pθ (w | c) still requires evaluating the partition
function—all we have done is transform the objective by adding some noise. To avoid the expense of evalu-
ating the partition function, NCE makes two further assumptions. First, it proposes partition function value
Z(c) be estimated as parameter zc (thus, for every empirical c, classic NCE introduces one parameter).
Second, for neural networks with lots of parameters, it turns out that fixing zc = 1 for all c is effective
(Mnih and Teh, 2012). This latter assumption both reduces the number of parameters and encourages the
model to have “self-normalized” outputs (i.e., Z(c) ≈ 1). Making these assumptions, we can now write the
conditional likelihood of being a noise sample or true distribution sample in terms of θ as

k × q(w)
p(D = 0 | c, w) =
uθ (w, c) + k × q(w)
uθ (w, c)
p(D = 1 | c, w) = .
uθ (w, c) + k × q(w)

We now have a binary classification problem with parameters θ that can be trained to maximize conditional
log-likelihood of D, with k negative samples chosen:
X
LNCEk = (log p(D = 1 | c, w) + kEw∼q log p(D = 0 | c, w)) .
(w,c)∈D

Unfortunately, the expectation of the second term in this summation is still a difficult summation—it is k
times the expected log probability (according to the current model) of producing a negative label under the
noise distribution over all words in V in a context c. We still have a loop over the entire vocabulary. The
final step is therefore to replace this expectation with its Monte Carlo approximation:
 
k
X X 1
LMC
NCEk =
log p(D = 1 | c, w) + k × × log p(D = 0 | c, w)
k
(w,c)∈D i=1,w∼q
 
X Xk
= log p(D = 1 | c, w) + log p(D = 0 | c, w) .
(w,c)∈D i=1,w∼q

2.1 Asymptotic analysis

Although the objective LNCEk is intractable, its derivative sheds light on why NCE works. This quantity may
be written as
!
∂ X X k × q(w) ∂
LNCEk = × (p̃(w | c) − uθ (w | c)) log uθ (w | c) .
∂θ uθ (w | c) + k × q(w) ∂θ
(w ,c)∈D
′ w∈V

It is easy to see that in the limiting case as k → ∞, this derivative tends to the gradient of the log likelihood
of D under pθ (and furthermore, LMC NCEk → L NCEk ). That is, the gradient is 0 when the model distribution
matches the empirical distribution.

3 Negative sampling
Negative sampling is a variation of NCE used by the popular word2vec tool which generates a proxy
corpus and also learns θ as a binary classification problem, but it defines the conditional probabilities given
(w, c) differently:
1
p(D = 0 | c, w) =
uθ (w, c) + 1
uθ (w, c)
p(D = 1 | c, w) = .
uθ (w, c) + 1

This objective can be understood in several ways. First, it is equivalent to NCE when k = |V | and q
is uniform. Second, it can be understood as the hinge objective of Collobert et al. (2011) where the max
function has been replaced with a softmax. As a result, aside from the k = |V | and uniform q case,
the conditional probabilities of D given (w, c) are not consistent with the language model probabilities of
(w, c) and therefore the θ estimated using this as an objective will not optimize the likelihood of the language
model in Eq. 1. Thus, while negative sampling may be appropriate for word representation learning, it does
not have the same asymptotic consistency guarantees that NCE has.

4 Conclusion
NCE is an effective way of learning parameters for an arbitrary locally normalized language model. How-
ever, negative sampling should be thought of as an alternative task for generating representations of words
for use in other tasks, but not, itself, as a method for learning parameters in a generative model of language.
Thus, if your goal is language modeling, you should use NCE; if your goal is word representation learning,
you should consider both NCE and negative sampling.

References
[Bengio and Senécal2003] Yoshua Bengio and Jean-Sébastien Senécal. 2003. Quick training of probabilistic neural
nets by importance sampling. In Proc. AISTATS.
[Collobert et al.2011] Ronan Collobert, Jason Weston, León Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel
Kuksa. 2011. Natural language processing (almost) from scratch. JMLR.
[Goldberg and Levy2014] Yoav Goldberg and Omer Levy. 2014. word2vec explained: Deriving Mikolov et al.’s
negative-sampling word-embedding method.
[Gutmann and Hyvärinen2010] Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new
estimation principle for unnormalized statistical models. In Proc. AISTATS.
[Mikolov et al.2012] Tomas Mikolov, Ilya Sutskeve, Kai Chen, Greg Corrado, and Jeffrey Dean. 2012. Distributed
representations of words and phrases and their compositionality. In Proc. NIPS.
[Mnih and Kavukcuoglu2013] Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently
with noise-contrastive estimation. In Proc. NIPS.
[Mnih and Teh2012] Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural proba-
bilistic language models. In Proc. ICML.
[Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with
large-scale neural language models improves translation. In Proc. EMNLP.

Applied Natural Language Processing: Barbara Rosario
No ratings yet
Applied Natural Language Processing: Barbara Rosario
39 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
AdaNS Adaptive negative sampling for unsupervised graph representation learning
No ratings yet
AdaNS Adaptive negative sampling for unsupervised graph representation learning
13 pages
MLE and MAP Ex PG 1-4 Print
No ratings yet
MLE and MAP Ex PG 1-4 Print
10 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
Alternate Learning and Compression Approaching : Extended Abstract
No ratings yet
Alternate Learning and Compression Approaching : Extended Abstract
5 pages
L14 TopicModels Sampling
No ratings yet
L14 TopicModels Sampling
40 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
CS 904: Natural Language Processing Statistical Inference: N-Grams
No ratings yet
CS 904: Natural Language Processing Statistical Inference: N-Grams
30 pages
Lec 2
No ratings yet
Lec 2
21 pages
practice_midterm_solutions
No ratings yet
practice_midterm_solutions
7 pages
It-3035 (NLP) - CS Mid Feb 2024
No ratings yet
It-3035 (NLP) - CS Mid Feb 2024
6 pages
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
No ratings yet
CS464 Chapter 4: Naïve Bayes: (Slides Based On The Slides Provided by Öznur Taştan and Mehmet Koyutürk)
55 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
Lec4_Probability Theory and Naive Bayes Classifier
No ratings yet
Lec4_Probability Theory and Naive Bayes Classifier
27 pages
Noise-Contrastive Estimation: A New Estimation Principle For Unnormalized Statistical Models
No ratings yet
Noise-Contrastive Estimation: A New Estimation Principle For Unnormalized Statistical Models
8 pages
lecture3-linear-classifiers
No ratings yet
lecture3-linear-classifiers
36 pages
Smoothed Contrastive Learning For Unsupervised Sentence Embedding
No ratings yet
Smoothed Contrastive Learning For Unsupervised Sentence Embedding
7 pages
9783642336928-c2+ch 2
No ratings yet
9783642336928-c2+ch 2
9 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
Lecture_3 (2)
No ratings yet
Lecture_3 (2)
13 pages
05 Ar 4
No ratings yet
05 Ar 4
145 pages
Noun Phrase Extraction: A Description of Current Techniques
No ratings yet
Noun Phrase Extraction: A Description of Current Techniques
36 pages
Week4
No ratings yet
Week4
45 pages
3 Classification 1
No ratings yet
3 Classification 1
55 pages
Model With One-Word Context: 2vec 2vec 2vec 2vec
100% (1)
Model With One-Word Context: 2vec 2vec 2vec 2vec
17 pages
cs224n-2025-lecture02-wordvecs2
No ratings yet
cs224n-2025-lecture02-wordvecs2
46 pages
CSE546: Naïve Bayes: Winter 2012
No ratings yet
CSE546: Naïve Bayes: Winter 2012
35 pages
Naive-Bayes
No ratings yet
Naive-Bayes
25 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
Tutorial on diffusion models
No ratings yet
Tutorial on diffusion models
4 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
I239-5 Naive Bayes
No ratings yet
I239-5 Naive Bayes
35 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
Unit Iv Learning
No ratings yet
Unit Iv Learning
40 pages
Maximum Entropy Markov Models: Alan Ritter CSE 5525
No ratings yet
Maximum Entropy Markov Models: Alan Ritter CSE 5525
70 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
Ai Unit V
No ratings yet
Ai Unit V
18 pages
Cohn1994 Article ImprovingGeneralizationWithAct
No ratings yet
Cohn1994 Article ImprovingGeneralizationWithAct
21 pages
Lect 0329
No ratings yet
Lect 0329
3 pages
Naive Bayes With Sentiment Classification
No ratings yet
Naive Bayes With Sentiment Classification
82 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
CS109/Stat121/AC209/E-109 Data Science: Bayesian Methods Continued, Text Data
No ratings yet
CS109/Stat121/AC209/E-109 Data Science: Bayesian Methods Continued, Text Data
35 pages
ml3 - Text Classification – Naive Bayes
No ratings yet
ml3 - Text Classification – Naive Bayes
50 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
Bayesian
No ratings yet
Bayesian
50 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
p (α, β) = p (α - β) p (β) = p (α) p (β - α) (7) p (α) p (β - α) p (β) (8) : Basic Bayes
No ratings yet
p (α, β) = p (α - β) p (β) = p (α) p (β - α) (7) p (α) p (β - α) p (β) (8) : Basic Bayes
5 pages
Lect1116-18 Active Learning
No ratings yet
Lect1116-18 Active Learning
6 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
4. WordRepresentation
No ratings yet
4. WordRepresentation
26 pages
Provable Benefits of Annealing
No ratings yet
Provable Benefits of Annealing
26 pages
AI-W5
No ratings yet
AI-W5
29 pages
CSE 473: Ar+ficial Intelligence: Bayes' Nets
No ratings yet
CSE 473: Ar+ficial Intelligence: Bayes' Nets
26 pages
Training Conditional Random Fields With Natural Gradient Descent
No ratings yet
Training Conditional Random Fields With Natural Gradient Descent
9 pages
slide07-bayes
No ratings yet
slide07-bayes
51 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
123 pages
Word2vector Paper PDF
No ratings yet
Word2vector Paper PDF
9 pages
Satplan: Fundamentals and Applications
From Everand
Satplan: Fundamentals and Applications
Fouad Sabry
No ratings yet
ABE 532 Handling of Agricultural Biomaterials-1
No ratings yet
ABE 532 Handling of Agricultural Biomaterials-1
5 pages
History of Grease
No ratings yet
History of Grease
12 pages
Biographical Narrative Thesis Statement
100% (3)
Biographical Narrative Thesis Statement
8 pages
Semi-Detailed Lesson Plan
No ratings yet
Semi-Detailed Lesson Plan
11 pages
Zirbus-Pharmaservice Formato FAT 030413
No ratings yet
Zirbus-Pharmaservice Formato FAT 030413
29 pages
JEE MAINS 1 FEB 2024 SHIFT 1
No ratings yet
JEE MAINS 1 FEB 2024 SHIFT 1
13 pages
Distillation Technology and Modelling Techniques: Part 1: Concepts in Distillation
No ratings yet
Distillation Technology and Modelling Techniques: Part 1: Concepts in Distillation
4 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
4 pages
GEOGRAPHY P1 GR10 QP NOV2020 - English
No ratings yet
GEOGRAPHY P1 GR10 QP NOV2020 - English
10 pages
Much Ado About Almost Nothing Reading Answers
No ratings yet
Much Ado About Almost Nothing Reading Answers
3 pages
4-TOP MCQs On Classification of Lubricating System - 1 and Answers 2023
No ratings yet
4-TOP MCQs On Classification of Lubricating System - 1 and Answers 2023
6 pages
Week 11 Lec 2
No ratings yet
Week 11 Lec 2
12 pages
Copy of 4.1.1 reverse engineer a mechanism
No ratings yet
Copy of 4.1.1 reverse engineer a mechanism
9 pages
Random Stuff
No ratings yet
Random Stuff
7 pages
What Can Fossils Tell Us About Earth's Past?
100% (1)
What Can Fossils Tell Us About Earth's Past?
14 pages
Melc-Based Budget of Lesson in Agri-Crop Production 7/8
No ratings yet
Melc-Based Budget of Lesson in Agri-Crop Production 7/8
31 pages
Physics Project
No ratings yet
Physics Project
17 pages
Notes Important Questions Answers of 11th Math Chapter 7 Excercise 7.2
No ratings yet
Notes Important Questions Answers of 11th Math Chapter 7 Excercise 7.2
5 pages
MMW Notes
No ratings yet
MMW Notes
3 pages
09-Sets21
No ratings yet
09-Sets21
18 pages
Kapoor 2021 J. Phys. Conf. Ser. 2070 012147
No ratings yet
Kapoor 2021 J. Phys. Conf. Ser. 2070 012147
13 pages
Tsunami Essay
No ratings yet
Tsunami Essay
5 pages
Kmeans.R: Finding The Elbow' in Wss Curve
No ratings yet
Kmeans.R: Finding The Elbow' in Wss Curve
4 pages
Planning and Zoning Commission Staff Report
No ratings yet
Planning and Zoning Commission Staff Report
112 pages
Icp Oes
No ratings yet
Icp Oes
10 pages
E Book Scroll Stopping Secrets
100% (1)
E Book Scroll Stopping Secrets
20 pages
Transmissive Metasurface With Independent Amplitude Phase Control and Its Application To Low-Side-Lobe Metalens Antenna
No ratings yet
Transmissive Metasurface With Independent Amplitude Phase Control and Its Application To Low-Side-Lobe Metalens Antenna
11 pages
Examples of Conclusion Section of A Research Paper
No ratings yet
Examples of Conclusion Section of A Research Paper
5 pages
APA Basics
No ratings yet
APA Basics
5 pages
John Barker (1998) - Christianity in Western Melanesian Ethnography
No ratings yet
John Barker (1998) - Christianity in Western Melanesian Ethnography
274 pages

Notes On Noise Contrastive Estimation and Negative Sampling: Chris Dyer

Uploaded by

Notes On Noise Contrastive Estimation and Negative Sampling: Chris Dyer

Uploaded by

Notes on Noise Contrastive Estimation and Negative Sampling

2 Noise contrastive estimation (NCE)

2.1 Asymptotic analysis

You might also like