Cohen, Shay & Hirst, Graeme (2019) - Bayesian Analysis in Natural Language Processing
Cohen, Shay & Hirst, Graeme (2019) - Bayesian Analysis in Natural Language Processing
COHEN
BAYESIAN ANALYSIS IN NATURAL LANGUAGE PROCESSING, 2ND ED.
Series Editor: Graeme Hirst, University of Toronto
Shay Cohen
ABOUT SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis lectures
store.morganclaypool.com
Bayesian Analysis in
Natural Language Processing
Second Edition
Synthesis Lectures on Human
Language Technologies
Editor
Graeme Hirst, University of Toronto
Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the University
of Toronto. The series consists of 50- to 150-page monographs on topics relating to natural
language processing, computational linguistics, information retrieval, and spoken language
understanding. Emphasis is on important new techniques, on new applications, and on topics that
combine two or more HLT subfields.
Argumentation Mining
Manfred Stede and Jodi Schneider
2018
Learning to Rank for Information Retrieval and Natural Language Processing, Second
Edition
Hang Li
2014
Discourse Processing
Manfred Stede
2011
Bitext Alignment
Jörg Tiedemann
2011
Dependency Parsing
Sandra Kübler, Ryan McDonald, and Joakim Nivre
2009
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00905ED2V01Y201903HLT041
Lecture #41
Series Editor: Graeme Hirst, University of Toronto
Series ISSN
Print 1947-4040 Electronic 1947-4059
Bayesian Analysis
in Natural Language Processing
Second Edition
Shay Cohen
University of Edinburgh
M
&C Morgan & cLaypool publishers
ABSTRACT
Natural language processing (NLP) went through a profound transformation in the mid-1980s
when it shifted to make heavy use of corpora and data-driven techniques to analyze language.
Since then, the use of statistical techniques in NLP has evolved in several ways. One such exam-
ple of evolution took place in the late 1990s or early 2000s, when full-fledged Bayesian machin-
ery was introduced to NLP. This Bayesian approach to NLP has come to accommodate various
shortcomings in the frequentist approach and to enrich it, especially in the unsupervised setting,
where statistical learning is done without target prediction examples.
In this book, we cover the methods and algorithms that are needed to fluently read
Bayesian learning papers in NLP and to do research in the area. These methods and algorithms
are partially borrowed from both machine learning and statistics and are partially developed
“in-house” in NLP. We cover inference techniques such as Markov chain Monte Carlo sam-
pling and variational inference, Bayesian estimation, and nonparametric modeling. In response
to rapid changes in the field, this second edition of the book includes a new chapter on rep-
resentation learning and neural networks in the Bayesian context. We also cover fundamental
concepts in Bayesian statistics such as prior distributions, conjugacy, and generative modeling.
Finally, we review some of the fundamental modeling techniques in NLP, such as grammar
modeling, neural networks and representation learning, and their use with Bayesian analysis.
KEYWORDS
natural language processing, computational linguistics, Bayesian statistics, Bayesian
NLP, statistical learning, inference in NLP, grammar modeling in NLP, neural
networks, representation learning
ix
Dedicated to Mia
xi
Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Continuous and Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Joint Distribution over Multiple Random Variables . . . . . . . . . . . . . . . . 4
1.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Independent and Conditionally Independent Random Variables . . . . . . 7
1.3.3 Exchangeable Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Expectations of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Parametric vs. Nonparametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.2 Inference with Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.3 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.4 Independence Assumptions in Models . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.5 Directed Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Learning from Data Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Bayesian and Frequentist Philosophy (Tip of the Iceberg) . . . . . . . . . . . . . . . . 22
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
xii
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Overview: Where Bayesian Statistics and NLP Meet . . . . . . . . . . . . . . . . . . . 26
2.2 First Example: The Latent Dirichlet Allocation Model . . . . . . . . . . . . . . . . . . 30
2.2.1 The Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Second Example: Bayesian Text Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4 Conclusion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.1 Conjugate Priors and Normalization Constants . . . . . . . . . . . . . . . . . . 47
3.1.2 The Use of Conjugate Priors with Latent Variable Models . . . . . . . . . . 48
3.1.3 Mixture of Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.4 Renormalized Conjugate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.5 Discussion: To Be or not to Be Conjugate? . . . . . . . . . . . . . . . . . . . . . . 52
3.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Priors Over Multinomial and Categorical Distributions . . . . . . . . . . . . . . . . . 53
3.2.1 The Dirichlet Distribution Re-Visited . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.2 The Logistic Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3 Non-Informative Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.1 Uniform and Improper Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.2 Jeffreys Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Conjugacy and Exponential Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5 Multiple Parameter Draws in Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.6 Structural Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7 Conclusion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1 Learning with Latent Variables: Two Views . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Bayesian Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
xiii
4.2.1 Maximum a Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.2 Posterior Approximations Based on the MAP Solution . . . . . . . . . . . . 87
4.2.3 Decision-Theoretic Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.4 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4 Asymptotic Behavior of the Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1 MCMC Algorithms: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 NLP Model Structure for MCMC Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 Partitioning the Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1 Collapsed Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3.2 Operator View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3.3 Parallelizing the Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.4 The Metropolis–Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.1 Variants of Metropolis–Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5 Slice Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5.1 Auxiliary Variable Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5.2 The Use of Slice Sampling and Auxiliary Variable Sampling in NLP . 117
5.6 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.7 Convergence of MCMC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.8 Markov Chain: Basic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.9 Sampling Algorithms Not in the MCMC Realm . . . . . . . . . . . . . . . . . . . . . 123
5.10 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.11.1 Computability of Distribution vs. Sampling . . . . . . . . . . . . . . . . . . . . 127
5.11.2 Nested MCMC Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.11.3 Runtime of MCMC Samplers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.11.4 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.12 Conclusion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
xiv
6 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1 Variational Bound on Marginal Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . 135
6.2 Mean-Field Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3 Mean-Field Variational Inference Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3.1 Dirichlet-Multinomial Variational Inference . . . . . . . . . . . . . . . . . . . . 141
6.3.2 Connection to the Expectation-Maximization Algorithm . . . . . . . . . 145
6.4 Empirical Bayes with Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.5.1 Initialization of the Inference Algorithms . . . . . . . . . . . . . . . . . . . . . . 148
6.5.2 Convergence Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5.3 The Use of Variational Inference for Decoding . . . . . . . . . . . . . . . . . . 150
6.5.4 Variational Inference as KL Divergence Minimization . . . . . . . . . . . . 151
6.5.5 Online Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
xix
List of Figures
1.1 Graphical model for the latent Dirichlet allocation model . . . . . . . . . . . . . . . . 18
List of Algorithms
5.1 The Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 The collapsed Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.3 The operator Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4 The parallel Gibbs sampling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5 The Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6 The rejection sampling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Shay Cohen
Edinburgh
May 2016
xxxi
Shay Cohen
Edinburgh
February 2019
1
CHAPTER 1
Preliminaries
This chapter is mainly intended to be used as a refresher on basic concepts in Probability and
Statistics required for the full comprehension of this book. Occasionally, it also provides notation
that will be used in subsequent chapters in this book.
Keeping this in mind, this chapter is written somewhat differently than typical introduc-
tions to basic concepts in Probability and Statistics. For example, this chapter defines concepts
directly for random variables, such as conditional distributions, independence and conditional
independence, the chain rule and Bayes’ rule rather than giving preliminary definitions for these
constructs for events in a sample space. For a deeper introductory investigation of probability
theory, see Bertsekas and Tsitsiklis (2002).
Sections 1.1–1.2 (probability measures and random variables) are given for completeness
in a rather formal way. If the reader is familiar with these basic notions and their constructions,
(s)he can skip to Section 1.3 where mechanisms essential to Bayesian learning, such as the chain
rule, are introduced.
where pX is the probability measure induced by the random variable X and p is a probability
measure originally defined for . The sample space for pX is R. The set of events for this sample
space includes all A R such that X 1 .A/ is an event in the original sample space of p .
It is common to define a statistical model directly in terms of random variables, instead
of explicitly defining a sample space and its corresponding real-value functions. In this case,
random variables do not have to be interpreted as real-value functions and the sample space is
understood to be a range of the random variable function. For example, if one wants to define
a probability distribution over a language vocabulary, then one can define a random variable
X.!/ D ! with ! ranging over words in the vocabulary. Following this, the probability of a
word in the vocabulary is denoted by p.X 2 f!g/ D p.X D !/.
1 These axioms are: (1) needs to be a measurable set; (2) The complement of a measurable set is a measurable set in the
collection; (3) Any union of measurable sets is also measurable.
1.2. RANDOM VARIABLES 3
Random variables can also be multivariate. In that case, they would map elements of the
sample space to a subset of Rd for some fixed d (or a tuple in some other space).2
where A is a subset of the possible values X can take. Note that this equation is the result of the
axiom of probability measures, where the probability of an event equals the sum of probabilities
of disjoint events that precisely cover that event (singletons, in our case).
The most common discrete distribution we will be making use of is the multinomial dis-
tribution, which serves as a building block for many NLP models (see Chapter 3 and also Sec-
tion B.1). With the multinomial space, is a finite set of events, for example, a finite vocabulary
of words. The PMF attaches a probability to each word in the vocabulary.
The continuous variables discussed in this book, on the other hand, are assumed to have a
probability density function (PDF). Similarly to a PMF, this is a function that attaches a weight to
each element in the sample space, p. /. The PDF is assumed to be integrable over the sample
space . (Here integration refers to Lebesgue integration.) This probability density function
induces a probability measure p. 2 A/, which is defined as:
Z
p. 2 A/ D p. /d:
2A
2 The more abstract measure-theoretic definition of a random variable is a function from a sample space (with a given
probability measure) to a measurable space E such that the preimage of this function, for any measurable set in E , is also
measurable in the probability space. In most NLP applications, it is sufficient to treat random variables as real functions or
functions which induce probability measures as described in this section.
4 1. PRELIMINARIES
The parallelism between PMFs and PDFs is not incidental. Both of these concepts can
be captured using a unified mathematical framework based on measure theory. As mentioned
earlier, this is beyond the scope of this book.
For notation, we use p.X D x/, with an explicit equal sign, to denote the PMF value of
a discrete variable X . When the random variable discussed is obvious from context, we will just
use notation such as p.x/ to denote p.X D x/. We denote the PMF itself (as a function) by
p.X / (without grounding X in a certain element in the sample space). We use p. / to denote
both a specific PDF value of the random variable and also the PDF itself (as a function).
With real-valued random variables, there is a special distribution function called the cumu-
lative distribution function (CDF). For a real-valued random variable , the CDF is a function
F W R ! Œ0; 1 such that F .y/ D p. y/. CDF is also generalized to the multivariate case,
where for a that represents a random variable with a range in Rd , the CDF F W Rd ! Œ0; 1 is
such that F .y/ D p.1 y1 ; : : : ; d yd /. CDFs have a central role in statistical analysis, but
are used less frequently in Bayesian NLP.
1 1
p.X 2 A; Y 2 B/ D p X .A/ \ Y .B/ :
It is often the case that we take several sets f1 ; : : : ; m g and combine them into a single
sample space D 1 : : : m . Each of the i is associated with a random variable. Based on
this, a joint probability distribution can be defined for all of these random variables together. For
example, consider D V P where V is a vocabulary of words and P is a part-of-speech tag.
This sample space enables us to define probabilities p.x; y ) where X denotes a word associated
with a part of speech Y . In this case, x 2 V and y 2 P .
With any joint distribution, we can marginalize some of the random variables to get a
distribution which is defined over a subset of the original random variables (so it could still be
a joint distribution, only over a subset of the random variables). Marginalization is done using
integration (for continuous variables) or summing (for discrete random variables). This operation
of summation or integration eliminates the random variable from the joint distribution. The
result is a joint distribution over the non-marginalized random variables.
For the simple part-of-speech example above, we could either get the marginal p.x/ D
P P
y2P p.x; y/ or p.y/ D x2V p.x; y/. The marginals p.X/ and p.Y / do not uniquely deter-
mine the joint distribution value p.X; Y /. Only the reverse is true. However, whenever X and Y
1.3. CONDITIONAL DISTRIBUTIONS 5
are independent then the joint distribution can be determined using the marginals. More about
this in Section 1.3.2.
p.X 2 A; Y D y/
p.X 2 AjY D y/ D ; (1.1)
p.Y D y/
is to be interpreted as a conditional distribution that determines the probability of X 2 A con-
ditioned on Y obtaining the value y . The bar denotes that we are clamping Y to the value y
and identifying the distribution induced on X in the restricted sample space. Informally, the
conditional distribution takes the part of the sample space where Y D y and re-normalizes the
joint distribution such that the result is a probability distribution defined only over that part of
the sample space.
When we consider the joint distribution in Equation 1.1 to be a function that maps events
to probabilities in the space of X , with y being fixed, we note that the value of p.Y D y/ is
actually a normalization constant that can be determined from the numerator p.X 2 A; Y D y/.
For example, if X is discrete when using a PMF, then:
X
p.Y D y/ D p.X D x; Y D y/:
x
Since p.Y D y/ is a constant with respect to the values that X takes, we will often use the
notation:
to denote that the conditional distribution over X given Y is proportional to the joint distribu-
tion, and that normalization of this joint distribution is required in order to get the conditional
distribution.
In their most general form, conditional distributions (Equation 1.1) can include more than
a single random variable on both sides of the bar. The two sets of random variables for each side
of the bar also do not have to be disjoint. In addition, we do not have to clamp the conditioned
6 1. PRELIMINARIES
random variables to a single value—they can be clamped to any event. All of this leads to the
following general form of conditional distributions. Let X1 ; : : : ; Xn be a set of random variables.
Let I D fa1 ; : : : ; am g J D fb1 ; : : : ; b` g be subsets of f1; : : : ; ng. In addition, let Ai for i 2 I be
an event for the sample space of Xai and Bj for j 2 J be an event for the sample space of Xbj .
Based on this, we can define the following conditional distribution:
p Xa1 2 A1 ; : : : ; Xam 2 Am jXb1 2 B1 ; : : : ; Xb` 2 B` D
p Xa1 2 A1 ; : : : ; Xam 2 Am ; Xb1 2 B1 ; : : : ; Xb` 2 B`
:
p Xb1 2 B1 ; : : : ; Xb` 2 B`
The Chain Rule The “chain rule” is a direct result of the definition of conditional probability
distributions. It permits us to express a joint distribution in terms of a sequence of multiplications
of conditional distributions. The simplest version of the chain rule states that for any two random
variables X and Y , it holds that p.X; Y / D p.X /p.Y jX/ (assuming p.Y jX / is always defined).
In the more general case, it states that we can decompose the joint distribution over a sequence
of random variables X .1/ ; : : : ; X .n/ to be:
Y
n
p X .1/ ; : : : ; X .n/ D p X .1/ p X .i / jX .1/ ; : : : ; X .i 1/
:
i D2
With the chain rule, we can also treat a subset of random variables as a single unit, so for
example, it is true that:
p X .1/ ; X .2/ ; X .3/ D p X .1/ p X .2/ ; X .3/ jX .1/ ;
or alternatively:
p X .1/ ; X .2/ ; X .3/ D p X .1/ ; X .2/ p X .3/ jX .1/ ; X .2/ ;
p.X D x; Y D y/
D p.X D x/p.Y D yjX D x/
D p.Y D y/p.X D xjY D y/:
Taking the last equality above, p.X D x/p.Y D yjX D x/ D p.Y D y/p.X D xjY D
y/, and dividing both sides by p.X D x/ results in Bayes’ rule as described in Equation 1.2.
Bayes’ rule is the main pillar in Bayesian statistics for reasoning and learning from data.
Bayes’ rule can invert the relationship between “observations” (the data) and the random vari-
ables we are interested in predicting. This makes it possible to infer target predictions from
such observations. A more detailed description of these ideas is provided in Section 1.5, where
statistical modeling is discussed.
or alternatively p.Y 2 BjX 2 A/ D p.Y 2 B/ (these two definitions are correct and equivalent
under very mild conditions that prevent ill-formed conditioning on an event that has zero prob-
ability).
Using the chain rule, it can also be shown that the above two definitions are equivalent to
the requirement that p.X 2 A; Y 2 B/ D p.X 2 A/p.Y 2 B/ for all A and B .
Independence between random variables implies that the random variables do not provide
information about each other. This means that knowing the value of X does not help us infer
anything about the value of Y —in other words, it does not change the probability of Y . (Or
vice-versa—Y does not tell us anything about X .) While independence is an important concept
in probability and statistics, in this book we will more frequently make use of a more refined
notion of independence, called “conditional independence”—which is a generalization of the
notion of independence described in the beginning of this section. A pair of random variables
.X; Y / is conditionally independent given a third random variable Z , if for any A, B and z , it
holds that p.X 2 AjY 2 B; Z D z/ D p.X 2 AjZ D z/.
8 1. PRELIMINARIES
Conditional independence between two random variables (given a third one) implies that
the two variables are not informative about each other, if the value of the third one is known.3
Conditional independence (and independence) can be generalized to multiple random
variables as well. We say that a set of random variables X1 ; : : : ; Xn , are mutually conditionally
independent given another set of random variables Z1 ; : : : ; Zm if the following applies for any
A1 ; : : : ; An and z1 ; : : : ; zm :
p.X1 2 A1 ; : : : ; Xn 2 An jZ1 D z1 ; : : : ; Zm D zm / D
Yn
p.Xi 2 Ai jZ1 D z1 ; : : : ; Zm D zm /:
i D1
This type of independence is weaker than pairwise independence for a set of random
variables, in which only pairs of random variables are required to be independent. (Also see
exercises.)
for any set of m integers, fa1 ; : : : ; am g. The interpretation of this is that exchangeable random
variables can be represented as a (potentially infinite) mixture distribution. This theorem is also
called the “representation theorem.”
The frequentist approach assumes the existence of a fixed set of parameters from which the
data were generated, while the Bayesian approach assumes that there is some prior distribution
over the set of parameters that generated the data. (This will become clearer as the book pro-
gresses.) De Finetti’s theorem provides another connection between the Bayesian approach and
3 To show that conditional independence is a generalized notion of independence, consider a Z that is a constant value.
4A permutation on S D f1; : : : ; ng is a bijection W S ! S .
1.4. EXPECTATIONS OF RANDOM VARIABLES 9
the frequentist one. The standard “independent and identically distributed” (i.i.d.) assumption
in the frequentist setup can be asserted as a setup of exchangeability where p. / is a point-mass
distribution over the unknown (but single) parameter from which the data are sampled. This
leads to the observations being unconditionally independent and identically distributed. In the
Bayesian setup, however, the observations are correlated, because p. / is not a point-mass distri-
bution. The prior distribution plays the role of p. /. For a detailed discussion of this similarity,
see O’Neill (2009).
The exchangeability assumption, when used in the frequentist setup, is weaker than the
i.i.d. assumption, and fixes an important conceptual flaw (in the eye of Bayesians) in the i.i.d.
assumption. In the i.i.d. setup the observed random variables are independent of each other,
and as such, do not provide information about each other when the parameters are fixed. The
probability of the nth observation (Xn ), conditioned on the first n 1 observations, is identical
to the marginal distribution over the nth observation, no matter what the first n 1 observations
were. The exchangeability assumption, on the other hand, introduces correlation between the
different observations, and as such, the distribution p.Xn j X1 ; : : : ; Xn 1 / will not be just p.Xn /.
Exchangeability appears in several contexts in Bayesian NLP. For example, in the LDA
model (Chapter 2), the words in each document are exchangeable, meaning that they are condi-
tionally independent given the topic distribution. The Chinese restauarant process (Chapter 7)
is also an exchangeable model, which makes it possible to derive its posterior distribution.
For the discrete random variables that we consider in this book, we usually consider expec-
tations of functions over these random variables. As mentioned in Section 1.2, discrete random
variable values often range over a set which is not numeric. In these cases, there is no “mean
value” for the values that these random variables accept. Instead, we will compute the mean
value of a real-function of these random variables.
With f being such a function, the expectation EŒf .X / is defined as:
X
EŒf .X / D p.x/f .x/:
x
10 1. PRELIMINARIES
For the linguistic structures that are used in this book, we will often use a function f that
indicates whether a certain property holds for the structure. For example, if the sample space of
X is a set of sentences, f .x/ can be an indicator function that states whether the word “spring”
appears in the sentence x or not; f .x/ D 1 if the word “spring” appears in x and 0, otherwise.
In that case, f .X/ itself can be thought of as a Bernoulli random variable, i.e., a binary random
variable that has a certain probability to be 1, and probability 1 to be 0. The expectation
EŒf .X / gives the probability that this random variable is 1. Alternatively, f .x/ can count how
many times the word “spring” appears in the sentence x . In that case, it can be viewed as a sum
of Bernoulli variables, each indicating whether a certain word in the sentence x is “spring” or
not.
Expectations are linear operators. This means that if 1 and 2 are two random variables
and a; b and c are real values, then
Equation 1.3 holds even if the random variables are not independent. Expectations are
linear for both continuous and discrete random variables, even when such random variables are
mixed together in the linear expression.
As with conditional distributions, one can define conditional expectation. For example
EŒf .X /jY D y would be the expectation of f .X / under the conditional distribution p.X jY D
y/. The function g.y/ D EŒf .X /jY D y can be thought of as a random variable. In that case,
it can be shown that EŒg.Y / D EŒEŒf .X/jY D EŒf .X/. (This is a direct result of Fubini’s
theorem (Ash and Doléans-Dade, 2000). This theorem roughly states that under some mild
conditions, any order of integration or summation over several random variables gives the same
result.)
It is common practice to denote by subscript the underlying distribution which is used to
compute the expectation, when it cannot be uniquely determined from context. For example,
Eq Œf .X/ denotes the expectation of f .X/ with respect to a distribution q , i.e.:
X
Eq Œf .X / D q.x/f .x/:
x
There are several types of expectations for real-value random variables which are deemed
important in various applications or when we are interested in summarizing the random variable.
One such type of expectation is “moments”: the n-th order moment of a random variable X
around point c is defined to be EŒ.X c/n . With n D 1 and c D 0, we get the mean of the
random variable. With c D EŒX
and n D 2, we get the variance of the random variable, which
also equals EŒX 2 EŒX2 .
The idea of moments can be generalized to several random variables. The most commonly
used generalization is covariance. The covariance between two random variables X and Y is
1.5. MODELS 11
EŒXY EŒX EŒY . Note that if Y D X , then the covariance is reduced to the variance of
X . If two random variables are independent, then their covariance is 0. The opposite is not
necessarily true—two random variables can be dependent, while covariance is still 0. In that
case, the random variables are only uncorrelated, but not independent.
A handful of moments sometimes uniquely define a probability distribution. For example,
a coin toss distribution (i.e., a Bernoulli distribution) is uniquely defined by the first moment,
which gives the probability of the coin toss giving the outcome 1. A Gaussian distribution is
uniquely defined by its first and second order moments (or mean and variance).
1.5 MODELS
The major goal of statistical modeling is to analyze data in order to make predictions, or to
help understand the properties of a process in “nature” that exhibits randomness, through its
modeling in mathematical terms. One way to define a statistical model is to represent it as a
family of probability distribution functions over a set of random variables. Statistical models
can also be described in terms of indices for probability distributions. In that case, a statistical
model M is a set, such that each member of the set identifies a specific probability distribution.
For example, let I denote the segment Œ0; 1 to define probability distributions over a
random variable X that takes the values 0 or 1 (a Bernoulli variable, or a “coin flip” variable).
Each 2 I is a number between 0 and 1. The distribution associated with that will be denoted
by p.X j/, such that: p.X D 0j/ D and p.X D 1j/ D 1 . This set of distributions, M,
is an example of a parametric model, as described below in Section 1.5.1.
The term “model” often refers, especially in colloquial discussions, to either a specific p 2
M (such as “the estimated model”—i.e., a specific member of the model family that is identified
through data), a non-specific p 2 M or the set of all distributions in M. We follow this norm
in the book, and use the word “model” in all of these cases, where it is clear from context what
the word refers to.
Models are often composed of well-studied distributions, such as the Gaussian distribu-
tion, Bernoulli distribution or the multinomial distribution. This means that there is a way to
write the joint distribution as a product of conditional distributions, such that these conditional
distributions are well-known distributions. This is especially true with generative models (Sec-
tion 1.5.3). We assume some basic familiarity with the important distributions that are used in
NLP, and also give in Appendix B a catalog of the especially common ones.
Yn Yn
.i/
Pn Pn
x .i/ .i/ .i /
L jx .1/ ; : : : ; x .n/ D p x .i/ j D x .1 /1 D iD1 x .1 /n iD1 x :
i D1 i D1
The log-likelihood is P
just the logarithm
of the likelihood.
In the above example it is
n Pn
log L jx .1/ ; : : : ; x .n/ D i D1 x .i/
log C n i D1 x .i/
log.1 /.
• Given observed data x (a realization of the random variable X ) use Bayes’ rule to obtain a
probability distribution over M, the posterior distribution:
p. /p.xj /
p. jx/ D R 0 0 0
: (1.4)
0 p. /p.xj /d
Note that all quantities on the right-hand side are known (because x is grounded in a
specific value), and therefore, from a mathematical point of view, the posterior p. jx/ can
be fully identified.
In the above steps, our goal was to infer a distribution over the set of parameters, which
essentially integrates information from the prior (this information tells us how a priori likely
each parameter is) with the information we get about the parameters through the observed data.
Instead of identifying a single distribution in M (as we would do in the frequentist setting),
we now have a distribution over M. There are many variants to this idea of inference in the
Bayesian setting, and a major goal of this book is to cover a significant part of these variants—
those which are necessary for understanding Bayesian NLP papers. In addition, the steps above
5 The regularity conditions require the model to be identifiable; the parameter space to be compact; the continuity of the
log-likelihood; and a bound on the log-likelihood function with respect to the data with an integrable function that does not
depend on the parameters. See Casella and Berger (2002) for more details.
14 1. PRELIMINARIES
just give a mathematical formulation for Bayesian inference. In practice, much care is needed
when applying Bayes’ rule in Equation 1.4. Chapter 2 goes into more detail on how Bayesian
inference is done for a simple model, and the rest of the book focuses on the harder cases where
more care is required.
6 Any generative model can be transformed into a discriminative model by conditioning on a specific input instance. The
reverse transformation from a discriminative model to a generative model is not possible without introducing an underspecified
factor, a probability distribution over the input space.
1.5. MODELS 15
• For j 2 f1; : : : ; ng
Generate zj 2 f1; : : : ; Kg from Multinomial. /
Generate xj from Normal.zj ; †zj /
Generative Story 1.1: The generative story for a Gaussian mixture model.
can be applied to the joint distribution, with a certain ordering over the random variables, so that
the joint distribution is written in a more compact way as a product of factors—each describing
a conditional distribution of a random variable or a set of random variables given a small, local
set of random variables.
It is often the case that describing a joint distribution based only on the form of its factors
can be confusing, or not sufficiently detailed. Representations such as graphical models (see
Section 1.5.4) can be even more restrictive and less revealing, since they mostly describe the
independence assumptions that exist in the model.
In these cases, this book uses a verbal generative story description, as a set of bullet points
that procedurally describe how each of the variables is generated in the model. For example,
generative story 1.1 describes a mixture of Gaussian models.
Behind lines 1 and 3 lurk simple statistical models with well-known distributions. The
first line assumes a probability distribution p.Zj j/. The second assumes a Gaussian probability
distribution p.Xj jzj ; †zj /. Combined together, and taking into account the loop in lines 1–3,
they yield a joint probability distribution:
p.X1 ; : : : ; Xn ; Z1 ; : : : ; Zn j; 1 ; : : : ; K ; †1 ; : : : ; †K / D
Y n
p.Xj ; Zj j; 1 ; : : : ; K ; †1 ; : : : ; †K /:
j D1
In this book, generative story boxes are included with some information about the vari-
ables, constants and parameters that exist in the model. This can be thought of as the signature
of the generative story. This signature often also tells, as in the above case, which variables are
assumed to be observed in the data, and which are assumed to be latent. Clearly, this is not a
16 1. PRELIMINARIES
property of the joint distribution itself, but rather depends on the context and the use of this
joint distribution as a statistical model.
As mentioned above, generative stories identify a joint distribution over the parameters
in the model, where this joint distribution is a product of several factors. This is related to the
chain rule (Section 1.3). The generative story picks an ordering for the random variables, and the
chain rule is applied using that order to yield the joint distribution. Each factor can theoretically
condition on all possible random variables that were generated before, but the independence
assumptions in the model make some of these variables unnecessary to condition on.
α β
K
θ Z W
N M
Figure 1.1: Graphical model for the latent Dirichlet allocation model. Number of topics denoted
by K , number of documents denoted by M and number of words per document denoted by N .7
if X and Y [ Z are conditionally independent given W , then X and Y are conditionally inde-
pendent given Z [ W . Here, X , Y , Z and W are subsets of random variables in a probability
distribution. See Pearl (1988) for more information.
Bayesian networks also include a graphical mechanism to describe a variable or unfixed
number of random variables, using so-called “plate notation.” With plate notation, a set of ran-
dom variables is placed inside plates. A plate represents a set of random variables with some
count. For example, a plate could be used to describe a set of words in a document. Figure 1.1
provides an example of a use of the graphical plate language. Random variables, denoted by
circles, are the basic building blocks in this graphical language; the plates are composed of such
random variables (or other plates) and denote a “larger object.” For example, the random variable
W stands for a word in a document, and the random variable Z is a topic variable associated
with that word. As such, the plate as a whole denotes a document, which is indeed a larger object
composed from N random variables from the type of Z and W .
In this notation, edges determine the conditional independence assumptions in the model.
The joint distribution over all random variables can be read by topologically sorting the elements
in the graphical representation, and then multiplying factors, starting at the root, such that each
random variable (or a group of random variables in a plate) conditions on its parents, determined
by the edges in the graph. For example, the graphical model in Figure 1.1 describes the following
joint distribution:
M
Y
p W1.j / ; : : : ; Wn.j / ; Z1.j / ; : : : ; ZN
.j /
; .j / jˇ1 ; : : : ; ˇK ; ˛
j D1
!
M
Y n
Y
D p. .j / j˛/ p Zi.j / j p Wi.j / jZi.j / ; .j / ; ˇ1 ; : : : ; ˇK : (1.5)
j D1 i D1
We are now equipped with the basic concepts from probability theory and from statistics so that
we can learn from data. We can now create a statistical model, where the data we have is mapped
to “observed” random variables. The question remains: how is our data represented? In NLP,
researchers usually rely on fixed datasets from annotated or unannotated corpora. Each item in a
fixed corpus is assumed to be drawn from some distribution. The data can be annotated (labeled)
or unannotated (unlabeled). Learning can be supervised, semi-supervised or unsupervised. The
various scenarios and prediction goals for each are described in Table 1.1.
An important concept common to all of these learning settings is that of marginal like-
lihood. The marginal likelihood is a quantity that denotes the likelihood of the observed data
according to the model. In the Bayesian setting, marginalization is done over the parameters
(taking into account the prior) and the latent variables.
Here are some learning cases along with their marginal likelihood.
0 1
Z Yn X
L x .1/ ; : : : ; x .n/ D @ p z .i/ j p x .i/ jz .i/ ; A p. /d:
i D1 z .i/
8 The scenario in transductive learning is such that training data in the form of inputs and outputs are available. In addition,
the inputs for which we are interested in making predictions are also available during training.
1.6. LEARNING FROM DATA SCENARIOS 21
0.i/ 0.i/
is (where Z are the predicted sequences for X ):
L x .1/ ; : : : ; x .n/ ; z .1/ ; : : : ; z .n/ ; x 0.1/ ; : : : ; x 0.m/
Z Y !
n
.i/ .i/ .i /
D p z j p x jz ; p. /
iD1
0 1
Ym X
@ p z 0.i / j p x 0.i/ jz 0.i/ ; A p. /d:
i D1 z 0.i/
!
Z n
Y
L x .1/ ; : : : ; x .n/ ; z .1/ ; : : : ; z .n/ D .i / .i/ .i/
p z j p x jz ; p. /d:
i D1
Note that here, the marginal likelihood is not used to predict any value directly. But if the
prior, for example, is parametrized as p. j˛/, this marginal likelihood could be maximized
with respect to ˛ . See Chapter 4.
• We are interested in doing inductive supervised part-of-speech tagging, where the statisti-
cal model for the POS sequences assumes an additional latent variable for each sequence.
Therefore, the likelihood is defined over X , Z and H , the latent variable. For example,
H could be a refinement of the POS labels—adding an additional latent category that de-
scribes types of coarse POS tags such as nouns, verbs and prepositions. The observed data
points are .x .1/ ; z .1/ /; : : : ; .x .n/ ; z .n/ /. The marginal likelihood is:
L x .1/ ; : : : ; x .n/ ; z .1/ ; : : : ; z .n/
0 1
Z Y n X
D @ p h.i / ; z .i / j p x .i/ jh.i/ ; z .i/ ; A p. /d:
i D1 h.i/
The concept of likelihood is further explored in this book in various contexts, most promi-
nently in Chapters 4 and 6.
Log-likelihood and marginal log-likelihood can also be used as an “intrinsic” measure
of evaluation for a given model. This evaluation is done by holding out part of the observed
22 1. PRELIMINARIES
data, and then evaluating the marginal log-likelihood on this held-out dataset. The higher this
log-likelihood is for a given model, the better “fit” it has to the data.
The reason for doing this kind of evaluation on a held-out dataset is to ensure that the
generalization power of the model is the parameter being tested. Otherwise, if we were to eval-
uate the log-likelihood on the observed data during learning and initial inference, we could
always create a dummy, not-so-useful model that would give higher log-likelihood than any
other model. This dummy model would be created by defining a probability distribution (un-
constrained to a specific model family) which assigns probability 1=n to each instance in the ob-
served data. Such distribution, however, would not generalize well for complex sample spaces.
Still, evaluation of marginal log-likelihood on the training data could be used for other purposes,
such as hyperparameter tuning.
p.A [ B [ C /
D p.A/ C p.B/ C p.C / p.A \ B/ p.A \ C / p.B \ C / C p.A \ B \ C /:
CHAPTER 2
Introduction
Broadly interpreted, Natural Language Processing (NLP) refers to the area in Computer Sci-
ence that develops tools for processing human languages using a computer. As such, it borrows
ideas from Artificial Intelligence, Linguistics, Machine Learning, Formal Language Theory and
Statistics. In NLP, natural language is usually represented as written text (as opposed to speech
signals, which are more common in the area of Speech Processing).
There is another side to the exploration of natural language using computational means—
through the field of Computational Linguistics. The goal of exploring language from this angle
is slightly different than that of NLP, as the goal is to use computational means to scientifi-
cally understand language, its evolution, acquisition process, history and influence on society.
A computational linguist will sometimes find herself trying to answer questions that linguists
are attempting to answer, only using automated methods and computational modeling. To a
large extent, the study of language can be treated from a computational perspective, since it
involves the manipulation of symbols, such as words or characters, similarly to the ways other
computational processes work.
Computational Linguistics and NLP overlap to a large extent, especially in regards to the
techniques they use to learn and perform inference with data. This is also true with Bayesian
methods in these areas. Consequently, we will refer mostly to NLP in this book, though most
of the technical descriptions in this book are also relevant to topics explored in Computational
Linguistics.
Many of the efforts in modern Natural Language Processing address written text at the
sentential level. Machine translation, syntactic parsing (the process of associating a natural lan-
guage sentence with a grammatical structure), morphological analysis (the process of analyz-
ing the structure of a word and often decompose it into its basic units such as morphemes)
and semantic parsing (the process of associating a natural language sentence with a meaning-
representation structure) all analyze sentences to return a linguistic structure. Such predicted
structures can be used further in a larger natural language application. This book has a simi-
lar focus. Most of the Bayesian statistical models and applications we discuss are developed at
the sentential level. Still, we will keep the statistical models used more abstract, and will not
necessarily commit to defining distributions over sentences or even natural language elements.
Indeed, to give a more diverse view for the reader, this introductory chapter actually dis-
cusses a simple model over whole documents, called the “latent Dirichlet allocation” (LDA)
model. Not only is this model not defined at the sentential level, it originally was not framed in
26 2. INTRODUCTION
a Bayesian context. Yet there is a strong motivation behind choosing the LDA model in this in-
troductory chapter. From a technical point of view, the LDA model is simple, but demonstrates
most of the fundamental points that repeatedly appear in Bayesian statistical modeling in NLP.
We also discuss the version of LDA used now, which is Bayesian.
Before we begin our journey, it is a good idea to include some historical context about the
development of Bayesian statistics in NLP and introduce some motivation behind its use. This
is the topic of the next section, followed by a section about the LDA model.
2 Here, “overlapping features” refers to features that are not easy to describe in a clean generative story—because they
contain overlapping information, and therefore together “generate” various parts of the structure or the data multiple times.
In principle, there is no issue with specifying generative models with overlapping features or models with complex interaction
such as log-linear models that define distribution both over the input and output spaces. However, such generative models are
often intractable to work with, because of the normalization constant which requires summing an exponential function both
over the input space and the output space. Therefore, general log-linear models are left to the discriminative setting, in which
the normalization constant is only required to be computed by summing over the output space for a given point in the input
space.
2.1. OVERVIEW: WHERE BAYESIAN STATISTICS AND NLP MEET 29
Computationally, we may treat each case differently (because of limited computational power),
but the basic principle remains the same.
One of the greatest advantages of using Bayesian statistics in NLP is the ability to in-
troduce a prior distribution that can bias inference to better solutions. For example, it has been
shown in various circumstances that different prior distributions can model various properties of
natural language, such as the sparsity of word occurrence (i.e., most words in a dictionary occur
very few times or not at all in a given corpus), the correlation between refined syntactic cate-
gories and the exponential decay of sentence length frequency. However, as is described below
and in Chapter 3, the current state of the art in Bayesian NLP does not exploit this degree of
freedom to its fullest extent. In addition, nonparametric methods in Bayesian statistics give a
principled way to identify appropriate model complexity supported by the data available.
Bayesian statistics also serves as a basis for modeling cognition, and there is some overlap
between research in Cognitive Science and NLP, most notably through the proxy of language
acquisition research (Doyle and Levy, 2013, Elsner et al., 2013, Frank et al., 2013, 2014, Full-
wood and O’Donnell, 2013, Pajak et al., 2013). For example, Bayesian nonparametric models for
word segmentation were introduced to the NLP community (Börschinger and Johnson, 2014,
Johnson, 2008, Johnson et al., 2010, 2014, Synnaeve et al., 2014) and were also used as models
for exploring language acquisition in infants (Goldwater et al., 2006, 2009). For reviews of the
use of the Bayesian framework in Cognitive Science see, for example, Griffiths et al. (2008,
2010), Perfors et al. (2011), Tenenbaum et al. (2011).
Bayesian NLP has been flourishing, and its future continues to hold promise. The rich
domain of natural language offers a great opportunity to exploit the basic Bayesian principle of
encoding into the model prior beliefs about the domain and parameters. Much knowledge about
language has been gathered in the linguistics and NLP communities, and using it in a Bayesian
context could potentially improve our understanding of natural language and its processing using
computers. Still, the exploitation of this knowledge through Bayesian principles in the natural
language community has been somewhat limited, and presents a great opportunity to develop
this area in that direction. In addition, many problems in machine learning are now approached
with a Bayesian perspective in mind; some of this knowledge is being transferred to NLP, with
statistical models and inference algorithms being tailored and adapted to its specific problems.
There are various promising directions for Bayesian NLP such as understanding better the
nature of prior linguistic knowledge about language that we have, and incorporating it into prior
distributions in Bayesian models; using advanced Bayesian inference techniques to scale the use
of Bayesian inference in NLP and make it more efficient; and expanding the use of advanced
Bayesian nonparametric models (see Section 7.5) to NLP.
30 2. INTRODUCTION
2.2 FIRST EXAMPLE: THE LATENT DIRICHLET
ALLOCATION MODEL
We begin the technical discussion with an example model for topic modeling: the latent Dirich-
let allocation (LDA) model. It demonstrates several technical points that are useful to know
when approaching a problem in NLP with Bayesian analysis in mind. The original LDA pa-
per by Blei et al. (2003) also greatly popularized variational inference techniques in Machine
Learning and Bayesian NLP, to which Chapter 6 is devoted.
LDA elegantly extends the simplest computational representation for documents—the
bag-of-words representation. With the bag-of-words representation, we treat a document as a
multiset of words (or potentially as a set as well). This means that we dispose of the order of
the words in the document and focus on just their isolated appearance in the text. The words
are assumed to originate in a fixed vocabulary that includes all words in all documents (see Zhai
and Boyd-Graber (2013) on how to avoid this assumption).
The bag-of-words representation is related to the “unigram language model,” which also
models sentences by ignoring the order of the words in these sentences.
As mentioned above, with the bag-of-words model, documents can be mathematically
represented as multisets. For example, assume there is a set V of words (the vocabulary) with a
special symbol ˘, and a text such as3 :
Goldman Sachs said Thursday it has adopted all 39 initiatives it proposed to
strengthen its business practices in the wake of the 2008 financial crisis, a step de-
signed to help both employees and clients move past one of most challenging chap-
ters in the company’s history. ˘
The symbol ˘ terminates the text. All other words in the document must be members
of V n f˘g. The mathematical object that describes this document, d , is the multiset fw W cg,
where the notation w W c is used to denote that the word w appears in the document c times.
For example, for the above document, business W 1 belongs to its corresponding multiset, and so
does both W 1. A bag of words can have even more extreme representation, in which counts are
ignored, and c D 1 is used for all words. From a practical perspective, the documents are often
preprocessed, so that, for example, function words or extremely common words are removed.
To define a probabilistic model over these multisets, we first assume a probability distri-
bution over V , p.W jˇ/. This means that ˇ is a set of parameters for a multinomial distribution
such that p.wjˇ/ D ˇw . This vocabulary distribution induces a distribution over documents,
denoted by the random variable D (a random multiset), as follows:
Y Y
p.D D d jˇ/ D p.wjˇ/c D .ˇw /c : (2.1)
.wWc/2d .wWc/2d
3 The text is taken from The Wall Street Journal Risk and Compliance Journal and was written by Justin Baer (May 23, 2013).
2.2. FIRST EXAMPLE: THE LATENT DIRICHLET ALLOCATION MODEL 31
This bag-of-words model appears in many applications in NLP, but is usually too weak
to be used on its own for the purpose of modeling language or documents. The model makes an
extreme independence assumption—the occurrence of all words in the document are independent
of each other. Clearly this assumption is not satisfied by text because there are words that tend
to co-occur. A document about soccer will tend to use the word “goal” in tandem with “player”
and ”ball,” while a document about U.S. politics will tend to use words such as “presidential,”
“senator” and “bill.” This means that the appearance of the word “presidential” in a document
gives us a lot of information about what other words may appear in the document, and there-
fore the independence assumption that bag-of-words models make fails. In fact, this extreme
independence assumption does not even capture the most intuitive notion of word repetition in
a document—actual content words—especially words denoting entities, which are more likely
to appear later in a document if they already appeared earlier in the document.
There is no single remedy to this strict independence assumption that the bag-of-words
model makes. The rich literature of document modeling is not the focus of this book, but many of
the current models for document modeling are subject to the following principle, devised in the
late 1990s. A set of “topics” is defined. Each word is associated with a topic, and this association
can be made probabilistic or crisp. It is also not mutually exclusive—words can belong to various
topics with different degrees of association. In an inference step, given a set of documents, each
document is associated with a set of topics, which are being learned from the data. Given that
the topics are learned automatically, they are not labeled by the model (though they could be in a
post-processing step after inference with the topic model), but the hope is to discover topics such
as “soccer” (to which the word “goal” would have strong association, for example) or “politics.”
Topics are discovered by assembling a collection of words for each topic, with a corresponding
likelihood for being associated with that topic.
This area of topic modeling has flourished in the recent decade with the introduction of
the latent Dirichlet allocation model (Blei et al., 2003). The idea behind the latent Dirichlet al-
location model is extremely intuitive and appealing, and it builds on previous work for modeling
documents, such as the work by Hofmann (1999b). There are K topics in the model. Each topic
z 2 f1; : : : ; Kg is associated with a conditional probability distribution over V , p.w j z/ D ˇz;w .
(ˇz is the multinomial distribution over V for topic z ). LDA then draws a document d in three
phases (conditioned on a fixed number of words to be sampled for the document, denoted by
N ) using generative story 2.1.4
LDA weakens the independence assumption made by the bag-of-words model in Equa-
tion 2.1: the words are not completely independent of each other, but are independent of each
other given their topic. First, a distribution over topics (for the entire document) is generated.
4 To make the LDA model description complete, there is a need to draw the number of words N in the document, so the
LDA model can generate documents of variable length. In Blei’s et al. paper, the number of words in the document is drawn
from a Poission distribution with some rate . It is common to omit this in LDA’s description, since the document is assumed
to be observed, and as a result, the number of words is known during inference. Therefore, it is not necessary to model N
probabilistically.
32 2. INTRODUCTION
Constants: K , N integers
Parameters: ˇ
Latent variables: , zi for i 2 f1; : : : ; N g
Observed variables: w1 ; : : : ; wN (d )
N
X
d D fw W cjw 2 V; c D I.wj D w/g;
j D1
Generative Story 2.1: The generative story for the latent Dirichlet allocation model. Figure 1.1
gives a graphical model description for the LDA. The generative model here assumes M D 1
when compared to the graphical model in Figure 1.1 (i.e., the model here generates a single
document). An outer loop is required to generate multiple documents. The distribution over
is not specified in the above, but with LDA it is drawn from the Dirichlet distribution.
Second, a list of topics for each word in the document is generated. Third, each word is generated
according to a multinomial distribution associated with the topic of that word index.
Consider line 1 in the description of the LDA generative model (generative story 2.1).
The distribution from which is drawn is not specified. In order to complete the description of
the LDA, we need to choose a “distribution over multinomial distributions” such that we can
draw from it a topic distribution. Each instance is a multinomial distribution with z 0 and
PK
zD1 z D 1. Therefore, we need to find a distribution over the set
( K
)
X
j 8z z 0 ; z D 1 :
zD1
2.2. FIRST EXAMPLE: THE LATENT DIRICHLET ALLOCATION MODEL 33
LDA uses the Dirichlet distribution for defining a distribution over this probability sim-
plex.5 This means that the distribution over is defined as follows:
K
Y ˛ 1
p.1 ; : : : ; K j˛1 ; : : : ; ˛K / D C.˛/ k k ; (2.2)
kD1
where the function C.˛/ is defined in Section 2.2.1 (Equation 2.3, see also Appendix B), and
serves as the normalization constant of the Dirichlet distribution.
The Dirichlet distribution depends on K hyperparameters, ˛1 ; : : : ; ˛K , which can be de-
noted using a vector ˛ 2 RK . It is notationally convenient to denote the Dirichlet distribution
then by p.1 ; : : : ; K j ˛/ to make this dependence explicit. This does not imply that ˛ itself is a
random variable or event that we condition on, but instead is a set of parameters that determines
the behavior of the specific instance of the Dirichlet distribution (see note about this notation
in Section 1.5.5).
The reason for preferring the Dirichlet distribution is detailed in Chapter 3, when the
notion of conjugacy is introduced. For now, it is sufficient to say that the choice of the Dirichlet
distribution is natural because it makes inference with LDA much easier—drawing a multinomial
from a Dirichlet distribution and then subsequently drawing a topic from this multinomial is
mathematically and computationally convenient. The Dirichlet distribution has other desirable
properties which are a good fit for modeling language, such as encouraging sparse multinomial
distributions with a specific choice of hyperparameters (see Section 3.2.1).
Natural language processing models are often constructed using multinomial distributions
as the basic building blocks for the generated structure. This includes parse trees, alignments,
dependency trees and others. These multinomial distributions compose the parameters of the
model. For example, the parameters of a probabilistic context-free grammar (see Chapter 8) are
a set of multinomial distributions for generating the right-hand side of a rule conditioned on
the left-hand side.
One of the core technical ideas in the Bayesian approach is that the parameters are con-
sidered to be random variables as well, and therefore the generative process draws values for
these model parameters. It is not surprising, therefore, that the combination of convenience in
inference with Dirichlet-multinomial models and the prevalence of multinomial distributions
in generative NLP models yields a common and focused use of the Dirichlet distribution in
Bayesian NLP.
There is one subtle dissimilarity between the way the Dirichlet distribution is used in
Bayesian NLP and the way it was originally defined in the LDA. The topic distribution in
LDA does not represent the parameters of the LDA model. The only parameters in LDA are
the topic multinomials ˇk for k 2 f1; : : : ; Kg. The topic distribution is an integral part of
5A K -dimensional simplex is a K -dimensional polytope, i.e., the convex hull of K C 1 vertices. The vertices that define
a probability simplex are the basis vectors ei for i 2 f1; : : : ; Kg where ei 2 Rk is a vector such that it is 0 everywhere. but is
1 in the i th coordinate.
34 2. INTRODUCTION
α ψ β
K
θ z w
N M
Figure 2.1: The fully Bayesian version of the latent Dirichlet allocation model. A prior over ˇ is
added (and ˇ now are random variables). Most commonly, this prior is a (symmetric) Dirichlet
distribution with hyperparameter .
the model, and is drawn separately for each document. To turn LDA into a Bayesian model,
one should draw ˇ from a Dirichlet distribution (or another distribution). This is indeed now
a common practice with LDA (Steyvers and Griffiths, 2007). A graphical model for this fully
Bayesian LDA model is given in Figure 2.1.
Since the Dirichlet distribution is so central to Bayesian NLP, the next section is dedi-
cated to an exploration of its basic properties. We will also re-visit the Dirichlet distribution in
Chapter 3.
Its probability density depends on K positive real values, ˛1 ; : : : ; ˛K . The PDF appears in
Equation 2.2 where C.˛/ is a normalization constant defined as:
P
. K kD1 ˛k /
C.˛/ D ; (2.3)
.˛1 / : : : .˛K /
with .x/ being the Gamma function for x 0 (see also Appendix B)—a generalization of the
factorial function such that whenever x is natural number it holds that .x/ D .x 1/Š.
Vectors in the “probability simplex,” as the name implies, can be treated as probability
distributions over a finite set of size K . This happens above, with LDA, where is treated as a
probability distribution over the K topics (each topic is associated with one of the K dimensions
of the probability simplex), and used to draw topics for each word in the document.
2.2. FIRST EXAMPLE: THE LATENT DIRICHLET ALLOCATION MODEL 35
Naturally, the first and second moments of the Dirichlet distribution depend on ˛ . When
P
denoting ˛ D K kD1 ˛k , it holds that:
˛k
EŒk D ;
˛
˛k .˛ ˛k /
var.k / D 2 ;
.˛ / .˛ C 1/
˛j ˛k
Cov.j ; k / D :
.˛ / .˛ C 1/
2
˛k 1
mode.k / D
:
˛ K
The mode is not defined if any ˛k < 1, since in that case, the density of the Dirichlet
distribution is potentially unbounded.
The Beta Distribution In the special case where K D 2, the Dirichlet distribution is also
called the Beta distribution, and has the following density:
.˛1 C ˛2 / ˛1 1 ˛2 1
p.1 ; 2 j ˛1 ; ˛2 / D 2 :
.˛1 /.˛2 / 1
Since 1 C 2 D 1, the Beta distribution can be described as a univariate distribution over
0
2 Œ0; 1:
.˛1 C ˛2 / 0 ˛1
p. 0 j ˛1 ; ˛2 / D . / 1
.1 0 /˛ 2 1
: (2.4)
.˛1 /.˛2 /
Symmetric Dirichlet It is often the case that instead of using K different parameters, the
Dirichlet distribution is used with ˛ such that ˛1 D ˛2 D : : : D ˛K D ˛ 0 2 RC . In this case, the
Dirichlet distribution is also called a symmetric Dirichlet distribution. The ˛ 0 hyperparameter
is called the concentration hyperparameter.
The reason for collapsing ˛1 ; : : : ; ˛K into a single parameter is two-fold: (i) this consider-
ably simplifies the distribution, and makes it easier to tackle learning, and (ii) a priori, if latent
variables exist in the model, and they are drawn from a multinomial (which is drawn from a
Dirichlet distribution), it is often the case that the role of various events in the multinomial are
interchangeable. This is the case, for example, with the LDA model. Since only the text is ob-
served in the data (without either the topic distribution or the topics associated with each word),
the role of the K topics can be permuted. If ˛1 ; : : : ; ˛K are not estimated by the learning algo-
rithm, but instead clamped at a certain value, it makes sense to keep the role of ˛k symmetric
by using a symmetric Dirichlet.
6 The mode of a probability distribution is the most likely value according to that distribution. It is the value(s) at which
the PMF or PDF obtain their maximal value.
36 2. INTRODUCTION
Parameters:
Latent variables: Z (and also )
Observed variables: X
Figure 2.2 plots the density in Equation 2.4 when ˛1 D ˛2 D ˛ 0 2 R for several values
of ˛ , and demonstrates the choice of the name “concentration parameter” for ˛ 0 . The closer ˛ 0
0
is to 0, the more sparse the distribution is, with most of the mass concentrated on near-zero
probability. The larger ˛ 0 is, the more concentrated the distribution around its mean value. Since
the figure describes a symmetric Beta distribution, the mean value is 0.5 for all values of ˛ 0 .
When ˛ 0 D 1, the distribution is uniform.
The fact that small values for ˛ 0 make the Dirichlet distribution sparse is frequently ex-
ploited in the NLP Bayesian literature. This point is discussed at greater length in Section 3.2.1.
It is also demonstrated in Figure 2.3.
2.2.2 INFERENCE
As was briefly mentioned earlier, in topic modeling the topics are considered to be latent. While
datasets exist in which documents are associated with various human-annotated topics, the vast
majority of document collections do not have such an annotation—certainly not in the style
of the LDA model, where each word has some degree of association with each topic. In fact,
asking an annotator to annotate topics the way they are defined in an LDA-style topic model
is probably an ill-defined task because these topics are often not crisp or fully interpretable in
their word association (see Chang et al. (2009) for a study on human interpretation of topic
models; also see Mimno et al. (2011) and Newman et al. (2010) for automatic topic coherence
evaluation). For LDA, this means that the distribution over topics, , and the topic identity for
each word are latent variables—they are never observed in the data, which are just pure text.
This is typical in Bayesian NLP as well. Usually, there is a random variable X (a document
or a sentence, for example) which is associated with a predicted structure, denoted by a random
variable Z . The generation of X and Z is governed by some distribution parametrized by . The
parameters themselves are a random variable that is governed, for example, by the Dirichlet
2.2. FIRST EXAMPLE: THE LATENT DIRICHLET ALLOCATION MODEL 37
3.5
0.1
0.5
3.0 1.0
2.0
3.0
2.5
2.0
Density
1.5
1.0
0.5
0.0
Figure 2.2: The Beta distribution density function when ˛1 D ˛2 D ˛ 0 for ˛ 0 2 f0:1; 0:5; 1; 2; 3g.
distribution, or more generally by a distribution p. /. This distribution is also called the prior
distribution. Generative story 2.2 describes this process.
There is a striking similarity to the LDA generative process, where the topic distribution
plays the role of the set of parameters, the topic assignments play the role of the latent structure,
and the words in the document play the role of the observed data.
The generative process above dictates the following joint probability distribution
p.X; Z; /:
The goal of Bayesian inference is naturally to either infer the latent structure z (or a dis-
tribution over it), or infer the parameters (or a distribution over them). More generally, the
goal of Bayesian inference is to obtain the posterior distribution over the non-observed random
38 2. INTRODUCTION
Figure 2.3: A plot of sampled data from the symmetric Dirichlet distribution with K D 3, with
various ˛ . Top-left: ˛ D 10, top-right: ˛ D 1, bottom-left: ˛ D 0:1, bottom-right: ˛ D 0:01.
The plot demonstrates that for ˛ < 1, the Dirichlet distribution is centered on points in the
probability simplex which are sparse. For the value ˛ D 1, the probability distribution should be
uniform.
variables in the model, given the observed data x . For the general Bayesian model above, the
posterior is p.Z; jx/.
Note that the predictions are managed through distributions (such as the posterior) and
not through fixed values. Bayesian inference, at its basic level, does not commit to a single z or
. (However, it is often the case that we are interested in a point estimate for the parameters,
see Chapter 4, and even more often—a fixed value for the predicted structure.)
2.2. FIRST EXAMPLE: THE LATENT DIRICHLET ALLOCATION MODEL 39
To identify this posterior, Bayesian statistics exploits the treatment of as a random vari-
able through a basic application of Bayes’ rule. More specifically, the posterior is identified as:
p. /p.z j /p.x j z; /
p.z; j x/ D : (2.5)
p.x/
The quantity p.x/ acts as a marginalization constant that ensures that the probability in
Equation 2.5 integrates (and sums) to one. Therefore, assuming is continuous, as is usually the
case, the following holds:
Z !
X
p.x/ D p. / p.z j /p.x j z; / d: (2.6)
z
Mathematically, Bayesian inference is easy and elegant. It requires inverting the condi-
tional distributions of p.x; z; / using Bayes’ rule so that the posterior is computed. All of the
quantities in Equation 2.5 are theoretically known. There is only the need to rely on the simplest,
most basic results from probability theory.
Still, Bayesian inference is not always trivial to implement or to computationally execute.
The main challenge is computing the marginalization constant (Equation 2.6), which is required
to make the predictions. The marginalization constant requires summing over a discrete (possibly
infinite) set and integrating over a continuous set. This is often intractable, but prior conjugacy
can alleviate this problem (Chapter 3). Even when only one of these marginalizations is required
(such as the case with variational inference, see Chapter 6), inference can still be complex.
This intractability is often overcome by using approximate inference methods, such as
Markov chain Monte Carlo methods or variational inference. These are discussed in Chapters 5
and 6, respectively.
2.2.3 SUMMARY
The core ideas in LDA modeling and inference have striking similarities to some of the principles
used in Bayesian NLP. Most notably, the use of the Dirichlet distribution to define multinomial
distributions is common in both.
LDA requires inferring a posterior over topic assignments for each word in each document
and the distribution over topics for each document. These are the two latent variables in the LDA
model. Analogously, in Bayesian NLP, we often require inferring a latent structure (such as a
parse tree or a sequence) and the parameters of the model.
Inference of the kind described in this chapter, with LDA, with Bayesian models and more
generally, with generative models, can be thought of as the reverse-engineering of a generative
device, that is the underlying model. The model is a device that continuously generates samples
of data, of which we see only a subset of the final output that is generated by this device. In
the case of LDA, the device generates raw text as output. Inference works backward, trying
to identify the missing values (topic distributions and the topic themselves) that were used to
generate this text.
40 2. INTRODUCTION
2.3 SECOND EXAMPLE: BAYESIAN TEXT REGRESSION
Even though Bayesian NLP has focused mostly on unsupervised learning, Bayesian inference
in general is not limited to learning from incomplete data. It is also often used for prediction
problems such as classification and regression where the training examples include both the
inputs and the outputs of the model.
In this section, we demonstrate Bayesian learning in the case of text regression, predicting
a continuous value based on a body of text. We will continue to use the notation from Section 2.2
and denote a document by d , as a set of words and word count pairs. In addition, we will assume
some continuous value that needs to be predicted, denoted by the random variable Y . To ground
the example, D can be a movie review, and Y can be a predicted average number of stars the
movie received by critics or its revenue (Joshi et al., 2010). The prediction problem is therefore
to predict the number of stars a movie receives from the movie review text.
One possible way to frame this prediction problem is as a Bayesian linear regression prob-
lem. This means we assume that we receive as input for the inference algorithm a set of examples
d .i/ ; y .i/ for i 2 f1; : : : ; ng. We assume a function f .d / that maps a document to a vector in
RK . This is the feature function that summarizes the information in the document as a vector,
and on which the final predictions are based. For example, K could be the size of the vocabulary
that the documents span, and Œf .d /j could be the count of the j th word in the vocabulary in
document d .
A linear regression model typically assumes that there is a stochastic relationship between
Y and d :
Y D f .d / C ;
where 2 RK is a set of parameters for the linear regression model and is a noise term (with
zero mean), most often framed as a Gaussian variable with variance 2 . For the sake of simplicity,
we assume for now that 2 is known, and we need not make any inference about it. As a result,
the learning problem becomes an inference problem about .
As mentioned above, is assumed to be a Gaussian variable under the model, and as such
Y itself is a Gaussian with mean value f .d / for any fixed and document d . The variance of
Y is 2 .
In Bayesian linear regression, we assume a prior on , a distribution p. j˛/. Consequently,
the joint distribution over and Y .i/ is:
n
Y
.1/ .1/ .n/ .n/ .1/ .n/
p ; Y D y ;:::;Y D y j d ; : : : ; d ; ˛ D p . j ˛/ p Y .i/ D y .i / j ; d .i / :
iD1
In this case, Bayesian inference will use Bayes’ rule to find a probability distribution over
conditioned on the data, y .i/ and d .i/ for i 2 f1; : : : ; ng. It can be shown that if we choose a
conjugate prior to the likelihood over Y .i/ , which in this case is also a normal distribution, then
the distribution p j y .1/ ; : : : ; y .n/ ; d .1/ ; : : : ; d .n/ is also a normal distribution.
2.4. CONCLUSION AND SUMMARY 41
This conjugacy between the normal distribution and itself is demonstrated in more detail
in Chapter 3. Also see exercise 6 in this chapter. There are two natural extensions to the scenario
described above. One in which Y .i / are multivariate, i.e., y .i / 2 RM . In that case, is a matrix in
RM K . The other extension is one in which the variance 2 (or the covariance matrix controlling
the likelihood and the prior, in the multivariate case) is unknown. Full derivations of Bayesian
linear regression for these two cases are given by Minka (2000).
Constants: n integer
Hyperparameters: ˛ > 0
Latent variables: Z .1/ ; : : : ; Z .n/
Observed variables: X .1/ ; : : : ; X .n/
• Draw a multinomial of size two from a symmetric Beta distribution with hyper-
parameter ˛ > 0.
• Draw z .1/ ; : : : ; z .n/ from the multinomial where z .i/ 2 f0; 1g.
• Set n 1 binary random variables x .1/ ; : : : ; x .n 1/
such that x .i / D z .i/ z .iC1/ .
2.2. Consider the graphical model for Bayesian LDA (Figure 2.1). Write down an expression
for the joint distribution over the observed variables (the words in the document). (Use the
Dirichlet distribution for the topic distributions.)
2.3. Alice has a biased coin that lands more often on “tails” than ”heads.” She is interested in
placing a symmetric Beta prior over this coin with hyperparameter ˛ . A draw from this
Beta distribution will denote the probability of tails. (1 is the probability of heads.)
What range of values for ˛ complies with Alice’s knowledge of the coin’s unfairness?
2.4. As mentioned in this chapter, choosing a hyperparameter ˛ < 1 with the symmetric
Dirichlet encourages sparse draws from the Dirichlet. What are some properties of nat-
ural language that you believe are useful to mathematically model using such sparse prior
distribution?
2.5. The LDA model assumes independence between the topics being drawn for each word in
the document. Can you describe the generative story and the joint probability distribu-
tion for a model that assumes a bigram-like distribution of topics? This means that each
topic Z .i/ depends on Z .i 1/ . In what scenarios is this model more sensible for document
modeling? Why?
2.6. Complete the details of the example in Section 2.3. More specifically, find the posterior
over given y .i/ and d .i/ for i 2 f1; : : : ; ng, assuming the prior over p. j ; 2 / is a
normal distribution with mean and variance 2 .
43
CHAPTER 3
Priors
Priors are a basic component in Bayesian modeling. The concept of priors and some of their
mechanics must be introduced quite early in order to introduce the machinery used in Bayesian
NLP. At their core, priors are distributions over a set of hypotheses, or when dealing with para-
metric model families, over a set of parameters. In essence, the prior distribution represents
the prior beliefs that the modeler has about the identity of the parameters from which data is
generated, before observing any data.
One of the criticisms of Bayesian statistics is that it lacks objectivity in the sense that dif-
ferent prior families will lead to different inference based on the available data, especially when
only small amounts of data are available. There are circumstances in which such a view is justi-
fied (for example, the Food and Drug Administration held this criticism against the Bayesian
approach for some time (Feinberg, 2011); the FDA has since made use of Bayesian statistics
in certain circumstances), but this lack of objectivity is less of a problem in solving engineering
problems than in natural language processing. In NLP, the final “test” for the quality of a model
(or decoder, to be more precise) that predicts a linguistic structure given some input such as a
sentence, often uses an evaluation metric that is not directly encoded in the statistical model.
The existence of a precise evaluation metric, coupled with the use of unseen data (in the su-
pervised case; in the unsupervised case, either the unseen data or the data on which inference is
performed can be used) to calculate this evaluation metric eliminates the concern of subjectivity.
In fact, the additional degree of freedom in Bayesian modeling, the prior distribution,
can be a great advantage in NLP. The modeler can choose a prior that biases the inference and
learning in such a way that the evaluation metric is maximized. This is not necessarily done
directly as a mathematical optimization problem, but through experimentation.
This point has been exploited consistently in the Bayesian NLP literature, where priors
are chosen because they exhibit certain useful properties that are found in natural language. In
the right setting, the Dirichlet distribution is often shown to lead to sparse solutions (see Chap-
ter 2). The logistic normal distribution, on the other hand, can capture relationships between
the various parameters of a multinomial distribution. Other hand-crafted priors have also been
used, mirroring a specific property of language.
This chapter covers the main types of priors that are used in Bayesian NLP. As such,
it discusses conjugate priors at length (Section 3.1) and specifically focuses on the Dirichlet
distribution (Section 3.2.1). The discussion of the Dirichlet distribution is done in the context
44 3. PRIORS
of priors over multinomials (Section 3.2), as the multinomial distribution is the main modeling
workhorse in Bayesian NLP.
p. j x; ˛/ D p. j ˛ 0 /;
for some ˛ 0 D ˛ 0 .x; ˛/ 2 A. Note that ˛ 0 is a function of the observation x and ˛ , the hyperpa-
rameter with which we begin the inference. (This means that in order to compute the posterior,
we need to be able to compute the function ˛ 0 .x; ˛/.)
The mathematical definition of conjugate priors does not immediately shed light on why
they make Bayesian inference more tractable. In fact, according to the deinfition above, the use
of a conjugate prior does not guarantee computational tractability. Conjugate priors are useful
when the function ˛ 0 .x; ˛/ can be efficiently computed, and indeed this is often the case when
conjugate priors are used in practice.
3.1. CONJUGATE PRIORS 45
0
When ˛ .x; ˛/ can be efficiently computed, inference with the Bayesian approach is con-
siderably simplified. As mentioned above, to compute the posterior over the parameters, all we
need to do is compute ˛ 0 .x; ˛/, and this introduces a new set of hyperparameters that define the
posterior.
The following example demonstrates this idea about conjugate priors for normal variables.
In this example, contrary to our persistent treatment of the variable X as a discrete variable up to
this point, X is set to be a continuous variable. This is done to demonstrate the idea of conjugacy
with a relatively simple, well-known example—the conjugacy of the normal distribution to itself
(with respect to the mean value parameters).
Example 3.1
Let X be drawn from a normal distribution with expected value and fixed known vari-
ance 2 (it is neither a parameter nor hyperparameter), i.e., the density of X for a point x is:
2 !
1 1 x
p.x j / D p exp :
2 2
In addition, is drawn from a prior family that is also Gaussian, controlled by the hy-
perparameter set A D R RC where every ˛ 2 A is a pair .; 2 /. The value of denotes the
expected value of the prior and the value of 2 denotes the variance. Assume we begin inference
with a prior such that 2 R and 2 D 2 —i.e., we assume the variance of the prior is identical
to the variance in the likelihood. (This is assumed for the simplicity of the posterior derivation,
but it is not necessary to follow this assumption to get a similar derivation when the variances
are not identical.) The prior, therefore, is:
2 !
1 1
p. j ˛/ D p exp :
2 2
p. j ˛/p.xj /
p. jx; ˛/ D R : (3.1)
p. j ˛/p.xj/d
46 3. PRIORS
The numerator equals:
!! 2 !!
1 1 x 2 1 1
p. j ˛/p.xj/ D p exp p exp
2 2 2 2
2 2
1 1 .x / C . /
D exp : (3.2)
2 2 2 2
2
xC
C 1
2
.x /2
2
.x /2 C . /2 D :
1=2
The term 12 .x /2 does not depend on , and therefore it will cancel from both the
numerator and the denominator. It can be taken out of the integral in the denominator. (We
therefore do not include it in the equation below.) Then, Equation 3.1 can be rewritten as:
0 1
xC 2
B C
B 2 C
exp B 2
C
@ =2 A
p. j ˛/p.xj/
p. jx; ˛/ D R D : (3.3)
p. j ˛/p.xj/d C.x; ˛/
0 1
xC 2
Z B C
B 2 C
C.x; ˛/ D exp B C d;
@ 2 =2 A
is a normalization constant that ensures that p. jx; ˛/ integrates to 1 over . Since the numer-
xC
ator of Equation 3.3 has the form of a normal distribution with mean value and variance
2
2
=2, this means
that Equation
3.3 is actually the density of a normal distribution such that
x C
˛ 0 .x; ˛/ D ; p , where and 2 are defined by ˛ . See Appendix A for further detail.
2 2
The normalization constant can easily be derived from the density of the normal distribution
with these specific hyperparameters.
3.1. CONJUGATE PRIORS 47
The conclusion from this example is that the family of prior distributions fp. j˛/j˛ D
2
.; / 2 R .0; 1/g is conjugate to the normally distributed likelihood with a fixed variance
(i.e., the likelihood is parametrized only by the mean value of the normal distribution).
In a general case, with ˛ D .; 0 / (i.e., Normal.; 02 / and n observations being
identically distributed (and independent given ) with X .i / Normal.; 2 /), it holds that:
Pn q !
C x .i/
i D1
˛ 0 x .1/ ; : : : ; x .n/ ; ˛ D ; .1=02 C n= 2 / 1 ; (3.4)
nC1
i.e., the posterior distribution is a normal distribution with a mean and a variance as specified in
Equation 3.4. Note that full knowledge of the likelihood variance is assumed both in the example
and in the extension above. This means that the prior is defined only over and not . When
the variances are not known (or more generally, when the covariance matrices of a multivariate
normal variable are not known) and are actually drawn from a prior as well, more care is required
in defining a conjugate prior for the variance (more specifically, the Inverse-Wishart distribution
would be the conjugate prior in this case defined over the covariance matrix space).
Example 3.1 and Equation 3.4 demonstrate a recurring point with conjugate priors and
their corresponding posteriors. In many cases, the hyperparameters ˛ take the role of “pseudo-
observations” in the function ˛ 0 .x; ˛/. As described in Equation 3.4, is added to the sum of
the rest of the observations and then averaged together with them. Hence, this hyperparameter
functions as an additional observation with value which is taken into account in the posterior.
To avoid confusion, it is worth noting that both the prior and the likelihood were normal
in this example of conjugacy, but usually conjugate prior and likelihood are not members of the
same family of distributions. With Gaussian likelihood, the conjugate prior is also Gaussian
(with respect to the mean parameter) because of the specific algebraic properties of the normal
density function.
X Z X
p.x j ˛/ D p./p.z j /p.x j z; /d D D.z/; (3.6)
z z
where D.z/ is defined to be the term inside the sum above. Equation 3.6 demonstrates that
conjugate priors are useful even when the normalization constant requires summing over latent
variables. If the prior family is conjugate to the distribution p.X; Z j /, then the function D.z/
P
will be mathematically easy to compute for any z . However, it is not true that z D.z/ is always
tractable, since the form of D.z/ can be quite complex.
On a related note, if the order of summation and integration is switched, then it holds
that:
Z ! Z
X
p.x j ˛/ D p. / p.z j /p.x j z; / d D p. /D 0 . /d. /;
z
0
where D ./ is defined as the term that sums over z . Then, it is often the case that for every ,
D 0 ./ can be computed using dynamic programming algorithms or other algorithms that sum
over a discrete space (for example, if the latent variable space includes parse trees with an under-
lying PCFG grammar, then D 0 ./ can be computed using a variant of the CKY algorithm, the
inside algorithm. See Chapter 8 for further discussion.) Switching integration and summation
does not make the problem of computing the marginalization constant p.xj˛/ tractable. The
outside integral over the function D 0 ./ is often still infeasible.
Still, the fact that a tractable solution for the inner term results from switching the inte-
gration and the sum is very useful for approximate inference, especially for variational inference.
This is exactly where the conjugacy of the prior comes in handy even with an intractable posterior
with latent variables. This is discussed further in Chapter 6.
50 3. PRIORS
3.1.3 MIXTURE OF CONJUGATE PRIORS
Mixture models are a simple way to extend a family of distributions into a more expressive
family. If we have a set of distributions p1 .X/; : : : ; pM .X/, then a mixture model over this set
of distributions is parametrized by an M dimensional probability vector .1 ; : : : ; M / (i 0,
P
i i D 1) and defines distributions over X such that:
M
X
p.X j/ D i pi .X /:
iD1
M
X
p j ˛ 1 ; : : : ; ˛ M ; 1 ; : : : ; M D i p j ˛ i ;
i D1
P
where i 0 and M i D1 i D 1 (i.e., is a point in the M 1 dimensional probability simplex).
This new prior family, which is hyperparametrized by ˛ i 2 A and i for i 2 f1; : : : M g will ac-
tually be conjugate to a likelihood p.x j / if the original prior family p. j ˛/ for ˛ 2 A is also
conjugate to this likelihood.
To see this, consider that when using a mixture prior, the posterior has the form:
p.x j /p. j ˛ 1 ; : : : ; ˛ M ; /
p j x; ˛ 1 ; : : : ; ˛ M ; D R 1 M
P p.x j /p. j ˛ ; : : : ; ˛ ; /d
M i
i p.x j /p. j ˛ /
D i D1 PM ;
i D1 i Zi
where
Z
Zi D p .x j / p j ˛ i d:
PM
1 M
i D1 .i Zi /p. j x; ˛ i /
p j x; ˛ ; : : : ; ˛ ; D PM ;
i D1 i Zi
3.1. CONJUGATE PRIORS 51
i i i
because p.xj/p. j˛ / D Zi p. jx; ˛ /. Because of conjugacy, each p. jx; ˛ / is equal to
p. jˇ i / for some ˇ i 2 A (i 2 f1; : : : ; M g). The hyperparameters ˇ i are the updated hyperpa-
rameters following posterior inference. Therefore, it holds:
M
X
p j x; ˛ 1 ; : : : ; ˛ M ; D 0i p. j ˇ i /;
i D1
P
M
for 0i D i Zi = i D1 i Zi .
When the prior family parametrized by ˛ is K -dimensional Dirichlet (see Section 2.2.1
and Equation 2.2), then:
QK
i
j D1 ˛j C xj
Zi D P :
K i
j D1 ˛j C xj
We conclude this section about mixtures of conjugate priors with an example of using such
a mixture prior for text analysis. Yamamoto and Sadamitsu (2005) define a topic model, where a
mixture of Dirichlet distributions is defined as the prior distribution over the vocabulary. Each
draw from this mixture provides a multinomial distribution over the vocabulary. Following that
draw, the words in the document are drawn independently in the generative process.
Yamamoto and Sadamitsu describe the mixture components of their Dirichlet mixture dis-
tribution as corresponding to topics. In this sense, there is a large distinction between their model
and LDA, which samples a topic distribution for each document separately. When measuring
performance on a held-out dataset using perplexity (see Appendix A and Section 1.6), their
model consistently scored better than LDA on a set of 100,000 newspaper articles in Japanese.
Their model performance was also saturated for 20 topics, while the perplexity of the LDA con-
tinued to decrease for a much larger number of topics. This perhaps points to a better fit—their
model uses fewer topics (and therefore is simpler), but still has lower perplexity than LDA.
p. j˛/
p 0 .j˛/ D R 0 0
: (3.7)
0 2‚0 p. j˛/d
This new distribution retains the same ratio between probabilities of elements in ‚0 as p ,
but essentially allocates probability 0 to any element in ‚ n ‚0 .
It can be shown that if p is a conjugate family to some likelihood, then p 0 is conjugate to
the same likelihood as well. This example actually demonstrates that conjugacy, in its pure form
does not necessitate tractability by using the conjugate prior together with the corresponding
likelihood. More specifically, the integral over ‚0 in the denominator of Equation 3.7 can often
be difficult to compute, and approximate inference is required.
The renormalization of conjugate distributions arises when considering probabilistic
context-free grammars with Dirichlet priors on the parameters. In this case, in order for the
prior to allocate zero probability to parameters that define non-tight PCFGs, certain multi-
nomial distributions need to be removed from the prior. Here, tightness refers to a desirable
property of a PCFG so that the total measure of all finite parse trees generated by the underly-
ing context-free grammar is 1. For a thorough discussion of this issue, see Cohen and Johnson
(2013).
3.1.6 SUMMARY
Conjugate priors are defined in the context of a prior family and a distribution for the variables
in the model over the observations and latent variables. In many cases, conjugate priors ensure
the tractability of the computation of the normalization constant of the posterior. For example,
conjugate priors often lead to closed-form analytic solutions for the posterior given a set of
observations (and values for the latent variables, if they exist in the model).
Conjugate priors are often argued to be too simplistic, but they are very helpful in NLP
because of the computational complexity of NLP models. Alternatives to conjugate priors are
often less efficient, but with the advent of new hardware and approximation algorithms, these
alternatives have become more viable.
The nature of structures that are predicted in natural language processing is an excellent fit for
modeling using the categorical distribution. The categorical distribution is a generalization of
the Bernoulli distribution, which specifies how K outcomes (such as topics in a document or
on the right-hand sides of context-free rules headed by a non-terminal) are distributed. The
categorical distribution is specified by a parameter vector 2 RK , where satisfies the following
two properties:
54 3. PRIORS
8k 2 f1; : : : ; Kg k 0; (3.8)
XK
k D 1: (3.9)
kD1
The space of allowed parameters for the categorical distribution over K outcomes,
˚
‚ D 2 RK j satisfies Equations 3.8–3.9 ;
is also called “the probability simplex of dimension K 1”—there is one fewer degree of free-
dom because of the requirement for all probabilities to sum to 1. The set ‚ defines a simplex,
i.e., geometrically it is a K 1 dimensional polytope, which is the convex hull of K vertices;
the vertices are the points such that all probability mass is placed on a single event. All other
probability distributions can be viewed as a combination of these vertices. Each point in this
simplex defines a categorical distribution.
If X Categorical./, then the probability distribution over X is defined as2 :
p.X D ij / D i ;
where i 2 f1; : : : ; Kg. While the categorical distribution generalizes the Bernoulli distribution,
the multinomial distribution is actually a generalization of the binomial distribution and describes
P
a distribution for a random variable X 2 N K such that K i D1 Xi D n for some fixed n, a natural
number, which is a parameter of the multinomial distribution. The parameter n plays a similar
role to the “experiments count” parameter in the binomial distribution. Then, given 2 ‚ and
n as described above, the multinomial distribution is defined as:
K
Y
nŠ i
p.X1 D i1 ; : : : ; XK D iK / D QK jj ;
j D1 ij Š j D1
PK
where j D1 ij D n.
Even though the categorical distribution and the multinomial distribution differ, there is
a strong relationship between them. More specifically, if X distributes according to a categorical
distribution with parameters , then the random variable Y 2 f0; 1gK defined as:
Yi D I.X D i /; (3.10)
2 The Bayesian NLP literature often refers to categorical distributions as “multinomial distributions,” but this is actually
a misnomer (a misnomer that is nonetheless used in this book, in order to be consistent with the literature).
3.2. PRIORS OVER MULTINOMIAL AND CATEGORICAL DISTRIBUTIONS 55
is distributed according to the multinomial distribution with parameters and n D 1. It is of-
ten the case that it is mathematically convenient to represent the categorical distribution as a
multinomial distribution of the above form, using binary indicators. In this case, the probability
Q yi
function p.Y j/ can be written as K i D1 i .
There have been various generalizations and extensions to the Dirichlet distribution. One
such example is the generalized Dirichlet distribution, which provides a richer covariance struc-
ture compared to the Dirichlet distribution (see also Section 3.2.2 about the covariance structure
of the Dirichlet distribution). Another example is the Dirichlet-tree distribution (Minka, 1999)
which gives a prior over distributions that generate leaf nodes in a tree-like stochastic process.
In the rest of this section, the categorical distribution will be referred to as a multinomial
distribution to be consistent with the NLP literature. This is not a major issue, since most of the
discussion points in this section are valid for both distributions.
K
Y ˛ 1
p.1 ; : : : ; K j˛1 ; : : : ; ˛K / D C.˛/ k k ;
kD1
P
where .1 ; : : : ; K / is a vector such that i 0 and Ki D1 i D 1.
Here, we continue to provide a more complete description of the Dirichlet distribution
and its properties.
K
! K
! K
Y ˛ 1
Y x
Y ˛i Cxi 1
p. j x; ˛/ / p. j˛/p.x j / / i i i i D i : (3.11)
i D1 i D1 i D1
56 3. PRIORS
Note the use of / (i.e., “proportional to”) instead of “equal to.” Two normalization con-
stants in Equation 3.11 were omitted to simplify the identification of the posterior. The first
constant is p.x j ˛/, the marginalization constant. The second constant is the normalization
constant of the Dirichlet distribution (Equation 2.3). The reason we can omit them is that these
constants do not change with , and the distribution we are interested in is defined over .
Equation 3.11 has the algebraic form (without the normalization constant) of a Dirich-
let distribution with ˛ 0 .x; ˛/ D ˛ C x . This means that the posterior distributes according to a
Dirichlet distribution with hyperparameters ˛ C x .
i
i D PK ; (3.12)
i D1 i
for i 2 f1; : : : ; Kg yields a random vector from the probability simplex of dimension K
1, such that distributes according to the Dirichlet distribution with hyperparameters ˛ D
.˛1 ; : : : ; ˛K /.
The representation of the Dirichlet as independent, normalized, Gamma variables explains
a limitation inherent to the Dirichlet distribution. There is no explicit parametrization of the
rich structure of relationships between the coordinates of . For example, given i ¤ j , the ratio
i =j , when treated as a random variable, is independent of any other ratio k =` calculated
from two other coordinates, k ¤ `. (This is evident from Equation 3.12: the ratio i D is
i D j , where all i for i 2 f1; : : : ; Kg are independent.) Therefore, the Dirichlet distribution
is not a good modeling choice when the parameters are better modeled even with a weak degree
of dependence.
Natural language elements, however, have a large dependence between one another. For
example, consider a Bayesian unigram language model (i.e., a language model that treats a sen-
tence as a bag of words), where the model is parametrized through , which is a distribution
over K words in a vocabulary. When estimated from data, these parameters exhibit great de-
pendence, depending on the domain of the data. It is highly likely that an increase or decrease
in the frequency of words that are semantically related to each other (where these changes are
compared to a language model learned from another unrelated text) is simultaneous, comparing
one data domain to another. A text about veterinary science will have a simultaneous increase
in the probability of words such as “dog,” “cat” and “fur” compared to a text about religion—
though each word separately can have a unique probability, high or low. A prior distribution
on this unigram model encapsulates our prior beliefs, varying across text domains, for example,
about the parameters. Using the Dirichlet distribution is unsatisfactory, because it is unable to
capture a dependency structure between the words in the vocabulary.
The next section explains how this independence property of the Dirichlet distribution can
be partially remedied, through the use of a different distribution as a prior over the multinomial
distribution.
Summary
The Dirichlet distribution is often used as a conjugate prior to the categorical and multinomial
distributions. This conjugacy makes it extremely useful in Bayesian NLP, because the categori-
cal distribution is ubiquitous in NLP modeling. The Dirichlet distribution has the advantage of
3.2. PRIORS OVER MULTINOMIAL AND CATEGORICAL DISTRIBUTIONS 59
potentially encouraging sparse solutions, when its hyperparameters are set properly. This prop-
erty has been repeatedly exploited in the NLP literature, because distributions over linguistic
elements such as part-of-speech tags or words typically tend to be sparse. The Dirichlet distri-
bution also comes with limitations. For example, this distribution assumes a nearly independent
structure between the coordinates of points in the probability simplex that are drawn from it.
exp.i /
i D PK 1 8i 2 f1; : : : ; K 1g; (3.13)
1 C j D1 exp.j /
1
K D PK 1 ; (3.14)
1 C j D1 exp.j /
for some 2 RK 1 random vector that distributes according to the multivariate normal distri-
bution with mean value and covariance matrix †.
Therefore, the logistic normal distribution, as its name implies, is a multivariate normal
variable that has been transformed using the logistic transformation. The reason that this mul-
tivariate normal variable needs to be K 1 dimensional instead of K dimensional is to elimi-
nate a redundant degree of freedom: had the logistic normal distribution used a K dimensional
multivariate normal variable, one of them could have been canceled by choosing one of the co-
ordinates and subtracting them from all the others (the resulting subtracted vector would still
be a multivariate normal variable).
An additional dependence structure that does not exist in the Dirichlet distribution ap-
pears in the logistic normal distribution because of the explicit dependence structure represented
through the covariance matrix †. Therefore, in light of the discussion in Section 3.2.1, the lo-
gistic normal distribution is an alternative to the Dirichlet distribution. Unlike the Dirichlet
distribution, the logistic normal distribution is not conjugate to the multinomial distribution.
Figure 3.1 provides a plot of the logistic normal distribution with various hyperparameters
for K D 3 and n D 5,000 (i.e., the number of samples drawn is 5,000). With independence
between the dimensions, and variance 1 for each dimension, the distribution is spread over the
whole probability simplex. When the correlation is negative and close to 1, we see a more
narrow spread. With large variance (and independence between the coordinates), the logistic
normal behaves almost like a sparse distribution.
60 3. PRIORS
Figure 3.1: A plot of sampled data from the logistic normal distribution with K D 3, with
1 0
various †. The hyperparameter is always .0; 0/. Top-left: † D , top-right: † D
0 1
1 0:7 1 0:7 5 0
, bottom-left: † D , bottom-right: † D .
0:7 1 0:7 1 0 5
3.2. PRIORS OVER MULTINOMIAL AND CATEGORICAL DISTRIBUTIONS 61
Properties of the Additive Logistic Normal Distribution
Using the Jacobian transformation method (see Appendix A), Aitchison (1986) shows that the
density of the additive logistic normal distribution is:
p. j ; †/
K
! 1
1 Y
Dp i
.2/K det.†/ i D1
1
exp .log. K =K / /> † 1
log. K =K / / ;
2
where K D .1 ; : : : ; K 1/ and log. K =K / 2 RK 1
with:
This density is only defined on the probability simplex. Both the moments and the log-
arithmic moments of the logistic normal distribution are well-defined for all positive orders.
Q ai QK
These moments are EŒ K ai
i D1 i and EŒ i D1 .log i / for ai > 0. Unfortunately, even though
these moments exist, there are no closed-form expressions for them.
Since the log-ratio between i and j is distributed according to the normal distribution,
the following holds (Aitchison, 1986):
EŒlog.i =j / D i j ;
Cov.log.i =j /; log.k =` // D `k C ji i` kj :
In addition,
1
EŒi =j D exp i j C ii 2ij C jj :
2
K
Y
p. / D p k ;
kD1
3.2. PRIORS OVER MULTINOMIAL AND CATEGORICAL DISTRIBUTIONS 63
PK
• Generate a multivariate normal variable 2 R iD1 Nk K . The multivariate normal
P P
variable has mean and covariance matrix † of size . K
i D1 Nk K/ . K i D1 Nk
K/.
exp.ki /
ik D Q 8i 2 f1; : : : ; Nk 1g;
Nk 1 k
j D1 1 C exp.j /
k 1
N k
DQ :
Nk 1 k
j D1 1 C exp.j /
Generative Story 3.1: The generative story for the partitioned logistic normal distribution.
where each distribution p k can be, for example, a Dirichlet or a logistic normal. However,
this decomposition does not introduce a covariance structure between events in different multi-
nomials; there is a clear independence assumption in the prior between multinomials of different
index k .
One way to overcome this issue is by using the partitioned logistic normal distribu-
tion (Aitchison, 1986). The partitioned logistic normal distribution is similar to the logistic
normal distribution, only it is defined as a prior on a whole collection of multinomial distribu-
tions, such as 1 ; : : : ; K . To generate such a collection of multinomials, the generative process
is given in generative story 3.1.
The covariance matrix † now permits correlations between all components of the vector
.
The partitioned logistic normal is related to the shared logistic normal distribution, intro-
duced by Cohen and Smith (2009). Both incorporate a covariance structure that exists outside
of multinomial boundaries, as defined by the natural factorization of a set of parameters into
64 3. PRIORS
multinomials. The shared logistic normal encodes such covariance more implicitly, by averaging
several Gaussian variables (“normal experts”) that are then exponentiated and normalized. See
also Cohen and Smith (2010b) for discussion.
exp.i /
i D QK 1 8i 2 f1; : : : ; K 1g;
j D1 1 C exp.j /
1
K D QK 1 :
j D1 1 C exp.j /
The multiplicative logistic normal distribution is not often used in Bayesian NLP models,
and is given here mostly for completeness.
0
where is the digamma function and is its derivative (see Appendix B).
3.2. PRIORS OVER MULTINOMIAL AND CATEGORICAL DISTRIBUTIONS 65
Summary
Despite its lack of conjugacy to the categorical distribution, the logistic normal distribution is a
useful prior for the categorical distribution. While the Dirichlet exhibits an independence struc-
ture in the probability simplex, the logistic normal distribution exploits an explicit dependence
structure originating in a multivariate normal distribution.
There are two variants of the logistic normal distribution: the additive and the multiplica-
tive. Usually, if a distribution is referred to as being a logistic normal, but no additional reference
is made to the type of logistic normal used, the additive is the one being referred to.
3.2.3 DISCUSSION
As Aitchison (1986) points out, several studies have been conducted in an attempt to gener-
alize the Dirichlet distribution to a family that has more dependence structure in it (see Sec-
tion 3.2.1), and that subsumes the family of Dirichlet distributions. Two such attempts are the
scaled Dirichlet distribution and the Connor-Mosimann distribution.
The scaled Dirichlet distribution is parametrized by two vectors ˛ and ˇ , all positive, of
the same length. Its density is then the following:
P Qd ˛ ˛ 1
. diD1 ˛i / ˇ i i i
p. j˛; ˇ/ D Qd i D1 i P d
;
i D1 .˛ i / Pd iD1 ˛i
i D1 ˇi i
where .x/ is the Gamma function. The Connor-Mosimann distribution, on the other hand,
has the following density (it is also parametrized by ˛; ˇ 2 Rd with all positive values):
d
Y ˛ 1
i i
p. j˛; ˇ/ D ;
B.˛i ; ˇi /
i D1
.˛i /.ˇi /
B.˛i ; ˇi / D :
.˛i C ˇi /
Both of these attempts only slightly improve the dependence structure of the Dirichlet
distribution, and as Aitchison points out, the problem of finding a class of distributions that
generalizes the Dirichlet distribution and enriches it with more dependence structure is still
open.
3.2.4 SUMMARY
Multinomial distributions are an important building block in NLP models, especially when
considering generative modeling techniques. Most linguistic structures in NLP can be described
66 3. PRIORS
in terms of parts that originate in multinomial distributions. For example, each rule in a phrase-
structure tree, with a probabilistic context-free grammar model, originates in a multinomial
distribution that generates the right-hand sides of rules from a multinomial associated with the
left-hand side nonterminal.
As such, there has been an extensive use of priors over multinomials in Bayesian NLP, most
notably, with the Dirichlet distribution. The choice of the Dirichlet originates in its conjugacy
to the multinomial distribution, which leads to tractability, but it can also be used to encourage
the model to manifest properties such as sparsity.
The second multinomial prior family discussed in this section is the family of logistic nor-
mal distributions. Unlike the Dirichlet, the logistic normal family of distributions incorporates
explicit covariance structure between the various parameters of the multinomial distribution. It
has additive and multiplicative versions.
Z
p./d D 1:
For example, there is no uniform distribution on the real line R (or moreR 1 generally, any
d
unbounded set in R for d 2 N ), simply because for any c > 0, the integral 1 cd diverges
to infinity. As a consequence, any attempt to define a uniform prior on the real line will lead to
an improper prior.
Even when the prior is improper, it is still technically (or algebraically) possible to use
Bayes’ rule to calculate the posterior,
p. /p.x j /
p. jx/ D R ;
p. /p.x j /d
R
and get a proper posterior distribution—if the integral p. /p.x j /d converges. For this
reason, improper priors are sometimes used by Bayesians, as long as the posterior is well-defined.
When using p./ D c for c > 0, i.e., a uniform (possibly improper) prior, there is an over-
lap between Bayesian statistics and maximum likelihood estimation, which is a purely frequentist
method. This is discussed in Section 4.2.1.
When a flat uniform prior becomes an improper prior, it is possible to instead use a vague
prior. Such a prior is not improper, but it is also not uniform. Instead, it has a large spread, such
that its tail goes to 0 to avoid divergence of the prior when integrating over the parameter space.
The tail is usually “heavy” in order to retain a distribution that is as close as possible to uniform.
68 3. PRIORS
3.3.2 JEFFREYS PRIOR
One criticism of non-informative priors, as described above, is that they are not invariant to
re-parameterization. This means that if is transformed into a new representation for the pa-
rameters, using a one-to-one (perhaps even smooth) mapping, the resulting non-informative
prior will have different properties, and will not remain a uniform prior, for example.
Intuitively, if a prior is not informative about the parameters, the prior should stay consis-
tent under re-parametrization. This means that the probability mass assigned to a set of parame-
ters should remain the same as the probability mass assigned to the set of these parameters after
being re-parametrized. For this reason, statisticians have sought out sets of priors that could still
be considered to be non-informative, but remain invariant to transformations of the parameters.
One example of such a prior is Jeffreys prior (Jeffreys, 1961).
Jeffreys priors are defined based on the Fisher information in the parameters. In the case
of multivariate parameter vector , coupled with a likelihood function p.xj/, the Fisher infor-
mation i. / is a function of the parameter values, which returns a matrix:
ˇ
@2 ˇ
.i. //ij D E log p.xj/ ˇˇ :
@i j
When is univariate, then the Fisher information reduces to be the variance of the score
function, which is the derivative of the log-likelihood function. Jeffreys (1961) proposed to define
the following prior:
p
p. / / det.i. //:
Eisenstein et al. (2011) used a Jeffreys prior in their Sparse Additive Generative (SAGE)
model on a parameter that serves as a variance value for a draw from the Gaussian distribution.
Several draws from the Gaussian distribution, each with their own variance, are combined with
a “background distribution” to produce a distribution over a set of words, representing a topic.
Eisenstein et al. claim that the use of the normal-Jeffreys combination (compared to a normal-
exponential combination they also tried) encourages sparsity, and also alleviates the need to
choose a hyperparameter for the prior (the Jeffreys prior they use is not parametrized).
Still, the use of Jeffreys priors in current Bayesian NLP work is uncommon. It is more
common to use uniform non-informative priors or vague priors using the Gamma distribution.
However, the strong relationship between the Dirichlet and multinomial distributions appears
again in the context of Jeffreys priors. The symmetric Dirichlet with hyperparameter 1=2 (see
Section 3.2.1) is the Jeffreys prior for the multinomial distribution.
In cases where a hierarchical prior is of interest (see Section 3.5), and the structure of the
model is Dirichlet-multinomial, one could choose to use a Jeffreys prior for the Dirichlet distri-
bution hyperparameters (Yang and Berger, 1998). If the Dirichletp distribution is parametrized
by ˛1 ; : : : ; ˛K > 0, then Jeffreys prior for over ˛ is p.˛/ / det.i.˛1 ; : : : ; ˛K // where:
3.4. CONJUGACY AND EXPONENTIAL MODELS 69
K
!
X
0 0
Œi.˛1 ; : : : ; ˛K /ii D .˛i / ˛i i 2 f1; : : : ; Kg
K
!i D1
X
0
Œi.˛1 ; : : : ; ˛K /ij D ˛i i ¤ j I i; j 2 f1; : : : ; Kg:
i D1
3.3.3 DISCUSSION
There is no clear agreement in the statistics community on what it means for a prior to be non-
informative. Some argue that uniform priors are non-informative because they assign an equal
probability to all parameters in the parameter space. However, when they exist, uniform priors
are not invariant to re-parametrization and therefore are not considered to be non-informative
by many statisticians. Jeffreys priors, on the other hand, are invariant to re-parametrization, but
they can have a preference bias for certain parts of the parameter space.
where A./ is used as a normalization constant (and is also called “the log-partition function”),
defined as:
!
X
A. / D log h.x/ exp.. / t.x// :
x
Many well-known distributions fall into this category of the exponential family. For ex-
ample, the categorical distribution for a space with d events, D f1; : : : ; d g, and parameters
1 ; : : : ; d , can be represented as an exponential model with:
70 3. PRIORS
i . / D log.i /; (3.15)
ti .x/ D I.x D i/; (3.16)
h.x/ D 1: (3.17)
Many other distributions fall into that category, such as the Gaussian distribution, the
Dirichlet distribution, the Gamma distribution and others.
An exponential model can be reparametrized, where the new set of parameters is ./,
and is replaced with the identity function. In this case, we say that the exponential model is
in natural form (with “natural parameters”). The rest of the discussion focuses on exponential
models in natural form, such that:
@A./
D EŒti .X/:
@i
This fact is used in NLP quite often when we are required to compute the gradient of
a log-linear model to optimize its parameters. In the case of being a combinatorial discrete
space, such as a set of parse trees or labeled sequences, dynamic programming algorithms can
be used to compute these expectations. See Chapter 8 for more detail.
Since we discuss the Bayesian setting, it is natural to inquire about what a conjugate prior is
for an exponential model. A conjugate prior for the model in Equation 3.18 is also an exponential
model of the following general form:
where 2 Rd , 0 2 R and f W Rd ! .RC [ f0g/. This general result can be used to prove many
of the conjugacy relationships between pairs of well-known distributions. (See the exercises
at the end of this chapter.) There is also a strong relationship between exponential models in
their natural form and log-linear models. More information about log-linear models is found in
Section 4.2.1.
α α
Figure 3.2: A graphical depiction of the two levels at which a prior can be placed for a model with
three observations. (a) Parameters for all of the observations are being drawn once; (b) multiple
parameters are re-drawn for each observation, in an empirical Bayes style. Shaded nodes are
observed.
a random variable X , but it is often the case that the observed data is composed of multiple
observations x .1/ ; : : : ; x .n/ , all drawn from the distribution p.X j /.
In this case, one way to write the joint distribution over the data and parameters is:
n
Y
.1/ .n/
p ; x ; : : : ; x D p. / p x .i / j :
i D1
A more careful look into this equation reveals that there is an additional degree of freedom
in the placement of the prior. Instead of drawing the parameters once for all data points, one
could draw a set of parameters for each data point. In such a case, the joint distribution is:
Yn
p .1/ ; : : : ; .n/ ; x .1/ ; : : : ; x .n/ D p .i / p x .i/ j .i/ :
i D1
This type of model is also called a compound sampling model. With this approach to prior
modeling, the distribution
Yn
p .1/ ; : : : ; .n/ D p .i / (3.19)
i D1
can be considered to be a single prior over the joint set of parameters . .1/ ; : : : ; .n/ /. These two
approaches are graphically depicted in Figure 3.2.
Conceptually, both approaches to prior placement have advantages and disadvantages
when modeling natural language. Drawing the parameters for each observation permits more
72 3. PRIORS
flexibility across the observations (or the predicted structures, in the case of latent variable mod-
els), allowing the model to capture variation across the corpus, that arises, for example, because
of difference in authors or genres. Generating the parameter at the top level (only once) suggests
that inference needs to be done in a smaller space: there is a need to find the posterior over a
single set of parameters. This reduces the complexity of the model.
When parameters are drawn separately for each datum (i.e., we use the prior in Equa-
tion 3.19), it is often useful to assume some kind of dependencies between these parameters.
We are also interested in inferring the hyperparameters, in order to infer dependencies between
the parameters. There are two dominating approaches to conducting this type of inference. Both
approaches assume that the prior p./ from which .i / are drawn is controlled by hyperparam-
eters, so that we have p. j˛/ for some ˛ .
The first approach is called empirical Bayes (Berger, 1985). Empirical Bayes here means
that the hyperparameters are estimated as well, usually by using a maximum likelihood criterion.
For more information about emprical Bayes, see Section 4.3.
The second approach is hierarchical Bayesian modeling. Hierarchical Bayesian models are
models in which the hyperparameters (which parametrize the prior) themselves are associated
with a (hyper)prior as well. A hierarchical Bayesian model would add an additional level of
priors, usually parametric (parametrized by 2 ƒ), p.˛ j / such that the joint distribution is:
n
Y
p ˛; .1/ ; : : : ; .n/ ; x .1/ ; : : : ; x .n/ j D p .˛ j / p .i/ j˛ p x .i/ j :
i D1
The choice of a second stage prior (or hyperprior) has a less noticeable effect than the choice
of a first stage prior on the predictions the model makes. Therefore, vague priors and priors that
are mathematically convenient are more common as priors over hyperparameters, even though
they might not be the best fit for our beliefs about the model.
Unfortunately, many Bayesian NLP papers do not make it explicit in their model de-
scription whether parameters are drawn for each example or whether there is a single set of
parameters for all observations and latent structures. This ambiguity mostly arises because typ-
ically, the generative process of a Bayesian NLP model is described for a single observation (or
latent structure). The “loop” over all observations is not made explicit in the description of the
model.
With hidden Markov models, for example (Chapter 8), with a single draw of the param-
eters for the HMM, there is an implicit treatment in many papers of the issue with multiple
sequences as datapoints. One can assume that all sequences in the data are concatenated to-
gether into a single sequence, with a separator symbol between them. Then, we can proceed
with inference for this single sequence. This leads to an equivalent scenario as having multi-
ple sequences, where the probabilities for transitioning from the separator symbol to any other
symbol are treated as initial probabilities.
3.6. STRUCTURAL PRIORS 73
As a general rule, readers of Bayesian NLP papers should usually assume that there is
a single set of parameters drawn for all observations unless the paper is set in the empirical
Bayesian setting or in a hierarchical Bayesian setting. This general rule is overridden by other
clues about the placement of the prior, such as the derivation of the inference algorithm or other
Bayesian baselines that the method in the paper is compared against.
This book tries to be as explicit as possible about the placement of the prior. The prior is
abstractly defined at the top level, but the reader should consider that in many cases, actually
represents multiple draws of parameters, with the prior defined in Equation 3.19.
Both hierarchical Bayesian modeling and empirical Bayes can be used in the case of a
single parameter draw for all parameters. However, they are most often used in NLP in the
context of multiple parameter draws, especially in the empirical Bayes setting.
CHAPTER 4
Bayesian Estimation
The main goal of Bayesian inference is to derive (from data) a posterior distribution over the
latent variables in the model, most notably the parameters of the model. This posterior can be
subsequently used to probabilistically infer the range of parameters (through Bayesian interval
estimates, in which we make predictive statements such as “the parameter is in the interval
Œ0:5; 0:56 with probability 0.95”), compute the parameters’ mean or mode, or compute other
expectations over quantities of interest. All of these are ways to summarize the posterior, instead
of retaining the posterior in its fullest form as a distribution, as described in the previous two
chapters.
In the traditional use of Bayesian statistics, this posterior summarization is done in order
to generate interpretable conclusions about the nature of the problem or data at hand. Differing
from the traditional use of Bayesian statistics, natural language processing is not usually focused
on summarizing the posterior for this kind of interpretation, but is instead focused on improving
the predictive power of the model for unseen data points. Examples for such predictions are the
syntactic tree of a sentence, the alignment of two sentences or the morphological segmentation
of a word.1
The most basic way to summarize the posterior for the use of an NLP problem is to
compute a point estimate from the posterior. This means that we identify a single point in the
parameter space (and therefore, a distribution from the model family) to be used for further pre-
dictions. At first it may seem that such an approach misses the point behind Bayesian inference,
which aims to manage uncertainty about the parameters of the model using full distributions;
however, posterior summarization, such as the posterior mean, often integrates over many values
of the parameters, and therefore, future predictions using this posterior summary rely heavily on
the prior (the posterior takes into account the prior), especially when small amounts of data are
available.
Posterior summarization is usually contrasted with the “fully Bayesian” approach, in which
predictions use the full posterior for any prediction. An example of a fully Bayesian approach is
the integration of the posterior over the parameters against the likelihood function to find the
highest scoring structure, averaging over all possible parameters. To understand the difference
between this fully Bayesian approach and identifying a point estimate, see Section 4.1. While
1 There are cases in which the actual values of the parameters are of interest in NLP problems. The parameters can be used
during the development phase, to determine which features in a model assist the most in improving its predictive power. The
actual values can also be used to interpret the model and understand the patterns that it has learned. This information can be
used iteratively to improve the expressive power of the model.
78 4. BAYESIAN ESTIMATION
the Bayesian approach is more “correct” probabilistically, from a Bayesian point of view, it is
often intractable to follow. On the other hand, posterior summarization, like frequentist point
estimates, leads to lightweight models that can be easily used in future predictions.
This chapter includes a discussion of several ways in which the posterior can be summa-
rized, and also relates some of these approaches to frequentist estimation. The chapter includes
two main parts: the first part appears in Section 4.2, and details the core ways to summarize a
posterior in Bayesian NLP; the second part appears in Section 4.3, and explains the empirical
Bayes approach, in which point estimates are obtained for the hyperparameters, which can often
be used as a substitute for a point estimate for the parameters themselves.
The techniques described in this chapter stand in contrast to the techniques described in
Chapter 5 (sampling methods) and Chapter 6 (variational inference). The techniques in these
latter chapters describe ways to fully identify the posterior, or at least identify a means to draw
samples from it. While the techniques in these chapters are often used for fully-Bayesian infer-
ence, they can also be used to identify the posterior and then summarize it using the approaches
described in this chapter.
X
p jx .1/ ; : : : ; x .n/ D p ; z .1/ ; : : : ; z .n/ jx .1/ ; : : : ; x .n/ ;
z .1/ ;:::;z .n/
Z
p z j x ; : : : ; x ; x p z 0 j; x 0 p jx .1/ ; : : : ; x .n/ d:
0 .1/ .n/ 0
(4.1)
This approximation is especially accurate when n is large. In order for the posterior to be
exact, we would need to condition the parameter distribution on x 0 as well, i.e.:
Z
p z 0 j x .1/ ; : : : ; x .n/ ; x 0 D p z 0 j ; x 0 p j x .1/ ; : : : ; x .n/ ; x 0 d: (4.2)
But with a large n the effect of x 0 on the posterior is negligible, in which case Equation 4.1
describes a good approximation for the right hand-side of Equation 4.2. The derivation of Equa-
tion 4.2 is a direct result of the independence between input (and output) instances conditioned
on the set of parameters .
Integrating the likelihood in this manner against the posterior is quite complex and com-
putationally inefficient. For this reason, we often resort to approximation methods when in-
ferring this posterior (such as sampling or variational inference), or alternatively, use Bayesian
point estimation.
.1/ .n/ p. /p x .1/ ; : : : ; x .n/ j
D arg max p j x ; : : : ; x D arg max :
p x .1/ ; : : : ; x .n/
The motivation behind MAP estimation is simple and intuitive: choose the set of param-
eters that are most likely according to the posterior,
which takes into account both the prior and
.1/ .n/
the observed data. Note that p x ; : : : ; x does not depend on which is maximized over,
and therefore:
D arg max p. /p x .1/ ; : : : ; x .n/ j :
In addition, since X .i/ for i 2 f1; : : : ; ng are independent given and since the log func-
tion is monotone, the MAP estimator corresponds to:
where
n
X
L. / D log p.x .i/ j /:
i D1
If p. / is constant (for example, when p. / denotes a uniform distribution over the prob-
ability simplex, a symmetric Dirichlet with hyperparameter 1), then Equation 4.3 recovers the
maximum likelihood (ML) solution. The function L. / equals the log-likelihood function used
with ML estimation. A uniform prior p. / is considered to be a non-informative prior, and is
discussed more in Section 3.3.
More generally, the term log p. / serves as a penalty term in the objective in Equation 4.3.
This penalty term makes the objective smaller when is highly unlikely according to the prior.
2 For this reason, certain language modeling toolkits, such as the SRI language modeling toolkit (Stolcke, 2002), have the
option to ignore unseen words when computing perplexity on unseen data.
82 4. BAYESIAN ESTIMATION
P
.˛ 1/ C niD1 xj.i/
j D P P : (4.4)
K.˛ 1/ C jKD1 niD1 xj.i/
When ˛ D 1, the ˛ 1 terms in the numerator and the denominator disappear, and we
recover the maximum likelihood estimate—the estimate for is just composed of the rela-
tive frequency of each event. Indeed, when ˛ D 1, the prior p. j˛/ with the Dirichlet is just
a uniform non-informative prior, and therefore we recover the MLE (see previous section and
Equation 4.3).
When ˛ > 1, the MAP estimation in Equation 4.4 with Dirichlet-multinomial cor-
responds to a smoothed maximum likelihood estimation. A pseudo-count, ˛ 1, is added to
each observation. This type of smoothing, also called additive smoothing, or Laplace-Lidstone
smoothing, has been regularly used in data-driven NLP since its early days, because it helps
to alleviate the problem of sparse counts in language data. (With ˛ < 1, there is a discounting
effect, because ˛ 1 < 0.)
Additive smoothing is especially compelling because it is easy to implement. The inter-
pretation as a MAP solution has its added value, but it is not the origin of additive smoothing.
Indeed, additive smoothing has been used for n-gram models in the NLP community since the
late 80s, without necessarily referring to its Bayesian interpretation.
Chen and Goodman (1996) describe a thorough investigation of smoothing techniques
for language modeling, and compare additive smoothing to other smoothing techniques. Their
findings were that additive smoothing is far from being the optimal solution for an accurate
estimation of n-gram models. Katz smoothing (Katz, 1987) and interpolation with lower order
estimation of n-gram language models (Jelinek and Mercer, 1980) performed considerably better
on a held-out data set (the reported performance measure is cross-entropy; see Appendix A).
Not surprisingly, smoothing the counts by adding 1 to all of them (as has been argued to be
a “morally correct” choice by Lidstone (1920) and Jeffreys (1961)) did not perform as well as
smoothing by varying the pseudo-counts added to the n-gram counts.3
In spite of its lack of optimality, additive smoothing still remains a basic tool in NLP
which is often tried out after vanilla maximum likelihood estimation. This is probably due to ad-
ditive smoothing’s efficiency and straightforward implementation as an extension of maximum
likelihood estimation. However, it is often the case that in order to achieve state-of-the-art per-
formance with maximum likelihood estimation, a more complex smoothing scheme is required,
such as interpolation or incorporation of lower order models.
3 Modern language models more often use smoothing techniques such as the one by Kneser and Ney (1995), for which a
Bayesian interpretation was discovered relatively recently (under a nonparametric model). See Chapter 7.4.1.
4.2. BAYESIAN POINT ESTIMATION 83
MAP Estimation and Regularization
There is a strong connection between maximum a posteriori estimation with certain Bayesian
priors and a frequentist type of regularization, in which an objective function, such as the log-
likelihood, is augmented with a regularization term to avoid overfitting. We now describe this
connection with log-linear models.
Log-linear models are a common type of model for supervised problems in NLP. In the
generative case, a model is defined on pairs .x; z/ where x is the input to the decoding problem
and z is the structure to be predicted. The model form is the following:
P
K
exp j D1 j fj .X; Z/
p.X; Zj/ D ;
A./
where f .x; z/ D .f1 .x; z/; : : : ; fK .x; z// is a feature vector that extracts information about the
pair .x; z/ in order to decide on its probability according to the model.
Each function fi .x; z/ maps .x; z/ to R, and is often just a binary function, taking values
in f0; 1g (to indicate the existence or absence of a sub-structure in x and z ), or integrals, taking
values in N (to count the number of times a certain sub-structure appears in x and z ).
The function A./ is the partition function, defined in order to normalize the distribution:
X
A./ D A.; x/; (4.5)
x
where
0 1
X K
X
A.; x/ D exp @ j fj .x; z/A :
z j D1
Alternatively, in a discriminative setting, only the predicted structures are modeled, and
the log-linear model is defined as a conditional model:
P
K
exp j D1 j fj .X; Z/
p.ZjX; / D :
A.; X /
In this case, A./ in Equation 4.5 is not needed. This is important, because A. / is often
intractable to compute because of the summation over all possible x , is not needed. On the other
hand, the function A.; x/ is often tractable to compute for a specific x , using algorithms such
as dynamic programming algorithms (Chapter 8).
84 4. BAYESIAN ESTIMATION
When Z 2 f 1; 1g, log-linear models are framed as “logistic regression” (binary) classi-
fiers. In that case, we define (note the lack of need for having a feature function that depends
on the label because of the sum-to-1 constraint on the label probabilities):
8
ˆ 1
ˆ
ˆ P if ZD1
ˆ
ˆ K
ˆ
ˆ 1 C exp f
j D1 j j .X /
ˆ
<
p.ZjX; / D P
ˆ
ˆ K
ˆ
ˆ exp f
j D1 j j .X/
ˆ
ˆ P if ZD 1:
ˆ
:̂ 1 C exp K
j fj .X /
j D1
Either way, both with generative and discriminative log-linear models or with logistic
regression, the classical approach in NLP for estimating the parameters is to maximize a log-
likelihood objective with respect to . In the generative case, the objective is:
n
X
log p x .i / ; z .i / j ;
i D1
A naïve maximization of the likelihood often leads to overfitting the model to the training
data. The parameter values are not constrained or penalized for being too large; therefore, the
log-likelihood function tends to fit the parameters in such a way that even patterns in the data
that are due to noise or do not represent a general case are taken into account. This makes
the model not generalize as well to unseen data. With certain low-frequency features that are
associated with a single output, the feature weights might even diverge to infinity.
Regularization is one solution to alleviate this problem. With L2 -regularization,5 for ex-
ample, the new objective function being optimized is (in the generative case):
n
X
log p x .i/ ; z .i/ j C R. /; (4.6)
i D1
where
4 Modern machine learning makes use of other discriminative learning algorithms for learning linear models similar to
that, most notably max-margin algorithms and the perceptron algorithm.
qP
5 The L norm of a vector x 2 Rd is its Euclidean length: d 2
2 i D1 xi .
4.2. BAYESIAN POINT ESTIMATION 85
0 1
K
X
1 @
R./ D j2 A ;
2 2
j D1
for some fixed 2 R. The regularized discriminative objective function is defined analogously,
replacing the log-likelihood with the conditional log-likelihood.
The intuition behind this kind of regularization is simple. When the parameters become
too large in the objective function (which happens when the objective function also fits the noise
in the training data, leading to overfitting), the regularization term (in absolute value) becomes
large and makes the entire objective much smaller. Therefore, depending on the value of , the
regularization term R. / encourages solutions in which the features are closer to 0.
Even though this type of regularization is based on a frequentist approach to the problem
of estimation, there is a connection between this regularization and Bayesian analysis. When
exponentiating the regularization term and multiplying by a constant (which does not depend
on ), this regularization term becomes the value of the density function of the multivariate
normal distribution, defined over , with zero mean and a diagonal covariance matrix with 2
on the diagonal.
This means that maximizing Equation 4.6 corresponds to maximizing:
n
X
log p x .i/ ; z .i / j C log p j 2 ; (4.7)
i D1
with p j 2 being a multivariate normal prior over the parameters . The mean value of this
multivariate normal distribution is 0, and its covariance is 2 IKK . Equation 4.7 has exactly
the same structure as in Equation 4.3. Therefore, the L2 -regularization corresponds to a MAP
estimation with a Gaussian prior over the parameters.
There are other alternatives to L2 regularization. Consider, for example, the following
prior on :
K
Y
p. j / D p.j j /
j D1
1 jj j
p.j j / D exp : (4.8)
2
The distribution over each j in Equation 4.8 is also called the Laplace distribution (its
mean is 0 and its variance is 22 ; see Appendix B). The prior p. j/ coupled with MAP esti-
mation leads to a maximization problem of the form (ignoring constants):
86 4. BAYESIAN ESTIMATION
0 1
n
X 1 @X
K
log p x .i/ ; z .i / j jj jA :
i D1 j D1
n
Y
.1/ .n/ .1/ .n/
p x ; : : : ; x ; z ; : : : ; z ; j ˛ D p. j˛/ p z .i / j ; ˛ p x .i/ j z .i/ ; ; ˛ :
i D1
The latent structures are denoted by the random variables Z .i/ , and the observations are
denoted by the random variables X .i / . The posterior has the form:
p ; z .1/ ; : : : ; z .n/ j x .1/ ; : : : ; x .n/ ; ˛ :
The most comprehensive way to get a point estimate from this posterior through MAP
estimation is to marginalize Z .i/ and then find as following:
X
D arg max p ; z .1/ ; : : : ; z .n/ j x .1/ ; : : : ; x .n/ ; ˛ : (4.9)
z .1/ ;:::;z .n/
However, such an estimate often does not have an analytic form, and even computing it
numerically can be inefficient. One possible way to avoid this challenge is to change the opti-
mization problem in Equation 4.9 to:
D arg max max p ; z .1/ ; : : : ; z .n/ j x .1/ ; : : : ; x .n/ ; ˛ : (4.10)
z .1/ ;:::;z .n/
4.2. BAYESIAN POINT ESTIMATION 87
This optimization problem, which identifies the mode of the posterior both with respect
to the parameters and the latent variables, is often more manageable to solve. For example,
simulation methods such as MCMC algorithms can be used together with simulated annealing
to find the mode of this posterior. The idea behind simulated annealing is to draw samples,
through MCMC inference, from the posterior after a transformation so that it puts most of its
probability mass on its mode. The transformation is gradual, and determined by a “temperature
schedule,” which slowly decreases a temperature parameter. That particular temperature param-
eter determines how peaked the distribution is—the lower the temperature is, the more peaked
the distribution. Simulated annealing is discussed in detail in Section 5.6. The replacement of
the marginal maximization problem (Equation 4.9) with the optimization problem in Equa-
tion 4.10 is a significant approximation. It tends to work best when the posterior has a peaked
form—i.e., most of the probability mass of the posterior is concentrated on a few elements in
the set of possible structures to predict.
Another approximation to the optimization problem in Equation 4.9 can be based on
variational approximations. In this case, the posterior is approximated using a distribution q ,
which often has a factorized form:
!
n
Y
.1/ .n/ .i/
p ; z ; : : : ; z q. / q z :
i D1
The distribution q. / is also assumed to have a parametric form, and identifying each
constituent of the distribution q (for each predicted structure and the parameters) is done itera-
tively using approximate inference methods such as mean-field variational inference. Then, the
approximate MAP is simply:
where q./ is the marginal approximate posterior distribution over the parameters. A thorough
discussion of variational approximation methods in Bayesian NLP is found in Chapter 6.
p. j x/ f . j ; † /; (4.11)
88 4. BAYESIAN ESTIMATION
where
1 1 > 1
f . j ; † / D p exp . / .† / . / ;
.2/ K=2 jdet.† /j 2
is the density of the multivariate normal distribution with mean (the mode of the posterior)
and covariance matrix † defined as inverse of the Hessian of the negated log-posterior at point
:
@2 h
.† /i;j1 D . /;
@i @j
with h. / D log p. j X D x/. Note that the Hessian must be a positive definite matrix to
serve as the covariance matrix of the distribution in Equation 4.11. This means that the Hessian
has to be a symmetric matrix. A necessary condition is that the second derivatives of the log-
posterior next to the mode are continuous.
The Laplace approximation is based on a second-order Taylor approximation of the log-
posterior. A second-order Taylor approximation of the log-posterior around point yields:
u
logit.u/ D log 8u 2 .0; 1/: (4.13)
1 u
In Appendix A, there is a discussion of how to re-parametrize distributions using the
Jacobian transformation.
Z X
R O D O
L .x/; p.X D xj /p. /d:
x
This analysis computes the average loss of estimating the parameters using the estimator
function O .x/, where the average is taken with respect to both the likelihood function and prior
information about the parameters. The Bayes risk is a natural candidate for minimization in
order to find an optimal set of parameters, in which the lowest average loss is incurred. For a
complete discussion of the use of decision theory with Bayesian analysis, see Berger (1985).
Minimizing the Bayes risk can be done by choosing .x/ O which minimizes the posterior
loss:
h i Z
O
E L .x/; jX D x D L O .x/; p. jX D x/d
Z
/ L O .x/; /p.X D xj p. /d;
O
i.e., .x/ D arg min 0 EŒL. 0 ; /jX D x.
Minimizing this expectation is not necessarily tractable in a general case, but it is often the
case that choosing specific loss functions makes it possible to solve this expectation analytically.
For example, if the parameter space is a subset of RK , and
2
O
O
L .x/; D
.x/
; (4.14)
2
then the posterior loss minimizer is the mean value of the parameters under the posterior, i.e.6 :
6 To see that, consider that for any random variable T , the quantity E Œ.T /2 is minimized with respect to when
D E ŒT .
90 4. BAYESIAN ESTIMATION
O
.x/ D arg min EŒL. 0 ; /jX D x D Ep. jXDx/ Œ; (4.15)
0
R
R R ˛ p.X D xj/p. j˛/p.˛/d˛
p. jX D x/ D ;
˛ p.X D xj/p. j˛/p.˛/d˛d
with p.˛/ being a distribution over the hyperparameters. This fully Bayesian approach places
a second-stage prior on the hyperparameters, and when inferring the posterior, integrating out
4.3. EMPIRICAL BAYES 91
˛ . Empirical Bayes takes a different approach to the problem of encoding information into the
hyperparameters. Instead of using a prior p.˛/, in the empirical Bayes setting a fixed value for ˛
is learned from observed data x . This hyperparameter, ˛.x/ O can either be learned by maximizing
the marginal likelihood p.X D xj˛/ or estimated through other estimation techniques. Then,
predictions are made using a posterior of the form p. jX D x; ˛.x// O . One can also learn ˛.x/
O
from a dataset different from the one on which final inference is performed.
This idea of identifying a hyperparameter ˛.x/ O based on the observed data is related to
hyperparameter identification with conjugate priors (Section 3.1). Still, there are several key
differences between hyperparameter identification with conjugate priors and empirical Bayes,
as it is described in this section. First, empirical Bayes does not have to be done with conjugate
priors. Second, identifying ˛.x/
O is usually done using a different statistical technique than regular
Bayesian inference (application of Bayes’ rule), as is done with conjugate priors (see below, for
example, about type II maximum likelihood). Third, empirical Bayes is usually just a preliminary
stage to identify a set of hyperparameters, which can then be followed with Bayesian inference,
potentially on a new set of data.
Similarly to the case with hierarchical priors, empirical Bayes is often used when the
parameters of the model are not drawn once for the whole corpus, but are drawn multiple times
for each instance in the corpus (see Section 3.5).
In this case, there are multiple observations x .1/ ; : : : ; x .n/ , each associated with a set of
parameters .i/ for i D f1; : : : ; ng. Empirical Bayes is similar to Bayesian point estimation, only
instead of identifying a single parameter from the observed data, a single hyperparameter is
identified.
This hyperparameter, ˛.xO .1/ ; : : : ; x .n/ /, summarizes the information in the learned prior:
Y
n
p .1/ ; : : : ; .n/ j˛O x .1/ ; : : : ; x .n/ D p .i/ j˛O x .1/ ; : : : ; x .n/ :
i D1
Although the traditional empirical Bayesian setting proceeds at this point to perform
.1/ .n/ .1/ .n/
inference (after estimating ˛O x ; : : : ; x with the posterior p j˛O x ; : : : ; x ), it is
sometimes preferable in NLP to apply a simple function on ˛O x .1/ ; : : : ; x .n/ in order to iden-
tify a point estimate for the model. For example, one can find the mode of the estimated prior,
.1/ .n/ .1/ .n/
p ; : : : ; j˛O x ; : : : ; x or its mean.
Maximizing marginal likelihood is typically the most common approach to empirical
Bayes in NLP. In that case, the following optimization problem—or an approximation of it—is
solved:
˛ x .1/ ; : : : ; x .n/ D arg max p x .1/ ; : : : ; x .n/ j˛ : (4.16)
˛
92 4. BAYESIAN ESTIMATION
There is an implicit marginalization of a set of parameters .i/ (or a single , if not each
observed datum is associated with parameters) in the formulation in Equation 4.16. In addi-
tion, random variables Z .i/ , the latent structures, are also marginalized out if they are part of
the model. In this setting, empirical Bayes is also referred to as type II maximum likelihood
estimation. With latent variables, maximizing likelihood in this way is often computationally
challenging, and algorithms such as variational EM are used. Variational approximations are
discussed at length in Chapter 6.
Finkel and Manning (2009) describe a simple example of the use of empirical Bayes in
NLP and an exploration of its advantages. They define a hierarchical prior for the purpose of
domain adaptation. Their model is a log-linear model, in which there is a Gaussian prior with
a varying mean over the feature weights (see discussion in Section 4.2.1)—instead of a regular
L2 regularization that assumes a zero-mean Gaussian prior. Each domain (among K domains)
corresponds to a different mean for the Gaussian prior. In addition, they have a zero-mean
Gaussian prior that is placed on the mean of all of the domain Gaussian priors. Such a hierar-
chical prior requires the model to share information between the models if the statistics available
are sparse; if there is enough data for a specific domain, it will override this kind of information
sharing.
The space of parameters is RK , and it parametrizes a conditional random field model. The
hierarchical prior of Finkel and Manning is defined as follows:
!
Y
J
p ; .1/ ; : : : ; .J / j1 ; : : : ; J ; D p j p .j / jj ; ;
i D1
with each p .j / jj ; being a multivariate normal variable with covariance matrix j2 I and
mean , and p j being a multivariate normal variable with mean zero and covariance matrix
2I .
In an empirical evaluation of their approach, Finkel and Manning tried their prior with
named entity recognition (NER) and dependency parsing. For NER, each domain was repre-
sented by a different NER dataset from the CoNLL 2003 (Tjong Kim Sang and De Meul-
der, 2003), MUC-6 (Chinchor and Sundheim, 2003) and MUC-7 (Chinchor, 2001) shared
tasks datasets. For this problem, their model performed better than just concatenating all of
the datasets and training a single conditional random field with that large set of data. The per-
formance gain (in F1 -measure) ranged from 2.66% to 0.43%, depending on the data set being
tested.
For the parsing problem, Finkel and Manning used the OntoNotes data (Hovy et al.,
2006), which includes parse trees from seven different domains. For this problem, the results
were more mixed: in four of the cases the hierarchical model performed better than the rest of
4.4. ASYMPTOTIC BEHAVIOR OF THE POSTERIOR 93
the tested methods, and in three cases, concatenating all of the domains into a single domain
performed better than the rest of the methods being tested.
Finkel and Manning also show that their model is equivalent to the domain adaptation
model of Daume III (2007). In Daume’s model, the features in the base conditional random
field are duplicated for each domain. Then, for each datum in each domain, two sets of features
are used: one feature set associated with the specific domain the datum came from, and one
feature set that is used for all of the domains.
4.5 SUMMARY
Bayesian point estimation is especially useful when a summary of the posterior is needed. In
NLP, the most common reason for such a need is to maintain a lightweight model with a fixed
set of parameters. Such a fixed set of parameters enables computationally efficient solutions for
decoding.
94 4. BAYESIAN ESTIMATION
Several common smoothing and regularization techniques can be interpreted as Bayesian
point estimation with a specific prior. Additive smoothing, for example, can be interpreted as
the mean of a posterior seeded by a Dirichlet prior. L2 regularization can be interpreted as a
maximum a posteriori solution with a Gaussian prior, and L1 regularization can be interpreted
as MAP solution with a Laplace prior.
Empirical Bayes estimation is another technique that is related to Bayesian point esti-
mation. With empirical Bayes, a point estimate for the hyperparameters is identified. This point
estimate can be subsequently followed with regular Bayesian inference (potentially on new set of
data), or used to summarize the posterior over the parameters to identify a final point estimate
for the parameters.
4.6. EXERCISES 95
4.6 EXERCISES
4.1. Show that Equation 4.4 is true. (Hint: the maximizer of the log-posterior is also the
maximizer of the posterior.)
4.2. Let be a value between Œ0; 1 drawn from the Beta distribution parametrized by .˛; ˇ/.
Use the Jacobian transformation (Appendix A) to transform the distribution over
to the real line with a new random variable D logit./. The logit transformation is
defined in Equation 4.13.
4.3. Show that Equation 4.15 is true for the choice of L O .x/; as it appears in Equa-
tion 4.14.
4.4. Let x .i/ 2 Rd and y .i/ 2 R for i 2 f1; : : : ; ng. With least squares ridge regression, our
goal is to find a weight vector 2 Rd such that:
! 0 1
n
X 2 d
X
D arg min y .i / x .i / C @ j2 A ; (4.17)
i D1 j D1
CHAPTER 5
Sampling Methods
When the posterior cannot be analytically represented, or efficiently computed, we often have
to resort to approximate inference methods. One main thread of approximate inference relies
on the ability to simulate from the posterior in order to draw structures or parameters from the
underlying distribution represented by the posterior. The samples drawn from this posterior can
be averaged to approximate expectations (or normalization constants). If these samples are close
to the posterior mode, they can be used as the final output. In this case, the samples replace the
need to find the highest scoring structure according to the model, which is often computationally
difficult to do if one is interested in averaging predictions with respect to the inferred distribution
over the parameters (see Section 4.1).
Monte Carlo (MC) methods provide a general framework ideal for drawing samples from
a target distribution that satisfies certain conditions. While not specific to Bayesian statistics,
an especially useful family of MC methods in the Bayesian context is the Markov Chain Monte
Carlo (MCMC) methods. In general, these methods have an advantage in allowing sampling
from a family of distributions that satisfy certain conditions (usually, that a distribution is com-
putable up to a normalization constant). In Bayesian statistics, they are often used for poste-
rior inference, because posterior distributions for various Bayesian models naturally meet these
conditions. MCMC algorithms are especially useful in the Bayesian context for finding the
normalization constant of the posterior, marginalizing out variables, computing expectations of
summary statistics and finding the posterior mode.
It is important to keep in mind that Bayesian inference, at its core, manages uncertainty
regarding the parameters and the remaining latent variables through the use of distributions. This
means that the goal of Bayesian inference is to eventually find the posterior distribution in one
form or another. Monte Carlo methods treat this problem slightly differently. Instead of directly
representing the posterior distribution as a member of some (possibly approximate) family of
distributions (such as with variational inference, see Chapter 6), MC methods instead permit
indirect access to this posterior. Access to the posterior comes in the form of being able to draw
from the posterior distribution, without necessarily needing a complete analytic representation
for it.
The focus of this chapter is to provide an account of the way that Monte Carlo methods
are used in Bayesian NLP. We cover some of the principal Markov chain Monte Carlo methods,
and detail the design choices and the advantages and disadvantages for using them in the con-
text of Bayesian NLP. We also cover some techniques used to assess the convergence of MCMC
98 5. SAMPLING METHODS
methods to the target distribution. Convergence here means that the MCMC sampler, which
is iterative and outputs a sequence of samples, has finished its “burn-in” period, during which
time it outputs samples that are not necessarily drawn from the target distribution. When the
MCMC sampler has reached convergence, its output represents samples from the target distri-
bution. It is often the case that poor assessment of an MCMC method implies that the output
returned is invalid, and does not represent the underlying Bayesian model used.
This chapter is organized as follows. We begin by providing an overview of MCMC meth-
ods in Section 5.1 and then follow with an account of MCMC in NLP in Section 5.2. We then
start covering several important MCMC sampling algorithms, such as Gibbs sampling (Sec-
tion 5.3), Metropolis–Hastings (Section 5.4) and slice sampling (Section 5.5). We then cover
other topics such as simulated annealing (Section 5.6); the convergence of MCMC algorithms
(Section 5.7); the basic theory behind MCMC algorithms (Section 5.8); non-MCMC sampling
algorithms such as importance sampling (Section 5.9); and finally, Monte Carlo integration
(Section 5.10). We conclude with a discussion (Section 5.11) and a summary (Section 5.12).
!
n
Y
p ; x .1/ ; : : : ; x .n/ ; z .1/ ; : : : ; z .n/ j ˛ D p. j ˛/ p z .i/ j P x .i / jz .i/ ; :
i D1
Given that only x .i/ for i 2 f1; : : : ; ng are observed, the posterior then has the form:
100 5. SAMPLING METHODS
p z .1/ ; : : : ; z .n/ ; jx .1/ ; : : : ; x .n/ :
MCMC sampling yields a stream of samples from this posterior, which can be used in
various ways, including to find a point estimate for the parameters, to draw predicted structures
from the posterior or even to find the maximum value of the posterior using simulated annealing
(see Section 5.6). Often, the collapsed setting is of interest, and then samples are drawn from the
posterior over the predicted structures only, integrating out the model parameters:
Z
p z .1/ ; : : : ; z .n/ jx .1/ ; : : : ; x .n/ D p z .1/ ; : : : ; z .n/ ; jx .1/ ; : : : ; x .n/ d:
In this case, the parameters are nuisance variables, because the inference procedure is not
focused on them. Still, it is often the case that a summary of the parameters (in the style of
Chapter 4) can be inferred from the samples drawn for the latent variables in the model.
Algorithm 5.1: The Gibbs sampling algorithm, in its “systematic sweep” form. Part of the input
to the Gibbs algorithm is samplers for the conditional distributions derived from the target
distribution. Such samplers are treated as black-box functions that draw samples from these
conditionals (in line 4).
For example, a sentence-blocked sampler for part-of-speech tagging could sample the tags for a
whole sentence using the dynamic programming forward-backward algorithm. A blocked sam-
pler can also sample smaller constituents—for example, five part-of-speech tags at a time—again
using the forward-backward algorithm applied to a window of five part-of-speech tags at a time.
In order to use a Gibbs sampler, one has to be able to draw samples for one variable in the
model conditioned on a fixed value for all others. To achieve this, another MCMC algorithm
can be used to sample from conditional distributions (see Section 5.11); however, it is often the
case in Bayesian NLP models that these conditionals have an analytic form, and therefore are
easy to sample from (while the whole posterior is intractable, and requires MCMC or some
other approximate method).
Example 5.1 Consider the latent Dirichlet allocation model from Chapter 2. We denote the
number of documents it models by N , the size of the vocabulary by V , and the number of words
per document by M (in general, the number of words in a document varies, but for the sake of
simplicity we assume all documents are of the same length). The full graphical model for LDA
appears in Figure 2.1.
We will use the index i to range over documents, j to range over words in a specific
document, k to range over the possible topics and v to range over the vocabulary. In addition,
there are random variables .i/ 2 RK which are the document topic distributions, Zj.i/ which
denote the topic for the j th word in the i th document, Wj.i / which denote the j th word in the
i th document and ˇk 2 RV which denote the distribution over the vocabulary for the k th topic.
5.3. GIBBS SAMPLING 103
The joint distribution is factorized as follows:
K
!
Y
p .; ˇ; Z ; W j ; ˛/ D p .ˇk j /
kD1
0 1 (5.1)
N
Y Y
M
@ p .i/ j ˛ p Zj.i / j .i/ p Wj.i/ j ˇ; Zj.i/ A :
i D1 j D1
The random variables we need to infer are , ˇ and Z . One way to break the random
variables into constituents (in conditional distributions) for a Gibbs sampler is the following:
• p .ˇk j ; ˇ k ; z; w; ; ˛/ for k 2 f1; : : : ; Kg. This is the distribution over the parameters
of the LDA model that denote the probabilities over the vocabulary for each topic (con-
ditioned on all other random variables). We denote by ˇ k the set of fˇk 0 j k 0 ¤ kg.
• p .i/ j . i/ ; ˇ; z; w; ; ˛ for i 2 f1; : : : ; N g. This is the distribution over the topic dis-
tribution for the i th document, conditioned on all other random variables in the model.
We denote by . i/ the topic distributions for all documents other than the i th document.
• p Zj.i/ j ; z .i;j / ; w; ; ˛ for i 2 f1; : : : ; N g and j 2 f1; : : : ; M g. This is the distribu-
tion over a topic assignment for a specific word (j th word) in a specific document (i th
document), conditioned on all other random variables in the model. We denote by z .i;j /
the set of all topic assignment variables other than Zj.i/ .
At this point, the question remains as to what the form is of each of these distributions,
and how we can sample from them. We begin with p .ˇk j ; ˇ k ; z; w; ; ˛/. According to
Equation 5.1, the only factors that interact with ˇk are denoted on the right-hand side of the
following equation:
0 1
N Y
Y M I .i/
zj Dk
p .ˇk j ; ˇ k ; z; w; ; ˛/ / p.ˇk j /@ p wj.i/ j ˇ; zj.i / A
i D1 j D1
V
!0 N M V
1
Y 1
Y Y Y I wj.i/ Dv^zj.i/ Dk
D ˇk;v @ ˇk A
vD1 i D1 j D1 vD1
V PN PM
Y 1C iD1
.i/ .i /
j D1 I wj Dv^zj Dk
D ˇk;v : (5.2)
vD1
P .i/ .i/
Denote by nk;v the quantity N i D1 I.wj D v ^ zj D k/. In this case, nk;v denotes the
number of times the word v is assigned to topic k in any of the documents based on the current
104 5. SAMPLING METHODS
state of the sampler. The form of Equation 5.2 is exactly the form of a Dirichlet distribution
with the hyperparameter C nk , where nk is the vector of nk;v ranging over v . This concludes
the derivation of the conditional distribution, which is required to sample a new set of topic
distributions ˇ given the state of the sampler.
Consider p .i/ j . i/ ; ˇ; z; w; ; ˛ . Following a similar derivation, we have:
Y
M
p .i/ j . i/ .i/
; ˇ; z; w; ; ˛ / p j ˛ p zj.i / j .i/
j D1
K
Y ˛ M K I
1 Y Y
.i/
zj Dk
D k.i / k.i /
kD1 j D1 kD1
P
YK ˛ .i/
1C jMD1 I zj Dk
D k.i / : (5.3)
kD1
PM
Denote by m.i/
k
the quantity j D1 I zj
.i /
D k , i.e., the number of times in the i th doc-
ument that a word was assigned to the k th topic. Then, it turns out that from Equation 5.3, .i / ,
conditioned on all other random variables in the model is distributed according to the Dirichlet
distribution, with hyperparameters ˛ C m.i / where m.i / is the vector ranging over m.i/
k
for all k .
The last distribution we have to consider is that of p Zj.i/ j ; z .i;j / ; w; ; ˛ . Again
using the joint distribution from Equation 5.1, it holds that:
p Zj.i/ D k j ; z .i;j / ; w; ; ˛ / p Zj.i/ D k j .i / p wj.i / j ˇ; zj.i/ D k.i/ ˇk;w .i/ :(5.4)
j
Note that this sampler is a pointwise sampler—it samples each coordinate of Z .i/ sepa-
rately. This example demonstrates a Gibbs sampler from a Dirichlet-multinomial family with
non-trivial relationships between the multinomials. Such is the relationship in other more com-
plex models in NLP (PCFGs, HMMs and so on). The structure of this Gibbs sampler will be
similar in these more complex cases as well. This is especially true regarding the draw of . In
more complex models, statistics will be collected from the current set of samples for the latent
structures, and combined into hyperparameters for the posterior of the Dirichlet distribution.
In more complex cases, the draws of Z .i/ could potentially rely on dynamic programming
algorithms, for example, to draw a phrase-structure tree conditioned on the parameters (with
PCFGs), or to draw a latent sequence (with HMMs)—see Chapter 8.
5.3. GIBBS SAMPLING 105
1: Initialize z .1/ ; : : : ; z .n/ with some value from their space of allowed values
2: repeat
3: for all i 2 f1; : : : ; ng do
4: Sample z .i/ from p.Z .i/ jz .1/ ; : : : ; z .i 1/ ; z .i C1/ ; : : : ; z .n/ ; X /
5: end for
6: until Markov chain converged
7: return z .1/ ; : : : ; z .n/
Algorithm 5.2: The collapsed observation-blocked Gibbs sampling algorithm, in its “systematic
sweep” form.
n
Y
p z .1/ ; : : : ; z .n/ ; j˛ D p. j˛/ p z .j / j :
i D1
106 5. SAMPLING METHODS
To derive a collapsed Gibbs sampler, the conditional distributions p Z .i / jZ . i / are re-
quired. Let ek be the binary vector with 0 in all coordinates, except for the k th coordinate, where
it is 1. The following holds:
Z
p Z .i/ D ek jz . i/
; ˛ D p Z .i / D ek ; jz . i/ ; ˛ d
Z
D p jz . i / ; ˛ p Z .i/ D ek jz . i/ ; ; ˛ d
Z
D p jz . i / ; ˛ p Z .i/ D ek j; ˛ d (5.5)
Z
D p jz . i / ; ˛ k d
P .j /
j ¤i z C ˛k
DP P k (5.6)
K n .j /
0
k D1 j ¤i k z 0 C ˛k 0
P .j /
j ¤i zk C ˛k
D P : (5.7)
n 1C K k 0 D1 ˛k 0
Equation 5.5 is true because of the conditional independence of Z .i/ given the parameter
. Note that in Equation 5.5 the term p jz . i / ; ˛ is a Dirichlet distribution with the hyperpa-
P
rameter ˛ C j ¤i z .j / (see Section 3.2.1), and that in the equation below p Z .i / D ek j D k .
Therefore, the integral is the mean value of k according to a Dirichlet distribution with hyper-
P
parameters ˛ C j ¤i z .j / , leading to Equation 5.6 (see Appendix B).
Example 5.2 exposes an interesting structure to a Gibbs sampler when the distribution
being sampled is a multinomial with a Dirichlet prior. According to Equation 5.7:
nk C ˛k
p Z .i/ D ek jZ . i/
D z. i/
D P ; (5.8)
K
.n 1/ C 0
k D1 ˛k 0
P
with nk D j ¤i zk.j / , which is the total count of event k appearing in z . i / . Intuitively, the
probability of a latent variable Z .i/ taking a particular value is proportional to the number of
times this value has been assigned to the rest of the latent variables, Z . i/ . Equation 5.8 is es-
sentially an additively smoothed (see Section 4.2.1) version of the maximum likelihood estimate
of the parameters, when the estimation is based on the values of the variables being conditioned
on. This kind of structure arises frequently when designing Gibbs samplers for models that have
a Dirichlet-multinomial structure.
5.3. GIBBS SAMPLING 107
The following example, which is less trivial than the example above, demonstrates this
point.
Example 5.3 Consider the LDA Example 5.1. It is often the case that the topic assignments
Z are the only random variables that are of interest to do inference for.1 The parameters ˇ and
the topic distributions can therefore be marginalized in this case when performing Gibbs
sampling.
Therefore, we can place our focus on the random variables Z and W . We would like to
draw a value for Zj.i/ conditioned on z .i;j / and w. This means that we are interested in the
distribution p Zj.i/ j z .i;j / ; w . By Bayes’ rule, we have the following:
p Zj.i/ D k j z .i;j / ; w / p wj.i/ j Zj.i/ ; z .i;j / ; w .i;j / p Zj.i/ D k j z .i;j / : (5.9)
Z
p wj.i / j Zj.i/ D k; z .i;j / ; w .i;j / D p ˇk ; wj.i/ j Zj.i/ D k; z .i;j / ; w .i;j / dˇk
Z ˇk
D p wj j Zj D k; ˇk p ˇk j Zj.i / D k; z .i;j / ; w .i;j / dˇk
.i/ .i/
Zˇk
D p wj.i/ j Zj.i / D k; ˇk p ˇk j z .i;j / ; w .i;j / dˇk : (5.10)
ˇk
The last equality holds because Zj.i/ and ˇk are conditionally independent when Wj.i / is
not observed. Note also that ˇk and Z .i;j / are a priori also independent of each other, and
therefore:
p ˇk j z .i;j / ; w .i;j / / p w .i;j / j ˇk ; z .i;j / p ˇk j z .i;j /
D p w .i;j / j ˇk ; z .i;j / p.ˇk /
V Y
N M
Y Y .i/
I Z .i;j / Dk^wj Dv
D ˇk p.ˇk /
vD1 i 0 D1 j 0 D1;.i 0 ;j 0 /¤.i;j /
V PN PM
Y i 0 D1 j 0 D1;.i 0 ;j 0 /¤.i;j / I Z .i;j / Dk^wj Dv
.i/
D ˇk p.ˇk /
vD1
V PN PM
Y i 0 D1 j 0 D1;.i 0 ;j 0 /¤.i;j / I
.i/
Z .i;j / Dk^wj Dv C 1
D ˇk :
vD1
The above means that the distribution p ˇk j z .i;j / ; w .i;j / has the form of a Dirichlet
with parameters C n .i;j /;k , such that n .i;j /;k is a vector of length V , and each coordinate v
equals the number of instances in z .i;j / and w .i;j / in which the v th word in the vocabulary
was assigned to topic k (note that we exclude from the count here the j th word in the i th
document).
Note that the term p wj.i/ j Zj.i/ D k; ˇk in Equation 5.10 is just ˇk;w .i / , and taking
j
it with the above, means that Equation 5.10 is the mean value of a Dirichlet distribution with
parameters C n .i;j /;k . This means that:
C Œn .i;j /;k v
p wj.i/ D v j Zj.i/ D k; z P
.i;j / ; w .i;j / D : (5.11)
V C v0 Œn .i;j /;k v0
We finished tackling the first term in Equation 5.9. We must still tackle
p Zj.i/ D k j z .i;j / . First note that by Bayes’ rule and the conditional independence assump-
tions in the model, it holds that:
Z
p Zj.i/ D k j z .i;j / D p .i / ; Zj.i / D k j z .i;j / d .i /
Z .i /
D p Zj.i / D k j .i/ p .i/ j z .i;j / d .i / : (5.12)
.i /
A similar derivation as before shows that p .i / j z .i;j / is a Dirichlet distribution with
parameters ˛ C m .i;j / where m .i;j / is a K -length vector such that Œm .i;j / k 0 is the number
of times in z .i;j / such that
words in thei th document (other than the j th word) were assigned
to topic k . The term p Zj.i/ D k j .i/ is just k.i / . Equation 5.12 is therefore again the mean
0
value of a Dirichlet distribution: it is the k th coordinate of the mean value of a Dirichlet with
parameters ˛ C m .i;j / . This means that:
˛ C Œm .i;j / k
p Zj.i / D k j z .i;j / D P : (5.13)
K˛ C k 0 Œm .i;j / k 0
5.3. GIBBS SAMPLING 109
Taking Equations 5.9, 5.11 and 5.13, we get that in order to apply Gibbs sampling to
LDA in the collapsed setting, we must sample at each point a single topic assignment for a
single word in a document based on the distribution:
It should not come as a surprise that the collapsed Gibbs sampler in the example above
has a closed analytic form. This is a direct result of the Dirichlet distribution being conjugate
to the multinomial distributions that govern the production of Z .i/ and W .i / . Even though
the coupling of latent variables with the integration of the parameters may lead to non-analytic
solutions for the posterior, the use of conjugate priors still often leads to analytic solutions for
the conditional distributions (and for the Gibbs sampler, as a consequence).
To summarize this section, Algorithm 5.2 gives a collapsed example-blocked Gibbs sam-
pler. It repeatedly draws a latent structure for each example until convergence.
The algorithm could also randomly move between the operators, instead of systematically
choosing an operator at each step.
110 5. SAMPLING METHODS
5: Set u u0
6: end for
7: until Markov chain converged
8: return u
Algorithm 5.3: The operator Gibbs sampling algorithm, in its “systematic sweep” form. The
function Z.fi ; u/ denotes a normalization constant for the distribution q.u0 /. Note that it could
be the case for the distribution p.U j X / that its normalization constant is unknown.
The operators usually make local changes to the current state in the state space. For ex-
ample, if the latent structure of interest is a part-of-speech tagging sequence, then an operator
can make a local change to one of the part-of-speech tags. If the latent structure of interest is a
phrase structure tree, then an operator can locally change a given node in the tree and its neigh-
borhood. In these cases, the neighborhood returned by the operator is the set of states that are
identical to the input state, other than some local changes.
In order to ensure that the operators do indeed induce a valid Gibbs sampler, there is a
need for the operators to satisfy the following properties:
• Detailed balance—A sufficient condition for the detailed balance condition is the follow-
ing: for each operator fi 2 O and each ! , detailed balance requires that if ! 0 2 fi .!/ then
fi .!/ D fi .! 0 /. This ensures the detailed balance condition, meaning that for any u and
u0 , it holds that:
where q.u j u0 / and q.u0 j u/ are defined in Equation 5.14. The reason Equation 5.15 holds
is that the normalization constants for q.u j u0 / and q.u0 j u/ satisfy Z.fi ; u/ D Z.fi ; u0 /
in case the sufficient condition mentioned above is satisfied.
5.3. GIBBS SAMPLING 111
• Recurrence—this implies that it is possible to get from any state in the search space to any
other state. More formally, it means that for any !; ! 0 2 , there is a sequence of operators
fa1 ; : : : ; fa` with ai 2 f1; : : : ; M g, such that ! 0 2 fiM .fiM 1 .: : : .fi1 .!///, and this chain
is possible with non-zero probability according to Equation 5.14.
Note that the symmetry condition implies that for every ! and fi , it holds that ! 2 fi .!/,
i.e., an operator may not make changes at all to a given state. More generally, detailed balance and
recurrence are formal properties of Markov chains that are important for ensuring the correctness
of a sampler (i.e., that it actually samples from the desired target distribution), and the above
two requirements from the operators are one way to satisfy them for the underlying Markov
chain created by the Gibbs sampler.
There are several examples in the Bayesian NLP literature that make use of this operator
view for Gibbs sampling, most notably for translation. For example, DeNero et al. (2008) used
Gibbs sampling in this manner to sample phrase alignments for machine translation. They had
several operators for making local changes to alignments: SWAP, which switches between the
alignments of two pairs of phrases; FLIP, which changes the phrase boundaries; TOGGLE, which
adds alignment links; FLIPTWO, which changes phrase boundaries in both the source and tar-
get language; and MOVE, which moves an aligned boundary either to the left or to the right.
Similar operators have been used by Nakazawa and Kurohashi (2012) to handle the alignment
of function words in machine translation. Ravi and Knight (2011) also used Gibbs operators to
estimate the parameters of IBM Model 3 for translation. For a discussion of the issue of detailed
balance with Gibbs sampling for a synchronous grammar model, see Levenberg et al. (2012).
It is often the case that an operator view of Gibbs sampling yields a sampler that is close
to being a pointwise sampler, because the operators, as mentioned above, make local changes to
the state space. These operators typically operate on latent structures, and not on the parameters.
If the sampler is explicit, that is the parameters are not marginalized out, then the parameters
can be sampled in an additional Gibbs step that samples from the conditional distribution of
the parameters given the latent structure.
Input: Samplers for the conditionals p.Ui jU i ; X / for the distribution p.U1 ; : : : ; Up jX /.
Output: u D .u1 ; : : : ; up / approximately drawn from the above distribution.
breaks the Gibbs algorithm. The stationary distribution (if it exists) for this sampler is not nec-
essarily p.U jX /.
However, samplers similar in form have been used in practice (e.g., for the LDA model).
For more information about parallelizing MCMC sampling for LDA, see Newman et al. (2009).
The difficulty of parallelizing sampling can be a reason to choose an inference algorithm
which is more amenable to parallelization, such as variational inference. This is discussed further
in Chapter 6.
Neiswanger et al. (2014) describe another method for parallelizing the Gibbs sampling al-
gorithm (or any MCMC algorithm, for that matter). With their approach, the data that require
inference is split into subsets, and Gibbs sampling is run separately on each subset to draw sam-
ples from the target distribution. Then, once all of the parallel MCMC chains have completed
drawing samples, the samples are re-combined to get asymptotically exact samples.
5.3.4 SUMMARY
Gibbs sampling assumes a partition of the set of random variables of interest. It is an MCMC
method that draws samples from the target distribution by alternating between steps for drawing
samples from a set of conditional distributions for each of the random variables in each section of
the partition. Among MCMC methods, Gibbs sampling is the most common one in Bayesian
NLP.
5.4. THE METROPOLIS–HASTINGS ALGORITHM 113
The Metropolis–Hastings algorithm (MH) is an MCMC sampling algorithm that uses a pro-
posal distribution to draw samples from the target distribution. Let be the sample space of
the target distribution p.U jX / (for this example, U can represent the latent variables in the
model). The proposal distribution is then a function q.U 0 jU / 2 ! Œ0; 1 such that each
u 2 defines a distribution q.U 0 ju/. It is also assumed that sampling from q.U 0 ju/ for any
u 2 is computationally efficient. The target distribution is assumed to be computable up to its
normalization constant.
The Metropolis–Hastings sampler is given in Algorithm 5.5. It begins by initializing the
state of interest with a random value, and then repeatedly samples from the underlying proposal
distribution. Since the proposal distribution can be quite different from the target distribution,
p.U jX /, there is a correction step following Equation 5.16 that determines whether or not to
accept the sample from the proposal distribution.
Just like the Gibbs sampler, the MH algorithm streams samples. Once the chain has con-
verged, one can continuously produce samples (which are not necessarily independent) by re-
114 5. SAMPLING METHODS
peating the loop statements in the algorithm. At each step, u can be considered to be a sample
from the underlying distribution.
We mentioned earlier that the distribution p needs to be computable up to its normal-
ization constant. This is true even though Equation 5.16 makes explicit use of the values of
p ; the acceptance ratio always computes ratios between different values of p , and therefore the
normalization constant is cancelled.
Accepting the proposed samples using the acceptance ratio pushes the sampler to explore
parts of the space that tend to have a higher probability according to the target distribution. The
acceptance ratio is proportional to the ratio between the probability of the next state and the
probability of the current state. The larger the next state probability is, the larger the acceptance
ratio is, and therefore it is more likely to be accepted by the sampler. However, there is an
important correction ratio that is multiplied in: the ratio between the value of the proposed
distribution in the current state and the value of the proposed distribution in the next state. This
correction ratio controls for the bias that the proposal distribution introduces by having higher
probability mass in certain parts of the state space over others (different than those of the target
distribution).
It is important to note that the support of the proposal distribution should subsume (or
be equal to) the support of the target distribution. This ensures that the underlying Markov
chain is recurrent, and that all sample space will be explored if the sampler is run long enough.
The additional property of detailed balance (see Section 5.3.2) is also important to ensure that
the correctness of a given MCMC sampler is satisfied through the correction step using the
acceptance ratio.
p.u0 jX /qi .uju0 /
˛i D min 1; ; (5.17)
p.ujX /qi .u0 ju/
to reject or accept the new sample. Each acceptance changes only a single coordinate in U .
The Gibbs algorithm can be viewed as a special case of the component-wise MH algo-
rithm, in which
(
0 p.u0i ju i ; X /; if u0 i D u i
qi .u ju/ D
0 otherwise:
In this case, it holds that ˛i from Equation 5.17 satisfies:
p.u0 jX /qi .uju0 /
˛i D min 1;
p.ujX /qi .u0 ju/
p.u0 jX /p.ui ju0 i ; X /
D min 1;
p.ujX /p.u0i ju i ; X /
p.u i jX /p.u0i ju0 i ; X /p.ui ju0 i ; X /
0
D min 1; 0
p.u0 i jX /p.u
i ju i ; X /p.ui ju i ; X /
p.u i jX /
D min 1;
p.u i jX /
D 1;
where the last equality comes from the fact that qi changes only coordinate i in the state u, and
the transition between the second and third inequality comes from the chain rule applied on
p.ujX / and p.u0 jX /. Since ˛i D 1 for i 2 f1; : : : ; pg, the MH sampler with the Gibbs proposal
distributions is never going to reject any change to the state. For this reason, no correction step
is needed for the Gibbs sampler, and it is removed.
0.15
1, u = –1
0.10 3, u = 2
Density
0.05 2
0.00
-5 0 5
u
Figure 5.1: A demonstration of slice sampling for a univariate variable U . The density of U is
given. The purple line (1) denotes the first sample we begin with, for which u D 1. We then
sample a point across line 1 from a uniform distribution. Once we choose that point, we consider
the black line (2), which intersects line 1 at that point. We uniformly sample a point on line 2,
which yields the second sample, u D 2. Then we sample a random point on the blue line (3) that
intersects line 2 at the new point, u D 2. We continue with that process of choosing points along
vertically and horizontally intersecting lines (the red line, 4, intersects line 3). This amounts to
a random walk on the graph of the density function of U .
The univariate slice sampler relies on the observation that sampling from a distribution
can be done by uniformly sampling a point from the graph of the underlying distribution, and
then projecting this point to the x-axis. Here, “graph” refers to the area that is bounded by the
curve of the density function. The x-axis ranges over the values that ˛ can receive, and the y-axis
ranges over the actual density values. Figure 5.1 demonstrates that idea.
This idea of uniformly sampling the graph is where MCMC comes into play: instead of
directly sampling from the graph, which can be computationally difficult, the slice sampler is an
MCMC sampler for which the stationary distribution is a uniform distribution over the area (or
volume) under the graph of the distribution of interest. Intuitively, the slice sampler is a Gibbs
sampler that moves in straight lines along the x-axis and y-axis in a random walk.
5.5. SLICE SAMPLING 117
More formally, the slice sampler introduces an auxiliary variable V 2 R to q.˛/, and then
defines two Gibbs sampling steps for changing the state .v; ˛/. In the first step, ˛ (given V D v )
is drawn uniformly from the set f˛ 0 jv q.˛ 0 /g. This in essence corresponds to a move along the
x-axis. The second step is perhaps more intuitive, where we draw v (given ˛ ) from a uniform
distribution over the set fvjv q.˛/g, corresponding to a move along the y-axis.
p 1=T t .U jX /
(5.18)
Z.T t /
where Z.T t / is a normalization constant, integrating or summing p 1=Tt .U jX / over U . The value
T t corresponds to the temperature, which starts at a high temperature, and slowly decreases to
1 as the iteration in the sampler, denoted t , increases. If one is interested in using simulated
annealing for optimization, i.e., finding the posterior mode, then one can even decrease the
temperature to 0, at which point the re-normalized posterior from Equation 5.18 concentrates
most of its probability mass on a single point in the sample space.
For example, with Gibbs sampling, simulated annealing can be done by exponentiating
the conditional distributions by 1=T t and renormalizing, while increasing t at each iteration of
the Gibbs sampler. For the Metropolis–Hastings algorithm, one needs to change the acceptance
ratio so that it exponentiates p.U jX / by T1t . There will be no need to compute Z.T t / in this case,
since it is cancelled in the acceptance ratio that appears in Equation 5.16.
• Visual inspection. In the case of a single univariate parameter being sampled, one can
manually inspect a traceplot that plots the value of the sampled parameter vs. the iteration
number of the sampler. If we observe that the chain gets “stuck” in a certain range of the
parameters, and then moves to another range, stays there for a while, and continuously
moves between these ranges, this is an indication that the sampler has not mixed.
120 5. SAMPLING METHODS
In NLP, the parameters are clearly multidimensional, and it is not always the case that
we are sampling a continuous variable. In this case, a scalar function of the structure or
multidimensional vector being sampled can be calculated at each iteration, and plotted
instead. This will provide a uni-directional indication about the mixing of the chain: if the
pattern mentioned above is observed, this is an indication that the sampler has not mixed.
Pt
One can also plot the mean of some scalar function 1t .i /
i D1 f .u / against t , where t is
the sampler iteration and u.i/ is the sample drawn at iteration i . This mean value should
eventually plateau (by the law of large numbers), and if it has not, then that is an indication
the sampler has not mixed. The scalar function, for example, can be the log-likelihood.
• Validation on a development set. In parallel to running the MCMC sampler, one could
make predictions on a small annotated development set, if such a set exists for this end. In
the explicit MCMC sampling setting, the parameters can be used to make predictions on
this development set, and the sampler can be stopped when performance stops improving
(performance is measured here according to the evaluation metric relevant to the problem
at hand). If a collapsed setting is used, then one can extract a point estimate based on the
current state of the sampler, and use it again to make predictions on a development set.
Note that with this approach, we are not checking for the convergence of the sampler on
the “true posterior,” but instead optimize our sampler so that it operates in a part of the
state space that works well with the final evaluation metric.
• Testing autocorrelation. When the target distribution is defined over real values, one can
use the autocorrelation to test for the diagonsis of an MCMC algorithm. Denote p. / as
the target distribution. The autocorrelation with lag k is defined as:
PT k
tD1 . t /. tCk /
k D PT k ; (5.19)
t D1 . t /2
P
where D T1 TtD1 t , t is the t th sample in the chain and T is the total number of sam-
ples. The numerator in Equation 5.19 corresponds to an estimated covariance term, and
the denominator to an estimated variance term. Autocorrelation tests rely on the idea that
if the MCMC sampler has reached the stationary distribution, the autocorrelation value
should be small as k increases. Thus, an indication of slow mixing or lack of convergence
is large autocorrelation values even for relatively large k .
• Other tests. The Geweke test (Geweke, 1992) is a well-known method for checking
whether an MCMC algorithm has converged. It works by splitting a chain of samples
into two parts after an assumed burn-in period; these two parts are then tested to see
whether they are similar to each other. If indeed the chain has reached the stationary
5.8. MARKOV CHAIN: BASIC THEORY 121
state, then these two parts should be similar. The test is performed using a modification of
the so-called z-test (Upton and Cook, 2014), and the score used to compare the two parts
of the chain is called the Geweke z-score. Another test that is widely used for MCMC
convergence diagnosis is the Raftery-Lewis test. It is a good fit in cases where the target
distribution is defined over real values. It works by thresholding all elements in the chain
against a certain quantile q , thus binarizing the chain into a sequence of 1s and 0s. The
test then proceeds by estimating transition probabilities between these binary values, and
uses these transition probabilities to assess convergence. For more details, see Raftery and
Lewis (1992).
A valid criticism of the use of MCMC algorithms in NLP is that the algorithms are often
used without a good verification of chain convergence. Since MCMC algorithms can be quite
expensive in the context of Bayesian NLP, they are often run for a fixed number of iterations
that are limited by the amount of time allotted for an empirical evaluation. This leads to high
sensitivity of the reported results to starting conditions of the Markov chain, or even worse—a
result that is based on samples that are not drawn from the true posterior. In this case, the use
of MCMC algorithms is similar to a random search.
It is therefore encouraged to be more careful with Bayesian models, and monitor the con-
vergence of MCMC algorithms. When MCMC algorithms are too slow to converge (an issue
that can arise often with Bayesian NLP models), one should consider switching to a different
approximate inference algorithm such as variational inference (Chapter 6), where convergence
can be more easily assessed.
N
X
Tij D 1 8i 2 f1; : : : ; N g: (5.20)
j D1
A non-homogeneous Markov chain would have a transition kernel per time step instead
of a single T ; all samplers and algorithms in this chapter are considered to be homogeneous
chains.
There is an important algebraic property of Markov chains with respect to T . Let be
P
some distribution over , i.e., a vector such that i 0 and i i D 1. In this case, multiplying
T by (on the left)—i.e., computing T —yields a distribution as well. To see this, consider that
if w D T then:
0 1
N
X N X
X N N
X N
X N
X
wj D i Tij D i @ Tij A D i D 1;
j D1 j D1 i D1 i D1 j D1 i D1
„ ƒ‚ …
D1
because of Equation 5.20. Therefore, w is a distribution over the state space . The distribution
w is not an arbitrary distribution. It is the distribution that results from taking a single step
using the Markov chain starting with the initial distribution over states. More generally, T k
for an integer k 0 gives the distribution over after k steps in the Markov chain. If we let
.t/ be the distribution over at time step t , then this means that:
T D ; (5.22)
i.e., is the eigenvector associated with eigenvalue 1. (This implicitly means that if T has a
stationary distribution, then it has eigenvalue of 1 to begin with, which is also going to be the
largest eigenvalue.)
This basic theory serves as the foundation for the proofs of samplers such as the Gibbs
sampler and the Metropolis–Hastings algorithm. In these proofs, two key steps are taken:
5.9. SAMPLING ALGORITHMS NOT IN THE MCMC REALM 123
• Proving that the chain induced by the sampler is such that it converges to a stationary
distribution (i.e., satisfies the basic regularity conditions).
• Proving that the target distribution we are interested in sampling from is the stationary dis-
tribution of the Markov chain induced by the sampler (i.e., that it satisfies Equation 5.22).
p.U jX / M q.U /;
for some known M > 0—since M is directly used and evaluated in the rejection sampling algo-
rithm. It is also assumed that for any u, one can calculate p.ujX / (including its normalization
constant). Then, in order to sample from p.U jX /, the procedure in Algorithm 5.6 is followed.
It can be shown that the probability of accepting y in each iteration is 1=M . Therefore,
rejection sampling is not practical when M is very large. This problem is especially severe with
distributions that are defined over high dimensional data.2
The intuition behind rejection sampling is best explained graphically for the univariate
case. Consider Figure 5.2. We see that M q.u/ serves as an envelope surrounding the target
distribution p.u/.
To proceed with drawing a sample from p.u/, we can repeatedly sample points from this
envelope defined by M q.u/, until we happen to hit a point which is within the graph of p.u/.
Indeed, to sample points from the envelope, we sample a point from q.u/; this limits us to a
point on the x -axis that corresponds to a point in the sample space. Now, we can proceed by just
inspecting the line that stretches from the 0 y -axis coordinate to M q.u/ and draw a uniform
2 Thisproblem is partially tackled by adaptive rejection sampling. Adaptive rejection sampling is used when multiple
samples u.1/ ; : : : ; u.m/ are needed from a distribution p.u/. It works by progressively tightening the proposal distribution
q.u/ around the target distribution p.u/. See Robert and Casella (2005) for more details. Examples of using adaptive rejection
sampling in Bayesian NLP are quite rare, but see for example Carter et al. (2012) and Dymetman et al. (2012) for its use in
NLP.
124 5. SAMPLING METHODS
Input: Distributions p.U jX / and q.U /, a sampler for a distribution q.U /, and a constant
M.
Output: u, a sample from p.U jX /.
point on that line. If it falls below the graph of p.u/, then the sampler managed to draw a sample
from p.u/. If it does not fall below the graph of p.u/, the process needs to be repeated.
In its general form, the main challenge behind the use of a rejection sampler is finding a
bounding distribution q.U / with its constant M . However, there is a specific case of rejection
sampling in which these quantities can be easily identified. Consider the case in which one
is interested in sampling from a more restricted subspace of the sample space . Assume the
existence of p.u/, and that the target distribution has the following form:
p.u/I.u 2 A/
p 0 .u/ D ;
Z
where A and I.u 2 A/ is the indicator function that equals 1 if u 2 A and 0 otherwise, and
Z is a normalization constant that integrates or sums p.u/ over A. If p.u/ can be computed for
each u 2 and efficiently sampled from, and the membership query u 2 A can be accomplished
efficiently for any u 2 , then rejection sampling can be used to sample from p 0 .u/, with a
proposal distribution p.u/. In this case, M D 1, and in order to proceed with sampling from
p 0 .u/, one has to sample from p.u/ until u 2 A.
Cohen and Johnson (2013) use rejection sampling in this way in order to restrict Bayesian
estimation of PCFGs to tight PCFGs, i.e., PCFGs for which their normalization constant (sum-
ming over all possible trees according to the grammar) sums to 1. Rejection sampling was used
in conjunction with a Dirichlet prior. The Dirichlet prior was sampled, and then the resulting
rule probabilities were inspected to see whether they were tight. If they were tight, these rule
probabilities were accepted.
5.9. SAMPLING ALGORITHMS NOT IN THE MCMC REALM 125
0.5
Distributions
Mq(U)
p(U)
0.4
0.3
Density
0.2
Accept Region
0.0
-5 0 5
u
Figure 5.2: The plots of M q.U / (the envelope) and p.U jX / (distribution of interest) for rejection
sampling. Plot adapted from Andrieu et al. (2003).
Inverse Transform Sampling: Another non-MCMC method for sampling is inverse trans-
form sampling (ITS). For a real-valued random variable X , the ITS method assumes that for a
given u 2 R, we can identify the largest x such that F .x/ u, where F is the CDF of X (see
Section 1.2.1). Then, ITS works by sampling u from a uniform distribution over Œ0; 1, and then
returning that largest x .
This inverse transform sampling method applies to sampling from multinomial distribu-
tions. In order to sample from such a distribution, we can apply the inverse transform sampling
method on a random variable X that maps each event in the multinomial distribution to a unique
integer between 1 and n, where n is the number of events for the multinomial.
A naïve implementation of ITS for the multinomial distribution will require linear time
in n for each sample, where n is the number of events in the multinomial. There is actually a
simple way to speed this up to O.log n/ per sample from that multinomial, with a preprocessing
step with asymptotic complexity O.n/. Once we map each event to an integer between 1 and n
P
with a random variable X , we calculate the vector of numbers ˛j D F .j / D jiD1 p.X D i/ for
j 2 f1; : : : ; ng and set ˛0 D 0. Now, in order to use the ITS, we draw a uniform variable u, and
then apply a logarithmic-time binary search on the array represented by the vector .˛0 ; : : : ; ˛n /
126 5. SAMPLING METHODS
to find a j such that u 2 Œ˛j 1 ; ˛j . We then return the multinomial event associated with index
j.
1 X .i /
M
I.f / D Ep.U / Œf .U / f u : (5.23)
M
i D1
This approximation is valid because of the law of large numbers, which states that as M !
1, the sum on the right-hand-side of Equation 5.23 will converge to the desired expectation
on the left-hand-side.
Importance Sampling: Importance sampling takes the idea in Equation 5.23, and suggests a
way to approximate the expectation I.f / by sampling using a proposal distribution q.U /. There-
fore, importance sampling can be used when it is not easy to sample from p (but it is possible to
calculate its value). Additionally, when choosing a specific proposal distribution under certain
circumstances, importance sampling is more efficient than perfect Monte Carlo integration (i.e.,
when estimating I.f / using samples from p.U /); this means the approximate integral tends to
converge with fewer samples to I.f /.
Importance sampling relies on the following simple identity that holds for any distribution
q.U / such that q.u/ D 0 only if p.u/ D 0:
p.U /
I.f / D Ep.U / Œf .U / D Eq.U / f .U / : (5.24)
q.U /
Equation 5.24 is true, because the expectation operator folds a weighted sum/integration
operation using q.U /, which in conjunction with the term p.U /
q.U /
leads to a re-weighting of the
sum/integration operation using p.U /.
The implication of this is that I.f / can be approximated as:
1 X .i / p.u.i / /
M
I.f / f u .i/ /
D IO.f /;
M q.u
i D1
jf .u/jp.u/
q .u/ D R ;
u jf .u/jp.u/du
where integration can be replaced by sum, if u is discrete. q itself is often hard to calculate or
sample from. The reason it is optimal is related to the fact that it places large probability mass
where both the magnitude of f and the mass/density of p are large (as opposed to just selecting
a region with high probability mass according to p , but potentially insignificant small values of
f ). We want to sample points in the space that balance between being highly probable and also
giving dominant values to f .
Going back to Equation 5.23, the samples drawn from p to estimate the integral can be
generated using any MC method presented in this chapter, including MCMC methods. This
means, for example, that we can repeatedly draw samples u.i/ from the state space using a Gibbs
sampler, and use them to estimate an integral of a specific function that we are interested in.
5.11 DISCUSSION
We will now briefly discuss some additional topics about sampling methods and their use in
NLP.
m
Y
p.Z1 ; : : : ; Zm ; X1 ; : : : ; Xm / D p.Z1 /p.X1 jZ1 / p.Zi jZi 1 /p.Xi jZi /: (5.25)
i D2
5.11. DISCUSSION 129
The goal of particle filtering is to approximate the distribution p.Zi jX1 D x1 ; : : : ; Xi D
xi /—i.e., to predict the latent state at position i in the sequence, conditioned on the observations
up to that point. This means that we are interested in sampling Zi from p.Zi jX1 D x1 ; : : : ; Xi D
xi / for i 2 f1; : : : mg. Particle filtering approaches this problem through the use of a sequence
of importance sampling steps. The distribution p is assumed to be known.
First, particle filtering samples M “particles” from the distribution p.Z1 jX1 /. This distri-
bution can be derived using a simple application of Bayes’ rule on p.X1 jZ1 / in tandem with the
distribution p.Z1 /, both of which are components of the model in Equation 5.25. This leads to
a set of particles z1.i/ for i 2 f1; : : : ; M g.
In the general case, particle filtering samples M particles corresponding to Zj in the j th
step,. These M particles are used to approximate the distribution p.Zj jX1 ; : : : ; Xj /—each par-
ticle zj.i/ for i 2 f1; : : : ; M g is assigned a weight ˇj;i , and the distribution p.Zj jX1 ; : : : ; Xj / is
approximated as:
M
X
p Zj D zjX1 D x1 ; : : : ; Xj D xj I zj.i/ D z ˇj;i : (5.26)
i D1
X
p Zj jX1 ; : : : ; Xj / p Zj 1 D zjX1 ; : : : ; Xj 1 p Zj jZj 1 D z p Xj jZj :
z
The quantity above exactly equals Ep.Zj 1 jX1 ;:::;Xj 1 / p.Zj jZj 1 D z/p.Xj jZj / . We
can therefore use importance sampling (see Section 5.10) to sample M particles from
p.Zj 1 jX1 ; : : : ; Xj 1 /, and for each such draw z , draw zj.i/ from p.Zj jZj 1 D z/, and set
its weight ˇj;i to be proportional to p.Xj D xj jZj D zj.i / /. This leads to the approximation
in Equation 5.26, which can be used in the next step of the particle filtering algorithm.
Particle filtering was used by Levy et al. (2009) to describe a model for incremental pars-
ing. The motivation is psycholinguistic: the authors were interested in modeling human com-
prehension of language. The authors claim, based on prior work, that there is much evidence
that shows that humans process language incrementally, and therefore it is beneficial to model
the probability of a partial syntactic derivation conditioned on a prefix of a sentence. In the
notation above, partial derivations are modeled whereas the latent random variables Zi and Xi
denote the words in a sentence, and the integer m denotes the length of the sentence. Levy et
al.’s incremental parser was especially good in modeling the effect of human memory limitations
in sentence comprehension.
130 5. SAMPLING METHODS
Table 5.1: A list of Monte Carlo methods and the components they require in order to oper-
ate and sample from the posterior p.U jX /. The / symbol denotes that we need to be able to
calculate a quantity only up to a normalization constant.
Yang and Eisenstein (2013) also developed a sequential Monte Carlo method for nor-
malizing tweets into English. The particles they maintain correspond to normalized tweets in
English. The (non-Bayesian) model they used was composed of a conditional log-linear model
that models the distribution over tweets given an English sentence, and a model over English
sentences. These two models are multiplied together to get a joint distribution over tweets and
English sentences.
i Tij D j Tji ;
for all i and j . Show that when the detailed balance condition is satisfied, is the
stationary distribution. Is the reverse true as well?
5.4. The following two questions prove the correctness of a Gibbs sampler for a simple
model. (The exercise is based on Section 3 in Casella and George (1992).) Consider
a probability distribution p.X; Y / over two binary random variables X; Y , with the fol-
lowing probability table:
X
values 0 1
0 p1 p2
Y
1 p3 p4
P4
such that i D1 pi D 1 and pi 0 for i 2 f1; : : : ; 4g.
Write down two matrices Ayjx and Axjy (in terms of pi ), both of size 2 2 such that:
CHAPTER 6
Variational Inference
In the previous chapter, we described some of the core algorithms used for drawing samples from
the posterior, or more generally, from a probability distribution. In this chapter, we consider
another approach to approximate inference—variational inference.
Variational inference treats the problem of identifying the posterior as an optimization
problem. When this optimization problem is solved, the output is an approximate version of the
posterior distribution. This means that the objective function that variational inference aims to
optimize is a function over a family of distributions. The reason this is an approximate inference
is that this family of distributions is usually not inclusive of the true posterior, and makes strong
assumptions about the form of the posterior distribution.
The term “variational” here refers to concepts from mathematical analysis (such as the
calculus of variations) which focus on the maximization and minimization of functionals (map-
pings from a set of functions to real numbers). This kind of analysis has been used frequently
in physics (e.g., quantum mechanics). Very commonly, it is used in the context of minimizing
energy through a functional that describes the state of physical elements.
Section 6.1 begins the discussion of variational inference in this chapter, by describing
the basic variational bound used in variational inference. We then discuss mean-field variational
inference, the main type of variational inference used in Bayesian NLP (Sections 6.2–6.3). We
continue with a discussion of empirical Bayes estimation with variational approximations (Sec-
tion 6.4). In the next section (Section 6.5), we discuss various topics related to variational infer-
ence in Bayesian NLP, covering topics such as initialization of variational inference algorithms,
convergence diagnosis, variational inference decoding, the relationship between variational in-
ference and KL minimization, and finally, online variational inference. We conclude with a
summary (Section 6.6). There is also a treatment of variational inference in the context of neural
networks and representation learning in Chapter 9.
p.; Z ; X j˛/
!
n
Y
D p ; Z .1/ ; : : : ; Z .n/ ; X .1/ ; : : : ; X .n/ j˛ D p .j˛/ p Z .i/ j p X .i / jZ .i / ; :
i D1
As mentioned in Section 3.1.2, in order to compute the posterior, one needs to compute
the following marginalization constant:
!
Z X n
Y
p x .1/ ; : : : ; x .n/ j˛ D p. j˛/ .i/ .i/ .i/
p z j p x jz ; d:
i D1
z .1/ ;:::;z .n/
likelihood. Because finding the posterior for Bayesian NLP problems, in the general case, is
intractable, it follows that optimizing F q; x .1/ ; : : : ; x .n/ j˛ with respect to q , in the general
case, is intractable (if we could do it, we would have been able to find the true posterior).
This is where variational inference plays an important role in removing the intractability of
this optimization problem. Variational inference implies that this optimization problem is still
being solved, but with a compromise: we maximize the bound with respect to a certain family
of distributions Q. The family Q is chosen so that, at the very least, finding a local maximum
for the following maximization problem is tractable:
max F q; x .1/ ; : : : ; x .n/ j˛ : (6.4)
q2Q
Since the true posterior usually does not belong to Q, this is an approximate method.
Clearly, the closer that one of the distributions in Q is to the true posterior (or, the more ex-
pressive the distributions in Q are), the more accurate this approximate solution is.
The approximate posterior in the solution is not only due to the restriction inherent in
choosing Q. Another reason is that even with “tractable” Q, the above optimization problem
is non-convex, and therefore, there is inherent difficulty in finding the global maximum for it.
Mean-field variational inference is one algorithm that treats this issue by applying coordinate
ascent on a factorized approximate posterior family (Section 6.3).1 The maximization problem,
however, stays non-convex.
1 Coordinate ascent is a method for maximizing the value of a real-valued function f .y ; : : : ; y /. It works by iterating
1 n
through the different arguments of f , at each point maximizing f with respect to a specific variable yi while holding yj for
138 6. VARIATIONAL INFERENCE
6.2 MEAN-FIELD APPROXIMATION
Mean-field approximation defines an approximate posterior family which has a factorized form.
Just as in the case of Gibbs sampling (Chapter 5), mean-field variational inference requires a
partition of the latent variables in the model (the most common choice for such a partition is a
separation of the parameters from the latent structures being predicted).
Once the latent variables Z .1/ ; : : : ; Z .n/ and are carved up into p random variables,
U1 ; : : : ; Up , the factorized form of the approximate posterior is such that it assumes indepen-
dence between each of Ui . More specifically, each member q of the family Q is assumed to have
the form:
p
Y
q U1 ; : : : ; Up D q .Ui / : (6.5)
i D1
See Section 5.2.1 for more discussion about the ways to partition the latent variables of a
Bayesian model.
One of the most natural approximate posterior families Q is one that decouples between
the parameters and the random variables of each of the latent structures Z D Z .1/ ; : : : ; Z .n/ .
With a top-level prior placement, a typical approximate posterior has the form:
Therefore, Q is the set of all distributions such that and the latent structures Z .i/ (for
i D 1; : : : ; n) are all independent of each other. This approximation for the posterior family
belongs to the family of mean-field approximations.
When a new set of parameters is drawn for each example (Section 3.5), leading to a set
of parameters .1/ ; : : : ; .n/ , it is more typical to use a mean-field approximation that assumes
independence between each of .i/ and Z .i / such that the approximate posterior has the form:
n
! n
!
Y Y
q.z .i/ / q. .i/ / : (6.6)
i D1 i D1
The approximation in Equation 6.6 is the most naïve mean-field approximation, where
all variables in the model are assumed to be independent of each other (parameters and latent
structures).
It is important to note, however, that this naïve mean-field approximation in this case is
not very useful by itself. When considering the bound in Equation 6.3, with this factorization
of the posterior family, we actually end up with n separate optimization sub-problems for each
observation, and these sub-problems do not interact with each other. Interaction between these
j ¤ i fixed with values from the previous steps. The value of the assignment to yi at each step is guaranteed to increase the
value of f , and under some circumstances, converge to the maximizer of f .
6.3. MEAN-FIELD VARIATIONAL INFERENCE ALGORITHM 139
sub-problems, however, can be introduced by following a variational expectation-maximization
algorithm, that estimates joint hyperparameters for all .i / , in the style of empirical Bayes. For
a discussion of empirical Bayes, see Section 4.3. For more details about variational EM, see
Section 6.4.
The factorized posterior family q does not have to follow the above naïve factorization, and
more structured approximations can certainly be used. The way that the posterior is factorized
depends on the modeler choice for carving up the variables on which the inference is done. In
addition, each factor in q can either be parametric (by defining “variational parameters” that
control this factor) or can be left as nonparametric (see Section 1.5.1 for the difference between
a parametric model family and a nonparametric one). In many cases it can be shown that even
when a factor is left as nonparametric (to obtain a tighter approximation, without confining the
factor to a special form), the factor that gives the tightest approximation actually has a parametric
form. See Section 6.3.1.
Mean-field methods actually originate in statistical physics. The main motivation behind
them in statistical physics (or statistical mechanics) is to reduce the complexity of stochasti-
cally modeling interactions between many physical elements by considering much simpler mod-
els. Mean-field methods are now often used in machine learning, especially with inference for
graphical models (Wainwright and Jordan, 2008).
Input: Observed data x .1/ ; : : : ; x .n/ , a partition of the latent variables into U1 ; : : : ; Up
and a set of possible distributions for U1 ; : : : ; Up : Q1 ; : : : ; Qp .
Output: Factorized approximate posterior q.U1 ; : : : ; Up /.
6: end for
Qp
7: q .U1 ; : : : ; Up /
i D1 q .Ui /
8: until the bound F .q ; x .1/ ; : : : ; x .n/ j˛/ converged
9: return q
Algorithm 6.1: The mean-field variational inference algorithm. Its input are observations, a par-
tition of the random variables that the inference is done on, and a set of distribution families, one
set per element in the partition. The algorithm then iterates (with iterator i ) through the differ-
ent elements in the partition, each time maximizing the variational bound for the observations
with respect to Qi , while holding q .Uj / for j ¤ i fixed.
the factorization in Equation 6.5. Algorithm 6.1 provides the skeleton for the coordinate ascent
mean-field variational inference algorithm.
The optimization problem in Equation 6.7 is not always easy to solve, but fortunately, the
solution has a rather general formula. It can be shown that q .Ui / maximizing Equation 6.7
equals:
exp Eq i Œlog p.X ; U1 ; : : : ; Up /
q .Ui / D ; (6.8)
Zi
where q i is a distribution over U i defined as:
Y
q i .U1 ; : : : ; Ui 1 ; Ui C1 ; : : : ; Up / D q.Uj /;
j ¤i
6.3. MEAN-FIELD VARIATIONAL INFERENCE ALGORITHM 141
and Zi is a normalization constant that integrates or sums the numerator in Equation 6.8
P
with respect to Ui . For example, if Ui is discrete, then Zi D u Eq i Œlog p.X ; U1 ; : : : ; Ui D
u; : : : ; Up //. This general derivation appears in detail in Bishop (2006), but in Section 6.3.1 we
derive a specific case of this formulation for the Dirichlet-Multinomial family.
The following decisions need to be made by the modeler when actually implementing
Algorithm 6.1.
• Partitioning of the latent variables This issue is discussed at length in Chapter 5 and in
Section 6.2. The decision that the modeler has to make with respect to this issue is how to
carve up the random variables into a set of random variables that have minimal interaction,
or that offer some computational tractability for maximizing the variational bound with
respect to each of members of this set of random variables.
• Choosing the parametrization of each factor (Qi ) Determining the parametrization of
each of the factors requires a balance between richness of the parametrization (so we are
able to get a tighter bound) and tractability (see below). It is often the case that even when
Qi is left nonparametric (or when it includes the set of all possible distributions over the
sample space of Ui ), the solution to the coordinate ascent step is actually a distribution
from a parametric family. Identifying this parametric family can be done as part of the
derivation of the variational EM algorithm, so that the variational distributions can be
represented computationally (and optimized with respect to their parameters, also referred
to as “variational parameters”).
• Optimizing the bound at each step of the coordinate ascent At each step of the mean-
field variational inference algorithm, we have to find the factor in q that maximizes the
variational bound while maintaining all other factors fixed according to their values from
the previous iterations. If the parametrization of each factor is chosen carefully, then some-
times closed-form solutions for these mini-maximization problems are available (this is
especially true when the prior is conjugate to the likelihood). It is also often the case that
a nested optimization problem needs to be solved using optimization techniques such as
gradient descent or Newton’s method. Unfortunately, sometimes the nested optimization
problem itself is a non-convex optimization problem.
Recent work, such as the work by Kucukelbir et al. (2016), tries to minimize the decision
making which is not strictly related to modeling and data collection. The work by Kucukelbir et
al. proposes an automatic variational inference algorithm, which uses automatic differentiation,
integrated into the Stan programming language Carpenter et al. (2015).
For example, in the case of PCFGs, K would be the number of nonterminals in the gram-
mar, Nk would be the number of rules for each nonterminal and ik would correspond to the
probability of the i th rule for the k th nonterminal. See Section 8.2 for more details about this
formulation. Let fik .x; z/ be a function that counts the times that event i from multinomial k
fires in .x; z/.
The most common choice of a conjugate prior for this model is a product-of-Dirichlet
distribution, such that:
Nk
K Y ˛ik
Y 1
p. j˛/ / ik ;
kD1 i D1
with ˛ D ˛ 1 ; : : : ; ˛ K and ˛ k 2 RNk such that ˛ik 0 for all i and k .
Assuming a top-level prior, and with X .1/ ; : : : ; X .n/ being the observed random variables
and Z .1/ ; : : : ; Z .n/ being the latent structures, the likelihood is:
n
Y Yn Y Nk
K Y fik .x .j / ;z .j / /
p x .j / ; z .j / j D ik
j D1 j D1 kD1 i D1
YK YNk PjnD1 fik .x .j / ;z .j / /
k
D i ;
kD1 iD1
with fik .x; z/ being the count of event i from multinomial k in the pair .x; z/. We denote in
short:
n
X
fk;i D fik x .j / ; z .j / :
j D1
gives the bound on the marginal log-likelihood), looks like the following:
F q; x .1/ ; : : : ; x .n/ j˛
2 0 13
Nk
K Y fk;i
Y
D Eq 4log @p. j˛/ ik A5 Eq Œlog q. / Eq Œlog q.Z /
kD1 i D1
Nk
K X
X h i
D Eq fk;i C ˛ik 1 log ik C H.q. // C H.q.Z //;
kD1 iD1
where H.q. // denotes the entropy of the distribution q. /, and H.q.Z // denotes the entropy
of the distribution q.Z / (for the definition of entropy, see Appendix A).
If we consider Algorithm 6.1 for this case, then we iterate between stages: (a) assuming
q. / is fixed, we optimize the bound in the above equation with respect to q.Z / and (b) assuming
q.Z / is fixed, we optimize the bound in the above equation with respect to q. /.
Assume the case where q./ is fixed. In that case, fk;i depends only on the latent assign-
ments z .1/ ; : : : ; z .n/ and not on the parameters, and therefore it holds that:
X Nk
K X h i
F q; x ; : : : ; x j˛ D
.1/ .n/
Eq fk;i C ˛ik 1 k
i C H.q.Z // C const
kD1 i D1
XK X Nk h i
k
D Eq i fk;i log A. / C H.q.Z // C const; (6.9)
kD1 i D1
with having the same vector structure like and ˛ such that
h i
k k
i D Eq. / log i ;
and
0 1
Nk
K X
X X X
log A. / D ::: exp @ k
i fk;i
A:
z .1/ z .n/ kD1 i D1
Note that the term log A. / can be added to Equation 6.9 because it does not depend
on the latent structures, since we sum them out in this term. It does, however, depend on q. /,
but it is assumed to be fixed. If we carefully consider Equation 6.9, we note that it denotes
the negated KL-divergence (Appendix A) between q.Z / and a log-linear model over Z with
sufficient statistics fk;i and parameters ik . Therefore, when q. / is fixed, the functional F is
144 6. VARIATIONAL INFERENCE
maximized when we choose q.Z / to be a log-linear distribution with the sufficient statistics fk;i
and parameters ik D Eq. / Œlog.ik /.
The meaning of this is that even though we a priori left q.Z / to be in a nonparametric
family, we discovered that the tightest solution for it resides in a parametric family, and this
family has a very similar form to the likelihood (the main difference between the approximate
posterior family and the likelihood is that with the approximate posterior family we also re-
quire normalization through log Z. / because does not necessarily represent a collection of
multinomial distributions).
What about the opposite case, i.e., when q.Z / is fixed and q. / needs to be inferred? In
that case, it holds that:
X Nk
K X
h i
F q; x .1/ ; : : : ; x .n/ j˛ / Eq fk;i C ˛ik 1 Eq. / log ik H.q. //: (6.10)
kD1 i D1
If we carefully consider the equation above, we see that it is proportional to the KL-
divergence between q./ and a product of Dirichlet distributions (of the same form as the prior
family) with hyperparameters ˇ D .ˇ 1 ; : : : ; ˇ K / such that ˇik D Eq Œfk;i C ˛ik . This is again
a case where we leave q. / nonparametric, and we discover that the tightest solution has a
parametric form. In fact, not only is it parametric, it also has the same form as the prior family.
The final variational inference algorithm looks like this:
• Initialize in some way ˇ D .ˇ 1 ; : : : ; ˇ K /.
• Repeat until convergence:
Compute q.z .1/ ; : : : ; z .n/ / as the log-linear model mentioned above with parameters
k k
i D Eq./ Œlog.i /jˇ.
Compute q./ as a product of Dirichlet distributions with hyperparameters ˇik D
Eq Œfk;i j C ˛ik .
Consider the computation of Eq./ log ik jˇ and Eq Œfk;i j . It is known that for a
given Dirichlet distribution, the expected log value of a single parameter can be expressed using
the digamma function, meaning that:
0 1
h i Nk
X
Eq. / log ik jˇ D ‰ ˇik ‰@ ˇik A ;
i D1
with ‰ representing the digamma function. The digamma function cannot be expressed analyti-
cally, but there are numerical recipes for finding its value for a given parameter. See Appendix B
for more details about the digamma function and its relationship to the Dirichlet distribution.
6.3. MEAN-FIELD VARIATIONAL INFERENCE ALGORITHM 145
On the other hand, computing Eq Œfk;i j can be done using an algorithm that heavily de-
pends on the structure of the likelihood function. For PCFGs, for example, this expectation can
be computed using the inside-outside algorithm. For HMMs, it can be done using the forward-
backward algorithm. See Chapter 8 for more details. Note that this expectation is computed for
each observed example separately, i.e., we calculate Eq Œfik .x .j / ; z .j / /j for j 2 f1; : : : ; ng and
then aggregate all of these counts to get Eq Œfk;i j .
Whenever we are simply interested in the posterior over q.Z /, the above two update steps
collapse to the following update rule for the variational parameters of q.Z /:
0 1
Nk
X
k new old old
. i / ‰ Eq fk;i j C ˛ik ‰@ Eq fk;i j C ˛ik A : (6.11)
i D1
Note that for these updates, the variational parameters ik need to be initialized first.
Most often in the Bayesian NLP literature, when variational inference is used, the final update
rule forms such as the one above are described. The log-linear model parametrized by ik can
be re-parameterized using a new set of parameters ki D exp. ik / for all k and i . In this case,
the update becomes:
new
exp ‰.Eq Œfk;i jold C ˛ik /
ki P : (6.12)
Nk
exp ‰. i D1 Eq Œfk;i jold C ˛ik /
Note that now we have a similar update to the EM algorithm, where we compute expected
counts, and in the M-step, we normalize them. The main difference is that the counts are passed
through the filter of the exp-digamma function, exp.‰.x//. Figure 6.1 plots the exp-digamma
function and compares it against the function x 0:5. We can see that as x becomes larger, the
two functions get closer to each other. The main difference between the two functions is that
at values smaller than 0:5, for which the exp-digamma function returns positive values which
are very close to 0, while x 0:5 returns negative values. Therefore, one way to interpret the
update Equation 6.12 is as the truncation of low expected counts during the E-step (lower than
0:5). Higher counts are also subtracted a value of around 0:5, and the higher the count is in the
E-step, the less influential this decrease will be on the corresponding parameter.
1.5
1.0
f (x)
0.5
0.0
0.0 0.5 1.0 1.5 2.0
x
Figure 6.1: A plot of the function f .x/ D exp.‰.x//, the exp-digamma function (in the middle
in black) compared to the functions f .x/ D x (at the top in blue) and f .x/ D x 0:5 (at the
bottom in red). Adapted from Johnson (2007b).
The EM algorithm iterates between two steps: the E-step, in which the posterior over
the latent structures is computed, and the M-step, in which a new set of parameters is com-
puted, until the marginal log-likelihood converges. It can be shown that the EM algorithm
finds a local maximum of the marginal log-likelihood function. The M-step is performed by
maximizing the expected log-likelihood of all variables in the model. The expectation is taken
with respect to a product distribution: the product of the empirical distribution over the ob-
served data and the posterior induced in the E-step. For more detailed information about the
expectation-maximization algorithm, see Appendix A.
There is actually a deeper connection between the EM algorithm and the variational in-
ference algorithm presented in Algorithm 6.1. The variational inference algorithm reduces to
the EM algorithm when the inputs to the variational inference algorithm and the prior in the
model are chosen carefully.
Consider the case where the set of latent variables is partitioned into two random variables:
in terms of Algorithm 6.1, U1 corresponds to a random variable over the parameters, and U2
6.4. EMPIRICAL BAYES WITH VARIATIONAL INFERENCE 147
corresponds to a variable over the set of all latent structures in the model (usually, it would be
Z .1/ ; : : : ; Z .n/ ). Hence, the posterior has the form q.; Z / D q. /q.Z /.
Consider also Q1 to represent the set of all distributions that place their whole probability
mass on a single point on the parameter space. This means that Q1 includes the set of all the
distributions q. j/ (parameterized by 2 ‚) such that
(
1; if D
q. j/ D
0 otherwise;
Q2 , on the other hand, remains nonparametric, and just includes the set of all possible
distributions over the latent structures. Last, the prior chosen in the model is chosen to be
p. / D c (for some constant c ) for all 2 ‚, i.e., a uniform non-informative prior (possibly
improper).
The functional F now in essence depends on the assignment (selecting q. j/) and the
q.Z /. We will express this functional as:
" !#
p.j˛/p.Z ; X D .x .1/ ; : : : ; x .n/ /j/
F q.Z /; ; x ; : : : ; x
.1/ .n/
D Eq.Z / log :
q.Z /
If we assume a non-informative constant prior, then maximizing the bound with respect
to q.Z / and can be done while ignoring the prior:
" !#
p.Z ; X D x .1/ ; : : : ; x .n/ j/
F q.Z /; ; x ; : : : ; x
.1/ .n/
/ Eq.Z / log :
q.Z /
This functional is exactly the same bound that the expectation-maximization algorithm
maximizes. Maximizing the right-hand side with respect to q.Z / while keeping fixed yields
.1/ .n/
the posterior q.Z / D p.Z jX D x ; : : : ; x ; /, which in turn yields the E-step in the EM
algorithm. On the other hand, maximizing the right-hand side with respect to yields the M-
step—doing so maximizes the bound with respect to the parameters, which keeps q.Z / fixed.
See Appendix A for a derivation of the EM algorithm.
Input: Observed data x .1/ ; : : : ; x .n/ , the bound F .q ; x .1/ ; : : : ; x .n/ j˛/.
Output: Factorized approximate posteriors q. .i/ / and q.Z .i / / for i 2 f1; : : : ; ng and an
estimated hyperparameter ˛ .
1: Initialize ˛ 0
2: repeat
3: Maximize F .q ; x .1/ ; : : : ; x .n/ j˛ 0 / with respect to q
4: using Algorithm 6.1 with factorization as in Equation 6.6
5: ˛ 0
arg max˛0 F .q ; x .1/ ; : : : ; x .n/ j˛ 0 /
6: until the bound F .q ; x .1/ ; : : : ; x .n/ j˛ 0 / converges
7: return .˛ 0 ; q /
when using this kind of mean-field approximation, we require an additional estimation step,
which integrates all the solutions for these sub-problems into a re-estimation step of the prior.
This is the main idea behind the variational EM algorithm. Variational EM is actually
an expectation-maximization algorithm, in which the hyperparameters for a prior family are
estimated based on data, and in which the E-step is an approximate E-step that finds a posterior
based on a variational inference algorithm, such as the one introduced in Algorithm 6.1. The
approximate posterior is identified over Z and , while the M-step maximizes the marginal
log-likelihood with respect to the hyperparameters.
The variational EM algorithm, with mean-field variational inference for the E-step, is
given in Algorithm 6.2.
6.5 DISCUSSION
We turn now to a discussion about important issues regarding the variational inference algo-
rithms presented in this chapter—issues that have a crucial effect on the performance of these
algorithms, but do not have well-formed theory.
A similar route can befollowed in the empirical Bayesian setting, decoding z .i/ by com-
puting arg maxz q Z .i/ D z .
With variational EM, the hyperparameters ˛ that are being eventually estimated can be
used to get a summary for the parameter’s point estimate. For example, given these hyperpa-
rameters ˛ , one can use the mean value of the posterior over the parameters as a point estimate
Z
D EŒ j˛ D p. j˛/d;
or alternatively, D arg max p. j˛/ (corresponding to maximum a posteriori estimate). See
Chapter 4 for a discussion. If the hyperparameters ˛ have the same structure as the parameters
(i.e., for each hyperparameter in the i th coordinate, ˛i , maps directly to a parameter i ), then
the hyperparameters themselves can be used as a point estimate. The hyperparameters may not
adhere, perhaps, to constraints on the parameter space (i.e., it could be the case that ˛ … ‚), but
they often do yield weights, which can be used in decoding the underlying model.
Cohen and Smith (2010b) used this technique, and estimated the hyperparameters of a
collection of logistic normal distributions for grammar induction. The Gaussian means were
eventually used as parameters for a weighted grammar they used in decoding.
The above approach is especially useful when there is a clear distinction between a training
set and a test set, and the final performance measures are reported on the test set, as opposed to
a setting in which inference is done on all of the observed data.
When this split between a training and test set exists, one can use a different approach to
the problem of decoding with variational EM. Using the hyperparameters estimated from the
training data, an extra variational inference step can be taken on the test set, thus identifying the
posterior over latent structures for each of the training examples (using mean-field variational
6.5. DISCUSSION 151
inference). Based on these results, it is possible to follow the same route mentioned in the be-
ginning of this section, finding the highest scoring structure according to each of the posteriors
and using these as the predicted structure.
The bound F actually denotes the Kullback-Leibler (KL) divergence (see Appendix A)
between q and the posterior. As mentioned in the beginning of this chapter, finding an approx-
imate posterior is done by minimizing F . Therefore, minimization of the bound F corresponds
to finding a posterior q from the family of posteriors Q which minimizes KL.q; p/.
KL divergence is not a symmetric function, and unfortunately, this minimization of the
KL divergence is done in the “reverse direction” from what is desirable. In most, “more correct,”
KL divergence minimization problems (such as maximum likelihood estimation), the free dis-
tribution that is optimized should represent the second argument to the KL divergence, while
the “true” distribution (the true posterior, in the case of variational inference), should represent
the first argument. In the reverse direction, minq KL.q; p/, one could find solutions that are
not necessarily meaningful. Still, with this approach, the KL divergence would get its minimum
when p D q (and then it would be 0), which is a desirable property.
A discussion regarding KL divergence minimization direction for variational inference,
with graphical models, is given by Koller and Friedman (2009).
6.6 SUMMARY
Variational inference is just one of the workhorses used in Bayesian NLP inference. The most
common variant of variational inference used in NLP is that of mean-field variational inference,
the main variant discussed in this chapter.
Variational expectation-maximization can be used in the empirical Bayes setting, using a
variational inference sub-routine for the E-step, while maximizing the variational bound with
respect to the hyperparameters in the M-step.
6.7. EXERCISES 153
6.7 EXERCISES
6.1. Consider the model in Example 5.1. Write down a mean-field variational inference
algorithm to infer p jx .1/ ; : : : ; x .n/ .
6.2. Consider again the model in Example 5.1, only now change it such that there are mul-
tiple parameter draws, and .1/ ; : : : ; .n/ drawn for each example. Write down a mean-
.1/ .n/ .1/ .n/
field variational inference algorithm to infer p ; : : : ; jx ; : : : ; x .
6.3. Show that Equation 6.8 is true. Also, show that Equation 6.10 is true.
6.4. Let 1 ; : : : ; K represent a set of K parameters, where K is fixed. In addition, let p.X ji /
represent a fixed distribution for a random variable X over sample space such that
jj < 1. Assume that p.xji / ¤ p.xjj / for any x 2 and i ¤ j . Define the follow-
ing model, parametrized by :
K
X
p.X j; 1 ; : : : ; K / D k p.X jk /;
kD1
P
where K kD1 k D 1 and k 0. This is a mixture model, with mixture components
that are fixed.
What is the log-likelihood for n observations, x .1/ ; : : : ; x .n/ , with respect to the param-
eters ? Under what conditions is the log-likelihood convex, if at all (with respect to
)?
6.5. Now assume that there is a symmetric Dirichlet prior over , hyperparametrized by
˛ > 0. Compute the marginal log-likelihood n observations, x .1/ ; : : : ; x .n/ , integrating
out . Is this a convex function with respect to ˛ ?
155
CHAPTER 7
Nonparametric Priors
Consider a simple mixture model that defines a distribution over a fixed set of words. Each draw
from that mixture model corresponds to a draw of a cluster index (corresponding to a mixture
component) followed by a draw from a cluster-specific distribution over words. Each distri-
bution associated with a given cluster can be defined so that it captures specific distributional
properties of the words in the vocabulary, or identifies a specific category for the word. If the
categories are not pre-defined, then the modeler is confronted with the problem of choosing
the number of clusters in the mixture model. On one hand, if there is not a sufficiently large
number of components, it will be difficult to represent the range of possible categories; indeed,
words that are quite dissimiliar according to the desired categorization may end up in the same
cluster. The opposite will happen if there are too many components in the model: much of the
slack in the number of clusters will be used to represent the noise in the data, and create overly
fine-grained clusters that should otherwise be merged together.
Ideally, we would like the number of clusters to grow as we increase the size of the vo-
cabulary and the observed text in the data. This flexibility can be attained with nonparametric
Bayesian modeling. The size of a nonparametric Bayesian model can potentially grow to be quite
large (it is unbounded), as a function of the number of data points n; but for any set of n data
points, the number of components inferred will always be finite.
The more general approach to nonparametric Bayesian modeling is to use a nonparametric
prior, often a stochastic process (roughly referring to a set of random variables indexed by an
infinite, linearly ordered set, such as the integers), which provides a direct distribution over a
set of functions or a set of distributions, instead of a distribution over a set of parameters. A
typical example of this in Bayesian NLP would be the Dirichlet process. The Dirichlet process
is a stochastic process that is often used as a nonparametric prior to define a distribution over
distributions. Each distribution drawn from the Dirichlet process can later be used to eventually
draw the observed data. For those who are completely unfamiliar with nonparametric Bayesian
modeling this may seem a bit vague, but a much more thorough discussion of the Dirichlet
process follows later in this chapter.1
Nonparametric priors are often generalizations of parametric priors, in which the number
of parameters is taken to infinity for the specific parametric family. For example, the nonpara-
metric Griffiths-Engen-McCloskey distribution can be thought of as a multinomial with an
infinite number of components. The Dirichlet process is the limit of a series of Dirichlet distri-
1 For a description of a very direct relation between the k -means clustering algorithm and the Dirichlet process, see Kulis
and Jordan (2011).
156 7. NONPARAMETRIC PRIORS
butions. A Gaussian process is a generalization of the multivariate Gaussian distribution, only
instead of draws from the Gaussian process being vectors indexed by a finite set of coordinates,
they are indexed by a continuous value (which can be interpreted, for example, as a temporal
axis).
The area of Bayesian nonparametrics in the Statistics and Machine Learning literature is
an evolving and highly active area of research. New models, inference algorithms and applica-
tions are frequently being developed in this area. Traditional parametric modeling is in a more
stable state compared to Bayesian nonparametrics, especially in natural language processing. It
is therefore difficult to comprhensively review the rich, cutting-edge literature on this topic;
instead, the goal of this chapter is to serve as a preliminary peek into the core technical ideas
behind Bayesian nonparametrics in NLP.
The Dirichlet process plays a pivotal role in Bayesian nonparametrics for NLP, similar
to the pivotal role that the Dirichlet distribution plays in parametric Bayesian NLP modeling.
Therefore, this chapter focuses on the Dirichlet process, and as such has the following organiza-
tion. We begin by introducing the Dirichet process and its various representations in Section 7.1.
We then show how the Dirichlet process can be used in an nonparametric mixture model in
Section 7.2. We next show in Section 7.3 how hierarchical models can be constructed with the
Dirichlet process as the foundation for solving issues such as the selection of the number of topics
for the latent Dirichlet allocation model. Finally, we discuss the Pitman-Yor process, for which
the Dirichlet process is a specific case (Section 7.4), and then follow with a brief discussion of
other stochastic processes that are used in Bayesian nonparametrics.
Full stick
1 – β1 β1
1 – β1 – β2 β2
1 – β1 – β2 – β3 β3
Figure 7.1: A graphical depiction of the stick breaking process. The rectangles on the right (in
blue) are the actual probabilities associated with each element in the infinite multinomial. The
process of breaking the stick repeats ad infinitum, leading to an infinite vector of ˇi variables.
At each step, the left part of the stick is broken into two pieces.
as the “mean,” where the expected value EŒG.A/ (where expectation is taken with respect to
G ) equals G0 .A/ for any measurable set A. The concentration parameter, on the other hand,
controls the variance in the Dirichlet process as follows:
G0 .A/.1 G0 .A//
Var.G.A// D :
sC1
The larger s is, the closer draws G from the Dirichlet process to G0 .
The mathematical definition of the Dirichlet process above, as fundamental as it may be,
is not constructive.2 In the next two sections we provide two other perspectives on the Dirichlet
process that are more constructive, and therefore are also more amenable to use in Bayesian NLP
with approximate inference algorithms.
kY1
ˇk D k .1 j /: (7.1)
j D1
In this case, the infinite vector .ˇ1 ; ˇ2 ; : : :/ is said to be drawn from the GEM distribution
with concentration parameter s .
Draws from the GEM distribution can be thought of as draws of “infinite multinomials,”
because the following is satisfied for every draw ˇ from the GEM distribution:
2 By that we mean that it does not describe the Dirichlet process in a way amenable to model specification, or inference.
158 7. NONPARAMETRIC PRIORS
Hyperparameters: G0 , s .
Variables: ˇi 0, i for i 2 f1; 2; : : :g.
Output: Distribution G over a discrete subset of the sample space of G0 .
• Draw ˇ GEM.s/.
• Draw 1 ; 2 ; : : : G0 .
• The distribution G is defined as:
1
X
G./ D ˇk I. D k /: (7.4)
kD1
Generative Story 7.1: The generative story for drawing a distribution from the Dirichlet process.
Since the components of this infinite multinomial sum to one, the components must decay
P
quickly, so that the tail 1 i Dm ˇi goes to 0 as m goes to infinity. This is guaranteed by the iterative
process through which a unit “stick” is broken to pieces, each time further breaking the residual
part of the stick (Equation 7.1).
Based on the stick-breaking representation, a draw of a distribution G DP.G0 ; s/ from
the Dirichet process can be represented using generative story 7.1. First, an infinite non-negative
vector that sums to 1 is drawn from the GEM distribution (line 1). This corresponds to an
infinite multinomial over “atoms,” which are drawn next from the base distribution (line 2).
Each atom is associated with an index in the infinite vector (line 3). This means that every draw
from the Dirichlet process has the structure in Equation 7.4 for some atoms k and coefficients
ˇ.
The stick-breaking process also demonstrates again the role of the s parameter. The larger
s is, the less rapidly the parts of the stick decay to zero (in their length, taken with respect to the
unit length). This is evident in Equation 7.1, when considering the fact that k Beta.1; s/.
The larger s is, the smaller each k is compared to 1 k , and therefore, more of the probability
mass is preserved for other pieces in the stick.
7.1. THE DIRICHLET PROCESS: THREE VIEWS 159
The stick-breaking process also demonstrates that any draw from the Dirichlet process is
a distribution defined over a discrete (or finite) support. The distribution in Equation 7.4 assigns
positive weight to a discrete subset of the sample space of G0 .
Hyperparameters: G0 , s .
Variables: Y .i / , i for i 2 f1; : : : ; ng.
Output: .i/ for i 2 f1; : : : ; ng drawn from G DP.G0 ; s/.
Generative Story 7.2: The Dirichlet process represented using the Chinese restaurant process.
Figure 7.2: A graphical depiction of the posterior distribution of the Chinese restaurant process.
Each black circle is a “customer” that sat next to one of the tables. In the picture, 3 tables are open
5
in the restaurant with 10 customers. For ˛ D 1:5, there is a probability of D
5 C 2 C 3 C 1:5
5 2 3
for a new customer to go to the first table, probability of to go the second table,
11:5 11:5 11:5
1:5
for the third table and to go to a new table.
11:5
means that the joint distribution over Y .1/ ; : : : ; Y .n/ is the same as the distribution over a per-
mutation of these random variables. In the CRP metaphor, the order in which the customers
enter the restaurant does not matter.
This view of the Dirichlet process demonstrates the way that the number of parameters
grows as more data is available (see beginning of this chapter). The larger the number of samples
is, the more “tables” are open, and therefore, the larger the number of parameters used when
performing inference.
The Chinese restaurant process induces a distribution over partitions of Y .i / , and therefore
it is only a function of the counts of customers that are seated next to a certain table. More
formally, the CRP distribution is a function of the integer count vector N of length m with
P
m D yn (the total number of tables used for n customers) and Nk D niD1 I.Y .i / D k/ (the
count of customers for each table k 2 f1; : : : ; mg) and is defined as follows:
7.2. DIRICHLET PROCESS MIXTURES 161
Qm
sm kD1 .Nk 1/Š
p.N js/ D Qn 1 :
i D0 .i C s/
With the CRP process, one can use a generative story 7.2 to define a representation for the
Dirichlet process. An equivalent procedure for generating .1/ ; : : : ; .n/ from G DP.G0 ; s/ is
to draw a partition from the CRP, assigning each “table” in the partition a draw from G0 and
then setting .i/ to be that draw from G0 according to the table the i th customer is seated in.
The CRP has a strong connection to the GEM distribution. Let be a draw from the
GEM distribution with concentration parameter s . Let U1 ; : : : ; UN be a set of random variables
that take integer values such that p.Ui D k/ D k . These random variables induce a partition
over f1; : : : ; N g, where each set in the partition consists of all Ui that take the same value. As
such, for a given N , the GEM distribution induces a partition over f1; : : : ; N g. The distribution
over partitions that the GEM distribution induces is identical to the distribution over partitions
that the CRP induces (with N customers) with concentration parameter s .
Hyperparameters: G0 , s .
Latent variables: .i/ for i 2 f1; : : : ; ng.
Observed variables: X .i/ for i 2 f1; : : : ; ng.
Output: x .1/ ; : : : ; x .n/ for i 2 f1; : : : ; ng drawn from the Dirichlet process mixture model.
Generative Story 7.3: The generative story for the Dirichlet process mixture model. The distri-
bution G0 is the base distribution such that its sample space is a set of parameters for each of
the mixture components.
p .i/ j . i/
; x .i/ ; G0 ; s
/ p .i/ ; x .i/ j . i/ ; G0 ; s
1 X s
D I .j / D .i/ p x .i / j .j / C G0 .i/ p x .i / j .i/
n 1Cs n 1Cs
j ¤i
1 X
D I .j / D .i/ p x .i / j .j /
n 1Cs
j ¤i
Z
s
C G0 ./p.x j /d p .i/ jx .i/ ;
.i /
(7.6)
n 1Cs
where p. jX/ is the posterior distribution over the parameter space for the distribution
p.; X / D G0 ./p.X j/, i.e., p. jX/ / G0 . /p.X j/. The transition to Equation 7.6 can be
justified by the following:
.i/ .i/
.i / .i/ .i / .i/ p jx p x .i/ .i/ .i/ .i/
G0 p x j D G0 D p jx p x ;
G0 .i/
3 This can be shown by exchangeability, and assuming that the i th sample is the last one that was drawn.
7.2. DIRICHLET PROCESS MIXTURES 163
and
Z
p x .i/ D G0 . /p x .i / j d: (7.7)
This is where the importance of the conjugacy of G0 to the likelihood becomes apparent.
If indeed there is such a conjugacy, the constant in Equation 7.7 can be easily computed, and the
posterior is easy to calculate as well. (This conjugacy is not necessary for the use of the Dirichlet
process mixture model, but it makes it more tractable.)
These observations yield a simple posterior inference mechanism for .1/ ; : : : ; .n/ : for
each i , given . i/ , do the following:
• With probability proportional to p x .i/ j .j / for j ¤ i , set .i / to .j /
R
• With probability proportional to s G0 . /p.x .i/ j/d set .i/ to a draw from the dis-
tribution p. jx .i/ /.
This sampler is a direct result of Equation 7.6. The probability of .i/ given the relevant
information can be viewed as a composition of two types of events: an event that assigns .i/ an
existing .j / for j ¤ i , and an event that draws a new .i/ .
The above two steps are iterated until convergence. While the sampler above is a perfectly
correct Gibbs sampler (Escobar, 1994, Escobar and West, 1995), it tends to mix quite slowly.
The reason is that each .i/ is sampled separately—local changes are made to the seating ar-
rangements, and therefore it is hard to find a global seating arrangement with high probability.
Neal (2000) describes another Gibbs algorithm that solves this issue. This algorithm samples
the assignment to tables, and changes atoms for all customers sitting at a given table at once. He
also suggests a specification of this algorithm to the conjugate DPM case, where G0 is conjugate
to the likelihood from which x .i / are sampled. For a thorough investigation of other scenarios
in which Gibbs sampling or other MCMC algorithms can be used for Dirichlet process mixture
models, see Neal (2000).
Hyperparameters: G0 , s .
Latent variables: , j for j 2 f1; : : :g, Z .i / for i 2 f1; : : : ; ng.
Observed variables: X .i/ for i 2 f1; : : : ; ng.
Output: x .i/ ; : : : ; x .n/ generated from the Dirichlet process mixture model.
• Draw GEM.s/.
• Draw 1 ; 2 ; : : : G0 .
• Draw z .i/ for i 2 f1; : : : ; ng such that z .i / is an integer.
• Draw x .i/ p.x .i/ jz .i/ / for i 2 f1; : : : ; ng.
Generative Story 7.4: The generative story for the DPMM using the stick-breaking process.
The distribution G0 is the base distribution such that its sample space is a set of parameters for
each of the mixture components.
from the Beta distribution, as explained in Section 7.1.1. Therefore, to define a variational dis-
tribution over , it is sufficient to define a variational distribution over 1 ; 2 ; : : :.
Blei and Jordan suggest using a mean-field approximation for i , such that q.i / for i 2
f1; : : : ; K 1g is a Beta distribution (just like the true distribution for i ) and:
(
1 K D 1
q.K / D
0 K ¤ 1:
Since q.K / puts its whole probability mass on K being 1, there is no need to define vari-
ational distributions for i for i > K . Such q.K / implies that i D 0 for any i > K according
to the variational distribution over i . The variational stick is truncated at point K .
Constants: K and n.
Hyperparameters: G0 base measure defined over ‚, s , a model family F .X j/ for 2 ‚.
Latent variables: Z .1/ ; : : : ; Z .n/ .
Observed variables: X .1/ ; : : : ; X .n/ .
Output: A set of n points drawn from a mixture model.
• Generate 1 ; : : : ; K G0 .
• Draw Dirichlet.˛=K/ (from a symmetric Dirichlet).
• For j 2 f1; : : : ; ng ranging over examples:
Draw z .j / from for j 2 f1; : : : ; ng.
Draw x .j / from F .X jzj / for j 2 f1; : : : ; ng.
Generative Story 7.5: An approximation of the Dirichlet process mixture model using a finite
mixture model.
scoring functions; he tested his algorithm on a character recognition problem and the clustering
of documents from the NIPS conference.
Inference with the HDP The original paper by Teh et al. (2006) suggested to perform infer-
ence with the HDP using MCMC sampling (Chapter 5). The authors suggest three sampling
schemes for the hierarchical Dirichlet process:
• Sampling using the Chinese restaurant franchise representation: the HDP can be de-
scribed in terms similar to that of the Chinese restaurant process. Instead of having a single
restaurant, there would be multiple restaurants, each representing a draw from a base dis-
tribution, which is by itself a draw from a Dirichlet process. With this sampling scheme,
the state space consists of indices for tables assigned to each customer i in restaurant j ,
and indices for dishes serves at table t in restaurant j .
7.3. THE HIERARCHICAL DIRICHLET PROCESS 167
Generative Story 7.6: The generative story for an LDA-style model using the Dirichlet process.
• Sampling using direct assignment: in the previous two samplers, the state space is such
that assignments of customers to dishes is indirect, through table dish assignment. This can
make bookkeeping complicated as well. Teh suggests another sampler, in which customers
are assigned to dishes directly from the first draw of the HDP, G0 . This means that the
state space is now assignments to a dish (from the draw of G0 ) for each customer i in
restaurant j and a count of the number of customers with a certain dish k in restaurant j .
Alternatives to MCMC sampling for the HDP have also been suggested. For example,
Wang et al. (2011) developed an online variational inference algorithm for the HDP. Bryant
and Sudderth (2012) developed another variational inference algorithm for the HDP, based on
a split-merge technique. Teh et al. (2008) extended a collapsed variational inference algorithm,
originally designed for the LDA model, to the HDP model.
168 7. NONPARAMETRIC PRIORS
7.4 THE PITMAN–YOR PROCESS
The Pitman–Yor process (Pitman and Yor, 1997), sometimes also called the “two-parameter
Poisson-Dirichlet process” is closely related to Dirichlet process, and it also defines a distribu-
tion over distributions. The Pitman-Yor process uses two real-valued parameters—a strength
parameter s that plays the same role as in the CRP, and a discount parameter d 2 Œ0; 1. In
addition, it also makes use of a base distribution G0 .
The generative process for generating a random distribution from the Pitman–Yor process
is almost identical to the generative process described in Section 7.1.2. This means that for n
observations, one draws a partition over the integers between 1 to n, and each cluster in the
partition is assigned an atom from G0 .
The difference between the Dirichlet process and the Pitman–Yor process is that the
Pitman–Yor process makes use of a generalization of the Chinese restaurant process, and it
modifies Equation 7.5 such that:
p Y .i/ D rjy .1/ ; : : : ; y .i 1/ ; s; d D
8 Pi 1
ˆ
ˆ j D1 I.y .j / D r/ d
< ; if r yi
i 1 C s :
ˆ s C yd
:̂ i
;
if r D yi C 1
i 1Cs
The discount parameter d plays a role in controlling the expected number of tables that will
be generated for n “customers.” With d D 0, the Pitman–Yor process reduces to the Dirichlet
process. With larger d , more tables are expected to be used for n customers. For a more detailed
discussion, see Section 7.4.2.
This modified Chinese restaurant process again induces a distribution that is a function of
P
the integer count vector N of length m with m D yn and Nk D niD1 I.Y .i / D k/. It is defined
as follows:
Qm QNk 1
kD1 .d.k 1/ C s/ j D1 .j d /
p.N js; d / D Qn 1 : (7.8)
i D0 .i C s/
The Pitman–Yor process also has a stick-breaking representation, which is very similar
to the stick-breaking representation of the Dirichlet process. More specifically, a PY with a
base distribution G0 , a strength parameter s and a discount parameter d , will follow the same
generative process in Section 7.1.1 for the Dirichlet process, only k are now drawn from
Beta.1 d; s C kd / (Pitman and Yor, 1997). This again shows that when d D 0, the Pitman–
Yor process reduces to the Dirichlet process.
The Pitman–Yor process can be used to construct a hierarchical Pitman–Yor process, sim-
ilar to the way that the Dirichlet process can be used to create an HDP (Section 7.3). The
7.4. THE PITMAN–YOR PROCESS 169
hierarchy is constructed from Pitman-Yor processes, instead of Dirichlet processes—with an
additional discount parameter. Such a hierchical PY process was used, for example, for depen-
dency parsing by Wallach et al. (2008).
n
Y
p.x1 xm / D p.x1 / p.xi jx1 xi 1 /: (7.9)
i D2
The most common and successful language models are n-gram models, which simply make
a Markovian assumption that the context necessary to generate xi is the n 1 words that pre-
ceeded xi . For example, for a bigram model, where n D 2, the probability in Equation 7.9 would
be formulated as:
n
Y
p.x1 xm / D p.x1 / p.xi jxi 1 /:
i D2
Naturally, a Bayesian language model would place a prior over the probability distribu-
tions that generate a new word given its context. Teh (2006b) uses the Pitman–Yor process as
a prior over these distributions. In the terminlogy of the paper, let an n-gram distribution such
as p.wjw1 wn 1 / be Gw1 wn 1 .w/ and let .w1 wr / D w2 wr , i.e., is a function that
takes a sequence of words and removes the “earliest” word. Then, Teh defines a hierarchical prior
over the n-gram distributions Gw1 wr (r n 1) as:
0.08 40
Number of Tables
30
i
0.04 20
10
0.00 0
0 10 20 30 40 50 0 20 40 60 80 100
i n
Figure 7.3: Left plot: average of i as a function of i according to 5,000 samples from a Pitman–
Yor process prior. The value of the concentration parameter s is 10 in all cases. The value of the
discount parameters are d D 0 (black; Dirichlet process), d D 0:1 (red), d D 0:5 (blue) and d D
0:8 (purple). Right plot: the average of number of tables obtained as a function of the number
of customers n over 5,000 samples from the Chinese restaurant process. The concentration and
the discount parameters are identical to the left plot.
This power-law behavior fits natural languge modeling very well. For example, with a
Pitman-Yor unigram language model, each table corresponds to a word type, and n corresponds
to word tokens. Therefore, the expected number of tables gives the expected number of word
types in a corpus of size n. As Zipf (1932) argued, this setting fits a power-law behavior.
Figure 7.3 demonstrates the behavior of the Dirichlet process vs. the Pitman–Yor process
when d ¤ 0. First, it shows that i , the length of the i part of the stick (in the stick-breaking
process representation of DP and PYP) has an exponential decay when d D 0 (i.e., when we use
the Dirichlet process) vs. a heavier tail when we use the Pitman–Yor process. With d ¤ 0, it
1
can be shown that i / .i / d . The figure also shows the average number of tables opened when
sampling from the Chinese restaurant process for DP and PYP. The number of tables grows
logarithmically when using d D 0, but grows faster when d > 0.
7.5 DISCUSSION
This chapter has a large focus on the Dirichlet process, its derivatives and its different represen-
tations. While the Dirichlet process plays an important role in Bayesian nonparametric NLP,
there are other stochastic processes used as nonparametric Bayesian priors.
The machine learning literature has produced many other nonparametric Bayesian priors
for solving various problems. In this section we provide an overview of some of these other
nonparametric models. We do not describe the posterior inference algorithms with these priors,
172 7. NONPARAMETRIC PRIORS
since they depend on the specifics of the model in which they are used. The reader should consult
with the literature in order to understand how to do statistical inference with any of these priors.
Indeed, the goal of this section is to inspire the reader to find a good use for the described priors
for natural language processing problems, as these priors have not often been used in NLP.
(
f Dij ; if i ¤ j
.i /
p Y D j jD; s /
s; if i D j;
where s plays a similar role to a concentration parameter.
The seating arrangement of each customer next to another customer induces a seating
arrangement next to tables, similar to the CRP, by assigning a table to all customers who are
connected to each other through some path.
Note that the process that Blei and Fraizer suggest is not sequential. This means that
customer 1 can pick customer 2 and customer 2 can pick customer 1, and more generally, that
there can be cycles in the arrangement. The distance-based CRP is not sequential in the sense
of the original CRP. In addition, the order in which the customers enter the restaurant does
matter, and the property of exchangeability is broken with the distance-dependent CRP (see
also Section 7.1.2).
However, it is possible to recover the traditional sequential CRP (with concentration pa-
rameter s ) by assuming that Dij D 1 for j > i , Dij D 1 for j < i , f .D/ D 1=d with f .1/ D 0.
In this case, customers can only join customers that have a lower index. The total probability of
joining a certain table will be proportional to the sum of the distances between a customer and
all other customers sitting at that table, which in the above formulation will just be the total
number of customers sitting next to that table.
There has been relatively little use of distance-dependent CRPs in NLP, but examples of
such use include the use of distance-dependent CRP for the induction of part-of-speech classes
(Sirts et al., 2014). Sirts et al. use a model that treats words as customers entering a restaurant,
and at each point, the seating of a word at a table depends on its similarity to the other words
at the table with respect to distributional and morphological features. The tables represent the
part-of-speech classes.
Another example for the use of distance-dependent CRPs in NLP is by Titov and Kle-
mentiev (2012). In this example, the distance-dependent CRP is used to induce semantic roles
in an unsupervised manner.
7.6. SUMMARY 175
7.5.5 SEQUENCE MEMOIZERS
Sequence memoizers (Wood et al., 2009) are hierarchical nonparametric Bayesian models that
define non-Markovian sequence models. Their idea is similar to the one that appears in Sec-
tion 7.4.1. A sequence of distributions over predicted tokens at each step are drawn from
the Pitman–Yor process. To predict the k th token given context x1 xk 1 , the distribution
Gx1 xk 1 is drawn from a Pitman–Yor process with base distribution Gx2 xk 1 . The distribu-
tion Gx2 xk 1 , in turn, is drawn from a Pitman–Yor distribution with base distribution Gx3 xk 1
and so on.
Wood et al. describe a technique to make posterior inference with sequence memoizers
efficient. To do so, they use a specific sub-family of the Pitman–Yor process, in which the con-
centration parameter is 0. This enables marginalizing out the distributions Gxi xk 1 for i > 1,
and constructing the final predictive distribution Gx1 xk 1 . It also enables an efficient represen-
tation of the sequence memoizer, which grows linearly with the sequence length.
Sequence memoizers were mostly used for language modeling, as a replacement for the
usual n-gram Markovian models. They were further improved by Gasthaus and Teh (2010),
introducing a richer hyperparameter setting, and a representation that is memory efficient. In-
ference algorithms for this improved model are also described by Gasthaus and Teh (2010). The
richer hyperparameter setting allows the modeler to use a non-zero concentration value for the
underlying Pitman–Yor process. They were also used by Shareghi et al. (2015) for structured
prediction.
7.6 SUMMARY
This chapter gave a glimpse into the use of Bayesian nonparametrics in the NLP literature. Some
of the important concepts that were covered include the following.
• The Dirichlet process, which serves as an important building block in many Bayesian non-
parametric models for NLP. Three equivalent constructions of the Dirichlet process were
given.
• Dirichlet process mixtures, which are a generalization of a finite mixture model.
• The Hierarchical Dirichlet process, in which several Dirichlet processes are nested together
in the model, in order to be able to “share atoms” between the various parts of the model.
• The Pitman–Yor process, which is a generalization of the Dirichlet process.
• Some discussion of other Bayesian nonparametric models and priors.
In Chapter 8, other examples of models and uses of Bayesian nonparametrics are given,
such as the HDP probabilistic context-free grammar and adaptor grammars.
176 7. NONPARAMETRIC PRIORS
7.7 EXERCISES
7.1. Show that the Chinese restaurant process describes a joint distribution over an ex-
changeable set of random variables.
7.2. Let Zn be a random variable denoting the total number of tables open after n customers
were seated in a Chinese restaurant process arrangement (with concentration ˛ ). Show
that as n becomes large, EŒZn ˛ log n.
7.3. Consider Equations 7.2–7.3. Show that indeed a draw ˇ from the GEM distribution
satisfies them. You will need to use the definition of ˇ from Equation 7.1.
7.4. In Section 7.2.1, a Gibbs sampler is given to sample .1/ ; : : : ; .n/ from the Dirichlet
process mixture model. Write down the sampler for the case in which G0 is a Dirichlet
distribution over the p 1 dimensional probability simplex, and p.X j/ is a multino-
mial distribution over p elements.
7.5. Alice derived the sampler in the previous question, but discovered later that a better
choice of G0 is a renormalized Dirichlet distribution, in which none of the coordinates
in G0 can be larger than 1=2 (assume p > 2). Based on rejection sampling (Sec-
tion 5.9), derive a Gibbs sampler for such G0 . Consult with Section 3.1.4 as needed.
177
CHAPTER 8
1 If a grammar has a “context-free backbone,” it does not necessarily mean that the language it generates is context-free.
It just means that its production rules do not require context on the left-hand side.
178 8. BAYESIAN GRAMMAR MODELS
• There is a large body of work on probabilistic context-free grammars and their derivatives.
This means that being well-acquainted with this grammar formalism is important for any
NLP researcher.
• PCFGs offer a generic way to derive and communicate about statistical models in NLP.
They offer a sweet spot between tractability and expressivity, and as such, their use is not
limited to just syntactic parsing. Many models in NLP can be captured in terms of PCFGs.
In Section 8.1 we cover the use of hidden Markov models, a fundamental sequence label-
ing model that is used in NLP and outside of it. In Section 8.2 we provide a general overview of
PCFGs and set up notation for the rest of this chapter. In Section 8.3 we begin the discussion
of the use of PCFGs with the Bayesian approach. We then move to nonparametric modeling
of grammars, covering adaptor grammars in Section 8.4, and the hierarchical Dirichlet process
PCFGs in Section 8.5. We then discuss dependency grammars in Section 8.6, synchronous
grammars in Section 8.7 and multilingual learning in Section 8.8, and conclude with some sug-
gestions for further reading in Section 8.9.
• A special state symbol ˘ 2 T called “the stop state” or “the sink state.”
• is a vector of parameters such that it defines for every s; s 0 2 N and every o 2 T the
following non-negative parameters:
P
Initial state probabilities: s . It holds that s2N s D 1.
P
Emission probabilities: ojs for all s 2 N n f˘g and o 2 T . It holds that o2T ojs D
1.
Transition probabilities: s 0 js for all s 2 N n f˘g and s 0 2 N . It holds that
P
s2N sjs 0 D 1.
Hidden Markov models define a probability distribution over pairs .x; z/ such that x D
x1 xm is a string over T , and z D z1 zm is a sequence of states from N such that zi ¤ ˘.
This distribution is defined as follows:
8.1. BAYESIAN HIDDEN MARKOV MODELS 179
z1 z2 zm-1 zm
x1 x2 xm-1 xm
Figure 8.1: A hidden Markov model depicted graphically as a chain structure. The shaded nodes
correspond to observations and the unshaded nodes correspond to the latent states. The sequence
is of length m, with latent nodes and observations indexed 3 through m 2 reprsented by the
dashed line.
m
!
Y
p.x; zj/ D z1 x1 jz1 zi jzi 1
xi jzi ˘jzm :
i D2
HMMs have basic inference algorithms called the forward and backward algorithms.
These are dynamic programming algorithms that can be used to compute feature expectations
given an observation sequence. For example, they can compute the expected number of times
that a certain emission ojs fires in the sequence, or alternatively, this pair of algorithms can com-
pute the expected number of times a certain transition sjs 0 fires. For a complete description of
these algorithms, see Rabiner (1989). These algorithms are an analog to the inside and outside
algorithms for PCFGs (see Section 8.2.2).
A graphical depiction of a hidden Markov model is given in Figure 8.1. The chain structure
graphically denotes the independence assumptions in the HMM. Given a fixed set of parameters
for the HMM and a state zi , the observation xi is conditionally independent of the rest of the
nodes in the chain. In addition, given a fixed set of parameters for the HMM and a state zi 1 ,
the state zi is conditionally independent of all previous states zj , j < i 1.
Hidden Markov models can be defined with higher order—in which case the probability
distribution over a given state depends on more states than just the previous one. A trigram
HMM, for example, has the probability over a given state depend on the two previous states.
The algorithms for inference with trigram HMMs (or higher order HMMs) are similar to the
inference algorithms that are used with vanilla bigram HMMs.
Constants: `, s0 , s1
Latent variables: ˇ , 0 , i , i for i 2 f1; : : :g, Zj for j 2 f1; : : : ; `g
Observed variables: Xj for j 2 f1; : : : ; `g
Generate xj F .jzj /
Generative Story 8.1: The generative story of the infinite hidden Markov model.
held-out data validation, one can increasingly add more states, each time performing inference
on the data, and checking the behavior of the log-likelihood function on the held-out data.
Another way to overcome the requirement to predefine the number of latent states in an
HMM is the use of nonparametric modeling, and more specifically, the hierarchical Dirichlet
process (Section 7.3). This allows the modeler to define an infinite state space (with a countable
number of latent states). During inference with sequence observations, the underlying number
of states that is used for explaining the data will grow with the amount of data available.
Hierarchical Dirichlet process hidden Markov models (HDP-HMM) are a rather intu-
itive extension, which combines HMMs with the HDP. Here we provide a specification of the
HDP-HMM in the general form that Teh (2006b) gave, but also see the infinite HMM model
by Beal et al. (2002).
The first model component that Teh assumes is a parametric family F .j/ (for 2 ‚) for
generating observations and a base distribution G0 used as a prior over the space of parameters
‚. For example, could be a multinomial distribution over a set of observation symbols T ,
and G0 could be a Dirichlet distribution over the .jT j 1/th probability simplex. In order to
generate a sequence of observations of length `, we use generative story 8.1.
The generative process assumes a discrete state space, and works by first generating a base
distribution ˇ , which is an infinite multinomial distribution over the state space. It then draws
another infinite multinomial distribution for each state. That infinite multinomial corresponds
8.2. PROBABILISTIC CONTEXT-FREE GRAMMARS 181
to a transition distribution from the state it is indexed by to other states (so that each event in
the infinite multinomial is also mapped to one of the states). Then the emission parameters are
drawn for each state using G0 , which is a base distribution over parameter space ‚. In order to
actually generate the sequences, the generative process proceeds by following the usual Marko-
vian process while using the transition and emission distributions. Inference with this infinite
HMM model is akin to inference with the Hierarchical Dirichlet process (Section 7.3). For the
full details, see Teh (2006b). See also Van Gael et al. (2008).
S , IN S .
, so .
NP-SBJ VP NP-SBJ VP
DT NN DT NN VBD NP
MD RB VP
The governor The lieutenant
could n’t welcomed DT NNS
VB NP
the guests
make PRP
it
Figure 8.2: An example of a phrase-structure tree in English inspired by the Penn treebank
(Marcus et al., 1993).
182 8. BAYESIAN GRAMMAR MODELS
• N is a finite set of nonterminal symbols, which label nodes in the phrase-structure tree. We
require that T \ N D ;.
• R is a finite set of production rules. Each element r 2 R has the form of a ! ˛ , with
a 2 N and ˛ 2 .T [ N / . We denote by Ra the set fr 2 R j r D a ! ˛g, i.e., the rules
associated with nonterminal a 2 N on the left-hand side of the rule.
• S is a designated start symbol which always appears at the top of the phrase-structure
trees.
In their most general form, CFGs can also have rules of the form a ! " where " is the
“empty word.” Of special interest in natural language processing are CFGs which are in Chom-
sky normal form. With Chomsky normal form grammars, only productions of the form a ! t
(for a 2 N and t 2 T ) or a ! b c (for a; b; c 2 N ) are allowed in the grammar. The original
definition of Chomsky normal form also permits a rule S ! ", but in most uses of CFGs in
Bayesian NLP, this rule is not added to the grammar. For simplification, we will not introduce
" rules to the grammars we discuss. (Still, " rules are useful for modeling specific components
of linguistic theories, such as empty categories or “gaps.”)
Chomsky normal form is useful in NLP because it usually has simple algorithms for basic
inference, such as the CKY algorithm (Cocke and Schwartz, 1970, Kasami, 1965, Younger,
1967). The simplicity of CNF does not come at the expense of its expressive power—it can be
shown that any context-free grammar can be reduced to a CNF form that generates equivalent
grammar derivations and the same string language. Therefore, our focus is on CFGs in CNF
form.2
A probabilistic context-free grammar attaches a set of parameters to a CFG G (these
are also called rule probabilities). Here, is a set of jN j vectors. Each a for a 2 N is a vector
of length jRa j in the probability simplex. Each coordinate in a corresponds to a parameter
a!b c or a parameter a!t . With probabilistic context-free grammars, the following needs to
be satisfied:
a!b c 0 a ! bc 2 R (8.1)
a!t 0 a!t 2R (8.2)
X X
a!b c C a!t D 1 8a 2 N : (8.3)
b;cWa!b c2R t Wa!t 2R
This means that each nonterminal is associated with a multinomial distribution over the
possible rules for that nonterminal. The parameters define a probability distribution p.Zj/
over phrase-structure trees. Assume z is composed of the sequence of rules r1 ; : : : ; rm where
ri D ai ! bi ci or ri D ai ! xi . The distribution P .Zj/ is defined as:
2 Note that it is not always possible to convert a PCFG to a PCFG in CNF form while retaining the same distribution
over derivations and strings, and approximations could be required. See also Abney et al. (1999).
8.2. PROBABILISTIC CONTEXT-FREE GRAMMARS 183
m
!
Y
p.Z D .r1 ; : : : ; rm /j/ D ri : (8.4)
i D1
Not every assignment of rule probabilities that satisfies the above constraints (Equa-
tions 8.1–8.3)yields a valid probability distribution over the space of trees. Certain rule proba-
bilities lead to PCFGs that are “inconsistent”—this means that these PCFGs allocate non-zero
probability to trees of infinite length. For example, for the grammar with rules S ! S S (with
probability 0.8) and S ! buffalo (with probability 0.2), there is a non-zero probability to gen-
erate trees with an infinite yield. For a comprehensive discussion of this issue, see Chi (1999).
In most cases in the Bayesian NLP literature, the underlying symbolic grammar in a
PCFG is assumed to be known. It can be a hand-crafted grammar that is compatible with a
specific NLP problem, or it can be a grammar that is learned by reading the rules that appear in
the parse trees in a treebank.
With PCFGs, the observations are typically the yield of the sentence appearing in a
derivation z . This means that PCFGs define an additional random variable, X , which ranges
over strings over T . It holds that X D yield.Z/ where yield.z/ is a deterministic function that
returns the string in the yield of z . For example, letting the derivation in Figure 8.2 be z , it
holds that yield.z/ is the string “The governor could n’t make it, so the lieutenant welcomed the
guests.”
The distribution p.Zjx; / is the conditional distribution for all possible derivation trees
that have the yield x . Since X is a deterministic function of Z , Equation 8.4 defines the distri-
bution p.X; Zj / as follows:
(
p.zj/ yield.z/ D x
p.x; zj / D
0 yield.z/ ¤ x:
In this chapter, the words in string x will often be denoted as x1 xm where m is the
length of x , and each xi is a symbol from the terminal symbols T .
Another important class of models related to PCFGs is that of weighted context-free
grammars. With weighted CFGs, a!b c and a!t have no sum-to-1 constraints (Equa-
tion 8.2). They can be of arbitrary non-negative weight. Weighted PCFGs again induce dis-
tribution p.Zj/, by defining:
Qm
i D1 ri
p.zj / D ; (8.5)
A./
m
XY
A./ D ri :
z iD1
For consistent PCFGs, A./ D 1. For any assignment of rule probabilities (i.e., assign-
ment of weights that satisfy Equations 8.1–8.2), it holds that A./ 1. For weighted CFGs, it
is not always the case that A./ is a finite real number. The function A./ can also diverge to
infinity for certain grammars with certain weight settings, since the sum is potentially over an
infinite number of derivations with different yields.
With PCFGs, K equals the number of nonterminals in the grammar, and Nk is the size
of Ra for the nonterminal a associated with index k 2 f1; : : : ; Kg. We denote by k;k 0 the event
k 0 in the k th multinomial.
With this kind of abstract model, each multinomial event corresponds to some piece of
structure. With PCFGs, these pieces of structures are production rules. Let fk;k 0 .x; z/ be the
number of times that production rule k 0 for nonterminal k 2 f1; : : : ; Kg fires in the pair of string
and phrase-structure tree .x; z/. The probabilistic model defined over pairs of strings and phrase
structure trees is:
Nk
K Y
Y f 0 .x;z/
k;k
p.x; zj/ D k;k 0 : (8.6)
kD1 k 0 D1
If we consider x D x .1/ ; : : : ; x .n/ and z D z .1/ ; : : : ; z .n/ being generated from the like-
lihood in Equation 8.6, independently (given the parameters), then it can be shown the likeli-
hood of these data is:
Nk
K Y
Y Pn
fk;k 0 .x .i/ ;z .i/ /
p.x; zj/ D k;kiD1
0 :
kD1 k 0 D1
8.2. PROBABILISTIC CONTEXT-FREE GRAMMARS 185
1: if j D j 0 then
2: z a tree with a root a and the word xj below a
3: return z
4: end if
5: for all rules a ! b c 2 R ranging over b; c 2 N do
6: for q j to j 0 1 do
7: Let sampleMult.a ! b c; q/ be a!b c in.b; j; q/ in.c; q C 1; j 0 /
8: end for
9: end for
sampleMult.a ! b c; q/
10: Normalize s : Let sampleMult.a ! b c; q/ be P
a!b c;q sampleMult.a ! b c; q/
11: Sample from the multinomial sampleMult an event .a ! b c; q/.
12: zleft SamplePCFG(G , , x , b , .j; q/, in)
13: zright SamplePCFG(G , , x , c , .q C 1; j 0 /, in)
14: z a
zleft zright
15: return z
Algorithm 8.1: A recursive algorithm SamplePCFG for sampling from a probabilistic context-free
grammar (in Chomsky normal form) conditioned on a fixed string and fixed parameters.
X
in.S; 1; m/ D p.zj/:
zWyield.z/Dx1 xm
Here, p.Zj/ is a PCFG distribution parametrized by weights with start symbol S . The
use of the notation in.S; 1; m/, with the arguments 1 and m, is deliberate: this quantity can be
calculated for other nonterminals and other spans in the sentence. Generally, it holds that:
X
in.a; i; j / D p.zj/;
z2A.a;i;j /
where A.a; i; j / D fzjyield.z/ D xi xj ; h.z/ D ag. This means that the inside probability of
a nonterminal a for span .i; j / is the total probability of generating the string xi xj , starting
from nonterminal a. The function h.z/ returns the root of the derivation z . Note that in this
formulation we allow for the root to be arbitrary nonterminal, not just S . The probability p.zj/
is defined as usual, as the product of all rules in the derivation.
The inside quantities can be computed through a recursive procedure, using quantities of
the same form. The following is the recursive definition:
in.a; i; i / D a!xi a 2 N ; a ! xi 2 R; 1 i m
j
X X
in.a; i; j / D a!b c in.b; i; k/ in.c; k C 1; j / a 2 N ; 1 i < j m:
kDi a!b c2R
The intermediate quantity in.a; i; j / is to be interpreted as the total weight of all trees
in which the nonterminal a spans words xi xj at positions i through j . There are various
execution models for computing the recursive equations above. One simple way is to use bottom-
up dynamic programming, in which the chart elements in.a; i; j / are computed starting with
those that have small width of j i C 1 and ending with the final element in.S; 1; n/ which
has a width of n. One can also use an agenda algorithm, such as the one that was developed
by Eisner et al. (2005)—see also Smith (2011).
Another important quantity of interest is the outside probability which is computed using
the outside algorithm. The outside quantity calculates the probability of generating an “outer”
8.2. PROBABILISTIC CONTEXT-FREE GRAMMARS 187
part of the derivation. More formally, we define out.a; i; j / for any i < j indices in a given
string and a a nonterminal as:
X
out.a; i; j / D p.zj /;
z2B.a;i;j /
out.S; 1; n/ D 1
out.a; 1; n/ D 0 a 2 N;a ¤ S
jX1 X
out.a; i; j / D b!c a in.c; k; i 1/ out.b; k; j /
kD1 b!c a2R
Xn X
C b!a c in.c; j C 1; k/ out.b; i; k/ a 2 N ; 1 i < j n:
kDj C1 b!a c2R
The most important use of the inside and outside algorithms is to compute feature expec-
tations of nonterminals spanning certain positions for a given sentence. More formally, if the
following indicator is defined:
(
1 if a spans words i through j in z
I.ha; i; j i 2 z/ D
0 otherwise;
then the inside and outside probabilities assist in computing:
P
z;yield.z/Dx1 xm p.zj/I.ha; i; j i 2 z/
E ŒI.ha; i; j i 2 Z/jx1 xm D ;
p.x1 xm j /
in.a; i; j / out.a; i; j /
E ŒI.ha; i; j i 2 Z/jx1 xm D :
in.S; 1; n/
Similarly, expectations of the form EŒI.ha ! b c; i; k; j i 2 Z/jx1 xm can also be com-
puted. Here, I.ha ! b c; i; k; j i 2 z/ is 1 if the rule a ! b c is used in z such that a spans words
i to j and below it b spans words i to k and c spans words k C 1 to j . It can be shown that:
188 8. BAYESIAN GRAMMAR MODELS
a!xi out.a; i; i/
E ŒI.ha ! xi ; i i 2 Z/jx1 xm D : (8.8)
in.S; 1; n/
The inside probabilities have an important role in PCFG sampling algorithms as well.
The inside probabilities are used in a sampling algorithm that samples a single tree z from the
distribution p.Zjx1 xm ; /. The sampling algorithm is given in Algorithm 8.1. The algorithm
assumes the computation (for a fixed ) of the inside chart of the relevant sentence is available.
It then proceeds by recursively sampling a left child and right child for a given node, based on
the inside chart.
The above pair of inside and outside algorithms are used with PCFGs in Chomsky normal
form. There are generalizations to these algorithms that actually work for arbitrary grammars
(without " rules). For example, the Earley algorithm can be used to compute feature expectations
of the form above (Earley, 1970).
• For each pair of states s; s 0 2 N n f˘g there is a rule .s; 0/ ! .s; 1/.s 0 ; 0/ with probability
s 0 js .
• For each state s 2 N n f˘g there is a rule .s; 0/ ! .s; 1/ with probability ˘js .
0 1 0 1
Y Y .˛ 1/ A
Y .˛a!t 1/ A
p. j˛/ / @ a!b c
a!b @ a!t : (8.9)
c
a2N a!b c2R.a/ a!t 2R.a/
Here, ˛ is a vector of hyperparameters that decomposes the same way that does. Each
˛a!b c is non-negative. See Chapter 3 for a definition of the missing normalization constant in
Equation 8.9.
There is also room for exploring different granularities for the hyperparameters ˛ . For
example, instead of having a hyperparameter associated with a rule in the grammar, there can
be a symmetric Dirichlet defined per nonterminal a 2 N by using a single hyperparameter ˛a
for each a 2 N , or perhaps even a single hyperparameter ˛ for all rules in the grammar.
The distribution in Equation 8.9 is conjugate to the distribution defined by a PCFG in
Equation 8.4. Let us assume the complete data scenario, in which derivations from the gram-
mar, z .1/ ; : : : ; z .n/ , are observed. Denote the yields of these derivations by x .1/ ; : : : ; x .n/ . The
posterior,
p j˛; z .1/ ; : : : ; z .n/ ; x .1/ ; : : : ; x .n/ ;
P
is a product-of-Dirichlet distribution with hyperparameters ˛ C jnD1 f x .j / ; z .j / where f
is a function that returns a vector indexed by grammar rules such that fa!b c .x; z/ counts the
number of times rule a ! b c appears in .x; z/ and fa!t .x; z/ counts the number of times rule
a ! t appears in .x; z/.
As mentioned earlier, not all assignments of rule probabilities to a PCFG (or more gen-
erally, a multinomial generative distribution) lead to a consistent PCFG. This means that the
Dirichlet prior in Equation 8.9 potentially assigns a non-zero probability mass to inconsistent
PCFGs. This issue is largely ignored in the Bayesian NLP literature, perhaps because it makes
little empirical difference.
p x .i/ jz .i / p z .i/ jz . i/
;˛
p z .i/ jx .1/ ; : : : ; x .n/ ; z . i/
;˛ D : (8.10)
p x .i/ jz . i / ; ˛
The distribution p X .i/ jZ .i / is just
a deterministic distribution that places its whole
probability mass on the string yield Z .i/ . The quantity p z .i / jz . i/ ; ˛ can also be computed
by relying on the conjugacy of the prior to the PCFG likelihood (see exercises). However, there is
no known efficient way to compute p.x .i / jz . i/ ; ˛/. This means that the conditional distribution
can only be computed up to a normalization constant, making it a perfect candidate for MCMC
sampling.
Therefore, Johnson et al. approach the problem of sampling from the conditional distribu-
tion in Equation 8.10 by sampling from a proposal distribution and then making a Metropolis-
Hastings correction step. Their algorithm is given in Algorithm 8.2.
0
There is no need to re-calculate a!ˇ from scratch after each tree draw. One can keep one
global count, together with the current state of the sampler (which consists of a tree per sentence
in the corpus), and then just subtract the counts of a current tree, and add back the counts of a
newly drawn tree.
Inferring sparse grammars Johnson et al. report that Bayesian inference with PCFGs and
a Dirichlet prior does not give a radically different result than a plain EM algorithm without
Bayesian inference. They tested their Bayesian inference with a simple grammar for analyzing
the morphology of one of the Bantu languages, Sesotho.
They discovered that their MCMC inference was not very sensitive to the hyperparam-
eters of the Dirichlet distribution (in terms of the F1 -measure of morphological segmentation
and exact segmentation), except for when ˛ < 0:01, in which case performance was low. On
the other hand, small ˛ values (but larger than 0:01) lead to relatively sparse posterior over .
Therefore, small values of ˛ can be used to estimate a sparse , leading to an interpretable model,
which has a small number of grammar rules that are actually active. The performance of their
model sharply peaks as ˛ is decreased to a value around 0:01. This peak is followed by a slow
decrease in performance, as ˛ is decreased further to significantly smaller values.
192 8. BAYESIAN GRAMMAR MODELS
Input: A PCFG, a vector ˛ ranging over rules in the PCFG, a set of strings from the
language of the grammar x .1/ ; : : : ; x .n/ .
Output: z D .z .1/ ; : : : ; z .n/ / trees from the posterior defined by the grammar with a
Dirichlet prior with hyperparameters ˛ .
Algorithm 8.2:
R An algorithm for sampling from the posterior of a PCFG with Dirichlet prior:
p.zjx; ˛/ D p.z; jx; ˛/d .
m
Y
8a 2 N W Ga such that Ga .z/ D a!ˇ Hh.zi0 / .zi0 / (8.11)
iD1
where h.z/ D a and r.z/ D a ! ˇ and Subtrees.z/ D .z10 ; : : : ; zm
0
/
8a 2 N W Ha such that Ha Ca .Ga /: (8.12)
Here, Ca is an adaptor, which defines a distribution over a set of distributions. Each dis-
tribution in this set is defined over phrase-structure trees. The distribution Ga serves as the “base
distribution” for the adaptor. As such, each distribution in the set of distributions mentioned
above, on average, bears some similarity to Ga . In the most general form of adaptor grammars,
the actual adaptor is left unspecified. This means that any distribution over phrase-structure tree
distributions (that is based on Ga ) can be used there. If Ca places all of its probability mass on Ga
194 8. BAYESIAN GRAMMAR MODELS
(this means that Equation 8.12 is replaced with Ha D Ga ), then what remains from the above
statistical relationships is the definition of a regular PCFG.
The final distribution over phrase-structure trees from which we draw full trees is HS . The
key idea with adaptor grammars is to choose Ca , so that we break the independence assumptions
that PCFGs have, which can be too strong for modeling language. Using a Pitman–Yor process
for the adaptors, Ca serves as an example for breaking these independence assumptions. In
Section 8.4.1, this use of the Pitman–Yor process is described.
An adaptor grammar makes a distinction between the set of the “adapted non-terminals”
(denoted A) and the set of non-adapted non-terminals (which are N n A). For the non-adapted
non-terminals, Ca refers only to the probabilistic identity mapping, that maps Ga to a distribu-
tion that is set on Ga with probability 1.
• Generate PCFG parameters Dirichlet.˛/ for the underlying CFG (see Equa-
tion 8.9).
• Generate HS from the following PYAG Equations 8.11–8.12 with Ca being a
Pitman-Yor process with strength parameter sa and discount parameter da for a 2 A
(for a 2 N n A, Ca is the probabilistic identity mapping).
• For i 2 f1; : : : ; ng, generate z .i / HS .
• For i 2 f1; : : : ; ng, set x .i / D yield.z .i/ /.
Generative Story 8.2: The generative story of the Pitman-Yor adaptor grammar.
Trees from Equation 8.13 are generated top-down beginning with the start symbol S .
Any non-adapted nonterminal a 2 N n A is expanded by drawing a rule from Ra . There are
two ways to expand a 2 A.
The counts nz , na and ka are all functions of the previously generated phrase-structure
trees z .1/ ; : : : ; z .n/ .
The state of an adaptor grammar, i.e., an assignment to all latent structures, can be de-
scribed using a set of analyses. Assume that we use an adaptor grammar to draw x .1/ ; : : : ; x .n/
and their corresponding phrase-structure trees z .1/ ; : : : ; z .n/ . In addition, denote by z.a/ the list
of subtrees that were generated by the adaptor grammar and are headed by nonterminal a. This
means that z.a/ D ..z.a//.1/ ; : : : ; .z.a//.ka / / with:
196 8. BAYESIAN GRAMMAR MODELS
ka
X
nz.a/.i/ D na :
i D1
Y B.˛a C f .z.a//
p.ujs; d; ˛/ D PY .m.a/jsa ; da / ; (8.14)
a2N
B.˛a /
with f .z.a// being a vector indexed by the rules in the grammar that have a on the left-hand
side, and fa!ˇ .z.a// denoting the total count of a ! ˇ in all subtrees in the list z.a/. In addi-
tion, m.a/ is a vector of the same length as z.a/, such that mi .a/ D nz.a/.i/ . Therefore, m.a/ is
a vector of integers, and the term PY.m.a/jsa ; da / is computed according to the distribution of
the Pitman-Yor process, defined in Equation 7.8 and repeated here:
Qka Qmk .a/ 1
kD1 .d a .k 1/ C s a / j D1 .j d a /
PY .m.a/jsa ; da / D Qna 1 :
i D0 .i C sa /
The function B.y/ is defined for a vector of integers y as (see also Equation 2.3):
P
jyj
i D1 yi
B.y/ D Qjyj :
i D1 .yi /
b1 bn
While yield.za;i / has nonterminals:
Choose an unexpanded nonterminal b from the yield of za;i .
If b 2 A, expand b according to Gb (defined on previous iterations of
step 2).
If b 2 N n A, expand b with a rule from Rb according to
Multinomial.B /.
For i 2 f1; : : :g, define Ga .za;i / D a;i
Generative Story 8.3: The generative story for adaptor grammars with the stick-breaking rep-
resentation.
198 8. BAYESIAN GRAMMAR MODELS
8.4.3 INFERENCE WITH PYAG
This section discusses the two main approaches for the following inference schemes with adaptor
grammars: sampling and variational inference.
MCMC Inference Consider the distribution defined over analyses as defined in Equa-
tion 8.14. The inference usually considered with adaptor grammars is one such that it infers
parse trees (phrase-structure trees) for a given set of strings x D .x .1/ ; : : : ; x .n/ /.
The distribution over phrase structure trees can be derived from the distribution over
analyses by marginalizing out the phrase-structure trees. More specifically, we are interested
in p.Z jx; s; d; ˛/. However, computing this posterior is intractable.
In order to perform inference, Johnson et al. (2007b) suggest using a compononent-wise
Metropolis–Hastings algorithm. They first specify how to create a static PCFG, a snapshot of
the adaptor grammar, based on a specific state of the adaptor grammar. This snapshot grammar
includes all rules in the underlying context-free grammar and rules that rewrite a nonterminal
directly to a string, corresponding to subtrees that appear in the history derivation vectors z.a/
for a 2 N . All of these rules are assigned probabilities according to the following estimate:
!
k a da C s a fa!ˇ .z.a// C ˛a!ˇ X nz.a/.i/ sa
0
a!ˇ D P C : (8.15)
na C da ka C a!ˇ 2Ra ˛a!ˇ na C da
i Wyield.z.a/.i/ /Dˇ
The first two multiplied terms are responsible for selecting a grammar rule from the un-
derlying context-free grammar. The term to the right of the sum is the MAP estimate of the
rules added to the snapshot grammar, in the form of a nonterminal rewriting to a string.
This snapshot grammar is created, and then used with a Metropolis–Hastings algorithm,
such that the proposal distribution is based on the snapshot grammar with the 0 estimates from
Equation 8.15 (i.e., we use the snapshot grammar to define a distribution over the analyses,
conditioning on the strings in the corpus). The target distribution is the one specified in Equa-
tion 8.14. Note that the real target distribution needs a normalization constant, corresponding
to the probability of the strings themselves according to the adaptor grammar, but this constant
is canceled in the MH algorithm, when calculating the probability of rejecting or accepting the
update.
Sampling with their MH algorithm is component-wise—each analysis u.i / is sampled
based on the snapshot grammar. At this point, the acceptance ratio is calculated, and if the
MH sampler decides to accept this sample for u.i / , the state of the sampler is updated, and the
snapshot grammar is re-calculated.
Variational Inference Cohen et al. (2010) describe a variational inference algorithm based
on the stick-breaking representation for adaptor grammars described in Section 8.4.2. The main
idea in their variational inference algorithm is similar to the idea that is used for variational infer-
ence algorithms with the Dirichlet process mixture (see Section 7.2). Each nonparametric stick
8.4. ADAPTOR GRAMMARS 199
• Draw ˇ , an infinite column vector, from the GEM distribution with hyperparameter
˛ . The infinite vector ˇ indices correspond to nonterminal symbols in the grammar.
Generative Story 8.4: The generative story for the HDP-PCFG model.
for each adapted nonterminal (and the corresponding strength and concentration parameters)
is associated with a variational distribution, which is a truncated stick—i.e., it is a distribution
that follows a finite version of the GEM distribution (see Equation 7.1).
One major advantage of the truncated stick-breaking variational inference algorithm for
adaptor grammars over MCMC inference is that its E-step can be parallelized. On the other
hand, its major disadvantage is the need to select a fixed subset of strings for each adapted
nonterminal to which the variational distributions may assign non-zero probability to. A specific
subset of strings, for a specific adapted nonterminal is selected from the set of strings that can
be the yield of a subtree dominated by that nonterminal. Cohen et al. use heuristics in order to
select such subset for each adapted nonterminal.
200 8. BAYESIAN GRAMMAR MODELS
Online and Hybrid Methods Zhai et al. (2014) have developed an inference algorithm for
adaptor grammars that combines MCMC inference and variational inference. The inference
algorithm in this case is an online algorithm. The training data is processed in mini-batches.
At each mini-batch processing step, the algorithm updates the posterior so that it reflects the
information in the new data. During this update, MCMC inference is used to estimate the
sufficient statistics that are needed in order to make an update to the current posterior.
The main motivation behind such an inference algorithm, and more generally online algo-
rithms, is the ability to process an ongoing stream of data, without the need to either iteratively
go through the data several times, or keep it all in memory.
The parameters above are very similar to the parameters drawn for the vanilla HDP-
PCFG, only they are indexed with nonterminals from the fixed set, or a rule from the fixed set
of rules. In addition, the authors also add parameters for unary rules.
5 1 7 3 2 1
The King welcomed the guests .
Figure 8.3: A dependency tree with latent states. The first line denotes the latent states and the
second line denotes the words generated from these latent states.
vertices and edges that denote syntactic relationships. For the use of dependency grammar and
dependency parsing in natural language processing, see Kübler et al. (2009).
Y
p zc.i/ jzi D p zj jzi : (8.17)
j 2c.i /
The independent children model is not realistic for modeling natural language, because
it assumes independence assumptions that are too strong (all siblings are conditionally inde-
pendent given the parent). For this reason, Finkel et al. suggest two additional models. In their
second model, “Markov children,” a child node is assumed to be conditionally independent of
the rest of the children given the parent and its sibling. More specifically, if c.i/ D .j1 ; : : : ; jr /,
then:
r
!
Y
p zc.i / jzi D p zj1 jzi p zjk jzjk 1 ; zi : (8.18)
kD2
Finkel et al. do not specify in which order they generate the children, which is necessary
to know to complete the model. Their last model, “simultaeneous children,” assumes that all
children are generated as one block of nodes. This means that p.zc.i / jzi / is not decomposed.
The main idea in Finkel’s et al. model is to use nonparametric distributions for Zi . The
latent states are assumed to obtain an integer value f1; 2; : : :g. The prior over the latent state
distributions is constructed using the hierarchical Dirichlet process (see Section 7.3).
For the independent children model, the generative process is the following. First, a basic
distribution over the integers is drawn from GEM.s0 /, where s0 is a concentration parameter.
Then, for each k 2 f1; 2; : : :g, a distribution k DP.; s1 / is drawn. The distributions k (for
k 2) are used for the conditional distributions p.Zj jZi / in Equation 8.17—i.e., p.zj jZi D
k/ D k;zj . (Note that 1 is used for the distribution p.Z1 / as it appears in Equation 8.16.) In
addition, for generating the observations X , a multinomial distribution k is generated from
a Dirichlet distribution. Then the observation distributions in Equation 8.16 are set such that
p.xi jZi D k/ D k;xi .
For their simultaneous children model, Finkel et al. draw a distribution over the latent
states for all children from DP.s2 ; G0 /, where G0 is defined to be the independent children
distribution from 8.17 drawn from the prior described above. This draw defines p.Zc.i/ jZi /
for each node i in the dependency tree. According to Finkel et al., the use of the independent
children distribution as a base distribution for the simultaneous children distribution promotes
consistency—if a certain sequence of children states has high probability, then similar sequences
of latent states (i.e., sequences that overlap with the high probability sequence) will also have
high probability.
The prior over the latent state distributions for the Markov children model (Equa-
tion 8.18) makes similar use of the hierarchical Dirichlet process. With this model, k` are gen-
erated, corresponding to distributions over latent states conditioning on a pair of latent states,
a parent and a sibling, with the parent being assigned to latent state k and the sibling being
8.7. SYNCHRONOUS GRAMMARS 205
assigned to latent state `. The observations are handled identically to the independent children
model.
• (First language) Draw transition multinomial distribution over tags t for each t 2 T
from a Dirichlet distribution with hyperparameter ˛0 .
• (First language) Draw emission multinomial distribution over vocabulary t for each
t 2 T from a Dirichlet distribution with hyperparameter ˛1 .
• (Second language) Draw transition multinomial distribution over tags t0 for each
t 2 T 0 from a Dirichlet distribution with hyperparameter ˛00 .
Y Y Y
p.y1 ; : : : ; yN ; y10 ; : : : ; yN
0
0/ D yi 1 ;yi
yj0 0
1 ;yj
p.yi ; yj0 j yi 0
1 ; yj 1 /:
i 2a0 j 2a00 .i;j /2a
• (First language) For each i 2 ŒN , emit a word xi from the multinomial yi .
• (Second language) For each j 2 ŒN 0 , emit a word xj0 from the multinomial yj0 .
The inference mechanism that Snyder et al. used to sample POS tags and words based on
the POS tags is reminiscent of hidden Markov models (HMMs). The model of Snyder et al.
relies on an alignment to generate pairs of tags together, but the information these tags condition
on is the same type of information bigram HMMs use—the previous tags before them in the
sequence.
Snyder et al. (2009b) further extended their model to introduce a fully multilingual model:
a model for POS tagging that models more than two languages. They do so by introducing a
new ingredient into their model—superlingual tags which are coarse tags that are presupposed
to be common to all languages, but are latent. These superlingual tags are generated from a
nonparametric model (Chapter 7).
8.10 SUMMARY
Probabilistic grammars are the most commonly used generic model family in NLP. As such,
given the fact that most probabilistic grammars are generative models, this enabled a large focus
of Bayesian NLP to be on development of Bayesian models and inference algorithms for these
grammars. The focus of this chapter was on probabilistic context-free grammars and their use
with Bayesian analysis, both parametric and nonparametric. Consequently, we covered the basics
of inference with PCFGs, their use with their Dirichlet priors and nonparametric models such
as adaptor grammars and the hierarchical Dirichlet process PCFGs.
212 8. BAYESIAN GRAMMAR MODELS
8.11 EXERCISES
8.1. Consider the exponential model family which is discussed in Section 3.4. Show that
the model defined by Equation 8.5 is an exponential model, and define the different
components of the exponential model based on the notation from Section 8.2.
8.2. Consider the context-free grammar with rules S ! S a, S ! S b and S ! c where S is
a nonterminal and a; b; c are terminal symbols. Can you find the probability assignment
to its rules such that the grammar is not consistent (or not “tight”)? If not, show that
such an assignment does not exist. (See also Section 8.3.1.)
8.3. In Equations 8.7–8.8 we show how to compute feature expectations for nonterminals
spanning certain positions in a string, and similarly for a rule. These two can be thought
of as features of “height” 1 and height 2, respectively. Write the equations of computing
expectations for features of height 3.
8.4. Show that the prior in Equation 8.9 is conjugate to the PCFG distribution in Equa-
tion 8.5. Identify the missing normalization constant of the posterior.
8.5. Consider the context-free grammar with rules S ! S a and S ! a. In Section 8.4.1
we mentioned that a PYAG could not adapt the nonterminal S with this CFG as the
base grammar. Can you explain why a PYAG, in the form of generative story 8.2, could
be ill-defined if we were allowed to adapt S ?
213
CHAPTER 9
1 The three-wave historical perspective described here is based on the description in the book by Goodfellow et al. (2016).
9.1. NEURAL NETWORKS AND REPRESENTATION LEARNING: WHY NOW? 215
image classification (and other problems in computer vision), speech recognition and language
modeling. In this third wave, neural network modeling is often called “deep learning,” which
refers to the need to train a significant number of hidden layers placed between the input and
output layers of a neural network to achieve state-of-the-art performance for several problems.
One of the reasons neural networks are being revisited in their current form in machine
learning is the the scale of the data that is now being collected and the discovered importance
of that scale for precise modeling—both in academia and in industry. Large amounts of data are
now being collected. While the performance of linear models plateaued on these data (meaning,
there were more data than needed to exploit the full power of such models), this has not been the
case for neural networks. Instead, neural networks continued to improve modeling performance
as more data arrived. This is not a surprise, as the decision surfaces and decision rules that neural
networks yield are complex, especially the deep ones that consist of multiple hidden layers and
therefore can exploit these data for better generalization.2 As mentioned above, this quality of
neural networks has improved the state of computer vision, language modeling and other areas
of machine learning.
The ability to use large-scale data would have not been possible without advancements
in high-performance computing, which also contributed to the rise of neural networks in their
current form. The use of neural networks was found to be especially suitable with Graphics
Processing Units (GPUs) as opposed to the standard Central Processing Units (CPUs) found
in computers. While GPUs were originally developed to more efficiently render graphics, they
have gradually taken on a more general form to perform matrix and vector operations more
efficiently and in a parallel manner, which is a good match for deep learning calculations and
other scientific applications, especially when the speed of CPUs has seemed to develop more
slowly than in the past.
The Third Wave in NLP This third wave of popularity of neural networks did not overlook the
NLP community. Furthermore, the NLP community has recently rediscovered the usefulness
of learning representations for words for various NLP problems (Turian et al., 2010). Within
this line of research, words are no longer represented as symbolic units, but instead projected to
a Euclidean space, usually based on co-occurrence statistics with other words. These projections
are often referred to as “word embeddings.” This discovery has marked a significant shift in
the NLP community, such that most recent NLP work represent words as vectors (see also
Section 9.2). This shift also comes hand-in-hand with the use of neural networks, which are
a very good fit for working with continuous representations (such as word embeddings). The
2 There are a variety of ways to discuss the complexity of a decision rule, such as the one entailed by a neural network.
For example, if a “nonlinear” neural network performs a binary classification over a two-dimensional space, the curve that
separates the positive examples from negative examples may have a complex pattern. This is in opposition to a linear classifier,
where there would be a single straight line in the plane separating the two types of examples. One of the theoretical properties
of neural networks that indicate the complexity of their decision surfaces is that a neural network with one hidden layer can
approximate a large set of functions, as described by Funahashi (1989), Cybenko (1989) and Hornik et al. (1989) (this property
is often referred to as “universal approximation”).
216 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
concept of word embeddings is rooted in older work that grounds text in vector space models
(for example, Hofmann 1999a, Mitchell and Lapata 2008, Turney and Pantel 2010).
NLP work with neural networks has already been seeded in preliminary stages of the third
wave. Bengio et al. (2003), for example, has created a neural language model and Henderson and
Lane (1998) have completed earlier work that featured the use of neural networks for syntactic
parsing (see also Henderson 2003 and Titov and Henderson 2010). Collobert et al. (2011)
pushed neural networks more prominently into the awareness of the NLP community when they
showed how a variety of NLP problems (including part-of-speech tagging, chunking, named
entity recognition and semantic role labeling) can all be framed as classification problems with
neural networks. The researchers proposed that the use of their framework did not require much
feature engineering, as opposed to the techniques available at that time. Arguably, this is indeed
one of the great advantages of neural networks and representation-learning algorithms.
In the early days of neural networks in modern NLP, there was an extensive use of
off-the-shelf models such as seq2seq (Section 9.4.3) and pre-trained word embeddings (Sec-
tion 9.2) that did not make much use of intermediate structures. Even when the prediction was
a complex structure for a specific NLP problem, seq2seq models were used with a process of
“stringification”—the reduction of such a structure to a string so that off-the-shelf neural tools
could be used to learn string mappings. Since then, intermediate structures have become more
important in neural NLP literature and seq2seq models have been further extended to process
trees and even graphs. The exact balance between the use of intermediate structures and the use
of shallow stringification remains a moving target.
The use of the Bayesian approach, as presented in this book, is focusing on discrete struc-
tures. To some extent, the approach presented this way and neural networks have somewhat of
a mismatch between them. The Bayesian approach has been, as mentioned in Chapter 2, mostly
used in the context of generative models, while neural network architectures are usually used to
define discriminative models (though, in principle, they can also be used to describe generative
models). In addition, learning with neural networks can be quite complex, and becomes even
more complicated when placing a prior on their parameters and finding the posterior. Still, this
has not deterred many researchers from exploring and interpreting neural networks in a Bayesian
context, as we discuss in this chapter.
Since the expansion of their use in the past decade, neural networks have become syn-
onomous with both the usual generic architectures that have been used before (such as feed-
forward networks and recurrent neural networks), and also with complex computation graphs
that propagate and handle vectors, matrices and even tensors as their values and weights. In some
sense, the engineering of a good model with neural networks for NLP has become the design
and search for the architecture, in the form of a computation graph, that behaves best.3 This
search is done both manually (by careful thinking about the problem and trial and error with
3 Some argue that this need for “manual” architecture design has replaced the manual feature engineering that is required
with linear models. See also Section 9.5.2.
9.2. WORD EMBEDDINGS 217
different architectures) as well as automatically (by tuning hyperparameters, and more recently,
by automatically searching for an architecture; see Section 9.5).
These computation graphs can be readily used to optimize an objective function, as is
standard in machine learning (Section 1.5.2). The most common objective function used is still
the log-likelihood, often referred to as “cross entropy” in the neural network literature (in an
equivalent form). There are generic algorithms, such as backpropagation (see Section 9.3.1)
that can optimize these objective functions for arbitrary computation graphs while relying on
principles such as automatic differentiation. Indeed, with the advent of automatic differentiation,
that allows us to specify a function and then automatically compute its derivatives and gradients,
it is not surprising that several software packages were introduced, such as Torch (Collobert et al.,
2002), TensorFlow (Abadi et al., 2016) DyNet (Neubig et al., 2017) and Theano (Al-Rfou et al.,
2016), that allow the user to define a computation graph and the way data is fed into it. After
that, learning is done as a blackbox using automatic differentiation and the backpropagation
algorithm (Section 9.3.1).
Pd
4 The cosine similarity between two vectors u iD1 u i vi
2 Rd and v 2 Rd is calculated as qP qP . This quantity
d 2 d
i D1 ui i D1 vi2
provides the cosine of the angle between u and v .
218 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
by the company it keeps;” Firth 19575 ). Since words that tend to “behave similarly” end up close
to one another in the embedding space, word embeddings can greatly alleviate the problem of
unseen words by exploiting information from the words that do appear in the training data.
Instead of using the word symbol as a feature in the model, we can use its vector, which exploits
such similarities.
In this section, we mostly cover a specific type of word embedding model, the skip-gram
model, which has served as the basis for many other models. We also cover this model because
it has a Bayesian version (Section 9.2.2).
c
Y
p.wj j w0 /; (9.1)
j D c;j ¤0
where w0 2 V is the pivot word such that the sequence of words in the sample is
.w c ; w cC1 ; : : : ; w0 ; w1 ; : : : wc / 2 V 2cC1 . Let w1 ; : : : ; wN be a sequence of words in a cor-
pus. If we take a frequentist approach in skip-gram model estimation, for example, through
maximum likelihood estimation (see Chapter 1), then we would aim to maximize the following
objective function:6
N c
1 X X
log p wi Cj j wi : (9.2)
N
i D1 j D c;j ¤0
While the skip-gram model has strong similarities to the n-gram model, in which a word
is generated conditioned on a fixed number of previous words in the text, there is a significant
difference in the underlying generative process. For a given corpus with skip-gram modeling, we
will generate each word multiple times, since we generate the context (and not the pivot word),
as a word appears in several contexts of various pivot words. Therefore, the skip-gram model is
not a generative model of a corpus—instead words are generated multiple times.
5 While this quote is widely used in NLP in the context of justifying the use of co-occurrence statistics to derive word
embeddings across a dictionary of different words, it is arguable whether this was Firth’s original intention—he may have been
referring to identifying the specific sense of an ambiguous word compared to its other senses by its “habitual collocation.”
6 The indices in Equation 9.2 may be negative or may exceed the number of words in the text. We assume the text is
padded by c symbols in its beginning and in its end to accommodate for this.
9.2. WORD EMBEDDINGS 219
Another point of departure of the skip-gram model from the common n-gram model is
the way that we model the factors of the form p.wj j w0 / in Equation 9.1. With representation
learning models such as skip-gram (Mikolov et al., 2013a), this probability is modeled as
exp u.w0 /> v.wj /
p wj j w0 ; u; v D X ; (9.3)
exp u.w0 /> v.w/
w2V
1
where .z/ D (the sigmoid function) and neg.w0 / is a set of “negative” word sample—
1Ce z
i.e., words that are unlikely to be in the context of w0 . These words are sampled from a unigram
3
distribution (estimated using the frequency count from a corpus) raised to a power ˛ D (˛ <
4
1 is a hyperparameter of the method, which results in the unigram distribution “flattening”
compared to its original form). Skip-gram modeling of the above form coupled with negative
sampling is often referred to as one of the word2vec models (Mikolov et al., 2013a,b). A second
proposed model of word2vec is the continuous bag-of-words model (CBOW), which predicts
a word from the context—in reverse from the skip-gram model. The model gets its name from
the way context is represented—as an average of word embeddings which appear in the context.
In this case, word order in the text is lost, similar to the case with the bag-of-words model in the
220 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
Latent Dirichlet Allocation model (Section 2.2). An alternative to the use of negative sampling
with the CBOW model is hierarchical softmax (Mikolov et al., 2013a), for which the sum over
the vocabulary becomes logarithmic in the size of the vocabulary. With the hierarchical softmax,
the probability of a word is modeled as a path in a binary tree with the leaves being the words
in the vocabulary.
where both p.u.w// and p.v.w// are multivariate Gaussian distributions with mean value 0 and
covariance matrix 1 I where I is the k k identity matrix.
Let C.i / be a multiset of the words that appear in the context of wi in a corpus
.w1 ; : : : ; wN /. We define the random variable Diw where i 2 ŒN (ŒN D f1; : : : ; N g) and w 2 V
that receives the value r if w appears in C.i / r 1 times and 1 if w … C.i /. We then define
the likelihood of the model as:
N Y
Y
p .Diw j u; v/ ; (9.5)
i D1 w2V
where
p .Diw D d j u; v/ D du.wi /> v.w/ ;
1
with .z/ D (the sigmoid function). The full joint model is defined as the product
1Ce z
of Equations 9.4 and 9.5. The product in Equation 9.5 can be split over terms w 2 C.i/ and
9.2. WORD EMBEDDINGS 221
w … C.i /. Clearly, there are many more terms of the latter kind, leading to intractability of
computing their product (or the sum of their logarithms). Barkan (2017) uses negative sampling
for that set of terms, similar to word2vec.
Until now our discussion has focused on defining a word embedding function that maps a
word to a vector (or potentially a distribution over vectors). However, words do not stand alone,
and can be interpreted differently depending on their context, both syntactically (“I can can the
can”) or semantically (compare the use of the word “bank” in “the bank of the river is green and
flowery” vs. “the bank increased its deposit fees”). A subsequent question that arises is whether
we can define a word embedding function that also takes as an argument the context of the word.
Recent work (Devlin et al., 2018, Peters et al., 2018) suggests that is possible, and can lead to
state-of-the-art results in a variety of NLP problems.
There is an earlier Bayesian connection to contextual embeddings. Bražinskas et al. (2017)
propose a Bayesian skip-gram contextualized model, through which word embeddings are rep-
resented as distributions. In this model, the context of a word c is generated while conditioning
on the word itself w , with a latent variable z (the embedding) in between. The distribution that
defines the model is: Z
p.c j w/ D p.z j w/p.c j z/dz:
z
While p.z j w/ is modeled using a Gaussian, p.c j z/ is modeled using a neural network.
The latent variable z can indicate, for example, a word sense or a syntactic category the word
belongs to in that context.
Inference in this model is intractable, and the authors use variational inference with a
mean-field approximation such that each factor represents a distribution z for a given word
and its context in the data. See also Chapter 6 and Section 9.6.1. This work is inspired by the
Gaussian embeddings of Vilnis and McCallum (2015) (where a word embedding is represented
using a Gaussian distribution), with two key differences. First, the work of Vilnis and McCallum
(2015) does not provide contextualized embeddings (i.e., we eventually get a single distribution
per word independent of its context as an output). In addition, the original work about Gaussian
embeddings does not define a generative model through which posterior inference can be done in
a Bayesian setting. Instead, finding the distribution for each word is done by directly optimizing
an objective function that is based on KL-divergence terms.
9.2.3 DISCUSSION
Whereas our discussion in the previous sections focused on pre-trained embeddings, word em-
beddings can be either pre-trained or trained (generic vs. task-specific). With pre-trained em-
beddings, we use a large corpus to estimate the word embeddings with techniques such as
word2vec and others. Trained embeddings, on the other hand, are estimated in conjunction with
the specific NLP problem we are trying to solve, such as machine translation or summarization.
While pre-trained embeddings have the advantage that they can be used in conjunction with
222 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
large corpora—it is usually easy to find large amounts of unlabeled text data—trained embed-
dings have the advantage that the learned vectors are tuned specifically to the problem at hand,
often with less data because annotated corpora for specific tasks tend to be smaller. Trained em-
beddings may be initialized as pre-trained in the beginning of the training procedure of a neural
network (which often uses the backpropagation algorithm; see Section 9.3.1).
The idea of embedding words in a Euclidean space can be generalized further, and sen-
tences, paragraphs and even whole documents can be embedded as vectors. Indeed, this is one
of the main ideas behind encoder-decoder models, which are discussed in Section 9.4.3. In
addition, like in the word2vec model, models for embedding larger chunks of text have been
developed (Le and Mikolov, 2014).
σ(w⊤x + b)
x1 x2 x3 x4 x5
Figure 9.1: An example of a neural network representing a logistic regression model. The func-
tion denotes the sigmoid function. The weights of the neural network are w 2 R5 and the bias
term is b 2 R.
x1
x2
Output
x3
x4
Figure 9.2: An example of a feed-forward neural network with one hidden layer. The dimension
of the input is d0 D 4, and L D 2 such that d1 D 5 and d2 D 1.
224 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
Table 9.1: Example of activation functions commonly used with neural networks in NLP. The
function I./ for a statement is the indicator function (1 if is true and 0 otherwise). Func-
tions gW R ! R map the linear combination of the values from the current layer (x ) to a new
value.
1 i L. These functions usually operate coordinate-wise on their input (but do not have to,
such as the case with pooling or normalization). For example, g .i/ could be returning the value
of the sigmoid function on each coordinate. Table 9.1 gives examples of prototypical activation
functions used with neural networks.
Now we can recursively define the outputs of the neural network. We begin with y .0/ D x
and then define for i 2 f1; : : : ; Lg,
y .i/ D g .i/ W .i/ y .i 1/ C b .i / : (9.6)
The final output y .L/ of the neural network is often a scalar, and in the case of classification,
might denote a probability distribution over a discrete space of outputs that is calculated by using
a logistic function for g .L/ . For example, in the case of binary classification, the neural network
defines a conditional probability model over outcomes f0; 1g with y .L/ D p.1 j x; W .i/ ; b .i/ ; i 2
ŒL/ where ŒL denotes the set f1; : : : ; Lg. The parameters of this model are naturally the weight
matrices and the bias vectors.
8 A variant of ReLU, called “leaky ReLU” is sometimes used instead of the vanilla ReLU activation function (Maas et al.,
2013). With leaky ReLU, g.x/ D xI.x 0/ C ˛xI.x < 0/ for a small value of ˛ (such as 0:001). This solves the problem
of “dead neurons,” which are never updated because they always have a negative pre-activation value; as a result, their gradient
update is 0.
9.3. NEURAL NETWORKS 225
9.3.1 FREQUENTIST ESTIMATION AND THE BACKPROPAGATION
ALGORITHM
Given that now we have defined a conditional probability model over the space of y conditioned
on x , we can proceed with the frequentist approach to estimate the weight matrices and the bias
vectors by maximizing an objective function of the parameters of this model and the data. This
can be done, for example, by maximizing the log-likelihood of the data (Section 1.5.2). In this
case, we assume that we receive a set .x .k/ ; z .k/ / where k 2 f1; : : : ; N g as input to the estimation
algorithm; we then aim to find (see Section 1.5.2):
L XN
W .i/ ; b .i / D arg max log p z .k/ j x .k/ ; W .i/ ; b .i/ ; i 2 ŒL : (9.7)
i D1 L
.W .i/ ;b .i / /iD1 kD1
„ ƒ‚ …
L..W .i/ ;b .i/ /L
iD1 /
This maximization problem usually does not have a closed-form solution which is just a
simple function of the training data. Therefore, we need to use optimization techniques which
require the calculation of the gradient of L .W .i/ ; b .i/ /L i D1 . This optimization technique will
“follow” the gradient of the function with respect to the parameters, which provides the direction
in which the function is increasing. This way it will find a local maximum of L .W .i / ; b .i/ /L i D1
with respect to the parameters of the neural network (see Section A.3). The computation of the
gradient leads to “update rules” that iteratively update the parameters until convergence to a
local maximum.9
To compute the gradient, we first note that when g .i/ operates coordinate-wise on its
inputs, the j th output coordinate yj.i/ for i 2 f1; : : : ; Lg (yj.i/ ) has the following form:
yj.i/ D gj.i/ aj.i/ ; activation values (9.8)
di
X
aj.i/ D Wj.i/
` `
y .i 1/
C bj.i/ pre-activation values: (9.9)
`D1
The definition a.i/ is such that a.i/ 2 Rdi . These vectors are also referred to as the “pre-
activation” values. We would like to compute the gradient of y .L/ with respect to Wst.r/ and bs.r/
where r 2 ŒL, s 2 Œdr and t 2 Œdr 1 . For simplicity, we will assume that dL D 1 (i.e., the neural
network has a single output).
To compute this derivative, we will use the chain rule several times. We begin by defining
the terms ıs.r/ for r 2 ŒL and s 2 Œdr :
@y1.L/
ıs.r/ D : (9.10)
@as.r/
9 While most neural network packages now provide automatic ways to optimize objective functions and calculate their
gradient, we give in this section a derivation of the way to calculate these gradients for completeness.
226 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
Using the chain rule (in Equation 9.11), we get a recursive formula for computing ıs.r/ :
0
ı1.L/ D g1.L/ a1.L/ base case
drC1
X @y .L/ @a.rC1/
ıs.r/ D 1 `
(9.11)
`D1 @a`.rC1/ @as.r/
drC1 .rC1/
X .rC1/ @a`
D ı` r 2 fL 1; : : : ; 1g; s 2 ŒdL ; (9.12)
`D1 @as.r/
0
where gj.i/ .z/ denotes the value of the derivative of gj.i / .z/ for z 2 R.
@a`.rC1/
To fully compute ıs.r/ , we need to be able to compute . Based on Equations 9.8–9.9
@as.r/
the following relationship holds:
dr
X
a`.rC1/ D .rC1/ .r/
W`k gk ak.r/ C b`.rC1/ :
kD1
As such,
@a`.rC1/ 0
D W`s.rC1/ gs.r/ as.r/ :
@as.r/
drC1 0
X
ıs.r/ D ı`.rC1/ gs.r/ as.r/ W`s.rC1/ r 2 fL 1; : : : ; 1g; s 2 ŒdL : (9.13)
`D1
Now, we can calculate the derivatives for y1.L/ with respect to the weights and bias terms
by using Equation 9.10:
then each summand in the log-likelihood objective function can be expressed as the following
function of the outputs:
.k/ .k/
log p z .k/ j x .k/ ; W .i / ; b .i/ ; i 2 ŒL D log .y1.L/ /z .1 y1.L/ /1 z
D z .k/ log y1.L/ C 1 z .k/ log 1 y1.L/ : (9.16)
„ ƒ‚ …
L.k;.W .i/ ;b .i/ /L
iD1 /
By the chain rule, the derivative of each of these terms as in Equation 9.16 with respect
to Wst.r/
is:
8 .L/
ˆ < 1 @y1
if z .k/ D 1
@L k; .W ; b /i D1
.i/ .i/ L .L/
y1 @Wst
.r/
D .L/
@Wst.r/ :̂ .L/1 @y1
.r/ if z .k/ D 0;
y1 @Wst
@y1.L/
where is taken from Equation 9.14. We can similarly use Equation 9.15 to calculate the
@Wst.r/
gradient of the objective function with respect to the bias terms.
Intuition Behind the Backpropagation Algorithm As mentioned in the beginning of this
section, the goal of the backpropagation algorithm is to compute the gradient of the neural net-
work output as a function of the weights and the bias terms of the network (the parameters),
or the gradient of the objective function as a function of the outputs, to be more precise. Equa-
tion 9.10 defines the ıs.r/ term, which provides the “amount of change” in the network output
as a function of the change in the pre-activation in layer r to neuron s in that layer (as.r/ ). The
derivation of Equation 9.10, which is based on the chain rule, shows that this amount of change
can be expressed as the weighted average of the amount of change of the activation with re-
spect to each pre-activation ` in layer r C 1 times the amount of change to the pre-activation of
neuron ` as a function of the amount of change to as.r/ . Therefore, the pre-activations in layer
228 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
r C 1 are used as intermediate variables; the change in upper layers of the network as a func-
tion of changes in pre-activations in lower layers can be expressed through these intermediate
pre-activations. For more intuition on the chain rule, see Appendix A.2.2. It is important to
note that while the chain rule was our main tool to derive the backpropagation algorithm in
this section, automatic differentiation for complex functions requires more than just a simple
application of the chain rule. These implementation details are now often hidden when using
off-the-shelf packages (such as PyTorch and Tensorflow) for neural network modeling. There
are also strong connections between the backpropagation algorithm and the inside-outside al-
gorithtm described in Chapter 8. See Eisner (2016) for more details.
Initialization with the Backpropagation Algorithm Most objective functions used in con-
junction with neural networks are non-convex and have multiple exterma. This non-convexity
arises when there are latent layers in the neural network, as commonly happens with unsuper-
vised learning (this happens because the latent layers introduce higher-order multiplicative terms
of interaction between different parameters of the neural networks). Therefore, initialization of
neural network weights with the backpropagation algorithm, which essentially optimizes the
objective function in a gradient-descent style, is crucial. In addition, if the weights are not cho-
sen in a suitable way at the beginning of the algorithm, the gradients may quickly explode or
vanish (Section 9.4.2).
Various examples of initialization techniques exist. For example, one can initialize the
weights of the neural network by sampling from a Gaussian variable with mean zero and vari-
ance that is inversely proportional to the square-root of the number of connections in the net-
work (Glorot and Bengio, 2010). Alternatively, instead of initializing each weight separately
through sampling, Saxe et al. (2014) proposed jointly initializing all weights in a given layer by
using an orthonormal matrix that preserves the size of the vector being input to the layer to
prevent the gradient from exploding or vanishing. For more details, see also Eisenstein (2019).
1 X n
F . / D r log p. / C r log p x .i/ j ;
n
i D1
where x .1/ ; : : : ; x .n/ consist of the observed data points and denote the parameters of the
model.
The gradient of the log-likelihood function can be computed using the backpropagation
algorithm if it is indeed modeled by a neural network. The update made to the parameters is in
the direction of this gradient together with additional noise sampled from a multidimensional
Gaussian distribution:
tC1 t t F . t / C t ;
where t is a sample from a multivariate Gaussian with mean zero, t is the set of parameters at
timestep t in the set of updates to the parameters and t is a learning rate.
It can be shown that when running these gradient updates, as t increases, the distribu-
tion over the parameters (given the Gaussian noise, we have a distribution over t so it can be
treated as a random variable) eventually converges to the true posterior over the parameters (Teh
et al., 2016). This convergence requires the learning rate to become closer to 0 as t increases (see
also Section 4.4). In NLP, for example, Shareghi et al. (2019) used SGLD for the problem of
dependency parsing. See more about stochastic gradient descent in Section A.3.1.
10 It is possible also to calculate this gradient in “batches.” See more in Section A.3.1.
230 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
9.4 MODERN USE OF NEURAL NETWORKS IN NLP
In their current form in NLP, neural networks have become complex, and they use more ad-
vanced architectures, particularly convolutional neural networks (CNNs) and recurrent neural
networks (RNNs).
Figure 9.3: (a) A representation of a recurrent neural network; (b) an unrolled representation of
the network.
the chain rule which states that for an objective L and a parameter in the RNN it holds that:
T
X @L @ .t/
@L
D ;
@ t D1
@ .t/ @
where T is the number of timesteps unrolled and .t/ is the corresponding unrolled parameter
@ .t/
in timestep t . As D 1 (because .t/ D ), it holds that the derivative of L with respect to
@
is the sum of all derivatives in the unrolled network with respect to .t/ . This variant of gradient
computation which is based on backpropagation is also referred to as “backpropagation through
time” (Werbos, 1990).
While RNNs focus on cases in which the network represents a sequence (the input to the
recurrent unit is the output of the recurrent unit from the previous step), we can extend this idea
further and have several “history” vectors that are fed into a unit at an upper level. This can be
done, for example, using a tree structure. Networks with this form of generalization of RNNs
are referred to as Recursive Neural Networks (Pollack, 1990).
The computation of Recursive Neural Networks, with the sharing of parameters at dif-
ferent nodes in the network, can be unrolled similarly to the way it is done with RNNs using
a directed acyclic graph representing the full computation. This unrolled form can be used to
compute the gradient for the optimization of a training objective. This technique of backprop-
agation calculation is called “backpropagation through structure” (Goller and Kuchler, 1996),
and it is a generalization of backpropagation through time.
RNNs enable the solution for a problem with natural language inputs: they are generally
of varying length, and in many applications (e.g., classification), we need a fixed-size vector to
apply the final classification step. An RNN can be used to “read” the input, token by token, and
the internal state that it maintains can be used in the last time step as a representation for the
whole input, which is fed to a classifier, for example. CNNs (Section 9.4.4) also tackle this issue
of fixed-size vectors for natural language inputs.
232 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
One can adapt RNN training to the Bayesian setting. For example, Fortunato et al. (2017)
used “Bayes by backprop” in a way that is similar to the work of Graves (2011), mentioned in
Section 9.3.2. They set a prior on the weights of the RNN, and then proceed with variational
Bayes to infer a posterior over the weights. The backpropagation algorithm is used as a subroutine
in their variational inference procedure. They also introduce the idea of “posterior shattering,”
in which the approximate posterior at each step of optimization is conditioned on the mini-
batch of datapoints used in that step. This reduces the variance in the learning process. Gal
and Ghahramani (2016c) framed a regularization technique for neural networks, dropout, in a
Bayesian manner. For more details see Section 9.5.1.
@C h.k/
where refers to the “immediate derivative” (Pascanu et al., 2013) of h.k/ with respect to
@
, in which h.k 1/ is taken to be a constant with respect to (so we do not apply further the
@h.t /
chain rule for h.k 1/ when taking its derivative with respect to ). The derivative .k/ can be
@h
shown to be:
@h.t/
kY1
@h.i/
kY1
> 0 .i 1/
D D Wrec diag g .h / ; (9.18)
@h.k/ i Dt
@h.i 1/ t Di
where diag is a function that takes as an input a vector (coordinate-wise derivatives of the ac-
tivation function, in this case) and turns it into a diagonal matrix with the values of the vector
@h.t/
on the diagonal. Note that .k/ is shorthand for a matrix that takes the derivative of a specific
@h
coordinate of h.t / with respect to a specific coordinate of h.k/ . This is a gradient formulation that
unravels the backpropagation equations in full.
9.4. MODERN USE OF NEURAL NETWORKS IN NLP 233
The vanishing gradient problem arises, for example, when the largest singular value of Wrec
is small, and as such the norm of the right-hand side of Equation 9.18 reaches 0 quickly. This will
happen if that singular value is smaller than
1 where
is an upper bound on jjdiag.g 0 .h.i 1/ /jj.
In this case, following Equation 9.18, it holds that:12
ˇˇ .t / ˇˇ
ˇˇ @h ˇˇ ˇˇ > ˇˇ ˇˇˇˇ ˇˇ
ˇˇ ˇˇ ˇˇW ˇˇ ˇˇdiag g 0 .h.i 1/ ˇˇˇˇ < 1
< 1: (9.19)
ˇˇ @h.k/ ˇˇ rec
Equation 9.19 indicates that it could be the case that there exists < 1 such that
@h.t/
jj @h .k/ jj < for all timesteps k . In this case, it holds that
ˇˇ ˇˇ ˇˇ ˇˇ ˇˇ ˇˇ
ˇˇ @L t @h.t/ ˇˇ ˇˇ @L t kY1 @h.i/ ˇˇ ˇˇ @L t ˇˇ
ˇˇ ˇˇ D ˇˇ ˇˇ t k ˇˇ ˇˇ
ˇˇ @h.t / @h.k/ ˇˇ ˇˇ @h.t / @h.i 1/ ˇˇ ˇˇ @h.t/ ˇˇ:
i Dt
The above equation shows that the contribution to the gradient of terms in further
timesteps reaches 0 exponentially fast as the timestep increases. While the vanishing gradient
happens whenever < 1, the opposite problem may happen if the largest singular value of Wrec
is too large. In this case, the gradient becomes larger and larger until it “explodes” and con-
vergence becomes erratic, or there is an overflow. “Gradient clipping” (in which the gradient is
clipped if it is larger than a certain threshold value) is often used to overcome this problem. The
intuition behind the reason of the problem of vanishing gradients and the exploding gradient is
in our multiplication of the gradients through the backpropagated terms. Multiplying too many
of these terms together may lead to such problems.
Long Short-Term Memory Units and Gated Recurrent Units One way to solve the problem
of the vanishing gradient in recurrent and deep networks is to use an abstraction of a neuron
called a Long Short-Term Memory (LSTM) cell, developed by Hochreiter and Schmidhuber
(1997) and depicted in Figure 9.4. An LSTM cell maintains an internal cell state at time step
c .t / as well as an output state ˛ .t/ . These two states are vectors determined by the following two
equations:
c .t/ D f .t/ ˇ c .t 1/
C i .t/ ˇ z .t / cell memory (9.20)
.t/
˛ D o.t/ ˇ tanh c .t/ hidden state output; (9.21)
12 The matrix norms are taken with respect to the spectral norm, which corresponds to the largest singular value of the
matrix.
234 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
x(t) x(t)
x(t)
Figure 9.4: A diagram depicting an LSTM cell, with an input x .t / at timestep t (all x .t/ in the
diagram refer to the same vector). There are three gates: output gate, input gate and forget gate.
The cell also maintains an internal state (c .t / ) and outputs a state ˛ .t/ . Different operations are
used, such as the sigmoid (denoted by the curve operator) and Hadamard product (denoted by
ˇ).13
where ˇ denotes vector coordinate-wise multiplication (Hadamard product) and the values of
f .t / , i .t / , z .t/ and o.t / are determined by the following additional equations:
i .t/ D W i x .t / C U i ˛ .t 1/ C b i input gate
f .t/ D W f x .t/ C U f ˛ .t 1/ C b f forget gate
z .t/ D tanh W z x .t/ C U z ˛ .t 1/ C b z
o.t/ D W o x .t/ C U o ˛ .t 1/ C b o output gate;
where c .t/ ; ˛ .t/ ; i .t/ ; f .t/ ; z .t/ ; o.t/ 2 Rd are vectors maintained by the LSTM cell (as such,
above denotes a coordinate-wise application of the sigmoid function as described in Table 9.1),
W i ; W f ; W z ; W o 2 Rd k are the weight matrices of the LSTM and b i ; b f ; b z ; b o are the bias
parameters. When considering the LSTM, we see the following structure. The vector c .t / , which
represents the state of the LSTM cell, is an interpolation between the previous state and z .t/ ,
which depends directly on the input. This means that the internal state c .t/ can be updated by
“forgetting” some of the previous information in the state as the previous state is multiplied
against a forget gate f .t / ) and includes a certain level of information from the input (as z .t/
9.4. MODERN USE OF NEURAL NETWORKS IN NLP 235
.t / .t/
is multiplied against an “input gate” i ). Finally, ˛ is the output of the LSTM cell, which
multiplies a transformation of the internal state against an “output gate” o.t / .
The reason that LSTM is less prone to the vanishing gradient problem is that its internal
state c .t/ in Equation 9.20 does not include a nonlinear, possibly squashing, function applied on
the state itself, causing the “squashing” to compound. Information is additively combined into
a new cell state from the previous timestep. Some information is potentially forgotten (through
the forget gate which is multiplied against c .t 1/ ) and some information is added by incorpo-
rating z .t/ . LSTM is constructed in a way that allows long range dependencies to propagate
through the updates of the state of the cell.
LSTMs are part of a family of neural network units that make use of gates. Another ex-
ample of such a unit is the Gated Recurrent Unit (GRUs), originally developed for machine
translation (Cho et al., 2014). GRUs maintain the gates and states that are defined by the fol-
lowing equations:
The reset and update gates, r .t/ and z .t / , control how much of the information from the
previous hidden state is maintained in the next state ˛ .t/ . In principle, the number of parameters
used by a GRU would be lower than the number of parameters used by an LSTM.
LSTMs and GRUs are an abstract form of a cell that may take part in a RNN, and one can
stack them up (with the output of one LSTM cell being fed as an input to another) or combine
them in other ways. Indeed, neural encoder-decoders (in the next section) build on this idea.
See also Graves (2012) for early work on using recurrent neural models for sequence modeling.
… …
Figure 9.5: An unrolled diagram for an encoder-decoder model. The left part is the encoder, the
right part is the decoder. The encoder also returns an output state ˛ .t / at each position t . The
decoder, on the other hand, outputs a symbol at each step, until it reaches the end-of-sentence
symbol. The output in the previous step is used as an input for the next step. The output symbol
is determined by the output state in each position. The blocks themselves “contain” the “cell
memory” (in LSTM, this is c .t/ ; see Equation 9.20). The markers bos and eos are beginning-
of-sequence and end-of-sequence markers.
specifically, the encoder part is an LSTM14 cell that receives the sequential input and maintains
an internal state, and the decoder part also consists of a cell that (a) receives as an input the
encoder’s last state after reading the sequence; (b) outputs a symbol; (c) receives as an input its
output from the previous position in the sequence and continues the process. Figure 9.5 gives a
diagram of an encoder-decoder model.
In their modern version, neural encoder-decoder models were first introduced for machine
translation (Cho et al., 2014, Sutskever et al., 2014), rooted in a two-decade-old endeavour to
treat machine translation in a connectionist framework (Castano and Casacuberta 1997, Neco
and Forcada 1997; see also Kalchbrenner and Blunsom 2013). Since then, these models have
been widely used for many other types of problems that require transduction of a sequence of
symbols to another sequence, such as summarization (or more generally, generation), question
answering and syntactic and semantic parsing.
The neural encoder-decoder model of Cho et al. (2014), which has become the base for
other variants, models a distribution of the form p.y1 ; : : : ; yn j x1 ; : : : ; xm /, mapping an input
sequence of symbols x1 : : : xm to an output sequence y1 : : : yn . The model works by first com-
puting a (global) vector c of the input sequence by running it through a recurrent unit (GRU
or LSTM, for example). That vector c is the cell memory of the last step in the input scan (see
Equation 9.20 for LSTM). The decoder, in turn, also uses a recurrent unit (for example, LSTM)
14 While we generally make reference to LSTM cells composing the encoder-decoder model, a natural variant is one in
which we use a GRU cell. This variant is common in NLP.
9.4. MODERN USE OF NEURAL NETWORKS IN NLP 237
to define the following probability of generating an output symbol:
p.y t j y1 ; : : : ; y t 1 ; x1 ; : : : ; xm / D g ˛ .t/ ; y t 1 ; c ; (9.23)
n
X
d .t/ D ˇ t .s/˛ .s/ ;
sD1
and use this context vector as an input to the decoder cell at index t , in addition to the output
from the decoder at index t 1, which is also input at index t . The context vector d .t/ allows
the decoder to focus on a specific index in the encoder states when predicting the output. Since
ˇ t .s/ are potentially parameterized, the neural network can learn how to set these coefficients
to focus on relevant parts of the encoder states.
15 The encoder-decoder model of Sutskever et al. (2014) has some differences, for example, in the fact that c is not used
as the input for all decoding steps, but just as the input for the first decoder step.
238 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
y1 y2 y3 <eos>
d(1)
<bos> y1 y2 yn
<bos> x1 x2 <eos>
Figure 9.6: An unrolled diagram for a neural encoder-decoder model with an attention mech-
anism. The context vector d .t/ is created at each time step as a weighted average of the states
of the encoder, and is fed into the decoder at each position. There is a connection through d .t/
between each element of the encoder and each element of the decoder.
There is a variety of similarity scores that can be used, such as the dot product between ˛
and ˛ , or a parameterized dot-product in the form of ˛ > W ˛ where W are additional parameters
in the neural network. One can also concatenate the two input vectors to the similarity func-
tion, for example as v > tanh .W Œ˛I ˛/ (with v and W being parameters of the neural network).
See Luong et al. (2015) for a thorough investigation of scoring functions for attention-based
models.
The encoder component in the encoder-decoder model usually scans an input sequence
token by token. It is sometimes advisable to scan the input in the reverse direction, to bias the
encoder to encode the beginning of the sentence as more salient. To help overcome the problem
of long-range dependencies, one may also use a BiLSTM encoder (Huang et al., 2015), in which
the input sequence is scanned both left-to-right and right-to-left. A diagram of a BiLSTM
encoder is given in Figure 9.7.
In addition, it may be preferable for the model to copy a word from the input sequence to
the output sequence. This is especially true in the context of machine translation, in which named
entities in one language may be unseen words might be written in an identical manner in the
target language. For that end, Gu et al. (2016) introduced the notion of a “copying mechanism,”
which copies words from the input sequence based on the attention vectors. The total probability
of outputting a specific word becomes a mixture between the probability according to a softmax
9.4. MODERN USE OF NEURAL NETWORKS IN NLP 239
… …
Figure 9.7: Diagrams for a bidirectional encoder. (a) With bidirectional encoders, there are
two LSTM encoders that read the input left-to-right and right-to-left. As such, the final rep-
resentation at each position is as such a combination of the state of the two encoders at the
relevant position, reading from both directions; (b) a common schematic representation used to
describe (a).
over the output vocabulary and a softmax (based on the attention weights) over the words in the
input.
From a technical perspective, neural encoder-decoder models are relatively easy to imple-
ment with current computation graph software packages, which makes them even more attrac-
tive as a modeling option. There are also quite a few existing off-the-shelf packages for seq2seq
models that are widely popular, such as OpenNMT (Klein et al., 2017) and Nematus (Sennrich
et al., 2017).
The
king
welcomed
the
guests
yesterday
evening
[conv.] [pooling]
Figure 9.8: An example of CNN for encoding a sentence representation. The first layer is con-
volution, the second layer is max-pooling. There are two convolutions: one of size 4 and one of
size 2. See detailed explanation in the text. Figure adapted from Narayan et al. (2018b).
among the different coordinates of the word embeddings (meaning, coordinates 1 and 2 in the
embedding are not apriori related to each more than coordinates 1 and 3 are). As such, the entire
word embedding is taken together when applying the sliding window.
The result of applying this convolution is a vector of dimension ˛M (˛ 1), depending
on the stride size. It is often the case that many filters are applied on the sentence, each with
a different sliding window, and then max-pooling or another type of pooling is applied on the
results of all filters to get a fixed vector size. Similar to encoding with RNNs, this is an approach
that can be used to reduce a variable-length sentence to a vector representing it with a fixed
dimension. This fixed-dimension vector can now be readily used for a classification step.
Consider, for example, Figure 9.8. Two types of convolutions are described: the top one
with a window size of four words and the bottom one with a window size of two words. Each
word is represented by a word embedding of dimension 4. The windows slide over the sentence
“The king welcomed the guests yesterday evening,” and as such leads to 4 window applications
for the window of size 4 and 6 window applications for the window size of 2. Note that one can
also pad the sentence with beginning and end markers, in which case we would have 7 window
applications for both windows (padding this way is often referred to as a “wide” convolution as
opposed to a “narrow” convolution). The number of filters used for each convolution is 3, which
is why we get a matrix of dimensions 4 3 for the upper convolution and a matrix of dimensions
6 3 for the bottom convolution. Finally, max-pooling is applied, taking the maximum across
each row in each matrix, leading to a 6-dimensional vector representing the entire sentence. Note
242 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
that this dimension of the vector is a function of the number of filters only, and not the length
of the sentence. As such, we get fixed-dimension representations for varying-length sentences.
More recently, CNNs have been used for sequence-to-sequence modeling, such as ma-
chine translation (Gehring et al., 2017). These models—nicked convseq2seq, like seq2seq
models that are based on recurrent neural networks (Section 9.4.3)—are also composed of an
encoder and a decoder component, both of which make use of convolutions.
9.5.1 REGULARIZATION
Regularization is used as a way to prevent overfitting when estimating a model. Overfitting
refers to the case in which the learning algorithm sets the weights of a model in such a way that
it handles idiosyncrasies that appear in the training data, but do not necessarily represent the
general rule. This leads to a significantly higher performance on the training data than on the
held-out test data. Overfitting happens when the estimated model family is highly complex, and
thus has the capacity to retain the idiosyncrasies mentioned above.
To avoid overfitting with model estimation, regularization is often used. With regular-
ization, we add an additional term to the objective function (such as the log-likelihood) that
we optimize during training, and this term is used as a penalty term for models deemed too
complex. The most common regularization terms used with neural network training are the L2
squared-norm of the weights in the neural network, or alternatively the L1 norm (which leads
to sparsity in the set of weights used—many weights are set to 0 during training). Both of these
regularization terms have a Bayesian interpretation, as described in Section 4.2.1.
Another common way to perform regularization is dropout (Srivastava et al., 2014), in
which certain neurons in the neural network are “dropped” during training. This means that
their contribution to the activation and to the gradient computation is 0, which is achieved by
dropping their connections to the rest of the network. Each unit in the hidden layer or in the
input has a certain probability of being dropped at each phase in training (for example, in a batch
of gradient optimization). This prevents network’s dependence on a small number of inputs (or
hidden neurons), or alternatively, co-dependence of neurons on each other to give the signal for
predicting the output (Eisenstein, 2019).
9.5. TUNING NEURAL NETWORKS 243
Dropout has been interpreted in a Bayesian manner by Gal and Ghahramani (2016b). In
their model, there is a Gaussian prior on the weights of a neural network. Furthermore, since
inference is intractable, the authors use variational inference (Chapter 6) to find the posterior.
The key idea they introduce is the form of the variational distribution, which is a neural network
with a set of weights that are zeroed out with a certain probability. This is similar to the dropout
approach mentioned in Section 9.4.4. The dropout formulation of Gal and Ghahramani makes
a neural network equivalent to an approximation of deep Gaussian processes—an hierarchical
model where Gaussian processes (see Section 7.5.1) are applied on each other (Damianou and
Lawrence, 2013). Gal and Ghahramani (2016a) further extended their work to the use of CNNs
(Section 9.4.4). The goal of the authors was to use the Bayesian approach on CNNs so that they
also work on small amounts of data without overfitting. Gal and Ghahramani (2016a) tested
their Bayesian neural networks on digit recognition datasets.
More recent work in the NLP community by Chirkova et al. (2018) sets a prior on the
weights of an RNN. This prior is inversely proportional to the absolute value of the weights.
Inference is then followed variationally, finding the approximate posterior. This prior is used to
find a sparse set of parameters for the RNN.
16 This is less intensive than cross-validation, in which we repeat several experiments, at each time testing the model on
one part of the training set that is left out.
244 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
an example, consider the BLEU metric in machine translation, which measures n-gram overlap
between the output of a system and the correct target reference.
In recent years, several proposals have been made to develop “blackbox hyperparameter
optimizers” that do not require gradient computations or an explicit form for the objective func-
tion they optimize. Derivative-free optimization (Rios and Sahinidis, 2013) is one example, and
there are related ideas that date back to the 1950s (Chernoff, 1959). The black-box optimization
algorithms usually take as an input a set of hyperparameters and the performance of a model
with them, and return as an output the next hyperparameter to try out or, alternatively, the final
decision on the hyperparameters to be used.
There is a Bayesian connection to hyperparameter tuning through the notion of Bayesian
optimization (Močkus, 1975, 2012) that has recently been re-discovered in the machine learning
community (Snoek et al., 2012). Say the performance function we are trying to maximize is f .˛/
where ˛ is a set of hyperparameters such as neural network size. Bayesian optimization works
by defining a prior over this function f (usually the prior is a Gaussian process; see more about
Gaussian processes in Section 7.5.1). It then also defines an acquisition function g , a cheap-to-
optimize proxy for f that chooses a point to evaluate using f given the previous points that
were already evaluated. Bayesian optimization works by iterating between the optimization of
the acquisition function, evaluating f at the point the acquisition function chooses and then
updating the posterior over f with the new result the algorithm received.
The acquisition function must take into account the current state of beliefs about f in
order to choose which point to evaluate next. Indeed, there are various types of acquisition
functions, for example, those that choose a point to maximize expected improvement over the
current distribution over f . In that case, we have:
g.˛ 0 j D/ D Ep.f 0 jD/ Œmaxf0; f 0 .˛ 0 / f .˛max /g;
where ˛max is the point for which f was evaluated to be the largest so far. An alternative to max-
imizing expected improvement is instead maximizing the probability of improvement through
the acquisition function:
g.˛ 0 j D/ D Ep.f 0 jD/ ŒI.f 0 .˛ 0 / f .˛max //:
This acquisition function is often less preferable than the expected improvement, as it
ignores the size of the improvement and just focuses on maximizing the probability of any im-
provement.
The key idea is that the acquisition function has to balance between exploitation (in which
it explores points that are likely to have a high value of f , and low variance according to the cur-
rent maintained distribution over f ) and exploration (in which it explores points that might have
higher variance). A sketch of the Bayesian optimization algorithm is given in Algorithm 9.1. It
can be seen from line 2 why this type of optimization is referred to as Bayesian. The algorithm
takes as input the prior and the evaluations of f , and then finds the posterior over the optimized
function. Following that, in lines 4–5, it chooses the next evaluation points for f .
9.6. GENERATIVE MODELING WITH NEURAL NETWORKS 245
Input: A function f .˛/ evaluated as a blackbox, a prior p.f 0 / over functions of the type
of f , an acquistion function g .
Output: ˛ , a proposed maximizer for f .
1: Set n D 0, D D ;
2: Set current posterior over f , p.f 0 j D/ D p.f 0 /
3: repeat
4: Set ˛nC1 D arg max˛0 g.˛ 0 j D/
5: Let ynC1 D f .˛nC1 / (evaluate f )
6: D D [ f.˛nC1 ; ynC1 /g
7: Update posterior p.f 0 j D/ based on new D
8: n nC1
9: until stopping criterion met (such as n > T for a fixed T )
10: return ˛ D ˛n
VAEs as Autoencoders
Autoencoders in general provide an approach to modeling an input distribution and extracting
its salient features into a representation. These encoders work by passing a random variable
Y through a “compression” layer to obtain a hidden representation z for each y . The hidden
representation is then used to recover y in another step. The first step is referred to as “encoding,”
and the second step as “decoding.” Note that we assume that samples from the random variable
Y are observed, while Z is not observed.
Variational autoencoders draw on this basic approach of autoencoders by maintaining a
probabilistic encoder, meaning that there is a distribution q.Z j y/ that probabilistically encodes
every y into a z . When optimizing VAEs, this encoder is pushed to remain close to a prior
distribution over Z (for example, a Gaussian with mean 0). The decoder, in turn, includes a
distribution that takes the encoded z as input and tries to recover y . A key idea in VAEs is
that both the encoders and decoders are represented using neural networks. More precisely, the
encoder q.Z j y/ is often a Gaussian parametrized by, for example in NLP models, an RNN
(that encodes a sentence y ). The decoder, in turn, is a neural network (for example, again, an
RNN) which is parametrized as a function of z .
The objective function we aim to minimize with VAEs would be:
This objective demonstrates the intuition mentioned above. The first term is a KL-
divergence term that maintains q closer to the prior p.Z/ (see Section A.1.2). The second term
ensures that we can recover y well under the distribution of the possible latent representations
that we get from the encoder.
In practice, VAE is implemented by specifying encoder and decoder networks. The en-
coder network which takes a y as input, together with a sample from a prior distribution p.Z/
provides a sample from the encoder q.Z j y/. Then the decoder network takes that sample as
input and tries to recover the input y .
VAEs can also be set up in a “conditional” setting, in which there is an additional random
variable X such that both the prior p.Z j X/ and the encoder q.Z j Y; X / condition on it. In
terms of the neural network architectures, this means that the instances from X , provided as
part of the training set and the decoding process, are used as additional input to the encoder
and decoder networks. For example, as mentioned below, VAEs in that form can be used for
9.6. GENERATIVE MODELING WITH NEURAL NETWORKS 247
Machine Translation—X would be the input sentence and Y would be the sentence in the target
language.
Bowman et al. (2016) constructed a VAE model for sentences, where the latent repre-
sentation aims at modeling the semantics of the sentence in a continuous manner. In contrast
to recurrent language models (Mikolov et al., 2011), their model is aimed at creating a global
representation of the sentence through the latent state Z . Both the encoder and decoder models
are RNNs (with a single layer of LSTM cells). Bowman et al. experimented using their mod-
els with the problem of language modeling, where their model performed slightly worse than
an RNN. They also tested their model on a missing-word imputation task, where their model
significantly outperforms the RNN language model baseline.
The idea of using a global semantic space was later used for machine translation. Zhang
et al. (2016) introduced a conditional VAE model (see section below), in which the latent ran-
dom variable Z is used to represent a global semantic space. The posterior that is approximated in
variational means is used for the semantic space variables that are conditioned both on the source
and target sentences. During decoding, however, access to the sentence in the target language
is eliminated, and instead a model that generates the target sentence is used. The model gener-
ates word yj (j th word) in the target sentence conditioned on all previously generated words, z
and x , and the source sentence. The authors use a Monte Carlo approximation to evaluate the
objective function.
In the next section, we describe a full mathematical derivation of VAEs which draws on
connections to Chapter 6. Our derivation is based on that of Doersch (2016).
Let y D y .i/ for a specific i . To maximize the above we will use variational inference (see
Section 6.1 for more detail). We define a variational distribution q.z j y/ (each training instance
will have a corresponding variational distribution). Then, we aim to minimize the KL-divergence
between q.z j y/ and p.z j y; ; f /. Using the definition of KL-divergence and using Bayes rule
we get:
By rearranging the terms and again using the definition of KL-divergence again we get
that:
VAEs provide a means for drawing samples for Y from the estimated neural network.
This is done first by sampling z from a Gaussian.0; Ikk / and then decoding it by sampling y
from Gaussian.f .z; /; 2 Id d /.17 This simply follows the generative process described earlier.
In doing so, with VAEs we learn a generative distribution over a sample space. However, it is
also possible to construct VAE models that make predictions. Such VAEs are called conditional
VAEs, and they assume the existence of another observed random variable X . Our goal in this
case isRto learn a conditional model p.Y j X/ which makes use of a latent variable Z : p.Y j
X / D z p.Y j X; z/p.z j X/dz . The main difference between this model and the model which
does not include X is that in this case the neural network f is a function of z , and x . Most
commonly, we have Y as a Gaussian with mean f .z; x; / and a diagonal covariance matrix with
2 on the diagonal.
In the case of z being a discrete random variable, a problem exists with this reparameter-
ization, because there is no g that would be differentiable with respect to D .1 ; : : : ; K / over
K events as necessary from Equation 9.26. This is where the Gumbel-softmax “trick” comes
into play. Let U1 ; : : : ; UK be a sequence of random variables drawn from a uniform distribution
on the interval Œ0; 1. It holds that if Gk D log. log Uk / that if we define the random variable
X as:
X D arg max log k C Gk ; (9.28)
k
then X is distributed according to the categorical distribution (Appendix B.1) with parameters
.1 ; : : : ; K /. The random variables Gk are said to have a Gumbel distribution. Note that this way
we managed to reparameterize the categorical distribution with respect to . More specifically, if
we wish q. j / to be a multinomial distribution, we can define " D .G1 ; : : : ; GK /. The function
g."; /, in turn, outputs a one-hot vector of length K (i.e., a vector which is all zeros except for
in one coordinate, where it is one) with 1 in the coordinate as specified in Equation 9.28. We
then have a reparameterization, where expectations with respect to q can be calculated using
samples from " (as described in Equation 9.27).
However, there is still a remaining problem: the function g specified as above is not dif-
ferentiable. This is where we relax the use of one-hot vectors to any vector in the probability
simplex of dimension K 1 (see also Chapter 2 regarding the probability simplex). We define
gW RK ! RK to be:
exp ..log k C "k /= /
Œg."; /k D PK ;
j D1 exp .log j C "j /=
where is a hyperparameter (the larger it is, the more peaked g is). Using this reparameterization
trick followed by the application of the softmax distribution, we can use a latent-variable z
252 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
over the probability simplex, a better fit for discrete data. See Maddison et al. (2017) and also
Section B.9.
1X
n
min max log D y .i/ C Ep.Z/ Œlog.1 D.G.z/// :
G D n
i D1
The first term is just a reference to the data, and is there to tune D so that it indeed
matches the distribution over the data. The second term represents the main idea behind the
minimax objective. On the one hand, this term is maximized with respect to D , in an attempt
to give low probability to elements that are generated by G . On the other hand, we also attempt
to minimize this term, and thus “fool” D to identify examples generated by G as having high
probability according to p.Y /.
We now specify a generative model over Y that is learned from the data by using G (once
it is learned). The model over Y is specified using the following generative model to draw a y :
z p.Z/;
y D G.z/:
In practice, D and G are parameterized through some parameters D and G , and in this
manner the minimax objective is optimized with respect to these parameters. In line with neural
9.6. GENERATIVE MODELING WITH NEURAL NETWORKS 253
network training, this objective can be optimized using optimization algorithms similar to those
used in stochastic gradient descent.
While GANs in their vanilla form have been successfully applied to learn distributions
over continuous domains (such as images and audio), their use in this form is more limited for
NLP. GANs work better with observed data that is continuous, but not as well with discrete
(observed) data, which are common in NLP. This is because it is a non-trivial task to build
a generator (for which the gradients with respect to the parameters can be calculated) based
on a noise distribution. Unlike with images, for example, where there is a spectrum of images
according to a continuous measure of distance between them, it is less clear how to create such
a spectrum for text and measure the distance between, say, a pair of sentences. Still, there has
been some recent work in NLP that discusses GANs, for example, work by Caccia et al. (2018)
and Tevet et al. (2018).
While GANs are intended to learn the distribution over the sample space represented by
the given datapoints, they often just tend to memorize the training examples that they are trained
with (known as “mode collapse”). In addition, because we also find a point estimate for a GAN
and do not perform full posterior inference, we are unable to represent any multimodality over
the parameter setting. In order to alleviate some of these problems, Saatci and Wilson (2017)
introduced the idea of Bayesian GANs. In their formulation, there are priors that are placed on
the parameters of the generator and the discriminator (p.G j ˛G / and p.D j ˛D /). Inference
is performed by iteratively sampling from the following posteriors:
0 1
R
Y
p.G j z; D / / @ D G.z .aj / / A p.G j ˛G /;
j D1
YN YR
p D j z; y .1/ ; : : : ; y .n/ ; G / D y .bi / 1 D G z .aj / p.D j ˛D /;
iD1 j D1
where z .1/ ; : : : ; z .r/ are samples drawn from p.Z/ during each step of sampling. In addition,
faj gjND1 represents a subset of the samples from p.Z/ (used in a mini-batch) and fbi gR i D1 rep-
.i/
resents a subset of the observed data x , i 2 Œn (also used in a mini-batch). Saatci and Wilson
propose using a stochastic gradient Hamiltonian Monte Carlo (SGHMC; Chen et al. 2014) to
sample from these posteriors.
The formulation of Goodfellow et al. (2014) for GANs is a specific case of Bayesian GANs
in which we use uniform priors over the discriminators and generators. In addition, with Good-
fellow et al.’s original formulation, we follow MAP estimation (see Section 4.2.1) rather than
iterating between the posteriors over the generators and discriminators.
254 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
9.7 CONCLUSION
Representation learning and neural networks have become important tools in the NLP toolkit.
They have replaced the need to manually construct “features” in linear models with automatic
feature and representation construction. This requires architecture engineering of the neural net-
work. It also marks a shift in NLP, which began with representing words, sentences, paragraphs
and even whole documents as dense continuous vectors. The two most common neural network
architectures used in NLP are recurrent and convolutional.
This chapter gives an overview of the main modeling neural network techniques prevalent
in NLP, but covers just the tip of the iceberg. The use of neural networks in NLP is a moving
target, and develops quite rapidly. For example, there is some evidence that models such as the
“Transformer” model (Vaswani et al., 2017) or convolutional seq2seq models (Gehring et al.,
2017) give high performance on certain NLP tasks, higher than that of recurrent seq2seq models.
Their advantage is also computational, as their functions can be more easily parallelized on a
GPU than recurrent neural network functions. The architectural details of these models and
others are not covered in this chapter.
While neural networks have transformed much of NLP and have permeated most of the
modeling currently done in the area (with this chapter covering a small part in this regard), it
remains to be seen whether this research will be able to break a “barrier” that still exists in NLP:
communicating and reasoning about natural language at more than just a shallow level, perhaps
coming closer to human-level communication.
Reasoning at such a level requires a significant understanding of the world that is difficult
to encode purely by looking at specific input-output pairs of a given problem as used in training
a neural network model. Furthermore, neural networks have yet to overcome the problem of
learning from smaller amounts of data. For example, with Machine Translation, it is claimed
that for language pairs with small amounts of parallel data (low-resource setting), older statistical
machine translation techniques outperform neural machine translation (Koehn and Knowles
2017; but see also Artetxe et al. 2018, Sennrich et al. 2016). Other important applications, such
as complex question answering, summarization and dialog (for example, “chatbots”) are also
far from being solved. For example, many commercial dialog systems are still largely reliant on
systems with manually crafted rules and scripts.
Bayesian learning can make a small, but significant contribution in that respect. Insights
from latent-variable inference in the Bayesian setting (such as variational inference) have been
used and can be further used to develop estimation and learning algorithms for neural networks.
In addition, traditional Bayesian modeling requires construction of a generative model (for ex-
ample, in the form of a graphical model), and therefore leads to more interpretable models. Neu-
ral networks, on the other hand, often yield “blackboxes,” which make it difficult to understand
how a specific neural network performs a task at hand. This is again a case where concepts from
Bayesian learning, such as generation of interpretable latent structures, can help bring progress
to the area of neural networks in NLP through understanding about neural model decisions.
9.7. CONCLUSION 255
This direction in understanding neural networks as an interpretable model, not necessarily in a
Bayesian context, has spurred much current interest in the NLP community (Lei et al., 2016,
Li et al., 2015, Linzen, 2018), including a recent series of workshops devoted to it.19
There is also a flipside to interpreting neural networks in a Bayesian context, in which rep-
resentations originating in Bayesian modeling are used to augment the representations that neu-
ral networks obtain. For example, Mikolov and Zweig (2012), Ghosh et al. (2016), and Narayan
et al. (2018a) used the latent Dirichlet allocation model (LDA; Chapter 2) in conjunction with
neural networks to provide networks with additional contextual topical information.
For more information about the use of neural networks in NLP, see Goldberg (2017)
and Eisenstein (2019).
19 See https://blackboxnlp.github.io/.
256 9. REPRESENTATION LEARNING AND NEURAL NETWORKS
9.8 EXERCISES
9.1. Discuss: What are the desired properties of word embeddings and their relationship to
each other in the embedding space? Does it depend on the problem at hand, and if so,
how?
9.2. Show that the XOR problem cannot be solved using a linear classifier. More specifically,
show that if you are given a dataset with examples .x1 ; x2 / and labels y.x1 ; x2 /:
.x1 ; x2 / y
.0; 0/ 0
.0; 1/ 1
.1; 0/ 1
.1; 1/ 0
w1 x1 C w2 x2 C b 0;
if and only if y.x1 ; x2 / D 1. Derive a two-layered neural network for the same input
with the step activation function such that the network perfectly classifies the XOR
data above (you may choose the number of units for the middle layer).
9.3. For the above neural network, change its activation function to the sigmoid function
and derive the backpropagation update rules for it when using the log likelihood of the
data as an objective function. The optimizer you need to create the update rules for is
gradient descent (see Section A.3).
9.4. Show that the log-likelihood function objective for the neural network above has mul-
tiple maxima.
9.5. What update rules (for the log-likelihood objective function) do you get when following
the backpropagation derivation recipe for the network in Figure 9.1? Will the update
rules lead to convergence to the global maximum of the corresponding log-likelihood
function?
9.6. Calculate the derivative of each activation function in Table 9.1 and show that
tanh.x/ D 2.x/ 1 where is the sigmoid function.
257
Bayesian NLP is a relatively new area in NLP, that has emerged in the early 2000s, and has
only recently matured to its current state. Its future still remains to be seen. Dennis Gabor, a
Nobel prize-winning physicist once said (in a paraphrase) “we cannot predict the future but we
can invent it.” This applies to Bayesian NLP too, I believe. There are a few key areas in which
Bayesian NLP could be further strengthened.
• Confluence with other machine learning techniques—In the recent years, neural networks
have become an important tool in the NLP machine learning toolkit. Yet, very little work
258 CLOSING REMARKS
has been done in NLP to connect Bayesian learning with these neural networks, although
previous work connecting the two exists in the machine learning literature. Bayesian learn-
ing can be used to control the complexity of the structure of a neural network, and it can
also be used for placing a prior on the parameter weights.
• Scaling up Bayesian inference in NLP—In the past decade, the scale of text resources that
NLP researchers have been working with has grown tremendously. One of the criticisms
of Bayesian analysis in machine learning and NLP is that Bayesian inference does not
scale (computationally) to large datasets in the “Big Data” age. Methods such as MCMC
inference are slow to converge and process on a much smaller scale than we are now used
to. Still, in the recent years, researchers in the Statistics and machine learning community
have made progress in scalable Bayesian inference algorithms, for example, by creating
stochastic versions of MCMC methods and variational inference methods. This knowl-
edge has not yet transferred to the NLP community in full form, and in order to conduct
inference with large datasets in a Bayesian context in NLP, this might be necessary. For a
discussion of Bayesian inference in the Big Data age, see Jordan (2011) and Welling et al.
(2014).
259
APPENDIX A
Basic Concepts
A.1 BASIC CONCEPTS IN INFORMATION THEORY
This section defines some basic concepts in information theory, such as entropy, cross entropy
and KL divergence. For a full introduction to information theory, see Cover and Thomas (2012).
Entropy is always non-negative. If the entropy is 0, then the random variable is a constant
value with probability 1. The larger the entropy is, the closer the uncertainty in the random
variable, or in a sense, the random variable distribution is closer to a uniform distribution. (The
convention is to use 0 for 0 log 0 terms, which are otherwise defined. The limit of p log p as
p ! 0 is indeed 0.)
When log2 is used instead of log, the entropy provides the expected number of bits re-
quired to encode the random variable as follows. Each value x is assigned with a code that
consists of log2 p.X D x/ bits. The motivation behind this can be demonstrated through the
notion of cross-entropy. The cross entropy H.p; q/ between two distributions for a given ran-
dom variable is defined as:
X
H.p; q/ D p.X D x/ log q.X D x/:
x
When log2 is used, the cross entropy gives the expected number of bits used to encode
random samples from p when using for each x 2 a code of length log2 q.X D x/. The cross
entropy is minimized, minq H.p; q/ when p D q . In this case, H.p; q/ D H.p/. Therefore, en-
coding random samples from p using a code that assigns each x 2 log2 p.X D x/ bits is
optimal in this sense.
When using the natural logarithm for calculating entropy, entropy is measured in “nat-
bits.” Entropies calculated with a different base for the logarithm change by a multiplicative
logc a
factor, as loga b D for any a; b; c > 0.
logc b
260 A. BASIC CONCEPTS
The notion of entropy can be naturally extended to continuous random variables as well
(as can be the notion of cross entropy). If is a random variable with density p. / taking values
in R, then the entropy H./ is defined as:
Z 1
H. / D p. / log p. /d:
1
The entropy of a continuous random variable is also called “differential entropy.” There are
several differences between the entropy of discrete and continuous random variables: the entropy
of a continuous random variable may be negative or diverge to infinity, and it also does not stay
invariant under a change of variable, unlike discrete random variable entropy.
As such, it immediately holds that for a concave function g , g.EŒX/ EŒg.X/ (i.e.,
the negation of a convex function is concave). Jensen’s inequality is used to derive the evidence
A.2. OTHER BASIC CONCEPTS 261
lower bound for variational inference (Chapter 6). The function g that is used is g.x/ D log x ,
with Jensen’s inequality being applied on the marginal log-likelihood. See Section 6.1 for more
details.
where si W ! R defined as si ./ D Œs./i (the i th coordinate of s ). J is also called “the Jaco-
bian” (in this case of s ).
This transformation is also often used to compute integrals (whether in probability theory
or outside of it), following a change of variables in the integral. It is often the case that following
such a change, the integrals are easier to compute, or are reduced to well-known integrals that
have analytic solutions.
Generally, this function is not convex, and has multiple global maxima. It is often also
computationally difficult to find its global maximum. The EM algorithm is a coordinate ascent
algorithm that iteratively creates a sequence of parameters 1 ; 2 ; : : : such that L.i / L.i 1 /
and that eventually converges to a local maximum of L. /.
Note first that L./ can also be expressed in the following manner:
n
" #!
X p x .i/ ; zj
L./ D log Eqi .Z/ ;
qi .z/
i D1
for any set of fixed distributions q1 .Z/; : : : ; qn .Z/ over the latent variables with a support that
subsumes the support of p for Z (to see this, just unfold the expectation under qi .Z/, and con-
sider that the term qi .z/ in the numerator and the denominator will cancel). Jensen’s inequality
tells us we can define the following bound B. jq1 ; : : : ; qn / L. / for any and qi as above:
n
" !#
X p x .i/ ; zj
B .jq1 ; : : : ; qn / D Eqi .Z/ log :
qi .z/
i D1
A.3. BASIC CONCEPTS IN OPTIMIZATION 263
.i/
It can be shown that for any , B. jq1 ; : : : ; qn / D L. / when qi .Z/ D p.Zjx ; /. The
EM algorithm capitalizes on this observation, and iteratively maximizes the lower bound B by
alternating between maximizing B with respect to and maximizing the bound with respect to
qi . Therefore, the EM algorithm works as follows.
• Initialize 1 with some value.
• Repeat until B. jq1 ; : : : ; qn / converges (or for a fixed number of iterations):
(E-Step:) Compute qi Zjx .i/ ; 1 for i 2 f1; : : : ; ng and identify the bound
B. jq1 ; : : : ; qn /.
(M-Step:) i C1 arg max B .jq1 ; : : : ; qn /.
Note that the EM algorithm is not the only option to maximize the lower bound
B . jq1 ; : : : ; qn /. Other optimization techniques can also be used, usually reaching a local max-
imum as well.
For example, D can be a set of training examples and ` can be the log-probability of x .i /
under a model p. j /. We also assume that ` is differentiable with respect to . Our goal is to
identify
D arg max f . j D/:
The most basic procedure is “gradient ascent” which operates by computing the gradient
of ` with respect to and then using the following update rule:
n
X
t C1 t C r ` t j x .i/ : (A.2)
i D1
1 It
is easy to focus this discussion to the optimization of a general differentiable function f . / with respect to , as is
commonly described in the optimization literature.
264 A. BASIC CONCEPTS
that starts with a pre-initialized 0 2 Rd and then creates a sequence 0 ; 1 ; : : : ; t ; : : : of up-
dated parameters. The key idea behind this update rule is to “take a step” with the current t in
the direction of the gradient, as the gradient gives the direction in which the function increases
in value. When the goal is to minimize the function, the update rule changes to
n
X
t C1 t r ` t j x .i / ;
i D1
this time, taking a step against the direction of the gradient. The value is a real number that is
referred to as the “step size” or the “learning rate.” It controls the magnitude of the step in the
direction of the gradient. A small value may lead to a slow convergence on , while too large
value may lead to alternating between points that could perhaps be close to , but “miss” it by
some significant distance.
There are also second-order optimization methods that make use of the Hessian of the
optimized function for faster convergence. (The Hessian of a real-value function f W Rd ! R at
@2 f
point is the matrix H 2 Rd d such that Hij D ./.) Second-order Newton’s method
@i @j
uses an update rule that looks like:
1
tC1 t C H. t / r f . t /:
This update is based on a second-order Taylor expansion of f around t . Since calculating
the inverse of the Hessian is expensive (as is even just calculating the Hessian itself, which is
quadratic in d ), methods that make use of an approximation of the Hessian are used. These are
referred to as quasi-Newton methods.
APPENDIX B
Distribution Catalog
This appendix gives some basic information about the various distributions that are mentioned
in this book.
Notes.
• When n D 1, the distribution is a “categorical distribution” over binary vectors that sum
to 1. With the categorical distribution, can be any set of k objects. Sometimes the
categorical distribution is referred to as a “multinomial distribution,” since the categorical
distribution is a specific case of the multinomial distribution when are the binary vectors
mentioned above.
• A distribution over a finite set A D fa1 ; : : : ; ad g with a probability i associated with each
ai may also often be referred to as a multinomial distribution.
268 B. DISTRIBUTION CATALOG
B.2 THE DIRICHLET DISTRIBUTION
Parameters integer d ≥ 2, α1, …, αd positive values
Sample space Ω θ∈Ω ℝd is such that θi ≥ 0 and d
θ
i=1 i
=1
d
f(θ) = 1 × ∏ θiαi-1 where B(α) is the Beta function
PDF
B(α) i=1
αi – 1
Mode (µ1, …, µd) such that µi = d if αi > 1
i=1
αi – d
αi
Mean E [θi] d
αi
i=1
d
Mean E [log θi] ψ(αi) – ψ( α)
i=1 i
where ψ is the digamma function
αi (α* – αi)
Variance Var (θi) 2 where α* = iαi
α* (α* + 1)
Notes.
• B.˛/ is the Beta function, defined as:
Qd
D1 .˛i /
B.˛/ D iP ;
d
i D1 ˛i
d
.x/ D log .x/:
dx
It does not have an analytic form, and can be approximated using numerical recipes, or
through series expansion approximation.1 (Chapter 3).
• When d D 2, then the Dirichlet distribution can be viewed as defining a distribution over
Œ0; 1 (since 2 D 1 1 ). In that case, it is called the Beta distribution (see Section 2.2.1).
Notes.
• If X1 ; : : : ; Xn are independent
Poisson random variables with rates 1 ; : : : ; n , then
Pn
p X1 ; : : : ; Xn j iD1 Xi D K is a multinomial distribution with parameters K and i D
i
Pn (Section B.1).
i D1 i
Notes.
• Another common parametrization for it is using two parameters, shape ˛ and “rate” ˇ
1
where ˇ D .
270 B. DISTRIBUTION CATALOG
• If Xi Gamma.˛i ; 1/ ! for ˛1 ; : : : ; ˛K are independently distributed, then
X1 XK
PK ; : : : ; PK is distributed according to the Dirichlet distribution
i D1 Xi i D1 Xi
with parameters .˛1 ; : : : ; ˛K /. See also Section 3.2.1.
Notes.
• The multivariate normal distribution is conjugate to itself, when considering the mean
parameters.
• Often referred to as the multivariate Gaussian distribution, or just the Gaussian distribu-
tion, named after Carl Friedrich Gauss (1777-1855).
B.6. THE LAPLACE DISTRIBUTION 271
B.6 THE LAPLACE DISTRIBUTION
Parameters µ ∈ ℝ, λ > 0
Sample space Ω Ω= ℝ
1 |θ – µ |
PDF f(θ) = exp –
2λ λ
Mean E [θ] µ
Mode µ
Variance Var (θ) 2λ2
Entropy 1 + log(2λ)
Notes.
• Can be used as a Bayesian interpretation for L1 regularization (Section 4.2.1).
Notes.
• The PDF is defined as:
d
! 1
1 Y 1 > 1
f . / D p i exp .log. d =d / / † log. d =d / / :
.2/d det.†/ 2
i D1
exp.i /
i D Pd 1 8i 2 f1; : : : ; d 1g;
1 C j D1 exp.j /
1
d D Pd 1 :
1 C j D1 exp.j /
Ψ
Mean E [T]
m–p–1
Ψ
Mode
m–p+1
(m – p + 1) (Ψij)2 + (m – p – 1)ΨiiΨjj
Variance Var (Tij)
(m – p) (m – p – 1)2 (m – p – 3)
Notes.
• The function tr.A/ for a matrix A 2 Rpp is defined as the trace: the sum of all diagonal
P
elements of A, piD1 Aii .
1
• If A is drawn from the Wishart distribution, then A is drawn from the inverse Wishart
distribution.
• The inverse Wishart is a conjugate prior for the covariance matrix parameter of a multi-
variate normal distribution.
B.9. THE GUMBEL DISTRIBUTION 273
B.9 THE GUMBEL DISTRIBUTION
Parameters µ ∈ ℝ, β > 0
Sample space Ω Ω= ℝ
1 x–µ x–µ
PDF f(x) = exp – – exp –
β β β
Mode µ
Mean µ + βγ, where γ ≈ 0.5772156649 (Euler-Mascheroni constant)
π2 β 2
Variance 6
Entropy log β + γ + 1
Notes.
• If U is a uniform random variable on Œ0; 1, then G D ˇ log. log U / has a Gumbel
distribution as above.
• For a sequence of K independent Gumbel random variables G1 ; : : : ; GK with D 0 and
ˇ D 1, and for .1 ; : : : ; K / (denoting a categorical distribution), we may define the concrete
distribution (Maddison et al., 2017) over the probability simplex of degree K 1 with a
random variable X D .X1 ; : : : ; XK /:
Bibliography
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,
Irving, G., Isard, M., et al. (2016). TensorFlow: A system for large-scale machine learning. In
Proc. of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI),
vol. 16, pages 265–283. 217
Abney, S., McAllester, D., and Pereira, F. (1999). Relating probabilistic grammars and au-
tomata. In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics,
pages 542–549, College Park, MD. DOI: 10.3115/1034678.1034759. 182
Ahmed, A. and Xing, E. P. (2007). On tight approximate inference of the logistic normal
topic admixture model. In Proc. of the 11th International Conference on Artifical Intelligence and
Statistics. Omnipress. 88
Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman and Hall, London.
DOI: 10.1007/978-94-009-4109-0. 59, 61, 63, 64, 65
Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., Bastien,
F., Bayer, J., Belikov, A., Belopolsky, A., et al. (2016). Theano: A python framework for fast
computation of mathematical expressions. ArXiv Preprint ArXiv:1605.02688, 472:473. 217
Altun, Y., Hofmann, T., and Smola, A. J. (2004). Gaussian process classification for segment-
ing and annotating sequences. In Proc. of the 21st International Conference on Machine Learn-
ing (ICML 2004), pages 25–32, New York, Max-Planck-Gesellschaft, ACM Press. DOI:
10.1145/1015330.1015433. 172
Andrieu, C., De Freitas, N., Doucet, A., and Jordan, M. I. (2003). An introduction to MCMC
for machine learning. Machine Learning, 50(1-2), pages 5–43. 125
Artetxe, M., Labaka, G., Agirre, E., and Cho, K. (2018). Unsupervised neural machine trans-
lation. 254
Ash, R. B. and Doléans-Dade, C. A. (2000). Probability and measure theory. Access online via
Elsevier. 2, 10
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning
to align and translate. In Proc. of the 3rd International Conference on Learning Representations
(ICLR). 237
276 BIBLIOGRAPHY
Barkan, O. (2017). Bayesian neural word embedding. In Proc. of the 31st Conference on Artificial
Intelligence (AAAI), pages 3135–3143. 220, 221
Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2002). The infinite hidden Markov model.
In Machine Learning, pages 29–245. MIT Press. 180
Bejan, C., Titsworth, M., Hickl, A., and Harabagiu, S. (2009). Nonparametric Bayesian models
for unsupervised event coreference resolution. In Bengio, Y., Schuurmans, D., Lafferty, J.,
Williams, C., and Culotta, A., Eds., Advances in Neural Information Processing Systems 22,
pages 73–81. Curran Associates, Inc. 27
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic lan-
guage model. Journal of Machine Learning Research, 3(Feb):1137–1155. DOI: 10.1007/3-
540-33486-6_6 216
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer. DOI:
10.1007/978-1-4757-4286-2. 41, 72, 89
Berger, A. L., Pietra, V. J. D., and Pietra, S. A. D. (1996). A maximum entropy approach to
natural language processing. Computational Linguistics, 22(1), pages 39–71. 27
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. 86, 141
Bisk, Y. and Hockenmaier, J. (2013). An HDP model for inducing combinatory categorial
grammars. Transactions of the Association for Computational Linguistics, 1, pages 75–88. 210
Black, E., Abney, S., Flickenger, D., Gdaniec, C., Grishman, R., Harrison, P., Hindle, D.,
Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M., Roukos, S., Santorini, B., and
Strzalkowski, T. (1991). A procedure for quantitatively comparing the syntactic coverage
of English grammars. In Proc. of DARPA Workshop on Speech and Natural Language. DOI:
10.3115/112405.112467. 149
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine
Learning Research, 3, pages 993–1022. 27, 30, 31
Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nested chinese restaurant process and
Bayesian nonparametric inference of topic hierarchies. Journal of the ACM ( JACM), 57(2),
page 7. DOI: 10.1145/1667053.1667056. 173
BIBLIOGRAPHY 277
Blei, D. M. and Frazier, P. I. (2011). Distance dependent chinese restaurant processes. Journal
of Machine Learning Research, 12, pages 2461–2488. 174
Blei, D. M. and Jordan, M. I. (2004). Variational methods for the Dirichlet process. In Proc. of
the 21st International Conference on Machine Learning. DOI: 10.1145/1015330.1015439. 163,
196, 201
Blei, D. M. and Lafferty, J. D. (2006). Correlated topic models. In Weiss, Y., Schölkopf, B.,
and Platt, J., Eds., Advances in Neural Information Processing Systems 18, pages 147–154. MIT
Press. 61, 62
Blunsom, P. and Cohn, T. (2010a). Inducing synchronous grammars with slice sampling. In
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the
Association for Computational Linguistics, pages 238–241, Los Angeles, CA. 27, 117, 118
Blunsom, P. and Cohn, T. (2010b). Unsupervised induction of tree substitution grammars for
dependency parsing. In Proc. of the 2010 Conference on Empirical Methods in Natural Language
Processing, pages 1204–1213, Cambridge, MA. Association for Computational Linguistics.
27
Blunsom, P., Cohn, T., Dyer, C., and Osborne, M. (2009a). A Gibbs sampler for phrasal
synchronous grammar induction. In Proc. of the Joint Conference of the 47th Annual Meet-
ing of the ACL and the 4th International Joint Conference on Natural Language Processing of
the AFNLP, pages 782–790, Suntec, Singapore. Association for Computational Linguistics.
DOI: 10.3115/1690219.1690256. 118, 205
Blunsom, P., Cohn, T., and Osborne, M. (2009b). Bayesian synchronous grammar induction. In
Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., Eds., Advances in Neural Information
Processing Systems 21, pages 161–168. Curran Associates, Inc. 205
Börschinger, B. and Johnson, M. (2014). Exploring the role of stress in Bayesian word segmen-
tation using adaptor grammars. Transactions of the Association for Computational Linguistics,
2(1), pages 93–104. 29
Bouchard-côté, A., Petrov, S., and Klein, D. (2009). Randomized pruning: Efficiently calcu-
lating expectations in large dynamic programs. In Bengio, Y., Schuurmans, D., Lafferty, J.,
Williams, C., and Culotta, A., Eds., Advances in Neural Information Processing Systems 22,
pages 144–152. Curran Associates, Inc. 118
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. (2016).
Generating sentences from a continuous space. Proc. of the 20th SIGNLL Conference on Com-
putational Natural Language Learning (CoNLL). DOI: 10.18653/v1/k16-1002 247
278 BIBLIOGRAPHY
Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.
DOI: 10.1017/cbo9780511804441 265
Bražinskas, A., Havrylov, S., and Titov, I. (2017). Embedding words as distributions with a
Bayesian skip-gram model. ArXiv Preprint ArXiv:1711.11027. 221
Bryant, M. and Sudderth, E. B. (2012). Truly nonparametric online variational inference for
hierarchical Dirichlet processes. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K.,
Eds., Advances in Neural Information Processing Systems 25, pages 2699–2707. Curran Asso-
ciates, Inc. 167
Burstall, R. M. and Darlington, J. (1977). A transformation system for developing recursive
programs. Journal of the ACM, 24(1), pages 44–67. DOI: 10.1145/321992.321996. 188
Caccia, M., Caccia, L., Fedus, W., Larochelle, H., Pineau, J., and Charlin, L. (2018). Language
GANs falling short. ArXiv Preprint ArXiv:1811.02549. 253
Cappé, O. and Moulines, E. (2009). On-line expectation–maximization algorithm for latent
data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3),
pages 593–613. DOI: 10.1111/j.1467-9868.2009.00698.x. 152
Carlin, B. P. and Louis, T. A. (2000). Bayes and Empirical Bayes Methods for Data Analysis. CRC
Press. DOI: 10.1201/9781420057669. 52
Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker,
M. A., Guo, J., Li, P., and Riddell, A. (2015). Stan: a probabilistic programming language.
Journal of Statistical Software. 141
Carter, S., Dymetman, M., and Bouchard, G. (2012). Exact sampling and decoding in high-
order hidden Markov models. In Proc. of the 2012 Joint Conference on Empirical Methods in
Natural Language Processing and Computational Natural Language Learning, pages 1125–1134,
Jeju Island, Korea. Association for Computational Linguistics. 123
Casella, G. and Berger, R. L. (2002). Statistical Inference. Duxbury Pacific Grove, CA. DOI:
10.2307/2532634. 13
Casella, G. and George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician,
46(3), pages 167–174. DOI: 10.2307/2685208. 132
Castano, A. and Casacuberta, F. (1997). A connectionist approach to machine translation. In
Proc. of the 5th European Conference on Speech Communication and Technology. 236
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., and Blei, D. M. (2009). Reading tea
leaves: How humans interpret topic models. In Bengio, Y., Schuurmans, D., Lafferty, J.,
Williams, C., and Culotta, A., Eds., Advances in Neural Information Processing Systems 22,
pages 288–296. Curran Associates, Inc. 36
BIBLIOGRAPHY 279
Chen, H., Branavan, S., Barzilay, R., Karger, D. R., et al. (2009). Content modeling using
latent permutations. Journal of Artificial Intelligence Research, 36(1), pages 129–163. 27
Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo.
In Proc. of the 31st International Conference on Machine Learning (ICML), pages 1683–1691.
DOI: 10.24963/ijcai.2018/419 253
Chen, S. F. and Goodman, J. (1996). An empirical study of smoothing techniques for language
modeling. In Proc. of the 34th Annual Meeting of the Association of Computational Linguistics,
pages 310–318, Stroudsburg, PA. DOI: 10.3115/981863.981904. 82, 170
Chirkova, N., Lobacheva, E., and Vetrov, D. (2018). Bayesian compression for natural language
processing. In Proc. of the Conference on Empirical Methods in Natural Language Processing
(EMNLP). DOI: 10.1162/coli_r_00310 243
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and
Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statis-
tical machine translation. Proc. of the Conference on Empirical Methods in Natural Language
Processing (EMNLP). DOI: 10.3115/v1/d14-1179 235, 236
Cocke, J. and Schwartz, J. T. (1970). Programming languages and their compilers: Preliminary
notes. Technical report, Courant Institute of Mathematical Sciences, New York University.
182
Cohen, S. B. (2017). Latent-variable PCFGs: Background and applications. In Proc. of the 15th
Meeting on the Mathematics of Language (MOL). DOI: 10.18653/v1/w17-3405 201
Cohen, S. B. and Collins, M. (2014). A provably correct learning algorithm for latent-variable
PCFGs. In Proc. of the 52nd Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1052–1061, Baltimore, MD. DOI: 10.3115/v1/p14-1099. 202
280 BIBLIOGRAPHY
Cohen, S. B., Gimpel, K., and Smith, N. A. (2009). Logistic normal priors for unsupervised
probabilistic grammar induction. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L.,
Eds., Advances in Neural Information Processing Systems 21, pages 321–328. Curran Associates,
Inc. 62, 149, 193
Cohen, S. B., Blei, D. M., and Smith, N. A. (2010). Variational inference for adaptor grammars.
In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of
the Association for Computational Linguistics, pages 564–572, Los Angeles, CA. 196, 198
Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L. (2013). Experiments
with spectral learning of latent-variable PCFGs. In Proc. of the 2013 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technolo-
gies, pages 148–157, Atlanta, GA. 202
Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L. (2014). Spectral learning
of latent-variable PCFGs: Algorithms and sample complexity. Journal of Machine Learning
Research, 15, pages 2399–2449. 202
Cohen, S. and Smith, N. A. (2009). Shared logistic normal distributions for soft parameter
tying in unsupervised grammar induction. In Proc. of Human Language Technologies: The 2009
Annual Conference of the North American Chapter of the Association for Computational Linguistics,
pages 74–82, Boulder, CO. DOI: 10.3115/1620754.1620766. 63
Cohen, S. and Smith, N. A. (2010a). Viterbi training for PCFGs: Hardness results and com-
petitiveness of uniform initialization. In Proc. of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 1502–1511, Uppsala, Sweden. 139
Cohn, T., Blunsom, P., and Goldwater, S. (2010). Inducing tree-substitution grammars. The
Journal of Machine Learning Research, 11, pages 3053–3096. 210
Collobert, R., Bengio, S., and Mariéthoz, J. (2002). Torch: A modular machine learning soft-
ware library. Technical Report, Idiap. 217
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011).
Natural language processing (almost) from scratch. Journal of Machine Learning Research,
12(Aug):2493–2537. 216
BIBLIOGRAPHY 281
Cover, T. M. and Thomas, J. A. (2012). Elements of Information Theory. John Wiley & Sons.
259
Cox, R. T. (1946). Probability, frequency and reasonable expectation. American Journal of Physics,
14(1), pages 1–13. DOI: 10.1119/1.1990764. 66
Damianou, A. and Lawrence, N. (2013). Deep Gaussian processes. In Proc. of the 16th Inter-
national Conference on Artificial Intelligence and Statistics (AISTATS), pages 207–215. 243
Daume, H. (2007). Fast search for Dirichlet process mixture models. In Meila, M. and Shen, X.,
Eds., Proc. of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS-
07), vol. 2, pages 83–90. Journal of Machine Learning Research—Proceedings Track. 164
Daume III, H. (2007). Frustratingly easy domain adaptation. In Proc. of the 45th Annual Meeting
of the Association of Computational Linguistics, pages 256–263, Prague, Czech Republic. 93
Daume III, H. (2009). Non-parametric Bayesian areal linguistics. In Proc. of Human Language
Technologies: The 2009 Annual Conference of the North American Chapter of the Association for
Computational Linguistics, pages 593–601, Boulder, CO. DOI: 10.3115/1620754.1620841.
27
Daume III, H. and Campbell, L. (2007). A Bayesian model for discovering typological im-
plications. In Proc. of the 45th Annual Meeting of the Association of Computational Linguistics,
pages 65–72, Prague, Czech Republic. 27
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), pages 1–38.
145
DeNero, J., Bouchard-Côté, A., and Klein, D. (2008). Sampling alignment structure under a
Bayesian translation model. In Proc. of the 2008 Conference on Empirical Methods in Natural
Language Processing, pages 314–323, Honolulu, HI. Association for Computational Linguis-
tics. DOI: 10.3115/1613715.1613758. 111
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep
bidirectional transformers for language understanding. ArXiv Preprint ArXiv:1810.04805.
221
Eisner, J. (2016). Inside-outside and forward-backward algorithms are just backdrop (tutorial
paper), In Proc. of the Workshop on Structured Prediction for NLP, pages 1–17. 228
Elsner, M., Goldwater, S., Feldman, N., and Wood, F. (2013). A joint learning model of word
segmentation, lexical acquisition, and phonetic variability. In Proc. of the 2013 Conference on
Empirical Methods in Natural Language Processing, pages 42–54, Seattle, WA. Association for
Computational Linguistics. 29
Escobar, M. D. (1994). Estimating normal means with a Dirichlet process prior. Journal of the
American Statistical Association, 89(425), pages 268–277. DOI: 10.2307/2291223. 163
Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using
mixtures. Journal of the American Statistical Association, 90(430), pages 577–588. DOI:
10.1080/01621459.1995.10476550. 163
Feinberg, S. E. (2011). Bayesian models and methods in public policy and government settings.
Statistical Science, 26(2), pages 212–226. DOI: 10.1214/10-sts331. 43
Finetti, B. d. (1980). Foresight; its logical laws, its subjective sources. In Kyberg, H.E. and
Smokler, H.E., Eds., Studies in Subjective Probability, pages 99–158. 8
Finkel, J. R., Grenager, T., and Manning, C. D. (2007). The infinite tree. In Proc. of the 45th
Annual Meeting of the Association of Computational Linguistics, pages 272–279, Prague, Czech
Republic. 203
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their appli-
cations. Biometrika, 57(1), pages 97–109. DOI: 10.2307/2334940. 114
Henderson, J. (2003). Inducing history representations for broad coverage statistical parsing.
In Human Language Technologies: The 2003 Annual Conference of the North American Chapter of
the Association for Computational Linguistics. DOI: 10.3115/1073445.1073459 216
Hoffman, M., Bach, F. R., and Blei, D. M. (2010). Online learning for latent Dirichlet al-
location. In Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., and Culotta, A., Eds.,
Advances in Neural Information Processing Systems 23, pages 856–864. Curran Associates, Inc.
152
Hofmann, T. (1999b). Probabilistic latent semantic indexing. In Proc. of the 22nd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR’99, pages 50–57, New York. DOI: 10.1145/312624.312649. 31
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are uni-
versal approximators. Neural Networks, 2(5):359–366. DOI: 10.1016/0893-6080(89)90020-8
215
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., and Weischedel, R. (2006). Ontonotes: The
90% solution. In Proc. of the Human Language Technology Conference of the NAACL, Companion
Volume: Short Papers, pages 57–60, New York. Association for Computational Linguistics. 92
Hu, Z., Yang, Z., Salakhutdinov, R., and Xing, E. P. (2018). On unifying deep generative
models. 246
Huang, Y., Zhang, M., and Tan, C. L. (2011). Nonparametric Bayesian machine transliteration
with synchronous adaptor grammars. In Proc. of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies, pages 534–539, Portland, OR. 205
288 BIBLIOGRAPHY
Huang, Y., Zhang, M., and Tan, C.-L. (2012). Improved combinatory categorial grammar in-
duction with boundary words and Bayesian inference. In Proc. of COLING 2012, pages 1257–
1274, Mumbai, India. The COLING 2012 Organizing Committee. 210
Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging.
ArXiv Preprint ArXiv:1508.01991. 238
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. DOI:
10.1017/cbo9780511790423. xxvi, 22, 66
Jiang, T., Wang, L., and Zhang, K. (1995). Alignment of trees—an alternative to tree edit.
Theoretical Computer Science, 143(1), pages 137–148. DOI: 10.1016/0304-3975(95)80029-9.
208
Johnson, M. (2007b). Why doesn’t EM find good HMM POS-taggers? In Proc. of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL), pages 296–305, Prague, Czech Republic. Association
for Computational Linguistics. 146
Johnson, M. (2008). Using adaptor grammars to identify synergies in the unsupervised ac-
quisition of linguistic structure. In Proc. of ACL-08: HLT, pages 398–406, Columbus, OH.
Association for Computational Linguistics. 29
Johnson, M., Griffiths, T., and Goldwater, S. (2007a). Bayesian inference for PCFGs via
Markov chain Monte Carlo. In Human Language Technologies 2007: The Conference of the
North American Chapter of the Association for Computational Linguistics; Proceedings of the Main
Conference, pages 139–146, Rochester, NY. 26, 190
Johnson, M., Griffiths, T. L., and Goldwater, S. (2007b). Adaptor grammars: A framework
for specifying compositional nonparametric Bayesian models. In Schölkopf, B., Platt, J., and
Hoffman, T., Eds., Advances in Neural Information Processing Systems 19, pages 641–648. MIT
Press. 27, 128, 194, 198
BIBLIOGRAPHY 289
Johnson, M., Demuth, K., Jones, B., and Black, M. J. (2010). Synergies in learning words and
their referents. In Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., and Culotta, A.,
Eds., Advances in Neural Information Processing Systems 23, pages 1018–1026. Curran Asso-
ciates, Inc. 29
Johnson, M., Christophe, A., Dupoux, E., and Demuth, K. (2014). Modelling function words
improves unsupervised word segmentation. In Proc. of the 52nd Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long Papers), pages 282–292, Baltimore, MD.
DOI: 10.3115/v1/p14-1027. 29
Jones, B., Johnson, M., and Goldwater, S. (2012). Semantic parsing with Bayesian tree trans-
ducers. In Proc. of the 50th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 488–496, Jeju Island, Korea. 210
Jordan, M. I. (2011). Message from the president: The era of big data. International Society for
Bayesian Analysis (ISBA) Bulletin, 18(2), pages 1–3. 258
Joshi, M., Das, D., Gimpel, K., and Smith, N. A. (2010). Movie reviews and revenues: An
experiment in text regression. In Human Language Technologies: The 2010 Annual Conference
of the North American Chapter of the Association for Computational Linguistics, pages 293–296,
Los Angeles, CA. 40
Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neural network for
modelling sentences. In Proc. of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), vol. 1, pages 655–665. DOI: 10.3115/v1/p14-1062 240
Kallmeyer, L. and Maier, W. (2010). Data-driven parsing with probabilistic linear context-free
rewriting systems. In Proc. of the 23rd International Conference on Computational Linguistics
(Coling 2010), pages 537–545, Beijing, China. Coling 2010 Organizing Committee. DOI:
10.1162/coli_a_00136. 177
290 BIBLIOGRAPHY
Kasami, T. (1965). An efficient recognition and syntax-analysis algorithm for context-free lan-
guages. Technical Report AFCRL-65-758, Air Force Cambridge Research Lab. 182
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model com-
ponent of a speech recognizer. In IEEE Transactions on Acoustics, Speech and Signal Processing,
pages 400–401. DOI: 10.1109/tassp.1987.1165125. 82
Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. ArXiv Preprint
ArXiv:1412.6980. 265
Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Proc. of the 2nd
International Conference on Learning Representations (ICLR). 245
Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M. (2017). OpenNMT: Open-source
toolkit for neural machine translation. In Proc. of the System Demonstrations of the 55th Annual
Meeting of the Association for Computational Linguistics, pages 67–72. DOI: 10.18653/v1/p17-
4012 239
Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language modeling. In Proc.
of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. I, pages 181–
184, Detroit, MI. IEEE Inc. DOI: 10.1109/icassp.1995.479394. 82, 170
Koehn, P. and Knowles, R. (2017). Six challenges for neural machine translation. In Proc. of
the 1st Workshop on Neural Machine Translation, pages 28–39, Association for Computational
Linguistics. DOI: 10.18653/v1/w17-3204 254
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques.
MIT Press. 19, 151
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems 25,
pages 1097–1105. DOI: 10.1145/3065386 240
Kübler, S., McDonald, R., and Nivre, J. (2009). Dependency Parsing. Synthe-
sis Lectures on Human Language Technologies. Morgan & Claypool. DOI:
10.2200/s00169ed1v01y200901hlt002. 203
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2016). Automatic
differentiation variational inference. arXiv preprint arXiv:1603.00788. 141
BIBLIOGRAPHY 291
Kulis, B. and Jordan, M. I. (2011). Revisiting k-means: New algorithms via Bayesian nonpara-
metrics. arXiv preprint arXiv:1111.0352. 155
Kumar, S. and Byrne, W. (2004). Minimum bayes-risk decoding for statistical machine trans-
lation. In Susan Dumais, D. M. and Roukos, S., Eds., HLT-NAACL 2004: Main Proceedings,
pages 169–176, Boston, MA. Association for Computational Linguistics. 90
Kwiatkowski, T., Goldwater, S., Zettlemoyer, L., and Steedman, M. (2012a). A probabilistic
model of syntactic and semantic acquisition from child-directed utterances and their mean-
ings. In Proc. of the 13th Conference of the European Chapter of the Association for Computational
Linguistics, pages 234–244, Avignon, France. 152
Kwiatkowski, T., Goldwater, S., Zettlemoyer, L., and Steedman, M. (2012b). A probabilistic
model of syntactic and semantic acquisition from child-directed utterances and their mean-
ings. In Proc. of the 13th Conference of the European Chapter of the Association for Computational
Linguistics, pages 234–244, Avignon, France. 210
Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In
Proc. of the 31st International Conference on Machine Learning (ICML), pages 1188–1196. 222
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel,
L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Compu-
tation, 1(4):541–551. DOI: 10.1162/neco.1989.1.4.541 214
Lei, T., Barzilay, R., and Jaakkola, T. (2016). Rationalizing neural predictions. In Proc.
of the Conference on Empirical Methods in Natural Language Processing (EMNLP). DOI:
10.18653/v1/d16-1011 255
Levenberg, A., Dyer, C., and Blunsom, P. (2012). A Bayesian model for learning scfgs with
discontiguous rules. In Proc. of the 2012 Joint Conference on Empirical Methods in Natural Lan-
guage Processing and Computational Natural Language Learning, pages 223–232, Jeju Island,
Korea. Association for Computational Linguistics. 111, 205
Levy, R. P., Reali, F., and Griffiths, T. L. (2009). Modeling the effects of memory on human
online sentence processing with particle filters. In Koller, D., Schuurmans, D., Bengio, Y.,
and Bottou, L., Eds., Advances in Neural Information Processing Systems 21, pages 937–944.
Curran Associates, Inc. 129
Li, J., Chen, X., Hovy, E., and Jurafsky, D. (2015). Visualizing and understanding neural models
in NLP. ArXiv Preprint ArXiv:1506.01066. DOI: 10.18653/v1/n16-1082 255
Liang, P., Petrov, S., Jordan, M., and Klein, D. (2007). The infinite PCFG using hierarchi-
cal Dirichlet processes. In Proc. of the 2007 Joint Conference on Empirical Methods in Nat-
ural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),
pages 688–697, Prague, Czech Republic. Association for Computational Linguistics. 200
292 BIBLIOGRAPHY
Liang, P. and Klein, D. (2009). Online EM for unsupervised models. In Proc. of
Human Language Technologies: The 2009 Annual Conference of the North American Chap-
ter of the Association for Computational Linguistics, pages 611–619, Boulder, CO. DOI:
10.3115/1620754.1620843. 152
Lidstone, G. J. (1920). Note on the general case of the Bayes-Laplace formula for the inductive
or posteriori probabilities. Transactions of the Faculty of Actuaries, 8(182). 82
Lin, C.-C., Wang, Y.-C., and Tsai, R. T.-H. (2009). Modeling the relationship among lin-
guistic typological features with hierarchical Dirichlet process. In Proc. of the 23rd Pacific
Asia Conference on Language, Information and Computation, pages 741–747, Hong Kong. City
University of Hong Kong. 27
Lindsey, R., Headden, W., and Stipicevic, M. (2012). A phrase-discovering topic model using
hierarchical Pitman-Yor processes. In Proc. of the 2012 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning, pages 214–222,
Jeju Island, Korea. Association for Computational Linguistics. 117
Linzen, T. (2018). What can linguistics and deep learning contribute to each other? ArXiv
Preprint ArXiv:1809.04179. DOI: 10.1353/lan.2019.0001 255
Liu, H., Simonyan, K., and Yang, Y. (2018). Darts: Differentiable architecture search. ArXiv
Preprint ArXiv:1806.09055. 243
Luong, T., Pham, H., and Manning, C. D. (2015). Effective approaches to attention-based
neural machine translation. In Proc. of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1412–1421. DOI: 10.18653/v1/d15-1166 238
Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural
network acoustic models. In Proc. of the 30th International Conference on Machine Learning
(ICML), page 3. 224
MacKay, D. J. (1992). A practical Bayesian framework for backpropagation networks. Neural
Computation, 4(3):448–472. DOI: 10.1162/neco.1992.4.3.448 228
Maddison, C. J., Mnih, A., and Teh, Y. W. (2017). The concrete distribution: A continuous
relaxation of discrete random variables. In Proc. of the 5th International Conference on Learning
Representations (ICLR). 252, 273
Mandt, S., Hoffman, M., and Blei, D. (2016). A variational analysis of stochastic gradient
algorithms. In Proc. of 33rd International Conference on Machine Learning (ICML), pages 354–
363. 265
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated
corpus of English: The Penn treebank. Computational Linguistics, 19(2), pages 313–330. 181
BIBLIOGRAPHY 293
Matsuzaki, T., Miyao, Y., and Tsujii, J. (2005). Probabilistic CFG with latent annotations.
In Proc. of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05),
pages 75–82, Ann Arbor, MI. DOI: 10.3115/1219840.1219850. 201, 211
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous
activity. The Bulletin of Mathematical Biophysics, 5(4):115–133. DOI: 10.1007/bf02478259
214
McGrayne, S. B. (2011). The Theory that Would not Die: How Bayes’ Rule Cracked the Enigma
Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Con-
troversy. Yale University Press. xxvi
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953).
Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21,
pages 1087–1092. DOI: 10.1063/1.1699114. 114
Mikolov, T., Kombrink, S., Burget, L., Černocky, J., and Khudanpur, S. (2011). Exten-
sions of recurrent neural network language model. In Proc. of the IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5528–5531. DOI:
10.1109/icassp.2011.5947611 247
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word repre-
sentations in vector space. ArXiv Preprint ArXiv:1301.3781. 218, 219, 220
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed repre-
sentations of words and phrases and their compositionality. In Burges, C. J. C., Bottou, L.,
Welling, M., Ghahramani, Z., and Weinberger, K. Q., Eds., Advances in Neural Information
Processing Systems 26, pages 3111–3119, Curran Associates, Inc. 219
Mikolov, T. and Zweig, G. (2012). Context dependent recurrent neural network language
model. SLT, 12(234–239):8. DOI: 10.1109/slt.2012.6424228 255
Mimno, D., Wallach, H., and McCallum, A. (2008). Gibbs sampling for logistic normal topic
models with graph-based priors. In NIPS Workshop on Analyzing Graphs. 61
Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011). Optimizing
semantic coherence in topic models. In Proc. of the 2011 Conference on Empirical Methods
in Natural Language Processing, pages 262–272, Edinburgh, Scotland, UK. Association for
Computational Linguistics. 36
Minka, T. (1999). The Dirichlet-tree distribution. Technical report, Justsystem Pittsburgh
Research Center. 55
Minka, T. (2000). Bayesian linear regression. Technical report, Massachusetts Institute of
Technology. 41
294 BIBLIOGRAPHY
Minsky, M. and Papert, S. (1969). Perceptrons. DOI: 10.7551/mitpress/11301.001.0001 214
Mitchell, J. and Lapata, M. (2008). Vector-based models of semantic composition, pages 236–
244. 216
Močkus, J. (1975). On Bayesian methods for seeking the extremum. In Proc. of the IFIP Technical
Conference on Optimization Techniques, pages 400–404, Springer. DOI: 10.1007/978-3-662-
38527-2_55 244
Močkus, J. (2012). Bayesian Approach to Global Optimization: Theory and Applications, vol. 37,
Springer Science & Business Media. DOI: 10.2307/2008419 244
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. 17, 222
Narayan, S., Cohen, S. B., and Lapata, M. (2018a). Don’t give me the details, just the sum-
mary! Topic-aware convolutional neural networks for extreme summarization. In Proc. of the
Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1797–1807.
255
Narayan, S., Cohen, S. B., and Lapata, M. (2018b). Ranking sentences for extractive summa-
rization with reinforcement learning. In Proc. of the Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, pages 1747–1759.
DOI: 10.18653/v1/n18-1158 241
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture
models. Journal of Computational and Graphical Statistics, 9(2), pages 249–265. DOI:
10.2307/1390653. 161, 163
Neal, R. M. (2003). Slice sampling. Annals of Statistics, 31, pages 705–767. DOI:
10.1214/aos/1056562461. 115
Neal, R. M. (2012). Bayesian Learning for Neural Networks, vol. 118, Springer Science &
Business Media. DOI: 10.1007/978-1-4612-0745-0 229
Neal, R. M. and Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental,
sparse, and other variants. In Learning in Graphical Models, pages 355–368. Springer. DOI:
10.1007/978-94-011-5014-9_12. 152
Neco, R. P. and Forcada, M. L. (1997). Asynchronous translations with recurrent neural nets.
In Proc. of the International Conference on Neural Networks, vol. 4, pages 2535–2540, IEEE.
DOI: 10.1109/icnn.1997.614693 214, 236
BIBLIOGRAPHY 295
Neiswanger, W., Wang, C., and Xing, E. P. (2014). Asymptotically exact, embarrassingly par-
allel MCMC. In Proc. of the 30th Conference on Uncertainty in Artificial Intelligence, UAI,
pages 623–632, Quebec City, Quebec, Canada. AUAI Press. 112
Neubig, G., Watanabe, T., Sumita, E., Mori, S., and Kawahara, T. (2011). An unsupervised
model for joint phrase alignment and extraction. In Proc. of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies, pages 632–641, Port-
land, OR. 205
Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., Balles-
teros, M., Chiang, D., Clothiaux, D., Cohn, T., et al. (2017). DyNet: The dynamic neural
network toolkit. ArXiv Preprint ArXiv:1701.03980. 217
Newman, D., Asuncion, A., Smyth, P., and Welling, M. (2009). Distributed algorithms for
topic models. Journal of Machine Learning Research, 10, pages 1801–1828. 112
Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010). Automatic evaluation of topic
coherence. In Human Language Technologies: The 2010 Annual Conference of the North American
Chapter of the Association for Computational Linguistics, pages 100–108, Los Angeles, CA. 36
Noji, H., Mochihashi, D., and Miyao, Y. (2013). Improvements to the Bayesian topic n-gram
models. In Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing,
pages 1180–1190, Seattle, WA. Association for Computational Linguistics. 170
O’Neill, B. (2009). Exchangeability, correlation, and Bayes’ effect. International Statistical Re-
view, 77(2), pages 241–250. DOI: 10.1111/j.1751-5823.2008.00059.x. 9
Och, F. J. and Ney, H. (2003). A systematic comparison of various statistical alignment models.
Computational Linguistics, 29(1), pages 19–51. DOI: 10.1162/089120103321337421. 206,
209
Omohundro, S. M. (1992). Best-first Model Merging for Dynamic Learning and Recognition.
International Computer Science Institute. 211
Pajak, B., Bicknell, K., and Levy, R. (2013). A model of generalization in distributional learn-
ing of phonetic categories. In Demberg, V. and Levy, R., Eds., Proc. of the 4th Workshop on
Cognitive Modeling and Computational Linguistics, pages 11–20, Sofia, Bulgaria. Association
for Computational Linguistics. 29
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neu-
ral networks. In Proc. of the 30th International Conference on Machine Learning (ICML),
pages 1310–1318. 230, 232
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann, San Mateo, CA. 18
296 BIBLIOGRAPHY
Perfors, A., Tenenbaum, J. B., Griffiths, T. L., and Xu, F. (2011). A tutorial introduction
to Bayesian models of cognitive development. Cognition, 120(3), pages 302–321. DOI:
10.1016/j.cognition.2010.11.015. 29
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L.
(2018). Deep contextualized word representations. In Proc. of the Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technolo-
gies, (Volume 1: Long Papers), vol. 1, pages 2227–2237. DOI: 10.18653/v1/n18-1202 221
Petrov, S., Barrett, L., Thibaux, R., and Klein, D. (2006). Learning accurate, compact, and
interpretable tree annotation. In Proc. of the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 433–
440, Sydney, Australia. DOI: 10.3115/1220175.1220230. 202, 211
Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution de-
rived from a stable subordinator. The Annals of Probability, 25(2), pages 855–900. DOI:
10.1214/aop/1024404422. 168
Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence, 46(1–2):77–
105. DOI: 10.1016/0004-3702(90)90005-k 231
Post, M. and Gildea, D. (2009). Bayesian learning of a tree substitution grammar. In Proc. of
the ACL-IJCNLP 2009 Conference Short Papers, pages 45–48, Suntec, Singapore. Association
for Computational Linguistics. DOI: 10.3115/1667583.1667599. 211
Post, M. and Gildea, D. (2013). Bayesian tree substitution grammars as a usage-based approach.
Language and Speech, 56, pages 291–308. DOI: 10.1177/0023830913484901. 211
Preoţiuc-Pietro, D. and Cohn, T. (2013). A temporal model of text periodicities using gaussian
processes. In Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing,
pages 977–988, Seattle, WA. Association for Computational Linguistics. 172
Prescher, D. (2005). Head-driven PCFGs with latent-head statistics. In Proc. of the 9th Inter-
national Workshop on Parsing Technology, pages 115–124, Vancouver, British Columbia. Asso-
ciation for Computational Linguistics. DOI: 10.3115/1654494.1654506. 201, 211
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech
recognition. Proc. of the IEEE, 77(2), pages 257–286. DOI: 10.1109/5.18626. 179
Raftery, A. E. and Lewis, S. M. (1992). Practical Markov chain Monte Carlo: Comment: One
long run with diagnostics: Implementation strategies for Markov chain Monte Carlo. Statis-
tical Science, 7(4), pages 493–497. 121
Raiffa, H. and Schlaifer, R. (1961). Applied Statistical Decision Theory. Wiley-Interscience. 52,
257
BIBLIOGRAPHY 297
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT
Press. DOI: 10.1007/978-3-540-28650-9_4. 172
Ravi, S. and Knight, K. (2011). Deciphering foreign language. In Proc. of the 49th Annual Meet-
ing of the Association for Computational Linguistics: Human Language Technologies, pages 12–21,
Portland, OR. 111
Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. (2018). Regularized evolution for image
classifier architecture search. ArXiv Preprint ArXiv:1802.01548. 243
Rios, L. M. and Sahinidis, N. V. (2013). Derivative-free optimization: A review of algorithms
and comparison of software implementations. Journal of Global Optimization, 56(3):1247–
1293. DOI: 10.1007/s10898-012-9951-y 244
Robert, C. P. and Casella, G. (2005). Monte Carlo Statistical Methods. Springer. DOI:
10.1007/978-1-4757-3071-5. 119, 121, 122, 123, 128
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and orga-
nization in the brain. Psychological Review, 65(6):386. DOI: 10.1037/h0042519 214
Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from
here? Proc. of the IEEE, 88(8), pages 1270–1278. DOI: 10.1109/5.880083. 170
Roth, D. and Yih, W.-t. (2005). Integer linear programming inference for conditional random
fields. In Proc. of the 22nd International Conference on Machine Learning (ICML), pages 736–
743, ACM. DOI: 10.1145/1102351.1102444 265
Rozenberg, G. and Ehrig, H. (1999). Handbook of Graph Grammars and Computing by Graph
Transformation, vol. 1. World Scientific, Singapore. DOI: 10.1142/9789812384720. 177
Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. (1988). Learning representations by
back-propagating errors. Cognitive Modeling, 5(3):1. DOI: 10.1038/323533a0 214
Saatci, Y. and Wilson, A. G. (2017). Bayesian GAN. In Advances in Neural Information Pro-
cessing Systems 30, pages 3622–3631. 253
Sankaran, B., Haffari, G., and Sarkar, A. (2011). Bayesian extraction of minimal scfg rules for
hierarchical phrase-based translation. In Proc. of the 6th Workshop on Statistical Machine Trans-
lation, pages 533–541, Edinburgh, Scotland. Association for Computational Linguistics. 205
Sato, M.-A. and Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian network.
Neural Computation, 12(2), pages 407–432. DOI: 10.1162/089976600300015853. 152
Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dy-
namics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120. 228
298 BIBLIOGRAPHY
Sennrich, R., Haddow, B., and Birch, A. (2016). Improving neural machine translation models
with monolingual data. In Proc. of the 54nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). DOI: 10.18653/v1/p16-1009 254
Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Junczys-Dowmunt, M.,
Läubli, S., Miceli Barone, A. V., Mokry, J., and Nadejde, M. (2017). Nematus: A toolkit
for neural machine translation. In Proc. of the Software Demonstrations of the 15th Conference
of the European Chapter of the Association for Computational Linguistics, pages 65–68, Valencia,
Spain. DOI: 10.18653/v1/e17-3017 239
Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4,
pages 639–650. 157
Shareghi, E., Haffari, G., Cohn, T., and Nicholson, A. (2015). Structured prediction of se-
quences and trees using infinite contexts. In Machine Learning and Knowledge Discovery in
Databases, pages 373–389. Springer. DOI: 10.1007/978-3-319-23525-7_23. 175
Shareghi, E., Li, Y., Zhu, Y., Reichart, R., and Korhonen, A. (2019). Bayesian learning for
neural dependency parsing. Proc. of the Annual Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL). 229
Shindo, H., Miyao, Y., Fujino, A., and Nagata, M. (2012). Bayesian symbol-refined tree sub-
stitution grammars for syntactic parsing. In Proc. of the 50th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 440–448, Jeju Island, Korea. 27,
211
Sirts, K., Eisenstein, J., Elsner, M., and Goldwater, S. (2014). Pos induction with distributional
and morphological information using a distance-dependent Chinese restaurant process. In
Proc. of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers), pages 265–271, Baltimore, MD. DOI: 10.3115/v1/p14-2044. 174
Smith, N. A. (2011). Linguistic Structure Prediction. Synthesis Lectures on Human Language
Technologies. Morgan & Claypool. DOI: 10.2200/s00361ed1v01y201105hlt013. 186
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine
learning algorithms. In Advances in Neural Information Processing Systems 25, pages 2951–
2959. 244
Snyder, B. and Barzilay, R. (2008). Unsupervised multilingual learning for morphological seg-
mentation. In Proc. of ACL-08: HLT, pages 737–745, Columbus, OH. Association for Com-
putational Linguistics. 27
Snyder, B., Naseem, T., Eisenstein, J., and Barzilay, R. (2008). Unsupervised multilingual
learning for POS tagging. In Proc. of the 2008 Conference on Empirical Methods in Natural
BIBLIOGRAPHY 299
Language Processing, pages 1041–1050, Honolulu, HI. Association for Computational Lin-
guistics. DOI: 10.3115/1613715.1613851. 27, 206
Snyder, B., Naseem, T., and Barzilay, R. (2009a). Unsupervised multilingual grammar induc-
tion. In Proc. of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th In-
ternational Joint Conference on Natural Language Processing of the AFNLP, pages 73–81, Sun-
tec, Singapore. Association for Computational Linguistics. DOI: 10.3115/1687878.1687890.
208
Snyder, B., Naseem, T., Eisenstein, J., and Barzilay, R. (2009b). Adding more languages
improves unsupervised multilingual part-of-speech tagging: a Bayesian non-parametric ap-
proach. In Proc. of Human Language Technologies: The 2009 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, pages 83–91, Boulder, CO.
DOI: 10.3115/1620754.1620767. 208
Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. (2010). From baby steps to leapfrog: How “less
is more” in unsupervised dependency parsing. In Human Language Technologies: The 2010
Annual Conference of the North American Chapter of the Association for Computational Linguistics,
pages 751–759, Los Angeles, CA. 149
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learn-
ing Research, 15(1):1929–1958. 242
Steedman, M. (2000). The Syntactic Process, vol. 35. MIT Press. 210
Steedman, M. and Baldridge, J. (2011). Combinatory categorial grammar. In Borsley, R. and
Borjars, K. Eds. Non-Transformational Syntax Oxford, pages 181–224. 177
Steyvers, M. and Griffiths, T. (2007). Probabilistic topic models. Handbook of Latent Semantic
Analysis, 427(7), pages 424–440. DOI: 10.4324/9780203936399.ch21. 34
Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proc. International
Conference on Spoken Language Processing, pages 901–904, Denver, CO. International Speech
Communication Association (ISCA). 81
Stolcke, A. and Omohundro, S. (1994). Inducing probabilistic grammars by Bayesian
model merging. In Grammatical Inference and Applications, pages 106–118. Springer. DOI:
10.1007/3-540-58473-0_141. 73, 211
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural
networks. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger,
K. Q., Eds., Advances in Neural Information Processing Systems 27, pages 3104–3112, Curran
Associates, Inc. 236, 237
300 BIBLIOGRAPHY
Synnaeve, G., Dautriche, I., Börschinger, B., Johnson, M., and Dupoux, E. (2014). Unsu-
pervised word segmentation in context. In Proc. of COLING 2014, the 25th International
Conference on Computational Linguistics: Technical Papers, pages 2326–2334, Dublin, Ireland.
Dublin City University and Association for Computational Linguistics. 29
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet pro-
cesses. Journal of the American Statistical Association, 101(476), pages 1566–1581. DOI:
10.1198/016214506000000302. 166
Teh, Y. W., Kurihara, K., and Welling, M. (2008). Collapsed variational inference for hdp. In
Platt, J., Koller, D., Singer, Y., and Roweis, S., Eds., Advances in Neural Information Processing
Systems 20, pages 1481–1488. Curran Associates, Inc. 167
Teh, Y. W., Thiery, A. H., and Vollmer, S. J. (2016). Consistency and fluctuations for stochastic
gradient langevin dynamics. The Journal of Machine Learning Research, 17(1):193–225. 229
Tenenbaum, J. B., Kemp, C., Griffiths, T. L., and Goodman, N. D. (2011). How to grow
a mind: Statistics, structure, and abstraction. Science, 331(6022), pages 1279–1285. DOI:
10.1126/science.1192788. 29
Tesnière, L., Osborne, T. J., and Kahane, S. (2015). Elements of Structural Syntax. John Ben-
jamins Publishing Company. DOI: 10.1075/z.185. 202
Tevet, G., Habib, G., Shwartz, V., and Berant, J. (2018). Evaluating text gans as language
models. ArXiv Preprint ArXiv:1810.12686. 253
Titov, I. and Henderson, J. (2010). A latent variable model for generative dependency parsing.
In Trends in Parsing Technology, pages 35–55, Springer. DOI: 10.3115/1621410.1621428 216
Titov, I. and Klementiev, A. (2012). A Bayesian approach to unsupervised semantic role induc-
tion. In Proc. of the 13th Conference of the European Chapter of the Association for Computational
Linguistics, pages 12–22, Avignon, France. 174
BIBLIOGRAPHY 301
Tjong Kim Sang, E. F. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared
task: Language-independent named entity recognition. In Daelemans, W. and Osborne,
M., Eds., Proc. of the 7th Conference on Natural Language Learning at HLT-NAACL 2003,
pages 142–147. DOI: 10.3115/1119176. 92
Tromble, R., Kumar, S., Och, F., and Macherey, W. (2008). Lattice Minimum Bayes-Risk de-
coding for statistical machine translation. In Proc. of the 2008 Conference on Empirical Methods
in Natural Language Processing, pages 620–629, Honolulu, HI. Association for Computational
Linguistics. DOI: 10.3115/1613715.1613792. 90
Turian, J., Ratinov, L., and Bengio, Y. (2010). Word representations: A simple and general
method for semi-supervised learning. In Proc. of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 384–394. 215
Turney, P. D. and Pantel, P. (2010). From frequency to meaning: Vector space models of se-
mantics. Journal of Artificial Intelligence Research, 37:141–188. DOI: 10.1613/jair.2934 216
Upton, G. and Cook, I. (2014). A Dictionary of Statistics, 3rd ed., Oxford University Press.
DOI: 10.1093/acref/9780199679188.001.0001. 121
Van Gael, J., Saatci, Y., Teh, Y. W., and Ghahramani, Z. (2008). Beam sampling for the infinite
hidden Markov model. In Proc. of the 25th International Conference on Machine Learning,
pages 1088–1095. ACM Press. DOI: 10.1145/1390156.1390293. 118, 181
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and
Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing
Systems 30, pages 5998–6008. 254
Vijay-Shanker, K., Weir, D. J., and Joshi, A. K. (1987). Characterizing structural descrip-
tions produced by various grammatical formalisms. In Proc. of the 25th Annual Meet-
ing of the Association for Computational Linguistics, pages 104–111, Stanford, CA. DOI:
10.3115/981175.981190. 177
Vilnis, L. and McCallum, A. (2015). Word representations via Gaussian embedding. In Proc.
of the 3rd International Conference on Learning Representations (ICLR). 221
Wainwright, M. and Jordan, M. (2008). Graphical models, exponential families, and varia-
tional inference. Foundations and Trends in Machine Learning, 1(1–2), pages 1–305. DOI:
10.1561/2200000001. 139
302 BIBLIOGRAPHY
Wallach, H. M. (2006). Topic modeling: beyond bag-of-words. In Proc. of the 23rd Interna-
tional Conference on Machine Learning, pages 977–984, Pittsburgh, PA. ACM Press. DOI:
10.1145/1143844.1143967. 170
Wallach, H., Sutton, C., and McCallum, A. (2008). Bayesian modeling of dependency trees
using hierarchical Pitman-Yor priors. In ICML Workshop on Prior Knowledge for Text and
Language Processing, pages 15–20, Helsinki, Finland. ACM. 169
Wang, C., Paisley, J. W., and Blei, D. M. (2011). Online variational inference for the hierarchical
Dirichlet process. In International Conference on Artificial Intelligence and Statistics, pages 752–
760. 152, 167
Weisstein, E. W. (2014). Gamma function. from MathWorld–a Wolfram web resource. http:
//mathworld.wolfram.com/GammaFunction.html, Last visited on 11/11/2014. 268
Welling, M., Teh, Y. W., Andrieu, C., Kominiarczuk, J., Meeds, T., Shahbaba, B., and Vollmer,
S. (2014). Bayesian inference with big data: a snapshot from a workshop. International Society
for Bayesian Analysis (ISBA) Bulletin, 21(4), pages 8–11. 258
Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynam-
ics. In Proc. of the 28th International Conference on Machine Learning (ICML), pages 681–688.
229
Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proc. of
the IEEE, 78(10):1550–1560. DOI: 10.1109/5.58337 231
Williams, P., Sennrich, R., Koehn, P., and Post, M. (2016). Syntax-based Statistical Machine
Translation. Synthesis Lectures on Human Language Technologies. Morgan & Claypool.
205
Wood, F., Archambeau, C., Gasthaus, J., James, L., and Teh, Y. W. (2009). A stochastic
memoizer for sequence data. In Proc. of the 26th Annual International Conference on Machine
Learning, pages 1129–1136. ACM. DOI: 10.1145/1553374.1553518. 175
Wu, D. (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel
corpora. Computational Linguistics, 23(3), pages 377–403. 205
Author’s Biography
SHAY COHEN
Shay Cohen is a Lecturer at the Institute for Language, Cognition and Computation at the
School of Informatics at the University of Edinburgh. He received his Ph.D. in Language
Technologies from Carnegie Mellon University (2011), his M.Sc. in Computer Science from
Tel-Aviv University (2004) and his B.Sc. in Mathematics and Computer Science from Tel-
Aviv University (2000). He was awarded a Computing Innovation Fellowship for his postdoc-
toral studies at Columbia University (2011–2013) and a Chancellor’s Fellowship in Edinburgh
(2013–2018). His research interests are in natural language processing and machine learning,
with a focus on problems in structured prediction, such as syntactic and semantic parsing.
307
Index
adaptor grammars, 193 de Finetti’s theorem, 8
MCMC inference, 198 deep generative models, 245
Pitman-Yor prior, 194 deep learning, 213
stick-breaking view, 196 dependency grammars, 202
variational inference, 198 digamma function, 144
autocorrelation, 120 directed graphical models, 17
autoencoders, 246 plate notation, 18
Dirichlet process, 156
backpropagation, 225 hierarchical, 165
exploding gradient, 232 mixture, 161
through structure, 231 MCMC inference, 161
through time, 231 search inference, 164
vanishing gradient, 232 variational inference, 163
Bayes by backprop, 231 PCFGs, 200
Bayes risk, 89 distance-dependent Chinese restaurant pro-
Bayes’ rule, 6 cess, 173
Bayesian distribution
objectivism, 22 Bernoulli, 11, 54
subjectivism, 22 Beta, 35
Bayesian optimization, 244 categorical, 53, 267
Bayesian philosophy, 22 concrete, 250, 273
conditional, 5
chain rule, 6, 261 Dirichlet, 32, 55, 268
Chinese restaurant process, 159, 176 Gamma, 269
Chomsky normal form, 182 Gamma and Dirichlet, 57
computational conjugacy, 257 GEM, 157
conditional distribution, 5 Gumbel, 250, 273
conjugacy, 33 inverse Wishart, 272
context-free grammar, 212 joint, 4
synchronous, 205 Laplace, 271
context-sensitive grammars, 210 logistic normal, 61, 271
convolutional neural networks, 239 marginal, 4
cross entropy, 259 multinomial, 33, 53, 267
cumulative distribution function, 4 multivariate normal, 45, 270
308 INDEX
normal, 40 inclusion-exclusion principle, 24
Poisson, 269 independence assumptions, 16
prior, 12, 28, 43 Indian buffet process, 172
symmetric Dirichlet, 35 inference, 12, 36
target, 97 approximate, 97
uniform, 67 Markov chain Monte Carlo, 39, 97
variational, 39, 135, 137
entropy, 259 convergence diagnosis, 149
estimation decoding, 150
decision-theoretic, 89 Dirichlet process mixtures, 163
Dirichlet, 81 Dirichlet-Multinomial, 141
empirical Bayes, 90, 147 mean-field, 138, 153
latent variables, 86 online, 151
maximum a posteriori, 26, 79 inside algorithm, 187
maximum likelihood, 12 inverse transform sampling, 124
regularization, 83
smoothing, 81 joint distribution, 4
exchangeability, 8
Kullback-Leibler divergence, 151, 260
exp-digamma function, 145
expectation-maximization, 145, 262 Langevin dynamics, 229
language modeling, 82, 169
factorized posterior, 137 large-scale MCMC, 258
frequentist, 12, 22, 28 latent Dirichlet allocation, 18, 25, 30
Fubini’s theoreom, 10 collapsed Gibbs sampler, 107
Gibbs sampler, 102
gated recurrent neural networks, 234 independence assumptions, 31
Gaussian mixture model, 15 latent-variable learning, 21
Gaussian process, 172 latent-variable PCFG, 201
GEM distribution, 157 learning
generative adversarial networks, 252 semi-supervised, 19
generative story, 15 supervised, 19, 27
Gibbs sampling, 101 unsupervised, 19, 27
collapsed, 105 likelihood, 21
operators, 109 linear regression, 40
grammar induction, 205, 208 log-likelihood, 21
marginal, 19
hidden Markov models, 178, 206 logistic regression, 83
infinite state space, 179 long short-term memory units, 235
PCFG representation, 189
hyperparameters, 33 machine translation, 205
INDEX 309
marginal distribution, 4 nested Chinese restaurant process, 173
marginalization of random variables, 4 neural networks, 213
Markov chains, 121 attention mechanism, 237
max pooling, 240 convolutional, 239
maximum likelihood estimate, 12 dropout, 242
MCMC encoder-decoder, 235
detailed balance, 110 exploding gradient, 232
recurrence, 110 feed-forward, 222
search space, 98 gated recurrent units, 235
mean-field variational inference, 138 generative, 245
Metropolis-Hastings sampling, 113 generative adversarial networks, 252
mildly CFGs, 209 GRUs, 235
minimum description length, 80 hyperparameter tuning, 243
model long short-term memory units, 235
bag of words, 30 LSTMs, 235
discriminative, 14, 27 machine translation, 235
exponential, 69, 212 prior, 228
generative, 14, 27 recurrent, 230
grammar, 177 regularization, 242
graphical, 17 softmax, 237
hierarchical, 72 vanishing gradient, 232
latent Dirichlet allocation, 132 variational autoencoders, 247
latent variables, 48, 78 variational inference, 229, 231
normalization constant, 5, 47
log-linear, 83
logistic regrssion, 83
observations, 7, 14, 19
mixture, 50, 155 online variational inference, 151
mixture of Gaussians, 15 outside algorithm, 187
nonparametric, 11
parametric, 11 particle filtering, 128
probabilistic-context free grammars, 181 PCFG inference, 185
skip-gram, 218 feature expectations, 186
structure, 99 inside algorithm, 186
unigram, 30 outside algorithm, 186, 188
model merging, 211 PCFGs
Monte Carlo integration, 126 MCMC inference, 190
multilingual learning, 206 tightness, 52, 190
grammar induction, 208 phrase-structure tree, 181
part-of-speech tagging, 206 Pitman-Yor process, 168
multinomial collection, 184 language modeling, 169
310 INDEX
power-law behavior, 170 recurrent neural networks, 230
plate notation for graphical models, 18 regression
pooling, 240 Ridge, 95
posterior regularization, 242
asymptotics, 93 rejection sampling, 123
Laplace’s approximation, 87 reparameterization trick, 250
point estimate, 77 representation learning, 213
summarization, 77
power-law behavior, 170 sample space, 1
prior discrete, 2, 4
conjugate, 44, 51 sampling
improper, 67 auxiliary variable, 117
Jeffreys, 68 blocked, 100
non-informative, 66 component-wise, 114
nonparametric, 155 convergence, 119
PCFGs, 189 Gibbs, 101
sparsity, 36 collapsed, 105
structural, 73 operators, 109
probability parallelization, 111
distribution, 1 inverse transform, 124
event, 1 Metropolis-Hastings, 113
measure, 1 nested MCMC, 128
probability density function, 3 pointwise, 101
probability mass function, 3 rejection, 123
probability simplex, 34, 54 slice, 115
sequence memoizers, 175
random variable sequential Monte Carlo, 128
transformation, 261 simulated annealing, 118
random variables, 2 slice sampling, 115
conditional independence, 8 smoothing, 26
continuous, 3 sparse grammars, 191
covariance, 10 sparsity, 36, 56
discrete, 3 statistical model, 11
exchangeable, 159 stick-breaking process, 157
expectation, 9 sufficient statistics, 69
independence, 5, 7 supervised learning, 21
latent, 14, 16, 19, 28, 100 synchronous grammars, 205
moments, 10
multivariate, 3 text regression, 40
variance, 10 topic modeling, 30, 102
INDEX 311
transductive learning, 20
tree substitution grammars, 210
treebank, 181
unsupervised learning, 19