Speech And Language Processing An Introduction To Natural Language Processing Computational Linguistics And Speech Recognition 3rd Edition Daniel Jurafsky download
Speech And Language Processing An Introduction To Natural Language Processing Computational Linguistics And Speech Recognition 3rd Edition Daniel Jurafsky download
https://ebookbell.com/product/speech-and-language-processing-an-
introduction-to-natural-language-processing-computational-linguistics-
and-speech-recognition-2nd-edition-daniel-jurafsky-2253230
https://ebookbell.com/product/speech-and-language-
processing-2e-slp2e-itebooks-23837520
https://ebookbell.com/product/speech-and-language-processing-for-
humanmachine-communications-proceedings-of-csi-2015-1st-edition-s-s-
agrawal-6793380
https://ebookbell.com/product/speech-and-language-processing-
draft-2nd-daniel-jurafsky-james-h-martin-7121290
Speech And Language Processing 2nd Edition Jurafsky D Martin Jh
https://ebookbell.com/product/speech-and-language-processing-2nd-
edition-jurafsky-d-martin-jh-51224892
https://ebookbell.com/product/bayesian-speech-and-language-processing-
watanabe-schien-jt-10447808
https://ebookbell.com/product/bayesian-speech-and-language-
processing-1-publ-watanabe-shinji-chien-6661768
https://ebookbell.com/product/introducing-speech-and-language-
processing-201008-john-coleman-230868990
https://ebookbell.com/product/bayesian-speech-and-language-processing-
watanabe-shinji-chien-232080296
Speech and Language Processing
An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition
Daniel Jurafsky
Stanford University
James H. Martin
University of Colorado at Boulder
2
Contents
I Fundamental Algorithms for NLP 1
1 Introduction 3
2 Regular Expressions, Text Normalization, Edit Distance 4
2.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Minimum Edit Distance . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 29
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 N-gram Language Models 31
3.1 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Evaluating Language Models . . . . . . . . . . . . . . . . . . . . 37
3.3 Sampling sentences from a language model . . . . . . . . . . . . . 40
3.4 Generalization and Zeros . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Huge Language Models and Stupid Backoff . . . . . . . . . . . . 48
3.7 Advanced: Kneser-Ney Smoothing . . . . . . . . . . . . . . . . . 49
3.8 Advanced: Perplexity’s Relation to Entropy . . . . . . . . . . . . 52
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 55
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Naive Bayes, Text Classification, and Sentiment 58
4.1 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Training the Naive Bayes Classifier . . . . . . . . . . . . . . . . . 62
4.3 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Optimizing for Sentiment Analysis . . . . . . . . . . . . . . . . . 64
4.5 Naive Bayes for other text classification tasks . . . . . . . . . . . 66
4.6 Naive Bayes as a Language Model . . . . . . . . . . . . . . . . . 67
4.7 Evaluation: Precision, Recall, F-measure . . . . . . . . . . . . . . 68
4.8 Test sets and Cross-validation . . . . . . . . . . . . . . . . . . . . 70
4.9 Statistical Significance Testing . . . . . . . . . . . . . . . . . . . 71
4.10 Avoiding Harms in Classification . . . . . . . . . . . . . . . . . . 75
4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 76
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 Logistic Regression 79
5.1 The sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Classification with Logistic Regression . . . . . . . . . . . . . . . 82
5.3 Multinomial logistic regression . . . . . . . . . . . . . . . . . . . 86
5.4 Learning in Logistic Regression . . . . . . . . . . . . . . . . . . . 89
5.5 The cross-entropy loss function . . . . . . . . . . . . . . . . . . . 90
5.6 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3
4 C ONTENTS
5.7 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.8 Learning in Multinomial Logistic Regression . . . . . . . . . . . . 98
5.9 Interpreting models . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.10 Advanced: Deriving the Gradient Equation . . . . . . . . . . . . . 100
5.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . 101
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
In the first part of the book we introduce the fundamental suite of algorithmic
tools that make up the modern neural language model that is the heart of end-to-end
NLP systems. We begin with tokenization and preprocessing, as well as useful algo-
rithms like computing edit distance, and then proceed to the tasks of classification,
logistic regression, neural networks, proceeding through feedforward networks, re-
current networks, and then transformers. We’ll also see the role of embeddings as a
model of word meaning.
CHAPTER
1 Introduction
La dernière chose qu’on trouve en faisant un ouvrage est de savoir celle qu’il faut
mettre la première.
[The last thing you figure out in writing a book is what to put first.]
Pascal
3
4 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
CHAPTER
Some languages, like Japanese, don’t have spaces between words, so word tokeniza-
tion becomes more difficult.
lemmatization Another part of text normalization is lemmatization, the task of determining
that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
Lemmatization is essential for processing morphologically complex languages like
stemming Arabic. Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word. Text normalization also includes sen-
sentence
segmentation tence segmentation: breaking up a text into individual sentences, using cues like
periods or exclamation points.
Finally, we’ll need to compare words and other strings. We’ll introduce a metric
called edit distance that measures how similar two strings are based on the number
of edits (insertions, deletions, substitutions) it takes to change one string into the
other. Edit distance is an algorithm with applications throughout language process-
ing, from spelling correction to speech recognition to coreference resolution.
case /S/ (/s/ matches a lower case s but not an upper case S). This means that
the pattern /woodchucks/ will not match the string Woodchucks. We can solve this
problem with the use of the square braces [ and ]. The string of characters inside the
braces specifies a disjunction of characters to match. For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.
The regular expression /[1234567890]/ specifies any single digit. While such
classes of characters as digits or letters are important building blocks in expressions,
they can get awkward (e.g., it’s inconvenient to specify
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
to mean “any capital letter”). In cases where there is a well-defined sequence asso-
ciated with a set of characters, the brackets can be used with the dash (-) to specify
range any one character in a range. The pattern /[2-5]/ specifies any one of the charac-
ters 2, 3, 4, or 5. The pattern /[b-g]/ specifies one of the characters b, c, d, e, f, or
g. Some other examples are shown in Fig. 2.3.
The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ. If the caret ˆ is the first symbol after the open square brace [,
the resulting pattern is negated. For example, the pattern /[ˆa]/ matches any single
character (including special characters) except a. This is only true when the caret
is the first symbol after the open square brace. If it occurs anywhere else, it usually
stands for a caret; Fig. 2.4 shows some examples.
How can we talk about optional elements, like an optional s in woodchuck and
woodchucks? We can’t use the square brackets, because while they allow us to say
2.1 • R EGULAR E XPRESSIONS 7
“s or S”, they don’t allow us to say “s or nothing”. For this we use the question mark
/?/, which means “the preceding character or nothing”, as shown in Fig. 2.5.
We can think of the question mark as meaning “zero or one instances of the
previous character”. That is, it’s a way of specifying how many of something that
we want, something that is very important in regular expressions. For example,
consider the language of certain sheep, which consists of strings that look like the
following:
baa!
baaa!
baaaa!
baaaaa!
...
This language consists of strings with a b, followed by at least two a’s, followed
by an exclamation point. The set of operators that allows us to say things like “some
Kleene * number of as” are based on the asterisk or *, commonly called the Kleene * (gen-
erally pronounced “cleany star”). The Kleene star means “zero or more occurrences
of the immediately previous character or regular expression”. So /a*/ means “any
string of zero or more as”. This will match a or aaaaaa, but it will also match the
empty string at the start of Off Minor since the string Off Minor starts with zero a’s.
So the regular expression for matching one or more a is /aa*/, meaning one a fol-
lowed by zero or more as. More complex patterns can also be repeated. So /[ab]*/
means “zero or more a’s or b’s” (not “zero or more right square braces”). This will
match strings like aaaa or ababab or bbbb.
For specifying multiple digits (useful for finding prices) we can extend /[0-9]/,
the regular expression for a single digit. An integer (a string of digits) is thus
/[0-9][0-9]*/. (Why isn’t it just /[0-9]*/?)
Sometimes it’s annoying to have to write the regular expression for digits twice,
so there is a shorter way to specify “at least one” of some character. This is the
Kleene + Kleene +, which means “one or more occurrences of the immediately preceding
character or regular expression”. Thus, the expression /[0-9]+/ is the normal way
to specify “a sequence of digits”. There are thus two ways to specify the sheep
language: /baaa*!/ or /baa+!/.
One very important special character is the period (/./), a wildcard expression
that matches any single character (except a carriage return), as shown in Fig. 2.6.
The wildcard is often used together with the Kleene star to mean “any string of
characters”. For example, suppose we want to find any line in which a particular
word, for example, aardvark, appears twice. We can specify this with the regular
expression /aardvark.*aardvark/.
anchors Anchors are special characters that anchor regular expressions to particular places
8 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
in a string. The most common anchors are the caret ˆ and the dollar sign $. The caret
ˆ matches the start of a line. The pattern /ˆThe/ matches the word The only at the
start of a line. Thus, the caret ˆ has three uses: to match the start of a line, to in-
dicate a negation inside of square brackets, and just to mean a caret. (What are the
contexts that allow grep or Python to know which function a given caret is supposed
to have?) The dollar sign $ matches the end of a line. So the pattern $ is a useful
pattern for matching a space at the end of a line, and /ˆThe dog\.$/ matches a
line that contains only the phrase The dog. (We have to use the backslash here since
we want the . to mean “period” and not the wildcard.)
Regex Match
ˆ start of line
$ end of line
\b word boundary
\B non-word boundary
Figure 2.7 Anchors in regular expressions.
There are also two other anchors: \b matches a word boundary, and \B matches
a non-boundary. Thus, /\bthe\b/ matches the word the but not the word other.
More technically, a “word” for the purposes of a regular expression is defined as any
sequence of digits, underscores, or letters; this is based on the definition of “words”
in programming languages. For example, /\b99\b/ will match the string 99 in
There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in
There are 299 bottles of beer on the wall (since 99 follows a number). But it will
match 99 in $99 (since 99 follows a dollar sign ($), which is not a digit, underscore,
or letter).
any number of spaces! The star here applies only to the space that precedes it,
not to the whole sequence. With the parentheses, we could write the expression
/(Column [0-9]+ *)*/ to match the word Column, followed by a number and
optional spaces, the whole pattern repeated zero or more times.
This idea that one operator may take precedence over another, requiring us to
sometimes use parentheses to specify what we mean, is formalized by the operator
operator
precedence precedence hierarchy for regular expressions. The following table gives the order
of RE operator precedence, from highest precedence to lowest precedence.
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ˆmy end$
Disjunction |
/the/
One problem is that this pattern will miss the word when it begins a sentence and
hence is capitalized (i.e., The). This might lead us to the following pattern:
/[tT]he/
But we will still incorrectly return texts with the embedded in other words (e.g.,
other or theology). So we need to specify that we want instances with a word bound-
ary on both sides:
/\b[tT]he\b/
Suppose we wanted to do this without the use of /\b/. We might want this since
/\b/ won’t treat underscores and numbers as word boundaries; but we might want
to find the in some context where it might also have underlines or numbers nearby
(the or the25). We need to specify that we want instances in which there are no
alphabetic letters on either side of the the:
/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
But there is still one more problem with this pattern: it won’t find the word the
when it begins a line. This is because the regular expression [ˆa-zA-Z], which
10 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
we used to avoid embedded instances of the, implies that there must be some single
(although non-alphabetic) character before the the. We can avoid this by specify-
ing that before the the we require either the beginning-of-line or a non-alphabetic
character, and the same at the end of the line:
/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/
The process we just went through was based on fixing two kinds of errors: false
false positives positives, strings that we incorrectly matched like other or there, and false nega-
false negatives tives, strings that we incorrectly missed, like The. Addressing these two kinds of
errors comes up again and again in implementing speech and language processing
systems. Reducing the overall error rate for an application thus involves two antag-
onistic efforts:
• Increasing precision (minimizing false positives)
• Increasing recall (minimizing false negatives)
We’ll come back to precision and recall with more precise definitions in Chapter 4.
Regex Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? zero or one occurrence of the previous char or expression
{n} exactly n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression
Figure 2.9 Regular expression operators for counting.
Finally, certain special characters are referred to by special notation based on the
newline backslash (\) (see Fig. 2.10). The most common of these are the newline character
2.1 • R EGULAR E XPRESSIONS 11
\n and the tab character \t. To refer to characters that are special themselves (like
., *, [, and \), precede them with a backslash, (i.e., /\./, /\*/, /\[/, and /\\/).
/$[0-9]+/
Note that the $ character has a different function here than the end-of-line function
we discussed earlier. Most regular expression parsers are smart enough to realize
that $ here doesn’t mean end-of-line. (As a thought experiment, think about how
regex parsers might figure out the function of $ from the context.)
Now we just need to deal with fractions of dollars. We’ll add a decimal point
and two digits afterwards:
/$[0-9]+\.[0-9][0-9]/
This pattern only allows $199.99 but not $199. We need to make the cents
optional and to make sure we’re at a word boundary:
/(ˆ|\W)$[0-9]+(\.[0-9][0-9])?\b/
One last catch! This pattern allows prices like $199999.99 which would be far
too expensive! We need to limit the dollars:
/(ˆ|\W)$[0-9]{0,3}(\.[0-9][0-9])?\b/
How about disk space? We’ll need to allow for optional fractions again (5.5 GB);
note the use of ? for making the final s optional, and the use of / */ to mean “zero
or more spaces” since there might always be extra spaces lying around:
/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/
Modifying this regular expression so that it only matches more than 500 GB is
left as an exercise for the reader.
12 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
s/colour/color/
s/([0-9]+)/<\1>/
The parenthesis and number operators can also specify that a certain string or
expression must occur twice in the text. For example, suppose we are looking for
the pattern “the Xer they were, the Xer they will be”, where we want to constrain
the two X’s to be the same string. We do this by surrounding the first X with the
parenthesis operator, and replacing the second X with the number operator \1, as
follows:
Here the \1 will be replaced by whatever string matched the first item in paren-
theses. So this will match the bigger they were, the bigger they will be but not the
bigger they were, the faster they will be.
capture group This use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the re-
register sulting match is stored in a numbered register. If you match two different sets of
parentheses, \2 means whatever matched the second capture group. Thus
will match the faster they ran, the faster we ran but not the faster they ran, the faster
we ate. Similarly, the third capture group is stored in \3, the fourth is \4, and so on.
Parentheses thus have a double function in regular expressions; they are used
to group terms for specifying the order in which operators should apply, and they
are used to capture something in a register. Occasionally we might want to use
parentheses for grouping, but don’t want to capture the resulting pattern in a register.
non-capturing
group In that case we use a non-capturing group, which is specified by putting the special
commands ?: after the open parenthesis, in the form (?: pattern ).
will match some cats like some cats but not some cats like some some.
Substitutions and capture groups are very useful in implementing simple chat-
bots like ELIZA (Weizenbaum, 1966). Recall that ELIZA simulates a Rogerian
psychologist by carrying on conversations like the following:
2.2 • W ORDS 13
Since multiple substitutions can apply to a given input, substitutions are assigned
a rank and applied in order. Creating patterns is the topic of Exercise 2.3, and we
return to the details of the ELIZA architecture in Chapter 15.
2.2 Words
Before we talk about processing words, we need to decide what counts as a word.
corpus Let’s start by looking at one particular corpus (plural corpora), a computer-readable
corpora collection of text or speech. For example the Brown corpus is a million-word col-
lection of samples from 500 written English texts from different genres (newspa-
per, fiction, non-fiction, academic, etc.), assembled at Brown University in 1963–64
(Kučera and Francis, 1967). How many words are in the following Brown sentence?
He stepped out into the hall, was delighted to encounter a water brother.
14 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
|V | = kN β (2.1)
2.3 • C ORPORA 15
The value of β depends on the corpus size and the genre, but at least for the large
corpora in Fig. 2.11, β ranges from .67 to .75. Roughly then we can say that the
vocabulary size for a text goes up significantly faster than the square root of its
length in words.
Another measure of the number of words in the language is the number of lem-
mas instead of wordform types. Dictionaries can help in giving lemma counts; dic-
tionary entries or boldface forms are a very rough upper bound on the number of
lemmas (since some lemmas have multiple boldface forms). The 1989 edition of the
Oxford English Dictionary had 615,000 entries.
2.3 Corpora
Words don’t appear out of nowhere. Any particular piece of text that we study
is produced by one or more specific speakers or writers, in a specific dialect of a
specific language, at a specific time, in a specific place, for a specific function.
Perhaps the most important dimension of variation is the language. NLP algo-
rithms are most useful when they apply across many languages. The world has 7097
languages at the time of this writing, according to the online Ethnologue catalog
(Simons and Fennig, 2018). It is important to test algorithms on more than one lan-
guage, and particularly on languages with different properties; by contrast there is
an unfortunate current tendency for NLP algorithms to be developed or tested just
on English (Bender, 2019). Even when algorithms are developed beyond English,
they tend to be developed for the official languages of large industrialized nations
(Chinese, Spanish, Japanese, German etc.), but we don’t want to limit tools to just
these few languages. Furthermore, most languages also have multiple varieties, of-
ten spoken in different regions or by different social groups. Thus, for example,
AAE if we’re processing text that uses features of African American English (AAE) or
African American Vernacular English (AAVE)—the variations of English used by
millions of people in African American communities (King 2020)—we must use
NLP tools that function with features of those varieties. Twitter posts might use fea-
tures often used by speakers of African American English, such as constructions like
MAE iont (I don’t in Mainstream American English (MAE)), or talmbout corresponding
to MAE talking about, both examples that influence word segmentation (Blodgett
et al. 2016, Jones 2015).
It’s also quite common for speakers or writers to use multiple languages in a
code switching single communicative act, a phenomenon called code switching. Code switching
is enormously common across the world; here are examples showing Spanish and
(transliterated) Hindi code switching with English (Solorio et al. 2014, Jurgens et al.
2017):
16 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
(2.2) Por primera vez veo a @username actually being hateful! it was beautiful:)
[For the first time I get to see @username actually being hateful! it was
beautiful:) ]
(2.3) dost tha or ra- hega ... dont wory ... but dherya rakhe
[“he was and will remain a friend ... don’t worry ... but have faith”]
Another dimension of variation is the genre. The text that our algorithms must
process might come from newswire, fiction or non-fiction books, scientific articles,
Wikipedia, or religious texts. It might come from spoken genres like telephone
conversations, business meetings, police body-worn cameras, medical interviews,
or transcripts of television shows or movies. It might come from work situations
like doctors’ notes, legal text, or parliamentary or congressional proceedings.
Text also reflects the demographic characteristics of the writer (or speaker): their
age, gender, race, socioeconomic class can all influence the linguistic properties of
the text we are processing.
And finally, time matters too. Language changes over time, and for some lan-
guages we have good corpora of texts from different historical periods.
Because language is so situated, when developing computational models for lan-
guage processing from a corpus, it’s important to consider who produced the lan-
guage, in what context, for what purpose. How can a user of a dataset know all these
datasheet details? The best way is for the corpus creator to build a datasheet (Gebru et al.,
2020) or data statement (Bender and Friedman, 2018) for each corpus. A datasheet
specifies properties of a dataset like:
Motivation: Why was the corpus collected, by whom, and who funded it?
Situation: When and in what situation was the text written/spoken? For example,
was there a task? Was the language originally spoken conversation, edited
text, social media communication, monologue vs. dialogue?
Language variety: What language (including dialect/region) was the corpus in?
Speaker demographics: What was, e.g., the age or gender of the text’s authors?
Collection process: How big is the data? If it is a subsample how was it sampled?
Was the data collected with consent? How was the data pre-processed, and
what metadata is available?
Annotation process: What are the annotations, what are the demographics of the
annotators, how were they trained, how was the data annotated?
Distribution: Are there copyright or other intellectual property restrictions?
2 abase
1 abash
14 abate
3 abated
3 abatement
...
Now we can sort again to find the frequent words. The -n option to sort means
to sort numerically rather than alphabetically, and the -r option means to sort in
reverse order (highest-to-lowest):
tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r
The results show that the most frequent words in Shakespeare, as in any other
corpus, are the short function words like articles, pronouns, prepositions:
27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
...
Unix tools of this sort can be very handy in building quick word count statistics
for any corpus in English. While in some versions of Unix these command-line tools
also correctly handle Unicode characters and so can be used for many languages,
in general for handling most languages outside English we use more sophisticated
tokenization algorithms.
clitic A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example, converting what’re to the two tokens what are, and
we’re to we are. A clitic is a part of a word that can’t stand on its own, and can only
occur when it is attached to another word. Some such contractions occur in other
alphabetic languages, including articles and pronouns in French (j’ai, l’homme).
Depending on the application, tokenization algorithms may also tokenize mul-
tiword expressions like New York or rock ’n’ roll as a single token, which re-
quires a multiword expression dictionary of some sort. Tokenization is thus inti-
mately tied up with named entity recognition, the task of detecting names, dates,
and organizations (Chapter 8).
One commonly used tokenization standard is known as the Penn Treebank to-
Penn Treebank kenization standard, used for the parsed corpora (treebanks) released by the Lin-
tokenization
guistic Data Consortium (LDC), the source of many useful datasets. This standard
separates out clitics (doesn’t becomes does plus n’t), keeps hyphenated words to-
gether, and separates out all punctuation (to save space we’re showing visible spaces
‘ ’ between tokens, although newlines is a more common output):
Input: "The San Francisco-based restaurant," they said,
"doesn’t charge $10".
Output: " The San Francisco-based restaurant , " they said ,
" does n’t charge $ 10 " .
In practice, since tokenization needs to be run before any other language process-
ing, it needs to be very fast. The standard method for tokenization is therefore to use
deterministic algorithms based on regular expressions compiled into very efficient
finite state automata. For example, Fig. 2.12 shows an example of a basic regular
expression that can be used to tokenize English with the nltk.regexp tokenize
function of the Python-based Natural Language Toolkit (NLTK) (Bird et al. 2009;
https://www.nltk.org).
Carefully designed deterministic algorithms can deal with the ambiguities that
arise, such as the fact that the apostrophe needs to be tokenized differently when used
as a genitive marker (as in the book’s cover), a quotative as in ‘The other class’, she
said, or in clitics like they’re.
Word tokenization is more complex in languages like written Chinese, Japanese,
and Thai, which do not use spaces to mark potential word-boundaries. In Chinese,
hanzi for example, words are composed of characters (called hanzi in Chinese). Each
20 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
Most tokenization schemes have two parts: a token learner, and a token seg-
menter. The token learner takes a raw training corpus (sometimes roughly pre-
separated into words, for example by whitespace) and induces a vocabulary, a set
of tokens. The token segmenter takes a raw test sentence and segments it into the
tokens in the vocabulary. Three algorithms are widely used: byte-pair encoding
(Sennrich et al., 2016), unigram language modeling (Kudo, 2018), and WordPiece
(Schuster and Nakajima, 2012); there is also a SentencePiece library that includes
implementations of the first two of the three (Kudo and Richardson, 2018a).
In this section we introduce the simplest of the three, the byte-pair encoding or
BPE BPE algorithm (Sennrich et al., 2016); see Fig. 2.13. The BPE token learner begins
with a vocabulary that is just the set of all individual characters. It then examines the
training corpus, chooses the two symbols that are most frequently adjacent (say ‘A’,
‘B’), adds a new merged symbol ‘AB’ to the vocabulary, and replaces every adjacent
’A’ ’B’ in the corpus with the new ‘AB’. It continues to count and merge, creating
new longer and longer character strings, until k merges have been done creating
k novel tokens; k is thus a parameter of the algorithm. The resulting vocabulary
consists of the original set of characters plus k new symbols.
The algorithm is usually run inside words (not merging across word boundaries),
so the input corpus is first white-space-separated to give a set of strings, each corre-
sponding to the characters of a word, plus a special end-of-word symbol , and its
counts. Let’s see its operation on the following tiny input corpus of 18 word tokens
with counts for each word (the word low appears 5 times, the word newer 6 times,
and so on), which would have a starting vocabulary of 11 letters:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w
2 l o w e s t
6 n e w e r
3 w i d e r
2 n e w
The BPE algorithm first counts all pairs of adjacent symbols: the most frequent
is the pair e r because it occurs in newer (frequency of 6) and wider (frequency of
3) for a total of 9 occurrences.1 We then merge these symbols, treating er as one
symbol, and count again:
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
Now the most frequent pair is er , which we merge; our system has learned
that there should be a token for word-final er, represented as er :
corpus vocabulary
5 l o w , d, e, i, l, n, o, r, s, t, w, er, er
2 l o w e s t
6 n e w er
3 w i d er
2 n e w
1 Note that there can be ties; we could have instead chosen to merge r first, since that also has a
frequency of 9.
22 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
Figure 2.13 The token learner part of the BPE algorithm for taking a corpus broken up
into individual characters or bytes, and learning a vocabulary by iteratively merging tokens.
Figure adapted from Bostrom and Durrett (2020).
Once we’ve learned our vocabulary, the token segmenter is used to tokenize a
test sentence. The token segmenter just runs on the test data the merges we have
learned from the training data, greedily, in the order we learned them. (Thus the
frequencies in the test data don’t play a role, just the frequencies in the training
data). So first we segment each test sentence word into characters. Then we apply
the first rule: replace every instance of e r in the test corpus with er, and then the
second rule: replace every instance of er in the test corpus with er , and so on.
By the end, if the test corpus contained the character sequence n e w e r , it
would be tokenized as a full word. But the characters of a new (unknown) word like
l o w e r would be merged into the two tokens low er .
Of course in real settings BPE is run with many thousands of merges on a very
large input corpus. The result is that most words will be represented as full symbols,
and only the very rare words (and unknown words) will have to be represented by
their parts.
extraction about the US, we might want to see information from documents whether
they mention the US or the USA.
case folding Case folding is another kind of normalization. Mapping everything to lower
case means that Woodchuck and woodchuck are represented identically, which is
very helpful for generalization in many tasks, such as information retrieval or speech
recognition. For sentiment analysis and other text classification tasks, information
extraction, and machine translation, by contrast, case can be quite helpful and case
folding is generally not done. This is because maintaining the difference between,
for example, US the country and us the pronoun can outweigh the advantage in
generalization that case folding would have provided for other words.
For many natural language processing situations we also want two morpholog-
ically different forms of a word to behave similarly. For example in web search,
someone may type the string woodchucks but a useful system might want to also
return pages that mention woodchuck with no s. This is especially common in mor-
phologically complex languages like Polish, where for example the word Warsaw
has different endings when it is the subject (Warszawa), or after a preposition like
“in Warsaw” (w Warszawie), or “to Warsaw” (do Warszawy), and so on.
Lemmatization is the task of determining that two words have the same root,
despite their surface differences. The words am, are, and is have the shared lemma
be; the words dinner and dinners both have the lemma dinner. Lemmatizing each
of these forms to the same lemma will let us find all mentions of words in Polish
like Warsaw. The lemmatized form of a sentence like He is reading detective stories
would thus be He be read detective story.
How is lemmatization done? The most sophisticated methods for lemmatization
involve complete morphological parsing of the word. Morphology is the study of
morpheme the way words are built up from smaller meaning-bearing units called morphemes.
stem Two broad classes of morphemes can be distinguished: stems—the central mor-
affix pheme of the word, supplying the main meaning—and affixes—adding “additional”
meanings of various kinds. So, for example, the word fox consists of one morpheme
(the morpheme fox) and the word cats consists of two: the morpheme cat and the
morpheme -s. A morphological parser takes a word like cats and parses it into the
two morphemes cat and s, or parses a Spanish word like amaren (‘if in the future
they would love’) into the morpheme amar ‘to love’, and the morphological features
3PL and future subjunctive.
The algorithm is based on series of rewrite rules run in series: the output of each
pass is fed as input to the next pass. Here are some sample rules (more details can
be found at https://tartarus.org/martin/PorterStemmer/):
Simple stemmers can be useful in cases where we need to collapse across differ-
ent variants of the same lemma. Nonetheless, they do tend to commit errors of both
over- and under-generalizing, as shown in the table below (Krovetz, 1993):
example comes from coreference, the task of deciding whether two strings such as
the following refer to the same entity:
Stanford President Marc Tessier-Lavigne
Stanford University President Marc Tessier-Lavigne
Again, the fact that these two strings are very similar (differing by only one word)
seems like useful evidence for deciding that they might be coreferent.
Edit distance gives us a way to quantify both of these intuitions about string sim-
minimum edit ilarity. More formally, the minimum edit distance between two strings is defined
distance
as the minimum number of editing operations (operations like insertion, deletion,
substitution) needed to transform one string into another.
The gap between intention and execution, for example, is 5 (delete an i, substi-
tute e for n, substitute x for t, insert c, substitute u for n). It’s much easier to see
alignment this by looking at the most important visualization for string distances, an alignment
between the two strings, shown in Fig. 2.14. Given two sequences, an alignment is
a correspondence between substrings of the two sequences. Thus, we say I aligns
with the empty string, N with E, and so on. Beneath the aligned strings is another
representation; a series of symbols expressing an operation list for converting the
top string into the bottom string: d for deletion, s for substitution, i for insertion.
INTE*NTION
| | | | | | | | | |
*EXECUTION
d s s i s
Figure 2.14 Representing the minimum edit distance between two strings as an alignment.
The final row gives the operation list for converting the top string into the bottom string: d for
deletion, s for substitution, i for insertion.
We can also assign a particular cost or weight to each of these operations. The
Levenshtein distance between two sequences is the simplest weighting factor in
which each of the three operations has a cost of 1 (Levenshtein, 1966)—we assume
that the substitution of a letter for itself, for example, t for t, has zero cost. The Lev-
enshtein distance between intention and execution is 5. Levenshtein also proposed
an alternative version of his metric in which each insertion or deletion has a cost of
1 and substitutions are not allowed. (This is equivalent to allowing substitution, but
giving each substitution a cost of 2 since any substitution can be represented by one
insertion and one deletion). Using this version, the Levenshtein distance between
intention and execution is 8.
i n t e n t i o n
n t e n t i o n i n t e c n t i o n i n x e n t i o n
Figure 2.15 Finding the edit distance viewed as a search problem
i n t e n t i o n
delete i
n t e n t i o n
substitute n by e
e t e n t i o n
substitute t by x
e x e n t i o n
insert u
e x e n u t i o n
substitute n by c
e x e c u t i o n
Figure 2.16 Path from intention to execution.
Imagine some string (perhaps it is exention) that is in this optimal path (whatever
it is). The intuition of dynamic programming is that if exention is in the optimal
operation list, then the optimal sequence must also include the optimal path from
intention to exention. Why? If there were a shorter path from intention to exention,
then we could use it instead, resulting in a shorter overall path, and the optimal
minimum edit
sequence wouldn’t be optimal, thus leading to a contradiction.
distance The minimum edit distance algorithm was named by Wagner and Fischer
algorithm
(1974) but independently discovered by many people (see the Historical Notes sec-
tion of Chapter 8).
Let’s first define the minimum edit distance between two strings. Given two
strings, the source string X of length n, and target string Y of length m, we’ll define
D[i, j] as the edit distance between X[1..i] and Y [1.. j], i.e., the first i characters of X
and the first j characters of Y . The edit distance between X and Y is thus D[n, m].
We’ll use dynamic programming to compute D[n, m] bottom up, combining so-
lutions to subproblems. In the base case, with a source substring of length i but an
empty target string, going from i characters to 0 requires i deletes. With a target
substring of length j but an empty source going from 0 characters to j characters
requires j inserts. Having computed D[i, j] for small i, j we then compute larger
D[i, j] based on previously computed smaller values. The value of D[i, j] is com-
puted by taking the minimum of the three possible paths through the matrix which
arrive there:
D[i − 1, j] + del-cost(source[i])
D[i, j] = min D[i, j − 1] + ins-cost(target[ j]) (2.8)
D[i − 1, j − 1] + sub-cost(source[i], target[ j])
2.5 • M INIMUM E DIT D ISTANCE 27
If we assume the version of Levenshtein distance in which the insertions and dele-
tions each have a cost of 1 (ins-cost(·) = del-cost(·) = 1), and substitutions have a
cost of 2 (except substitution of identical letters have zero cost), the computation for
D[i, j] becomes:
D[i − 1, j] + 1
D[i, j − 1] + 1
D[i, j] = min (2.9)
2; if source[i] 6= target[ j]
D[i − 1, j − 1] +
0; if source[i] = target[ j]
The algorithm is summarized in Fig. 2.17; Fig. 2.18 shows the results of applying
the algorithm to the distance between intention and execution with the version of
Levenshtein in Eq. 2.9.
n ← L ENGTH(source)
m ← L ENGTH(target)
Create a distance matrix D[n+1,m+1]
# Initialization: the zeroth row and column is the distance from the empty string
D[0,0] = 0
for each row i from 1 to n do
D[i,0] ← D[i-1,0] + del-cost(source[i])
for each column j from 1 to m do
D[0,j] ← D[0, j-1] + ins-cost(target[j])
# Recurrence relation:
for each row i from 1 to n do
for each column j from 1 to m do
D[i, j] ← M IN( D[i−1, j] + del-cost(source[i]),
D[i−1, j−1] + sub-cost(source[i], target[j]),
D[i, j−1] + ins-cost(target[j]))
# Termination
return D[n,m]
Figure 2.17 The minimum edit distance algorithm, an example of the class of dynamic
programming algorithms. The various costs can either be fixed (e.g., ∀x, ins-cost(x) = 1)
or can be specific to the letter (to model the fact that some letters are more likely to be in-
serted than others). We assume that there is no cost for substituting a letter for itself (i.e.,
sub-cost(x, x) = 0).
Alignment Knowing the minimum edit distance is useful for algorithms like find-
ing potential spelling error corrections. But the edit distance algorithm is important
in another way; with a small change, it can also provide the minimum cost align-
ment between two strings. Aligning two strings is useful throughout speech and
language processing. In speech recognition, minimum edit distance alignment is
used to compute the word error rate (Chapter 16). Alignment plays a role in ma-
chine translation, in which sentences in a parallel corpus (a corpus with a text in two
languages) need to be matched to each other.
To extend the edit distance algorithm to produce an alignment, we can start by
visualizing an alignment as a path through the edit distance matrix. Figure 2.19
shows this path with boldfaced cells. Each boldfaced cell represents an alignment
28 C HAPTER 2 • R EGULAR E XPRESSIONS , T EXT N ORMALIZATION , E DIT D ISTANCE
Src\Tar # e x e c u t i o n
# 0 1 2 3 4 5 6 7 8 9
i 1 2 3 4 5 6 7 6 7 8
n 2 3 4 5 6 7 8 7 8 7
t 3 4 5 6 7 8 7 8 9 8
e 4 3 4 5 6 7 8 9 10 9
n 5 4 5 6 7 8 9 10 11 10
t 6 5 6 7 8 9 8 9 10 11
i 7 6 7 8 9 10 9 8 9 10
o 8 7 8 9 10 11 10 9 8 9
n 9 8 9 10 11 12 11 10 9 8
Figure 2.18 Computation of minimum edit distance between intention and execution with
the algorithm of Fig. 2.17, using Levenshtein distance with cost of 1 for insertions or dele-
tions, 2 for substitutions.
of a pair of letters in the two strings. If two boldfaced cells occur in the same row,
there will be an insertion in going from the source to the target; two boldfaced cells
in the same column indicate a deletion.
Figure 2.19 also shows the intuition of how to compute this alignment path. The
computation proceeds in two steps. In the first step, we augment the minimum edit
distance algorithm to store backpointers in each cell. The backpointer from a cell
points to the previous cell (or cells) that we came from in entering the current cell.
We’ve shown a schematic of these backpointers in Fig. 2.19. Some cells have mul-
tiple backpointers because the minimum extension could have come from multiple
backtrace previous cells. In the second step, we perform a backtrace. In a backtrace, we start
from the last cell (at the final row and column), and follow the pointers back through
the dynamic programming matrix. Each complete path between the final cell and the
initial cell is a minimum distance alignment. Exercise 2.7 asks you to modify the
minimum edit distance algorithm to store the pointers and compute the backtrace to
output an alignment.
# e x e c u t i o n
# 0 ← 1 ← 2 3
← 4
← ←5 ← 6 ← 7 ← 8 ← 9
i ↑1 -←↑ 2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -6 ←7 ←8
n ↑2 -←↑ 3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 ↑7 -←↑ 8 -7
t ↑3 -←↑ 4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -7 ←↑ 8 -←↑ 9 ↑8
e ↑4 -3 ←4 -← 5 ←6 ←7 ←↑ 8 -←↑ 9 -←↑ 10 ↑9
n ↑5 ↑4 -←↑ 5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 -↑ 10
t ↑6 ↑5 -←↑ 6 -←↑ 7 -←↑ 8 -←↑ 9 -8 ←9 ← 10 ←↑ 11
i ↑7 ↑6 -←↑ 7 -←↑ 8 -←↑ 9 -←↑ 10 ↑9 -8 ←9 ← 10
o ↑8 ↑7 -←↑ 8 -←↑ 9 -←↑ 10 -←↑ 11 ↑ 10 ↑9 -8 ←9
n ↑9 ↑8 -←↑ 9 -←↑ 10 -←↑ 11 -←↑ 12 ↑ 11 ↑ 10 ↑9 -8
Figure 2.19 When entering a value in each cell, we mark which of the three neighboring
cells we came from with up to three arrows. After the table is full we compute an alignment
(minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and
following the arrows back. The sequence of bold cells represents one possible minimum cost
alignment between the two strings. Diagram design after Gusfield (1997).
While we worked our example with simple Levenshtein distance, the algorithm
in Fig. 2.17 allows arbitrary weights on the operations. For spelling correction, for
example, substitutions are more likely to happen between letters that are next to
2.6 • S UMMARY 29
2.6 Summary
This chapter introduced a fundamental tool in language processing, the regular ex-
pression, and showed how to perform basic text normalization tasks including
word segmentation and normalization, sentence segmentation, and stemming.
We also introduced the important minimum edit distance algorithm for comparing
strings. Here’s a summary of the main points we covered about these ideas:
• The regular expression language is a powerful tool for pattern-matching.
• Basic operations in regular expressions include concatenation of symbols,
disjunction of symbols ([], |, and .), counters (*, +, and {n,m}), anchors
(ˆ, $) and precedence operators ((,)).
• Word tokenization and normalization are generally done by cascades of
simple regular expression substitutions or finite automata.
• The Porter algorithm is a simple and efficient way to do stemming, stripping
off affixes. It does not have high accuracy but may be useful for some tasks.
• The minimum edit distance between two strings is the minimum number of
operations it takes to edit one into the other. Minimum edit distance can be
computed by dynamic programming, which also results in an alignment of
the two strings.
to ‘execution’ was adapted from Kruskal (1983). There are various publicly avail-
able packages to compute edit distance, including Unix diff and the NIST sclite
program (NIST, 2005).
In his autobiography Bellman (1984) explains how he originally came up with
the term dynamic programming:
“...The 1950s were not good years for mathematical research. [the]
Secretary of Defense ...had a pathological fear and hatred of the word,
research... I decided therefore to use the word, “programming”. I
wanted to get across the idea that this was dynamic, this was multi-
stage... I thought, let’s ... take a word that has an absolutely precise
meaning, namely dynamic... it’s impossible to use the word, dynamic,
in a pejorative sense. Try thinking of some combination that will pos-
sibly give it a pejorative meaning. It’s impossible. Thus, I thought
dynamic programming was a good name. It was something not even a
Congressman could object to.”
Exercises
2.1 Write regular expressions for the following languages.
1. the set of all alphabetic strings;
2. the set of all lower case alphabetic strings ending in a b;
3. the set of all strings from the alphabet a, b such that each a is immedi-
ately preceded by and immediately followed by a b;
2.2 Write regular expressions for the following languages. By “word”, we mean
an alphabetic string separated from other words by whitespace, any relevant
punctuation, line breaks, and so forth.
1. the set of all strings with two consecutive repeated words (e.g., “Hum-
bert Humbert” and “the the” but not “the bug” or “the big bug”);
2. all strings that start at the beginning of the line with an integer and that
end at the end of the line with a word;
3. all strings that have both the word grotto and the word raven in them
(but not, e.g., words like grottos that merely contain the word grotto);
4. write a pattern that places the first word of an English sentence in a
register. Deal with punctuation.
2.3 Implement an ELIZA-like program, using substitutions such as those described
on page 13. You might want to choose a different domain than a Rogerian psy-
chologist, although keep in mind that you would need a domain in which your
program can legitimately engage in a lot of simple repetition.
2.4 Compute the edit distance (using insertion cost 1, deletion cost 1, substitution
cost 1) of “leda” to “deal”. Show your work (using the edit distance grid).
2.5 Figure out whether drive is closer to brief or to divers and what the edit dis-
tance is to each. You may use any version of distance that you like.
2.6 Now implement a minimum edit distance algorithm and use your hand-computed
results to check your code.
2.7 Augment the minimum edit distance algorithm to output an alignment; you
will need to store pointers and add a stage to compute the backtrace.
CHAPTER
“You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.
Random sentence generated from a Jane Austen trigram model
Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next few words someone
is going to say? What word, for example, is likely to follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over,
but probably not refrigerator or the. In the following sections we will formalize
this intuition by introducing models that assign a probability to each possible next
word. The same models will also serve to assign a probability to an entire sentence.
Such a model, for example, could predict that the following sequence has a much
higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
than does this same set of words in a different order:
Why would you want to predict upcoming words, or assign probabilities to sen-
tences? Probabilities are essential in any task in which we have to identify words in
noisy, ambiguous input, like speech recognition. For a speech recognizer to realize
that you said I will be back soonish and not I will be bassoon dish, it helps to know
that back soonish is a much more probable sequence than bassoon dish. For writing
tools like spelling correction or grammatical error correction, we need to find and
correct errors in writing like Their are two midterms, in which There was mistyped
as Their, or Everything has improve, in which improve should have been improved.
The phrase There are will be much more probable than Their are, and has improved
than has improve, allowing us to help users by detecting and correcting these errors.
Assigning probabilities to sequences of words is also essential in machine trans-
lation. Suppose we are translating a Chinese source sentence:
他 向 记者 介绍了 主要 内容
He to reporters introduced main content
As part of the process we might have built the following set of potential rough
English translations:
he introduced reporters to the main contents of the statement
he briefed to reporters the main contents of the statement
he briefed reporters on the main contents of the statement
32 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
3.1 N-Grams
Let’s begin with the task of computing P(w|h), the probability of a word w given
some history h. Suppose the history h is “its water is so transparent that” and we
want to know the probability that the next word is the:
One way to estimate this probability is from relative frequency counts: take a
very large corpus, count the number of times we see its water is so transparent that,
and count the number of times this is followed by the. This would be answering the
question “Out of the times we saw the history h, how many times was it followed by
the word w”, as follows:
With a large enough corpus, such as the web, we can compute these counts and
estimate the probability from Eq. 3.2. You should pause now, go to the web, and
compute this estimate for yourself.
While this method of estimating probabilities directly from counts works fine in
many cases, it turns out that even the web isn’t big enough to give us good estimates
in most cases. This is because language is creative; new sentences are created all the
time, and we won’t always be able to count entire sentences. Even simple extensions
3.1 • N-G RAMS 33
of the example sentence may have counts of zero on the web (such as “Walden
Pond’s water is so transparent that the”; well, used to have counts of zero).
Similarly, if we wanted to know the joint probability of an entire sequence of
words like its water is so transparent, we could do it by asking “out of all possible
sequences of five words, how many of them are its water is so transparent?” We
would have to get the count of its water is so transparent and divide by the sum of
the counts of all possible five word sequences. That seems rather a lot to estimate!
For this reason, we’ll need to introduce more clever ways of estimating the prob-
ability of a word w given a history h, or the probability of an entire word sequence
W . Let’s start with a little formalizing of notation. To represent the probability of a
particular random variable Xi taking on the value “the”, or P(Xi = “the”), we will use
the simplification P(the). We’ll represent a sequence of n words either as w1 . . . wn
or w1:n (so the expression w1:n−1 means the string w1 , w2 , ..., wn−1 ). For the joint
probability of each word in a sequence having a particular value P(X1 = w1 , X2 =
w2 , X3 = w3 , ..., Xn = wn ) we’ll use P(w1 , w2 , ..., wn ).
Now, how can we compute probabilities of entire sequences like P(w1 , w2 , ..., wn )?
One thing we can do is decompose this probability using the chain rule of proba-
bility:
The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words. Equa-
tion 3.4 suggests that we could estimate the joint probability of an entire sequence of
words by multiplying together a number of conditional probabilities. But using the
chain rule doesn’t really seem to help us! We don’t know any way to compute the
exact probability of a word given a long sequence of preceding words, P(wn |w1:n−1 ).
As we said above, we can’t just estimate by counting the number of times every word
occurs following every long string, because language is creative and any particular
context might have never occurred before!
The intuition of the n-gram model is that instead of computing the probability of
a word given its entire history, we can approximate the history by just the last few
words.
bigram The bigram model, for example, approximates the probability of a word given
all the previous words P(wn |w1:n−1 ) by using only the conditional probability of the
preceding word P(wn |wn−1 ). In other words, instead of computing the probability
P(the|that) (3.6)
34 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
When we use a bigram model to predict the conditional probability of the next word,
we are thus making the following approximation:
P(wn |w1:n−1 ) ≈ P(wn |wn−1 ) (3.7)
The assumption that the probability of a word depends only on the previous word is
Markov called a Markov assumption. Markov models are the class of probabilistic models
that assume we can predict the probability of some future unit without looking too
far into the past. We can generalize the bigram (which looks one word into the past)
n-gram to the trigram (which looks two words into the past) and thus to the n-gram (which
looks n − 1 words into the past).
Let’s see a general equation for this n-gram approximation to the conditional
probability of the next word in a sequence. We’ll use N here to mean the n-gram
size, so N = 2 means bigrams and N = 3 means trigrams. Then we approximate the
probability of a word given its entire context as follows:
P(wn |w1:n−1 ) ≈ P(wn |wn−N+1:n−1 ) (3.8)
Given the bigram assumption for the probability of an individual word, we can com-
pute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4:
n
Y
P(w1:n ) ≈ P(wk |wk−1 ) (3.9)
k=1
maximum
How do we estimate these bigram or n-gram probabilities? An intuitive way to
likelihood estimate probabilities is called maximum likelihood estimation or MLE. We get
estimation
the MLE estimate for the parameters of an n-gram model by getting counts from a
normalize corpus, and normalizing the counts so that they lie between 0 and 1.1
For example, to compute a particular bigram probability of a word wn given a
previous word wn−1 , we’ll compute the count of the bigram C(wn−1 wn ) and normal-
ize by the sum of all the bigrams that share the same first word wn−1 :
C(wn−1 wn )
P(wn |wn−1 ) = P (3.10)
w C(wn−1 w)
We can simplify this equation, since the sum of all bigram counts that start with
a given word wn−1 must be equal to the unigram count for that word wn−1 (the reader
should take a moment to be convinced of this):
C(wn−1 wn )
P(wn |wn−1 ) = (3.11)
C(wn−1 )
Let’s work through an example using a mini-corpus of three sentences. We’ll
first need to augment each sentence with a special symbol <s> at the beginning
of the sentence, to give us the bigram context of the first word. We’ll also need a
special end-symbol. </s>2
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
1 For probabilistic models, normalizing means dividing by some total count so that the resulting proba-
bilities fall between 0 and 1.
2 We need the end-symbol to make the bigram grammar a true probability distribution. Without an end-
symbol, instead of the sentence probabilities of all sentences summing to one, the sentence probabilities
for all sentences of a given length would sum to one. This model would define an infinite set of probability
distributions, with one distribution per sentence length. See Exercise 3.5.
3.1 • N-G RAMS 35
Here are the calculations for some of the bigram probabilities from this corpus
2 1 2
P(I|<s>) = 3 = .67 P(Sam|<s>) = 3 = .33 P(am|I) = 3 = .67
1 1 1
P(</s>|Sam) = 2 = 0.5 P(Sam|am) = 2 = .5 P(do|I) = 3 = .33
For the general case of MLE n-gram parameter estimation:
C(wn−N+1:n−1 wn )
P(wn |wn−N+1:n−1 ) = (3.12)
C(wn−N+1:n−1 )
Equation 3.12 (like Eq. 3.11) estimates the n-gram probability by dividing the
observed frequency of a particular sequence by the observed frequency of a prefix.
relative
frequency This ratio is called a relative frequency. We said above that this use of relative
frequencies as a way to estimate probabilities is an example of maximum likelihood
estimation or MLE. In MLE, the resulting parameter set maximizes the likelihood
of the training set T given the model M (i.e., P(T |M)). For example, suppose the
word Chinese occurs 400 times in a corpus of a million words like the Brown corpus.
What is the probability that a random word selected from some other text of, say,
400
a million words will be the word Chinese? The MLE of its probability is 1000000
or .0004. Now .0004 is not the best possible estimate of the probability of Chinese
occurring in all situations; it might turn out that in some other corpus or context
Chinese is a very unlikely word. But it is the probability that makes it most likely
that Chinese will occur 400 times in a million-word corpus. We present ways to
modify the MLE estimates slightly to get better probability estimates in Section 3.5.
Let’s move on to some examples from a slightly larger corpus than our 14-word
example above. We’ll use data from the now-defunct Berkeley Restaurant Project,
a dialogue system from the last century that answered questions about a database
of restaurants in Berkeley, California (Jurafsky et al., 1994). Here are some text-
normalized sample user queries (a sample of 9332 sentences is on the website):
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Figure 3.1 shows the bigram counts from a piece of a bigram grammar from the
Berkeley Restaurant Project. Note that the majority of the values are zero. In fact,
we have chosen the sample words to cohere with each other; a matrix selected from
a random set of eight words would be even more sparse.
Figure 3.2 shows the bigram probabilities after normalization (dividing each cell
in Fig. 3.1 by the appropriate unigram for its row, taken from the following set of
unigram probabilities):
1, the more probabilities we multiply together, the smaller the product becomes.
Multiplying enough n-grams together would result in numerical underflow. By using
log probabilities instead of raw probabilities, we get numbers that are not as small.
Adding in log space is equivalent to multiplying in linear space, so we combine log
probabilities by adding them. The result of doing all computation and storage in log
space is that we only need to convert back into probabilities if we need to report
them at the end; then we can just take the exp of the logprob:
want as much training data as possible. At the minimum, we would want to pick
the smallest test set that gives us enough statistical power to measure a statistically
significant difference between two potential models. In practice, we often just divide
our data into 80% training, 10% development, and 10% test. Given a large corpus
that we want to divide into training and test, test data can either be taken from some
continuous sequence of text inside the corpus, or we can remove smaller “stripes”
of text from randomly selected parts of our corpus and combine them into a test set.
3.2.1 Perplexity
In practice we don’t use raw probability as our metric for evaluating language mod-
perplexity els, but a variant called perplexity. The perplexity (sometimes called PPL for short)
of a language model on a test set is the inverse probability of the test set, normalized
by the number of words. For a test set W = w1 w2 . . . wN ,:
1
perplexity(W ) = P(w1 w2 . . . wN )− N (3.14)
s
1
= N
P(w1 w2 . . . wN )
v
uN
uY 1
perplexity(W ) = t
N
(3.15)
P(wi |w1 . . . wi−1 )
i=1
The perplexity of a test set W depends on which language model we use. Here’s
the perplexity of W with a unigram language model (just the geometric mean of the
unigram probabilities):
v
uN
uY 1
perplexity(W ) = t N
(3.16)
P(wi )
i=1
Note that because of the inverse in Eq. 3.15, the higher the conditional probabil-
ity of the word sequence, the lower the perplexity. Thus, minimizing perplexity is
equivalent to maximizing the test set probability according to the language model.
What we generally use for word sequence in Eq. 3.15 or Eq. 3.17 is the entire se-
quence of words in some test set. Since this sequence will cross many sentence
boundaries, we need to include the begin- and end-sentence markers <s> and </s>
in the probability computation. We also need to include the end-of-sentence marker
</s> (but not the beginning-of-sentence marker <s>) in the total count of word to-
kens N.
There is another way to think about perplexity: as the weighted average branch-
ing factor of a language. The branching factor of a language is the number of possi-
ble next words that can follow any word. Consider the task of recognizing the digits
3.2 • E VALUATING L ANGUAGE M ODELS 39
in English (zero, one, two,..., nine), given that (both in some training set and in some
1
test set) each of the 10 digits occurs with equal probability P = 10 . The perplexity of
this mini-language is in fact 10. To see that, imagine a test string of digits of length
N, and assume that in the training set all the digits occurred with equal probability.
By Eq. 3.15, the perplexity will be
1
perplexity(W ) = P(w1 w2 . . . wN )− N
1 N −1
= ( ) N
10
1 −1
=
10
= 10 (3.18)
But suppose that the number zero is really frequent and occurs far more often
than other numbers. Let’s say that 0 occur 91 times in the training set, and each of the
other digits occurred 1 time each. Now we see the following test set: 0 0 0 0 0 3 0 0 0
0. We should expect the perplexity of this test set to be lower since most of the time
the next number will be zero, which is very predictable, i.e. has a high probability.
Thus, although the branching factor is still 10, the perplexity or weighted branching
factor is smaller. We leave this exact calculation as exercise 3.12.
We see in Section 3.8 that perplexity is also closely related to the information-
theoretic notion of entropy.
We mentioned above that perplexity is a function of both the text and the lan-
guage model: given a text W , different language models will have different perplex-
ities. Because of this, perplexity can be used to compare different n-gram models.
Let’s look at an example, in which we trained unigram, bigram, and trigram gram-
mars on 38 million words (including start-of-sentence tokens) from the Wall Street
Journal, using a 19,979 word vocabulary. We then computed the perplexity of each
of these models on a test set of 1.5 million words, using Eq. 3.16 for unigrams,
Eq. 3.17 for bigrams, and the corresponding equation for trigrams. The table below
shows the perplexity of a 1.5 million word WSJ test set according to each of these
grammars.
Unigram Bigram Trigram
Perplexity 962 170 109
As we see above, the more information the n-gram gives us about the word
sequence, the higher the probability the n-gram will assign to the string. A trigram
model is less surprised than a unigram model because it has a better idea of what
words might come next, and so it assigns them a higher probability. And the higher
the probability, the lower the perplexity (since as Eq. 3.15 showed, perplexity is
related inversely to the likelihood of the test sequence according to the model). So a
lower perplexity can tell us that a language model is a better predictor of the words
in the test set.
Note that in computing perplexities, the n-gram model P must be constructed
without any knowledge of the test set or any prior knowledge of the vocabulary of
the test set. Any kind of knowledge of the test set can cause the perplexity to be
artificially low. The perplexity of two language models is only comparable if they
use identical vocabularies.
An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) im-
provement in the performance of a language processing task like speech recognition
40 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
polyphonic
p=.0000018
however
the of a to in (p=.0003)
Figure 3.3 A visualization of the sampling distribution for sampling sentences by repeat-
edly sampling unigrams. The blue bar represents the relative frequency of each word (we’ve
ordered them from most frequent to least frequent, but the choice of order is arbitrary). The
number line shows the cumulative probabilities. If we choose a random number between 0
and 1, it will fall in an interval corresponding to some word. The expectation for the random
number to fall in the larger intervals of one of the frequent words (the, of, a) is much higher
than in the smaller interval of one of the rare words (polyphonic).
We can use the same technique to generate bigrams by first generating a ran-
dom bigram that starts with <s> (according to its bigram probability). Let’s say the
second word of that bigram is w. We next choose a random bigram starting with w
(again, drawn according to its bigram probability), and so on.
given training corpus. Another implication is that n-grams do a better and better job
of modeling the training corpus as we increase the value of N.
We can use the sampling method from the prior section to visualize both of
these facts! To give an intuition for the increasing power of higher-order n-grams,
Fig. 3.4 shows random sentences generated from unigram, bigram, trigram, and 4-
gram models trained on Shakespeare’s works.
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1
gram
rote life have
–Hill he late speaks; or! a more to leg less first you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2
gram
king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3
gram
’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4
gram
great banquet serv’d in;
–It cannot be but so.
Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
for capitalization to improve readability.
The longer the context on which we train the model, the more coherent the sen-
tences. In the unigram sentences, there is no coherent relation between words or any
sentence-final punctuation. The bigram sentences have some local word-to-word
coherence (especially if we consider that punctuation counts as a word). The tri-
gram and 4-gram sentences are beginning to look a lot like Shakespeare. Indeed, a
careful investigation of the 4-gram sentences shows that they look a little too much
like Shakespeare. The words It cannot be but so are directly from King John. This is
because, not to put the knock on Shakespeare, his oeuvre is not very large as corpora
go (N = 884, 647,V = 29, 066), and our n-gram probability matrices are ridiculously
sparse. There are V 2 = 844, 000, 000 possible bigrams alone, and the number of pos-
sible 4-grams is V 4 = 7 × 1017 . Thus, once the generator has chosen the first 4-gram
(It cannot be but), there are only five possible continuations (that, I, he, thou, and
so); indeed, for many 4-grams, there is only one continuation.
To get an idea of the dependence of a grammar on its training set, let’s look at an
n-gram grammar trained on a completely different corpus: the Wall Street Journal
(WSJ) newspaper. Shakespeare and the Wall Street Journal are both English, so
we might expect some overlap between our n-grams for the two genres. Fig. 3.5
shows sentences generated by unigram, bigram, and trigram grammars trained on
40 million words from WSJ.
Compare these examples to the pseudo-Shakespeare in Fig. 3.4. While they both
model “English-like sentences”, there is clearly no overlap in generated sentences,
and little overlap even in small phrases. Statistical models are likely to be pretty use-
less as predictors if the training sets and the test sets are as different as Shakespeare
and WSJ.
How should we deal with this problem when we build n-gram models? One step
is to be sure to use a training corpus that has a similar genre to whatever task we are
trying to accomplish. To build a language model for translating legal documents,
42 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
1
gram
Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N.
2
gram
B. E. C. Taylor would seem to complete the major central planners one
point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred
3
gram
four oh six three percent of the rates of interest stores as Mexico and
Brazil on market conditions
Figure 3.5 Three sentences randomly generated from three n-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
tion as words. Output was then hand-corrected for capitalization to improve readability.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
ebookbell.com