Lecture 4
Lecture 4
10 April 2024
Language Modeling
Introduction to n-grams and statistical
language models
N-grams and Language Models
• N-gram: important concept in NLP which is the basis for language
modelling
• N-grams – contiguous sequences of tokens from a given text
Probabilistic Language Models based on N-
grams
http://www.deepl.com
Application example: Answer/Query
Completions/Suggestion
Application example: Dialogue generation
• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
Andrei Markov
(1856 – 1922)
• Or maybe:
Markov Assumption
𝑃(𝑤1 𝑤2 … 𝑤𝑛 ) = ෑ 𝑃(𝑤𝑖 | 𝑤𝑖−𝑘 … 𝑤𝑖−1 )
texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr.,
gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred,
fifty, five, yen
count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )
c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
Maximum Likelihood Estimate (MLE)
• The maximum likelihood estimate
• of some parameter of a model M from a training set T
• maximizes the likelihood of the training set T given the model M
• Suppose the word “bagel” occurs 400 times in a corpus of a million words
• What is the probability that a random word from some other text will be “bagel”?
• MLE estimate is 400/1,000,000 = .0004
• This may be a bad estimate for some other corpus..
• But it is the estimate that makes it most likely that “bagel” will occur 400 times in our million word
corpus
Example
<s> I am Sam </s> c(wi-1,wi )
<s> Sam I am </s> P(wi | w i-1 ) =
c(wi-1)
<s> I do not like green eggs and ham </s>
https://web.stanford.edu/~jurafsky/icslp-red.pdf
Raw bigram counts
• Out of 9,222 sentences
Raw bigram probabilities
• Normalize by unigram counts:
• Result:
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) = P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
P(<s> I want chinese food </s>) = P(I | <s>)
* P(want | I)
* P(chinese | want)
* P(food | chinese)
* P(</s> | food)
= .25 x .33 x .0065 x .52 x .68 = .00019
What kind of knowledge is captured by LM?
• P(english|want) = .0011
domain knowledge?
• P(chinese|want) = .0065
• P(to|want) = .66
• P(eat | to) = .28
• P (i | <s>) = .25 grammar
• P(food | to) = 0
Practical Issues
• Better to do everything in log space to:
• avoid numerical underflow
• also adding is often faster than multiplying (at least in general)
Dataset example: Google N-Gram Release,
August 2006
https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dataset example: Google N-Gram Release
• serve as the incoming 92
• serve as the incubator 99
• serve as the independent 794
• serve as the index 223
• serve as the indication 72
• serve as the indicator 120
• serve as the indicators 45
• serve as the indispensable 111
• serve as the indispensible 40
• serve as the individual 234
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Another dataset example: Google Book N-
grams
• Google Books datasets:
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html
https://books.google.com/ngrams
Research example: Timestamping Documents using
Google Ngram Books Data
Estimated document age
A. Jatowt, R. Campos: Interactive System for Reasoning about Document Age. CIKM 2017: 2471-2474
Estimating Text Age using Ngrams
Language Modeling Toolkit Examples
• KenLM
• https://kheafield.com/code/kenlm/
• SRILM
• http://www.speech.sri.com/projects/srilm/
Language Modeling
Evaluation and Perplexity
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?
• Assigns higher probability to “real” or “frequently observed” sentences than to
“ungrammatical” or “rarely observed” sentences?
• We train parameters of the model on a training set
• We test the model’s performance on data we haven’t seen
• (A test set is an unseen dataset that is different from training set)
• An evaluation metric tells us how well our model does on the test set
Training on the test set
• We cannot allow test sentences into the training set
• We would assign them an artificially high probability when we see them in the
test set
• “Training on the test set”
• Bad science
• And violates ethics..
Training and Test Sets
• Ideally, the training (and test) corpus should be representative of the actual
application data
• Sometimes we may need to adapt a general model to a small amount of new
(in-domain) data by adding highly weighted small corpus to original training
data
Extrinsic evaluation of N-gram models
• Best evaluation for comparing models A and B
• Put each model in a task
• Spelling corrector, speech recognizer, MT system, etc.
• Run the task, get an accuracy for system A and for B, e.g.:
• How many misspelled words corrected properly?
• How many words translated correctly?
• Compare the accuracy for A and B
Example of extrinsic evaluation
• Instead of perplexities (to be described soon) which are easier to evaluate
• We want to have more credible measure such as improvement in a real life
scenario, e.g. automatic speech recognition
• where the quality of recognized speech (same as in OCR or ASR) can be measured by
Word Error Rate, or Character Error Rate
Difficulty of extrinsic evaluation of N-gram
models
• Extrinsic evaluation
• Time-consuming; can take days or weeks, and costly
• So
• We use intrinsic evaluation: perplexity
• Bad approximation
• unless the test data looks just like the training data
• so generally only useful in pilot experiments
• But is helpful to think about
Intuition of Perplexity
• The Shannon Game:
mushrooms 0.1
• How well can we predict the next word?
pepperoni 0.1
I always order pizza with cheese and ____ anchovies 0.01
The 33rd President of the US was ____ ….
I saw a ____ fried rice 0.0001
….
• Unigrams are terrible at this game (why?) and 1e-100
Chain rule: 1
= N
The best language model is one that P(w1w2 ...wN )
best predicts an unseen test set
• Gives the highest P(sentence)
For bigrams:
Training 38 million words, test 1.5 million words, Wall Street Journal
The more information the n-gram gives about the word sequence,
the lower the perplexity. Perplexity is related inversely to the
likelihood of the test sequence according to the model.
Evaluation of Language Models (summary)
• Ideally, evaluate use of model in end application (extrinsic, in vivo)
• Realistic
• Expensive
• Evaluate on ability to model test corpus (intrinsic)
• Less realistic
• Cheaper
• Verify at least once that intrinsic evaluation correlates with an extrinsic one..
Evaluation of Language Models (summary)
• Perplexity - measure of how well a model “fits” the test data
• Uses the probability that the model assigns to the test corpus
• Normalizes for the number of words in the test corpus and takes the inverse
1
PP(W ) = N
P( w1w2 ...wN )
Language Modeling
Generalization and zeros
Generation Method
• Choose a random bigram
(<s>, w) according to its probability <s> I
I want
• Next choose a random bigram want to
(w, x) according to its probability to eat
eat Chinese
• And so on until we choose </s>
Chinese food
• Then string the words together food
</s>
I want to eat Chinese food
1
-
PP(W ) = P(w1w2 ...wN ) N
1
= N
P(w1w2 ...wN )
Out of Vocabulary (OOV) words
• The problem of words whose n-gram probability is 0 needs to be solved
(will be discussed soon in next section..)
• But what about words we simply have never seen before?
• Unknown words, or out of vocabulary (OOV) words
• OOV rate - the percentage of OOV words that appear in the test set
• We sometimes model potential unknown words in the test set by adding
a pseudo-word called <UNK> (explained in next slide)
Unknown words: Open versus closed
vocabulary tasks
• If we know all the words in advance
• Vocabulary V is fixed
• A closed vocabulary task
• Often we don’t know them..
• Out Of Vocabulary = OOV words
• An open vocabulary task
c(wi-1, wi )
• MLE estimate: PMLE (wi | wi-1 ) =
c(wi-1 )
Adding probability mass to unseen events requires removing it from seen ones
(discounting) in order to maintain a joint distribution that sums to 1
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Laplace-smoothed bigrams
Reconstituted counts
It is often convenient to reconstruct the count matrix so we can see how much a
smoothing algorithm has changed the original counts
Compare with raw bigram counts
Add-one smoothing
has made a very big
change to the counts
(much probability
mass moved to all the
zeros)
Add-1 estimation is an inaccurate instrument..
• The sharp change in counts and probabilities occurs as too much probability mass
moved to all the zeros
• Straightforward solution: since too much mass is given to unseen events, adjust to
add 0<<1 (hence normalized by V instead of V)
• But add-1 isn’t used for N-grams:
• We’ll see better methods
• But add-1 is used to smooth other NLP models
• For text classification
• In domains where the number of zeros isn’t so large
Language Modeling
Interpolation, Backoff
Example
Held-Out Test
Training Data Data Data
count(wi )
S(wi ) = count(wii-1) = count(wi-1,wi)
N
count(wii-2) = count(wi-2,wi-1,wi)
Selected Advanced Language Modelling
Concepts
• Caching Models
• Recently used words are more likely to appear
c(w Î history)
PCACHE (w | history) = l P(wi | wi-2 wi-1 ) + (1- l )
| history |
• Bias-vs-Variance trade-off: choice of n
• To choose a value for n in an n-gram model, find the right trade-off between the stability
of the estimate against its appropriateness.
• For example, trigram is a common choice with large training corpora (millions of words), while a
bigram is often used with smaller ones…
• Skip-grams
• A generalization of n-grams in which words need not be consecutive in the text, but may
leave gaps that are skipped over (another way of overcoming data sparsity problem)
Selected Advanced Language Modelling
Concepts
• Fertility
• Number of distinct types a word occurs with (e.g., compare: “delay” and
”Francisco”; which is more likely in an arbitrary new context?)
• POS n-grams
• Integer encodings of n-grams
Summary
• Language models assign a probability that a sentence is a “legal” string in a
language
• They can also predict a word from preceding words
• They are useful as a component of many NLP systems, such as ASR, OCR, and MT
• Simple N-gram models are easy to train on unsupervised corpora and can provide useful
estimates of sentence likelihood
• N-gram LMs can be evaluated extrinsically in a task or intrinsically using perplexity
• N-gram LMs = Markov models estimating words from a fixed window of
previous words, with probabilities estimated from normalized corpus
frequencies (MLE)
Summary (cont.)
• MLE gives inaccurate parameters for models trained on sparse data
• Smoothing algorithms allow to estimate the probabilities of unseen (but
not impossible) N-grams using lower-order n-grams as Back-off or
Interpolation
A Problem for N-Grams: Long Distance
Dependencies
• Many times local context does not provide the most useful predictive
clues, which instead are provided by long-distance dependencies
• Syntactic dependencies
• “The man next to the large oak tree near the grocery store on the corner is tall.”
• “The men next to the large oak tree near the grocery store on the corner are tall.”
• Semantic dependencies
• “The bird next to the large oak tree near the grocery store on the corner flies rapidly.”
• “The man next to the large oak tree near the grocery store on the corner talks rapidly.”
• Hence, Markovian assumption may be questioned.
• More complex models of language are needed to handle such
dependencies…
Task for next week (deadline 4/17, 14:15)
Construct a letter-based language model for detecting the type of English used in
documents
1. Download and unpack the file from OLAT (ngram_task.zip)
2. Build a 3-gram LM based on letters separately for each variant of English: British, American and
Australian using the training data files
3. Generate 5 random sentences from each model based on the Shannon method shown during the
lecture (no need to care about sentence segmentation and sentence markers, unless you want to..)
4. Take the test examples and for each one estimate their preplexities using your LMs
5. Calculate the accuracy of estimating an English type using all the test examples
• (Optionally you might experiment with different n values for character n-grams than 3)
6. Include code and results in report and discuss any decisions or assumptions made
Paper 1
https://www.aclweb.org/anthology/P05-1065.pdf
Paper 2
https://csaws.cs.technion.ac.il/~yahave/papers/pldi14-statistical.pdf
Paper 3
https://arxiv.org/pdf/2103.10918.pdf