0% found this document useful (0 votes)
43 views

Lecture 4

The document discusses statistical language modeling and n-grams. It explains how n-grams are used to build probabilistic language models and compute the probability of word sequences. Various applications of language models are also covered like machine translation, spell checking, speech recognition, and dialogue generation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Lecture 4

The document discusses statistical language modeling and n-grams. It explains how n-grams are used to build probabilistic language models and compute the probability of word sequences. Various applications of language models are also covered like machine translation, spell checking, speech recognition, and dialogue generation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Advanced Data Engineering & Analytics:

Statistical Language Modelling

10 April 2024
Language Modeling
Introduction to n-grams and statistical
language models
N-grams and Language Models
• N-gram: important concept in NLP which is the basis for language
modelling
• N-grams – contiguous sequences of tokens from a given text
Probabilistic Language Models based on N-
grams

• Goal: assign a probability to word sequence


• Machine Translation:
• P(high winds tonight) > P(large winds tonight)
• Spell Correction
• P(about fifteen minutes from) > P(about fifteen minuets from)
• Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
• Summarization, question-answering, etc.
Application example: OCR
• to fee great Pompey paffe the Areets of Rome:
• to see great Pompey passe the streets of Rome:
Application example: Machine Translation
• Fidelity (to source text)
• Fluency (of the translation)

http://www.deepl.com
Application example: Answer/Query
Completions/Suggestion
Application example: Dialogue generation

Li et al., "Deep Reinforcement Learning for Dialogue Generation" (EMNLP 2016)


Other Uses
• Augmentative & Alternative Communication (AAC)
systems
• For users who are physically unable to write/sign but can
for example use eye gaze
• Effective word prediction to be chosen is important
• Predictive text input systems could guess what user is
typing and offer choices on how to complete it
https://pdos.csail.mit.edu/archive/scigen/
https://pdos.csail.mit.edu/archive/scigen/#about
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or sequence of words:
P(W) = P(w1,w2,w3,w4,w5,…,wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model
• Often better: use the grammar but language model (LM) is easier
How to compute P(W)
• How to compute this joint probability:

P(its, water, is, so, transparent, that)

• Intuition: rely on the Chain Rule of Probability


The Chain Rule
• Definition of conditional probabilities
P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

• The Chain Rule in general:


P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint
probability of words in sentence

𝑃(𝑤1 𝑤2 … 𝑤𝑛 ) = ෑ 𝑃(𝑤𝑖 | 𝑤𝑖−𝑘 … 𝑤𝑖−1 )

P(“its water is so transparent”) =


P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these probabilities?
• Could we just count and divide?
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)

• No, too many possible sentences!


• We’ll never see enough data for estimating these..
Sparsity
• New words (Heap’s law: open vocabulary)
• Old words in “new” contexts (Zipf law: most words are rare ones)
Markov Assumption
• Simplifying assumption:

Andrei Markov
(1856 – 1922)
• Or maybe:
Markov Assumption
𝑃(𝑤1 𝑤2 … 𝑤𝑛 ) = ෑ 𝑃(𝑤𝑖 | 𝑤𝑖−𝑘 … 𝑤𝑖−1 )

• In other words, we approximate each component in the product:

𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−1 ) = 𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−𝑘 )


Markov Assumption (general definition)
• The Markov assumption is assumption that the future behavior of a dynamic system
depends only on its recent history
• In particular, in a kth-order Markov model, the next state only depends on the k most
recent states, therefore an N-gram model is a (N−1)-order Markov model

1-st order Markov model,


bigram model

2-nd order Markov model,


trigram model
Simplest case: Unigram model

𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−1 ) = 𝑃(𝑤𝑖 )

Some automatically generated sentences from a unigram model:


fifth, an, of, futures, the, an, incorporated, a, a, the, inflation,
most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the


Bigram model
Condition on the previous word:

𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−1 ) = 𝑃(𝑤𝑖 |𝑤𝑖−1 )

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr.,
gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred,
fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november


N-gram models
• We can extend to trigrams (3-grams), 4-grams, 5-grams
• In general this is an insufficient model of language
• because language has long-distance dependencies:
“The computer(s) which I had just put into the machine room on the fifth floor is (are) crashing.”

• Yet we can often get away with N-gram models…


Estimating bigram probabilities
• The Maximum Likelihood Estimate

count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )

c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
Maximum Likelihood Estimate (MLE)
• The maximum likelihood estimate
• of some parameter of a model M from a training set T
• maximizes the likelihood of the training set T given the model M

• Suppose the word “bagel” occurs 400 times in a corpus of a million words
• What is the probability that a random word from some other text will be “bagel”?
• MLE estimate is 400/1,000,000 = .0004
• This may be a bad estimate for some other corpus..
• But it is the estimate that makes it most likely that “bagel” will occur 400 times in our million word
corpus
Example
<s> I am Sam </s> c(wi-1,wi )
<s> Sam I am </s> P(wi | w i-1 ) =
c(wi-1)
<s> I do not like green eggs and ham </s>

To have a consistent probabilistic model, we append a unique start (<s>) and


end (</s>) symbol to every sentence and treat these as additional words.
More Examples: Berkeley Restaurant Project
sentences
• can you tell me about any good cantonese restaurants close by
• mid priced thai food is what i’m looking for
• tell me about chez panisse
• can you give me a listing of the kinds of food that are available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day
• …

https://web.stanford.edu/~jurafsky/icslp-red.pdf
Raw bigram counts
• Out of 9,222 sentences
Raw bigram probabilities
• Normalize by unigram counts:

• Result:
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) = P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
P(<s> I want chinese food </s>) = P(I | <s>)
* P(want | I)
* P(chinese | want)
* P(food | chinese)
* P(</s> | food)
= .25 x .33 x .0065 x .52 x .68 = .00019
What kind of knowledge is captured by LM?

• P(english|want) = .0011
domain knowledge?
• P(chinese|want) = .0065
• P(to|want) = .66
• P(eat | to) = .28
• P (i | <s>) = .25 grammar

• P(food | to) = 0
Practical Issues
• Better to do everything in log space to:
• avoid numerical underflow
• also adding is often faster than multiplying (at least in general)
Dataset example: Google N-Gram Release,
August 2006

https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dataset example: Google N-Gram Release
• serve as the incoming 92
• serve as the incubator 99
• serve as the independent 794
• serve as the index 223
• serve as the indication 72
• serve as the indicator 120
• serve as the indicators 45
• serve as the indispensable 111
• serve as the indispensible 40
• serve as the individual 234

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Another dataset example: Google Book N-
grams
• Google Books datasets:
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html

https://books.google.com/ngrams
Research example: Timestamping Documents using
Google Ngram Books Data
Estimated document age

Evidence for age estimation

Dates of first appearance of text


ngrams over time

A. Jatowt, R. Campos: Interactive System for Reasoning about Document Age. CIKM 2017: 2471-2474
Estimating Text Age using Ngrams
Language Modeling Toolkit Examples
• KenLM
• https://kheafield.com/code/kenlm/
• SRILM
• http://www.speech.sri.com/projects/srilm/
Language Modeling
Evaluation and Perplexity
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?
• Assigns higher probability to “real” or “frequently observed” sentences than to
“ungrammatical” or “rarely observed” sentences?
• We train parameters of the model on a training set
• We test the model’s performance on data we haven’t seen
• (A test set is an unseen dataset that is different from training set)
• An evaluation metric tells us how well our model does on the test set
Training on the test set
• We cannot allow test sentences into the training set
• We would assign them an artificially high probability when we see them in the
test set
• “Training on the test set”
• Bad science
• And violates ethics..
Training and Test Sets
• Ideally, the training (and test) corpus should be representative of the actual
application data
• Sometimes we may need to adapt a general model to a small amount of new
(in-domain) data by adding highly weighted small corpus to original training
data
Extrinsic evaluation of N-gram models
• Best evaluation for comparing models A and B
• Put each model in a task
• Spelling corrector, speech recognizer, MT system, etc.
• Run the task, get an accuracy for system A and for B, e.g.:
• How many misspelled words corrected properly?
• How many words translated correctly?
• Compare the accuracy for A and B
Example of extrinsic evaluation
• Instead of perplexities (to be described soon) which are easier to evaluate
• We want to have more credible measure such as improvement in a real life
scenario, e.g. automatic speech recognition
• where the quality of recognized speech (same as in OCR or ASR) can be measured by
Word Error Rate, or Character Error Rate
Difficulty of extrinsic evaluation of N-gram
models
• Extrinsic evaluation
• Time-consuming; can take days or weeks, and costly
• So
• We use intrinsic evaluation: perplexity
• Bad approximation
• unless the test data looks just like the training data
• so generally only useful in pilot experiments
• But is helpful to think about
Intuition of Perplexity
• The Shannon Game:
mushrooms 0.1
• How well can we predict the next word?
pepperoni 0.1
I always order pizza with cheese and ____ anchovies 0.01
The 33rd President of the US was ____ ….
I saw a ____ fried rice 0.0001
….
• Unigrams are terrible at this game (why?) and 1e-100

• A better model of a text


• is one which assigns a higher probability to the word that actually occurs
Perplexity
Perplexity is the inverse probability of the test set, normalized by the
number of words: -
1
PP(W ) = P(w1w2 ...wN ) N

Chain rule: 1
= N
The best language model is one that P(w1w2 ...wN )
best predicts an unseen test set
• Gives the highest P(sentence)

For bigrams:

Minimizing perplexity is the same as maximizing probability


Lower perplexity = better model

Training 38 million words, test 1.5 million words, Wall Street Journal

N-gram Unigram Bigram Trigram


Order
Perplexity 962 170 109

The more information the n-gram gives about the word sequence,
the lower the perplexity. Perplexity is related inversely to the
likelihood of the test sequence according to the model.
Evaluation of Language Models (summary)
• Ideally, evaluate use of model in end application (extrinsic, in vivo)
• Realistic
• Expensive
• Evaluate on ability to model test corpus (intrinsic)
• Less realistic
• Cheaper
• Verify at least once that intrinsic evaluation correlates with an extrinsic one..
Evaluation of Language Models (summary)
• Perplexity - measure of how well a model “fits” the test data
• Uses the probability that the model assigns to the test corpus
• Normalizes for the number of words in the test corpus and takes the inverse

1
PP(W ) = N
P( w1w2 ...wN )
Language Modeling
Generalization and zeros
Generation Method
• Choose a random bigram
(<s>, w) according to its probability <s> I
I want
• Next choose a random bigram want to
(w, x) according to its probability to eat
eat Chinese
• And so on until we choose </s>
Chinese food
• Then string the words together food
</s>
I want to eat Chinese food

Shannon 1951; Miller & Selfridge 1950


Approximating Shakespeare
Shakespeare texts as corpus
• N = 884,647 tokens, V = 29,066
• Shakespeare produced 300,000 bigram types out of V2 = 844 million possible
bigrams
• So 99.96% of the possible bigrams were never seen (have zero entries in the
table)
• 4-grams worse: what's coming out looks like Shakespeare because it is
Shakespeare..
The Wall Street Journal
Can you guess the source of these random 3-
gram sentences?
• They also point to ninety nine point six billion dollars from two hundred
four oh six three percent of the rates of interest stores as Mexico and
gram Brazil on market conditions
• This shall forbid it should be branded, if renown made it empty.
• “You are uniformly charming!” cried he, with a smile of associating and
now and then I bowed and they perceived a chaise and four to wish for.

Shakespeare WSJ Jane Austen


The perils of overfitting
• N-grams only work well for word prediction if the test corpus looks like the training corpus
• In real life, it often doesn’t
• We need to train robust models that generalize..
• One kind of generalization: “zeros”
• Things that don’t ever occur in the training set
• But occur in the test set
Zeros
Training set: Test set:
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0

Underestimating probability of all possible words that might occur and


overestimating probability of those that occurred in training set
Zero probability bigrams
• Bigrams with zero probability
• mean that we will assign 0 probability to the test set!
• And hence we cannot compute perplexity (can’t divide by 0)..

1
-
PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )
Out of Vocabulary (OOV) words
• The problem of words whose n-gram probability is 0 needs to be solved
(will be discussed soon in next section..)
• But what about words we simply have never seen before?
• Unknown words, or out of vocabulary (OOV) words
• OOV rate - the percentage of OOV words that appear in the test set
• We sometimes model potential unknown words in the test set by adding
a pseudo-word called <UNK> (explained in next slide)
Unknown words: Open versus closed
vocabulary tasks
• If we know all the words in advance
• Vocabulary V is fixed
• A closed vocabulary task
• Often we don’t know them..
• Out Of Vocabulary = OOV words
• An open vocabulary task

• Instead: create an unknown word token <UNK>


• Training of <UNK> probabilities
• Create a fixed lexicon L of size V
• At the text normalization phase, any training word not in L is changed to <UNK>
• Now we train its probabilities like for any word
• At decoding time
• For text input: use <UNK> probabilities for any word not in training set
Language Modeling
Smoothing: Add-one (Laplace) smoothing
Zeros
Training set: Test set:
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0

Underestimating probability of all possible words that might occur and


overestimating probability of those that occurred in training set
The intuition behind smoothing
• When we have sparse statistics
• Borrow probability mass to generalize better
P(w | denied the) P(w | denied the)
3 allegations 2.5 allegations
2 reports 1.5 reports
1 claims 0.5 claims
1 request 0.5 request
7 total 2 other
7 total
Add-one estimation
• Also called Laplace smoothing
• Pretend we saw each word one more time than we did
• Just add one to all the counts

c(wi-1, wi )
• MLE estimate: PMLE (wi | wi-1 ) =
c(wi-1 )

• Add-1 estimate: c(wi-1, wi ) +1


PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V

Adding probability mass to unseen events requires removing it from seen ones
(discounting) in order to maintain a joint distribution that sums to 1
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Laplace-smoothed bigrams
Reconstituted counts

It is often convenient to reconstruct the count matrix so we can see how much a
smoothing algorithm has changed the original counts
Compare with raw bigram counts

Add-one smoothing
has made a very big
change to the counts
(much probability
mass moved to all the
zeros)
Add-1 estimation is an inaccurate instrument..
• The sharp change in counts and probabilities occurs as too much probability mass
moved to all the zeros
• Straightforward solution: since too much mass is given to unseen events, adjust to
add 0<<1 (hence normalized by V instead of V)
• But add-1 isn’t used for N-grams:
• We’ll see better methods
• But add-1 is used to smooth other NLP models
• For text classification
• In domains where the number of zeros isn’t so large
Language Modeling
Interpolation, Backoff
Example

(higher precision & (lower precision &


higher variability) less variability)
Backoff
• Sometimes it helps to use less context
• Conditioning on less context for contexts you haven’t learned much about
• Backoff:
• use trigram if you have good evidence
• otherwise bigram, otherwise unigram
• Effectively, backing off to lower n-gram model when 0 evidence
• Discounting: distribute probability mass to maintain the probability distribution
• Interpolation:
• mix unigram, bigram, trigram, etc.

• Interpolation however works better


Interpolation
• A linear interpolation of two language models p and q is also a
valid language model
𝜆𝑝 + (1 − 𝜆)q
e.g., p = web
𝜆 ∈ (0,1) and
q = political speeches
Linear interpolation
• Simple interpolation

• Lambdas conditional on context:


How to set the lambdas?
• Use a held-out corpus to learn both simple and conditional λs

Held-Out Test
Training Data Data Data

• Choose λs to maximize the probability of held-out data:


• Fix the N-gram probabilities (on the training data)
• Then search for such λs that give the largest probability of held-out set:
Smoothing for Web-scale N-grams
• “Stupid backoff” (Brants et al. 2007) which is designed for very large LMs
• No discounting
• Does not produce probability distribution!

count(wi )
S(wi ) = count(wii-1) = count(wi-1,wi)
N
count(wii-2) = count(wi-2,wi-1,wi)
Selected Advanced Language Modelling
Concepts
• Caching Models
• Recently used words are more likely to appear
c(w Î history)
PCACHE (w | history) = l P(wi | wi-2 wi-1 ) + (1- l )
| history |
• Bias-vs-Variance trade-off: choice of n
• To choose a value for n in an n-gram model, find the right trade-off between the stability
of the estimate against its appropriateness.
• For example, trigram is a common choice with large training corpora (millions of words), while a
bigram is often used with smaller ones…
• Skip-grams
• A generalization of n-grams in which words need not be consecutive in the text, but may
leave gaps that are skipped over (another way of overcoming data sparsity problem)
Selected Advanced Language Modelling
Concepts
• Fertility
• Number of distinct types a word occurs with (e.g., compare: “delay” and
”Francisco”; which is more likely in an arbitrary new context?)
• POS n-grams
• Integer encodings of n-grams
Summary
• Language models assign a probability that a sentence is a “legal” string in a
language
• They can also predict a word from preceding words
• They are useful as a component of many NLP systems, such as ASR, OCR, and MT
• Simple N-gram models are easy to train on unsupervised corpora and can provide useful
estimates of sentence likelihood
• N-gram LMs can be evaluated extrinsically in a task or intrinsically using perplexity
• N-gram LMs = Markov models estimating words from a fixed window of
previous words, with probabilities estimated from normalized corpus
frequencies (MLE)
Summary (cont.)
• MLE gives inaccurate parameters for models trained on sparse data
• Smoothing algorithms allow to estimate the probabilities of unseen (but
not impossible) N-grams using lower-order n-grams as Back-off or
Interpolation
A Problem for N-Grams: Long Distance
Dependencies
• Many times local context does not provide the most useful predictive
clues, which instead are provided by long-distance dependencies
• Syntactic dependencies
• “The man next to the large oak tree near the grocery store on the corner is tall.”
• “The men next to the large oak tree near the grocery store on the corner are tall.”
• Semantic dependencies
• “The bird next to the large oak tree near the grocery store on the corner flies rapidly.”
• “The man next to the large oak tree near the grocery store on the corner talks rapidly.”
• Hence, Markovian assumption may be questioned.
• More complex models of language are needed to handle such
dependencies…
Task for next week (deadline 4/17, 14:15)
Construct a letter-based language model for detecting the type of English used in
documents
1. Download and unpack the file from OLAT (ngram_task.zip)
2. Build a 3-gram LM based on letters separately for each variant of English: British, American and
Australian using the training data files
3. Generate 5 random sentences from each model based on the Shannon method shown during the
lecture (no need to care about sentence segmentation and sentence markers, unless you want to..)
4. Take the test examples and for each one estimate their preplexities using your LMs
5. Calculate the accuracy of estimating an English type using all the test examples
• (Optionally you might experiment with different n values for character n-grams than 3)
6. Include code and results in report and discuss any decisions or assumptions made
Paper 1

https://www.aclweb.org/anthology/P05-1065.pdf
Paper 2

https://csaws.cs.technion.ac.il/~yahave/papers/pldi14-statistical.pdf
Paper 3

https://arxiv.org/pdf/2103.10918.pdf

You might also like