0% found this document useful (0 votes)

43 views

Lecture 4

The document discusses statistical language modeling and n-grams. It explains how n-grams are used to build probabilistic language models and compute the probability of word sequences. Various applications of language models are also covered like machine translation, spell checking, speech recognition, and dialogue generation.

Uploaded by

Abdullah Khan Qadri

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

Lecture 4

Uploaded by

Abdullah Khan Qadri

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

Advanced Data Engineering & Analytics:

Statistical Language Modelling

10 April 2024
Language Modeling
Introduction to n-grams and statistical
language models
N-grams and Language Models
• N-gram: important concept in NLP which is the basis for language
modelling
• N-grams – contiguous sequences of tokens from a given text
Probabilistic Language Models based on N-
grams

• Goal: assign a probability to word sequence

• Machine Translation:
• P(high winds tonight) > P(large winds tonight)
• Spell Correction
• P(about fifteen minutes from) > P(about fifteen minuets from)
• Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
• Summarization, question-answering, etc.
Application example: OCR
• to fee great Pompey paffe the Areets of Rome:
• to see great Pompey passe the streets of Rome:
Application example: Machine Translation
• Fidelity (to source text)
• Fluency (of the translation)

http://www.deepl.com
Application example: Answer/Query
Completions/Suggestion
Application example: Dialogue generation

Li et al., "Deep Reinforcement Learning for Dialogue Generation" (EMNLP 2016)

Other Uses
• Augmentative & Alternative Communication (AAC)
systems
• For users who are physically unable to write/sign but can
for example use eye gaze
• Effective word prediction to be chosen is important
• Predictive text input systems could guess what user is
typing and offer choices on how to complete it
https://pdos.csail.mit.edu/archive/scigen/
https://pdos.csail.mit.edu/archive/scigen/#about
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or sequence of words:
P(W) = P(w1,w2,w3,w4,w5,…,wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model
• Often better: use the grammar but language model (LM) is easier
How to compute P(W)
• How to compute this joint probability:

P(its, water, is, so, transparent, that)

• Intuition: rely on the Chain Rule of Probability

The Chain Rule
• Definition of conditional probabilities
P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

• The Chain Rule in general:

P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint
probability of words in sentence

𝑃(𝑤1 𝑤2 … 𝑤𝑛 ) = ෑ 𝑃(𝑤𝑖 | 𝑤𝑖−𝑘 … 𝑤𝑖−1 )

P(“its water is so transparent”) =

P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these probabilities?
• Could we just count and divide?
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)

• No, too many possible sentences!

• We’ll never see enough data for estimating these..
Sparsity
• New words (Heap’s law: open vocabulary)
• Old words in “new” contexts (Zipf law: most words are rare ones)
Markov Assumption
• Simplifying assumption:

Andrei Markov
(1856 – 1922)
• Or maybe:
Markov Assumption
𝑃(𝑤1 𝑤2 … 𝑤𝑛 ) = ෑ 𝑃(𝑤𝑖 | 𝑤𝑖−𝑘 … 𝑤𝑖−1 )

• In other words, we approximate each component in the product:

𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−1 ) = 𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−𝑘 )

Markov Assumption (general definition)
• The Markov assumption is assumption that the future behavior of a dynamic system
depends only on its recent history
• In particular, in a kth-order Markov model, the next state only depends on the k most
recent states, therefore an N-gram model is a (N−1)-order Markov model

1-st order Markov model,

bigram model

2-nd order Markov model,

trigram model
Simplest case: Unigram model

𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−1 ) = 𝑃(𝑤𝑖 )

Some automatically generated sentences from a unigram model:

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation,
most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Bigram model
Condition on the previous word:

𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−1 ) = 𝑃(𝑤𝑖 |𝑤𝑖−1 )

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr.,
gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred,
fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

N-gram models
• We can extend to trigrams (3-grams), 4-grams, 5-grams
• In general this is an insufficient model of language
• because language has long-distance dependencies:
“The computer(s) which I had just put into the machine room on the fifth floor is (are) crashing.”

• Yet we can often get away with N-gram models…

Estimating bigram probabilities
• The Maximum Likelihood Estimate

count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )

c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
Maximum Likelihood Estimate (MLE)
• The maximum likelihood estimate
• of some parameter of a model M from a training set T
• maximizes the likelihood of the training set T given the model M

• Suppose the word “bagel” occurs 400 times in a corpus of a million words
• What is the probability that a random word from some other text will be “bagel”?
• MLE estimate is 400/1,000,000 = .0004
• This may be a bad estimate for some other corpus..
• But it is the estimate that makes it most likely that “bagel” will occur 400 times in our million word
corpus
Example
<s> I am Sam </s> c(wi-1,wi )
<s> Sam I am </s> P(wi | w i-1 ) =
c(wi-1)
<s> I do not like green eggs and ham </s>

To have a consistent probabilistic model, we append a unique start (<s>) and

end (</s>) symbol to every sentence and treat these as additional words.
More Examples: Berkeley Restaurant Project
sentences
• can you tell me about any good cantonese restaurants close by
• mid priced thai food is what i’m looking for
• tell me about chez panisse
• can you give me a listing of the kinds of food that are available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day
• …

https://web.stanford.edu/~jurafsky/icslp-red.pdf
Raw bigram counts
• Out of 9,222 sentences
Raw bigram probabilities
• Normalize by unigram counts:

• P(food | to) = 0
Practical Issues
• Better to do everything in log space to:
• avoid numerical underflow
• also adding is often faster than multiplying (at least in general)
Dataset example: Google N-Gram Release,
August 2006

https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dataset example: Google N-Gram Release
• serve as the incoming 92
• serve as the incubator 99
• serve as the independent 794
• serve as the index 223
• serve as the indication 72
• serve as the indicator 120
• serve as the indicators 45
• serve as the indispensable 111
• serve as the indispensible 40
• serve as the individual 234

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Another dataset example: Google Book N-
grams
• Google Books datasets:
https://storage.googleapis.com/books/ngrams/books/datasetsv3.html

https://books.google.com/ngrams
Research example: Timestamping Documents using
Google Ngram Books Data
Estimated document age

Evidence for age estimation

Dates of first appearance of text

ngrams over time

A. Jatowt, R. Campos: Interactive System for Reasoning about Document Age. CIKM 2017: 2471-2474
Estimating Text Age using Ngrams
Language Modeling Toolkit Examples
• KenLM
• https://kheafield.com/code/kenlm/
• SRILM
• http://www.speech.sri.com/projects/srilm/
Language Modeling
Evaluation and Perplexity
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?
• Assigns higher probability to “real” or “frequently observed” sentences than to
“ungrammatical” or “rarely observed” sentences?
• We train parameters of the model on a training set
• We test the model’s performance on data we haven’t seen
• (A test set is an unseen dataset that is different from training set)
• An evaluation metric tells us how well our model does on the test set
Training on the test set
• We cannot allow test sentences into the training set
• We would assign them an artificially high probability when we see them in the
test set
• “Training on the test set”
• Bad science
• And violates ethics..
Training and Test Sets
• Ideally, the training (and test) corpus should be representative of the actual
application data
• Sometimes we may need to adapt a general model to a small amount of new
(in-domain) data by adding highly weighted small corpus to original training
data
Extrinsic evaluation of N-gram models
• Best evaluation for comparing models A and B
• Put each model in a task
• Spelling corrector, speech recognizer, MT system, etc.
• Run the task, get an accuracy for system A and for B, e.g.:
• How many misspelled words corrected properly?
• How many words translated correctly?
• Compare the accuracy for A and B
Example of extrinsic evaluation
• Instead of perplexities (to be described soon) which are easier to evaluate
• We want to have more credible measure such as improvement in a real life
scenario, e.g. automatic speech recognition
• where the quality of recognized speech (same as in OCR or ASR) can be measured by
Word Error Rate, or Character Error Rate
Difficulty of extrinsic evaluation of N-gram
models
• Extrinsic evaluation
• Time-consuming; can take days or weeks, and costly
• So
• We use intrinsic evaluation: perplexity
• Bad approximation
• unless the test data looks just like the training data
• so generally only useful in pilot experiments
• But is helpful to think about
Intuition of Perplexity
• The Shannon Game:
mushrooms 0.1
• How well can we predict the next word?
pepperoni 0.1
I always order pizza with cheese and ____ anchovies 0.01
The 33rd President of the US was ____ ….
I saw a ____ fried rice 0.0001
….
• Unigrams are terrible at this game (why?) and 1e-100

• A better model of a text

• is one which assigns a higher probability to the word that actually occurs
Perplexity
Perplexity is the inverse probability of the test set, normalized by the
number of words: -
1
PP(W ) = P(w1w2 ...wN ) N

Chain rule: 1
= N
The best language model is one that P(w1w2 ...wN )
best predicts an unseen test set
• Gives the highest P(sentence)

For bigrams:

Minimizing perplexity is the same as maximizing probability

Lower perplexity = better model

Training 38 million words, test 1.5 million words, Wall Street Journal

N-gram Unigram Bigram Trigram

Order
Perplexity 962 170 109

The more information the n-gram gives about the word sequence,
the lower the perplexity. Perplexity is related inversely to the
likelihood of the test sequence according to the model.
Evaluation of Language Models (summary)
• Ideally, evaluate use of model in end application (extrinsic, in vivo)
• Realistic
• Expensive
• Evaluate on ability to model test corpus (intrinsic)
• Less realistic
• Cheaper
• Verify at least once that intrinsic evaluation correlates with an extrinsic one..
Evaluation of Language Models (summary)
• Perplexity - measure of how well a model “fits” the test data
• Uses the probability that the model assigns to the test corpus
• Normalizes for the number of words in the test corpus and takes the inverse

1
PP(W ) = N
P( w1w2 ...wN )
Language Modeling
Generalization and zeros
Generation Method
• Choose a random bigram
(<s>, w) according to its probability <s> I
I want
• Next choose a random bigram want to
(w, x) according to its probability to eat
eat Chinese
• And so on until we choose </s>
Chinese food
• Then string the words together food
</s>
I want to eat Chinese food

Shannon 1951; Miller & Selfridge 1950

Approximating Shakespeare
Shakespeare texts as corpus
• N = 884,647 tokens, V = 29,066
• Shakespeare produced 300,000 bigram types out of V2 = 844 million possible
bigrams
• So 99.96% of the possible bigrams were never seen (have zero entries in the
table)
• 4-grams worse: what's coming out looks like Shakespeare because it is
Shakespeare..
The Wall Street Journal
Can you guess the source of these random 3-
gram sentences?
• They also point to ninety nine point six billion dollars from two hundred
four oh six three percent of the rates of interest stores as Mexico and
gram Brazil on market conditions
• This shall forbid it should be branded, if renown made it empty.
• “You are uniformly charming!” cried he, with a smile of associating and
now and then I bowed and they perceived a chaise and four to wish for.

Shakespeare WSJ Jane Austen

The perils of overfitting
• N-grams only work well for word prediction if the test corpus looks like the training corpus
• In real life, it often doesn’t
• We need to train robust models that generalize..
• One kind of generalization: “zeros”
• Things that don’t ever occur in the training set
• But occur in the test set
Zeros
Training set: Test set:
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0

Underestimating probability of all possible words that might occur and

overestimating probability of those that occurred in training set
Zero probability bigrams
• Bigrams with zero probability
• mean that we will assign 0 probability to the test set!
• And hence we cannot compute perplexity (can’t divide by 0)..

1
-
PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )
Out of Vocabulary (OOV) words
• The problem of words whose n-gram probability is 0 needs to be solved
(will be discussed soon in next section..)
• But what about words we simply have never seen before?
• Unknown words, or out of vocabulary (OOV) words
• OOV rate - the percentage of OOV words that appear in the test set
• We sometimes model potential unknown words in the test set by adding
a pseudo-word called <UNK> (explained in next slide)
Unknown words: Open versus closed
vocabulary tasks
• If we know all the words in advance
• Vocabulary V is fixed
• A closed vocabulary task
• Often we don’t know them..
• Out Of Vocabulary = OOV words
• An open vocabulary task

• Instead: create an unknown word token <UNK>

• Training of <UNK> probabilities
• Create a fixed lexicon L of size V
• At the text normalization phase, any training word not in L is changed to <UNK>
• Now we train its probabilities like for any word
• At decoding time
• For text input: use <UNK> probabilities for any word not in training set
Language Modeling
Smoothing: Add-one (Laplace) smoothing
Zeros
Training set: Test set:
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0

Underestimating probability of all possible words that might occur and

overestimating probability of those that occurred in training set
The intuition behind smoothing
• When we have sparse statistics
• Borrow probability mass to generalize better
P(w | denied the) P(w | denied the)
3 allegations 2.5 allegations
2 reports 1.5 reports
1 claims 0.5 claims
1 request 0.5 request
7 total 2 other
7 total
Add-one estimation
• Also called Laplace smoothing
• Pretend we saw each word one more time than we did
• Just add one to all the counts

c(wi-1, wi )
• MLE estimate: PMLE (wi | wi-1 ) =
c(wi-1 )

• Add-1 estimate: c(wi-1, wi ) +1

PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V

Adding probability mass to unseen events requires removing it from seen ones
(discounting) in order to maintain a joint distribution that sums to 1
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Laplace-smoothed bigrams
Reconstituted counts

It is often convenient to reconstruct the count matrix so we can see how much a
smoothing algorithm has changed the original counts
Compare with raw bigram counts

Add-one smoothing
has made a very big
change to the counts
(much probability
mass moved to all the
zeros)
Add-1 estimation is an inaccurate instrument..
• The sharp change in counts and probabilities occurs as too much probability mass
moved to all the zeros
• Straightforward solution: since too much mass is given to unseen events, adjust to
add 0<<1 (hence normalized by V instead of V)
• But add-1 isn’t used for N-grams:
• We’ll see better methods
• But add-1 is used to smooth other NLP models
• For text classification
• In domains where the number of zeros isn’t so large
Language Modeling
Interpolation, Backoff
Example

(higher precision & (lower precision &

higher variability) less variability)
Backoff
• Sometimes it helps to use less context
• Conditioning on less context for contexts you haven’t learned much about
• Backoff:
• use trigram if you have good evidence
• otherwise bigram, otherwise unigram
• Effectively, backing off to lower n-gram model when 0 evidence
• Discounting: distribute probability mass to maintain the probability distribution
• Interpolation:
• mix unigram, bigram, trigram, etc.

• Interpolation however works better

Interpolation
• A linear interpolation of two language models p and q is also a
valid language model
𝜆𝑝 + (1 − 𝜆)q
e.g., p = web
𝜆 ∈ (0,1) and
q = political speeches
Linear interpolation
• Simple interpolation

• Lambdas conditional on context:

How to set the lambdas?
• Use a held-out corpus to learn both simple and conditional λs

Held-Out Test
Training Data Data Data

• Choose λs to maximize the probability of held-out data:

• Fix the N-gram probabilities (on the training data)
• Then search for such λs that give the largest probability of held-out set:
Smoothing for Web-scale N-grams
• “Stupid backoff” (Brants et al. 2007) which is designed for very large LMs
• No discounting
• Does not produce probability distribution!

count(wi )
S(wi ) = count(wii-1) = count(wi-1,wi)
N
count(wii-2) = count(wi-2,wi-1,wi)
Selected Advanced Language Modelling
Concepts
• Caching Models
• Recently used words are more likely to appear
c(w Î history)
PCACHE (w | history) = l P(wi | wi-2 wi-1 ) + (1- l )
| history |
• Bias-vs-Variance trade-off: choice of n
• To choose a value for n in an n-gram model, find the right trade-off between the stability
of the estimate against its appropriateness.
• For example, trigram is a common choice with large training corpora (millions of words), while a
bigram is often used with smaller ones…
• Skip-grams
• A generalization of n-grams in which words need not be consecutive in the text, but may
leave gaps that are skipped over (another way of overcoming data sparsity problem)
Selected Advanced Language Modelling
Concepts
• Fertility
• Number of distinct types a word occurs with (e.g., compare: “delay” and
”Francisco”; which is more likely in an arbitrary new context?)
• POS n-grams
• Integer encodings of n-grams
Summary
• Language models assign a probability that a sentence is a “legal” string in a
language
• They can also predict a word from preceding words
• They are useful as a component of many NLP systems, such as ASR, OCR, and MT
• Simple N-gram models are easy to train on unsupervised corpora and can provide useful
estimates of sentence likelihood
• N-gram LMs can be evaluated extrinsically in a task or intrinsically using perplexity
• N-gram LMs = Markov models estimating words from a fixed window of
previous words, with probabilities estimated from normalized corpus
frequencies (MLE)
Summary (cont.)
• MLE gives inaccurate parameters for models trained on sparse data
• Smoothing algorithms allow to estimate the probabilities of unseen (but
not impossible) N-grams using lower-order n-grams as Back-off or
Interpolation
A Problem for N-Grams: Long Distance
Dependencies
• Many times local context does not provide the most useful predictive
clues, which instead are provided by long-distance dependencies
• Syntactic dependencies
• “The man next to the large oak tree near the grocery store on the corner is tall.”
• “The men next to the large oak tree near the grocery store on the corner are tall.”
• Semantic dependencies
• “The bird next to the large oak tree near the grocery store on the corner flies rapidly.”
• “The man next to the large oak tree near the grocery store on the corner talks rapidly.”
• Hence, Markovian assumption may be questioned.
• More complex models of language are needed to handle such
dependencies…
Task for next week (deadline 4/17, 14:15)
Construct a letter-based language model for detecting the type of English used in
documents
1. Download and unpack the file from OLAT (ngram_task.zip)
2. Build a 3-gram LM based on letters separately for each variant of English: British, American and
Australian using the training data files
3. Generate 5 random sentences from each model based on the Shannon method shown during the
lecture (no need to care about sentence segmentation and sentence markers, unless you want to..)
4. Take the test examples and for each one estimate their preplexities using your LMs
5. Calculate the accuracy of estimating an English type using all the test examples
• (Optionally you might experiment with different n values for character n-grams than 3)
6. Include code and results in report and discuss any decisions or assumptions made
Paper 1

https://www.aclweb.org/anthology/P05-1065.pdf
Paper 2

https://csaws.cs.technion.ac.il/~yahave/papers/pldi14-statistical.pdf
Paper 3

https://arxiv.org/pdf/2103.10918.pdf

Teaching Arabic As A Foreign Language: Origins, Developments and Current Directions
No ratings yet
Teaching Arabic As A Foreign Language: Origins, Developments and Current Directions
295 pages
Uncle Jack and The Bakonzi Tree: Young Readers
100% (1)
Uncle Jack and The Bakonzi Tree: Young Readers
10 pages
Second Language Acquisition
100% (2)
Second Language Acquisition
508 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
3_LM_2024
No ratings yet
3_LM_2024
78 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
13 Ai Cse551 NLP 1 PDF
No ratings yet
13 Ai Cse551 NLP 1 PDF
50 pages
Lecture 2. N-Gram LMs
No ratings yet
Lecture 2. N-Gram LMs
77 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
N Grams
No ratings yet
N Grams
51 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
Ngrams
100% (1)
Ngrams
22 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
N-Gram in NLP
No ratings yet
N-Gram in NLP
15 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
Lecture 23-24
No ratings yet
Lecture 23-24
53 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Applied Natural Language Processing: Barbara Rosario
No ratings yet
Applied Natural Language Processing: Barbara Rosario
39 pages
language modelling_
No ratings yet
language modelling_
17 pages
CME4408 P5 N-grams Smooting
No ratings yet
CME4408 P5 N-grams Smooting
43 pages
NLP_Week_03
No ratings yet
NLP_Week_03
33 pages
NLP_Module 2(1)
No ratings yet
NLP_Module 2(1)
77 pages
Natural Language Systems in Prolog
No ratings yet
Natural Language Systems in Prolog
29 pages
Human Computer Interaction - An Introduction
No ratings yet
Human Computer Interaction - An Introduction
28 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
lm24aug
No ratings yet
lm24aug
84 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
Lecture 3 Sentiment Analysis
No ratings yet
Lecture 3 Sentiment Analysis
41 pages
N-Grams and Smoothing: CSC 371: Spring 2012
No ratings yet
N-Grams and Smoothing: CSC 371: Spring 2012
39 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
Chapter Four 1
No ratings yet
Chapter Four 1
91 pages
NLP
No ratings yet
NLP
46 pages
Pract Q
No ratings yet
Pract Q
6 pages
495 Lecture 13 Trans Decoder
No ratings yet
495 Lecture 13 Trans Decoder
21 pages
N-Grams - Text Representation
No ratings yet
N-Grams - Text Representation
23 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
A I 2020 Discussion 1105
No ratings yet
A I 2020 Discussion 1105
76 pages
Chapter 03-Number System
No ratings yet
Chapter 03-Number System
88 pages
Logic Programming With Prolog: Resolution, Unification, Backtracking
No ratings yet
Logic Programming With Prolog: Resolution, Unification, Backtracking
34 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
Deeplearning Ai
No ratings yet
Deeplearning Ai
69 pages
Language Model PDF
No ratings yet
Language Model PDF
76 pages
NLP m2
No ratings yet
NLP m2
74 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
2024 03 13 Lindsey Nicholson
No ratings yet
2024 03 13 Lindsey Nicholson
46 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Lecture 2
No ratings yet
Lecture 2
151 pages
2024 03 06 Introduction Lecture Series Climate Protection SS2024
No ratings yet
2024 03 06 Introduction Lecture Series Climate Protection SS2024
45 pages
Introduction
No ratings yet
Introduction
29 pages
Final Report
No ratings yet
Final Report
9 pages
KV Questions and Answers in English Grammar Exercises PDF
100% (1)
KV Questions and Answers in English Grammar Exercises PDF
3 pages
English Lang Grade 10 Week 6-10 - Term 3
No ratings yet
English Lang Grade 10 Week 6-10 - Term 3
51 pages
E3present Simple2
No ratings yet
E3present Simple2
1 page
Placement Test
No ratings yet
Placement Test
9 pages
Using Capital Letters
No ratings yet
Using Capital Letters
2 pages
Gelect 2
No ratings yet
Gelect 2
7 pages
Verbes-Être, Avoir, Aller, S'appeler 'Er'Groupe
No ratings yet
Verbes-Être, Avoir, Aller, S'appeler 'Er'Groupe
11 pages
Academic Inquiry 3 Final Writing Test Editable
No ratings yet
Academic Inquiry 3 Final Writing Test Editable
2 pages
Purcomm
No ratings yet
Purcomm
31 pages
Teaching Spoken English in The ODL System in Nigeria : Challenges and Strategies For Improvement
No ratings yet
Teaching Spoken English in The ODL System in Nigeria : Challenges and Strategies For Improvement
10 pages
UNIT 8- NEW WAYS TO LEARN
No ratings yet
UNIT 8- NEW WAYS TO LEARN
39 pages
Irregular Verbs List
No ratings yet
Irregular Verbs List
3 pages
Memo Writing Lecture
No ratings yet
Memo Writing Lecture
22 pages
Theory Summed Up
No ratings yet
Theory Summed Up
5 pages
Tech Mahindra JD
No ratings yet
Tech Mahindra JD
5 pages
ABC Caterpillar PDF
No ratings yet
ABC Caterpillar PDF
17 pages
Dca Exaaaam
No ratings yet
Dca Exaaaam
4 pages
A Day in My Life
No ratings yet
A Day in My Life
8 pages
Turning Red
No ratings yet
Turning Red
21 pages
Anglicka Gramatika Cvicebnice (B2)
No ratings yet
Anglicka Gramatika Cvicebnice (B2)
15 pages
Grammar Games Present Simple and Present Continuous Worksheet
No ratings yet
Grammar Games Present Simple and Present Continuous Worksheet
2 pages
Harrassowitz Verlag Zeitschrift Der Deutschen Morgenländischen Gesellschaft
No ratings yet
Harrassowitz Verlag Zeitschrift Der Deutschen Morgenländischen Gesellschaft
30 pages
[Ebooks PDF] download How Languages Are Learned 4th Edition Patsy M. Lightbown full chapters
100% (4)
[Ebooks PDF] download How Languages Are Learned 4th Edition Patsy M. Lightbown full chapters
61 pages
122 LA English Simplified
100% (2)
122 LA English Simplified
130 pages
Hyland KL, JIANG F (2016) - Change of Attitude? A Diachronic Study of Stance, Written Communication 33 (3) P. 251-274
No ratings yet
Hyland KL, JIANG F (2016) - Change of Attitude? A Diachronic Study of Stance, Written Communication 33 (3) P. 251-274
27 pages
Unit 6
No ratings yet
Unit 6
14 pages
Group 1 - What Is Discourse Analysis
No ratings yet
Group 1 - What Is Discourse Analysis
28 pages

Lecture 4

Uploaded by

Lecture 4

Uploaded by

Advanced Data Engineering & Analytics:

Statistical Language Modelling

• Goal: assign a probability to word sequence

Li et al., "Deep Reinforcement Learning for Dialogue Generation" (EMNLP 2016)

P(its, water, is, so, transparent, that)

• Intuition: rely on the Chain Rule of Probability

• The Chain Rule in general:

𝑃(𝑤1 𝑤2 … 𝑤𝑛 ) = ෑ 𝑃(𝑤𝑖 | 𝑤𝑖−𝑘 … 𝑤𝑖−1 )

P(“its water is so transparent”) =

• No, too many possible sentences!

• In other words, we approximate each component in the product:

𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−1 ) = 𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−𝑘 )

1-st order Markov model,

2-nd order Markov model,

𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−1 ) = 𝑃(𝑤𝑖 )

Some automatically generated sentences from a unigram model:

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

𝑃(𝑤𝑖 |𝑤1 𝑤2 … 𝑤𝑖−1 ) = 𝑃(𝑤𝑖 |𝑤𝑖−1 )

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

• Yet we can often get away with N-gram models…

To have a consistent probabilistic model, we append a unique start (<s>) and

Evidence for age estimation

Dates of first appearance of text

• A better model of a text

Minimizing perplexity is the same as maximizing probability

N-gram Unigram Bigram Trigram

Shannon 1951; Miller & Selfridge 1950

Shakespeare WSJ Jane Austen

P(“offer” | denied the) = 0

Underestimating probability of all possible words that might occur and

• Instead: create an unknown word token <UNK>

P(“offer” | denied the) = 0

Underestimating probability of all possible words that might occur and

• Add-1 estimate: c(wi-1, wi ) +1

(higher precision & (lower precision &

• Interpolation however works better

• Lambdas conditional on context:

• Choose λs to maximize the probability of held-out data:

You might also like