0% found this document useful (0 votes)

2 views

2. Language Modeling

The document outlines the second lecture of a Natural Language Processing course focusing on language modeling, specifically n-gram models. It covers topics such as the introduction to language models, evaluation methods, smoothing techniques, and the limitations of n-gram models. The lecture also discusses the transition to neural network language models and their advantages over traditional methods.

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

2. Language Modeling

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

COMP 3361 Natural Language Processing

Lecture 2: Language Modeling

n-gram Language Models

Spring 2024

Many materials from CSE517@UW, COMS W4705@Columbia, 11-711@CMU, COS484@Princeton with special thanks!
Announcements
● Join the course Slack workspace
https://join.slack.com/t/slack-fdv4728/shared_invite/zt-2asgddr0h-6wIXbRndwKhBw2IX2~ZrJQ

● Assignment 1 will be out this weekend

Lecture plan
● Introduction to language models
● N-gram language models
● Language model evaluation
● Smoothing methods
ChatGPT is a powerful language model!
Let’s play a game!
This year, I am going to do an internship in

Queen Mary Hospital, HSBC, Google, Amazon

Majoring in computer science, this year, I am going to do an internship in

Queen Mary Hospital, HSBC, Google, Amazon

ChatGPT auto-completes your prompt
ChatGPT auto-completes your prompt
Generative language model

I am going to do an internship in Google

Making the dice
1 Belief
2 Evidence
3 Reason
4 Claim
5 Think
6 Justify
7 Also
…
99 Therefore
bag of words 100 Google

Vocabulary
Generative language model

I
Generative language model

I am

am
Generative language model

I am going

going
Generative language model

I am going to do an internship in Google

Google
Neuralize the dice!
Neural Networks (e.g. Transformers)
Neural network language models

I am going to do an internship in Google

Google
Language models, and how to build it

Transformers, neural networks and many others

Dice, and how do we roll them
(powerful functions)
(probabilistic model)

Learn

Parameterize
First problem — the language modeling problem
Given a finite vocabulary

We have a set of sentences

<s> I am going to an internship in Google </s>
<s> an internship in Google </s>
<s> I am going going </s>
<s> Google is am </s>
<s> internship is going </s>

Can we learn a “model” for this “generative process”? We need to “learn” a probability
distribution:

Learn from what we’ve seen

The language modeling problem
Given a training sample of example sentences, we need to “learn” a probabilistic model that
assigns probabilities to every possible string:
p(<s> I am going to an internship in Google </s>) = 10-12
p(<s> an internship in Google </s>) = 10-8
p(<s> I am going going </s>) = 10-15
…
What is a language model?
● A probabilistic model of a sequence of words

A language model consists of

● A finite set
What is a language model?
● A probabilistic model of a sequence of words

A language model consists of

● A finite set

A sentence in the language is a sequence of words

For example

Define to be the set of all sentences with the vocabulary

What is a language model?
● A probabilistic model of a sequence of words

A language model consists of

● A finite set
● A probability distribution over sequences of words such that:
Assign a probability to a sentence
Application of language models:

P(“I am going to school”) > P(“I are going to school”) Grammar Checking

I had some coffee this morning.

Machine translation
P(“我今早喝了一些咖啡”) > P(“我今早吃了一些咖啡”)

P(“Can we put an elephant into the refrigerator? No, we

Question Answering
can’t.”) > P(“Can we put an elephant into the refrigerator?
Yes, we can.”)
N-gram language models

Transformers, neural networks and many others

Dice, and how do we roll them
(powerful functions)
(probabilistic model)

Learn

Parameterize
A (very bad) language model

Number of times the sentence is seen in the training corpus

Total number of sentences in the training corpus

Why this is very bad?

Markov models
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Chain rule
Markov models
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Chain rule

First-order Markov Assumption

● Use only the recent past to predict the next word
● Reduces the number of estimated parameters in exchange for modeling capacity
Trigram language models
A trigram language model consists of a finite set , and a parameter
For each trigram , such that , .
can be interpreted as the probability of seeing the word immediately
after the bigram .

For any sentence , where

where we define
Trigram language models
For example, for the sentence:
the dog barks STOP

Problem solved? How can we find

Parameters (of the model)

How many parameters?

How to “estimate” them from training data?

Trigram language models

Parameters (of the model)

How many parameters? How to “estimate” them from training data?

Generating from a trigram language model

I am going

q(“going”|“I am”)
going

Trigram language model

Trigram language models
How to “estimate” them from training data?

N-gram counts!

Pasta v.s. Hamburger (Google Books Ngram Viewer)

Sparse data problems
Maximum likelihood estimate:

Say vocabulary size is 20000. We have 8 *1012 parameters!!

Evaluating language models

training set
(Fancy) Trigram
Model

test set
(real product)

development (dev) set

(held-out data)
Evaluating language models
● Directly optimized for downstream applications
○ higher task accuracy better model
● Expensive, time consuming
● Hard to optimize downstream objective (indirect feedback)
Evaluating language models: perplexity

the cat laughs STOP

the dog laughs at the cat STOP

development (dev) set We can compute the probability it assigns to the entire
(held-out data)
set of test sentences

The higher this quantity is, the better the language model is at
modeling unseen sentences.
Evaluating language models: perplexity
The higher this quantity is, the better the language model is at
modeling unseen sentences.

Perplexity on the test corpus is derived as a direction transformation of this.

M is the total length of the sentences in the test corpus.

What if the model estimate and the trigram appears in the dataset?
Wait, why we love this number in the first place?
Let the model predicts

A uniform probability model — The perplexity is equal to the vocabulary size!

Perplexity can be thought of as the effective vocabulary size under the model!
For example, the perplexity of the model is 120 (even though the vocabulary
size is say 10,000), then this is roughly equivalent to having an effective vocabulary of 120.
Generalization of n-gram language models
● Not all n-grams in the test set will be observed in training data
● Test corpus might have some that have zero probability under our model
Smoothing for language models
If the model estimate and the trigram appears in the test data,
ppl goes up to infinity.
When we have sparse statistics:
P(w | denied the)
3 allegations
2 reports
1 claims
1 request

7 total

Steal probability mass to generalize better:

P(w | denied the)
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other

7 total Example from Dan Klein

Add-one (Laplace) smoothing
Considering a bigram model here, pretend we saw each word one more time than we
did.

MLE estimate:

Add-one smoothing:
Linear interpolation (stupid backoff)

Trigram Model, Bigram Model, Unigram Model

Trigram maximum-likelihood estimate:

Bigram maximum-likelihood estimate:

Unigram maximum-likelihood estimate:

Which one suffers from the data sparsity problem the most?
Which one is more accurate?
Linear interpolation (stupid backoff)

How to choose the value of

Use the held-out corpus Hyperparameters

maximize the probability of held-out data.

Markov models in retrospect
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

First-order Markov Assumption

N-gram language models
Limitations of n-gram language models
They are not sufficient to handle long-range dependencies
“Alice/Bob could not go to work that day because she/he had a doctor’s appointment”

First-order Markov Assumption

N-gram language models
Markov models in retrospect
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Is it possible to directly model this probability?

Neural network language models
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Is it possible to directly model this probability?

Transformers, neural networks and many others

e.g., ChatGPT
Neural network language models
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Is it possible to directly model this probability?

Train on a much larger corpus! Transformers, neural networks and many others
e.g., ChatGPT
Perplexity: n-gram v.s. neural language models

https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word

CADKEY 19 Functions
No ratings yet
CADKEY 19 Functions
124 pages
Gemalto - 3M Page Reader Programmers' Guidev33
No ratings yet
Gemalto - 3M Page Reader Programmers' Guidev33
102 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
N Grams
No ratings yet
N Grams
51 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
NLp
No ratings yet
NLp
12 pages
Ngrams
100% (1)
Ngrams
22 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Lectures LM
No ratings yet
Lectures LM
57 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Lecture 6 to 8 N-gram
No ratings yet
Lecture 6 to 8 N-gram
19 pages
lm24aug
No ratings yet
lm24aug
84 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Cs224n 2025 Lecture05 Rnnlm
No ratings yet
Cs224n 2025 Lecture05 Rnnlm
54 pages
NLP m2
No ratings yet
NLP m2
74 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
L5_CSE256_FA24_LM
No ratings yet
L5_CSE256_FA24_LM
65 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
5)Lecture-Feb11&13&17&18
No ratings yet
5)Lecture-Feb11&13&17&18
21 pages
3_LM_2024
No ratings yet
3_LM_2024
78 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
module-1 ch-2
No ratings yet
module-1 ch-2
31 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
Evaluating Language Models
No ratings yet
Evaluating Language Models
21 pages
ai
No ratings yet
ai
13 pages
KEN2570 4 LanguageModel
No ratings yet
KEN2570 4 LanguageModel
17 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
BCSE306L_AI_MODULE-7_SMSATAPATHY
No ratings yet
BCSE306L_AI_MODULE-7_SMSATAPATHY
51 pages
plm.17
No ratings yet
plm.17
15 pages
Unit 5 notes final
No ratings yet
Unit 5 notes final
14 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Technical NLP U3-6
No ratings yet
Technical NLP U3-6
83 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
Deep Learning (MODULE-4)_RNN - NLP
No ratings yet
Deep Learning (MODULE-4)_RNN - NLP
52 pages
Natural Language Processing Lecture Notes Columbia Cs4705 Itebooks download
No ratings yet
Natural Language Processing Lecture Notes Columbia Cs4705 Itebooks download
50 pages
Language Modelling
No ratings yet
Language Modelling
3 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
Ai Unit 5
No ratings yet
Ai Unit 5
16 pages
Trigram Language Models
No ratings yet
Trigram Language Models
19 pages
NLP-UNITS-IV-V
No ratings yet
NLP-UNITS-IV-V
30 pages
INTRO TO LANGUAGE MODELS - SOUMYASIS MISHRA - 191001021003 - BCS4C
No ratings yet
INTRO TO LANGUAGE MODELS - SOUMYASIS MISHRA - 191001021003 - BCS4C
10 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
No ratings yet
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
42 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
Large Language Model Using Tensorflow: A Complete TensorFlow Implementation Guide for Modern AI Development
From Everand
Large Language Model Using Tensorflow: A Complete TensorFlow Implementation Guide for Modern AI Development
Aarav Joshi
No ratings yet
E5. Efficient LM Methods
No ratings yet
E5. Efficient LM Methods
41 pages
11. Pre-training & LLM 2
No ratings yet
11. Pre-training & LLM 2
46 pages
E3. AI Agents
No ratings yet
E3. AI Agents
49 pages
13. LLM Scaling Laws & Emergent Capacities
No ratings yet
13. LLM Scaling Laws & Emergent Capacities
23 pages
12. LLM Prompting & In-Context Learning
No ratings yet
12. LLM Prompting & In-Context Learning
18 pages
3. Multi-class Classification
No ratings yet
3. Multi-class Classification
52 pages
6. Neural Language Models & Tokenization
No ratings yet
6. Neural Language Models & Tokenization
70 pages
7. Deep Learning Recap
No ratings yet
7. Deep Learning Recap
13 pages
0. Introduction
No ratings yet
0. Introduction
6 pages
7. Orthogonality
No ratings yet
7. Orthogonality
61 pages
0.1. Probability Review
No ratings yet
0.1. Probability Review
6 pages
2. Matrices and Linear Transformations
No ratings yet
2. Matrices and Linear Transformations
74 pages
4. Subspace and Basis
No ratings yet
4. Subspace and Basis
60 pages
Risk Management Plan Template
No ratings yet
Risk Management Plan Template
8 pages
Chapter 1
No ratings yet
Chapter 1
42 pages
citrix-netscaler-routing-bgp-command-reference-guide-netscaler-101
No ratings yet
citrix-netscaler-routing-bgp-command-reference-guide-netscaler-101
170 pages
What Is CPU PCB Manufacturing and Supplying
No ratings yet
What Is CPU PCB Manufacturing and Supplying
4 pages
Conditional Offer Master of Information Technology (Artificial Intelligence)
No ratings yet
Conditional Offer Master of Information Technology (Artificial Intelligence)
3 pages
v8.2.1c Releasenotes v5.0
No ratings yet
v8.2.1c Releasenotes v5.0
85 pages
1-Introduction To Arduino
No ratings yet
1-Introduction To Arduino
82 pages
Niharika 01
No ratings yet
Niharika 01
24 pages
T3d Manual
No ratings yet
T3d Manual
37 pages
CVPR 2024 Call For Papers
No ratings yet
CVPR 2024 Call For Papers
4 pages
OHS352 Notes UNIT 2
No ratings yet
OHS352 Notes UNIT 2
17 pages
HW Configuration of The Cpu 314St/Dpm With Winplc7: How-To-Do
No ratings yet
HW Configuration of The Cpu 314St/Dpm With Winplc7: How-To-Do
4 pages
LCD Sharp LC-60LE745U - LC-60C7450 - LC60C8470 - LC-60LE845 - LC-70LE745U - LC-70LE7450 - LC-70LE845 - LC-70LE845 - LC-708470 - LC-70LE847
No ratings yet
LCD Sharp LC-60LE745U - LC-60C7450 - LC60C8470 - LC-60LE845 - LC-70LE745U - LC-70LE7450 - LC-70LE845 - LC-70LE845 - LC-708470 - LC-70LE847
100 pages
Slot 28 - 29-Background Tasks With Worker Service
No ratings yet
Slot 28 - 29-Background Tasks With Worker Service
37 pages
Smart House Thesis
100% (3)
Smart House Thesis
8 pages
Extensibility Guide for RISE With SAP SAP S 4HANA 1734905938
No ratings yet
Extensibility Guide for RISE With SAP SAP S 4HANA 1734905938
7 pages
Cyberphysical Systems Modelling And Industrial Application Alla G Kravets download
No ratings yet
Cyberphysical Systems Modelling And Industrial Application Alla G Kravets download
72 pages
DX Diag
No ratings yet
DX Diag
30 pages
Instant ebooks textbook An Introduction to Sandwich Structures 2nd Edition Dan Zenkert download all chapters
100% (11)
Instant ebooks textbook An Introduction to Sandwich Structures 2nd Edition Dan Zenkert download all chapters
30 pages
Heavy Duty Precision Milling Machine: Model PM-833TV
No ratings yet
Heavy Duty Precision Milling Machine: Model PM-833TV
35 pages
3000M and 3100M Micro Processor: Technical Manual
No ratings yet
3000M and 3100M Micro Processor: Technical Manual
34 pages
AutoCAD 2011 and AutoCAD LT 2011 No Experience Required 1st Edition Donnie Gladfelter - The ebook in PDF format is ready for immediate access
100% (1)
AutoCAD 2011 and AutoCAD LT 2011 No Experience Required 1st Edition Donnie Gladfelter - The ebook in PDF format is ready for immediate access
51 pages
Assignment10 Cloud Computing
No ratings yet
Assignment10 Cloud Computing
4 pages
redp5684
No ratings yet
redp5684
196 pages
Final
No ratings yet
Final
145 pages
Foundations of Red Hat Cloud-Native
No ratings yet
Foundations of Red Hat Cloud-Native
60 pages
1 Related Studies
No ratings yet
1 Related Studies
12 pages
Class Notes: The Basics of Quantum Computing
No ratings yet
Class Notes: The Basics of Quantum Computing
3 pages

2. Language Modeling

Uploaded by

2. Language Modeling

Uploaded by

COMP 3361 Natural Language Processing

Lecture 2: Language Modeling

● Assignment 1 will be out this weekend

Queen Mary Hospital, HSBC, Google, Amazon

Majoring in computer science, this year, I am going to do an internship in

Queen Mary Hospital, HSBC, Google, Amazon

I am going to do an internship in Google

I am going to do an internship in Google

I am going to do an internship in Google

Transformers, neural networks and many others

We have a set of sentences

Learn from what we’ve seen

A language model consists of

A language model consists of

A sentence in the language is a sequence of words

Define to be the set of all sentences with the vocabulary

A language model consists of

I had some coffee this morning.

P(“Can we put an elephant into the refrigerator? No, we

Transformers, neural networks and many others

Number of times the sentence is seen in the training corpus

Total number of sentences in the training corpus

Why this is very bad?

The joint probability of a sentence is

The joint probability of a sentence is

First-order Markov Assumption

For any sentence , where

Problem solved? How can we find

Parameters (of the model)

How to “estimate” them from training data?

Parameters (of the model)

How many parameters? How to “estimate” them from training data?

Trigram language model

Pasta v.s. Hamburger (Google Books Ngram Viewer)

Say vocabulary size is 20000. We have 8 *1012 parameters!!

development (dev) set

the cat laughs STOP

Perplexity on the test corpus is derived as a direction transformation of this.

M is the total length of the sentences in the test corpus.

A uniform probability model — The perplexity is equal to the vocabulary size!

Steal probability mass to generalize better:

7 total Example from Dan Klein

Trigram Model, Bigram Model, Unigram Model

Trigram maximum-likelihood estimate:

Bigram maximum-likelihood estimate:

Unigram maximum-likelihood estimate:

How to choose the value of

Use the held-out corpus Hyperparameters

maximize the probability of held-out data.

The joint probability of a sentence is

First-order Markov Assumption

First-order Markov Assumption

The joint probability of a sentence is

Is it possible to directly model this probability?

The joint probability of a sentence is

Is it possible to directly model this probability?

Transformers, neural networks and many others

The joint probability of a sentence is

Is it possible to directly model this probability?

You might also like