0% found this document useful (0 votes)
2 views

2. Language Modeling

The document outlines the second lecture of a Natural Language Processing course focusing on language modeling, specifically n-gram models. It covers topics such as the introduction to language models, evaluation methods, smoothing techniques, and the limitations of n-gram models. The lecture also discusses the transition to neural network language models and their advantages over traditional methods.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2. Language Modeling

The document outlines the second lecture of a Natural Language Processing course focusing on language modeling, specifically n-gram models. It covers topics such as the introduction to language models, evaluation methods, smoothing techniques, and the limitations of n-gram models. The lecture also discusses the transition to neural network language models and their advantages over traditional methods.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

COMP 3361 Natural Language Processing

Lecture 2: Language Modeling


n-gram Language Models

Spring 2024

Many materials from CSE517@UW, COMS W4705@Columbia, 11-711@CMU, COS484@Princeton with special thanks!
Announcements
● Join the course Slack workspace
https://join.slack.com/t/slack-fdv4728/shared_invite/zt-2asgddr0h-6wIXbRndwKhBw2IX2~ZrJQ

● Assignment 1 will be out this weekend


Lecture plan
● Introduction to language models
● N-gram language models
● Language model evaluation
● Smoothing methods
ChatGPT is a powerful language model!
Let’s play a game!
This year, I am going to do an internship in

Queen Mary Hospital, HSBC, Google, Amazon

Majoring in computer science, this year, I am going to do an internship in

Queen Mary Hospital, HSBC, Google, Amazon


ChatGPT auto-completes your prompt
ChatGPT auto-completes your prompt
Generative language model

I am going to do an internship in Google


Making the dice
1 Belief
2 Evidence
3 Reason
4 Claim
5 Think
6 Justify
7 Also

99 Therefore
bag of words 100 Google

Vocabulary
Generative language model

I
Generative language model

I am

am
Generative language model

I am going

going
Generative language model

I am going to do an internship in Google

Google
Neuralize the dice!
Neural Networks (e.g. Transformers)
Neural network language models

I am going to do an internship in Google

Google
Language models, and how to build it

Transformers, neural networks and many others


Dice, and how do we roll them
(powerful functions)
(probabilistic model)

Learn

Parameterize
First problem — the language modeling problem
Given a finite vocabulary

We have a set of sentences


<s> I am going to an internship in Google </s>
<s> an internship in Google </s>
<s> I am going going </s>
<s> Google is am </s>
<s> internship is going </s>

Can we learn a “model” for this “generative process”? We need to “learn” a probability
distribution:

Learn from what we’ve seen


The language modeling problem
Given a training sample of example sentences, we need to “learn” a probabilistic model that
assigns probabilities to every possible string:
p(<s> I am going to an internship in Google </s>) = 10-12
p(<s> an internship in Google </s>) = 10-8
p(<s> I am going going </s>) = 10-15

What is a language model?
● A probabilistic model of a sequence of words

A language model consists of


● A finite set
What is a language model?
● A probabilistic model of a sequence of words

A language model consists of


● A finite set

A sentence in the language is a sequence of words

For example

Define to be the set of all sentences with the vocabulary


What is a language model?
● A probabilistic model of a sequence of words

A language model consists of


● A finite set
● A probability distribution over sequences of words such that:
Assign a probability to a sentence
Application of language models:

P(“I am going to school”) > P(“I are going to school”) Grammar Checking

I had some coffee this morning.


Machine translation
P(“我今早喝了一些咖啡”) > P(“我今早吃了一些咖啡”)

P(“Can we put an elephant into the refrigerator? No, we


Question Answering
can’t.”) > P(“Can we put an elephant into the refrigerator?
Yes, we can.”)
N-gram language models

Transformers, neural networks and many others


Dice, and how do we roll them
(powerful functions)
(probabilistic model)

Learn

Parameterize
A (very bad) language model

Number of times the sentence is seen in the training corpus

Total number of sentences in the training corpus

Why this is very bad?


Markov models
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Chain rule
Markov models
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Chain rule

First-order Markov Assumption


● Use only the recent past to predict the next word
● Reduces the number of estimated parameters in exchange for modeling capacity
Trigram language models
A trigram language model consists of a finite set , and a parameter
For each trigram , such that , .
can be interpreted as the probability of seeing the word immediately
after the bigram .

For any sentence , where

where we define
Trigram language models
For example, for the sentence:
the dog barks STOP

Problem solved? How can we find

Parameters (of the model)


How many parameters?

How to “estimate” them from training data?


Trigram language models

Parameters (of the model)

How many parameters? How to “estimate” them from training data?


Generating from a trigram language model

I am going

q(“going”|“I am”)
going

Trigram language model


Trigram language models
How to “estimate” them from training data?

N-gram counts!

Pasta v.s. Hamburger (Google Books Ngram Viewer)


Sparse data problems
Maximum likelihood estimate:

Say vocabulary size is 20000. We have 8 *1012 parameters!!


Evaluating language models

training set
(Fancy) Trigram
Model

test set
(real product)

development (dev) set


(held-out data)
Evaluating language models
● Directly optimized for downstream applications
○ higher task accuracy better model
● Expensive, time consuming
● Hard to optimize downstream objective (indirect feedback)
Evaluating language models: perplexity

the cat laughs STOP


the dog laughs at the cat STOP

development (dev) set We can compute the probability it assigns to the entire
(held-out data)
set of test sentences

The higher this quantity is, the better the language model is at
modeling unseen sentences.
Evaluating language models: perplexity
The higher this quantity is, the better the language model is at
modeling unseen sentences.

Perplexity on the test corpus is derived as a direction transformation of this.

M is the total length of the sentences in the test corpus.


What if the model estimate and the trigram appears in the dataset?
Wait, why we love this number in the first place?
Let the model predicts

A uniform probability model — The perplexity is equal to the vocabulary size!


Perplexity can be thought of as the effective vocabulary size under the model!
For example, the perplexity of the model is 120 (even though the vocabulary
size is say 10,000), then this is roughly equivalent to having an effective vocabulary of 120.
Generalization of n-gram language models
● Not all n-grams in the test set will be observed in training data
● Test corpus might have some that have zero probability under our model
Smoothing for language models
If the model estimate and the trigram appears in the test data,
ppl goes up to infinity.
When we have sparse statistics:
P(w | denied the)
3 allegations
2 reports
1 claims
1 request

7 total

Steal probability mass to generalize better:


P(w | denied the)
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other

7 total Example from Dan Klein


Add-one (Laplace) smoothing
Considering a bigram model here, pretend we saw each word one more time than we
did.

MLE estimate:

Add-one smoothing:
Linear interpolation (stupid backoff)

Trigram Model, Bigram Model, Unigram Model

Trigram maximum-likelihood estimate:

Bigram maximum-likelihood estimate:

Unigram maximum-likelihood estimate:

Which one suffers from the data sparsity problem the most?
Which one is more accurate?
Linear interpolation (stupid backoff)

How to choose the value of

Use the held-out corpus Hyperparameters

maximize the probability of held-out data.


Markov models in retrospect
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

First-order Markov Assumption


N-gram language models
Limitations of n-gram language models
They are not sufficient to handle long-range dependencies
“Alice/Bob could not go to work that day because she/he had a doctor’s appointment”

First-order Markov Assumption


N-gram language models
Markov models in retrospect
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Is it possible to directly model this probability?


Neural network language models
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Is it possible to directly model this probability?

Transformers, neural networks and many others


e.g., ChatGPT
Neural network language models
Consider a sequence of random variables , each take any value in

The joint probability of a sentence is

Is it possible to directly model this probability?

Train on a much larger corpus! Transformers, neural networks and many others
e.g., ChatGPT
Perplexity: n-gram v.s. neural language models

https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word

You might also like