2. Language Modeling
2. Language Modeling
Spring 2024
Many materials from CSE517@UW, COMS W4705@Columbia, 11-711@CMU, COS484@Princeton with special thanks!
Announcements
● Join the course Slack workspace
https://join.slack.com/t/slack-fdv4728/shared_invite/zt-2asgddr0h-6wIXbRndwKhBw2IX2~ZrJQ
Vocabulary
Generative language model
I
Generative language model
I am
am
Generative language model
I am going
going
Generative language model
Google
Neuralize the dice!
Neural Networks (e.g. Transformers)
Neural network language models
Google
Language models, and how to build it
Learn
Parameterize
First problem — the language modeling problem
Given a finite vocabulary
Can we learn a “model” for this “generative process”? We need to “learn” a probability
distribution:
For example
P(“I am going to school”) > P(“I are going to school”) Grammar Checking
Learn
Parameterize
A (very bad) language model
Chain rule
Markov models
Consider a sequence of random variables , each take any value in
Chain rule
where we define
Trigram language models
For example, for the sentence:
the dog barks STOP
I am going
q(“going”|“I am”)
going
N-gram counts!
training set
(Fancy) Trigram
Model
test set
(real product)
development (dev) set We can compute the probability it assigns to the entire
(held-out data)
set of test sentences
The higher this quantity is, the better the language model is at
modeling unseen sentences.
Evaluating language models: perplexity
The higher this quantity is, the better the language model is at
modeling unseen sentences.
7 total
MLE estimate:
Add-one smoothing:
Linear interpolation (stupid backoff)
Which one suffers from the data sparsity problem the most?
Which one is more accurate?
Linear interpolation (stupid backoff)
Train on a much larger corpus! Transformers, neural networks and many others
e.g., ChatGPT
Perplexity: n-gram v.s. neural language models
https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word