Lecture 03
Lecture 03
Processing
Lecture 3: n-gram language models
10/29/2020
COMS W4705
Yassine Benajiba
Probability of a Sentence
Probability of a Sentence
• Domain knowledge
• Lexical knowledge
Probability of the Next Word
• Idea: We do not need to model domain, syntactic, and
lexical knowledge perfectly.
• Spelling correction P(“my cat eats fish”) > P(“my xat eats fish”)
• Other uses
• OCR
• Summarization
• Document classification
• Essay scoring
Language Models
𝑷 𝒘𝟏 , … . , 𝒘𝒏 =
𝑷 𝒘𝒏 𝒘𝟏 , … … , 𝒘𝒏−𝟏 ) 𝑷(𝒘𝟏 , … . , 𝒘𝒏−𝟏 )=
𝑷 𝒘𝒏 𝒘𝟏 , … . , 𝒘𝒏−𝟏 ) 𝑷(𝒘𝒏−𝟏 𝒘𝒏−𝟐 , … . , 𝒘𝟏 𝑷(𝒘𝒏−𝟐 , … . , 𝒘𝟏 )
…
Markov Assumption
• is difficult to estimate.
P(i|START)·P(want|i)·P(to|want)·P(eat|to)·P(Chinese|eat)·P(food|Chinese)·P(END|food)
trigram example
P(i|START, START)·P(want|START,i)·P(to|i,want)·P(eat|want,to)·
P(Chinese|to,eat) · P(food|eat,Chinese)·P(END|Chinese,food)
Bigram example from the Berkeley
Restaurant Project (BeRP)
w0 = START
What do ngrams capture?
• Or for trigrams:
Bigram Counts from BeRP
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Counts to Probabilities
I Want To Eat Chinese Food lunch
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
• Unigram
Eat
counts:
0 0 2 0 19 2 52
3437
Food 1215
19 3256
0 938
17 0 213 0 1506 0 459
0
Lunch 4 0 0 0 0 1 0
Corpora
• Large digital collections of text or speech. Different languages, domains, modalities.
Annotated or un-annontated.
• English:
• Brown Corpus
• BNC, ANC
• AP newswire
• Practical approach:
• Other techniques:
• Most words occur very rarely. Many are seen only once.
frequency
word rank
Zipf’s Law
Wikipedia 10m words per language
https://commons.wikimedia.org/wiki/File:Zipf_30wiki_en_labels.png
Smoothing
• Smoothing flattens spiky distributions.
• Inaccurate in practice.
Linear Interpolation
• Use denser distributions of shorter ngrams to “fill in”
sparse ngram distributions.
• Where and .
• where .
• where:
• Given a corpus of ‘m’ sentences ‘si’, where ‘M’ is total number of tokens
in the corpus