0% found this document useful (0 votes)
8 views

Lecture 03

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture 03

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Natural Language

Processing
Lecture 3: n-gram language models

10/29/2020

COMS W4705
Yassine Benajiba
Probability of a Sentence
Probability of a Sentence

“But it must be recognized that the notion of ‘probability of a


sentence’ is an entirely useless one, under any known
interpretation of this term.”
Noam Chomsky (1969)
Language Modeling

• Task: predict the next word given the context.

• Used in speech recognition, handwritten character


recognition, spelling correction, text entry UI, machine
translation,…
Language Modeling

• Stocks plunged this …

• Let’s meet in Times …

• I took the subway to …


From a NYT story
• Stocks plunged this ....

• Stocks plunged this morning, despite a cut interest


rates by the …

• Stocks plunged this morning, despite a cut in interest


rates by the Federal Reserve, as Wall …

• Stocks plunged this morning, despite a cut in interest


rates by the Federal Reserve, as Wall Street began
Human Word Prediction
• Clearly at least some of us have the ability to predict the
future.

• How does this work?

• Domain knowledge

• Syntactic knowledge (guess correct part of speech)

• Lexical knowledge
Probability of the Next Word
• Idea: We do not need to model domain, syntactic, and
lexical knowledge perfectly.

• Instead, we can rely on the notion of probability of a


sequence (letters, words…).
Applications
• Speech recognition: P(“recognize speech”) > P(“wreck a nice beach”)

• Text generation: P(“three houses”) > P(“three house”)

• Spelling correction P(“my cat eats fish”) > P(“my xat eats fish”)

• Machine Translation P(“the blue house”) > P(“the house blue”)

• Other uses

• OCR

• Summarization

• Document classification

• Essay scoring
Language Models

• This model can also be used to describe the probability of


an entire sentence, not just the last word.

• Use the chain rule:

𝑷 𝒘𝟏 , … . , 𝒘𝒏 =
𝑷 𝒘𝒏 𝒘𝟏 , … … , 𝒘𝒏−𝟏 ) 𝑷(𝒘𝟏 , … . , 𝒘𝒏−𝟏 )=
𝑷 𝒘𝒏 𝒘𝟏 , … . , 𝒘𝒏−𝟏 ) 𝑷(𝒘𝒏−𝟏 𝒘𝒏−𝟐 , … . , 𝒘𝟏 𝑷(𝒘𝒏−𝟐 , … . , 𝒘𝟏 )

Markov Assumption
• is difficult to estimate.

• The longer the sequence becomes, the less likely


w1 w2 w3 … wn-1 will appear in training data.

• Instead, we make the following simple independence


assumption (Markov assumption):

• The probability to see wn depends only on the previous


k-1 words.
bi-gram language model
• Using the Markov assumption and the chain rule:

• More consistent to use only bigrams:


n-grams

• The sequence wn is a unigram.

• The sequence wn-1, wn is a bigram.

• The sequence wn-2, wn-1, wn is a trigram….

• The sequence wn-2, wn-1, wn is a quadrigram…


Variable-Length Language
Models
• We typically don’t know what the length of the sentence is.

• Instead, we use a special marker STOP that indicates the


end of a sentence.

• We typically just augment the sentence with START and


STOP markers to provide the appropriate context.
START i want to eat Chinese food END

P(i|START)·P(want|i)·P(to|want)·P(eat|to)·P(Chinese|eat)·P(food|Chinese)·P(END|food)
trigram example

P(i|START, START)·P(want|START,i)·P(to|i,want)·P(eat|want,to)·
P(Chinese|to,eat) · P(food|eat,Chinese)·P(END|Chinese,food)
Bigram example from the Berkeley
Restaurant Project (BeRP)

Eat on 0.16 Eat Thai 0.03

Eat some 0.06 Eat breakfast 0.03

Eat lunch 0.06 Eat in 0.02

Eat dinner 0.05 Eat Chinese 0.02

Eat at 0.04 Eat Mexican 0.02

Eat a 0.04 Eat tomorrow 0.01

Eat Indian 0.04 Eat dessert


http://www1.icsi.berkeley.edu/Speech/berp.html 0.007
Bigram example from the Berkeley
Restaurant Project (BeRP)

START I 0.25 Want some 0.04

START I’d 0.06 Want Thai 0.01

START Tell 0.04 To eat 0.26

START I’m 0.02 To have 0.14

I want 0.32 To spend 0.09

I would 0.29 To be 0.02

I don’t 0.08 British food 0.60


Bigram example from the Berkeley
Restaurant Project (BeRP)
• Assume P(END | food) = 0.2

P(I want to eat British food) =


P(I | START) · P(want | I) · P(to | want) · P(eat | to) ·
P(British | eat) · P(food | British) · P(END | food) =
.25 · .32 · .65 · .26 · .001 · .60 · .2 = .0000016

P(I want to eat Chinese food) =


P(I | START) · P(want | I) · P(to | want) · P(eat | to) ·
P(Chinese | eat) · P(food | Chinese) · P(END | food) =
.25 · .32 · .65 · .26 · .02 · .60 · .2 = .000032
log probabilities
• Probabilities can become very small (a few orders of
magnitude per token).

• We often work with log probabilities in practice.

w0 = START
What do ngrams capture?

• Probabilities seem to capture syntactic facts and


world knowledge.

• eat is often followed by a NP.

• British food is not too popular, but Chinese is.


Estimating n-gram
probabilities
• We can estimate n-gram probabilities using maximum
likelihood estimates.

• Or for trigrams:
Bigram Counts from BeRP

I Want To Eat Chinese Food lunch

I 8 1087 0 13 0 0 0

Want 3 0 786 0 6 8 6

To 3 0 10 860 3 0 12

Eat 0 0 2 0 19 2 52

Chinese 2 0 0 0 0 120 1

Food 19 0 17 0 0 0 0
Counts to Probabilities
I Want To Eat Chinese Food lunch

I 8 1087 0 13 0 0 0

Want 3 0 786 0 6 8 6

To 3 0 10 860 3 0 12
• Unigram
Eat
counts:
0 0 2 0 19 2 52

IChinese Want 2 To 0 Eat


0 Chinese
0 Food
0 120Lunch 1

3437
Food 1215
19 3256
0 938
17 0 213 0 1506 0 459
0

Lunch 4 0 0 0 0 1 0
Corpora
• Large digital collections of text or speech. Different languages, domains, modalities.
Annotated or un-annontated.

• English:

• Brown Corpus

• BNC, ANC

• Wall Street Journal

• AP newswire

• DARPA/NIST text/speech corpora


(Call Home,ATIS, switchboard, Broadcast News,…)

• MT: Hansards, Europarl


Google Web 1T 5-gram
Corpus
File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens: 1,024,908,267,229

Number of sentences: 95,119,665,584

Number of unigrams: 13,588,391

Number of bigrams: 314,843,401

Number of trigrams: 977,069,902

Number of fourgrams: 1,313,818,354

Number of fivegrams: 1,176,470,663


Google Web 1T 5-gram
Corpus
• 3-gram examples:

ceramics collectables collectibles 55


ceramics collectables fine 130
ceramics collected by 52
ceramics collectible pottery 50
ceramics collectibles cooking 45
ceramics collection , 144
ceramics collection . 247
ceramics collection </S> 120
ceramics collection and 43
ceramics collection at 52
ceramics collection is 68
ceramics collection of 76
Google Web 1T 5-gram
Corpus
• 4-gram examples:

serve as the incoming 92


serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52
serve as the industry 607
serve as the info 42
serve as the informal 102
Data sparsity in n-gram
models
• Sparsity is a problem all over NLP: Test data contains
language phenomena not encountered during training.

• For n-gram models there are two issues:

• We may not have seen all tokens.

• We may not have seen all ngrams (even though the


individual tokens are known).

• Token has not been encountered in this context


before.
P(lunch | I ) = 0.0
Unseen Tokens
• Typical approach to unseen tokens:

• Start with a specific lexicon of known tokens.

• Replace all tokens in the training and testing corpus that


are not in the lexicon with an UNK token.

• Practical approach:

• Lexicon contains all words that appear more than k


times in the training corpus.

• Replace all other tokens with UNK.


Unseen Contexts
• Two basic approaches:

• Smoothing / Discounting: Move some probability mass


from seen trigrams to unseen trigrams.

• Back-off: Use n-1-…, n-2-… grams to compute n-gram


probability.

• Other techniques:

• Class-based backoff, use back-off probability for a


specific word class / part-of-speech.
Zipf’s Law
• Problem: n-grams (and most other linguistic phenomena)
follow a Zipfian distribution.

• A few words occur very frequently.

• Most words occur very rarely. Many are seen only once.

Zipf’s law: a word’s frequency is approximately inversely


proportional to its rank in the word distribution list.
Zipf’s Law

frequency

word rank
Zipf’s Law
Wikipedia 10m words per language

https://commons.wikimedia.org/wiki/File:Zipf_30wiki_en_labels.png
Smoothing
• Smoothing flattens spiky distributions.

• before P(w | We denied the)


3 allegations
2 reports
1 claims
1 request
7 total

• after P(w | We denied the)


2.5 allegations
1.5 reports
0.5 claims
0.4 request
2 UNK
7 total
Smoothing is like Robin Hood: Steal from the rich, give to the poor.
Example from Dan Klein.
Additive Smoothing
• Classic approach: Laplacian, a.k.a. additive smoothing.

• N is the number of tokens, V is the number of types (i.e.


size of the vocabulary)

• Inaccurate in practice.
Linear Interpolation
• Use denser distributions of shorter ngrams to “fill in”
sparse ngram distributions.

• Where and .

• Works well in practice (but not a lot of theoretical


justification why).

• Parameters can be estimated on development data (for


example, using Expectation Maximization).
Discounting
• Idea: set aside some probability mass, then fill in the
missing mass using back-off.

• where .

• Then for all seen bigrams:

• For each context v the missing probability mass is

• We can now divide this held-out mass between the


unseen words (evenly or using back-off).
Katz’ Backoff
• Divide the held-out probability mass proportionally to the
unigram probability of the unseen words in context v.
Katz’ Backoff for Trigrams
• For trigrams: recursively compute backoff-probability for
unseen bigrams. Then distribute the held-out probability
mass proportionally to that bigram backoff-probability.

• where:

• Often combined with Good-Turing smoothing.


Evaluating n-gram models
• Extrinsic evaluation: Apply the model in an application (for
example language classification). Evaluate the application.

• Intrinsic evaluation: measure how well the model


approximates unseen language data.

• Can compute the probability of each sentence


according to the model. Higher probability -> better
model.

• Typically we compute Perplexity instead.


Perplexity
• Perplexity (per word) measures how well the ngram model predicts the
sample.

• Given a corpus of ‘m’ sentences ‘si’, where ‘M’ is total number of tokens
in the corpus

• Perplexity is defined as 2-l , where .

• Lower perplexity = better model. Intuition:

• Assume we are predicting one word at a time.

• With uniform distribution, all successor words are equally likely.


Perplexity is equal to vocabulary size.

• Perplexity can be thought of as “effective vocabulary size”.

You might also like