Lecture 5
Lecture 5
Today
• Parts of speech (POS)
• Tagsets
• POS Tagging
– Rule-based tagging
– HMMs and Viterbi algorithm
2
Parts of Speech
• 8 (ish) traditional parts of speech
– Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc
– Called: parts-of-speech, lexical categories,
word classes, morphological classes, lexical
tags...
– Lots of debate within linguistics about the
number, nature, and universality of these
• We’ll completely ignore this debate.
3
POS examples
• N noun chair, bandwidth, pacing
• V verb study, debate, munch
• ADJ adjective purple, tall, ridiculous
• ADVadverb unfortunately, slowly
• P preposition of, by, to
• PRO pronoun I, me, mine
• DET determiner the, a, that, those
4
POS Tagging
• The process of assigning a part-of-speech or
lexical class marker to each word in a
collection.
WORD
tag
the
DET
koala
N
put V
the
DET
keys N 5
Why is POS Tagging Useful?
• First step of a vast number of practical tasks
• Speech synthesis
– How to pronounce “lead”?
– INsult inSULT
– OBject obJECT
– OVERflow overFLOW
– DIScount disCOUNT
– CONtent conTENT
• Parsing
– Need to know if a word is an N or V before you can parse
• Information extraction
– Finding names, relations, etc.
• Machine Translation
6
Open and Closed Classes
• Closed class: a small fixed membership
– Prepositions: of, in, by, …
– Auxiliaries: may, can, will had, been, …
– Pronouns: I, you, she, mine, his, them, …
– Usually function words (short common words which
play a role in grammar)
• Open class: new ones can be created all the time
– English has 4: Nouns, Verbs, Adjectives, Adverbs
– Many languages have these 4, but not all!
7
Open Class Words
• Nouns
– Proper nouns (Boulder, Granby, Eli Manning)
• English capitalizes these.
– Common nouns (the rest).
– Count nouns and mass nouns
• Count: have plurals, get counted: goat/goats, one goat, two goats
• Mass: don’t get counted (snow, salt, communism) (*two snows)
• Adverbs: tend to modify things
– Unfortunately, John walked home extremely slowly yesterday
– Directional/locative adverbs (here,home, downhill)
– Degree adverbs (extremely, very, somewhat)
– Manner adverbs (slowly, slinkily, delicately)
• Verbs
– In English, have morphological affixes (eat/eats/eaten)
8
Closed Class Words
Examples:
– prepositions: on, under, over, …
– particles: up, down, on, off, …
– determiners: a, an, the, …
– pronouns: she, who, I, ..
– conjunctions: and, but, or, …
– auxiliary verbs: can, may should, …
– numerals: one, two, three, third, …
9
Prepositions from CELEX
10
English Particles
11
Conjunctions
12
POS Tagging
Choosing a Tagset
• There are so many parts of speech, potential distinctions we
can draw
• To do POS tagging, we need to choose a standard set of tags to
work with
• Could pick very coarse tagsets
– N, V, Adj, Adv.
• More commonly used set is finer grained, the “Penn TreeBank
tagset”, 45 tags
– PRP$, WRB, WP$, VBG
• Even more fine-grained tagsets exist
13
Penn TreeBank POS Tagset
14
Using the Penn Tagset
• The/DT grand/JJ jury/NN
commmented/VBD on/IN a/DT number/NN
of/IN other/JJ topics/NNS ./.
• Prepositions and subordinating conjunctions
marked IN (“although/IN I/PRP..”)
• Except the preposition/complementizer “to”
is just marked “TO”.
15
POS Tagging
• Words often have more than one POS: back
– The back door = JJ
– On my back = NN
– Win the voters back = RB
– Promised to back the bill = VB
• The POS tagging problem is to determine
the POS tag for a particular instance of a
word.
16
How Hard is POS Tagging? Measuring
Ambiguity
17
Two Methods for POS Tagging
1. Rule-based tagging
– (ENGTWOL)
2. Stochastic
1. Probabilistic sequence models
• HMM (Hidden Markov Model) tagging
• MEMMs (Maximum Entropy Markov Models)
18
Rule-Based Tagging
• Start with a dictionary
• Assign all possible tags to words from the
dictionary
• Write rules by hand to selectively remove
tags
• Leaving the correct tag for each word.
19
Start With a Dictionary
• she: PRP
• promised: VBN,VBD
• to TO
• back: VB, JJ, RB, NN
• the: DT
• bill: NN, VB
20
Assign Every Possible Tag
NN
RB
VBN JJ VB
PRP VBD TO VB DT NN
She promised to back the bill
21
Write Rules to Eliminate Tags
22
Stage 1 of ENGTWOL Tagging
• First Stage: Run words through FST
morphological analyzer to get all parts of speech.
• Example: Pavlov had shown that salivation …
23
Stage 2 of ENGTWOL Tagging
• Second Stage: Apply NEGATIVE constraints.
• Example: Adverbial “that” rule
– Eliminates all readings of “that” except the one in
• “It isn’t that odd”
Given input: “that”
If
(+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier
(+2 SENT-LIM) ;following which is E-O-S
(NOT -1 SVOC/A) ; and the previous word is not a
; verb like “consider” which
; allows adjective complements
; in “I consider that odd”
Then eliminate non-ADV tags
Else eliminate ADV
24
Hidden Markov Model Tagging
• Using an HMM to do POS tagging is a special
case of Bayesian inference
– Foundational work in computational linguistics
– Bledsoe 1959: OCR
– Mosteller and Wallace 1964: authorship
identification
• It is also related to the “noisy channel” model
that’s the basis for ASR, OCR and MT
25
POS Tagging as Sequence Classification
• We are given a sentence (an “observation” or
“sequence of observations”)
– Secretariat is expected to race tomorrow
• What is the best sequence of tags that
corresponds to this sequence of observations?
• Probabilistic view:
– Consider all possible sequences of tags
– Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1…wn.
26
Getting to HMMs
• We want, out of all sequences of n tags t1…tn the single tag
sequence such that P(t1…tn|w1…wn) is highest.
27
Getting to HMMs
• This equation is guaranteed to give us the
best tag sequence
29
Likelihood and Prior
30
Two Kinds of Probabilities
• Tag transition probabilities p(ti|ti-1)
– Determiners likely to precede adjs and nouns
• That/DT flight/NN
• The/DT yellow/JJ hat/NN
• So we expect P(NN|DT) and P(JJ|DT) to be high
• But P(DT|JJ) to be:
– Compute P(NN|DT) by counting in a labeled corpus:
31
Two Kinds of Probabilities
32
Example: The Verb “race”
33
Disambiguating “race”
34
Example
• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading,
35
Hidden Markov Models
• What we’ve described with these two kinds
of probabilities is a Hidden Markov Model
(HMM)
36
Definitions
• A weighted finite-state automaton adds
probabilities to the arcs
– The sum of the probabilities leaving any arc must sum
to one
• A Markov chain is a special case of a WFST in
which the input sequence uniquely determines
which states the automaton will go through
• Markov chains can’t represent inherently
ambiguous problems
– Useful for assigning probabilities to unambiguous
sequences
37
Markov Chain for Weather
38
Markov Chain for Words
39
Markov Chain: “First-order observable
Markov Model”
• A set of states
– Q = q1, q2…qN; the state at time t is qt
• Transition probabilities:
– a set of probabilities A = a01a02…an1…ann.
– Each aij represents the probability of transitioning from
state i to state j
– The set of these is the transition probability matrix A
40
Markov Chain for Weather
• What is the probability of 4 consecutive
rainy days?
• Sequence is rainy-rainy-rainy-rainy
• I.e., state sequence is 3-3-3-3
• P(3,3,3,3) =
– 1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
41
HMM for Ice Cream
• You are a climatologist in the year 2799
• Studying global warming
• You can’t find any records of the weather in
Baltimore, MA for summer of 2007
• But you find Jason Eisner’s diary
• Which lists how many ice-creams Jason ate
every date that summer
• Our job: figure out how hot it was
42
Hidden Markov Model
• For Markov chains, the output symbols are the same as
the states.
– See hot weather: we’re in state hot
• But in part-of-speech tagging (and other things)
– The output symbols are words
– But the hidden states are part-of-speech tags
• So we need an extension!
• A Hidden Markov Model is an extension of a Markov
chain in which the input symbols are not the same as the
states.
• This means we don’t know which state we are in.
43
Hidden Markov Models
• States Q = q1, q2…qN;
• Observations O= o1, o2…oN;
– Each observation is a symbol from a vocabulary V =
{v1,v2,…vV}
• Transition probabilities
– Transition probability matrix A = {aij}
aij P(qt j | qt 1 i) 1 i, j N
• Observation likelihoods
– Output probability matrix B={bi(k)}
bi (k) P(X t ok | qt i)
i P(q1 i) 1 i N
• Special initial probability vector 44
Eisner Task
• Given
– Ice Cream Observation Sequence:
1,2,3,2,2,2,3…
• Produce:
– Weather Sequence: H,C,H,H,H,C…
45
HMM for Ice Cream
46
Transition Probabilities
47
Observation Likelihoods
48
Decoding
• Ok, now we have a complete model that can give
us what we need. Recall that we need to get
49
The Viterbi Algorithm
50
Viterbi Example
51
Viterbi Summary
• Create an array
– With columns corresponding to inputs
– Rows corresponding to possible states
• Sweep through the array in one pass filling
the columns left to right using our transition
probs and observations probs
• Dynamic programming key is that we need
only store the MAX prob path to each cell,
(not all paths).
52
Evaluation
• So once you have you POS tagger running
how do you evaluate it?
– Overall error rate with respect to a gold-
standard test set.
– Error rates on particular tags
– Error rates on particular words
– Tag confusions...
53
Error Analysis
• Look at a confusion matrix
54
Evaluation
• The result is compared with a manually
coded “Gold Standard”
– Typically accuracy reaches 96-97%
– This may be compared with result for a
baseline tagger (one that uses no context).
• Important: 100% is impossible even for
human annotators.
55
Summary
• Parts of speech
• Tagsets
• Part of speech tagging
• HMM Tagging
– Markov Chains
– Hidden Markov Models
56