Session 6 - Part-Of-Speech Tagging, Sequence Labeling
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
2
Part Of Speech Tagging
3
English POS Tagsets
• Original Brown corpus used a large set of 87 POS
tags.
• Most common in NLP today is the Penn Treebank set
of 45 tags.
– Tagset used in these slides.
– Reduced from the Brown set for use in the context of a
parsed corpus (i.e. treebank).
• The C5 tagset used for the British National Corpus
(BNC) has 61 tags.
4
English Parts of Speech
6
Closed vs. Open Class
• Closed class categories are composed of a small,
fixed set of grammatical function words for a given
language.
– Pronouns, Prepositions, Modals, Determiners, Particles,
Conjunctions
• Open class categories have large number of words
and new ones are easily invented.
– Nouns (Googler, textlish), Verbs (Google), Adjectives
(geeky), Abverb (automagically)
7
Ambiguity in POS Tagging
• “Like” can be a verb or a preposition
– I like/VBP candy.
– Time flies like/IN an arrow.
• “Around” can be a preposition, particle, or adverb
– I bought it at the shop around/IN the corner.
– I never got around/RP to getting a car.
– A new Prius costs around/RB $25K.
8
POS Tagging Process
• Usually assume a separate initial tokenization process that
separates and/or disambiguates punctuation, including
detecting sentence boundaries.
• Degree of ambiguity in English (based on Brown corpus)
– 11.5% of word types are ambiguous.
– 40% of word tokens are ambiguous.
• Average POS tagging disagreement amongst expert human
judges for the Penn treebank was 3.5%
– Based on correcting the output of an initial automated tagger, which
was deemed to be more accurate than tagging from scratch.
• Baseline: Picking the most frequent tag for each specific word
type gives about 90% accuracy
– 93.7% if use model for unknown words for Penn Treebank tagset.
9
POS Tagging Approaches
10
Classification Learning
• Typical machine learning addresses the problem of
classifying a feature-vector description into a fixed
number of classes.
• There are many standard learning methods for this
task:
– Decision Trees and Rule Learning
– Naïve Bayes and Bayesian Networks
– Logistic Regression / Maximum Entropy (MaxEnt)
– Perceptron and Neural Networks
– Support Vector Machines (SVMs)
– Nearest-Neighbor / Instance-Based
11
Beyond Classification Learning
• Standard classification problem assumes individual
cases are disconnected and independent (i.i.d.:
independently and identically distributed).
• Many NLP problems do not satisfy this assumption
and involve making many connected decisions, each
resolving a different ambiguity, but which are
mutually dependent.
• More sophisticated learning and inference techniques
are needed to handle such situations in general.
12
Sequence Labeling Problem
13
Information Extraction
14
Semantic Role Labeling
• For each clause, determine the semantic role played
by each noun phrase that is an argument to the verb.
agent patient source destination instrument
– John drove Mary from Austin to Dallas in his Toyota Prius.
– The hammer broke the window.
• Also referred to a “case role analysis,” “thematic
analysis,” and “shallow semantic parsing”
15
Bioinformatics
• Sequence labeling also valuable in labeling genetic
sequences in genome analysis.
extron intron
– AGCTAACGTTCGATACGGATTACAGCCT
16
Problems with Sequence Labeling as Classification
17
Probabilistic Sequence Models
• Probabilistic sequence models allow integrating
uncertainty over multiple, interdependent
classifications and collectively determine the most
likely global assignment.
• Two standard models
– Hidden Markov Model (HMM)
– Conditional Random Field (CRF)
18
Markov Model / Markov Chain
• A finite state machine with probabilistic state
transitions.
• Makes Markov assumption that next state only
depends on the current state and independent of
previous history.
19
Sample Markov Model for POS
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Verb
0.25
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
20
Sample Markov Model for POS
0.1
Det Noun
0.5
0.95
0.9
stop
0.05 Verb
0.25
0.1
PropNoun 0.8
0.4
0.5 0.1
0.25
0.1
start
P(PropNoun Verb Det Noun) = 0.4*0.8*0.25*0.95*0.1=0.0076 21
Hidden Markov Model
22
Sample HMM for POS
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start
23
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start
24
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.1
start
25
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John
26
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John
27
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit
28
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit
29
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the
30
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the
31
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the apple
32
Sample HMM Generation
the cat 0.1
a the dog
a
the a the car bed
that pen apple
0.5
0.95
Det Noun 0.9 bit
stop
ate saw
0.05 played
Tom 0.25 hit gave
0.1 John Mary Verb
Alice 0.8
0.4 Jerry
0.5 PropNoun 0.1
0.25
0.1
start John bit the apple
33
Formal Definition of an HMM
• A set of N +2 states S={s0,s1,s2, … sN, sF}
– Distinguished start state: s0
– Distinguished final state: sF
• A set of M possible observations V={v1,v2…vM}
• A state transition probability distribution A={aij}
a
j =1
ij + aiF = 1 0 i N
• Observation probability distribution for each state j
B={bj(k)}
35
Three Useful HMM Tasks
36
HMM: Observation Likelihood
• Given a sequence of observations, O, and a model
with a set of parameters, λ, what is the probability
that this observation was generated by this model:
P(O| λ) ?
• Allows HMM to be used as a language model: A
formal probabilistic model of a language that assigns
a probability to each string saying how likely that
string was to have been generated by the language.
• Useful for two tasks:
– Sequence Classification
– Most Likely Sequence
37
Sequence Classification
? ?
Austin P(O | Austin) > P(O | Boston) ? Boston 38
Most Likely Sequence
O1
? dice precedent core
40
HMM: Observation Likelihood
Efficient Solution
• Due to the Markov assumption, the probability of
being in any state at any given time t only relies on
the probability of being in each of the possible states
at time t−1.
• Forward Algorithm: Uses dynamic programming to
exploit this fact to efficiently compute observation
likelihood in O(TN2) time.
– Compute a forward trellis that compactly and implicitly
encodes information about all possible state paths.
41
Forward Trellis
s1 • • •
s2 • • •
• • • • •
s0 • • • •
sF
• • • •
• • • • •
sN • • •
t1 t2 t3 tT-1 tT
t ( j ) = P(o1 , o2 ,...ot , qt = s j | )
43
Forward Step
44
Computing the Forward Probabilities
• Initialization
1 ( j ) = a0 j b j (o1 ) 1 j N
• Recursion
N
t ( j ) = t −1 (i )aij b j (ot ) 1 j N , 1 t T
i =1
• Termination N
P(O | ) = T +1 ( sF ) = T (i)aiF
i =1
45
Forward Computational Complexity
• Requires only O(TN2) time to compute the probability
of an observed sequence given a model.
• Exploits the fact that all state sequences must merge
into one of the N possible states at any point in time
and the Markov assumption that only the last state
effects the next one.
46
Most Likely State Sequence (Decoding)
47
Most Likely State Sequence
48
Most Likely State Sequence
49
Most Likely State Sequence
50
Most Likely State Sequence
51
Most Likely State Sequence
52
Most Likely State Sequence
53
HMM: Most Likely State Sequence
Efficient Solution
54
Viterbi Scores
s1 • • •
s2 • • •
• • • • •
s0 • • • •
sF
• • • •
• • • • •
sN • • •
t1 t2 t3 tT-1 tT
58
Viterbi Backtrace
s1 • • •
s2 • • •
• • • • •
s0 • • • •
sF
• • • •
• • • • •
sN • • •
t1 t2 t3 tT-1 tT
59
HMM Learning
• Supervised Learning: All training sequences are
completely labeled (tagged).
• Unsupervised Learning: All training sequences are
unlabelled (but generally know the number of tags,
i.e. states).
• Semisupervised Learning: Some training sequences
are labeled, most are unlabeled.
60
Supervised HMM Training
61
Supervised Parameter Estimation
62
Learning and Using HMM Taggers
• Use a corpus of labeled sequence data to easily
construct an HMM using supervised training.
• Given a novel unlabeled test sequence to tag, use the
Viterbi algorithm to predict the most likely (globally
optimal) tag sequence.
63
Evaluating Taggers
• Train on training set of labeled sequences.
• Possibly tune parameters based on performance on a
development set.
• Measure accuracy on a disjoint test set.
• Generally measure tagging accuracy, i.e. the
percentage of tokens tagged correctly.
• Accuracy of most modern POS taggers, including
HMMs is 96−97% (for Penn tagset trained on about
800K words) .
– Generally matching human agreement level.
64
Unsupervised
Maximum Likelihood Training
Training Sequences
ah s t e n
a s t i n
oh s t u n
eh z t en HMM
. Training
.
. Austin
65
Maximum Likelihood Training
• Given an observation sequence, O, what set of
parameters, λ, for a given model maximizes the
probability that this data was generated from this
model (P(O| λ))?
• Used to train an HMM model and properly induce its
parameters from a set of training data.
• Only need to have an unannotated observation
sequence (or set of sequences) generated from the
model. Does not need to know the correct state
sequence(s) for the observation sequence(s). In this
sense, it is unsupervised.
66
Bayes Theorem
P( E | H ) P( H )
P( H | E ) =
P( E )
P( E | H ) P( H )
QED: P( H | E ) =
P( E )
Maximum Likelihood vs.
Maximum A Posteriori (MAP)
69
EM Algorithm
70
EM
Initialize:
Assign random probabilistic labels to unlabeled data
Unlabeled Examples
+ −
+ −
+ −
+ −
+ −
71
EM
Initialize:
Give soft-labeled training data to a probabilistic learner
+ −
+ −
+ −
Prob.
+ − Learner
+ −
72
EM
Initialize:
Produce a probabilistic classifier
+ −
+ −
+ −
Prob. Prob.
+ − Learner Classifier
+ −
73
EM
E Step:
Relabel unlabled data using the trained classifier
+ −
+ −
Prob. Prob.
+ −
Learner Classifier + −
+ −
74
EM
M step:
Retrain classifier on relabeled data
+ −
+ −
Prob. Prob.
+ −
Learner Classifier + −
+ −
Unlabeled Examples
Training Examples
+ + −
+ −
- Prob. Prob.
- + −
Learner Classifier + −
+
+ + −
79
Semi-Supervised EM
Training Examples
+ + −
+ −
- Prob. Prob.
- + −
Learner Classifier + −
+
+ + −
80
Semi-Supervised EM
Training Examples
+
- Prob. Prob.
- Classifier
+
Learner
+
+ −
+ −
+ −
+ −
+ −
81
Semi-Supervised EM
Unlabeled Examples
Training Examples
+ + −
+ −
- Prob. Prob.
- + −
Learner Classifier + −
+
+ + −
82
Semi-Supervised EM
Training Examples
+ + −
+ −
- Prob. Prob.
- + −
Learner Classifier + −
+
+ + −