L4 Tagging
L4 Tagging
Introduction to NLP
221.
Hidden Markov Models
Markov Models
• Definition
– in terms of a transition matrix A and initial state probabilities .
Example
0.2 1.0
0.8
f a b
0.2
0.7
1.0
0.3
0.1
1.0
e d c
0.7
start
Visible MM
• Uses
– Part of speech tagging
– Speech recognition
– Gene sequencing
Hidden Markov Models
S0 S1 S2 S3 … Sn
W1 W2 W3 Wn
Generative Algorithm
start
0.2
0.8 G H 0.4
0.6
Emission Probabilities
• P(Ot=k|Xt=si,Xt+1=sj) = bijk
x y z
G 0.7 0.2 0.1
H 0.3 0.5 0.2
All Parameters of the Model
• Initial
– P(G|start) = 1.0, P(H|start) = 0.0
• Transition
– P(G|G) = 0.8, P(G|H) = 0.6, P(H|G) = 0.2, P(H|H) = 0.4
• Emission
– P(x|G) = 0.7, P(y|G) = 0.2, P(z|G) = 0.1
– P(x|H) = 0.3, P(y|H) = 0.5, P(z|H) = 0.2
Observation sequence “yz”
• Starting in state G (or H), P(yz) = ?
• Possible sequences of states:
– GG
– GH
– HG
– HH
• P(yz) = P(yz|GG) + P(yz|GH) + P(yz|HG) + P(yz|HH) =
= .8 x .2 x .8 x .1
+ .8 x .2 x .2 x .2
+ .2 x .5 x .4 x .2
+ .2 x .5 x .6 x .1
= .0128+.0064+.0080+.0060 =.0332
Hidden Markov Model
States and Transitions
• An HMM is essentially a weighted finite-state
transducer
– The states encode the most recent history
– The transitions encode likely sequences of states
• e.g., Adj-Noun or Noun-Verb
• or perhaps Art-Adj-Noun
– Use MLE to estimate the probabilities
• Another way to think of an HMM
– It’s a natural extension of Naïve Bayes to sequences
Emissions
• Estimating the emission probabilities
– Harder than transition probabilities (why?)
– There may be novel uses of word/POS combinations
• Suggestions
– It is possible to use standard smoothing
– As well as heuristics (e.g., based on the spelling of the words)
Sequence of Observations
• The observer can only see the emitted symbols
• Observation likelihood
– Given the observation sequence S and the model = (A,B,), what
is the probability P(S|) that the sequence was generated by that
model.
• Being able to compute the probability of the observations
sequence turns the HMM into a language model
Tasks with HMM
• Given = (A,B,), find P(O|)
– Uses the Forward Algorithm
• Given O, , find (X1,…XT+1)
– Uses the Viterbi Algorithm
• Given O and a space of all possible 1..m, find
model i that best describes the observations
– Uses Expectation-Maximization
Inference
• Find the most likely sequence of tags, given the sequence of words
– t* = argmaxt P(t|w)
• Given the model µ, it is possible to compute P (t|w) for all values of t
– In practice, there are way too many combinations
• Greedy Search
• Beam Search
– One possible solution
– Uses partial hypotheses
– At each state, only keep the k best hypotheses so far
– May not work
Viterbi Algorithm
P(y|G)
G G G G
P(H|G)
P(H|H)
H H H H
H H H H
y z .
HMM Trellis
P(H,t=1)
H H H H
y z .
HMM Trellis
P(H,t=2)
H H H H
y z .
HMM Trellis
G G G G
P(H,t=2)
H H H H
y z .
HMM Trellis
G G G G
H H H H
y z .
HMM Trellis
P(end,t=3)
end end end end
G G G G
H H H H
y z .
HMM Trellis
P(end,t=3)
end end end end
P(end,t=3) =
max (P(G,t=2) x P(end|G),
P(H,t=2) x P(end|H))
G G G G
H H H H
y z .
HMM Trellis
P(end,t=3)
end end end end
P(end,t=3) =
max (P(G,t=2) x P(end|G),
P(H,t=2) x P(end|H))
G G G G
y z .
Beam Search
Some Observations
• Advantages of HMMs
– Relatively high accuracy
– Easy to train
• Higher-Order HMM
– The previous example was about bigram HMMs
– How can you modify it to work with trigrams?
How to compute P(O)
• Viterbi was used to find the most likely sequence of
states that matches the observation
• What if we want to find all sequences that match the
observation
• We can add their probabilities (because they are
mutually exclusive) to form the probability of the
observation
• This is done using the Forward Algorithm
The Forward Algorithm
• Used to compute the probability of a sequence
• Very similar to Viterbi
• Instead of max we use sum
NLP
Introduction to NLP
222.
Learning in Hidden Markov Models
HMM Learning
• Supervised
– Training sequences are labeled
• Unsupervised
– Training sequences are unlabeled
– Known number of states
• Semi-supervised
– Some training sequences are labeled
Supervised HMM Learning
231.
Statistical POS Tagging
HMM Tagging
• T = argmax P(T|W)
– where T=t1,t2,…,tn
• By Bayes’ theorem
– P(T|W) = P(T)P(W|T)/P(W)
• Thus we are attempting to choose the sequence of
tags that maximizes the RHS of the equation
– P(W) can be ignored
– P(T) is called the prior, P(W|T) is called the likelihood.
HMM Tagging
• Complete formula
– P(T)P(W|T) = ΠP(wi|w1t1…wi-1ti-1ti)P(ti|t1…ti-2ti-1)
• Simplification 1:
– P(W|T) = ΠP(wi|ti)
• Simplification 2:
– P(T)= ΠP(ti|ti-1)
• Bigram approximation
– T = argmax P(T|W) = argmax ΠP(wi|ti) P(ti|ti-1)
Example
DT NN VBP TO NN .
DT NN VBP TO VB .
• Transition probabilities
P(NN|JJ) = C(JJ,NN)/C(JJ) = 22301/89401 = .249
• Emission probabilities
P(this|DT) = C(DT,this)/C(DT) = 7037/103687 = .068
Evaluating Taggers
• Data set
– Training set
– Development set
– Test set
• Tagging accuracy
– how many tags right
HMM POS Results
[Manning 2011]
Confusion Matrix
232.
Information Extraction
Information Extraction
• Times
– Absolute expressions
– Relative expressions (e.g., “last night”)
• Events
– E.g., a plane went past the end of the runway
Named Entity Recognition (NER)
• Segmentation
– Which words belong to a named entity?
– Brazilian football legend Pele's condition has improved, according to
a Thursday evening statement from a Sao Paulo hospital.
• Classification
– What type of named entity is it?
– Use gazetteers, spelling, adjacent words, etc.
– Brazilian football legend [PERSON Pele]'s condition has improved,
according to a [TIME Thursday evening] statement from a [LOCATION
Sao Paulo] hospital.
NER, Time, and Event extraction
• Brazilian football legend [PERSON Pele]'s condition has
improved, according to a [TIME Thursday evening]
statement from a [LOCATION Sao Paulo] hospital.
• There had been earlier concerns about Pele's health after
[ORG Albert Einstein Hospital] issued a release that said
his condition was "unstable.“
• [TIME Thursday night]'s release said [EVENT Pele was
relocated] to the intensive care unit because a kidney
dialysis machine he needed was in ICU.
Event Extraction
Event Extraction
Named Entities
Named Entity Recognition (NER)
Sample Input for NER
( (S
(NP-SBJ-1
(NP (NNP Rudolph) (NNP Agnew) )
(, ,)
(UCP
(ADJP
(NP (CD 55) (NNS years) )
(JJ old) )
(CC and)
(NP
(NP (JJ former) (NN chairman) )
(PP (IN of)
(NP (NNP Consolidated) (NNP Gold) (NNP Fields) (NNP PLC) ))))
(, ,) )
(VP (VBD was)
(VP (VBN named)
(S
(NP-SBJ (-NONE- *-1) )
(NP-PRD
(NP (DT a) (JJ nonexecutive) (NN director) )
(PP (IN of)
(NP (DT this) (JJ British) (JJ industrial) (NN conglomerate) ))))))
(. .) ))
Sample Output for NER (IOB format)
file_id sent_id word_id iob_inner pos word
0002 1 0 B-PER NNP Rudolph
0002 1 1 I-PER NNP Agnew
0002 1 2 O COMMA COMMA
0002 1 3 B-NP CD 55
0002 1 4 I-NP NNS years
0002 1 5 B-ADJP JJ old
0002 1 6 O CC and
0002 1 7 B-NP JJ former
0002 1 8 I-NP NN chairman
0002 1 9 B-PP IN of
0002 1 10 B-ORG NNP Consolidated
0002 1 11 I-ORG NNP Gold
0002 1 12 I-ORG NNP Fields
0002 1 13 I-ORG NNP PLC
0002 1 14 O COMMA COMMA
0002 1 15 B-VP VBD was
0002 1 16 I-VP VBN named
0002 1 17 B-NP DT a
0002 1 18 I-NP JJ nonexecutive
0002 1 19 I-NP NN director
0002 1 20 B-PP IN of
0002 1 21 B-NP DT this
0002 1 22 I-NP JJ British
0002 1 23 I-NP JJ industrial
0002 1 24 I-NP NN conglomerate
0002 1 25 O . .
NER Demos
• http://nlp.stanford.edu:8080/ner/
• http://cogcomp.org/page/demo_view/ner
• http://demo.allennlp.org/named-entity-recognition
NER Extraction Features
NER Extraction Features
Feature Encoding in NER
NER as Sequence Labeling
• Many NLP problems can be cast as sequence labeling problems
– POS – part of speech tagging
– NER – named entity recognition
– SRL – semantic role labeling
• Input
– Sequence w1w2w3
• Output
– Labeled words
• Classification methods
– Can use the categories of the previous tokens as features in classifying the
next one
– Direction matters
NER as Sequence Labeling
Temporal Expressions
Temporal Lexical Triggers
TempEx Example
TimeML
TimeBank
Biomedical example
• Gene labeling
• Sentence:
– [GENE BRCA1] and [GENE BRCA2] are human genes
that produce tumor suppressor proteins
Other Examples
• Job announcements
– Location, title, starting date, qualifications, salary
• Seminar announcements
– Time, title, location, speaker
• Medical papers
– Drug, disease, gene/protein, cell line, species, substance
Filling the Templates
233.
Relation Extraction
Relation Extraction
• Person-person
– ParentOf, MarriedTo, Manages
• Person-organization
– WorksFor
• Organization-organization
– IsPartOf
• Organization-location
– IsHeadquarteredAt
Relation Extraction
• Core NLP task
– Used for building knowledge bases, question answering
• Input
– Mazda North American Operations is headquartered in Irvine,
Calif., and oversees the sales, marketing, parts and customer
service support of Mazda vehicles in the United States and
Mexico through nearly 700 dealers.
• Output
– IsHeadquarteredIn (Mazda North American Operations, Irvine)
Relation extraction
• Using patterns
– Regular expressions
– Gazetteers
• Supervised learning
• Semi-supervised learning
– Using seeds
Relation Extraction
The ACE Evaluation
• Newspaper data
• Entities:
– Person, Organization, Facility, Location, Geopolitical Entity
• Relations:
– Role, Part, Located, Near, Social
The ACE Evaluation
Semantic Relations
Extracting IS-A Relations
• Hearst’s patterns
– X and other Y
– X or other Y
– Y such as X
– Y, including X
– Y, especially X
• Example
– Evolutionary relationships between the platypus and other
mammals
Hypernym Extraction (Hearst)
Supervised Relation Extraction