0% found this document useful (0 votes)
21 views

L4 Tagging

The document provides an introduction to hidden Markov models (HMMs) and their applications. It discusses key concepts such as the Markov property, visible and hidden states, generative processes, parameter estimation, and algorithms for inference including the Viterbi and Forward algorithms. Learning in HMMs can be supervised, unsupervised, or semi-supervised depending on whether the training data is labeled. Maximum likelihood estimation is commonly used to learn the transition and emission probabilities.

Uploaded by

Ike S. Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

L4 Tagging

The document provides an introduction to hidden Markov models (HMMs) and their applications. It discusses key concepts such as the Markov property, visible and hidden states, generative processes, parameter estimation, and algorithms for inference including the Viterbi and Forward algorithms. Learning in HMMs can be supervised, unsupervised, or semi-supervised depending on whether the training data is labeled. Maximum likelihood estimation is commonly used to learn the transition and emission probabilities.

Uploaded by

Ike S. Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

NLP

Introduction to NLP

221.
Hidden Markov Models
Markov Models

• Sequence of random variables that aren’t independent


• Examples
– Weather reports
– Text
– Stock market numbers
Definition
Properties
• Limited horizon:
P(Xt+1 = sk|X1,…,Xt) = P(Xt+1 = sk|Xt)

• Time invariant (stationary)


P(Xt+1 = sk|Xt) = P(X2=sk|X1) Andrey Markov

• Definition
– in terms of a transition matrix A and initial state probabilities .
Example
0.2 1.0

0.8
f a b
0.2
0.7
1.0
0.3
0.1

1.0
e d c
0.7

start
Visible MM

P(X1,…XT) = P(X1) P(X2|X1) P(X3|X1,X2) … P(XT|X1,…,XT-1)

= P(X1) P(X2|X1) P(X3|X2) … P(XT|XT-1)


T -1
= p X1 Õ aXt Xt+1
t=1

P(d, a, b) = P(X1=d) P(X2=a|X1=d) P(X3=b|X2=a)


= 1.0 x 0.7 x 0.8
= 0.56
Hidden Markov Models
• Motivation
– Observing a sequence of symbols
– The sequence of states that led to the generation of the symbols is hidden
– The states correspond to hidden (latent) variables
• Definition
– Q = states
– O = observations, drawn from a vocabulary
– q0,qf = special (start, final) states
– A = state transition probabilities
– B = symbol emission probabilities
–  = initial state probabilities
– µ = (A,B,) = complete probabilistic model
Hidden Markov Models

• Uses
– Part of speech tagging
– Speech recognition
– Gene sequencing
Hidden Markov Models

• Can be used to model state sequences and


observation sequences
• Example:
– P(s,w) =  i P(si|si-1)P(wi|si)

S0 S1 S2 S3 … Sn

W1 W2 W3 Wn
Generative Algorithm

• Pick start state from 


• For t = 1..T
– Move to another state based on A
– Emit an observation based on B
State Transition Probabilities

start
0.2

0.8 G H 0.4

0.6
Emission Probabilities

• P(Ot=k|Xt=si,Xt+1=sj) = bijk

x y z
G 0.7 0.2 0.1
H 0.3 0.5 0.2
All Parameters of the Model
• Initial
– P(G|start) = 1.0, P(H|start) = 0.0
• Transition
– P(G|G) = 0.8, P(G|H) = 0.6, P(H|G) = 0.2, P(H|H) = 0.4
• Emission
– P(x|G) = 0.7, P(y|G) = 0.2, P(z|G) = 0.1
– P(x|H) = 0.3, P(y|H) = 0.5, P(z|H) = 0.2
Observation sequence “yz”
• Starting in state G (or H), P(yz) = ?
• Possible sequences of states:
– GG
– GH
– HG
– HH
• P(yz) = P(yz|GG) + P(yz|GH) + P(yz|HG) + P(yz|HH) =
= .8 x .2 x .8 x .1
+ .8 x .2 x .2 x .2
+ .2 x .5 x .4 x .2
+ .2 x .5 x .6 x .1
= .0128+.0064+.0080+.0060 =.0332
Hidden Markov Model
States and Transitions
• An HMM is essentially a weighted finite-state
transducer
– The states encode the most recent history
– The transitions encode likely sequences of states
• e.g., Adj-Noun or Noun-Verb
• or perhaps Art-Adj-Noun
– Use MLE to estimate the probabilities
• Another way to think of an HMM
– It’s a natural extension of Naïve Bayes to sequences
Emissions
• Estimating the emission probabilities
– Harder than transition probabilities (why?)
– There may be novel uses of word/POS combinations
• Suggestions
– It is possible to use standard smoothing
– As well as heuristics (e.g., based on the spelling of the words)
Sequence of Observations
• The observer can only see the emitted symbols
• Observation likelihood
– Given the observation sequence S and the model  = (A,B,), what
is the probability P(S|) that the sequence was generated by that
model.
• Being able to compute the probability of the observations
sequence turns the HMM into a language model
Tasks with HMM
• Given  = (A,B,), find P(O|)
– Uses the Forward Algorithm
• Given O, , find (X1,…XT+1)
– Uses the Viterbi Algorithm
• Given O and a space of all possible 1..m, find
model i that best describes the observations
– Uses Expectation-Maximization
Inference
• Find the most likely sequence of tags, given the sequence of words
– t* = argmaxt P(t|w)
• Given the model µ, it is possible to compute P (t|w) for all values of t
– In practice, there are way too many combinations
• Greedy Search
• Beam Search
– One possible solution
– Uses partial hypotheses
– At each state, only keep the k best hypotheses so far
– May not work
Viterbi Algorithm

• Find the best path up to observation i and state s


• Characteristics
– Uses dynamic programming
– Memoization
– Backpointers
Viterbi Algorithm
Viterbi Algorithm
HMM Trellis

end end end end

P(y|G)
G G G G

P(H|G)
P(H|H)
H H H H

start start start


start
HMM Trellis

end end end end


P(G,t=1) =
P(start) x P(G|start) x P(y|G)
P(G,t=1)
G G G G

H H H H

start start start


start

y z .
HMM Trellis

end end end end


P(H,t=1) =
P(start) x P(H|start) x P(y|H)
G G G G

P(H,t=1)
H H H H

start start start


start

y z .
HMM Trellis

end end end end


P(H,t=2) =
max (P(G,t=1) x P(H|G) x P(z|H),
P(H,t=1) x P(H|H) x P(z|H))
G G G G

P(H,t=2)
H H H H

start start start


start

y z .
HMM Trellis

end end end end

G G G G

P(H,t=2)
H H H H

start start start


start

y z .
HMM Trellis

end end end end

G G G G

H H H H

start start start


start

y z .
HMM Trellis
P(end,t=3)
end end end end

G G G G

H H H H

start start start


start

y z .
HMM Trellis
P(end,t=3)
end end end end
P(end,t=3) =
max (P(G,t=2) x P(end|G),
P(H,t=2) x P(end|H))
G G G G

H H H H

start start start


start

y z .
HMM Trellis
P(end,t=3)
end end end end
P(end,t=3) =
max (P(G,t=2) x P(end|G),
P(H,t=2) x P(end|H))
G G G G

P(end,t=3) = best score for the sequence

H H H H Use the backpointers to find the


sequence of states.

start start start


start

y z .
Beam Search
Some Observations

• Advantages of HMMs
– Relatively high accuracy
– Easy to train
• Higher-Order HMM
– The previous example was about bigram HMMs
– How can you modify it to work with trigrams?
How to compute P(O)
• Viterbi was used to find the most likely sequence of
states that matches the observation
• What if we want to find all sequences that match the
observation
• We can add their probabilities (because they are
mutually exclusive) to form the probability of the
observation
• This is done using the Forward Algorithm
The Forward Algorithm
• Used to compute the probability of a sequence
• Very similar to Viterbi
• Instead of max we use sum
NLP
Introduction to NLP

222.
Learning in Hidden Markov Models
HMM Learning
• Supervised
– Training sequences are labeled
• Unsupervised
– Training sequences are unlabeled
– Known number of states
• Semi-supervised
– Some training sequences are labeled
Supervised HMM Learning

• Estimate the static transition probabilities using MLE


Count ( qt  si , qt 1  s j )
aij 
Count ( qt  si )
• Estimate the observation probabilities using MLE
Count ( q i  s j , oi  v k )
b j (k ) 
Count ( q i  s j )
• Use smoothing
Unsupervised HMM Training
• Given:
– observation sequences
• Goal:
– build the HMM
• Use EM (Expectation Maximization) methods
– forward-backward (Baum-Welch) algorithm
– Baum-Welch finds an approximate solution for P(O|µ)
Outline of Baum-Welch
• Algorithm
– Randomly set the parameters of the HMM
– Until the parameters converge repeat:
• E step – determine the probability of the various state sequences for
generating the observations
• M step – reestimate the parameters based on these probabilities
• Notes
– the algorithm guarantees that at each iteration the likelihood of the
data P(O|µ) increases
– it can be stopped at any point and give a partial solution
– it converges to a local maximum
NLP
Introduction to NLP

231.
Statistical POS Tagging
HMM Tagging
• T = argmax P(T|W)
– where T=t1,t2,…,tn
• By Bayes’ theorem
– P(T|W) = P(T)P(W|T)/P(W)
• Thus we are attempting to choose the sequence of
tags that maximizes the RHS of the equation
– P(W) can be ignored
– P(T) is called the prior, P(W|T) is called the likelihood.
HMM Tagging
• Complete formula
– P(T)P(W|T) = ΠP(wi|w1t1…wi-1ti-1ti)P(ti|t1…ti-2ti-1)
• Simplification 1:
– P(W|T) = ΠP(wi|ti)
• Simplification 2:
– P(T)= ΠP(ti|ti-1)
• Bigram approximation
– T = argmax P(T|W) = argmax ΠP(wi|ti) P(ti|ti-1)
Example

• The/DT rich/JJ like/VBP to/TO travel/VB ./.


Example

DT NN VBP TO NN .

The rich like to travel .

DT NN VBP TO VB .

The rich like to travel .


Maximum Likelihood Estimates

• Transition probabilities
P(NN|JJ) = C(JJ,NN)/C(JJ) = 22301/89401 = .249

• Emission probabilities
P(this|DT) = C(DT,this)/C(DT) = 7037/103687 = .068
Evaluating Taggers
• Data set
– Training set
– Development set
– Test set
• Tagging accuracy
– how many tags right
HMM POS Results

• Assigning each word its most likely tag: 90%


• Trigram HMM 95% (55% on unknown words)
• Tuned HMM (Brants 1998): 96.2% (86.0%)
• SOTA (Bi-LSTM CRF): 97.5% (89+%)
• Numbers thanks to Dan Klein and Greg Durrett
Remaining Errors
• Words not seen with that tag in training: 4.5%
• Unknown word: 4.5%
• Could get right: 16% (needs parsing)
• Difficult decision: 20% (“set” = VBP or VBD?)
• Underspecified/unclear, gold standard inconsistent/wrong:
58% (e.g., is “discontinued” JJ or VBN)

[Manning 2011]
Confusion Matrix

[Example from Toutanova+Manning’00 via Dan Klein]


Notes on POS
• New domains
– Lower performance
• New languages
– Morphology matters! Also availability of training data
• Distributional clustering
– Combine statistics about semantically related words
– Example: names of companies
– Example: days of the week
– Example: animals
Notes on POS
• British National Corpus
– http://www.natcorp.ox.ac.uk/
• Tagset sizes
– PTB 45, Brown 85, Universal 12, Twitter 25
• Dealing with unknown words
– Look at features like twoDigitNum, allCaps, initCaps,
containsDigitAndSlash (Bikel et al. 1999)
HMM Spreadsheet

• Jason Eisner’s awesome interactive spreadsheet about


learning HMMs
– http://cs.jhu.edu/~jason/papers/#eisner-2002-tnlp
– http://cs.jhu.edu/~jason/papers/eisner.hmm.xls
NLP
Introduction to NLP

232.
Information Extraction
Information Extraction

• Usually from unstructured or semi-structured data


• Examples
– News stories
– Scientific papers
– Resumes
• Entities
– Who did what, when, where, why
• Build knowledge base (KBP Task)
Named Entities
• Types:
– People
– Locations
– Organizations
• Teams, Newspapers, Companies
– Geo-political entities
• Ambiguity:
– London can be a person, city, country (by metonymy) etc.
• Useful for interfaces to databases, question answering,
etc.
Times and Events

• Times
– Absolute expressions
– Relative expressions (e.g., “last night”)
• Events
– E.g., a plane went past the end of the runway
Named Entity Recognition (NER)
• Segmentation
– Which words belong to a named entity?
– Brazilian football legend Pele's condition has improved, according to
a Thursday evening statement from a Sao Paulo hospital.
• Classification
– What type of named entity is it?
– Use gazetteers, spelling, adjacent words, etc.
– Brazilian football legend [PERSON Pele]'s condition has improved,
according to a [TIME Thursday evening] statement from a [LOCATION
Sao Paulo] hospital.
NER, Time, and Event extraction
• Brazilian football legend [PERSON Pele]'s condition has
improved, according to a [TIME Thursday evening]
statement from a [LOCATION Sao Paulo] hospital.
• There had been earlier concerns about Pele's health after
[ORG Albert Einstein Hospital] issued a release that said
his condition was "unstable.“
• [TIME Thursday night]'s release said [EVENT Pele was
relocated] to the intensive care unit because a kidney
dialysis machine he needed was in ICU.
Event Extraction
Event Extraction
Named Entities
Named Entity Recognition (NER)
Sample Input for NER
( (S
(NP-SBJ-1
(NP (NNP Rudolph) (NNP Agnew) )
(, ,)
(UCP
(ADJP
(NP (CD 55) (NNS years) )
(JJ old) )
(CC and)
(NP
(NP (JJ former) (NN chairman) )
(PP (IN of)
(NP (NNP Consolidated) (NNP Gold) (NNP Fields) (NNP PLC) ))))
(, ,) )
(VP (VBD was)
(VP (VBN named)
(S
(NP-SBJ (-NONE- *-1) )
(NP-PRD
(NP (DT a) (JJ nonexecutive) (NN director) )
(PP (IN of)
(NP (DT this) (JJ British) (JJ industrial) (NN conglomerate) ))))))
(. .) ))
Sample Output for NER (IOB format)
file_id sent_id word_id iob_inner pos word
0002 1 0 B-PER NNP Rudolph
0002 1 1 I-PER NNP Agnew
0002 1 2 O COMMA COMMA
0002 1 3 B-NP CD 55
0002 1 4 I-NP NNS years
0002 1 5 B-ADJP JJ old
0002 1 6 O CC and
0002 1 7 B-NP JJ former
0002 1 8 I-NP NN chairman
0002 1 9 B-PP IN of
0002 1 10 B-ORG NNP Consolidated
0002 1 11 I-ORG NNP Gold
0002 1 12 I-ORG NNP Fields
0002 1 13 I-ORG NNP PLC
0002 1 14 O COMMA COMMA
0002 1 15 B-VP VBD was
0002 1 16 I-VP VBN named
0002 1 17 B-NP DT a
0002 1 18 I-NP JJ nonexecutive
0002 1 19 I-NP NN director
0002 1 20 B-PP IN of
0002 1 21 B-NP DT this
0002 1 22 I-NP JJ British
0002 1 23 I-NP JJ industrial
0002 1 24 I-NP NN conglomerate
0002 1 25 O . .
NER Demos

• http://nlp.stanford.edu:8080/ner/
• http://cogcomp.org/page/demo_view/ner
• http://demo.allennlp.org/named-entity-recognition
NER Extraction Features
NER Extraction Features
Feature Encoding in NER
NER as Sequence Labeling
• Many NLP problems can be cast as sequence labeling problems
– POS – part of speech tagging
– NER – named entity recognition
– SRL – semantic role labeling
• Input
– Sequence w1w2w3
• Output
– Labeled words
• Classification methods
– Can use the categories of the previous tokens as features in classifying the
next one
– Direction matters
NER as Sequence Labeling
Temporal Expressions
Temporal Lexical Triggers
TempEx Example
TimeML
TimeBank
Biomedical example

• Gene labeling
• Sentence:
– [GENE BRCA1] and [GENE BRCA2] are human genes
that produce tumor suppressor proteins
Other Examples
• Job announcements
– Location, title, starting date, qualifications, salary
• Seminar announcements
– Time, title, location, speaker
• Medical papers
– Drug, disease, gene/protein, cell line, species, substance
Filling the Templates

• Some fields get filled by text from the document


– E.g., the names of people
• Others can be pre-defined values
– E.g., successful/unsuccessful merger
• Some fields allow for multiple values
Evaluating Template-Based NER

• For each test document


– Number of correct template extractions
– Number of slot/value pairs extracted
– Number of extracted slot/value pairs that are correct
NLP
Introduction to NLP

233.
Relation Extraction
Relation Extraction
• Person-person
– ParentOf, MarriedTo, Manages
• Person-organization
– WorksFor
• Organization-organization
– IsPartOf
• Organization-location
– IsHeadquarteredAt
Relation Extraction
• Core NLP task
– Used for building knowledge bases, question answering
• Input
– Mazda North American Operations is headquartered in Irvine,
Calif., and oversees the sales, marketing, parts and customer
service support of Mazda vehicles in the United States and
Mexico through nearly 700 dealers.
• Output
– IsHeadquarteredIn (Mazda North American Operations, Irvine)
Relation extraction

• Using patterns
– Regular expressions
– Gazetteers
• Supervised learning
• Semi-supervised learning
– Using seeds
Relation Extraction
The ACE Evaluation

• Newspaper data
• Entities:
– Person, Organization, Facility, Location, Geopolitical Entity
• Relations:
– Role, Part, Located, Near, Social
The ACE Evaluation
Semantic Relations
Extracting IS-A Relations
• Hearst’s patterns
– X and other Y
– X or other Y
– Y such as X
– Y, including X
– Y, especially X
• Example
– Evolutionary relationships between the platypus and other
mammals
Hypernym Extraction (Hearst)
Supervised Relation Extraction

• Look for sentences that have two entities that we know


are part of the target relation
• Look at the other words in the sentence, especially the
ones between the two entities
• Use a classifier to determine whether the relation
exists
Semi-supervised Relation Extraction

• Start with some seeds, e.g.,


– Beethoven was born in December 1770 in Bonn
• Look for other sentences with the same words
• Look for expressions that appear nearby
• Look for other sentences with the same expressions
Bootstrapping
Bootstrapping
Bootstrapping
Evaluating Relation Extraction
• Precision P
– correctly extracted relations/all extracted relations
• Recall R
– correctly extracted relations/all existing relations
• F1 measure
– F1 = 2PR/(P+R)
• If there is no annotated data
– only measure precision
NLP

You might also like