Sequence Model:: Hidden Markov Models
Sequence Model:: Hidden Markov Models
Sequence Model:
Hidden Markov Models
…
Part-of-Speech Tagging
What is Part of Speech?
The part of speech explains how a word is
used in a sentence
nouns, pronouns, adjectives, verbs, adverbs,
prepositions, conjunctions, …
How does POS Tagging works?
Source: https://spacy.io/usage/processing-pipelines
Outline
Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1
Initial probabilities
πi ≡ P(q1=Si) Σj=1N πi=1
Markov random processes
Markov property
P (qi | qi 1 ,...q1 ) P(qi | qi 1 ) for i 1
t
P(qt , qt 1 ,...q1 ) P(q1 ) P(qi | qi 1 ) P(q1 ) P(q2 | q1 )...P(qt | qt 1 )
i 2
Stochastic Automaton
3
PO Q(123) | A, Pq1 Pqt | qt 1 q1 aq1q2 aq2 q3
t 2
Example: Balls and Urns
Markov process with a non-hidden observation process –
stochastic automoton
Three urns each full of balls of one color
S1: red, S2: blue, S3: green
0.4 0.3 0.3
0.5,0.2,0.3 A 0.2 0.6 0.2
T
Problem 1: Probability of an Observation
Sequence
What is P(O | ) ?
The probability of a observation sequence is the
sum of the probabilities of all possible state
sequences in the HMM.
Naïve computation is very expensive. Given T
observations and N states, there are NT
possible state sequences.
Even small HMMs, e.g. T=10 and N=10,
contain 10 billion different paths
Solution to this and problem 2 is to use dynamic
programming
Examples
Source: https://web.stanford.edu/~jurafsky/slp3/A.pdf
Example (cont.)
N
t ( j) t1(i) aij b j (ot )
i1
Forward Algorithm
Induction:
N
t ( j) t1(i) aij b j (ot ) 2 t T,1 j N
i1
N
t (i) aij b j (ot 1) t 1 ( j)
j1
Backward Algorithm
Initialization: T (i) 1, 1 i N
Induction:
N
t (i) aij b j (ot 1) t 1 ( j) t T 1...1,1 i N
j1
Termination: N
P(O | ) i 1(i)
i1
Problem 2: Decoding
The solution to Problem 1 (Evaluation) gives us
the sum of all paths through an HMM efficiently.
For Problem 2, we wan to find the path with the
highest probability.
We want to find the state sequence Q=q1…qT,
such that
Q argmax P(Q'| O, )
Q'
Viterbi Algorithm
t ( j) maxt1 (i) aij b j (ot )
1iN
t ( j) argmaxt1 (i) aij 2 t T,1 j N
1iN
Termination: p max
*
T (i) q argmax T (i)
*
T
1iN 1iN
v3(2)= 0.012544
Problem 3: Learning
Parameter Re-estimation
Use the forward-backward (or Baum-
Welch) algorithm, which is a hill-climbing
algorithm
Using an initial parameter instantiation,
the forward-backward algorithm iteratively
re-estimates the parameters and
improves the probability that given
observation are generated by the new
parameters
Parameter Re-estimation
Three parameters need to be re-
estimated:
Initial state distribution: i
Transition probabilities: ai,j
Re-estimating Transition Probabilities
What’s the probability of being in state si
at time t and going to state sj, given the
current model and parameters?
t (i, j) P(qt si , qt 1 s j | O, )
Re-estimating Transition Probabilities
t (i, j) P(qt si , qt 1 s j | O, )
(i) a t i, j b j (ot 1 ) t 1 ( j)
i1 j1
Re-estimating Transition Probabilities
Formally:
T 1
(i, j)t
aˆ i, j t1
T 1 N
(i, j') t
t1 j'1
Re-estimating Transition Probabilities
N
Defining t (i) t (i, j )
j 1
(i, j)
t
(i) t
t1
Review of Probabilities
Forward probability: t (i)
The probability of being in state si, given the partial
observation o1,…,ot
Backward probability: t (i)
The probability of being in state si, given the partial
ot+1,…,oT
observation
Transition probability: t (i, j)
of going from state si, to state sj, given
The probability
the complete observation o1,…,oT
State probability: t (i)
The probability
of being in state si, given the complete
observation o1,…,oT
Re-estimating Initial State Probabilities
Initial state distribution: i is the
probability that si is a start state
Re-estimation is easy:
ˆ i expected number
of times in state s i at time 1
Formally:
ˆ i 1 (i)
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
expected number of times in state si and observe symbol vk
bˆi (k)
expected number of times in state si
Formally: T
(o ,v ) (i)
t k t
(i) t
t1
T 1
(i, j)
T
t (o ,v ) (i)
t k t
aˆ i, j t1
T 1
bˆi (k) t1
T
ˆ i 1 (i)
(i) t
(i) t
t1
t1
The inner loop for
forward-backward algorithm
Given an input sequence and ( S , A, B, )
1. Calculate forward probability:
• Base case i (1) i
• Recursive case: j (t 1) i (t )aij b j (ot )
i
2. Calculate backward probability:
• Base case: i (T 1) 1
• i (t ) j (t 1)aij b j (ot )
Recursive case:
j
i (t )aij b j (ot ) j (t 1)
3. Calculate expected counts: t (ij ) N
4. Update the
T
parameters: m (t ) m (t )
m 1
T
t (ij ) (ot , vk ) t (ij) N
aij N T
t 1
b j (k ) t 1 T (i) 1 (i, j )
t (ij )
j 1 t 1
t (ij) j 1
t 1
Iterations
Each iteration provides values for all the
parameters
The new model always improve the
likeliness of the training data:
ˆ ) P(O | )
P(O |