Hidden Markov and Graphical Models presentation

Introduction to Hidden Markov
Models

• Set of states:
• Process moves from one state to another generating a
sequence of states :
• Markov chain property: probability of each subsequent state
depends only on what was the previous state:
• To define Markov model, the following probabilities have to be
specified: transition probabilities and initial
probabilities
Markov Models
}
,
,
,
{ 2
1 N
s
s
s 

 ,
,
,
, 2
1 ik
i
i s
s
s
)
|
(
)
,
,
,
|
( 1
1
2
1 
  ik
ik
ik
i
i
ik s
s
P
s
s
s
s
P 
)
|
( j
i
ij s
s
P
a 
)
( i
i s
P



Rain Dry
0.7
0.3
0.2 0.8
• Two states : ‘Rain’ and ‘Dry’.
• Transition probabilities: P(‘Rain’|‘Rain’)=0.3 ,
P(‘Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
Example of Markov Model

• By Markov chain property, probability of state sequence can be
found by the formula:
• Suppose we want to calculate a probability of a sequence of
states in our example, {‘Dry’,’Dry’,’Rain’,Rain’}.
P({‘Dry’,’Dry’,’Rain’,Rain’} ) =
P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)=
= 0.3*0.2*0.8*0.6
Calculation of sequence probability
)
(
)
|
(
)
|
(
)
|
(
)
,
,
,
(
)
|
(
)
,
,
,
(
)
,
,
,
|
(
)
,
,
,
(
1
1
2
2
1
1
1
2
1
1
1
2
1
1
2
1
2
1
i
i
i
ik
ik
ik
ik
ik
i
i
ik
ik
ik
i
i
ik
i
i
ik
ik
i
i
s
P
s
s
P
s
s
P
s
s
P
s
s
s
P
s
s
P
s
s
s
P
s
s
s
s
P
s
s
s
P


















Hidden Markov models.
• Set of states:
•Process moves from one state to another generating a
sequence of states :
• Markov chain property: probability of each subsequent state
depends only on what was the previous state:
• States are not visible, but each state randomly generates one of
M observations (or visible states)
• To define hidden Markov model, the following probabilities
have to be specified: matrix of transition probabilities A=(aij),
aij= P(si | sj) , matrix of observation probabilities B=(bi (vm )),
bi(vm )= P(vm | si) and a vector of initial probabilities =(i),
i = P(si) . Model is represented by M=(A, B, ).
}
,
,
,
{ 2
1 N
s
s
s 

 ,
,
,
, 2
1 ik
i
i s
s
s
)
|
(
)
,
,
,
|
( 1
1
2
1 
  ik
ik
ik
i
i
ik s
s
P
s
s
s
s
P 
}
,
,
,
{ 2
1 M
v
v
v 

Low High
0.7
0.3
0.2 0.8
Dry
Rain
0.6 0.6
0.4 0.4
Example of Hidden Markov Model

• Two states : ‘Low’ and ‘High’ atmospheric pressure.
• Two observations : ‘Rain’ and ‘Dry’.
• Transition probabilities: P(‘Low’|‘Low’)=0.3 ,
P(‘High’|‘Low’)=0.7 , P(‘Low’|‘High’)=0.2,
P(‘High’|‘High’)=0.8
• Observation probabilities : P(‘Rain’|‘Low’)=0.6 ,
P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High’)=0.4 ,
P(‘Dry’|‘High’)=0.3 .
• Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .
Example of Hidden Markov Model

•Suppose we want to calculate a probability of a sequence of
observations in our example, {‘Dry’,’Rain’}.
•Consider all possible hidden state sequences:
P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) +
P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} ,
{‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’})
where first term is :
P({‘Dry’,’Rain’} , {‘Low’,’Low’})=
P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) =
P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low)
= 0.4*0.4*0.6*0.4*0.3
Calculation of observation sequence probability

Evaluation problem. Given the HMM M=(A, B, ) and the
observation sequence O=o1 o2 ... oK , calculate the probability that
model M has generated sequence O .
• Decoding problem. Given the HMM M=(A, B, ) and the
observation sequence O=o1 o2 ... oK , calculate the most likely
sequence of hidden states si that produced this observation sequence
O.
• Learning problem. Given some training observation sequences
O=o1 o2 ... oK and general structure of HMM (numbers of hidden
and visible states), determine HMM parameters M=(A, B, )
that best fit training data.
O=o1...oK denotes a sequence of observations ok{v1,…,vM}.
Main issues using HMMs :

• Typed word recognition, assume all characters are separated.
• Character recognizer outputs probability of the image being
particular character, P(image|character).
0.5
0.03
0.005
0.31
z
c
b
a
Word recognition example(1).
Hidden state Observation

• Hidden states of HMM = characters.
• Observations = typed images of characters segmented from the
image . Note that there is an infinite number of
observations
• Observation probabilities = character recognizer scores.
•Transition probabilities will be defined differently in two
subsequent models.
   
)
|
(
)
( i
i s
v
P
v
b
B 
 


v

• If lexicon is given, we can construct separate HMM models
for each lexicon word.
Amherst a m h e r s t
Buffalo b u f f a l o
0.5 0.03
• Here recognition of word image is equivalent to the problem
of evaluating few HMM models.
•This is an application of Evaluation problem.
0.4 0.6

• We can construct a single HMM for all words.
• Hidden states = all characters in the alphabet.
• Transition probabilities and initial probabilities are calculated
from language model.
• Observations and observation probabilities are as before.
a m
h e
r
s
t
b v
f
o
• Here we have to determine the best sequence of hidden states,
the one that most likely produced word image.
• This is an application of Decoding problem.

• The structure of hidden states is chosen.
• Observations are feature vectors extracted from vertical slices.
• Probabilistic mapping from hidden state to feature vectors:
1. use mixture of Gaussian models
2. Quantize feature vector space.
Character recognition with HMM example.

• The structure of hidden states:
• Observation = number of islands in the vertical slice.
s1 s2 s3
•HMM for character ‘A’ :
Transition probabilities: {aij}=
Observation probabilities: {bjk}=
 .8 .2 0 
 0 .8 .2 
 0 0 1 
 .9 .1 0 
 .1 .8 .1 
 .9 .1 0 
•HMM for character ‘B’ :
Transition probabilities: {aij}=
Observation probabilities: {bjk}=
 .8 .2 0 
 0 .8 .2 
 0 0 1 
 .9 .1 0 
 0 .2 .8 
 .6 .4 0 
Exercise: character recognition with HMM(1)

• Suppose that after character image segmentation the following
sequence of island numbers in 4 slices was observed:
{ 1, 3, 2, 1}
• What HMM is more likely to generate this observation
sequence , HMM for ‘A’ or HMM for ‘B’ ?

Consider likelihood of generating given observation for each
possible sequence of hidden states:
• HMM for character ‘A’:
Hidden state sequence Transition probabilities Observation probabilities
s1 s1 s2s3 .8  .2  .2  .9  0  .8  .9 = 0
s1 s2 s2s3 .2  .8  .2  .9  .1  .8  .9 = 0.0020736
s1 s2 s3s3 .2  .2  1  .9  .1  .1  .9 = 0.000324
Total = 0.0023976
• HMM for character ‘B’:
Hidden state sequence Transition probabilities Observation probabilities
s1 s1 s2s3 .8  .2  .2  .9  0  .2  .6 = 0
s1 s2 s2s3 .2  .8  .2  .9  .8  .2  .6 = 0.0027648
s1 s2 s3s3 .2  .2  1  .9  .8  .4  .6 = 0.006912
Total = 0.0096768

•Evaluation problem. Given the HMM M=(A, B, ) and the
observation sequence O=o1 o2 ... oK , calculate the probability that
model M has generated sequence O .
• Trying to find probability of observations O=o1 o2 ... oK by
means of considering all hidden state sequences (as was done in
example) is impractical:
NK
hidden state sequences - exponential complexity.
• Use Forward-Backward HMM algorithms for efficient
calculations.
• Define the forward variable k(i) as the joint probability of the
partial observation sequence o1 o2 ... ok and that the hidden state at
time k is si : k(i)= P(o1 o2 ... ok , qk=si )
Evaluation Problem.

s1
s2
si
sN
s1
s2
si
sN
s1
s2
sj
sN
s1
s2
si
sN
a1j
a2j
aij
aNj
Time= 1 k k+1 K
o1 ok ok+1 oK = Observations
Trellis representation of an HMM

• Initialization:
1(i)= P(o1 , q1=si ) = i bi (o1) , 1<=i<=N.
• Forward recursion:
k+1(i)= P(o1 o2 ... ok+1 , qk+1=sj ) =
i P(o1 o2 ... ok+1 , qk=si , qk+1=sj ) =
i P(o1 o2 ... ok , qk=si) aij bj (ok+1 ) =
[i k(i) aij ] bj (ok+1 ) , 1<=j<=N, 1<=k<=K-1.
• Termination:
P(o1 o2 ... oK) = i P(o1 o2 ... oK , qK=si) = i K(i)
• Complexity :
N2
K operations.
Forward recursion for HMM

• Define the forward variable k(i) as the joint probability of the
partial observation sequence ok+1 ok+2 ... oK given that the hidden
state at time k is si : k(i)= P(ok+1 ok+2 ... oK |qk= si )
• Initialization:
K(i)= 1 , 1<=i<=N.
• Backward recursion:
k(j)= P(ok+1 ok+2 ... oK |qk=sj ) =
i P(ok+1 ok+2 ... oK , qk+1=si |qk=sj ) =
i P(ok+2 ok+3 ... oK |qk+1= si) aji bi (ok+1 ) =
i k+1(i) aji bi (ok+1 ) , 1<=j<=N, 1<=k<=K-1.
• Termination:
P(o1 o2 ... oK) = i P(o1 o2 ... oK , q1=si) =
 P(o1 o2 ... oK |q1=si) P(q1=si) =   (i) bi (o1) i
Backward recursion for HMM

•Decoding problem. Given the HMM M=(A, B, ) and the
observation sequence O=o1 o2 ... oK , calculate the most likely
sequence of hidden states si that produced this observation
sequence.
• We want to find the state sequence Q= q1…qK which maximizes
P(Q | o1 o2 ... oK ) , or equivalently P(Q , o1 o2 ... oK ) .
• Brute force consideration of all paths takes exponential time. Use
efficient Viterbi algorithm instead.
• Define variable k(i) as the maximum probability of producing
observation sequence o1 o2 ... ok when moving along any hidden
state sequence q1… qk-1 and getting into qk= si .
k(i) = max P(q1… qk-1 ,qk= si , o1 o2 ... ok)
where max is taken over all possible paths q1… qk-1 .
Decoding problem

• General idea:
if best path ending in qk= sj goes through qk-1= si then it
should coincide with best path ending in qk-1= si .
s1
si
sN
sj
aij
aNj
a1j
qk-1 qk
• k(i) = max P(q1… qk-1 ,qk= sj , o1 o2 ... ok) =
maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ]
• To backtrack best path keep info that predecessor of sj was si.
Viterbi algorithm (1)

• Initialization:
1(i) = max P(q1= si , o1) = i bi (o1) , 1<=i<=N.
•Forward recursion:
k(j) = max P(q1… qk-1 ,qk= sj , o1 o2 ... ok) =
maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ] =
maxi [ aij bj (ok ) k-1(i) ] , 1<=j<=N, 2<=k<=K.
•Termination: choose best path ending at time K
maxi [ K(i) ]
• Backtrack best path.
This algorithm is similar to the forward recursion of evaluation
problem, with  replaced by max and additional backtracking.
Viterbi algorithm (2)

•Learning problem. Given some training observation sequences
O=o1 o2 ... oK and general structure of HMM (numbers of
hidden and visible states), determine HMM parameters M=(A,
B, ) that best fit training data, that is maximizes P(O |M) .
• There is no algorithm producing optimal parameter values.
• Use iterative expectation-maximization algorithm to find local
maximum of P(O |M) - Baum-Welch algorithm.
Learning problem (1)

• If training data has information about sequence of hidden states
(as in word recognition example), then use maximum likelihood
estimation of parameters:
aij= P(si | sj) =
Number of transitions from state sj to state si
Number of transitions out of state sj
bi(vm )= P(vm | si)=
Number of times observation vm occurs in state si
Number of times in state si
Learning problem (2)

General idea:
aij= P(si | sj) =
Expected number of transitions from state sj to state si
Expected number of transitions out of state sj
bi(vm )= P(vm | si)=
Expected number of times observation vm occurs in state si
Expected number of times in state si
i = P(si) = Expected frequency in state si at time k=1.
Baum-Welch algorithm

• Define variable k(i,j) as the probability of being in state si at
time k and in state sj at time k+1, given the observation
sequence o1 o2 ... oK .
k(i,j)= P(qk= si ,qk+1= sj |o1 o2 ... oK)
k(i,j)=
P(qk= si ,qk+1= sj ,o1 o2 ... ok)
P(o1 o2 ... ok)
=
P(qk= si ,o1 o2 ... ok) aij bj (ok+1 ) P(ok+2 ... oK | qk+1= sj )
P(o1 o2 ... ok)
=
k(i) aij bj (ok+1 ) k+1(j)
i j k(i) aij bj (ok+1 ) k+1(j)
Baum-Welch algorithm: expectation step(1)

• Define variable k(i) as the probability of being in state si at
time k, given the observation sequence o1 o2 ... oK .
k(i)= P(qk= si |o1 o2 ... oK)
k(i)=
P(qk= si ,o1 o2 ... ok)
P(o1 o2 ... ok)
=
k(i) k(i)
i k(i) k(i)

•We calculated k(i,j) = P(qk= si ,qk+1= sj |o1 o2 ... oK)
and k(i)= P(qk= si |o1 o2 ... oK)
• Expected number of transitions from state si to state sj =
= k k(i,j)
• Expected number of transitions out of state si = k k(i)
• Expected number of times observation vm occurs in state si =
= k k(i) , k is such that ok= vm
• Expected frequency in state si at time k=1 : 1(i) .

aij =
Expected number of transitions from state sj to state si
Expected number of transitions out of state sj
bi(vm ) =
Expected number of times observation vm occurs in state si
Expected number of times in state si
i = (Expected frequency in state si at time k=1) = 1(i).
=
k k(i,j)
k k(i)
=
k k(i,j)
k,ok= vm k(i)
Baum-Welch algorithm: maximization step

Hidden Markov and Graphical Models presentation

Recommended

More Related Content

Similar to Hidden Markov and Graphical Models presentation (20)

Recently uploaded (20)

Hidden Markov and Graphical Models presentation