0% found this document useful (0 votes)

16 views

HMM Tutorial

This document provides a tutorial on hidden Markov models (HMM). It begins with background on related topics like expectation maximization, Lagrange multipliers, and probabilistic graphical models. It then explains HMMs in detail, covering likelihood estimation using expectation maximization, evaluation using forward-backward algorithms, parameter estimation using Viterbi, and training HMMs with EM and Baum-Welch. Applications to speech recognition, action recognition, and other domains are also discussed. The tutorial aims to provide a technical overview of key concepts in HMMs.

Uploaded by

samson wmariam

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

HMM Tutorial

Uploaded by

samson wmariam

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Hidden Markov Model: Tutorial

Benyamin Ghojogh BGHOJOGH @ UWATERLOO . CA

Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Fakhri Karray KARRAY @ UWATERLOO . CA
Department of Electrical and Computer Engineering,
Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada
Mark Crowley MCROWLEY @ UWATERLOO . CA
Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

Abstract emitted observation symbols (Rabiner, 1989). HMM mod-

This is a tutorial paper for Hidden Markov Model els the the probability density of a sequence of observed
(HMM). First, we briefly review the background symbols (Roweis, 2003).
on Expectation Maximization (EM), Lagrange This paper is a technical tutorial for evaluation, estima-
multiplier, factor graph, the sum-product algo- tion, and training in HMM. We also explain some applica-
rithm, the max-product algorithm, and belief tions for HMM. There exist many different applications for
propagation by the forward-backward procedure. HMM. One of the most famous applications is in speech
Then, we introduce probabilistic graphical mod- recognition (Rabiner, 1989; Gales et al., 2008) where an
els including Markov random field and Bayesian HMM is trained for every word in the speech (Rabiner &
network. Markov property and Discrete Time Juang, 1986). Another application of HMM is in action
Markov Chain (DTMC) are also introduced. We, recognition (Yamato et al., 1992) because an action can
then, explain likelihood estimation and EM in be seen as a sequence or times series of poses (Ghojogh
HMM in technical details. We explain evalua- et al., 2017; Mokari et al., 2018). An interesting applica-
tion in HMM where direct calculation and the tion of HMM, which is not trivial in the first glance, is in
forward-backward belief propagation are both face recognition (Samaria, 1994). The face can be seen as
explained. Afterwards, estimation in HMM is a sequence of organs being modeled by HMM (Nefian &
covered where both the greedy approach and the Hayes, 1998). Usage of HMM in DNA analysis is another
Viterbi algorithm are detailed. Then, we explain application of HMM (Eddy, 2004).
how to train HMM using EM and the Baum- The remainder of paper is as follows. In Section 2, we re-
Welch algorithm. We also explain how to use view some necessary background for the paper. Section
HMM in some applications such as speech and 3 introduces probabilistic graphical models. Expectation
action recognition. maximization in HMM is explained in Section 4. after-
wards, evaluation, estimation, and training in HMM are
explain in Sections 5, 6, and 7, respectively. Some applica-
1. Introduction tions are introduced in more details in Section 8. Finally,
Assume we have a time series of hidden random variables. Section 9 concludes the paper.
We name the hidden random variables as states (Ghahra-
mani, 2001). Every hidden random variable can generate 2. Background
(emit) an observation symbol or output. Therefore, we have Some background required for better understanding of the
a set of states and a set of observation symbols. A Hidden HMM algorithms is mentioned in this Section.
Markov Model (HMM) has some hidden states and some
2.1. Expectation Maximization
This section is taken from our previous paper (Ghojogh
et al., 2019). Sometimes, the data are not fully observ-
Hidden Markov Model: Tutorial 2

able. For example, the data are known to be whether zero

or greater than zero. As an illustration, assume the data are
collected for a particular disease but for convenience of the
patients participated in the survey, the severity of the dis-
ease is not recorded but only the existence or non-existence
of the disease is reported. So, the data are not giving us
complete information as Xi > 0 is not obvious whether is
Xi = 2 or Xi = 1000.
In this case, MLE cannot be directly applied as we do
not have access to complete information and some data
are missing. In this case, Expectation Maximization (EM)
(Moon, 1996) is useful. The main idea of EM can be sum-
marized in this short friendly conversation:
– What shall we do? The data are missing! The log-
likelihood is not known completely so MLE cannot be used.
– Mmm, probably we can replace the missing data with Figure 1. (a) An example factor graph, and (b) its representation
something... as a bipartite graph.
– Aha! Let us replace it with its mean.
– You are right! We can take the mean of log-likelihood
over the possible values of the missing data. Then every- For solving this problem, we can introduce a new variable
thing in the log-likelihood will be known, and then... η which is called “Lagrange multiplier”. Also, a new func-
– And then we can do MLE! tion L(Θ1 , . . . , ΘK , η), called “Lagrangian” is introduced:
Assume D(obs) and D(miss) denote the observed data (Xi ’s
= 0 in the above example) and the missing data (Xi ’s > 0 L(Θ1 , . . . , ΘK , η) = Q(Θ1 , . . . , ΘK )
in the above example). The EM algorithm includes two (4)
− η P (Θ1 , . . . , ΘK ) − c .
main steps, i.e., E-step and M-step.
Let Θ be the parameter which is the variable for likelihood Maximizing (or minimizing) this Lagrangian function
estimation. In the E-step, the log-likelihood `(Θ), is taken gives us the solution to the optimization problem (Boyd &
expectation with respect to the missing data D(miss) in or- Vandenberghe, 2004):
der to have a mean estimation of it. Let Q(Θ) denote the
set
expectation of the likelihood with respect to D(miss) : ∇Θ1 ,...,ΘK ,η L = 0, (5)
Q(Θ) := ED(miss) |D(obs) ,Θ [`(Θ)]. (1)
which gives us:
(obs)
Note that in the above expectation, the D and Θ are set
conditioned on, so they are treated as constants and not ran- ∇Θ1 ,...,ΘK L = 0 =⇒ ∇Θ1 ,...,ΘK Q = η∇Θ1 ,...,ΘK P,
dom variables. set
∇η L = 0 =⇒ P (Θ1 , . . . , ΘK ) = c.
In the M-step, the MLE approach is used where the log-
likelihood is replaced with its expectation, i.e., Q(Θ); 2.3. Factor Graph, The Sum-Product and
therefore: Max-Product Algorithms, and The
Forward-Backward Procedure
Θ
b = arg max Q(Θ). (2)
Θ 2.3.1. FACTOR G RAPH
These two steps are iteratively repeated until convergence A factor graph is a bipartite graph where the two partitions
of the estimated parameters Θ.
b
of graph include functions (or factors) fj , ∀j and the vari-
ables xi , ∀i (Kschischang et al., 2001; Loeliger, 2004). Ev-
2.2. Lagrange Multiplier
ery factor is a function of two or more variables. The vari-
Suppose we have a multi-variate function Q(Θ1 , . . . , ΘK ) ables of a factor are connected by edges to it. Hence, we
(called “objective function”) and we want to maximize (or have a bipartite graph. Figure 1 shows an example factor
minimize) it. However, this optimization is constrained and graphs where the factor and variable nodes are represented
its constraint is equality P (Θ1 , . . . , ΘK ) = c where c is a by circles and squares, respectively.
constant. So, the constrained optimization problem is:
2.3.2. T HE S UM -P RODUCT A LGORITHM
maximize Q(Θ1 , . . . , ΘK ),
Θ1 ,...,ΘK
(3) Assume we have a factor graph with some variable nodes
subject to P (Θ1 , . . . , ΘK ) = c. and factor nodes. For this, one of the nodes is considered as
Hidden Markov Model: Tutorial 3

the root of tree. The result of algorithm does not depend on 1 Initialize the messages to [1, 1, . . . , 1]>
which node to be taken as the root (Bishop, 2006). Usually, 2 for time t from 1 to τ do
the first variable or the last one are considered as the root. 3 Forward pass: Do sum-product algorithm
Then, some messages from the leaf (or leaves) are sent up 4 Backward pass: Do sum-product algorithm
on the tree to the root (Bishop, 2006). The messages can be 5 belief = forward message × backward
initialized to any fixed constant values (see Eq. (8)). The message
message from the variable node xi to its neighbor factor 6 if max(belief change) then
node fj is: 7 break the loop
Y
mxi →fj = mf →xi , (6) 8 Return beliefs
f ∈N (xi )\fj
Algorithm 1: The forward-backward algorithm
where f ∈ N (xi ) \ fj means all the neighbor factor nodes using the sum-product sub-algorithm
to the variable xi except fj . Also, mxi →fj and mf →xi
denote the message from the variable xi to the factor fj
and the message from the factor f to the variable xi , re- 2.3.3. T HE M AX -P RODUCT A LGORITHM
spectively. The message from a the factor node fj to its The max-product algorithm (Weiss & Freeman, 2001;
neighbor variable xi is: Pearl, 2014) is similar to the sum-product algorithm where
X Y the summation operator is replaced by the maximum op-
mfj →xi = fj mx→fj . (7)
erator. In this algorithm, the messages from the variable
values of x∈N (fj )\xi x∈N (fj )\xi
nodes to the factor nodes and vice versa are:
The Eq. (6) includes both sum and product. This is Y
why this procedure is named the sum-product algorithm mxi →fj = mf →xi , (10)
(Kschischang et al., 2001; Bishop, 2006). In the factor f ∈N (xi )\fj
Y
graph, if the variable xi has degree one, the message is: mfj →xi = max fj mx→fj . (11)
values of x∈N (fj )\xi
> k x∈N (fj )\xi
mxi →fj = [1, 1, . . . , 1] ∈ R , (8)

where k is the number of possible values that the variables 2.3.4. B ELIEF P ROPAGATION WITH
can get. Moreover, if the variable xi has degree two con- F ORWARD -BACKWARD P ROCEDURE
nected to fj and f` , the message is: In order to learn the beliefs over the variables and factor
nodes, belief propagation can be applied using a forward-
mxi →fj = mf` →xi . (9) backward procedure (see chapter 8 in (Bishop, 2006)).
A good example for better understanding of Eqs. (6) and The forward-backward algorithm using the sum-product
(7) exists in chapter 8 in (Bishop, 2006). sub-algorithm is shown in Algorithm 1. In the forward-
backward procedure, the belief over a random variable xi
For the exact convergence (and inference) of belief prop-
is the product of the forward and backward messages.
agation, the graph should be a tree, i.e., should be cycle-
free (Kschischang et al., 2001). If the factor graph has
3. Probabilistic Graphical Models
cycles, the inference is approximate where the algorithm
is stopped after a while manually. The belief propagation 3.1. Markov and Bayesian Networks
in such graphs is called loopy belief propagation (Murphy A Probabilistic Graphical Model (PGM) is a grpah-based
et al., 1999; Bishop, 2006). Note that Markov chains are representation of a complex distribution in the possibly
cycle-free so the exact belief propagation can be applied high dimensional space (Koller & Friedman, 2009). In
for them. other words, PGM is a combination of graph theory and
It is also noteworthy that as every variable which is con- probability theory. In a PGM, the random variables are rep-
nected to only two factor nodes merely passes the message resented by nodes or vertices. There exist edges between
through (see Eq. (9)), a new graphical model, named nor- two variables which have interaction with one another in
mal graph (Forney, 2001), was proposed which states every terms of probability. Different conditional probabilities can
variable as an edge rather than a node (vertex). More details be represented by a PGM.
about factor graph and sum-product algorithm can be found There exist two types of PGM which are Markov network
in references (Kschischang et al., 2001; Loeliger, 2004) (also called Markov random field) and Bayesian network
and chapter 8 of (Bishop, 2006). Also note that some al- (Koller & Friedman, 2009). In the Markov network and
ternatives to sum-product algorithm are min-product, max- Bayesian network, the edges of graph are undirected and
product, and min-sum (Bishop, 2006). directed, respectively.
Hidden Markov Model: Tutorial 4

3.2. The Markov Property the number of states and observation symbols, respectively.
Consider a times series of random variables We show the sets of states and possible observation sym-
X1 , X2 , . . . , Xn . In general, the joint probability of bols by S = {s1 , . . . , sn } and O = {o1 , . . . , om }, re-
these random variables can be written as: spectively. We show being in state si and in observation
symbol oi at time t by si (t) and oi (t), respectively. Let
P(X1 ,X2 , . . . , Xn ) = P(X1 ) P(X2 | X1 ) Rn×n 3 A = [ai,j ] be the state Transition Probability Ma-
P(X3 | X2 , X1 ) . . . P(Xn | Xn−1 , . . . , X2 , X1 ), trix (TPM), where:
(12)
ai,j := P sj (t + 1) | si (t) . (15)
according to chain (or multiplication) rule in probabil- We have:
ity. [The first order] Markov property is an assumption n
X
which states that in a time series of random variables ai,j = 1. (16)
X1 , X2 , . . . , Xn , every random variable is merely depen- j=1
dent on the latest previous random variable and not the oth-
ers. In other words: The Emission Probability Matrix (EPM) is denoted by as
Rn×m 3 B = [bi,j ] where:
P(Xi | Xi−1 , Xi−2 , . . . , X2 , X1 ) = P(Xi | Xi−1 ). (13)
bi,j := P oj (t) | si (t) , (17)
Hence, with Markov property, the chain rule is simplied to: which is the probability of emission of the observation
P(X1 , X2 , . . . , Xn ) symbols from the states. We have:
n
= P(X1 ) P(X2 | X1 ) P(X3 | X2 ) . . . P(Xn | Xn−1 ). X
(14) bi,j = 1. (18)
j=1
The Markov property can be of any order. For example, in
Let the initial state distribution be denoted by the vector
a second order Markov property, a random variable is de-
Rn 3 π = [π1 , . . . , πn ] where:
pendent on the latest and one-to-latest variables. Usually,
the default Markov property is of order one. A stochastic πi := P si (1) , (19)
process which has the Markov process is called a Marko-
and:
vian process (or Markov process).
n
X
3.3. Discrete Time Markov Chain πi = 1, (20)
i=1
A Markov chain is a PGM which has Markov property. The
Markov chain can be either directed or undirected. Usually, to satisfy the probability properties. An HMM model is
Markov chain is a Bayesian network where the edges are denoted by the tuple λ = (π, A, B).
directed. It is important not to confuse Markov chain with Assume that a sequence of states is generated by the HMM
Markov network. according to the TPM. We denote this generated sequence
There are two types of Markov Chain which are Discrete of states by S g := sg (1), . . . , sg (τ ) where sg (t) ∈ S, ∀t.
Time Markov Chain (DTMC) (Ross, 2014) and Continu- Likewise, a sequence of outputs (observations) is generated
ous Time Markov Chain (CTMC) (Lawler, 2018). As it is by the HMM according to SPM. We denote this generated
obvious from their names, in DTMC and CTMC, the time sequence of output symbols by Og := og (1), . . . , og (τ )
of transitions from a random variable to another one is and where og (t) ∈ O, ∀t.
is not partitioned into discrete slots, respectively. We denote the probability of transition from state sg (t) to
If the variables in a DTMC are considered as states, the sg (t + 1) by asg (t),sg (t+1) . So:
DTMC can be viewed as a Finite-State Machine (FSM) asg (t),og (t+1) := P(sg (t + 1) | sg (t)). (21)
or a Finite-State Automaton (FSA) (Booth, 1967). Also,
note that the DTMC can be viewed as a Sub-Shifts of Finite Note that sg (t) ∈ S and sg (t + 1) ∈ S. We also denote:
Type in modeling dynamic systems (Brin & Stuck, 2002). πsg (i) := P(sg (i)). (22)
3.4. Hidden Markov Model (HMM) Likewise, we denote the probability of state sg (t) emitting
HMM is a DTMC which contains a sequence of hidden the observation og (t) by bsg (t),og (t) . So:
variables (named states) in addition to a sequence of emit-
ted observation symbols (outputs). bsg (t),og (t) := P(og (t) | sg (t)). (23)
g g
We have an observation sequence of length τ which is the Note that s (t) ∈ S and o (t) ∈ O. Figure 2 depicts the
number clock times, t ∈ {1, . . . , τ }. Let n and m denote structure of an HMM model.
Hidden Markov Model: Tutorial 5
n X
X n
=⇒ log(asg (t−1),sg (t) ) = 1i [i] 1j [j] log(ai,j )
i=1 j=1

= 1>
sg (t−1) (log A)1sg (t) , (27)

n Y
Y n
bsg (t),og (t) = (bi,j )1i [i] 1j [j] ,
i=1 j=1

n X
X n
=⇒ log(bsg (t),og (t) ) = 1i [i] 1j [j] log(bi,j )
Figure 2. A Hidden Markov Model i=1 j=1

= 1>
sg (t) (log B)1og (t) , (28)
4. Likelihood and Expectation Maximization
where 1i [i] = 1j [j] = 1. Hence, we can write the log-
in HMM likelihood as:
The EM algorithm can be used for analysis and training
τ −1
in HMM (Moon, 1996). In the following, we explain the X
` = 1>
sg (1) log π + 1>
sg (t−1) (log A)1sg (t)
details of EM for HMM.
t=1
τ
4.1. Likelihood X
+ 1>
sg (t) (log B)1og (t) . (29)
According to Fig. 2, the likelihood of occurrence of the t=1
state sequence S g and the observation sequence Og is
(Ghahramani, 2001): 4.2. E-step in EM
The missing variables in the log-likelihood are 1sg (1) ,
L = P(S g , Og )
1sg (t−1) , 1sg (t) , and 1og (t) . The expectation of the log-
= P sg (1) P(next states | previous states) P(Og | S g )

likelihood with respect to the missing variables is:
−1
g
τY g g
τ
Y
= P s (1) P(s (t + 1) | s (t)) P(og (t) | sg (t)) Q(π,A, B) = E(`)
t=1 t=1 τ
X −1
= E(1> E 1>

−1
τY τ
Y sg (1) log π) + sg (t) (log A)1sg (t+1)
= πsg (1) asg (t),sg (t+1) bsg (t),og (t) . (24) t=1
t=1 t=1 τ
X
E 1>

+ sg (t) (log B)1og (t) . (30)
The log-likelihood is: t=1
τ −1
4.3. M-step in EM
X
` = log(L) = log πsg (1) + log(asg (t),sg (t+1) )
t=1 We maximize the Q(π, A, B) with respect to the parame-
τ
X ters πi , ai,j , and bi,j :
+ log(bsg (t),og (t) ). (25)
t=1 maximize Q(π, A, B)
x
Let 1i be a vector with entry one at index i, i.e., 1i := n
X
[0, 0, . . . , 0, 1, 0 . . . , 0]> . Also, 1sg (1) means the vector subject to πi = 1,
with entry one at the index of the first state in the sequence i=1
n
S g . For example if there are three possible states and a se- X (31)
quence of length three, sg (1) = 2, sg (2) = 1, sg (3) = 3, ai,j = 1, ∀i ∈ {1, . . . , n},
j=1
we have 1sg (1) = [0, 1, 0]> .
n
The terms in this log-likelihood are:
X
bi,j = 1, ∀i ∈ {1, . . . , n},
j=1
log πsg (1) = 1>
sg (1) log π, (26)
Y n Y n where the constraints ensure that the probabilities in the
asg (t−1),sg (t) = (ai,j )1i [i] 1j [j] , initial states, the transition matrix, and the emission matrix
i=1 j=1 add to one.
Hidden Markov Model: Tutorial 6

The Lagrangian (Boyd & Vandenberghe, 2004) for this op- Likewise, we have:
timization problem is: τ
∂L X set
τ −1 = E(1sg (t) [i] 1og (t) [j]) − η3 bi,j = 0
X ∂bi,j
L = E(1> E 1>

t=1
sg (1) log π) + sg (t) (log A)1sg (t+1)
τ
t=1 1 X
τ n =⇒ bi,j = E(1sg (t) [i] 1og (t) [j]). (39)
X X η3 t=1
1>

+ E sg (t) (log B)1og (t) − η1 ( πi − 1)
t=1 i=1
n n n n τ
X X X 1 XX
(39)
− η2 ( ai,j − 1) − η3 ( bi,j − 1), (32) bi,j = 1 =⇒ E(1sg (t) [i] 1og (t) [j]) =
j=1
η3 j=1 t=1
j=1 j=1
τ
where η1 , η2 , and η3 are the Lagrange multipliers. 1 X
= E(1sg (t) [i] × 0) + · · · + E(1sg (t) [i] × 1)
The first term in the Lagrangian is simplified to η3 t=1 | {z }
j-th element
E(1sg (1) [1] log π1 + · · · + 1sg (1) [τ ] log πτ ) = τ
E(1sg (1) [1] log π1 ) + · · · + E(1sg (1) [τ ] log πτ ); therefore,
1 X set
+ · · · + E(1sg (t) [i] × 0) = E(1sg (t) [i]) = 1
we have: η3 t=1
∂L set τ
= E(1sg (1) [i]) − η1 πi = 0
X
∂πi =⇒ η3 = E(1sg (t) [i]). (40)
1 t=1
=⇒ πi = E(1sg (1) [i]). (33)
η1 Pτ
t=1 E(1sg (t) [i] 1og (t) [j])
∴ bi,j = Pτ . (41)
t=1 E(1sg (t) [i])
n
X 1 (33)
πi = 1 =⇒ E(1sg (1) [1])+
i=1
η 1 In Section 7.1, we will simplify the statements for πi , ai,j ,
+ · · · + E(1sg (1) [i]) + · · · + E(1sg (1) [n])
and bi,j .
1 set
= (0 + · · · + 1 + · · · + 0) = 1 =⇒ η1 = 1. (34) 5. Evaluation in HMM
η1
Evaluation in HMM means the following (Rabiner &
∴ πi = E(1sg (1) [i]). (35)
Juang, 1986; Rabiner, 1989): Given the observation se-
Similarly, we have: quence Og = og (1), . . . , og (τ ) and the HMM model λ =
(π, A, B), we want to compute P(Og | λ), i.e., the proba-
τ −1
∂L X set bility of the generated observation sequence. In summary:
= E(1sg (t) [i] 1sg (t+1) [j]) − η2 ai,j = 0
∂ai,j t=1 Og , given: λ =⇒ P(Og | λ) = ? (42)
τ −1
1 X
Note that P(Og | λ) can also be denoted by P(Og ; λ). The
=⇒ ai,j = E(1sg (t) [i] 1sg (t+1) [j]). (36)
η2 t=1 P(Og | λ) is sometimes referred to as the likelihood.

n n τ −1 5.1. Direct Calculation

X 1 XX
(36)
ai,j = 1 =⇒ E(1sg (t) [i] 1sg (t+1) [j]) = Assume that the state sequence S g = sg (1), . . . , sg (τ ) has
η2 j=1 t=1
j=1 caused the observation sequence Og = og (1), . . . , og (τ ).
τ −1
1 X Hence, we have:
= E(1sg (t) [i] × 0) + . . .
η2 t=1 P(Og | S g , λ) = bsg (1),og (1) bsg (2),og (2) . . . bsg (τ ),og (τ ) .
(43)
+ E(1sg (t) [i] × 1) + · · · + E(1sg (t) [i] × 0)
| {z } On the other hand, the probability of the state sequence
j-th element
S g = sg (1), . . . , sg (τ ) is:
τ −1 τ −1
1 X set
X
= E(1sg (t) [i]) = 1 =⇒ η2 = E(1sg (t) [i]). P(S g | λ) = πsg (1) asg (2),og (2) . . . asg (τ ),og (τ ) . (44)
η2 t=1 t=1
(37) According to chain rule, we have:

Pτ −1 P(Og , S g | λ) = P(Og | S g , λ) P(S g | λ) = (45)

t=1 E(1sg (t) [i] 1sg (t+1) [j]) πsg (1) bsg (1),og (1) asg (2),og (2) . . . bsg (τ ),og (τ ) asg (τ ),og (τ ) ,
∴ ai,j = Pτ −1 . (38)
t=1 E(1sg (t) [i]) (46)
Hidden Markov Model: Tutorial 7

which is the probability of occurrence of both the observa- 1 Input: λ = (π, A, B)

tion sequence Og and state sequence S g . 2 αi (1) = πi bi,og (1) , ∀i ∈ {1, . . . , n}
Any state sequence may have caused the observation se- 3 for state j from 1 to n do
quence Og . Therefore, according to the law of total proba- 4 for time t from 1 hto (τ − 1) do i
bility, we have: Pn
X 5 αj (t + 1) = i=1 αi (t) ai,j bj,og (t+1)
P(Og | λ) = P(Og , S g | λ) Pn
∀S g
6 P(Og | λ) = i=1 αi (τ )
(45) X 7 Return P(Og | λ), ∀i, ∀t : αi (t)
= P(Og | S g , λ) P(S g | λ)
∀S g Algorithm 2: The forward belief propagation in
(46) X X X the forward-backward procedure
= ··· πsg (1) bsg (1),og (1) asg (2),og (2)
∀sg (1) ∀sg (2) ∀sg (τ )

. . . bsg (τ ),og (τ ) asg (τ ),og (τ ) , (47) Proposition 1. The Eq. (50) can be interpreted as the sum-
which means that we start with the first state, then output product algorithm (see Section 2.3.2).
the first observation, and then go to the next state. This
procedure is repeated until the last state. The summations
Proof. The Algorithm 2 has iterations over states indexed
are over all possible states in the state sequence.
by j. Also the Eq. (50) has sum over states index by i.
The time complexity of this direct calculation of P(Og | λ) Consider all states indexed by i and a specific state sj (see
is in the order of O(2τ nτ ) because at every time clock Fig. 3). We can consider every two successive states as a
t ∈ {1, . . . , τ }, there are n possible states to go through factor node in the factor graph. Hence, a state indexed by
(Rabiner & Juang, 1986). Because of nτ , this is very inef- i and the state sj form a factor node which we denote by
ficient especially for long sequences (large τ ). f i,j . The observation symbol og (t + 1) emitted from sj is
considered as the variable node in the factor graph.
5.2. The Forward-Backward Procedure
The message αi (t) is the message received to the factor
A more efficient algorithm for evaluation in HMM is
node f i,j so far. Therefore, in the sum-product algorithm
forward-backward procedure (Rabiner & Juang, 1986).
(see Eq. (6)), we have:
The forward-backward procedure includes two stages, i.e.,
forward and backward belief propagation stages.
mog (t)→f i,j = αi (t). (51)
5.2.1. T HE F ORWARD B ELIEF P ROPAGATION
Similar to the belief propagation procedure, we define the The message ai,j bj,og (t+1) is the message received from
forward message until time t as: the factor nodes f i,j , ∀i to the variable node og (t + 1).
Therefore, in the sum-product algorithm (see Eq. (7)), we
αi (t) := P og (1), og (2), . . . , og (t), sg (t) = si | λ , (48)

have:
which is the probability of partial observation sequence un-
til time t and being in state si at time t. mf i,j →og (t+1) = ai,j bj,og (t+1) . (52)
Algorithm 2 shows the forward belief propagation from
state one to the state τ . In this algorithm, αi (t) is solved Hence:
inductively. The initial forward message is:
n
X
αi (1) = πi bi,og (1) , ∀i ∈ {1, . . . , n}, (49) mf →og (t+1) = mf i,j →og (t+1) mog (t)→f i,j (53)
next
i=1
which is the probability of occurrence of the initial state n
si and the observation symbol og (1). The next forward =
X
ai,j bj,og (t+1) αi (t)
messages are calculated as: i=1
hX n i hXn i
αj (t + 1) = αi (t) ai,j bj,og (t+1) , (50) = αi (t) ai,j bj,og (t+1) . (54)
i=1 i=1

which is the probability of occurrence of observation se-

i,j
quence og (1), . . . , og (t), being in state si at time t, going where f is the set of all factor nodes, {f i,j }ni=1 , and fnext
i,j
to state j at time t+1, and the observation symbol og (t+1). is the factor f in the next time slot.
The summation is because, at time t, the state sg (t) can be Note that if we consider Fig. 3 for all states indexed by j,
any state so we should use the law of total probability. a lattice network is formed.
Hidden Markov Model: Tutorial 8

1 Input: λ = (π, A, B)
2 βi (τ ) = 1, ∀i ∈ {1, . . . , n}
3 for state i from 1 to n do
4
Pn(τ − 1) to 1 do
for time t from
5 βi (t) = j=1 ai,j bj,og (t+1)

6 Return ∀i, ∀t : βi (t)

Algorithm 3: The backward belief propagation
in the forward-backward procedure

state sg (t + 1) can be any state so we should use the law of

total probability.
It is noteworthy that for very long sequences, the αi (t) and
β(t) become extremely small, recursively (see Algorithms
2 and 3). Hence, some people normalize them at every
iteration of algorithm (Ghahramani, 2001):
Figure 3. Modeling forward belief propagation for HMM as a
sum-product algorithm in a factor graph. αi (t)
αi (t) ← Pτ , (59)
j=1 αj (t)

Finally, using the law of total probability, we have: βi (t)

βi (t) ← Pτ , (60)
n n j=1 βj (t)
X X
P(Og | λ) = P Og , sg (τ ) = si | λ = αi (τ ),
i=1 i=1
in order to sum to one. However, note that if this normal-
(55) ization is done, we will have P(Og | λ) = 1.

which is the desired probability in the evaluation for HMM. 5.2.3. T IME C OMPLEXITY
Hence, the forward belief propagation suffices for evalua- The time complexity of the forward belief propagation is in
tion. In the following, we explain the backward belief prop- the order of O(τ n2 ) because we have O(nτ ) loops each of
agation which is required for other sections of the paper. which includes summation over n states (Rabiner & Juang,
1986). This is much more efficient than the complexity of
5.2.2. T HE BACKWARD B ELIEF P ROPAGATION direct calculation of P(Og | λ) which was O(2τ nτ ). Simi-
Again similar to the belief propagation procedure, we de- larly, the time complexity of the backward belief propaga-
fine the backward message since time τ to t + 1 as: tion is in the order of O(τ n2 ) (Rabiner & Juang, 1986).

βi (t) := P og (t + 1), og (t + 2), . . . , og (τ ) | sg (t) = si , λ ,

6. Estimation in HMM
(56)
which is the probability of partial observation sequence Estimation in HMM means the following (Rabiner &
from t + 1 until the end time τ given being in state si at Juang, 1986; Rabiner, 1989): Given the observation se-
time t. quence Og = og1 , . . . , ogτ and the HMM model λ =
(π, A, B), we want to compute or estimate P(S g | Og , λ),
Algorithm 3 shows the backward belief propagation. In this
i.e., the probability of a state sequence given an observation
algorithm, the initial backward message is:
sequence. In summary:
βi (τ ) = 1, ∀i ∈ {1, . . . , n}. (57)
S g , given: Og , λ =⇒ P(S g | Og , λ) = ? (61)
The next backward messages are calculated as:
n
X 6.1. Greedy Approach
βi (t) = ai,j bj,og (t+1) , (58) Let the probability of being in state si at time t given the
j=1 observation sequence Og and the HMM model λ be de-
noted by:
which is the probability of being in state si at time t, go-
ing to state j at time t + 1, and the observation symbol
γi (t) := P sg (t) = si | Og , λ .

og (t + 1). The summation is because, at time t + 1, the (62)
Hidden Markov Model: Tutorial 9

We can say: 1 Input: λ = (π, A, B)

γi (t) P(Og | λ) = P sg (t) = si | Og , λ P(Og | λ) // Initialization:

2

(a) (b) 3 δi (1) = πi bi,og (1) , ∀i ∈ {1, . . . , n}

= P sg (t) = si , Og , λ = αi (t) βi (t) 4 ψi (1) = 0, ∀i ∈ {1, . . . , n}
5 // Recursion:
αi (t) βi (t) 6 for state j from 1 to n do
=⇒ γi (t) = (63) for time t from 2 to τ do
P(Og | λ) 7

αi (t) βi (t) 8 δj (t) = max1≤i≤n δi (t−1) ai,j bj,og (t)
= Pn , (64)

9 ψj (t) = arg max1≤i≤n δi (t − 1) ai,j
j=1 αj (t) βj (t)

where (a) is because of chain rule in probability and (b) is 10 // Termination:

because: 11 p∗ = max1≤i≤n δi (τ )
12 s∗ (τ ) = arg max1≤i≤n δi (τ )
(48),(56)
αi (t) βi (t) = 13 // Backtracking:
g g g g
14 for time t from τ − 1 to 1 do
= P o (1), o (2), . . . , o (t), s (t) = si | λ
15 s∗ (t) = ψs∗ (t+1) (t + 1)
× P og (t + 1), og (t + 2), . . . , og (τ ) | sg (t) = si , λ

16 P(Og , S g | λ) = p∗
= P og (1), og (2), . . . , og (τ ), sg (t) = si , λ

17 Return P(Og , S g | λ), S g = s∗ (1), . . . , s∗ (τ )
= P Og , sg (t) = si , λ .

Algorithm 4: The Viterbi algorithm for estima-
The reason of (b) can also be interpreted in this way: it is tion in HMM
because of the forward-backward procedure which states
the belief over a variable as product of the forward and Proposition 2. The Eq. (66) can be interpreted as the max-
backward
Pn messages (see Section 2.3.4). Note that we have product algorithm (see Section 2.3.3).
γ
i=1 i (t) = 1. Also, note that P(Og | λ) in Eq. (63) can
be obtained from either the denominator of Eq. (64) or line Proof. The Algorithm 4 has iterations over states indexed
6 in Algorithm 2 (i.e., Eq. (55)). by j. Also the Eq. (66) has maximum operator over states
In the greedy approach, at every time t, we select a state index by i. Consider all states indexed by i and a specific
with maximum probability of occurrence without consid- state sj (see Fig. 4). The definitions of factor node and
ering the other states in the sequence. Therefore, we have: variable node in the factor graph are similar to the proof of
Proposition 1.
sg (t) = arg max γi (t), ∀t ∈ {1, . . . , τ }, (65)
1≤i≤n The message δi (t − 1) is the message received to the factor
node f i,j so far. Therefore, in the max-product algorithm
6.2. The Viterbi Algorithm (see Eq. (10)), we have:
The greedy approach does not optimize over the whole path
but greedily chooses the best state at every time step. An- mog (t−1)→f i,j = δi (t − 1). (67)
other approach is to find the best state sequence which The message ai,j bj,og (t) is the message received from the
has the highest probability of occurrence, i.e., maximiz- factor nodes f i,j , ∀i to the variable node og (t). Therefore,
ing P(Og , S g | λ) (Rabiner & Juang, 1986). The Viterbi in the max-product algorithm (see Eq. (11)), we have:
algorithm (Viterbi, 1967; Forney, 1973) can be used to find
this path of states (Blunsom, 2004). Different works, such mf i,j →og (t) = ai,j bj,og (t) . (68)
as (He, 1988), have worked on using Viterbi algorithm for
Hence:
HMM.
Algorithm 4 shows the Viterbi algorithm for estimation in mf →og (t) = max mf i,j →og (t) mog (t)→f i,j (69)
1≤i≤n next
HMM (Rabiner & Juang, 1986). In this algorithm, we have
= max ai,j bj,og (t) δi (t − 1)
variable δj (t): 1≤i≤n
h i h i
δj (t) = max δi (t − 1) ai,j bj,og (t) , (66) = max δi (t − 1) ai,j bj,og (t) . (70)
1≤i≤n 1≤i≤n

i,j
which is similar to αj (t) defined in Eq. (50), except that where f is the set of all factor nodes, {f i,j }ni=1 , and fnext
i,j
αj (t) in the forward belief propagation uses sum-product is the factor f in the next time slot.
algorithm (Kschischang et al., 2001) while δj (t) in the Note that if we consider Fig. 4 for all states indexed by
Viterbi algorithm uses max-product algorithm (Weiss & j, a lattice network is formed which is common in Viterbi
Freeman, 2001; Pearl, 2014) (see Sections 2.3.2 and 2.3.3). algorithm (see (Rabiner & Juang, 1986)).
Hidden Markov Model: Tutorial 10

7. Training the HMM

Training HMM means the following (Rabiner & Juang,
1986; Rabiner, 1989): Given the observation sequence
Og = og1 , . . . , ogτ , we want to adjust the HMM model pa-
rameters λ = (π, A, B) in order to maximize P(Og | λ).
In summary:

given: Og , O, S =⇒ λ = arg max P(Og | λ). (77)

7.1. The Baum-Welch Algorithm

We can solve for Eq. (77) using maximum likelihood esti-
mation using Expectation Maximization (EM). The Baum-
Welch algorithm (Baum et al., 1970) is the most well-
known method for training HMM. It makes use of the EM
results.
We define the probability of occurrence of a path being in
states si and sj , respectively, at times t and t + 1 by:

ξi,j (t) := P sg (t) = si , sg (t + 1) = sj | Og , λ . (78)

Figure 4. Modeling Viterbi algorithm for HMM as a max-product
algorithm in a factor graph. This figure is very similar to Fig. 3
where the difference is in the index of time. We can say:

ξi,j (t) P(Og | λ)

= P sg (t) = si , sg (t + 1) = sj | Og , λ P(Og | λ)

Similar to Eq. (49), the initial δj (t) is:
(a)
= P sg (t) = si , sg (t + 1) = sj , Og , λ

δi (1) = πi bi,og (1) , ∀i ∈ {1, . . . , n}. (71)
(b)
We define the index maximizing in Eq. (66) as: = αi (t) ai,j bj,og (t+1) βj (t + 1)

ψj (t) = arg max δi (t − 1) ai,j . (72) αi (t) ai,j bj,og (t+1) βj (t + 1)
1≤i≤n =⇒ ξi,j (t) = (79)
P(Og | λ)
Then, a backward analysis is done starting from the end of αi (t) ai,j bj,og (t+1) βj (t + 1)
state sequence: = Pn Pn , (80)
r=1 `=1 αr (t) ar,` b`,og (t+1) β` (t + 1)

p∗ = max δi (τ ), (73) where (a) is because of chain rule in probability and (b)
1≤i≤n
is because of the following: According to Eqs. (48), (15),
s∗ (τ ) = arg max δi (τ ), (74) (17), and (56), we have:
1≤i≤n

and the other states in the sequence are backtracked as: αi (t) = P(og (1), . . . , og (t), sg (t) = si | λ),
ai,j = P(sj (t + 1) | si (t)),
s∗ (t) = ψs∗ (t+1) (t + 1). (75) bj,og (t+1) = P(og (t + 1) | sj (t + 1)),
βj (t + 1) = P(og (t + 2), . . . , og (τ ) | sg (t + 1) = sj , λ).
The states S g = s∗ (1), s∗ (2), . . . , s∗ (τ ) are the desired
state sequence in the estimation. Therefore, the states in Therefore, we have:
the state sequence are maximizing the forward belief prop-
agation in a max-product setting. αi (t) ai,j (t) bj,og (t+1) βj (t + 1)
The probability of this path of states with maximum prob- = P sg (t) = si , sg (t + 1) = sj , Og , λ .

ability of occurrence is: Pn Pn
Note that we have i=1 j=1 ξi,j (t) = 1. Also, note
P(Og , S g | λ) = p∗ . (76) that P(Og | λ) in Eq. (79) can be obtained from either the
denominator of Eq. (80) or line 6 in Algorithm 2 (i.e., Eq.
Note that the Viterbi algorithm can be visualized using a (55)).
trellis structure (see Appendix A in (Jurafsky & Martin, In Eq. (79), the terms αi (t), ai,j (t), bj,og (t+1) , and βj (t +
2019)). 1) stand for the probability of the first t observations ending
Hidden Markov Model: Tutorial 11

in state si at time t, the probability of transitioning from 1 Input: γi (t), ξi,j (t), ∀i, ∀j, ∀t
state si (at time t) to state sj (at time t + 1), the probability 2 πi = γi (1), ∀i ∈ {1, . . . , n}
of observing og (t + 1) from state sj at time t + 1, and the 3 for state i from 1 to n do
probability of the remainder of the observation sequence, 4 for state j from 1 to n do
respectively. Pτ −1 Pτ −1
5 ai,j = t=1 ξi,j (t)/ t=1 γi (t)
Now, recall the Eqs. (35), (38), and (41) from EM algo-
rithm for HMM. We write these equations again for conve- 6 for state j from 1 to n do
nience of the reader: 7
Pτ k from 1 to m doPτ
for observation
8 bj,k = t=1, og (t)=k γj (t)/ t=1 γj (t)
πi = E(1sg (1) [i]),
Pτ −1 9 // Normalization,
Pn for computer error corrections:
E(1sg (t) [i] 1sg (t+1) [j])
ai,j = t=1Pτ −1 , 10 πi ← πi / j=1 πj
t=1 E(1sg (t) [i])
Pn
11 ai,j ← ai,j / P`=1 ai,`
Pτ m
E(1sg (t) [i] 1og (t) [j]) 12 bj,k ← bj,k / `=1 bj,`
bi,j = t=1 .
π = [π1 , . . . , πn ]> , A = [ai,j ], B =
Pτ
t=1 E(1sg (t) [i])
13
[bj,k ], ∀i, j ∈ {1, . . . , n}, ∀k ∈ {1, . . . , m}
On the other hand, recall Eqs. (62) and (78) which are 14 Return λ = (π, A, B)
repeated here:
Algorithm 5: The Baum-Welch algorithm for
γi (t) = P sg (t) = si | Og , λ ,

training the HMM
ξi,j (t) = P sg (t) = si , sg (t + 1) = sj | Og , λ .

expected number of transitions from state si to sj over the

Therefore, we can say:
expected number of transitions out of state si . Finally, the
E(1sg (1) [i]) = γi (1), (81) bj,k is calculated using Eq. (87). Likewise, using counting
in probability, the bj,k can be interpreted as the ratio of
E(1sg (t) [i]) = γi (t), (82)
the expected number of times being in state sj and seeing
E(1sg (t) [i] 1sg (t+1) [j]) = ξi,j (t), (83) observation ok over the expected number of times being in
g
E(1sg (t) [i] 1og (t) [j]) = γi (t), where o (t) = j in γi (t). state sj .
(84) Note that in the numerator of Eq. (87), we have og (t) =
k in γj (t), i.e., according to Eq. (62), we have γj (t) =
Hence: P sg (t) = sj | og (1), . . . , og (t) = k, . . . , og (τ ), λ . This
(62) means that according to Eq. (63), we set og (t) = k in Eqs.
πi = γi (1) = P sg (1) = si | Og , λ ,

∀i ∈ {1, . . . , n}, (50) and (58), or in Algorithms 2 and 3, for calculation of
(85) γj (t) in Eq. (87). On the other hand, in line 8 in Algorithm
5, we are iterating over all the time slots t ∈ {1, . . . , τ }.
Pτ −1 Hence, in the numerator of Eq. (87), we will have γj (t) =
ξi,j (t)
ai,j = Pt=1
τ −1 , ∀i, j ∈ {1, . . . , n}, (86) P sg(t) = sj | og (1) = k, . . . , og (t) = k, . . . , og (τ ) =
t=1 γi (t) k, λ . Therefore:
Pτ
t=1, og (t)=j γi (t)
bi,j = Pτ , ∀i ∈ {1, . . . , n}, • For calculating the numerator of Eq. (87), we use Eq.
t=1 γi (t) (64) where:
∀j ∈ {1, . . . , m}. n
hX i
αj (t + 1) = αi (t) ai,j bj,k , (88)
With change of variable, we have:
i=1
Pτ n
t=1, og (t)=k γj (t)
X
bj,k = Pτ , ∀j ∈ {1, . . . , n}, (87) βi (t) = ai,j bj,k , (89)
t=1 γj (t) j=1
∀k ∈ {1, . . . , m}. in line 5 in Algorithm 2 and in line 5 in Algorithm
3, respectively. We use the obtained αi , ∀i, ∀t and
The algorithm is shown in Algorithm 5 (Rabiner & Juang,
βi (t), ∀i, ∀t for calculating the numerator of Eq. (64).
1986). In this algorithm the initial probabilities of being in
state si at time t = 1 is according to Eq. (85). Then the ai,j • For calculating the denominator of Eq. (87), we use
is then calculated using Eq. (86). According to counting Eq. (64) where Algorithms 2 and 3 are used for calcu-
in probability, it can also be interpreted as the ratio of the lating αi (t) and βi (t).
Hidden Markov Model: Tutorial 12

It is noteworthy that because of a possible error caused |Q|

1 Input: {Og = og1 , . . . , ogτ }q=1
by computer, it is better to normalize the obtained HMM Pn
model parameters (see lines 10 to 12 in Algorithm 5) in or- 2 Initialize
Pn λ = (π,P A, B) where i=1 πi = 1,
n
der to make sure that Eqs. (20), (16), and (18) are satisfied. j=1 ai,j = 1, j=1 bi,j = 1
3 while Convergence do
7.2. Training the HMM 4 for each q from 1 to |Q| do
We know that Algorithm 5 requires γi (t) and ξi,j (t). Ac- 5 Consider the q-th training sequence
cording to Eqs. (63) and (79), the γi (t) and ξi,j (t) also re- 6 P(Og | λ), ∀i, ∀t : αi (t) ← Do
quire αi (t) and βi (t). The αi (t) and βi (t) can be computed Algorithm 2 [input: λ]
using Algorithms (2) and (3), respectively. Moreover, note 7 ∀i, ∀t : βi (t) ← Do Algorithm 3 [input:
that in calculating αi (t), β(i), and ξi,j (t), we need some πi , λ]
ai,j , and bi,j from π, A, and B, respectively. Therefore, 8 ∀i, ∀t : γi (t) ← Eq. (63) [input: α, β]
we need an initial λ = (π, A, B) to compute another λ0 = 9 ∀i, ∀j, ∀t : ξi,j (t) ← Eq. (79) [input: α,
(π 0 , A0 , B 0 ) at the end according to Algorithm 5. Thus, we β, λ]
should fine tune λ = (π, A, B) iteratively. The obtained 10 λ ← Do Algorithm 5 [input: γ, ξ]
λ0 from the previous λ satisfies P(Og | λ0 ) ≥ P(Og | λ) 11 if change of λ is small then
(Rabiner & Juang, 1986) (see (Baum & Eagon, 1967) or 12 Convergence ← True
(Baum et al., 1970) for proof) so we have progress in con-
vergence (note that P(Og | λ) is obtained for each iteration 13 Return λ = (π, A, B)
in line 6 in Algorithm 6). If we have several training se-
Algorithm 6: Training the HMM
quences Q, indexed by q ∈ {1, . . . , |Q|}, we use one of the
sequences at the first iteration, the second sequence at the
second iteration, and so on. We repeat this procedure until 1. Calculate P(Otg | λw ) for all w ∈ {1, . . . , |W|} using
convergence which is reached when there is no significant the forward belief propagation, i.e., Algorithm 2 (see
change in λ. Algorithm 6 shows how to train an HMM. Eq. (55)).

8. Usage of HMM in Applications 2. The test word is recognized as:

In Section 1, we introduced some application of HMM w∗ = arg max P(Otg | λw ). (90)

w
in speech recognition (Rabiner & Juang, 1986; Rabiner,
So, the test word is recognized as the w-th word in the
1989; Gales et al., 2008), action recognition (Yamato et al.,
dictionary.
1992; Ghojogh et al., 2017), and face recognition (Nefian
& Hayes, 1998; Samaria, 1994). Here, we explain the ap- In test phase for speech recognition, usually, the Viterbi al-
plications in speech and action recognition in more details. gorithm is used (Rabiner, 1989). Hence, another approach
for recognition of the test word Otg is:
8.1. Application in Speech Recognition
Assume we have a dictionary of words consisting of |W| 1. Calculate P(Otg , Stg | λw ) for all w ∈ {1, . . . , |W|}
words. For every word indexed by w ∈ {1, . . . , |W|}, we using the Viterbi algorithm, i.e., Algorithm 4 (see Eq.
have |Qw | training instances spoken by one or several peo- (76)).
ple. The training instances for a word are indexed by q 2. The test word is recognized as:
where q ∈ {1, . . . , |Qw |}. Every training instance is a
sequence of observation symbols obtained from formants w∗ = arg max P(Otg , Stg | λw ). (91)
w
(Titze & Martin, 1998). We consider an HMM model for
So, the test word is recognized as the w-th word in the
every word in the dictionary. Training the HMMs are as
dictionary.
(Rabiner & Juang, 1986; Rabiner, 1989):
Note that the words are pronounced with different lengths
1. For every word wi , consider the training sequences, (fast or slowly) by different people. As HMM is robust to
Owg
, indexed by q, i.e., og1 , . . . , og|Qw | . different repetitions of states, the recognition of words with
different pacing is possible.
2. Train the HMM for the w-th word using Algorithm 6
to obtain λw . 8.2. Application in Action Recognition
In action recognition, every action can be seen as a se-
For an unknown test word with sequence Otg = quence of poses where every pose may be repeated for sev-
og1 , . . . , og|Qt | , we recognize the word as (Rabiner & Juang, eral frames (Ghojogh et al., 2017). Hence, HMM can be
1986; Rabiner, 1989): used for action recognition (Yamato et al., 1992).
Hidden Markov Model: Tutorial 13

Action Sequence t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t = 10 t = 11 t = 12
1 stand stand stand sit sit sit × × × × × ×
2 stand stand sit sit sit sit sit sit × × × ×
Sit 3 stand stand stand stand stand sit sit × × × × ×
4 stand stand stand stand stand sit sit sit sit sit × ×
5 stand stand stand stand stand sit sit sit sit stand sit sit
1 sit sit sit stand stand stand × × × × × ×
2 sit sit sit sit sit sit stand stand × × × ×
Stand 3 sit sit stand stand stand stand stand × × × × ×
4 sit sit sit sit stand stand stand stand stand × × ×
5 sit sit sit sit stand stand stand sit stand stand stand ×
1 stand stand stand tilt tilt tilt × × × × × ×
2 stand stand tilt tilt tilt tilt tilt tilt × × × ×
Turn 3 stand stand stand stand stand tilt tilt × × × × ×
4 stand stand stand stand tilt tilt tilt tilt tilt tilt × ×
5 stand stand stand stand tilt tilt tilt stand tilt tilt tilt ×

Table 1. An example training dataset for action recognition

Action t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t = 10 t = 11
Sit stand stand stand sit sit sit sit × × × ×
Stand sit sit sit sit stand stand stand × × × ×
Turn stand stand stand stand tilt tilt stand stand tilt tilt tilt

Table 2. An example test dataset for action recognition

Assume we have a set of actions denoted by W where the i.e., stand, sit, and tilt. The actions can have different
actions are indexed by w ∈ {1, . . . , |W|}. We have |Qw | lengths or pacing. An example training dataset with its in-
training instances for every action where the training instances is shown in Table 1. In some sequences of dataset,
stances are indexed by q ∈ {1, . . . , |Qw |}. Every training there are some noisy poses in the middle of sequences of
instance is a sequence of observation symbols where the correct poses for making a difficult instance. An example
symbols are the poses, e.g., sitting, standing, etc. An HMM test dataset is also shown in Table 2. The three test se-
is trained for every action (Ghojogh et al., 2017; Mokari quences are different from the training sequences to check
et al., 2018). Training and testing HMMs for action recog- the generalizability of the HMM models. Three HMM
nition is the same as training and test phases explained for models can be trained for the three actions in this dataset
speech recognition. and then, the test action sequences can be fed to the HMM
The actions are performed with different sequence lengths models to be recognized.
(fast or slowly) by different people. As HMM is robust
to different repetitions of states, the recognition of actions 9. Conclusion
with different pacing is possible. In this paper, we explained the theory of HMM for evalu-
In action recognition, we have a dataset of actions consist- ation, estimation, and training. We started with some re-
ing of several defined poses (Ghojogh et al., 2017; Mokari quired background, i.e., EM, factor graphs, sum-product
et al., 2018). For example, if the dataset includes three ac- and max-product algorithms, forward-backward propaga-
tions sit, stand, and turn, the format of actions is as follows: tion, Markov and Bayesian networks, Markov property,
and DTMC. We then introduced HMM and detailed EM in
• Action sit: stand, stand . . . , stand, sit, sit . . . , sit HMM. Evaluation in HMM was explained in both direct
| {z } | {z }
stand sit calculation and forward-backward procedure. We intro-
• Action stand: sit, sit . . . , sit, stand, stand . . . , stand duced estimation in HMM using the greedy approach and
| {z } | {z } the Viterbi algorithm. Training the HMM was also covered
sit stand
using the Baum-Welch algorithm based on EM algorithm.
• Action turn: stand, stand . . . , stand, tilt, tilt . . . , tilt We also introduced speech and action recognition as two
| {z } | {z }
stand tilt popular applications of HMM.
where the actions are modeled as sequences of some poses,
Hidden Markov Model: Tutorial 14

Acknowledgment Ghojogh, Benyamin, Ghojogh, Aydin, Crowley, Mark, and

The authors hugely thank Prof. Mehdi Molkaraie, Prof. Karray, Fakhri. Fitting a mixture distribution to data:
Kevin Granville, and Prof. Mu Zhu whose courses partly tutorial. arXiv preprint arXiv:1901.06708, 2019.
covered the materials mentioned in this tutorial paper.
He, Yang. Extended Viterbi algorithm for second order hid-
den markov process. In [1988 Proceedings] 9th Interna-
References tional Conference on Pattern Recognition, pp. 718–720.
Baum, Leonard E and Eagon, John Alonzo. An inequal- IEEE, 1988.
ity with applications to statistical estimation for proba-
bilistic functions of Markov processes and to a model for Jurafsky, Daniel and Martin, James H. Speech and Lan-
ecology. Bulletin of the American Mathematical Society, guage Processing: An introduction to speech recog-
73(3):360–363, 1967. nition, computational linguistics and natural language
processing. Pearson Prentice Hall, 2019.
Baum, Leonard E, Petrie, Ted, Soules, George, and Weiss,
Norman. A maximization technique occurring in the Koller, Daphne and Friedman, Nir. Probabilistic graphical
statistical analysis of probabilistic functions of Markov models: principles and techniques. MIT press, 2009.
chains. The annals of mathematical statistics, 41(1):
164–171, 1970. Kschischang, Frank R, Frey, Brendan J, and Loeliger,
Hans-Andrea. Factor graphs and the sum-product algo-
Bishop, Christopher M. Pattern recognition and machine rithm. IEEE Transactions on information theory, 47(2):
learning. Springer, 2006. 498–519, 2001.

Blunsom, Phil. Hidden Markov models. Technical report, Lawler, Gregory F. Introduction to stochastic processes.
2004. Chapman and Hall/CRC, 2018.

Booth, Taylor L. Sequential machines and automata the- Loeliger, H-A. An introduction to factor graphs. IEEE
ory. New York, NY: Wiley, 1967. Signal Processing Magazine, 21(1):28–41, 2004.

Boyd, Stephen and Vandenberghe, Lieven. Convex opti- Mokari, Mozhgan, Mohammadzade, Hoda, and Ghojogh,
mization. Cambridge university press, 2004. Benyamin. Recognizing involuntary actions from 3d
skeleton data using body states. Scientia Iranica, 2018.
Brin, Michael and Stuck, Garrett. Introduction to dynami-
cal systems. Cambridge university press, 2002. Moon, Todd K. The expectation-maximization algorithm.
IEEE Signal processing magazine, 13(6):47–60, 1996.
Eddy, Sean R. What is a hidden Markov model? Nature
biotechnology, 22(10):1315, 2004. Murphy, Kevin P, Weiss, Yair, and Jordan, Michael I.
Loopy belief propagation for approximate inference: An
Forney, G David. The Viterbi algorithm. Proceedings of empirical study. In Proceedings of the Fifteenth confer-
the IEEE, 61(3):268–278, 1973. ence on Uncertainty in artificial intelligence, pp. 467–
475. Morgan Kaufmann Publishers Inc., 1999.
Forney, G David. Codes on graphs: Normal realizations.
IEEE Transactions on Information Theory, 47(2):520– Nefian, Ara V and Hayes, Monson H. Hidden markov mod-
548, 2001. els for face recognition. In Proceedings of the 1998 IEEE
International Conference on Acoustics, Speech and Sig-
Gales, Mark, Young, Steve, et al. The application of hidden nal Processing, ICASSP’98 (Cat. No. 98CH36181), vol-
Markov models in speech recognition. Foundations and ume 5, pp. 2721–2724. IEEE, 1998.
Trends R in Signal Processing, 1(3):195–304, 2008.
Pearl, Judea. Probabilistic reasoning in intelligent systems:
Ghahramani, Zoubin. An introduction to hidden Markov networks of plausible inference. Elsevier, 2014.
models and Bayesian networks. In Hidden Markov mod-
els: applications in computer vision, pp. 9–41. World Rabiner, Lawrence R. A tutorial on hidden markov mod-
Scientific, 2001. els and selected applications in speech recognition. Pro-
ceedings of the IEEE, 77(2):257–286, 1989.
Ghojogh, Benyamin, Mohammadzade, Hoda, and Mokari,
Mozhgan. Fisherposes for human action recognition us- Rabiner, Lawrence R and Juang, Biing-Hwang. An intro-
ing kinect sensor data. IEEE Sensors Journal, 18(4): duction to hidden Markov models. IEEE ASSP Maga-
1612–1627, 2017. zine, 3(1):4–16, 1986.
Hidden Markov Model: Tutorial 15

Ross, Sheldon M. Introduction to probability models. Aca-

demic press, 2014.
Roweis, Sam. Hidden Markov models. Technical report,
University of Toronto, 2003.
Samaria, Ferdinando Silvestro. Face recognition using hid-
den Markov models. PhD thesis, University of Cam-
bridge, Cambridge, UK, 1994.
Titze, Ingo R and Martin, Daniel W. Principles of voice
production. Acoustical Society of America, 1998.
Viterbi, Andrew. Error bounds for convolutional codes
and an asymptotically optimum decoding algorithm.
IEEE transactions on Information Theory, 13(2):260–
269, 1967.
Weiss, Yair and Freeman, William T. On the optimality
of solutions of the max-product belief-propagation algo-
rithm in arbitrary graphs. IEEE Transactions on Infor-
mation Theory, 47(2):736–744, 2001.
Yamato, Junji, Ohya, Jun, and Ishii, Kenichiro. Recogniz-
ing human action in time-sequential images using hid-
den Markov model. In Proceedings 1992 IEEE Com-
puter Society conference on computer vision and pattern
recognition, pp. 379–385. IEEE, 1992.

Department of Education: Preventive Maintenance Checklist
100% (2)
Department of Education: Preventive Maintenance Checklist
1 page
The Expectation-Maximization Algorithm: IEEE Signal Processing Magazine December 1996
No ratings yet
The Expectation-Maximization Algorithm: IEEE Signal Processing Magazine December 1996
15 pages
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
No ratings yet
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
4 pages
attachment(53)
No ratings yet
attachment(53)
50 pages
Regularisation in Image Reconstruction
No ratings yet
Regularisation in Image Reconstruction
4 pages
Seismic Source Inversion Using Discontinuous Galerkin Methods A Bayesian Approach
No ratings yet
Seismic Source Inversion Using Discontinuous Galerkin Methods A Bayesian Approach
16 pages
Elly Aj NK Abc Grad
No ratings yet
Elly Aj NK Abc Grad
35 pages
buhRBO
No ratings yet
buhRBO
18 pages
12 Reproducibility Challenge Meta
No ratings yet
12 Reproducibility Challenge Meta
9 pages
Frigola Bayesian Inference and Learning in Gaussian Process State Space Models With Particle MCMC
No ratings yet
Frigola Bayesian Inference and Learning in Gaussian Process State Space Models With Particle MCMC
9 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
10.3934 Amc.2021067
No ratings yet
10.3934 Amc.2021067
25 pages
SSRN Id4355794
No ratings yet
SSRN Id4355794
11 pages
FTC_2021_Nonsmooth-2-22
No ratings yet
FTC_2021_Nonsmooth-2-22
21 pages
Machine L in China Appendix
No ratings yet
Machine L in China Appendix
43 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Regression in Modal Logic
100% (1)
Regression in Modal Logic
21 pages
Pricing Asian Options by Contour Integration, Including Asymptotic Methods For Low Volatility
No ratings yet
Pricing Asian Options by Contour Integration, Including Asymptotic Methods For Low Volatility
14 pages
Yoshua Bengio Dept. Informatique Et Recherche Operationnelle Universite de Montreal, Montreal, QC H3C-3J7, Canada
No ratings yet
Yoshua Bengio Dept. Informatique Et Recherche Operationnelle Universite de Montreal, Montreal, QC H3C-3J7, Canada
34 pages
Lossdist
No ratings yet
Lossdist
20 pages
Moss - Deep Kernel Learning of Nonlinear Latent Force Models
No ratings yet
Moss - Deep Kernel Learning of Nonlinear Latent Force Models
15 pages
Identification of Reliability Models For Non Repai
No ratings yet
Identification of Reliability Models For Non Repai
8 pages
Bayesian Learning Rules
No ratings yet
Bayesian Learning Rules
37 pages
Silverwood 2007 Report
No ratings yet
Silverwood 2007 Report
23 pages
03_lecturenote_MLE_MAP
No ratings yet
03_lecturenote_MLE_MAP
7 pages
CHP 1curve Fitting
No ratings yet
CHP 1curve Fitting
21 pages
DOS - Report
No ratings yet
DOS - Report
25 pages
MM Algorithm
No ratings yet
MM Algorithm
28 pages
Parameterized Expectations Algorithm: Lecture Notes 8
No ratings yet
Parameterized Expectations Algorithm: Lecture Notes 8
33 pages
Analysis of Algorithms, A Case Study: Determinants of Matrices With Polynomial Entries
No ratings yet
Analysis of Algorithms, A Case Study: Determinants of Matrices With Polynomial Entries
10 pages
Physics 509: Numerical Methods For Bayesian Analyses: Scott Oser Lecture #15 November 4, 2008
No ratings yet
Physics 509: Numerical Methods For Bayesian Analyses: Scott Oser Lecture #15 November 4, 2008
32 pages
HiddenMarkovModels RobertFreyStonyBrook PDF
No ratings yet
HiddenMarkovModels RobertFreyStonyBrook PDF
34 pages
Topology Optimisation Example Nastran
100% (1)
Topology Optimisation Example Nastran
12 pages
Error Analysis in Homography Estimation by First Order
No ratings yet
Error Analysis in Homography Estimation by First Order
15 pages
Proceedings Letters: Section Discontinued
No ratings yet
Proceedings Letters: Section Discontinued
2 pages
Constr Opt by LSTMs MILCOM21 Submitted
No ratings yet
Constr Opt by LSTMs MILCOM21 Submitted
6 pages
A Tutorial On MM Algorithms
No ratings yet
A Tutorial On MM Algorithms
28 pages
Dual Coordinate Descent Methods For Logistic Regression and Maximum Entropy Models
No ratings yet
Dual Coordinate Descent Methods For Logistic Regression and Maximum Entropy Models
34 pages
Elements On Estimation Theory 04nvp04de13
No ratings yet
Elements On Estimation Theory 04nvp04de13
52 pages
An Adaptive Metropolis Algorithm: 1350 7265 # 2001 ISI/BS
No ratings yet
An Adaptive Metropolis Algorithm: 1350 7265 # 2001 ISI/BS
20 pages
Quantum Data-Fitting: PACS Numbers: 03.67.-A, 03.67.ac, 42.50.Dv
No ratings yet
Quantum Data-Fitting: PACS Numbers: 03.67.-A, 03.67.ac, 42.50.Dv
6 pages
Ranking Replacement
No ratings yet
Ranking Replacement
35 pages
ps2
No ratings yet
ps2
9 pages
PINN Notes
No ratings yet
PINN Notes
10 pages
Cra I U Rosenthal Ann Rev
No ratings yet
Cra I U Rosenthal Ann Rev
40 pages
ML - Mid2
No ratings yet
ML - Mid2
24 pages
deamatlabtoolbox
No ratings yet
deamatlabtoolbox
40 pages
Deducibility Constraints, Equational Theory and Electronic Money
No ratings yet
Deducibility Constraints, Equational Theory and Electronic Money
17 pages
Ratnn Si 2015 09 04
No ratings yet
Ratnn Si 2015 09 04
23 pages
Power Load Forecasting Based On Multi-Task Gaussian Process
No ratings yet
Power Load Forecasting Based On Multi-Task Gaussian Process
6 pages
A Practical Guide To GMM (With Applications To Option Pricin
No ratings yet
A Practical Guide To GMM (With Applications To Option Pricin
74 pages
IE684 Lab03
No ratings yet
IE684 Lab03
6 pages
Modeling and Simulation, 5FY095, Fall 2009 Lecture Notes For Ordinary Differential Equations First Revision: 2009/09/8
No ratings yet
Modeling and Simulation, 5FY095, Fall 2009 Lecture Notes For Ordinary Differential Equations First Revision: 2009/09/8
42 pages
General State Space Markov Chains and MCMC Algorithms - Gareth O. Roberts, Jeffrey S. Rosenthal
No ratings yet
General State Space Markov Chains and MCMC Algorithms - Gareth O. Roberts, Jeffrey S. Rosenthal
64 pages
ssrn-816326
No ratings yet
ssrn-816326
37 pages
Session 2 Student Text1
No ratings yet
Session 2 Student Text1
8 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
3 Bayesian Deep Learning
No ratings yet
3 Bayesian Deep Learning
33 pages
Training Structural Svms When Exact Inference Is Intractable
No ratings yet
Training Structural Svms When Exact Inference Is Intractable
8 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
7.2 chapter 7 part II Emotional Intelligence
No ratings yet
7.2 chapter 7 part II Emotional Intelligence
59 pages
Character Building Part I
No ratings yet
Character Building Part I
70 pages
Cybersecurity Certification Training
80% (5)
Cybersecurity Certification Training
169 pages
Character Building Part II
No ratings yet
Character Building Part II
83 pages
Pre and Post Training Question On Knowledge Management For Trainer
No ratings yet
Pre and Post Training Question On Knowledge Management For Trainer
1 page
A Threat Analysis Methodology For Security Evaluat
No ratings yet
A Threat Analysis Methodology For Security Evaluat
7 pages
Leadership Lessons From OBAMA
No ratings yet
Leadership Lessons From OBAMA
22 pages
CSC2130: Empirical Research Methods For Software Engineering
No ratings yet
CSC2130: Empirical Research Methods For Software Engineering
17 pages
Research Methods in Computer Science: The Challenges and Issues
No ratings yet
Research Methods in Computer Science: The Challenges and Issues
16 pages
Finding and Formulating Your Topic: Chapter Concepts
No ratings yet
Finding and Formulating Your Topic: Chapter Concepts
41 pages
AA Incedo Business Pro Device Configuration Processes v2.2 ENG
No ratings yet
AA Incedo Business Pro Device Configuration Processes v2.2 ENG
145 pages
009-2014-009 APAC Best Practice Installation Manual Issue 1.1
No ratings yet
009-2014-009 APAC Best Practice Installation Manual Issue 1.1
85 pages
CS168: The Modern Algorithmic Toolbox Lecture #1: Introduction and Consistent Hashing
No ratings yet
CS168: The Modern Algorithmic Toolbox Lecture #1: Introduction and Consistent Hashing
11 pages
EL, Lec4,5,6 CS Week 2 Compiled
No ratings yet
EL, Lec4,5,6 CS Week 2 Compiled
52 pages
Lab 5 - Python Language
No ratings yet
Lab 5 - Python Language
60 pages
HW1 PDF
No ratings yet
HW1 PDF
13 pages
Last Minute Revision OOP
No ratings yet
Last Minute Revision OOP
14 pages
Company Profile Mobile Money
No ratings yet
Company Profile Mobile Money
1 page
java Answer Key
No ratings yet
java Answer Key
3 pages
DB Lab Manuals
No ratings yet
DB Lab Manuals
87 pages
408-V1.1-User Manual-20231206 (1)
No ratings yet
408-V1.1-User Manual-20231206 (1)
12 pages
ED201 Proposal BELTRAN JEFFREY C.
No ratings yet
ED201 Proposal BELTRAN JEFFREY C.
31 pages
Lecture - 1 Automatics
No ratings yet
Lecture - 1 Automatics
24 pages
Interior Architectural Woodwork
No ratings yet
Interior Architectural Woodwork
60 pages
BBA (TT) Syllabus 2019-22
No ratings yet
BBA (TT) Syllabus 2019-22
38 pages
Marking Scheme Cs GR 10 11 Mockexam Paper2
No ratings yet
Marking Scheme Cs GR 10 11 Mockexam Paper2
21 pages
Class X Worksheet Real Numbers, Polynomials & Pair of Linear Equations With Two Variables.
100% (1)
Class X Worksheet Real Numbers, Polynomials & Pair of Linear Equations With Two Variables.
3 pages
METADOGE Whitepaper
No ratings yet
METADOGE Whitepaper
8 pages
Philips Bds4223v 27 Service Manual LCD
No ratings yet
Philips Bds4223v 27 Service Manual LCD
88 pages
Libro - OWASP IoT Security Testing Guide - Pag 142
No ratings yet
Libro - OWASP IoT Security Testing Guide - Pag 142
142 pages
Knowledge Management in Enterprise Resource Planning Systems
No ratings yet
Knowledge Management in Enterprise Resource Planning Systems
3 pages
ISO Tools
No ratings yet
ISO Tools
7 pages
Final Prospectus Part - A Ini-Cet PG July 2022 Session Information Brochure As On 25.03.2022 15 - 40
No ratings yet
Final Prospectus Part - A Ini-Cet PG July 2022 Session Information Brochure As On 25.03.2022 15 - 40
25 pages
Using QGIS, Lizmap Plugin and Web Client To Publish Web Maps - Open - Gis
No ratings yet
Using QGIS, Lizmap Plugin and Web Client To Publish Web Maps - Open - Gis
25 pages
Lecture 13
No ratings yet
Lecture 13
45 pages
Simple DIY Induction Heater - RMCybernetics
No ratings yet
Simple DIY Induction Heater - RMCybernetics
44 pages
Datasheet
No ratings yet
Datasheet
58 pages
Coda
No ratings yet
Coda
15 pages
Resume 2019
No ratings yet
Resume 2019
2 pages

HMM Tutorial

Uploaded by

HMM Tutorial

Uploaded by

Hidden Markov Model: Tutorial

Benyamin Ghojogh BGHOJOGH @ UWATERLOO . CA

Abstract emitted observation symbols (Rabiner, 1989). HMM mod-

able. For example, the data are known to be whether zero

n n τ −1 5.1. Direct Calculation

Pτ −1 P(Og , S g | λ) = P(Og | S g , λ) P(S g | λ) = (45)

which is the probability of occurrence of both the observa- 1 Input: λ = (π, A, B)

which is the probability of occurrence of observation se-

6 Return ∀i, ∀t : βi (t)

state sg (t + 1) can be any state so we should use the law of

Finally, using the law of total probability, we have: βi (t)

βi (t) := P og (t + 1), og (t + 2), . . . , og (τ ) | sg (t) = si , λ ,

We can say: 1 Input: λ = (π, A, B)

(a) (b) 3 δi (1) = πi bi,og (1) , ∀i ∈ {1, . . . , n}

where (a) is because of chain rule in probability and (b) is 10 // Termination:

7. Training the HMM

given: Og , O, S =⇒ λ = arg max P(Og | λ). (77)

7.1. The Baum-Welch Algorithm

ξi,j (t) := P sg (t) = si , sg (t + 1) = sj | Og , λ . (78)

ξi,j (t) P(Og | λ)

expected number of transitions from state si to sj over the

It is noteworthy that because of a possible error caused |Q|

8. Usage of HMM in Applications 2. The test word is recognized as:

In Section 1, we introduced some application of HMM w∗ = arg max P(Otg | λw ). (90)

Table 1. An example training dataset for action recognition

Table 2. An example test dataset for action recognition

Acknowledgment Ghojogh, Benyamin, Ghojogh, Aydin, Crowley, Mark, and

Ross, Sheldon M. Introduction to probability models. Aca-

You might also like