HMM Tutorial
HMM Tutorial
the root of tree. The result of algorithm does not depend on 1 Initialize the messages to [1, 1, . . . , 1]>
which node to be taken as the root (Bishop, 2006). Usually, 2 for time t from 1 to τ do
the first variable or the last one are considered as the root. 3 Forward pass: Do sum-product algorithm
Then, some messages from the leaf (or leaves) are sent up 4 Backward pass: Do sum-product algorithm
on the tree to the root (Bishop, 2006). The messages can be 5 belief = forward message × backward
initialized to any fixed constant values (see Eq. (8)). The message
message from the variable node xi to its neighbor factor 6 if max(belief change) then
node fj is: 7 break the loop
Y
mxi →fj = mf →xi , (6) 8 Return beliefs
f ∈N (xi )\fj
Algorithm 1: The forward-backward algorithm
where f ∈ N (xi ) \ fj means all the neighbor factor nodes using the sum-product sub-algorithm
to the variable xi except fj . Also, mxi →fj and mf →xi
denote the message from the variable xi to the factor fj
and the message from the factor f to the variable xi , re- 2.3.3. T HE M AX -P RODUCT A LGORITHM
spectively. The message from a the factor node fj to its The max-product algorithm (Weiss & Freeman, 2001;
neighbor variable xi is: Pearl, 2014) is similar to the sum-product algorithm where
X Y the summation operator is replaced by the maximum op-
mfj →xi = fj mx→fj . (7)
erator. In this algorithm, the messages from the variable
values of x∈N (fj )\xi x∈N (fj )\xi
nodes to the factor nodes and vice versa are:
The Eq. (6) includes both sum and product. This is Y
why this procedure is named the sum-product algorithm mxi →fj = mf →xi , (10)
(Kschischang et al., 2001; Bishop, 2006). In the factor f ∈N (xi )\fj
Y
graph, if the variable xi has degree one, the message is: mfj →xi = max fj mx→fj . (11)
values of x∈N (fj )\xi
> k x∈N (fj )\xi
mxi →fj = [1, 1, . . . , 1] ∈ R , (8)
where k is the number of possible values that the variables 2.3.4. B ELIEF P ROPAGATION WITH
can get. Moreover, if the variable xi has degree two con- F ORWARD -BACKWARD P ROCEDURE
nected to fj and f` , the message is: In order to learn the beliefs over the variables and factor
nodes, belief propagation can be applied using a forward-
mxi →fj = mf` →xi . (9) backward procedure (see chapter 8 in (Bishop, 2006)).
A good example for better understanding of Eqs. (6) and The forward-backward algorithm using the sum-product
(7) exists in chapter 8 in (Bishop, 2006). sub-algorithm is shown in Algorithm 1. In the forward-
backward procedure, the belief over a random variable xi
For the exact convergence (and inference) of belief prop-
is the product of the forward and backward messages.
agation, the graph should be a tree, i.e., should be cycle-
free (Kschischang et al., 2001). If the factor graph has
3. Probabilistic Graphical Models
cycles, the inference is approximate where the algorithm
is stopped after a while manually. The belief propagation 3.1. Markov and Bayesian Networks
in such graphs is called loopy belief propagation (Murphy A Probabilistic Graphical Model (PGM) is a grpah-based
et al., 1999; Bishop, 2006). Note that Markov chains are representation of a complex distribution in the possibly
cycle-free so the exact belief propagation can be applied high dimensional space (Koller & Friedman, 2009). In
for them. other words, PGM is a combination of graph theory and
It is also noteworthy that as every variable which is con- probability theory. In a PGM, the random variables are rep-
nected to only two factor nodes merely passes the message resented by nodes or vertices. There exist edges between
through (see Eq. (9)), a new graphical model, named nor- two variables which have interaction with one another in
mal graph (Forney, 2001), was proposed which states every terms of probability. Different conditional probabilities can
variable as an edge rather than a node (vertex). More details be represented by a PGM.
about factor graph and sum-product algorithm can be found There exist two types of PGM which are Markov network
in references (Kschischang et al., 2001; Loeliger, 2004) (also called Markov random field) and Bayesian network
and chapter 8 of (Bishop, 2006). Also note that some al- (Koller & Friedman, 2009). In the Markov network and
ternatives to sum-product algorithm are min-product, max- Bayesian network, the edges of graph are undirected and
product, and min-sum (Bishop, 2006). directed, respectively.
Hidden Markov Model: Tutorial 4
3.2. The Markov Property the number of states and observation symbols, respectively.
Consider a times series of random variables We show the sets of states and possible observation sym-
X1 , X2 , . . . , Xn . In general, the joint probability of bols by S = {s1 , . . . , sn } and O = {o1 , . . . , om }, re-
these random variables can be written as: spectively. We show being in state si and in observation
symbol oi at time t by si (t) and oi (t), respectively. Let
P(X1 ,X2 , . . . , Xn ) = P(X1 ) P(X2 | X1 ) Rn×n 3 A = [ai,j ] be the state Transition Probability Ma-
P(X3 | X2 , X1 ) . . . P(Xn | Xn−1 , . . . , X2 , X1 ), trix (TPM), where:
(12)
ai,j := P sj (t + 1) | si (t) . (15)
according to chain (or multiplication) rule in probabil- We have:
ity. [The first order] Markov property is an assumption n
X
which states that in a time series of random variables ai,j = 1. (16)
X1 , X2 , . . . , Xn , every random variable is merely depen- j=1
dent on the latest previous random variable and not the oth-
ers. In other words: The Emission Probability Matrix (EPM) is denoted by as
Rn×m 3 B = [bi,j ] where:
P(Xi | Xi−1 , Xi−2 , . . . , X2 , X1 ) = P(Xi | Xi−1 ). (13)
bi,j := P oj (t) | si (t) , (17)
Hence, with Markov property, the chain rule is simplied to: which is the probability of emission of the observation
P(X1 , X2 , . . . , Xn ) symbols from the states. We have:
n
= P(X1 ) P(X2 | X1 ) P(X3 | X2 ) . . . P(Xn | Xn−1 ). X
(14) bi,j = 1. (18)
j=1
The Markov property can be of any order. For example, in
Let the initial state distribution be denoted by the vector
a second order Markov property, a random variable is de-
Rn 3 π = [π1 , . . . , πn ] where:
pendent on the latest and one-to-latest variables. Usually,
the default Markov property is of order one. A stochastic πi := P si (1) , (19)
process which has the Markov process is called a Marko-
and:
vian process (or Markov process).
n
X
3.3. Discrete Time Markov Chain πi = 1, (20)
i=1
A Markov chain is a PGM which has Markov property. The
Markov chain can be either directed or undirected. Usually, to satisfy the probability properties. An HMM model is
Markov chain is a Bayesian network where the edges are denoted by the tuple λ = (π, A, B).
directed. It is important not to confuse Markov chain with Assume that a sequence of states is generated by the HMM
Markov network. according to the TPM. We denote this generated sequence
There are two types of Markov Chain which are Discrete of states by S g := sg (1), . . . , sg (τ ) where sg (t) ∈ S, ∀t.
Time Markov Chain (DTMC) (Ross, 2014) and Continu- Likewise, a sequence of outputs (observations) is generated
ous Time Markov Chain (CTMC) (Lawler, 2018). As it is by the HMM according to SPM. We denote this generated
obvious from their names, in DTMC and CTMC, the time sequence of output symbols by Og := og (1), . . . , og (τ )
of transitions from a random variable to another one is and where og (t) ∈ O, ∀t.
is not partitioned into discrete slots, respectively. We denote the probability of transition from state sg (t) to
If the variables in a DTMC are considered as states, the sg (t + 1) by asg (t),sg (t+1) . So:
DTMC can be viewed as a Finite-State Machine (FSM) asg (t),og (t+1) := P(sg (t + 1) | sg (t)). (21)
or a Finite-State Automaton (FSA) (Booth, 1967). Also,
note that the DTMC can be viewed as a Sub-Shifts of Finite Note that sg (t) ∈ S and sg (t + 1) ∈ S. We also denote:
Type in modeling dynamic systems (Brin & Stuck, 2002). πsg (i) := P(sg (i)). (22)
3.4. Hidden Markov Model (HMM) Likewise, we denote the probability of state sg (t) emitting
HMM is a DTMC which contains a sequence of hidden the observation og (t) by bsg (t),og (t) . So:
variables (named states) in addition to a sequence of emit-
ted observation symbols (outputs). bsg (t),og (t) := P(og (t) | sg (t)). (23)
g g
We have an observation sequence of length τ which is the Note that s (t) ∈ S and o (t) ∈ O. Figure 2 depicts the
number clock times, t ∈ {1, . . . , τ }. Let n and m denote structure of an HMM model.
Hidden Markov Model: Tutorial 5
n X
X n
=⇒ log(asg (t−1),sg (t) ) = 1i [i] 1j [j] log(ai,j )
i=1 j=1
= 1>
sg (t−1) (log A)1sg (t) , (27)
n Y
Y n
bsg (t),og (t) = (bi,j )1i [i] 1j [j] ,
i=1 j=1
n X
X n
=⇒ log(bsg (t),og (t) ) = 1i [i] 1j [j] log(bi,j )
Figure 2. A Hidden Markov Model i=1 j=1
= 1>
sg (t) (log B)1og (t) , (28)
4. Likelihood and Expectation Maximization
where 1i [i] = 1j [j] = 1. Hence, we can write the log-
in HMM likelihood as:
The EM algorithm can be used for analysis and training
τ −1
in HMM (Moon, 1996). In the following, we explain the X
` = 1>
sg (1) log π + 1>
sg (t−1) (log A)1sg (t)
details of EM for HMM.
t=1
τ
4.1. Likelihood X
+ 1>
sg (t) (log B)1og (t) . (29)
According to Fig. 2, the likelihood of occurrence of the t=1
state sequence S g and the observation sequence Og is
(Ghahramani, 2001): 4.2. E-step in EM
The missing variables in the log-likelihood are 1sg (1) ,
L = P(S g , Og )
1sg (t−1) , 1sg (t) , and 1og (t) . The expectation of the log-
= P sg (1) P(next states | previous states) P(Og | S g )
likelihood with respect to the missing variables is:
−1
g
τY g g
τ
Y
= P s (1) P(s (t + 1) | s (t)) P(og (t) | sg (t)) Q(π,A, B) = E(`)
t=1 t=1 τ
X −1
= E(1> E 1>
−1
τY τ
Y sg (1) log π) + sg (t) (log A)1sg (t+1)
= πsg (1) asg (t),sg (t+1) bsg (t),og (t) . (24) t=1
t=1 t=1 τ
X
E 1>
+ sg (t) (log B)1og (t) . (30)
The log-likelihood is: t=1
τ −1
4.3. M-step in EM
X
` = log(L) = log πsg (1) + log(asg (t),sg (t+1) )
t=1 We maximize the Q(π, A, B) with respect to the parame-
τ
X ters πi , ai,j , and bi,j :
+ log(bsg (t),og (t) ). (25)
t=1 maximize Q(π, A, B)
x
Let 1i be a vector with entry one at index i, i.e., 1i := n
X
[0, 0, . . . , 0, 1, 0 . . . , 0]> . Also, 1sg (1) means the vector subject to πi = 1,
with entry one at the index of the first state in the sequence i=1
n
S g . For example if there are three possible states and a se- X (31)
quence of length three, sg (1) = 2, sg (2) = 1, sg (3) = 3, ai,j = 1, ∀i ∈ {1, . . . , n},
j=1
we have 1sg (1) = [0, 1, 0]> .
n
The terms in this log-likelihood are:
X
bi,j = 1, ∀i ∈ {1, . . . , n},
j=1
log πsg (1) = 1>
sg (1) log π, (26)
Y n Y n where the constraints ensure that the probabilities in the
asg (t−1),sg (t) = (ai,j )1i [i] 1j [j] , initial states, the transition matrix, and the emission matrix
i=1 j=1 add to one.
Hidden Markov Model: Tutorial 6
The Lagrangian (Boyd & Vandenberghe, 2004) for this op- Likewise, we have:
timization problem is: τ
∂L X set
τ −1 = E(1sg (t) [i] 1og (t) [j]) − η3 bi,j = 0
X ∂bi,j
L = E(1> E 1>
t=1
sg (1) log π) + sg (t) (log A)1sg (t+1)
τ
t=1 1 X
τ n =⇒ bi,j = E(1sg (t) [i] 1og (t) [j]). (39)
X X η3 t=1
1>
+ E sg (t) (log B)1og (t) − η1 ( πi − 1)
t=1 i=1
n n n n τ
X X X 1 XX
(39)
− η2 ( ai,j − 1) − η3 ( bi,j − 1), (32) bi,j = 1 =⇒ E(1sg (t) [i] 1og (t) [j]) =
j=1
η3 j=1 t=1
j=1 j=1
τ
where η1 , η2 , and η3 are the Lagrange multipliers. 1 X
= E(1sg (t) [i] × 0) + · · · + E(1sg (t) [i] × 1)
The first term in the Lagrangian is simplified to η3 t=1 | {z }
j-th element
E(1sg (1) [1] log π1 + · · · + 1sg (1) [τ ] log πτ ) = τ
E(1sg (1) [1] log π1 ) + · · · + E(1sg (1) [τ ] log πτ ); therefore,
1 X set
+ · · · + E(1sg (t) [i] × 0) = E(1sg (t) [i]) = 1
we have: η3 t=1
∂L set τ
= E(1sg (1) [i]) − η1 πi = 0
X
∂πi =⇒ η3 = E(1sg (t) [i]). (40)
1 t=1
=⇒ πi = E(1sg (1) [i]). (33)
η1 Pτ
t=1 E(1sg (t) [i] 1og (t) [j])
∴ bi,j = Pτ . (41)
t=1 E(1sg (t) [i])
n
X 1 (33)
πi = 1 =⇒ E(1sg (1) [1])+
i=1
η 1 In Section 7.1, we will simplify the statements for πi , ai,j ,
+ · · · + E(1sg (1) [i]) + · · · + E(1sg (1) [n])
and bi,j .
1 set
= (0 + · · · + 1 + · · · + 0) = 1 =⇒ η1 = 1. (34) 5. Evaluation in HMM
η1
Evaluation in HMM means the following (Rabiner &
∴ πi = E(1sg (1) [i]). (35)
Juang, 1986; Rabiner, 1989): Given the observation se-
Similarly, we have: quence Og = og (1), . . . , og (τ ) and the HMM model λ =
(π, A, B), we want to compute P(Og | λ), i.e., the proba-
τ −1
∂L X set bility of the generated observation sequence. In summary:
= E(1sg (t) [i] 1sg (t+1) [j]) − η2 ai,j = 0
∂ai,j t=1 Og , given: λ =⇒ P(Og | λ) = ? (42)
τ −1
1 X
Note that P(Og | λ) can also be denoted by P(Og ; λ). The
=⇒ ai,j = E(1sg (t) [i] 1sg (t+1) [j]). (36)
η2 t=1 P(Og | λ) is sometimes referred to as the likelihood.
. . . bsg (τ ),og (τ ) asg (τ ),og (τ ) , (47) Proposition 1. The Eq. (50) can be interpreted as the sum-
which means that we start with the first state, then output product algorithm (see Section 2.3.2).
the first observation, and then go to the next state. This
procedure is repeated until the last state. The summations
Proof. The Algorithm 2 has iterations over states indexed
are over all possible states in the state sequence.
by j. Also the Eq. (50) has sum over states index by i.
The time complexity of this direct calculation of P(Og | λ) Consider all states indexed by i and a specific state sj (see
is in the order of O(2τ nτ ) because at every time clock Fig. 3). We can consider every two successive states as a
t ∈ {1, . . . , τ }, there are n possible states to go through factor node in the factor graph. Hence, a state indexed by
(Rabiner & Juang, 1986). Because of nτ , this is very inef- i and the state sj form a factor node which we denote by
ficient especially for long sequences (large τ ). f i,j . The observation symbol og (t + 1) emitted from sj is
considered as the variable node in the factor graph.
5.2. The Forward-Backward Procedure
The message αi (t) is the message received to the factor
A more efficient algorithm for evaluation in HMM is
node f i,j so far. Therefore, in the sum-product algorithm
forward-backward procedure (Rabiner & Juang, 1986).
(see Eq. (6)), we have:
The forward-backward procedure includes two stages, i.e.,
forward and backward belief propagation stages.
mog (t)→f i,j = αi (t). (51)
5.2.1. T HE F ORWARD B ELIEF P ROPAGATION
Similar to the belief propagation procedure, we define the The message ai,j bj,og (t+1) is the message received from
forward message until time t as: the factor nodes f i,j , ∀i to the variable node og (t + 1).
Therefore, in the sum-product algorithm (see Eq. (7)), we
αi (t) := P og (1), og (2), . . . , og (t), sg (t) = si | λ , (48)
have:
which is the probability of partial observation sequence un-
til time t and being in state si at time t. mf i,j →og (t+1) = ai,j bj,og (t+1) . (52)
Algorithm 2 shows the forward belief propagation from
state one to the state τ . In this algorithm, αi (t) is solved Hence:
inductively. The initial forward message is:
n
X
αi (1) = πi bi,og (1) , ∀i ∈ {1, . . . , n}, (49) mf →og (t+1) = mf i,j →og (t+1) mog (t)→f i,j (53)
next
i=1
which is the probability of occurrence of the initial state n
si and the observation symbol og (1). The next forward =
X
ai,j bj,og (t+1) αi (t)
messages are calculated as: i=1
hX n i hXn i
αj (t + 1) = αi (t) ai,j bj,og (t+1) , (50) = αi (t) ai,j bj,og (t+1) . (54)
i=1 i=1
1 Input: λ = (π, A, B)
2 βi (τ ) = 1, ∀i ∈ {1, . . . , n}
3 for state i from 1 to n do
4
Pn(τ − 1) to 1 do
for time t from
5 βi (t) = j=1 ai,j bj,og (t+1)
which is the desired probability in the evaluation for HMM. 5.2.3. T IME C OMPLEXITY
Hence, the forward belief propagation suffices for evalua- The time complexity of the forward belief propagation is in
tion. In the following, we explain the backward belief prop- the order of O(τ n2 ) because we have O(nτ ) loops each of
agation which is required for other sections of the paper. which includes summation over n states (Rabiner & Juang,
1986). This is much more efficient than the complexity of
5.2.2. T HE BACKWARD B ELIEF P ROPAGATION direct calculation of P(Og | λ) which was O(2τ nτ ). Simi-
Again similar to the belief propagation procedure, we de- larly, the time complexity of the backward belief propaga-
fine the backward message since time τ to t + 1 as: tion is in the order of O(τ n2 ) (Rabiner & Juang, 1986).
i,j
which is similar to αj (t) defined in Eq. (50), except that where f is the set of all factor nodes, {f i,j }ni=1 , and fnext
i,j
αj (t) in the forward belief propagation uses sum-product is the factor f in the next time slot.
algorithm (Kschischang et al., 2001) while δj (t) in the Note that if we consider Fig. 4 for all states indexed by
Viterbi algorithm uses max-product algorithm (Weiss & j, a lattice network is formed which is common in Viterbi
Freeman, 2001; Pearl, 2014) (see Sections 2.3.2 and 2.3.3). algorithm (see (Rabiner & Juang, 1986)).
Hidden Markov Model: Tutorial 10
ψj (t) = arg max δi (t − 1) ai,j . (72) αi (t) ai,j bj,og (t+1) βj (t + 1)
1≤i≤n =⇒ ξi,j (t) = (79)
P(Og | λ)
Then, a backward analysis is done starting from the end of αi (t) ai,j bj,og (t+1) βj (t + 1)
state sequence: = Pn Pn , (80)
r=1 `=1 αr (t) ar,` b`,og (t+1) β` (t + 1)
p∗ = max δi (τ ), (73) where (a) is because of chain rule in probability and (b)
1≤i≤n
is because of the following: According to Eqs. (48), (15),
s∗ (τ ) = arg max δi (τ ), (74) (17), and (56), we have:
1≤i≤n
and the other states in the sequence are backtracked as: αi (t) = P(og (1), . . . , og (t), sg (t) = si | λ),
ai,j = P(sj (t + 1) | si (t)),
s∗ (t) = ψs∗ (t+1) (t + 1). (75) bj,og (t+1) = P(og (t + 1) | sj (t + 1)),
βj (t + 1) = P(og (t + 2), . . . , og (τ ) | sg (t + 1) = sj , λ).
The states S g = s∗ (1), s∗ (2), . . . , s∗ (τ ) are the desired
state sequence in the estimation. Therefore, the states in Therefore, we have:
the state sequence are maximizing the forward belief prop-
agation in a max-product setting. αi (t) ai,j (t) bj,og (t+1) βj (t + 1)
The probability of this path of states with maximum prob- = P sg (t) = si , sg (t + 1) = sj , Og , λ .
ability of occurrence is: Pn Pn
Note that we have i=1 j=1 ξi,j (t) = 1. Also, note
P(Og , S g | λ) = p∗ . (76) that P(Og | λ) in Eq. (79) can be obtained from either the
denominator of Eq. (80) or line 6 in Algorithm 2 (i.e., Eq.
Note that the Viterbi algorithm can be visualized using a (55)).
trellis structure (see Appendix A in (Jurafsky & Martin, In Eq. (79), the terms αi (t), ai,j (t), bj,og (t+1) , and βj (t +
2019)). 1) stand for the probability of the first t observations ending
Hidden Markov Model: Tutorial 11
in state si at time t, the probability of transitioning from 1 Input: γi (t), ξi,j (t), ∀i, ∀j, ∀t
state si (at time t) to state sj (at time t + 1), the probability 2 πi = γi (1), ∀i ∈ {1, . . . , n}
of observing og (t + 1) from state sj at time t + 1, and the 3 for state i from 1 to n do
probability of the remainder of the observation sequence, 4 for state j from 1 to n do
respectively. Pτ −1 Pτ −1
5 ai,j = t=1 ξi,j (t)/ t=1 γi (t)
Now, recall the Eqs. (35), (38), and (41) from EM algo-
rithm for HMM. We write these equations again for conve- 6 for state j from 1 to n do
nience of the reader: 7
Pτ k from 1 to m doPτ
for observation
8 bj,k = t=1, og (t)=k γj (t)/ t=1 γj (t)
πi = E(1sg (1) [i]),
Pτ −1 9 // Normalization,
Pn for computer error corrections:
E(1sg (t) [i] 1sg (t+1) [j])
ai,j = t=1Pτ −1 , 10 πi ← πi / j=1 πj
t=1 E(1sg (t) [i])
Pn
11 ai,j ← ai,j / P`=1 ai,`
Pτ m
E(1sg (t) [i] 1og (t) [j]) 12 bj,k ← bj,k / `=1 bj,`
bi,j = t=1 .
π = [π1 , . . . , πn ]> , A = [ai,j ], B =
Pτ
t=1 E(1sg (t) [i])
13
[bj,k ], ∀i, j ∈ {1, . . . , n}, ∀k ∈ {1, . . . , m}
On the other hand, recall Eqs. (62) and (78) which are 14 Return λ = (π, A, B)
repeated here:
Algorithm 5: The Baum-Welch algorithm for
γi (t) = P sg (t) = si | Og , λ ,
training the HMM
ξi,j (t) = P sg (t) = si , sg (t + 1) = sj | Og , λ .
Action Sequence t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t = 10 t = 11 t = 12
1 stand stand stand sit sit sit × × × × × ×
2 stand stand sit sit sit sit sit sit × × × ×
Sit 3 stand stand stand stand stand sit sit × × × × ×
4 stand stand stand stand stand sit sit sit sit sit × ×
5 stand stand stand stand stand sit sit sit sit stand sit sit
1 sit sit sit stand stand stand × × × × × ×
2 sit sit sit sit sit sit stand stand × × × ×
Stand 3 sit sit stand stand stand stand stand × × × × ×
4 sit sit sit sit stand stand stand stand stand × × ×
5 sit sit sit sit stand stand stand sit stand stand stand ×
1 stand stand stand tilt tilt tilt × × × × × ×
2 stand stand tilt tilt tilt tilt tilt tilt × × × ×
Turn 3 stand stand stand stand stand tilt tilt × × × × ×
4 stand stand stand stand tilt tilt tilt tilt tilt tilt × ×
5 stand stand stand stand tilt tilt tilt stand tilt tilt tilt ×
Action t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t = 10 t = 11
Sit stand stand stand sit sit sit sit × × × ×
Stand sit sit sit sit stand stand stand × × × ×
Turn stand stand stand stand tilt tilt stand stand tilt tilt tilt
Assume we have a set of actions denoted by W where the i.e., stand, sit, and tilt. The actions can have different
actions are indexed by w ∈ {1, . . . , |W|}. We have |Qw | lengths or pacing. An example training dataset with its in-
training instances for every action where the training in- stances is shown in Table 1. In some sequences of dataset,
stances are indexed by q ∈ {1, . . . , |Qw |}. Every training there are some noisy poses in the middle of sequences of
instance is a sequence of observation symbols where the correct poses for making a difficult instance. An example
symbols are the poses, e.g., sitting, standing, etc. An HMM test dataset is also shown in Table 2. The three test se-
is trained for every action (Ghojogh et al., 2017; Mokari quences are different from the training sequences to check
et al., 2018). Training and testing HMMs for action recog- the generalizability of the HMM models. Three HMM
nition is the same as training and test phases explained for models can be trained for the three actions in this dataset
speech recognition. and then, the test action sequences can be fed to the HMM
The actions are performed with different sequence lengths models to be recognized.
(fast or slowly) by different people. As HMM is robust
to different repetitions of states, the recognition of actions 9. Conclusion
with different pacing is possible. In this paper, we explained the theory of HMM for evalu-
In action recognition, we have a dataset of actions consist- ation, estimation, and training. We started with some re-
ing of several defined poses (Ghojogh et al., 2017; Mokari quired background, i.e., EM, factor graphs, sum-product
et al., 2018). For example, if the dataset includes three ac- and max-product algorithms, forward-backward propaga-
tions sit, stand, and turn, the format of actions is as follows: tion, Markov and Bayesian networks, Markov property,
and DTMC. We then introduced HMM and detailed EM in
• Action sit: stand, stand . . . , stand, sit, sit . . . , sit HMM. Evaluation in HMM was explained in both direct
| {z } | {z }
stand sit calculation and forward-backward procedure. We intro-
• Action stand: sit, sit . . . , sit, stand, stand . . . , stand duced estimation in HMM using the greedy approach and
| {z } | {z } the Viterbi algorithm. Training the HMM was also covered
sit stand
using the Baum-Welch algorithm based on EM algorithm.
• Action turn: stand, stand . . . , stand, tilt, tilt . . . , tilt We also introduced speech and action recognition as two
| {z } | {z }
stand tilt popular applications of HMM.
where the actions are modeled as sequences of some poses,
Hidden Markov Model: Tutorial 14
Blunsom, Phil. Hidden Markov models. Technical report, Lawler, Gregory F. Introduction to stochastic processes.
2004. Chapman and Hall/CRC, 2018.
Booth, Taylor L. Sequential machines and automata the- Loeliger, H-A. An introduction to factor graphs. IEEE
ory. New York, NY: Wiley, 1967. Signal Processing Magazine, 21(1):28–41, 2004.
Boyd, Stephen and Vandenberghe, Lieven. Convex opti- Mokari, Mozhgan, Mohammadzade, Hoda, and Ghojogh,
mization. Cambridge university press, 2004. Benyamin. Recognizing involuntary actions from 3d
skeleton data using body states. Scientia Iranica, 2018.
Brin, Michael and Stuck, Garrett. Introduction to dynami-
cal systems. Cambridge university press, 2002. Moon, Todd K. The expectation-maximization algorithm.
IEEE Signal processing magazine, 13(6):47–60, 1996.
Eddy, Sean R. What is a hidden Markov model? Nature
biotechnology, 22(10):1315, 2004. Murphy, Kevin P, Weiss, Yair, and Jordan, Michael I.
Loopy belief propagation for approximate inference: An
Forney, G David. The Viterbi algorithm. Proceedings of empirical study. In Proceedings of the Fifteenth confer-
the IEEE, 61(3):268–278, 1973. ence on Uncertainty in artificial intelligence, pp. 467–
475. Morgan Kaufmann Publishers Inc., 1999.
Forney, G David. Codes on graphs: Normal realizations.
IEEE Transactions on Information Theory, 47(2):520– Nefian, Ara V and Hayes, Monson H. Hidden markov mod-
548, 2001. els for face recognition. In Proceedings of the 1998 IEEE
International Conference on Acoustics, Speech and Sig-
Gales, Mark, Young, Steve, et al. The application of hidden nal Processing, ICASSP’98 (Cat. No. 98CH36181), vol-
Markov models in speech recognition. Foundations and ume 5, pp. 2721–2724. IEEE, 1998.
Trends R in Signal Processing, 1(3):195–304, 2008.
Pearl, Judea. Probabilistic reasoning in intelligent systems:
Ghahramani, Zoubin. An introduction to hidden Markov networks of plausible inference. Elsevier, 2014.
models and Bayesian networks. In Hidden Markov mod-
els: applications in computer vision, pp. 9–41. World Rabiner, Lawrence R. A tutorial on hidden markov mod-
Scientific, 2001. els and selected applications in speech recognition. Pro-
ceedings of the IEEE, 77(2):257–286, 1989.
Ghojogh, Benyamin, Mohammadzade, Hoda, and Mokari,
Mozhgan. Fisherposes for human action recognition us- Rabiner, Lawrence R and Juang, Biing-Hwang. An intro-
ing kinect sensor data. IEEE Sensors Journal, 18(4): duction to hidden Markov models. IEEE ASSP Maga-
1612–1627, 2017. zine, 3(1):4–16, 1986.
Hidden Markov Model: Tutorial 15