Hidden Markov Models
Hidden Markov Models
matrix X), which in turn implies range(O) ⊆ range(P2,1 ). 2. (Relation to hidden states)
Thus rank(P2,1 ) = rank(O) = m, and also range(U ) =
range(P2,1 ) = range(O). ~bt = (U > O) ~ht (x1:t−1 )
~bt+1 Bxt~bt Bxt (U > O)~ht (U > O)Axt ~ht • Given an observation xt , the ‘internal state’ update is:
= = =
~b> Bx ~bt Pr[xt |x1:t−1 ] Pr[xt |x1:t−1 ]
∞ t
Bbx bbt
>Pr[ht+1 = ·, xt |x1:t−1 ] bbt+1 = t
.
= (U O) bb> B
∞ x bt
Pr[xt |x1:t−1 ]
b b
t
Pr[ht+1 = ·|x1:t ] Pr[xt |x1:t−1 ] • To predict the conditional probability of xt given x1:t−1 :
= (U > O)
Pr[xt |x1:t−1 ]
b> b b
> ~
= (U O) ht+1 (x1:t ) c t |x1:t−1 ] = Pb∞ Bxt bt .
Pr[x
b> b b
x b∞ Bx bt
(the first three equalities follow from the first claim, the in-
ductive hypothesis, and Lemma 3), and Aside from the random sampling, the running time of the
~b> Bx ~bt+1 = ~1> Ax ~ht+1 = Pr[xt+1 |x1:t ] learning algorithm is dominated by the SVD computation of
∞ t+1 m t+1 an n×n matrix. The time required for computing joint prob-
(again, using Lemma 3). ability calculations is O(tm2 ) for length t sequences—same
as if one used the ordinary HMM parameters (O and T ). For
conditional probabilities, we require some extra work (pro-
4 Spectral Learning of Hidden Markov portional to n) to compute the normalization factor. How-
Models ever, our analysis shows that this normalization factor is al-
ways close to 1 (see Lemma 13), so it can be safely omitted
4.1 Algorithm in many applications.
The representation in the previous section suggests the al-
gorithm L EARN HMM(m, N ) detailed in Figure 1, which 4.2 Main Results
simply uses random samples to estimate the model param- We now present our main results. The first result is a guar-
eters. Note that in practice, knowing m is not essential be- antee on the accuracy of our joint probability estimates for
cause the method presented here tolerates models that are observation sequences. The second result concerns the ac-
not exactly HMMs, and the parameter m may be tuned using curacy of conditional probability estimates — a much more
cross-validation. As we discussed earlier, the requirement delicate quantity to bound due to conditioning on unlikely
for independent samples is only for the convenience of our events. We also remark that if the probability distribution is
sample complexity analysis. only approximately modeled as an HMM, then our results
The model returned by L EARN HMM(m, N ) can be used degrade gracefully based on this approximation quality.
as follows:
4.2.1 Joint Probability Accuracy
• To predict the probability of a sequence: Let σm (M ) denote the mth largest singular value of a matrix
M . Our sample complexity bound will depend polynomially
c 1 , . . . , xt ] = bb>
Pr[x ∞ Bxt . . . Bx1 b1 . on 1/σm (P2,1 ) and 1/σm (O).
b b b
Also, define To interpret this quantity γ, consider any two distribu-
tions over hidden states ~h, b
h ∈ Rm . Then kO(~h − b h)k1 ≥
X
(k) = min Pr[x2 = j] : S ⊆ [n], |S| = n − k , ~ b ~
γkh − hk1 . Regarding h as the true hidden state distribution
and bh as the estimated hidden state distribution, this inequal-
j∈S
(1) ity gives a lower bound on the error of the estimated obser-
and let vation distributions under O. In other words, the observation
n0 () = min{k : (k) ≤ /4}. process, on average, reveal errors in our hidden state estima-
In other words, n0 () is the minimum number of observa- tion. [EDKM07] uses this as a contraction property to show
tions that account for about 1 − /4 of the total probability how prediction errors (due to using an approximate model)
mass. Clearly n0 () ≤ n, but it can often be much smaller do not diverge. In our setting, this is more difficult as we
in real applications. For example, in many practical appli- do not explicitly estimate O nor do we explicitly maintain
cations, the frequencies of observation symbols observe a distributions over hidden states.
power law (called Zipf’s law) of the form f (k) ∝ 1/k s , We also need the following assumption, which we dis-
where f (k) is the frequency of the k-th most frequently ob- cuss further following the theorem statement.
served symbol. If s > 1, then (k) = O(k 1−s ), and n0 () = Condition 3 (Stochasticity Condition). For all observations
O(1/(1−s) ) becomes independent of the number of obser- x and all states i and j, [Ax ]ij ≥ α > 0.
vations n. This means that for such problems, our analysis
below leads to a sample complexity bound for the cumula- Theorem 7. There exists a constant C > 0 such that the
tive distribution Pr[x1:t ] that can be independent of n. This following holds. Pick any 0 < , η < 1. Assume the HMM
is useful in domains with large n such as natural language obeys Conditions 1 and 3, and
processing.
m · n0 () m
N ≥C· 2 +
Theorem 6. There exists a constant C > 0 such that the σm (O)2 σm (P2,1 )2 σm (O)2 σm (P2,1 )4
following holds. Pick any 0 < , η < 1 and t ≥ 1. Assume
m (log(2/α))4 (log(2/α))4
1
the HMM obeys Condition 1, and · 2 2+ + · log .
α 4 α 4 γ 4 α10 γ 4 η
t2
m · log(1/η) m · n0 () · log(1/η)
N ≥ C· 2 · + . With probability at least 1 − η, then the model returned by
σm (O)2 σm (P2,1 )4 σm (O)2 σm (P2,1 )2
L EARN HMM(m, N ) satisfies, for any time t,
With probability at least 1 − η, the model returned by the
algorithm L EARN HMM(m, N ) satisfies KL(Pr[xt |x1 , . . . , xt−1 ] || Pr[x
c t |x1 , . . . , xt−1 ])
X " #
| Pr[x1 , . . . , xt ] − Pr[x
c 1 , . . . , xt ]| ≤ . Pr[xt |x1:t−1 ]
= Ex1:t ln ≤ .
x1 ,...,xt c t |x1:t−1 ]
Pr[x
The main challenge in proving Theorem 6 is understand- To justify our choice of error measure, note that the prob-
ing how the estimation errors accumulate in the algorithm’s lem of bounding the errors of conditional probabilities is
probability calculation. This would have been less problem- complicated by the issue of that, over the long run, we may
atic if we had estimates of the usual HMM parameters T and have to condition on a very low probability event. Thus we
O; the fully observable representation forces us to deal with need to control the relative accuracy of our predictions. This
more cumbersome matrix and vector products. makes the KL-divergence a natural choice for the error mea-
4.2.2 Conditional Probability Accuracy sure. Unfortunately, because our HMM conditions are more
In this section, we analyze the accuracy of our conditional naturally interpreted in terms of spectral and normed quan-
c t |x1 , . . . , xt−1 ]. Intuitively, we might hope tities, we end up switching back and forth between KL and
predictions Pr[x L1 errors via Pinsker-style inequalities (as in [EDKM07]). It
that these predictive distributions do not become arbitrarily is not clear to us if a significantly better guarantee could be
bad over time (as t → ∞). The reason is that while estima- obtained with a pure L1 error analysis (nor is it clear how to
tion errors propagate into long-term probability predictions do such an analysis).
(as evident in Theorem 6), the history of observations con- The analysis in [EDKM07] (which assumed that approx-
stantly provides feedback about the underlying hidden state, imations to T and O were provided) dealt with this problem
and this information is incorporated using Bayes’ rule (im- of dividing by zero (during a Bayes’ rule update) by explic-
plicitly via our internal state updates). itly modifying the approximate model so that it never assigns
This intuition was confirmed by [EDKM07], who showed the probability of any event to be zero (since if this event oc-
that if one has an approximate model of T and O for the curred, then the conditional probability is no longer defined).
HMM, then under certain conditions, the conditional predic- In our setting, Condition 3 ensures that true model never as-
tion does not diverge. This condition is the positivity of the signs the probability of any event to be zero. We can relax
‘value of observation’ γ, defined as this condition somewhat (so that we need not quantify over
γ = inf kO~v k1 . all observations), though we do not discuss this here.
~ v k1 =1
v :k~
√ We should also remark that while our sample complex-
Note that γ ≥ σm (O)/ n, so it is guaranteed to be positive ity bound is significantly larger than in Theorem 6, we are
by Condition 1. However, γ can be much larger than what also bounding the more stringent KL-error measure on con-
this crude lower bound suggests. ditional distributions.
m, n Number of states and observations The following lemma bounds these errors with high prob-
n0 () Number of significant observations ability as a function of the number of observation samples
O, T , Ax HMM parameters used to form the estimates.
P1 , P2,1 , P3,x,1 Marginal probabilities
Pb1 , Pb2,1 , Pb3,x,1 Empirical marginal probabilities Lemma 8. If the algorithm independently samples N obser-
1 , 2,1 , 3,x,1 Sampling errors [Section 5.1] vation triples from the HMM, then with probability at least
1 − η:
Ub Matrix of m left singular vectors of Pb2,1
eb∞ , B ex , eb1 True observable parameters using U b r r
1 3 1
[Section 5.1] 1 ≤ ln +
bb∞ , B bx , bb1 Estimated observable parameters using U b N η N
r r
δ ∞ , ∆x , δ 1 Parameter errors [Section 5.1] 1 3 1
∆
P 2,1 ≤ ln +
x ∆x [Section 5.1] N η N
σm (M ) m-th largest singular value of matrix M s r !
~bt , bbt True and estimated states [Section 5.3]
X k 3 k
3,x,1 ≤ min ln + + 2(k)
~ht , b ht , gbt b > O)−1~bt , (U
(U ht /(~1>
b > O)−1bbt , b k N η N
m ht ) x
b
[Section 5.3] r r
1 3 1
A
bx b > O)−1 B
(U bx (Ub > O) [Section 5.3] + ln +
N η N
γ, α inf{kOvk1 : kvk1 = 1}, min{[Ax ]i,j }
where (k) is defined in (1).
Table 1: Summary of notation.
The rest of the analysis estimates how the sampling er-
4.2.3 Learning Distributions -close to HMMs rors affect the accuracies of the model parameters (which in
Our L1 error guarantee for predicting joint probabilities still turn affect the prediction quality).
holds if the sample used to estimate Pb1 , Pb2,1 , Pb3,x,1 come Let U ∈ Rn×m be matrix of left singular vectors of P2,1 .
from a probability distribution Pr[·] that is merely close to The first lemma implies that if Pb2,1 is sufficiently close to
an HMM. Specifically, all we need is that there exists some P2,1 , i.e. 2,1 is small enough, then the difference between
tmax ≥ 3 and some m state HMM with distribution PrHMM [·] projecting to range(U b ) and to range(U ) is small. In particu-
such that: >
lar, U O will be invertible and be nearly as well-conditioned
b
1. PrHMM satisfies Condition 1 (HMM Rank Condition), as U > O.
2. ∀t ≤ tmax , x1:t | Pr[x1:t ] − PrHMM [x1:t ]| ≤ HMM (t),
P
Lemma 9. Suppose 2,1 ≤ ε · σm (P2,1 ) for some ε < 1/2.
3. HMM (2) 21 σm (P2,1
HMM
). Let ε0 = 22,1 /((1 − ε)σm (P2,1 ))2 . Then:
The resulting error of our learned model Pr
c is
X 1. ε0 < 1,
| Pr[x1:t ] − Pr[x
c 1:t ]|
x1:t b > Pb2,1 ) ≥ (1 − ε)σm (P2,1 ),
X 2. σm (U
HMM HMM
≤ (t) + |Pr [x1:t ] − Pr[x
c 1:t ]|
x1:t b > P2,1 ) ≥ √1 − ε0 σm (P2,1 ),
3. σm (U
for all t ≤ tmax . The second term is now bounded as in The-
orem 6, with spectral parameters corresponding to PrHMM . b > O) ≥ √1 − ε0 σm (O).
4. σm (U
5 Proof ideas
Now we will argue that the estimated parameters bb∞ , B
bx , bb1
We outline the main ideas for proving Theorems 6 and 7. are close to the following true parameters from the observ-
Full proofs can be found in a technical report available from able representation when Ub is used for U :
arXiv (http://arxiv.org/abs/0811.4413).
Throughout this section, we assume the HMM obeys Con- eb∞ > b + b > O)−>~1m ,
= (P2,1 U ) P1 = (U
dition 1. Table 1 summarizes the notation that will be used
throughout the analysis in this section. B
ex = b > P3,x,1 )(U
(U b > P2,1 )+
5.2 Proof of Theorem 6 for all i = 1, . . . , m. Moreover, for any non-zero vector
~ ∈ Rm ,
w
We need to quantify how estimation errors propagate in the ~1>
m Ax w
b ~ 1
probability calculation. Because the joint probability of a ≤ .
length t sequence is computed by multiplying together t ma- bb> (U
∞
b > O)Abxw
~ 1 − δ∞
trices, there is a danger of magnifying the estimation errors
exponentially. Fortunately, this is not the case: the following A consequence of Lemma 13 is that if the estimated pa-
lemma (readily proved by induction) shows that these errors rameters are sufficiently accurate, then the state updates never
accumulate roughly additively. allow predictions of very small hidden state probabilities.
b > O is invertible. For any time t: Corollary 14. Assume δ∞ ≤ 1/2, maxx ∆x ≤ α/3, δ1 ≤
Lemma 11. Assume U α/8, and maxx δ∞ + δ∞ ∆x + ∆x ≤ 1/3. Then [b
gt ]i ≥ α/2
X
b > O)−1 B
bx bb1 − B
for all t and i.
(U t:1
ex eb1
t:1
1
x1:t Lemma 13 and Corollary 14 can now be used to prove the
t t
≤ (1 + ∆) δ1 + (1 + ∆) − 1. contraction property of the KL-divergence between the true
hidden states and the estimated hidden states. The analysis
All that remains is to bound the effect of errors in bb∞ . shares ideas from [EDKM07], though the added difficulty is
Theorem 6 will follow from the following lemma combined due to the fact that the state maintained by our algorithm is
with the sampling error bounds of Lemma 8. not a probability distribution.
Lemma 12. Assume 2,1 ≤ σm (P2,1 )/3. Then for any t, Lemma 15. Let ε0 = maxx 2∆x /α + (δ∞ + δ∞ ∆x +
∆x )/α + 2δ∞ . Assume δ∞ ≤ 1/2, maxx ∆x ≤ α/3, and
maxx δ∞ + δ∞ ∆x + ∆x ≤ 1/3. For all t, if gbt ∈ Rm is a
X
Pr[x1:t ] − Pr[x
c 1:t ]
x1:t
probability vector, then
δ∞ + (1 + δ∞ ) (1 + ∆)t δ1 + (1 + ∆)t − 1 . γ2
≤
KL(~ht+1 ||b
gt+1 ) ≤ KL(~ht ||b
gt )− KL(~ht ||b
gt )2 +ε0 .
2 2
5.3 Proof of Theorem 7 2 ln α
In this subsection, we assume the HMM obeys Condition 3 Finally, the recurrence from Lemma 15 easily gives the
(in addition to Condition 1). following lemma.
Lemma 16. Let ε0 = maxx 2∆x√ /α + (δ∞ +√δ∞ ∆x + [CR08] Kamalika Chaudhuri and Satish Rao. Learning
∆x )/α+2δ∞ and ε1 = maxx (δ∞ + mδ∞ ∆x + m∆x )/α. mixtures of product distributions using correla-
Assume δ∞ ≤ 1/2, maxx ∆x ≤ α/3, and maxx δ∞ + tions and independence. In COLT, 2008.
δ∞ ∆x + ∆x ≤ 1/3. Also assume
[Das99] Sanjoy Dasgupta. Learning mixutres of Gaus-
α4 γ 2
r
ε0 α 1 1 sians. In FOCS, 1999.
δ1 ≤ 2
≤ ≤ , ε0 ≤ 2 , and ε1 < .
8γ 8 2 128 ln α2 2
[DLR77] A. P. Dempster, N. M. Laird, and D. B. Rubin.
Then for all t, Maximum likelihood from incomplete data via
s the EM algorithm. Journal of the Royal Statis-
2 tical Society, Series B, 39(1):1–38, 1977.
2 ln α2 ε0
KL(~ht ||b
gt ) ≤ and
γ2 [DS07] Sanjoy Dasgupta and Leonard Schulman. A
probabilistic analysis of EM for mixtures
KL(Pr[xt |x1:t−1 ] || Pr[x
c t |x1:t−1 ])
of separated, spherical Gaussians. JMLR,
s 2
2 ln α2 ε0 8(Feb):203–226, 2007.
≤ + δ∞ + δ∞ ∆ + ∆ + 2ε1 .
γ2 [EDKM05] Eyal Even-Dar, Sham M. Kakade, and Yishay
Mansour. Planning in POMDPs using multi-
Theorem 7 follows by combining the previous lemma plicity automata. In UAI, 2005.
and the sampling error bounds of Lemma 8.
[EDKM07] Eyal Even-Dar, Sham M. Kakade, and Yishay
Acknowledgments Mansour. The value of observation for monitor-
The authors would like to thank John Langford and Rus- ing dynamic systems. In IJCAI, 2007.
lan Salakhutdinov for earlier discussions on using bottleneck
methods to learn nonlinear dynamic systems. The lineariza- [Fli74] M. Fliess. Matrices deHankel. J. Math. Pures
tion of the bottleneck idea was the basis of this paper. We Appl., 53:197–222, 1974.
also thank Yishay Mansour for pointing out hardness results
for learning HMMs. This work was completed while the first [Hot35] H. Hotelling. The most predictable criterion.
author was an intern at TTI-C in 2008. Journal of Educational Psychology, 1935.
[Jae00] Herbert Jaeger. Observable operator models for
References discrete stochastic time series. Neural Comput.,
[ARJ03] S. Andersson, T. Ryden, and R. Johansson. Lin- 12(6), 2000.
ear optimal prediction and innovations repre- [Kat05] T. Katayama. Subspace Methods for System
sentations of hidden markov models. Stochastic Identification. Springer, 2005.
Processes and their Applications, 108:131–149,
2003. [KMR+ 94] M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld,
R. Schapire, and L. Sellie. On the learnability of
[BE67] Leonard E. Baum and J. A. Eagon. An inequal- discrete distributions. In STOC, pages 273–282,
ity with applications to statistical estimation for 1994.
probabilistic functions of Markov processes and
to a model for ecology. Bull. Amer. Math. Soc., [Lju87] L. Ljung. System Identification: Theory for
73(3):360–363, 1967. the User. NJ: Prentice-Hall Englewood Cliffs,
1987.
[BPSW70] Leonard E. Baum, Ted Petrie, George Soules,
and Norman Weiss. A maximization technique [LSS01] Michael Littman, Richard Sutton, and Satin-
occurring in the statistical analysis of proba- der Singh. Predictive representations of state.
bilistic functions of Markov chains. Annals of In Advances in Neural Information Processing
Mathematical Statistics, 41(1):164–171, 1970. Systems 14 (NIPS), pages 1555–1561, 2001.
[BV08] S. Charles Brubaker and Santosh Vempala. [MR06] E. Mossel and S. Roch. Learning nonsingu-
Isotropic PCA and affine-invariant clustering. lar phylogenies and hidden Markov models.
In FOCS, 2008. Annals of Applied Probability, 16(2):583–614,
2006.
[CC08] G. Cybenko and V. Crespi. Learning hid-
den markov models using non-negative ma- [OM96] P. V. Overschee and B. De Moor. Subspace
trix factorization. Technical report, 2008. Identification of Linear Systems. Kluwer Aca-
arXiv:0809.4086. demic Publishers, 1996.
[CP71] J.W Carlyle and A. Paz. Realization by stochas- [Sch61] M.P. Schützenberger. On the definition of a
tic finite automaton. J. Comput. Syst. Sci., 5:26– family of automata. Inf. Control, 4:245–270,
40, 1971. 1961.
[Ter02] Sebastiaan Terwijn. On the learnability of hid- Now we can form O from the diagonals of Ox . Since O
den Markov models. In International Collo- has full column rank, O+ O = Im , so it is now easy to also
quium on Grammatical Inference, 2002. recover ~π and T from P1 and P2,1 :
[VW02] Santosh Vempala and Grant Wang. A spectral O+ P1 = O+ O~π = ~π
algorithm for learning mixtures of distributions.
In FOCS, 2002. and
O+ P2,1 (O+ )> diag(~π )−1
[VWM07] B. Vanluyten, J. Willems, and B. De Moor. A
new approach for the identification of hidden = O+ (OT diag(~π )O> )(O+ )> diag(~π )−1
markov models. In Conference on Decision and = T.
Control, 2007.
Note that because [MR06] do not allow more observa-
[ZJ07] MingJie Zhao and Herbert Jaeger. The error tions than states, they do not need to work in a lower dimen-
controlling algorithm for learning OOMs. Tech- sional subspace such as range(U ). Thus, they perform an
nical Report 6, International University Bre- eigen-decomposition of the matrix
men, 2007. !
X X
−1
gx P3,x,1 P3,1 = (OT ) gx Ox (OT )−1 ,
x x
A Recovering the Observation and
Transition Matrices and then use the eigenvectors to form the matrix OT . Thus
they rely on the stability of the eigenvectors, which depends
We sketch how to use the technique of [MR06] to recover heavily on the spacing of the eigenvalues.
the observation and transition matrices explicitly. This is an
extra step that can be used in conjunction with our algorithm.
Define the n × n matrix [P3,1 ]i,j = Pr[x3 = i, x1 = j].
Let Ox = diag(Ox,1 , . . . , Ox,m ), so Ax = P T Ox . Since
P3,x,1 = OAx T diag(~π )O> , we have P3,1 = x P3,x,1 =
OT T diag(~π )O> . Therefore
U > P3,x,1 = U > OT Ox T diag(~π )O>
= (U > OT )Ox (U > OT )−1
(U > OT )T diag(~π )O>
= (U > OT )Ox (U > OT )−1 (U > P3,1 ).
The matrix U > P3,1 has full row rank, so it follows that