Bayesian Nonparametric Hidden Semi-Markov Models
Bayesian Nonparametric Hidden Semi-Markov Models
Abstract
There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM)
as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning
from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian
constraints are undesirable, particularly if we wish to learn or encode non-geometric state dura-
tions. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration
semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting,
to allow construction of highly interpretable models that admit natural prior information on state
durations.
In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semi-
Markov Model (HDP-HSMM) and develop sampling algorithms for efficient posterior inference.
The methods we introduce also provide new methods for sampling inference in the finite Bayesian
HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical
Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference
toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both syn-
thetic and real experiments.
Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchi-
cal Dirichlet Process Hidden Markov Model
1. Introduction
Given a set of sequential data in an unsupervised setting, we often aim to infer meaningful states,
or “topics,” present in the data along with characteristics that describe and distinguish those states.
For example, in a speaker diarization (or who-spoke-when) problem, we are given a single audio
recording of a meeting and wish to infer the number of speakers present, when they speak, and some
characteristics governing their speech patterns (Tranter and Reynolds, 2006; Fox et al., 2008). Or
in separating a home power signal into the power signals of individual devices, we would be able
to perform the task much better if we were able to exploit our prior knowledge about the levels
and durations of each device’s power modes (Kolter and Johnson, 2011). Such learning problems
for sequential data are pervasive, and so we would like to build general models that are both flex-
ible enough to be applicable to many domains and expressive enough to encode the appropriate
information.
c
2013 Matthew J. Johnson and Alan S. Willsky.
J OHNSON AND W ILLSKY
Hidden Markov Models (HMMs) have proven to be excellent general models for approaching
learning problems in sequential data, but they have two significant disadvantages: (1) state duration
distributions are necessarily restricted to a geometric form that is not appropriate for many real-
world data, and (2) the number of hidden states must be set a priori so that model complexity is not
inferred from data in a Bayesian way.
Recent work in Bayesian nonparametrics has addressed the latter issue. In particular, the Hi-
erarchical Dirichlet Process HMM (HDP-HMM) has provided a powerful framework for inferring
arbitrarily large state complexity from data (Teh et al., 2006; Beal et al., 2002). However, the HDP-
HMM does not address the issue of non-Markovianity in real data. The Markovian disadvantage
is even compounded in the nonparametric setting, since non-Markovian behavior in data can lead
to the creation of unnecessary extra states and unrealistically rapid switching dynamics (Fox et al.,
2008).
One approach to avoiding the rapid-switching problem is the Sticky HDP-HMM (Fox et al.,
2008), which introduces a learned global self-transition bias to discourage rapid switching. Indeed,
the Sticky model has demonstrated significant performance improvements over the HDP-HMM for
several applications. However, it shares the HDP-HMM’s restriction to geometric state durations,
thus limiting the model’s expressiveness regarding duration structure. Moreover, its global self-
transition bias is shared among all states, and so it does not allow for learning state-specific duration
information. The infinite Hierarchical HMM (Heller et al., 2009) induces non-Markovian state du-
rations at the coarser levels of its state hierarchy, but even the coarser levels are constrained to have
a sum-of-geometrics form, and hence it can be difficult to incorporate prior information. Further-
more, constructing posterior samples from any of these models can be computationally expensive,
and finding efficient algorithms to exploit problem structure is an important area of research.
These potential limitations and needed improvements to the HDP-HMM motivate this investiga-
tion into explicit-duration semi-Markov modeling, which has a history of success in the parametric
(and usually non-Bayesian) setting. We combine semi-Markovian ideas with the HDP-HMM to
construct a general class of models that allow for both Bayesian nonparametric inference of state
complexity as well as general duration distributions. In addition, the sampling techniques we de-
velop for the Hierarchical Dirichlet Process Hidden semi-Markov Model (HDP-HSMM) provide
new approaches to inference in HDP-HMMs that can avoid some of the difficulties which result in
slow mixing rates. We demonstrate the applicability of our models and algorithms on both synthetic
and real data sets.
The remainder of this paper is organized as follows. In Section 2, we describe explicit-duration
HSMMs and existing HSMM message-passing algorithms, which we use to build efficient Bayesian
inference algorithms. We also provide a brief treatment of the Bayesian nonparametric HDP-HMM
and sampling inference algorithms. In Section 3 we develop the HDP-HSMM and related models. In
Section 4 we develop extensions of the weak-limit and direct assignment samplers (Teh et al., 2006)
for the HDP-HMM to our models and describe some techniques for improving the computational
efficiency in some settings.
Section 5 demonstrates the effectiveness of the HDP-HSMM on both synthetic and real data. In
synthetic experiments, we demonstrate that our sampler mixes very quickly on data generated by
both HMMs and HSMMs and accurately learns parameter values and state cardinality. We also show
that while an HDP-HMM is unable to capture the statistics of an HSMM-generated sequence, we
can build HDP-HSMMs that efficiently learn whether data were generated by an HMM or HSMM.
As a real-data experiment, we apply the HDP-HSMM to a problem in power signal disaggregation.
674
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
✄ ✦
☎
✆✝ ✆✞ ✆✟ ✆✠ ✆✡
✂ ✁
!"!"!
☎
☛✝ ☛✞ ☛✟ ☛✠ ☛✡
Figure 1: Basic graphical model for the Bayesian HMM. Parameters for the transition, emission,
and initial state distributions are random variables. The symbol α represents the hyper-
parameter for the prior distributions on state-transition parameters. The shaded nodes
indicate observations on which we condition to form the posterior distribution over the
unshaded latent components.
2.1 HMMs
The core of the HMM consists of two layers: a layer of hidden state variables and a layer of ob-
servation or emission variables, as shown in Figure 1. The hidden state sequence, x = (xt )Tt=1 ,
is a sequence of random variables on a finite alphabet, xt ∈ {1, 2, . . . , N }, that form a Markov
chain. In this paper, we focus on time-homogeneous models, in which the transition distribution
does not depend on t. The transition parameters are collected into a row-stochastic transition matrix
π = (πij )N
i,j=1 where πij = p(xt+1 = j|xt = i). We also use {πi } to refer to the set of rows of the
transition matrix. We use p(yt |xt , {θi }) to denote the emission distribution, where {θi } represents
parameters.
The Bayesian approach allows us to model uncertainty over the parameters and perform model
averaging (for example, forming a prediction of an observation yT +1 by integrating out all possible
parameters and state sequences), generally at the expense of somewhat more expensive algorithms.
This paper is concerned with the Bayesian approach and so the model parameters are treated as
random variables, with their priors denoted p(π|α) and p({θi }|H).
2.2 HSMMs
There are several approaches to hidden semi-Markov models (Murphy, 2002; Yu, 2010). We fo-
cus on explicit duration semi-Markov modeling; that is, we are interested in the setting where
each state’s duration is given an explicit distribution. Such HSMMs are generally treated from a
non-Bayesian perspective in the literature, where parameters are estimated and fixed via an approxi-
mate maximum-likelihood procedure (particularly the natural Expectation-Maximization algorithm,
which constitutes a local search).
The basic idea underlying this HSMM formalism is to augment the generative process of a
standard HMM with a random state duration time, drawn from some state-specific distribution when
675
J OHNSON AND W ILLSKY
✦ ✦✁ !"!"! ✦✂
✠✡ ✠☛ ✠☞
Figure 2: HSMM interpreted as a Markov chain on a set of super-states, (zs )Ss=1 . The number
of shaded nodes associated with each zs , denoted by Ds , is drawn from a state-specific
duration distribution.
the state is entered. The state remains constant until the duration expires, at which point there is
a Markov transition to a new state. We use the random variable Dt to denote the duration of a
state that is entered at time t, and we write the probability mass function for the random variable as
p(dt |xt = i).
A graphical model for the explicit-duration HSMM is shown in Figure 2 (from Murphy, 2002),
though the number of nodes in the graphical model is itself random. In this picture, we see there
is a Markov chain (without self-transitions) on S “super-state” nodes, (zs )Ss=1 , and these super-
states in turn emit random-length segments of observations, of which we observe the first T . Here,
the symbol Ds is used to denote the random length of the observation segment of super-state s
for s = 1, . . . , S. The “super-state” picture separates the Markovian transitions from the segment
durations.
When defining an HSMM model, one must also choose whether the observation sequence ends
exactly on a segment boundary or whether the observations are censored at the end, so that the final
segment may possibly be cut off in the observations. We focus on the right-censored formulation in
this paper, but our models and algorithms can easily be modified to the uncensored or left-censored
cases. For a further discussion, see Guédon (2007).
It is possible to perform efficient message-passing inference along an HSMM state chain (con-
ditioned on parameters and observations) in a way similar to the standard alpha-beta dynamic pro-
gramming algorithm for standard HMMs. The “backwards” messages are crucial in the devel-
opment of efficient sampling inference in Section 4 because the message values can be used to
efficiently compute the posterior information necessary to block-sample the hidden state sequence
(xt ), and so we briefly describe the relevant part of the existing HSMM message-passing algorithm.
As derived in Murphy (2002), we can define and compute the backwards messages1 B and B ∗ as
follows:
1. In Murphy (2002) and others, the symbols β and β ∗ are used for the messages, but to avoid confusion with our HDP
parameter β, we use the symbols B and B ∗ for messages.
676
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
BT (i) := 1,
where we have split the messages into B and B ∗ components for convenience and used yk1 :k2 to
denote (yk1 , . . . , yk2 ). Dt+1 represents the duration of the segment beginning at time t + 1. The
conditioning on the parameters of the distributions, namely the observation, duration, and transition
parameters, is suppressed from the notation.
We write Ft = 1 to indicate a new segment begins at t + 1 (Murphy, 2002), and so to compute
the message from t + 1 to t we sum over all possible lengths d for the segment beginning at t + 1,
using the backwards message at t + d to provide aggregate future information given a boundary just
after t + d. The final additive term in the expression for Bt∗ (i) is described in Guédon (2007); it
constitutes the contribution of state segments that run off the end of the provided observations, as
per the censoring assumption, and depends on the survival function of the duration distribution.
Though a very similar message-passing subroutine is used in HMM Gibbs samplers, there are
significant differences in computational cost between the HMM and HSMM message computations.
The greater expressive power of the HSMM model necessarily increases the computational cost: the
above message passing requires O(T 2 N + T N 2 ) basic operations for a chain of length T and state
cardinality N , while the corresponding HMM message passing algorithm requires only O(T N 2 ).
However, if the support of the duration distribution is limited, or if we truncate possible segment
lengths included in the inference messages to some maximum dmax , we can instead express the
asymptotic message passing cost as O(T dmax N + T N 2 ). Such truncations are often natural as the
duration prior often causes the message contributions to decay rapidly with sufficiently large d.
Though the increased complexity of message-passing over an HMM significantly increases the cost
per iteration of sampling inference for a global model, the cost is offset because HSMM samplers
need far fewer total iterations to converge. See the experiments in Section 5.
677
J OHNSON AND W ILLSKY
✆ ☎
✄ ✦
✝
✞✟ ✞✠ ✞✡ ✞☛ ✞☞
✂ ✁
!"!"!
✝
✌✟ ✌✠ ✌✡ ✌☛ ✌☞
The generative process HDP-HMM(γ, α, H) given concentration parameters γ, α > 0 and base
measure (observation prior) H can be summarized as:
β ∼ GEM(γ),
iid iid
πi ∼ DP(α, β) θi ∼ H i = 1, 2, . . . ,
xt ∼ πxt−1 ,
yt ∼ f (θxt ) t = 1, 2, . . . , T,
where GEM denotes a stick breaking process (Sethuraman, 1994) and f denotes an observation
distribution parameterized by draws from H. We set x1 := 1. We have also suppressed explicit
conditioning from the notation. See Figure 3 for a graphical model.
The HDP plays the role of a prior over infinite transition matrices: each πj is a DP draw and
is interpreted as the transition distribution from state j. The πj are linked by being DP draws
parameterized by the same discrete measure β, thus E[πj ] = β and the transition distributions tend
to have their mass concentrated around a typical set of states, providing the desired bias towards
re-entering and re-using a consistent set of states.
The Chinese Restaurant Franchise and direct-assignment collapsed sampling methods described
in Teh et al. (2006); Fox (2009) are approximate inference algorithms for the full infinite dimen-
sional HDP, but they have a particular weakness in the sequential-data context of the HDP-HMM:
each state transition must be re-sampled individually, and strong correlations within the state se-
quence significantly reduce mixing rates (Fox, 2009). As a result, finite approximations to the HDP
have been studied for the purpose of providing faster mixing. Of particular note is the popular
weak limit approximation, used in Fox et al. (2008), which has been shown to reduce mixing times
for HDP-HMM inference while sacrificing little of the “tail” of the infinite transition matrix. In
this paper, we describe how the HDP-HSMM with geometric durations can provide an HDP-HMM
sampling inference algorithm that maintains the “full” infinite-dimensional sampling process while
mitigating the detrimental mixing effects due to the strong correlations in the state sequence, thus
providing a new alternative to existing HDP-HMM sampling methods.
The Sticky HDP-HMM augments the HDP-HMM with an extra parameter κ > 0 that biases
the process towards self-transitions and thus provides a method to encourage longer state durations.
678
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
β ∼ GEM(γ),
iid iid
πi ∼ DP(α + κ, β + κδj ) θi ∼ H i = 1, 2, . . . ,
xt ∼ πxt−1 ,
yt ∼ f (θxt ) t = 1, 2, . . . , T,
where δj denotes an indicator function that takes value 1 at index j and 0 elsewhere. While the
Sticky HDP-HMM allows some control over duration statistics, the state duration distributions re-
main geometric; the goal of this work is to provide a model in which any duration distributions may
be used.
3. Models
In this section, we introduce the explicit-duration HSMM-based models that we use in the remainder
of the paper. We define the finite Bayesian HSMM and the HDP-HSMM and show how they can be
used as components in more complex models, such as in a factorial structure. We describe generative
processes that do not allow self-transitions in the state sequence, but we emphasize that we can also
allow self-transitions and still employ the inference algorithms we describe; in fact, allowing self-
transitions simplifies inference in the HDP-HSMM, since complications arise as a result of the
hierarchical prior and an elimination of self-transitions. However, there is a clear modeling gain by
eliminating self-transitions: when self-transitions are allowed, the “explicit duration distributions”
do not model the state duration statistics directly. To allow direct modeling of state durations, we
must consider the case where self-transitions do not occurr.
We do not investigate here the problem of selecting particular observation and duration distribu-
tion classes; model selection is a fundamental challenge in generative modeling, and models must
be chosen to capture structure in any particular data. Instead, we provide the HDP-HSMM and re-
lated models as tools in which modeling choices (such as the selection of observation and duration
distribution classes to fit particular data) can be made flexibly and naturally.
The finite Bayesian HSMM is a combination of the Bayesian HMM approach with semi-Markov
state durations and is the model we generalize to the HDP-HSMM. Some forms of finite Bayesian
HSMMs have been described previously, such as in Hashimoto et al. (2009) which treats observation
parameters as Bayesian latent variables, but to the best of our knowledge the first fully Bayesian
treatment of all latent components of the HSMM was given in Johnson and Willsky (2010) and later
independently in Dewar et al. (2012), which allows self-transitions.
It is instructive to compare this construction with that of the finite model used in the weak-limit
HDP-HSMM sampler that will be described in Section 4.2, since in that case the hierarchical ties
between rows of the transition matrix requires particular care.
The generative process for a Bayesian HSMM with N states and observation and duration pa-
rameter prior distributions of H and G, respectively, can be summarized as
679
J OHNSON AND W ILLSKY
iid iid
πi ∼ Dir(α(1 − δi )) (θi , ωi ) ∼ H × G i = 1, 2, . . . , N,
zs ∼ πzs−1 ,
Ds ∼ g(ωzs ), s = 1, 2, . . . ,
xt1s :t2s = zs ,
iid X
yt1s :t2s ∼ f (θzs ) t1s = Ds̄ t2s = t1s + Ds − 1,
s̄<s
where f and g denote observation and duration distributions parameterized by draws from H and
G, respectively. The indices t1s and t2s denote the first and last index of segment s, respectively, and
xt1s :t2s := (xt1s , xt1s +1 , . . . , xt2s ). We use Dir(α(1 − δi )) to denote a symmetric Dirichlet distribution
with parameter α except with the ith component of the hyperparameter vector set to zero, hence
fixing πii = 0 and ensuring there will be no self-transitions sampled in the super-state sequence
(zs ). We also define the label sequence (xt ) for convenience; the pair (zs , Ds ) is the run-length
encoding of (xt ). The process as written generates an infinite sequence of observations; we observe
a finite prefix of size T .
Note, crucially, that in this definition the πi are not tied across various i. In the HDP-HSMM,
as well as the weak limit model used for approximate inference in the HDP-HSMM, the πi will
be tied through the hierarchical prior (specifically via β), and that connection is necessary to pe-
nalize the total number of states and encourage a small, consistent set of states to be visited in
the state sequence. However, the interaction between the hierarchical prior and the elimination of
self-transitions presents an inference challenge.
3.2 HDP-HSMM
The generative process of the HDP-HSMM is similar to that of the HDP-HMM (as described in,
for example, Fox et al. (2008)), with some extra work to include duration distributions. The process
HDP-HSMM(γ, α, H, G), illustrated in Figure 4, can be written
β ∼ GEM(γ),
iid iid
πi ∼ DP(α, β) (θi , ωi ) ∼ H × G i = 1, 2, . . . ,
zs ∼ π̄zs−1 ,
Ds ∼ g(ωzs ) s = 1, 2, . . . ,
xt1s :t2s = zs ,
iid X
yt1s :t2s ∼ f (θxt ) t1s = Ds̄ t2s = t1s + Ds − 1,
s̄<s
πij
where we use π̄i := 1−π ii
(1 − δij ) to eliminate self-transitions in the super-state sequence (zs ). As
with the finite HSMM, we define the label sequence (xt ) for convenience. We observe a finite prefix
of size T of the observation sequence.
Note that the atoms we edit to eliminate self-transitions are the same atoms that are affected by
the global sticky bias in the Sticky HDP-HMM.
680
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
γ β
α πi
∞
z1 z2 ... zS
D1 D2 DS
λ θi
∞
yt1 . . . yt′1 yt2 . . . yt′2 ytS . . . yt′S
t1 = 1 t2 = D1 + 1 tS = T − DS + 1
t′1 = D1 t′2 = D1 + D2 t′S = T
Figure 4: A graphical model for the HDP-HSMM in which the number of segments S, and hence
the number of nodes, is random.
y (k) ∼ HDP-HSMM(αk , γk , Hk , Gk ) k = 1, 2, . . . , K,
K
X (k)
ȳt := yt + w t t = 1, 2, . . . , T,
k=1
where wt is a noise process independent of the other components of the model states.
A graphical model for a factorial HMM can be seen in Figure 5, and a factorial HSMM or fac-
torial HDP-HSMM simply replaces the hidden state chains with semi-Markov chains. Each chain,
indexed by superscripts, evolves with independent dynamics and produces independent emissions,
but the observations are combinations of the independent emissions. Note that each component
HSMM is not restricted to any fixed number of states.
Such factorial models are natural ways to frame source separation or disaggregation problems,
which require identifying component emissions and component states. With the Bayesian frame-
work, we also model uncertainty and ambiguity in such a separation. In Section 5.2 we demonstrate
the use of a factorial HDP-HSMM for the task of disaggregating home power signals.
681
J OHNSON AND W ILLSKY
✧★ ✧★ ✧★ ✘ ✘ ✘
✦✧ ✦✡ ✦☛
✧★ ✧★ ✧★
✆✧ ✆✡ ✆☛
✙ ✙ ✙
✙ ✙ ✙
✙ ✙ ✙
✆
✔✕ ✆
✔✖ ✆
✔✗
Figure 5: A graphical model for the factorial HMM, which can naturally be extended to factorial
structures involving the HSMM or HDP-HSMM.
Problems in source separation or disaggregation are often ill-conditioned, and so one relies on
prior information in addition to the source independence structure to solve the separation problem.
Furthermore, representation of uncertainty is often important, since there may be several good ex-
planations for the data. These considerations motivate Bayesian inference as well as direct modeling
of state duration statistics.
4. Inference Algorithms
We describe three Gibbs sampling inference algorithms, beginning with a sampling algorithm for
the finite Bayesian HSMM, which is built upon in developing algorithms for the HDP-HSMM in
the sequel. Next, we develop a weak-limit Gibbs sampling algorithm for the HDP-HSMM, which
parallels the popular weak-limit sampler for the HDP-HMM and its sticky extension. Finally, we
introduce a collapsed sampler which parallels the direct assignment sampler of Teh et al. (2006).
For all both of the HDP-HSMM samplers there is a loss of conjugacy with the HDP prior due to
the fact that self-transitions in the super-state sequence are not permitted (see Section 4.2.1). We
develop auxiliary variables to form an augmented representation that effectively recovers conjugacy
and hence enables fast Gibbs steps.
In comparing the weak limit and direct assignment sampler, the most important trade-offs are
that the direct assignment sampler works with the infinite model by integrating out the transition
matrix π while simplifying bookkeeping by maintaining part of β; it also collapses the observation
and duration parameters. However, the variables in the label sequence (xt ) are coupled by the inte-
gration, and hence each element of the label sequence must be resampled sequentially. In contrast,
the weak limit sampler represents all latent components of the model (up to an adjustable finite
approximation for the HDP) and thus allows block resampling of the label sequence by exploiting
HSMM message passing.
We end the section with a discussion of leveraging changepoint side-information to greatly
accelerate inference.
682
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
by drawing samples from the distribution, where G represents the prior over duration parameters.
We can construct these samples by following a Gibbs sampling algorithm in which we iteratively
sample from the appropriate conditional distributions of (xt ), {πi }, {ωi }, and {θi }.
Sampling {θi } or {ωi } from their respective conditional distributions can be easily reduced to
standard problems depending on the particular priors chosen. Sampling the transition matrix rows
{πi } is straightforward if the prior on each row is Dirichlet over the off-diagonal entries and so
we do not discuss it in this section, but we note that when the rows are tied together hierarchically
(as in the weak-limit approximation to the HDP-HSMM), resampling the {πi } correctly requires
particular care (see Section 4.2.1).
Sampling (xt )|{θi }, {πi }, (yt ) in a finite Bayesian Hidden semi-Markov Model was first intro-
duced in Johnson and Willsky (2010) and, in independent work, later in Dewar et al. (2012). In
the following section we develop the algorithm for block-sampling the state sequence (xt ) from its
conditional distribution by employing the HSMM message-passing scheme.
where we have used the assumption that the observation sequence begins on a segment boundary
(F0 = 1) and suppressed notation for conditioning on parameters.
We can also use the messages to efficiently draw a sample from the posterior duration distribu-
tion for the sampled initial state. Conditioning on the initial state draw, x̄1 , the posterior duration of
the first state is:
p(D1 = d, y1:T |x1 = x̄1 , F0 )
p(D1 = d|y1:T , x1 = x̄1 , F0 = 1) =
p(y1:T |x1 = x̄1 , F0 )
p(D1 = d|x1 = x̄1 , F0 )p(y1:d |D1 = d, x1 = x̄1 , F0 )p(yd+1:T |D1 = d, x1 = x̄1 , F0 )
=
p(y1:T |x1 = x̄1 , F0 )
p(D1 = d)p(y1:d |D1 = d, x1 = x̄1 , F0 = 1)Bd (x̄1 )
= .
B0∗ (x̄1 )
683
J OHNSON AND W ILLSKY
We repeat the process by using xD1 +1 as our new initial state with initial distribution p(xD1 +1 =
i|x1 = x̄1 ), and thus draw a block sample for the entire label sequence.
The weak-limit sampler for an HDP-HMM (Fox et al., 2008) constructs a finite approximation to
the HDP transitions prior with finite L-dimensional Dirichlet distributions, motivated by the fact
that the infinite limit of such a construction converges in distribution to a true HDP:
where we again interpret πi as the transition distribution for state i and β as the distribution which
ties state distributions together and encourages shared sparsity. Practically, the weak limit approxi-
mation enables the complete representation of the transition matrix in a finite form, and thus, when
we also represent all parameters, allows block sampling of the entire label sequence at once, result-
ing in greatly accelerated mixing in many circumstances. The parameter L gives us control over
the approximation, with the guarantee that the approximation will become exact as L grows; see
Ishwaran and Zarepour (2000), especially Theorem 1, for a discussion of theoretical guarantees.
Note that the weak limit approximation is more convenient for us than the truncated stick-breaking
approximation because it directly models the state transition probabilities, while stick lengths in the
HDP do not directly represent state transition probabilities because multiple sticks in constructing
πi can be sampled at the same atom of β.
We can employ the weak limit approximation to create a finite HSMM that approximates infer-
ence in the HDP-HSMM. This approximation technique often results in greatly accelerated mixing,
and hence it is the technique we employ for the experiments in the sequel. However, the infer-
ence algorithm of Section 4.1 must be modified to incorporate the fact that the {πi } are no longer
mutually independent and are instead tied through the shared β. This dependence between the tran-
sition rows introduces potential conjugacy issues with the hierarchical Dirichlet prior; the following
section explains the difficulty as well as a clean solution via auxiliary variables.
The beam sampling technique (Van Gael et al., 2008) can be applied here with little modifica-
tion, as in Dewar et al. (2012), to sample over the approximation parameter L, thus avoiding the
need to set L a priori while still allowing instantiation of the transition matrix and block sampling
of the state sequence. This technique is especially useful if the number of states could be very large
and is difficult to bound a priori. We do not explore beam sampling here.
To construct our overall Gibbs sampler, we need to be able to easily resample the transition matrix π
given the other components of the model. However, by ruling out self-transitions while maintaining
a hierarchical link between the transition rows, the model is no longer fully conjugate, and hence
resampling is not necessarily easy. To observe the loss of conjugacy using the hierarchical prior
required in the weak-limit approximation, note that we can summarize the relevant portion of the
684
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
generative model as
where π̄j represents πj with the jth component removed and renormalized appropriately:
πji (1 − δij )
π̄ji =
1 − πjj
with δij = 1 if i = j and δij = 0 otherwise. The deterministic transformation from πj to π̄j eliminates
self-transitions. Note that we have suppressed the observation parameter set, duration parameter set,
and observation sequence sampling for simplicity.
Consider the distribution of π1 |(xt ), β, the first row of the transition matrix:
where nij are the number of transitions from state i to state j in the state sequence (xt ). Essentially,
1
because of the extra 1−π 11
terms from the likelihood without self-transitions, we cannot reduce this
expression to the Dirichlet form over the components of π1 , and therefore we cannot proceed with
sampling m and resampling β and π as in Teh et al. (2006).
However, we can introduce auxiliary variables to recover conjugacy, following the general data
augmentation technique described in Van Dyk and Meng (2001). We define an extended generative
model with extra random variables, and then show through simple manipulations that conditional
distributions simplify with the additional variables, hence allowing us to cycle simple Gibbs updates
to produce a sampler.
For simplicity, we focus on the first row of the transition matrix, namely π1 , and the draws that
depend on it; the reasoning easily extends to the other rows. We also drop the parameter α for
convenience. First, we write the relevant portion of the generative process as
π1 |β ∼ Dir(β),
zi |π̄1 ∼ π̄1 i = 1, . . . , n,
yi |zi ∼ f (zi ) i = 1, . . . , n.
Here, n counts the total number of transitions out of state 1 and the {zi } represent the transitions
out of state 1 to a specific state: sampling zi = k represents a transition from state 1 to state k. The
{yi } represent the observations on which we condition; in particular, if we have zi = k then the yi
corresponds to an emission from state k in the HSMM. See the graphical model in Figure 6(a) for a
depiction of the relationship between the variables.
We can introduce auxiliary variables {ρi }ni=1 , where each ρi is independently drawn from a
geometric distribution supported on {0, 1, . . .} with success parameter 1 − π11 : ρi |π11 ∼ Geo(1 −
685
J OHNSON AND W ILLSKY
β β
π1 π1
zi ρi zi
yi yi
(a) (b)
Figure 6: Simplified depiction of the relationship between the auxiliary variables and the rest of
the model; 6(a) depicts the nonconjugate setting and 6(b) shows the introduced auxiliary
variables {ρi }.
686
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
Figure 7: Empirical sample chain autocorrelation for the first component of π for both the proposed
auxiliary variable sampler and a Metropolis-Hastings sampler. The rapidly diminishing
autocorrelation for the auxiliary variable sampler is indicative of fast mixing.
(a) (b)
Figure 8: Multivariate Potential Scale Reduction Factors for both the proposed auxiliary variable
sampler and a Metropolis-Hastings sampler. The auxiliary variable sampler rapidly
achieves the statistic’s asymptotic value of unity. Note that the auxiliary variable sam-
pler is also much more efficient to execute, as shown in 8(b).
687
J OHNSON AND W ILLSKY
γ β
α πi
L
z1 z2 ... zS
D1
ρ1 D2
ρ2 DS
ρS
λ θi
L
yt1 . . . yt′1 yt2 . . . yt′2 ytS . . . yt′S
Figure 9: Graphical model for the weak-limit approximation including auxiliary variables.
in cases where there is insufficient data to inform some latent parameters so that marginalization
is necessary for mixing or estimating marginal likelihoods (such as in some topic models). As
mentioned previously, in the direct assignment sampler for the HDP-HMM the infinite transition
matrix π is analytically marginalized out along with the observation parameters (if conjugate priors
are used). The sampler represents explicit instantiations of the state sequence (xt ) and the “used”
prefix of the infinite vector β: β1:K where K = #{xt : t = 1, . . . , T }. There are also auxiliary
variables m used to resample β, but for simplicity we do not discuss them here; see Teh et al.
(2006) for details.
Our DA sampler additionally represents the auxiliary variables necessary to recover HDP con-
jugacy (as introduced in the previous section). Note that the requirement for, and correctness of, the
auxiliary variables described in the finite setting in Section 4.2.1 immediately extends to the infinite
setting as a consequence of the Dirichlet Process’s definition in terms of the finite Dirichlet distri-
bution and the Kolmogorov extension theorem (Çinlar, 2010, Chapter 4); for a detailed discussion,
see Orbanz (2009). The connection to the finite case can also be seen in the sampling steps of the
direct assignment sampler for the HDP-HMM, in which the global weights β over K instantiated
components are resampled according to (β1:K , βrest )|α, (xt ) ∼ Dir(α + n1 , . . . , α + nK , α) where ni
is the number of transitions into state i and Dir is the finite Dirichlet distribution.
As described in Fox (2009), the basic HDP-HMM DA sampling step for each element xt of the
label sequence is to sample a new label k with probability proportional (over k) to
688
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
We can derive this step by writing the complete joint probability p((xt ), (yt )|β, H) leveraging
exchangeability; this joint probability value is proportional to the desired posterior probability
p(xt |(x\t ), (yt ), β, H). When we consider each possible assignment xt = k, we can cancel all the
terms that are invariant over k, namely all the transition probabilities other than those to and from
xt and all data likelihoods other than that for yt . However, this cancellation process relies on the
fact that for the HDP-HMM there is no distinction between self-transitions and new transitions: the
term for each t in the complete posterior simply involves transition scores no matter the labels of
xt+1 and xt−1 . In the HDP-HSMM case, we must consider segments and their durations separately
from transitions.
To derive an expression for resampling xt in the case of the HDP-HSMM, we can similarly
consider writing out an expression for the joint probability p((xt ), (yt )|β, H, G), but we notice that
as we vary our assignment of xt over k, the terms in the expression must change: if xt−1 = k̄ or
xt+1 = k̄, the probability expression includes a segment term for entire contiguous run of label k̄.
Hence, since we can only cancel terms that are invariant over k, our score expression must include
terms for the adjacent segments into which xt may merge. See Figure 10 for an illustration.
The final expression for the probability of sampling the new value of xt to be k then consists of
between 1 and 3 segment score terms, depending on merges with adjacent segments, each of which
has the form
αβk + nxprev ,k αβxnext + nk,xnext
p(xt = k|(x\t ), β, H, G) ∝ ·
α(1 − βxprev ) + nxprev ,· α(1 − βk ) + nk,·
| {z } | {z }
left-transition right-transition
2 1
· fdur (t − t + 1) · fobs (yt1 :t2 |k),
| {z } | {z }
duration observation
where we have used t1 and t2 to denote the first and last indices of the segment, respectively.
Transition scores at the start and end of the chain are not included.
The function fdur (d|k) is the corresponding duration predictive likelihood evaluated on a dura-
tion d, which depends on the durations of other segments with label k and any duration hyperpa-
rameters. The function fobs now represents a block or joint predictive likelihood over all the data in
a segment (see, for example, Murphy (2007) for a thorough discussion of the Gaussian case). Note
that the denominators in the transition terms are affected by the elimination of self-transitions by a
rescaling of the “total mass.” The resulting chain is ergodic if the duration predictive score fdur has a
support that includes {1, 2, . . . , dmax }, so that segments can be split and merged in any combination.
2. The indicator variables are present because the two transition probabilities are not independent but rather exchange-
able.
689
J OHNSON AND W ILLSKY
2 1 1 3 3 4
t
2 1 3 3 4 2 1 2 3 3 4
2 1 3 3 3 4
...
Figure 10: Illustration of the Gibbs step to resample xt for the DA sampler for the HDP-HSMM.
The red dashed boxes indicate the elements of the label sequence that contribute to the
score computation for k = 1, 2, 3 which produce two, three, and two segment terms,
respectively. The label sequence element being resample is emphasized in bold.
690
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
eration. For example, in the power disaggregation application in Section 5, we can run inexpensive
changepoint detection on the observations to get a list of possible changepoints, ruling out many
obvious non-changepoints. The possible changepoints divide the label sequence into state blocks,
where within each block the label sequence must be constant, though sequential blocks may have
the same label. By only allowing super-state switching to occur at these detected changepoints, we
can greatly reduce the computation of all the samplers considered.
In the case of the weak-limit sampler, the complexity of the bottleneck message-passing step is
reduced to a function of the number of possible changepoints (instead of total sequence length): the
2
asymptotic complexity becomes O(Tchange N + N 2 Tchange ) , where Tchange , the number of possible
changepoints, may be dramatically smaller than the sequence length T . We simply modify the
backwards message-passing procedure to sum only over the possible durations:
where p̃ represents the duration distribution restricted to the set of possible durations D ⊂ N+ and
re-normalized. We similarly modify the forward-sampling procedure to only consider possible du-
rations. It is also clear how to adapt the DA sampler: instead of re-sampling each element of the
label sequence (xt ) we simply consider the block label sequence, resampling each block’s label
(allowing merging with adjacent blocks).
5. Experiments
In this section, we evaluate the proposed HDP-HSMM sampling algorithms on both synthetic and
real data. First, we compare the HDP-HSMM direct assignment sampler to the weak limit sampler
as well as the Sticky HDP-HMM direct assignment sampler, showing that the HDP-HSMM direct
assignment sampler has similar performance to that for the Sticky HDP-HMM and that the weak
limit sampler is much faster. Next, we evaluate the HDP-HSMM weak limit sampler on synthetic
data generated from finite HSMMs and HMMs. We show that the HDP-HSMM applied to HSMM
data can efficiently learn the correct model, including the correct number of states and state labels,
while the HDP-HMM is unable to capture non-geometric duration statistics. We also apply the
HDP-HSMM to data generated by an HMM and demonstrate that, when equipped with a duration
distribution class that includes geometric durations, the HDP-HSMM can also efficiently learn an
HMM model when appropriate with little loss in efficiency. Next, we use the HDP-HSMM in
a factorial (Ghahramani and Jordan, 1997) structure for the purpose of disaggregating a whole-
home power signal into the power draws of individual devices. We show that encoding simple
duration prior information when modeling individual devices can greatly improve performance,
and further that a Bayesian treatment of the parameters is advantageous. We also demonstrate
how changepoint side-information can be leveraged to significantly speed up computation. The
Python code used to perform these experiments as well as Matlab code is available online at http:
//github.com/mattjj/pyhsmm.
691
J OHNSON AND W ILLSKY
(a) (b)
Figure 11: 11(a) compares the Geometric-HDP-HSMM direct assignment sampler with that of the
Sticky HDP-HMM, both applied to HMM data. The sticky parameter κ was chosen to
maximize mixing. 11(b) compares the HDP-HSMM direct assignment sampler with the
weak limit sampler. In all plots, solid lines are the median error at each time over 25
independent chains; dashed lines are 25th and 75th percentile errors.
692
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
0.2 0.2
0.1 0.1
0 0
0 100 200 300 400 500 0 100 200 300 400 500
Iteration Iteration
Figure 12: State-sequence Hamming error of the HDP-HMM and Poisson-HDP-HSMM applied to
data from a Poisson-HSMM. In each plot, the blue line indicates the error of the chain
with the median error across 25 independent Gibbs chains, while the red dashed lines
indicate the chains with the 10th and 90th percentile errors at each iteration. The jumps
in the plot correspond to a change in the ranking of the 25 chains.
Figure 13: Number of states inferred by the HDP-HMM and Poisson-HDP-HSMM applied to data
from a four-state Poisson-HSMM. In each plot, the blue line indicates the error of the
chain with the median error across 25 independent Gibbs chains, while the red dashed
lines indicate the chains with the 10th and 90th percentile errors at each iteration.
693
J OHNSON AND W ILLSKY
0.2
0.2
0.1
0 0
0 40 80 120 160 200 0 40 80 120 160 200
Iteration Iteration
Figure 14: The HDP-HSMM and HDP-HMM applied to data from an HMM. In each plot, the blue
line indicates the error of the chain with the median error across 25 independent Gibbs
chains, while the red dashed line indicates the chains with the 10th and 90th percentile
error at each iteration.
model to learn geometric durations as well as significantly non-geometric distributions with modes
away from zero. Figure 14 shows a negative binomial HDP-HSMM learning an HMM model from
data generated from an HMM with four states. The observation distribution for each state is a 10-
dimensional Gaussian, again with parameters sampled i.i.d. from a NIW prior. The prior over r was
set to be uniform on {1, 2, . . . , 6}, and all other priors were chosen to be similarly non-informative.
The sampler chains quickly concentrated at r = 1 for all state duration distributions. There is only
a slight loss in mixing time for the HDP-HSMM compared to the HDP-HMM. This experiment
demonstrates that with the appropriate choice of duration distribution the HDP-HSMM can effec-
tively learn an HMM model.
694
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
other aspects of the problem, such as additional data features, and builds a very compelling complete
solution to the disaggregation problem, while we focus on the factorial time series modeling itself.
For our experiments, we used the REDD data set (Kolter and Johnson, 2011), which monitors
many homes at high frequency and for extended periods of time. We chose the top 5 power-drawing
devices (refrigerator, lighting, dishwasher, microwave, furnace) across several houses and identified
18 24-hour segments across 4 houses for which many (but not always all) of the devices switched on
at least once. We applied a 20-second median filter to the data, and each sequence is approximately
5000 samples long.
We constructed simple priors that set the rough power draw levels and duration statistics of the
modes for several devices. For example, the power draw from home lighting changes infrequently
and can have many different levels, so an HDP-HSMM with a bias towards longer negative-binomial
durations is appropriate. On the other hand, a refrigerator’s power draw cycle is very regular and
usually exhibits only three modes, so our priors biased the refrigerator HDP-HSMM to have fewer
modes and set the power levels accordingly. For details on our prior specification, see Appendix A.
We did not truncate the duration distributions during inference, and we set the weak limit approxi-
mation parameter L to be twice the number of expected modes for each device; for example, for the
refrigerator device we set L = 6 and for lighting we set L = 20. We performed sampling inference
independently on each observation sequence.
As a baseline for comparison, we also constructed a factorial sticky HDP-HMM (Fox et al.,
2008) with the same observation priors and with duration biases that induced the same average
mode durations as the corresponding HDP-HSMM priors. We also compare to the factorial HMM
performance presented in Kolter and Johnson (2011), which fit device models using an EM al-
gorithm on training data. For the Bayesian models, we performed inference separately on each
aggregate data signal.
The set of possible changepoints is easily identifiable in these data, and a primary task of the
model is to organize the jumps observed in the observations into an explanation in terms of the
individual device models. By simply computing first differences and thresholding, we are able to
reduce the number of potential changepoints we need to consider from 5000 to 100-200, and hence
we are able to speed up state sequence resampling by orders of magnitude. See Figure 15 for an
illustration.
To measure performance, we used the error metric of Kolter and Johnson (2011):
PK (i)
ŷ − y (i)
PT
t=1 i=1 t t
Acc. = 1 − PT
2 t=1 ȳt
(i)
where ȳt refers to the observed total power consumption at time t, yt is the true power consumed at
(i)
time t by device i, and ŷt is the estimated power consumption. We produced 20 posterior samples
for each model and report the median accuracy of the component emission means compared to
the ground truth provided in REDD. We ran our experiments on standard desktop machines (Intel
Core i7-920 CPUs, released Q4 2008), and a sequence with about 200 detected changepoints would
resample each component chain in 0.1 seconds, including block sampling the state sequence and
resampling all observation, duration, and transition parameters. We collected samples after every
50 such iterations.
Our overall results are summarized in Figure 16 and Table 1. Both Bayesian approaches im-
proved upon the EM-based approach because they allowed flexibility in the device models that
695
J OHNSON AND W ILLSKY
3000
2500
2000
Power (Watts)
1500
1000
500
Figure 15: An total power observation sequence from the power disaggregation data set. Vertical
dotted red lines indicate changepoints detected with a simple first-differences. By using
the changepoint-based algorithms described in Section 4.4 we can greatly accelerate
inference speed for this application.
Figure 16: Overall accuracy comparison between the EM-trained FHMM of Kolter and Johnson
(2011), the factorial sticky HDP-HMM, and the factorial HDP-HSMM.
could be fit during inference, while the EM-based approach fixed device model parameters that
may not be consistent across homes. Furthermore, the incorporation of duration structure and prior
information provided a significant performance increase for the HDP-HSMM approach. Detailed
performance comparisons between the HDP-HMM and HDP-HSMM approaches can be seen in
Figure 17. Finally, Figures 18 and 19 shows total power consumption estimates for the two models
on two selected data sequences.
We note that the nonparametric prior was very important for modeling the power consumption
due to lighting. Power modes arise from combinations of lights switched on in the user’s home, and
hence the number of levels that are observed is highly uncertain a priori. For the other devices the
number of power modes (and hence states) is not so uncertain, but duration statistics can provide a
696
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
Figure 17: Performance comparison between the HDP-HMM and HDP-HSMM approaches broken
down by data sequence.
Figure 18: Estimated total power consumption for a data sequence where the HDP-HSMM signifi-
cantly outperformed the HDP-HMM due to its modeling of duration regularities.
697
J OHNSON AND W ILLSKY
Figure 19: Estimated total power consumption for a data sequence where both the HDP-HMM and
HDP-HSMM approaches performed well.
strong clue for disaggregation; for these, the main advantage of our model is in providing Bayesian
inference and duration modeling.
6. Conclusion
We have developed the HDP-HSMM and two Gibbs sampling inference algorithms, the weak
limit and direct assignment samplers, uniting explicit-duration semi-Markov modeling with new
Bayesian nonparametric techniques. These models and algorithms not only allow learning from
complex sequential data with non-Markov duration statistics in supervised and unsupervised set-
tings, but also can be used as tools in constructing and performing infernece in larger hierarchical
models. We have demonstrated the utility of the HDP-HSMM and the effectiveness of our inference
algorithms with real and synthetic experiments, and we believe these methods can be built upon to
provide new tools for many sequential learning problems.
Acknowledgments
The authors thank J. Zico Kolter, Emily Fox, and Ruslan Salakhutdinov for invaluable discussions
and advice. We also thank the anonymous reviewers for helpful fixes and suggestions. This work
was supported in part by a MURI through ARO Grant W911NF-06-1-0076, in part through a MURI
through AFOSR Grant FA9550-06-1-303, and in part by the National Science Foundation Graduate
Research Fellowship under Grant No. 1122374.
698
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
by first sampling one of the three sets of hyperparameters uniformly at random and then sampling
observation parameters using those hyperparameters
A comprehensive summary of our prior settings for the Factorial HDP-HSMM are in Table 2.
Observation distributions were all Gaussian with state-specific latent means and fixed variances. We
use Gauss(µ0 , σ02 ; σ 2 ) to denote a Gaussian observation distribution prior with a fixed variance of
σ 2 and a prior over its mean parameter that is Gaussian distributed with mean µ0 and variance σ02 ;
that is, it denotes that a state’s mean parameter µ is sampled according to µ ∼ N (µ0 , σ02 ) and an
observation from that state is sampled from N (µ, σ 2 ). Similarly, we use NegBin(α, β; r) to denote
Negative Binomial duration distribution priors where a latent state-specific “success” parameter p
is drawn from p ∼ Beta(α, β) and the parameter r is fixed, so that state durations for that state are
then drawn from NegBin(p, r). (Note choosing r = 1 sets a geometric duration class.)
We set the priors for the Factorial Sticky HDP-HMM by using the same set of observation prior
parameters as for the HDP-HSMM and setting state-specific sticky bias parameters so as to match
the expected durations encoded in the HDP-HSMM duration priors. For an example of real data
observation sequences, see Figure 20.
A natural extension of this model would be a more elaborate hierarchical model which learns
the hyperparameter mixtures automatically from training data. As our experiment is meant to em-
phasize the merits of the HDP-HSMM and sampling inference, we leave this extension to future
work.
References
M. J. Beal, Z. Ghahramani, and C. E. Rasmussen. The infinite hidden markov model. Advances in
Neural Information Processing Systems, 14:577–584, 2002.
S. P. Brooks and A. Gelman. General methods for monitoring convergence of iterative simulations.
Journal of Computational and Graphical Statistics, pages 434–455, 1998.
E. Çinlar. Probability and Stochastics. Springer Verlag, 2010.
M. Dewar, C. Wiggins, and F. Wood. Inference in hidden markov models with explicit state duration
distributions. Signal Processing Letters, IEEE, (99):1–1, 2012.
E. B. Fox. Bayesian Nonparametric Learning of Complex Dynamical Phenomena. Ph.D. thesis,
MIT, Cambridge, MA, 2009.
E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. An HDP-HMM for systems with state
persistence. In Proceedings of the International Conference on Machine Learning, July 2008.
Z. Ghahramani and M. I. Jordan. Factorial hidden markov models. Machine Learning, 29(2):
245–273, 1997.
Y. Guédon. Exploring the state sequence space for hidden markov and semi-markov chains.
Computational Statistics and Data Analysis, 51(5):2379–2409, 2007. ISSN 0167-9473. doi:
http://dx.doi.org/10.1016/j.csda.2006.03.015.
K. Hashimoto, Y. Nankaku, and K. Tokuda. A bayesian approach to hidden semi-markov model
based speech synthesis. In Tenth Annual Conference of the International Speech Communication
Association, 2009.
699
J OHNSON AND W ILLSKY
600 Lighting
Power (Watts) Power (Watts) Power (Watts) Power (Watts) Power (Watts)
500
400
300
200
100
0 Refrigerator
450
400
350
300
250
200
150
100
500
800 Dishwasher
700
600
500
400
300
200
100
0 Furnace
800
700
600
500
400
300
200
100
0 Microwave
1800
1600
1400
1200
1000
800
600
400
2000
3000
2500
Total Power
2000
1500
1000
500
00 1000 2000 3000 4000 5000
Time (sample index)
Figure 20: Example real data observation sequences for the power disaggregation experiments.
Table 2: Power disaggregation prior parameters for each device. Observation priors encode rough
power levels that are expected from devices. Duration priors encode duration statistics that
are expected from devices.
700
BAYESIAN N ONPARAMETRIC H IDDEN S EMI -M ARKOV M ODELS
K. A. Heller, Y. W. Teh, and D. Görür. Infinite hierarchical hidden Markov models. In Proceedings
of the International Conference on Artificial Intelligence and Statistics, volume 12, 2009.
H. Ishwaran and M. Zarepour. Markov chain monte carlo in approximate dirichlet and beta two-
parameter process hierarchical models. Biometrika, 87(2):371–390, 2000.
M. J. Johnson and A. S. Willsky. The Hierarchical Dirichlet Process Hidden Semi-Markov Model. In
Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Corvallis,
Oregon, USA, 2010. AUAI Press.
M. J. Johnson and A. S. Willsky. Dirichlet posterior sampling with truncated multinomial likeli-
hoods. 2012. arXiv:1208.6537v2.
H. Kim, M. Marwah, M. Arlitt, G. Lyon, and J. Han. Unsupervised disaggregation of low frequency
power measurements. Technical report, HP Labs Tech. Report, 2010.
J. Z. Kolter and M. J. Johnson. REDD: A Public Data Set for Energy Disaggregation Research. In
SustKDD Workshop on Data Mining Applications in Sustainability, 2011.
K. Murphy. Hidden semi-markov models (segment models). Technical Report, November 2002.
URL http://www.cs.ubc.ca/˜murphyk/Papers/segment.pdf.
K. P. Murphy. Conjugate bayesian analysis of the gaussian distribution. Technical report, 2007.
P. Orbanz. Construction of nonparametric bayesian models from parametric bayes equations. Ad-
vances in Neural Information Processing Systems, 2009.
D. A. Van Dyk and X. L. Meng. The art of data augmentation. Journal of Computational and
Graphical Statistics, 10(1):1–50, 2001.
J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the infinite hidden
markov model. In Proceedings of the 25th International Conference on Machine Learning, pages
1088–1095. ACM, 2008.
M. Zeifman and K. Roth. Nonintrusive appliance load monitoring: Review and outlook. Consumer
Electronics, IEEE Transactions on, 57(1):76–84, 2011.
701