0% found this document useful (0 votes)
32 views

Tree of Latent Mixtures For Bayesian Modelling and Classification of High Dimensional Data

The document describes a new approach called the Tree of Latent Mixtures (TLM) model for modeling and classifying high-dimensional data. The TLM is a graphical model with a tree structure that performs density estimation on multiple scales. It models correlations at different scales, with short-range correlations modeled by lower nodes in the tree and longer-range correlations by higher nodes. Exact inference in the TLM is intractable, so the paper presents two approximate inference algorithms using variational techniques and discusses learning and classification using the TLM model. Performance is demonstrated on a handwritten digit recognition task.

Uploaded by

Abbé Busoni
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Tree of Latent Mixtures For Bayesian Modelling and Classification of High Dimensional Data

The document describes a new approach called the Tree of Latent Mixtures (TLM) model for modeling and classifying high-dimensional data. The TLM is a graphical model with a tree structure that performs density estimation on multiple scales. It models correlations at different scales, with short-range correlations modeled by lower nodes in the tree and longer-range correlations by higher nodes. Exact inference in the TLM is intractable, so the paper presents two approximate inference algorithms using variational techniques and discusses learning and classification using the TLM model. Performance is demonstrated on a handwritten digit recognition task.

Uploaded by

Abbé Busoni
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Tree of Latent Mixtures for Bayesian Modelling and Classication of

High Dimensional Data


Hagai T. Attias

Matthew J. Beal

[email protected] [email protected]
Golden Metallic, Inc., P.O. Box 475608 Dept. of Computer Science and Engineering
San Francisco, CA 94147, USA State University of New York at Bualo
Bualo, NY 14260-2000, USA
March 2004 (updated January, 2005)
Technical Report No. 2005-06, Department of Computer Science and Engineering, University at Bualo, SUNY
Abstract
Many domains of interest to machine learning, such as audio and video, computational biology, climate mod-
elling, and quantitative nance, involve very high dimensional data. Such data are often characterized by sta-
tistical structure that includes correlations on multiple scales. Attempts to model those high dimensional data
raise problems that are much less signicant in low dimensions. This paper presents a novel approach to mod-
elling and classication in high dimensions. The approach is based on a graphical model with a tree structure,
which performs density estimation on multiple spatial scales. Exact inference in this model is computationally
intractable; two algorithms for approximate inference using variational techniques are derived and analyzed. We
discuss learning and classication using these algorithms, and demonstrate their performance on a handwritten
digit recognition task.

http://www.goldenmetallic.com

http://www.cse.buffalo.edu/faculty/mbeal
Figure 1: Left: a mixture model. x denotes the data
vector and s the component label. Right: a mixture of
factor analyzers (MFA) model. x

denotes the hidden


factors.
1 Introduction
Many domains of interest to machine learning, such as au-
dio and video, computational biology, climate modelling,
and quantitative nance, involve very high dimensional
data. Such data are often characterized by a rich sta-
tistical structure, including long range spatial and tem-
poral correlations. These correlation make the graphical
model builders job quite challenging. Consider, for in-
stance, tting a Gaussian mixture model (Fig. 1 left) to
K-dimensional data. Using diagonal precision matrices
would require just K parameters per matrix, but could
lead to highly inaccurate results for strongly correlated
data. Using general precision matrices could in principle
lead to an accurate model, but at the price of escalat-
ing the number of parameters to K(K+1)/2 per matrix.
As K increases, the O(K
2
) parameters could sharply in-
crease the required computational resources, slow down
the learning algorithm, and complicate the likelihood sur-
face, increasing the number of local maxima and the al-
gorithms chance of getting stuck in one of them.
Many existing techniques attempt to tackle this prob-
lem [13, 11]. One group of techniques focus on controlling
the number of model parameters. This may be achieved
by constraining the precision matrices by e.g., tying pa-
rameters across components or applying symmetry con-
siderations. A more sophisticated technique uses a fac-
tor analysis model with an appropriately chosen number
of factors in each component (Fig. 1 right). Using L
factors reduces the number of parameters to K(L + 1)
per component. However, such techniques do not take
into account the actual correlation structure in the data,
which may lead to inaccurate modelling and unsatisfying
performance.
Another group of techniques focus on controlling the
number of data dimensions. This may be done by a va-
riety of methods for feature extraction, projection, and
dimensionality reduction. This pre-processing results in
lower dimensional data, on which modelling is performed.
However, the success of such methods strongly relies on
accurate prior knowledge about which features are rele-
vant for a given task. Applying them in the absence of
such knowledge may again lead to inaccurate modelling.
This paper presents a novel approach to modelling and
classication in high dimensions. The approach is based
on a graphical model which performs density estimation
on multiple scales. The model is termed tree of latent mix-
tures (TLM). All the nodes in the tree but the leafs are
hidden, and each node consists of a mixture distribution
(see Fig. 2). The TLM has a generative interpretation
that is simple and intuitive: the variable at a given node
draws its mean from the parent node, which is associ-
ated with a coarser scale; and the distribution about the
mean is described by a mixture model. That variable, in
turn, controls the mean of the children nodes, which are
associated with a ner scale. Hence, TLM models corre-
lations on multiple scales, where short range correlations
are modelled by nodes lower in the tree, and progressively
longer range correlations are modelled by higher nodes.
The TLM model belongs to the class of hybrid graph-
ical models, which contain both discrete and continuous
variables. Like many models in that class [10], exact in-
ference in the TLM is computationally intractable. We
present two algorithms for approximate inference, using
variational techniques [6], and discuss learning and clas-
sication using these algorithms. Performance is demon-
strated on a handwritten digit recognition task.
2 Model
Here we dene the TLM model and discuss its dierent
interpretations.
2.1 Mathematical Denition
The TLM model is dened as follows. Let / denote a
tree structure with M nodes. Let m = 1 : M index the
nodes, where the root node is labelled by M. For node
m, let m denote its parent node, and let
m
denote the
set of its child nodes. The number of children of node m
is denoted by C
m
=[
m
[.
Each node in the tree is now expanded into a mixture
model that contains two nodes, denoted by s
m
and x
m
for
model m. Node s
m
is termed the component label, and is
a discrete variable with S
m
possible values, s
m
= 1 : S
m
.
Node x
m
is termed the feature vector, and is a continuous
vector of dimension K
m
. (Note that we use the term
feature just to refer to the continuous nodes; no actual
feature extraction is involved.) We will refer to the nodes
of the original tree graph as models, and reserve nodes for
the label and feature variables.
Next we dene the probabilistic dependencies of the
nodes, shown graphically in Fig. 2. The component label
s
m
of model m depends on the label s
m
of the parent
model m, via a probability table w
mss
,
p(s
m
= s [ s
m
= s

) = w
mss
. (1)
Hence, the discrete nodes s
1:M
themselves form a tree,
whose structure is identical to the original tree structure
/. The continuous nodes x
1:M
, however, do not form a
1
Figure 2: The TLM model. For model m, s
m
is the
component label and x
m
is the feature vector. The mod-
els form a tree, which here is symmetric with C
m
= 2
children for each model. During learning, the leaf nodes
x1, x2, x3, x4 are visible and all other nodes are hidden.
tree, as each of them has two parents, one discrete and one
continuous. The feature vector x
m
of model m depends
on the component label s
m
of that model, and on the
feature x
m
of the parent model. The dependence draws
on the mixture of factor analyzers (MFA) model (Fig. 1
right), where x
m
is a linear function of x
m
plus Gaussian
noise, with the linear coecients A
ms
, a
ms
and the noise
precisions B
ms
depend on s
m
,
p(x
m
= x [ s
m
= s, x
m
= x

)
= ^(x [ A
ms
x

+a
ms
, B
ms
) . (2)
Like in MFA, the precision matrices B
ms
are diagonal.
1
The full joint distribution of the TLM is given by a
product over models,
p(x
1:M
, s
1:M
[ ) =
M

m=1
p(x
m
[ s
m
, x
m
)p(s
m
[ s
m
) (3)
where for the root node we set s
M
= 1 and x
M
= 0,
and
= A
ms
, a
ms
, B
ms
, w
mss
(4)
denotes the model parameters. To help with overtting
protection and model selection, we impose standard (con-
jugate, see [1]) independent priors over the parameters:
Gaussian over A
ms
, a
ms
, Wishart over the precisions B
ms
,
and Dirichlet over the label parameters w
mss
,
p(A
ms
) = ^(A
ms
[ 0, ) , p(a
ms
) = ^(a
ms
[ 0, ) ,
p(B
ms
) = J(B
ms
[ , ) , p(w
mss
) = T(w
mss
[ ) .
1
Notation: the Gaussian distribution over a vector x with mean
a and precision matrix B is N(x | a, B) =
| B/2 |
1/2
exp[(x a)
T
B(x a)/2].
Figure 3: The feature vectors of the TLM of Fig. 2. The
data are 16-dimensional and are organized in a spatial
44 array. The feature vectors at the leaves x
1
, x
2
, x
3
, x
4
correspond to data variables. The features x
5
, x
6
, x
7
are
hidden. In this example, all features have dimensionality
K
m
= 4.
Let V 1 : M denote the models at the leaves,
and let H = 1 : M V denote the rest of the models.
Only the feature vectors of the leaf models, denoted by
x
V
= x
m
, m V , correspond to observed data vari-
ables, and are therefore visible during learning. The rest
of the feature vectors, denoted by x
H
= x
m
, m H, as
well as all the component labels s
1:M
, are hidden.
To construct a TLM model for a given dataset, one di-
vides the data variables into subsets, and associates a fea-
ture vector x
m
V with each subset. The dimensionality
K
m
of the feature vectors is the number of variables in the
corresponding subset. The subsets may or may not over-
lap; overlap could help reduce edge eects. Fig. 3 shows
an example where 16-dimensional data, organized in a
spatial 44 array, are modelled by the tree in Fig. 2. The
data variables are divided into 4 non-overlapping subsets
V = 1, 2, 3, 4 with 4 variables each. Hence, the visible
feature vectors x
m
, m V have dimension K
m
= 4. The
hidden features x
m
, m H, where H = 5, 6, 7, in this
example also have K
m
= 4. In principle, one may t to
a given dataset a tree of an arbitrary structure /, not
necessarily symmetric, and arbitrary parameters K
m
and
S
m
. Estimating them from data is discussed briey in
Section 8.
2.2 Interpretation
Any probabilistic DAG models the process that generates
the observed data, and has therefore a generative inter-
pretation. To generate an observed data point from the
TLM, start at the root and select a value s 1 : S
M

for the label s


M
with probability p(s
M
= s). Then gen-
erate the feature via x
M
= u
M
, where u
M
is sampled
from the Gaussian p(x
M
[ s
M
= s) which has mean a
ms
and precision B
Ms
. Next, proceed to the children models
m
M
. For each child, select a value s for its label s
m
with probabilities p(s
m
= s [ s
m
). Then generate its
feature via x
m
= A
ms
x
m
+ u
m
, where u
m
is sampled
from the Gaussian p(x
m
[ s
m
, x
m
) which has mean a
ms
and precision B
ms
. Repeat until reaching the leaves.
2
The TLM may also be viewed as performing density
estimation at multiple spatial scales, where the scale is
nest at the leaves and coarsest at the root. Each fea-
ture x
m
is described by a mixture distribution where the
mean of each component, apart from a constant, is pro-
portional to the parent x
M
. Hence, in eect, at a given
spatial scale (or level in the tree), the mean is determined
by a node at a coarser scale (the parent node, one level
up), whereas the nodes at the given scale add the ner
statistical details.
There is another attractive interpretation of the TLM,
as a bottom-up process of successive clustering and di-
mensionality reduction. We will discuss it in Section 5
and exploit it for initialization.
3 Inference
Bayesian classication with TLM is performed by training
a separate TLM model on data from each class, using an
expectation maximization (EM) algorithm derived for the
model. A new unlabelled data point is then classied by
comparing its likelihood under the models for the dierent
classes, combined with the class prior, as instructed by
Bayes rule.
Both the performing the E-step of EM and computing
the data likelihood require inference, i.e., computing the
posterior distribution over the hidden nodes x
H
and s
1:M
,
conditioned on the observed data x
V
,
p(x
H
, s
1:M
[ x
V
) =
p(x
1:M
, s
1:M
)
p(x
V
)
. (5)
However, this posterior is computationally intractable,
since computing the normalization constant p(x
V
) re-
quires summing over all possible state congurations
(s
1
, ..., s
M
) of the labels, whose number

M
m=1
S
m
= e
M
is exponential in the number of models M ( = log S
m
)
is the average log-number of states).
3.1 Variational Inference
In the following, we present two techniques for approxi-
mate inference in the TLM model. Both techniques are
based on variational methods [6], which we now review
briey. In variational inference, one approximates the
exact posterior p(x
H
, s
1:M
[ x
V
) by another distribution
q(x
H
, s
1:M
[ x
V
), termed the variational posterior. Un-
like the exact posterior, q is chosen to be tractable. This is
usually achieved by giving it a factorized structure, i.e.,
grouping hidden variables into separate sets which are
mutually independent given the data; in the exact poste-
rior, those sets are correlated. q() is typically given in a
parametric form, which either has to be specied in ad-
vance, or (as is the case here) emerges once the structure
has been specied. Its parameters are then optimized to
minimize its Kullback-Leibler (KL) distance KL[q [[ p]
to the exact posterior p. Interestingly, the optimization
requires knowledge of the exact posterior only within nor-
malization. The optimization is performed by an itera-
tive loop nested at the E-step of each EM iteration, and
is guaranteed to converge since the KL distance is lower-
bounded by zero.
Learning (M-step) in TLM needs the following pos-
terior moments (a.k.a. sucient statistics). These are
the feature conditional means
ms
and child-parent cor-
relations
ms,m

s
, m

= m, as well as the label prob-


ability
ms
and joint child-parent probabilities
ms,m

s
.
They are dened via marginals of the variational posterior
q(x
H
, s
1:M
[ x
V
) by

ms
=
_
dx
m
q(x
m
[ s
m
= s) x
m
,

ms,m

s
=
_
dx
m
dx
m

q(x
m
, x
m
[ s
m
= s, s
m
= s

) x
m
x
T
m
,

ms
= q(s
m
= s [ x
V
) ,

ms,m

s
= q(s
m
= s, s
m
= s

[ x
V
) . (6)
Those posterior moments are computed below by each
inference technique separately.
At the M-step of each EM iteration, one considers the
averaged complete data likelihood E
q
log p(x
1:M
, s
1:M
[
), where E
q
averages w.r.t. the variational posterior
computed at the E-step. The learning rule is derived,
as is standard, by maximizing this quantities w.r.t. the
mode parameters . Once a model has been learned, the
likelihood L = log p(x
V
) of a new data point x
V
is ap-
proximated by the variational likelihood T,
T = E
q
[log p(x
1:M
, s
1:M
) log q(x
H
, s
1:M
[ x
V
)] . (7)
In fact, it can be shown that T = L KL[q [[ p] L
(the argument p refers to the exact posterior), hence the
variational E-step that minimizes the KL distance also
maximizes the variational likelihood.
3.2 Technique I: Tree of Factorized Fea-
tures
Generally, the posterior may be written as the posterior
over labels and over features conditioned on those labels,
q(x
H
, s
1:M
[ x
V
) = q(x
H
[ s
H
, x
V
)q(s
1:M
[ x
V
) . (8)
We now require the features to be conditionally indepen-
dent of each other,
q(x
H
[ s
H
, x
V
) =

mH
q
m
(x
m
[ s
m
) . (9)
This is the only restriction we impose on the structure of
the posterior. Notice that this structure maintains three
important properties of the exact posterior: (1) the labels
are correlated, (2) the features depend on the labels, and
(3) the features themselves are correlated since q(x
H
[
3
x
V
) =

S
H
q(x
H
[ s
H
, x
V
)q(s
H
[ x
V
) ,=

mH
q(x
m
[
x
V
). However, the details of the dierent dependen-
cies dier from the exact posterior. Similar ideas have
been used in [2, 9] in the context of processing multiple
speech/sound signals.
The functional form of the variational posterior falls
out of free-form optimization of T under the structural re-
striction (9); no further assumptions are required. First,
the label posterior has a tree structure, just like the prior
(1), given by a product over potentials
q(s
1:M
[ x
V
) =
1
Z
M

m=1
exp[
m
(s
m
[ s
m
)] (10)
where Z is a normalization constant. Since the label pos-
terior is a tree, any required statistics, including Z and
the posterior moments
ms
and
ms,m

s
in (6), can be
computed eciently using any of the standard message
passing algorithms [8]. We exploit this useful property
for computing the feature posterior below.
The potentials
m
are given by the expression

m
(s
m
= s [ s
m
= s

) = log w
mss
(11)
+log p(x
m
=
ms
[ s
m
= s, x
m
=
m,s
)

1
2
Tr
_

1
ms
+A
ms

1
m,s
A
T
ms
_
+
1
2
log [
ms
[ ,
which basically shows how inferring feature x
m
modies
the prior w
mss
to yield the posterior. Here,
ms
and

ms
denote the mean and precision, respectively, of the
feature posterior q(x
m
[ s
m
= s, x
V
) in (9), to be com-
puted below. For the leaf nodes m V we substitute

ms
= x
m
,
1
ms
= log [
ms
[= 0, and for the root node
m = M,
m,s
=
1
m,s
= 0. In addition, p(x
m
= ...)
in the second line refers to p in (2) with the appropriate
substitutions.
Next, the feature posteriors turn out to be Gaussian,
q(x
m
[ s
m
= s) = ^(x
m
[
ms
,
ms
) (12)
whose the precision matrix is given by

ms
= B
ms
+

m
E
m

|ms
A
T
m

s
B
m

s
A
m

s
. (13)
E
m

|ms
denotes posterior averaging over s
m
conditioned
on s
m
= s, e.g.,
E
m

|ms
A
m

s
=

q(s
m
= s

[ s
m
= s, x
V
)A
m

s
,
where q(s
m
= s

[ s
m
= s, x
V
) =
m

,ms
/
ms
.
The means of (12) are given by a linear equation that
expresses
ms
in terms of the parent and children of node
m,

ms

ms
= B
ms
_
A
ms
E
m,s

|ms

m,s
+a
ms
_
(14)
+

m
E
m

|ms
A
T
m

s
B
m

s
(
m

s
a
m

s
) .
Direct solution of this linear system may be inecient,
since its dimension

m
K
m
S
m
may be quite large. How-
ever, most of the coecients are zero since node m is
coupled only to its parent and children, hence sparse ma-
trix techniques can solve it eciently. Instead, we imple-
mented an iterative solution which, starting with the ini-
tialization described in Section 5, makes repeated passes
through the tree and updates the means using (14). For
the experiments described below, convergence was almost
always achieved in O(10) passes.
The conditional feature correlations
ms,m

s
in (6)
can now be computed via

ms,m

s
=
ms

T
m

s
+
mm

1
ms
(15)
Finally, as usual with variational techniques, the equa-
tions for q(s
1:
[ x
V
) and q(x
H
[ s
H
, x
V
) are mutually
dependent and must be solved iteratively.
3.3 Technique II: Factorized Trees
Here, we require the features to be independent of the
labels,
q(x
H
, s
1:M
[ x
V
) = q(x
H
[ x
V
)q(s
1:M
[ x
V
) . (16)
This is a stronger restriction than the one proposed in
the previous section (9), and has the disadvantage that
the posterior over the features is unimodal (above it was
a mixture distribution, q(x
H
) =

s
H
q(x
H
[ s
H
)q(s
H
)).
In fact, it can be shown to be a single Gaussian, which
is a fairly simplistic approximation to the exact poste-
rior. However, it has the advantage of preserving direct
correlations among the features, and it is also simpler to
compute.
Just like above, the functional form of the variational
posterior falls out of a free-form optimization of T under
the restriction (9); no further assumptions are required.
First, the label posterior again has a tree structure, given
by a product over potentials
q(s
1:M
[ x
V
) =
1
Z
M

m=1
exp[
m
(s
m
[ s
m
)] (17)
where Z is a normalization constant. Any required statis-
tics, including Z and the posterior moments
ms
and

ms,m

s
in (6), can be computed eciently via message
passing [8].
The potentials
m
are obtained via
(s
m
= s [ s
m
= s

) = log w
mss
(18)
+log p(x
m
=
m
[ s
m
= s, x
m
=
m
)

1
2
Tr
_

m,m
+A
ms

m,m
A
T
ms
2
m,m
A
T
ms
_
,
where
m
and
mm
(6) are discussed below. For the leaf
nodes m V we substitute
m
= x
m
,
mm
= 0, and for
the root node m = M,
m
=
m,m
= 0. In addition,
4
p(x
m
= ...) in the second line refers to p in (2) with the
appropriate substitutions.
Next, the feature posterior is shown to be a Gaussian
tree, given by a product over potentials
q(x
H
[ x
V
) =
1
Z

mH
exp[
m
(x
m
[ x
m
)] , (19)
where Z

is a normalization constant. The potentials


m
are given by the quadratic expressions

m
(x
m
= x [ x
m
= x

) = x
T
(E
ms
B
ms
)x (20)
+x
T
(E
ms
A
T
ms
B
ms
A
ms
)x

2x
T
(E
ms
A
T
ms
B
ms
)x
2(E
ms
a
T
ms
B
ms
)x + 2(E
ms
a
T
ms
B
ms
A
ms
)x

,
where E
ms
denotes posterior averaging over s
m
, e.g.,
E
ms
A
ms
=

s

ms
A
ms
.
Finally, the conditional feature means and correlations
in (6) in this case are not conditioned on the label, due
to the variational factorization (16). Hence

ms
=
m
=
_
dx
m
q(x
m
) x
m
, (21)

ms,m

s
=
mm
=
_
dx
m
dx
m
q(x
m
, x
m
) x
m
x
m
,
and both
m
,
mm
are computed via message passing,
due to the tree structure of the feature posterior.
4 Learning
Given the posterior moments (6), the update rules (M-
step) for the TLM model parameters (4) are straight-
forward to derive. We dene, as is standard, the extended
feature vector x
T
m
= (x
T
m
, 1) by appending 1, and the ex-
tended matrix

A
ms
= (A
ms
, a
ms
) by appending the col-
umn a
ms
to the right. We extend
ms,m

s
of (6) such
that

ms,m

s
is the correlation of x
m
, x
m
, and

ms,m

is the correlation of x
m
, x
m
. We also dene the matrices
F
ms
= E
m,s

|ms

m,s

,m,s
,
F

ms
= E
m,s

|ms

ms,m,s
, (22)
where the average E
m

|ms
is dened in (14).
The learning algorithm presented here runs in batch
mode on an N-point dataset. Averaging over the data is
denoted by ).
For

A
ms
we obtain

A
ms
= F

ms
) (F
ms
) +/N)
1
. (23)
For B
ms
we obtain
B
1
ms
= (1 +/N)
1
_

ms,ms
+

A
ms
F
ms
)

A
T
ms
2F

ms
)A
T
ms
+ /N
_
. (24)
For w
mss
we obtain
w
mss
=

ms,m,s
) +/N

m,s
) +S
m
/N
. (25)
Note how the parameters of the priors , , , and reg-
ularize the update rules by preventing ill-conditioning.
5 Initialization
As EM is not guaranteed to converge to the global max-
imum of the likelihood, using an eective initialization
procedure is important. Here we discuss a heuristic
that emerges from an interesting, though not rigorous,
interpretation of what the TLM actually does. Taking
a bottom-up view, the leaf models m V perform clus-
tering on their data x
m
. The parent feature x
m
then
performs linear dimensionality reduction of the clustering
errors from all its children, i.e., of the dierence between
each data point and the center of its assigned cluster.
At the next level up, the models m cluster the features
x
m
, and their parent features reduce dimensionality of
the errors of that clustering. The process continues until
the root is reached.
Based in this view, we dene the following procedure.
Consider each leaf m V in isolation, setting A
sm
= 0,
and run EM-GM (EM for a Gaussian mixture) to deter-
mining the cluster means and precision a
ms
, B
ms
. Set
the weight w
mss
to the responsibility (posterior proba-
bility) of cluster s for data x
m
. Vector quantization (VQ)
may be used to initialize EM-GM. Next, for each model
m V and data point x
m
, dene the clustering error
u
m
= x
m
a
m s(m)
, where s(m) = arg min
s
[ x
m
a
ms
[.
Now, consider each parent m

of a leaf, and run EM-FA


(EM for factor analysis) using x
m
as factors and the er-
ror from all its children u
m
,
m
= m

as data. This will


determine the A
ms
(notice that s(m) must be used to la-
bel the u
m
). Principal component analysis (PCA) may
be used to initialize EM-FA.
Next, set the features x
m
to their MAP values ob-
tained by EM-FA. Regard those values as observed data,
and perform the above procedure of EM-GM followed by
EM-FA. Repeat until the root is reached.
This completes the initialization of all model param-
eters . However, for variational EM one must also ini-
tialize either the moments of the label posterior or the
moments of the feature posterior (6). We did the latter,
by setting
ms
to the component responsibility in each
model m, computed from EM-GM for that model, and
approximating
ms,m

s
=
ms

s
.
We point out that several variants of our initializa-
tion procedure exist. In one variant the tree is built and
initialized section by section. One stops the procedure
after only part of the tree has been initialized, and runs
variational EM only on that part. After convergence, one
proceeds upward with initialization, and possibly stops
again to run variational EM before reaching the root.
6 Bayesian Classication
Here we consider the TLM in the context of supervised
classication. In that task, a training set consisting of N
pairs (x
V
, c) of data and class label is given. These data
are used to train a dierent TLM model for each class.
Let
c
denote the parameters learned for class c, and let
5
Figure 4: Handwritten digits from the Bualo dataset.
p(x
1:M
, s
1:M
[
c
) denote the TLM joint distribution (3)
for that class. Let p(c) denote the prior over classes. The
exact Bayes classier for a new data point x
V
is given by
c(x
V
) = arg max
c
p(x
V
[
c
)p(c).
However, the marginal p(x
V
[
c
) is computationally
intractable, and in the variational framework it is approx-
imated via the variational likelihood (7). Let T
c
denote
the variational likelihood for the TLM of class c. Then
T
c
(x
V
) log p(x
V
[
c
), and the approximate Bayes clas-
sier is given by
c(x
V
) = arg max
c
[T
c
(x
V
) + log p(c)] . (26)
Computing T
c
can be shown to be remarkably simple
in either of the variational techniques discussed above.
Recall that the label posterior in either case was a tree,
given as a product over potential normalized by Z (10,17),
which in turn is computed by message passing. Denoting
it by Z
c
(x
V
) for the TLM of class c, we have
T
c
(x
V
) = log Z
c
(x
V
) (27)
within a constant independent of x
V
.
7 Experiments
We tested the TLM on the Bualo post oce dataset,
which contains 1100 examples for each digits 0-9. Each
digit is a gray-level 88 pixel array (see examples in Fig.
4). We used 5 random 800-digit batches for training, and
a separate 200-digit batch for testing.
We considered 3 dierent TLM structures. The struc-
tures diered by the number of models M, the parameter
values K
m
, S
m
, C
m
, and the degree of overlap between
the pixels associated with the data variables x
m
, m V .
In all cases we tied the parameters of A
ms
across com-
ponents, such that A
ms
= A
m
, since the available num-
ber of training examples may be too small to estimate
all A
ms
reliably (the NIST digits dataset should alleviate
Table 1: Misclassication rate for three dierent TLM
structures using two variational inference techniques.
MOG denotes a Gaussian mixture benchmark.
Structure Technique I Technique II
(1) 1.8% 2.2%
(2) 1.6% 1.9%
(3) 2.2% 2.7%
MOG 2.5% 2.5%
this problem, see below). Structure (1) included 2 levels
with M = 5 models. Each of the 4 leaf nodes covered a
44 pixel array with no overlap. The leaf models created
a 22 model array, and all had the root as parent, hence
C
m
= 4. We used K
m
= 16 and S
m
= 6 for all mod-
els. Structure (2) included 3 levels with M = 14 models.
Each of the 9 leaf nodes again covered a 44 pixel array,
but it overlapped its horizontal neighbors by a 4 2 and
its vertical neighbors by 24 pixel array. The leaf models
created a 3 3 model array. Next, the middle level had
4 models, each parenting a 2 2 model array of the leaf
level, with an overlap as before. The middle level mod-
els created a 2 2 model array, and all had the root as
parent. Hence, this structure also had C
m
= 4. We used
K
m
= 16 for all models, and set S
m
= 2 to keep the total
number of parameters approximately equal to structure
(1). Structure (3) included 3 levels with M = 21 models.
Each of the 16 leaf models covered a 2 2 pixel array
with no overlap. The leaf models created a 4 4 model
array, and the tree was constructed upward with C
m
= 4
similarly to structure (1). We used K
m
= 4 for all mod-
els with S
m
= 4, leading to about half the number of
parameters of structures (1),(2).
For each structure we ran variational EM with both
inference techniques. The results are shown in Table
1. As benchmark, denoted MOG, we used standard
Gaussian mixture model with 30 components. All TLM
structures, except (3) with inference technique II, out-
performed the benchmark. Structure (2) with technique
I outperformed the rest.
While our experiments are not yet exhaustive, they
seem to indicate that (1) overlap at the leaf level may
enhance performance, possibly by reducing edge eects;
and (2) technique I may be superior, possibly due to the
multimodality of the feature posterior. We are currently
performing experiments with additional TLM structures,
and will report the results in the nal version of this pa-
per. We are also working on a second set of experiments
using the NIST digits dataset, which contains 60, 000
training examples on a 20 20 pixel array.
8 Extensions
The work presented here may be extended in sev-
eral interesting directions. We are currently pursuing
new techniques for approximate inference in TLM. A
6
Rao-Blackwellized Monte Carlo approach [4, 7], which
combines sampling the labels with exact inference on
the features given the label samples, has given promising
preliminary results. Suitable versions of other methods,
including loopy belief propagation [14], expectation prop-
agation [12], and the algorithm of [3] (originally designed
for time series), may also prove eective.
An important direction we plan to pursue is learn-
ing the structure / of TLM, including the parameters
C
m
, K
m
, S
m
, from data. In classication, for instance,
dierent classes may require models with dierent struc-
tures. Structure search, accelerated using our initializa-
tion method, and combined with scoring using a varia-
tional Bayesian technique [1, 5], could produce a powerful
extension to the current TLM algorithms.
Another direction involves extending TLM into a dy-
namic graphical model, by constructing a Markov chain
of TLMs. This graph would model complex time series
on multiple spatiotemporal scales, and could produce a
novel and eective forecasting tool.
References
[1] H. Attias. A variational Bayesian framework for
graphical models. In Advances in Neural Information
Processing Systems 12, pages 209215. MIT Press,
2000.
[2] H. Attias. Source separation with a sensor array us-
ing graphical models and subband ltering. In Ad-
vances in Neural Information Processing Systems 15.
MIT Press, 2003.
[3] X. Boyen and D. Koller. Tractable inference for com-
plex stochastic processes. In Proceedings of the 14th
Annual Conference on Uncertainty in Articial In-
telligence, pages 3342. Morgan Kaufmann, 1998.
[4] A. Doucet, N. de Freitas, and N. Gordon. Sequential
Monte Carlo Methods in Practice. Springer, 2001.
[5] Z. Ghahramani and M. J. Beal. Propagation algo-
rithms for variational Bayesian learning. In Advances
in Neural Information Processing Systems 13, pages
507513. MIT Press, 2001.
[6] M. I. Jordan, Z. Ghahramani, and T. S. Jaakkola.
An introduction to variational methods for graphical
models. In M.I. Jordan, editor, Learning in Graphical
Models. MIT Press, 1998.
[7] T. Kristjansson, H. Attias, and J. R. Hershey. Stereo
based 3d tracking and learning using EM and par-
ticle ltering. In Proceedings of the 18th European
Conference on Computer Vision, page in press, 2004.
[8] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger.
Factor graphs and the sum-product algorithm. IEEE
Transactions on Information Theory, 47:498519,
2001.
[9] L. J. Lee, H. Attias, and L. Deng. A multimodal vari-
ational approach to learning and inference in switch-
ing state space models. In Proceedings of the 2004
International Conference on Acoustics, Speech, and
Signal Processing, page in press, 2004.
[10] U. Lerner and R. Parr. Inference in hybrid networks:
theoretical limits and practical algorithms. In Pro-
ceedings of the 17th Annual Conference on Uncer-
tainty in Articial Intelligence, pages 310318. Mor-
gan Kaufmann, 2001.
[11] M. Meila and M. I. Jordan. Learning with mixtures
of trees. Journal of Machine Learning Research, 1:1
48, 2000.
[12] T. Minka and J. Laerty. Expectation-propagation
for the generative aspect model. In Proceedings of
the 18th Annual Conference on Uncertainty in Arti-
cial Intelligence, pages 352359. Morgan Kaufmann,
2002.
[13] A. Moore. Very fast EM-based mixture model clus-
tering using multiresolution kd-trees. In Advances
in Neural Information Processing Systems 11, pages
543549. MIT Press, 1999.
[14] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Gen-
eralized belief propagation. In Advances in Neural
Information Processing Systems 13, pages 689695.
MIT Press, 2001.
7

You might also like