0% found this document useful (0 votes)

38 views

Experiments With Non Parametric Topic Models

Topic modeling non parametrically

Uploaded by

Søren Larson

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

Experiments With Non Parametric Topic Models

Topic modeling non parametrically

Uploaded by

Søren Larson

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Experiments with Non-parametric Topic Models

Wray Buntine

Monash University
Clayton, VIC, Australia
[email protected]
Swapnil Mishra
RSISE, The Australian National University
Canberra, ACT, Australia
[email protected]
ABSTRACT
In topic modelling, various alternative priors have been de-
veloped, for instance asymmetric and symmetric priors for
the document-topic and topic-word matrices respectively,
the hierarchical Dirichlet process prior for the document-
topic matrix and the hierarchical Pitman-Yor process prior
for the topic-word matrix. For information retrieval, lan-
guage models exhibiting word burstiness are important. In-
deed, this burstiness eect has been show to help topic mod-
els as well, and this requires additional word probability
vectors for each document. Here we show how to combine
these ideas to develop high-performing non-parametric topic
models exhibiting burstiness based on standard Gibbs sam-
pling. Experiments are done to explore the behavior of the
models under dierent conditions and to compare the algo-
rithms with previously published. The full non-parametric
topic models with burstiness are only a small factor slower
than standard Gibbs sampling for LDA and require double
the memory, making them very competitive. We look at the
comparative behaviour of dierent models and present some
experimental insights.
Categories and Subject Descriptors
I.7 [Document and Text Processing]: Miscellaneous;
I.2.6 [Articial Intelligence]: Learning
Keywords
topic modelling; experimental results; non-parametric prior;
text
1. INTRODUCTION
Topic models are now a recognised genre in the suite of ex-
ploratory software available for text data mining and other
semi-structured tasks. Moreover, it is also recognised that

Part of this authors contribution was done while at

NICTA, Canberra.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage, and that copies bear this notice and the full ci-
tation on the rst page. Copyrights for third-party components of this work must be
honored. For all other uses, contact the owner/author(s). Copyright is held by the
author/owner(s).
KDD14, August 2427, 2014, New York, NY, USA.
ACM 978-1-4503-2956-9/14/08.
http://dx.doi.org/10.1145/2623330.2623691.
a broad class of variants can be developed based around
extended graphical models that capture some additional do-
main requirements. Examples of these extensions include
author-topic modelling [19], document segmentation [8] and
word-sense disambiguation [1], however the list is extensive.
Here we consider the task of improving the vanilla topic
model, but in the back of our mind is the requirement that
the techniques need to easily be transferred to the host of
variants that make up a signicant part of use in applica-
tions.
While the standard model is Latent Dirichlet Allocation
(LDA), and many techniques exist to scale this up signi-
cantly, better quality topic models are available. Two clear
and related innovations are available here. The rst is the
use of more sophisticated priors for the probability vectors
rather than the simple symmetric Dirichlet [25], and the sec-
ond is the use of non-parametric methods, best characterised
by the HDP-LDA model [24], again to improve the priors,
the standard being the hierarchical Dirichlet process. These
have the goals of better estimating prior topic or word pro-
portions, and also estimating the right number of topics.
Note these are related in that a technique to estimate the
proportions for dierent topics can also make some topics
insignicant, thus eectively changing the number of topics.
Another innovation, not so well known, is to model bursti-
ness [5], which is the idea that once we see a word, we can
expect to see it again. Consider the following news snippet:
Despite their separation, Charles and Diana stayed
close to their boys William and Harry. Here, they
accompany the boys for 13-year-old Williams
rst day school at Eton College on Sept. 6, 1995,
with housemaster Dr. Andrew Gayley looking on.
We see here that two words, boys and William appear
twice. In the information retrieval literature, a related phe-
nomonen is the notion of eliteness [12] whereby words are
said to have dierent levels of occurrence, and this inu-
enced the development of the dominant relevance paradigm
in information retrieval [18].
These innovations, non-parametric priors and burstiness,
have so far been bogged down by computationally intensive
techniques that prevent their wider use. The original imple-
mentation of DCMLDA for burstiness [5] was only able to
be applied to small data sets of less than a 1000 documents.
A theme of research in recent machine learning conferences
have been a variety of alternative algorithms for inference
with HDP-LDA [22, 27, 2]. One technique of note is the use
of stochastic variational methods that allows application to
streaming data [27, 13]. An excellent theoretical and em-
pirical comparison of a variety of sampling and variational
methods can be found in [20].
These newer algorithms, however, are usually based on the
standard stick-breaking formulation for Dirichlet processes
and variational methods for these and the simpler Dirichlet
distribution. Recently, an alternative sampling scheme for
the more general Pitman-Yor process has been developed
[4] called table indicator sampling. It uses prebuilt tables of
second order generalised Stirling numbers, and the scheme
has seen use on problems such as document segmentation [9]
and topic models on structured text [7]. Its key advantages
are that it requires no dynamic memory for implementation,
and that the convergence is usually signicantly faster and
better quality (it is a collapsed Gibbs sampler) [6].
This table indicator sampling allows easy development of
non-parametric topic models including HDP-LDA, its ex-
tension to the Pitman-Yor process (PYP) and power-law
models [21] which place the PYP on the topic-word ma-
trix. What is interesting, however, which is our rst major
contribution, is that we can easily extend this broader vari-
ety of non-parametric topic models by adding a burstiness
component on the front end of the model. Moreover, an
implementation trick lets this be done with little additional
memory/time overhead. These models are available in our
recently MLOSS-released open source multi-core software
hca
1
.
The resulting non-parametric LDA algorithms turn out
to be fast. Generally, they are a factor or two (in memory
and time) slower than standard Gibbs implementations of
vanilla LDA and typically 5 times faster than comparable
variational HDP-LDA implementations but with the same
memory requirements. Although, we also show that Mal-
lets [16] asymmetric-symmetric LDA is a form of truncated
HDP-LDA and it is an order of magnitude faster again. The
experimental results show improvement over all the existing
methods in perplexity, including Mallet. We also conduct a
number of experiments to explore the nature of the new al-
gorithms. The full experiments with burstiness are the rst
done at moderate scale with these sort of models and the ex-
perimental insights are our second major contribution. Our
implementation runs bursty HDP-LDA with K=1000 topics
on 800k news articles at 10 minutes per major Gibbs cycle
using a standard 8-core CPU.
We rst discuss, at a general level, how we use Pitman-Yor
processes and the nature of the inference with them, in Sec-
tion 2. Section 3 presents the dierent models used here.
Because our inference schemes are standard block Gibbs
samplers in the table indicator framework, we do not detail
the algorithms here other than describing how we imple-
ment burstiness. Section 4 then presents our experimental
setup, and a sequence of empirical investigations follow in
Section 5.
2. THE HIERARCHICAL PITMAN-YOR
PROCESS
Here we briey review the methods used for inference on
the hierarchical Pitman-Yor processes that one can see em-
bedded in the model we use [3]. Those not needing to under-
stand the fundamentals of the sampling methods can skip
this section.
1
See http://mloss.org/software/view/527/
All our samplers use standard block table indicator Gibbs
samplers for the network of Pitman-Yor processes [4, 9] and
adaptive rejection sampling [10] for the many hyperparam-
eters. Slice sampling is usually of similar performance but
can suer with the extremely peaked posteriors for the con-
centration parameter of a Pitman-Yor process.
We use the Pitman-Yor process as a distribution on a
probability vector. The distribution has a mean (i.e., an-
other probability vector), a variance parameter represented
as a concentration (usually given as bX when on vector

X),
and a third parameter called discount (usually given as aX
when on vector

X), so one has p PYP
_
ap, bp,

_
. The
Pitman-Yor process, when used in this way, can be used hi-
erarchically to form distributions on a network of probability
vectors.
The inference on a network of probability vectors is based
on a basic property of species sampling schemes [14] that
is best understood using the framework of message passing
over networks. Figure 1a shows the context of a probability
vector p having a Pitman-Yor process with base distribution

. Two multinomial style likelihood messages are passed up

to p with counts n and m. The standard message then
passed on from p to

using the multinomial-Dirichlet dis-
tribution [3] is a complex set of gamma functions obtained
using the normalising term for a Dirichlet distribution, rep-
resented in the gure as l
Dir
p
(

) (assuming a concentration
at the node of bp)
l
Dir
p
(

) =
(bp)
(bp +N +M)

k
(bp
k
+n
k
+m
k
)
(bp
k
)
, (1)
where the total statistics are N =

k
n
k
and M =

k
m
k
.
This functional complexity on

prevents any further net-
work inference.
Figure 1b shows the alternative after marginalising out the
vector p using Pitman-Yor process theory [3]. One however
must introduce a new latent count vector

t that represents
the fraction of the data

n +m that passes up in a message to

. These are called table multiplicities and they correspond

to the number of tables in the corresponding Chinese restau-
rant process (CRP) [24]. The multiplicites have a bounding
constraint t
k
n
k
+ m
k
and moreover t
k
0 if and only if
n
k
+ m
k
0. Thus at the expense of introducing a latent
count vector (

t) one gets a simple multinomial likelihood

passed up the hierarchy, albiet with a complex looking but
O(1) normalising constant, in the form
l
PY P
p
(

) =
(bp|ap)T (bp)
(bp +N +M)

k
S
n
k
+m
k
t
k
,ap

t
k
k
, (2)
where the total T =

k
t
k
and (x|y)T denotes the Pochham-
mer symbol, (x|y)T = x(x + y) . . . (x + (T 1)y). S
c
t,a
is a
generalized second-order Stirling number [23]. Libraries
2
ex-
ist for eciently dealing with this making it in most cases
an O(1) computation.
The counts of

t then contribute to the data at the par-
ent node

and thus its posterior probability. Thus network
inference is feasible, and moreover no dynamic memory is
required, unlike CRP methods, because

t is the same di-
mension as

n +m. Sampling the

t directly, however, leads
2
See https://mloss.org/software/view/424/ and
https://mloss.org/software/view/528
p

l
Dir
p
(

k
p
n
k
k

k
p
m
k
k
(a) Embedded probability vector with messages.

n +m

k

t
k
k

k
p
n
k
k

k
p
m
k
k
(b) Embedded count vectors after marginalising.
Figure 1: Computation with species sampling models.
to a poor algorithm because they may have a large range
(above, 0 t
k
n
k
+ m
k
) and the impact of data at the
node p on the node

is buered by

t, leading to poor mixing.

Table indicators are introduced by [4] to allow the sam-
pling of the table multiplicity vectors

t to be done incremen-
tally and thus allow more rapid mixing and simpler sam-
pling. Note the table indicators are Boolean values indicat-
ing if the current data item increments the table multiplicity
at its node and thus the data item contributes to the mes-
sage to the parent. The assignment of indicators to data can
be done because of the above constraint t
k
n
k
+ m
k
. So
the data item contributes a +1 to n
k
+m
k
and the matched
table indicator contributes either 0 or +1 to t
k
, which is the
change in the message to the parent. If there is a grandpar-
ent node, then a corresponding table indicator in the parent
node might also propagate a +1 up to the grandparent.
For inference on a network of such vectors, each probabil-
ity vector node contributes a factor to the posterior proba-
bility. For the above example with table indicators this is
given by Formula (3) [3]
(bp|ap)T (bp)
(bp +N +M)

k
S
n
k
+m
k
t
k
,ap
_
n
k
+m
k
t
k
_
1
, (3)
where the addition of the
_
n
k
+m
k
t
k
_
term over Equation (2)
simply divides by the number of choices there are for picking
the t
k
boolean table indicators to be on out of a possible
n
k
+m
k
.
In sampling, a data point coming from the node source
for n contributes a +1 to n
k
(for some k), and either con-
tributes a +1 or a 0 to t
k
depending on the value of the table
indicator. If n
k
= t
k
= 0 initially, then it must contribute
a +1 to t
k
, so there is no choice. The change in posterior
probability of Formula (3) due to the new data point at this
node is, given the Boolean indicator r
l
_
(t
k
+ 1) (bp +T ap) S
n
k
+m
k
+1
t
k
+1,ap
_
r
l
1
_
(n
k
+m
k
t
k
+ 1)S
n
k
+m
k
+1
t
k
,ap
_
r
l
0
_
(n
k
+m
k
+ 1) (bp +N +M) S
n
k
+m
k
t
k
,ap
_
1
(4)
depending on the value of the table indicator r
l
for the data
point. In a network, one has to jointly sample table indi-
cators for all reachable ancestor nodes in the network, and
standard discrete graphical model inference is done in closed
form. Examples are given by [7, 8].
For estimation, one requires the expected probabilities
IE
n, m,

[ p] at a node. In the table indicator framework this

is harder to compute. Fortunately, with a trivial change of
latent variables (drop the table indicators and reintroduce
the table occupancies for the CRP) we get the usual estima-
tion formula for the CRP [23] given by Equation (5)
IE
n, m,

[ p] =
bp +T ap
bp +N +M

+
n + map

t
bp +N +M
. (5)
Since this does not involve knowing the table occupancies
for the CRP, no additional sampling is needed to compute
the formula, just the existing counts (i.e., n, m,

t) are used.
Moreover, we know the estimates normalise correctly.
3. MODELS
3.1 Basic Models
The basic non-parametric topic model we consider is given
in Figure 2. Here, the document-topic proportions

i (for i

z
x

I
L
a

, b

a, b
a

, b

Figure 2: Non-parametric topic model.

running over documents) have a PYP with mean and the
topic-word proportions

k
(for k running over topics) have
a PYP with mean

. The mean vectors and

correspond
to the asymmetric priors of [25].
While we show and

having a GEM prior [15, 24] in
the gure, allowing dierent priors covers a range of LDA
styles, as shown in Table 1. For instance, when is -
nite and the discount for the PYP on

, a

, is zero, then

Dirichlet(b

). Thus the two PYPs in the gure can be

congured to be Dirichlets, giving the standard LDA set-up
for

and likewise for

k
. The GEM is equivalent to the
stick-breaking prior that is at the core of a DP or PYP, so
Table 1: Family of LDA Models. The trabbreviates trun-
cated and symm abbreviates symmetric.
prior a a

prior a

LDA nite u 0 0 nite u 0

tr. HDP-LDA tr. GEM 0 0 nite u 0
tr. HDP-LDA symm. Dir. 0 0 nite u 0
tr. NP-LDA tr. GEM - 0 tr. GEM -
using this with and a truncated K, and setting

k
up to be
Dirichlet distributed as just shown, we have truncated HDP-
LDA. Notice there are dierent ways of provided a truncated
prior to ensure a xed dimensional . The truncated GEM
is used in various versions of truncated HDP-LDA [22, 27],
and the simpler truncation, just using a Dirichlet, is implicit
in the asymmetric priors of [25]. That is, the asymmetric-
symmetric (AS) variant of LDA [25] is equivalent to a trun-
cated HDP-LDA. This means that Mallet [16] has imple-
mented a truncated HDP-LDA (via AS-LDA) since 2008,
and it is indeed both one of the fastest and the best per-
forming.
Thus we reproduce several alternative variants of LDA
[25], as well as truncated versions of HDP-LDA, HPYP-LDA
and a fully non-parametric asymmetric version (with the
truncated GEM prior on both and

) we refer to as NP-
LDA. Sampling algorithms for dealing with the HPYP-LDA
case are from earlier work [4], and the other cases are similar.
3.2 Bursty Models
The extension with burstiness we consider [5] is given in
Figure 3. Here, each topic

k
is specialised to a variant

z
x

I
L
a

, b

a, b
a

, b

Figure 3: Model with topic burstiness.

specic to each document i,

k,i
. Thus

k,i
PY
_
a

, b
,k
,

k
_
.
On the surface one would think introducing potentially K
W (number of words by number of topics) new parameters
for each document, for the

k
, seems statistically impracti-
cal. In practice, the

k
are marginalised out during inference
and book-keeping only requires a small number of additional
latent variables. Note that each topic k has its own concen-
tration parameter b
,k
. This feature will be illustrated in
Subsection 5.5.
3.3 Inference with Burstiness
In LDA style topic modelling using our approach, we get a
formula for sampling a new topic z for a word w in position l
in a document d. Suppose all the other data and the rest of
the document is D
(d,l)
and this is some model M (maybe
NP-LDA or LDA, etc.) with hyperparameters. Then de-
note this Gibbs sampling formula as p
_
z | w, D
(d,l)
, M
_
.
For LDA, this is just the standard collapsed Gibbs sampling
formula [11]. It also forms the rst step of the block Gibbs
sampler we use for HDP-LDA [4]: rst we sample the topic
z, and then we sample the various table indicators give z in
the model.
The burstiness model built on M, denote it M-B, is sam-
pled using p
_
z | w, D
(d,l)
, M-B
_
which is computed using
p
_
z | w, D
(d,l)
, M
_
. Thus we say the burstiness model M-B
is a front end to the Gibbs sampler. At the position l in a
document we have a word of type w and wish to resample
its topic z = k. Let n
w,k
be the number of other existing
words of the type w already in topic k for the current docu-
ment, and let s
w,k
be the corresponding table multiplicities.
They are statistics for the parameters

k
in the burstiness
model. Note by keeping track of which words in a document
are unique, one knows that n
w,k
= 0 for those words, thus
computation can be substantially simplied. Let N
.,k
and
S
.k
be the corresponding totals for the topic k in the doc-
ument (i.e., summed over words). The matrices of counts
n
w,k
and s
w,k
and vectors N
.,k
and S
.k
can be recomputed
as each document in processed in time proportional to the
length of the document.
The Gibbs sampling probability for choosing z = k at
position l for the burstiness model is obtained using Equa-
tion (4).
p
_
z = k | w, D
(d,l)
, M-B
_
(6)
p
_
z | w, D
(d,l)
, M
_
b
,k
+a

S
.k
b
,k
+N
.,k
s
w,k
+ 1
n
w,k
+ 1
S
n
w,k
+1
s
w,k
+1,a

S
n
w,k
s
w,k
,a

+
1
b
,k
+N
.,k
n
w,k
s
w,k
+ 1
n
w,k
+ 1
S
n
w,k
+1
s
w,k
,a

S
n
w,k
s
w,k
,a

.
This has a special case when s
w,k
= n
w,k
= 0 of
p
_
z | w, D
(d,l)
, M
_
b
,k
+a

S
.k
b
,k
+N
.,k
. (7)
Once topic z = k is sampled, the second term of Equation (6)
is proportional to the probability that the table indicator for
word w in the

k
PYP is zero, it does not contribute data
to the parent node

k
, i.e., the original model Mwill ignore
this data point. The rst term of Equation (6) is propor-
tional to the probability that the table indicator is one, so
it does contribute data to the parent node

k
, i.e., back
to the original model M. This table indicator is sampled
according to the two terms and the n
w,k
, s
w,k
, N
.,k
, S
.k
are
all updated. If the table indicator is one then the original
model M processes the data point in the manner it usually
would.
Thus Equation (6) is used to lter words, so we refer to it
as the burstiness front-end. Only words with table indicators
of one are allowed to pass through to the regular model M
and contribute to its statistics for

k
and, for instance, any
further PYP vectors in the model.
4. EXPERIMENTAL SETUP
4.1 Implementation
The publically available hca suite used in these exper-
iments is coded in C using 16 and 32 bit integers where
needed for saving space. All data preparation is done using
the DCA-Bags
3
package, a set of scripts, and input data can
be handled in a number of formats including the LdaC for-
mat. All algorithms are run on a desktop with an Intel(R)
Core(TM) i7 8-core CPU (3.4Ghz) using a single core.
The algorithms have no dynamic memory, so we set the
maximum number of topics K ahead of time. This is like
the truncation level in variational implementations of HDP-
LDA. Moreover, initialisation is done by setting the number
of topics to this maximum and randomly assigning words to
topics. Other authors [22] report initialising to the maxi-
mum number of topics, rather than 1, leads to substantially
better results, an experimental nding with which we agree.
Note, inference and learning for burstiness requires the
word by topic counts n
w,k
and word by topic multiplicities
s
w,k
be maintained for each document, as well as their totals.
There is an implementation trick used to achieve space e-
ciency here. First one computes, for each document, which
words appear more than once in the document (i.e., those
for which n
w,k
can become greater than 1). These words
require special handling, the full Equation (6), and lists of
these are stored in preset variable length arrays. Words that
occur only once in a document are easy to deal with since
their sampling is governed by Equation (7) and no sampling
of the table indicator is needed. Second, the count and mul-
tiplicity statistics (the n
w,k
and s
w,k
which are statistics
for

i,k
) are not stored but recomputed as each document is
about to be processed. Moreover, this only needs to be done
for words appearing more than once in the document (hence
why lists of these are prestored). All one needs to recompute
these statistics is the Boolean table indicators and the topic
assignments. The statistics n
w,k
, s
w,k
can be recomputed in
time proportional to the length of the document.
4.2 Data
We have used several datasets for our experiments, the
PN, MLT, RML, TNG, NIPS and LAT datasets. Not all
data sets were used in all comparisons.
The PN dataset is taken from 805K News articles (Reuters
RCV1) using the query person, excluding stop words and
words appearing <5 times. The MLT dataset is abstracts
from the JMLR volumes 1-11, the ICML years 2007-2011,
and IEEE Trans.of PAMI 2006-2011. Stop words were dis-
carded along with words appearing <5 or >2900 times. The
RML dataset is the Reuters-21578 collection, made up us-
ing standard ModLewis split. The TNG dataset is the 20-
newsgroup dataset using the standard split. For both stop
words were discarded along with words appearing <5 times.
The LAT dataset is the LA Times articles from TREC disk
4. Stop words were discarded along with words appearing
<10 times. Only words made up entirely of alphabetic char-
3
http://mloss.org/software/view/522/
Table 2: Characteristic Sizes of Datasets
PN MLT RML TNG LAT NIPS
W 26037 4662 16994 35287 78953 13649
D 8616 2691 19813 18846 131896 1740
T 1000 306 6188 7532 0 348
N 1.76M 224k 1.27M 1.87M 34.5M 23.0M
acters or dashes were allowed. Roweis NIPS dataset
4
was
left as is.
Characteristics of these six datasets are given in Table 2,
where dictionary size is W, number of documents (including
test) is D, number of test documents is T and total number
of words in the collection is N.
4.3 Evaluation
The algorithms are evaluated on two dierent measures,
test sample perplexity and point-wise mutual information
(PMI). Perplexity is calculated over test data and is done
using document completion [26], known to be unbiased and
easy to implement for a broad class of models. The doc-
ument completion estimate is averaged over 40 cycles per
document done at the end of the training run and uses a 80-
20% split, so every fth word is used to evaluate perplexity
and the remaining to estimate latent variables for the docu-
ment. Topic comprehensibility can be measured in terms of
PMI [17]. It is done by measuring average word association
between all pairs of words in the top-10 topic words (using
the English Wikipedia articles). Here the PMI reported is
average across all topics. PMI les are prepared with the
DCA-Bags package using linkCoco and projected onto the
specic data-sets using cooc2pmi.pl in the hca suite.
We also compare results with two other systems, online-
hdp [27] is a stochastic variational algorithm for HDP-LDA
coded in Python from C. Wang
5
, and HDP a Matlab+C com-
bination doing Gibbs sampling from Y.W. Teh. To do the
comparisons, at various timepoints we take a snapshot of the
vector and the

k
vectors. This is already supported in
onlinehdp, and C. Chen provided the support for this task
with HDP. We then load these values along with the hyperpa-
rameter settings into hca and use its document completion
and PMI evaluation options -V -p -hdoc,5. In this way, all
algorithms are compared using identical software.
5. EXPERIMENTS
5.1 Runtime Comparisons
To see how the algorithms work at scale, we consider the
cycle times and memory requirements of the dierent ver-
sions running on the full LAT data set. These are given in
Table 3. Cycle times in minutes are for a full pass through all
documents and memory requirements are given in megabytes.
LDA, HDP-LDA (where a

= 0) and NP-LDA are as de-

scribed in Section 3. The right half of the table gives per-
formance for the burstiness model of Figure 3. Note only
a portion of the computation is linear in K so, for instance
NP-LDA with burstiness using K = 2000 topics on the same
dataset takes roughly 90 minutes a cycle and 2.43GB mem-
ory. Moreover, given it is coded in Python using inecient
4
http://www.cs.nyu.edu/roweis/data.html
5
Some C++ versions also exist.
Table 3: Cycle times and memory requirements on the LA
Times TREC 4 data using K = 500 topics. Burst is the
burstiness version.
w/out Burst with Burst
Alg. mins. Mb mins. Mb
LDA 11 630 20 690
HDP-LDA 20 760 30 850
NP-LDA 35 840 45 930
onlinehdp 236 1800
allocation, onlinehdp has comparable memory requirements
to hca. In subsequent experiments, we also saw HDP required
5-7 times more memory than hca.
Experiments show that the convergence rates (in cycle
counts not time) are similar for the various Gibbs algorithms
(LDA, Burst LDA, Burst NP-LDA, etc.). Gibbs for full non-
parametric LDA with the burstiness front end gives substan-
tial improvements over vanilla Gibbs LDA while requiring
only 50% more memory and 3 times greater computation
time. Note that the table indicator samplers have previ-
ously been reported to give 1-2% improvement in perplexity
over Chinese restaurant samplers [4], which in turn retain a
substantial improvement over earlier variational algorithms
for HDP-LDA [22].
We nd that sampling hyper-parameters (for instance, dis-
counts and concentrations of Pitman-Yor processes) to be
important for performance. A substantial part of the time
for topic burstiness is hyper-parameter sampling, something
that is usually less than 5% for the other models. This is
because the model has a dierent concentration parameter
for every topic, thus much more of the inecient adaptive
rejection sampling is done for the bursty models versus oth-
ers.
5.2 General Results
A subset of the results are presented in Table 4 and some
informative plots given in Figure 4 and Figure 5. These
represent the average values computed over 4 independent
runs. Note that the dierences between hcas Burst HDP
and Burst NP-LDA in the table are not signicant at the
5% level, but are only mildy signicant.
LDA reaches an earlier minimum for perplexity and then
it usually increases, though PMI does increase as well. Mod-
els like HDP-LDA and NP-LDA usually keep on improving
in PMI as the number of topics increases and hold-out per-
plexity often waivers about, gradually increasing after a later
minimum. For instance, for the small MLT data set they
reach a minimum perplexity at about K=20. All the while,
PMI keeps improving. For data sets like Reuters-21578 a
much larger number of topics can be supported, for instance
K > 500 easily. The eventual increase in perplexity for
larger K seems counter-intuitive given the non-parametric
slogan of estimating the right dimension of the model from
the data. However, remember, we have initialization arti-
facts to deal with. Initializing with substantially too many
topics leads to fragmentation/duplication of the topics not
subsequently handled by simple Gibbs sampling. To deal
with this sort of aect, we need something like split-merge
operators in the sampler [2].
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
20 40 60 80 100
T
e
s
t

P
e
r
p
l
e
x
i
t
y
LDA
Burst LDA
NP-LDA
Burst NP-LDA
1
1.5
2
2.5
3
3.5
4
20 40 60 80 100
P
M
I

o
f

T
o
p
i
c
s
No of Topics
Figure 4: Perplexity and PMI on the RML data for LDA,
Burst LDA and NP-LDA, Burst NP-LDA.
900
1000
1100
1200
1300
1400
1500
10 20 30 40 50 60 70 80 90 100
T
e
s
t

P
e
r
p
l
e
x
i
t
y
No of Topics
LDA
HDP-LDA
NP-LDA
Burst LDA
Burst HDP-LDA
Burst NP-LDA
Figure 5: Perplexity as the number of topics (K) changes
for dierent algorithms on the MLT data.
With the burstiness models, however, the change in per-
formance is dramatic. Burstiness almost always improves
PMI, sometimes substantially and the drop in perplexity
is always dramatic. Burstiness makes perplexity peak much
earlier w.r.t. the number of topics, but for the non-parametric
models the subsequent rise in perplexity thereafter is mild.
The non-parametric models cope better with the challenges
of sampling behind the bursty front-end. However the best
perplexity is reached for low number of topics on MLT where
all the dierent models (LDA, HDP-LDA, NP-LDA) have
similar perplexity. For the larger RML data set, LDAs per-
plexity again peaks earlier but NP-LDAs keeps improving
for the number of topics considered.
5.3 Performance Comparisons
This section compares hca performance with previous al-
gorithms.
5.3.1 Comparison with onlinehdp and HDP
In order to compare the dierent systems, hca versus on-
linehdp and HDP we use the RML and TNG data. We used
a xed set of hyperparameters with no sampling so all dis-
count parameters are set to 0 and the relevant concentration
parameters set to 1 (b, b

) and a symmetric = (0.01)

1 is
used. For onlinehdp we did a large number of runs vary-
Table 4: Document completion perplexity and PMI for hca variants. Data is presented as Perplexity/PMI. HDP is short
for HDP-LDA.
Data (K) LDA Burst LDA HDP Burst HDP NP-LDA Burst NP-LDA
MLT(10) 1493.62/2.33 915.46/2.47 1480.85/2.61 904.29/2.59 1480.20/2.38 907.74/2.70
MLT(50) 1504.63/2.94 1008.68/3.26 1389.29/3.70 940.69/3.63 1375/3.47 932.88/3.93
RML(50) 1472.87/2.07 915.65/2.61 1427.28/2.25 891.07/2.73 1431.29/2.10 882.06/2.89
RML(110) 1441.55/2.43 965.56/2.99 1308.83/3.05 889.42/3.31 1297.08/2.96 880.22/3.32
PN(160) 4232.08/3.69 2988.69/4.18 3801.42/4.50 2689.19/4.62 3785.05/4.39 2657.78/4.70
PN(240) 4306.63/4.07 3081.19/4.45 3726.05/4.75 2720.98/4.76 3676.35/4.72 2734.66/4.78
ing = 1, 4, 16, 64. = 0.5, 0.8 and K = 150, 300 and
batchsize = 250, 1000. Note = 64, = 0.8 are recommend
in [27]. Only the fastest and best converging result is given
for onlinehdp. We did one run of both hca and HDP with
these settings noting that the dierences are way outside of
the range of typical statistical variation between individual
runs. Plots of the runs over time are given in Figures 6
and 7. The nal PMI scores for the 3 algorithms are given
1000
1500
2000
2500
3000
3500
4000
0 2000 4000 6000 8000 10000 12000
P
e
r
p
l
e
x
i
t
y
Seconds elapsed
OnlineHDP
NPLDA
HDP(Teh)
Figure 6: Comparative perplexity for one run on the RML
data.
in Table 5.
Table 5: PMI scores for the comparative runs.
onlinehdp hca HDP
RML 2.607 3.47 4.452
TNG 4.042 4.017 4.887
Table 6: Eective Number of Topics for the comparative
runs.
onlinehdp hca HDP
RML 37.0 155 149
TNG 7.1 92.7 89.6
The improvement in perplexity of hca over HDP is not that
surprising because comparative experiments on even simple
2000
4000
6000
8000
10000
12000
14000
0 2000 4000 6000 8000 10000 12000
P
e
r
p
l
e
x
i
t
y
Seconds elapsed
OnlineHDP
NPLDA
HDP(Teh)
Figure 7: Comparative perplexity for one run on the TNG
data.
models show the signicant improvement of table indicator
methods over CRP methods [6], and Sato et al. [20] also re-
port substantial dierences between dierent formulations
for variational HDP-LDA. However, the poor performance
of onlinehdp needs some explanation. On looking at the
topics discovered by onlinehdp, we see there are many du-
plicates. Moreover, the topic proportions given by the vec-
tor show extreme bias towards earlier topics. It is known, for
instance, that variational methods applied to the Dirichlet
make the probability estimates more extreme. In this model
one is working with a tree of Betas, so it seems the eect is
confounded. A useful diagnostic here is the Eective Num-
ber of Topics which is given by exponentiating the entropy
of the estimated vector, shown in Table 6. One can see hca
and HDP are similar here but onlinehdp has a dramatically
reduced number of topics. The non-duplicated topics in the
onlinehdp result, however, look good in terms of compre-
hensability, so the online stochastic variational method is
clearly a good way to get a smaller number of topics from a
very large data set.
5.3.2 Comparison with Mallet
Mallet suppports asymmetric-symmetric LDA, which is a
form of truncated HDP-LDA using nite symmetric Dirich-
let to truncate a GEM. We compare the implementation
of HDP-LDA in Mallet and hca. Results are reported for
Table 7: Comparative Results for Mallet.
hca
Dataset(K) Mallet (HDP-LDA) (NP-LDA)
RML(300) 1404 8 1280 2 1145 2
TNG(300) 4081 27 3999 10 3586 8
MLT(50) 1357 14 1389 na 1375 na
PN(240) 3844 24 3726 na 3676 na
Table 8: Comparative Results for PCVB0
hca
K PCVB0 (HDP-LDA) (NP-LDA)
200 1285 10 1267 5 1193 5
300 1275 10 1223 5 1151 5
RML andTNG datasets with 300 topics as per previous,
and also some from Table 4. As suggested in [16] we run
Mallet for 2000 iterations, and optimise the hyperparam-
eters every 10 major Gibbs cycles after an initial burn-in
period of 50 cycles, to get the best results. Table 7 presents
the comparative results. We can see that hca generally pro-
duces better results. Note that results produced by the full
asymmetric version NP-LDA are even better, an option not
implemented in Mallet.
5.3.3 Comparison with PCVB0
We also sought to compare hca with the variants of PCVB0
reported in [20]. These are a family of simplied variational
algorithms, though the dierent variants seem to perform
similarly. Without details of the document pre-processing,
it was dicult to reproduce comparable datasets. Thus only
results for their KOS blog corpus, available preprocessed
from the UCI Machine Learning Repository, where used in
producing the comparisons presented in Table 8. We note
the smaller dierence here in perplexity is such that better
hyper-parameter estimation with PCVB0 could well make
the algorithms more equal. Interestly, Sato et al. report lit-
tle dierence between the symmetric or asymmetric priors
on the Dirichlet on

k
. In contrast, our corresponding asym-
metric version NP-LDA shows signicant improvements.
5.3.4 Comparison on NIPS 1988-2000 Dataset
A split-merge variant of HDP-LDA has been developed
[2] that was compared with online and batch variational al-
gorithms. For the NIPS data they have made runs with
K = 300 and they estimate all hyperparameters. They use
a 80-20% split for document completion and we replicated
the experiment with the same dataset, parameter settings
and sampling. The results are show in Figure 8 and should
be compared with [2, Figure 2(b)]. Their results show plots
for 40 hours whereas we ran for 4.5 hours, so our algorithm is
approximately 4 times faster per document. Our Gibbs im-
plementation of HDP-LDA substantially beats all other non-
split-merge algorithms. Not surprisingly, the sophisticated
split-merge sampler eventually reaches the performance of
ours. Note the NP-LDA model is superior to HDP-LDA on
this data, and the bursty versions are clearly superior to all
others.
Figure 8: Convergence on Roweis NIPS data for K = 300.
5.4 Effect of Hyperparameters on the Num-
ber of Topics
Standard reporting of experiments using HDP-LDA usu-
ally sets the parameter which governs the symmetric prior
for the

k
. For instance, some authors [13] call this and
it is set to 0.01. Here we explore what happens when we
vary this parameter for the RML data. Note we have done
this experiment on most of the data sets and the results are
comparable. We train HDP-LDA for 1000 Gibbs cycles and
1050
1100
1150
1200
1250
1300
1350
0 500 1000 1500 2000 2500
T
e
s
t
P
e
r
p
l
e
x
i
t
y
Topics
2
2.5
3
3.5
4
4.5
0 500 1000 1500 2000 2500
P
M
I
o
f
T
o
p
i
c
s
Topics
beta=0.001
beta=0.01
beta=0.1
beta=0.5
beta=sampled
beta=0.001
beta=0.01
beta=0.1
beta=0.5
beta=sampled
Figure 9: Perplexity and PMI for the RML data when vary-
ing in the symmetric prior for HDP-LDA.
then record the evaluation measures. This takes 60 min-
utes on the desktop for each value of . We also do a run
where is sampled. For each of the curves, the stopping
point on the right gives the number of full topics used by
the algorithms (ignoring trivially populated topics with 1-2
words). So the lowest perplexity is achieved by HDP-LDA
with = 0.001 where roughly K = 2, 400 topics are used.
Sampling roughly tracks the lowest achieved for each num-
ber of topics.
The PMI results also indicate that for larger one obtains
more comprehensible topics, though less of them. Thus there
is a trade-o: if you want less but more comprehensible
topics, for instance a coarser summary of the content, then
make larger. If you want a better t to the data, or more
nely grained topics, then estimate properly.
Table 9: Low proportion topics (proportion below 0.001)
with lower variance factor for LAT data when K = 500.
PMI topic words
0.31 Zsa gabor capos slapping avert anhalt enright rolls-
Royce cop-slapping Hensley judgeship Leona
2.32 herald tribune examiner dailies gannet batten numeric
press-telegram petersburg sentinel
4.02 Baker PT evangelist bakers Tammy Faye swagged evan-
gelists televangelists defrocked
Thus we can see that the number of topics found by HDP-
LDA is signicantly aected by the hyper parameter , and
thus it is probably inadvisable to x it without careful ex-
perimentation, consideration or sampling. Moreover, the
number of topics on RML, with roughly 20,000 documents
is up to 2,000. Inspection shows a good number of these are
comprehensible. With larger collections we claim it would
be impractical to attempt to estimate the right number of
topics. For larger collections, one could be estimating tens
of thousands of topics. Is this large number of topics even
useful?
5.5 Topic Specic Concentrations
For the topic burstiness model of Figure 3 we had topic
specic concentrations to the PYP, b
,k
. Now the concen-
tration and discount together control the variance. So for
document i and topic k, the variance of a word probability

i,k,w
from its mean
k,w
will be
_
1a

1+b
,k
_

k,w
[3]. We call
the ratio the variance factor. If it is close to one then the
word proportions

i,k
for the topic have little relationship to
their mean

k
. If close to zero they are similar. Figure 10
considers 500 topics from a model built on the LAT data
with K = 500 using PYP-LDA and topic burstiness. About
0.0001
0.001
0.01
0.1
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
v
a
r
i
a
n
c
e
f
a
c
t
o
r
Topic proportion
Figure 10: Topic proportions versus the variance factor for
LAT data when K = 500.
15% of the topics have low values for concentration that
make the topics eectively random, and thus not properly
used. Examples of topics with low proportions but variance
factor below 0.4, so the topics are still use able, are given
in Table 9. The rst topic is actually about two issues: the
rst is the Zsa Zsa Gabor slapping incident, and the second
is about Orange County Dist. Atty.s Avert and Enright.
6. CONCLUSION
We have shown that an implementation of the HDP-LDA
non-parametric topic model and related non-parametric ex-
tension NP-LDA using block table indicator Gibbs sampling
methods [4] are signicantly superior to recent variational
methods in terms of perplexity of results. The NP-LDA is
also signicantly superior in perplexity to the Mallet imple-
mentation of truncated HDP-LDA (masquerading as asym-
metric symmetric LDA). Taking account of the dierent im-
plementation languages, the newer Gibbs samplers and vari-
ational methods also have the same memory footprint. Mal-
let is substantially faster, however, and performs well for
HDP-LDA.
We note that these have two goals, (A) better estimat-
ing prior topic or word proportions, and (B) estimating the
rightnumber of topics. The non-parametric methods seem
superior at the rst goal (A) over the parametric equivalents.
Given that the estimated number of topics grows substan-
tially with the collection sizes, it is not clear how important
goal (B) can be. Arguably, goal (A) is the more important
one.
Moreover, we have developed a Gibbs theory of burstiness
that:
Is implemented as a front-end so can in principle read-
ily be applied to most variants of a topic model that
use a Gibbs sampler.
It is a factor of 1.5-2 slower per major Gibbs cycle.
This will allow the wide variety of topic-model variants to
easily take advantage of the burstiness model.
Through the experiments, we have illustrated some char-
acterizations of the models, for instance:
Our asymmetric-asymmetric NP-LDA model is about
75% slower than HDP-LDA but generally performs
better than HDP-LDA, a dierent result to published
results [25, 20] due to the dierent algorithms.
The topic comprehensibility (as measured using PMI)
is substantially improved by the burstiness version, as
reported in the original work [5].
The topic concentration parameter in the burstiness
model goes very low when the topic is insignicant.
We can use this to estimate which topics have become
inactive in the model.
The concentration parameter for the topic-word vec-
tors signicantly aects results, so care should be taken
in experiments using these models.
7. ACKNOWLEDGEMENTS
Both authors were funded partly by NICTA. NICTA is
funded by the Australian Government through the Depart-
ment of Communications and the Australian Research Coun-
cil through the ICT Centre of Excellence Program. Thanks
to Changyou Chen and Kar Wai Lim for their feedback and
Changyou for running the HDP experiments.
8. REFERENCES
[1] J. Boyd-Graber, D. Blei, and X. Zhu. A topic model
for word sense disambiguation. In EMNLP-CoNLL,
pages 10241033, 2007.
[2] M. Bryant and E. Sudderth. Truly nonparametric
online variational inference for hierarchical Dirichlet
processes. In P. Bartlett, F. Pereira, C. Burges,
L. Bottou, and K. Weinberger, editors, Advances in
Neural Information Processing Systems 25, pages
27082716. 2012.
[3] W. Buntine and M. Hutter. A Bayesian view of the
Poisson-Dirichlet process. Technical Report
arXiv:1007.0296 [math.ST], arXiv, Feb. 2012.
[4] C. Chen, L. Du, and W. Buntine. Sampling table
congurations for the hierarchical Poisson-Dirichlet
process. In Machine Learning and Knowledge
Discovery in Databases: European Conference, ECML
PKDD, pages 296311. Springer, 2011.
[5] G. Doyle and C. Elkan. Accounting for burstiness in
topic models. In Proc. of the 26th Annual Int. Conf.
on Machine Learning, ICML 09, pages 281288, 2009.
[6] L. Du. Non-parametric Bayesian Methods for
Structured Topic Models A Mixture Distribution
Approach. PhD thesis, School of Computer Science,
the Australian National University, Canberra,
Australia, 2011.
[7] L. Du, W. Buntine, and H. Jin. Modelling sequential
text with an adaptive topic model. In Proc. of the
2012 Joint Conf. on EMNLP and CoNLL, pages
535545. ACM, 2012.
[8] L. Du, W. Buntine, and M. Johnson. Topic
segmentation with a structured topic model. In
HLT-NAACL, pages 190200. The Association for
Computational Linguistics, 2013.
[9] L. Du, W. Buntine, and M. Johnson. Topic
segmentation with a structured topic model. In
Proceedings of NAACL-HLT, pages 190200, 2013.
[10] W. R. Gilks and P. Wild. Adaptive rejection sampling
for Gibbs sampling. Applied Statistics, pages 337348,
1992.
[11] T. Griths and M. Steyvers. Finding scientic topics.
PNAS Colloquium, 2004.
[12] S. Harter. A probabilistic approach to automatic
keyword indexing. Part II. An algorithm for
probabilistic indexing. Jnl. of the American Society
for Information Science, 26(5):280289, 1975.
[13] M. Homan, D. Blei, C. Wang, and J. Paisley.
Stochastic variational inference. Journal of Machine
Learning Research, 14:13031347, 2013.
[14] H. Ishawaran and L. James. Generalized weighted
Chinese restaurant processes for species sampling
mixture models. Statistica Sinica, 13:12111235, 2003.
[15] H. Ishwaran and L. James. Gibbs sampling methods
for stick-breaking priors. Journal of ASA,
96(453):161173, 2001.
[16] A. K. McCallum. Mallet: A machine learning for
language toolkit. http://mallet.cs.umass.edu, 2002.
[17] D. Newman, J. Lau, K. Grieser, and T. Baldwin.
Automatic evaluation of topic coherence. In Proc. of
the 2010 Annual Conf. of the NAACL, pages
100 a

A S108, 2010.
[18] S. Robertson and H. Zaragoza. The probabilistic
relevance framework: BM25 and beyond. Found.
Trends Inf. Retr., 3(4):333389, Apr. 2009.
[19] M. Rosen-Zvi, T. Griths, M. Steyvers, and P. Smyth.
The author-topic model for authors and documents. In
Proc. of the 20th Annual Conf. on Uncertainty in
Articial Intelligence (UAI-04), pages 48749, 2004.
[20] I. Sato, K. Kurihara, and H. Nakagawa. Practical
collapsed variational Bayes inference for hierarchical
Dirichlet process. In Proc. of the 18th ACM SIGKDD
international conf. on Knowledge discovery and data
mining, pages 105113. ACM, 2012.
[21] I. Sato and H. Nakagawa. Topic models with
power-law using Pitman-Yor process. KDD 10, pages
673682. ACM, 2010.
[22] Y. Teh, K. Kurihara, and M. Welling. Collapsed
variational inference for HDP. In NIPS 07. 2007.
[23] Y. W. Teh. A Bayesian interpretation of interpolated
Kneser-Ney. Technical Report TRA2/06, School of
Computing, National University of Singapore, 2006.
[24] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
Hierarchical Dirichlet processes. Journal of the ASA,
101(476):15661581, 2006.
[25] H. Wallach, D. Mimno, and A. McCallum. Rethinking
LDA: Why priors matter. In Advances in Neural
Information Processing Systems 19, 2009.
[26] H. Wallach, I. Murray, R. Salakhutdinov, and
D. Mimno. Evaluation methods for topic models. In
ICML 09, pages 672679. 2009.
[27] C. Wang, J. Paisley, and D. Blei. Online variational
inference for the hierarchical Dirichlet process. In
AISTATS 11. 2011.