Experiments With Non Parametric Topic Models
Experiments With Non Parametric Topic Models
Wray Buntine
Monash University
Clayton, VIC, Australia
[email protected]
Swapnil Mishra
RSISE, The Australian National University
Canberra, ACT, Australia
[email protected]
ABSTRACT
In topic modelling, various alternative priors have been de-
veloped, for instance asymmetric and symmetric priors for
the document-topic and topic-word matrices respectively,
the hierarchical Dirichlet process prior for the document-
topic matrix and the hierarchical Pitman-Yor process prior
for the topic-word matrix. For information retrieval, lan-
guage models exhibiting word burstiness are important. In-
deed, this burstiness eect has been show to help topic mod-
els as well, and this requires additional word probability
vectors for each document. Here we show how to combine
these ideas to develop high-performing non-parametric topic
models exhibiting burstiness based on standard Gibbs sam-
pling. Experiments are done to explore the behavior of the
models under dierent conditions and to compare the algo-
rithms with previously published. The full non-parametric
topic models with burstiness are only a small factor slower
than standard Gibbs sampling for LDA and require double
the memory, making them very competitive. We look at the
comparative behaviour of dierent models and present some
experimental insights.
Categories and Subject Descriptors
I.7 [Document and Text Processing]: Miscellaneous;
I.2.6 [Articial Intelligence]: Learning
Keywords
topic modelling; experimental results; non-parametric prior;
text
1. INTRODUCTION
Topic models are now a recognised genre in the suite of ex-
ploratory software available for text data mining and other
semi-structured tasks. Moreover, it is also recognised that
_
. The
Pitman-Yor process, when used in this way, can be used hi-
erarchically to form distributions on a network of probability
vectors.
The inference on a network of probability vectors is based
on a basic property of species sampling schemes [14] that
is best understood using the framework of message passing
over networks. Figure 1a shows the context of a probability
vector p having a Pitman-Yor process with base distribution
) (assuming a concentration
at the node of bp)
l
Dir
p
(
) =
(bp)
(bp +N +M)
k
(bp
k
+n
k
+m
k
)
(bp
k
)
, (1)
where the total statistics are N =
k
n
k
and M =
k
m
k
.
This functional complexity on
prevents any further net-
work inference.
Figure 1b shows the alternative after marginalising out the
vector p using Pitman-Yor process theory [3]. One however
must introduce a new latent count vector
t that represents
the fraction of the data
) =
(bp|ap)T (bp)
(bp +N +M)
k
S
n
k
+m
k
t
k
,ap
t
k
k
, (2)
where the total T =
k
t
k
and (x|y)T denotes the Pochham-
mer symbol, (x|y)T = x(x + y) . . . (x + (T 1)y). S
c
t,a
is a
generalized second-order Stirling number [23]. Libraries
2
ex-
ist for eciently dealing with this making it in most cases
an O(1) computation.
The counts of
t then contribute to the data at the par-
ent node
and thus its posterior probability. Thus network
inference is feasible, and moreover no dynamic memory is
required, unlike CRP methods, because
t is the same di-
mension as
n +m. Sampling the
t directly, however, leads
2
See https://mloss.org/software/view/424/ and
https://mloss.org/software/view/528
p
l
Dir
p
(
k
p
n
k
k
k
p
m
k
k
(a) Embedded probability vector with messages.
n +m
k
t
k
k
k
p
n
k
k
k
p
m
k
k
(b) Embedded count vectors after marginalising.
Figure 1: Computation with species sampling models.
to a poor algorithm because they may have a large range
(above, 0 t
k
n
k
+ m
k
) and the impact of data at the
node p on the node
is buered by
t to be done incremen-
tally and thus allow more rapid mixing and simpler sam-
pling. Note the table indicators are Boolean values indicat-
ing if the current data item increments the table multiplicity
at its node and thus the data item contributes to the mes-
sage to the parent. The assignment of indicators to data can
be done because of the above constraint t
k
n
k
+ m
k
. So
the data item contributes a +1 to n
k
+m
k
and the matched
table indicator contributes either 0 or +1 to t
k
, which is the
change in the message to the parent. If there is a grandpar-
ent node, then a corresponding table indicator in the parent
node might also propagate a +1 up to the grandparent.
For inference on a network of such vectors, each probabil-
ity vector node contributes a factor to the posterior proba-
bility. For the above example with table indicators this is
given by Formula (3) [3]
(bp|ap)T (bp)
(bp +N +M)
k
S
n
k
+m
k
t
k
,ap
_
n
k
+m
k
t
k
_
1
, (3)
where the addition of the
_
n
k
+m
k
t
k
_
term over Equation (2)
simply divides by the number of choices there are for picking
the t
k
boolean table indicators to be on out of a possible
n
k
+m
k
.
In sampling, a data point coming from the node source
for n contributes a +1 to n
k
(for some k), and either con-
tributes a +1 or a 0 to t
k
depending on the value of the table
indicator. If n
k
= t
k
= 0 initially, then it must contribute
a +1 to t
k
, so there is no choice. The change in posterior
probability of Formula (3) due to the new data point at this
node is, given the Boolean indicator r
l
_
(t
k
+ 1) (bp +T ap) S
n
k
+m
k
+1
t
k
+1,ap
_
r
l
1
_
(n
k
+m
k
t
k
+ 1)S
n
k
+m
k
+1
t
k
,ap
_
r
l
0
_
(n
k
+m
k
+ 1) (bp +N +M) S
n
k
+m
k
t
k
,ap
_
1
(4)
depending on the value of the table indicator r
l
for the data
point. In a network, one has to jointly sample table indi-
cators for all reachable ancestor nodes in the network, and
standard discrete graphical model inference is done in closed
form. Examples are given by [7, 8].
For estimation, one requires the expected probabilities
IE
n, m,
t,
t,
[ p] =
bp +T ap
bp +N +M
+
n + map
t
bp +N +M
. (5)
Since this does not involve knowing the table occupancies
for the CRP, no additional sampling is needed to compute
the formula, just the existing counts (i.e., n, m,
t) are used.
Moreover, we know the estimates normalise correctly.
3. MODELS
3.1 Basic Models
The basic non-parametric topic model we consider is given
in Figure 2. Here, the document-topic proportions
i (for i
z
x
I
L
a
, b
a, b
a
, b
, b
k
(for k running over topics) have
a PYP with mean
. The mean vectors and
correspond
to the asymmetric priors of [25].
While we show and
having a GEM prior [15, 24] in
the gure, allowing dierent priors covers a range of LDA
styles, as shown in Table 1. For instance, when is -
nite and the discount for the PYP on
, a
, is zero, then
Dirichlet(b
k
. The GEM is equivalent to the
stick-breaking prior that is at the core of a DP or PYP, so
Table 1: Family of LDA Models. The trabbreviates trun-
cated and symm abbreviates symmetric.
prior a a
prior a
k
up to be
Dirichlet distributed as just shown, we have truncated HDP-
LDA. Notice there are dierent ways of provided a truncated
prior to ensure a xed dimensional . The truncated GEM
is used in various versions of truncated HDP-LDA [22, 27],
and the simpler truncation, just using a Dirichlet, is implicit
in the asymmetric priors of [25]. That is, the asymmetric-
symmetric (AS) variant of LDA [25] is equivalent to a trun-
cated HDP-LDA. This means that Mallet [16] has imple-
mented a truncated HDP-LDA (via AS-LDA) since 2008,
and it is indeed both one of the fastest and the best per-
forming.
Thus we reproduce several alternative variants of LDA
[25], as well as truncated versions of HDP-LDA, HPYP-LDA
and a fully non-parametric asymmetric version (with the
truncated GEM prior on both and
) we refer to as NP-
LDA. Sampling algorithms for dealing with the HPYP-LDA
case are from earlier work [4], and the other cases are similar.
3.2 Bursty Models
The extension with burstiness we consider [5] is given in
Figure 3. Here, each topic
k
is specialised to a variant
z
x
I
L
a
, b
a, b
a
, b
, b
, b
k,i
. Thus
k,i
PY
_
a
, b
,k
,
k
_
.
On the surface one would think introducing potentially K
W (number of words by number of topics) new parameters
for each document, for the
k
, seems statistically impracti-
cal. In practice, the
k
are marginalised out during inference
and book-keeping only requires a small number of additional
latent variables. Note that each topic k has its own concen-
tration parameter b
,k
. This feature will be illustrated in
Subsection 5.5.
3.3 Inference with Burstiness
In LDA style topic modelling using our approach, we get a
formula for sampling a new topic z for a word w in position l
in a document d. Suppose all the other data and the rest of
the document is D
(d,l)
and this is some model M (maybe
NP-LDA or LDA, etc.) with hyperparameters. Then de-
note this Gibbs sampling formula as p
_
z | w, D
(d,l)
, M
_
.
For LDA, this is just the standard collapsed Gibbs sampling
formula [11]. It also forms the rst step of the block Gibbs
sampler we use for HDP-LDA [4]: rst we sample the topic
z, and then we sample the various table indicators give z in
the model.
The burstiness model built on M, denote it M-B, is sam-
pled using p
_
z | w, D
(d,l)
, M-B
_
which is computed using
p
_
z | w, D
(d,l)
, M
_
. Thus we say the burstiness model M-B
is a front end to the Gibbs sampler. At the position l in a
document we have a word of type w and wish to resample
its topic z = k. Let n
w,k
be the number of other existing
words of the type w already in topic k for the current docu-
ment, and let s
w,k
be the corresponding table multiplicities.
They are statistics for the parameters
k
in the burstiness
model. Note by keeping track of which words in a document
are unique, one knows that n
w,k
= 0 for those words, thus
computation can be substantially simplied. Let N
.,k
and
S
.k
be the corresponding totals for the topic k in the doc-
ument (i.e., summed over words). The matrices of counts
n
w,k
and s
w,k
and vectors N
.,k
and S
.k
can be recomputed
as each document in processed in time proportional to the
length of the document.
The Gibbs sampling probability for choosing z = k at
position l for the burstiness model is obtained using Equa-
tion (4).
p
_
z = k | w, D
(d,l)
, M-B
_
(6)
p
_
z | w, D
(d,l)
, M
_
b
,k
+a
S
.k
b
,k
+N
.,k
s
w,k
+ 1
n
w,k
+ 1
S
n
w,k
+1
s
w,k
+1,a
S
n
w,k
s
w,k
,a
+
1
b
,k
+N
.,k
n
w,k
s
w,k
+ 1
n
w,k
+ 1
S
n
w,k
+1
s
w,k
,a
S
n
w,k
s
w,k
,a
.
This has a special case when s
w,k
= n
w,k
= 0 of
p
_
z | w, D
(d,l)
, M
_
b
,k
+a
S
.k
b
,k
+N
.,k
. (7)
Once topic z = k is sampled, the second term of Equation (6)
is proportional to the probability that the table indicator for
word w in the
k
PYP is zero, it does not contribute data
to the parent node
k
, i.e., the original model Mwill ignore
this data point. The rst term of Equation (6) is propor-
tional to the probability that the table indicator is one, so
it does contribute data to the parent node
k
, i.e., back
to the original model M. This table indicator is sampled
according to the two terms and the n
w,k
, s
w,k
, N
.,k
, S
.k
are
all updated. If the table indicator is one then the original
model M processes the data point in the manner it usually
would.
Thus Equation (6) is used to lter words, so we refer to it
as the burstiness front-end. Only words with table indicators
of one are allowed to pass through to the regular model M
and contribute to its statistics for
k
and, for instance, any
further PYP vectors in the model.
4. EXPERIMENTAL SETUP
4.1 Implementation
The publically available hca suite used in these exper-
iments is coded in C using 16 and 32 bit integers where
needed for saving space. All data preparation is done using
the DCA-Bags
3
package, a set of scripts, and input data can
be handled in a number of formats including the LdaC for-
mat. All algorithms are run on a desktop with an Intel(R)
Core(TM) i7 8-core CPU (3.4Ghz) using a single core.
The algorithms have no dynamic memory, so we set the
maximum number of topics K ahead of time. This is like
the truncation level in variational implementations of HDP-
LDA. Moreover, initialisation is done by setting the number
of topics to this maximum and randomly assigning words to
topics. Other authors [22] report initialising to the maxi-
mum number of topics, rather than 1, leads to substantially
better results, an experimental nding with which we agree.
Note, inference and learning for burstiness requires the
word by topic counts n
w,k
and word by topic multiplicities
s
w,k
be maintained for each document, as well as their totals.
There is an implementation trick used to achieve space e-
ciency here. First one computes, for each document, which
words appear more than once in the document (i.e., those
for which n
w,k
can become greater than 1). These words
require special handling, the full Equation (6), and lists of
these are stored in preset variable length arrays. Words that
occur only once in a document are easy to deal with since
their sampling is governed by Equation (7) and no sampling
of the table indicator is needed. Second, the count and mul-
tiplicity statistics (the n
w,k
and s
w,k
which are statistics
for
i,k
) are not stored but recomputed as each document is
about to be processed. Moreover, this only needs to be done
for words appearing more than once in the document (hence
why lists of these are prestored). All one needs to recompute
these statistics is the Boolean table indicators and the topic
assignments. The statistics n
w,k
, s
w,k
can be recomputed in
time proportional to the length of the document.
4.2 Data
We have used several datasets for our experiments, the
PN, MLT, RML, TNG, NIPS and LAT datasets. Not all
data sets were used in all comparisons.
The PN dataset is taken from 805K News articles (Reuters
RCV1) using the query person, excluding stop words and
words appearing <5 times. The MLT dataset is abstracts
from the JMLR volumes 1-11, the ICML years 2007-2011,
and IEEE Trans.of PAMI 2006-2011. Stop words were dis-
carded along with words appearing <5 or >2900 times. The
RML dataset is the Reuters-21578 collection, made up us-
ing standard ModLewis split. The TNG dataset is the 20-
newsgroup dataset using the standard split. For both stop
words were discarded along with words appearing <5 times.
The LAT dataset is the LA Times articles from TREC disk
4. Stop words were discarded along with words appearing
<10 times. Only words made up entirely of alphabetic char-
3
http://mloss.org/software/view/522/
Table 2: Characteristic Sizes of Datasets
PN MLT RML TNG LAT NIPS
W 26037 4662 16994 35287 78953 13649
D 8616 2691 19813 18846 131896 1740
T 1000 306 6188 7532 0 348
N 1.76M 224k 1.27M 1.87M 34.5M 23.0M
acters or dashes were allowed. Roweis NIPS dataset
4
was
left as is.
Characteristics of these six datasets are given in Table 2,
where dictionary size is W, number of documents (including
test) is D, number of test documents is T and total number
of words in the collection is N.
4.3 Evaluation
The algorithms are evaluated on two dierent measures,
test sample perplexity and point-wise mutual information
(PMI). Perplexity is calculated over test data and is done
using document completion [26], known to be unbiased and
easy to implement for a broad class of models. The doc-
ument completion estimate is averaged over 40 cycles per
document done at the end of the training run and uses a 80-
20% split, so every fth word is used to evaluate perplexity
and the remaining to estimate latent variables for the docu-
ment. Topic comprehensibility can be measured in terms of
PMI [17]. It is done by measuring average word association
between all pairs of words in the top-10 topic words (using
the English Wikipedia articles). Here the PMI reported is
average across all topics. PMI les are prepared with the
DCA-Bags package using linkCoco and projected onto the
specic data-sets using cooc2pmi.pl in the hca suite.
We also compare results with two other systems, online-
hdp [27] is a stochastic variational algorithm for HDP-LDA
coded in Python from C. Wang
5
, and HDP a Matlab+C com-
bination doing Gibbs sampling from Y.W. Teh. To do the
comparisons, at various timepoints we take a snapshot of the
vector and the
k
vectors. This is already supported in
onlinehdp, and C. Chen provided the support for this task
with HDP. We then load these values along with the hyperpa-
rameter settings into hca and use its document completion
and PMI evaluation options -V -p -hdoc,5. In this way, all
algorithms are compared using identical software.
5. EXPERIMENTS
5.1 Runtime Comparisons
To see how the algorithms work at scale, we consider the
cycle times and memory requirements of the dierent ver-
sions running on the full LAT data set. These are given in
Table 3. Cycle times in minutes are for a full pass through all
documents and memory requirements are given in megabytes.
LDA, HDP-LDA (where a
1 is
used. For onlinehdp we did a large number of runs vary-
Table 4: Document completion perplexity and PMI for hca variants. Data is presented as Perplexity/PMI. HDP is short
for HDP-LDA.
Data (K) LDA Burst LDA HDP Burst HDP NP-LDA Burst NP-LDA
MLT(10) 1493.62/2.33 915.46/2.47 1480.85/2.61 904.29/2.59 1480.20/2.38 907.74/2.70
MLT(50) 1504.63/2.94 1008.68/3.26 1389.29/3.70 940.69/3.63 1375/3.47 932.88/3.93
RML(50) 1472.87/2.07 915.65/2.61 1427.28/2.25 891.07/2.73 1431.29/2.10 882.06/2.89
RML(110) 1441.55/2.43 965.56/2.99 1308.83/3.05 889.42/3.31 1297.08/2.96 880.22/3.32
PN(160) 4232.08/3.69 2988.69/4.18 3801.42/4.50 2689.19/4.62 3785.05/4.39 2657.78/4.70
PN(240) 4306.63/4.07 3081.19/4.45 3726.05/4.75 2720.98/4.76 3676.35/4.72 2734.66/4.78
ing = 1, 4, 16, 64. = 0.5, 0.8 and K = 150, 300 and
batchsize = 250, 1000. Note = 64, = 0.8 are recommend
in [27]. Only the fastest and best converging result is given
for onlinehdp. We did one run of both hca and HDP with
these settings noting that the dierences are way outside of
the range of typical statistical variation between individual
runs. Plots of the runs over time are given in Figures 6
and 7. The nal PMI scores for the 3 algorithms are given
1000
1500
2000
2500
3000
3500
4000
0 2000 4000 6000 8000 10000 12000
P
e
r
p
l
e
x
i
t
y
Seconds elapsed
OnlineHDP
NPLDA
HDP(Teh)
Figure 6: Comparative perplexity for one run on the RML
data.
in Table 5.
Table 5: PMI scores for the comparative runs.
onlinehdp hca HDP
RML 2.607 3.47 4.452
TNG 4.042 4.017 4.887
Table 6: Eective Number of Topics for the comparative
runs.
onlinehdp hca HDP
RML 37.0 155 149
TNG 7.1 92.7 89.6
The improvement in perplexity of hca over HDP is not that
surprising because comparative experiments on even simple
2000
4000
6000
8000
10000
12000
14000
0 2000 4000 6000 8000 10000 12000
P
e
r
p
l
e
x
i
t
y
Seconds elapsed
OnlineHDP
NPLDA
HDP(Teh)
Figure 7: Comparative perplexity for one run on the TNG
data.
models show the signicant improvement of table indicator
methods over CRP methods [6], and Sato et al. [20] also re-
port substantial dierences between dierent formulations
for variational HDP-LDA. However, the poor performance
of onlinehdp needs some explanation. On looking at the
topics discovered by onlinehdp, we see there are many du-
plicates. Moreover, the topic proportions given by the vec-
tor show extreme bias towards earlier topics. It is known, for
instance, that variational methods applied to the Dirichlet
make the probability estimates more extreme. In this model
one is working with a tree of Betas, so it seems the eect is
confounded. A useful diagnostic here is the Eective Num-
ber of Topics which is given by exponentiating the entropy
of the estimated vector, shown in Table 6. One can see hca
and HDP are similar here but onlinehdp has a dramatically
reduced number of topics. The non-duplicated topics in the
onlinehdp result, however, look good in terms of compre-
hensability, so the online stochastic variational method is
clearly a good way to get a smaller number of topics from a
very large data set.
5.3.2 Comparison with Mallet
Mallet suppports asymmetric-symmetric LDA, which is a
form of truncated HDP-LDA using nite symmetric Dirich-
let to truncate a GEM. We compare the implementation
of HDP-LDA in Mallet and hca. Results are reported for
Table 7: Comparative Results for Mallet.
hca
Dataset(K) Mallet (HDP-LDA) (NP-LDA)
RML(300) 1404 8 1280 2 1145 2
TNG(300) 4081 27 3999 10 3586 8
MLT(50) 1357 14 1389 na 1375 na
PN(240) 3844 24 3726 na 3676 na
Table 8: Comparative Results for PCVB0
hca
K PCVB0 (HDP-LDA) (NP-LDA)
200 1285 10 1267 5 1193 5
300 1275 10 1223 5 1151 5
RML andTNG datasets with 300 topics as per previous,
and also some from Table 4. As suggested in [16] we run
Mallet for 2000 iterations, and optimise the hyperparam-
eters every 10 major Gibbs cycles after an initial burn-in
period of 50 cycles, to get the best results. Table 7 presents
the comparative results. We can see that hca generally pro-
duces better results. Note that results produced by the full
asymmetric version NP-LDA are even better, an option not
implemented in Mallet.
5.3.3 Comparison with PCVB0
We also sought to compare hca with the variants of PCVB0
reported in [20]. These are a family of simplied variational
algorithms, though the dierent variants seem to perform
similarly. Without details of the document pre-processing,
it was dicult to reproduce comparable datasets. Thus only
results for their KOS blog corpus, available preprocessed
from the UCI Machine Learning Repository, where used in
producing the comparisons presented in Table 8. We note
the smaller dierence here in perplexity is such that better
hyper-parameter estimation with PCVB0 could well make
the algorithms more equal. Interestly, Sato et al. report lit-
tle dierence between the symmetric or asymmetric priors
on the Dirichlet on
k
. In contrast, our corresponding asym-
metric version NP-LDA shows signicant improvements.
5.3.4 Comparison on NIPS 1988-2000 Dataset
A split-merge variant of HDP-LDA has been developed
[2] that was compared with online and batch variational al-
gorithms. For the NIPS data they have made runs with
K = 300 and they estimate all hyperparameters. They use
a 80-20% split for document completion and we replicated
the experiment with the same dataset, parameter settings
and sampling. The results are show in Figure 8 and should
be compared with [2, Figure 2(b)]. Their results show plots
for 40 hours whereas we ran for 4.5 hours, so our algorithm is
approximately 4 times faster per document. Our Gibbs im-
plementation of HDP-LDA substantially beats all other non-
split-merge algorithms. Not surprisingly, the sophisticated
split-merge sampler eventually reaches the performance of
ours. Note the NP-LDA model is superior to HDP-LDA on
this data, and the bursty versions are clearly superior to all
others.
Figure 8: Convergence on Roweis NIPS data for K = 300.
5.4 Effect of Hyperparameters on the Num-
ber of Topics
Standard reporting of experiments using HDP-LDA usu-
ally sets the parameter which governs the symmetric prior
for the
k
. For instance, some authors [13] call this and
it is set to 0.01. Here we explore what happens when we
vary this parameter for the RML data. Note we have done
this experiment on most of the data sets and the results are
comparable. We train HDP-LDA for 1000 Gibbs cycles and
1050
1100
1150
1200
1250
1300
1350
0 500 1000 1500 2000 2500
T
e
s
t
P
e
r
p
l
e
x
i
t
y
Topics
2
2.5
3
3.5
4
4.5
0 500 1000 1500 2000 2500
P
M
I
o
f
T
o
p
i
c
s
Topics
beta=0.001
beta=0.01
beta=0.1
beta=0.5
beta=sampled
beta=0.001
beta=0.01
beta=0.1
beta=0.5
beta=sampled
Figure 9: Perplexity and PMI for the RML data when vary-
ing in the symmetric prior for HDP-LDA.
then record the evaluation measures. This takes 60 min-
utes on the desktop for each value of . We also do a run
where is sampled. For each of the curves, the stopping
point on the right gives the number of full topics used by
the algorithms (ignoring trivially populated topics with 1-2
words). So the lowest perplexity is achieved by HDP-LDA
with = 0.001 where roughly K = 2, 400 topics are used.
Sampling roughly tracks the lowest achieved for each num-
ber of topics.
The PMI results also indicate that for larger one obtains
more comprehensible topics, though less of them. Thus there
is a trade-o: if you want less but more comprehensible
topics, for instance a coarser summary of the content, then
make larger. If you want a better t to the data, or more
nely grained topics, then estimate properly.
Table 9: Low proportion topics (proportion below 0.001)
with lower variance factor for LAT data when K = 500.
PMI topic words
0.31 Zsa gabor capos slapping avert anhalt enright rolls-
Royce cop-slapping Hensley judgeship Leona
2.32 herald tribune examiner dailies gannet batten numeric
press-telegram petersburg sentinel
4.02 Baker PT evangelist bakers Tammy Faye swagged evan-
gelists televangelists defrocked
Thus we can see that the number of topics found by HDP-
LDA is signicantly aected by the hyper parameter , and
thus it is probably inadvisable to x it without careful ex-
perimentation, consideration or sampling. Moreover, the
number of topics on RML, with roughly 20,000 documents
is up to 2,000. Inspection shows a good number of these are
comprehensible. With larger collections we claim it would
be impractical to attempt to estimate the right number of
topics. For larger collections, one could be estimating tens
of thousands of topics. Is this large number of topics even
useful?
5.5 Topic Specic Concentrations
For the topic burstiness model of Figure 3 we had topic
specic concentrations to the PYP, b
,k
. Now the concen-
tration and discount together control the variance. So for
document i and topic k, the variance of a word probability
i,k,w
from its mean
k,w
will be
_
1a
1+b
,k
_
k,w
[3]. We call
the ratio the variance factor. If it is close to one then the
word proportions
i,k
for the topic have little relationship to
their mean
k
. If close to zero they are similar. Figure 10
considers 500 topics from a model built on the LAT data
with K = 500 using PYP-LDA and topic burstiness. About
0.0001
0.001
0.01
0.1
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
v
a
r
i
a
n
c
e
f
a
c
t
o
r
Topic proportion
Figure 10: Topic proportions versus the variance factor for
LAT data when K = 500.
15% of the topics have low values for concentration that
make the topics eectively random, and thus not properly
used. Examples of topics with low proportions but variance
factor below 0.4, so the topics are still use able, are given
in Table 9. The rst topic is actually about two issues: the
rst is the Zsa Zsa Gabor slapping incident, and the second
is about Orange County Dist. Atty.s Avert and Enright.
6. CONCLUSION
We have shown that an implementation of the HDP-LDA
non-parametric topic model and related non-parametric ex-
tension NP-LDA using block table indicator Gibbs sampling
methods [4] are signicantly superior to recent variational
methods in terms of perplexity of results. The NP-LDA is
also signicantly superior in perplexity to the Mallet imple-
mentation of truncated HDP-LDA (masquerading as asym-
metric symmetric LDA). Taking account of the dierent im-
plementation languages, the newer Gibbs samplers and vari-
ational methods also have the same memory footprint. Mal-
let is substantially faster, however, and performs well for
HDP-LDA.
We note that these have two goals, (A) better estimat-
ing prior topic or word proportions, and (B) estimating the
rightnumber of topics. The non-parametric methods seem
superior at the rst goal (A) over the parametric equivalents.
Given that the estimated number of topics grows substan-
tially with the collection sizes, it is not clear how important
goal (B) can be. Arguably, goal (A) is the more important
one.
Moreover, we have developed a Gibbs theory of burstiness
that:
Is implemented as a front-end so can in principle read-
ily be applied to most variants of a topic model that
use a Gibbs sampler.
It is a factor of 1.5-2 slower per major Gibbs cycle.
This will allow the wide variety of topic-model variants to
easily take advantage of the burstiness model.
Through the experiments, we have illustrated some char-
acterizations of the models, for instance:
Our asymmetric-asymmetric NP-LDA model is about
75% slower than HDP-LDA but generally performs
better than HDP-LDA, a dierent result to published
results [25, 20] due to the dierent algorithms.
The topic comprehensibility (as measured using PMI)
is substantially improved by the burstiness version, as
reported in the original work [5].
The topic concentration parameter in the burstiness
model goes very low when the topic is insignicant.
We can use this to estimate which topics have become
inactive in the model.
The concentration parameter for the topic-word vec-
tors signicantly aects results, so care should be taken
in experiments using these models.
7. ACKNOWLEDGEMENTS
Both authors were funded partly by NICTA. NICTA is
funded by the Australian Government through the Depart-
ment of Communications and the Australian Research Coun-
cil through the ICT Centre of Excellence Program. Thanks
to Changyou Chen and Kar Wai Lim for their feedback and
Changyou for running the HDP experiments.
8. REFERENCES
[1] J. Boyd-Graber, D. Blei, and X. Zhu. A topic model
for word sense disambiguation. In EMNLP-CoNLL,
pages 10241033, 2007.
[2] M. Bryant and E. Sudderth. Truly nonparametric
online variational inference for hierarchical Dirichlet
processes. In P. Bartlett, F. Pereira, C. Burges,
L. Bottou, and K. Weinberger, editors, Advances in
Neural Information Processing Systems 25, pages
27082716. 2012.
[3] W. Buntine and M. Hutter. A Bayesian view of the
Poisson-Dirichlet process. Technical Report
arXiv:1007.0296 [math.ST], arXiv, Feb. 2012.
[4] C. Chen, L. Du, and W. Buntine. Sampling table
congurations for the hierarchical Poisson-Dirichlet
process. In Machine Learning and Knowledge
Discovery in Databases: European Conference, ECML
PKDD, pages 296311. Springer, 2011.
[5] G. Doyle and C. Elkan. Accounting for burstiness in
topic models. In Proc. of the 26th Annual Int. Conf.
on Machine Learning, ICML 09, pages 281288, 2009.
[6] L. Du. Non-parametric Bayesian Methods for
Structured Topic Models A Mixture Distribution
Approach. PhD thesis, School of Computer Science,
the Australian National University, Canberra,
Australia, 2011.
[7] L. Du, W. Buntine, and H. Jin. Modelling sequential
text with an adaptive topic model. In Proc. of the
2012 Joint Conf. on EMNLP and CoNLL, pages
535545. ACM, 2012.
[8] L. Du, W. Buntine, and M. Johnson. Topic
segmentation with a structured topic model. In
HLT-NAACL, pages 190200. The Association for
Computational Linguistics, 2013.
[9] L. Du, W. Buntine, and M. Johnson. Topic
segmentation with a structured topic model. In
Proceedings of NAACL-HLT, pages 190200, 2013.
[10] W. R. Gilks and P. Wild. Adaptive rejection sampling
for Gibbs sampling. Applied Statistics, pages 337348,
1992.
[11] T. Griths and M. Steyvers. Finding scientic topics.
PNAS Colloquium, 2004.
[12] S. Harter. A probabilistic approach to automatic
keyword indexing. Part II. An algorithm for
probabilistic indexing. Jnl. of the American Society
for Information Science, 26(5):280289, 1975.
[13] M. Homan, D. Blei, C. Wang, and J. Paisley.
Stochastic variational inference. Journal of Machine
Learning Research, 14:13031347, 2013.
[14] H. Ishawaran and L. James. Generalized weighted
Chinese restaurant processes for species sampling
mixture models. Statistica Sinica, 13:12111235, 2003.
[15] H. Ishwaran and L. James. Gibbs sampling methods
for stick-breaking priors. Journal of ASA,
96(453):161173, 2001.
[16] A. K. McCallum. Mallet: A machine learning for
language toolkit. http://mallet.cs.umass.edu, 2002.
[17] D. Newman, J. Lau, K. Grieser, and T. Baldwin.
Automatic evaluation of topic coherence. In Proc. of
the 2010 Annual Conf. of the NAACL, pages
100 a
A S108, 2010.
[18] S. Robertson and H. Zaragoza. The probabilistic
relevance framework: BM25 and beyond. Found.
Trends Inf. Retr., 3(4):333389, Apr. 2009.
[19] M. Rosen-Zvi, T. Griths, M. Steyvers, and P. Smyth.
The author-topic model for authors and documents. In
Proc. of the 20th Annual Conf. on Uncertainty in
Articial Intelligence (UAI-04), pages 48749, 2004.
[20] I. Sato, K. Kurihara, and H. Nakagawa. Practical
collapsed variational Bayes inference for hierarchical
Dirichlet process. In Proc. of the 18th ACM SIGKDD
international conf. on Knowledge discovery and data
mining, pages 105113. ACM, 2012.
[21] I. Sato and H. Nakagawa. Topic models with
power-law using Pitman-Yor process. KDD 10, pages
673682. ACM, 2010.
[22] Y. Teh, K. Kurihara, and M. Welling. Collapsed
variational inference for HDP. In NIPS 07. 2007.
[23] Y. W. Teh. A Bayesian interpretation of interpolated
Kneser-Ney. Technical Report TRA2/06, School of
Computing, National University of Singapore, 2006.
[24] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
Hierarchical Dirichlet processes. Journal of the ASA,
101(476):15661581, 2006.
[25] H. Wallach, D. Mimno, and A. McCallum. Rethinking
LDA: Why priors matter. In Advances in Neural
Information Processing Systems 19, 2009.
[26] H. Wallach, I. Murray, R. Salakhutdinov, and
D. Mimno. Evaluation methods for topic models. In
ICML 09, pages 672679. 2009.
[27] C. Wang, J. Paisley, and D. Blei. Online variational
inference for the hierarchical Dirichlet process. In
AISTATS 11. 2011.