A Dual Markov Chain Topic Model For Dynamic Environments: Ayan Acharya Joydeep Ghosh Mingyuan Zhou
A Dual Markov Chain Topic Model For Dynamic Environments: Ayan Acharya Joydeep Ghosh Mingyuan Zhou
ABSTRACT 1 INTRODUCTION
The abundance of digital text has led to extensive research on topic Analysis of dyadic data, which represent the relationships between
models that reason about documents using latent representations. two different sets of entities, such as documents and words or users
Since for many online or streaming textual sources such as news and items, has been a prolific domain of research over the past
outlets, the number, and nature of topics change over time, there decade, driven largely by applications in diverse areas such as topic
have been several efforts that attempt to address such situations modeling [5, 9, 17], recommender systems [13, 16, 29], e-commerce
using dynamic versions of topic models. Unfortunately, existing [38] and bio-informatics [36]. Successful as these analysis tech-
approaches encounter more complex inferencing when their model niques are, a major limitation of most of them is that they are
parameters are varied over time, resulting in high computation static models and ignore the temporal correlation and evolution of
complexity and performance degradation. This paper introduces the relationships between entities – an attribute present in most
the DM-DTM, a dual Markov chain dynamic topic model, for real-world dyadic data. Text mining researchers have developed a
characterizing a corpus that evolves over time. This model uses handful of techniques for analyzing corpora that evolve over time by
a gamma Markov chain and a Dirichlet Markov chain to allow modeling them as a sequence of document-by-word count matrices.
the topic popularities and word-topic assignments, respectively, Some of these techniques employ Kalman filtering based inference
to vary smoothly over time. Novel applications of the Negative- and a nonlinear transformation of the latent states to the discrete
Binomial augmentation trick result in simple, efficient, closed-form observations [8, 31, 45], while others [5, 6] use a temporal Dirich-
updates of all the required conditional posteriors, resulting in far let process and make arguably simplistic assumptions to calculate
lower computational requirements as well as less sensitivity to an intractable posterior. Since the inference techniques for linear
initial conditions, as compared to existing approaches. Moreover, dynamical systems are well-developed, one usually is tempted to
via a gamma process prior, the number of desired topics is inferred connect a count-valued observation to a latent Gaussian random
directly from the data rather than being pre-specified and can vary variable. However, such approaches often incur heavy computation
as the data changes. Empirical comparisons using multiple real- cost, fail to exploit the natural sparsity of the data and lack interpre-
world corpora demonstrate a clear superiority of DM-DTM over tation of the latent states as the components of these states may take
strong baselines for both static and dynamic topic models. negative values. This is also true for models in recommendation
systems that exploit temporal correlation [28, 48] but hypothesize
CCS CONCEPTS that the observation is generated from an interaction of latent fac-
• Mathematics of computing → Bayesian networks; Bayesian tors that assume a normal distribution. Clearly, such an assumption
nonparametric models; Time series analysis; • Information is restrictive for count-valued dyadic data unless some nonlinear
systems → Document topic models; transformation is used, which again makes the inference intractable
[12]. This critical problem of non-conjugacy arising from latent
Gaussian variables and their subsequent non-linear transformation
KEYWORDS to model count-valued observations can be further mitigated using
dynamic topic model; CRT augmentation; Gibbs sampling the Pólya-Gamma augmentation trick [19, 31]. However, such aug-
mentation does not necessarily improve the empirical performance,
ACM Reference Format:
Ayan Acharya, Joydeep Ghosh, and Mingyuan Zhou. 2018. A Dual Markov
as evidenced in Section 4.
Chain Topic Model for Dynamic Environments. In KDD ’18: The 24th ACM The objective of this paper is to model a set of documents that
SIGKDD International Conference on Knowledge Discovery Data Mining, evolve over time and provide an inference mechanism without mak-
August 19–23, 2018, London, United Kingdom. ACM, New York, NY, USA, ing crude approximations. To that end, we introduce DM-DTM; a
10 pages. https://doi.org/10.1145/3219819.3219995 novel dual Markov chain based dynamic topic model. A critical
aspect of DM-DTM is that, unlike the standard techniques adopted
in both text mining and recommender system problems, the ob-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed servations are modeled using a Poisson distribution and the latent
for profit or commercial advantage and that copies bear this notice and the full citation factors/topics are allowed to vary smoothly over time using the
on the first page. Copyrights for components of this work owned by others than ACM gamma and Dirichlet distributions. To be more specific, two sepa-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a rate Markov chains are introduced – a gamma Markov chain and
fee. Request permissions from [email protected]. a Dirichlet Markov chain. The gamma Markov chain models the
KDD ’18, August 19–23, 2018, London, United Kingdom temporal evolution of the popularities of the topics. The Dirich-
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5552-0/18/08. . . $15.00 let Markov chain, on the other hand, is employed to adapt the
https://doi.org/10.1145/3219819.3219995 topic-word assignment with time. Gibbs sampling is adopted for
1099
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom
inference where the conditional posteriors are all available in closed The parameter βt k models the distribution of the words within the
form. This is made possible by the use of an augmentation trick k th latent factor at time t. Additionally, each atom βt k is associated
associated with the negative binomial distribution together with a with an atom (θdk )d ∈ Dt which is a D t -dimensional distribution
QD t
forward-backward sampling algorithm, each step of which assumes characterized as (θdk )d ∈ Dt ∼ d=1 Gam(r t k , 1/cd ). The (d, w ) th
a posterior that is easy to sample from [1]. Using the gamma process entry of X t is assumed to be generated from a sum of latent counts
[20] to generate a countably infinite number of weighted latent as: x tdw ∼ Pois ( k λtdwk ) where λtdwk = θdk βtwk . One may
P
factors in the prior, the model can infer a parsimonious set of topics consider λtdwk as the strength of the k th latent factor that dictates
from the data in the posterior. Empirical comparisons in terms of the relation between the d th document and the w th word at time t.
held-out perplexity indicate the clear superiority of DM-DTM over Each of these latent counts is composed of two parts – θdk models
two of the most widely-used temporal topic models [8, 31]. the affinity of the d th document to the k th latent factor and βtwk
The remainder of the paper is organized as follows. Section 2
models the popularity of the w th word among the k th latent factor
provides a detailed description of the modeling assumptions and
at time t. Each latent factor contributes such a count and the total
the inference techniques of DM-DTM. Related works are outlined
count is the aggregate of the countably infinite latent factors.
in Section 3. Empirical results with real-world data are reported
in Section 4. Finally, the conclusion and future works are listed in
Section 5.
1100
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom
1101
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom
V
Lemma 2.3. ([53]) If β ∼ Dir(η), η ∼ Gam(s 0 , 1/t 0 ), (xw )w =1 ∼ Algorithm 1: Forward Backward Gibbs Sampling
V
mult((βw )w =1 ; w xw ) then
P
(s ) (s )
S , {θ }S , {β (s )
X Result: {r t k }s=1 dk s=1 twk s=1
}S ;
(η|−) ∼ Gam(s 0 + ξw , 1/(t 0 − V log(1 − ζ ))), 1 for s ∈ {1, 2, · · · , S } do
w 2 for d ∈ {Dt }Tt=1 do
where ξw ∼ CRT(xw , η) and ζ ∼ Beta( w xw , ηV ). (s ) (s )
P
3 sample {x tdwk } and cd ;
The sampling for t = T is easy and follows as: 4 end
V backward sampling: initialize t = T ;
5
V |−) ∼ Dir ηV β
((βT wk )w =1 (T −1)wk + xT .wk w =1 . 6 while t > 0 do
(s ) (s ) (s ) (s )
For 2 ≤ t ≤ (T − 1), the sampling is non-trivial due to the Dirichlet 7 sample {ℓtdk }, {Lt k }, {ζt k } and {ξ twk };
Markov chain. However, from the relation between the Poisson and (s )
8 cache {qt k } to use in forward sampling;
multimnomial distributions, it follows that
V 9 t = (t − 1);
V
(x (t +1).wk )w =1 ∼ mult β (t +1)wk ; x (t +1)..k . 10 end
w =1
V V 11 forward sampling: initialize t = 1;
Since (β (t +1)wk )w =1 ∼ Dir(ηV (β twk )w =1 ), we may integrate out 12 while t ≤ T do
β (t +1)wk and according to the definition of the Dirichlet-multinomial (s ) (s ) (s )
distribution, we have 13 sample {r t k }, {θd ∈ D k } and {βtwk };
t
V t = (t + 1);
∼ Dirmult ηV (βtwk )V
14
x (t +1).wk w =1 .
w =1 15 end
(s )
The Dirichlet-multinomial likelihood is further augmented with 16 sample c (s ) , γ 0 , η (s ) ;
ζ (t +1)k ∼ Beta(x (t +1)..k , ηV ) and according to Lemma 2.3, the joint
17 end
distribution takes the following form:
V
V ,ζ
Y
f ((x (t +1).wk )w =1 (t +1)k ) ∝ NB(x (t +1).wk ; ηV , ζ (t +1)k ).
w =1
hierarchical graphical models was first proposed in Zhou and Carin
We now augment ξ (t +1)wk ∼ CRT(x (t +1).wk , ηβtwk ) and using the [54], however, it was first utilized for modeling time-evolving count
results of 2.3 sample βtwk as: vectors in Acharya et al. [1]. Such adoption of the NB trick was
V
V |−) ∼ Dir ηV β
((βtwk )w (t −1)wk + x t .wk + ξ (t +1)wk . non-trivial, as is the case with the current paper that further uses
=1 w =1 it for modeling two separate Markov chains–the gamma Markov
This augmentation trick is illustrated in further details in the proof chain and the Dirichlet Markov chain–to yield closed-form solu-
of Lemma 2.3 [53]. For t = 1, the sampling follows almost the same tion for Gibbs sampling. These samples converge to a meaningful
pattern except that the prior is changed: representation only when a precise order of sampling is followed,
V V
((β 1wk )w =1 |−) ∼ Dir (η + x 1.wk + ξ 2wk ) w =1 .
as suggested in Algorithm 1. Due to the introduction of the CRT
distributed random variables, the backward sampling step must pre-
Sampling of γ 0 : We augment ℓ0k ∼ CRT(ℓ1k , γ 0 /K ) and use cede the forward sampling step, the precise explanation of which
Lemma 2.2 to derive: can be found in Acharya et al. [1]. Also, we strongly believe that
(γ 0 |−) ∼ Gam (e 0 + k ℓ0k , 1/ ( f 0 log(1 − p0k ))) ,
P P
− 1/K k the simplicity in the final form of the updates leads to superior em-
pirical results. Note that none of the existing works on the temporal
log(1 − p1k )
p0k = . topic model has such assumptions that naturally fit the overdis-
(log(1 − p1k ) − c)
persed count data, facilitates interpretability of the latent states, has
Sampling of η : Sampling of η follows from an application of closed-form and straight-forward updates in inference and exhibits
Lemma 2.2 and the Bayes’ rule as: such superior empirical performance. Moreover, as mentioned in
(η|−) ∼ Gam s 0 + t,w,k ξ twk , 1/ t 0 − t,k log (1 − ζt k ) .
P P Section 4, the performance of DM-DTM is least sensitive to the
initialization of the parameters, a flexibility absent in any existing
implementation of temporal topic model.
The sequence in which the sampling is performed is concisely
presented in Algorithm 1. For the temporal correlation in the latent
variables, the sampling needs to follow a backward step and a for-
3 RELATED WORK
ward step in every epoch, which is designated by s ∈ {1, 2, · · · , S } Poisson Factor Analysis: Since the document-by-word observa-
in Algorithm 1. The variables are all indexed by an additional super- tion matrices in DM-DTM are modeled using Poisson factorization,
script (s) just to highlight the specific epoch. Note that the run-time a brief discussion of Poisson factor analysis is necessary. A large
complexity of Algorithm 1 is dictated by the number of non-zero number of discrete latent variable models for count matrix factoriza-
entries in the observed corpus {X t ∈ Z | Dt |×V }Tt=1 . tion can be united under Poisson factor analysis (PFA) [1–3, 55, 56],
We would like to emphasize further that both the model and the which factorizes a count matrix Y ∈ ZD×V under the Poisson like-
lihood as Y ∼ Pois(Θβ ), where Θ ∈ R+ D×K is the factor loading
inference are novel contributions of this paper. That the NB augmen-
matrix or dictionary, β ∈ R+ K ×V is the factor score matrix. For
tation trick can be utilized for an efficient inference procedure in
1102
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom
example, non-negative matrix factorization [11, 30], with the objec- Relevant Temporal Models for Count Data: Time-evolving dyadic
tive to minimize the Kullback-Leibler divergence between N and data is also prevalent in applications of recommender systems and
its factorization Θβ, is essentially PFA solved with maximum like- social network analysis. Though such applications are not the focus
lihood estimation. LDA [9] is equivalent to PFA, in terms of both of the current paper, we discuss a few algorithms for completeness.
block Gibbs sampling and variational inference [55, 56], if Dirichlet Both Bayesian Probabilistic Tensor factorization (BPTF) [48] and
distribution priors are imposed on both θ k ∈ R+ D , the columns of Dynamic Poisson factorization (DPF) [12] model the temporal evo-
V
Θ, and βk ∈ R+ , the columns of β. lution using normal distribution. While BPTF models the count
Temporal Topic Models: One of the notable contributions to- data using the normal distribution itself, DPF uses an exponential
wards a dynamic topic model leverages the well-known concept of function to convert the latent rates to nonnegative values, a trans-
Gaussian state space evolution. In Blei and Lafferty [8], a Kalman formation that makes the inference intractable. To impose temporal
filter is used to infer temporal updates to the state space parameters, smoothness in the frequency domain for audio processing, Virtanen
which are then mapped to the topic simplex. Wang et al. [45] allow et al. [43] consider chaining latent variables across successive time
a continuous time state space sampling, but still employ a Gaussian frames via the Gamma scale parameters. Jerfel et al. [25] model
distribution and a mapping to the topic space thereafter using a the evolution of the latent factors in the context of recommender
logistic-normal distribution. These models also require the number systems via the Gamma scale parameters. Similarly, Févotte et al.
of topics to be specified in advance. Elibol et al. [19], Linderman [21] propose a gamma Markov chain using the scale parameters for
et al. [31] employ the Pólya-Gamma augmentation trick [37] to con- applications in audio and speech. Most of the works in dynamic so-
quer the non-conjugacy that arises from the Gaussian state space cial network analysis [22, 27, 49] employ similar temporal evolution
evolution and likelihood for modeling count-valued observations. using a normal distribution to model time-varying binary matrices.
Ahmed and Xing [5, 6] use a temporal Dirichlet process and This paper borrows some of the technical ideas from Acharya et al.
make arguably simplistic assumptions to calculate an intractable [1, 2], Schein et al. [39], which introduce gamma Markov chain for
posterior. In particular, the framework of temporal Dirichlet pro- analyzing count and binary data with temporal correlation.
cess, first introduced in Ahmed and Xing [5], is combined with
the Hierarchical Dirichlet Process (HDP) [42] to facilitate smooth 4 EMPIRICAL EVALUATION
temporal evolution and admixture modeling. In such formulation, 4.1 Experiments with Synthetic Data
the base measures of the HDP’s for different time slices are modeled
To illustrate the working principles of DM-DTM, we created a
using a temporal Dirichlet process and the documents for a given
synthetic corpus that has three different time slices. The document-
time slice are assumed to be generated following an HDP with the
by-word matrices corresponding to each of these time slices are
corresponding base measure. The non-conjugacy that arises in such
presented in the first column of Fig. 2. Note that each document-
a modeling assumption requires one to use a Metropolis-Hastings
by-word matrix, denoted by X1 , X2 , and X3 , has a clearly defined
sampler for inferring the word-topic assignments. However, to their
structure where some documents have the exact same words and
credit, Ahmed and Xing [6] model both the genesis and death of
some words only appear in a given set of documents. The appear-
topics and Wang et al. [46] further model nonlinear evolutionary
ance of the words is varied smoothly from one time slice to the next,
traces in temporal data, which we avoid in this paper but plan to
replicating the temporal evolution that we may see in a real-world
incorporate in a later submission. Iwata et al. [24], Nallapati et al.
corpus. The reconstructed matrices, denoted by X̂1 , X̂2 , and X̂3 , are
[35] emphasize on the problem of modeling topics spread on a time-
presented in the second column which precisely reflect the original
line with multiple resolutions, namely how topics are organized
observations. The third, fourth and the fifth column display the de-
in a hierarchy and how they evolve over time. Similarly, Srebro
rived parameters of the model corresponding to different time slices.
and Roweis [41] use the framework of the Dependent Dirichlet
Note that the r t k ’s, which represent the popularities of the topics,
Process (DDP) [34] to model more flexible, non-Markovian varia-
only have few components that are dominant and span across time-
tion in topic probabilities, but inference in all such models scales
slices, implying the temporal smoothness discovered by the model,
very poorly. Bhadury et al. [7] adopt the framework of stochastic
which is a consequence of using both the gamma Markov chain and
gradient Langevin dynamics [32] to accelerate the inference based
the Dirichlet Markov chain. Similarly, the θdk ’s (document-topic
on Gibbs sampling in the original formulation of dynamic topic
assignments) and the βtwk ’s (topic-word assignments) also have a
model [8]. Some of the other online algorithms [4, 23, 51] explicitly
temporal correlation, as is evident from the heat-maps.
model temporal evolution by making Markovian assumptions.
Different from the works mentioned above, the topics over time
4.2 Experiments with Real-world Data
(TOT) model [47] assumes that the topics define a distribution
over words as well as time slices. Though TOT and some of its 4.2.1 Description of Datasets.
extensions [18, 44] can model non-Markovian variations in the topic • NIPS Corpus: The NIPS corpus consists of papers that appeared
probabilities and enjoy inference that is computationally tractable, in the NIPS conference from the years 1987 to 1999. After standard
they do not explicitly evolve the parameters of the model with time. pre-processing and removal of most frequent and least frequent
Though the modeling assumptions are interesting, there has not words, the size of the corpus is reduced to 1383 documents and
been much empirical comparison between these two different sets 1636 words. Documents were divided into 13 epochs based on the
of algorithms. publication year.
• Business News Corpora: We create three additional corpora
for our experimental analysis by crawling the Bloomberg News
1103
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom
Figure 3: Evolution of r k
portal. In particular, we are only interested in the news article that documents and 1352 words. The datasets used in these experiments
has a mention of the companies that belong to the list of Financial are listed here (https://goo.gl/uVB1f7)1 .
Times Stock Exchange (FTSE) 250. Each of these corpora consists of
news articles from 9 successive days. The three corpora, termed as 4.2.2 Qualitative Evaluation of Topics. For qualitative under-
Business News Corpus 1, 2, and 3, has news articles starting from standing of how DM-DTM works with real-world data, we consider
November 1st 2015, November 12th 2015, and November 22nd 2015 the Business News Corpus 2 whose documents span from Nov 12th ,
respectively. After standard pre-processing and removal of most 2015 to Nov 20th , 2015. The significance of this corpus is due to the
frequent and least frequent words, the first news corpus consists of unfortunate event of terrorist attacks in Paris, France late in the
3271 documents and 1636 words, the second corpus consists of 2935 evening on Nov 13th , 2015, which triggered massive socio-economic
documents and 1570 words, and the third corpus consists of 2234 impact worldwide. The news articles published Nov 14th onwards
convey information about the incidents and their impacts on the
1 We are indebted to Matt Sanchez, CTO of CognitiveScale, for curating these datasets.
1104
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom
global economy. In Fig. 3, we show the temporal evolution of the where s ∈ {1, · · · , S } are the indices of collected samples. For models
strength of one of the 50 topics that are used to model this corpus in that employ variational methods for inference,
one of the experiments. In Fig. 1, the top 10 words corresponding to X X
this topic for all the time slices are also displayed. One can clearly fdw = θ¯dk ϕ¯wk / θ¯dk ϕ¯wk ,
see how the semantics of this topic change over time. Note that, k w,k
before Nov 14th (day 3), the composition of the topic is significantly
different. However, the terrorist attacks on the evening of the 13th where θ¯dk and ϕ¯wk are the point estimates of the respective pa-
(day 2), enhance the strength of the topic and change its composi- rameters obtained from the variational inference. Note that the
tion. As time advances, the topic incorporates words like “Syria" per-word perplexity is equal to V if fdw = 1.0/V , thus it should
and “Abbaaoud" linking the origin and the perpetrators of the ter- be no greater than V for a topic model that works appropriately.
rorist attack. Interestingly, the former French President François The final results are averaged over five random training/testing
Hollande also appears in the topic as he is quoted condemning the partitions.
genocide in the news articles. For concrete empirical comparison, we use several models as
baselines, the first of which is the original LDA model [9] that is
4.2.3 Quantitative Results. We randomly hold out p fraction of learned using variational EM. We address this model as LDA-VEM.
the data (p ∈ {0.5, 0.6, 0.7, 0.8, 0.9}), train a model with the rest and The second and the third models are γ -NB Process (γ NBP) and Dir-
then predict on the held-out set. For comparing multiple models PFA [55], both of which are learned using Gibbs sampling. Note
with different assumptions and inference mechanisms the per-word that both Dir-PFA and LDA [9] have the same block Gibbs sampling
perplexity on the held-out words is considered, which is defined as: and variational Bayes inference equations. Hence, we use Dir-PFA
to facilitate an inference with Gibbs sampling in the original LDA
D V
1 XX model. All of these models, however, ignore the temporal informa-
Perplexity = exp *.− ydw log fdw +/ , tion in the corpus. Note that the γ NBP model used as a baseline is as
y..
d=1 w =1
, - strong as the HDP (inferred with the Chinese restaurant franchise
where y .. = d,w ydw . For models where inference is carried out
P representation), as shown in Zhou and Carin [55]. In fact, one can
using Gibbs sampling, fdw is defined as: show that a normalized γ NBP can be reduced to HDP.
The dynamic topic model (DTM) [8] and the Pólya-Gamma multi-
(s ) (s ) (s ) (s )
X X
fdw = θdk ϕwk / θdk ϕwk , nomial dynamic topic model (PGMult) [31], which are capable of
s,k s,w,k incorporating the time-stamps associated with each document, are
1105
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom
1106
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom
mentioned above. For inference with the variational Kalman filter- types of dyadic count data, for example, those prevalent in recom-
ing (VKF) in DTM and the Pólya-Gamma augmentation trick in mender systems [12, 25]. Interestingly, the models can be further
PGMult, we use 50 iterations. For LDA-VEM we use 50 iterations enriched with the split-merge techniques [10] so that the gene-
for variational EM. For all other models that use Gibbs sampling, sis and termination of topics can be explicitly accounted for in
we use 500 iterations for burn-in and 500 for collection. Note that, the generative assumptions. Finally, the performance of the model
unlike in DTM and PGMult where the initialization with LDA must can potentially be improved further using ideas from adversarial
be done to achieve meaningful learning of the representation of training [40] and advanced variational methods [50, 52].
the model, DM-DTM or any of its ablations does not require any
special initialization, thereby bringing an additional advantage to
the table. REFERENCES
As expected, being parametric models, LDA-VEM, Dir-PFA, DTM, [1] A. Acharya, J. Ghosh, and M. Zhou. 2015. Nonparametric Bayesian Factor Analysis
for Dynamic Count Matrices. In Proc. of AISTATS. 1–9.
and PGMult all suffer from severe overfitting as K is increased. In [2] A. Acharya, A. Saha, M. Zhou, D. Teffer, and J. Ghosh. 2015. Nonparametric
particular, with higher values of η, the overfitting is more promi- Dynamic Network Modeling. In KDD Workshop on Mining and Learning from
nent. With small values of η, especially with η = 0.10, the topics Time Series.
[3] A. Acharya, D. Teffer, J. Henderson, M. Tyler, M. Zhou, and J. Ghosh. 2015. Gamma
discovered are very sparse and hence the perplexity does not in- Process Poisson Factorization for Joint Modeling of Network and Documents. In
crease with increasing value of K for the parametric models. Note Proc. of ECML. 283–299.
[4] A. Ahmed, Q. Ho, C. H. Teo, J. Eisenstein, A. Smola, and E. Xing. 2011. Online
that DM-DTM outperforms all the other models by a large margin. Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text.
The significant gap between DM-DTM and γ NBP or Dir-PFA shows In Proc. of AISTATS. 101–109.
that the performance difference is not due to the adoption of Gibbs [5] A. Ahmed and E. Xing. 2008. Dynamic Non-Parametric Mixture Models and
The Recurrent Chinese Restaurant Process : with Applications to Evolutionary
sampling for inference, but due to the congruence between the Clustering. Proc. of SDM (2008).
modeling assumptions and the statistical characteristics of the cor- [6] A. Ahmed and E. Xing. 2010. Timeline: A Dynamic Hierarchical Dirichlet Process
pora. Similarly, the performance gap between γ NBP or Dir-PFA and Model for Recovering Birth/Death and Evolution of Topics in Text Stream. In
Proc. of UAI.
LDA-VEM illustrates that the adoption of Gibbs sampling, instead [7] A. Bhadury, J. Chen, J. Zhu, and S. Liu. 2016. Scaling Up Dynamic Topic Models.
of variational methods, for inference makes a difference. The gap be- In Proc. of WWW. 381–390.
[8] D. M. Blei and J. D. Lafferty. 2006. Dynamic topic models. In Proc. of ICML.
tween LDA-VEM and DM-DTM clearly proves that the performance 113–120.
difference is both due to better modeling assumptions and better [9] D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet Allocation. JMLR 3
inference algorithm based on Gibbs sampling. The performance (2003), 993–1022.
[10] M. Bryant and E. B. Sudderth. 2012. Truly Nonparametric Online Variational
difference between DM-DTM and DM-DTM-r /DM-DTM-ϕ also Inference for Hierarchical Dirichlet Processes. In NIPS. 2699–2707.
justifies the use of both the chains – the gamma Markov chain and [11] A. T. Cemgil. 2009. Bayesian inference for nonnegative matrix factorisation
models. Intell. Neuroscience (2009).
the Dirichlet Markov chain. Indeed, the performance gap among [12] L. Charlin, R. Ranganath, J. McInerney, and D.M. Blei. 2015. Dynamic Poisson
DM-DTM, PGMult, and DTM justifies the modeling assumptions Factorization. In Proc. of RecSys. 155–162.
and the efficacy of the inference proposed. Please note that the [13] K. Christakopoulou and A. Banerjee. 2015. Collaborative Ranking with a Push at
the Top. In Proc. of WWW. 205–215.
performance gap between DM-DTM-r and DM-DTM-ϕ is mostly [14] Y. Cong, B. Chen, H. Liu, and M. Zhou. 2017. Deep Latent Dirichlet Allocation
negligible, possibly because both chains are equally good in captur- with Topic-Layer-Adaptive Stochastic Gradient Riemannian MCMC. In Proc. of
ing the temporal dependencies in the data. Additionally, to illustrate ICML. 864–873.
[15] Y. Cong, B. Chen, and M. Zhou. 2017. Fast Simulation of Hyperplane-Truncated
the run-time complexity of DM-DTM, we present the variation of Multivariate Normal Distributions. Bayesian Analysis (2017).
the total compute time, as measured on a MacBook with 2.5 GHz [16] M. Deodhar and J. Ghosh. 2009. Mining for Most Certain Predictions from Dyadic
Data. In Proc. of KDD.
Intel Core i7 processors and 16 GB of RAM, as a function of the [17] N. Du, M. Farajtabar, A. Ahmed, A.J. Smola, and L. Song. 2015. Dirichlet-Hawkes
held-out fraction p for all the corpora in Fig. 8. Note that a higher Processes with Applications to Clustering Continuous-Time Document Streams.
fraction of held-out data implies a smaller training set and compute In Proc. of KDD (to appear).
[18] A. Dubey, A. Hefny, S. Williamson, and E. P. Xing. 2013. A nonparametric mixture
time. model for topic modeling over time. In Proc. of SDM. 530–538.
[19] H. Elibol, V. Nguyen, S. Linderman, M. Johnson, A. Hashmi, and F. Doshi-Velez.
2016. Cross-corpora Unsupervised Learning of Trajectories in Autism Spectrum
Disorders. JMLR 17, 1 (2016), 4597–4634.
5 CONCLUSIONS AND FUTURE WORK [20] T. S. Ferguson. 1973. A Bayesian analysis of some nonparametric problems. Ann.
Statist. (1973).
This paper introduced DM-DTM, a novel nonparametric Bayesian [21] C. Févotte, J. L. Roux, and J. R. Hershey. 2013. Non-negative dynamical system
dynamic topic model that allows the topic popularities and word- with application to speech and audio. In Proc. of ICASSP. 3158–3162.
topic assignments to vary smoothly over time using a gamma [22] Q. Ho, L. Song, and E.P. Xing. 2011. Evolving Cluster Mixed-Membership Block-
model for Time-Varying Networks. In Proc. of AISTATS.
Markov chain and a Dirichlet Markov chain, respectively. DM- [23] L. Hong, B. Dom, S. Gurumurthy, and K. Tsioutsiouliklis. 2011. A Time-dependent
DTM is equipped with a nonparametric Bayesian construction and Topic Model for Multiple Text Streams. In Proc. of KDD. 832–840.
[24] T. Iwata, T. Yamada, Y. Sakurai, and N. Ueda. 2010. Online Multiscale Dynamic
a tractable inference mechanism. The experiments with several Topic Models. In Proc. of KDD. 663–672.
real-world corpora clearly demonstrate its supremacy over many [25] G. Jerfel, M. Basbug, and B. Engelhardt. 2017. Dynamic Collaborative Filtering
of the existing baselines. In future, the inference can get further With Compound Poisson Factorization. In Proc. of AISTATS. 738–747.
[26] N. L. Johnson, A. W. Kemp, and S. Kotz. 2005. Univariate Discrete Distributions.
accelerated using the formulations of stochastic gradient Langevin John Wiley & Sons.
dynamics [32, 33] and the sampling tricks proposed in Cong et al. [27] M. Kim and J. Leskovec. 2013. Nonparametric Multi-group Membership Model
[14, 15]. Additionally, the gamma Markov chain and the Dirichlet for Dynamic Networks. In Proc. of NIPS. 1385–1393.
[28] Y. Koren. 2009. Collaborative Filtering with Temporal Dynamics. In Proc. of KDD.
Markov chain can be used to model temporal evolution of other 447–456.
1107
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom
[29] Y. Koren, R. Bell, and C. Volinsky. 2009. Matrix Factorization Techniques for [43] T. Virtanen, A.T. Cemgil, and S. Godsill. 2008. Bayesian extensions to non-negative
Recommender Systems. IEEE Computer (2009). matrix factorisation for audio signal modelling. In Proc. of ICASSP. 1825–1828.
[30] D. D. Lee and H. S. Seung. 2001. Algorithms for Non-negative Matrix Factorization. [44] D. D. Walker, K. Seppi, and E. K. Ringger. 2012. Topics over Nonparametric Time:
In NIPS. A Supervised Topic Model Using Bayesian Nonparametric Density Estimation.
[31] S. Linderman, M. Johnson, and R. P. Adams. 2015. Dependent Multinomial Models In Proc. of UAI. 74–83.
Made Easy: Stick-Breaking with the Polya-gamma Augmentation. In Proc. of NIPS. [45] C. Wang, D. Blei, and D. Heckerman. 2008. Continuous Time Dynamic Topic
3438–3446. Models. In Proc. of UAI.
[32] Y. Ma, T. Chen, and E. B. Fox. 2015. A Complete Recipe for Stochastic Gradient [46] P. Wang, P. Zhang, C. Zhou, Z. Li, and H. Yang. 2017. Hierarchical Evolving
MCMC. In Proc. of NIPS. 2917–2925. Dirichlet Processes for Modeling Nonlinear Evolutionary Traces in Temporal
[33] Y. Ma, N. J. Foti, and E. B. Fox. 2017. Stochastic Gradient MCMC Methods for Data. Data Min. Knowl. Discov. 31, 1 (Jan. 2017), 32–64.
Hidden Markov Models. In Proc. of ICML. 2265–2274. [47] X. Wang and A. McCallum. 2006. Topics over Time: A non-Markov Continuous-
[34] S.N. MacEachern. 2000. Dependent Dirichlet Process. Technical Report. Depart- time Model of Topical Trends. In Proc. of KDD. 424–433.
ment of Statistics, The Ohio State University. [48] L. Xiong, X. Chen, T. Huang, J. Schneider, and J. G. Carbonell. 2010. Temporal
[35] R.M. Nallapati, S. Ditmore, J.D. Lafferty, and K. Ung. 2007. Multiscale Topic Collaborative Filtering with Bayesian Probabilistic Tensor Factorization. In Proc.
Tomography. In Proc. of KDD. 520–529. of SDM.
[36] N. Natarajan and I.S. Dhillon. 2014. Inductive matrix completion for predicting [49] K.S. Xu and A.O. Hero. 2014. Dynamic Stochastic Blockmodels for Time-Evolving
gene-disease associations. Bioinformatics 30, 12 (2014), 60–68. Social Networks. J. Sel. Topics Signal Processing 8, 4 (2014), 552–562.
[37] N. G. Polson, J. G. Scott, and J. Windle. 2013. Bayesian Inference for Logistic [50] M. Yin and M. Zhou. 2018. Semi-implicit variational inference. In Proc. of ICML.
Models Using Pólya–Gamma Latent Variables. J. Amer. Statist. Assoc. 108, 504 [51] K. Zhai and J.L. Boyd-graber. 2013. Online Latent Dirichlet Allocation with
(2013), 1339–1349. Infinite Vocabulary. In Proc. of ICML. 561–569.
[38] S. Raghavan, S. Gunasekar, and J. Ghosh. 2012. Review Quality Aware Collabora- [52] H. Zhang, D. Guo, B. Chen, and M. Zhou. 2018. WHAI: Weibull Hybrid Autoen-
tive Filtering. In Proc. of RecSys. 123–130. coding Inference for Deep Topic Modeling. In Proc. of ICLR. to appear.
[39] A. Schein, H. Wallach, and M. Zhou. 2016. Poisson-Gamma dynamical systems. [53] M. Zhou. 2016. Nonparametric Bayesian Negative Binomial Factor Analysis. (Oct
In Proc. of NIPS. 5005–5013. 2016).
[40] J. Song, S. Zhao, and S. Ermon. 2017. A-NICE-MC: Adversarial Training for [54] M. Zhou and L. Carin. 2012. Augment-and-Conquer Negative Binomial Processes.
MCMC. In Proc. of NIPS. 5146–5156. In Proc. of NIPS.
[41] N. Srebro and S. Roweis. 2005. Time-Varying Topic Models using Dependent [55] M. Zhou and L. Carin. 2015. Negative Binomial Process Count and Mixture
Dirichlet Processes. Modeling. IEEE Trans. Pattern Analysis and Machine Intelligence (2015).
[42] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. 2006. Hierarchical Dirichlet [56] M. Zhou, L. Hannah, D. Dunson, and L. Carin. 2012. Beta-Negative Binomial
Processes. J. Amer. Statist. Assoc. 101 (December 2006), 1566–1581. Process and Poisson Factor Analysis. In Proc. of AISTATS. 1462–1471.
1108