0% found this document useful (0 votes)

125 views10 pages

A Dual Markov Chain Topic Model For Dynamic Environments: Ayan Acharya Joydeep Ghosh Mingyuan Zhou

Uploaded by

Bala Chowdappa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views10 pages

A Dual Markov Chain Topic Model For Dynamic Environments: Ayan Acharya Joydeep Ghosh Mingyuan Zhou

Uploaded by

Bala Chowdappa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

A Dual Markov Chain Topic Model for Dynamic Environments

Ayan Acharya Joydeep Ghosh Mingyuan Zhou
CognitiveScale Inc University of Texas at Austin University of Texas at Austin
Austin, Texas Austin, Texas Austin, Texas
[email protected] [email protected] [email protected]

ABSTRACT 1 INTRODUCTION
The abundance of digital text has led to extensive research on topic Analysis of dyadic data, which represent the relationships between
models that reason about documents using latent representations. two different sets of entities, such as documents and words or users
Since for many online or streaming textual sources such as news and items, has been a prolific domain of research over the past
outlets, the number, and nature of topics change over time, there decade, driven largely by applications in diverse areas such as topic
have been several efforts that attempt to address such situations modeling [5, 9, 17], recommender systems [13, 16, 29], e-commerce
using dynamic versions of topic models. Unfortunately, existing [38] and bio-informatics [36]. Successful as these analysis tech-
approaches encounter more complex inferencing when their model niques are, a major limitation of most of them is that they are
parameters are varied over time, resulting in high computation static models and ignore the temporal correlation and evolution of
complexity and performance degradation. This paper introduces the relationships between entities – an attribute present in most
the DM-DTM, a dual Markov chain dynamic topic model, for real-world dyadic data. Text mining researchers have developed a
characterizing a corpus that evolves over time. This model uses handful of techniques for analyzing corpora that evolve over time by
a gamma Markov chain and a Dirichlet Markov chain to allow modeling them as a sequence of document-by-word count matrices.
the topic popularities and word-topic assignments, respectively, Some of these techniques employ Kalman filtering based inference
to vary smoothly over time. Novel applications of the Negative- and a nonlinear transformation of the latent states to the discrete
Binomial augmentation trick result in simple, efficient, closed-form observations [8, 31, 45], while others [5, 6] use a temporal Dirich-
updates of all the required conditional posteriors, resulting in far let process and make arguably simplistic assumptions to calculate
lower computational requirements as well as less sensitivity to an intractable posterior. Since the inference techniques for linear
initial conditions, as compared to existing approaches. Moreover, dynamical systems are well-developed, one usually is tempted to
via a gamma process prior, the number of desired topics is inferred connect a count-valued observation to a latent Gaussian random
directly from the data rather than being pre-specified and can vary variable. However, such approaches often incur heavy computation
as the data changes. Empirical comparisons using multiple real- cost, fail to exploit the natural sparsity of the data and lack interpre-
world corpora demonstrate a clear superiority of DM-DTM over tation of the latent states as the components of these states may take
strong baselines for both static and dynamic topic models. negative values. This is also true for models in recommendation
systems that exploit temporal correlation [28, 48] but hypothesize
CCS CONCEPTS that the observation is generated from an interaction of latent fac-
• Mathematics of computing → Bayesian networks; Bayesian tors that assume a normal distribution. Clearly, such an assumption
nonparametric models; Time series analysis; • Information is restrictive for count-valued dyadic data unless some nonlinear
systems → Document topic models; transformation is used, which again makes the inference intractable
[12]. This critical problem of non-conjugacy arising from latent
Gaussian variables and their subsequent non-linear transformation
KEYWORDS to model count-valued observations can be further mitigated using
dynamic topic model; CRT augmentation; Gibbs sampling the Pólya-Gamma augmentation trick [19, 31]. However, such aug-
mentation does not necessarily improve the empirical performance,
ACM Reference Format:
Ayan Acharya, Joydeep Ghosh, and Mingyuan Zhou. 2018. A Dual Markov
as evidenced in Section 4.
Chain Topic Model for Dynamic Environments. In KDD ’18: The 24th ACM The objective of this paper is to model a set of documents that
SIGKDD International Conference on Knowledge Discovery Data Mining, evolve over time and provide an inference mechanism without mak-
August 19–23, 2018, London, United Kingdom. ACM, New York, NY, USA, ing crude approximations. To that end, we introduce DM-DTM; a
10 pages. https://doi.org/10.1145/3219819.3219995 novel dual Markov chain based dynamic topic model. A critical
aspect of DM-DTM is that, unlike the standard techniques adopted
in both text mining and recommender system problems, the ob-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed servations are modeled using a Poisson distribution and the latent
for profit or commercial advantage and that copies bear this notice and the full citation factors/topics are allowed to vary smoothly over time using the
on the first page. Copyrights for components of this work owned by others than ACM gamma and Dirichlet distributions. To be more specific, two sepa-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a rate Markov chains are introduced – a gamma Markov chain and
fee. Request permissions from [email protected]. a Dirichlet Markov chain. The gamma Markov chain models the
KDD ’18, August 19–23, 2018, London, United Kingdom temporal evolution of the popularities of the topics. The Dirich-
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5552-0/18/08. . . $15.00 let Markov chain, on the other hand, is employed to adapt the
https://doi.org/10.1145/3219819.3219995 topic-word assignment with time. Gibbs sampling is adopted for

1099
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

inference where the conditional posteriors are all available in closed The parameter βt k models the distribution of the words within the
form. This is made possible by the use of an augmentation trick k th latent factor at time t. Additionally, each atom βt k is associated
associated with the negative binomial distribution together with a with an atom (θdk )d ∈ Dt which is a D t -dimensional distribution
QD t
forward-backward sampling algorithm, each step of which assumes characterized as (θdk )d ∈ Dt ∼ d=1 Gam(r t k , 1/cd ). The (d, w ) th
a posterior that is easy to sample from [1]. Using the gamma process entry of X t is assumed to be generated from a sum of latent counts
[20] to generate a countably infinite number of weighted latent as: x tdw ∼ Pois ( k λtdwk ) where λtdwk = θdk βtwk . One may
P
factors in the prior, the model can infer a parsimonious set of topics consider λtdwk as the strength of the k th latent factor that dictates
from the data in the posterior. Empirical comparisons in terms of the relation between the d th document and the w th word at time t.
held-out perplexity indicate the clear superiority of DM-DTM over Each of these latent counts is composed of two parts – θdk models
two of the most widely-used temporal topic models [8, 31]. the affinity of the d th document to the k th latent factor and βtwk
The remainder of the paper is organized as follows. Section 2
models the popularity of the w th word among the k th latent factor
provides a detailed description of the modeling assumptions and
at time t. Each latent factor contributes such a count and the total
the inference techniques of DM-DTM. Related works are outlined
count is the aggregate of the countably infinite latent factors.
in Section 3. Empirical results with real-world data are reported
in Section 4. Finally, the conclusion and future works are listed in
Section 5.

2 MODEL AND INFERENCE

In what we present below, vectors and matrices are denoted by bold-
faced lowercase and capital letters respectively. Scalar variables are
written in italic font, and sets are denoted by calligraphic uppercase
letters. Dir(), Gam(), Pois() and mult() stand for the Dirichlet, gamma,
Poisson and multinomial distributions, respectively. For a tensor
X ∈ ZK1 ×K2 ×K3 the (k 1 , k 2 , k 3 ) th entry is denoted by x k1 k2 k3 . Also,
x k1 k2 . = kK3=1 x k1 k2 k3 and x k1 .. = kK2=1 kK3=1 x k1 k2 k3 .
P P P
3 2 3
Consider a collection of matrices {X t ∈ Z | Dt |×V }Tt=1 that are
sequentially observed. These matrices are the bag-of-words repre-
sentation of a corpus that evolves over time. In what is proposed
hereafter, each document in the corpus appears in only one time-
slice. In particular, let td denote the time-stamp of the d th document
and Dt denote the set of documents that appear in the t th time-
stamp.
Further, consider a gamma process [20] G ∼ ΓP(c, G 0 ) defined
on the product space R+ × Ω, with scale parameter c and a finite
and continuous base measure G 0 over a complete separable metric
space Ω, such that G (Ai ) ∼ Gam(G 0 (Ai ), 1/c) are independent
gamma random variables for disjoint partition {Ai }i of Ω. The Lévy
measure of the gamma process can be expressed as:
Figure 1: Graphical Model of DM-DTM
ν (drdω) = r −1e −cr drG 0 (dω).
A gamma process based model has an inherent shrinkage mecha-
nism, as in the prior the number of atoms with weights greater than To complete the generative process, we put Gamma priors over
τ ∈ R+ follows a Poisson distribution whose parameter is given c, cd , γ 0 and η as: c ∼ Gam(a 0 , 1/b0 ), cd ∼ Gam(c 0 , 1/d 0 ), γ 0 ∼
R∞
by H (Ω) τ r −1 exp(−cr )dr . The value of this parameter decreases Gam(e 0 , 1/f 0 ) and η ∼ Gam(s 0 , 1/t 0 ). In the formulation above, we
as τ increases. A draw from the gamma process is expressed as assume that the global popularity of the latent factors evolves over
G = k∞=1 r 0k δ β1k , where β 1k ∈ Ω is a V -dimensional atom drawn
P time using a gamma Markov chain. At the t th time instance, the
from β 1k ∼ Dir(η) and r 0k = G (β 1k ) is the associated weight. proximity of the d th document to the k th latent factor is given by
We associate each atom β 1k with an r 1k and generate a gamma θdk , which in turn is generated from a Gamma distribution with
Markov chain by letting: scale r t k . Therefore, the evolution of r t k may capture the changes
in the semantic themes (or topics) over time that these documents
r t k |r (t −1)k ∼ Gam(r (t −1)k , 1/c), t ∈ {1, . . . ,T }. talk about. Additionally, the words that describe the topics can
also evolve smoothly over time using the Dirichlet Markov chain
The parameter r t k models the global popularity of the latent factor
imposed on the βt k ’s. Moreover, using the gamma process prior,
k at time t. Similarly, we generate a Dirichlet Markov chain by
the model adjusts its capacity automatically as the number of active
letting:
topics vary with time. The corresponding plate diagram of DM-
βt k |β (t −1)k ∼ Dir(ηV β (t −1)k ), t ∈ {2, . . . ,T }. DTM is shown in Fig. 1.

1100
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

We then augment LT k ∼ CRT( d ∈ DT ℓT dk , r (T −1)k ), ℓ (T −1)dk ∼

P
Though DM-DTM supports a countably infinite number of la-
tent factors, in practice, it is impossible to instantiate all of them. CRT(x (T −1)d .k , r (T −1)k ) and using Lemmas 2.1 and 2.2, one can
Therefore, a finite approximation of the infinite model is consid- now sample r (T −1)k as:
ered by truncating the number of factors to K which approaches
(r (T −1)k |−) ∼ Gam r (T −2)k + d ∈ D(T −1) ℓ (T −1)dk + LT k ,
P
the original infinite model as K → ∞. The sampling proceeds as
follows.
Sampling of x tdwk : The sampling of the latent rates x tdwk fol- 1/(c + log(1 + 1/cd ) − log(1 − qT k )) .
P
d ∈D (T −1)
lows from the relation between Poisson and multinomial distribu-
tion and can be derived as: For 1 ≤ t ≤ (T − 2), we augment ℓtdk ∼ CRT(x td .k , r t k ), L (t +1)k ∼
CRT( d ∈ D(t +1) ℓ (t +1)dk , r t k ), apply Lemma 2.1 and 2.2 repeatedly,
P
(θdk βtwk )kK=1
((x tdwk )kK=1 |−) ∼ Mult *x tdw ; PK +. and then sample:
θ β
k =1 dk twk -
X
, (r t k |−) ∼ Gam(r (t −1)k + ℓtdk + L (t +1)k ,
Sampling of r t k : The difficulty in inferring the shape parameter of d ∈ Dt
the gamma distribution and the unique construction of the gamma X
1/(c + log(1 + 1/cd ) − log(1 − q (t +1)k ))),
Markov chain make the sampling of the r t k ’s non-trivial. To that
d ∈ Dt
end, we introduce the negative binomial (NB) distribution. The NB
log(1+1/c d )−log(1−q (t +2)k )
P
distribution m ∼ NB(r , p), with probability mass function (PMF) d ∈D
where q (t +1)k = (c+P (t +1) log(1+1/c )−log(1−q . For t = 0,
Γ(m+r ) (t +2)k ))
P (M = m) = m!Γ(r ) pm (1 −p) r for m ∈ Z, can be augmented into a
d ∈D (t +1) d
augment L 1k ∼ CRT( d ∈ D1 ℓ1dk , r 0k ) and according to Lemma
P
gamma-Poisson construction as m ∼ Pois(λ), λ ∼ Gam(r , p/(1−p)), 2.2 sample:
where the gamma distribution is parameterized by its shape r and
scale p/(1 −p). It can also be augmented under a compound Poisson (r 0k |−) ∼ Gam(γ 0 /K + L 1k , 1/(c − log(1 − q 1k ))),
iid
representation as m = lt =1 ut , ut ∼ Log(p), l ∼ Pois(−r ln(1−p)), d ∈ D log(1 + 1/c d ) − log(1 − q 2k )
P P
q 1k = P 1 .
where u ∼ Log(p) is the logarithmic distribution [26]. Consequently, (c + d ∈D1 log(1 + 1/cd ) − log(1 − q 2k ))
we have the following Lemma. Sampling of θdk : Sampling of these variables are straightforward
Lemma 2.1 ([54]). If m ∼ NB(r, p) is represented under its com- and follows from Bayes’ rule:
pound Poisson representation, then the conditional posterior of l given

(θdk |−) ∼ Gam r td k + x td d .k , 1/ (cd + 1) .
m and r has PMF:
Sampling of cd and c : To derive the updates of these parameters,
Γ(r )
P (l = j | m, r ) = |s (m, j)|r j , j = 0, 1, · · · , m, we make use of the conjugacy of a gamma distribution with another
Γ(m + r ) gamma distribution for the scale parameter. The sampling for cd
where |s (m, j)| are unsigned Stirling numbers of the first kind. We and c then follows as:
denote this conditional posterior as (l |m, r ) ∼ CRT(m, r ), a Chinese
(cd |−) ∼ Gam c 0 + r td . , 1/ (d 0 + θd . ) ,
restaurant table (CRT) count random variable, which can be generated
via l = m K (TX −1) K X T
n=1 zn , zn ∼ Bernoulli(r /(n − 1 + r )).
P
X X
(c |−) ∼ Gam *.γ 0 + a 0 + r t k , 1/ *. r t k + b0 +/+/ .
The following lemma is a consequence of Lemma 2.1.
, k =1 t =0 ,k =1 t =0 --
Lemma 2.2 ([1]). If x i ∼ Pois(mi r 2 ), r 2 ∼ Gam(r 1 , 1/d ), r 1 ∼ Sampling of βtwk : For t = T , the conditional posterior is rela-
Gam(a, 1/b), then (r 1 |−) ∼ Gam(a + ℓ, 1/(b − log(1 − p))) where tively easy to calculate. However, the unique construction of the
(ℓ|x, r 1 ) ∼ CRT( i x i , r 1 ) and p = i mi /(d + i mi ). Dirichlet Markov chain makes the inference for βtwk ’s very diffi-
P P P
cult for 1 ≤ t ≤ (T − 1). To make the inference tractable, we first
For t = T , we augment ℓT dk ∼ CRT(xT d .k , rT k ) and then sample
observe that if xw ∼ Pois(mβw ) ∀w ∈ {1, 2, · · · , V } and β ∼ Dir(η),
rT k according to Lemma 2.2 as:
then (β |−) ∼ Dir(η 1 + x 1 , · · · , ηV + xV ). This follows directly from
(rT k |−) ∼ Gam r (T −1)k + d ∈ DT ℓT dk ,
P
the relation between the Poisson and multinomial distributions and
Bayes’ rule. In addition, we introduce the Dirichlet-multinomial dis-
1/(c + d ∈ DT log(1 + 1/cd )) .
P
tribution: Let x = (xw )wV
=1 be a random vector of category counts
We now describe how the posterior is calculated for t = (T − sampled from a multinomial distribution as x ∼ mult(β ). Addi-
tionally, let β ∼ Dir(η). The marginal distribution of x = (xw )w V
1). The same process recursively applies for deriving the poste- =1
riors for 1 ≤ t ≤ (T − 2). Note that, using Lemma 2.1, one can obtained by integrating out β has the pdf of a Dirichlet-multinomial
write ℓT dk ∼ Pois(rT k log(1 + 1/cd )). Further, using the addi- (Dirmult) distribution as given below:
tive property of the Poisson distribution, we have d ∈ DT ℓT dk ∼ V
P
Pois(rT k d ∈DT log(1 + 1/cd )). Therefore, for deriving the poste-
P Γ( w ηPw )
P Y Γ(xw + ηw )
f (x |η) = ( w xw )! Γ( x + η ) .
P
x !Γ(ηw )
P
rior for r (T −1)k , we can integrate out rT k and obtain d ∈ DT ℓT dk ∼
P
w =1 w
w w w w
NB(r (T −1)k , qT k ), where
The introduction of the Dirichlet-multinomial distribution leads to
d ∈ DT log(1 + 1/c d )
P
the following Lemma which we utilize for computing the closed-
qT k = .
c + d ∈ Dt log(1 + 1/cd )
P
form inference with the Dirichlet Markov chain.

1101
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

V
Lemma 2.3. ([53]) If β ∼ Dir(η), η ∼ Gam(s 0 , 1/t 0 ), (xw )w =1 ∼ Algorithm 1: Forward Backward Gibbs Sampling
V
mult((βw )w =1 ; w xw ) then
P
(s ) (s )
S , {θ }S , {β (s )
X Result: {r t k }s=1 dk s=1 twk s=1
}S ;
(η|−) ∼ Gam(s 0 + ξw , 1/(t 0 − V log(1 − ζ ))), 1 for s ∈ {1, 2, · · · , S } do
w 2 for d ∈ {Dt }Tt=1 do
where ξw ∼ CRT(xw , η) and ζ ∼ Beta( w xw , ηV ). (s ) (s )
P
3 sample {x tdwk } and cd ;
The sampling for t = T is easy and follows as: 4 end
V backward sampling: initialize t = T ;
5
V |−) ∼ Dir ηV β
((βT wk )w =1 (T −1)wk + xT .wk w =1 . 6 while t > 0 do
(s ) (s ) (s ) (s )
For 2 ≤ t ≤ (T − 1), the sampling is non-trivial due to the Dirichlet 7 sample {ℓtdk }, {Lt k }, {ζt k } and {ξ twk };
Markov chain. However, from the relation between the Poisson and (s )
8 cache {qt k } to use in forward sampling;
multimnomial distributions, it follows that
V 9 t = (t − 1);
V
(x (t +1).wk )w =1 ∼ mult β (t +1)wk ; x (t +1)..k . 10 end
w =1
V V 11 forward sampling: initialize t = 1;
Since (β (t +1)wk )w =1 ∼ Dir(ηV (β twk )w =1 ), we may integrate out 12 while t ≤ T do
β (t +1)wk and according to the definition of the Dirichlet-multinomial (s ) (s ) (s )
distribution, we have 13 sample {r t k }, {θd ∈ D k } and {βtwk };
t
V t = (t + 1);
∼ Dirmult ηV (βtwk )V
14
x (t +1).wk w =1 .
w =1 15 end
(s )
The Dirichlet-multinomial likelihood is further augmented with 16 sample c (s ) , γ 0 , η (s ) ;
ζ (t +1)k ∼ Beta(x (t +1)..k , ηV ) and according to Lemma 2.3, the joint
17 end
distribution takes the following form:
V
V ,ζ
Y
f ((x (t +1).wk )w =1 (t +1)k ) ∝ NB(x (t +1).wk ; ηV , ζ (t +1)k ).
w =1
hierarchical graphical models was first proposed in Zhou and Carin
We now augment ξ (t +1)wk ∼ CRT(x (t +1).wk , ηβtwk ) and using the [54], however, it was first utilized for modeling time-evolving count
results of 2.3 sample βtwk as: vectors in Acharya et al. [1]. Such adoption of the NB trick was
V
V |−) ∼ Dir ηV β
((βtwk )w (t −1)wk + x t .wk + ξ (t +1)wk . non-trivial, as is the case with the current paper that further uses
=1 w =1 it for modeling two separate Markov chains–the gamma Markov
This augmentation trick is illustrated in further details in the proof chain and the Dirichlet Markov chain–to yield closed-form solu-
of Lemma 2.3 [53]. For t = 1, the sampling follows almost the same tion for Gibbs sampling. These samples converge to a meaningful
pattern except that the prior is changed: representation only when a precise order of sampling is followed,
V V

((β 1wk )w =1 |−) ∼ Dir (η + x 1.wk + ξ 2wk ) w =1 .
as suggested in Algorithm 1. Due to the introduction of the CRT
distributed random variables, the backward sampling step must pre-
Sampling of γ 0 : We augment ℓ0k ∼ CRT(ℓ1k , γ 0 /K ) and use cede the forward sampling step, the precise explanation of which
Lemma 2.2 to derive: can be found in Acharya et al. [1]. Also, we strongly believe that
(γ 0 |−) ∼ Gam (e 0 + k ℓ0k , 1/ ( f 0 log(1 − p0k ))) ,
P P
− 1/K k the simplicity in the final form of the updates leads to superior em-
pirical results. Note that none of the existing works on the temporal
log(1 − p1k )
p0k = . topic model has such assumptions that naturally fit the overdis-
(log(1 − p1k ) − c)
persed count data, facilitates interpretability of the latent states, has
Sampling of η : Sampling of η follows from an application of closed-form and straight-forward updates in inference and exhibits
Lemma 2.2 and the Bayes’ rule as: such superior empirical performance. Moreover, as mentioned in

(η|−) ∼ Gam s 0 + t,w,k ξ twk , 1/ t 0 − t,k log (1 − ζt k ) .
P P Section 4, the performance of DM-DTM is least sensitive to the
initialization of the parameters, a flexibility absent in any existing
implementation of temporal topic model.
The sequence in which the sampling is performed is concisely
presented in Algorithm 1. For the temporal correlation in the latent
variables, the sampling needs to follow a backward step and a for-
3 RELATED WORK
ward step in every epoch, which is designated by s ∈ {1, 2, · · · , S } Poisson Factor Analysis: Since the document-by-word observa-
in Algorithm 1. The variables are all indexed by an additional super- tion matrices in DM-DTM are modeled using Poisson factorization,
script (s) just to highlight the specific epoch. Note that the run-time a brief discussion of Poisson factor analysis is necessary. A large
complexity of Algorithm 1 is dictated by the number of non-zero number of discrete latent variable models for count matrix factoriza-
entries in the observed corpus {X t ∈ Z | Dt |×V }Tt=1 . tion can be united under Poisson factor analysis (PFA) [1–3, 55, 56],
We would like to emphasize further that both the model and the which factorizes a count matrix Y ∈ ZD×V under the Poisson like-
lihood as Y ∼ Pois(Θβ ), where Θ ∈ R+ D×K is the factor loading
inference are novel contributions of this paper. That the NB augmen-
matrix or dictionary, β ∈ R+ K ×V is the factor score matrix. For
tation trick can be utilized for an efficient inference procedure in

1102
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

example, non-negative matrix factorization [11, 30], with the objec- Relevant Temporal Models for Count Data: Time-evolving dyadic
tive to minimize the Kullback-Leibler divergence between N and data is also prevalent in applications of recommender systems and
its factorization Θβ, is essentially PFA solved with maximum like- social network analysis. Though such applications are not the focus
lihood estimation. LDA [9] is equivalent to PFA, in terms of both of the current paper, we discuss a few algorithms for completeness.
block Gibbs sampling and variational inference [55, 56], if Dirichlet Both Bayesian Probabilistic Tensor factorization (BPTF) [48] and
distribution priors are imposed on both θ k ∈ R+ D , the columns of Dynamic Poisson factorization (DPF) [12] model the temporal evo-
V
Θ, and βk ∈ R+ , the columns of β. lution using normal distribution. While BPTF models the count
Temporal Topic Models: One of the notable contributions to- data using the normal distribution itself, DPF uses an exponential
wards a dynamic topic model leverages the well-known concept of function to convert the latent rates to nonnegative values, a trans-
Gaussian state space evolution. In Blei and Lafferty [8], a Kalman formation that makes the inference intractable. To impose temporal
filter is used to infer temporal updates to the state space parameters, smoothness in the frequency domain for audio processing, Virtanen
which are then mapped to the topic simplex. Wang et al. [45] allow et al. [43] consider chaining latent variables across successive time
a continuous time state space sampling, but still employ a Gaussian frames via the Gamma scale parameters. Jerfel et al. [25] model
distribution and a mapping to the topic space thereafter using a the evolution of the latent factors in the context of recommender
logistic-normal distribution. These models also require the number systems via the Gamma scale parameters. Similarly, Févotte et al.
of topics to be specified in advance. Elibol et al. [19], Linderman [21] propose a gamma Markov chain using the scale parameters for
et al. [31] employ the Pólya-Gamma augmentation trick [37] to con- applications in audio and speech. Most of the works in dynamic so-
quer the non-conjugacy that arises from the Gaussian state space cial network analysis [22, 27, 49] employ similar temporal evolution
evolution and likelihood for modeling count-valued observations. using a normal distribution to model time-varying binary matrices.
Ahmed and Xing [5, 6] use a temporal Dirichlet process and This paper borrows some of the technical ideas from Acharya et al.
make arguably simplistic assumptions to calculate an intractable [1, 2], Schein et al. [39], which introduce gamma Markov chain for
posterior. In particular, the framework of temporal Dirichlet pro- analyzing count and binary data with temporal correlation.
cess, first introduced in Ahmed and Xing [5], is combined with
the Hierarchical Dirichlet Process (HDP) [42] to facilitate smooth 4 EMPIRICAL EVALUATION
temporal evolution and admixture modeling. In such formulation, 4.1 Experiments with Synthetic Data
the base measures of the HDP’s for different time slices are modeled
To illustrate the working principles of DM-DTM, we created a
using a temporal Dirichlet process and the documents for a given
synthetic corpus that has three different time slices. The document-
time slice are assumed to be generated following an HDP with the
by-word matrices corresponding to each of these time slices are
corresponding base measure. The non-conjugacy that arises in such
presented in the first column of Fig. 2. Note that each document-
a modeling assumption requires one to use a Metropolis-Hastings
by-word matrix, denoted by X1 , X2 , and X3 , has a clearly defined
sampler for inferring the word-topic assignments. However, to their
structure where some documents have the exact same words and
credit, Ahmed and Xing [6] model both the genesis and death of
some words only appear in a given set of documents. The appear-
topics and Wang et al. [46] further model nonlinear evolutionary
ance of the words is varied smoothly from one time slice to the next,
traces in temporal data, which we avoid in this paper but plan to
replicating the temporal evolution that we may see in a real-world
incorporate in a later submission. Iwata et al. [24], Nallapati et al.
corpus. The reconstructed matrices, denoted by X̂1 , X̂2 , and X̂3 , are
[35] emphasize on the problem of modeling topics spread on a time-
presented in the second column which precisely reflect the original
line with multiple resolutions, namely how topics are organized
observations. The third, fourth and the fifth column display the de-
in a hierarchy and how they evolve over time. Similarly, Srebro
rived parameters of the model corresponding to different time slices.
and Roweis [41] use the framework of the Dependent Dirichlet
Note that the r t k ’s, which represent the popularities of the topics,
Process (DDP) [34] to model more flexible, non-Markovian varia-
only have few components that are dominant and span across time-
tion in topic probabilities, but inference in all such models scales
slices, implying the temporal smoothness discovered by the model,
very poorly. Bhadury et al. [7] adopt the framework of stochastic
which is a consequence of using both the gamma Markov chain and
gradient Langevin dynamics [32] to accelerate the inference based
the Dirichlet Markov chain. Similarly, the θdk ’s (document-topic
on Gibbs sampling in the original formulation of dynamic topic
assignments) and the βtwk ’s (topic-word assignments) also have a
model [8]. Some of the other online algorithms [4, 23, 51] explicitly
temporal correlation, as is evident from the heat-maps.
model temporal evolution by making Markovian assumptions.
Different from the works mentioned above, the topics over time
4.2 Experiments with Real-world Data
(TOT) model [47] assumes that the topics define a distribution
over words as well as time slices. Though TOT and some of its 4.2.1 Description of Datasets.
extensions [18, 44] can model non-Markovian variations in the topic • NIPS Corpus: The NIPS corpus consists of papers that appeared
probabilities and enjoy inference that is computationally tractable, in the NIPS conference from the years 1987 to 1999. After standard
they do not explicitly evolve the parameters of the model with time. pre-processing and removal of most frequent and least frequent
Though the modeling assumptions are interesting, there has not words, the size of the corpus is reduced to 1383 documents and
been much empirical comparison between these two different sets 1636 words. Documents were divided into 13 epochs based on the
of algorithms. publication year.
• Business News Corpora: We create three additional corpora
for our experimental analysis by crawling the Bloomberg News

1103
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

Figure 3: Evolution of r k

Figure 2: Performance of DM-DTM on Synthetic Data

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9

Egypt Egypt Paris Paris Paris Paris Paris Paris Paris
flight kill kill France France France France France France
Russia flight security security terror security suspect Islam terror
crash Russia gunman gunman security Islam terror suicide Islam
plane crash Islam Belgium Islam kill Islam bomb gunman
kill plane terror Islam bomb stadium soccer Abaaoud Syria
sinai sinai stadium Europe Belgium soccer François Brussel Europe
bomb tourism bomb terror Islam Europe Syria terror raid
airline gunman train bomb stadium militant suicide militant Abaaoud
resort Islam holland stadium train François Europe raid bomb
Table 1: Evolution of Topic-Word Assignments

portal. In particular, we are only interested in the news article that documents and 1352 words. The datasets used in these experiments
has a mention of the companies that belong to the list of Financial are listed here (https://goo.gl/uVB1f7)1 .
Times Stock Exchange (FTSE) 250. Each of these corpora consists of
news articles from 9 successive days. The three corpora, termed as 4.2.2 Qualitative Evaluation of Topics. For qualitative under-
Business News Corpus 1, 2, and 3, has news articles starting from standing of how DM-DTM works with real-world data, we consider
November 1st 2015, November 12th 2015, and November 22nd 2015 the Business News Corpus 2 whose documents span from Nov 12th ,
respectively. After standard pre-processing and removal of most 2015 to Nov 20th , 2015. The significance of this corpus is due to the
frequent and least frequent words, the first news corpus consists of unfortunate event of terrorist attacks in Paris, France late in the
3271 documents and 1636 words, the second corpus consists of 2935 evening on Nov 13th , 2015, which triggered massive socio-economic
documents and 1570 words, and the third corpus consists of 2234 impact worldwide. The news articles published Nov 14th onwards
convey information about the incidents and their impacts on the
1 We are indebted to Matt Sanchez, CTO of CognitiveScale, for curating these datasets.

1104
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

Figure 4: Performance Comparison on NIPS Corpus

Figure 5: Performance Comparison on Business News Corpus 1

global economy. In Fig. 3, we show the temporal evolution of the where s ∈ {1, · · · , S } are the indices of collected samples. For models
strength of one of the 50 topics that are used to model this corpus in that employ variational methods for inference,
one of the experiments. In Fig. 1, the top 10 words corresponding to X X
this topic for all the time slices are also displayed. One can clearly fdw = θ¯dk ϕ¯wk / θ¯dk ϕ¯wk ,
see how the semantics of this topic change over time. Note that, k w,k
before Nov 14th (day 3), the composition of the topic is significantly
different. However, the terrorist attacks on the evening of the 13th where θ¯dk and ϕ¯wk are the point estimates of the respective pa-
(day 2), enhance the strength of the topic and change its composi- rameters obtained from the variational inference. Note that the
tion. As time advances, the topic incorporates words like “Syria" per-word perplexity is equal to V if fdw = 1.0/V , thus it should
and “Abbaaoud" linking the origin and the perpetrators of the ter- be no greater than V for a topic model that works appropriately.
rorist attack. Interestingly, the former French President François The final results are averaged over five random training/testing
Hollande also appears in the topic as he is quoted condemning the partitions.
genocide in the news articles. For concrete empirical comparison, we use several models as
baselines, the first of which is the original LDA model [9] that is
4.2.3 Quantitative Results. We randomly hold out p fraction of learned using variational EM. We address this model as LDA-VEM.
the data (p ∈ {0.5, 0.6, 0.7, 0.8, 0.9}), train a model with the rest and The second and the third models are γ -NB Process (γ NBP) and Dir-
then predict on the held-out set. For comparing multiple models PFA [55], both of which are learned using Gibbs sampling. Note
with different assumptions and inference mechanisms the per-word that both Dir-PFA and LDA [9] have the same block Gibbs sampling
perplexity on the held-out words is considered, which is defined as: and variational Bayes inference equations. Hence, we use Dir-PFA
to facilitate an inference with Gibbs sampling in the original LDA
D V
1 XX model. All of these models, however, ignore the temporal informa-
Perplexity = exp *.− ydw log fdw +/ , tion in the corpus. Note that the γ NBP model used as a baseline is as
y..
d=1 w =1
, - strong as the HDP (inferred with the Chinese restaurant franchise
where y .. = d,w ydw . For models where inference is carried out
P representation), as shown in Zhou and Carin [55]. In fact, one can
using Gibbs sampling, fdw is defined as: show that a normalized γ NBP can be reduced to HDP.
The dynamic topic model (DTM) [8] and the Pólya-Gamma multi-
(s ) (s ) (s ) (s )
X X
fdw = θdk ϕwk / θdk ϕwk , nomial dynamic topic model (PGMult) [31], which are capable of
s,k s,w,k incorporating the time-stamps associated with each document, are

1105
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

Figure 6: Performance Comparison on Business News Corpus 2

Figure 7: Performance Comparison on Business News Corpus 3

used as stronger baselines. Despite the extensive literature on tem-

poral topic models as listed in the related work section, we, un-
fortunately, did not find an open-source implementation for rest
of the models. For a thorough understanding of the effect of the
two chains – the gamma Markov chain and the Dirichlet Markov
chain, we also tried two different ablations of the DM-DTM model–
DM-DTM-r and DM-DTM-ϕ. In DM-DTM-r , we just maintain the
gamma Markov chain (on the r t k ’s) and assume a global set of
topics (ϕwk ’s) that explain the observations in all time slices. This
implies that the global popularities of the topics change with time,
but the topic-word assignments do not. In DM-DTM-ϕ, we only
maintain the Dirichlet Markov chain (on the ϕ t k ’s) and assume a
K that explain all the observa-
single set of topic strengths (r k )k=1
tions. In such model, the topic popularities do not change with time,
but the topic-word assignments do. Comparison with these two
models is expected to prove the utility of both the chains, instead
of the isolated use of either of them.
The performances corresponding to p = 0.70 on the corpora
Figure 8: Compute Times for DM-DTM
listed in the previous section are compactly presented in Fig. 4, 5,
6 and 7. The results corresponding to other values of p are similar
and omitted here to avoid redundancy. Each of these plots rep-
resents the average performances over 10 different runs of the
aforementioned models with different values of the parameter prior over the document-topic distribution, is set at 50.0/K accord-
η ∈ {0.10, 0.25, 0.50, 0.75, 1.00} which is the parameter of the Dirich- ing to the popular choice. For all other models, all the relevant
let prior on the topic-word distribution. For both LDA-VEM and parameters (the ones that do not have any prior imposed on them)
Dir-PFA the parameter α, which is the parameter for the Dirichlet are set to 1.0. For both DTM and PGMult, we use 30 iterations
for initialization of the parameters with the LDA model where the
corresponding η is initialized with one of the five different values

1106
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

mentioned above. For inference with the variational Kalman filter- types of dyadic count data, for example, those prevalent in recom-
ing (VKF) in DTM and the Pólya-Gamma augmentation trick in mender systems [12, 25]. Interestingly, the models can be further
PGMult, we use 50 iterations. For LDA-VEM we use 50 iterations enriched with the split-merge techniques [10] so that the gene-
for variational EM. For all other models that use Gibbs sampling, sis and termination of topics can be explicitly accounted for in
we use 500 iterations for burn-in and 500 for collection. Note that, the generative assumptions. Finally, the performance of the model
unlike in DTM and PGMult where the initialization with LDA must can potentially be improved further using ideas from adversarial
be done to achieve meaningful learning of the representation of training [40] and advanced variational methods [50, 52].
the model, DM-DTM or any of its ablations does not require any
special initialization, thereby bringing an additional advantage to
the table. REFERENCES
As expected, being parametric models, LDA-VEM, Dir-PFA, DTM, [1] A. Acharya, J. Ghosh, and M. Zhou. 2015. Nonparametric Bayesian Factor Analysis
for Dynamic Count Matrices. In Proc. of AISTATS. 1–9.
and PGMult all suffer from severe overfitting as K is increased. In [2] A. Acharya, A. Saha, M. Zhou, D. Teffer, and J. Ghosh. 2015. Nonparametric
particular, with higher values of η, the overfitting is more promi- Dynamic Network Modeling. In KDD Workshop on Mining and Learning from
nent. With small values of η, especially with η = 0.10, the topics Time Series.
[3] A. Acharya, D. Teffer, J. Henderson, M. Tyler, M. Zhou, and J. Ghosh. 2015. Gamma
discovered are very sparse and hence the perplexity does not in- Process Poisson Factorization for Joint Modeling of Network and Documents. In
crease with increasing value of K for the parametric models. Note Proc. of ECML. 283–299.
[4] A. Ahmed, Q. Ho, C. H. Teo, J. Eisenstein, A. Smola, and E. Xing. 2011. Online
that DM-DTM outperforms all the other models by a large margin. Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text.
The significant gap between DM-DTM and γ NBP or Dir-PFA shows In Proc. of AISTATS. 101–109.
that the performance difference is not due to the adoption of Gibbs [5] A. Ahmed and E. Xing. 2008. Dynamic Non-Parametric Mixture Models and
The Recurrent Chinese Restaurant Process : with Applications to Evolutionary
sampling for inference, but due to the congruence between the Clustering. Proc. of SDM (2008).
modeling assumptions and the statistical characteristics of the cor- [6] A. Ahmed and E. Xing. 2010. Timeline: A Dynamic Hierarchical Dirichlet Process
pora. Similarly, the performance gap between γ NBP or Dir-PFA and Model for Recovering Birth/Death and Evolution of Topics in Text Stream. In
Proc. of UAI.
LDA-VEM illustrates that the adoption of Gibbs sampling, instead [7] A. Bhadury, J. Chen, J. Zhu, and S. Liu. 2016. Scaling Up Dynamic Topic Models.
of variational methods, for inference makes a difference. The gap be- In Proc. of WWW. 381–390.
[8] D. M. Blei and J. D. Lafferty. 2006. Dynamic topic models. In Proc. of ICML.
tween LDA-VEM and DM-DTM clearly proves that the performance 113–120.
difference is both due to better modeling assumptions and better [9] D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet Allocation. JMLR 3
inference algorithm based on Gibbs sampling. The performance (2003), 993–1022.
[10] M. Bryant and E. B. Sudderth. 2012. Truly Nonparametric Online Variational
difference between DM-DTM and DM-DTM-r /DM-DTM-ϕ also Inference for Hierarchical Dirichlet Processes. In NIPS. 2699–2707.
justifies the use of both the chains – the gamma Markov chain and [11] A. T. Cemgil. 2009. Bayesian inference for nonnegative matrix factorisation
models. Intell. Neuroscience (2009).
the Dirichlet Markov chain. Indeed, the performance gap among [12] L. Charlin, R. Ranganath, J. McInerney, and D.M. Blei. 2015. Dynamic Poisson
DM-DTM, PGMult, and DTM justifies the modeling assumptions Factorization. In Proc. of RecSys. 155–162.
and the efficacy of the inference proposed. Please note that the [13] K. Christakopoulou and A. Banerjee. 2015. Collaborative Ranking with a Push at
the Top. In Proc. of WWW. 205–215.
performance gap between DM-DTM-r and DM-DTM-ϕ is mostly [14] Y. Cong, B. Chen, H. Liu, and M. Zhou. 2017. Deep Latent Dirichlet Allocation
negligible, possibly because both chains are equally good in captur- with Topic-Layer-Adaptive Stochastic Gradient Riemannian MCMC. In Proc. of
ing the temporal dependencies in the data. Additionally, to illustrate ICML. 864–873.
[15] Y. Cong, B. Chen, and M. Zhou. 2017. Fast Simulation of Hyperplane-Truncated
the run-time complexity of DM-DTM, we present the variation of Multivariate Normal Distributions. Bayesian Analysis (2017).
the total compute time, as measured on a MacBook with 2.5 GHz [16] M. Deodhar and J. Ghosh. 2009. Mining for Most Certain Predictions from Dyadic
Data. In Proc. of KDD.
Intel Core i7 processors and 16 GB of RAM, as a function of the [17] N. Du, M. Farajtabar, A. Ahmed, A.J. Smola, and L. Song. 2015. Dirichlet-Hawkes
held-out fraction p for all the corpora in Fig. 8. Note that a higher Processes with Applications to Clustering Continuous-Time Document Streams.
fraction of held-out data implies a smaller training set and compute In Proc. of KDD (to appear).
[18] A. Dubey, A. Hefny, S. Williamson, and E. P. Xing. 2013. A nonparametric mixture
time. model for topic modeling over time. In Proc. of SDM. 530–538.
[19] H. Elibol, V. Nguyen, S. Linderman, M. Johnson, A. Hashmi, and F. Doshi-Velez.
2016. Cross-corpora Unsupervised Learning of Trajectories in Autism Spectrum
Disorders. JMLR 17, 1 (2016), 4597–4634.
5 CONCLUSIONS AND FUTURE WORK [20] T. S. Ferguson. 1973. A Bayesian analysis of some nonparametric problems. Ann.
Statist. (1973).
This paper introduced DM-DTM, a novel nonparametric Bayesian [21] C. Févotte, J. L. Roux, and J. R. Hershey. 2013. Non-negative dynamical system
dynamic topic model that allows the topic popularities and word- with application to speech and audio. In Proc. of ICASSP. 3158–3162.
topic assignments to vary smoothly over time using a gamma [22] Q. Ho, L. Song, and E.P. Xing. 2011. Evolving Cluster Mixed-Membership Block-
model for Time-Varying Networks. In Proc. of AISTATS.
Markov chain and a Dirichlet Markov chain, respectively. DM- [23] L. Hong, B. Dom, S. Gurumurthy, and K. Tsioutsiouliklis. 2011. A Time-dependent
DTM is equipped with a nonparametric Bayesian construction and Topic Model for Multiple Text Streams. In Proc. of KDD. 832–840.
[24] T. Iwata, T. Yamada, Y. Sakurai, and N. Ueda. 2010. Online Multiscale Dynamic
a tractable inference mechanism. The experiments with several Topic Models. In Proc. of KDD. 663–672.
real-world corpora clearly demonstrate its supremacy over many [25] G. Jerfel, M. Basbug, and B. Engelhardt. 2017. Dynamic Collaborative Filtering
of the existing baselines. In future, the inference can get further With Compound Poisson Factorization. In Proc. of AISTATS. 738–747.
[26] N. L. Johnson, A. W. Kemp, and S. Kotz. 2005. Univariate Discrete Distributions.
accelerated using the formulations of stochastic gradient Langevin John Wiley & Sons.
dynamics [32, 33] and the sampling tricks proposed in Cong et al. [27] M. Kim and J. Leskovec. 2013. Nonparametric Multi-group Membership Model
[14, 15]. Additionally, the gamma Markov chain and the Dirichlet for Dynamic Networks. In Proc. of NIPS. 1385–1393.
[28] Y. Koren. 2009. Collaborative Filtering with Temporal Dynamics. In Proc. of KDD.
Markov chain can be used to model temporal evolution of other 447–456.

1107
Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

[29] Y. Koren, R. Bell, and C. Volinsky. 2009. Matrix Factorization Techniques for [43] T. Virtanen, A.T. Cemgil, and S. Godsill. 2008. Bayesian extensions to non-negative
Recommender Systems. IEEE Computer (2009). matrix factorisation for audio signal modelling. In Proc. of ICASSP. 1825–1828.
[30] D. D. Lee and H. S. Seung. 2001. Algorithms for Non-negative Matrix Factorization. [44] D. D. Walker, K. Seppi, and E. K. Ringger. 2012. Topics over Nonparametric Time:
In NIPS. A Supervised Topic Model Using Bayesian Nonparametric Density Estimation.
[31] S. Linderman, M. Johnson, and R. P. Adams. 2015. Dependent Multinomial Models In Proc. of UAI. 74–83.
Made Easy: Stick-Breaking with the Polya-gamma Augmentation. In Proc. of NIPS. [45] C. Wang, D. Blei, and D. Heckerman. 2008. Continuous Time Dynamic Topic
3438–3446. Models. In Proc. of UAI.
[32] Y. Ma, T. Chen, and E. B. Fox. 2015. A Complete Recipe for Stochastic Gradient [46] P. Wang, P. Zhang, C. Zhou, Z. Li, and H. Yang. 2017. Hierarchical Evolving
MCMC. In Proc. of NIPS. 2917–2925. Dirichlet Processes for Modeling Nonlinear Evolutionary Traces in Temporal
[33] Y. Ma, N. J. Foti, and E. B. Fox. 2017. Stochastic Gradient MCMC Methods for Data. Data Min. Knowl. Discov. 31, 1 (Jan. 2017), 32–64.
Hidden Markov Models. In Proc. of ICML. 2265–2274. [47] X. Wang and A. McCallum. 2006. Topics over Time: A non-Markov Continuous-
[34] S.N. MacEachern. 2000. Dependent Dirichlet Process. Technical Report. Depart- time Model of Topical Trends. In Proc. of KDD. 424–433.
ment of Statistics, The Ohio State University. [48] L. Xiong, X. Chen, T. Huang, J. Schneider, and J. G. Carbonell. 2010. Temporal
[35] R.M. Nallapati, S. Ditmore, J.D. Lafferty, and K. Ung. 2007. Multiscale Topic Collaborative Filtering with Bayesian Probabilistic Tensor Factorization. In Proc.
Tomography. In Proc. of KDD. 520–529. of SDM.
[36] N. Natarajan and I.S. Dhillon. 2014. Inductive matrix completion for predicting [49] K.S. Xu and A.O. Hero. 2014. Dynamic Stochastic Blockmodels for Time-Evolving
gene-disease associations. Bioinformatics 30, 12 (2014), 60–68. Social Networks. J. Sel. Topics Signal Processing 8, 4 (2014), 552–562.
[37] N. G. Polson, J. G. Scott, and J. Windle. 2013. Bayesian Inference for Logistic [50] M. Yin and M. Zhou. 2018. Semi-implicit variational inference. In Proc. of ICML.
Models Using Pólya–Gamma Latent Variables. J. Amer. Statist. Assoc. 108, 504 [51] K. Zhai and J.L. Boyd-graber. 2013. Online Latent Dirichlet Allocation with
(2013), 1339–1349. Infinite Vocabulary. In Proc. of ICML. 561–569.
[38] S. Raghavan, S. Gunasekar, and J. Ghosh. 2012. Review Quality Aware Collabora- [52] H. Zhang, D. Guo, B. Chen, and M. Zhou. 2018. WHAI: Weibull Hybrid Autoen-
tive Filtering. In Proc. of RecSys. 123–130. coding Inference for Deep Topic Modeling. In Proc. of ICLR. to appear.
[39] A. Schein, H. Wallach, and M. Zhou. 2016. Poisson-Gamma dynamical systems. [53] M. Zhou. 2016. Nonparametric Bayesian Negative Binomial Factor Analysis. (Oct
In Proc. of NIPS. 5005–5013. 2016).
[40] J. Song, S. Zhao, and S. Ermon. 2017. A-NICE-MC: Adversarial Training for [54] M. Zhou and L. Carin. 2012. Augment-and-Conquer Negative Binomial Processes.
MCMC. In Proc. of NIPS. 5146–5156. In Proc. of NIPS.
[41] N. Srebro and S. Roweis. 2005. Time-Varying Topic Models using Dependent [55] M. Zhou and L. Carin. 2015. Negative Binomial Process Count and Mixture
Dirichlet Processes. Modeling. IEEE Trans. Pattern Analysis and Machine Intelligence (2015).
[42] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. 2006. Hierarchical Dirichlet [56] M. Zhou, L. Hannah, D. Dunson, and L. Carin. 2012. Beta-Negative Binomial
Processes. J. Amer. Statist. Assoc. 101 (December 2006), 1566–1581. Process and Poisson Factor Analysis. In Proc. of AISTATS. 1462–1471.

1108

Solutions To Sheldon Ross Simulation
50% (12)
Solutions To Sheldon Ross Simulation
66 pages
Statistical Compendium
No ratings yet
Statistical Compendium
85 pages
hw1 Spring2016
No ratings yet
hw1 Spring2016
2 pages
A Dynamic Users’ Interest Discovery Model with Distributed Inference Algorithm
No ratings yet
A Dynamic Users’ Interest Discovery Model with Distributed Inference Algorithm
11 pages
ECIR2009 Topic Trend Detection
No ratings yet
ECIR2009 Topic Trend Detection
5 pages
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
From Everand
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
No ratings yet
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
7 pages
Incorporating Topic Transition in Topic Detection and Tracking Algorithmsincorporating Topic Transition in Topic Detection and Tracking Algorithms
No ratings yet
Incorporating Topic Transition in Topic Detection and Tracking Algorithmsincorporating Topic Transition in Topic Detection and Tracking Algorithms
6 pages
Principles of Mesh Networks and Mesh Generation: Definitive Reference for Developers and Engineers
From Everand
Principles of Mesh Networks and Mesh Generation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
No ratings yet
Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
7 pages
A LDA Based Model For Topic Evolution: Evidence From Information Science Journals
No ratings yet
A LDA Based Model For Topic Evolution: Evidence From Information Science Journals
6 pages
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Stream Processing Techniques and Patterns: Definitive Reference for Developers and Engineers
From Everand
Stream Processing Techniques and Patterns: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Directed Acyclic Graphs in Theory and Practice: Definitive Reference for Developers and Engineers
From Everand
Directed Acyclic Graphs in Theory and Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Grid Computing: A Revolutionary Approach to Scientific Research and Data Management
From Everand
Grid Computing: A Revolutionary Approach to Scientific Research and Data Management
Pasquale De Marco
No ratings yet
Detecting Favorite Topics in Computing Scientific Literature Via Dynamic Topic Modeling
No ratings yet
Detecting Favorite Topics in Computing Scientific Literature Via Dynamic Topic Modeling
11 pages
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Eai 13-7-2018 159623
No ratings yet
Eai 13-7-2018 159623
16 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
55 pages
A Survey of Topic Pattern Mining in Text Mining PDF
No ratings yet
A Survey of Topic Pattern Mining in Text Mining PDF
7 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Online Classification of Nonstationary Data Streams
No ratings yet
Online Classification of Nonstationary Data Streams
36 pages
A Survey of Topic Modeling in Text Mining
No ratings yet
A Survey of Topic Modeling in Text Mining
7 pages
A CORRELATED TOPIC MODEL OF SCIENCE1
No ratings yet
A CORRELATED TOPIC MODEL OF SCIENCE1
19 pages
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet
ssrn-4575985
No ratings yet
ssrn-4575985
29 pages
Major Research Topics in Big Data
No ratings yet
Major Research Topics in Big Data
4 pages
Temporal Data Clustering Using Hidden Markov Model
No ratings yet
Temporal Data Clustering Using Hidden Markov Model
6 pages
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
From Everand
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
.... Using Dynamic Time Warping To Findpatterns in Time Series
No ratings yet
.... Using Dynamic Time Warping To Findpatterns in Time Series
12 pages
Video Information Reterival Using: Chemeleon Clustering
No ratings yet
Video Information Reterival Using: Chemeleon Clustering
5 pages
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
From Everand
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Analysis and Optimization of Data Classification Using K-Means Clustering and Affinity Propagation Technique
No ratings yet
Analysis and Optimization of Data Classification Using K-Means Clustering and Affinity Propagation Technique
9 pages
Topic Models From Twitter Hashtags: 1 Problem Definition
No ratings yet
Topic Models From Twitter Hashtags: 1 Problem Definition
2 pages
Experiments With Non Parametric Topic Models
No ratings yet
Experiments With Non Parametric Topic Models
10 pages
IRS UNIT 4
No ratings yet
IRS UNIT 4
66 pages
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction To Latent Things
No ratings yet
Introduction To Latent Things
2 pages
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Learning Author-Topic Models From Text Corpora
No ratings yet
Learning Author-Topic Models From Text Corpora
38 pages
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
From Everand
Tensor Structures and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
299-APRIL-2019
No ratings yet
299-APRIL-2019
10 pages
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
From Everand
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
CatBoost Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
From Everand
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Topic Modeling Clustering of Deep Webpages
No ratings yet
Topic Modeling Clustering of Deep Webpages
9 pages
Tcpdump in Depth: Definitive Reference for Developers and Engineers
From Everand
Tcpdump in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Srivatsanlaxman 2005
No ratings yet
Srivatsanlaxman 2005
13 pages
Topcat: Data Mining For Topic Identification in A Text Corpus
No ratings yet
Topcat: Data Mining For Topic Identification in A Text Corpus
33 pages
Detecting Emerging Trends From Scientific Corpora
No ratings yet
Detecting Emerging Trends From Scientific Corpora
7 pages
DATAMINING
No ratings yet
DATAMINING
8 pages
Improve Text Classification Accuracy Based On Classifier Fusion Methods
No ratings yet
Improve Text Classification Accuracy Based On Classifier Fusion Methods
6 pages
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
No ratings yet
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
12 pages
Using Incremental PLSI For Threshold-Resilient Online Event Analysis
No ratings yet
Using Incremental PLSI For Threshold-Resilient Online Event Analysis
11 pages
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
No ratings yet
Text Classificatio Through Time:: Efficient Label Propagation in Time-Based Graphs
9 pages
Mjoiuytrsfedsqwe 4 e 56 R 7 I 8 Ouikjghfvdcsretjyukilopl, KMJHNGB
No ratings yet
Mjoiuytrsfedsqwe 4 e 56 R 7 I 8 Ouikjghfvdcsretjyukilopl, KMJHNGB
9 pages
Novelty Detection by Multivariate Kernel Density Estimation and Growing Neural Gas Algorithm
No ratings yet
Novelty Detection by Multivariate Kernel Density Estimation and Growing Neural Gas Algorithm
10 pages
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Elastic Essentials: Definitive Reference for Developers and Engineers
From Everand
Elastic Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Mining 101: Core Concepts and Algorithms
From Everand
Data Mining 101: Core Concepts and Algorithms
Swarnalata Verma
No ratings yet
Graph Data Modeling and Analytics with Neo4j: Definitive Reference for Developers and Engineers
From Everand
Graph Data Modeling and Analytics with Neo4j: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Chapter-3 Compact
No ratings yet
Chapter-3 Compact
62 pages
GoldSim Appendices
No ratings yet
GoldSim Appendices
129 pages
Edu 2008 Spring C Questions
No ratings yet
Edu 2008 Spring C Questions
180 pages
stsurvivalanalysis
No ratings yet
stsurvivalanalysis
8 pages
Simu Final Note 2
No ratings yet
Simu Final Note 2
17 pages
Risk Theory - T
No ratings yet
Risk Theory - T
425 pages
1984 WGEN Richardson
No ratings yet
1984 WGEN Richardson
88 pages
Modeling The Frequency and Severity of Auto Insura
No ratings yet
Modeling The Frequency and Severity of Auto Insura
24 pages
User's Manual For SPC XL
No ratings yet
User's Manual For SPC XL
104 pages
Probability_integral_transformation__1
No ratings yet
Probability_integral_transformation__1
5 pages
Gamma Swap: Roger Lee University of Chicago December 29, 2008
No ratings yet
Gamma Swap: Roger Lee University of Chicago December 29, 2008
3 pages
Computation of The Distribution of The Sum of Inde
No ratings yet
Computation of The Distribution of The Sum of Inde
9 pages
Common Probability Distributionsi Math 217/218 Probability and Statistics
No ratings yet
Common Probability Distributionsi Math 217/218 Probability and Statistics
10 pages
Probability Distributions of The Aggregated Residential Load
No ratings yet
Probability Distributions of The Aggregated Residential Load
6 pages
Chapter 4: Multiple Random Variables
No ratings yet
Chapter 4: Multiple Random Variables
34 pages
Document 1
No ratings yet
Document 1
21 pages
Practical application of gamma distribution
No ratings yet
Practical application of gamma distribution
3 pages
Far-from-Equilibrium Time Evolution Between Two Gamma Distributions
No ratings yet
Far-from-Equilibrium Time Evolution Between Two Gamma Distributions
26 pages
Information Theory and Language
100% (1)
Information Theory and Language
246 pages
HW 4
No ratings yet
HW 4
3 pages
A Generalized Non-Linear Composite Fading Model
No ratings yet
A Generalized Non-Linear Composite Fading Model
16 pages
STAT 355, Spring 2013 SECTION: Monday Wednesday Exam I
No ratings yet
STAT 355, Spring 2013 SECTION: Monday Wednesday Exam I
6 pages
Sta2001 2023W
No ratings yet
Sta2001 2023W
2 pages
Unit 3
No ratings yet
Unit 3
27 pages
MAST20005 Statistics Assignment 1
No ratings yet
MAST20005 Statistics Assignment 1
10 pages
Equipment Failure Rate Updating
No ratings yet
Equipment Failure Rate Updating
5 pages
Discrete Probability Distributions
No ratings yet
Discrete Probability Distributions
6 pages
Copula-Regression-Parsa-Klugman
No ratings yet
Copula-Regression-Parsa-Klugman
10 pages

A Dual Markov Chain Topic Model For Dynamic Environments: Ayan Acharya Joydeep Ghosh Mingyuan Zhou

Uploaded by

A Dual Markov Chain Topic Model For Dynamic Environments: Ayan Acharya Joydeep Ghosh Mingyuan Zhou

Uploaded by

Research Track Paper KDD 2018, August 19-23, 2018, London, United Kingdom

A Dual Markov Chain Topic Model for Dynamic Environments

2 MODEL AND INFERENCE

We then augment LT k ∼ CRT( d ∈ DT ℓT dk , r (T −1)k ), ℓ (T −1)dk ∼

Figure 2: Performance of DM-DTM on Synthetic Data

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9

Figure 4: Performance Comparison on NIPS Corpus

Figure 5: Performance Comparison on Business News Corpus 1

Figure 6: Performance Comparison on Business News Corpus 2

Figure 7: Performance Comparison on Business News Corpus 3

used as stronger baselines. Despite the extensive literature on tem-

You might also like