2.hierarchical Topic Modeling of Twitter Data
2.hierarchical Topic Modeling of Twitter Data
February 6, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2891902
ABSTRACT Social platforms, such as Twitter, reveal much about the tastes of the public. Many studies focus
on the content analysis of social platforms, which assists in product promotion and sentiment investigation.
On the other hand, online analytical processing (OLAP) has been proven to be very effective for analyzing
multidimensional structured data. The key purpose of applying OLAP to text messages, (e.g., tweets), called
text OLAP, is to mine and construct the hierarchical dimension based on the unstructured text content.
In contrast to the plain texts which text OLAP usually handles, the social media content includes a wealth
of social relationship information which can be employed to extract a more effective dimensional hierarchy.
In this paper, we propose a topic model called twitter hierarchical latent Dirichlet allocation (thLDA). Based
on hierarchical latent Dirichlet allocation, thLDA aims to automatically mine the hierarchical dimension
of tweets’ topics, which can be further employed for text OLAP on the tweets. Furthermore, thLDA uses
word2vec to analyze the semantic relationships of words in tweets to obtain a more effective dimension.
We conduct extensive experiments on huge quantities of Twitter data and evaluate the effectiveness of
thLDA. The experimental results demonstrate that it outperforms other current topic models in mining and
constructing the hierarchical dimension of tweeters’ topics.
INDEX TERMS Twitter, online analytical processing, topic modeling, hierarchical latent Dirichlet
allocation, social media analysis.
I. INTRODUCTION analyzing textual data for the underlying topics. In [1] and [2],
During the past few years, Twitter has become increasingly we proposed a LDA-based model, called MS-LDA, to detect
popular as an emerging social platform for messaging and the hidden layered interests from the Twitter data. As the
communication among individuals. The huge quantities of extension of LDA, MS-LDA integrated tweets and the social
Twitter data accumulated so far make it possible to discover relationships among tweeters. Nevertheless, the primitive
the distribution and drift of mass tastes and opinions, which LDA model can only mine monolayer topics, rather than
greatly assists in product recommendation, target marketing the hierarchical ones which OLAP requires. On the other
and so on. On the other hand, OLAP, or online analytical hand, as an unsupervised hierarchical topic model, hLDA
processing, enables analysis to interactively view data from can obtain the sibling-sibling relationships between topics
all aspects in layered granularities, which has already been and can organize the topics into a hierarchical tree automati-
proven especially useful for business intelligence. Unfortu- cally. In fact, Twitter data contain abundant social behavioral
nately, OLAP techniques are successful in dealing with cube information about tweeters, such as mentioning, retweet-
data which are structured and formalized, but face difficulties ing and following. In addition, there exist some semantic
in processing textual content such as Twitter data. To suc- relationships among the words in tweets, which may affect
cessfully apply OLAP techniques to Twitter, it is critical to the effectiveness of the modeling process. In other words,
mine the hidden representative dimensions from its extensive to effectively discover the hidden layers of topics from Twitter
content. data for constructing the hierarchical dimension for OLAP,
As a typical unsupervised topic model, the Latent Dirich- we need to propose a new topic model which leverages the
let Allocation (LDA) model is efficient at statistically characteristics of Twitter in its modeling process.
2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 7, 2019 Personal use is also permitted, but republication/redistribution requires IEEE permission. 12373
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP
In this paper, we focus on how to discover the underlying Michelson and Macskassy et al. [9] present a topic profile
topics of tweets from tweeters’ social behaviors and from to characterize tweets’ topics. Cuzzocrea et al. [10] intro-
their published tweets. Such topics can be then organized into duce an aggregation operator for tweets’ content by using
one very important hierarchical dimension, or topic dimen- formal concept analysis theory. Liu et al. [11] propose a
sion, for applying OLAP to Twitter data. We present a model text cube approach to learning various types of social,
called thLDA to extract the hidden-layer topics from Twitter human and cultural behaviors embedded in the Twitter data.
data for the multidimensional analysis of tweets’ topics. The Rehman et al. [12] focus on incorporating extensive natural
process is briefly described as follows. Firstly, we collect a language process technology in OLAP, to analyze multidi-
primitive corpus through Twitter’s APIs. Then, we preprocess mensional social data.
the Twitter data by removing stop words and irrelevant data In addition, many researchers employ machine learning
such as short links, short tweets and junk information. Sub- techniques to analyze social media. Siswanto et al. [13] pro-
sequently, we analyze the social relationships of tweeters and pose a model that utilizes supervised learning-based clas-
the semantic relationships between words in tweets. Finally, sification based on tweeters’ labels and specific accounts.
we mine the topics from the Twitter data and organize them Pennacchiotti and Popescu [14] propose a generic machine
into a hierarchical structure based on thLDA. learning framework for tweeter classification, based on four
The main contribution of this paper is threefold. general feature types: tweeter profile, tweeting behavior,
(1) We introduce a novel hierarchical model called thLDA linguistic content of tweeter’s message and tweeter social
to construct a dimension hierarchy of tweets’ topics, incor- network features. Pu et al. [15] present a mixed method which
porating social relationships and semantic relationships into combines text mining and Wikipedia to mine tweeters’ topics
the modeling process. (2) We make use of word2vec, which in Twitter data. Vathi et al. [16] propose a model based on
is a two-layer neural network model, to obtain the semantic a topic model to mine tweeters’ clustered discussion topics
relationships between words in tweets, to improve the mining and to design a method for excluding trivial topics. Fur-
of the topics. (3) We conduct extensive experiments on our thermore, combining a topic model with analysis of Twitter
model with large quantities of Twitter data and find that the data, Zhao et al. [17] propose a method called Twitter-LDA
results demonstrate its effectiveness. which aims to mine tweeters’ topics from a typical sample of
The remainder of this paper is organized as follows. After Twitter as a whole. However, this can only mine the topics
introducing the state-of-the-art related works in Section 2, from the Twitter data and does not take into consideration the
we present some preliminaries necessary for understanding hierarchical aspects of the topics. Based on LDA, Blei and
the paper in Section 3. Section 4 elaborates our proposed Mcauliffe [18] propose sLDA (supervised Latent Dirichlet
thLDA model and demonstrates the mathematical derivation Allocation). In sLDA, Blei et al. add to LDA a response
process of thLDA, and Section 5 presents the experimental variable associated with each document. In order to find latent
results and the comparison with other models, undertaken to topics that will best predict the response variables for future
verify the effectiveness of thLDA. Finally, we draw conclu- unlabeled documents, sLDA jointly model the documents and
sions about our model and outline future work in Section 6. the responses.
In order to obtain the topic hierarchy from the textual
II. RELATED WORKS data, some researchers have focused on how to extend the
OLAP is an approach to answering multidimensional ana- traditional topic modeling techniques to obtain hierarchi-
lytical queries over the cube data. It provides the operations cal information on the topics. The technique of hLDA [19],
such as rolling up, drilling down and slicing [3]. The goal based on the notion of nCRP (nested Chinese restaurant
of OLAP is to provide decision support or ad-hoc reporting. process) [20], can simultaneously mine topics and con-
Its core technology is the concept of ‘‘dimensions,’’ which struct the topic hierarchy by analyzing the relationships
are usually multiple and hierarchical. Based on dimensions, of topics without supervision. On the basis of hLDA,
OLAP aggregates the ‘‘measured’’ data by averaging, count- Mao et al. [21] propose a semi-supervised hierarchical topic
ing, totaling and so on. model which aims to explore new topics automatically in
Traditional OLAP can effectively analyze structured mul- the data space while incorporating the information from
tidimensional data. However, it cannot handle unstructured observed hierarchical labels into the modeling process, called
data such as tweets [4]. In order to apply OLAP technology Semi-Supervised Hierarchical Latent Dirichlet Allocation
to the analysis of unstructured textual data, the concept of (SSHLDA). Wang et al. [22] also propose a semi-supervised
text OLAP is proposed [5]. Based on traditional OLAP hierarchical topic model, which aims to explore more rea-
technology, text OLAP aims to provide aggregative functions sonable topics in the data space by incorporating some
that summarize unstructured text data [6], [7]. For instance, constraints into the modeling process that are extracted auto-
Azabou et al. [8] present a novel model which serves as a matically, denoted as constrained hierarchical Latent Dirich-
basis for semantic OLAP for documents. let Allocation (constrained-hLDA). Dai and Storkey [23]
How to accurately and effectively mine tweets’ top- propose the sHDP (supervised hierarchical Dirichlet process)
ics from social data has long been the focus of research process, which is a nonparametric generative model for the
in the field of natural language processing. For example, joint distribution of a group of observations and a response
FIGURE 3. The overall process of exploring Twitter data based on the technique.
• Hierarchical topic modeling: Extract the topics shallow, two-layer neural networks that are trained to recon-
(or interests) from the Twitter data, and construct the struct linguistic contexts of words. Word2vec takes as its
hierarchical topic dimension based on the probability input a large corpus of text and produces a vector space,
distribution of various topics and subtopics. typically of several hundred dimensions, with each unique
• Data exploring: Analyze tweeters from multiple dimen- word in the corpus being assigned a corresponding vec-
sions using OLAP. tor in the space. Word vectors are positioned in the vector
Although the OLAP technology provides an intuitive space such that words that share common contexts in the
inquiry form that is consistent with human custom, it can only corpus are located in close proximity to one another in the
handle structured data, and fails to deal with scenarios related space. Word2vec was created by a team of researchers led
to unstructured text data like tweets. Therefore, the key to by Tomas Mikolov at Google, and has been subsequently
applying the OLAP technology to Twitter data is to identify analyzed and explained by other researchers. Embedding
and construct the dimension hierarchy from the Twitter data vectors created using the Word2vec algorithm have many
automatically. However, this still remains a difficult problem. advantages compared to earlier algorithms such as latent
The main issue this paper tries to resolve can be described semantic analysis. Word2vec can utilize either of two model
as follows: how to automatically mine and construct the architectures to produce a distributed representation of words:
hierarchical dimension of tweets’ topics (or tweeters’ inter- continuous bag-of-words (CBOW) or continuous skip-gram.
ests) from the unstructured tweet data to achieve effective In the continuous bag-of-words architecture, the model pre-
multidimensional analysis. dicts the current word from a window of surrounding con-
text words. However, the order of context words does not
C. WORD2VEC influence prediction (bag-of-words assumption). In the con-
Word2vec [30], [31] is a group of related models that tinuous skip-gram architecture, the model uses the current
are used to produce word embeddings. These models are word to predict the surrounding window of context words.
the corresponding
number of customers at each table is N =
FIGURE 4. The models CBOW and skip-gram [30]. nj |0 < j < m . The probability that the next customer m
chooses an occupied, or unoccupied table is given by the
following distributions:
Figure 4(a) show the CBOW model, where wt represents the
central word and wt±i represents the context word of wt . P(occupied table rj |Previous m − 1 Customer, γ )
nj
Figure 4(b) shows the skip-gram model, which weighs nearby = (1)
context words more heavily than more distant context words. γ +m−1
CBOW is considered to be faster than skip-gram but more P(unoccupied table|Previous m − 1 Customer, γ )
γ
suitable for infrequent words. = (2)
During the calculation process of word2vec, we usually γ +m−1
express the semantic correlation between words by cal- Here, γ is the parameter which aims to control the proba-
culating the cosine similarity of two word vectors. The bility of the customer selecting a new table.
greater the cosine similarity, the stronger the correlation The nCRP model is derived from CRP, and is a distribu-
between two words. In addition, as the dimension increases, tion over hierarchical partitions. The nCRP model can be
the model effectiveness tends to be steady. To ensure the illustrated by the following situation. Supposing in a city
high efficiency and good effectiveness, we choose 300 as there is an infinite number of Chinese restaurants, each of
the number of dimensions of the word vector in our which has an infinite number of tables. The first restaurant is
approach. regarded as the root restaurant and each table in this restaurant
corresponds to a card which refers to another restaurant.
D. CRP AND HLDA In the other words, each restaurant is associated with other
The current topic models can be employed to mine the tweets’ restaurants. Consequently, all the restaurants can be orga-
topics from large quantities of Twitter data. As a classical nized into a tree with an infinite number of branches, while
topic model, the standard LDA model considers that each every level of the tree is associated with an infinite number of
word in an article is obtained by the following process: choose restaurants.
a topic with a certain probability in the article, and choose a Consider a certain number of customers coming to the
word from the chosen topic. In the framework of the LDA city for L days of holiday. On the first day, a customer
model, all words in all articles represent observable data, and comes into the root restaurant and chooses a table according
the topics of articles are implicit random variables which can to Equation (1). On the second day, he goes to the sec-
only be obtained through a process of several iterations of ond restaurant which is associated with the table chosen
sampling. previously by himself, and then chooses a table according
However, one of the disadvantages of the standard LDA to Equations (1) and (2). All customers choose restaurants
model is that we must specify the number of topics in advance according to Equations (1) and (2), repeatedly for L days.
in the modeling process. In fact, the number of topics is In other words, all customers follow a path which starts from
unknown in different articles, and a fixed topic number may the root restaurant and ends at level L. After all customers
cause malign effects on the modeling process. In addition, have finished their L-day holiday, the paths followed by each
the standard LDA model is unable to analyze the relationships customer constitute a collection which can be regarded as
between topics. In other words, by leveraging standard LDA, an L-level tree. As an extension of CRP, nCRP can be applied
we can only retrieve topics in one single layer rather than in to illustrate the uncertainty in the hierarchical structure (see
a topic hierarchy. Figure 5 for an example of such a tree)
Fortunately, a probability distribution model based on the The hLDA model mines the topics in the same way as
partition of integers, CRP (Chinese restaurant process) and its LDA, but applies nCRP to organize the topics into a hierarchi-
extension called nCRP (nested Chinese restaurant process), cal structure rather than a flat structure. During the modeling
can organize topics into a hierarchical structure, and allow process of hLDA, a certain document first chooses a path
the data to continue to change and accumulate, by creating a which starts from the root node and ends on a leaf node by
hierarchical division of the sampling process. nCRP, and then samples topics at every node in the chosen
FIGURE 5. The paths of four tourists through the infinite tree of Chinese
restaurants (L = 3).
IV. THLDA
A. OVERVIEW
In contrast with hLDA, thLDA integrates tweets and social
relationships among tweeters into the modeling process.
In addition, it considers semantic relationships between
words in tweets. Figure 6 shows the Bayesian process of
thLDA. During the modeling process, we first sample the
path cm for each tweeter, and then sample zm,w which denotes
the topic allocation of each word associated with the level in
the path.
Table 1 presents the symbols used throughout the paper.
For simplicity, we do not distinguish between topics and tend to be short and simple. When a single tweet text is treated
interests in the Twitter data. as a document that is modeled as an input to LDA or hLDA,
we often can not obtain good results. Therefore, in this paper,
B. DATA PREPROCESSING we treat all tweet texts of a Twitter user as the input document
Before the actual topic modeling, we need to preprocess to thLDA.
the text by transforming the disordered text into an easy-to- As shown in Figure 7, we combine all the tweet data of the
handle text-word matrix. Twitter user Twitterm into a tweet document, and then obtain
The traditional LDA and hLDA require documents with the tweet document collection TDC = {TweetDocm |m ∈ M },
clear structure and rigorous style. Unfortunately, tweet texts in which TweetDocm = {wm,1 , wm,2 , . . . , wm,n }.
D. PATH SAMPLING
The distribution of path cm which conditions on all observed
words is expressed as follows:
Here, to calculate sim(qkl,i , qkl+1,j ), we employ word2vec, (3) Iteration. Repeat Step (2) until the result converges to
an efficient tool for training words as an x-dimensional vector a steady value.
space. Supposing there are two words w1 and w2 , we can We assume that NM tweeters are associated with
obtain the similarity between w1 and w2 using the following NM independent Dirichlet-multinomial conjugated struc-
expression: tures and NK topics are associated with NK independent
Dirichlet-multinomial conjugated structures also. The main
Sim (w1 , w2 ) = cos (V1 , V2 ) process of assigning a topic to each tweet word of tweeter m
Px
(V1,i × V2,i ) is presented as follows:
= qP i=1 (8)
(1) α → θm → z: When generating the tweet of
qP
x 2 x 2
V
i=1 1,i × i=1 V2,i
tweeter m, we first obtain θm which is the probabil-
Here, V1 and V2 are the vectors of w1 and w2 obtained using ity distribution of topics over tweeter m according to
word2vec, and x is the number of dimensions. the hyper-parameter α. Afterwards, we generate zm ,
Further, we introduce Hm,k , or the social impact, to repre- the collection of zm,w for all words of tweeter m.
sent the degree to which social relationships affect tweeter m Here, α → θm is associated with a Dirichlet process,
in choosing topic k. and θm → z is associated with a multinomial distribu-
Supposing Sm = {u1 , u2 , u3 , · · · , uNm } represents the tion. On the whole, α → θm → z is associated with a
social list of tweeter m where ui represents the ith tweeter in Dirichlet-multinomial conjugated structure.
the social list Sm and Nm represents the number of all tweeters. (2) β, Y 0 k,w , zm → ϕk → Wm : Given zm , we first obtain
The social impact is calculated using the following equation, k which is the probability distribution of words over
where Pui ,k represents the probability that tweeter ui selects topic k according to the hyper-parameter β and the
topic k in the previous iteration: semantic impact between topic k and word w. After-
PNm wards, we generate Wm , the collection of all words of
j=1 Pui ,k tweeter m. Here, β, Y 0 k,w , zm → ϕk is associated with
Hm,k = (9)
Nm a Dirichlet process, and ϕk → Wm is associated with a
On the other hand, the second factor of Equation (4), or the multinomial distribution. As a whole, β, Y 0 k,w , zm →
probabilistic distribution P {Wm |c, W−m , z, β}, represents the ϕk → Wm is associated with a Dirichlet-multinomial
probability of obtaining the words for tweeter m with a certain conjugated structure.
choice of path, which can be calculated as follows: We obtain the probability distribution of topics as follows:
Z
L
0( w∈W (nw cm,l ,−m + β)) p (z|α) =
P
Y p(z|θ) × p(θ|α)dθ
P(Wm |c, W−m , z, β) =
w∈W 0(ncm,l ,−m + β)
Q w
Y (vm + α)
l=1
=
w∈W 0(ncm,l ,−m + ncm,l ,m + β)
Q w w (α)
m∈M
× P (10)
0( w∈W (nw cm,l ,−m + ncm,l ,m + β))
w 0( k∈K αk ) Y k∈K 0(vm,k + αk )
P Q
= (12)
0(α 0( k∈K (vm,k + αk ))
Q P
where nw k )
cm,l ,−m represents the number of words assigned to k∈K m∈M
cm,l , excluding those in the tweet document TweetDocm . Furthermore, the probability distribution of words is
obtained as follows:
E. TOPICS SAMPLING
p Wm β, Y 0 k,w , z
After path sampling, we sample the words of each tweeter, Z
i.e., allocate the topic, or the level of the topic tree, to each
= p(Wm |z, Y 0 k,w , ϕ) × p(ϕ|β)dϕ
word.
The joint probability of the whole corpus of tweets is (Y 0 k,w (vk + β))
Q
calculated as follows: = k∈K
(β)
0( w∈Wm βw ) Y w∈Wm 0(Y 0 k,w × (vk,w + βw ))
P Q
P zm,w z−(m,w) , Wm , Y 0 , α, β = p(Wm |z, β, Y 0 ) × p(z|α)
= Q
w∈Wm 0(βw ) 0( w∈Wm Y 0 k,w (vk,w + βw ))
P
= p Wm , zα, β, Y 0
(11) k∈K
Qk = {qk,i |1 ≤ i ≤ n}. The word-topic semantic impact Algorithm 1 Formalized Modeling Process of thLDA
can thus be obtained as follows: Input: TDC - The set of Twitter document;
Pn α, β, γ - hyperparameters;
(fk,i × Sim(w, qk,i ))
Y k,w = i=1 Pn
0
(14) L - the height of topic tree;
i=1 fk,i I - the iteration number of Gibbs sampling;
Output: TopicTree;
Here, fk,i denotes the frequency of occurrence of word i
1: // Associate topic with node based on Dirichlet dist
with respect to topic k.
2: for each t ∈ TopicTree do
According to Equations (12) and (13), we obtain the joint
3: draw a Dirichlet Process ϕ ∼ Dir(β);
probability distribution of W and z as fallows:
4: end for
(Y 0 k,w (vk + β)) 5: // Generate a path for TweetDocm based on nCRP
Q
p Wm , z α, β, Y k,w = k∈K
0
6: for each TweetDocm ∈ TDC do
(β)
Y (vm + α) 7: let c1 be the root node;
× (15) 8: for each level l ∈ 1, 2, . . . , L do
(α)
m∈M 9: draw the current level for each Tweetm,s ;
According to the Gibbs sampling method, we iterate over 10: draw a occupied path cl using Eq. (5);
Equation (15) and sample the topic of all words until the 11: draw a unoccupied path cl using Eq. (6);
sampling result becomes stable. Finally, we obtain the prob- 12: end for
ability distribution θm of document-topic of the tweet and the 13: obtain cm ;
probability distribution ϕk of topic-word of the tweet. The 14: draw a L-dim. topic proportion vector θm from Dir(α);
results are as follows: 15: for i = 1 to I do
16: for each word w ∈ W do
(vm,k + α) 17: draw topic z ∈ 1, 2, . . . , L from Mult(θ);
θm = (16)
(vm,· + K α) 18: draw w from the topic z;
Y 0 k,w (vk,w + β) 19: end for
varphik = 0 (17) end for
Y k,· (vk,· + V β) 20:
21: end for
Combining cm , θm , and ϕm , we know the distribution of the 22: return TopicTree;
various themes of TweetDoc in the path c, and the distribution
probability of various words of TweetDoc in the topic. In this
way, we obtain a complete topic tree. Algorithm 1 describes
the formal modeling process for thLDA.
V. EXPERIMENT
A. DATA AND ENVIRONMENT
To verify the effectiveness and efficiency of our model,
we conducted extensive experiments on large quantities of
Twitter data collected through the Twitter REST API. We first
chose 15 Twitter users with the largest amount of attention
as the seeds and then obtained all tweeters who followed the
seeds, retrieving their profiles, tweets, and social relation- FIGURE 8. The distribution of tweeters over different numbers of tweets.
ships (including following lists and followed lists). The num-
ber of tweets reached a total of 21,213,000. Subsequently,
we removed the short tweets of less than 6 words, because News dataset. The experiments were executed on a com-
we think such tweets generally have no clear semantics. puter with eight E5-2620 2.10GHz cores, 16GB memory, and
In addition, we also removed the duplicated tweets. Finally, Windows 7.
we obtained 10,160,317 tweets from 6,907 tweeters. Figure 8
shows the distribution of tweeters and tweets. Due to the B. EVALUATING EFFECTIVENESS BASED ON PMI
limitations of the Twitter REST API, we could only acquire We used PMI-score (Pointwise Mutual Information score) to
at most 3,200 tweets for each tweeter. The experimental evaluate the effectiveness of our model. To check whether a
data and results are published on the website for reference topic was reasonable, we judged the number of odd words
(http://dbsi.hdu.edu.cn/twitter_data/). which were irrelevant to the specific topic.
The word2vec model we employed in our paper was down- We calculated the PMI values for pairs of the top 20 fre-
loaded from https://code.google.com/archive/p/word2vec/. quent words relevant to topic. The larger the PMI value
This repository hosts the word2vec model (three million between two words, the stronger the relationship between
300-dimensional English word vectors) trained on the Google them. If two words are completely unrelated, their PMI value
TABLE 2. The perplexity of thLDA, LDA, and hLDA over different heights of topics with different numbers of iterations.
is set to zero. We set the PMI-score of topic k to the median value of two, and the PMI score of our model is slightly
value of all the PMI values of its word pairs, as shown in lower than those of the other two models for a height value
Equation (18). of three, when the height is too small, the corresponding
number of topics will be small, and unrelated words will
PMI − SCORE k = median{PMI (wki , wkj )} i, j ∈ [1, 20]
be assigned to the same topic, consequently, the PMI score
(18) of our model is similar to that of the other two models a
in which height value of two and three. However, our model outper-
p(wki , wkj ) forms than other two models for height values four, five
PMI (wki , wkj ) = log (19) and six.
p(wki )p(wkj )
As we know, when applying LDA the number of topics C. EVALUATING EFFECTIVENESS BASED ON PERPLEXITY
must be assigned in advance. However, the number of topics As a conventional evaluation index of topic models, perplex-
can be determined during the modeling process when apply- ity is normally used to evaluate the ability of a topic model
ing either our model or hLDA. To ensure a fair comparison, for generating texts. For a set of tweets, a lower perplexity
we first conducted the experiments on thLDA and obtained denotes better effectiveness of the topic model and a stronger
the specific topic number for different heights of the topic ability for predicting texts. For a set of tweets D, the Perplex-
trees, and based on these we then ran the experiments on ity is calculated as follows:
LDA. The relation between the heights of topic trees (used by PM
log(p(wm ))
− m=1
thLDA) and the corresponding topic number (used by LDA) PM
m=1 Nm
P(D) = exp (20)
is shown in Table 2.
in which wm denotes the word of tweeter m and Nm denotes
TABLE 3. The height of topic tree and its corresponding topic number.
the number of words of tweeter m, respectively, and M
denotes the number of tweets in the set D.
FIGURE 10. The perplexity of thLDA over different heights with different
numbers of iterations.
TABLE 4. Part of distribution of topic numbers at different levels over different heights for thLDA.
demonstrates that thLDA outperforms LDA and hLDA as far drill down into it, the distributions of the tweets’ topics are
as perplexity is concerned. different in different cities. However, in all cases, topic-1
of level-3, which may be described as ‘‘sports’’ in accor-
D. OVERALL EFFECT dance with the hot words given in Table 4, attracts the most
Table 4 shows part of word distribution of the discovered attention.
topics and the hierarchical relationships between them over
different levels when the height is set to four.
One advantage of applying OLAP to Twitter data is
that we can conduct multi-dimensional analysis using oper-
ations such as rolling up and drilling down. As shown
in Figure 11, with regard to topic-1 of level-2, when we
DONGJIN YU is currently a Professor with DONGJING WANG received the B.S. and Ph.D.
Hangzhou Dianzi University, China, where he is degrees in computer science from Zhejiang Uni-
also the Director of the Institute of Big Data and versity, Hangzhou, China, in 2012 and 2018,
the Institute of Computer Software. His research respectively. He is currently a Lecturer with
efforts include big data, business process manage- Hangzhou Dianzi University, China. His cur-
ment, and software engineering. He is a member rent research interests include recommender sys-
of IEEE, a member of ACM, and a Senior Member tems, machine learning, and business process
of China Computer Federation (CCF). He is also a management.
member of the Technical Committee of Software
Engineering, CCF, and the Technical Committee
of Service Computing, CCF.
DENGWEI XU is currently pursuing the degree ZHIYONG NI received the bachelor’s and mas-
with Hangzhou Dianzi University, China. His ter’s degrees in computer science from Hangzhou
research interests include machine learning and Dianzi University, China. He has participated in
information retrieval. several government funded projects related with
data mining. His current research interests mainly
include online analytic processing and information
retrieval.