0% found this document useful (0 votes)
45 views

2.hierarchical Topic Modeling of Twitter Data

This document proposes a new topic modeling method called Twitter Hierarchical Latent Dirichlet Allocation (thLDA) to analyze Twitter data for online analytical processing (OLAP). thLDA aims to automatically mine the hierarchical topic structure of tweets by incorporating social relationships between users and semantic relationships between words. The model is evaluated on large Twitter datasets and shown to outperform other topic models in constructing hierarchical topic dimensions for OLAP of social media content.

Uploaded by

Kunal Garg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

2.hierarchical Topic Modeling of Twitter Data

This document proposes a new topic modeling method called Twitter Hierarchical Latent Dirichlet Allocation (thLDA) to analyze Twitter data for online analytical processing (OLAP). thLDA aims to automatically mine the hierarchical topic structure of tweets by incorporating social relationships between users and semantic relationships between words. The model is evaluated on large Twitter datasets and shown to outperform other topic models in constructing hierarchical topic dimensions for OLAP of social media content.

Uploaded by

Kunal Garg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received November 20, 2018, accepted December 9, 2018, date of publication January 10, 2019, date of current version

February 6, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2891902

Hierarchical Topic Modeling of Twitter Data


for Online Analytical Processing
DONGJIN YU , (Member, IEEE), DENGWEI XU, DONGJING WANG, AND ZHIYONG NI
School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China
Corresponding author: Dongjin Yu ([email protected])
This work was supported in part by the National Natural Science Foundation of China under Grant 61472112 and Grant 61702144, in part
by the Key Science and Technology Project of Zhejiang Province under Grant 2017C01010, and in part by the Natural Science Foundation
of Zhejiang Province under Grant LY12F02003.

ABSTRACT Social platforms, such as Twitter, reveal much about the tastes of the public. Many studies focus
on the content analysis of social platforms, which assists in product promotion and sentiment investigation.
On the other hand, online analytical processing (OLAP) has been proven to be very effective for analyzing
multidimensional structured data. The key purpose of applying OLAP to text messages, (e.g., tweets), called
text OLAP, is to mine and construct the hierarchical dimension based on the unstructured text content.
In contrast to the plain texts which text OLAP usually handles, the social media content includes a wealth
of social relationship information which can be employed to extract a more effective dimensional hierarchy.
In this paper, we propose a topic model called twitter hierarchical latent Dirichlet allocation (thLDA). Based
on hierarchical latent Dirichlet allocation, thLDA aims to automatically mine the hierarchical dimension
of tweets’ topics, which can be further employed for text OLAP on the tweets. Furthermore, thLDA uses
word2vec to analyze the semantic relationships of words in tweets to obtain a more effective dimension.
We conduct extensive experiments on huge quantities of Twitter data and evaluate the effectiveness of
thLDA. The experimental results demonstrate that it outperforms other current topic models in mining and
constructing the hierarchical dimension of tweeters’ topics.

INDEX TERMS Twitter, online analytical processing, topic modeling, hierarchical latent Dirichlet
allocation, social media analysis.

I. INTRODUCTION analyzing textual data for the underlying topics. In [1] and [2],
During the past few years, Twitter has become increasingly we proposed a LDA-based model, called MS-LDA, to detect
popular as an emerging social platform for messaging and the hidden layered interests from the Twitter data. As the
communication among individuals. The huge quantities of extension of LDA, MS-LDA integrated tweets and the social
Twitter data accumulated so far make it possible to discover relationships among tweeters. Nevertheless, the primitive
the distribution and drift of mass tastes and opinions, which LDA model can only mine monolayer topics, rather than
greatly assists in product recommendation, target marketing the hierarchical ones which OLAP requires. On the other
and so on. On the other hand, OLAP, or online analytical hand, as an unsupervised hierarchical topic model, hLDA
processing, enables analysis to interactively view data from can obtain the sibling-sibling relationships between topics
all aspects in layered granularities, which has already been and can organize the topics into a hierarchical tree automati-
proven especially useful for business intelligence. Unfortu- cally. In fact, Twitter data contain abundant social behavioral
nately, OLAP techniques are successful in dealing with cube information about tweeters, such as mentioning, retweet-
data which are structured and formalized, but face difficulties ing and following. In addition, there exist some semantic
in processing textual content such as Twitter data. To suc- relationships among the words in tweets, which may affect
cessfully apply OLAP techniques to Twitter, it is critical to the effectiveness of the modeling process. In other words,
mine the hidden representative dimensions from its extensive to effectively discover the hidden layers of topics from Twitter
content. data for constructing the hierarchical dimension for OLAP,
As a typical unsupervised topic model, the Latent Dirich- we need to propose a new topic model which leverages the
let Allocation (LDA) model is efficient at statistically characteristics of Twitter in its modeling process.

2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 7, 2019 Personal use is also permitted, but republication/redistribution requires IEEE permission. 12373
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

In this paper, we focus on how to discover the underlying Michelson and Macskassy et al. [9] present a topic profile
topics of tweets from tweeters’ social behaviors and from to characterize tweets’ topics. Cuzzocrea et al. [10] intro-
their published tweets. Such topics can be then organized into duce an aggregation operator for tweets’ content by using
one very important hierarchical dimension, or topic dimen- formal concept analysis theory. Liu et al. [11] propose a
sion, for applying OLAP to Twitter data. We present a model text cube approach to learning various types of social,
called thLDA to extract the hidden-layer topics from Twitter human and cultural behaviors embedded in the Twitter data.
data for the multidimensional analysis of tweets’ topics. The Rehman et al. [12] focus on incorporating extensive natural
process is briefly described as follows. Firstly, we collect a language process technology in OLAP, to analyze multidi-
primitive corpus through Twitter’s APIs. Then, we preprocess mensional social data.
the Twitter data by removing stop words and irrelevant data In addition, many researchers employ machine learning
such as short links, short tweets and junk information. Sub- techniques to analyze social media. Siswanto et al. [13] pro-
sequently, we analyze the social relationships of tweeters and pose a model that utilizes supervised learning-based clas-
the semantic relationships between words in tweets. Finally, sification based on tweeters’ labels and specific accounts.
we mine the topics from the Twitter data and organize them Pennacchiotti and Popescu [14] propose a generic machine
into a hierarchical structure based on thLDA. learning framework for tweeter classification, based on four
The main contribution of this paper is threefold. general feature types: tweeter profile, tweeting behavior,
(1) We introduce a novel hierarchical model called thLDA linguistic content of tweeter’s message and tweeter social
to construct a dimension hierarchy of tweets’ topics, incor- network features. Pu et al. [15] present a mixed method which
porating social relationships and semantic relationships into combines text mining and Wikipedia to mine tweeters’ topics
the modeling process. (2) We make use of word2vec, which in Twitter data. Vathi et al. [16] propose a model based on
is a two-layer neural network model, to obtain the semantic a topic model to mine tweeters’ clustered discussion topics
relationships between words in tweets, to improve the mining and to design a method for excluding trivial topics. Fur-
of the topics. (3) We conduct extensive experiments on our thermore, combining a topic model with analysis of Twitter
model with large quantities of Twitter data and find that the data, Zhao et al. [17] propose a method called Twitter-LDA
results demonstrate its effectiveness. which aims to mine tweeters’ topics from a typical sample of
The remainder of this paper is organized as follows. After Twitter as a whole. However, this can only mine the topics
introducing the state-of-the-art related works in Section 2, from the Twitter data and does not take into consideration the
we present some preliminaries necessary for understanding hierarchical aspects of the topics. Based on LDA, Blei and
the paper in Section 3. Section 4 elaborates our proposed Mcauliffe [18] propose sLDA (supervised Latent Dirichlet
thLDA model and demonstrates the mathematical derivation Allocation). In sLDA, Blei et al. add to LDA a response
process of thLDA, and Section 5 presents the experimental variable associated with each document. In order to find latent
results and the comparison with other models, undertaken to topics that will best predict the response variables for future
verify the effectiveness of thLDA. Finally, we draw conclu- unlabeled documents, sLDA jointly model the documents and
sions about our model and outline future work in Section 6. the responses.
In order to obtain the topic hierarchy from the textual
II. RELATED WORKS data, some researchers have focused on how to extend the
OLAP is an approach to answering multidimensional ana- traditional topic modeling techniques to obtain hierarchi-
lytical queries over the cube data. It provides the operations cal information on the topics. The technique of hLDA [19],
such as rolling up, drilling down and slicing [3]. The goal based on the notion of nCRP (nested Chinese restaurant
of OLAP is to provide decision support or ad-hoc reporting. process) [20], can simultaneously mine topics and con-
Its core technology is the concept of ‘‘dimensions,’’ which struct the topic hierarchy by analyzing the relationships
are usually multiple and hierarchical. Based on dimensions, of topics without supervision. On the basis of hLDA,
OLAP aggregates the ‘‘measured’’ data by averaging, count- Mao et al. [21] propose a semi-supervised hierarchical topic
ing, totaling and so on. model which aims to explore new topics automatically in
Traditional OLAP can effectively analyze structured mul- the data space while incorporating the information from
tidimensional data. However, it cannot handle unstructured observed hierarchical labels into the modeling process, called
data such as tweets [4]. In order to apply OLAP technology Semi-Supervised Hierarchical Latent Dirichlet Allocation
to the analysis of unstructured textual data, the concept of (SSHLDA). Wang et al. [22] also propose a semi-supervised
text OLAP is proposed [5]. Based on traditional OLAP hierarchical topic model, which aims to explore more rea-
technology, text OLAP aims to provide aggregative functions sonable topics in the data space by incorporating some
that summarize unstructured text data [6], [7]. For instance, constraints into the modeling process that are extracted auto-
Azabou et al. [8] present a novel model which serves as a matically, denoted as constrained hierarchical Latent Dirich-
basis for semantic OLAP for documents. let Allocation (constrained-hLDA). Dai and Storkey [23]
How to accurately and effectively mine tweets’ top- propose the sHDP (supervised hierarchical Dirichlet process)
ics from social data has long been the focus of research process, which is a nonparametric generative model for the
in the field of natural language processing. For example, joint distribution of a group of observations and a response

12374 VOLUME 7, 2019


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

variable directly associated with that whole group. Chien [24]


presents a HPYD (hierarchical Pitman-Yor-Dirichlet) process
as the nonparametric priors to infer the predictive proba-
bilities of the smoothed n-grams with the integrated topic
information. Teh [25] proposes a new hierarchical Bayesian
n-gram model of natural languages, which makes use of a
generalization of the commonly used Dirichlet distributions
called Pitman-Yor processes which produce power-law distri-
butions more closely resembling those in natural languages.
A further challenge is that many classification tasks on
short text, such as tweet, fail to achieve high accuracy due
to data sparseness. Up to now, several works have been done
in the field to solve the problem by finding more effec-
tive word embedding models. Li et al. [26] present several
FIGURE 1. An example of social relationships among tweeters.
tweet topic classification methods by exploiting different
types of data: tweet text, tweet text plus entity knowledge
base, word embeddings derived from tweet text, distributed
representations of tweets, and topical word embeddings. Figure 1 shows the social relationships between tweeters.
A follow-up study by Ganguly et al. [27] focus on the use In fact, tweeters possess abundant social behaviors, including
of word embeddings for enhancing retrieval effectiveness. following, mentioning and retweeting. These social behaviors
In particular, they construct a generalized language model. are of great significance in mining the topics of tweeters.
Enríquez et al. [28] show how a vector-based word represen- As shown in Figure 1, when the Institute of Physics sends
tation obtained via word2vec helps to improve the results of a tweet, Tweeters 3 and 5, who follow it, will receive a
a document classifier based on bags of words. They have notification, and may retweet the tweet if they are interested
also performed cross-domain experiments in which word2vec in it. Meanwhile, Tweeter 5 can also mention it to his friend,
has shown much more stable behavior than bag of words Tweeter 4, when sending or retweeting tweets.
models. Zhang et al. [29] propose a method for sentiment
classification based on word2vec and SVMperf in order B. APPLYING OLAP TO TWITTER DATA
to obtain the semantic features. The experimental results Online analytical processing, or OLAP, provides an intu-
show the superior performance of their method in sentiment itive form that is suitable for exploring Twitter data from
classification. multiple dimensions. As shown in Figure 2, from the per-
In contrast to the studies mentioned above, the thLDA spective of the conceptual model of OLAP, the fact table
model proposed in this paper can mine tweets’ topic hier- ‘‘UserFact’’ includes measures such as ‘‘FriendsCount’’ and
archy automatically from Twitter data while considering the ‘‘FollowerCount,’’ which can be obtained directly by attribute
semantic relationships between words in tweets and the social mapping from the tweeter entity. Similarly, the dimension
relationships between tweeters. The final hierarchy of topics tables ‘‘UserDIM,’’ ‘‘LocationDIM’’ and ‘‘TimeDIM’’ can
has proven to be suitable for the multidimensional analysis of be obtained by attribute mapping from the tweeter entity.
Twitter data. However, the tweets’ topics, or the tweeter’s interests, are
implicitly embedded in the tweets. Such topics or interests
establish a dimension hierarchy for OLAP, which must be
III. PRELIMINARIES extracted from the Twitter data.
A. TWITTER DATA OLAP provides users with operations such as the roll-up,
Twitter involves two entities, i.e., tweets and tweeters. Here, drill-down, slicing and dicing operations which can analyze
the term ‘‘tweets’’ refers to the content published by tweeters Twitter data from multiple perspectives. The overall process
together with properties such as ‘‘id,’’ ‘‘place’’ and ‘‘Favorite- of exploring Twitter data based on the OLAP technique can
Count,’’ whereas ‘‘tweeters’’ have their own properties like be described as follows (Figure 3):
‘‘uid,’’ ‘‘location’’ and ‘‘name’’ and a set of behaviors includ- • Data acquisition: Obtain tweeters’ profiles, tweets and
ing ‘‘retweeting,’’ ‘‘mentioning’’ and ‘‘following’’. On the social relationships through the REST APIs provided by
other hand, Twitter data can also be divided into two parts, Twitter.
i.e., the structured and unstructured parts. The structured data, • Data preprocessing: Remove the short words (the most
such as ‘‘id’’ and ‘‘location,’’ do not require additional prepro- common, short function words, such as the, is, at, which,
cessing for OLAP. However, the unstructured data, including and on) and the web links and carry out a parts of
text messages, emoticons, short links, etc., require special speech analysis to leave only nouns and verbs in the
treatment for OLAP. In particular, the topics that tweeters unstructured tweets.
discuss must be extracted as one of the dimensions which • Text modeling: Identify the relationship between tweet-
OLAP may employ to explore the Twitter data. ers and tweets based on text modeling.

VOLUME 7, 2019 12375


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

FIGURE 2. The galaxy schema for twitter data.

FIGURE 3. The overall process of exploring Twitter data based on the technique.

• Hierarchical topic modeling: Extract the topics shallow, two-layer neural networks that are trained to recon-
(or interests) from the Twitter data, and construct the struct linguistic contexts of words. Word2vec takes as its
hierarchical topic dimension based on the probability input a large corpus of text and produces a vector space,
distribution of various topics and subtopics. typically of several hundred dimensions, with each unique
• Data exploring: Analyze tweeters from multiple dimen- word in the corpus being assigned a corresponding vec-
sions using OLAP. tor in the space. Word vectors are positioned in the vector
Although the OLAP technology provides an intuitive space such that words that share common contexts in the
inquiry form that is consistent with human custom, it can only corpus are located in close proximity to one another in the
handle structured data, and fails to deal with scenarios related space. Word2vec was created by a team of researchers led
to unstructured text data like tweets. Therefore, the key to by Tomas Mikolov at Google, and has been subsequently
applying the OLAP technology to Twitter data is to identify analyzed and explained by other researchers. Embedding
and construct the dimension hierarchy from the Twitter data vectors created using the Word2vec algorithm have many
automatically. However, this still remains a difficult problem. advantages compared to earlier algorithms such as latent
The main issue this paper tries to resolve can be described semantic analysis. Word2vec can utilize either of two model
as follows: how to automatically mine and construct the architectures to produce a distributed representation of words:
hierarchical dimension of tweets’ topics (or tweeters’ inter- continuous bag-of-words (CBOW) or continuous skip-gram.
ests) from the unstructured tweet data to achieve effective In the continuous bag-of-words architecture, the model pre-
multidimensional analysis. dicts the current word from a window of surrounding con-
text words. However, the order of context words does not
C. WORD2VEC influence prediction (bag-of-words assumption). In the con-
Word2vec [30], [31] is a group of related models that tinuous skip-gram architecture, the model uses the current
are used to produce word embeddings. These models are word to predict the surrounding window of context words.

12376 VOLUME 7, 2019


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

CRP is a discrete-time stochastic process, analogous to


seating customers at tables in a Chinese restaurant. It assumes
that there is a Chinese restaurant which owns an unlimited
number of infinite tables that can take an infinite number
of customers at the same time. All customers come into
the restaurant and choose their own tables with a certain
probability. Here, the customers are regarded as an infinite
collection, i.e., customer = {m|0 ≤ m ≤ NM }. Given that the
first m − 1 customers have selected their tables, the collection
of occupied tables is expressed as R = rj |0 < j < m , and


the corresponding
number of customers at each table is N =
FIGURE 4. The models CBOW and skip-gram [30]. nj |0 < j < m . The probability that the next customer m
chooses an occupied, or unoccupied table is given by the
following distributions:
Figure 4(a) show the CBOW model, where wt represents the
central word and wt±i represents the context word of wt . P(occupied table rj |Previous m − 1 Customer, γ )
nj
Figure 4(b) shows the skip-gram model, which weighs nearby = (1)
context words more heavily than more distant context words. γ +m−1
CBOW is considered to be faster than skip-gram but more P(unoccupied table|Previous m − 1 Customer, γ )
γ
suitable for infrequent words. = (2)
During the calculation process of word2vec, we usually γ +m−1
express the semantic correlation between words by cal- Here, γ is the parameter which aims to control the proba-
culating the cosine similarity of two word vectors. The bility of the customer selecting a new table.
greater the cosine similarity, the stronger the correlation The nCRP model is derived from CRP, and is a distribu-
between two words. In addition, as the dimension increases, tion over hierarchical partitions. The nCRP model can be
the model effectiveness tends to be steady. To ensure the illustrated by the following situation. Supposing in a city
high efficiency and good effectiveness, we choose 300 as there is an infinite number of Chinese restaurants, each of
the number of dimensions of the word vector in our which has an infinite number of tables. The first restaurant is
approach. regarded as the root restaurant and each table in this restaurant
corresponds to a card which refers to another restaurant.
D. CRP AND HLDA In the other words, each restaurant is associated with other
The current topic models can be employed to mine the tweets’ restaurants. Consequently, all the restaurants can be orga-
topics from large quantities of Twitter data. As a classical nized into a tree with an infinite number of branches, while
topic model, the standard LDA model considers that each every level of the tree is associated with an infinite number of
word in an article is obtained by the following process: choose restaurants.
a topic with a certain probability in the article, and choose a Consider a certain number of customers coming to the
word from the chosen topic. In the framework of the LDA city for L days of holiday. On the first day, a customer
model, all words in all articles represent observable data, and comes into the root restaurant and chooses a table according
the topics of articles are implicit random variables which can to Equation (1). On the second day, he goes to the sec-
only be obtained through a process of several iterations of ond restaurant which is associated with the table chosen
sampling. previously by himself, and then chooses a table according
However, one of the disadvantages of the standard LDA to Equations (1) and (2). All customers choose restaurants
model is that we must specify the number of topics in advance according to Equations (1) and (2), repeatedly for L days.
in the modeling process. In fact, the number of topics is In other words, all customers follow a path which starts from
unknown in different articles, and a fixed topic number may the root restaurant and ends at level L. After all customers
cause malign effects on the modeling process. In addition, have finished their L-day holiday, the paths followed by each
the standard LDA model is unable to analyze the relationships customer constitute a collection which can be regarded as
between topics. In other words, by leveraging standard LDA, an L-level tree. As an extension of CRP, nCRP can be applied
we can only retrieve topics in one single layer rather than in to illustrate the uncertainty in the hierarchical structure (see
a topic hierarchy. Figure 5 for an example of such a tree)
Fortunately, a probability distribution model based on the The hLDA model mines the topics in the same way as
partition of integers, CRP (Chinese restaurant process) and its LDA, but applies nCRP to organize the topics into a hierarchi-
extension called nCRP (nested Chinese restaurant process), cal structure rather than a flat structure. During the modeling
can organize topics into a hierarchical structure, and allow process of hLDA, a certain document first chooses a path
the data to continue to change and accumulate, by creating a which starts from the root node and ends on a leaf node by
hierarchical division of the sampling process. nCRP, and then samples topics at every node in the chosen

VOLUME 7, 2019 12377


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

FIGURE 5. The paths of four tourists through the infinite tree of Chinese
restaurants (L = 3).

path, and samples each word of the documents from the


chosen topics. In this way, hLDA obtains a hierarchical
structure whose every node is related to a topic and where
each topic is regarded as a distribution of words after a
certain number of iterations. Consequently, a topic hierar-
chy is obtained, which contains the underlying relationships
between topics and simultaneously reflects the universality
and specificity of the words.
FIGURE 6. The graphical description of thLDA.
Compared with the LDA model, hLDA generates a pri-
ori distribution of Bayesian non-parametric models through
nCRP. In addition, the number of topics generated from TABLE 1. Symbols used throughout the paper.
hLDA is automatically changed according to changes in the
corpus. Indeed, hLDA can adapt to the dynamic growth
of the data set, and can distribute the topics into mul-
tiple abstraction levels. As a hierarchical topic model,
hLDA is a pure data-driven approach that not only imple-
ments deep semantic analysis, but also identifies relation-
ships between topics, namely, abstract and specific topics.
In general, the topics that are close to the top are more
abstract, whereas the topics that are close to the bottom are
more specific. Consequently, the hierarchical organization
of topics accords with human cognition of vocabulary and
semantics.

IV. THLDA
A. OVERVIEW
In contrast with hLDA, thLDA integrates tweets and social
relationships among tweeters into the modeling process.
In addition, it considers semantic relationships between
words in tweets. Figure 6 shows the Bayesian process of
thLDA. During the modeling process, we first sample the
path cm for each tweeter, and then sample zm,w which denotes
the topic allocation of each word associated with the level in
the path.
Table 1 presents the symbols used throughout the paper.
For simplicity, we do not distinguish between topics and tend to be short and simple. When a single tweet text is treated
interests in the Twitter data. as a document that is modeled as an input to LDA or hLDA,
we often can not obtain good results. Therefore, in this paper,
B. DATA PREPROCESSING we treat all tweet texts of a Twitter user as the input document
Before the actual topic modeling, we need to preprocess to thLDA.
the text by transforming the disordered text into an easy-to- As shown in Figure 7, we combine all the tweet data of the
handle text-word matrix. Twitter user Twitterm into a tweet document, and then obtain
The traditional LDA and hLDA require documents with the tweet document collection TDC = {TweetDocm |m ∈ M },
clear structure and rigorous style. Unfortunately, tweet texts in which TweetDocm = {wm,1 , wm,2 , . . . , wm,n }.

12378 VOLUME 7, 2019


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

In the following two sections, we detail the derivation of


path sampling and topic sampling, respectively.

D. PATH SAMPLING
The distribution of path cm which conditions on all observed
words is expressed as follows:

P(cm |W , c−m , z, γ , β, Y , H ) = P(cm |c−m , γ , Y , H )


× P(Wm |c, W−m , z, β) (4)
According to Equation (4), two elements affect the proba-
FIGURE 7. Schematic diagram of text modeling for twitter data. bility that a tweeter selects a certain path. On the one hand,
the first factor P(cm |c−m , γ , Y , H ) is implied by the nCRP
model, with the extra consideration of social relationships and
C. THE THLDA MODEL the semantic impact of words.
In the thLDA model, we need to sample two parameters The generation process of nCRP model is described as
via Gibbs sampling: the path cm of each tweet document follows:
TweetDocm in the topic tree and the topic number zm,w of all (1) When the node kl,j of the topic tree has been selected,
words in the tweet document collection. The joint probability the probability that we choose a non-empty node kl+1,j0
distribution of the path cm and the topic number zm,w is shown is defined as follows:
in Equation (3):
P(cm |c−m , γ , Ykl ,kl +1 , H )
N (kl+1,j0 )
= × Ykl,j ,kl+1,j0 × H (5)
P(cm , zm,w |α, β, γ , Y , Y 0 , H , Wm ) γ +m−1
= P(cm |Wm , c−m , z, γ , β, Y , H ) (2) When the node kl,j of the topic tree has been selected,
× P(zm,w |z−(m,w) , Wm , Y 0 , α, β) (3) the probability that we choose an empty node kl+1,j0 is
defined as follows:
Equation (3) describes the joint probability distribution γ
between the observable words of tweeter and the latent topic P(cm |c−m , γ ) = (6)
γ +m−1
in the thLDA model, in which γ is the hyperparameter of
Equations (5) and (6) describe the probability distribution
the nCRP model, and α, β are hyperparameters defined
of the tweet document TweetDocm when selecting the next
in the topic sampling process. During the path sampling
layer of nodes in the topic tree, where N (kl+1,j0 ) represents
process, γ is used to control the probability of each tweet
the number of tweet documents selecting node kl+1,j0 . Each
document selection path; during the topic sampling pro-
node in the topic tree consists primarily of two pieces of data:
cess, β is used to control the probability of selecting a
the topic and the tweet document which selects the node.
topic for each word. They are used together for the size
In order to make full use of these two parts of data, we define
of the subject tree. If γ is larger and β is smaller, more
H and Ykl ,kl+1 in the equations, as explained in the following.
topics will be obtained, and a larger topic tree will even-
During the process of sampling a path, a tweeter at a given
tually be generated. A smaller β value will result in fewer
level will choose a index which is related to the node at next
high-probability words for each topic, and more topics to
level. As we know, each node at each level is associated
describe the data. On the other hand, a larger γ will lead to a
with a topic. We hold the view that the semantic similarity
higher probability that the tweet document will select a new
of the topics affects the nCRP process. Therefore, Ykl ,kl+1
path.
is introduced to indicate the semantic impact between two
We use a special MCMC (Markov Chain Monte Carlo)
topics (or nodes) kl and kl+1 . The higher the value of Ykl ,kl+1 ,
method to infer the posterior probability distribution of
the higher the probability that the topics kl and kl+1 will be
thLDA. Since it accepts all the data in the sample space,
assigned to the sample path.
its acceptance rate achieves 100%. The MCMC needs to
To calculate Ykl ,kl+1 , we extract the top n words as Qkl =
estimate multiple potential variables, but only considers a
{qkl,i |1 ≤ i ≤ n}. We use Fkl = {fkl,i |1 ≤ i ≤ n} to
single latent variable in a single sample and treats the remain-
represent the collection of their frequencies, where each item
ing variables as observable variables. When sampling the
gives the number of occurrences of the corresponding word.
path cm and the topic number zm,w , the main work is as
Thus, Ykl ,kl+1 is calculated according to the following :
follows:
(1) According to P(cm |Wm , z, c−m , γ , β, Y , H ), randomly
 
Pn
j=1 fkl+1,j ×sim(qkl,i ,qkl+1,j )
sample the state of cm at the next moment c0m .
Pn
i=1 fkl,i × Pn
j=1 fkl+1,j
(2) According to P(zm,w |z−(m,w) , Wm , Y 0 , α, β), randomly Ykl ,kl+1 = Pn (7)
sample the state of zm,w at the next moment z0m,w . i=1 fkl,i

VOLUME 7, 2019 12379


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

Here, to calculate sim(qkl,i , qkl+1,j ), we employ word2vec, (3) Iteration. Repeat Step (2) until the result converges to
an efficient tool for training words as an x-dimensional vector a steady value.
space. Supposing there are two words w1 and w2 , we can We assume that NM tweeters are associated with
obtain the similarity between w1 and w2 using the following NM independent Dirichlet-multinomial conjugated struc-
expression: tures and NK topics are associated with NK independent
Dirichlet-multinomial conjugated structures also. The main
Sim (w1 , w2 ) = cos (V1 , V2 ) process of assigning a topic to each tweet word of tweeter m
Px
(V1,i × V2,i ) is presented as follows:
= qP i=1 (8)
(1) α → θm → z: When generating the tweet of
qP
x 2 x 2
V
i=1 1,i × i=1 V2,i
tweeter m, we first obtain θm which is the probabil-
Here, V1 and V2 are the vectors of w1 and w2 obtained using ity distribution of topics over tweeter m according to
word2vec, and x is the number of dimensions. the hyper-parameter α. Afterwards, we generate zm ,
Further, we introduce Hm,k , or the social impact, to repre- the collection of zm,w for all words of tweeter m.
sent the degree to which social relationships affect tweeter m Here, α → θm is associated with a Dirichlet process,
in choosing topic k. and θm → z is associated with a multinomial distribu-
Supposing Sm = {u1 , u2 , u3 , · · · , uNm } represents the tion. On the whole, α → θm → z is associated with a
social list of tweeter m where ui represents the ith tweeter in Dirichlet-multinomial conjugated structure.
the social list Sm and Nm represents the number of all tweeters. (2) β, Y 0 k,w , zm → ϕk → Wm : Given zm , we first obtain
The social impact is calculated using the following equation, k which is the probability distribution of words over
where Pui ,k represents the probability that tweeter ui selects topic k according to the hyper-parameter β and the
topic k in the previous iteration: semantic impact between topic k and word w. After-
PNm wards, we generate Wm , the collection of all words of
j=1 Pui ,k tweeter m. Here, β, Y 0 k,w , zm → ϕk is associated with
Hm,k = (9)
Nm a Dirichlet process, and ϕk → Wm is associated with a
On the other hand, the second factor of Equation (4), or the multinomial distribution. As a whole, β, Y 0 k,w , zm →
probabilistic distribution P {Wm |c, W−m , z, β}, represents the ϕk → Wm is associated with a Dirichlet-multinomial
probability of obtaining the words for tweeter m with a certain conjugated structure.
choice of path, which can be calculated as follows: We obtain the probability distribution of topics as follows:
Z
L
0( w∈W (nw cm,l ,−m + β)) p (z|α) =
P
Y p(z|θ) × p(θ|α)dθ
P(Wm |c, W−m , z, β) =
w∈W 0(ncm,l ,−m + β)
Q w
Y (vm + α)
l=1
=
w∈W 0(ncm,l ,−m + ncm,l ,m + β)
Q w w (α)
m∈M
× P (10)
0( w∈W (nw cm,l ,−m + ncm,l ,m + β))
w 0( k∈K αk ) Y k∈K 0(vm,k + αk )
P Q
= (12)
0(α 0( k∈K (vm,k + αk ))
Q P
where nw k )
cm,l ,−m represents the number of words assigned to k∈K m∈M
cm,l , excluding those in the tweet document TweetDocm . Furthermore, the probability distribution of words is
obtained as follows:
E. TOPICS SAMPLING
p Wm β, Y 0 k,w , z

After path sampling, we sample the words of each tweeter, Z
i.e., allocate the topic, or the level of the topic tree, to each
= p(Wm |z, Y 0 k,w , ϕ) × p(ϕ|β)dϕ
word.
The joint probability of the whole corpus of tweets is (Y 0 k,w (vk + β))
Q
calculated as follows: = k∈K
(β)
0( w∈Wm βw ) Y w∈Wm 0(Y 0 k,w × (vk,w + βw ))
P Q
P zm,w z−(m,w) , Wm , Y 0 , α, β = p(Wm |z, β, Y 0 ) × p(z|α)

= Q
w∈Wm 0(βw ) 0( w∈Wm Y 0 k,w (vk,w + βw ))
P
= p Wm , z α, β, Y 0

(11) k∈K

Here, we need to utilize the collapsed Gibbs sampling to (13)


sample the variables Wm and z. The main sampling steps are We hold the view that the semantic similarity of the
described as follows: words and topics influences the topic sampling process.
(1) Initialization. We assign a topic to each word according The higher the semantic similarity of the words and topics,
to the multinomial distribution. the greater the probability that the words will be assigned to
(2) Sampling. For each word, we utilize collapsed Gibbs the topic. We use Y 0 k,v to represent the degree of word-topic
sampling to assign a topic to each word according to semantic impact of word v belonging to topic k. To calcu-
the semantic relationship between word and topic and late the word-topic semantic impact, we pick out the top
the Dirichlet distribution between word and topic. n words which belong to topic k to constitute a collection

12380 VOLUME 7, 2019


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

Qk = {qk,i |1 ≤ i ≤ n}. The word-topic semantic impact Algorithm 1 Formalized Modeling Process of thLDA
can thus be obtained as follows: Input: TDC - The set of Twitter document;
Pn α, β, γ - hyperparameters;
(fk,i × Sim(w, qk,i ))
Y k,w = i=1 Pn
0
(14) L - the height of topic tree;
i=1 fk,i I - the iteration number of Gibbs sampling;
Output: TopicTree;
Here, fk,i denotes the frequency of occurrence of word i
1: // Associate topic with node based on Dirichlet dist
with respect to topic k.
2: for each t ∈ TopicTree do
According to Equations (12) and (13), we obtain the joint
3: draw a Dirichlet Process ϕ ∼ Dir(β);
probability distribution of W and z as fallows:
4: end for
(Y 0 k,w (vk + β)) 5: // Generate a path for TweetDocm based on nCRP
Q
p Wm , z α, β, Y k,w = k∈K
0

6: for each TweetDocm ∈ TDC do

(β)
Y (vm + α) 7: let c1 be the root node;
× (15) 8: for each level l ∈ 1, 2, . . . , L do
(α)
m∈M 9: draw the current level for each Tweetm,s ;
According to the Gibbs sampling method, we iterate over 10: draw a occupied path cl using Eq. (5);
Equation (15) and sample the topic of all words until the 11: draw a unoccupied path cl using Eq. (6);
sampling result becomes stable. Finally, we obtain the prob- 12: end for
ability distribution θm of document-topic of the tweet and the 13: obtain cm ;
probability distribution ϕk of topic-word of the tweet. The 14: draw a L-dim. topic proportion vector θm from Dir(α);
results are as follows: 15: for i = 1 to I do
16: for each word w ∈ W do
(vm,k + α) 17: draw topic z ∈ 1, 2, . . . , L from Mult(θ);
θm = (16)
(vm,· + K α) 18: draw w from the topic z;
Y 0 k,w (vk,w + β) 19: end for
varphik = 0 (17) end for
Y k,· (vk,· + V β) 20:
21: end for
Combining cm , θm , and ϕm , we know the distribution of the 22: return TopicTree;
various themes of TweetDoc in the path c, and the distribution
probability of various words of TweetDoc in the topic. In this
way, we obtain a complete topic tree. Algorithm 1 describes
the formal modeling process for thLDA.

V. EXPERIMENT
A. DATA AND ENVIRONMENT
To verify the effectiveness and efficiency of our model,
we conducted extensive experiments on large quantities of
Twitter data collected through the Twitter REST API. We first
chose 15 Twitter users with the largest amount of attention
as the seeds and then obtained all tweeters who followed the
seeds, retrieving their profiles, tweets, and social relation- FIGURE 8. The distribution of tweeters over different numbers of tweets.
ships (including following lists and followed lists). The num-
ber of tweets reached a total of 21,213,000. Subsequently,
we removed the short tweets of less than 6 words, because News dataset. The experiments were executed on a com-
we think such tweets generally have no clear semantics. puter with eight E5-2620 2.10GHz cores, 16GB memory, and
In addition, we also removed the duplicated tweets. Finally, Windows 7.
we obtained 10,160,317 tweets from 6,907 tweeters. Figure 8
shows the distribution of tweeters and tweets. Due to the B. EVALUATING EFFECTIVENESS BASED ON PMI
limitations of the Twitter REST API, we could only acquire We used PMI-score (Pointwise Mutual Information score) to
at most 3,200 tweets for each tweeter. The experimental evaluate the effectiveness of our model. To check whether a
data and results are published on the website for reference topic was reasonable, we judged the number of odd words
(http://dbsi.hdu.edu.cn/twitter_data/). which were irrelevant to the specific topic.
The word2vec model we employed in our paper was down- We calculated the PMI values for pairs of the top 20 fre-
loaded from https://code.google.com/archive/p/word2vec/. quent words relevant to topic. The larger the PMI value
This repository hosts the word2vec model (three million between two words, the stronger the relationship between
300-dimensional English word vectors) trained on the Google them. If two words are completely unrelated, their PMI value

VOLUME 7, 2019 12381


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

TABLE 2. The perplexity of thLDA, LDA, and hLDA over different heights of topics with different numbers of iterations.

is set to zero. We set the PMI-score of topic k to the median value of two, and the PMI score of our model is slightly
value of all the PMI values of its word pairs, as shown in lower than those of the other two models for a height value
Equation (18). of three, when the height is too small, the corresponding
number of topics will be small, and unrelated words will
PMI − SCORE k = median{PMI (wki , wkj )} i, j ∈ [1, 20]
be assigned to the same topic, consequently, the PMI score
(18) of our model is similar to that of the other two models a
in which height value of two and three. However, our model outper-
p(wki , wkj ) forms than other two models for height values four, five
PMI (wki , wkj ) = log (19) and six.
p(wki )p(wkj )
As we know, when applying LDA the number of topics C. EVALUATING EFFECTIVENESS BASED ON PERPLEXITY
must be assigned in advance. However, the number of topics As a conventional evaluation index of topic models, perplex-
can be determined during the modeling process when apply- ity is normally used to evaluate the ability of a topic model
ing either our model or hLDA. To ensure a fair comparison, for generating texts. For a set of tweets, a lower perplexity
we first conducted the experiments on thLDA and obtained denotes better effectiveness of the topic model and a stronger
the specific topic number for different heights of the topic ability for predicting texts. For a set of tweets D, the Perplex-
trees, and based on these we then ran the experiments on ity is calculated as follows:
LDA. The relation between the heights of topic trees (used by PM
log(p(wm ))
− m=1
thLDA) and the corresponding topic number (used by LDA) PM
m=1 Nm
P(D) = exp (20)
is shown in Table 2.
in which wm denotes the word of tweeter m and Nm denotes
TABLE 3. The height of topic tree and its corresponding topic number.
the number of words of tweeter m, respectively, and M
denotes the number of tweets in the set D.

FIGURE 10. The perplexity of thLDA over different heights with different
numbers of iterations.

Figure 10 shows that the perplexities of thLDA over all


heights decrease with an increasing number of iterations and
eventually converge to the steady values. Table 3 compares
the perplexities of thLDA with those of LDA and hLDA.
FIGURE 9. The comparison of PMI-score of thLDA, hLDA and LDA over It is clearly seen that in all cases the perplexities of thLDA
different heights and topic numbers. are lower than those of LDA when the modeling process
becomes stable. Similarly, in most cases (H=3, 4, 6), the
As Figure 9 shows, the PMI score of our model is slightly perplexities of thLDA are clearly lower than those of hLDA
higher than those of the other two models for a height when the process becomes stable. Overall, the experiment

12382 VOLUME 7, 2019


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

TABLE 4. Part of distribution of topic numbers at different levels over different heights for thLDA.

demonstrates that thLDA outperforms LDA and hLDA as far drill down into it, the distributions of the tweets’ topics are
as perplexity is concerned. different in different cities. However, in all cases, topic-1
of level-3, which may be described as ‘‘sports’’ in accor-
D. OVERALL EFFECT dance with the hot words given in Table 4, attracts the most
Table 4 shows part of word distribution of the discovered attention.
topics and the hierarchical relationships between them over
different levels when the height is set to four.
One advantage of applying OLAP to Twitter data is
that we can conduct multi-dimensional analysis using oper-
ations such as rolling up and drilling down. As shown
in Figure 11, with regard to topic-1 of level-2, when we

FIGURE 12. The distribution of child-topics of topic-1 of level-3 over


different cities.

Figure 12 shows the results when we drill down into the


topic-1 of level-3, whereas Figure 13 shows the results when
FIGURE 11. The distribution of child-topics of topic-1 of level-2 over
aggregating the number of tweeters by rolling up the ‘‘loca-
different cities. tion’’ dimension from city to country. It indicates that in

VOLUME 7, 2019 12383


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

[8] M. Azabou, K. Khrouf, J. Feki, C. Soulé-Dupuy, and N. Vallès, ‘‘A novel


multidimensional model for the OLAP on documents: Modeling, gener-
ation and implementation,’’ in Proc. Int. Conf. Model Data Eng. Cham,
Switzerland: Springer, 2014, pp. 258–272.
[9] M. Michelson and S. A. Macskassy, ‘‘Discovering users’ topics of interest
on Twitter: A first look,’’ in Proc. ACM 4th workshop Anal. Noisy Unstruc-
tured Text Data, 2010, pp. 73–80.
[10] A. Cuzzocrea, C. De Maio, G. Fenza, V. Loia, and M. Parente, ‘‘OLAP
analysis of multidimensional tweet streams for supporting advanced ana-
lytics,’’ in Proc. 31st Annu. ACM Symp. Appl. Comput., 2016, pp. 992–999.
[11] X. Liu et al., ‘‘A text cube approach to human, social and cultural behavior
in the Twitter stream,’’ in Proc. Int. Conf. Social Comput., Behav.-Cultural
Modeling, Predict. Berlin, Germany: Springer, 2013, pp. 321–330.
FIGURE 13. The distribution of child-topics of topic-1 of level-3 over [12] N. U. Rehman, A. Weiler, and M. H. Scholl, ‘‘OLAPing social media:
different countries. The case of Twitter,’’ in Proc. IEEE/ACM Int. Conf. Adv. Social Netw. Anal.
Mining, Aug. 2013, pp. 1139–1146.
[13] E. Siswanto, M. L. Khodra, and L. J. E. Dewi, ‘‘Prediction of interest for
dynamic profile of Twitter user,’’ in Proc. IEEE Int. Conf. Adv. Inform.,
most cases tweeters in the ‘‘USA’’ are more active than other Concept, Theory Appl. (ICAICTA), Aug. 2014, pp. 266–271.
countries’ tweeters. [14] M. Pennacchiotti and A.-M. Popescu, ‘‘A machine learning approach
to Twitter user classification,’’ in Proc. ICWSM, 2011, vol. 11, no. 1,
pp. 281–288.
VI. CONCLUSIONS AND FUTURE WORK [15] X. Pu, M. A. Chatti, H. Thues, and U. Schroeder, ‘‘Wiki-LDA: A mixed-
In this paper, we put forward a novel hierarchical topic model, method approach for effective interest mining on Twitter data,’’ in Proc.
8th Int. Conf. Comput. Supported Edu. Rome, Italy: ScitePress-Science
i.e., thLDA, which is applied to mine the dimension hierarchy and Technology Publications, 2016, pp. 426–433.
of tweets’ topics from a large quantity amount of unstructured [16] E. Vathi, G. Siolas, and A. Stafylopatis, ‘‘Mining interesting topics in
Twitter data. We conducted extensive experiments on real Twitter communities,’’ in Computational Collective Intelligence. Cham,
Switzerland: Springer, 2015, pp. 123–132.
Twitter data to evaluate the effectiveness of thLDA. The [17] W. X. Zhao et al., ‘‘Comparing Twitter and traditional media using topic
results show that thLDA has a better recognition effect than models,’’ in Proc. Eur. Conf. Inf. Retr. Berlin, Germany: Springer, 2011,
pp. 338–349.
the other models. [18] D. M. Blei and J. D. Mcauliffe, ‘‘Supervised topic models,’’ in Proc. Adv.
When considering how social relationships impact on the Neural Inf. Process. Syst., vol. 3, 2010, pp. 327–332.
hierarchical topic model, we focus only on direct social [19] T. L. Griffiths, M. I. Jordan, J. B. Tenenbaum, and D. M. Blei, ‘‘Hierarchi-
cal topic models and the nested chinese restaurant process,’’ in Proc. Adv.
relationships and ignore indirect relationships. Furthermore, Neural Inf. Process. Syst., 2004, pp. 17–24.
we ignore cases where two unrelated tweeters follow the [20] D. M. Blei, T. L. Griffiths, and M. I. Jordan, ‘‘The nested chinese restau-
same tweeters. In the future, we will analyze indirect social rant process and Bayesian nonparametric inference of topic hierarchies,’’
J. ACM, vol. 57, no. 2, 2010, Art. no. 3.
relationships among tweeters to enhance our current model. [21] X.-L. Mao, Z.-Y. Ming, T.-S. Chua, S. Li, H. Yan, and X. Li, ‘‘SSHLDA:
In addition, to improve the model effectiveness, we will con- A semi-supervised hierarchical topic model,’’ in Proc. Joint Conf. Empir-
ical Methods Natural Lang. Process. Comput. Natural Lang. Learn.
sider taking advantage of bicliques to calculate the semantic Stroudsburg, PA, USA: Association for Computational Linguistics, 2012,
impact of the topic of two tweets. Last but not least, we will pp. 800–809.
focus on how the social impact factors and word semantic [22] W. Wang, H. Xu, W. Yang, and X. Huang, ‘‘Constrained-hLDA for topic
discovery in chinese microblogs,’’ in Proc. Pacific-Asia Conf. Knowl.
similarity influence the experimental results separately, and Discovery Data Mining. Cham, Switzerland: Springer, 2014, pp. 608–619.
whether it is possible to improve the model using hashtags. [23] A. M. Dai and A. J. Storkey, ‘‘The supervised hierarchical Dirich-
let process,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2,
pp. 243–255, Feb. 2015.
REFERENCES [24] J.-T. Chien, ‘‘Hierarchical Pitman–Yor–Dirichlet language model,’’
[1] D. Yu et al., ‘‘Mining hidden interests from Twitter based on word simi- IEEE/ACM Trans. Audio, Speech, Language Process, vol. 23, no. 8,
larity and social relationship for OLAP,’’ Int. J. Softw. Eng. Knowl. Eng., pp. 1259–1272, Aug. 2015.
vol. 27, nos. 9–10, pp. 1567–1578, 2017. [25] Y. W. Teh, ‘‘A hierarchical Bayesian language model based on Pitman–
[2] D. Yu, J. Sun, Y. Wu, Z. Ni, and Y. Li, ‘‘Discovering hidden interests from Yor processes,’’ in Proc. 21st Int. Conf. Comput. Linguistics 44th Annu.
Twitter for multidimensional analysis,’’ in Proc. 29th Int. Conf. Softw. Eng. Meeting Assoc. Comput. Linguistics. Stroudsburg, PA, USA: Association
Knowl. Eng., 2017, pp. 329–334. for Computational Linguistics, 2006, pp. 985–992.
[26] Q. Li, S. Shah, X. Liu, A. Nourbakhsh, and R. Fang, ‘‘Tweet
[3] S. Chaudhuri and U. Dayal, ‘‘An overview of data warehousing and OLAP
topic classification using distributed language representations,’’ in Proc.
technology,’’ ACM SIGMOD Rec., vol. 26, no. 1, pp. 65–74, 1997.
IEEE/WIC/ACM Int. Conf. Web Intell. (WI), Oct. 2016, pp. 81–88.
[4] A. Inokuchi and K. Takeda, ‘‘A method for online analytical processing
[27] D. Ganguly, D. Roy, M. Mitra, and G. J. F. Jones, ‘‘Word embedding based
of text data,’’ in Proc. 16th ACM Conf. Conf. Inf. Knowl. Manage., 2007,
generalized language model for information retrieval,’’ in Proc. 38th Int.
pp. 455–464.
ACM SIGIR Conf. Res. Develop. Inf. Retr., 2015, pp. 795–798.
[5] F. Ravat, O. Teste, R. Tournier, and G. Zurfluh, ‘‘Top_keyword: An [28] F. Enríquez, J. A. Troyano, and T. López-Solaz, ‘‘An approach to the use
aggregation function for textual document OLAP,’’ in Proc. Int. Conf. of word embeddings in an opinion classification task,’’ Expert Syst. Appl.,
Data Warehousing Knowl. Discovery. Berlin, Germany: Springer, 2008, vol. 66, pp. 1–6, Dec. 2016.
pp. 55–64. [29] D. Zhang, H. Xu, Z. Su, and Y. Xu, ‘‘Chinese comments sentiment clas-
[6] C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao, ‘‘Text cube: sification based on word2vec and SVMperf ,’’ Expert Syst. Appl., vol. 42,
Computing IR measures for multidimensional text database analy- no. 4, pp. 1857–1863, 2015.
sis,’’ in Proc. 8th IEEE Int. Conf. Data Mining (ICDM), Dec. 2008, [30] T. Mikolov, K. Chen, G. Corrado, and J. Dean. (2013). ‘‘Efficient esti-
pp. 905–910. mation of word representations in vector space.’’ [Online]. Available:
[7] D. Zhang, C. Zhai, J. Han, A. Srivastava, and N. Oza, ‘‘Topic modeling for https://arxiv.org/abs/1301.3781
olap on multidimensional text databases: Topic cube and its applications,’’ [31] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, ‘‘Distributed
Stat. Anal. Data Mining, ASA Data Sci. J., vol. 2, nos. 5–6, pp. 378–395, representations of words and phrases and their compositionality,’’ in Proc.
2009. Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119.

12384 VOLUME 7, 2019


D. Yu et al.: Hierarchical Topic Modeling of Twitter Data for OLAP

DONGJIN YU is currently a Professor with DONGJING WANG received the B.S. and Ph.D.
Hangzhou Dianzi University, China, where he is degrees in computer science from Zhejiang Uni-
also the Director of the Institute of Big Data and versity, Hangzhou, China, in 2012 and 2018,
the Institute of Computer Software. His research respectively. He is currently a Lecturer with
efforts include big data, business process manage- Hangzhou Dianzi University, China. His cur-
ment, and software engineering. He is a member rent research interests include recommender sys-
of IEEE, a member of ACM, and a Senior Member tems, machine learning, and business process
of China Computer Federation (CCF). He is also a management.
member of the Technical Committee of Software
Engineering, CCF, and the Technical Committee
of Service Computing, CCF.

DENGWEI XU is currently pursuing the degree ZHIYONG NI received the bachelor’s and mas-
with Hangzhou Dianzi University, China. His ter’s degrees in computer science from Hangzhou
research interests include machine learning and Dianzi University, China. He has participated in
information retrieval. several government funded projects related with
data mining. His current research interests mainly
include online analytic processing and information
retrieval.

VOLUME 7, 2019 12385

You might also like