2017 Phrase Mining From Massive Text and Its Applications
2017 Phrase Mining From Massive Text and Its Applications
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00759ED1V01Y201702DMK013
Lecture #13
Series Editors: Jiawei Han, University of Illinois at Urbana-Champaign
Lise Getoor, University of California, Santa Cruz
Wei Wang, University of California, Los Angeles
Johannes Gehrke, Cornell University
Robert Grossman, University of Chicago
Series ISSN
Print 2151-0067 Electronic 2151-0075
Phrase Mining from
Massive Text and Its Applications
Jialu Liu
Google
Jingbo Shang
University of Illinois at Urbana-Champaign
Jiawei Han
University of Illinois at Urbana-Champaign
M
&C Morgan & cLaypool publishers
ABSTRACT
A lot of digital ink has been spilled on “big data” over the past few years. Most of this surge owes its
origin to the various types of unstructured data in the wild, among which the proliferation of text-
heavy data is particularly overwhelming, attributed to the daily use of web documents, business
reviews, news, social posts, etc., by so many people worldwide. A core challenge presents itself:
How can one efficiently and effectively turn massive, unstructured text into structured represen-
tation so as to further lay the foundation for many other downstream text mining applications?
In this book, we investigated one promising paradigm for representing unstructured text,
that is, through automatically identifying high-quality phrases from innumerable documents. In
contrast to a list of frequent n-grams without proper filtering, users are often more interested
in results based on variable-length phrases with certain semantics such as scientific concepts,
organizations, slogans, and so on. We propose new principles and powerful methodologies to
achieve this goal, from the scenario where a user can provide meaningful guidance to a fully
automated setting through distant learning. is book also introduces applications enabled by
the mined phrases and points out some promising research directions.
KEYWORDS
phrase mining, phrase quality, phrasal segmentation, distant supervision, text min-
ing, real-world applications, efficient and scalable algorithms
vii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What is Phrase Mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
ix
Acknowledgments
e authors would like to acknowledge Xiang Ren, Fangbo Tao, and Huan Gui for their contri-
bution to Chapter 4.
e research was supported in part by the U.S. Army Research Lab. under Cooperative
Agreement No. W911NF-09-2-0053 (NSCTA), National Science Foundation IIS-1320617,
and IIS 16-18481, and grant 1U54GM114838 awarded by NIGMS through funds provided by
the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). e views and
conclusions contained in this document are those of the author(s) and should not be interpreted
as representing the official policies of the U.S. Army Research Laboratory or the U.S. Govern-
ment. e U.S. Government is authorized to reproduce and distribute reprints for Government
purposes notwithstanding any copyright notation hereon. e views and conclusions contained
in our research publications are those of the authors and should not be interpreted as representing
any funding agencies.
CHAPTER 1
Introduction
1.1 MOTIVATION
e past decade has witnessed the surge of interest in data mining which is broadly construed to
discover knowledge from all kinds of data, be it in academia, industry, or daily life. e infor-
mation explosion brings the “big data” era to the light of the stage. is overwhelming tide of
information is largely composed of unstructured data such as images, speeches, and videos. It is
easy to distinguish them from typical structured data (e.g., relational data) in that the latter can be
readily stored in the fielded form in databases. Among the various unstructured data, a particu-
larly prominent category comes in the form of text. Examples include news articles, social media
messages, as well as web pages and query logs.
In the literature of text mining, during the process of analyzing text, one fundamental
problem is how to effectively represent text and model its topic, not only from the perspective of
algorithm performance, but also for analysts to better interpret and present the results. A common
approach is to use n-gram, i.e., a contiguous sequence of n unigrams, as the basic units. Figure 1.1
shows an example sequence with the corresponding 1-gram, 2-gram, 3-gram and consolidated
representation. However, such representation raises concerns of exponential growth of the dictio-
nary as well as the lack of interpretability. One can reasonably expect an intelligent method that
only uses a compact subset of n-grams but generates explainable representation given a document.
Bag of n-grams
Along this line of thought, in this book, we formulate such explainable n-gram subset as
quality phrases (e.g., scientific terms such as “data mining” and “machine learning” outlined in the
figure) and phrase mining as the corresponding knowledge discovery process.
2 1. INTRODUCTION
Phrase mining has been studied in different communities. e natural language processing
(NLP) community refers to it as “automatic term recognition” (i.e., extracting technical terms
with the use of computers). e information retrieval (IR) community studies this topic to select
main concepts in a corpus in an effort to improve search engine. Among existing works published
by these two communities, linguistic processors with heuristic rules are primarily used and the
most common approach is based on noun phrases. Supervised noun phrase chunking techniques
are particularly proposed to leverage annotated documents to learn these rules. Other methods
may utilize more sophisticated NLP features, such as dependency parser to further enhance the
precision. However, emerging textual data, such as social media messages, can deviate from rigor-
ous language rules. Using various kinds of heavily (pre-)trained linguistic processing makes these
approaches difficult to be generalized.
In this regard, we believe that the community would welcome and benefit from a set of data-
driven algorithms that work for large-scale datasets involving irregular textual data in a robust
way, while minimizing the human labeling cost. We are also convinced by various study and
experiments that our proposed methods embody enough novelty and contribution to add solid
building block for various text-related tasks including document indexing, keyphrase extraction,
topic modeling, knowledge base construction, and so on.
Problem 1.1 Phrase Mining Given a large document corpus C —which can be any textual word
sequences with arbitrary lengths such as articles, titles, and queries—phrase mining tries to assign
a value between 0 and 1 to indicate the quality of each phrase mentioned in D and discovers a set
of quality phrases K D fK1 ; ; KM g with their quality scores greater than 0.5. It also seeks to
provide a segmenter for locating quality phrase mentions in any unseen text snippet.
Definition 1.2 Quality Phrase. A quality phrase is a sequence of words that appear contigu-
ously in the corpus, and serves as a complete (non-composible) semantic unit in certain context
among given documents.
ere is no universally accepted definition for phrase quality. However, it is useful to quan-
tify phrase quality based on certain criteria as outlined below:
• Popularity: Quality phrases should occur with sufficient frequency in the given document
collection.
1.3. OUTLINE OF THE BOOK 3
• Concordance: Concordance refers to the collocation of tokens in such a frequency that is
significantly higher than what is expected due to chance. A commonly used example of a
phraseological-concordance is the two phrases “strong tea” and “powerful tea.” One would
assume that the two phrases appear in similar frequency, yet in the English language the
phrase “strong tea” is considered more proper and appears with much higher frequency.
Because a concordant phrase’s frequency deviates from what is expected, we consider them
as belonging to a whole semantic unit.
• Informativeness: A phrase is informative if it is indicative of a specific topic or concept.
e phrase “this paper” is popular and concordant, but is not considered to be informative
in the bibliographic corpus.
• Completeness: Long frequent phrases and their subsequences may both satisfy the above
three criteria. But apparently not all of them are qualified. A quality phrase should be inter-
preted as a complete semantic unit in certain contexts. e phrase “vector machine” is not
considered to be complete as it mostly appears with prefix word “support.”
Because single-word phrases cannot be decomposed into multiple tokens, the concordance
criteria is no longer definable. As an alternative, we propose the independence criteria and will
introduce it in more detail in Chapter 3.
CHAPTER 2
2.1 OVERVIEW
Identifying quality phrases has gained increased attention due to its value of handling increasingly
massive text datasets. As the origin, the natural language processing (NLP) community has con-
ducted extensive studies mostly known as automatic term recognition [Frantzi et al., 2000, Park
et al., 2002, Zhang et al., 2008], referring to the task of extracting technical terms with the use
of computers. is topic also attracts attention in the information retrieval (IR) community since
appropriate indexing term selection is critical to the improvement of a search engine where the
ideal indexing units should represent the main concepts in a corpus, beyond the bag-of-words.
Linguistic processors are commonly used to filter out stop words and restrict candidate
terms to noun phrases. With pre-defined part-of-speech (POS) rules, one can generate noun
phrases as term candidates to each POS-tagged document. Supervised noun phrase chunking
techniques [Chen and Chen, 1994, Punyakanok and Roth, 2001, Xun et al., 2000] leverage an-
notated documents to automatically learn these rules. Other methods may utilize more sophisti-
cated NLP features such as dependency parser to further enhance the precision [Koo et al., 2008,
McDonald et al., 2005]. With candidate terms collected, the next step is to leverage certain sta-
tistical measures derived from the corpus to estimate phrase quality. Some methods further resort
to reference corpus for the calibration of “termhood” [Zhang et al., 2008]. e various kinds of
linguistic processing, domain-dependent language rules, and expensive human labeling make it
challenging to apply the phrase mining technique to emerging big and unrestricted corpora which
possibly encompass many different domains and topics such as query logs, social media messages,
and textual transaction records. erefore, researchers have sought more general data-driven ap-
proaches, primarily based on the frequent pattern mining principle [Ahonen, 1999, Simitsis et al.,
2008]. Early work focuses on efficiently retrieving recurring word sequences, but many such se-
6 2. QUALITY PHRASE MINING WITH USER GUIDANCE
quences do not form meaningful phrases. More recent work filters or ranks them according to
frequency-based statistics. However, the raw frequency from the data tends to produce mislead-
ing quality assessment, and the outcome is unsatisfactory, as the following example demonstrates.
Example 2.1 Raw Frequency-based Phrase Mining Consider a set of scientific publications
and the raw frequency counts of two phrases “relational database system” and “support vector
machine” and their subsequences in the frequency column of Table 2.1. e numbers are hy-
pothetical but manifest several key observations: (i) the frequency generally decreases with the
phrase length; (ii) both good and bad phrases can possess high frequency (e.g., “support vector”
and “vector machine”); and (iii) the frequency of one sequence (e.g., “relational database system”)
and its subsequences can have a similar scale of another sequence (e.g., “support vector machine”)
and its counterparts.
Obviously, a method that ranks the word sequences solely according to the frequency will
output many false phrases such as “vector machine.” In order to address this problem, differ-
ent heuristics have been proposed based on comparison of a sequence’s frequency and its sub- (or
super-) sequences, assuming that a good phrase should have high enough (normalized) frequency
compared with its sub-sequences and/or super-sequences [Danilevsky et al., 2014, Parameswaran
et al., 2010]. However, such heuristics can hardly differentiate the quality of, e.g., “support vec-
tor” and “vector machine” because their frequencies are so close. Finally, even if the heuristics can
indeed draw a line between “support vector” and “vector machine” by discriminating their fre-
2.2. PHRASAL SEGMENTATION 7
quencies (between 160 and 150), the same separation could fail for another case like “relational
database” and “database system.”
Using the frequency in Table 2.1, all heuristics will produce identical predictions for “re-
lational database” and “vector machine,” guaranteeing one of them to be wrong. is example
suggests the intrinsic limitations of using raw frequency counts, especially in judging whether a
sequence is too long (longer than a minimum semantic unit), too short (broken and not informa-
tive), or right in length. It is a critical bottleneck for all frequency-based quality assessment.
Example 2.2 Rectification Consider the following occurrences of the six multi-word sequences
listed in Table 2.1.
6. dRelevance vector machinec has an identical dfunctional formc to the dsupport vector
machinec…
7. e basic goal for dobject-oriented relational databasec is to dbridge the gapc between…
e first four instances should provide positive counts to these sequences, while the last three
instances should not provide positive counts to “vector machine” or “relational database” because
they should not be interpreted as a whole phrase (instead, sequences like “feature vector” and “rel-
evance vector machine” can). Suppose one can correctly count true occurrences of the sequences,
and collect rectified frequency as shown in the rectified column of Table 2.1. e rectified fre-
quency now clearly distinguishes “vector machine” from the other phrases, since “vector machine”
rarely occurs as a whole phrase.
e success of this approach relies on reasonably accurate rectification. Simple arithmetics
of the raw frequency, such as subtracting one sequence’s count with its quality super sequence,
are prone to error. First, which super sequences are quality phrases is a question in and of itself.
8 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Second, it is context-dependent to decide whether a sequence should be deemed a whole phrase.
For example, the fifth instance in Example 2.2 prefers “feature vector” and “machine learning” over
“vector machine,” even though neither “feature vector machine” nor “vector machine learning” is
a quality phrase. e context information is lost when we only collect the frequency counts.
In order to recover the true frequency with best effort, we ought to examine the context of
every occurrence of each word sequence and decide whether to count it as a phrase. e exami-
nation for one occurrence may involve enumeration of alternative possibilities, such as extending
the sequence or breaking the sequence, and comparison among them. e test for word sequence
occurrences could be expensive, losing the advantage in efficiency of the frequent pattern mining
approaches.
Facing the challenge of accuracy and efficiency, we propose a segmentation approach called
“phrasal segmentation,” and integrate it with the phrase quality assessment in a unified frame-
work with linear complexity (w.r.t the corpus size). First, the segmentation assigns every word
occurrence to only one phrase. In the first instance of Example 2.2, “relational database sys-
tem” are bundled as a single phrase. erefore, it automatically avoids double counting “relational
database” and “database system” within this instance. Similarly, the segmentation of the fifth in-
stance contributes to the count of “feature vector” and “machine learning” instead of “feature,”
“vector machine,” and “learning.” is strategy condenses the individual tests for each word se-
quence and reduces the overall complexity while ensures correctness. Second, although there are
an exponential number of possible partitions of the documents, we are concerned with those rel-
evant to the phrase extraction task only. erefore, we can integrate the segmentation with the
phrase quality assessment, such that: (i) only frequent phrases with reasonable quality are taken
into consideration when enumerating partitions; and (ii) the phrase quality guides the segmenta-
tion, and the segmentation rectifies the phrase quality estimation. Such an integrated framework
benefits from mutual enhancement, and achieves both high quality and high efficiency.
A phrasal segmentation defines a partition of a sequence into subsequences, such that every
subsequence corresponds to either a single word or a phrase. Example 2.2 shows instances of such
partitions, where all phrases with high quality are marked by brackets dc. e phrasal segmenta-
tion is distinct from word, sentence or topic segmentation tasks in natural language processing. It
is also different from the syntactic or semantic parsing which relies on grammar to decompose the
sentences with rich structures like parse trees. Phrasal segmentation provides the necessary gran-
ularity we need to extract quality phrases. e total count for a phrase to appear in the segmented
corpus is called rectified frequency.
It is beneficial to acknowledge that a sequence’s segmentation may not be unique, for two
reasons. First, as we mentioned above, a word sequence may be regarded as a phrase or not,
depending on the adoption customs. Some phrases, like “bridge the gap” in the last instance of
Example 2.2, are subject to a user’s requirement. erefore, we seek for segmentation that accom-
modates the phrase quality, which is learned from user-provided examples. Second, a sequence
could be ambiguous and have different interpretations. Nevertheless, in most cases, it does not
2.3. SUPERVISED PHRASE MINING FRAMEWORK 9
require perfect segmentation, no matter if such a segmentation exists, to extract quality phrases.
In a large document collection, the popularly adopted phrases appear many times in a variety of
context. Even with a few mistakes or debatable partitions, a reasonably high quality segmentation
(e.g., yielding no partition like “support dvector machinec”) would retain sufficient support (i.e.,
rectified frequency) for these quality phrases, albeit not for false phrases with high raw frequency.
With the above discussions, we have the following formalization.
Example 2.4 Continuing our previous Example 2.2 and specifically for the first instance, the
word sequence and marked segmentation are
C D a relational database system for images
S D / a / relational database system / for / images /
with a boundary index sequence B D f1; 2; 5; 6; 7g indicating the location of segmentation symbol
/.
Phrase Quality
Candidates Feature Extraction Features
Estimation
Phrase
Classifier
Labels
15 i ndex i ndex 0
16 return f
Concordance Features
is set of features is designed to measure concordance among sub-units of a phrase. To make
phrases with different lengths comparable, we partition each phrase candidate into two disjoint
parts in all possible ways and derive effective features measuring their concordance.
Suppose for each word or phrase u 2 U , we have its raw frequency f Œu. Its probability
p.u/ is defined as:
f Œu
p.u/ D P 0
:
u0 2U f Œu
Given a phrase v 2 P , we split it into two most-likely sub-units hul ; ur i such that pointwise
mutual information is minimized. Pointwise mutual information quantifies the discrepancy be-
tween the probability of their true collocation and the presumed collocation under independence
assumption. Mathematically,
p.v/
hul ; ur i D arg min log :
ul ˚ur Dv p.ul /p.ur /
With hul ; ur i, we directly use the pointwise mutual information as one of the concordance fea-
tures.
p.v/
PMI.ul ; ur / D log :
p.ul /p.ur /
12 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Another feature is also from information theory, called pointwise Kullback-Leibler divergence:
p.v/
PKL.vkhul ; ur i/ D p.v/ log :
p.ul /p.ur /
e additional p.v/ is multiplied with pointwise mutual information, leading to less bias toward
rare-occurred phrases.
Both features are supposed to be positively correlated with concordance.
Informativeness Features
Some candidates are unlikely to be informative because they are functional or stop words. We
incorporate the following stop word-based features into the classification process.
• Whether stop words are located at the beginning or the end of the phrase candidate,
which requires a dictionary of stop words. Phrases that begin or end with stop words, such as “I
am,” are often functional rather than informative.
A more generic feature is to measure the informativeness based on corpus statistics:
• Average inverse document frequency (IDF) computed over words,
where IDF for a word w is computed as
jC j
IDF.w/ D log :
jfd 2 ŒD W w 2 Cd gj
It is a traditional information retrieval measure of how much information a word provides in order
to retrieve a small subset of documents from a corpus. In general, quality phrases are expected to
have not too small average IDF.
In addition to word-based features, punctuation is frequently used in text to aid interpre-
tations of specific concept or idea. is information is helpful for our task. Specifically, we adopt
the following feature.
• Punctuation: probabilities of a phrase in quotes, brackets, or capitalized;
higher probability usually indicates more likely a phrase is informative.
It’s noteworthy that in order to extract features efficiently we have designed an adapted
Aho-Corasick Automaton to rapidly locate occurrences of phrases in the corpus.
e Aho-Corasick Automaton is similar to the data structure Trie, which could utilize
the common prefix to save the memory usage and also make the process more efficient. It also
computes the field “failed” referring to the node which could continue the matching process.
In this book, we adopt standard Aho-Corasick Automaton definition and construction process.
Algorithm 2 introduced a “while” loop to fix the issues brought by prefix (i.e., there might be
some phrase candidates which are the prefix of the others), which is slightly different from the
2.3. SUPERVISED PHRASE MINING FRAMEWORK 13
14 return O
traditional matching process and could help us find all occurrences of the phrase candidates in
the corpus in a linear time.
An alternative way is to adopt the hash table. However, one should carefully choose the
hash function for hash tale and the theoretical time complexity of hash table is not exactly linear.
For comparison, we implemented a hash table approach using the unordered map in C++, while
the Aho-Corasick Automaton was coded in C++ too. e results can be found in Table 2.2 We
can see that Aho-Corasick Automaton is slightly better because of its exact linear complexity and
less memory overhead.
Academia Yelp
Aho-Corasick Automaton 154.25s 198.39s
Unordered Map (Hash Table) 192.71s 366.67s
Classifier
e framework can work with arbitrary classifiers that can be effectively trained with small la-
beled data and output a probabilistic score between 0 and 1. For instance, we can adopt random
forest [Breiman, 2001] which is efficient to train with a small number of labels. e ratio of pos-
14 2. QUALITY PHRASE MINING WITH USER GUIDANCE
itive predictions among all decision trees can be interpreted as a phrase’s quality estimation. In
experiments we will see that 200–300 labels are enough to train a satisfactory classifier.
Just as we have mentioned, both quality phrases and inferior ones are required as labels
for training. To further reduce the labeling effort, the next chapter introduces distant learning to
automatically retrieve both positive and negative labels.
where p.b t C1 ; dwŒbt ;btC1 / cjb t / is the probability of observing a word sequence wŒbt ;btC1 / as the
t -th quality segment. As segments of a word sequence usually have weak dependence on each
other, we assume they are generated one by one for the sake of both efficiency and simplicity.
We now describe the generative model for each segment. Given the start index b t of a seg-
ment s t , we first generate the end index b t C1 , according to a prior distribution p.js t j D b t C1 b t /
over phrase lengths. en we generate the word sequence wŒbt ;b tC1 / according to a multino-
mial distribution over all segments of length .b t C1 b t /. Finally, we generate an indicator
whether wŒb t ;bt C1 / forms a quality segment according to its quality p.dwŒbt ;b tC1 cjwŒbt ;b tC1 / / D
Q.wŒb t ;b tC1 / /. We formulate its probabilistic factorization as follows:
p b tC1 ; dwŒb t ;b tC1 / cjb t D p b t C1 jb t p dwŒb t ;b tC1 / cjb t ; b tC1
ˇ
ˇ
D p b t C1 b t p wŒb t ;bt C1 / ˇjs t j D b tC1 b t Q wŒb t ;b tC1 / :
p.js t j/ / ˛ 1 js t j
: (2.2)
2.3. SUPERVISED PHRASE MINING FRAMEWORK 15
C
Here ˛ 2 R is a factor called segment length penalty. If ˛ < 1, phrases with longer length have
larger value of p.js t j/. If ˛ > 1, the mass of p.js t j/ moves toward shorter phrases. Smaller ˛
favors longer phrases and results in fewer segments. Tuning its value turns out to be a trade-off
between precision and recall for recognizing quality phrases. At the end of this subsection we will
discuss how to estimate its value by reusing labels in Section 2.3.2. It is worth mentioning that
such segment length penalty is also discussed by Li et al. [2011]. Our formulation differs from
theirs by posing a weaker penaltyˇ on long phrases.
ˇ
We denote p.wŒbt ;b tC1 / ˇjs t j/ with wŒbt ;btC1 / for convenience. For a given corpus C with
ˇ
ˇ
D documents, we need to estimate u D p.uˇjuj/ for each frequent word and phrase u 2 U ,
and infer segmentation S . Since P .C / does not depend on segmentation S , one can maximize
log P .S ; C / instead. We employ the maximum a posteriori principle and maximize the joint prob-
ability of the corpus:
D
X md
D X
X ˇ
ˇ .d /
log p.Sd ; Cd / D log p b t.dC1
/ .d /
; dwŒb t ;b tC1
c ˇ b t : (2.3)
d D1 d D1 tD1
To find the best segmentation to maximize Eq. (2.3), one can use efficient dynamic pro-
gramming (DP) if is known. e algorithm is shown in Algorithm 3.
To learn , we employ an optimization strategy called Viterbi Training (VT) or Hard-
EM in the literature [Allahverdyan and Galstyan, 2011]. Generally speaking, VT is an efficient
and iterative way of parameter learning for probabilistic models with hidden variables. In our
case, given corpus C , it searches for a segmentation that maximizes p.S ; C jQ; ; ˛/ followed by
coordinate ascent on parameters . Such a procedure is iterated until a stationary point has been
reached. e corresponding algorithm is given in Algorithm 4.
e hard E-step is performed by DP with fixed, and the M-step is based on the segmen-
tation obtained from DP. Once the segmentation S is fixed, the closed-form solution of u can
be derived as: PD Pmd
d D1 tD1 1s t.d / Du
u D PD Pm ; (2.4)
d
d D1 tD1 1js .d / jDjuj
t
where 1 denotes the identity indicator. We can see that u is the rectified frequency of u nor-
malized by the total frequencies of the segments with length juj. For this reason, we name
normalized rectified frequency.
Note that Soft-EM (i.e., Bawm-Welch algorithm [Bishop, 2006]) can also be applied to
find a maximum likelihood estimator of . Nevertheless, VT is more suitable in our case because:
1. VT uses DP for the segmentation step, which is significantly faster than Bawm-Welch
using forward-backward algorithm for the E-step; and
2. majority of the phrases get removed as their approaches 0 during iterations, which further
speeds up our algorithm.
16 2. QUALITY PHRASE MINING WITH USER GUIDANCE
10 i n
11 m 0
12 while i > 0 do
13 m mC1
14 sm wgi C1 wgi C2 : : : wi
15 i gi
16 return S sm sm 1 : : : s1
It has also been reported in Allahverdyan and Galstyan [2011] that VT converges faster and
results in sparser and simpler models for Hidden Markov Model-like tasks. Meanwhile, VT is
capable of correctly recovering most of the parameters.
Previously in Equation (2.2), we defined the formula of segment length penalty. ere is
a hyper-parameter ˛ that needs to be determined outside the VT iterations. An overestimate
˛ will segment quality phrases into shorter parts, while an underestimate of ˛ tends to keep
low-quality phrases. us, an appropriate ˛ reflects the user’s trade-off between precision and
recall. To judge what ˛ value is reasonable, we propose to reuse the labeled phrases used in the
phrase quality estimation. Specifically, we try to search for the maximum value of ˛ such that
VT does not segment positive phrases. A parameter r0 named non-segmented ratio controls the
trade-off mentioned above. It is the expected ratio of phrases in L not partitioned by dynamic
programming. e detailed searching process is described in Algorithm 5 where we initially set
upper and lower bounds of ˛ and then perform a binary search. In Algorithm 5, jSj denotes the
number of segments in S and jLj refers to the number of positive labels.
2.3. SUPERVISED PHRASE MINING FRAMEWORK 17
9 for t D 1 to m do
.d /
10 u wŒb t ;b tC1 /
11 u0 u0 C 1
12 if r 0 then
13 up ˛
14 else
15 low ˛
Problem Fixed by
Phrase Quality Before Quality After
Feedback
np hard in the strong sense 0.78 0.93 slight underestimate
np hard in the strong 0.70 0.23 overestimate
false pos. and false neg. 0.90 0.97 N/A
pos. and false neg. 0.87 0.29 overestimate
data base management system 0.60 0.82 underestimate
data stream management system 0.26 0.64 underestimate
troduced in the phrasal segmentation, which better models the concordance criterion. In addi-
tion, normalized rectified frequencies are used to compute these new features. is addresses
the context-dependent completeness criterion. As a result, misclassified phrase candidates in the
above example can get mostly corrected after retraining the classifier, as shown in Table 2.3.
A better phrase quality estimator can guide a better segmentation as well. In this way, the
loop between the quality estimation and phrasal segmentation is closed and such an integrated
2.3. SUPERVISED PHRASE MINING FRAMEWORK 19
framework is expected to leverage mutual enhancement and address all the four phrase quality
criteria organically.
Note that we do not need to run quality estimation and phrasal segmentation for many
iterations. In our experiments, the benefits brought by rectified frequency can penetrate after the
first iteration, leaving performance curves over the next several iterations similar. It will be shown
in the experiments.
Feature Extraction: When extracting features, the most challenging problem is how to effi-
ciently locate these phrase candidates in the original corpus, because the original texts are crucial
for finding the punctuation and capitalization information. Instead of using some dictionaries to
store all the occurrences, we take the advantage of the Aho-Corasick Automaton algorithm and
tailor it to find all the occurrences of phrase candidates. e time complexity is O.jC j C jP j/ and
space complexity O.jP j/, where jP j refers to the total number of frequent phrase candidates. As
the length of each candidate is limited by a constant ! , O.jP j/ D O.jC j/, so the complexity is
O.jC j/ in both time and space.
Phrase Quality Estimation: As we only labeled a very small set of phrase candidates, as long as
the number and depth of decision trees in the random forest are some constant, the training time
for the classifier is very small compared to other parts. For the prediction stage, it is proportional
to the size of phrase candidates and the dimensions of features. erefore, it could be O.jC j/ in
both time and space, although the actual magnitude might be smaller.
Viterbi Training: It is easy to observe that Algorithm 3 is O.n!/, which is linear to the number
of words. ! is treated as a constant, and thus the VT process is also O.jC j/ considering Algo-
rithm 4 ususally finishes in a few iterations.
Penalty Learning: Suppose we only require a constant to check the convergence of the binary
search. en after log2 200
rounds, the algorithm converges. So the number of loops could be
treated as a constant. Because VT takes O.jC j/ time, penalty learning also takes O.jC j/ time.
Summary. Because the time and space complexities of all components in our framework are
O.jC j/, our proposed framework has a linear time and space complexities and is thus very efficient.
Furthermore, the most time consuming parts, including penalty learning and VT, could be easily
parallelized because of the nature of independence between documents and sentences.
20 2. QUALITY PHRASE MINING WITH USER GUIDANCE
2.4 EXPERIMENTAL STUDY
In this section, experiments demonstrate the effectiveness and efficiency of the proposed methods
in mining quality phrases and generating accurate segmentation. We begin with the description
of datasets.
Two real-world data sets were used in the experiments and detailed statistics are summa-
rized in Table 2.4.
Table 2.4: Statistics about datasets
Dataset #Docs #Words #Labels
Academia 2.77M 91.6M 300
Yelp 4.75M 145.1M 300
• SegPhrase+ is similar to SegPhrase but adds segmentation features to refine quality estima-
tion. It contains the full procedures presented in Section 2.3.
e first two methods utilize NLP chunking to obtain phrase candidates. We use the JATE⁵
implementation of the first two methods, i.e., TF-IDF and C-Value. Both of them rely on OpenNLP⁶
as the linguistic processor to detect phrase candidates in the corpus. e rest methods are all
based on frequent n-grams and the runtime is dramatically reduced. e last three methods are
variations of our proposed method.
It is also worth mentioning that JATE contains several more implemented methods in-
cluding Weirdness [Ahmad et al., 1999]. ey are not reported here due to their unsatisfactory
performance compared to the baselines listed above.
For the parameter setting, we set minimum phrase support as 30 and maximum phrase
length ! as 6, which are two parameters required by all methods. Other parameters required by
baselines were set according to the open source tools or the original papers.
For our proposed methods, training labels for phrases were collected by sampling represen-
tative phrase candidates from groups of phrases pre-clustered on the normalized feature space by
k -means. We labeled research areas, tasks, algorithms and other scientific terms in the Academia
dataset as quality phrases. Some examples are “divide and conquer,” “np complete,” and “rela-
tional database.” For the Yelp dataset, restaurants, dishes, cities and other related concepts are
labeled to be positive. In contrast, phrases like “under certain assumptions,” “many restaurants,”
and “last night” were labeled as negative. We downsample low-quality phrases because they are
dominant over quality phrases. e number of training labels in our experiments are reported in
Table 2.4. To automatically learn the value of segment length penalty, we set the non-segmented
ratio r0 in Algorithm 5 as 1.0 for Academia dataset and 0.95 for Yelp dataset. e selection of
this parameter will be discussed in detail later in this section.
To make outputs returned by different methods comparable, we converted all the phrase
candidates to lower case and merged plural with singular phrases. e phrase lists generated by
these methods have different size, and the tail of the lists are low quality. For the simplicity of
comparison, we discarded low-ranked phrases based on the minimum size among all phrase lists
except Conextr. Conextr returns all phrases without ranking. us, we did not remove its phrases.
e remaining size of each list is still reasonably large (> 40,000).
Wiki Phrases: e first set of experiments were conducted by using Wikipedia phrases as ground
truth labels. Wiki phrases refer to popular mentions of entities by crawling intra-Wiki citations
within Wiki content. To compute precision, only the Wiki phrases are considered to be positive.
For recall, we combine Wiki phrases returned by different methods altogether and view them
as all quality phrases. Precision and recall are biased in this case because positive labels are re-
stricted to Wiki phrases. However, we still expect to obtain meaningful insights regarding the
performance difference between the proposed and baselines.
Pooling: Besides Wiki phrases, we rely on human evaluators to judge whether the rest of the
candidates are good. We randomly sampled k Wiki-uncovered phrases from the returned can-
didates of each compared method. ese sampled phrases formed a pool and each of them was
then evaluated by three reviewers independently. e reviewers could use a popular search engine
for the candidates (thus helping reviewers judge the quality of phrases that they were not familiar
with). We took the majority of the opinions and used these results to evaluate the methods on
how precise the returned quality phrases are. roughout the experiments we set k D 500.
Precision-recall curves of different methods evaluated by both Wiki phrases and pooling
phrases are shown in Figure 2.2. e trends on both datasets are similar.
Among the existing work, the chunking-based methods, such as TF-IDF and C-Value, have
the best performance; Conextr reduces to a dot in the figure since its output does not provide the
ranking information. Our proposed method, SegPhrase+, outperforms them significantly. More
specifically, SegPhrase+ can achieve a higher recall while its precision is maintained at a satisfactory
level. at is, many more quality phrases can be found by SegPhrase+ than baselines. Under a given
recall, precision of our method is higher in most of the time.
For variant methods within our framework, it is surprising that ClassPhrase could per-
form competitively to the chunking-based methods like TF-IDF. Note that the latter requires large
amounts of pre-training for good phrase chunking. However, ClassPhrase’s precision at the tail
is slightly worse than TF-IDF on Academia dataset evaluated by Wiki phrases. We also observe a
significant difference between SegPhrase and ClassPhrase, indicating phrasal segmentation plays
a crucial role to address the completeness criterion. In fact, SegPhrase already beats ClassPhrase
and baselines. Moreover, SegPhrase+ improves the performance of SegPhrase, because of the use
of phrasal segmentation results as additional features.
An interesting observation is that the advantage of our method is more significant on the
pooling evaluations. e phrases in the pool are not covered by Wiki, indicating that Wikipedia is
not a complete source of quality phrases. However, our proposed methods, including SegPhrase+,
SegPhrase, and ClassPhrase, can mine out most of them (more than 80%) and keep a very high
level of precision, especially on the Academia dataset. erefore, the evaluation results on the
2.4. EXPERIMENTAL STUDY 23
Precision
Precision
0.6 0.6
0.5 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Recall Recall
Precision
Precision
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Recall Recall
TF-IDF TF-IDF
C-Value ClassPhrase
1 ConExtr 1
SegPhrase
KEA SegPhrase+
0.95 ToPMine 0.95
SegPhrase+
0.9
0.9
Precision
Precision
0.85
0.85
0.8
0.8
0.75
0.7 0.75
0.65 0.7
TF-IDF TF-IDF
1 C-Value 0.95 ClassPhrase
ConExtr SegPhrase
0.9 KEA 0.9 SegPhrase+
ToPMine
SegPhrase+ 0.85
0.8
Precision
Precision
0.8
0.7
0.75
0.6
0.7
0.5
0.65
0.4
0.6
0.3
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall
Figure 2.2: Precision-recall in four groups of experiments: (Academia, Yelp) (Wiki phrase, pool-
ing).
24 2. QUALITY PHRASE MINING WITH USER GUIDANCE
pooling phrases suggest that our methods not only detect the well-known Wiki phrases, but also
work properly for the long tail phrases which might occur not so frequently.
From the result on Yelp dataset evaluated by pooling phrases, we notice that SegPhrase+ is
a little weaker than SegPhrase at the head. As we know, SegPhrase+ has tried to utilize phrasal
segmentation results from SegPhrase to refine the phrase quality estimator. However, segmenta-
tion features do not add new information for bigrams. If there are not many quality phrases with
more than two words, SegPhrase+ might not have significant improvement and even can perform
slightly worse due to the overfitting problem by reusing the same set of labeled phrases. In fact,
on Academia dataset, the ratios of quality phrases with more than 2 words are 24% among all
Wiki phrases and 17% among pooling phrases. In contrast, these statistics go down to to 13%
and 10% on Yelp dataset, which verifies our conjecture and explains why SegPhrase+ has slightly
lower precision than SegPhrase at the head.
• How many labels do we need to achieve good results of phrase quality estimation?
• How many iterations are needed to alternate between phrase quality estimation and phrasal
segmentation?
Number of Labels
To evaluate the impact of training data size on the phrase quality estimation, we focus on studying
the classification performance of ClassPhrase. Table 2.5 shows the results evaluated among phrases
with positive predictions (i.e., fv 2 P W Qv 0:5). With different numbers of labels, we report
the precision, recall and F1 score judged by human evaluators (Pooling). e number of cor-
rectly predicted Wiki phrases is also provided together with the total number of positive phrases
predicted by the classifier. From these results, we observe that the performance of the classifier
becomes better as the number of labels increases. Specifically, on both datasets, the recall rises up
as the number of labels increases, while the precision goes down. e reason is the downsampling
of low-quality phrases in the training data. Overall, the F1 score is monotonically increasing,
which indicates that more labels may result in better performance. 300 labels are enough to train
a satisfactory classifier.
2.4. EXPERIMENTAL STUDY 25
Table 2.5: Impact of training data size on ClassPhrase (Top: Academia, Bottom: Yelp)
Academia
# Labels Precision Recall F1 # Wiki Phrases # Total
50 0.881 0.372 0.523 6,179 24,603
100 0.859 0.430 0.573 6,834 30,234
200 0.856 0.558 0.676 8,196 40,355
300 0.760 0.811 0.785 11,535 95,070
Yelp
# Labels Precision Recall F1 # Wiki Phrases # Total
50 0.491 0.948 0.647 6,985 79,091
100 0.540 0.948 0.688 6,692 57,018
200 0.554 0.948 0.700 6,786 53,613
300 0.559 0.944 0.702 6,777 53,442
Non-segmented Ratio
e non-segmented ratio r0 is designed for learning segment length penalty, which further con-
trols the precision and recall phrasal segmentation. Empirically, under higher r0 , the segmenta-
tion process will favor longer phrases, and vice versa. We show experimental results in Table 2.6
for models with different values of r0 . e evaluation measures are similar to the previous set-
ting but they are computed based on the results of SegPhrase. One can observe that the precision
increases with lower r0 , while the recall decreases. It is because phrases are more likely to be seg-
mented into words by lower r0 . High r0 is generally preferred because we should preserve most
positive phrases in training data. We select r0 D 1:00 and 0:95 for Academia and Yelp datasets
respectively, because quality phrases are shorter in Yelp dataset than in Academia dataset.
Iterations of SegPhrase+
SegPhrase+ involves only one iteration of re-estimating phrase quality using normalized recti-
fied frequency from phrasal segmentation. Here we show the performance of SegPhrase+ with
more iterations in Figure 2.3 based on human-labeled phrases. For comparison, we also report
26 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Table 2.6: Impact of non-segmented ratio r0 on SegPhrase (Top: Academia, Bottom: Yelp)
Academia
r0 Precision Recall F1 # Wiki Phrases # Total
1.00 0.816 0.756 0.785 10,607 57,668
0.95 0.909 0.625 0.741 9,226 43,554
0.90 0.949 0.457 0.617 7,262 30,550
0.85 0.948 0.422 0.584 7,107 29,826
0.80 0.944 0.364 0.525 6,208 25,374
Yelp
r0 Precision Recall F1 # Wiki Phrases # Total
1.00 0.606 0.948 0.739 7,155 48,684
0.95 0.631 0.921 0.749 6,916 42,933
0.90 0.673 0.846 0.749 6,467 34,632
0.85 0.714 0.766 0.739 5,947 28,462
0.80 0.725 0.728 0.727 5,729 26,245
Table 2.7: Objective function values of Viterbi Training for SegPhrase and SegPhrase+
performance of ClassPhrase+ which is similar with ClassPhrase but contains segmentation feature
generated by results of phrasal segmentation from the last iteration.
We can see that the benefits brought by rectified frequency can be fully digested within the
first iteration, leaving F1 scores over the next several iterations close. One can also observe a slight
performance decline over the next two iterations especially for the top-1000 phrases. Recall that
we are reusing training labels for each iteration. en this decline can be well explained by overfit-
ting because segmentation features added by later iterations become less meaningful. Meanwhile,
more meaningless features will undermine the classification power of random forest. Based on
this, we can conclude that there is no need to do the phrase quality re-estimation multiple times.
2.4. EXPERIMENTAL STUDY 27
F1 Score on Academia dataset (Pooling) F1 Score on Yelp dataset (Pooling)
0.95
0.95 0.9
0.9 0.85
F1 Score
F1 Score
0.85 0.8
0.8 0.75
SegPhrase+@All SegPhrase+@All
SegPhrase+@1000 0.7 SegPhrase+@1000
0.75
1 2 3 4 1 2 3 4
#Iterations #Iterations
Figure 2.3: Performance variations of SegPhrase+ and ClassPhrase+ with increasing iterations.
3500
Academia
3000 Yelp
Running Time (seconds)
2500
2000
1500
1000
500
0
0.2 0.4 0.6 0.8 1
Proportion
Besides, the pies in Figure 2.5 show the ratios of different components of our framework.
One can observe that Feature Extraction and Phrasal Segmentation occupy most of the runtime.
28 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Academia Yelp
Frequent Phrase Detection Frequent Phrase Detection
2nd Phrasal Segmentation 2nd Phrasal Segmentation
Figure 2.5: Runtime of different modules in our framework on Academia and Yelp dataset.
Fortunately, almost all components of our frameworks can be parallelized, such as Feature
Extraction, Phrasal Segmentation, and Quality Estimation, which are the most expensive parts
of execution time. It is because sentences can be proceeded one by one without any impact on
each other. erefore, our methods could be very efficient for massive corpus using parallel and
distributed techniques. Here we do not compare the runtime with other baselines because they
are implemented by different programming languages and some of them further rely on various
third-party packages. Among existing implementations, our method is empirically one of the
fastest.
SIGMOD SIGKDD
SegPhrase+ Chunking SegPhrase+ Chunking
1 data base data base data mining data mining
2 database system database system data set association rule
3 relational database query processing association rule knowledge discovery
4 query optimization query optimization knowledge discovery frequent itemset
5 query processing relational database time series decision tree
… … … … …
51 sql server database technology assoc. rule mining search space
52 relational data database server rule set domain knowledge
53 data structure large volume concept drift important problem
54 join query performance study knowledge acquisition concurrency control
55 web service web service gene expression data conceptual graph
… … … … …
201 high dimensio. data efficient impl. web content optimal solution
202 location based serv. sensor network frequent subgraph semantic relation
203 xml schema large collection intrusion detection effective way
204 two phase locking important issue categorical attribute space complexity
205 deep web frequent itemset user preference small set
… … … … …
2.5 SUMMARY
In this chapter, we introduced a data-driven model for extracting quality phrases from text cor-
pora with user guidance. By requiring limited training effort, the model can achieve outstanding
performance even for highly irregular textual datausiness reviews. e key idea is to rectify the raw
frequency of phrases which misleads quality estimation. A segmentation-integrated approach is
2.5. SUMMARY 31
Table 2.10: Top-5 similar phrases for representative queries (Top: Academia, Bottom: Yelp)
therefore developed and finally addresses such a fundamental limitation of phrase mining. How-
ever, we discover that despite the outstanding performance, the reliance on manual efforts from
domain experts can still become an impediment for timely analysis of massive, emerging text cor-
pora. A fully automated algorithm, instead, can be much more useful in this scenario. Meanwhile,
this chapter focuses on multi-word phrase mining while single-word phrases are not taken care of.
e integration of light-weight linguistic processors such as POS tagging is also worth studying.
We reserve these topics for the next chapter.
32 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Table 2.11: Sampled quality phrases from Academia and Yelp datasets (Continues.)
Academia
Method ClassPhrase SegPhrase SegPhrase+
1 virtual reality virtual reality self organization
2 variable bit rate variable bit rate polynomial time approx.
3 shortest path shortest path least squares
… … … …
501 finite state frequency offset estimation health care
502 air traffic collaborative filtering gene expresion
503 long term ultra wide band finite state transducers
… … … …
2001 chemical reaction ad hoc networks quasi monte carlo
2002 container terminals hyperspectral remote sensing integer programming
2003 graceful degradation piecewise affine gray level
… … … …
10001 search terms test plan airline crew scheduling
10002 high dimensional space automatic text integer programming
10003 delay variation adaptive bandwidth web log data
… … … …
20001 test coverage implementation costs experience sampling
virtual execution
20002 adaptive sliding mode control error bounded
environments
20003 random graph models free market nonlinear time delay systems
… … … …
50001 svm method harmony search algorithm asymptotic theory
50002 interface adaptation integer variables physical mapping
50003 diagnostic fault simulation nonlinear oscillators distince patterns
… … … …
2.5. SUMMARY 33
Table 2.11: (Continued.) Sampled quality phrases from Academia and Yelp datasets
Yelp
Method ClassPhrase SegPhrase SegPhrase+
1 taco bell taco bell tour guide
2 wet republic wet republic yellow tail
3 pizzeria bianco pizzeria bianco vanilla bean
… … … …
501 panoramic view art museum rm seafood
502 pretzel bun ice cream parlor pecan pie
503 spa pedicure pho kim long master bedroom
… … … …
2001 buffalo chicken wrap training sessions smashed potatoes
2002 salvation army folding chairs italian stallion
2003 shortbread cookies single bypass ferris wheel
… … … …
10001 seated promptly carrot soup gary danko
10002 leisurely stroll veggie soup benny benassi
10003 flavored water pork burrito big eaters
… … … …
20001 buttery toast late night specials cilantro hummus
20002 quick breakfast older women lv convention center
20003 slightly higher worth noting iced vanilla
… … … …
40001 friday morning conveniently placed coupled with
40002 start feeling cant remember way too high
40003 immediately start stereo system almost guaranteed
… … … …
35
CHAPTER 3
No Human Effort
Figure 3.1: Motivation: Automated phrase mining without human effort for multiple languages.
3.1 OVERVIEW
Toward the goal of making the framework fully automated, we summarize the following three
major challenges.
1. Can we completely remove the human effort for labeling phrases? In the previous chap-
ter, SegPhrase+ has shown that the quality of phrases generated by unsupervised meth-
ods [Deane, 2005, El-Kishky et al., 2015, Parameswaran et al., 2010] is acceptable but
36 3. AUTOMATED QUALITY PHRASE MINING
much weaker than the supervised methods, and at least a few hundred labels are necessary
for training. Distant training is a popular methodology to reduce expensive human labor by
utilizing high-quality phrases in knowledge bases as positive phrase labels.
2. Can we achieve high performance of phrase mining in multiple languages? Complicated pre-
processing models, such as dependency parsing, heavily rely on human efforts and thus can-
not be smoothly applied to multiple languages, as shown in Figure 3.1. To achieve high per-
formance with minimum language dependency, we fully utilize the results of the following
two techniques: (1) tokenization should be allowed because it provides the building bricks
of phrases—the boundaries of words; and (2) part-of-speech (POS) tagging, another elemen-
tary preprocessing step in NLP pipelines, is available in most of languages. And there are
language-independent part-of-speech taggers, such as TreeTagger [Schmid, 2013]. More-
over, Observation 3.1 suggests that the context information from POS tags can be a strong
signal for identifying the phrase boundary in complement to the frequency-based statistical
signals.
On the other hand, purely considering POS tags may not be wise regardless of the tagging
performance. For example, in #4, “classifier SVM ” will be wrongly extracted if only POS
tags are considered. In this case, frequency-based signals can correct the error.
Figure 3.3: e illustration of each base classifier. In each base classifier, we first randomly sample K
positive and negative labels from the pools respectively. ere might be ı quality phrases among the
K negative labels. An unpruned decision tree is trained based on this perturbed training set.
negative pool) simply because they were not present in the knowledge base. Instead, we propose to
utilize an ensemble classifier that averages the results of T independently trained base classifiers.
As shown in Figure 3.3, for each base classifier, we randomly draw K phrase candidates with
replacement from the positive pool and the negative pool respectively (considering a canonical
balanced classification scenario). is size-2K subset of the full set of all phrase candidates is
called a perturbed training set Breiman [2000], because the labels of some (ı in the figure) quality
phrases are switched from positive to negative. In order for the ensemble classifier to alleviate the
effect of such noise, we need to use base classifiers with the lowest possible training errors. We
grow an unpruned decision tree to the point of separating all phrases to meet this requirement. In
fact, such decision tree will always reach 100% training accuracy when no two positive and negative
ı
phrases share identical feature values in the perturbed training set. In this case, its ideal error is 2K ,
which approximately equals to the proportion of switched labels among all phrase candidates (i.e.,
ı
2K
10%). erefore, the value of K is not sensitive to the accuracy of the unpruned decision
tree and is fixed as 100 in our implementation. Assuming the extracted features are distinguishable
between quality and inferior phrases, the empirical error evaluated on all phrase candidates, p ,
should be relatively small as well.
An interesting property of this sampling procedure is that the random selection of phrase
candidates for building perturbed training sets creates classifiers that have statistically indepen-
dent errors and similar erring probability Breiman [2000], Martínez-Muñoz, and Suárez [2005].
erefore, we naturally adopt random forest Geurts, Ernst and Wehenkel [2006], which is veri-
fied, in the statistics literature, to be robust and efficient. e phrase quality score of a particular
phrase is computed as the proportion of all decision trees that predict that phrase is a quality
phrase. Suppose there are T trees in the random forest, the ensemble error can be estimated as
the probability of having more than half of the classifiers misclassifying a given phrase candidate
as follows.
40 3. AUTOMATED QUALITY PHRASE MINING
T
!
X T
ensemble_ error.T / D : p t .1 p/T t
:
t
tDb1CT =2c
0.4
p=0.05
p=0.1
0.3 p=0.2
Ensemble Error
p=0.4
0.2
0.1
0 0
10 101 102 103
T
From Figure 3.4, one can easily observe that the ensemble error is approaching 0 when T
grows. In practice, T needs to be set larger due to the additional error brought by model bias.
Empirical studies can be found in Figure 3.8.
Independence. A quality single-word phrase is more likely a complete semantic unit in the given
documents. For example, “UIUC ” is a quality single-word phrase. However, “united,” usually
occurring within other quality multi-word phrases such as “United States,” “United Kingdom,”
“United Airlines,” and “United Parcel Service,” is not a quality single-word phrase, because its
independence is not enough.
Informativeness Features. In information retrieval, stop words and inverse document frequency
(IDF) are two useful approaches to measure the word informativeness:
In general, quality single-word phrases are expected to be a non-stop word with relatively large
IDF.
3.2. AUTOMATED PHRASE MINING FRAMEWORK 41
Punctuation is commonly appearing across different languages, especially quotes, brackets,
and capitalization. erefore, we adopt (1) the probability that a single-word phrase is surrounded
by quotes or brackets and (2) the probability that the first character of a single-word phrase is in
uppercase. Higher probability usually indicates a single-word phrase being more informative. A
good example is in support vector machines (SVM). Note that, in some languages, such as Chinese,
these is no uppercase feature.
e features for multi-word phrases in the previous chapter are inherited, including con-
cordance features such as pointwise mutual information and pointwise Kullback-Leibler divergence
after decomposing the phrase into two parts and informativeness features involving IDF, stop
word, and punctuation.
In addition, we propose two new context-independent completeness features inspired
by Parameswaran et al. [2010]: (1) the ratio between the phrase frequency and the minimum
frequency among its sub-phrases; and (2) the ratio between the maximum frequency among its
super-phrases and the phrase frequency. A low sub-phrase ratio usually indicates the phrase can be
shorten, while a high super-phrase ratio implies the phrase is not complete. For instance, “NP-
complete in the strong ” tends to have a high super-phrase ratio because it always occurs in “NP-
complete in the strong sense;” “classifier SVM ” is expected to receive a low sub-phrase ratio because
both “classifier ” and “SVM ” are popular elsewhere.
Example 3.3 Recall the example sentences in Observation 3.1. Ideal POS-guided phrasal seg-
mentation results are as follows.
42 3. AUTOMATED QUALITY PHRASE MINING
#1: 〈Sophia Smith, NNP NNP〉, 〈was VBD〉, 〈born, VBN〉, 〈in, IN〉,
〈England, NNP〉, 〈., .〉
#2: …, 〈the, DT〉, 〈Great Firewall, NNP NNP〉, 〈is, VBZ〉 …
#3: 〈This, DT〉, 〈is, VBZ〉, 〈a, DT〉,
〈great, JJ〉, 〈firewall software, NN NN , ., .
#4 〈The, DT〉, 〈discriminative classifier, JJ NN〉, 〈SVM, NN〉,
〈is, VBZ〉, …
Definition 3.4 POS Sequence Quality. is defined to be the probability of a word sequence
being a complete semantic unit given its corresponding POS tag sequence, according to the above
criteria. Given a length-k POS tag sequence t1 t2 : : : tk , its POS sequence quality is:
where tag.v1 : : : vk / is the corresponding POS tag sequence of the word sequence v1 : : : vk .
e estimator for POS sequence quality will also be learned, which is expected to work as
follows.
Example 3.5 A good POS sequence quality estimator can return T .NN NN/ 1,
T .NN VB/ 0, and T .DT NN/ 0, where NN refers to singular or mass noun (e.g., database),
VB means verb in the base form (e.g., is), and DT is for determiner (e.g., the).
e POS sequence quality score T ./ is designed to reward the phrases with meaningful
POS patterns. e particular form we chosen is:
bi C1 1
Y
T .tŒbi ;bi C1 / / D 1 ı.tbiC1 1 ; tbiC1 / ı.tj 1 ; tj /;
j Dbi C1
where ı.t1 ; t2 / is the probability that the POS tag t2 is exactly after the POS tag t1 within a phrase
in the given document collection. In this formula, the first term represents that there is a phrase
boundary between biC1 1 and bi , while the product indicates that all POS tags among tŒbi ;biC1 /
are in the same phrase. is POS quality score can naturally counter the bias to longer segments
because exactly one of ı.t1 ; t2 / and .1 ı.t1 ; t2 // is always multiplied no matter how the corpus
is segmented. Note that the length penalty model in SegPhrase+ is a special case when ı.t1 ; t2 /
shares the same corresponding value.
Mathematically, ı.t1 ; t2 / is defined as:
where p.biC1 ; dwŒbi ;biC1 / cjbi ; tŒbi ;bi C1 / / is the probability of observing a word sequence wŒbi ;biC1 /
as the i -th quality segment given the previous boundary index bi and its corresponding POS tag
sequence tŒbi ;biC1 / .
Since the phrase segments function as a constituent in the syntax of a sentence [Finch,
2000], they usually have weak dependence on each other. As a result, we assume these segments
in the word sequence are generated one by one for the sake of both efficiency and simplicity.
For each segment, given the POS tag sequence t and the start index bi of a segment si , the
generative process is defined as follows.
1. Generate the end index biC1 , according to its POS sequence quality
2. Given the two ends bi ; biC1 , generate the word sequence wŒbi ;biC1 / according to a multi-
nomial distribution over all segments of length-.biC1 bi /.
ˇ
p.wŒbi ;bi C1 / jbi ; biC1 / D p wŒbi ;biC1 / ˇjsi j D biC1 bi :
3. Finally, we generate an indicator whether wŒbi ;biC1 / forms a quality segment according to
its quality
p.dwŒbi ;biC1 / cjwŒbi ;biC1 / / D Q.wŒbi ;biC1 / /:
Integrating the above three generative steps together, we have the the following probabilistic
factorization:
erefore, for a given corpus C with D documents, there are three subproblems:
ˇ ˇ
• learn p uˇjuj for each frequent word and phrase u 2 P . We denote p uˇjuj as u for
convenience;
44 3. AUTOMATED QUALITY PHRASE MINING
11 j n C 1, m 0
12 while j > 1 do
13 m mC1
14 sm hwŒgj ;j / , tŒgj ;j / i
15 j gj
16 return S sm sm 1 : : : s1
Given the and ı.; /, to find the best segmentation that maximizes Equation (3.1), we
develop an efficient dynamic programming algorithm for the POS-guided phrasal segmentation
(PGPS) as shown in Algorithm 6.
When the segmentation S and the parameter are fixed, the closed-form solution of
ı.t1 ; t2 / is:
PD Pmd PbiC1 .d /
2 .d /
d D1 iD1 .d / 1.tj D t1 ^ tj.dC1
/
D t2 /
j Dbi
ı.t1 ; t2 / D PD Pnd 1 .d / .d /
; (3.2)
d D1 iD1 1.t i D t 1 ^ tiC1 D t 2 /
3.2. AUTOMATED PHRASE MINING FRAMEWORK 45
Algorithm 7: AutoPhrase+ Viterbi Training
1 Input: Corpus C and phrase quality Q.
2 Output: and ı .
3 initialize with normalized raw frequencies in the corpus
4 while does not converge do
5 while ı does not converge do
6 for d D 1 to D do
7 Sd PGPS.Cd ; Q; ; ı/ via Algorithm 6
8 update ı using S1 S2 : : : SD according to Eq. (3.2)
9 for d D 1 to D do
10 Sd PGPS.Cd ; Q; ; ı/ via Algorithm 6
11 update using S1 S2 : : : SD according to Eq. (3.3)
12 return and ı
where 1./ denotes the identity indicator and ı.t1 ; t2 / is the unsegmented ratio among all t1 t2
pairs in the given corpus.
Similarly, once the segmentation S and the parameter ı are fixed, the closed-form solution
of u can be derived as:
PD Pmd .d /
d D1 iD1 1.wŒbi ;bi C1 / D u/
u D PD Pm .d /
: (3.3)
d
d D1 iD1 1.jsi j D juj/
We can see that u is the times that u becomes a complete segment normalized by the number
of the length-juj segments.
As shown in Algorithm 7, our optimization strategy for learning ı and is a nested iterative
optimization process similar to SegPhrase+. In our case, given corpus C , in the inner loop, it first
fixes and keeps adjusting parameters ı using the segmentation that maximizes p.S ; C jQ; ; ı/
until converge. Later, in the outer loop, ı is fixed and will be updated. Such a procedure is
iterated until a stationary point has been reached.
As same as that in SegPhrase+, the efficiency is the major reason that we choose Hard-EM
instead of finding a maximum likelihood estimator of and ı using Soft-EM (i.e., Bawm-Welch
algorithm [Bishop, 2006]).
e reconstruction exploits the rectified frequency in a more thorough way and thus yielding a
better performance gain.
In addition, an independence feature is added for single-word phrases. Formally, it is the
ratio of the rectified frequency of a single-word phrase given the context-aware phrasal segmen-
tation over its raw frequency. Quality single-word phrases are expected to have large values. For
example, “united ” is likely to an almost zero ratio.
Efficiency Testing Environment. e following execution time experiments were all conducted
on the same machine mentioned in the previous chapter. e algorithm is fully implemented in
C++. e preprocessing includes tokenizers from Lucene and Stanford NLP as well as the POS
tagger from TreeTagger.
Precision
0.6
Precision
Precision
0.6 0.6
second best method (SegPhrase+) in absolute value when evaluating by Wiki phrases. Moreover,
the recall differences between AutoPhrase+ and its variant AutoPhrase, ranging from 10% to 30%
sheds light on the importance of modeling single-word phrases. Meanwhile, one can also observe
that there is always a big precision gap between AutoPhrase+ and the best baseline on all three
datasets. Without any surprise, the phrase chunking-based methods TF-IDF and TextRank work
poorly, because the extraction and ranking are separated instead of unified.
Across two Latin language datasets, English and Spanish, the precision-recall curves of dif-
ferent methods are in the similar shapes. AutoPhrase+ and AutoPhrase overlaps in the beginning, but
later, the precision of AutoPhrase drops earlier and has a lower recall due to the lack of single-word
phrases. However, AutoPhrase works better than the previous state-of-the-art method SegPhrase+.
TextRank starts with a higher precision than TF-IDF, but its recall is very low because of the spar-
sity of the constructed co-occurrence graph. TF-IDF achieves a reasonable recall but unsatisfactory
precision.
On Chinese dataset, AutoPhrase+ and AutoPhrase has a clear gap even in the very begin-
ning, which is different from the trends on the English and Spanish datasets, which reflects that
single-word phrases are more important in Chinese. e major reason behind is that there are a
considerable number of high-quality phrases (e.g., person names) in Chinese have only one token
after tokenization. e performance of Chinese segmentation model AnsjSeg is very competitive,
which is slightly better than WrapSegPhrase especially when evaluating by human annotation and
shows comparable performance as AutoPhrase. is is because it not only leverages training data
for segmentations, but also exhausts the engineering work, including a huge dictionary for popu-
lar Chinese entity names and specific rules for certain types of entities. As a consequence, AnsjSeg
can easily extract tons of well-known terms and people/location names. Outperforming such a
strong baseline further confirms the effectiveness of AutoPhrase+. TF-IDF is slightly better than
another pre-trained Chinese segmentation method JiebaPSeg, while TextRank works worst again.
50 3. AUTOMATED QUALITY PHRASE MINING
In conclusion, our proposed AutoPhrase+ consistently works the best among all compared
methods and thus demonstrating its effectiveness on three datasets in different languages. e dif-
ference between AutoPhrase+ and AutoPhrase shows the necessity of modeling single-word phrases.
To be fair, all the configurations in the classifiers are the same except for the label selection
process. More specifically, we come up with four training pools:
1. EP means that domain experts give the positive pool.
2. DP means that a sampled subset from existing general knowledge forms the positive pool.
3. EN means that domain experts give the negative pool.
4. DN means that all unlabeled (i.e., not in the positive pool) phrase candidates form the negative
pool.
By combining any pair of the positive and negative pools, we have four variants, EPEN (in
SegPhrase+), DPDN (in AutoPhrase+), EPDN, and DPEN.
0.90
0.85
0.85
0.80
0.80
AUC
0.75
AUC
0.70 0.75
EPEN (in SegPhrase)
EPEN (in SegPhrase)
DPEN
DPEN
0.65 EPDN
EPDN
DPDN (in AutoPhrase) 0.70
DPDN (in AutoPhrase)
0.60
5 10 15 20 25 30 35 40 45 50 55 60 65 70 15 20 25 30 35 40 45 50 55 60 65 70
Positive Pool Size Positive Pool Size
Figure 3.6: AUC curves of four variants when we have enough positive labels in the positive pool EP.
3.3. EXPERIMENTAL STUDY 51
First of all, we evaluate the performance difference in the two positive pools. Compared to
EPEN, DPEN adopts a positive pool sampled from knowledge bases instead of the well-designed
positive pool given by domain experts. e negative pool EN is shared. As shown in Figure 3.6, we
vary the size of the positive pool and plot their AUC curves. We can find that EPEN outperforms
DPEN and the trends of curves on both datasets are similar. erefore, we conclude that the positive
pool generated from knowledge bases has reasonable quality, although its corresponding quality
estimator works slightly worse.
Secondly, we verify that whether the proposed noise reduction mechanism works properly.
Compared to EPEN, EPDN adopts a negative pool of all unlabeled phrase candidates instead of
the well-designed negative pool given by domain experts. e positive pool EP is shared. In
Figure 3.6, the clear gap between them and the similar trends on both datasets show that the noisy
negative pool is slightly worse than the well-designed negative pool, but it still works effectively.
As illustrated in Figure 3.6, DPDN has the worst performance when the size of positive pools
are limited. However, distant training can generate much larger positive pools, which may sig-
nificantly beyond the ability of domain experts considering the high expense of labeling. Conse-
quently, we are curious whether the distant training can finally beat domain experts when positive
pool sizes become large enough. We call the size at this tipping point as the ideal number.
AUC
0.84 0.84 0.82 0.82
EPEN (in SegPhrase) EPEN (in SegPhrase)
EPDN
0.82 0.82 EPDN
DPEN 0.81 0.81
DPEN
DPDN (in AutoPhrase)
DPDN (in AutoPhrase)
100 600 1100 1600 2900 29154 29304 100 600 1100 1600 21718 21868 22018
Positive Pool Size Positive Pool Size
Figure 3.7: AUC curves of four variants after we exhaust positive labels in the positive pool EP.
We increase positive pool sizes and plot AUC curves of DPEN and DPDN, while EPEN and
EPDN are degenerated as dashed lines due to the limited domain expert abilities. As shown in
Figure 3.7, with a large enough positive pool, distant training is able to beat expert labeling. On
the DBLP dataset, the ideal number is about 700, while on the Yelp dataset, it becomes around
1600. Our guess is that the ideal training size is proportional to the number of words (e.g., 91.6M
in DBLP and 145.1M in Yelp). We notice that compared to the corpus size, the ideal number is
relatively small, which implies the distant training should be effective in many domain-specific
corpora as if they overlap with Wikipedia.
52 3. AUTOMATED QUALITY PHRASE MINING
Besides, Figure 3.7 shows that when the positive pool size continues growing, the AUC
score increases but the slope becomes smaller. e performance of distant training will be finally
stable when a relatively large number of quality phrases were fed.
0.9
0.8
AUC
0.7
DBLP
0.6 Yelp
We are curious how many trees (i.e., T ) is enough for DPDN. We increase T and plot AUC
curves of DPDN. As shown in Figure 3.8, on both datasets, as T grows, the AUC scores first
increase rapidly and later the speed slows down gradually, which is consistent with the theoretical
analysis in Section 3.2.1.
1 1 1
0.8
0.8 0.8
Precision
Precision
Precision
0.6
0.6 0.6
AutoPhrase 0.4
AutoPhrase AutoPhrase
AutoSegPhrase AutoSegPhrase AutoSegPhrase
0.4 0.4
SegPhrase WrapSegPhrase JiebaSeg
0.2
0 0.5 1 0 0.2 0.4 0.6 0.8 0 0.5 1
Recall Recall Recall
Speedup
10
10
20
5 5
0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 5 10 15
Portion of Data Portion of Data The Number of Threads
(a) Running Time (b) Peak Memory (c) Multi-threading
Besides, compared to the previous state-of-the-art phrase mining method SegPhrase+ and
its variants WrapSegPhrase on three datasets, as shown in Table 3.3, AutoPhrase+ achieves about
8 to 11 times speedup and about 5 to 7 times memory usage improvement. ese improvements
are made by a more efficient indexing and a more thorough parallelization.
Table 3.4: e results of AutoPhrase+ on the EN and CN datasets, with translations and explanations
for Chinese phrases. e whitespaces on the CN dataset are inserted by the Chinese tokenizer.
EN CN
Rank Phrase Phrase Translation (Explanation)
1 Elf Aquitaine (the name of a soccer team)
2 Arnold Sommerfeld Absinthe
3 Eugene Wigner (the name of a novel or a TV-series)
4 Tarpon Springs notebook computer, laptop
5 Sean Astin Secretary of Party Committee
… … …
20,001 ECAC Hockey Aftican countries
20,002 Sacramento Bee The Left (German: Die Linke)
20,003 Bering Strait Fraser Valley
20,004 Jacknife Lee Hippocampus
20,005 WXYZ-TV Mitsuki Saiga (a voice actress)
… … …
99,994 John Gregson Computer Science and Technology
99,995 white-tailed eagle Fonterra (a company)
99,996 rhombic dodecahedron The Vice President of Writers
Association of China
99,997 great spotted woodpecker Vitamin B
99,998 David Manners controlled guidance of the media
…
55
CHAPTER 4
In the following sections, we introduce four applications to showcase the impact of the
phrase mining results, and discuss the research frontier.
Table 4.1: Representations for query “DBSCAN is a method for clustering in process of knowledge
discovery,” returned by various categories of methods
Categories Representation
Words dbscan, method, clustering, process, ...
Topics [k-means, clustering, clusters, dbscan, ...]
[clusters, density, dbscan, clustering, ...]
[machine, learning, knowledge, mining, ...]
KB Concepts data mining, clustering analysis, dbscan, ...
Document Keyphrases dbscan: [dbscan, density, clustering, ...]
clustering: [clustering, clusters, partition, ...]
data mining: [data mining, knowledge, ...]
4.1. LATENT KEYPHRASE INFERENCE 57
Quality Phrase Silhouetting
Phrase Mining
kernel k-means dbscan data mining data kernel
mining … dbscan … k-means
data mining 0.4 0.3
text mining kernel kmeans 1 dbscan 1 data mining 1
Offline: clustering kernel k means 1 density 0.8 knowl. discov. 1
1
1 0.6
1
clustering 0.65 clustering 0.6 kdd 0.67 0.65
kernel k-means
dbscan kernel 0.55 dense regions 0.3 clustering 0.6 …
… rbf kernel 0.5 shape 0.25 text mining 0.6
knowledge kdd dbscan clustering data kernel
discovery k-means
data kernel
mining … dbscan … k-means data
DBSCAN / is / a / 0.4 0.3 mining clustering dbscan
method / for / clustering /
Online: in / process / of / 1
1 0.6
0.65 1 … 0.6 0.8 0 0.8 0 0.7 0.9 …
knowledge discovery.
… 1 0 2 1 0 0 …
DBSCAN / was / knowledge density-based
knowledge kdd dbscan clustering data kernel
proposed / by … discovery k-means
discovery clustering
Figure 4.1: Overview of LAKI. White and grey nodes represent quality phrases and content units,
respectively.
e offline phase is critical in the sense that the online inference phase can be formulated
as its sub-process. Technically speaking, the learning is done by optimizing a statistical Bayesian
network, given observed content units (i.e., words and phrases after phrasal segmentation) in a
training corpus. We use a DAG-like Bayesian network shown in Figure 4.2. Content units are
located at the bottom layer and quality phrases form the rest. Both types of nodes act as binary
variables¹ and directional links between nodes depict their dependency.
Before diving into the details, we motivate our Bayesian approach to the silhouetting prob-
lem. First, this approach enables our model to infer not just explicitly mentioned document
keyphrases. For example, even if the text only contains “html” and “css,” the words “web page”
come to mind. But more than that, a multi-layered network will activate an ancestor-quality
¹For multiple mentions of a content unit, we can simply make several copies of that node together with its links.
58 4. PHRASE MINING APPLICATIONS
K3
Quality Phrases
K2 K4
K1 K5
Content Units T1 T2 T3 T4 T5 T6
phrase like “world wide web” even they are not directly linked to “html” or “css,” which are con-
tent units in the bottom layer.
Meanwhile, we expect to identify document keyphrases with different relatedness scores.
Reflected in this Bayesian model from a top-down view, when a parent quality phrase is activated,
it is more possible for its children with stronger connection to get activated.
Furthermore, this formulation is flexible. We allow a content unit to get activated by each
connected quality phrase as well as by a random noise factor (not shown in Figure 4.2), behaving
like a Noisy-OR, i.e., a logical OR gate with some probability of having “noisy” output. is
increases robustness of the model especially when training documents are noisy.
ere are two challenges in the learning process: (1) how to learn link weights given the
fixed Bayesian network structure and (2) how the initialization is done to decide this structure
and to set initial link weights.
For the former, to effectively learn link weights, Maximum Likelihood Estimation (MLE)
is adopted. e intuition is to maximize the likelihood of observing content units together with
partially-observed document keyphrases.² But it is extremely difficult to directly optimize due to
the latent states for the rest quality phrases. In this case, we usually resort to the Expectation-
Maximization (EM) algorithm which guarantees to give a local optimum solution. e EM
algorithm starts with some initial guess at the link weights and then proceeds to iteratively
generate successive estimates by repeatedly applying the E-step (Expectation-step) and M-step
(Maximization-step) until the MLE objective changes minimally.
Expectation Step: e whole E-step is trying to compute the conditional probability of unob-
served document keyphrases considering all their state combinations. It turns out that this step
is exactly the same as what we conduct in the online inference phase. In other words, the online
inference phase is just a sub-process of the offline training phase.
²Explicit document keyphrases can be identified by applying existing keyphrase extraction methods like Witten et al. [1999].
4.1. LATENT KEYPHRASE INFERENCE 59
Unfortunately, the E-step cannot be easily executed. Since each latent quality phrase in
Figure 4.2 acts as a binary variable, the size of possible state combinations can be as big as O.2n /.
at is to say, to accurately compute the probabilities required in E-step is NP-hard for a Bayesian
network like ours [Cooper, 1990]. We therefore adopt two approaches to approximately collect
sufficient statistics. e first idea is to apply sampling technique such as Markov Chain Monte
Carlo to search for the most likely state combinations. Among the Monte Carlo family, we apply
Gibbs sampling in this work to sample quality phrase variables during each E-step. Given content
unit vector representing a document, we proceed as follows.
1. Start with initial setting: only observed content units and explicit document keyphrases are
set to be true.
2. For each sampling step, sequentially sample each quality phrase node following conditional
distribution of that node given all other nodes with fixed states.
e above Gibbs sampling process ensures that samples approximate the joint probability distri-
bution between all phrase variables and content units.
e second approach is applied as the preprocessing right before the E-step. e idea is to
exclude non-related quality phrases that we are confident with. Intuitively, only a small portion of
quality phrases are related to the observed text. ere is no need to sample all phrase nodes since
most of them do not have chance to get activated. at is to say, we can skip majority of them
based on a reasonable relatedness prediction before conducting Gibbs sampling. We adopt a local
arborescence structure [Wang et al., 2012] to approximate the original Bayesian network which
allows us to roughly approximate the score for each node in an efficient way. We opt to omit the
technical details here and interesting readers can refer to the original paper [Liu et al., 2016].
Maximization Step: e M-step tries to update link weight based on the sufficient statistics
collected by the Exspectation step. In this problem setting, we are able to obtain a closed form
solution by taking the derivative of the MLE objective function.
Now the rest challenge is to decide the Bayesian network structure and to set initial link
weights. A reasonable topological order of DAG should be similar to that of a domain ontology.
e links among quality phrase nodes should reflect IS-A relationships [Yin and Shah, 2010].
Ideally, documents which are describing specific topics will first imply some deep quality phrase
nodes being activated. en the ontology-like topological order ensures these content units have
the chance of being jointly activated by general phrase nodes via inter-phrase links. Many tech-
niques [Dahab et al., 2008, Sanderson and Croft, 1999, Yin and Shah, 2010] have been previ-
ously developed to induce an ontological structure over quality phrases. It is out of scope of our
work to specifically address these or evaluate their relative impact in our evaluation. We instead
use a simple data-driven approach, where quality phrases are sorted based on their counts in the
corpus, assuming phrase generality is positively correlated with its number of mentions. us,
quality phrases mentioned more often are higher up in the graph. Links are added between qual-
ity phrases when they are closely related and frequently co-occurred. e link weights between
60 4. PHRASE MINING APPLICATIONS
nodes are simply set to be their similarity scores computed from the Word2Vec [Mikolov et al.,
2013].
To verify the effectiveness of LAKI, we present several queries with their top-ranked docu-
ment keyphrases in Table 4.2 generated from the online phase of LAKI. Overall, we see that the
method can handle both short and long queries quite well. Most document keyphrases are suc-
cessfully identified in the list. Relatedness between keyphrase and queries generally drops with
ranking lowers down. Meanwhile, both general and specific document keyphrases exist in the
ranked list. is provides LAKI with more discriminative power when someone applies it to text
mining applications like document clustering and classification. Moreover, the method has the
ability to process ambiguous queries like “lda” based on contextual words “topic.” We attribute
this to the well-modeled quality phrase silhouettes and we show some examples of them in Ta-
ble 4.3. As a quality phrase silhouette might contain many content units, we only demonstrate
ones with the most significant link weights. For ease of presentation, link weights are omitted in
the table.
³http://engineering.tripadvisor.com/using-nlp-to-find-interesting-collections-of-hotels/
4.2. TOPIC EXPLORATION FOR DOCUMENT COLLECTION 61
Table 4.2: Examples of document representation by LAKI with top-ranked document keyphrases (re-
latedness scores are ommited due to the space limit)
Document text analytics, text mining, patterns, text, tex- kobe beef, fish lovers, steakhouse, sancerre, wine
Keyphrases tual data, topic, information, text documents, list, guests, perfectly cooked, lobster, staff, meat,
information extraction, machine learning, data fillet, fish, lover, seafood, ribeye, filet, sea bass,
mining, knowledge discovery, etc. risotto, starter, scallops, steak, beef, etc.
Academia Yelp
62 4. PHRASE MINING APPLICATIONS
Table 4.3: Examples of quality phrase silhouettes (from offline quality phrase silhouette learning).
Link weights are omitted.
Academia Yelp
4.2. TOPIC EXPLORATION FOR DOCUMENT COLLECTION 63
those hotels, one can determine if it would a good way to explore the hotels of a city. It would be
difficult to mathematically define what is interesting, but easy for a human to know when they see
it. e human can also come up with a clever name, which is also simple given the list of quality
phrases.
interesting_views
unique_view 20
comfortable_beds_and_pillows awesome_view-of-times-square
25 28
comfy_beds_and_pillows 18 fantastic_view_of_central_park
24
beds_super_comfortable 21 27
comfortable_accommodation
big_comfortable_rooms sooooo_comfy 29
free-wi-fi-in-lobby 31 33
23 30
free-wifi-worked_perfectly
continental_type_breakfast 34
friendly_accommodating_staff 19
26 32
friendly_assistance friendly_nature
felt_completely_safe 22
felt_safe_and_secure
Some interesting collections are shown in Figure 4.4. e whole process provides insight
into a particular city, picking out interesting neighborhoods, features of the hotels, and nearby
attractions.
To systematically analyze large numbers of textual documents, another approach is to man-
age documents (and their associated metadata) in a multi-dimensional fashion (e.g., document
category, date/time, location, author, etc.). Such structure provides flexibility of understanding
local information with different granularities. Moreover, the contextualized analysis often yields
comparative insights. at is, given a pair of document collections, how to identify common and
different content units of various granularity (e.g., words, sentences).
However, word-based summarization suffers from limited readability as single words are
usually non-informative and bag-of-words representation does not capture the semantics of the
original document well—it may not be easy for users to interpret the combined meaning of the
words. Sentence-based summarization, on the other hand, may be too verbose to highlight the
general commonalities and differences—users may be distracted by the irrelevant information
contained there. Recent studies [Ren et al., 2017a, Tao et al., 2016] leverage quality phrases,
i.e., minimal semantic unit, to summarize the commonalities and differences. Figure 4.5 gives an
example where an analyst may pose multidimensional queries and the system is able to leverage
64 4. PHRASE MINING APPLICATIONS
Figure 4.4: Left: collection “catch a show;” Right: collection “near the high line.”
T: Economy,
L: China? massacre at sandy
hook elementary
long island railroad
(q1)
background check
senate armed services
Location
committee
T: Gun Control, adam lanza
L: US? buyback program
e
m
Example 4.1 Suppose a multi-dimensional text database is constructed from e New York
Times news repository with three meta attributes: Location, Topic, and Time, as shown in Fig-
ure 4.5. An analyst may pose multidimensional queries such as: (q1): hChina, Economyi and
(q2): hUS, Gun Controli. Each query asks for summary of a cell defined by two dimensions Lo-
cation and Topic. What kind of cell summary does she like to see? Frequent unigrams such as
debt or senate are not as informative as multi-word phrases, such as local government debt and
senate armed service committee. e phrases preserve better semantics as integral units rather
than as separate words.
Generally, three criteria should be considered when ranking representative phrases in a
selected multidimensional cell: (i) integrity: a phrase that provides integral semantic unit should
be preferable over nonintegral unigrams; (ii) popularity: popular in the selected cell (i.e., selected
subset of documents); and (iii) distinctiveness: distinguish the selected cell from other cells.
Within the whole ranked phrase list, top-k representative phrases normally have higher
value for users in text analysis. As a further matter, the top-k query also enjoys computational
superiority, so that users can conduct fast analysis.
Bearing all these in mind, the authors have designed statistical measures for each criterion,
and uses geometric mean of those three scores as the ranking signal. e specific design principles
are as follows.
1. Popularity and distinctiveness of a phrase are dependent of the target cell, while integrity is
not.
2. Popularity and distinctiveness can be measured from frequency statistics of a phrase in each
cell, while integrity cannot. To measure integrity, one needs to investigate each occurrence
of the phrase and other phrases to determine whether that phrase is indeed an integral
semantic unit. e quality score provided by SegPhrase+ and AutoPhrase+ is a good indicator.
3. Popularity relies on statistics from documents only within the cell, while distinctiveness
relies on documents both in and out of the cell. e documents involved for distinctive-
ness measure calculation is defined as contrastive document set. In the particular algorithm
design in the paper, sibling set of the query cell is used as contrastive document set.
e algorithm is applied on e New York Times 2013–2016 dataset and PubMed ⁴ Cardiac
data with their representative phrase list in Table 4.4 and Table 4.5. In the book, it is reported
that using phrases achieves the best trade-off between processing efficiency, storage cost, and
summarization interpretability.
⁴PubMed is a free full-text archive of biomedical and life sciences journal literature.
66 4. PHRASE MINING APPLICATIONS
Table 4.4: Top-10 representative phrases for e New York Times queries
documents clustering
Figure 4.6: Embedding document keyphrases with metadata into the same space.
Different from the previous contrastive analysis, the embedding approach largely relies on
the data redundancy to automatically infer the relatedness. By viewing the association between any
document and its metadata as a link, the algorithm tries to push the embeddings of its constituent
nodes closer to each other. For instance, observing the phrase “data mining” often appear together
with venue “SIGKDD” rather than “SIGMOD” in the bibliographic corpus, the embedding
distance between the pair “data mining” and “SIGKDD” should be smaller than that of “data
mining” and “SIGMOD.”
e underlying technique is essentially minimizing the Kullback-Leibler divergence be-
tween model distribution and empirical distribution defined on a target node out given the other
participating nodes on the same link as context. Practically, one can solve the optimization prob-
lem by requiring the likelihood of an observed link to be higher than its corresponding “fake”
link with one of the constituent node replaced by any other randomly sampled node. For more
technical and experimental details, please refer to our recent paper [Gui et al., 2016].
President Clinton and Obama attended the funeral of former Israeli author_of (”Obama”, “Dream of
S4 My Father” S2)
Test features from S4
Prime Minister, and were scheduled to fly back to the US together.
BETWEEN_
… … book Test Relation and BETWEEN_
backto
Entity Mention
Model entity- Relation Mention
Automatically Labeled Training Data ( “Obama”, “US”, S4) LEFT_
president
relation Embedding Space
Relation Mention: (“Barack Obama”, “US”, S1) interactions BETWEEN_
Types of Entity 1: {person, politician, artist, author}, Entity 2: {ORG, LOC} fly
politician Target Entity
Relation Types: {president_of, born_in, citizen_of, travel_to} root …
S1_Barack Obama Type Hierarchy
…
Relation Mention: (“Obama”, “Dreams of My Father”, S2) S3_Barack Obama
Text Types of Entity 1: {person, politician, artist, author}, Entity 2: {book} CONTEXT_ person art person location organization
president
Corpus Relation Types: {author_of }
artist
Entity author film book artist … … …
Relation Mention: (“Barack Obama”, “United States”, S3) Mention CONTEXT_ S2_Obama
Types of Entity 1: {person, politician, artist, author}, Entity 2: {ORG, LOC} Embedding book location actor author politician
Relation Types: {{president_of, born_in, citizen_of, travel_to} Space
labeling their structured types and inferring their relations, it is possible to construct or enrich
semantically rich knowledge base and provide conceptual summarization of such data.
Existing entity detection tools such as noun phrase chunkers are trained on general-domain
corpora (e.g., news articles), but they do not work effectively nor efficiently on domain-specific
corpora such as Yelp reviews (e.g., “pulled pork sandwich” cannot be easily detected). Meanwhile,
the process of manually labeling a training set with a large number of entity and relation types is
too expensive and error-prone. erefore, a major challenge is how to design domain-independent
system that will apply to text corpora from different domains in the absence of human annotated,
domain data. e rapid emergence of large, domain specific text corpora (e.g., news, scientific
publications, social media content) calls for methods that can extract entities and relations of
target types with minimal or no human supervision.
Recently, Ren et al. [2017b] introduced a novel entity/relation joint typing algorithm in-
spired from our corpus-scope phrase mining framework. e work extends our phrasal segmen-
tation to detect entity mentions, and then jointly embeds entity mentions, relation mentions,
text features, and type labels into two low-dimensional spaces (for entity and relation mentions
respectively), where, in each space, objects whose types are close will also have similar represen-
tations.
Figure 4.7 illustrates the framework which comprises the following four major steps.
1. Run phrase mining algorithm on a textual corpus using positive examples obtained from an
existing knowledge base, to detect candidate entity mentions.
2. Generate candidate relation mentions (sentences mentioning two candidate entities), ex-
tract text features for each relation mention and their entity mention argument. Apply dis-
tant supervision to generate labeled training data.
4.3. KNOWLEDGE BASE CONSTRUCTION 69
3. Jointly embed relation and entity mentions, text features, and type labels into two low-
dimensional spaces (for entities and relations, respectively) where, in each space, close ob-
jects tend to share the same types.
4. Estimate type labels for each test relation mention and type-path for each test entity men-
tion from learned embeddings, by performing nearest neighbor search in the target type set
or the target type hierarchy.
Similar to AutoPhrase+, the first step utilizes POS tags and uses quality examples from KB
as guidance to model the segment quality (i.e., “how likely a candidate segment is an entity men-
tion”). But the detailed workflow is slightly different: (1) mine frequent contiguous patterns for
both word sequence and POS tag sequence up to a fixed length from POS-tagged corpus; (2) ex-
tract features including corpus-level concordance and sentence-level lexical signals to train two
random forest classifiers, for estimating quality of candidate phrase and candidate POS pattern;
(3) find the best segmentation of the corpus using the estimated segment quality scores; and
(4) compute rectified features using the segmented corpus and repeat steps (2)–(4) until the result
converges. Figure 4.8 shows the high/low quality POS patterns learned using entity names found
in the corpus.
After generating the set of candidate relation mentions from the detected candidate entity
mentions, the authors propose to apply network embedding to help infer entity and relation types.
Intuitively, two relation mentions sharing many text features (i.e., with similar distribution over
the set of text features including head token, POS tags, entity mention order, etc.) likely have sim-
ilar relation types; and text features co-occurring with many relation mentions in the corpus tend
to represent close type semantics. For example, in Figure 4.9, (“Barack Obama,” “US,” “S1”) and
(“Barack Obama,” “United States,” “S3”) share multiple features including context word ‘presi-
dent’ and first entity mention argument “Barack Obama,” and thus they are likely of the same
relation type (i.e., “president_of ”).
70 4. PHRASE MINING APPLICATIONS
Entity
S1_“Barack S3_“Barack S3_“United S2_“Dream of
Mention
Obama” Obama” States” My Father”
S1_“US”
S2_“Obama”
person TOKEN_
States
artist
LOC CONTEXT_
book president
ORG
Entity HEAD_Obama CONTEXT_
Type politician none book
By further modeling the noisy distant labels from knowledge base and enforcing the addi-
tive operation⁵ in the vector space, a joint optimization objective function is formulated to learn
embeddings for both relation/entity mentions and relation/entity types.
It is reported in the paper that the effectiveness of joint extraction of typed entities and
relations has been verified across different domains (e.g., news, biomedical), with an average of
25% improvement in F1 score compared to the next best method. Table 4.6 shows the output of
the algorithm COTYPE together with other two competitive algorithms on two news sentences
from the Wiki-KBP⁶ dataset.
1. Multi-sense Phrase Mining. During the process of phrase mining, we typically assume
phrase is represented as continuous word sequence. Our next challenge is to explore the
underlying concept for each phrase mention and further rectify the phrase frequency. Such
refinement encounters two problems: (1) variety: many phrase mentions may refer to the
same concept (e.g., “page rank” and “PageRank,” cytosol and cytoplasm); and (2) ambiguity:
multiple concepts may share the same phrase mentions (e.g., PCR can refer to polymerase
chain reaction or principle component regression). Such refinement is easier to achieve from
the perspective of phrase evolving and contextual topic. When relational database was first
introduced in 1970, “data base” was a simple composition of two words, and then with its
gained popularity people even invented a new word “database,” clearly as a whole seman-
tic unit. In the context of machine learning, PCR certainly refers to principle component
regression instead of polymerase chain reaction.
2. Phrase Mining For Users. is book mainly introduce the techniques for extracting phrases
from documents. It can often be observed that unstructured textual data and users are in-
terconnected, particularly in the big data era where social network and user-created content
become popular. Together with mining massive unstructured data, one can expect to create
profiles for users in the format of salient phrases by analyzing his/her activities, which can
be utilized for future recommendation and behavior analysis. One promising solution is to
model the user-content interaction as information network in the sense that links connect
different types of nodes such as documents, keyphrases, and users. Such data model allows
information propagation and many network-based algorithms can be applied.
72 4. PHRASE MINING APPLICATIONS
3. Phrase Mining For Fresh Content. All the methods discussed in this book are data-driven
and rely on frequent phrase mentions to certain extend. Accordingly, a large-scale dataset
is necessary due to the data redundancy. On the other hand, the same philosophy is not
suitable for fresh content when a new phrase is just formed. Instead of purely depending
on the phrase mentions, contextual knowledge such as “Hearst patterns” is also useful. It
is certainly an open problem to learn these textual patterns automatically and effectively.
As time goes by, statistics of a phrase will eventually be sufficient, allowing our proposed
methods to prove its power. It is interesting to see how much a hybrid method can benefit
from this scenario as well.
73
Bibliography
Khurshid Ahmad, Lee Gillam, and Lena Tostevin. University of surrey participation in trec8:
Weirdness indexing for logical document extrapolation and retrieval (wilder). In TREC, 1999.
21
Armen Allahverdyan and Aram Galstyan. Comparative analysis of viterbi training and maxi-
mum likelihood estimation for hmms. In Advances in Neural Information Processing Systems 24,
pages 1674–1682, 2011. 15, 16
David Arthur and Sergei Vassilvitskii. k-means++: e advantages of careful seeding. In Proc. of
the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035, 2007.
Srikanta Bedathur, Klaus Berberich, Jens Dittrich, Nikos Mamoulis, and Gerhard Weikum.
Interesting-phrase mining for ad-hoc text analytics. Proc. of the VLDB Endowment, 3(1-2):
1348–1357, 2010. DOI: 10.14778/1920841.1921007. 29
Christopher M Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York,
Inc., 2006. 15, 45
Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cy-
ganiak, and Sebastian Hellmann. Dbpedia-a crystallization point for the web of data. Web
Semantics: Science, Services and Agents on the World Wide Web, 7(3):154–165, 2009. DOI:
10.1016/j.websem.2009.07.002. 56
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of
Machine Learning Research, 3:993–1022, 2003. 55, 60
Leo Breiman. Randomizing outputs to increase prediction accuracy. Machine Learning, 40(3):
229–242, 2000. Springer. 39
Kuang-hua Chen and Hsin-Hsi Chen. Extracting noun phrases from large-scale texts: A hybrid
approach and its automatic evaluation. In Proc. of the 32nd Annual Meeting on Association for
Computational Linguistics, pages 234–241, 1994. DOI: 10.3115/981732.981764. 5
74 BIBLIOGRAPHY
Gregory F Cooper. e computational complexity of probabilistic inference using Bayesian belief
networks. Artificial Intelligence, 42(2):393–405, 1990. DOI: 10.1016/0004-3702(90)90060-d.
59
Mohamed Yehia Dahab, Hesham A Hassan, and Ahmed Rafea. Textontoex: Automatic ontology
construction from natural english text. Expert Systems with Applications, 34(2):1474–1480,
2008. DOI: 10.1016/j.eswa.2007.01.043. 59
Marina Danilevsky, Chi Wang, Nihit Desai, Xiang Ren, Jingyi Guo, and Jiawei Han. Automatic
construction and ranking of topical keyphrases on collections of short documents. In Proc. of the
SIAM International Conference on Data Mining, 2014. DOI: 10.1137/1.9781611973440.46. 6
Paul Deane. A nonparametric method for extraction of candidate phrasal terms. In Proc. of the
43rd Annual Meeting on Association for Computational Linguistics, pages 605–613, 2005. DOI:
10.3115/1219840.1219915. 35, 36
Scott C. Deerwester, Susan T. Dumais, omas K. Landauer, George W. Furnas, and Richard A.
Harshman. Indexing by latent semantic analysis. Journal of the American Society for Informa-
tion Science, 41(6):391–407, 1990. DOI: 10.1002/(sici)1097-4571(199009)41:6%3C391::aid-
asi1%3E3.0.co;2-9. 55
Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R Voss, and Jiawei Han. Scalable top-
ical phrase mining from text corpora. Proc. of the VLDB Endowment, 8(3), 2015. DOI:
10.14778/2735508.2735519. 10, 35, 36, 47, 60
Geoffrey Finch. Linguistic Terms and Concepts. Macmillan Press Limited, 2000. DOI:
10.1007/978-1-349-27748-3. 37, 43
Katerina Frantzi, Sophia Ananiadou, and Hideki Mima. Automatic recognition of multi-word
terms: e c-value/nc-value method. International Journal on Digital Libraries, 3(2):115–130,
2000. DOI: 10.1007/s007999900023. 5, 10, 35
Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using wikipedia-
based explicit semantic analysis. In Proc. of the 20th International Joint Conference on Artificial
Intelligence, pages 1606–1611, 2007. 55
Chuancong Gao and Sebastian Michel. Top-k interesting phrase mining in ad-hoc collections
using sequence pattern indexing. In Proc. of the 15th International Conference on Extending
Database Technology, pages 264–275, 2012. DOI: 10.1145/2247596.2247628. 29
BIBLIOGRAPHY 75
Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learn-
ing, 63(1):3–42, 2006. Springer. 39
omas Gottron, Maik Anderka, and Benno Stein. Insights into explicit semantic analysis. In
Proc. of the 20th ACM International Conference on Information and Knowledge Management,
pages 1961–1964, 2011. DOI: 10.1145/2063576.2063865. 55
Ziyu Guan, Long Chen, Wei Zhao, Yi Zheng, Shulong Tan, and Deng Cai. Weakly-supervised
deep learning for customer review sentiment classification. In Proc. of the 25th International
Joint Conference on Artificial Intelligence, pages 3719–3725, 2016. 60
Huan Gui, Jialu Liu, Fangbo Tao, Meng Jiang, Brandon Norick, and Jiawei Han. Large-scale
embedding learning in heterogeneous event data. In Proc. of the IEEE International Conference
on Data Mining, 2016. DOI: 10.1109/icdm.2016.0111. 67
John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979. DOI:
10.2307/2346830.
Samer Hassan and Rada Mihalcea. Semantic relatedness using salient semantic analysis. In Proc.
of the 25th AAAI Conference on Artificial Intelligence, pages 884–889, 2011. 55
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld.
Knowledge-based weak supervision for information extraction of overlapping relations. In
Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Lan-
guage Technologies, volume 1, pages 541–550, 2011.
Terry Koo, Xavier Carreras Pérez, and Michael Collins. Simple semi-supervised dependency
parsing. In 46th Annual Meeting of the Association for Computational Linguistics, pages 595–
603, 2008. 5
Roger Levy and Christopher Manning. Is it harder to parse Chinese, or the Chinese tree-
bank? In Proc. of the 41st Annual Meeting on Association for Computational Linguistics, volume 1,
pages 439–446, 2003. DOI: 10.3115/1075096.1075152. 47
Yanen Li, Bo-Jun Paul Hsu, ChengXiang Zhai, and Kuansan Wang. Unsupervised query seg-
mentation using clickthrough for information retrieval. In Proc. of the 34th International ACM
SIGIR Conference on Research and Development in Information Retrieval, pages 285–294, 2011.
DOI: 10.1145/2009916.2009957. 15
Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. Mining quality phrases from
massive text corpora. In Proc. of the ACM SIGMOD International Conference on Management
of Data, pages 1729–1744, 2015. DOI: 10.1145/2723372.2751523. 36, 47, 55
76 BIBLIOGRAPHY
Jialu Liu, Xiang Ren, Jingbo Shang, Taylor Cassidy, Clare R. Voss, and Jiawei Han. Representing
documents via latent keyphrase inference. In Proc. of the 25th International Conference on World
Wide Web, pages 1057–1067, 2016. DOI: 10.1145/2872427.2883088. 55, 56, 59
Gonzalo Martínez-Muñoz and Alberto Suárez. Switching class labels to generate classification
ensembles. Pattern Recognition, 38(10):1483–1494, 2005. Elsevier. 39
Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. Non-projective dependency
parsing using spanning tree algorithms. In Proc. of the Conference on Human Language Tech-
nology and Empirical Methods in Natural Language Processing, pages 523–530, 2005. DOI:
10.3115/1220575.1220641. 5
Rada Mihalcea and Paul Tarau. Textrank: Bringing order into texts. In Proc. of the Conference on
Empirical Methods in Natural Language Processing, 2004. 47
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in Neural Information
Processing Systems 26, pages 3111–3119, 2013. 29, 60
Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation ex-
traction without labeled data. In Proc. of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,
volume 2, pages 1003–1011, 2009. DOI: 10.3115/1690219.1690287.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christo-
pher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. Uni-
versal dependencies v1: A multilingual treebank collection. In Proc. of the 10th International
Conference on Language Resources and Evaluation, pages 1659–1666, 2016. 47
Deepak Padmanabhan, Atreyee Dey, and Debapriyo Majumdar. Fast mining of interesting
phrases from subsets of text corpora. In Proc. of the 17th International Conference on Extending
Database Technology, pages 193–204, 2014. 29
Aditya Parameswaran, Hector Garcia-Molina, and Anand Rajaraman. Towards the web of con-
cepts: Extracting concepts from large datasets. Proc. of the VLDB Endowment, 3(1-2):566–577,
2010. DOI: 10.14778/1920841.1920914. 6, 35, 36, 41, 47
Youngja Park, Roy J Byrd, and Branimir K Boguraev. Automatic glossary extraction: Beyond
terminology identification. In Proc. of the 19th International Conference on Computational Lin-
guistics, volume 1, pages 1–7, 2002. DOI: 10.3115/1072228.1072370. 5, 10, 35
Vasin Punyakanok and Dan Roth. e use of classifiers in sequential inference. In Advances in
Neural Information Processing Systems 13, pages 995–1001, 2001. 5
BIBLIOGRAPHY 77
Xiang Ren, Yuanhua Lv, Kuansan Wang, and Jiawei Han. Comparative document analysis for
large text corpora. In Proc. of the 10th ACM International Conference on Web Search and Data
Mining, 2017a. DOI: 10.1145/3018661.3018690. 63
Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare Voss, Heng Ji, Tarek Abdelzaher, and Jiawe
Han. Cotype: Joint extraction of typed entities and relations with knowledge bases. In Proc.
of the 26th International Conference on World Wide Web, 2017b. 68
Mark Sanderson and Bruce Croft. Deriving concept hierarchies from text. In Proc. of the 22nd
Annual International ACM SIGIR Conference on Research and Development in Information Re-
trieval, pages 206–213, 1999. DOI: 10.1145/312624.312679. 59
Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. In New Methods in
Language Processing, page 154, 2013. 36
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare Voss, and Jiawei Han. Automated Phrase
Mining from Massive Text Corpora, arXiv preprint arXiv:1702.04457v1, 2017. 55
Alkis Simitsis, Akanksha Baid, Yannis Sismanis, and Berthold Reinwald. Multidimen-
sional content exploration. Proc. of the VLDB Endowment, 1(1):660–671, 2008. DOI:
10.14778/1453856.1453929. 5
Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen. Short text
conceptualization using a probabilistic knowledge base. In Proc. of the 22nd International Joint
Conference on Artificial Intelligence, pages 2330–2336, 2011. DOI: 10.5591/978-1-57735-516-
8/IJCAI11-388. 55
Fangbo Tao, Honglei Zhuang, Chi Wang Yu, Qi Wang, Taylor Cassidy, Lance Kaplan, Clare
Voss, and Jiawei Han. Multi-dimensional, phrase-based summarization in text cubes. Data
Engineering, 39(3):74–84, 2016. 63
Beidou Wang, Can Wang, Jiajun Bu, Chun Chen, Wei Vivian Zhang, Deng Cai, and Xiaofei He.
Whom to mention: Expand the diffusion of tweets by @ recommendation on micro-blogging
systems. In Proc. of the 22nd International Conference on World Wide Web, pages 1331–1340,
2013. DOI: 10.1145/2488388.2488505. 60
Chi Wang, Wei Chen, and Yajun Wang. Scalable influence maximization for independent cascade
model in large-scale social networks. Data Mining and Knowledge Discovery, 25(3):545–576,
2012. DOI: 10.1007/s10618-012-0262-1. 59
Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G Nevill-Manning.
Kea: Practical automatic keyphrase extraction. In Proc. of the 4th ACM Conference on Digital
Libraries, pages 254–255, 1999. DOI: 10.1145/313238.313437. 47, 58
78 BIBLIOGRAPHY
Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. Probase: A probabilistic taxonomy
for text understanding. In Proc. of the ACM SIGMOD International Conference on Management
of Data, pages 481–492, 2012. DOI: 10.1145/2213836.2213891. 56
Endong Xun, Changning Huang, and Ming Zhou. A unified statistical model for the identifi-
cation of English basenp. In Proc. of the 38th Annual Meeting on Association for Computational
Linguistics, pages 109–116, 2000. DOI: 10.3115/1075218.1075233. 5
Xiaoxin Yin and Sarthak Shah. Building taxonomy of web search intents for name entity queries.
In Proc. of the 19th International Conference on World Wide Web, pages 1001–1010, 2010. DOI:
10.1145/1772690.1772792. 59
Ziqi Zhang, José Iria, Christopher A Brewster, and Fabio Ciravegna. A comparative evaluation
of term recognition algorithms. Proc. of the 6th International Conference on Language Resources
and Evaluation, 2008. 5, 35
79
Authors’ Biographies
JIALU LIU
Jialu Liu, an engineer at Google Research in New York, is working on structured data for knowl-
edge exploration. He received his B.Sc. from Zhejiang University, China, in 2007 and Ph.D.
degree in computer science from the University of Illinois at Urbana-Champaign in 2015. His
research has been focused on scalable data mining, text mining, and information extraction.
JINGBO SHANG
Jingbo Shang, is a Ph.D. candidate in the Department of Computer Science at the University of
Illinois at Urbana-Champaign. He received a B.Sc. from Shanghai Jiao Tong University, China
in 2014. His research focuses on mining and constructing structured knowledge from massive
text corpora.
JIAWEI HAN
Jiawei Han, Abel Bliss Professor, Department of Computer Science, the University of Illinois,
has been researching data mining, information network analysis, and database systems, and has
been involved in over 700 publications. He served as the founding Editor-in-Chief of ACM
Transactions on Knowledge Discovery from Data (TKDD). Jiawei received the ACM SIGKDD
Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), and
IEEE Computer Society W. Wallace McDowell Award (2009). He is a Fellow of ACM and
a Fellow of IEEE. His co-authored textbook, Data Mining: Concepts and Techniques (Morgan
Kaufmann), has been adopted worldwide.