0% found this document useful (0 votes)
120 views

2017 Phrase Mining From Massive Text and Its Applications

Uploaded by

acepmardiyana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

2017 Phrase Mining From Massive Text and Its Applications

Uploaded by

acepmardiyana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Phrase Mining from

Massive Text and Its Applications


Synthesis Lectures on Data
Mining and Knowledge
Discovery
Editors
Jiawei Han, University of Illinois at Urbana-Champaign
Lise Getoor, University of California, Santa Cruz
Wei Wang, University of California, Los Angeles
Johannes Gehrke, Cornell University
Robert Grossman, University of Chicago
Synthesis Lectures on Data Mining and Knowledge Discovery is edited by Jiawei Han, Lise
Getoor, Wei Wang, Johannes Gehrke, and Robert Grossman. e series publishes 50- to 150-page
publications on topics pertaining to data mining, web mining, text mining, and knowledge discovery,
including tutorials and case studies. Potential topics include: data mining algorithms, innovative data
mining applications, data mining systems, mining text, web and semi-structured data, high
performance and parallel/distributed data mining, data mining standards, data mining and
knowledge discovery framework and process, data mining foundations, mining data streams and
sensor data, mining multi-media data, mining social networks and graph data, mining spatial and
temporal data, pre-processing and post-processing in data mining, robust and scalable statistical
methods, security, privacy, and adversarial data mining, visual data mining, visual analytics, and data
visualization.

Phrase Mining from Massive Text and Its Applications


Jialu Liu, Jingbo Shang, and Jiawei Han
2017

Exploratory Causal Analysis with Time Series Data


James M. McCracken
2016

Mining Human Mobility in Location-Based Social Networks


Huiji Gao and Huan Liu
2015

Mining Latent Entity Structures


Chi Wang and Jiawei Han
2015
iii
Probabilistic Approaches to Recommendations
Nicola Barbieri, Giuseppe Manco, and Ettore Ritacco
2014

Outlier Detection for Temporal Data


Manish Gupta, Jing Gao, Charu Aggarwal, and Jiawei Han
2014

Provenance Data in Social Media


Geoffrey Barbier, Zhuo Feng, Pritam Gundecha, and Huan Liu
2013

Graph Mining: Laws, Tools, and Case Studies


D. Chakrabarti and C. Faloutsos
2012

Mining Heterogeneous Information Networks: Principles and Methodologies


Yizhou Sun and Jiawei Han
2012

Privacy in Social Networks


Elena Zheleva, Evimaria Terzi, and Lise Getoor
2012

Community Detection and Mining in Social Media


Lei Tang and Huan Liu
2010

Ensemble Methods in Data Mining: Improving Accuracy rough Combining Predictions


Giovanni Seni and John F. Elder
2010

Modeling and Data Mining in Blogosphere


Nitin Agarwal and Huan Liu
2009
Copyright © 2017 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.

Phrase Mining from Massive Text and Its Applications


Jialu Liu, Jingbo Shang, and Jiawei Han
www.morganclaypool.com

ISBN: 9781627058988 paperback


ISBN: 9781627059183 ebook

DOI 10.2200/S00759ED1V01Y201702DMK013

A Publication in the Morgan & Claypool Publishers series


SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE DISCOVERY

Lecture #13
Series Editors: Jiawei Han, University of Illinois at Urbana-Champaign
Lise Getoor, University of California, Santa Cruz
Wei Wang, University of California, Los Angeles
Johannes Gehrke, Cornell University
Robert Grossman, University of Chicago
Series ISSN
Print 2151-0067 Electronic 2151-0075
Phrase Mining from
Massive Text and Its Applications

Jialu Liu
Google

Jingbo Shang
University of Illinois at Urbana-Champaign

Jiawei Han
University of Illinois at Urbana-Champaign

SYNTHESIS LECTURES ON DATA MINING AND KNOWLEDGE


DISCOVERY #13

M
&C Morgan & cLaypool publishers
ABSTRACT
A lot of digital ink has been spilled on “big data” over the past few years. Most of this surge owes its
origin to the various types of unstructured data in the wild, among which the proliferation of text-
heavy data is particularly overwhelming, attributed to the daily use of web documents, business
reviews, news, social posts, etc., by so many people worldwide. A core challenge presents itself:
How can one efficiently and effectively turn massive, unstructured text into structured represen-
tation so as to further lay the foundation for many other downstream text mining applications?
In this book, we investigated one promising paradigm for representing unstructured text,
that is, through automatically identifying high-quality phrases from innumerable documents. In
contrast to a list of frequent n-grams without proper filtering, users are often more interested
in results based on variable-length phrases with certain semantics such as scientific concepts,
organizations, slogans, and so on. We propose new principles and powerful methodologies to
achieve this goal, from the scenario where a user can provide meaningful guidance to a fully
automated setting through distant learning. is book also introduces applications enabled by
the mined phrases and points out some promising research directions.

KEYWORDS
phrase mining, phrase quality, phrasal segmentation, distant supervision, text min-
ing, real-world applications, efficient and scalable algorithms
vii

Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What is Phrase Mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Quality Phrase Mining with User Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Phrasal Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Supervised Phrase Mining Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Frequent Phrase Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Phrase Quality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Rectification through Phrasal Segmentation . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Feedback as Segmentation Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Quantitative Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.3 Efficiency Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Automated Quality Phrase Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Automated Phrase Mining Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Phrase Label Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Phrase Quality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 POS-guided Phrasal Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.4 Phrase Quality Re-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
viii
3.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Quantitative Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Distant Training Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.4 POS-guided Phrasal Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.5 Efficiency Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Phrase Mining Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55


4.1 Latent Keyphrase Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Topic Exploration for Document Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Knowledge Base Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Research Frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
ix

Acknowledgments
e authors would like to acknowledge Xiang Ren, Fangbo Tao, and Huan Gui for their contri-
bution to Chapter 4.
e research was supported in part by the U.S. Army Research Lab. under Cooperative
Agreement No. W911NF-09-2-0053 (NSCTA), National Science Foundation IIS-1320617,
and IIS 16-18481, and grant 1U54GM114838 awarded by NIGMS through funds provided by
the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). e views and
conclusions contained in this document are those of the author(s) and should not be interpreted
as representing the official policies of the U.S. Army Research Laboratory or the U.S. Govern-
ment. e U.S. Government is authorized to reproduce and distribute reprints for Government
purposes notwithstanding any copyright notation hereon. e views and conclusions contained
in our research publications are those of the authors and should not be interpreted as representing
any funding agencies.

Jialu Liu, Jingbo Shang, and Jiawei Han


February 2017
1

CHAPTER 1

Introduction
1.1 MOTIVATION
e past decade has witnessed the surge of interest in data mining which is broadly construed to
discover knowledge from all kinds of data, be it in academia, industry, or daily life. e infor-
mation explosion brings the “big data” era to the light of the stage. is overwhelming tide of
information is largely composed of unstructured data such as images, speeches, and videos. It is
easy to distinguish them from typical structured data (e.g., relational data) in that the latter can be
readily stored in the fielded form in databases. Among the various unstructured data, a particu-
larly prominent category comes in the form of text. Examples include news articles, social media
messages, as well as web pages and query logs.
In the literature of text mining, during the process of analyzing text, one fundamental
problem is how to effectively represent text and model its topic, not only from the perspective of
algorithm performance, but also for analysts to better interpret and present the results. A common
approach is to use n-gram, i.e., a contiguous sequence of n unigrams, as the basic units. Figure 1.1
shows an example sequence with the corresponding 1-gram, 2-gram, 3-gram and consolidated
representation. However, such representation raises concerns of exponential growth of the dictio-
nary as well as the lack of interpretability. One can reasonably expect an intelligent method that
only uses a compact subset of n-grams but generates explainable representation given a document.

Bag of n-grams

data, mining, and, machine,


1-grams: Data Mining and Machine Learning
learning, data mining, mining
and, and machine, machine
2-grams: Data Mining and Machine Learning
learning, data mining and,
mining and machine, and
3-grams: Data Mining and Machine Learning
machine learning

Figure 1.1: Example of n-gram representation.

Along this line of thought, in this book, we formulate such explainable n-gram subset as
quality phrases (e.g., scientific terms such as “data mining” and “machine learning” outlined in the
figure) and phrase mining as the corresponding knowledge discovery process.
2 1. INTRODUCTION
Phrase mining has been studied in different communities. e natural language processing
(NLP) community refers to it as “automatic term recognition” (i.e., extracting technical terms
with the use of computers). e information retrieval (IR) community studies this topic to select
main concepts in a corpus in an effort to improve search engine. Among existing works published
by these two communities, linguistic processors with heuristic rules are primarily used and the
most common approach is based on noun phrases. Supervised noun phrase chunking techniques
are particularly proposed to leverage annotated documents to learn these rules. Other methods
may utilize more sophisticated NLP features, such as dependency parser to further enhance the
precision. However, emerging textual data, such as social media messages, can deviate from rigor-
ous language rules. Using various kinds of heavily (pre-)trained linguistic processing makes these
approaches difficult to be generalized.
In this regard, we believe that the community would welcome and benefit from a set of data-
driven algorithms that work for large-scale datasets involving irregular textual data in a robust
way, while minimizing the human labeling cost. We are also convinced by various study and
experiments that our proposed methods embody enough novelty and contribution to add solid
building block for various text-related tasks including document indexing, keyphrase extraction,
topic modeling, knowledge base construction, and so on.

1.2 WHAT IS PHRASE MINING?


Phrase mining is a text mining technique that discovers semantically meaningful phrases from
massive text. By considering the challenge of heterogeneity in the emerging textual data, the
principles and methods discussed in this book will not assume particular lexical rules and are
primarily compelled by data. Formally, we define the task as follows.

Problem 1.1 Phrase Mining Given a large document corpus C —which can be any textual word
sequences with arbitrary lengths such as articles, titles, and queries—phrase mining tries to assign
a value between 0 and 1 to indicate the quality of each phrase mentioned in D and discovers a set
of quality phrases K D fK1 ;    ; KM g with their quality scores greater than 0.5. It also seeks to
provide a segmenter for locating quality phrase mentions in any unseen text snippet.

Definition 1.2 Quality Phrase. A quality phrase is a sequence of words that appear contigu-
ously in the corpus, and serves as a complete (non-composible) semantic unit in certain context
among given documents.
ere is no universally accepted definition for phrase quality. However, it is useful to quan-
tify phrase quality based on certain criteria as outlined below:

• Popularity: Quality phrases should occur with sufficient frequency in the given document
collection.
1.3. OUTLINE OF THE BOOK 3
• Concordance: Concordance refers to the collocation of tokens in such a frequency that is
significantly higher than what is expected due to chance. A commonly used example of a
phraseological-concordance is the two phrases “strong tea” and “powerful tea.” One would
assume that the two phrases appear in similar frequency, yet in the English language the
phrase “strong tea” is considered more proper and appears with much higher frequency.
Because a concordant phrase’s frequency deviates from what is expected, we consider them
as belonging to a whole semantic unit.
• Informativeness: A phrase is informative if it is indicative of a specific topic or concept.
e phrase “this paper” is popular and concordant, but is not considered to be informative
in the bibliographic corpus.
• Completeness: Long frequent phrases and their subsequences may both satisfy the above
three criteria. But apparently not all of them are qualified. A quality phrase should be inter-
preted as a complete semantic unit in certain contexts. e phrase “vector machine” is not
considered to be complete as it mostly appears with prefix word “support.”
Because single-word phrases cannot be decomposed into multiple tokens, the concordance
criteria is no longer definable. As an alternative, we propose the independence criteria and will
introduce it in more detail in Chapter 3.

1.3 OUTLINE OF THE BOOK


e remaining chapters of the book are outlined as follows.
• Chapter 2: Quality Phrase Mining with User Guidance In the literature of phrase mining,
earlier work focuses on efficiently retrieving recurring word sequences and ranking them
according to frequency-based statistics. However, the raw frequency from the data tends
to produce misleading quality assessment, and the outcome therefore is unsatisfactory. We
attempt to rectify the decisive raw frequency to help discover the true quality of a phrase by
examining the context of its mentions. With limited labeling effort from the user, the model
is able to iteratively segment the corpus into non-overlapped words and phrase sequences
such that: (1) the phrase quality estimated in the previous iteration guides the segmentation
and (2) segmentation results rectify raw phrase frequency and improve the process of phrase
quality estimation. Such an integrated framework benefits from mutual enhancement, and
achieves both high quality and high efficiency.
• Chapter 3: Automated Quality Phrase Mining Almost all state-of-the-art methods in
NLP, IR, and text mining communities require human experts at certain levels. Such re-
liance on manual efforts from domain experts becomes an impediment for timely analysis of
massive, emerging text corpora. Besides this issue, an ideal automated phrase mining method
is supposed to work smoothly for multiple languages with high performance in terms of pre-
cision, recall, and efficiency. We attempt to make the phrase mining automated by utilizing
4 1. INTRODUCTION
external knowledge bases to remove human efforts and minimize the language dependency.
Modeling single-word phrases at the same time also improves the performance, especially
the recall.
Since phrase mining lays the foundation for many other downstream text mining applica-
tions, we opt to devote one chapter to discuss its applications during the latest research develop-
ment.
• Chapter 4: Phrase Mining Applications Particularly, we would like to introduce three rep-
resentative applications using phrase mining results.
• e first is a statistical inference algorithm for detecting latent quality phrases topically
relevant to a single document. Previously mentioned phrase mining methods are able to
locate any phrase mentions in a document, but they cannot provide the relatedness between
the document and the phrase.
• e second application utilizes phrase mining results to systematically analyze large num-
bers of textual documents from the perspective of topic exploration. We discuss how to
group phrases into clusters sharing the same topic, how to summarize commonalities
and differences given multiple document collections, and how to incorporate document-
associated metadata like authors and tags into the exploration process.
• e last application tries to construct semantically rich knowledge base out of unstructured
text. Identifying the phrases in text that constitute entity mentions and assigning types to
these spans as well as to the relations between entity mentions are the key to this process.
5

CHAPTER 2

Quality Phrase Mining with


User Guidance
In large, dynamic collections of documents, analysts are often interested in variable-length
phrases, including scientific concepts, events, organizations, products, slogans, and so on. Ac-
curate estimation of phrase quality is critical for the extraction of quality phrases and will enable a
large body of applications to transform from word granularity to phrase granularity. In this chap-
ter, we study a segmentation-integrated framework to mine multi-word quality phrases with a
small set of user-provided binary labels.

2.1 OVERVIEW
Identifying quality phrases has gained increased attention due to its value of handling increasingly
massive text datasets. As the origin, the natural language processing (NLP) community has con-
ducted extensive studies mostly known as automatic term recognition [Frantzi et al., 2000, Park
et al., 2002, Zhang et al., 2008], referring to the task of extracting technical terms with the use
of computers. is topic also attracts attention in the information retrieval (IR) community since
appropriate indexing term selection is critical to the improvement of a search engine where the
ideal indexing units should represent the main concepts in a corpus, beyond the bag-of-words.
Linguistic processors are commonly used to filter out stop words and restrict candidate
terms to noun phrases. With pre-defined part-of-speech (POS) rules, one can generate noun
phrases as term candidates to each POS-tagged document. Supervised noun phrase chunking
techniques [Chen and Chen, 1994, Punyakanok and Roth, 2001, Xun et al., 2000] leverage an-
notated documents to automatically learn these rules. Other methods may utilize more sophisti-
cated NLP features such as dependency parser to further enhance the precision [Koo et al., 2008,
McDonald et al., 2005]. With candidate terms collected, the next step is to leverage certain sta-
tistical measures derived from the corpus to estimate phrase quality. Some methods further resort
to reference corpus for the calibration of “termhood” [Zhang et al., 2008]. e various kinds of
linguistic processing, domain-dependent language rules, and expensive human labeling make it
challenging to apply the phrase mining technique to emerging big and unrestricted corpora which
possibly encompass many different domains and topics such as query logs, social media messages,
and textual transaction records. erefore, researchers have sought more general data-driven ap-
proaches, primarily based on the frequent pattern mining principle [Ahonen, 1999, Simitsis et al.,
2008]. Early work focuses on efficiently retrieving recurring word sequences, but many such se-
6 2. QUALITY PHRASE MINING WITH USER GUIDANCE
quences do not form meaningful phrases. More recent work filters or ranks them according to
frequency-based statistics. However, the raw frequency from the data tends to produce mislead-
ing quality assessment, and the outcome is unsatisfactory, as the following example demonstrates.

Example 2.1 Raw Frequency-based Phrase Mining Consider a set of scientific publications
and the raw frequency counts of two phrases “relational database system” and “support vector
machine” and their subsequences in the frequency column of Table 2.1. e numbers are hy-
pothetical but manifest several key observations: (i) the frequency generally decreases with the
phrase length; (ii) both good and bad phrases can possess high frequency (e.g., “support vector”
and “vector machine”); and (iii) the frequency of one sequence (e.g., “relational database system”)
and its subsequences can have a similar scale of another sequence (e.g., “support vector machine”)
and its counterparts.

Table 2.1: A hypothetical example of word sequence raw frequency

Sequence Raw Frequency Quality Phrase? Rectified Frequency


relational database system 100 yes 70
relational database 150 yes 40
database system 160 yes 35
relational 500 N/A 20
database 1000 N/A 200
system 10000 N/A 1000
Sequence Raw Frequency Quality Phrase? Rectified Frequency
support vector machine 100 yes 80
support vector 160 yes 50
vector maching 150 no 6
support 500 N/A 150
vector 1000 N/A 200
machine 10000 N/A 50

Obviously, a method that ranks the word sequences solely according to the frequency will
output many false phrases such as “vector machine.” In order to address this problem, differ-
ent heuristics have been proposed based on comparison of a sequence’s frequency and its sub- (or
super-) sequences, assuming that a good phrase should have high enough (normalized) frequency
compared with its sub-sequences and/or super-sequences [Danilevsky et al., 2014, Parameswaran
et al., 2010]. However, such heuristics can hardly differentiate the quality of, e.g., “support vec-
tor” and “vector machine” because their frequencies are so close. Finally, even if the heuristics can
indeed draw a line between “support vector” and “vector machine” by discriminating their fre-
2.2. PHRASAL SEGMENTATION 7
quencies (between 160 and 150), the same separation could fail for another case like “relational
database” and “database system.”
Using the frequency in Table 2.1, all heuristics will produce identical predictions for “re-
lational database” and “vector machine,” guaranteeing one of them to be wrong. is example
suggests the intrinsic limitations of using raw frequency counts, especially in judging whether a
sequence is too long (longer than a minimum semantic unit), too short (broken and not informa-
tive), or right in length. It is a critical bottleneck for all frequency-based quality assessment.

2.2 PHRASAL SEGMENTATION


In this chapter, we discuss how to address this bottleneck through rectifying the decisive raw
frequency that hinders discovering the true quality of a phrase. e goal of the rectification is to
estimate how many times each word sequence should be interpreted in whole as a phrase in its
occurrence context. e following example illustrates this idea.

Example 2.2 Rectification Consider the following occurrences of the six multi-word sequences
listed in Table 2.1.

1. A drelational database systemc for images…

2. dDatabase systemc empowers everyone in your organization…

3. More formally, a dsupport vector machinec constructs a hyperplane…

4. e dsupport vectorc method is a new general method of dfunction estimationc…

5. A standard dfeature vectorc dmachine learningc setup is used to describe…

6. dRelevance vector machinec has an identical dfunctional formc to the dsupport vector
machinec…

7. e basic goal for dobject-oriented relational databasec is to dbridge the gapc between…

e first four instances should provide positive counts to these sequences, while the last three
instances should not provide positive counts to “vector machine” or “relational database” because
they should not be interpreted as a whole phrase (instead, sequences like “feature vector” and “rel-
evance vector machine” can). Suppose one can correctly count true occurrences of the sequences,
and collect rectified frequency as shown in the rectified column of Table 2.1. e rectified fre-
quency now clearly distinguishes “vector machine” from the other phrases, since “vector machine”
rarely occurs as a whole phrase.
e success of this approach relies on reasonably accurate rectification. Simple arithmetics
of the raw frequency, such as subtracting one sequence’s count with its quality super sequence,
are prone to error. First, which super sequences are quality phrases is a question in and of itself.
8 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Second, it is context-dependent to decide whether a sequence should be deemed a whole phrase.
For example, the fifth instance in Example 2.2 prefers “feature vector” and “machine learning” over
“vector machine,” even though neither “feature vector machine” nor “vector machine learning” is
a quality phrase. e context information is lost when we only collect the frequency counts.
In order to recover the true frequency with best effort, we ought to examine the context of
every occurrence of each word sequence and decide whether to count it as a phrase. e exami-
nation for one occurrence may involve enumeration of alternative possibilities, such as extending
the sequence or breaking the sequence, and comparison among them. e test for word sequence
occurrences could be expensive, losing the advantage in efficiency of the frequent pattern mining
approaches.
Facing the challenge of accuracy and efficiency, we propose a segmentation approach called
“phrasal segmentation,” and integrate it with the phrase quality assessment in a unified frame-
work with linear complexity (w.r.t the corpus size). First, the segmentation assigns every word
occurrence to only one phrase. In the first instance of Example 2.2, “relational database sys-
tem” are bundled as a single phrase. erefore, it automatically avoids double counting “relational
database” and “database system” within this instance. Similarly, the segmentation of the fifth in-
stance contributes to the count of “feature vector” and “machine learning” instead of “feature,”
“vector machine,” and “learning.” is strategy condenses the individual tests for each word se-
quence and reduces the overall complexity while ensures correctness. Second, although there are
an exponential number of possible partitions of the documents, we are concerned with those rel-
evant to the phrase extraction task only. erefore, we can integrate the segmentation with the
phrase quality assessment, such that: (i) only frequent phrases with reasonable quality are taken
into consideration when enumerating partitions; and (ii) the phrase quality guides the segmenta-
tion, and the segmentation rectifies the phrase quality estimation. Such an integrated framework
benefits from mutual enhancement, and achieves both high quality and high efficiency.
A phrasal segmentation defines a partition of a sequence into subsequences, such that every
subsequence corresponds to either a single word or a phrase. Example 2.2 shows instances of such
partitions, where all phrases with high quality are marked by brackets dc. e phrasal segmenta-
tion is distinct from word, sentence or topic segmentation tasks in natural language processing. It
is also different from the syntactic or semantic parsing which relies on grammar to decompose the
sentences with rich structures like parse trees. Phrasal segmentation provides the necessary gran-
ularity we need to extract quality phrases. e total count for a phrase to appear in the segmented
corpus is called rectified frequency.
It is beneficial to acknowledge that a sequence’s segmentation may not be unique, for two
reasons. First, as we mentioned above, a word sequence may be regarded as a phrase or not,
depending on the adoption customs. Some phrases, like “bridge the gap” in the last instance of
Example 2.2, are subject to a user’s requirement. erefore, we seek for segmentation that accom-
modates the phrase quality, which is learned from user-provided examples. Second, a sequence
could be ambiguous and have different interpretations. Nevertheless, in most cases, it does not
2.3. SUPERVISED PHRASE MINING FRAMEWORK 9
require perfect segmentation, no matter if such a segmentation exists, to extract quality phrases.
In a large document collection, the popularly adopted phrases appear many times in a variety of
context. Even with a few mistakes or debatable partitions, a reasonably high quality segmentation
(e.g., yielding no partition like “support dvector machinec”) would retain sufficient support (i.e.,
rectified frequency) for these quality phrases, albeit not for false phrases with high raw frequency.
With the above discussions, we have the following formalization.

Definition 2.3 Phrasal Segmentation. Given a word sequence C D w1 w2 : : : wn of length


n, a segmentation S D s1 s2 : : : sm for C is induced by a boundary index sequence B D
fb1 ; b2 ; : : : ; bmC1 g satisfying 1 = b1 < b2 <    < bmC1 D n C 1, where a segment s t D
wbt wb t C1 : : : wbt Cjst j 1 . Here js t j refers to the number of words in segment s t . Since b t C js t j D
b tC1 , for clearness we use wŒb t ;b tC1 / to denote word sequence wb t wbt C1    wb t Cjs t j 1 .

Example 2.4 Continuing our previous Example 2.2 and specifically for the first instance, the
word sequence and marked segmentation are
C D a relational database system for images
S D / a / relational database system / for / images /
with a boundary index sequence B D f1; 2; 5; 6; 7g indicating the location of segmentation symbol
/.

Frequent Pattern Phrasal


Corpus Phrases
Mining Segmentation

Phrase Quality
Candidates Feature Extraction Features
Estimation

Phrase
Classifier
Labels

Figure 2.1: e supervised phrase mining framework.

2.3 SUPERVISED PHRASE MINING FRAMEWORK


In this chapter, in addition to the input corpus originally mentioned in Definition 1.1, users are
required to provide a small set L of labeled quality phrases and LN of inferior ones, which serves
as the training data to guide the phrasal segmentation. e supervised framework comprises the
following five steps and try to mine quality phrases following the quality criteria described in
Section 1.2.
10 2. QUALITY PHRASE MINING WITH USER GUIDANCE
1. Generate frequent phrase candidates according to popularity criterion (Section 2.3.1).
2. Estimate phrase quality based on features design for concordance and informativeness cri-
teria (Section 2.3.2).
3. Estimate rectified frequency via phrasal segmentation (Section 2.3.3).
4. Add segmentation-based features derived from rectified frequency into the feature set of
phrase quality classifier (Section 2.3.4). Repeat steps 2 and 3.
5. Filter phrases with low rectified frequencies to satisfy the completeness criterion as post-
processing step.
An complexity analysis for this framework is given at Section 2.3.5 to show that both of its
computation time and required space grow linearly as the corpus size increases.

2.3.1 FREQUENT PHRASE DETECTION


e task of detecting frequent phrases can be defined as collecting aggregate counts for all phrases
in a corpus that satisfy a certain minimum support threshold  , according to the popularity crite-
rion. In practice, one can also set a maximum phrase length ! to restrict the phrase length. Even
if no explicit restriction is added, ! is typically a small constant. For efficiently mining these
frequent phrases, we draw upon two properties.
1. Downward Closure property: If a phrase is not frequent, then any its super-phrase is guaran-
teed to be not frequent. erefore, those longer phrases will be filtered and never expanded.
2. Prefix property: If a phrase is frequent, any of its prefix units should be frequent too. In this
way, all the frequent phrases can be generated by expanding their prefixes.
e algorithm for detecting frequent phrases is given in Algorithm 1. We use C Œ to index
a word in the corpus string and jC j to denote the corpus size. e ˚ operator is for concatecating
two words or phrases. Algorithm 1 returns a key-value dictionary f . Its keys are vocabulary U
containing all frequent phrases P , and words U n P . Its values are their raw frequency.

2.3.2 PHRASE QUALITY ESTIMATION


Estimating phrase quality from only a few training labels is challenging since a huge number of
phrase candidates might be generated from the first step and they are messy. Instead of using one
or two statistical measures [El-Kishky et al., 2015, Frantzi et al., 2000, Park et al., 2002], we opt
to compute multiple features for each candidate in P . A classifier is trained on these features to
predict quality Q for all unlabeled phrases. For phrases not in P , their quality is simply 0.
We divide the features into two categories according to concordance and informativeness
criteria in the following two subsections. Only representative features are introduced for clearness.
We then discuss about the classifier in Section 14.
2.3. SUPERVISED PHRASE MINING FRAMEWORK 11

Algorithm 1: Frequent Phrase Detection


1 Input: Document corpus C , minimum support threshold  .
2 Output: Raw frequency dictionary f of frequent phrases and words.
3 f an empty dictionary
4 i ndex an empty dictionary
5 for i 1 to jC j do
6 i ndexŒC Œi i ndexŒC Œi  [ i
7 while i ndex is not empty do
8 i ndex 0 an empty dictionary
9 for u 2 i ndex:keys do
10 if ji ndexŒu|   then
11 f Œu ji ndexŒuj
12 for j 2 i ndexŒu do
13 u0 u ˚ C Œj C 1
14 i ndex 0 Œu0  i ndex 0 Œu0  [ fj C 1g

15 i ndex i ndex 0
16 return f

Concordance Features
is set of features is designed to measure concordance among sub-units of a phrase. To make
phrases with different lengths comparable, we partition each phrase candidate into two disjoint
parts in all possible ways and derive effective features measuring their concordance.
Suppose for each word or phrase u 2 U , we have its raw frequency f Œu. Its probability
p.u/ is defined as:
f Œu
p.u/ D P 0
:
u0 2U f Œu 
Given a phrase v 2 P , we split it into two most-likely sub-units hul ; ur i such that pointwise
mutual information is minimized. Pointwise mutual information quantifies the discrepancy be-
tween the probability of their true collocation and the presumed collocation under independence
assumption. Mathematically,
p.v/
hul ; ur i D arg min log :
ul ˚ur Dv p.ul /p.ur /
With hul ; ur i, we directly use the pointwise mutual information as one of the concordance fea-
tures.
p.v/
PMI.ul ; ur / D log :
p.ul /p.ur /
12 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Another feature is also from information theory, called pointwise Kullback-Leibler divergence:
p.v/
PKL.vkhul ; ur i/ D p.v/ log :
p.ul /p.ur /
e additional p.v/ is multiplied with pointwise mutual information, leading to less bias toward
rare-occurred phrases.
Both features are supposed to be positively correlated with concordance.

Informativeness Features
Some candidates are unlikely to be informative because they are functional or stop words. We
incorporate the following stop word-based features into the classification process.
• Whether stop words are located at the beginning or the end of the phrase candidate,
which requires a dictionary of stop words. Phrases that begin or end with stop words, such as “I
am,” are often functional rather than informative.
A more generic feature is to measure the informativeness based on corpus statistics:
• Average inverse document frequency (IDF) computed over words,
where IDF for a word w is computed as
jC j
IDF.w/ D log :
jfd 2 ŒD W w 2 Cd gj
It is a traditional information retrieval measure of how much information a word provides in order
to retrieve a small subset of documents from a corpus. In general, quality phrases are expected to
have not too small average IDF.
In addition to word-based features, punctuation is frequently used in text to aid interpre-
tations of specific concept or idea. is information is helpful for our task. Specifically, we adopt
the following feature.
• Punctuation: probabilities of a phrase in quotes, brackets, or capitalized;
higher probability usually indicates more likely a phrase is informative.
It’s noteworthy that in order to extract features efficiently we have designed an adapted
Aho-Corasick Automaton to rapidly locate occurrences of phrases in the corpus.
e Aho-Corasick Automaton is similar to the data structure Trie, which could utilize
the common prefix to save the memory usage and also make the process more efficient. It also
computes the field “failed” referring to the node which could continue the matching process.
In this book, we adopt standard Aho-Corasick Automaton definition and construction process.
Algorithm 2 introduced a “while” loop to fix the issues brought by prefix (i.e., there might be
some phrase candidates which are the prefix of the others), which is slightly different from the
2.3. SUPERVISED PHRASE MINING FRAMEWORK 13

Algorithm 2: Locating using Aho-Corasick Automaton


1 Input: e Aho-Corasick Automaton root, the corpus string C .
2 Output: All occurrences of phrases in the automaton O.
3 O ;
4 u root
5 for i 1 to jC j do
6 while u ¤ root and C [i] not in u.next do
7 u u.failed
8 if C [i] in u.next then
9 u u.next[C [i]]
10 p u
11 while p ¤ root and p .isEnd do
12 O O[ [i - p.depth + 1, i]
13 p p .failed

14 return O

traditional matching process and could help us find all occurrences of the phrase candidates in
the corpus in a linear time.
An alternative way is to adopt the hash table. However, one should carefully choose the
hash function for hash tale and the theoretical time complexity of hash table is not exactly linear.
For comparison, we implemented a hash table approach using the unordered map in C++, while
the Aho-Corasick Automaton was coded in C++ too. e results can be found in Table 2.2 We
can see that Aho-Corasick Automaton is slightly better because of its exact linear complexity and
less memory overhead.

Table 2.2: Runtime of locating phrase candidates

Academia Yelp
Aho-Corasick Automaton 154.25s 198.39s
Unordered Map (Hash Table) 192.71s 366.67s

Classifier
e framework can work with arbitrary classifiers that can be effectively trained with small la-
beled data and output a probabilistic score between 0 and 1. For instance, we can adopt random
forest [Breiman, 2001] which is efficient to train with a small number of labels. e ratio of pos-
14 2. QUALITY PHRASE MINING WITH USER GUIDANCE
itive predictions among all decision trees can be interpreted as a phrase’s quality estimation. In
experiments we will see that 200–300 labels are enough to train a satisfactory classifier.
Just as we have mentioned, both quality phrases and inferior ones are required as labels
for training. To further reduce the labeling effort, the next chapter introduces distant learning to
automatically retrieve both positive and negative labels.

2.3.3 RECTIFICATION THROUGH PHRASAL SEGMENTATION


e discussion in Example 2.1 points out the limitations of using only raw frequency counts.
Instead, we ought to examine the context of every word sequence’s occurrence and decide whether
to count it as a phrase, as introduced in Example 2.2. e segmentation directly addresses the
completeness criterion, and indirectly helps with the concordance criterion via rectified frequency.
Here we propose an efficient phrasal segmentation method to compute rectified frequency of each
phrase. We will see that combined with aforementioned phrase quality estimation, some phrases
with high raw frequency get removed as their rectified frequencies approach zero. Furthermore,
rectified phrase frequencies can be fed back to generate additional features and improve the phrase
quality estimation. is will be discussed in the next subsection.
e segmentation is realized through a statistical model. Given a word sequence C , and
a segmentation S D s1 : : : sm induced by boundary index sequence B D fb1 ; : : : ; bmC1 g, where
s t D wŒb t ;b tC1 / , the joint probability is factorized as:
m
Y  ˇ 
ˇ
p.S; C / D p b t C1 ; dwŒb t ;btC1 / cˇb t ; (2.1)
tD1

where p.b t C1 ; dwŒbt ;btC1 / cjb t / is the probability of observing a word sequence wŒbt ;btC1 / as the
t -th quality segment. As segments of a word sequence usually have weak dependence on each
other, we assume they are generated one by one for the sake of both efficiency and simplicity.
We now describe the generative model for each segment. Given the start index b t of a seg-
ment s t , we first generate the end index b t C1 , according to a prior distribution p.js t j D b t C1 b t /
over phrase lengths. en we generate the word sequence wŒbt ;b tC1 / according to a multino-
mial distribution over all segments of length .b t C1 b t /. Finally, we generate an indicator
whether wŒb t ;bt C1 / forms a quality segment according to its quality p.dwŒbt ;b tC1 cjwŒbt ;b tC1 / / D
Q.wŒb t ;b tC1 / /. We formulate its probabilistic factorization as follows:
     
p b tC1 ; dwŒb t ;b tC1 / cjb t D p b t C1 jb t p dwŒb t ;b tC1 / cjb t ; b tC1
   ˇ   
ˇ
D p b t C1 b t p wŒb t ;bt C1 / ˇjs t j D b tC1 b t Q wŒb t ;b tC1 / :

e length prior p.js t j D b t C1 b t / is explicitly modeled to counter the bias to longer


segments as they result in fewer segments. e particular form of p.js t j/ we pick is:

p.js t j/ / ˛ 1 js t j
: (2.2)
2.3. SUPERVISED PHRASE MINING FRAMEWORK 15
C
Here ˛ 2 R is a factor called segment length penalty. If ˛ < 1, phrases with longer length have
larger value of p.js t j/. If ˛ > 1, the mass of p.js t j/ moves toward shorter phrases. Smaller ˛
favors longer phrases and results in fewer segments. Tuning its value turns out to be a trade-off
between precision and recall for recognizing quality phrases. At the end of this subsection we will
discuss how to estimate its value by reusing labels in Section 2.3.2. It is worth mentioning that
such segment length penalty is also discussed by Li et al. [2011]. Our formulation differs from
theirs by posing a weaker penaltyˇ on long phrases.
ˇ
We denote p.wŒbt ;b tC1 / ˇjs t j/ with wŒbt ;btC1 / for convenience. For a given corpus C with
ˇ
ˇ
D documents, we need to estimate u D p.uˇjuj/ for each frequent word and phrase u 2 U ,
and infer segmentation S . Since P .C / does not depend on segmentation S , one can maximize
log P .S ; C / instead. We employ the maximum a posteriori principle and maximize the joint prob-
ability of the corpus:
D
X md
D X
X  ˇ 
ˇ .d /
log p.Sd ; Cd / D log p b t.dC1
/ .d /
; dwŒb t ;b tC1
c ˇ b t : (2.3)
d D1 d D1 tD1

To find the best segmentation to maximize Eq. (2.3), one can use efficient dynamic pro-
gramming (DP) if  is known. e algorithm is shown in Algorithm 3.
To learn  , we employ an optimization strategy called Viterbi Training (VT) or Hard-
EM in the literature [Allahverdyan and Galstyan, 2011]. Generally speaking, VT is an efficient
and iterative way of parameter learning for probabilistic models with hidden variables. In our
case, given corpus C , it searches for a segmentation that maximizes p.S ; C jQ; ; ˛/ followed by
coordinate ascent on parameters  . Such a procedure is iterated until a stationary point has been
reached. e corresponding algorithm is given in Algorithm 4.
e hard E-step is performed by DP with  fixed, and the M-step is based on the segmen-
tation obtained from DP. Once the segmentation S is fixed, the closed-form solution of u can
be derived as: PD Pmd
d D1 tD1 1s t.d / Du
u D PD Pm ; (2.4)
d
d D1 tD1 1js .d / jDjuj
t

where 1 denotes the identity indicator. We can see that u is the rectified frequency of u nor-
malized by the total frequencies of the segments with length juj. For this reason, we name 
normalized rectified frequency.
Note that Soft-EM (i.e., Bawm-Welch algorithm [Bishop, 2006]) can also be applied to
find a maximum likelihood estimator of  . Nevertheless, VT is more suitable in our case because:
1. VT uses DP for the segmentation step, which is significantly faster than Bawm-Welch
using forward-backward algorithm for the E-step; and
2. majority of the phrases get removed as their  approaches 0 during iterations, which further
speeds up our algorithm.
16 2. QUALITY PHRASE MINING WITH USER GUIDANCE

Algorithm 3: Dynamic Programming (DP)


1 Input: Word sequence C D w1 w2 : : : wn , phrase quality Q, normalized frequency  ,
segment length penalty ˛ .
2 Output: Optimal segmentation S .
3 h0 1, hi 0 .0 < i  n/
4 denote ! as the maximum phrase length
// Denote ! as the maximum phrase length.
5 for i D 1 to n do
6 for ı D 1 to ! do ˇ
ˇ
7 if hi  p.b t C1 D b t C ı; dwŒi C1;i CıC1/ cˇb t / > hiCı then
ˇ
ˇ
8 hiCı hi  p.b tC1 D b t C ı; dwŒi C1;iCıC1/ cˇb t /
9 giCı i

10 i n
11 m 0
12 while i > 0 do
13 m mC1
14 sm wgi C1 wgi C2 : : : wi
15 i gi
16 return S sm sm 1 : : : s1

It has also been reported in Allahverdyan and Galstyan [2011] that VT converges faster and
results in sparser and simpler models for Hidden Markov Model-like tasks. Meanwhile, VT is
capable of correctly recovering most of the parameters.
Previously in Equation (2.2), we defined the formula of segment length penalty. ere is
a hyper-parameter ˛ that needs to be determined outside the VT iterations. An overestimate
˛ will segment quality phrases into shorter parts, while an underestimate of ˛ tends to keep
low-quality phrases. us, an appropriate ˛ reflects the user’s trade-off between precision and
recall. To judge what ˛ value is reasonable, we propose to reuse the labeled phrases used in the
phrase quality estimation. Specifically, we try to search for the maximum value of ˛ such that
VT does not segment positive phrases. A parameter r0 named non-segmented ratio controls the
trade-off mentioned above. It is the expected ratio of phrases in L not partitioned by dynamic
programming. e detailed searching process is described in Algorithm 5 where we initially set
upper and lower bounds of ˛ and then perform a binary search. In Algorithm 5, jSj denotes the
number of segments in S and jLj refers to the number of positive labels.
2.3. SUPERVISED PHRASE MINING FRAMEWORK 17

Algorithm 4: Viterbi Training (VT)


1 Input: Corpus C , phrase quality Q, length penalty ˛ .
2 Output:  .
3 initialize  with normalized raw frequencies in the corpus
4 while not converge do
5 u0 0; 8 u 2 U
6 for d D 1 to D do
7 Sd DP .Cd ; Q; ; ˛/ via Algorithm 3
8 assume Sd D s1.d /    sm
.d /

9 for t D 1 to m do
.d /
10 u wŒb t ;b tC1 /
11 u0 u0 C 1

12 normalize  0 w.r.t. different length as in Eq. (2.4)


13  0
14 return 

2.3.4 FEEDBACK AS SEGMENTATION FEATURES


Rectified frequencies can help refine the features and improve the quality assessment. e motiva-
tion behind this feedback idea is explained with the examples shown in Table 2.3. “Quality before
feedback” listed in the table is computed based on phrase quality estimation in the previous step.
For example, the quality of “np hard in the strong” is significantly overestimated according to the
raw frequency. Once we correctly segment the documents, its frequency will be largely reduced,
and we can use it to guide the quality estimator. For another example, the quality of phrases like
“data stream management system” were originally underestimated due to its relatively lower fre-
quency and smaller concordance feature values. Suppose after the segmentation, this phrase is not
broken into smaller units in most cases. en we can feed that information back to the quality
estimator and boost the score.
Based on this intuition, we design two new features named segmentation features and plug
them into the feature set introduced in Section 2.3.2. Given a phrase v 2 P , these two segmen-
tation features are defined as:
p.S; v/jjS jD1
log
maxjS j>1 p.S; v/
p.S; v/jjS jD1
p.S; v/jjS jD1 log ;
maxjS j>1 p.S; v/
where p.S; v/ is computed by Equation (2.1). Instead of splitting a phrase into two parts like
the concordance features, we now find the best segmentation with dynamic programming in-
18 2. QUALITY PHRASE MINING WITH USER GUIDANCE

Algorithm 5: Penalty Learning


1 Input: Corpus C , labeled quality phrases L, phrase quality Q, non-segmented ratio r0 .
2 Output: Desired segment length penalty ˛ .
3 up 200, low 0
4 while not converge do
5 ˛ .up C low/=2
6  V T .C ; Q; ˛/ via Algorithm 4
7 r r0  jLj
8 for i D 1 to jLj do
9 S DP .Li ; Q; ; ˛/ via Algorithm 3
10 if jSj D 1 then
11 r r 1

12 if r  0 then
13 up ˛
14 else
15 low ˛

16 return .up C low/=2

Table 2.3: Effects of segmentation feedback on phrase quality estimation

Problem Fixed by
Phrase Quality Before Quality After
Feedback
np hard in the strong sense 0.78 0.93 slight underestimate
np hard in the strong 0.70 0.23 overestimate
false pos. and false neg. 0.90 0.97 N/A
pos. and false neg. 0.87 0.29 overestimate
data base management system 0.60 0.82 underestimate
data stream management system 0.26 0.64 underestimate

troduced in the phrasal segmentation, which better models the concordance criterion. In addi-
tion, normalized rectified frequencies are used to compute these new features. is addresses
the context-dependent completeness criterion. As a result, misclassified phrase candidates in the
above example can get mostly corrected after retraining the classifier, as shown in Table 2.3.
A better phrase quality estimator can guide a better segmentation as well. In this way, the
loop between the quality estimation and phrasal segmentation is closed and such an integrated
2.3. SUPERVISED PHRASE MINING FRAMEWORK 19
framework is expected to leverage mutual enhancement and address all the four phrase quality
criteria organically.
Note that we do not need to run quality estimation and phrasal segmentation for many
iterations. In our experiments, the benefits brought by rectified frequency can penetrate after the
first iteration, leaving performance curves over the next several iterations similar. It will be shown
in the experiments.

2.3.5 COMPLEXITY ANALYSIS


Frequent Phrases Detection: Since the operation of Hash table is O.1/, both the time and space
complexities are O.!jC j/. ! is a small constant indicating the maximum phrase length, so this
step is linear to the size of corpus jC j.

Feature Extraction: When extracting features, the most challenging problem is how to effi-
ciently locate these phrase candidates in the original corpus, because the original texts are crucial
for finding the punctuation and capitalization information. Instead of using some dictionaries to
store all the occurrences, we take the advantage of the Aho-Corasick Automaton algorithm and
tailor it to find all the occurrences of phrase candidates. e time complexity is O.jC j C jP j/ and
space complexity O.jP j/, where jP j refers to the total number of frequent phrase candidates. As
the length of each candidate is limited by a constant ! , O.jP j/ D O.jC j/, so the complexity is
O.jC j/ in both time and space.

Phrase Quality Estimation: As we only labeled a very small set of phrase candidates, as long as
the number and depth of decision trees in the random forest are some constant, the training time
for the classifier is very small compared to other parts. For the prediction stage, it is proportional
to the size of phrase candidates and the dimensions of features. erefore, it could be O.jC j/ in
both time and space, although the actual magnitude might be smaller.

Viterbi Training: It is easy to observe that Algorithm 3 is O.n!/, which is linear to the number
of words. ! is treated as a constant, and thus the VT process is also O.jC j/ considering Algo-
rithm 4 ususally finishes in a few iterations.

Penalty Learning: Suppose we only require a constant  to check the convergence of the binary
search. en after log2 200
rounds, the algorithm converges. So the number of loops could be
treated as a constant. Because VT takes O.jC j/ time, penalty learning also takes O.jC j/ time.

Summary. Because the time and space complexities of all components in our framework are
O.jC j/, our proposed framework has a linear time and space complexities and is thus very efficient.
Furthermore, the most time consuming parts, including penalty learning and VT, could be easily
parallelized because of the nature of independence between documents and sentences.
20 2. QUALITY PHRASE MINING WITH USER GUIDANCE
2.4 EXPERIMENTAL STUDY
In this section, experiments demonstrate the effectiveness and efficiency of the proposed methods
in mining quality phrases and generating accurate segmentation. We begin with the description
of datasets.
Two real-world data sets were used in the experiments and detailed statistics are summa-
rized in Table 2.4.
Table 2.4: Statistics about datasets
Dataset #Docs #Words #Labels
Academia 2.77M 91.6M 300
Yelp 4.75M 145.1M 300

• e Academia dataset¹ is a collection of major computer science journals and proceedings.


We use both titles and abstracts in our experiments.
• e Yelp dataset² provides reviews of 250 businesses. Each individual review is considered
as a document.
To demonstrate the effectiveness of the proposed approach, we compared the following
phrase extraction methods.
• TF-IDF ranks phrases by the product of their raw frequencies and inverse document fre-
quencies.
• C-Value proposes a ranking measure based on frequencies of a phrase used as parts of their
super-phrases following a top-down scheme.
• ConExtr approaches phrase extraction as a market-baskets problem based on an assumption
about relationship between n-gram and prefix/suffix (n-1)-gram.
• KEA³ is a supervised keyphrase extraction method for long documents. To apply this
method in our setting, we consider the whole corpus as a single document.
• TopMine⁴is a topical phrase extraction method. We use its phrase mining module for com-
parison.
• ClassPhrase ranks phrases based on their estimated qualities (removing step 3–5 from our
framework).
¹http://aminer.org/billboard/AMinerNetwork
²https://www.yelp.com/academic_dataset
³https://code.google.com/p/kea-algorithm
⁴http://web.engr.illinois.edu/~elkishk2/
2.4. EXPERIMENTAL STUDY 21
• SegPhrase combines ClassPhrase with phrasal segmentation to filter overestimated phrases
based on normalized rectified frequency (removing step 4 from our framework).

• SegPhrase+ is similar to SegPhrase but adds segmentation features to refine quality estima-
tion. It contains the full procedures presented in Section 2.3.

e first two methods utilize NLP chunking to obtain phrase candidates. We use the JATE⁵
implementation of the first two methods, i.e., TF-IDF and C-Value. Both of them rely on OpenNLP⁶
as the linguistic processor to detect phrase candidates in the corpus. e rest methods are all
based on frequent n-grams and the runtime is dramatically reduced. e last three methods are
variations of our proposed method.
It is also worth mentioning that JATE contains several more implemented methods in-
cluding Weirdness [Ahmad et al., 1999]. ey are not reported here due to their unsatisfactory
performance compared to the baselines listed above.
For the parameter setting, we set minimum phrase support  as 30 and maximum phrase
length ! as 6, which are two parameters required by all methods. Other parameters required by
baselines were set according to the open source tools or the original papers.
For our proposed methods, training labels for phrases were collected by sampling represen-
tative phrase candidates from groups of phrases pre-clustered on the normalized feature space by
k -means. We labeled research areas, tasks, algorithms and other scientific terms in the Academia
dataset as quality phrases. Some examples are “divide and conquer,” “np complete,” and “rela-
tional database.” For the Yelp dataset, restaurants, dishes, cities and other related concepts are
labeled to be positive. In contrast, phrases like “under certain assumptions,” “many restaurants,”
and “last night” were labeled as negative. We downsample low-quality phrases because they are
dominant over quality phrases. e number of training labels in our experiments are reported in
Table 2.4. To automatically learn the value of segment length penalty, we set the non-segmented
ratio r0 in Algorithm 5 as 1.0 for Academia dataset and 0.95 for Yelp dataset. e selection of
this parameter will be discussed in detail later in this section.
To make outputs returned by different methods comparable, we converted all the phrase
candidates to lower case and merged plural with singular phrases. e phrase lists generated by
these methods have different size, and the tail of the lists are low quality. For the simplicity of
comparison, we discarded low-ranked phrases based on the minimum size among all phrase lists
except Conextr. Conextr returns all phrases without ranking. us, we did not remove its phrases.
e remaining size of each list is still reasonably large (> 40,000).

2.4.1 QUANTITATIVE EVALUATION AND RESULTS


e goal of our experiments is to study how well our methods perform in terms of “precision”
and “recall” and compare with baselines. Precision is defined as the ratio of “true” quality phrases
⁵https://code.google.com/p/jatetoolkit
⁶http://opennlp.apache.org
22 2. QUALITY PHRASE MINING WITH USER GUIDANCE
among all predictions. Recall is defined as the ratio between “true” quality phrases in the predic-
tions and the complete set of quality phrases.

Wiki Phrases: e first set of experiments were conducted by using Wikipedia phrases as ground
truth labels. Wiki phrases refer to popular mentions of entities by crawling intra-Wiki citations
within Wiki content. To compute precision, only the Wiki phrases are considered to be positive.
For recall, we combine Wiki phrases returned by different methods altogether and view them
as all quality phrases. Precision and recall are biased in this case because positive labels are re-
stricted to Wiki phrases. However, we still expect to obtain meaningful insights regarding the
performance difference between the proposed and baselines.

Pooling: Besides Wiki phrases, we rely on human evaluators to judge whether the rest of the
candidates are good. We randomly sampled k Wiki-uncovered phrases from the returned can-
didates of each compared method. ese sampled phrases formed a pool and each of them was
then evaluated by three reviewers independently. e reviewers could use a popular search engine
for the candidates (thus helping reviewers judge the quality of phrases that they were not familiar
with). We took the majority of the opinions and used these results to evaluate the methods on
how precise the returned quality phrases are. roughout the experiments we set k D 500.
Precision-recall curves of different methods evaluated by both Wiki phrases and pooling
phrases are shown in Figure 2.2. e trends on both datasets are similar.
Among the existing work, the chunking-based methods, such as TF-IDF and C-Value, have
the best performance; Conextr reduces to a dot in the figure since its output does not provide the
ranking information. Our proposed method, SegPhrase+, outperforms them significantly. More
specifically, SegPhrase+ can achieve a higher recall while its precision is maintained at a satisfactory
level. at is, many more quality phrases can be found by SegPhrase+ than baselines. Under a given
recall, precision of our method is higher in most of the time.
For variant methods within our framework, it is surprising that ClassPhrase could per-
form competitively to the chunking-based methods like TF-IDF. Note that the latter requires large
amounts of pre-training for good phrase chunking. However, ClassPhrase’s precision at the tail
is slightly worse than TF-IDF on Academia dataset evaluated by Wiki phrases. We also observe a
significant difference between SegPhrase and ClassPhrase, indicating phrasal segmentation plays
a crucial role to address the completeness criterion. In fact, SegPhrase already beats ClassPhrase
and baselines. Moreover, SegPhrase+ improves the performance of SegPhrase, because of the use
of phrasal segmentation results as additional features.
An interesting observation is that the advantage of our method is more significant on the
pooling evaluations. e phrases in the pool are not covered by Wiki, indicating that Wikipedia is
not a complete source of quality phrases. However, our proposed methods, including SegPhrase+,
SegPhrase, and ClassPhrase, can mine out most of them (more than 80%) and keep a very high
level of precision, especially on the Academia dataset. erefore, the evaluation results on the
2.4. EXPERIMENTAL STUDY 23

Precision-Recall Curves on Academia Dataset (Wiki Phrases)


1 1
TF-IDF TF-IDF
0.9 C-Value 0.9 ClassPhrase
ConExtr SegPhrase
0.8 KEA 0.8 SegPhrase+
ToPMine
0.7 SegPhrase+ 0.7

Precision

Precision
0.6 0.6
0.5 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Recall Recall

Precision-Recall Curves on Yelp Dataset (Wiki Phrases)


0.8 0.8
TF-IDF TF-IDF
0.7 C-Value ClassPhrase
ConExtr 0.7 SegPhrase
KEA SegPhrase+
0.6 ToPMine 0.6
SegPhrase+
0.5

Precision

Precision
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Recall Recall

Precision-Recall Curves on Academia Dataset (Pooling)

TF-IDF TF-IDF
C-Value ClassPhrase
1 ConExtr 1
SegPhrase
KEA SegPhrase+
0.95 ToPMine 0.95
SegPhrase+
0.9
0.9
Precision

Precision
0.85
0.85
0.8
0.8
0.75
0.7 0.75

0.65 0.7

0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8


Recall Recall
Precision-Recall Curves on Yelp Dataset (Pooling)

TF-IDF TF-IDF
1 C-Value 0.95 ClassPhrase
ConExtr SegPhrase
0.9 KEA 0.9 SegPhrase+
ToPMine
SegPhrase+ 0.85
0.8
Precision

Precision

0.8
0.7
0.75
0.6
0.7
0.5
0.65
0.4
0.6
0.3
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall

Figure 2.2: Precision-recall in four groups of experiments: (Academia, Yelp)  (Wiki phrase, pool-
ing).
24 2. QUALITY PHRASE MINING WITH USER GUIDANCE
pooling phrases suggest that our methods not only detect the well-known Wiki phrases, but also
work properly for the long tail phrases which might occur not so frequently.
From the result on Yelp dataset evaluated by pooling phrases, we notice that SegPhrase+ is
a little weaker than SegPhrase at the head. As we know, SegPhrase+ has tried to utilize phrasal
segmentation results from SegPhrase to refine the phrase quality estimator. However, segmenta-
tion features do not add new information for bigrams. If there are not many quality phrases with
more than two words, SegPhrase+ might not have significant improvement and even can perform
slightly worse due to the overfitting problem by reusing the same set of labeled phrases. In fact,
on Academia dataset, the ratios of quality phrases with more than 2 words are 24% among all
Wiki phrases and 17% among pooling phrases. In contrast, these statistics go down to to 13%
and 10% on Yelp dataset, which verifies our conjecture and explains why SegPhrase+ has slightly
lower precision than SegPhrase at the head.

2.4.2 MODEL SELECTION


e goal of model selection is to study how well our methods perform in terms of “precision” and
“recall” on various candidate models with different parameters. We specifically want to study four
potentially interesting questions.

• How many labels do we need to achieve good results of phrase quality estimation?

• How to choose non-segmented ratio r0 for deciding segment length penalty?

• How many iterations are needed to alternate between phrase quality estimation and phrasal
segmentation?

• What is the convergence speed of viterbi training?

Number of Labels
To evaluate the impact of training data size on the phrase quality estimation, we focus on studying
the classification performance of ClassPhrase. Table 2.5 shows the results evaluated among phrases
with positive predictions (i.e., fv 2 P W Qv  0:5). With different numbers of labels, we report
the precision, recall and F1 score judged by human evaluators (Pooling). e number of cor-
rectly predicted Wiki phrases is also provided together with the total number of positive phrases
predicted by the classifier. From these results, we observe that the performance of the classifier
becomes better as the number of labels increases. Specifically, on both datasets, the recall rises up
as the number of labels increases, while the precision goes down. e reason is the downsampling
of low-quality phrases in the training data. Overall, the F1 score is monotonically increasing,
which indicates that more labels may result in better performance. 300 labels are enough to train
a satisfactory classifier.
2.4. EXPERIMENTAL STUDY 25
Table 2.5: Impact of training data size on ClassPhrase (Top: Academia, Bottom: Yelp)

Academia
# Labels Precision Recall F1 # Wiki Phrases # Total
50 0.881 0.372 0.523 6,179 24,603
100 0.859 0.430 0.573 6,834 30,234
200 0.856 0.558 0.676 8,196 40,355
300 0.760 0.811 0.785 11,535 95,070
Yelp
# Labels Precision Recall F1 # Wiki Phrases # Total
50 0.491 0.948 0.647 6,985 79,091
100 0.540 0.948 0.688 6,692 57,018
200 0.554 0.948 0.700 6,786 53,613
300 0.559 0.944 0.702 6,777 53,442

Non-segmented Ratio
e non-segmented ratio r0 is designed for learning segment length penalty, which further con-
trols the precision and recall phrasal segmentation. Empirically, under higher r0 , the segmenta-
tion process will favor longer phrases, and vice versa. We show experimental results in Table 2.6
for models with different values of r0 . e evaluation measures are similar to the previous set-
ting but they are computed based on the results of SegPhrase. One can observe that the precision
increases with lower r0 , while the recall decreases. It is because phrases are more likely to be seg-
mented into words by lower r0 . High r0 is generally preferred because we should preserve most
positive phrases in training data. We select r0 D 1:00 and 0:95 for Academia and Yelp datasets
respectively, because quality phrases are shorter in Yelp dataset than in Academia dataset.

Convergence Study of Viterbi Training


Our time complexity analysis in Section 2.3.5 assumes Viterbi Training in Algorithm 4 converges
in few iterations. Here we verify this through empirical studies. From Table 2.7, VT converges
extremely fast on both datasets. is owes to the good initialization based on raw frequency as
well as the particular property of Viterbi Training discussed in Section 2.3.3.

Iterations of SegPhrase+
SegPhrase+ involves only one iteration of re-estimating phrase quality using normalized recti-
fied frequency from phrasal segmentation. Here we show the performance of SegPhrase+ with
more iterations in Figure 2.3 based on human-labeled phrases. For comparison, we also report
26 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Table 2.6: Impact of non-segmented ratio r0 on SegPhrase (Top: Academia, Bottom: Yelp)

Academia
r0 Precision Recall F1 # Wiki Phrases # Total
1.00 0.816 0.756 0.785 10,607 57,668
0.95 0.909 0.625 0.741 9,226 43,554
0.90 0.949 0.457 0.617 7,262 30,550
0.85 0.948 0.422 0.584 7,107 29,826
0.80 0.944 0.364 0.525 6,208 25,374
Yelp
r0 Precision Recall F1 # Wiki Phrases # Total
1.00 0.606 0.948 0.739 7,155 48,684
0.95 0.631 0.921 0.749 6,916 42,933
0.90 0.673 0.846 0.749 6,467 34,632
0.85 0.714 0.766 0.739 5,947 28,462
0.80 0.725 0.728 0.727 5,729 26,245

Table 2.7: Objective function values of Viterbi Training for SegPhrase and SegPhrase+

Dataset Academia Yelp


Method SegPhrase SegPhrase+ SegPhrase SegPhrase+
Iter.1 -6.39453E+08 -6.33064E+08 -9.33899E+08 -9.27055E+08
Iter.2 -6.23699E+08 -6.17419E+08 -9.12082E+08 -9.06041E+08
Iter.3 -6.23383E+08 -6.17214E+08 -9.11835E+08 -9.05946E+08
Iter.4 -6.23354E+08 -6.17196E+08 -9.11819E+08 -9.05940E+08
Iter.5 -6.23351E+08 -6.17195E+08 -9.11818E+08 -9.05940E+08

performance of ClassPhrase+ which is similar with ClassPhrase but contains segmentation feature
generated by results of phrasal segmentation from the last iteration.
We can see that the benefits brought by rectified frequency can be fully digested within the
first iteration, leaving F1 scores over the next several iterations close. One can also observe a slight
performance decline over the next two iterations especially for the top-1000 phrases. Recall that
we are reusing training labels for each iteration. en this decline can be well explained by overfit-
ting because segmentation features added by later iterations become less meaningful. Meanwhile,
more meaningless features will undermine the classification power of random forest. Based on
this, we can conclude that there is no need to do the phrase quality re-estimation multiple times.
2.4. EXPERIMENTAL STUDY 27
F1 Score on Academia dataset (Pooling) F1 Score on Yelp dataset (Pooling)
0.95

0.95 0.9

0.9 0.85

F1 Score
F1 Score
0.85 0.8

0.8 0.75

SegPhrase+@All SegPhrase+@All
SegPhrase+@1000 0.7 SegPhrase+@1000
0.75
1 2 3 4 1 2 3 4
#Iterations #Iterations

Figure 2.3: Performance variations of SegPhrase+ and ClassPhrase+ with increasing iterations.

2.4.3 EFFICIENCY STUDY


e following execution time experiments were all conducted on a machine with two Intel(R)
Xeon(R) CPU E5-2680 v2 @ 2.80 GHz. Our framework is mainly implemented in C++ while
a small part of preprocessing is in Python. As shown in Figure 2.4, the linear curves of total
runtime of SegPhrase+ on different proportions of data verifies our linear time complexity analyzed
in Section 2.3.5.

3500
Academia
3000 Yelp
Running Time (seconds)

2500

2000

1500

1000

500

0
0.2 0.4 0.6 0.8 1
Proportion

Figure 2.4: Runtime on different proportions of data.

Besides, the pies in Figure 2.5 show the ratios of different components of our framework.
One can observe that Feature Extraction and Phrasal Segmentation occupy most of the runtime.
28 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Academia Yelp
Frequent Phrase Detection Frequent Phrase Detection
2nd Phrasal Segmentation 2nd Phrasal Segmentation

2nd Qualtiy Estimation


2nd Qualtiy Estimation
Segmentation Feature Extraction
Segmentation Feature Extraction
Feature Extraction Feature Extraction
1st Phrasal Segmentation 1st Phrasal Segmentation

1st Quality Estimation 1st Quality Estimation

Figure 2.5: Runtime of different modules in our framework on Academia and Yelp dataset.

Fortunately, almost all components of our frameworks can be parallelized, such as Feature
Extraction, Phrasal Segmentation, and Quality Estimation, which are the most expensive parts
of execution time. It is because sentences can be proceeded one by one without any impact on
each other. erefore, our methods could be very efficient for massive corpus using parallel and
distributed techniques. Here we do not compare the runtime with other baselines because they
are implemented by different programming languages and some of them further rely on various
third-party packages. Among existing implementations, our method is empirically one of the
fastest.

2.4.4 CASE STUDY


Previous experiments are focused on evaluating phrase quality quantitatively. In this subsection,
we show two case studies based on applications taking segmented corpora as input. Note that the
segmented corpus can be obtained by applying the segmenter (i.e., the other output of the phrase
mining methods) to the training corpus.

Interesting Phrase Mining


e first application is to mine interesting phrases in a subset of given corpus. Interesting phrases
are defined to be phrases frequent in the subset C 0 but relatively infrequent in the overall corpus

Table 2.8: Running time

Dataset File Size # Words Time


Academia 613 MB 91.6 M 0.595 h
Yelp 750 MB 145.1 M 0.917 h
2.4. EXPERIMENTAL STUDY 29
C [Bedathur et al., 2010, Gao and Michel, 2012, P et al., 2014]. Given a phrase v , its interesting-
ness is measured by freq.v; C 0 /  purity.v; C 0 ; C / D freq.v; C 0 /2 =freq.v; C /, which considers both
phrase frequency and purity in the subset.
We list a fraction of interesting phrases in Table 2.9 mined from papers published in
SIGMOD and SIGKDD conferences. Each series of proceedings form a subset of the whole
Academia corpus. Two segmentation methods are compared. e first one relies on dynamic
programming using phrase quality estimated by SegPhrase+. e other is based on the phrase
chunking method adopted in JATE, which is further used to detect phrase candidates for TF-IDF
and C-Value methods. To be fair, we only show phrases extracted by SegPhrase+, TF-IDF, and
C-Value methods in the table. Because TF-IDF and C-Value perform similarly and they both rely
on the chunking method, we merge their phrases and report mining results in one column named
“Chunking.” Phrases in SegPhrase+ but missing in the chunking results are highlighted in purple
(red vice versa). One can observe that the interesting phrases mined by SegPhrase+ based on the
segmentation result are more meaningful and the improvement is significant. Relatively speaking,
phrases mined from the chunking method are of inferior quality. erefore, many of them are
not covered by SegPhrase+.

Word/Phrase Similarity Search


With a segmented corpus, one could train a model to learn distributed vector representations of
words and phrases [Mikolov et al., 2013]. Using this technique, words and phrases are mapped
into a vector space such that semantically similar words and phrases have similar vector represen-
tations. It helps other text mining algorithms to achieve better performance by grouping similar
units. e quality of the learned vector representation is closely related to the quality of the input
segmented corpus. Accurate segmentation results in good vector representation and this perfor-
mance gain is usually evaluated by comparing similarity scores between word/phrase pairs. To be
specific, one could compute top-k similar words or phrases given a query and compare the ranked
lists. We use this to verify the utility of both quality phrase mining and quality segmentation.
We show the results in Table 2.10 from SegPhrase+ and the chunking method mentioned
in the previous interesting phrase mining application. Queries were chosen to be capable of show-
ing the difference between the two methods for both Academia and Yelp datasets. Distributed
representations were learned through an existing tool [Mikolov et al., 2013] and ranking scores
were computed based on cosine similarity.
From the table, one can easily tell that the rank list from SegPhrase+ carries more sense than
that from phrase chunking. One of the possible reasons is that chunking method only detects noun
phrases in the corpus, providing less accurate information of phrase occurrences than SegPhrase+
to the vector representation learning algorithm.
30 2. QUALITY PHRASE MINING WITH USER GUIDANCE
Table 2.9: Interesting phrases mined from papers published in SIGMOD and SIGKDD

SIGMOD SIGKDD
SegPhrase+ Chunking SegPhrase+ Chunking
1 data base data base data mining data mining
2 database system database system data set association rule
3 relational database query processing association rule knowledge discovery
4 query optimization query optimization knowledge discovery frequent itemset
5 query processing relational database time series decision tree
… … … … …
51 sql server database technology assoc. rule mining search space
52 relational data database server rule set domain knowledge
53 data structure large volume concept drift important problem
54 join query performance study knowledge acquisition concurrency control
55 web service web service gene expression data conceptual graph
… … … … …
201 high dimensio. data efficient impl. web content optimal solution
202 location based serv. sensor network frequent subgraph semantic relation
203 xml schema large collection intrusion detection effective way
204 two phase locking important issue categorical attribute space complexity
205 deep web frequent itemset user preference small set
… … … … …

Case Study of Quality Phrases


We show some phrases from ranking lists generated by ClassPhrase, SegPhrase, and SegPhrase+ in
Table 2.11. In general, phrase quality drops with number goes up. ClassPhrase always performs the
worst among the three methods. SegPhrase+ is slightly better than SegPhrase, which is noticeable
for phrases ranked after 20,000. It’s worth mentioning that the minimum sizes of phrase lists are
50,577 and 42,989 for two datasets, respectively.

2.5 SUMMARY
In this chapter, we introduced a data-driven model for extracting quality phrases from text cor-
pora with user guidance. By requiring limited training effort, the model can achieve outstanding
performance even for highly irregular textual datausiness reviews. e key idea is to rectify the raw
frequency of phrases which misleads quality estimation. A segmentation-integrated approach is
2.5. SUMMARY 31
Table 2.10: Top-5 similar phrases for representative queries (Top: Academia, Bottom: Yelp)

Query Data Mining Olap


Method SegPhrase+ Chunking SegPhrase+ Chunking
1 knowledge discovery driven methodologies data warehouse warehouses
2 text mining text mining online analy. proc. clustcube
3 web mining financial investment data cube rolap
4 machine learning knowledge discovery olap queries online analy. proc.
5 data mining techniques building knowledge multidim. databases analytical processing

Query Blu-ray Noodle Valet Parking


Method Seg- Seg- Seg-
Chunking Chunking Chunking
Phrase+ Phrase+ Phrase+
1 dvd microwave ramen noodle soup valet huge lot
2 vhs lifetime wty noodle soup asian noodle self-parking private lot
3 cd recliner rice noodle beef noodle valet service self-parking
4 new release battery egg noodle stir fry free valet parking valet
5 sony new battery pasta fish ball covered parking front lot

therefore developed and finally addresses such a fundamental limitation of phrase mining. How-
ever, we discover that despite the outstanding performance, the reliance on manual efforts from
domain experts can still become an impediment for timely analysis of massive, emerging text cor-
pora. A fully automated algorithm, instead, can be much more useful in this scenario. Meanwhile,
this chapter focuses on multi-word phrase mining while single-word phrases are not taken care of.
e integration of light-weight linguistic processors such as POS tagging is also worth studying.
We reserve these topics for the next chapter.
32 2. QUALITY PHRASE MINING WITH USER GUIDANCE

Table 2.11: Sampled quality phrases from Academia and Yelp datasets (Continues.)

Academia
Method ClassPhrase SegPhrase SegPhrase+
1 virtual reality virtual reality self organization
2 variable bit rate variable bit rate polynomial time approx.
3 shortest path shortest path least squares
… … … …
501 finite state frequency offset estimation health care
502 air traffic collaborative filtering gene expresion
503 long term ultra wide band finite state transducers
… … … …
2001 chemical reaction ad hoc networks quasi monte carlo
2002 container terminals hyperspectral remote sensing integer programming
2003 graceful degradation piecewise affine gray level
… … … …
10001 search terms test plan airline crew scheduling
10002 high dimensional space automatic text integer programming
10003 delay variation adaptive bandwidth web log data
… … … …
20001 test coverage implementation costs experience sampling
virtual execution
20002 adaptive sliding mode control error bounded
environments
20003 random graph models free market nonlinear time delay systems
… … … …
50001 svm method harmony search algorithm asymptotic theory
50002 interface adaptation integer variables physical mapping
50003 diagnostic fault simulation nonlinear oscillators distince patterns
… … … …
2.5. SUMMARY 33

Table 2.11: (Continued.) Sampled quality phrases from Academia and Yelp datasets

Yelp
Method ClassPhrase SegPhrase SegPhrase+
1 taco bell taco bell tour guide
2 wet republic wet republic yellow tail
3 pizzeria bianco pizzeria bianco vanilla bean
… … … …
501 panoramic view art museum rm seafood
502 pretzel bun ice cream parlor pecan pie
503 spa pedicure pho kim long master bedroom
… … … …
2001 buffalo chicken wrap training sessions smashed potatoes
2002 salvation army folding chairs italian stallion
2003 shortbread cookies single bypass ferris wheel
… … … …
10001 seated promptly carrot soup gary danko
10002 leisurely stroll veggie soup benny benassi
10003 flavored water pork burrito big eaters
… … … …
20001 buttery toast late night specials cilantro hummus
20002 quick breakfast older women lv convention center
20003 slightly higher worth noting iced vanilla
… … … …
40001 friday morning conveniently placed coupled with
40002 start feeling cant remember way too high
40003 immediately start stereo system almost guaranteed
… … … …
35

CHAPTER 3

Automated Quality Phrase


Mining
Almost all state-of-the-art methods in NLP, IR, and text mining communities require human
experts at certain levels. For example, NLP-based methods [Frantzi et al., 2000, Park et al., 2002,
Zhang et al., 2008] require language experts for rigorous language rules or sophisticated labels (e.g.,
parsing tree labels) to identify phrase mentions. SegPhrase+ introduced in last chapter doesn’t rely
on linguistic processing and outperforms many other methods [Deane, 2005, El-Kishky et al.,
2015, Frantzi et al., 2000, Parameswaran et al., 2010, Park et al., 2002, Zhang et al., 2008], but
needs hundreds of binary labels telling whether a phrase is of high quality. Such reliance on manual
efforts from domain experts becomes an impediment for timely analysis of massive, emerging text
corpora. Besides this issue, an ideal automated phrase mining method, as shown in Figure 3.1, is
supposed to work smoothly for multiple languages with high performance in terms of precision,
recall, and efficiency.

No Human Effort

No Labeling Distant Labeling Weak Human Labeling Heavy Human Labeling

e.g., ToPMine AutoPhrase+ e.g., SegPhrase+ NLP-based methods


Support Multiple Languages Best
Performance!
Minimum Language Dependency Heavy Language Dependency

e.g., SegPhrase & AutoPhrase+ NLP-based methods


ToPMine

Figure 3.1: Motivation: Automated phrase mining without human effort for multiple languages.

3.1 OVERVIEW
Toward the goal of making the framework fully automated, we summarize the following three
major challenges.

1. Can we completely remove the human effort for labeling phrases? In the previous chap-
ter, SegPhrase+ has shown that the quality of phrases generated by unsupervised meth-
ods [Deane, 2005, El-Kishky et al., 2015, Parameswaran et al., 2010] is acceptable but
36 3. AUTOMATED QUALITY PHRASE MINING
much weaker than the supervised methods, and at least a few hundred labels are necessary
for training. Distant training is a popular methodology to reduce expensive human labor by
utilizing high-quality phrases in knowledge bases as positive phrase labels.

2. Can we achieve high performance of phrase mining in multiple languages? Complicated pre-
processing models, such as dependency parsing, heavily rely on human efforts and thus can-
not be smoothly applied to multiple languages, as shown in Figure 3.1. To achieve high per-
formance with minimum language dependency, we fully utilize the results of the following
two techniques: (1) tokenization should be allowed because it provides the building bricks
of phrases—the boundaries of words; and (2) part-of-speech (POS) tagging, another elemen-
tary preprocessing step in NLP pipelines, is available in most of languages. And there are
language-independent part-of-speech taggers, such as TreeTagger [Schmid, 2013]. More-
over, Observation 3.1 suggests that the context information from POS tags can be a strong
signal for identifying the phrase boundary in complement to the frequency-based statistical
signals.

Observation 3.1 Combining frequency-based signals with POS information is helpful.

#1 Sophia Smith was born in England .


NNP NNP VBD VBN IN NNP .
#2 … the Great Firewall is …
… DT NNP NNP VBZ …
#3 This is a great firewall software .
DT VBZ DT JJ NN NN .

e data-driven methods usually rely on frequency-based signals [Deane, 2005, El-Kishky


et al., 2015, Liu et al., 2015, Parameswaran et al., 2010] and can lead to two types of errors.
(1) Over-decomposition: Combination of individual popular words tend to be more decom-
posable. In #1, since both Sophia and Smith are very popular names, to make the full name
a complete phrase, Sophia Smith is required to be also popular, which may not be true.
(2) Under-decomposition: Popular phrases tend to be less decomposable. For instance, “great
firewall ” are mentioned in both #2 and #3. Suppose this phrase is mentioned frequently in
our corpus, it may prevent the algorithm to extract “firewall system” from #3 if it is less pop-
ular. However, POS information can be helpful to avoid such faults. In #1, two consecutive
nouns emit a strong indicator for a phrase; in #2, the transition from the noun “Firewall ”
to the verb “is” implies a phrase boundary; in #3, “great ” is an adjective while both “firewall ”
and “software” are nouns, making “firewall software” a more likely unit.
3.2. AUTOMATED PHRASE MINING FRAMEWORK 37

#4 The discriminative classifier SVM is …


DT JJ NN NN VBZ …

On the other hand, purely considering POS tags may not be wise regardless of the tagging
performance. For example, in #4, “classifier SVM ” will be wrongly extracted if only POS
tags are considered. In this case, frequency-based signals can correct the error. 

3. Can we simultaneously model single-word and multi-word phrases? In linguistic analysis, a


phrase is not only a group of multiple words, but also possibly a single word, as long as
it functions as a constituent in the syntax of a sentence [Finch, 2000]. As a great portion
(ranging from 10–30% based on our experiments) of high-quality phrases, we should take
single-word phrases (e.g., d UIUC c, d Illinois c, and d USA c) into consideration as well
as multi-word phrases to achieve a high recall in phrase mining.
In this chapter, we introduce a novel automated phrase mining method AutoPhrase+ to
address these three challenges simultaneously, mainly using the following three techniques.
1. Positive-only Distant Training. Distant training is utilized to generate clean positive but
noisy negative labels from a knowledge base for ensemble classifiers to estimate phrase qual-
ity scores. ereafter, human labeling is no longer needed.
2. POS-guided phrasal segmentation. is technique utilizes the context information embedded
in POS tags and accurately locates the boundaries of phrases in the given corpus, which
improves precision.
3. Single-word phrase modeling. e mining framework designed for multi-word phrase mining
is extended for single-word phrases and gains about 10–30% more recall.
In addition, the language dependency is minimized: AutoPhrase+ only requires tokeniza-
tion and POS tagging in the preprocessing. eoretically, it is compatible with any language as
long as a knowledge base (e.g., Wikipedia), a tokenizer, and a POS tagger in that language are
available. Moreover, as demonstrated in our experiments, AutoPhrase+ supports English, Spanish
and Chinese. To our best knowledge, this is the first phrase mining method that can smoothly
work for multiple languages. More importantly, it is adaptable to other languages with minimal
engineering cost.

3.2 AUTOMATED PHRASE MINING FRAMEWORK


Figure 3.2 presents the automated phrase mining framework. Different from previous phrase
mining approach SegPhrase+ which requires human-generated labels, this new framework takes
a knowledge base as the side input. After preprocessing with third party tools including tokenizers
38 3. AUTOMATED QUALITY PHRASE MINING
from Lucene and Stanford NLP as well as the POS taggers from TreeTagger, AutoPhrase+ includes
five modules: frequent phrase mining, noisy label generation, robust positive-only distant training,
POS-guided phrasal segmentation, and phrase quality re-estimation.

Figure 3.2: e automated phrase mining framework.

3.2.1 PHRASE LABEL GENERATION


To assign quality score to each candidate phrase, as introduced in the previous chapter, SegPhrase+
required domain experts to first carefully select hundreds of varying-quality phrases from millions
of candidates, and then annotate them with binary labels. For example, for computer science pa-
pers, our domain experts provided hundreds of positive labels (e.g., “spanning tree” and “computer
science”) and negative labels (e.g., “paper focuses” and “important form of ”). However, creating
such a label set is expensive, especially in specialized domains like clinical reports and business
reviews, because this approach provides no clues for how to identify the phrase candidates to be
labeled. In this section, we introduce a method that only utilizes existing general knowledge bases
without any other human effort.
Public knowledge bases (e.g., Wikipedia) usually encode a considerable number of high-
quality phrases in the titles, keywords, and internal links of pages. For example, by analyzing the
internal links and synonyms¹ in English Wikipedia, more than a hundred thousand high-quality
phrases were discovered. As a result, we place these phrases in a positive pool .
Knowledge bases, however, rarely, if ever, identify phrases that fail to meet our criteria, what
we call inferior phrases. An important observation is that the number of phrase candidates, based
on n-grams (recall leftmost box of Figure 3.2), is huge and the majority of them are actually of
of inferior quality (e.g., “speaks at”). In practice, based on our experiments, among millions of
phrase candidates, usually, only about 10% are in good quality. erefore, phrase candidates that
are derived from the given corpus but that fail to match any high-quality phrase derived from the
given knowledge base, are used to populate a large but noisy negative pool .
Directly training a classifier based on the noisy label pools is not a wise choice: some phrases
of high quality from the given corpus may have been missed (i.e., inaccurately binned into the
¹https://github.com/kno10/WikipediaEntities
3.2. AUTOMATED PHRASE MINING FRAMEWORK 39
positive pool + labels
Ideal error = δ 2K
≈ noise
sampling in the negative pool ≈
negative pool – labels 10%. Empirical error 
should be similar.
sampling … … … …

All phrase A size-2K perturbed quality phrases


candidates training set inferior phrases

Figure 3.3: e illustration of each base classifier. In each base classifier, we first randomly sample K
positive and negative labels from the pools respectively. ere might be ı quality phrases among the
K negative labels. An unpruned decision tree is trained based on this perturbed training set.

negative pool) simply because they were not present in the knowledge base. Instead, we propose to
utilize an ensemble classifier that averages the results of T independently trained base classifiers.
As shown in Figure 3.3, for each base classifier, we randomly draw K phrase candidates with
replacement from the positive pool and the negative pool respectively (considering a canonical
balanced classification scenario). is size-2K subset of the full set of all phrase candidates is
called a perturbed training set Breiman [2000], because the labels of some (ı in the figure) quality
phrases are switched from positive to negative. In order for the ensemble classifier to alleviate the
effect of such noise, we need to use base classifiers with the lowest possible training errors. We
grow an unpruned decision tree to the point of separating all phrases to meet this requirement. In
fact, such decision tree will always reach 100% training accuracy when no two positive and negative
ı
phrases share identical feature values in the perturbed training set. In this case, its ideal error is 2K ,
which approximately equals to the proportion of switched labels among all phrase candidates (i.e.,
ı
2K
 10%). erefore, the value of K is not sensitive to the accuracy of the unpruned decision
tree and is fixed as 100 in our implementation. Assuming the extracted features are distinguishable
between quality and inferior phrases, the empirical error evaluated on all phrase candidates, p ,
should be relatively small as well.
An interesting property of this sampling procedure is that the random selection of phrase
candidates for building perturbed training sets creates classifiers that have statistically indepen-
dent errors and similar erring probability Breiman [2000], Martínez-Muñoz, and Suárez [2005].
erefore, we naturally adopt random forest Geurts, Ernst and Wehenkel [2006], which is veri-
fied, in the statistics literature, to be robust and efficient. e phrase quality score of a particular
phrase is computed as the proportion of all decision trees that predict that phrase is a quality
phrase. Suppose there are T trees in the random forest, the ensemble error can be estimated as
the probability of having more than half of the classifiers misclassifying a given phrase candidate
as follows.
40 3. AUTOMATED QUALITY PHRASE MINING

T
!
X T
ensemble_ error.T / D : p t .1 p/T t
:
t
tDb1CT =2c

0.4
p=0.05
p=0.1
0.3 p=0.2

Ensemble Error
p=0.4

0.2

0.1

0 0
10 101 102 103
T

Figure 3.4: Ensemble errors of different p ’s varying T .

From Figure 3.4, one can easily observe that the ensemble error is approaching 0 when T
grows. In practice, T needs to be set larger due to the additional error brought by model bias.
Empirical studies can be found in Figure 3.8.

3.2.2 PHRASE QUALITY ESTIMATION


In the last chapter, we have introduced in detail about the four criteria for measuring multi-word
phrase quality, i.e., popularity, concordance, informativeness, and completeness. Are they also
applicable for single-word phrases? Not necessarily.
Because single-word phrases cannot be decomposed into two or more parts, the concor-
dance is no longer definable. As the complement, we propose the independence requirement for
quality single word phrases as below.

Independence. A quality single-word phrase is more likely a complete semantic unit in the given
documents. For example, “UIUC ” is a quality single-word phrase. However, “united,” usually
occurring within other quality multi-word phrases such as “United States,” “United Kingdom,”
“United Airlines,” and “United Parcel Service,” is not a quality single-word phrase, because its
independence is not enough.

Informativeness Features. In information retrieval, stop words and inverse document frequency
(IDF) are two useful approaches to measure the word informativeness:

• Stop word. Whether this word is a stop word; and

• IDF of this word.

In general, quality single-word phrases are expected to be a non-stop word with relatively large
IDF.
3.2. AUTOMATED PHRASE MINING FRAMEWORK 41
Punctuation is commonly appearing across different languages, especially quotes, brackets,
and capitalization. erefore, we adopt (1) the probability that a single-word phrase is surrounded
by quotes or brackets and (2) the probability that the first character of a single-word phrase is in
uppercase. Higher probability usually indicates a single-word phrase being more informative. A
good example is in support vector machines (SVM). Note that, in some languages, such as Chinese,
these is no uppercase feature.
e features for multi-word phrases in the previous chapter are inherited, including con-
cordance features such as pointwise mutual information and pointwise Kullback-Leibler divergence
after decomposing the phrase into two parts and informativeness features involving IDF, stop
word, and punctuation.
In addition, we propose two new context-independent completeness features inspired
by Parameswaran et al. [2010]: (1) the ratio between the phrase frequency and the minimum
frequency among its sub-phrases; and (2) the ratio between the maximum frequency among its
super-phrases and the phrase frequency. A low sub-phrase ratio usually indicates the phrase can be
shorten, while a high super-phrase ratio implies the phrase is not complete. For instance, “NP-
complete in the strong ” tends to have a high super-phrase ratio because it always occurs in “NP-
complete in the strong sense;” “classifier SVM ” is expected to receive a low sub-phrase ratio because
both “classifier ” and “SVM ” are popular elsewhere.

3.2.3 POS-GUIDED PHRASAL SEGMENTATION


POS-guided phrasal segmentation, which is the most crucial component in AutoPhrase+, is pro-
posed to tackle the challenge of measuring completeness and independence through locating
every phrase mention in the corpus and rectifying phrase mentions previously obtained via string
match.

Definition 3.2 POS-guided Phrasal Segmentation. e “POS-guided ” emphasizes combin-


ing with POS tags, which is helpful as indicated in Observation 3.1. Given a corpus C (i.e., a
length-n POS tagged word sequence hw1 w2 : : : wn , t1 t2 : : : tn i), a segmentation S D s1 s2 : : : sm
is induced by a boundary index sequence B D fb1 ; b2 ; : : : ; bmC1 g satisfying 1 D b1 < b2 < : : :
< bmC1 D n+1, where the i -th segment si D hwbi wbi C1 : : : wbi Cjsi j 1 , tbi tbi C1 : : : tbi Cjsi j 1 i.
Here jsi j refers to the number of words/tags in segment si . Since bi C jsi j D biC1 , for clarity we
use wŒbi ;biC1 / to denote word sequence wbi wbi C1    wbi Cjsi j 1 and tŒbi ;biC1 / to denote POS tag
sequence tbi tbi C1    tbi Cjsi j 1 . erefore, the i -th segment si D hwŒbi ;biC1 / , tŒbi ;bi C1 / i. 
For a better understanding of the POS-guided phrasal segmentation, we provide the fol-
lowing example.

Example 3.3 Recall the example sentences in Observation 3.1. Ideal POS-guided phrasal seg-
mentation results are as follows.
42 3. AUTOMATED QUALITY PHRASE MINING

#1: 〈Sophia Smith, NNP NNP〉, 〈was VBD〉, 〈born, VBN〉, 〈in, IN〉,
〈England, NNP〉, 〈., .〉
#2: …, 〈the, DT〉, 〈Great Firewall, NNP NNP〉, 〈is, VBZ〉 …
#3: 〈This, DT〉, 〈is, VBZ〉, 〈a, DT〉,
〈great, JJ〉, 〈firewall software, NN NN , ., .
#4 〈The, DT〉, 〈discriminative classifier, JJ NN〉, 〈SVM, NN〉,
〈is, VBZ〉, …

Definition 3.4 POS Sequence Quality. is defined to be the probability of a word sequence
being a complete semantic unit given its corresponding POS tag sequence, according to the above
criteria. Given a length-k POS tag sequence t1 t2 : : : tk , its POS sequence quality is:

T .t1 : : : tk / D p.dv1 : : : vk cjtag.v1 : : : vk / D t1 : : : tk / 2 Œ0; 1;

where tag.v1 : : : vk / is the corresponding POS tag sequence of the word sequence v1 : : : vk . 
e estimator for POS sequence quality will also be learned, which is expected to work as
follows.

Example 3.5 A good POS sequence quality estimator can return T .NN NN/  1,
T .NN VB/  0, and T .DT NN/  0, where NN refers to singular or mass noun (e.g., database),
VB means verb in the base form (e.g., is), and DT is for determiner (e.g., the).
e POS sequence quality score T ./ is designed to reward the phrases with meaningful
POS patterns. e particular form we chosen is:
bi C1 1
 Y
T .tŒbi ;bi C1 / / D 1 ı.tbiC1 1 ; tbiC1 /  ı.tj 1 ; tj /;
j Dbi C1

where ı.t1 ; t2 / is the probability that the POS tag t2 is exactly after the POS tag t1 within a phrase
in the given document collection. In this formula, the first term represents that there is a phrase
boundary between biC1 1 and bi , while the product indicates that all POS tags among tŒbi ;biC1 /
are in the same phrase. is POS quality score can naturally counter the bias to longer segments
because exactly one of ı.t1 ; t2 / and .1 ı.t1 ; t2 // is always multiplied no matter how the corpus
is segmented. Note that the length penalty model in SegPhrase+ is a special case when ı.t1 ; t2 /
shares the same corresponding value.
Mathematically, ı.t1 ; t2 / is defined as:

ı.t1 ; t2 / D p.: : : d: : : w1 w2 : : :c : : : jC ; tag.w1 / D t1 ^ tag.w2 / D t2 /:


3.2. AUTOMATED PHRASE MINING FRAMEWORK 43
As it depends on how documents are segmented into phrases, ı.t1 ; t2 / will be learned during the
context-aware phrasal segmentation.
Now, after we have both phrase quality Q and POS sequence quality T ready, we are able
to formally define the POS-guided phrasal segmentation model. e joint probability of a corpus
C and a segmentation S D s1 : : : sm is factorized as:
m
Y  ˇ 
ˇ
p.S; C / D p biC1 ; dwŒbi ;biC1 / cˇbi ; tŒbi ;bi C1 / ;
iD1

where p.biC1 ; dwŒbi ;biC1 / cjbi ; tŒbi ;bi C1 / / is the probability of observing a word sequence wŒbi ;biC1 /
as the i -th quality segment given the previous boundary index bi and its corresponding POS tag
sequence tŒbi ;biC1 / .
Since the phrase segments function as a constituent in the syntax of a sentence [Finch,
2000], they usually have weak dependence on each other. As a result, we assume these segments
in the word sequence are generated one by one for the sake of both efficiency and simplicity.
For each segment, given the POS tag sequence t and the start index bi of a segment si , the
generative process is defined as follows.
1. Generate the end index biC1 , according to its POS sequence quality

p.biC1 jbi ; tŒbi ;biC1 / / D p.dwcjtŒbi ;biC1 / / D T .tŒbi ;biC1 / /:

2. Given the two ends bi ; biC1 , generate the word sequence wŒbi ;biC1 / according to a multi-
nomial distribution over all segments of length-.biC1 bi /.
ˇ 
p.wŒbi ;bi C1 / jbi ; biC1 / D p wŒbi ;biC1 / ˇjsi j D biC1 bi :

3. Finally, we generate an indicator whether wŒbi ;biC1 / forms a quality segment according to
its quality
p.dwŒbi ;biC1 / cjwŒbi ;biC1 / / D Q.wŒbi ;biC1 / /:

Integrating the above three generative steps together, we have the the following probabilistic
factorization:

p.biC1 ; dwŒbi ;biC1 / cjbi ; tŒbi ;biC1 / /


D p.biC1 jbi ; tŒbi ;biC1 / /p.wŒbi ;biC1 / jbi ; biC1 /p.dwŒbi ;biC1 / cjwŒbi ;biC1 / /
ˇ 
D T .tŒb ;b / /p wŒb ;b / ˇjsi j D biC1 bi Q.wŒb ;b / /:
i i C1 i iC1 i iC1

erefore, for a given corpus C with D documents, there are three subproblems:
ˇ  ˇ 
• learn p uˇjuj for each frequent word and phrase u 2 P . We denote p uˇjuj as u for
convenience;
44 3. AUTOMATED QUALITY PHRASE MINING

Algorithm 6: POS-guided Phrasal Segmentation (PGPS)


1 Input: Corpus C D hw1 w2 : : : wn , t1 t2 : : : tn i, phrase quality Q, parameters  and ı .
2 Output: Optimal segmentation S .
// hi  maxS p.S ; C D hwŒ1;i / ; tŒ1;i / ijQ; ; ı/
3 h1 1, hi 0 .1 < i  n C 1/
4 for i D 1 to n do
5 for j D i C 1 to n C 1 do
// Efficiently implemented via Trie.
6 if there is no phrase starting with wŒi;j / then
7 break
// In practice, log and addition are used to avoid underflow.
8 if hi  p.j; dwŒi;j / cji; tŒi;j / / > hj then
9 hj hi  p.j; dwŒi;j / cji; tŒi;j / /
10 gj i

11 j n C 1, m 0
12 while j > 1 do
13 m mC1
14 sm hwŒgj ;j / , tŒgj ;j / i
15 j gj
16 return S sm sm 1 : : : s1

• learn ı.t1 ; t2 / for every POS tag pair; and

• infer the segmentation S when  and ı are fixed.


We employ the maximum a posterior principle and maximize the joint probability of the
corpus:
D
X md
D X
X  ˇ 
.d / .d / ˇ .d /
log p.Sd ; Cd / D log p biC1 ; dwŒb ;b
i iC1 /
c ˇb t ; t Œbi ;b iC1 / : (3.1)
d D1 d D1 iD1

Given the  and ı.; /, to find the best segmentation that maximizes Equation (3.1), we
develop an efficient dynamic programming algorithm for the POS-guided phrasal segmentation
(PGPS) as shown in Algorithm 6.
When the segmentation S and the parameter  are fixed, the closed-form solution of
ı.t1 ; t2 / is:
PD Pmd PbiC1 .d /
2 .d /
d D1 iD1 .d / 1.tj D t1 ^ tj.dC1
/
D t2 /
j Dbi
ı.t1 ; t2 / D PD Pnd 1 .d / .d /
; (3.2)
d D1 iD1 1.t i D t 1 ^ tiC1 D t 2 /
3.2. AUTOMATED PHRASE MINING FRAMEWORK 45
Algorithm 7: AutoPhrase+ Viterbi Training
1 Input: Corpus C and phrase quality Q.
2 Output:  and ı .
3 initialize  with normalized raw frequencies in the corpus
4 while  does not converge do
5 while ı does not converge do
6 for d D 1 to D do
7 Sd PGPS.Cd ; Q; ; ı/ via Algorithm 6
8 update ı using S1 S2 : : : SD according to Eq. (3.2)
9 for d D 1 to D do
10 Sd PGPS.Cd ; Q; ; ı/ via Algorithm 6
11 update  using S1 S2 : : : SD according to Eq. (3.3)
12 return  and ı

where 1./ denotes the identity indicator and ı.t1 ; t2 / is the unsegmented ratio among all t1 t2
pairs in the given corpus.
Similarly, once the segmentation S and the parameter ı are fixed, the closed-form solution
of u can be derived as:
PD Pmd .d /
d D1 iD1 1.wŒbi ;bi C1 / D u/
u D PD Pm .d /
: (3.3)
d
d D1 iD1 1.jsi j D juj/
We can see that u is the times that u becomes a complete segment normalized by the number
of the length-juj segments.
As shown in Algorithm 7, our optimization strategy for learning ı and  is a nested iterative
optimization process similar to SegPhrase+. In our case, given corpus C , in the inner loop, it first
fixes  and keeps adjusting parameters ı using the segmentation that maximizes p.S ; C jQ; ; ı/
until converge. Later, in the outer loop, ı is fixed and  will be updated. Such a procedure is
iterated until a stationary point has been reached.
As same as that in SegPhrase+, the efficiency is the major reason that we choose Hard-EM
instead of finding a maximum likelihood estimator of  and ı using Soft-EM (i.e., Bawm-Welch
algorithm [Bishop, 2006]).

3.2.4 PHRASE QUALITY RE-ESTIMATION


Different from SegPhrase+, instead of adding two more features computed based on the rectified
frequency of phrases, we reconstruct the whole feature space as follows.
• When generating frequency-related features, such as concordance features and complete-
ness features, the raw frequency is replaced by the rectified frequency.
46 3. AUTOMATED QUALITY PHRASE MINING
• When calculating occurrence-related features, such as the informativeness features, only
those complete segments matched occurrences are considered.

e reconstruction exploits the rectified frequency in a more thorough way and thus yielding a
better performance gain.
In addition, an independence feature is added for single-word phrases. Formally, it is the
ratio of the rectified frequency of a single-word phrase given the context-aware phrasal segmen-
tation over its raw frequency. Quality single-word phrases are expected to have large values. For
example, “united ” is likely to an almost zero ratio.

3.2.5 COMPLEXITY ANALYSIS


Same as SegPhrase+, the complexity of AutoPhrase+ is theoretically linear to the corpus size and
thus very efficient and scalable. Meanwhile, every component can be parallelized in an almost
lock-free way grouping by either phrases or sentences. at is, suppose T threads are enabled, the
time complexity becomes O.jC j=T /.

3.3 EXPERIMENTAL STUDY


In this section, we will apply the proposed method to mine quality phrases from three large text
corpora in three languages (English, Spanish, and Chinese) and validate whether it satisfies the
three requirements of automated phrase mining. After introducing the experimental settings, we
compare the proposed method with many other methods to demonstrate its high precision and
recall as well as the importance of single-word phrase modeling. en, we explore the robustness
of the proposed positive-only distant training and its performance against expert labeling. e
importance of incorporating POS tags in the POS-guided phrasal segmentation has also been
verified. In the end, we present the case study and the efficiency study.
For the purpose of checking the language dependency of the proposed model, we have
prepared three large collections of text in different languages cleaned from English, Spanish,
and Chinese Wikipedia articles. Table 3.1 shows the detailed statistics of these datasets. Since
the size of all English Wikipedia article is too large for some baselines, either on time or memory,
we randomly sample 10 million documents as the English dataset; the Spanish dataset has more
than 11 million documents and about 1 billion tokens; the smallest dataset, Chinese, is still more
than 1.5 GB in file size.
Table 3.1: Dataset statistics
Language # Docs # Tokens File Size
English 10,000,000 808,013,246 3.94 GB
Spanish 11,032,323 791,230,831 4.06 GB
Chinese 241,291 371,899,237 1.56 GB
3.3. EXPERIMENTAL STUDY 47
We compare AutoPhrase+ with three types of methods as follows. Every method returns a
ranked list of phrases.
Phrase Mining: ere are many phrase extraction methods, such as NLP chunking methods,
ConExtr [Parameswaran et al., 2010], KEA [Witten et al., 1999], TopMine [El-Kishky et al., 2015],
and SegPhrase+. SegPhrase+ has shown its advantage over others.
• SegPhrase+: SegPhrase+ is for English corpus, which outperformed many other phrase min-
ing, keyphrase extraction, and noun phrase chunking methods. However, it requires the
effort of human experts to label hundreds of binary phrase. To adapt this work to the
automated phrase mining setting in this paper, we feed the binary phrase labels used by
AutoPhrase+ to SegPhrase+.

• WrapSegPhrae²: Moreover, to make SegPhrase+ support different languages, we add an


encoding preprocessing to first transform non-English corpus using English characters and
punctuation as well as a decoding postprocessing to translate them to the original language.
Parser-based Phrase Extraction: Using language-specific parsers, we can extract the minimum
phrase units (e.g., NP) from the parsing trees as phrase candidates. Parsers of all three languages
are available in Stanford NLP tools [De Marneffe et al., 2006, Levy and Manning, 2003, Nivre
et al., 2016]. Two ranking heuristics are considered.
• TF-IDF: Rank the extracted phrase candidates by TF-IDF. It is more effective than C-
Value as shown in Liu et al. [2015].
• TextRank : An unsupervised graph-based ranking model for keyword extraction [Mihalcea
and Tarau, 2004].
Chinese Segmentation Models: Different from English and Spanish, phrasal segmentation in
Chinese has been intensively studied because there is no whitespace in Chinese. e most effective
and popular segmentation methods are the following.
• AnsjSeg³ is a popular text segmentation algorithm for Chinese corpus. It ensembles statis-
tical modeling methods of Conditional Random Fields (CRF) and Hidden Markov Models
(HMMs) based on the n-gram setting.
• JiebaPSeg⁴ is a Chinese text segmentation method implemented in Python. It builds a di-
rected acyclic graph for all possible phrase combinations based on a prefix dictionary struc-
ture to achieve efficient phrase graph scanning. en it uses dynamic programming to find
the most probable combination based on the phrase frequency. For unknown phrases, an
HMM-based model is used with the Viterbi algorithm.
²https://github.com/remenberl/SegPhrase-MultiLingual
³https://github.com/NLPchina/ansj_seg
⁴https://github.com/fxsjy/jieba
48 3. AUTOMATED QUALITY PHRASE MINING
Note that all parser-based phrase extraction and Chinese segmentation models are pre-trained
based on general corpus, which should be similar to our Wikipedia datasets.
We denote our proposed method as AutoPhrase+. If only the sub-ranked list of multi-word
phrases in AutoPhrase+ is returned, it degenerates to AutoPhrase. If the context-aware phrasal
segmentation degenerates to the length penalty mode (i.e., all ı.t1 ; t2 / share the same value), we
name it as AutoSegPhrase.

3.3.1 EXPERIMENTAL SETTINGS


Default Parameters. We set the minimum support threshold  as 30 and the maximum phrase
length as 6, which are two parameters required by all methods. Other parameters required by
compared methods were set according to the open-source tools or the original papers.

Efficiency Testing Environment. e following execution time experiments were all conducted
on the same machine mentioned in the previous chapter. e algorithm is fully implemented in
C++. e preprocessing includes tokenizers from Lucene and Stanford NLP as well as the POS
tagger from TreeTagger.

3.3.2 QUANTITATIVE EVALUATION AND RESULTS


Our experiments are designed to study how well our methods perform in terms of precision and
recall compared to other methods. For a list of phrases, precision is defined as the number of
occurred quality phrases divided by the length of the list. Recall is defined as the number of
occurred quality phrases divided by the total number of quality phrases. For a ranked list, when
a quality phrase encountered, we record the precision and recall of the prefix ranked list. In the
end, we evaluate the precision-recall curves using the records.
Human Annotation. We rely on human evaluators to judge the quality of the phrases which
cannot be identified through any knowledge base. More specifically, on each dataset, we randomly
sample 500 such phrases from the predicted phrases of each method in the experiments. ese
selected phrases are shuffled in a shared pool and evaluated by three reviewers independently. We
allow reviewers to use search engines when unfamiliar phrases encountered. By the rule of majority
voting, phrases in this pool received at least two positive annotations are quality phrases. e intra-
class correlations (ICCs) are all more than 0.9 on all datasets, which shows the agreement. focus
on evaluating the ranked list in the pool.
Precision-recall curves of all compared methods evaluated by human annotation on three
datasets are presented in Figure 3.5. e trends on the English and Spanish datasets are similar,
while the trend on the Chinese dataset is slightly different.
AutoPhrase+ is the best among all compared methods on all datasets in terms of not only pre-
cision but also recall. Significant recall advantages can be always observed on all English, Spanish,
and Chinese datasets regardless of using either Wiki phrases or human annotation for evaluation.
For example, on the English dataset, the recall of AutoPhrase+ is more than 20% higher than the
3.3. EXPERIMENTAL STUDY 49
1 1 1

0.8 0.8 0.8

Precision
0.6

Precision

Precision
0.6 0.6

0.4 0.4 0.4 AutoPhrase+


AutoPhrase
AutoPhrase+ AutoPhrase+ JiebaPSeg
AutoPhrase AutoPhrase AnsjSeg
0.2 SegPhrase 0.2 SegPhrase 0.2 TF-IDF
TF-IDF TF-IDF WrapSegPhrase
TextRank TextRank TextRank
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1
Recall Recall Recall

(a) English (b) Spanish (c) Chinese

Figure 3.5: Precision-recall curves evaluated by human annotation.

second best method (SegPhrase+) in absolute value when evaluating by Wiki phrases. Moreover,
the recall differences between AutoPhrase+ and its variant AutoPhrase, ranging from 10% to 30%
sheds light on the importance of modeling single-word phrases. Meanwhile, one can also observe
that there is always a big precision gap between AutoPhrase+ and the best baseline on all three
datasets. Without any surprise, the phrase chunking-based methods TF-IDF and TextRank work
poorly, because the extraction and ranking are separated instead of unified.
Across two Latin language datasets, English and Spanish, the precision-recall curves of dif-
ferent methods are in the similar shapes. AutoPhrase+ and AutoPhrase overlaps in the beginning, but
later, the precision of AutoPhrase drops earlier and has a lower recall due to the lack of single-word
phrases. However, AutoPhrase works better than the previous state-of-the-art method SegPhrase+.
TextRank starts with a higher precision than TF-IDF, but its recall is very low because of the spar-
sity of the constructed co-occurrence graph. TF-IDF achieves a reasonable recall but unsatisfactory
precision.
On Chinese dataset, AutoPhrase+ and AutoPhrase has a clear gap even in the very begin-
ning, which is different from the trends on the English and Spanish datasets, which reflects that
single-word phrases are more important in Chinese. e major reason behind is that there are a
considerable number of high-quality phrases (e.g., person names) in Chinese have only one token
after tokenization. e performance of Chinese segmentation model AnsjSeg is very competitive,
which is slightly better than WrapSegPhrase especially when evaluating by human annotation and
shows comparable performance as AutoPhrase. is is because it not only leverages training data
for segmentations, but also exhausts the engineering work, including a huge dictionary for popu-
lar Chinese entity names and specific rules for certain types of entities. As a consequence, AnsjSeg
can easily extract tons of well-known terms and people/location names. Outperforming such a
strong baseline further confirms the effectiveness of AutoPhrase+. TF-IDF is slightly better than
another pre-trained Chinese segmentation method JiebaPSeg, while TextRank works worst again.
50 3. AUTOMATED QUALITY PHRASE MINING
In conclusion, our proposed AutoPhrase+ consistently works the best among all compared
methods and thus demonstrating its effectiveness on three datasets in different languages. e dif-
ference between AutoPhrase+ and AutoPhrase shows the necessity of modeling single-word phrases.

3.3.3 DISTANT TRAINING EXPLORATION


To compare the distant training and domain expert labeling, we introduce two domain-specific
datasets in English: DBLP and Yelp as shown in the following table.

Table 3.2: Two domain-specific datasets in English

Dataset Domain # of Tokens File Size Positive Pool Size


DBLP Scientific Paper 91.6 M 618 MB 29 K
Yelp Business Review 145.1 M 749 MB 22 K

To be fair, all the configurations in the classifiers are the same except for the label selection
process. More specifically, we come up with four training pools:
1. EP means that domain experts give the positive pool.
2. DP means that a sampled subset from existing general knowledge forms the positive pool.
3. EN means that domain experts give the negative pool.
4. DN means that all unlabeled (i.e., not in the positive pool) phrase candidates form the negative
pool.
By combining any pair of the positive and negative pools, we have four variants, EPEN (in
SegPhrase+), DPDN (in AutoPhrase+), EPDN, and DPEN.

0.90
0.85
0.85

0.80
0.80
AUC

0.75
AUC

0.70 0.75
EPEN (in SegPhrase)
EPEN (in SegPhrase)
DPEN
DPEN
0.65 EPDN
EPDN
DPDN (in AutoPhrase) 0.70
DPDN (in AutoPhrase)
0.60
5 10 15 20 25 30 35 40 45 50 55 60 65 70 15 20 25 30 35 40 45 50 55 60 65 70
Positive Pool Size Positive Pool Size

(a) DBLP (b) Yelp

Figure 3.6: AUC curves of four variants when we have enough positive labels in the positive pool EP.
3.3. EXPERIMENTAL STUDY 51
First of all, we evaluate the performance difference in the two positive pools. Compared to
EPEN, DPEN adopts a positive pool sampled from knowledge bases instead of the well-designed
positive pool given by domain experts. e negative pool EN is shared. As shown in Figure 3.6, we
vary the size of the positive pool and plot their AUC curves. We can find that EPEN outperforms
DPEN and the trends of curves on both datasets are similar. erefore, we conclude that the positive
pool generated from knowledge bases has reasonable quality, although its corresponding quality
estimator works slightly worse.
Secondly, we verify that whether the proposed noise reduction mechanism works properly.
Compared to EPEN, EPDN adopts a negative pool of all unlabeled phrase candidates instead of
the well-designed negative pool given by domain experts. e positive pool EP is shared. In
Figure 3.6, the clear gap between them and the similar trends on both datasets show that the noisy
negative pool is slightly worse than the well-designed negative pool, but it still works effectively.
As illustrated in Figure 3.6, DPDN has the worst performance when the size of positive pools
are limited. However, distant training can generate much larger positive pools, which may sig-
nificantly beyond the ability of domain experts considering the high expense of labeling. Conse-
quently, we are curious whether the distant training can finally beat domain experts when positive
pool sizes become large enough. We call the size at this tipping point as the ideal number.

0.90 0.90 0.85 0.85

0.88 0.88 0.84 0.84

0.86 0.86 0.83 0.83


AUC

AUC
0.84 0.84 0.82 0.82
EPEN (in SegPhrase) EPEN (in SegPhrase)
EPDN
0.82 0.82 EPDN
DPEN 0.81 0.81
DPEN
DPDN (in AutoPhrase)
DPDN (in AutoPhrase)
100 600 1100 1600 2900 29154 29304 100 600 1100 1600 21718 21868 22018
Positive Pool Size Positive Pool Size

(a) DBLP (b) Yelp

Figure 3.7: AUC curves of four variants after we exhaust positive labels in the positive pool EP.

We increase positive pool sizes and plot AUC curves of DPEN and DPDN, while EPEN and
EPDN are degenerated as dashed lines due to the limited domain expert abilities. As shown in
Figure 3.7, with a large enough positive pool, distant training is able to beat expert labeling. On
the DBLP dataset, the ideal number is about 700, while on the Yelp dataset, it becomes around
1600. Our guess is that the ideal training size is proportional to the number of words (e.g., 91.6M
in DBLP and 145.1M in Yelp). We notice that compared to the corpus size, the ideal number is
relatively small, which implies the distant training should be effective in many domain-specific
corpora as if they overlap with Wikipedia.
52 3. AUTOMATED QUALITY PHRASE MINING
Besides, Figure 3.7 shows that when the positive pool size continues growing, the AUC
score increases but the slope becomes smaller. e performance of distant training will be finally
stable when a relatively large number of quality phrases were fed.

0.9

0.8

AUC
0.7

DBLP
0.6 Yelp

100 101 102 103


T

Figure 3.8: AUC curves of DPDN varying T .

We are curious how many trees (i.e., T ) is enough for DPDN. We increase T and plot AUC
curves of DPDN. As shown in Figure 3.8, on both datasets, as T grows, the AUC scores first
increase rapidly and later the speed slows down gradually, which is consistent with the theoretical
analysis in Section 3.2.1.

3.3.4 POS-GUIDED PHRASAL SEGMENTATION


We are also interested in how much performance gain we can obtain from incorporating POS
tags in this segmentation model, especially for different languages. We select Wikipedia article
datasets in three different languages: English, Spanish, and Chinese. To be fair, since SegPhrase+
only models multi-word phrases, we only use AutoPhrase for the comparison.

1 1 1

0.8
0.8 0.8
Precision

Precision

Precision

0.6
0.6 0.6

AutoPhrase 0.4
AutoPhrase AutoPhrase
AutoSegPhrase AutoSegPhrase AutoSegPhrase
0.4 0.4
SegPhrase WrapSegPhrase JiebaSeg
0.2
0 0.5 1 0 0.2 0.4 0.6 0.8 0 0.5 1
Recall Recall Recall

(a) English (b) Spanish (c) Chinese

Figure 3.9: Precision-recall curves of AutoPhrase and AutoSegPhrase.


3.3. EXPERIMENTAL STUDY 53
Figure 3.9 compares the results of AutoPhrase and AutoSegPhrase, with the best baseline
methods as references. AutoPhrase outperforms AutoSegPhrase even on the English dataset, though
it has been shown in the last chapter the length penalty works reasonably well for English. e
Spanish dataset has similar observation. Moreover, the advantage of AutoPhrase becomes more
significant on the Chinese dataset, indicating the poor generality of length penalty.
In summary, thanks to the extra context information and syntactic information for the
particular language, incorporating POS tags during the phrasal segmentation can work better
than equally penalizing phrases of the same length.

3.3.5 EFFICIENCY STUDY


Figures 3.10a and 3.10b evaluate the running time and the peak memory usage of AutoPhrase+
using 10 threads on different proportions of three datasets, respectively. Both time and memory
are linear to the size of text corpora. Moreover, AutoPhrase+ can also be parallelized in an almost
lock-free way and shows a linear speedup in Figure 3.10c.
60 20
English English 15 English
Running Time (mins)

Peak Memory (GB)


Spanish Spanish Spanish
15
40 Chinese Chinese Chinese

Speedup
10
10

20
5 5

0 0
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 5 10 15
Portion of Data Portion of Data The Number of Threads
(a) Running Time (b) Peak Memory (c) Multi-threading

Figure 3.10: Efficiency of AutoPhrase+.

Besides, compared to the previous state-of-the-art phrase mining method SegPhrase+ and
its variants WrapSegPhrase on three datasets, as shown in Table 3.3, AutoPhrase+ achieves about
8 to 11 times speedup and about 5 to 7 times memory usage improvement. ese improvements
are made by a more efficient indexing and a more thorough parallelization.

3.3.6 CASE STUDY


We present a case study about the extracted phrases as shown in Table 3.4. e top ranked phrases
are mostly named entities, which makes sense for the Wikipedia article datasets. Even in the
long tail part, there are still many high-quality phrases. For example, we have the dgreat spotted
woodpeckerc (a type of birds) and d c (i.e., Computer Science and Technology)
ranked about 100,000. In fact, we have more than 345K and 116K phrases with a phrase quality
higher than 0:5 on the EN and CN datasets, respectively.
54 3. AUTOMATED QUALITY PHRASE MINING
Table 3.3: Efficiency comparison between AutoPhrase+ and SegPhrase+/WrapSegPhrase utilizing 10
threads

English Spanish Chinese


Time Memory Time Memory Time Memory
(mins) (GB) (mins) (GB) (mins) (GB)
AutoPhrase+ 32.77 13.77 54.05 16.42 9.43 5.75
(Wrap)SegPhrase 369.53 97.72 452.85 92.47 108.58 35.38
Speed/Saving 11.27 86% 8.37 82% 11.50 83%

Table 3.4: e results of AutoPhrase+ on the EN and CN datasets, with translations and explanations
for Chinese phrases. e whitespaces on the CN dataset are inserted by the Chinese tokenizer.

EN CN
Rank Phrase Phrase Translation (Explanation)
1 Elf Aquitaine (the name of a soccer team)
2 Arnold Sommerfeld Absinthe
3 Eugene Wigner (the name of a novel or a TV-series)
4 Tarpon Springs notebook computer, laptop
5 Sean Astin Secretary of Party Committee
… … …
20,001 ECAC Hockey Aftican countries
20,002 Sacramento Bee The Left (German: Die Linke)
20,003 Bering Strait Fraser Valley
20,004 Jacknife Lee Hippocampus
20,005 WXYZ-TV Mitsuki Saiga (a voice actress)
… … …
99,994 John Gregson Computer Science and Technology
99,995 white-tailed eagle Fonterra (a company)
99,996 rhombic dodecahedron The Vice President of Writers
Association of China
99,997 great spotted woodpecker Vitamin B
99,998 David Manners controlled guidance of the media

55

CHAPTER 4

Phrase Mining Applications


is book investigates the problem of phrase mining and introduces a series of methodologies
to solve it. It first presents the limitation of relying on n-gram-based representation, and then
proposes to use quality phrases as the representation units due to its superior intepretability.
Both effective and scalable solutions are proposed and empirically validated on multiple
real world datasets, such as scientific publications, business reviews, and online encyclopedia. e
corresponding source code has been released on Github and we encourage open collaborative
development.

• SegPhrase+ Liu et al. [2015]: https://github.com/shangjingbo1226/SegPhrase

• AutoPhrase+ Shang et al. [2017]: https://github.com/shangjingbo1226/AutoPhrase

In the following sections, we introduce four applications to showcase the impact of the
phrase mining results, and discuss the research frontier.

4.1 LATENT KEYPHRASE INFERENCE


Quality phrases mined in previous chapters are document-independent. at is to say, regarding
a particular document, it is difficult to tell which phrase is more salient than the rest. One solution
studied in our recent publication [Liu et al., 2016] is to rank phrases by their topical relevance
with the document content. In particular, this work is motivated by the application of document
representation.
If one looks back in the literature, the most common document representation is the bag-
of-words due to its simplicity and efficiency. is method, however, typically fails to capture
word-level synonymy (missing shared concepts in distinct words, such as “doctor” and “physi-
cian”) and polysemy (missing distinct concepts in same word, such as “Washington” can be either
the city or the government). As a remedy, topic models [Blei et al., 2003, Deerwester et al.,
1990] try to overcome this limitation by positing a set of latent topics which are distributions
over words, and assuming that each document can be described as a mixture of these topics.
Nevertheless, the interpretability of latent space for topic models is not straightforward and pur-
suing semantic meaning in inferred topics is difficult. Concept-based models [Gabrilovich and
Markovitch, 2007, Gottron et al., 2011, Hassan and Mihalcea, 2011, Song et al., 2011] were
proposed to overcome these barriers. e intuition is to link the documents with concepts in a
general Knowledge Base (KB), like Wikipedia or Freebase, and assign relevance score accordingly.
56 4. PHRASE MINING APPLICATIONS
For example, the text sequence “DBSCAN for knowledge discovery” can be mapped to KB con-
cepts like “KB: data mining,” “KB: density-based clustering ” and “KB: dbscan” (relevance scores are
omitted). Such methods take advantage of a vast amount of highly organized human knowledge.
However, most of the existing knowledge bases are manually maintained, and are limited in cov-
erage and freshness. Researchers have therefore recently developed systems such as Probase [Wu
et al., 2012] and DBpedia [Bizer et al., 2009] to replace or enrich traditional KBs. Neverthless,
the rapid emergence of large, domain-specific text corpora (e.g., business reviews) poses signif-
icant challenges to traditional concept-based techniques and calls for methods of representing
documents by interpretable units without requirement of a KB.
One solution in Liu et al. [2016] is to instantiate the interpretable units in the document
representation as quality phrases. at is to say, a document is represented as a subset of quality
phrases that are informative to summarize the document content. For ease of presentation, we
name these phrases as document keyphrases.
However, not all document keyphrases are frequently mentioned in the text. In other words,
phrase frequency does not necessarily indicate the saliency. To deal with this challenge, we propose
to associate each quality phrase with a silhouette—a cohesive set of topically related content units
(i.e., words and phrases), which is learned from the corpus itself, to help infer topical relevance
between the document content and each quality phrase mined in previous chapters.
ese silhouettes also enhance the interpretability of corresponding quality phrases. An
example of using document keyphrases to represent text is provided in Table 4.1, together with
results of other approaches. Its underlying technique, called Latent Keyphrase Inference (LAKI) is
shown in Figure 4.1. LAKI can be divided into two phases: (i) the offline phrase silhouette learning
phase, which extracts quality phrases from the in-domain corpus and learns their silhouettes re-
spectively, and (ii) the online document keyphrase inference phase, which identify keyphrases for each
query based on the quality phrase silhouettes, as outlined below.

Table 4.1: Representations for query “DBSCAN is a method for clustering in process of knowledge
discovery,” returned by various categories of methods

Categories Representation
Words dbscan, method, clustering, process, ...
Topics [k-means, clustering, clusters, dbscan, ...]
[clusters, density, dbscan, clustering, ...]
[machine, learning, knowledge, mining, ...]
KB Concepts data mining, clustering analysis, dbscan, ...
Document Keyphrases dbscan: [dbscan, density, clustering, ...]
clustering: [clustering, clusters, partition, ...]
data mining: [data mining, knowledge, ...]
4.1. LATENT KEYPHRASE INFERENCE 57
Quality Phrase Silhouetting
Phrase Mining
kernel k-means dbscan data mining data kernel
mining … dbscan … k-means
data mining 0.4 0.3
text mining kernel kmeans 1 dbscan 1 data mining 1
Offline: clustering kernel k means 1 density 0.8 knowl. discov. 1
1
1 0.6
1
clustering 0.65 clustering 0.6 kdd 0.67 0.65
kernel k-means
dbscan kernel 0.55 dense regions 0.3 clustering 0.6 …
… rbf kernel 0.5 shape 0.25 text mining 0.6
knowledge kdd dbscan clustering data kernel
discovery k-means
data kernel
mining … dbscan … k-means data
DBSCAN / is / a / 0.4 0.3 mining clustering dbscan
method / for / clustering /
Online: in / process / of / 1
1 0.6
0.65 1 … 0.6 0.8 0 0.8 0 0.7 0.9 …
knowledge discovery.
… 1 0 2 1 0 0 …
DBSCAN / was / knowledge density-based
knowledge kdd dbscan clustering data kernel
proposed / by … discovery k-means
discovery clustering

Segmentation Document Keyphrase Inference Document Representation

Figure 4.1: Overview of LAKI. White and grey nodes represent quality phrases and content units,
respectively.

• Offline Phrase Silhouette Learning:


1. Mine quality phrases from a textual corpus; and
2. learn quality phrase silhouettes by iteratively optimizing a Bayesian network with re-
spect to the unknown values, i.e., latent document keyphrases, given observed content
units in the training corpus.

• Online Document Keyphrase Inference:

1. Segment input query into content units; and


2. do inference for document keyphrases given the observed content units, which quan-
tifies relatedness between the input query and corresponding keyphrase.

e offline phase is critical in the sense that the online inference phase can be formulated
as its sub-process. Technically speaking, the learning is done by optimizing a statistical Bayesian
network, given observed content units (i.e., words and phrases after phrasal segmentation) in a
training corpus. We use a DAG-like Bayesian network shown in Figure 4.2. Content units are
located at the bottom layer and quality phrases form the rest. Both types of nodes act as binary
variables¹ and directional links between nodes depict their dependency.
Before diving into the details, we motivate our Bayesian approach to the silhouetting prob-
lem. First, this approach enables our model to infer not just explicitly mentioned document
keyphrases. For example, even if the text only contains “html” and “css,” the words “web page”
come to mind. But more than that, a multi-layered network will activate an ancestor-quality
¹For multiple mentions of a content unit, we can simply make several copies of that node together with its links.
58 4. PHRASE MINING APPLICATIONS

K3
Quality Phrases
K2 K4
K1 K5

Content Units T1 T2 T3 T4 T5 T6

Figure 4.2: An illustrative Bayesian network for quality phrase silhouetting.

phrase like “world wide web” even they are not directly linked to “html” or “css,” which are con-
tent units in the bottom layer.
Meanwhile, we expect to identify document keyphrases with different relatedness scores.
Reflected in this Bayesian model from a top-down view, when a parent quality phrase is activated,
it is more possible for its children with stronger connection to get activated.
Furthermore, this formulation is flexible. We allow a content unit to get activated by each
connected quality phrase as well as by a random noise factor (not shown in Figure 4.2), behaving
like a Noisy-OR, i.e., a logical OR gate with some probability of having “noisy” output. is
increases robustness of the model especially when training documents are noisy.
ere are two challenges in the learning process: (1) how to learn link weights given the
fixed Bayesian network structure and (2) how the initialization is done to decide this structure
and to set initial link weights.
For the former, to effectively learn link weights, Maximum Likelihood Estimation (MLE)
is adopted. e intuition is to maximize the likelihood of observing content units together with
partially-observed document keyphrases.² But it is extremely difficult to directly optimize due to
the latent states for the rest quality phrases. In this case, we usually resort to the Expectation-
Maximization (EM) algorithm which guarantees to give a local optimum solution. e EM
algorithm starts with some initial guess at the link weights and then proceeds to iteratively
generate successive estimates by repeatedly applying the E-step (Expectation-step) and M-step
(Maximization-step) until the MLE objective changes minimally.

Expectation Step: e whole E-step is trying to compute the conditional probability of unob-
served document keyphrases considering all their state combinations. It turns out that this step
is exactly the same as what we conduct in the online inference phase. In other words, the online
inference phase is just a sub-process of the offline training phase.

²Explicit document keyphrases can be identified by applying existing keyphrase extraction methods like Witten et al. [1999].
4.1. LATENT KEYPHRASE INFERENCE 59
Unfortunately, the E-step cannot be easily executed. Since each latent quality phrase in
Figure 4.2 acts as a binary variable, the size of possible state combinations can be as big as O.2n /.
at is to say, to accurately compute the probabilities required in E-step is NP-hard for a Bayesian
network like ours [Cooper, 1990]. We therefore adopt two approaches to approximately collect
sufficient statistics. e first idea is to apply sampling technique such as Markov Chain Monte
Carlo to search for the most likely state combinations. Among the Monte Carlo family, we apply
Gibbs sampling in this work to sample quality phrase variables during each E-step. Given content
unit vector representing a document, we proceed as follows.
1. Start with initial setting: only observed content units and explicit document keyphrases are
set to be true.
2. For each sampling step, sequentially sample each quality phrase node following conditional
distribution of that node given all other nodes with fixed states.
e above Gibbs sampling process ensures that samples approximate the joint probability distri-
bution between all phrase variables and content units.
e second approach is applied as the preprocessing right before the E-step. e idea is to
exclude non-related quality phrases that we are confident with. Intuitively, only a small portion of
quality phrases are related to the observed text. ere is no need to sample all phrase nodes since
most of them do not have chance to get activated. at is to say, we can skip majority of them
based on a reasonable relatedness prediction before conducting Gibbs sampling. We adopt a local
arborescence structure [Wang et al., 2012] to approximate the original Bayesian network which
allows us to roughly approximate the score for each node in an efficient way. We opt to omit the
technical details here and interesting readers can refer to the original paper [Liu et al., 2016].
Maximization Step: e M-step tries to update link weight based on the sufficient statistics
collected by the Exspectation step. In this problem setting, we are able to obtain a closed form
solution by taking the derivative of the MLE objective function.
Now the rest challenge is to decide the Bayesian network structure and to set initial link
weights. A reasonable topological order of DAG should be similar to that of a domain ontology.
e links among quality phrase nodes should reflect IS-A relationships [Yin and Shah, 2010].
Ideally, documents which are describing specific topics will first imply some deep quality phrase
nodes being activated. en the ontology-like topological order ensures these content units have
the chance of being jointly activated by general phrase nodes via inter-phrase links. Many tech-
niques [Dahab et al., 2008, Sanderson and Croft, 1999, Yin and Shah, 2010] have been previ-
ously developed to induce an ontological structure over quality phrases. It is out of scope of our
work to specifically address these or evaluate their relative impact in our evaluation. We instead
use a simple data-driven approach, where quality phrases are sorted based on their counts in the
corpus, assuming phrase generality is positively correlated with its number of mentions. us,
quality phrases mentioned more often are higher up in the graph. Links are added between qual-
ity phrases when they are closely related and frequently co-occurred. e link weights between
60 4. PHRASE MINING APPLICATIONS
nodes are simply set to be their similarity scores computed from the Word2Vec [Mikolov et al.,
2013].
To verify the effectiveness of LAKI, we present several queries with their top-ranked docu-
ment keyphrases in Table 4.2 generated from the online phase of LAKI. Overall, we see that the
method can handle both short and long queries quite well. Most document keyphrases are suc-
cessfully identified in the list. Relatedness between keyphrase and queries generally drops with
ranking lowers down. Meanwhile, both general and specific document keyphrases exist in the
ranked list. is provides LAKI with more discriminative power when someone applies it to text
mining applications like document clustering and classification. Moreover, the method has the
ability to process ambiguous queries like “lda” based on contextual words “topic.” We attribute
this to the well-modeled quality phrase silhouettes and we show some examples of them in Ta-
ble 4.3. As a quality phrase silhouette might contain many content units, we only demonstrate
ones with the most significant link weights. For ease of presentation, link weights are omitted in
the table.

4.2 TOPIC EXPLORATION FOR DOCUMENT


COLLECTION
e previous application targets at single document analysis. How do we deal with a collection
of documents?
Most textbooks might tell you to build a topic model like Latent Dirichlet Allocation [Blei
et al., 2003]. It assumes that each document can be modeled as a mixture of latent topics and each
topic is represented as a mixture of unigrams. One can certainly replace the unigram-based text
input [El-Kishky et al., 2015, Guan et al., 2016, Wang et al., 2013] with our phrasal segmentation
results such that topics become a mixture of phrases.
A more interesting approach for topic exploration is inspired from word embed-
ding [Mikolov et al., 2013]. We have shown some case studies in Section 2.4.4 that, after applying
the algorithm, words and phrases are mapped into a vector space such that semantically similar
words and phrases have similar vector representations. On top of that, one can apply hierarchical
clustering over these embeddings, with the intuition that depth of the clusters in the hierarchy
implies topic granularity.
Recently, TripAdvisor has published a tech article³ at its blog site introducing their appli-
cation built on hotel reviews. With the help of phrase mining and word embedding, they are able
to apply agglomerative clustering to obtain a gigantic hierarchy. Figure 4.3 shows a toy hierarchy
clustered over 18 phrases. Up to this point the process is fully automatic. e remaining step is to
rely on human curation to pick out the interesting clusters, and give each of them a snappy name.
e manual process is fairly easy. Given a list of phrases for a cluster, together with hotels that
are the most related to the cluster, and even some examples sentences from the review text for

³http://engineering.tripadvisor.com/using-nlp-to-find-interesting-collections-of-hotels/
4.2. TOPIC EXPLORATION FOR DOCUMENT COLLECTION 61

Table 4.2: Examples of document representation by LAKI with top-ranked document keyphrases (re-
latedness scores are ommited due to the space limit)

Query LDA BOA


Document linear discriminant analysis, latent dirichlet boa steakhouse, bank of america, stripsteak,
Keyphrases allocation, topic models, topic modeling, face agnolotti, credit card, santa monica, restaurants,
recognition, latent dirichlet, generative model, wells fargo, steakhouse, prime rib, bank, vegas,
topic, subspace models, . . . las vegas, cash, cut, dinner, bank, money,
...
Query LDA topic BOA steak
Document latent dirichlet allocation, topic, topic models, steak, stripsteak, boa steakhouse, steakhouse,
Keyphrases topic modeling, probabilistic topic models, ribeye, craftsteak, santa monica, medium rare,
latent topics, topic discovery, generative model, prime, vegas, entrees, potatoes, french fries, filet
mixture, text mining, topic distribution, etc. mignon, mashed potatoes, texas roadhouse, etc.
Query SVM deep dish pizza
Document support vector machines, svm classifier, multi deep dish pizza, chicago, deep dish, amore taste
Keyphrases class, training set, margin, knn, classification of chicago, amore, pizza, oregano, chicago style,
problems, kernel function, multi class svm, chicago style deep dish pizza, thin crust, windy
multi class support vector machine, support city, slice, pan, oven, pepperoni, hot dog, etc.
vector, etc.
Query Mining Frequent Patterns without I am a huge fan of the All You Can Eat
Candidate Generation Chinese food buffet
Document mining frequent patterns, candidate generation, all you can eat, chinese food, buffet, chinese
Keyphrases frequent pattern mining, candidate, prune, fp buffet, dim sum, orange chicken, chinese
growth, frequent pattern tree, apriori, subtrees, restaurant, asian food, asian buffet, crab legs,
frequent patterns, candidate sets, etc. lunch buffet, fan, salad bar, all you can drink,
etc.
Query Text mining, also referred to as text It’s the perfect steakhouse for both
data mining, roughly equivalent to meat and fish lovers. My table guest
text analytics, refers to the process was completely delirious about his
of deriving high-quality information Kobe Beef and my lobster was perfectly
from text. High-quality information is cooked. Good wine list, they have a
typically derived through means such lovely Sancerre! Professional staff,
as statistical pattern learning. quick and smooth.

Document text analytics, text mining, patterns, text, tex- kobe beef, fish lovers, steakhouse, sancerre, wine
Keyphrases tual data, topic, information, text documents, list, guests, perfectly cooked, lobster, staff, meat,
information extraction, machine learning, data fillet, fish, lover, seafood, ribeye, filet, sea bass,
mining, knowledge discovery, etc. risotto, starter, scallops, steak, beef, etc.

Academia Yelp
62 4. PHRASE MINING APPLICATIONS
Table 4.3: Examples of quality phrase silhouettes (from offline quality phrase silhouette learning).
Link weights are omitted.

Quality linear discriminant analysis boa steakhouse


Phrase
Silhouette linear discriminant analysis, lda, face boa steakhouse, boa, steakhouse,
recognition, feature extraction, principle com- restaurant, dinner, strip steak, craftsteak, santa
ponent analysis, uncorrelated, between class monica, vegas, filet, ribeye, new york strip, sushi
scatter, etc. roku, etc.
Quality latent dirichlet allocation ribeye
Phrase
Silhouette latent dirichlet allocation, lda, topics, perplex- ribeye, steak, medium rare, medium,
ity, variants, subspace, mixture, baselines, topic oz, marbled, new york strip, well done, prime
models, text mining, bag of words, etc. rib, fatty, juicy, top sirloin, filet mignon, fillet,
etc.
Quality support vector machines deep dish
Phrase
Silhouette support vector machines, svm, deep dish, pizza, crust, thin crust
classification, training, classifier, machine learn- pizza, chicago, slice, pepperoni, deep dish pizza,
ing, prediction, hybrid, kernel, feature selection, pan style, pizza joints, oregano, stuffed crust,
etc. chicago style, etc.
Quality fp growth chinese food
Phrase
Silhouette fp growth, algorithm, apriori like, chinese food, food, chinese,
mining, apriori, frequent patterns, mining asso- restaurants, americanized, asian, orange
ciation rules, frequent pattern mining, fp tree, chicken, chow mein, wok, dim sum, panda ex-
etc. press, chinese cuisine, etc.
Quality text mining mcdonalds
Phrase
Silhouette text mining, text, information mcdonalds, drive through, fast food,
retrieval, machine learning, topics, knowledge mcnugget, mcflurry, fast food chain, sausage
discovery, text data mining, text clustering, nlp, mcmuffin, big bag, mcmuffin, burger king, etc.
etc.
Quality database sushi
Phrase
Silhouette database, information, set, objects, sushi, rolls, japanese, sushi joint,
storing, retrieval, queries, accessing, relational, seafood, ayce, sushi rolls, salmon sushi, tuna
indexing, record, tables, query processing, trans- sushi, california roll, sashimi, sushi lovers, sushi
actions, etc. fish, etc.

Academia Yelp
4.2. TOPIC EXPLORATION FOR DOCUMENT COLLECTION 63
those hotels, one can determine if it would a good way to explore the hotels of a city. It would be
difficult to mathematically define what is interesting, but easy for a human to know when they see
it. e human can also come up with a clever name, which is also simple given the list of quality
phrases.

interesting_views

unique_view 20

comfortable_beds_and_pillows awesome_view-of-times-square

25 28
comfy_beds_and_pillows 18 fantastic_view_of_central_park
24
beds_super_comfortable 21 27
comfortable_accommodation
big_comfortable_rooms sooooo_comfy 29
free-wi-fi-in-lobby 31 33
23 30
free-wifi-worked_perfectly
continental_type_breakfast 34

friendly_accommodating_staff 19
26 32
friendly_assistance friendly_nature

felt_completely_safe 22

felt_safe_and_secure

Figure 4.3: Agglomerative clustering over 18 phrases.

Some interesting collections are shown in Figure 4.4. e whole process provides insight
into a particular city, picking out interesting neighborhoods, features of the hotels, and nearby
attractions.
To systematically analyze large numbers of textual documents, another approach is to man-
age documents (and their associated metadata) in a multi-dimensional fashion (e.g., document
category, date/time, location, author, etc.). Such structure provides flexibility of understanding
local information with different granularities. Moreover, the contextualized analysis often yields
comparative insights. at is, given a pair of document collections, how to identify common and
different content units of various granularity (e.g., words, sentences).
However, word-based summarization suffers from limited readability as single words are
usually non-informative and bag-of-words representation does not capture the semantics of the
original document well—it may not be easy for users to interpret the combined meaning of the
words. Sentence-based summarization, on the other hand, may be too verbose to highlight the
general commonalities and differences—users may be distracted by the irrelevant information
contained there. Recent studies [Ren et al., 2017a, Tao et al., 2016] leverage quality phrases,
i.e., minimal semantic unit, to summarize the commonalities and differences. Figure 4.5 gives an
example where an analyst may pose multidimensional queries and the system is able to leverage
64 4. PHRASE MINING APPLICATIONS

Figure 4.4: Left: collection “catch a show;” Right: collection “near the high line.”

Analyst Queries Multi-dimensional Text Cube Representative Phrases

Topic china’s economy


the people’s bank
Location of china
trillion renminbi
Time growth target
fixed asset investment
local government debt
solar panel

T: Economy,
L: China? massacre at sandy
hook elementary
long island railroad
(q1)
background check
senate armed services
Location

committee
T: Gun Control, adam lanza
L: US? buyback program
e
m

assault weapons and


Ti

(q2) high capacity


Topic

Figure 4.5: Illustration of phrase-based summarization.


4.2. TOPIC EXPLORATION FOR DOCUMENT COLLECTION 65
the relation between document subsets induced by query context and identify phrases that truly
distinguish the queried subset of documents from neighboring subsets.

Example 4.1 Suppose a multi-dimensional text database is constructed from e New York
Times news repository with three meta attributes: Location, Topic, and Time, as shown in Fig-
ure 4.5. An analyst may pose multidimensional queries such as: (q1): hChina, Economyi and
(q2): hUS, Gun Controli. Each query asks for summary of a cell defined by two dimensions Lo-
cation and Topic. What kind of cell summary does she like to see? Frequent unigrams such as
debt or senate are not as informative as multi-word phrases, such as local government debt and
senate armed service committee. e phrases preserve better semantics as integral units rather
than as separate words.
Generally, three criteria should be considered when ranking representative phrases in a
selected multidimensional cell: (i) integrity: a phrase that provides integral semantic unit should
be preferable over nonintegral unigrams; (ii) popularity: popular in the selected cell (i.e., selected
subset of documents); and (iii) distinctiveness: distinguish the selected cell from other cells.
Within the whole ranked phrase list, top-k representative phrases normally have higher
value for users in text analysis. As a further matter, the top-k query also enjoys computational
superiority, so that users can conduct fast analysis.
Bearing all these in mind, the authors have designed statistical measures for each criterion,
and uses geometric mean of those three scores as the ranking signal. e specific design principles
are as follows.

1. Popularity and distinctiveness of a phrase are dependent of the target cell, while integrity is
not.

2. Popularity and distinctiveness can be measured from frequency statistics of a phrase in each
cell, while integrity cannot. To measure integrity, one needs to investigate each occurrence
of the phrase and other phrases to determine whether that phrase is indeed an integral
semantic unit. e quality score provided by SegPhrase+ and AutoPhrase+ is a good indicator.

3. Popularity relies on statistics from documents only within the cell, while distinctiveness
relies on documents both in and out of the cell. e documents involved for distinctive-
ness measure calculation is defined as contrastive document set. In the particular algorithm
design in the paper, sibling set of the query cell is used as contrastive document set.

e algorithm is applied on e New York Times 2013–2016 dataset and PubMed ⁴ Cardiac
data with their representative phrase list in Table 4.4 and Table 4.5. In the book, it is reported
that using phrases achieves the best trade-off between processing efficiency, storage cost, and
summarization interpretability.
⁴PubMed is a free full-text archive of biomedical and life sciences journal literature.
66 4. PHRASE MINING APPLICATIONS

Table 4.4: Top-10 representative phrases for e New York Times queries

<U.S., Gun <U.S., <U.S., Domestic <U.S., Law and


<U.S., Military>
Control> Immigration> Politics> Crime>
sexual assault in the
gun laws immigration debate gun laws district attorney
military
the national rifle
border security insurance plans shot and killed military prosecutors
association
guest worker pro- armed services com-
gun rights background check federal court
gram mittee
immigration legis-
background check health coverage life in prison armed forces
lation
undocumented im-
gun owners tax increases death row defense secretary
migrants
overhaul of the na-
the national rifle
assault weapons ban tion’s immigration grand jury military personnel
association
laws
department of jus-
mass shootings legal status assault weapons ban sexually assaulted
tice
high capacity mag-
path to citizenship immigration debate child abuse fort meade
azines
gun legislation immigration status the federal exchange plea deal private manning
gun control advo- second degree mur-
immigration reform medicaid program pentagon officials
cates der

Table 4.5: Top representative phrases for five cardiac diseases

<Cerebrovascular <Ischemic Heart <Valve


<Cardiomyopathy> <Arrhythmia>
Accident> Disease> Dysfunction>
Cholesteryl ester Mineralocorticoid
alpha-galactosidase a Interferon gamma Methionine synthase
transfer protein receptor
brain neurotrophic tropomyosin alpha-1
apolipoprotein a-I interleukin-4 ryanodine receptor 2
factor chain
potassium v.g. h
tissue-type activator integrin alpha-iib interleukin-17a elastin
member 2
inward rectifier beta-2-glycopro-
apolipoprotein e adiponectin titin
channel 2 tein 1
tumor necrosis beta-2-glycopro- myosin-binding
neurogenic l.n.h.p. 3 p2y purinoceptor 12
factor tein 1 protein c
4.3. KNOWLEDGE BASE CONSTRUCTION 67
Next, we introduce a topic exploration approach proposed from the perspective of graph
mining. In Gui et al. [2016], a gigantic heterogeneous information network is constructed by
incorporating metadata like authors, venues, and categories, as shown in Figure 4.6. e hetero-
geneity comes from the multiple types of nodes and links. is network is then modeled with an
embedding approach, i.e., nodes in the network including the document phrases and metadata
are all projected into a common low-dimensional space such that different types of nodes can
be compared in a homogeneous fashion. e embedding algorithm is designed to preserve the
semantic similarity among multi-typed nodes such that nodes that are semantically similar will
be close in the space, with the distance measured by cosine similarity, for instance.

Metadata with Documents Node Embedding


dbscan
Keyphrase: graphics
Author: Jiawei Han database

documents clustering

Author: Edgar F. Codd Keyphrase: clustering


Venue: SIGKDD
data mining Keyphrase: data mining
Venue: SIGKDD

Figure 4.6: Embedding document keyphrases with metadata into the same space.

Different from the previous contrastive analysis, the embedding approach largely relies on
the data redundancy to automatically infer the relatedness. By viewing the association between any
document and its metadata as a link, the algorithm tries to push the embeddings of its constituent
nodes closer to each other. For instance, observing the phrase “data mining” often appear together
with venue “SIGKDD” rather than “SIGMOD” in the bibliographic corpus, the embedding
distance between the pair “data mining” and “SIGKDD” should be smaller than that of “data
mining” and “SIGMOD.”
e underlying technique is essentially minimizing the Kullback-Leibler divergence be-
tween model distribution and empirical distribution defined on a target node out given the other
participating nodes on the same link as context. Practically, one can solve the optimization prob-
lem by requiring the likelihood of an observed link to be higher than its corresponding “fake”
link with one of the constituent node replaced by any other randomly sampled node. For more
technical and experimental details, please refer to our recent paper [Gui et al., 2016].

4.3 KNOWLEDGE BASE CONSTRUCTION


e above applications are focusing on document-related summarization considering phrase men-
tion as the representation unit. Beyond that, a big opportunity of utilizing phrase mining is to
recognize real-world entities, such as people, products and organizations, from massive amount
but interrelated unstructured data. By mining token spans of phrase mentions in documents,
68 4. PHRASE MINING APPLICATIONS
Joint Entity and Relation Embedding Entity and Relation Type Inference
Candidate Generation & Distant Supervision
ID Sentence Target Relation Type Set
S1 US president Barack Obama visit China today. (Barack Obama, BETWEEN_
US, S1) president
A clip of Obama reading from his book “Dreams ofMy Father” None president_of born_in
S2 president_of travel_to
has been chared out of context.
Barack Obama is the 44th and current President of the United States
BETWEEN_ citizen_of author_of travel_to
S3 back to

President Clinton and Obama attended the funeral of former Israeli author_of (”Obama”, “Dream of
S4 My Father” S2)
Test features from S4
Prime Minister, and were scheduled to fly back to the US together.
BETWEEN_
… … book Test Relation and BETWEEN_
backto
Entity Mention
Model entity- Relation Mention
Automatically Labeled Training Data ( “Obama”, “US”, S4) LEFT_
president
relation Embedding Space
Relation Mention: (“Barack Obama”, “US”, S1) interactions BETWEEN_
Types of Entity 1: {person, politician, artist, author}, Entity 2: {ORG, LOC} fly
politician Target Entity
Relation Types: {president_of, born_in, citizen_of, travel_to} root …
S1_Barack Obama Type Hierarchy

Relation Mention: (“Obama”, “Dreams of My Father”, S2) S3_Barack Obama
Text Types of Entity 1: {person, politician, artist, author}, Entity 2: {book} CONTEXT_ person art person location organization
president
Corpus Relation Types: {author_of }
artist
Entity author film book artist … … …
Relation Mention: (“Barack Obama”, “United States”, S3) Mention CONTEXT_ S2_Obama
Types of Entity 1: {person, politician, artist, author}, Entity 2: {ORG, LOC} Embedding book location actor author politician
Relation Types: {{president_of, born_in, citizen_of, travel_to} Space

Figure 4.7: Framework overview of entity/relation joint typing.

labeling their structured types and inferring their relations, it is possible to construct or enrich
semantically rich knowledge base and provide conceptual summarization of such data.
Existing entity detection tools such as noun phrase chunkers are trained on general-domain
corpora (e.g., news articles), but they do not work effectively nor efficiently on domain-specific
corpora such as Yelp reviews (e.g., “pulled pork sandwich” cannot be easily detected). Meanwhile,
the process of manually labeling a training set with a large number of entity and relation types is
too expensive and error-prone. erefore, a major challenge is how to design domain-independent
system that will apply to text corpora from different domains in the absence of human annotated,
domain data. e rapid emergence of large, domain specific text corpora (e.g., news, scientific
publications, social media content) calls for methods that can extract entities and relations of
target types with minimal or no human supervision.
Recently, Ren et al. [2017b] introduced a novel entity/relation joint typing algorithm in-
spired from our corpus-scope phrase mining framework. e work extends our phrasal segmen-
tation to detect entity mentions, and then jointly embeds entity mentions, relation mentions,
text features, and type labels into two low-dimensional spaces (for entity and relation mentions
respectively), where, in each space, objects whose types are close will also have similar represen-
tations.
Figure 4.7 illustrates the framework which comprises the following four major steps.

1. Run phrase mining algorithm on a textual corpus using positive examples obtained from an
existing knowledge base, to detect candidate entity mentions.

2. Generate candidate relation mentions (sentences mentioning two candidate entities), ex-
tract text features for each relation mention and their entity mention argument. Apply dis-
tant supervision to generate labeled training data.
4.3. KNOWLEDGE BASE CONSTRUCTION 69
3. Jointly embed relation and entity mentions, text features, and type labels into two low-
dimensional spaces (for entities and relations, respectively) where, in each space, close ob-
jects tend to share the same types.

4. Estimate type labels for each test relation mention and type-path for each test entity men-
tion from learned embeddings, by performing nearest neighbor search in the target type set
or the target type hierarchy.

Similar to AutoPhrase+, the first step utilizes POS tags and uses quality examples from KB
as guidance to model the segment quality (i.e., “how likely a candidate segment is an entity men-
tion”). But the detailed workflow is slightly different: (1) mine frequent contiguous patterns for
both word sequence and POS tag sequence up to a fixed length from POS-tagged corpus; (2) ex-
tract features including corpus-level concordance and sentence-level lexical signals to train two
random forest classifiers, for estimating quality of candidate phrase and candidate POS pattern;
(3) find the best segmentation of the corpus using the estimated segment quality scores; and
(4) compute rectified features using the segmented corpus and repeat steps (2)–(4) until the result
converges. Figure 4.8 shows the high/low quality POS patterns learned using entity names found
in the corpus.

POS Tag Pattern Example


NNP NNP San Francisco/Barack Obama/United States
Good NN NN comedy drama/car accident/club captain
(high score) CD NN seven network/seven dwargs/2001 census
JJ NN crude oil/nucletic acid/baptist church

DT JJ NND a few miles/the early stages/the late 1980s


Bad CD CD NN IN 2 : 0 victory over/1 : 0 win over
(low score) NN IN NNP NNP rating on rotten tomatoes
WD RB IN worked together on/spent much of

Figure 4.8: Example POS tag patterns.

After generating the set of candidate relation mentions from the detected candidate entity
mentions, the authors propose to apply network embedding to help infer entity and relation types.
Intuitively, two relation mentions sharing many text features (i.e., with similar distribution over
the set of text features including head token, POS tags, entity mention order, etc.) likely have sim-
ilar relation types; and text features co-occurring with many relation mentions in the corpus tend
to represent close type semantics. For example, in Figure 4.9, (“Barack Obama,” “US,” “S1”) and
(“Barack Obama,” “United States,” “S3”) share multiple features including context word ‘presi-
dent’ and first entity mention argument “Barack Obama,” and thus they are likely of the same
relation type (i.e., “president_of ”).
70 4. PHRASE MINING APPLICATIONS

Mention Modeling Automatically-Labeled Training Corpus


Relation
EM1_Obama None president_of
Feature Type
BETWEEN_
book born_in
Type
BETWEEN_ author_of
president
EM1_
Barack Obama
travel_to

Relation (”Barack Obama”, (”Barack Obama”, (”Obama”, “Dream of


Mention “US”, S1) “US”, S3) My Father”, S2)

Entity
S1_“Barack S3_“Barack S3_“United S2_“Dream of
Mention
Obama” Obama” States” My Father”

S1_“US”
S2_“Obama”

person TOKEN_
States
artist
LOC CONTEXT_
book president
ORG
Entity HEAD_Obama CONTEXT_
Type politician none book

Figure 4.9: Network formed by entity/relation mentions, features, and types.

By further modeling the noisy distant labels from knowledge base and enforcing the addi-
tive operation⁵ in the vector space, a joint optimization objective function is formulated to learn
embeddings for both relation/entity mentions and relation/entity types.
It is reported in the paper that the effectiveness of joint extraction of typed entities and
relations has been verified across different domains (e.g., news, biomedical), with an average of
25% improvement in F1 score compared to the next best method. Table 4.6 shows the output of
the algorithm COTYPE together with other two competitive algorithms on two news sentences
from the Wiki-KBP⁶ dataset.

4.4 RESEARCH FRONTIER


ere are many unexplored territories and challenging research issues. Here we outline some of
the research directions stemming from our work. ey share the common research goal that the
mined phrases will become more task relevant and carry more semantics.
⁵In a relation mention z D .m1 ; m2 ; s/, embedding vector of m1 should be a nearest neighbor of the embedding vector of
m2 plus the embedding vector of relation mention z .
⁶It uses 1.5M sentences sampled from 780,000 Wikipedia articles as training corpus and 14k manually annotated sentences
from 2013 KBP slot filling assessment results as test data.
4.4. RESEARCH FRONTIER 71
Table 4.6: Example output of COTYPE and the compared methods on two news sentences from the
Wiki-KBP dataset. r  stands for relation type and Y  stands for entity type.

Text Blake Edwards, a prolific Anderson is survived by his wife


filmmaker who kept alive the Carol, sons Lee and Albert, daugh-
tradition of slapstick comedy, ter Shirley Englebrecht and nine
died Wednesday of pneumonia grandchildren.
at a hospital in Santa Monica.
MultiR Hoffmann r*: person:country_of_birth, r*: None, *1 : {N/A},
et al. [2011] *1 : {N/A}, *2 : {N/A} *2 : {N/A}
Logistic Mintz r*: per:country_of_birth, r*: None, *1 : {person},
et al. [2009] *1 : {person}, *2 : {country} *2 : {person, politician}
CoType r*: person:place_of_death, r*: person:children, *1 : {person},
*1 : {person,artist,director}, *2 : {person}
*2 : {location, city}

1. Multi-sense Phrase Mining. During the process of phrase mining, we typically assume
phrase is represented as continuous word sequence. Our next challenge is to explore the
underlying concept for each phrase mention and further rectify the phrase frequency. Such
refinement encounters two problems: (1) variety: many phrase mentions may refer to the
same concept (e.g., “page rank” and “PageRank,” cytosol and cytoplasm); and (2) ambiguity:
multiple concepts may share the same phrase mentions (e.g., PCR can refer to polymerase
chain reaction or principle component regression). Such refinement is easier to achieve from
the perspective of phrase evolving and contextual topic. When relational database was first
introduced in 1970, “data base” was a simple composition of two words, and then with its
gained popularity people even invented a new word “database,” clearly as a whole seman-
tic unit. In the context of machine learning, PCR certainly refers to principle component
regression instead of polymerase chain reaction.

2. Phrase Mining For Users. is book mainly introduce the techniques for extracting phrases
from documents. It can often be observed that unstructured textual data and users are in-
terconnected, particularly in the big data era where social network and user-created content
become popular. Together with mining massive unstructured data, one can expect to create
profiles for users in the format of salient phrases by analyzing his/her activities, which can
be utilized for future recommendation and behavior analysis. One promising solution is to
model the user-content interaction as information network in the sense that links connect
different types of nodes such as documents, keyphrases, and users. Such data model allows
information propagation and many network-based algorithms can be applied.
72 4. PHRASE MINING APPLICATIONS
3. Phrase Mining For Fresh Content. All the methods discussed in this book are data-driven
and rely on frequent phrase mentions to certain extend. Accordingly, a large-scale dataset
is necessary due to the data redundancy. On the other hand, the same philosophy is not
suitable for fresh content when a new phrase is just formed. Instead of purely depending
on the phrase mentions, contextual knowledge such as “Hearst patterns” is also useful. It
is certainly an open problem to learn these textual patterns automatically and effectively.
As time goes by, statistics of a phrase will eventually be sufficient, allowing our proposed
methods to prove its power. It is interesting to see how much a hybrid method can benefit
from this scenario as well.
73

Bibliography
Khurshid Ahmad, Lee Gillam, and Lena Tostevin. University of surrey participation in trec8:
Weirdness indexing for logical document extrapolation and retrieval (wilder). In TREC, 1999.
21

Helena Ahonen. Knowledge discovery in documents by extracting frequent word sequences.


Library Trends, 48(1), 1999. 5

Armen Allahverdyan and Aram Galstyan. Comparative analysis of viterbi training and maxi-
mum likelihood estimation for hmms. In Advances in Neural Information Processing Systems 24,
pages 1674–1682, 2011. 15, 16

David Arthur and Sergei Vassilvitskii. k-means++: e advantages of careful seeding. In Proc. of
the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035, 2007.

Srikanta Bedathur, Klaus Berberich, Jens Dittrich, Nikos Mamoulis, and Gerhard Weikum.
Interesting-phrase mining for ad-hoc text analytics. Proc. of the VLDB Endowment, 3(1-2):
1348–1357, 2010. DOI: 10.14778/1920841.1921007. 29

Christopher M Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York,
Inc., 2006. 15, 45

Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cy-
ganiak, and Sebastian Hellmann. Dbpedia-a crystallization point for the web of data. Web
Semantics: Science, Services and Agents on the World Wide Web, 7(3):154–165, 2009. DOI:
10.1016/j.websem.2009.07.002. 56

David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of
Machine Learning Research, 3:993–1022, 2003. 55, 60

Leo Breiman. Randomizing outputs to increase prediction accuracy. Machine Learning, 40(3):
229–242, 2000. Springer. 39

Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. 13

Kuang-hua Chen and Hsin-Hsi Chen. Extracting noun phrases from large-scale texts: A hybrid
approach and its automatic evaluation. In Proc. of the 32nd Annual Meeting on Association for
Computational Linguistics, pages 234–241, 1994. DOI: 10.3115/981732.981764. 5
74 BIBLIOGRAPHY
Gregory F Cooper. e computational complexity of probabilistic inference using Bayesian belief
networks. Artificial Intelligence, 42(2):393–405, 1990. DOI: 10.1016/0004-3702(90)90060-d.
59

Mohamed Yehia Dahab, Hesham A Hassan, and Ahmed Rafea. Textontoex: Automatic ontology
construction from natural english text. Expert Systems with Applications, 34(2):1474–1480,
2008. DOI: 10.1016/j.eswa.2007.01.043. 59

Marina Danilevsky, Chi Wang, Nihit Desai, Xiang Ren, Jingyi Guo, and Jiawei Han. Automatic
construction and ranking of topical keyphrases on collections of short documents. In Proc. of the
SIAM International Conference on Data Mining, 2014. DOI: 10.1137/1.9781611973440.46. 6

Marie-Catherine De Marneffe, Bill MacCartney, Christopher D Manning, et al. Generating


typed dependency parses from phrase structure parses. In Proc. of the 5th International Confer-
ence on Language Resources and Evaluation, volume 6, pages 449–454, 2006. 47

Paul Deane. A nonparametric method for extraction of candidate phrasal terms. In Proc. of the
43rd Annual Meeting on Association for Computational Linguistics, pages 605–613, 2005. DOI:
10.3115/1219840.1219915. 35, 36

Scott C. Deerwester, Susan T. Dumais, omas K. Landauer, George W. Furnas, and Richard A.
Harshman. Indexing by latent semantic analysis. Journal of the American Society for Informa-
tion Science, 41(6):391–407, 1990. DOI: 10.1002/(sici)1097-4571(199009)41:6%3C391::aid-
asi1%3E3.0.co;2-9. 55

Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R Voss, and Jiawei Han. Scalable top-
ical phrase mining from text corpora. Proc. of the VLDB Endowment, 8(3), 2015. DOI:
10.14778/2735508.2735519. 10, 35, 36, 47, 60

Geoffrey Finch. Linguistic Terms and Concepts. Macmillan Press Limited, 2000. DOI:
10.1007/978-1-349-27748-3. 37, 43

Katerina Frantzi, Sophia Ananiadou, and Hideki Mima. Automatic recognition of multi-word
terms: e c-value/nc-value method. International Journal on Digital Libraries, 3(2):115–130,
2000. DOI: 10.1007/s007999900023. 5, 10, 35

Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using wikipedia-
based explicit semantic analysis. In Proc. of the 20th International Joint Conference on Artificial
Intelligence, pages 1606–1611, 2007. 55

Chuancong Gao and Sebastian Michel. Top-k interesting phrase mining in ad-hoc collections
using sequence pattern indexing. In Proc. of the 15th International Conference on Extending
Database Technology, pages 264–275, 2012. DOI: 10.1145/2247596.2247628. 29
BIBLIOGRAPHY 75
Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learn-
ing, 63(1):3–42, 2006. Springer. 39

omas Gottron, Maik Anderka, and Benno Stein. Insights into explicit semantic analysis. In
Proc. of the 20th ACM International Conference on Information and Knowledge Management,
pages 1961–1964, 2011. DOI: 10.1145/2063576.2063865. 55

Ziyu Guan, Long Chen, Wei Zhao, Yi Zheng, Shulong Tan, and Deng Cai. Weakly-supervised
deep learning for customer review sentiment classification. In Proc. of the 25th International
Joint Conference on Artificial Intelligence, pages 3719–3725, 2016. 60

Huan Gui, Jialu Liu, Fangbo Tao, Meng Jiang, Brandon Norick, and Jiawei Han. Large-scale
embedding learning in heterogeneous event data. In Proc. of the IEEE International Conference
on Data Mining, 2016. DOI: 10.1109/icdm.2016.0111. 67

John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979. DOI:
10.2307/2346830.

Samer Hassan and Rada Mihalcea. Semantic relatedness using salient semantic analysis. In Proc.
of the 25th AAAI Conference on Artificial Intelligence, pages 884–889, 2011. 55

Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld.
Knowledge-based weak supervision for information extraction of overlapping relations. In
Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Lan-
guage Technologies, volume 1, pages 541–550, 2011.

Terry Koo, Xavier Carreras Pérez, and Michael Collins. Simple semi-supervised dependency
parsing. In 46th Annual Meeting of the Association for Computational Linguistics, pages 595–
603, 2008. 5

Roger Levy and Christopher Manning. Is it harder to parse Chinese, or the Chinese tree-
bank? In Proc. of the 41st Annual Meeting on Association for Computational Linguistics, volume 1,
pages 439–446, 2003. DOI: 10.3115/1075096.1075152. 47

Yanen Li, Bo-Jun Paul Hsu, ChengXiang Zhai, and Kuansan Wang. Unsupervised query seg-
mentation using clickthrough for information retrieval. In Proc. of the 34th International ACM
SIGIR Conference on Research and Development in Information Retrieval, pages 285–294, 2011.
DOI: 10.1145/2009916.2009957. 15

Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Jiawei Han. Mining quality phrases from
massive text corpora. In Proc. of the ACM SIGMOD International Conference on Management
of Data, pages 1729–1744, 2015. DOI: 10.1145/2723372.2751523. 36, 47, 55
76 BIBLIOGRAPHY
Jialu Liu, Xiang Ren, Jingbo Shang, Taylor Cassidy, Clare R. Voss, and Jiawei Han. Representing
documents via latent keyphrase inference. In Proc. of the 25th International Conference on World
Wide Web, pages 1057–1067, 2016. DOI: 10.1145/2872427.2883088. 55, 56, 59

Gonzalo Martínez-Muñoz and Alberto Suárez. Switching class labels to generate classification
ensembles. Pattern Recognition, 38(10):1483–1494, 2005. Elsevier. 39

Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. Non-projective dependency
parsing using spanning tree algorithms. In Proc. of the Conference on Human Language Tech-
nology and Empirical Methods in Natural Language Processing, pages 523–530, 2005. DOI:
10.3115/1220575.1220641. 5

Rada Mihalcea and Paul Tarau. Textrank: Bringing order into texts. In Proc. of the Conference on
Empirical Methods in Natural Language Processing, 2004. 47

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in Neural Information
Processing Systems 26, pages 3111–3119, 2013. 29, 60

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation ex-
traction without labeled data. In Proc. of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,
volume 2, pages 1003–1011, 2009. DOI: 10.3115/1690219.1690287.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christo-
pher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. Uni-
versal dependencies v1: A multilingual treebank collection. In Proc. of the 10th International
Conference on Language Resources and Evaluation, pages 1659–1666, 2016. 47

Deepak Padmanabhan, Atreyee Dey, and Debapriyo Majumdar. Fast mining of interesting
phrases from subsets of text corpora. In Proc. of the 17th International Conference on Extending
Database Technology, pages 193–204, 2014. 29

Aditya Parameswaran, Hector Garcia-Molina, and Anand Rajaraman. Towards the web of con-
cepts: Extracting concepts from large datasets. Proc. of the VLDB Endowment, 3(1-2):566–577,
2010. DOI: 10.14778/1920841.1920914. 6, 35, 36, 41, 47

Youngja Park, Roy J Byrd, and Branimir K Boguraev. Automatic glossary extraction: Beyond
terminology identification. In Proc. of the 19th International Conference on Computational Lin-
guistics, volume 1, pages 1–7, 2002. DOI: 10.3115/1072228.1072370. 5, 10, 35

Vasin Punyakanok and Dan Roth. e use of classifiers in sequential inference. In Advances in
Neural Information Processing Systems 13, pages 995–1001, 2001. 5
BIBLIOGRAPHY 77
Xiang Ren, Yuanhua Lv, Kuansan Wang, and Jiawei Han. Comparative document analysis for
large text corpora. In Proc. of the 10th ACM International Conference on Web Search and Data
Mining, 2017a. DOI: 10.1145/3018661.3018690. 63

Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare Voss, Heng Ji, Tarek Abdelzaher, and Jiawe
Han. Cotype: Joint extraction of typed entities and relations with knowledge bases. In Proc.
of the 26th International Conference on World Wide Web, 2017b. 68

Mark Sanderson and Bruce Croft. Deriving concept hierarchies from text. In Proc. of the 22nd
Annual International ACM SIGIR Conference on Research and Development in Information Re-
trieval, pages 206–213, 1999. DOI: 10.1145/312624.312679. 59

Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. In New Methods in
Language Processing, page 154, 2013. 36

Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare Voss, and Jiawei Han. Automated Phrase
Mining from Massive Text Corpora, arXiv preprint arXiv:1702.04457v1, 2017. 55

Alkis Simitsis, Akanksha Baid, Yannis Sismanis, and Berthold Reinwald. Multidimen-
sional content exploration. Proc. of the VLDB Endowment, 1(1):660–671, 2008. DOI:
10.14778/1453856.1453929. 5

Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen. Short text
conceptualization using a probabilistic knowledge base. In Proc. of the 22nd International Joint
Conference on Artificial Intelligence, pages 2330–2336, 2011. DOI: 10.5591/978-1-57735-516-
8/IJCAI11-388. 55

Fangbo Tao, Honglei Zhuang, Chi Wang Yu, Qi Wang, Taylor Cassidy, Lance Kaplan, Clare
Voss, and Jiawei Han. Multi-dimensional, phrase-based summarization in text cubes. Data
Engineering, 39(3):74–84, 2016. 63

Beidou Wang, Can Wang, Jiajun Bu, Chun Chen, Wei Vivian Zhang, Deng Cai, and Xiaofei He.
Whom to mention: Expand the diffusion of tweets by @ recommendation on micro-blogging
systems. In Proc. of the 22nd International Conference on World Wide Web, pages 1331–1340,
2013. DOI: 10.1145/2488388.2488505. 60

Chi Wang, Wei Chen, and Yajun Wang. Scalable influence maximization for independent cascade
model in large-scale social networks. Data Mining and Knowledge Discovery, 25(3):545–576,
2012. DOI: 10.1007/s10618-012-0262-1. 59

Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G Nevill-Manning.
Kea: Practical automatic keyphrase extraction. In Proc. of the 4th ACM Conference on Digital
Libraries, pages 254–255, 1999. DOI: 10.1145/313238.313437. 47, 58
78 BIBLIOGRAPHY
Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. Probase: A probabilistic taxonomy
for text understanding. In Proc. of the ACM SIGMOD International Conference on Management
of Data, pages 481–492, 2012. DOI: 10.1145/2213836.2213891. 56
Endong Xun, Changning Huang, and Ming Zhou. A unified statistical model for the identifi-
cation of English basenp. In Proc. of the 38th Annual Meeting on Association for Computational
Linguistics, pages 109–116, 2000. DOI: 10.3115/1075218.1075233. 5
Xiaoxin Yin and Sarthak Shah. Building taxonomy of web search intents for name entity queries.
In Proc. of the 19th International Conference on World Wide Web, pages 1001–1010, 2010. DOI:
10.1145/1772690.1772792. 59
Ziqi Zhang, José Iria, Christopher A Brewster, and Fabio Ciravegna. A comparative evaluation
of term recognition algorithms. Proc. of the 6th International Conference on Language Resources
and Evaluation, 2008. 5, 35
79

Authors’ Biographies

JIALU LIU
Jialu Liu, an engineer at Google Research in New York, is working on structured data for knowl-
edge exploration. He received his B.Sc. from Zhejiang University, China, in 2007 and Ph.D.
degree in computer science from the University of Illinois at Urbana-Champaign in 2015. His
research has been focused on scalable data mining, text mining, and information extraction.

JINGBO SHANG
Jingbo Shang, is a Ph.D. candidate in the Department of Computer Science at the University of
Illinois at Urbana-Champaign. He received a B.Sc. from Shanghai Jiao Tong University, China
in 2014. His research focuses on mining and constructing structured knowledge from massive
text corpora.

JIAWEI HAN
Jiawei Han, Abel Bliss Professor, Department of Computer Science, the University of Illinois,
has been researching data mining, information network analysis, and database systems, and has
been involved in over 700 publications. He served as the founding Editor-in-Chief of ACM
Transactions on Knowledge Discovery from Data (TKDD). Jiawei received the ACM SIGKDD
Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), and
IEEE Computer Society W. Wallace McDowell Award (2009). He is a Fellow of ACM and
a Fellow of IEEE. His co-authored textbook, Data Mining: Concepts and Techniques (Morgan
Kaufmann), has been adopted worldwide.

You might also like