Data Mining and Knowledge Discovery For Big Data - Methodologies, Challenge and Opportunities (Chu 2013-10-09)
Data Mining and Knowledge Discovery For Big Data - Methodologies, Challenge and Opportunities (Chu 2013-10-09)
Data Mining
and Knowledge
Discovery
for Big Data
Methodologies,
Challenge and Opportunities
Studies in Big Data
Volume 1
Series Editor
Janusz Kacprzyk, Warsaw, Poland
ABC
Editor
Wesley W. Chu
Department of Computer Science
University of California
Los Angeles
USA
The field of data mining has made significant and far-reaching advances over
the past three decades. Because of its potential power for solving complex
problems, data mining has been successfully applied to diverse areas such as
business, engineering, social media, and biological science. Many of these ap-
plications search for patterns in complex structural information. This trans-
disciplinary aspect of data mining addresses the rapidly expanding areas of
science and engineering which demand new methods for connecting results
across fields. In biomedicine for example, modeling complex biological sys-
tems requires linking knowledge across many levels of science, from genes
to disease. Further, the data characteristics of the problems have also grown
from static to dynamic and spatiotemporal, complete to incomplete, and cen-
tralized to distributed, and grow in their scope and size (this is known as big
data). The effective integration of big data for decision-making also requires
privacy preservation. Because of the board-based applications and often in-
terdisciplinary, their published research results are scattered among journals
and conference proceedings in different fields and not limited to such jour-
nals and conferences in knowledge discovery and data mining (KDD). It is
therefore difficult for researchers to locate results that are outside of their
own field. This motivated us to invite experts to contribute papers that sum-
marize the advances of data mining in their respective fields.Therefore, to
a large degree, the following chapters describe problem solving for specific
applications and developing innovative mining tools for knowledge discovery.
This volume consists of nine chapters that address subjects ranging from
mining data from opinion, spatiotemporal databases, discriminative subgraph
patterns, path knowledge discovery, social media, and privacy issues to the
subject of computation reduction via binary matrix factorization. The fol-
lowing provides a brief description of these chapters.
Aspect extraction and entity extraction are two core tasks of aspect-based
opinion mining. In Chapter 1, Zhang and Liu present their studies on people’s
opinions, appraisals, attitudes, and emotions toward such things as entities,
products, services, and events.
VI Preface
when searching and gathering knowledge from published literature, and can
facilitate derivation of interpretable results.
Chapters 6, 7 and 8 present data mining in social media. In Chapter 6,
Bhattacharyya and Wu, present “InfoSearch : A Social Search Engine” which
was developed using the Facebook platform. InfoSearch leverages the data
found in Facebook, where users share valuable information with friends. The
user-to–content link structure in the social network provides a wealth of data
in which to search for relevant information. Ranking factors are used to en-
courage users to search queries through InfoSearch.
As social media became more integrated into the daily lives of people,
users began turning to it in times of distress. People use Twitter, Facebook,
YouTube, and other social media platforms to broadcast their needs, prop-
agate rumors and news, and stay abreast of evolving crisis situations. In
Chapter 7, Landwehr and Carley discuss social media mining and its novel
application to humanitarian assistance and disaster relief. An increasing num-
ber of organizations can now take advantage of the dynamic and rich infor-
mation conveyed in social media for humanitarian assistance and disaster
relief.
Social network analysis is very useful for discovering the embedded knowl-
edge in social network structures. This is applicable to many practical
domains such as homeland security, epidemiology, public health, electronic
commerce, marketing, and social science. However, privacy issues prevent
different users from effectively sharing information of common interest. In
Chapter 8, Yang and Thuraisingham propose to construct a generalized so-
cial network in which only insensitive and generalized information is shared.
Further, their proposed privacy-preserving method can satisfy a prescribed
level of privacy leakage tolerance thatis measured independent of the privacy-
preserving techniques.
Binary matrix factorization (BMF) is an important tool in dimension re-
duction for high-dimensional data sets with binary attributes, and it has been
successfully employed in numerous applications. In Chapter 9, Jiang, Peng,
Heath and Yang propose a clustering approach to updating procedures for
constrained BMF where the matrix product is required to be binary. Numer-
ical experiments show that the proposed algorithm yields better results than
that of other algorithms reported in research literature.
Finally, we want to thank our authors for contributing their work to this
volume, and also our reviewers for commenting on the readability and accu-
racy of the work. We hope that the new data mining methodologies and
challenges will stimulate further research and gain new opportunities for
knowledge discovery.
1 Introduction
Opinion mining or sentiment analysis is the computational study of people’s
opinions, appraisals, attitudes, and emotions toward entities and their aspects. The
entities usually refer to products, services, organizations, individuals, events, etc
and the aspects are attributes or components of the entities (Liu, 2006). With the
growth of social media (i.e., reviews, forum discussions, and blogs) on the Web,
individuals and organizations are increasingly using the opinions in these media
for decision making. However, people have difficulty, owing to their mental and
physical limitations, producing consistent results when the amount of such
information to be processed is large. Automated opinion mining is thus needed, as
subjective biases and mental limitations can be overcome with an objective
opinion mining system.
W.W. Chu (ed.), Data Mining and Knowledge Discovery for Big Data, 1
Studies in Big Data 1,
DOI: 10.1007/978-3-642-40837-3_1, © Springer-Verlag Berlin Heidelberg 2014
2 L. Zhang and B. Liu
In the past decade, opinion mining has become a popular research topic due to
its wide range of applications and many challenging research problems. The topic
has been studied in many fields, including natural language processing, data
mining, Web mining, and information retrieval. The survey books of Pang and
Lee (2008) and Liu (2012) provide a comprehensive coverage of the research in
the area. Basically, researchers have studied opinion mining at three levels of
granularity, namely, document level, sentence level, and aspect level. Document
level sentiment classification is perhaps the most widely studied problem (Pang,
Lee and Vaithyanathan, 2002; Turney, 2002). It classifies an opinionated
document (e.g., a product review) as expressing an overall positive or negative
opinion. It considers the whole document as a basic information unit and it
assumes that the document is known to be opinionated. At the sentence level,
sentiment classification is applied to individual sentences in a document (Wiebe
and Riloff, 2005; Wiebe et al., 2004; Wilson et al., 2005). However, each sentence
cannot be assumed to be opinionated. Therefore, one often first classifies a
sentence as opinionated or not opinioned, which is called subjectivity
classification. The resulting opinionated sentences are then classified as
expressing positive or negative opinions.
Although opinion mining at the document level and the sentence level is useful
in many cases, it still leaves much to be desired. A positive evaluative text on a
particular entity does not mean that the author has positive opinions on every
aspect of the entity. Likewise, a negative evaluative text for an entity does not
mean that the author dislikes everything about the entity. For example, in a
product review, the reviewer usually writes both positive and negative aspects of
the product, although the general sentiment on the product could be positive or
negative. To obtain more fine-grained opinion analysis, we need to delve into the
aspect level. This idea leads to aspect-based opinion mining, which was first
called the feature-based opinion mining in Hu and Liu (2004b). Its basic task is to
extract and summarize people’s opinions expressed on entities and aspects of
entities. It consists of three core sub-tasks.
Example: In the cellular phone domain, an aspect could be named voice quality.
There are many expressions that can indicate the aspect, e.g., “sound,” “voice,”
and “voice quality.”
Aspect expressions are usually nouns and noun phrases, but can also be verbs,
verb phrases, adjectives, and adverbs. We call aspect expressions in a sentence
that are nouns and noun phrases explicit aspect expressions. For example, “sound”
in “The sound of this phone is clear” is an explicit aspect expression. We call
aspect expressions of the other types, implicit aspect expressions, as they often
imply some aspects. For example, “large” is an implicit aspect expression in “This
phone is too large”. It implies the aspect size. Many implicit aspect expressions
are adjectives and adverbs, which imply some specific aspects, e.g., expensive
(price), and reliably (reliability). Implicit aspect expressions are not just adjectives
and adverbs. They can be quite complex, for example, “This phone will not easily
fit in pockets”. Here, “fit in pockets” indicates the aspect size (and/or shape).
Like aspects, an entity also has a name and many expressions that indicate the
entity. For example, the brand Motorola (entity name) can be expressed in several
ways, e.g., “Moto”, “Mot” and “Motorola” itself.
Definition (entity expression): An entity expression is an actual word or phrase
that has appeared in text indicating a particular entity.
Definition (opinion holder): The holder of an opinion is the person or
organization that expresses the opinion.
For product reviews and blogs, opinion holders are usually the authors of the
postings. Opinion holders are more important in news articles as they often
explicitly state the person or organization that holds an opinion. Opinion holders
are also called opinion sources. Some research has been done on identifying and
extracting opinion holders from opinion documents (Bethard et al., 2004; Choi et
al., 2005; Kim and Hovy, 2006; Stoyanov and Cardie, 2008).
We now turn to opinions. There are two main types of opinions: regular
opinions and comparative opinions (Liu, 2010; Liu, 2012). Regular opinions are
often referred to simply as opinions in the research literature. A comparative
opinion is a relation of similarity or difference between two or more entities,
which is often expressed using the comparative or superlative form of an adjective
or adverb (Jindal and Liu, 2006a and 2006b).
An opinion (or regular opinion) is simply a positive or negative view, attitude,
emotion or appraisal about an entity or an aspect of the entity from an opinion
holder. Positive, negative and neutral are called opinion orientations. Other names
for opinion orientation are sentiment orientation, semantic orientation, or polarity.
In practice, neutral is often interpreted as no opinion. We are now ready to
formally define an opinion.
Definition (opinion): An opinion (or regular opinion) is a quintuple,
(ei, aij, ooijkl, hk, tl),
Aspect and Entity Extraction for Opinion Mining 5
where ei is the name of an entity, aij is an aspect of ei, ooijkl is the orientation of the
opinion about aspect aij of entity ei, hk is the opinion holder, and tl is the time when
the opinion is expressed by hk. The opinion orientation ooijkl can be positive,
negative or neutral, or be expressed with different strength/intensity levels. When
an opinion is on the entity itself as a whole, we use the special aspect GENERAL
to denote it.
We now put everything together to define a model of entity, a model of
opinionated document, and the mining objective, which are collectively called the
aspect-based opinion mining.
Model of Entity: An entity ei is represented by itself as a whole and a finite set of
aspects, Ai = {ai1, ai2, …, ain}. The entity itself can be expressed with any one of a
final set of entity expressions OEi = {oei1, oei2, …, oeis}. Each aspect aij ∈ Ai of
the entity can be expressed by any one of a finite set of aspect expressions AEij =
{aeij1, aeij2, …, aeijm}.
Model of Opinionated Document: An opinionated document d contains opinions
on a set of entities {e1, e2, …, er} from a set of opinion holders {h1, h2, …, hp}.
The opinions on each entity ei are expressed on the entity itself and a subset Aid of
its aspects.
Objective of Opinion Mining: Given a collection of opinionated documents D,
discover all opinion quintuples (ei, aij, ooijkl, hk, tl) in D.
Fig. 1 Opinion summary based on product aspects of iPad (from Google Product1)
3 Aspect Extraction
Both aspect extraction and entity extraction fall into the broad class of information
extraction (Sarawagi, 2008), whose goal is to automatically extract structured
information (e.g., names of persons, organizations and locations) from
unstructured sources. However, traditional information extraction techniques are
often developed for formal genre (e.g., news, scientific papers), which have some
difficulties to be applied effectively to opinion mining applications. We aim to
extract fine-grained information from opinion documents (e.g., reviews, blogs and
forum discussions), which are often very noisy and also have some distinct
characteristics that can be exploited for extraction. Therefore, it is beneficial to
design extraction methods that are specific to opinion documents. In this section,
we focus on the task of aspect extraction. Since aspect extraction and entity
extraction are closely related, some ideas or methods proposed for aspect
extraction can be applied to the task of entity extraction as well. In Section 4, we
will discuss a special problem of entity extraction for opinion mining and some
approaches for solving the problem.
Existing research on aspect extraction is mainly carried out on online reviews.
We thus focus on reviews here. There are two common review formats on the
Web.
Format 1 − Pros, Cons and the Detailed Review: The reviewer is asked to
describe some brief Pros and Cons separately and also write a detailed/full review.
Format 2 − Free Format: The reviewer can write freely, i.e., no separation of
pros and cons.
1
http://www.google.com/shopping
Aspect and Entity Extraction for Opinion Mining 7
To extract aspects from Pros and Cons in reviews of Format 1 (not the detailed
review, which is the same as Format 2), many information extraction techniques
can be applied. An important observation about Pros and Cons is that they are
usually very brief, consisting of short phrases or sentence segments. Each sentence
segment typically contains only one aspect, and sentence segments are separated
by commas, periods, semi-colons, hyphens, &, and, but, etc. This observation
helps the extraction algorithm to perform more accurately (Liu, Hu and Cheng,
2005). Since aspect extraction from Pros and Cons is relatively simple, we will not
discuss it further.
We now focus on the more general case, i.e., extracting aspects from reviews of
Format 2, which usually consist of full sentences.
is advmod not
(VBZ) (RB)
This
(DT) nsubj
dobj
det a
det
movie (DT)
(NN)
masterpiece
(NN)
The idea of using the modifying relationship of opinion words and aspects to
extract aspects can be generalized to using dependency relation. Zhuang et al.
(2006) employed the dependency relation to extract aspect-opinion pairs from
movie reviews. After parsed by a dependency parser (e.g., MINIPAR2
2
http://webdocs.cs.ualberta.ca/~lindek/minipar.htm
Aspect and Entity Extraction for Opinion Mining 9
(Lin, 1998)), words in a sentence are linked to each other by a certain dependency
relation. Figure 2 shows the dependency grammar graph of an example sentence,
“This movie is not a masterpiece”, where “movie” and “masterpiece” have been
labeled as aspect and opinion word respectively. A dependency relation template
can be found as the sequence “NN - nsubj - VB - dobj - NN”. NN and VB are POS
tags. nsubj and dobj are dependency tags. Zhuang et al. (2006) first identified
reliable dependency relation templates from training data, and then used them to
identify valid aspect-opinion pairs in test data.
In Wu et al. (2009), a phrase dependency parser was used for extracting noun
phrases and verb phrases as aspect candidates. Unlike a normal dependency parser
that identifies dependency of individual words only, a phrase dependency parser
identifies dependency of phrases. Dependency relations have also been exploited
by Kessler and Nicolov (2009).
Wang and Wang (2008) proposed a method to identify product aspects and
opinion words simultaneously. Given a list of seed opinion words, a bootstrapping
method is employed to identify product aspects and opinion words in an
alternation fashion. Mutual information is utilized to measure association between
potential aspects and opinion words and vice versa. In addition, linguistic rules are
extracted to identify infrequent aspects and opinion words. The similar
bootstrapping idea is also utilized in (Hai et al., 2012).
Double propagation (Qiu et al., 2011) further developed aforementioned ideas.
Similar to Wang and Wang (2008), the method needs only an initial set of opinion
word seeds as the input. It observed that opinions almost always have targets, and
there are natural relations connecting opinion words and targets in a sentence due
to the fact that opinion words are used to modify targets. Furthermore, it found
that opinion words have relations among themselves and so do targets among
themselves too. The opinion targets are usually aspects. Thus, opinion words can
be recognized by identified aspects, and aspects can be identified by known
opinion words. The extracted opinion words and aspects are utilized to identify
new opinion words and new aspects, which are used again to extract more opinion
words and aspects. This propagation process ends when no more opinion words or
aspects can be found. As the process involves propagation through both opinion
words and aspects, the method is called double propagation. Extraction rules are
designed based on different relations between opinion words and aspects, and also
opinion words and aspects themselves. Dependency grammar was adopted to
describe these relations.
The method only uses a simple type of dependencies called direct dependencies
to model useful relations. A direct dependency indicates that one word depends on
the other word without any additional words in their dependency path or they both
depend on a third word directly. Some constraints are also imposed. Opinion
words are considered to be adjectives and aspects are nouns or noun phrases.
Table 1 shows the rules for aspect and opinion word extraction. It uses OA-Rel to
denote the relations between opinion words and aspects, OO-Rel between opinion
words themselves and AA-Rel between aspects. Each relation in OA-Rel, OO-Rel
10 L. Zhang and B. Liu
OA-Rels are used for tasks (1) and (3), AA-Rels are used for task (2) and OO-
Rels are used for task (4). Four types of rules are defined respectively for these
four subtasks and the details are given in Table 1. In the table, o (or a) stands for
the output (or extracted) opinion word (or aspect). {O} (or {A}) is the set of
known opinion words (or the set of aspects) either given or extracted. H means
any word. POS(O(or A)) and O(or A)-Dep stand for the POS tag and dependency
relation of the word O (or A) respectively.{JJ} and {NN}are sets of POS tags of
potential opinion words and aspects respectively. {JJ} contains JJ, JJR and JJS;
{NN} contains NN and NNS. {MR} consists of dependency relations describing
relations between opinion words and aspects (mod, pnmod, subj, s, obj, obj2 and
desc). {CONJ} contains conj only. The arrows mean dependency. For example, O
→ O-Dep → A means O depends on A through a syntactic relation O-Dep.
Specifically, it employs R1i to extract aspects (a) using opinion words (O), R2i to
extract opinion words (o) using aspects (A), R3i to extract aspects (a) using
extracted aspects (Ai) and R4i to extract opinion words (o) using known opinion
words (Oi). Take R11 as an example. Given the opinion word O, the word with the
POS tag NN and satisfying the relation O-Dep is extracted as an aspect.
The double propagation method works well for medium-sized corpuses, but for
large and small corpora, it may result in low precision and low recall. The reason
is that the patterns based on direct dependencies have a large chance of
introducing noises for large corpora and such patterns are limited for small
corpora. To overcome the weaknesses, Zhang et al. (2010) proposed an approach
to extend double propagation. It consists of two steps: aspect extraction and
aspect ranking. For aspect extraction, it still adopts double propagation to
populate aspect candidates. However, some new linguistic patterns (e.g., part-
whole relation patterns) are introduced to increase recall. After extraction, it ranks
aspect candidates by aspect importance. That is, if an aspect candidate is genuine
and important, it will be ranked high. For an unimportant aspect or noise, it will be
ranked low. It observed that there are two major factors affecting the aspect
importance: aspect relevance and aspect frequency. The former describes how
likely an aspect candidate is a genuine aspect. There are three clues to indicate
aspect relevance in reviews. The first clue is that an aspect is often modified by
multiple opinion words. For example, in the mattress domain, “delivery” is
modified by “quick” “cumbersome” and “timely”. It shows that reviewers put
emphasis on the word “delivery”. Thus, “delivery” is a likely aspect. The second
clue is that an aspect can be extracted by multiple part-whole patterns. For
example, in car domain, if we find following two sentences, “the engine of the
car” and “the car has a big engine”, we can infer that “engine” is an aspect for
car, because both sentences contain part-whole relations to indicate “engine” is a
part of “car”. The third clue is that an aspect can be extracted by a combination of
opinion word modification relation, part-whole pattern or other linguistic patterns.
If an aspect candidate is not only modified by opinion words but also extracted by
part-whole pattern, we can infer that it is a genuine aspect with high confidence.
For example, for sentence “there is a bad hole in the mattress”, it strongly
12 L. Zhang and B. Liu
“of scanner”, “scanner has”, “scanner comes with”, etc. The PMI measure is
calculated by searching the Web. The equation is as follows.
hits (a ∧ d )
PMI (a, d ) = (1)
hits (a )hits (d )
Hidden Markov Model (HMM) is a directed sequence model for a wide range of
state series data. It has been applied successfully to many sequence labeling
problems such as named entity recognition (NER) in information extraction and
POS tagging in natural language processing. A generic HMM model is illustrated
in Figure 3.
3
http://wordnet.princeton.edu
14 L. Zhang and B. Liu
y0 y1 y2 … yt
x0 x1 x2 xt
We have
Y = < y0 , y1 , … yt > = hidden state sequence
X = < x0 , x1 , … xt > = observation sequence
HMM models a sequence of observations X by assuming that there is a hidden
sequence of states Y. Observations are dependent on states. Each state has a
probability distribution over the possible observations. To model the joint
distribution p(y, x) tractably, two independence assumptions are made. First, it
assumes that state yt only depends on its immediate predecessor state yt-1. yt is
independent of all its ancestor y1, y2, y3, … , yt-2. This is also called the Markov
property. Second, the observation xt only depends on the current state yt. With
these assumptions, we can specify HMM using three probability distributions: p
(y0) over initial state, state transition distribution p(yt | yt-1) and observation
distribution p(xt | yt). That is, the joint probability of a state sequence Y and an
observation sequence X factorizes as follows.
t
p (Y , X ) = ∏ p ( yt | yt −1 ) p ( xt | yt ) (2)
t =1
y0 y1 y2 … yt
x0 , x1 , x2 … xt
We have
Y = < y0 , y1 , … yt > = hidden state sequence
X = < x0 , x1 , … xt > = observation sequence
The conditional distribution p(Y|X) takes the form
k
1
p(Y | X ) = exp{ λk f k ( yt , yt −1 , xt )} (3)
Z(X ) k =1
CRF introduces the concept of feature function. Each feature function has the
form f k ( yt , yt −1 , xt ) and λk is its corresponding weight. Figure 4 indicates that
CRF makes independence assumption among Y, but not among X. Note that one
argument for feature function fk is the vector xt which means each feature function
can depend on observation X from any step. That is, all the components of the
global observations X are needed in computing feature function fk at step t. Thus,
CRF can introduce more features than HMM at each step.
16 L. Zhang and B. Liu
Jakob and Gurevych (2010) utilzied CRF to extract opinion targets (or aspects)
from sentences which contain an opinion expression. They emplyed the following
features as input for the CRF-based approach.
Token: This feature represents the string of the current token.
Part of Speech: This feature represents the POS tag of the current token. It can
provide some means of lexical disambiguation.
Short Dependency Path: Direct dependency realtions show accurate
connections between a target and an opinion expression. Thus, all tokens which have
a direct dependency relation to an opinion expression in a sentence are labelled.
Word Distance: Noun phrases are good candidates for opinion targets in
product reviews. Thus token(s) in the closest noun phrase regarding word distance
to each opinion expression in a sentence are labelled.
Jakob and Gurevych represented the possible labels following the Inside-
Outside-Begin (IOB) labelling schema: B-Target, identifying the beginning of an
opinion target; I-Target, identifying the continuation of a target, and O for other
(non-target) tokens.
Similar work has been done in (Li et al., 2010a). In order to model the long
distance dependency with conjunctions (e.g., “and”, “or”, “but”) at the sentence
level and deep syntactic dependencies for aspects, positive opinions and negative
opinions, they used the skip-tree CRF models to detect product aspects and
opinions.
d α d
… …
…
…
Obviously, the main parameters of the model are θ and φ. They can be
estimated by Expectation Maximization (EM) algorithm (Dempster et al., 1977),
which is used to calculate maximum likelihood estimates of the parameters.
For aspect extraction task, we can regard product aspects as latent topics in
opinion documents. Lu et al. (2009) proposed a method for aspect discovery and
grouping in short comments. They assume that each review can be parsed into
opinion phrases of the format < head term, modifier > and incorporate such
structure of phrases into the pLSA model, using the co-occurrence information of
head terms and their modifiers. Generally, the head term is an aspect, and the
modifier is opinion word, which expresses some opinion towards the aspect. The
proposed approach defines k unigram language models: Θ = {θ1, θ2, …, θk} as k
topic models, each is a multinomial distribution of head terms, capturing one
aspect. Note that each modifier could be represented by a set of head terms that it
modifies as the following equations:
d ( wm ) = {wh | ( wm , wh ) ∈ T } (8)
where wh is the head term and wm is the modifier.
Actually, a modifier can be regarded as a sample of the following mixture
model.
k
pd ( wm ) ( wh ) = [π d ( wm ), j p( wh | θ j )] (9)
j =1
To address the limitation of pLSA, the Bayesian LDA model is proposed in (Blei
et al., 2003). It extends pLSA by adding priors to the parameters θ and φ. In LDA,
a prior Dirichlet distribution Dir (α) is added for θ and a prior Dirichlet
distribution Dir (β) is added for φ. The generation of a document collection is
started by sampling a word distribution φ from Dir (β) for each latent topic. Then
each document d in LDA is assumed to be generated as follows.
(1) choose distribution of topics θ ~ Dir (α)
(2) choose distribution of words φ ~ Dir (β)
(3) for each word wj in document d
- choose topic zi ~ θ
- choose word wj ~ φ
The model is represented in Figure 5 (b). LDA has only two parameters: α and
β, which prevent it from overfitting. Exact inference in such a model is intractable
and various approximations have been considered, such as the variational EM
method and the Markov Chain Monte Carlo (MCMC) algorithm (Gilks et
al.,1996). Note that, compared with pLSA, LDA has a stronger generative power,
as it describes how to generate topic distribution θ for an unseen document d.
LDA based topic models have been used for aspect extraction by several
researchers. Titov and McDonald (2008a) pointed that global topic models such as
pLSA and LDA might not be suitable for detecting aspects. Both pLSA and LDA
use the bag-of-words representation of documents, which depends on topic
distribution differences and word co-occurrence among documents to identify
topics and word probability distribution in each topic. However, for opinion
documents such as reviews about a particular type of products, they are quite
homogenous. That is, every document talks about the same aspects, which makes
global topic models ineffective and are only effective for discovering entities
(e.g., brands or product names). In order to tackle this problem, they proposed
20 L. Zhang and B. Liu
Multi-grain LDA (MG-LDA) to discover aspects, which models two distinct types
of topics: global topics and local topics. As in pLSA and LDA, the distribution of
global topics is fixed for a document (review). However, the distribution of local
topics is allowed to vary across documents. A word in a document is sampled
either from the mixture of global topics or from the mixture of local topics specific
for the local context of the word. It is assumed that aspects will be captured by
local topics and global topics will capture properties of reviewed items. For
example, a review of a London hotel: “… public transport in London is
straightforward, the tube station is about an 8 minute walk … or you can get a bus
for £1.50’’. The review can be regarded as a mixture of global topic London
(words: “London”, “tube”, “£”) and the local topic (aspect) location (words:
“transport”, “walk”, “bus”).
MG-LDA can distinguish local topics. But due to the many-to-one mapping
between local topics and ratable aspects, the correspondence is not explicit. It
lacks direct assignment from topics to aspects. To resolve the issue, Titov and
McDonald (2008b) extended the MG-LDA model and constructed a joint model
of text and aspect ratings, which is called the Multi-Aspect Sentiment model
(MAS). It consists of two parts. The first part is based on MG-LDA to build topics
what are representative of ratable aspects. The second part is a set of classifiers
(sentiment predictors) for each aspect, which attempt to infer the mapping
between local topics and aspects with the help of aspect-specific ratings provided
along with the review text. Their goal is to use the rating information to identity
more coherent aspects.
The idea of LDA has also been applied and extended in (Branavan et al., 2008;
Lin and He, 2009; Brody and Elhadad, 2010; Zhao et al., 2010; Wang et al., 2010;
Jo and Oh, 2011; Sauper et al., 2011; Moghaddam and Ester, 2011; Mukajeee and
Liu, 2012). Branavan used the aspect descriptions as keyphrases in Pros and Cons
of review Format 1 to help finding aspects in the detailed review text. Keyphrases
are clustered based on their distributional and orthographic properties, and a
hidden topic model is applied to the review text. Then, a final graphical model
integrates both of them. Lin and He (2009) proposed a joint topic-sentiment model
(JST), which extends LDA by adding a sentiment layer. It can detect aspect and
sentiment simultaneously from text. Brody and Elhadad (2010) proposed to
identify aspects using a local version of LDA, which operates on sentences, rather
than documents and employs a small number of topics that correspond directly to
aspects. Zhao et al. (2010) proposed a MaxEnt-LDA hybrid model to jointly
discover both aspect words and aspect-specific opinion words, which can leverage
syntactic features to help separate aspects and opinion words. Wang et al. (2010)
proposed a regression model to infer both aspect ratings and aspect weights at the
level of individual reviews based on learned latent aspects. Jo and Oh (2011)
proposed an Aspect and Sentiment Unification Model (ASUM) to model
sentiments toward different aspects. Sauper et al. (2011) proposed a joint model,
which worked only on short snippets already extracted from reviews. It combined
topic modeling with a HMM, where the HMM models the sequence of words with
Aspect and Entity Extraction for Opinion Mining 21
types (aspect, opinion word, or background word). Moghaddam and Ester (2011)
proposed a model called ILDA, which is based on LDA and jointly models latent
aspects and rating. ILDA can be viewed as a generative process that first generates
an aspect and subsequently generates its rating. In particular, for generating each
opinion phrase, ILDA first generates an aspect am from an LDA model. Then it
generates a rating rm conditioned on the sampled aspect am. Finally, a head term tm
and a sentiment sm are drawn conditioned on am and rm, respectively. Mukajeee
and Liu (2012) proposed two models (SAS and ME-SAS) to jointly model both
aspects and aspect specific sentiments by using seeds to discover aspects in an
opinion corpus. The seeds reflect the user needs to discover specific aspects.
Other closely related work with topic model is the topic-sentiment model
(TSM). Mei et al. (2007) proposed it to perform joint topic and sentiment
modeling for blogs, which uses a positive sentiment model and a negative
sentiment model in additional to aspect models. They do sentiment analysis on
documents level and not on aspect level. In (Su et al., 2008), the authors also
proposed a clustering based method with mutual reinforcement to identify aspects.
Similar work has been done in (Scaffidi et al., 2007), they proposed a language
model approach for product aspect extraction with the assumption that product
aspects are mentioned more often in a product review than they are mentioned in
general English text. However, statistics may not be reliable when the corpus is
small.
In summary, topic modeling is a powerful and flexible modeling tool. It is also
very nice conceptually and mathematically. However, it is only able to find some
general/rough aspects, and has difficulty in finding fine-grained or precise aspects.
We think it is too statistics centric and come with its limitations. It could be
fruitful if we can shift more toward natural language and knowledge centric for a
more balanced approach.
Yi et al. (2003) proposed a method for aspect extraction based on the likelihood-
ratio test. Bloom et al. (2007) manually built a taxonomy for aspects, which
indicates aspect type. They also constructed an aspect list by starting with a
sample of reviews that the list would apply to. They examined the seed list
manually and used WordNet to suggest additional terms to add to the list. Lu et al.
(2010) exploited the online ontology Freebase4 to obtain aspects to a topic and
used them to organize scattered opinions to generate structured opinion
summaries. Ma and Wan (2010) exploited Centering theory (Grosz et al., 1995) to
extract opinion targets from news comments. The approach uses global
information in news articles as well as contextual information in adjacent
sentences of comments. Ghani et al. (2006) formulated aspect extraction as a
classification problem and used both traditional supervised learning and semi-
supervised learning methods to extract product aspects. Yu et al. (2011) used a
4
http://www.freebase.com
22 L. Zhang and B. Liu
again according to their latent topic structures produced from level 1 and context
snippets in reviews.
Zhai et al. (2010) proposed a semi-supervised learning method to group aspect
expressions into the user specified aspect groups or categories. Each group
represents a specific aspect. To reflect the user needs, they first manually label a
small number of seeds for each group. The system then assigns the rest of the
discovered aspect expressions to suitable groups using semi-supervised learning
based on labeled seeds and unlabeled examples. The method used the Expectation-
Maximization (EM) algorithm. Two pieces of prior knowledge were used to
provide a better initialization for EM, i.e., (1) aspect expressions sharing some
common words are likely to belong to the same group, and (2) aspect expressions
that are synonyms in a dictionary are likely to belong to the same group. Zhai et
al. (2011) further proposed an unsupervised method, which does not need any pre-
labeled examples. Besides, it is further enhanced by lexical (or WordNet)
similarity. The algorithm also exploited a piece of natural language knowledge to
extract more discriminative distributional context to help grouping.
Mauge et al. (2012) used a maximum entropy based clustering algorithm to
group aspects in a product category. It first trains a maximum-entropy classifier to
determine the probability p that two aspects are synonyms. Then, an undirected
weighted graph is constructed. Each vertex represents an aspect. Each edge weight
is proportional to the probability p between two vertices. Finally, approximate
graph partitioning methods are employed to group product aspects.
Closely related to aspect grouping, aspect hierarchy is to present product
aspects as a tree or hierarchy. The root of the tree is the name of the entity. Each
non-root node is a component or sub-component of the entity. Each link is a part-
of relation. Each node is associated with a set of product aspects. Yu et al. (2011b)
proposed a method to create aspect hierarchy. The method starts from an initial
hierarchy and inserts the aspects into it one-by-one until all the aspects are
allocated. Each aspect is inserted to the optimal position by semantic distance
learning. Wei and Gulla (2010) studied the sentiment analysis based on aspect
hierarchy trees.
aspect indicator. The final ranking score of a candidate aspect is the multiplication
of the aspect relevancy score (authority score) and logarithm of aspect frequency.
Yu et al. (2011a) showed the important aspects are identified according to two
observations: the important aspects of a product are usually commented by a large
number of consumers and consumers’ opinions on the important aspects greatly
influence their overall ratings on the product. Given reviews of a product, they first
identify product aspects by a shallow dependency parser and determine opinions on
these aspects via a sentiment classifier. They then develop an aspect ranking
algorithm to identify the important aspects by considering the aspect frequency and
the influence of opinions given to each aspect on their overall opinions.
Liu et al. (2012) proposed a graph-based algorithm to compute the confidence
of each opinion target and its ranking. They argued that the ranking of a candidate
is determined by two factors: opinion relevancy and candidate importance. To
model these two factors, a bipartite graph (similar to that in Zhang et al., 2010) is
constructed. An iterative algorithm based on the graph is proposed to compute
candidate confidences. Then the candidates with high confidence scores are
extracted as opinion targets. Similar work has also been reported in (Li et al.,
2012a).
on the degree of association between aspects and sentiment words. The association
(or mutual reinforcement relationship) is modeled using a bipartite graph. An
aspect and an opinion word are linked if they have co-occurred in a sentence. The
links are also weighted based on the co-occurrence frequency. After the iterative
clustering, the strong links between aspects and sentiment word groups form the
mapping.
In Hai et al. (2011), a two-phase co-occurrence association rule mining
approach was proposed to match implicit aspects (which are also assumed to be
sentiment words) with explicit aspects. In the first phase, the approach generates
association rules involving each sentiment word as the condition and an explicit
aspect as the consequence, which co-occur frequently in sentences of a corpus. In
the second phase, it clusters the rule consequents (explicit aspects) to generate
more robust rules for each sentiment word mentioned above. For application or
testing, given a sentiment word with no explicit aspect, it finds the best rule cluster
and then assigns the representative word of the cluster as the final identified
aspect.
Fei et al. (2012) focused on finding implicit aspects (mainly nouns) indicated
by opinion adjectives, e.g., to identify price, cost, etc., for adjective expensive. A
dictionary-based method was proposed, which tries to identify attribute nouns
from the dictionary gloss of the adjective. They formulated the problem as a
collective classification problem, which can exploit lexical relations of words
(e.g., synonyms, antonyms, hyponym and hypernym) for classification.
Some other related work for implicit aspect mapping includes those in (Wang
and Wang, 2008; Yu et al., 2011b).
Step 2: Pruning: This step prunes the two lists. The idea is that when a noun
product aspect is directly modified by both positive and negative opinion
words, it is unlikely to be an opinionated product aspect.
these words resource terms (which cover both words and phrases). They are a
kind of special product aspects.
In terms of sentiments involving resources, the rules in Figure 6 are applicable
(Liu, 2010). Rules 1 and 3 represent normal sentences that involve resources and
imply sentiments, while rules 2 and 4 represent comparative sentences that involve
resources and also imply sentiments, e.g., “this washer uses much less water than
my old GE washer”.
Zhang and Liu (2011a) formulated the problem based on a bipartite graph and
proposed an iterative algorithm to solve the problem. The algorithm was based on
the following observation:
Observation: The sentiment or opinion expressed in a sentence about resource
usage is often determined by the flowing triple,
(verb, quantifier, noun_term),
where noun_term is a noun or a noun phrase representing a resource.
The proposed method used such triples to help identify resources in a domain
corpus. The model used a circular definition to reflect a special reinforcement
relationship between resource usage verbs (e.g., consume) and resource terms
(e.g., water) based on the bipartite graph. The quantifier was not used in
computation but was employed to identify candidate verbs and resource terms.
The algorithm assumes that a list of quantifiers is given, which is not numerous
and can be manually compiled. Based on the circular definition, the problem is
solved using an iterative algorithm similar to the HITS algorithm in (Kleinberg,
1999). To start the iterative computation, some global seed resources are
employed to find and to score some strong resource usage verbs. These scores are
then applied as the initialization for the iterative computation for any application
domain. When the algorithm converges, a ranked list of candidate resource terms
is identified.
4 Entity Extraction
The task of entity extraction belongs to the traditional named entity recognition
(NER) problem, which has been studied extensively. Many supervised
information extraction approaches (e.g., HMM and CRF) can be adopted directly
(Putthividhya and Hu, 2011). However, opinion mining also presents some special
problems. One of them is the following: in a typical opinion mining application,
the user wants to find opinions about some competing entities, e.g., competing
products or brands (e.g., Canon, Sony, Samsung and many more). However, the
user often can only provide a few names because there are so many different
brands and models. Web users also write the names of the same product in various
ways in forums and blogs. It is thus important for a system to automatically
discover them from relevant corpora. The key requirement of this discovery is that
the discovered entities must be relevant, i.e., they must be of the same class/type
as the user provided entities, e.g., same brands or models.
28 L. Zhang and B. Liu
4.1.1 PU Learning
In machine learning, there is a class of semi-supervised learning algorithms that
learns from positive and unlabeled examples (PU learning). Its key characteristic
(Liu et al., 2002) is that there is no negative training example available for
learning. As stated above, PU learning is a two-class classification model. Its
objective is to build a classifier using P and U to classifying the data in U or future
test cases. The results can be either binary decisions (whether each test case
belongs to the positive class or not), or a ranking based on how likely each test
case belongs to the positive class represented by P. Clearly, the set expansion
problem is a special case of PU learning, where the set Q is P here and the set D is
U here.
There are several PU learning algorithms (Liu et al., 2002; Li and Liu, 2003; Li
et al., 2007; Yu et al., 2002). Li et al. (2010b) used the S-EM algorithm proposed
Aspect and Entity Extraction for Opinion Mining 29
in (Liu et al., 2002) for entity extraction in opinion documents. The main idea of
S-EM is to use a spy technique to identify some reliable negatives (RN) from the
unlabeled set U, and then use an EM algorithm to learn from P, RN and U–RN. To
apply S-EM algorithm, Li et al. (2010b) takes following basic steps.
Generating Candidate Entities: It selects single words or phrases as
candidate entities based on their part-of-speech (POS) tags. In particular, it
chooses the following POS tags as entity indicators — NNP (proper noun), NNPS
(plural proper noun), and CD (cardinal number).
Generating Positive and Unlabeled Sets: For each seed, each occurrence in
the corpus forms a vector as a positive example in P. The vector is formed based
on the surrounding word context of the seed mention. Similarly, for each
candidate d ∈ D (D denotes the set of all candidates), each occurrence also forms a
vector as an unlabeled example in U. Thus, each unique seed or candidate entity
may produce multiple feature vectors, depending on the number of times that the
seed appears in the corpus. The components in the feature vectors are term
frequencies.
Ranking Entity Candidates: With positive and unlabeled data, S-EM applied.
At convergence, S-EM produces a Bayesian classifier C, which is used to classify
each vector u ∈ U and to assign a probability p(+|u) to indicate the likelihood that
u belongs to the positive class. Note that each unique candidate entity may
generate multiple feature vectors, depending on the number of times that the
candidate entity occurs in the corpus. As such, the rankings produced by S-EM are
not the rankings of the entities, but rather the rankings of the entities’ occurrences.
Since different vectors representing the same candidate entity can have very
different probabilities, Li et al. (2010b) compute a single score for each unique
candidate entity for ranking based on Equation (11).
Let the probabilities (or scores) of a candidate entity d ∈ D be Vd = {v1 , v2 …,
vn} obtained from the feature vectors representing the entity. Let Md be the median
of Vd. The final score f for d is defined as following:
f ( d ) = M d × log(1 + n ) (11)
The use of the median of Vd can be justified based on the statistical skewness
(Neter et al, 1993). Note that here n is the frequency count of candidate entity d in
the corpus. The constant 1 is added to smooth the value. The idea is to push the
frequent candidate entities up by multiplying the logarithm of frequency. log is
taken in order to reduce the effect of big frequency counts.
The final score f(d) indicates candidate d’s overall likelihood to be a relevant
entity. A high f(d) implies a high likelihood that d is in the expanded entity set.
The top-ranked candidates are most likely to be relevant entities to the user-
provided seeds.
Algorithm: BayesianSets(Q, D)
Input: A small seed set Q of entities
A set of candidate entities D (= {e1 , e2 , e3 … en})
Output: A ranked list of entities in D
1. for each entity ei in D
p (ei , Q )
2. compute: score (ei ) =
P (ei ) p (Q )
3. end for
4. Rank the items in D based on their scores;
Let us first compute the integrals of Equation (14). Each seed entity qk ∈ Q is
represented as a binary feature vector (qk1, qk2 , … qkj). We assume each element of
the feature vector has an independent Bernoulli distribution:
J
p(qk | θ ) = ∏θ j kj (1 − θ j )
q 1− q kj
(17)
j =1
The conjugate prior for the parameters of a Bernoulli distribution is the Beta
distribution:
J
Γ(α j + β j )
p(θ | α , β ) = ∏
α j −1 β j −1
θj (1 − θ j ) (18)
j =1 Γ( a j )Γ( β j )
Where α and β are hyperparameters (which are also vectors). We set α and β
empirically from the data, = kmj, = k(1- mj), where mj is the mean value of j-
th components of all possible entities, and k is a scaling factor. The Gamma
function is a generalization of the factorial function. For Q ={q1, q2, …, qn},
Equation (14) can be represented as follows:
~
Γ(a j + β j ) Γ(α~ j )Γ ( β j )
p (Q | α , β ) = ∏ ~ (19)
Γ (α )Γ( β ) Γ(α~ + β )
j j j j j
N N
q q
~
where α~ j = α j + kj and β j = β j + N − kj With the same idea, we can
k =1 k =1
compute Equation (15) and Equation (16).
Overall, the score of ei, which is also represented a feature vector, (ei1, ei2 , …
eij) in the data, is computed with:
~
α j + β j α~ j β 1− e
score(ei ) = ∏ ( )eij ( j ) ij
(20)
j aj + β j + N α j βj
32 L. Zhang and B. Liu
where
~
c = log(α j + β j ) − log(α j + β j + N ) + log β j − log β j
j
~
and w j = logα~ j − logα j − log β j + log β j (22)
All possible entities ei will be assigned a similarity score by Equation (21). Then
we can rank them accordingly. The top ranked entities should be highly related to
the seed set Q according to the Bayesian Sets algorithm.
However, Zhang and Liu (2011c) found that this direct application of Bayesian
Sets produces poor results. They believe there are two main reasons. First, since
Bayesian Sets uses binary features, multiple occurrences of an entity in the corpus,
which give rich contextual information, is not fully exploited. Second, since the
number of seeds is very small, the learned results from Bayesian Sets can be quite
unreliable.
They proposed a method to improve Bayesian Sets, which produces much
better results. The main improvements are as follows.
Raising Feature Weights: From Equation (21), we can see that the score of an
entity ei is determined only by its corresponding feature vector and the weight
vector w = (w1, w2, …, wj). Equation (22) shows a value of the weight vector w.
They rewrite Equation (22) as follows,
N N
α~ j
~
βj qij N−ij q
w j = log − log = log(1 + i =1 ) − log(1 + i =1
) (23)
αj βj km j k (1 − m j )
In Equation (23), N is the number of items in the seed set. As mentioned before,
mj is the mean of feature j of all possible entities and k is a scaling factor. mj can
be regarded as the prior information empirically set from the data.
In order to make a positive contribution to the final score of entity e, wj must be
greater than zero. Under this circumstance, it can obtain the following inequality
based on Equation (23).
N
q
i =1
ij > Nm j (24)
Equation (24) shows that if feature j is effective (wj > 0), the seed data mean must
be greater than the candidate data mean on feature j. Only such kind of features
can be regarded as high-quality features in Bayesian Sets. Unfortunately, it is not
Aspect and Entity Extraction for Opinion Mining 33
always the case due to the idiosyncrasy of the data. There are many high-quality
features, whose seed data mean may be even less than the candidate data mean.
For example, in drug data set, “prescribe” can be a left first verb for an entity. It is
a very good entity feature. “Prescribe EN/NNP” (EN represents an entity, NNP is
its POS tag) strongly suggests that EN is a drug. However, the problem is that the
mean of this feature in the seed set is 0.024 which is less than its candidate set
mean 0.025. So if we stick with Equation (24), the feature will have negative
contribution, which means that it is worse than no feature at all. The fact that all
pattern features are from sentences containing seeds, a candidate entity associated
with a feature should be better than no feature.
Zhang and Liu tackled this problem by fully utilizing all features found in
corpus. They changed original mj to m ~ by multiplying a scaling factor t to force
j
all feature weights wj > 0:
~ = tm
m (0 < t < 1) (25)
j j
The idea is that they lower the candidate data mean intentionally so that all the
features found from the seed data can be utilized.
more times in the seed data, its corresponding wj will also be high. However,
Equation (23) may not be sufficient since it only considers the feature occurrence
but does not take feature quality into consideration. For example, two different
features A and B, which have the same feature occurrence in the seed data and
thus the same mean, According to Equation (23), they should have the same
feature weight . However, for feature A, all feature counts may come from only
one entity in the seed set, but for feature B, the feature counts are from four
different entities in the seed set. Obviously, feature B is a better feature than
feature A simply because the feature is shared by or associated with more entities.
To detect such high-quality features to increase their weights, Zhang and Liu used
the following formula to change the original wj to w~ .
j
~ = rw
w (26)
j j
log h
r = (1 + ) (27)
T
In Equation (26), r is used to represent feature quality for feature j. h is the number
of unique entities that have j-th feature. T is the total number of entities in the seed
set.
34 L. Zhang and B. Liu
In Zhang and Liu (2011c), different vectors representing the same candidate
entity are produced as in (Li et al., 2010b). Thus, the same ranking algorithm is
adopted, which is the multiplication of the median of the score vector obtained
from feature vectors representing the entity and the logarithm of entity frequency.
5 Summary
With the explosive growth of social media on the Web, organizations are
increasingly relying on opinion mining methods to analyze the content of these
media for their decision making. Aspect-based opinion mining, which aims to
obtain detailed information about opinions, has attracted a great of deal of
attention from both the research community and industry. Aspect extraction and
entity extraction are two of its core tasks. In this chapter, we reviewed some
representative works for aspect extraction and entity extraction from opinion
documents.
For aspect extraction, existing solutions can be grouped into three main
categories:
(1) using language dependency rules, e.g., double propagation (Qiu et al., 2011).
These methods utilize the relationships between aspects and opinion words or
other terms to perform aspect extraction. The approaches are unsupervised
and domain-independent. Thus, they can be applied to any domain.
(2) using sequence learning algorithms such as HMM and CRF (Jin et al., 2009a;
Jakob and Gurevych, 2010). These supervised methods are the dominating
techniques for traditional information extraction. But they need a great deal of
manual labeling effort.
(3) using topic models, e.g., MG-LDA (Titov and McDonald, 2008a). This is a
popular research area for aspect extraction recently. The advantages of topic
models are that they can group similar aspect expressions together and that
they are unsupervised. However, their limitation is that the extracted aspects
are not fine-grained.
For entity extraction, supervised learning has also been the dominating
approach. However, semi-supervised methods have drawn attention recently. As
in opinion mining, users often want to find competing entities for opinion
analysis, they can provide some knowledge (e.g., entity instances) as seeds for
semi-supervised learning. In this chapter, we introduced PU learning and
Bayesian Sets based semi-supervised extraction methods.
For evaluation, the commonly used measures for information extraction such as
precision, recall and F-1 scores are also often used in aspect and entity extraction.
The current F-1 score results range from 0.60 to 0.85 depending on domains and
datasets. Thus, the problems, especially aspect extraction, remain to be highly
challenging. We expect that the future work will improve the accuracy
significantly. We also believe that semi-supervised and unsupervised methods will
play a larger role in these tasks.
Aspect and Entity Extraction for Opinion Mining 35
References
Bethard, S., Yu, H., Thornton, A., Hatzivassiloglou, V., Jurafsky, D.: Automatic extraction
of opinion propositions and their holders. In: Proceedings of the AAAI Spring
Symposium on Exploring Attitude and Affect in Text (2004)
Blair-Goldensohn, S., Hannan, K., McDonald, R., Neylon, T., Reis, G.A., Reyna, J.:
Building a sentiment summarizer for local service reviews. In: Proceedings of
International Conference on World Wide Web Workshop of NLPIX, WWW-NLPIX-
2008 (2008)
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning
Research (2003)
Bloom, K., Grag, N., Argamon, S.: Extracting apprasial expressions. In: Proceedings of the
2007 Annual Conference of the North American Chapter of the ACL (NAACL 2007)
(2007)
Branavan, S.R.K., Chen, H., Eisenstein, J., Barzilay, R.: Learning document-level semantic
properties from free-text annotations. In: Proceedings of Annual Meeting of the
Association for Computational Linguistics, ACL 2008 (2008)
Brown, F.P., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of
statitical machine translation: parameter estimation. Computational Linguistics (1993)
Brody, S., Elhadad, S.: An unsupervised aspect-sentiment model for online reviews. In:
Proceedings of the 2010 Annual Conference of the North American Chapter of the ACL,
NAACL 2010 (2010)
Carenini, G., Ng, R., Pauls, A.: Multi-Document summarization of evaluative text. In:
Proceeding of Conference of the European Chapter of the ACL, EACL 2006 (2006)
Carenini, G., Ng, R., Zwart, E.: Extracting knowledge from evaluative text. In: Proceedings
of Third International Conference on Knowledge Capture, K-CAP 2005 (2005)
Choi, Y., Cardie, C., Riloff, E., Patwardhan, S.: Identifying sources of opinions with
conditional random fields and extraction patterns. In: Proceedings of the Human
Language Technology Conference and the Conference on Empirical Methods in Natural
Language Processing, HLT/EMNLP 2005 (2005)
Dempster, P., Laird, A.M.N., Rubin, B.D.: Maximum likelihood from incomplete data via
the EM algorithms. Journal of the Royal Statistical Society, Series B (1977)
Fei, G., Liu, B., Hsu, M., Castellanos, M., Ghosh, R.: A dictionary-based approach to
identifying aspects implied by adjectives for opinion mining. In: Proceedings of
International Conference on Computational Linguistics, COLING 2012 (2012)
Ghahramani, Z., Heller, K.A.: Bayesian sets. In: Proceeding of Annual Neural Information
Processing Systems, NIPS 2005 (2005)
Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.: Text mining for product attribute
extraction. ACM SIGKDD Explorations Newsletter 8(1) (2006)
Gilks, R.W., Richardson, S., Spiegelhalter, D.: Markov Chain Monte Carlo in practice.
Chapman and Hall (1996)
Grosz, J.B., Winstein, S., Joshi, A.K.: Centering: a framework for modeling the local
coherence of discourse. Computational Linguistics 21(2) (1995)
36 L. Zhang and B. Liu
Guo, H., Zhu, H., Guo, Z., Zhang, X., Su, Z.: Product feature categorization with multilevel
latent semantic association. In: Proceedings of ACM International Conference on
Information and Knowledge Management, CIKM 2009 (2009)
Hai, Z., Chang, K., Kim, J.: Implicit feature identification via co-occurrence association
rule mining. Computational Linguistic and Intelligent Text Processing (2011)
Hai, Z., Chang, K., Cong, G.: One seed to find them all: mining opinion features via
association. In: Proceedings of ACM International Conference on Information and
Knowledge Management, CIKM 2012 (2012)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine
Learning (2001)
Hu, M., Liu, B.: Mining opinion features in customer reviews. In: Proceedings of National
Conference on Artificial Intelligence, AAAI 2004 (2004a)
Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD
2004 (2004b)
Jakob, N., Gurevych, I.: Extracting opinion targets in a single and cross-domain setting
with conditional random fields. In: Proceedings of Conference on Empirical Methods in
Natural Language Processing, EMNLP 2010 (2010)
Jin, W., Ho, H.: A novel lexicalized HMM-based learning framework for web opinion
mining. In: Proceedings of International Conference on Machine Learning, ICML 2009
(2009a)
Jin, W., Ho, H., Srihari, R.K.: OpinionMiner: a novel machine learning system for web
opinion mining and extraction. In: Proceedings of ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD 2009 (2009b)
Jindal, N., Liu, B.: Mining comparative sentences and relations. In: Proceedings of National
Conference on Artificial Intelligence, AAAI 2006 (2006a)
Jindal, N., Liu, B.: Identifying comparative sentences in text documents. In: Proceedings of
ACM SIGIR International Conference on Information Retrieval, SIGIR 2006 (2006b)
Jo, Y., Oh, A.: Aspect and sentiment unification model for online review analysis. In:
Proceedings of the Conference on Web Search and Web Data Mining, WSDM 2011
(2011)
Kessler, J., Nicolov, N.: Targeting sentiment expressions through supervised ranking of
linguistic configurations. In: Proceedings of the International AAAI Conference on
Weblogs and Social Media, ICWSM 2009 (2009)
Kim, S.M., Hovy, E.: Extracting opinions, opinion holders, and topics expressed in online
news media text. In: Proceedings of the ACL Workshop on Sentiment and Subjectivity
in Text (2006)
Kleinberg, J.: Authoritative sources in hyper-linked environment. Journal of the
ACM 46(5), 604–632 (1999)
Kobayashi, N., Inui, K., Matsumoto, Y.: Extracting aspect-evaluation and aspect-of
relations in opinion mining. In: Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language
Learning, EMNLP 2007 (2007)
Ku, L., Liang, Y., Chen, H.: Opinion extraction, summarization and tracking in news and
blog corpora. In: Proceedings of AAAI-CAAW 2006 (2006)
Aspect and Entity Extraction for Opinion Mining 37
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for
segmenting and labeling sequence data. In: Proceedings of International Conference on
Machine Learning, ICML 2001 (2001)
Lee, L.: Measures of distributional similarity. In: Proceedings of Annual Meeting of the
Association for Computational Linguistics, ACL 1999 (1999)
Li, F., Han, C., Huang, M., Zhu, X., Xia, Y., Zhang, S., Yu, H.: Structure-aware review
mining and summarization. In: Proceedings of International Conference on
Computational Linguistics, COLING 2010 (2010a)
Li, F., Pan, S.J., Jin, Q., Yang, Q., Zhu, X.: Cross-Domain co-extraction of sentiment and
topic lexicons. In: Proceedings of Annual Meeting of the Association for Computational
Linguistics, ACL 2012 (2012a)
Li, S., Wang, R., Zhou, G.: Opinion target extraction using a shallow semantic parsing
framework. In: Proceedings of National Conference on Artificial Intelligence, AAAI
2012 (2012b)
Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: Proceedings
of International Joint Conferences on Artificial Intelligence, IJCAI 2003 (2003)
Li, X., Liu, B., Ng, S.: Learning to identify unexpected instances in the test set. In:
Proceedings of International Joint Conferences on Artificial Intelligence, IJCAI 2007
(2007)
Li, X., Zhang, L., Liu, B., Ng, S.: Distributional similarity vs. PU learning for entity set
expansion. In: Proceedings of Annual Meeting of the Association for Computational
Linguistics, ACL 2010 (2010b)
Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: Proceedings of
ACM International Conference on Information and Knowledge Management, CIKM
2009 (2009)
Lin, D.: Dependency-based evaluation of MINIPAR. In: Proceedings of the Workshop on
Evaluation of Parsing System, ICLRE 1998 (1998)
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 1st edn.
Springer (2006), 2nd edn. (2011)
Liu, B.: Sentiment analysis and subjectivity, 2nd edn. Handbook of Natural Language
Processing (2010)
Liu, B.: Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers (2012)
Liu, B., Hu, M., Cheng, J.: Opinion observer: analyzing and comparing opinions on the
web. In: Proceedings of International Conference on World Wide Web, WWW 2005
(2005)
Liu, B., Lee, W.-S., Yu, P.S., Li, X.: Partially supervised text classification. In: Proceedings
of International Conference on Machine Learning, ICML 2002 (2002)
Liu, K., Xu, L., Zhao, J.: Opinion target extraction using word-based translation model. In:
Proceeding of Conference on Empirical Methods in Natural Language Processing,
EMNLP 2012 (2012)
Long, C., Zhang, J., Zhu, X.: A review selection approach for accurate feature rating
estimation. In: Proceedings of International Conference on Computational Linguistics,
COLING 2010 (2010)
38 L. Zhang and B. Liu
Lu, Y., Duan, H., Wang, H., Zhai, C.: Exploiting structured ontology to organize scattered
online opinions. In: Proceedings of International Conference on Computational
Linguistics, COLING 2010 (2010)
Lu, Y., Zhai, C., Sundaresan, N.: Rated aspect summarization of short comments. In:
Proceedings of International Conference on World Wide Web, WWW 2009 (2009)
Ma, T., Wan, X.: Opinion target extraction in Chinese news comments. In: Proceedings of
International Conference on Computational Linguistics (COLING 2010) (2010)
Mauge, K., Rohanimanesh, K., Ruvini, J.D.: Structuring e-commerce inventory. In:
Proceedings of Annual Meeting of the Association for Computational Linguistics, ACL
2012 (2012)
Mukherjee, A., Liu, B.: Aspect extraction through semi-supervised modeling. In:
Proceedings of Annual Meeting of the Association for Computational Linguistics, ACL
2012 (2012)
Mei, Q., Ling, X., Wondra, M., Su, H., Zhai, C.: Topic sentiment mixture: modeling facets
and opinions in weblogs. In: Proceedings of International Conference on World Wide
Web, WWW 2007 (2007)
Moghaddam, S., Ester, M.: Opinion digger: an unsupervised opinion miner from
unstructured product reviews. In: Proceedings of ACM International Conference on
Information and Knowledge Management, CIKM 2010 (2010)
Moghaddam, S., Ester, M.: ILDA: interdependent LDA model for learning latent aspects
and their ratings from online product reviews. In: Proceedings of ACM SIGIR
International Conference on Information Retrieval, SIGIR 2011 (2011)
Neter, J., Wasserman, W., Whitmore, G.A.: Applied Statistics. Allyn and Bacon (1993)
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in
Information Retrieval 2(1-2), 1–135 (2008)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine
learning techniques. In: Proceedings of Conference on Empirical Methods in Natural
Language Processing, EMNLP 2002 (2002)
Pantel, P., Crestan, E., Borkovsky, A., Popescu, A.: Web-Scale distributional similarity and
entity set expansion. In: Proceedings of the 2009 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language
Learning, EMNLP 2009 (2009)
Popescu, A., Etzioni, O.: Extracting product features and opinions from reviews. In:
Proceedings of Conference on Empirical Methods in Natural Language Processing,
EMNLP 2005 (2005)
Putthividhya, D., Hu, J.: Bootstrapped name entity recognition for product attribute
extraction. In: Proceedings of Conference on Empirical Methods in Natural Language
Processing, EMNLP 2011 (2011)
Qiu, G., Liu, B., Bu, J., Chen, C.: Opinion word expansion and target extraction through
double propagation. Computational Linguistics (2011)
Rabiner, R.L.: A tutorial on hidden markov models and selected applications in speech
recognition. Proceedings of IEEE 77(2) (1989)
Sarawagi, S.: Information Extraction. Foundations and Trends in Databases (2008)
Sauper, C., Haghighi, A., Barzilay, R.: Content models with attribute. In: Proceedings of
Annual Meeting of the Association for Computational Linguistics, ACL 2011 (2011)
Aspect and Entity Extraction for Opinion Mining 39
Scaffidi, C., Bierhoff, K., Chang, E., Felker, M., Ng, H., Jin, C.: Red opal: product-feature
scoring from reviews. In: Proceedings of the 9th International Conference on Electronic
Commerce, EC 2007 (2007)
Stoyanov, V., Cardie, C.: Topic identification for fine-grained opinion analysis. In:
Proceedings of International Conference on Computational Linguistics, COLING 2008
(2008)
Su, Q., Xu, X., Guo, H., Guo, Z., Wu, X., Zhang, X., Swen, B., Su, Z.: Hidden sentiment
association in Chinese web opinion mining. In: Proceedings of International Conference
on World Wide Web, WWW 2008 (2008)
Sutton, C., McCallum, A.: An introduction to conditional random fields for relational
learning. Introduction to Statistical Relational Learning. MIT Press (2006)
Titov, I., McDonald, R.: Modeling online reviews with multi-grain topic models. In:
Proceedings of International Conference on World Wide Web, WWW 2008 (2008a)
Titov, I., McDonald, R.: A joint model of text and aspect ratings for sentiment
summarization. In: Proceedings of Annual Meeting of the Association for
Computational Linguistics, ACL 2008 (2008b)
Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised
classification of reviews. In: Proceedings of Annual Meeting of the Association for
Computational Linguistics, ACL 2002 (2002)
Wang, B., Wang, H.: Bootstrapping both product features and opinion words from Chinese
customer reviews with cross-inducing. In: Proceedings of the International Joint
Conference on Natural Language Processing, IJCNLP 2008 (2008)
Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating
regression approach. In: Proceedings of ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD 2010 (2010)
Wei, W., Gulla, J.A.: Sentiment learning on product reviews via sentiment ontology tree.
In: Proceedings of Annual Meeting of the Association for Computational Linguistics
(ACL 2010) (2010)
Wiebe, J., Riloff, E.: Creating subjective and objective sentence classifiers from
unannotated texts. In: Proceedings of Computational Linguistics and Intelligent Text
Processing, CICLing 2005 (2005)
Wiebe, J., Wilson, T., Bruce, R., Bell, M., Martin, M.: Learning subjective language.
Computational Linguistics 30(3), 277–308 (2004)
Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase-level
sentiment analysis. In: Proceedings of the Human Language Technology Conference
and the Conference on Empirical Methods in Natural Language Processing,
HLT/EMNLP 2005 (2005)
Wu, Y., Zhang, Q., Huang, X., Wu, L.: Phrase dependency parsing for opinion mining. In:
Proceedings of Conference on Empirical Methods in Natural Language Processing,
EMNLP 2009 (2009)
Yi, J., Nasukawa, T., Bunescu, R., Niblack, W.: Sentiment analyzer: extracting sentiments
about a given topic using natural language processing techniques. In: Proceedings of
International Conference on Data Mining, ICDM 2003 (2003)
40 L. Zhang and B. Liu
Yu, H., Han, J., Chang, K.: PEBL: Positive example based learning for Web page
classification using SVM. In: Proceedings of ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD 2002 (2002)
Yu, J., Zha, Z., Wang, M., Chua, T.: Aspect ranking: identifying important product aspects
from online consumer reviews. In: Proceedings of Annual Meeting of the Association
for Computational Linguistics, ACL 2011 (2011a)
Yu, J., Zha, Z., Wang, M., Wang, K., Chua, T.: Domain-Assisted product aspect hierarchy
generation: towards hierarchical organization of unstructured consumer reviews. In:
Proceedings of Conference on Empirical Methods in Natural Language Processing,
EMNLP 2011 (2011b)
Zhai, Z., Liu, B., Xu, H., Jia, P.: Clustering product features for opinion mining. In:
Proceedings of ACM International Conference on Web Search and Data Mining,
WSDM 2011 (2011)
Zhai, Z., Liu, B., Xu, H., Jia, P.: Grouping product features using semi-supervised learning
with soft-constraints. In: Proceedings of International Conference on Computational
Linguistics, COLING 2010 (2010)
Zhang, L., Liu, B., Lim, S., O’Brien-Strain, E.: Extracting and ranking product features in
opinion documents. In: Proceedings of International Conference on Computational
Linguistics, COLING 2010 (2010)
Zhang, L., Liu, B.: Identifying noun product features that imply opinions. In: Proceedings
of Annual Meeting of the Association for Computational Linguistics, ACL 2011 (2011a)
Zhang, L., Liu, B.: Extracting resource terms for sentiment analysis. In: Proceedings of the
International Joint Conference on Natural Language Processing, IJCNLP 2011 (2011b)
Zhang, L., Liu, B.: Entity set expansion in opinion documents. In Proceedings of ACM
Conference on Hypertext and Hypermedia (HT 2011) (2011c)
Zhao, W., Jiang, J., Yan, H., Li, X.: Jointly modeling aspects and opinions with a MaxEnt-
LDA hybrid. In: Proceedings of Conference on Empirical Methods in Natural Language
Processing, EMNLP 2010 (2010)
Zhu, J., Wang, H., Tsou, B.K., Zhu, M.: Multi-aspect opinion polling from textual reviews.
In: Proceedings of ACM International Conference on Information and Knowledge
Management, CIKM 2009 (2009)
Zhuang, L., Jing, F., Zhu, X.: Movie review mining and summarization. In: Proceedings of
ACM International Conference on Information and Knowledge Management, CIKM
2006 (2006)
Mining Periodicity from Dynamic and
Incomplete Spatiotemporal Data
Zhenhui Li
Pennsylvania State University, University Park, PA
e-mail: [email protected]
Jiawei Han
University of Illinois at Urbana-Champaign, Champaign, IL
e-mail: [email protected]
W.W. Chu (ed.), Data Mining and Knowledge Discovery for Big Data, 41
Studies in Big Data 1,
DOI: 10.1007/978-3-642-40837-3_2, c Springer-Verlag Berlin Heidelberg 2014
42 Z. Li and J. Han
1 Introduction
With the rapid development of positioning technologies, sensor networks, and on-
line social media, spatiotemporal data is now widely collected from smartphones
carried by people, sensor tags attached to animals, GPS tracking systems on cars
and airplanes, RFID tags on merchandise, and location-based services offered by
social media. While such tracking systems act as real-time monitoring platforms,
analyzing spatiotemporal data generated from these systems frames many research
problems and high-impact applications. For example, understanding and model-
ing animal movement is important to addressing environmental challenges such
as climate and land use change, bio-diversity loss, invasive species, and infectious
diseases.
As spatiotemporal data becomes widely available, there are emergent needs in
many applications to understand the increasingly large collections of data. Among
all the patterns, one most common pattern is the periodic behavior. A periodic be-
havior can be loosely defined as the repeating activities at certain locations with
regular time intervals. For example, bald eagles start migrating to South America
in late October and go back to Alaska around mid-March. People may have weekly
periodicity staying in the office.
Mining periodic behaviors can benefit us in many aspects. First, periodic behav-
iors provide an insightful and concise explanation over the long moving history. For
example, animal movements can be summarized using mixture of multiple daily
and yearly periodic behaviors. Second, periodic behaviors are also useful for com-
pressing spatiotemporal data [17, 25, 4]. Spatiotemporal data usually have huge
volume because data keeps growing as time passes. However, once we extract peri-
odic patterns, it will save a lot of storage space by recording the periodic behaviors
rather than original data, without losing much information. Finally, periodicity is
extremely useful in future movement prediction [10], especially for a distant query-
ing time. At the same time, if an object fails to follow regular periodic behaviors, it
could be a signal of abnormal environment change or an accident.
More importantly, since spatiotemporal data is just a special class of temporal
data, namely two-dimensional temporal data, many ideas and techniques we discuss
in this chapter can actually be applied to other types of temporal data collected
in a broad range of fields such as bioinformatics, social network, environmental
science, and so on. For example, the notion of probabilistic periodic behavior can
be very useful in understanding the social behaviors of people via analyzing the
social network data such as tweets. Also, the techniques we developed for period
detection from noisy and incomplete observations can be applied to any kind of
temporal event data, regardless of the type of the collecting sensor.
Based on manual examination of the raw data (on the left), it is almost impossi-
ble to extract the periodic behaviors (on the right). In fact, the periodic behaviors
are quite complicated. There are multiple periods and periodic behaviors that may
interleave with each other. Below we summarize the major challenges in mining
periodic behavior from movement data:
1. A real life moving object does not ever strictly follow a given periodic pattern.
For example, birds never follow exactly the same migration paths every year.
Their migration routes are strongly affected by weather conditions and thus could
be substantially different from previous years. Meanwhile, even though birds
generally stay in north in the summer, it is not the case that they stay at exactly
the same locations, on exactly the same days of the year, as previous years. There-
fore, “north” is a fairly vague geo-concept that is hard to be modeled from raw
trajectory data. Moreover, birds could have multiple interleaved periodic behav-
iors at different spatiotemporal granularities, as a result of daily periodic hunting
behaviors, combined with yearly migration behaviors.
2. We usually have incomplete observations, which are unevenly sampled and have
large portion of missing data. For example, a bird can only carry small sensors
with one or two reported locations in three to five days. And the locations of a
person may only be recorded when he uses his cellphone. Moreover, if a sensor
is not functioning or a tracking facility is turned off, it could result in a large
portion of missing data.
3. With the periods detected, the corresponding periodic behaviors should be mined
to provide a semantic understanding of movement data, such as the hidden pe-
riodic behaviors shown in Figure 1. The challenge in this step lies in the inter-
leaving nature of multiple periodic behaviors. As we can see that, for a person’s
movement as shown in Figure 1, one periodic behavior can be associated with
different locations, such as periodic behavior #1 is associated with both office
and dorm. Also, the same period (i.e., day) could be associated with two differ-
ent periodic behaviors, one from September to May and the other from June to
August.
44 Z. Li and J. Han
1 N−1
ACF(τ ) = ∑ x(τ ) · x(n + τ )
N n=0
If ACF(τ ∗ ) is the maximum over autocorrelation values of all time lags, it means
that τ ∗ is most likely to be the period of the sequence. Different from Fourier trans-
form that k∗ /N is in frequency domain, time lag τ ∗ is in time domain.
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 45
Note that these spectra of Z are functions of the frequency f , which is the recip-
rocal of duration, D (i.e., D = 1/ f ). It can be shown that Z( f ) provides an indication
of the trend of circular motion, and can also be used to distinguish clockwise from
counterclockwise patterns. Interested readers are referred to [2] for detailed illustra-
tions and results of CFT.
Meanwhile, it is important to distinguish the circular analysis from the aforemen-
tioned recursion analysis. Note that a close path detected by recursion analysis is not
necessarily circular, and similarly a clockwise or counterclockwise movement does
not ensure a recursion. In this sense, these two methods are complementary to each
other. Consequently, one can combine these two methods to answer more complex
questions such as whether there is a circular path between recursions.
First, the performance of recursion analysis heavily rely on the resolution of land-
scape discretization, for which expert information about the moving objects’ typical
range of activity is crucial. For example, one will miss a lot of recursions when the
resolution is set too coarse, whereas when the resolution is set too fine a large num-
ber of false positives will occur. Due to the same reason, the recursion analysis is
also very sensitive to noise in the movement data.
Second, while circle analysis does not have the same dependency issue as re-
cursion analysis, its usage is however strictly restricted to detecting circular paths
in the movement data. Unfortunately, real-world spatiotemporal data often exhibit
much more complex periodic patterns which are not necessarily circular (see Fig-
ure 2 for an example). Therefore, the development of a more flexible method is of
great important in practice.
Finally, as we mentioned before, the objects of interest (e.g., humans, animals)
often have multiple periodic behaviors with the same period, which is completely
ignored by existing methods. In order to achieve semantic understanding of the data,
it is important for our algorithm to be able to mine such multiple behaviors in move-
ment data.
With all of these considerations in mind, we now proceed to describe a new al-
gorithms for periodic behavior mining in spatiotemporal data, which handles all the
aforementioned difficulties in a unified framework.
to be in the office at 9:00 everyday.” One may argue that these frequent periodic
patterns can be further summarized using probabilistic modeling approach [26, 22].
But such models built on frequent periodic patterns do not truly reflect the real un-
derlying periodic behaviors from the original movement, because frequent patterns
are already a lossy summarization over the original data. Furthermore, if we can
directly mine periodic behaviors on the original movement using polynomial time
complexity, it is unnecessary to mine frequent periodic patterns and then summarize
over these patterns.
We formulate the periodic behavior mining problem and propose the assumption
that the observed movement is generated from several periodic behaviors associated
with some reference locations. We design a two-stage algorithm, Periodica, to detect
the periods and further find the periodic behaviors.
At the first stage, we focus on detecting all the periods in the movement. Given
the raw data as shown in Figure 1, we use the kernel method to discover those refer-
ence locations, namely reference spots. For each reference spot, the movement data
is transformed from a spatial sequence to a binary sequence, which facilitates the
detection of periods by filtering the spatial noise. Besides, based on our assumption,
every period will be associated with at least one reference spot. All periods in the
movement can be detected if we try to detect the periods in every reference spot.
At the second stage, we statistically model the periodic behavior using a generative
model. Based on this model, underlying periodic behaviors are generalized from
the movement using a hierarchical clustering method and the number of periodic
behaviors is automatically detected by measuring the representation error.
Since there are two subtasks in the periodic behavior mining problem, detecting
the periods and mining the periodic behaviors. We propose a two-stage algorithm
Periodica, where the overall procedure of the algorithm is developed in two stages
and each stage targets one subtask.
Algorithm 1 shows the general framework of Periodica. At the first stage, we
first find all the reference spots (Line 2) and for each reference spot, the periods are
detected (Lines 3∼5). Then for every period T , we consider the reference spots with
period T and further mine the corresponding periodic behaviors (Lines 7∼10).
Algorithm 1. Periodica
INPUT: A movement sequence LOC = loc1 loc2 · · · locn .
OUTPUT: A set of periodic behaviors.
ALGORITHM:
1: /* Stage 1: Detect periods */
2: Find reference spots O = {o1 , o2 , · · · , od };
3: for each oi ∈ O do
4: Detect periods in oi and store the periods in Pi ;
5: Pset ← Pset ∪ Pi ;
6: end for
7: /* Stage 2: Mine periodic behaviors */
8: for each T ∈ Pset do
9: OT = {oi |T ∈ Pi };
10: Construct the symbolized sequence S using OT ;
11: Mine periodic behaviors in S.
12: end for
50 2.5
40
30 2
20
10 1.5
10 1
20
30 0.5
40
50 0
50 40 30 20 10 0 10 20 30 40 50 0 10 20 30 40 50 60 70 80 90 100
1.8
1.6
1
1.4
1.2
1 0.5
0.8
0.6
0
0.4
0.2
0 0.5
0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 110 120
plane and use Fourier transform to detect the periods. However, as shown in Fig-
ure 2(b) and Figure 2(c), there is no strong signal corresponding to the correct period
because such method is sensitive to the spatial noise. If the object does not follow
more or less the same hunting route every day, the period can hardly be detected.
However, in real cases, few objects repeat the exactly same route in the periodic
movement.
Our key observation is that, if we view the data from the den, the period is easier
to be detected. In Figure 2(d), we transform the movement into a binary sequence,
where 1 represents the animal is at den and 0 when it goes out. It is easy to see
the regularity in this binary sequence. Our idea is to find some important reference
locations, namely reference spots , to view the movement. In this example, the den
serves as our reference spot.
The notion of reference spots has several merits. First, it filters out the spatial
noise and turns the period detection problem from a 2-dimensional space (i.e., spa-
tial) to a 1-dimensional space (i.e., binary). As shown in Figure 2(d), we do not care
where the animal goes when it is out of the den. As long as it follows a regular pat-
tern going out and coming back to the den, there is a period associated with the den.
Second, we can detect multiple periods in the movement. Consider the scenario that
there is a daily period with one reference spot and a weekly period with another ref-
erence spot, it is possible that only period “day” is discovered because the shorter
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 53
period will repeat more times. But if we view the movement from two reference
spots separately, both periods can be individually detected. Third, based on the as-
sumption that each periodic behavior is associated with some reference locations,
all the periods can be found through reference spots.
The rest of this section will discuss in details how to find reference spots and
detect the periods on the binary sequence for each reference spot.
Finding Reference Spots. Since an object with periodic movement will repeatedly
visit some specific places, if we only consider the spatial information of the move-
ment, reference spots are those dense regions containing more points than the other
regions. Note that the reference spots are obtained for each individual object.
Many methods can be applied to detect the reference spots, such as density-based
clustering. The methods could vary according to different applications. We adapt a
popular kernel method [24], which is designed for the purpose of finding home
ranges of animals. For human movement, we may use important location detection
methods in [14, 31].
While computing the density for each location in a continuous space is computa-
tionally expensive, we discretize the space into a regular w × h grid and compute the
density for each cell. The grid size is determined by the desired resolution to view
the spatial data. If an animal has frequent activities at one place, this place will have
higher probability to be its home. This actually aligns very well with our definition
of reference spots.
For each grid cell c, the density is estimated using the bivariate normal density
kernel,
1 n 1 |c − loci|2
f (c) = 2 ∑ exp(− ),
nγ i=1 2π 2γ 2
where |c − loci | is the distance between cell c and location loci . In addition, γ is a
smoothing parameter which is determined by the following heuristic method [2],
1 1 1
γ = (σx2 + σy2 ) 2 n− 6 ,
2
where σx and σy are the standard deviations of the whole sequence LOC in its x and
y-coordinates, respectively. The time complexity for this method is O(w · h · n).
After obtaining the density values, a reference spot can be defined by a contour
line on the map, which joins the cells of the equal density value, with some density
threshold. The threshold can be determined as the top-p% density value among all
the density values of all cells. The larger the value p is, the bigger the size of refer-
ence spot is. In practice, p can be chosen based on prior knowledge about the size
of the reference spots. In many real applications, we can assume that the reference
spots are usually very small on a large map (e.g., within 10% of whole area). So,
by setting p% = 15%, most parts of reference spots should be detected with high
probability.
To illustrate this idea, assume that a bird stays in a nest for half a year and moves
to another nest staying for another half year. At each nest, it has a daily periodic
54 Z. Li and J. Han
100 100
90
80 80 1
70
60 60
50
40 40
30 3
20 20 2
10
0
10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
behavior of going out for food during the daytime and coming back to the nest at
night, as shown in Figure 3. Note that the two small areas (spot #2 and spot #3)
are the two nests and the bigger region is the food resource (spot #1). Figure 3(a)
shows the density calculated using the kernel method. The grid size is 100 × 100.
The darker the color is, the higher the density is. Figure 3(b) is the reference spots
identified by contour using top-15% density value threshold.
Periods Detection on Binary Sequence. Given a set of reference spots, we fur-
ther propose a method to obtain the potential periods within each spot separately.
Viewed from a single reference spot, the movement sequence now can be trans-
formed into a binary sequence B = b1 b2 . . . bn , where bi = 1 when this object is
within the reference spot at timestamp i and 0 otherwise. In discrete signal process-
ing area, to detect periods in a sequence, the most popular methods are Fourier trans-
form and autocorrelation, which essentially complement each other in the following
sense, as discussed in [21]. On one hand, Fourier transform often suffers from the
low resolution problem in the low frequency region, hence provides poor estimation
of large periods. Also, the well-known spectral leakage problem of Fourier trans-
form tends to generate a lot of false positives in the periodogram. On the other
hand, autocorrelation offers accurate estimation for both short and large periods,
but is more difficult to set the significance threshold for important periods. Con-
sequently, [21] proposed to combine Fourier transform and autocorrelation to find
periods. Here, we adapt this approach to find periods in the binary sequence B.
In Discrete Fourier Transform (DFT), the sequence B = b1 b2 . . . bn is transformed
into the sequence of n complex numbers X1 , X2 , . . . , Xn . Given coefficients X, the
periodogram is defined as the squared length of each Fourier coefficient: Fk = Xk 2 .
Here, Fk is the power of frequency k. In order to specify which frequencies are
important, we need to set a threshold and identify those higher frequencies than this
threshold.
The threshold is determined using the following method. Let B be a randomly
permutated sequence from B. Since B should not exhibit any periodicities, even the
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 55
maximum power does not indicate the period in the sequence. Therefore, we record
its maximum power as pmax , and only the frequencies in B that have higher power
than pmax may correspond to real periods. To provide a 99% confidence level on
what frequencies are important, we repeat the above random permutation experi-
ment 100 times and record the maximum power of each permutated sequence. The
99-th largest value of these 100 experiments will serve as a good estimator of the
power threshold.
Given that Fk is larger than the power threshold, we still need to determine the
exact period in the time domain, because a single value k in frequency domain cor-
responds to a range of periods [ nk , k−1
n
) in time domain. In order to do this, we use
circular autocorrelation, which examines how similar a sequence is to its previous
values for different τ lags: R(τ ) = ∑ni=1 bτ bi+τ .
Thus, for each period range [l, r) given by the periodogram, we test whether
there is a peak in {R(l), R(l + 1), . . . , R(r − 1)} by fitting the data with a quadratic
function. If the resulting function is concave in the period range, which indicates the
existence of a peak, we return t ∗ = argmaxl≤t<r R(t) as a detected period. Similarly,
we employ a 99% confidence level to eliminate false positives caused by noise.
15
P1
10
P2
0
0 500 1000 1500 2000
(a) Periodogram
400
P1 T=24
300
200
P2
100
0
0 20 40 60 80 100
(b) Circular autocorrelation
Now, suppose I 1 , I 2 , . .., I l follow the same periodic behavior. The probability
that the segment set I = lj=1 I j is generated by some distribution matrix P is
1 If n is not a multiple of T , then the last (n mod T ) positions are truncated.
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 57
T
P(I |P) = ∏ ∏ p(xk = Ikj ).
I ∈I k=1
j
When KL(PQ) is small, it means that the two distribution matrices P and Q are
similar, and vice versa.
Note that KL(PQ) becomes infinite when p(xk = i) or q(xk = i) has zero prob-
ability. To avoid this situation, we add to p(xk = i) (and q(xk = i)) a background
variable u which is uniformly distributed among all reference spots,
p(xk = i) = (1 − λ )p(xk = i) + λ u, (3)
where λ is a small smoothing parameter 0 < λ < 1.
Now, suppose we have two periodic behaviors, H1 = T, P and H2 = T, Q . We
define the distance between these two behaviors as
dist(H1 , H2 ) = KL(PQ).
58 Z. Li and J. Han
Suppose there exist K underlying periodic behaviors. There are many ways to
group the segments into K clusters with the distance measure defined. However, the
number of underlying periodic behaviors (i.e., K) is usually unknown. So we pro-
pose a hierarchical agglomerative clustering method to group the segments while at
the same time determine the optimal number of periodic behaviors. At each iteration
of the hierarchical clustering, two clusters with the minimum distance are merged.
In Algorithm 2, we first describe the clustering method assuming K is given. We
will return to the problem of selecting optimal K later.
|Cs | |Ct |
P= Ps + Pt . (4)
|Cs | + |Ct | |Cs | + |Ct |
0.8
Probability
0.6
spot #1
0.4
spot #2
spot #3
0.2
unknown
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Hour
0.8
Probability
0.6
0.4 spot #1
spot #2
0.2 spot #3
unknown
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Hour
place around 12:00. Such periodic behaviors well represent the bird’s movement
and truly reveal the mechanism we employed to generate this synthetic data.
Now, we discuss how to pick the appropriate parameter K. Ideally, during the
hierarchical agglomerative clustering, the segments generated from the same be-
havior should be merged first because they have smaller KL-divergence distance.
Thus, we judge a cluster is good if all the segments in the cluster are concentrated in
one single reference spot at a particular timestamp. Hence, a natural representation
error measure to evaluate the representation quality of a cluster is as follows. Note
that here we exclude the reference spot o0 which essentially means the location is
unknown.
At each iteration, all the segments are partitioned into k clusters {C1 ,C2 , . . . ,Ck }.
The overall representation error at current iteration is calculated as the mean over
all clusters,
1 k
Ek = ∑ E(Ci ).
k i=1
60 Z. Li and J. Han
0.5
0.4
representation error
0.3
K=2
0.2
0.1
0
0 20 40 60 80 100
# of clusters
Time
5 18 26 29 48 50 67 79
To illustrate the difficulties, let us first take a look at Figure 7. Suppose we have
observed the occurrences of an event at timestamps 5, 18, 26, 29, 48, 50, 67, and 79.
The observations of the event at other timestamps are not available. It is certainly
not an easy task to infer the period directly from these incomplete observations.
Even though some extensions of Fourier transform have been proposed to handle
uneven data samples [15, 19], they are still not applicable to the case with very low
sampling rate.
Besides, the periodic behaviors could be inherently complicated and noisy. A
periodic event does not necessarily happen at exactly the same timestamp in each
periodic cycle. For example, the time that a person goes to work in the morning
might oscillate between 8:00 to 10:00. Noises could also occur when the “in office”
event is expected to be observed on a weekday but fails to happen.
In this section, we take a completely different approach to the period detection
problem and handle all the aforementioned difficulties occurring in data collection
process and periodic behavior complexity in a unified framework. The basic idea of
our method is illustrated in Example 1.
Event has period 20. Occurrences of the event happen between 20k+5 to 20k+10.
Time
5 18 26 29 48 50 67 79
Segment the data using length 20 Segment the data using length 16
to denote observations. For example, if the event is “in the office”, x(t) = 1 means
this person is in the office at time t and x(t) = 0 means this person is not in the
office at time t. Later we will refer x(t) = 1 as a positive observation and x(t) = 0
as a negative observation.
Definition 5 (Periodic Sequence). A sequence X = {x(t)}t=0 n−1
is said to be peri-
odic if there exists some T ∈ Z such that x(t + T ) = x(t) for all values of t. We call
T a period of X .
A fundamental ambiguity with the above definition is that if T is a period of X ,
then mT is also a period of X for any m ∈ Z. A natural way to resolve this problem
is to use the so called prime period.
Definition 6 (Prime Period). The prime period of a periodic sequence is the small-
est T ∈ Z such that x(t + T ) = x(t) for all values of t.
For the rest of the section, unless otherwise stated, we always refer the word
“period” to “prime period”.
As we mentioned before, in real applications the observed sequences always de-
viate from the perfect periodicity due to the oscillating behavior and noises. To
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 63
Here we need to exclude the trivial cases where pT = 0T or 1T . Also note that if
we restrict the value of each pTi to {0, 1} only, then the resulting X is strictly pe-
riodic according to Definition 5. We are now able to formulate our period detection
problem as follows.
Problem 1 (Event Period Detection). Given a binary sequence X generated ac-
cording to any periodic distribution vector pT0 , find T0 .
0.8
Probability
0.6
0.4
0.2
0
0 6 12 18 24
Time (hour)
Fig. 9 (Running Example) Periodic distribution vector of a event with daily periodicity T0 =
24
Example 2 (Running Example). We will use a running example throughout the sec-
tion to illustrate our method. Assume that a person has a daily periodicity visiting
his office during 10am-11am and 2pm-4pm. His observation sequence is generated
from the periodic distribution vector with high probabilities at time interval [10:11]
and [14:16] and low but nonzero probabilities at other timestamps, as shown in
Figure 9.
where FT (t) = mod (t, T ), and further compute the ratios of 1’s and 0’s whose
corresponding timestamps fall into I after overlay:
|SI+ | − |SI− |
μX
+
(I, T ) = , μX (I, T ) = . (5)
|S+ | |S− |
The following lemma says that these ratios indeed reveal the true underlying prob-
abilistic model parameters, given that the observation sequence is sufficiently long.
Lemma 1. Suppose a binary sequence X = {x(t)}t=0 n−1
is generated according to
some periodic distribution vector p of length T , write qTi = 1 − pTi . Then ∀I ∈ IT ,
T
Proof. Define
T T
pi 0 qi 0
ci = T −1 T
− T −1 T
,
∑k=0
0
pk 0 ∑k=0
0
qk 0
it is easy to see that the value limn→∞ γX (T0 ) is achieved by I ∗ = {i ∈ [0, T0 − 1] :
ci > 0}. So it suffices to show that for any T ∈ Z and I ∈ IT ,
Therefore we have
⎛ T T
⎞
1 T0 −1 pF0 qF0
T0 (i+ j×T ) T0 (i+ j×T )
lim ΔX (I, T ) = ∑∑⎝ T −1 T
− T −1 T
⎠
n→∞ T i∈I j=0 ∑k=0
0
pk 0 ∑k=0
0
qk 0
T0 −1 T0 −1
1 1
=
T ∑∑ c FT
0
(i+ j×T ) ≤
T ∑ ∑ max(cFT0 (i+ j×T ) , 0)
i∈I j=0 i∈I j=0
T0 T −1
1 1
≤
T ∑ max(cFT
0
(i+ j×T ) , 0) =
T
× T ∑ ci = ∑ ci ,
j=0 i∈I ∗ i∈I ∗
0.15 0.2
positive
negative 0.15
0.1
Discrepancy
0.1
0.05
Ratio
−0.05
0.05
−0.1
−0.15
0 −0.2
0 6 12 18 24 0 6 12 18 24
Time (hour) Time (hour)
Discrepancy
0.04
Ratio
0.03 0
0.02
−0.005
0.01
0 −0.01
0 6 12 18 24 0 6 12 18 24
Time (hour) Time (hour)
−
Fig. 10 (a) and (c): Ratios of 1’s and 0’s at a single timestamp (i.e., μX
+
(·, T ) and μX (·, T ))
when T = 24 and T = 23, respectively. (b) and (d): Discrepancy scores at a single timestamp
(i.e. Δ X (·, T )) when T = 24 and T = 23.
0.7
← 24 hours
0.6
Periodicity Score
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200
Potential Period T
probability to fall into the set of timestamps: {10, 11, 14, 15, 16}. However, when us-
ing the wrong period T = 23, the distribution is almost uniform over time, as shown
in Figure 10(c). Similarly, we see large discrepancy scores for T = 24 (Figure 10(b))
whereas the discrepancy scores are very small for T = 23 (Figure 10(d)). There-
fore, we will have γX (24) > γX (23). Figure 11 shows the periodicity scores for all
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 67
potential periods in [1 : 200]. We can see that the score is maximized at T = 24,
which is the true period of the sequence.
In general, we may assume that each dt is independently drawn from some fixed
but unknown distribution f over the interval [0, 1]. To avoid the trivial case where
dt ≡ 0 for all t, we further assume that it has nonzero mean: ρ f > 0. Although
this model seems to be very flexible, in the section we prove that our periodicity
measure is still valid. In order to do so, we need the following lemma, which states
−
that μX+
(I, T ) and μX (I, T ) remain the same as before, assuming infinite length
observation sequence.
Lemma 3. Suppose d = {dt }t=0
n−1
are i.i.d. random variables in [0, 1] with nonzero
mean, and a sequence X is generated according to (pT , d), write qTi = 1 − pTi .
Then ∀I ∈ IT ,
Proof. We only prove the first equation. Let y(t) be a random variable distributed
according to Bernoulli(dt ) and z(t) = x(t)y(t). Then {z(t)}t=0
n−1
are independent ran-
dom variables which take value in {0, 1}, with mean E[z(t)] computed as follows:
where we use limn→∞ |Sni | = 1/T for the last equality. Therefore,
68 Z. Li and J. Han
Periodicity Score
Discrepancy 0.5
0.05
0.4
0
0.3
−0.05
0.2
−0.1
−0.15 0.1
−0.2 0
0 6 12 18 24 0 50 100 150 200
Time (hour) Potential Period T
pTi ρ f
|S+ |/n ∑ |Si+ |/n ∑i∈I ∑i∈I pTi
lim μ + (I, T ) = lim I+ = lim Ti∈I = T
= .
n→∞ X n→∞ |S |/n n→∞ ∑ −1 |S+ |/n −1 pTi ρ f −1 T
∑Ti=0 pi
i=0 i ∑Ti=0 T
−
Since our periodicity measure only depends on μX +
(I, T ) and μX (I, T ), it is now
straightforward to prove its validity under the random observation model. We sum-
marize our main result as the following theorem.
Theorem 1. Suppose d = {dt }t=0
n−1
are i.i.d. random variables in [0, 1] with nonzero
mean, and a sequence X is generated according to any (pT0 , d) for some T0 , then
The proof is exactly the same as that of Lemma 2 given the result of Lemma 3,
hence is omitted here.
Here we make two useful comments on this result. First, the assumption that
dt ’s are independent of each other plays an important role in the proof. In fact, if
this does not hold, the observation sequence could exhibit very different periodic
behavior from its underlying periodic distribution vector. But a thorough discussion
on this issue is beyond the scope of this book. Second, this result only holds exactly
with infinite length sequences. However, it provides a good estimate on the situation
with finite length sequences, assuming that the sequences are long enough. Note
that this length requirement is particularly important when a majority of samples
are missing (i.e., ρ f is close to 0).
Bermoulli(p) random variable for some fixed p > 0. It is easy to see that for any T
and I ∈ IT , we have
|I|
lim μ + (I, T ) = . (9)
n→∞ U T
This corresponds to the case where the positive samples are evenly distributed over
all entries after overlay. So we propose the new discrepancy score of I as follows:
|I|
ΔX
+
(I, T ) = μX
+
(I, T ) − , (10)
T
and define the periodicity measure as:
γX
+
(T ) = max ΔX
+
(I, T ). (11)
I∈IT
In fact, with some slight modification to the proof of Lemma 2, we can show
that it is a desired measure under our probabilistic model, resulting in the following
theorem.
Theorem 2. Suppose d = {dt }t=0
n−1
are i.i.d. random variables in [0, 1] with nonzero
mean, and a sequence X is generated according to any (pT0 , d) for some T0 , then
lim γ + (T ) ≤ lim γX
+
(T0 ), ∀T ∈ Z.
n→∞ X n→∞
T
pi 0
Proof. Define c+
i = T0 −1 T0 − T10 , it is easy to see that the value limn→∞ γX
+
(T0 ) is
∑k=0 pk
achieved by I ∗ = {i ∈ [0, T0 − 1] : c+
i > 0}. So it suffices to show that for any T ∈ Z
and I ∈ IT ,
lim ΔX +
(I, T ) ≤ lim ΔX (I , T0 ) = ∑ c+
+ ∗
i .
n→∞ n→∞
i∈I ∗
Therefore we have
⎧ ⎛ ⎞ ⎫
1 ⎨T0 −1 pTF0 (i+ j×T ) ⎬
lim Δ + (I, T ) = ∑ ⎩ ∑ ⎝ T0 −1 pT0 ⎠ − 1⎭
T0
n→∞ X T ∑k=0 k
i∈I j=0
⎛ T ⎞
0
1 T0 −1 ⎝ pFT0 (i+ j×T ) 1⎠ 1 T0 −1
= ∑ ∑ T −1 T
− = ∑ ∑ c+FT0 (i+ j×T )
T i∈I j=0 ∑0 p0 T0
k=0
T
k i∈I j=0
T0 −1 T0 T −1
1 1
≤
T ∑∑ max(c+
FT 0
(i+ j×T ) , 0) ≤ T ∑ max(c+
FT 0
(i+ j×T ) , 0)
i∈I j=0 j=0
1
= × T ∑ c+ i = ∑ ci ,
+
T i∈I ∗ i∈I ∗
0.4
0.1
Ratio
0.3
0.05 0.2
0.1
0 0
0 6 12 18 24 0 50 100 150 200
Time (hour) Potential Period T
Example 5 (Running Example (cont.)). In this example we further marked all the
negative samples in the sequence we used in Example 4 as unknown. When there is
no negative samples, the portion of positive samples at a single timestamp i is ex-
pected to be T1 , as shown in Figure 13(a). The discrepancy scores when T = 24 still
have large values at {10, 11, 14, 15, 16}. Thus the correct period can be successfully
detected as shown in Figure 13(b).
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 71
4 Algorithm: Periodo
In Section 3.2, we have introduced our periodicity measure for any potential period
T ∈ Z. Our period detection method simply computes the periodicity scores for
every T and report the one with the highest score.
In this section, we first describe how to compute the periodicity score for a po-
tential period and then discuss a practical issue when applying our method to finite
length sequence. We will focus on the case with both positive and negative observa-
tions. The case without negative observations can be solved in the same way.
As we have seen in Section 3.2.1, the set of timestamps I ∗ that maximizes γX (T )
can be expressed as
I ∗ = {i ∈ [0, T0 − 1] : ci > 0}, (12)
T T
pi 0 qi 0
where ci = T0 −1 T0 − T0 −1 T0 . Therefore, to find I ∗ , it suffices to compute ci for
∑k=0 pk ∑k=0 qk
each i ∈ [0, T0 − 1] and select those ones with ci > 0.
Time Complexity Analysis. For every potential period T , it takes O(n) time to
compute discrepancy score for a single timestamp (i.e., ci ) and then O(T ) time to
compute periodicity γX (T ). Since potential period should be in range [1, n], the time
complexity of our method is O(n2 ). In practice, it is usually unnecessary to try all
the potential periods. For example, we may have common sense that the periods will
be no larger than certain values. So we only need to try potential periods up to n0 ,
where n0 n. This will make our method efficient in practice with time complexity
as O(n × n0).
1 0.5
0.4
0.8 ← 24 hours ← 24 hours
Periodicity Score
Periodicity Score
0.3
0.6
0.2
← periodicity score on randomized sequence
0.4 0.1
0
0.2
−0.1
0 −0.2
0 50 100 150 200 0 50 100 150 200
Potential Period T Potential Period T
Now we want to point out a practical issue when applying our method on finite
length sequence. As one may already notice in our running example, we usually see
a general increasing trend of periodicity scores γX (T ) and γX
+
(T ) for a larger poten-
tial period T . This trend becomes more dominating as the number of observations
decreases. For example, the original running example has observations for 1000
days. If the observations are only for 20 days, our method may result in incorrect
period detection result, as the case shown in Figure 14(a). In fact, this phenomenon
is expected and can be understood in the following way. Let us take γX +
(T ) as an
72 Z. Li and J. Han
1 1
0.8 0.8
Accuracy
Accuracy
0.6 0.6
0.4 0.4
Our Method (pos)
Our Method
0.2 FFT (pos) 0.2
FFT
Auto (pos)
0 Auto 0
Histogram
1 0.5 0.1 0.05 0.01 0.0075 0.005 0.025 0.001 0.5 0.1 0.05 0.025 0.01 0.005 0.0025 0.001
η α
(a) Sampling rate η (b) Observed segements α
1 1
0.8 0.8
Accuracy
Accuracy
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.1 0.2 0.3 0.4 0.5 1000 750 500 250 100
β TN
(c) Noise ratio β (d) Repetitions T N
since it considers the distances between any two positive observations. Our method
is still the most robust one among all. For example, with β = 0.3, our method
achieves accuracy as high as 80%.
Performance w.r.t. Number of Repetitions T N. Figure 15(d) shows the accuracies
as a function of T N. As expected, the accuracies decrease as T N becomes smaller
for all the methods, but our method again significantly outperforms the other ones.
Performance w.r.t. Periodic Behavior. We also study the performance of all the
methods on randomly generated periodic behaviors. Given a period T and fix the
ratio of 1’s in a SEG as r, we generate SEG by setting each element to 1 with prob-
ability r. Sequences generated in this way will have positive observations scattered
within a period, which will cause big problems for all the methods using Fourier
transform, as evidenced in Figure 16. This is because Fourier transform is very
likely to have high spectral power at short periods if the input values alternate be-
tween 1 and 0 frequently. In Figure 16(a) we set r = 0.4 and show the results w.r.t.
period length T . In Figure 16(b), we fix T = 24 and show the results with varying
r. As we can see, all the other methods fail miserably when the periodic behavior is
randomly generated. In addition, when the ratio of positive observations is low, i.e.
fewer observations, it is more difficult to detect the correct period in general.
Comparison with Lomb-Scargle Method. Lomb-Scargle periodogram (Lomb) [15,
19] was introduced as a variation of Fourier transform to detect periods in un-
evenly sampled data. The method takes the timestamps with observations and their
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 75
1 1
0.8 0.8
Accuracy
Accuracy
0.6 Our Method (pos) 0.6
Our Method
FFT (pos)
0.4 FFT 0.4
Auto (pos)
Auto
0.2 Histogram 0.2
0 0
Accuracy
Parameter
Our Method FFT Lomb
η = 0.5 1 0.7 0.09
η = 0.1 1 0.52 0.10
α = 0.5 1 1 0.01
α = 0.1 0.99 0.35 0
corresponding values as input. It does not work for the positive-sample-only case,
because all the input values will be the same hence no period can be detected. The
reason we do not compare with this method systematically is that the method per-
forms poorly on the binary data and it is very slow. Here, we run it on a smaller
dataset by setting T N = 100. We can see from Table 2 that, when η = 0.5 or α = 0.5,
our method and FFT perform well whereas the accuracy of Lomb is already ap-
proaching 0. As pointed out in [20], Lomb does not work well in bi-modal periodic
signals and sinusoidal signals with non-Gaussian noises, hence not suitable for our
purpose.
our method is extremely robust to uncertainties, noises and missing entries of the
input data obtained in real-world applications.
150
3
100
2
50
1
0
0 50 100 150
Figure 17(a) shows the original data of bald eagle using Google Earth. It is an
enlarged area of Northeast in America and Quebec area in Canada. As shown in
Figure 17(b), three reference spots are detected in areas of New York, Great Lakes
and Quebec. By applying period detection to each reference spot, we obtain the
periods for each reference spot, which are 363, 363 and 364 days, respectively. The
periods can be roughly explained as a year. It is a sign of yearly migration in the
movement.
0.8
Probability
0.6
0.4
spot #1
spot #2
0.2 spot #3
unknown
0
0 30 60 90 120 150 180 210 240 270 300 330 360
Day
Now we check the periodic behaviors mined from the movement. Ideally, we
want to consider three reference spots together because they all show yearly period.
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 77
However, we may discover that the periods are not exactly the same for all the ref-
erence spots. This is a very practical issue. In real cases, we can hardly get perfectly
the same period for some reference spots. So, we should relax our constraint and
consider the reference spots with similar periods together. If the difference of pe-
riods is within some tolerance threshold, we take the average of these periods and
set it as the common period. Here, we take period T as 363 days, and the probabil-
ity matrix is summarized in Figure 18. Using such probability matrix, we can well
explain the yearly migration behavior as follows.
“This bald eagle stays in New York area (i.e., reference spot # 1) from December
to March. In March, it flies to Great Lakes area (i.e., reference spot #2) and stays
there until the end of May. It flies to Quebec area (i.e., reference spot #3) in the
summer and stays there until late September. Then it flies back to Great Lake again
staying there from mid October to mid November and goes back to New York in
December.”
This real example shows the periodic behaviors mined from the movement pro-
vides an insightful explanation for the movement data.
6
x 10
← 1day
0.025 8 7
← 1day
0.35 2
← 1day
Periodicity Score
Periodicity Score
0.3 7 6 1.8
Periodicity Score
Periodicity Score
0.02 0.25 1.6
6
5
Probability
0.2 1.4
5
0.015 0.15 4 1.2
0.1 4 1
3
0.01 0.05 3 0.8
0 2 0.6
← 1day
2
0.005 −0.05 0.4
1 1
−0.1 0.2
0 −0.15 0 0 0
0 12 24 36 48 60 72 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Time (20min) Potential Period T (20min) Potential Period T (20min) Potential Period T (20min) Potential Period T (20min)
(a) Event (b) Our Method (c) FFT (d) Auto (e) Histogram
Periodicity Score
0.3 0.16
Periodicity Score
Periodicity Score
Periodicity Score
2.5
0.02 0.25 7
0.14
Probability
0.2 6
2 0.12
0.015 0.15
← 1day 5 0.1
← 1day
0.1 1.5
4 0.08
0.01 0.05
1 3 0.06
0
2 0.04
0.005 −0.05
0.5
−0.1 1
← 1day 0.02
0 −0.15 0 0 0
0 6 12 18 24 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Time (hour) Potential Period T (hour) Potential Period T (hour) Potential Period T (hour) Potential Period T (hour)
(a) Event (b) Our Method (c) FFT (d) Auto (e) Histogram
Fig. 20 [Sampling rate: 1 hour] Comparison of period detection methods on a person’s move-
ment data
4
x 10
7 x 10
0.4 80 10 8
0.8
← 1day ← 1day ← 1day
← 1week
← 1day
9 7
0.35 0.7 70
Periodicity Score
Periodicity Score
Periodicity Score
8
Periodicity Score
0.3 0.6 60 6
7
Probability
0.5 5
0.25 50
6
0.4
0.2 40 5 4
0.3
4 3
0.15 30
0.2
3
0.1 20 2
0.1 2
0.05 0 10 1
1
0
0 24 48 72 96 120 144 168
−0.1
0 50 100 150 200
0
0 50 100 150 ← 1week
200
0
0 50 100 150 ← 1week
200
0
0 50 100 150 ← 1week
200
Time (hour) Potential Period T (hour) Potential Period T (hour) Potential Period T (hour) Potential Period T (hour)
(a) Event (b) Our Method (c) FFT (d) Auto (e) Histogram
Fig. 21 Comparison of methods on detecting long period, i.e. one week (168 hours)
Sampling rate
Method
20min 1hour 2hour 4hour
Our Method (pos) 24 24 24 8
Our Method 24 24 24 8
FFT(pos) 9.3 9 8 8
FFT 24 195 372 372
Auto(pos) 24 9 42 8
Auto 24 193 372 780
Histogram 66.33 8 42 48
location is his office which he only visits during weekdays. Our method correctly
detects 7-day with the highest periodicity score and 1-day has second highest score.
But all other methods are dominated by the short period of 1-day. Please note that,
in the figures of other methods, 1-week point is not even on the peak. This shows
the strength of our method at detecting both long and short periods.
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 79
Acknowledgments. The work was supported in part by Boeing company, NASA NRA-
NNH10ZDA001N, NSF IIS-0905215 and IIS-1017362, the U.S. Army Research Laboratory
under Cooperative Agreement No. W911NF-09-2-0053 (NS-CTA) and startup funding pro-
vided by the Pennsylvania State University. The views and conclusions contained in this paper
are those of the authors and should not be interpreted as representing any funding agencies.
References
1. Ahdesmäki, M., Lähdesmäki, H., Gracey, A., Yli-Harja, O.: Robust regression for pe-
riodicity detection in non-uniformly sampled time-course gene expression data. BMC
Bioinformatics 8(1), 233 (2007)
2. Bar-Dvaid, S., Bar-David, I., Cross, P.C., Ryan, S.J., Getz, W.M.: Methods for assessing
movement path recursion with application to african buffalo in south africa. Ecology 90
(2009)
3. Berberidis, C., Aref, W.G., Atallah, M.J., Vlahavas, I.P., Elmagarmid, A.K.: Multiple and
partial periodicity mining in time series databases. In: Proc. 2002 European Conference
on Artificial Intelligence, ECAI 2002 (2002)
4. Cao, H., Mamoulis, N., Cheung, D.W.: Discovery of periodic patterns in spatiotempo-
ral sequences. IEEE Transactions on Knowledge and Data Engineering 19(4), 453–467
(2007)
5. Elfeky, M.G., Aref, W.G., Elmagarmid, A.K.: Periodicity detection in time series
databases. IEEE Trans. Knowl. Data Eng. 17(7) (2005)
6. Elfeky, M.G., Aref, W.G., Elmagarmid, A.K.: Warp: Time warping for periodicity detec-
tion. In: Proc. 2005 Int. Conf. Data Mining, ICDM 2005 (2005)
7. Glynn, E.F., Chen, J., Mushegian, A.R.: Detecting periodic patterns in unevenly spaced
gene expression time series using lomb-scargle periodograms. Bioinformatics (2005)
8. Han, J., Dong, G., Yin, Y.: Efficient mining of partial periodic patterns in time series
database. In: Proc. 1999 Int. Conf. Data Engineering (ICDE 1999), Sydney, Australia,
pp. 106–115 (April 1999)
9. Han, J., Gong, W., Yin, Y.: Mining segment-wise periodic patterns in time-related
databases. In: Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD
1998), York City, NY, pp. 214–218 (August 1998)
10. Jeung, H., Liu, Q., Shen, H.T., Zhou, X.: A hybrid prediction model for moving objects.
In: Proc. 2008 Int. Conf. Data Engineering, ICDE 2008 (2008)
11. Junier, I., Herisson, J., Kepes, F.: Periodic pattern detection in sparse boolean sequences.
Algorithms for Molecular Biology (2010)
12. Li, Z., Ding, B., Han, J., Kays, R., Nye, P.: Mining periodic behaviors for moving objects.
In: Proc. 2010 ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD
2010), Washington D.C. (July 2010)
13. Liang, K.-C., Wang, X., Li, T.-H.: Robust discovery of periodically expressed genes
using the laplace periodogram. BMC Bioinformatics 10(1), 15 (2009)
14. Liao, L., Fox, D., Kautz, H.: Location-based activity recognition using relational markov
networks. In: Proc. 2005 Int. Joint Conf. on Artificial Intelligence (IJCAI 2005), pp.
773–778 (2005)
15. Lomb, N.R.: Least-squares frequency analysis of unequally spaced data. Astrophysics
and Space Science (1976)
16. Ma, S., Hellerstein, J.L.: Mining partially periodic event patterns with unknown periods.
In: Proc. 2001 Int. Conf. Data Engineering (ICDE 2001), Heidelberg, Germany, pp. 205–
214 (April 2001)
Mining Periodicity from Dynamic and Incomplete Spatiotemporal Data 81
17. Mamoulis, N., Cao, H., Kollios, G., Hadjieleftheriou, M., Tao, Y., Cheung, D.: Mining,
indexing, and querying historical spatiotemporal data. In: Proc. 2004 ACM SIGKDD
Int. Conf. Knowledge Discovery in Databases (KDD 2004), Seattle, WA, pp. 236–245
(August 2004)
18. Priestley, M.B.: Spectral Analysis and Time Series. Academic Press, London (1981)
19. Scargle, J.D.: Studies in astronomical time series analysis. ii - statistical aspects of spec-
tral analysis of unevenly spaced data. Astrophysical Journal (1982)
20. Schimmel, M.: Emphasizing difficulties in the detection of rhythms with lomb-scargle
periodograms. Biological Rhythm Research (2001)
21. Vlachos, M., Yu, P.S., Castelli, V.: On periodicity detection and structural periodic simi-
larity. In: Proc. 2005 SIAM Int. Conf. on Data Mining, SDM 2005 (2005)
22. Wang, C., Parthasarathy, S.: Summarizing itemset patterns using probabilistic models. In:
Proc. 2006 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD 2006),
pp. 730–735. ACM (2006)
23. Wang, W., Yang, J., Yu, P.S.: Meta-patterns: Revealing hidden periodic patterns. In: Proc.
2001 Int. Conf. Data Mining (ICDM 2001), San Jose, CA (November 2001)
24. Worton, B.J.: Kernel methods for estimating the utilization distribution in home-range
studies. Ecology 70 (1989)
25. Xia, Y., Tu, Y., Atallah, M., Prabhakar, S.: Reducing data redundancy in location-based
services. In: GeoSensor (2006)
26. Yan, X., Cheng, H., Han, J., Xin, D.: Summarizing itemset patterns: A profile-based
approach. In: Proc. 2005 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases
(KDD 2005), Chicago, IL, pp. 314–323 (August 2005)
27. Yang, J., Wang, W., Yu, P.S.: Mining asynchronous periodic patterns in time series data.
In: Proc. 2000 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD
2000), Boston, MA, pp. 275–279 (August 2000)
28. Yang, J., Wang, W., Yu, P.S.: Infominer: mining surprising periodic patterns. In: Proc.
2001 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD 2001), San
Francisco, CA, pp. 395–400 (August 2001)
29. Yang, J., Wang, W., Yu, P.S.: Infominer+: Mining partial periodic patterns with gap
penalties. In: Proc. 2002 Int. Conf. Data Mining (ICDM 2002), Maebashi, Japan (De-
cember 2002)
30. Zhang, M., Kao, B., Cheung, D.W.-L., Yip, K.Y.: Mining periodic patterns with gap
requirement from sequences. In: Proc. 2005 ACM-SIGMOD Int. Conf. Management of
Data (SIGMOD 2005), pp. 623–633 (2005)
31. Zheng, V.W., Zheng, Y., Xie, X., Yang, Q.: Collaborative location and activity recom-
mendations with gps history data. In: Proceedings of the 19th International Conference
on World Wide Web (WWW 2010), pp. 1029–1038. ACM (2010)
Spatio-temporal Data Mining for Climate Data:
Advances, Challenges, and Opportunities
1 Introduction
Our world is experiencing simultaneous changes in population, industrialization,
and climate amongst other planetary-scale changes. These contemporaneous
transformations, known as global change, raise pressing questions of significant
James H. Faghmous · Vipin Kumar
Department of Computer Science and Engineering, The University of Minnesota, Twin Cities
e-mail: {jfagh,kumar}@cs.umn.edu
W.W. Chu (ed.), Data Mining and Knowledge Discovery for Big Data, 83
Studies in Big Data 1,
DOI: 10.1007/978-3-642-40837-3_3, c Springer-Verlag Berlin Heidelberg 2014
84 J.H. Faghmous and V. Kumar
scientific and societal interest [39]. For example, how will the continued growth in
global population and persisting tropical deforestation, or global climate change, af-
fect our ability to access food and water? Coincidentally, these questions are emerg-
ing at a time when data, specifically spatio-temporal climate data, are more available
than ever before. In fact, climate science promises to be one of the largest sources of
data for data-driven research. A recent lower bound estimate puts the size of climate
data in 2010 at 10 Petabyytes (1 PB = 1,000 TB). This number is projected to grow
exponentially to about 350 Patabytes by 2030 [69].
The last decades have seen tremendous growth in data-driven learning algorithms
and their broad-range applications [46]. This rapid growth was fueled by the In-
ternet’s democratization of data production, access, and sharing. Merely observing
these events unfold – the growth of climate data, a wide-range of challenging real-
world research questions, and the emergence of data mining and machine learning
in virtually every domain where data are reasonably available – one may assume
that data mining is ripe to make significant contributions to these challenges.
Unfortunately, this has not been the case – at least not at the scale we have come
to expect from the success of data mining in other domains, such as biology and
e-commerce. At a high level, this lack of progress is due to the inherent nature of
climate data as well as the types of research questions climate science attempts to
address.
Although the size of climate data is a serious challenge, there are major research
efforts to address the variety, velocity, and volume of climate data (commonly re-
ferred to as Big Data’s 3Vs). Research efforts to address the nature of climate data,
however, are severely lagging the rate of data growth. For instance, climate data tend
to be predominantly spatio-temporal, noisy, and heterogeneous. The spatio-temporal
nature of climate data emerges in the form of auto- and cross-correlation between
input variables. Therefore, existing learning methods that make implicit or explicit
independence assumptions about the input data will have limited applicability to the
climate domain.
It is also important to study the types of research questions that climate science
brings forth. Climate science is the study of the spatial and temporal variations of
the atmospherehydrosphereland surface system over prolonged time periods. As a
result, climate-related questions are inexorably linked to space and time. This means
that climate scientists are interested in solutions that explain the evolution of phe-
nomena in space and time. Furthermore, the majority of climate phenomena occur
only within a specific region and time period. For example, hurricanes only take
place in certain geographic regions and during a limited month range. However, due
to the large datasets and the exponential number of space-time subsets within the
data, we must reduce the complexity of problems by finding significant space-time
subsets.
The combination of climate data’s unique characteristics and associated research
questions require the emergence of a new generation of space-time algorithms.
Fortunately, climate data have intrinsic space and time information that, if in-
sightfully leveraged, can provide a powerful computational framework to address
many of the challenges listed above while significantly reducing the complexity of
Spatio-temporal Data Mining for Climate Data 85
Fig. 1 Climate science has numerous types of data, each with its own challenges
In-situ records of climate data date back to the mid- to late 1600s [69]. Today,
observational data are gathered from a plethora of in-situ instruments such as ships,
buoys, and weather balloons. Such data tend to be sparse measurements in space and
time since they are only available when measurements are gathered and where the
instrument is physically located. For example, a weather balloon records frequent
measurement only for a limited time duration and at its physical location. Addition-
ally, raw measurements can be noisy due to measurement error or other phenomena
temporarily impacting measurement (e.g. strong winds affecting temperature mea-
surements). A final caveat is such data are dependent on the geopolitical state of
where the instruments are deployed. For instance, the quality of sea surface tem-
peratures along the Atlantic ocean decreased during World War II due to reduced
reconnaissance.
86 J.H. Faghmous and V. Kumar
Remote sensed satellite data became available in the late 1960s and are a great
source of relatively high quality data for large portions of the earth. Although they
are considered one of the best sources of global observational data, remote sensed
satellite data have notable limitations. First, satellite data are subject to measurement
noise and missing data due to obstructions from clouds or changes in orbit. Second,
due to their short life-span (∼ a decade) and evolving technology, satellite data can
be heterogeneous.
Currently, the biggest contributors to climate data volume are climate model sim-
ulations. Climate models are used to simulate future climate change under various
scenarios as well as reconstructing past climate (hindcasts). Such models run solely
based on the thermodynamics and physics that govern the atmosphere-hydrosphere-
land surface system, with observational data used for initialization. While these data
tend to be spatio-temporally continuous, they are highly variable due to the output’s
dependence on parameterization and initial conditions. Furthermore, all model out-
put come with inherent uncertainties given that not all the physics are resolved within
models and our incomplete understanding of many physical processes. Therefore,
the climate science community often relies on multi-model ensembles where numer-
ous model outputs using various parameters and initial conditions are averaged to
mitigate the uncertainty any single model output might have. For instance, the No-
bel Peace Prize winning Intergovernmental Panel on Climate Change (IPCC) used
multi-model ensembles to present its assessment of future climate change [86]. Fi-
nally, there still exist several theoretical and computational limitations that cause
climate models to poorly simulate certain phenomena, such as precipitation.
To address the noisy and heterogeneous quality of in-situ and satellite observa-
tions, a new generation of simulation-observation hybrid data (or reanalyses) have
emerged. Reanalysis datasets are assimilated remote and in-situ sensor measure-
ments through a numerical climate model. Reanalyses are generated through an
unchanging (”frozen”) data assimilation scheme and models that take available ob-
servation from in-situ and remote sensed data every 6-12 hours over a pre-defined
period being analyzed (e.g. 1948–2013)1. This unchanging framework provides a
dynamically consistent estimate of the climate state at each time step. As a re-
sult, reanalysis datasets tend to be smoother than the raw observational records and
have extended spatio-temporal coverage. While reanalyses are considered the best
available proxy for global observations, their quality is still dependent on that of
the observations, the (assimilation) model used, and processing methods. More do-
main specific quality issues for certain reanalysis data can be found at http://
www.ecmwf.int/research/era/do/get/index/QualityIssues.
Finally, researchers have been reconstructing historical data using paleoclimatic
proxy records such as trees, dunes, shells, oxygen isotope content and other
sediments2 . Such data are used to study climate variability at the centennial and mil-
lennial scales. Given the relatively short record of observational data, paleoclimate
1 http://climatedataguide.ucar.edu/reanalysis/
atmospheric-reanalysis-overview-comparison-tables
2 http://www.ncdc.noaa.gov/paleoclimate-data
Spatio-temporal Data Mining for Climate Data 87
Climate data tend to also be highly variable. Sources of variability include: (i)
natural variability, where wide-range fluctuations within a single field exist between
different locations on the globe, as well as at the same location across time; (ii)
variability from measurement errors; (iii) variability from model parameterization;
and (iv) variability from our limited understanding of how the world functions (i.e.
model representation). Even if one accounts for such variability, it is not clear if
these biases are additive and there are limited approaches to de-convolute such bi-
ases a posteriori.
We refer to data diversity as its heterogeneity in space and time. That is data are
available at various spatio-temporal resolutions, from different sources, and for dif-
ferent uses. Often times, a researcher must rely on multiple sources of information
and adequately integrating such diverse data remains a challenge. For example, one
may have access to three different sea surface temperature datasets: one reanaly-
sis dataset at a 2.5◦ resolution, another reanalysis dataset at 0.75◦ resolution, and
a satellite dataset at 0.25◦ resolution. Given that each dataset has its own biases,
it unclear what effect fusing these datasets would have on data mining tasks and
knowledge extracted therein.
Additionally, climate phenomena operate and interact on multiple spatio-temporal
scales. For example, changes in global atmospheric circulation patterns may have
significant impacts on local infrastructures that cannot be unearthed if studying cli-
mate only at a global scale (i.e “will global warming cause a more rainy winter in
California in year 2020?”). Understanding such multi-scale dependencies and inter-
actions is of significant societal interest as there is a need to provide meaningful risk
assessments about global climate’s impact on local communities.
Finally, many climate phenomena have effects that are delayed in space and time.
Although “long-range” relationships do exist in traditional data mining applications,
such as a purchase occurring due to a distant acquaintance recommending a product,
they are far more complex in a climate setting. Relationships in climate datasets
can not only be long-range in both space and time as well as multivariate, there are
exponentially many space-time-variable subsets where relationships may exist. As a
result, identifying significant spatio-temporal patterns depends on knowing what to
search for as much as where to search for such a pattern (i.e. which spatio-temporal
resolution).
In the next section, we will provide the reader with a concise review of the STDM
literature pertaining to climate data.
Fig. 2 A large amount of climate data is at at global spatial scale (∼250km), however many
climate-related questions are at the regional (∼50km) or local (km or sum-km) scale. This
multi-scale discrepancy is a significant data mining challenge.
characteristics. Given the large size of climate data, early priorities were focused on
data exploration and collaborative analysis.
Mesrobian et al. [61] introduced CONQUEST, a parallel query processing sys-
tem for exploratory research using geoscience data. The tool allowed scientists to
formulate and mine queries in large datasets. This is one of the first works to track
distortions in a continuous field. One application demonstrated in their work was
the tracking of cyclones as local minima within a closed contour sea level pressure
(SLP) field [61, 83]. As en extension to CONQUEST, Stolorz and Dean [82] in-
troduced Quakefinder, an automatic application that detects and measures tectonic
activity from remote sensing imagery. Mesrobian et al. [62] introduced Oasis, an
extensible exploratory data mining tool for geophysical data. A similar application
is the algorithm development and mining framework (ADaM) [73] which was de-
veloped to mine geophysical events in spatio-temporal data. Finally, Baldocchi et al.
[4] introduced FLUXNET, a collaborative research tool to study the spatial and tem-
poral variability of carbon dioxide, water vapor, and energy flux densities.
The early emphasis of all these works was on scalable query matching as well as
abstracting the data and their formats to the researcher to focus more on exploratory
research rather than data management. However, large-scale collaborative research
efforts are costly and require extensive infrastructures and management, effectively
increasing the risk associated with such endeavors. Furthermore, we often embark
on exploratory research without prior knowledge of the patterns of interest making
explicit query searches non-trivial. Finally, such exploratory efforts should capital-
ize on the recent advances in both spatial and temporal subsequence pattern mining
(e.g. [36, 72]).
Fig. 3 An ocean eddy moving in time as detected in ocean data. One of the challenges of
STDM is to identify significant patterns in continuous spatio-temporal climate data.
(a) (b)
Fig. 4 An example of a spatio-temporal event (a) and anomaly (b). The time-series denote
changes in vegetation over time. (a) A land-cover change event as seen in the decrease of
vegetation due to agricultural expansion in 2003. (b) an abrupt drop in vegetation due to a
forest fire in 2006, the vegetation gradually returned after the fire.
Spatio-temporal Data Mining for Climate Data 93
60 0.6
0.4
30
0.2
latitude
0 0
-0.2
-30
-60
Region -0.6
-0.8
-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
longitude
Fig. 5 Top: The NINO1+2 time-series which was constructed by averaging the sea surface
temperatures (SST) of the box highlighted in the map below. Bottom: the linear correlation
between the NINO1+2 index and global land surface temperature anomalies.
significance tests that would account for the limited number of reliable observations
within certain datasets.
major basins between 1970-2005 and concluded that the upward trend in Atlantic
TC seasonal counts cannot be attributed to the increased SST. This was because not
all basins that had an increase in SST, had a corresponding increase in TC counts. In
another study, Chen et al. [18] used the sea surface temperatures and found differ-
ent oceanic regions correlate with fire activity in different parts of Amazon. There
are numerous other studies like the ones mentioned above, however detecting rela-
tionships in large climate datasets remains extremely challenging. For example, the
data used in [18] only spanned 10 years (N=10). It is also impossible to isolate all
confounding factors in global climate studies since many conditions can affect any
given phenomenon.
One other limitation of linear correlation is its inability to capture nonlinear re-
lationships. While there are studies that use nonlinear measures such as mutual in-
formation (e.g. [49]), climate scientist use composite analysis as a another way to
quantify how well one variable explains another. Figure 6 shows an example of how
composites are constructed. For a given anomaly index, in this case NINO3.4 index,
we can identify extreme years as those that significantly deviate from the long-term
mean (e.g. less/greater than one or two standard deviations). The time-series in Fig-
ure 6’s upper panel highlights the extreme positive (red squares) and negative (blue
squares) years within the NINO3.4 index from 1979 to 2010. Using the extreme pos-
itive and negative years, one can comment on how a variable responds to the extreme
phases of a variable (in this case the NINO3.4 index). Take the June-October mean
vertical wind shear over the Atlantic basin (Figure 6 bottom panel). The composite
shows the difference between the mean June-October vertical wind shear during the
5 negative extreme years and the 5 positive extreme years. The bottom panel sug-
gests that extreme negative years in NINO3.4 tend to have low vertical wind shear
along the tropical Atlantic. One of the advantages of using composite analysis is
that it does not make specific assumptions about the relationship between the two
variables, it could be linear or non-linear. One must also use caution when analyzing
composites. While we can test the significance in the difference in means between
the positive and negative years, traditional significance tests assume independent
observations which might not be the case for such data. Furthermore, the sample
size of extreme events might be too small to be significant. For example, Kim and
Han [54] constructed composites of Atlantic hurricane tracks based on the warm-
ing patterns in the Pacific ocean. One phase of their index had a sample size of 5
years (out of 39 years). To test the significance of the composite that summarized
hurricane tracks during those years, the authors used a bootstrapping technique [31]
to determine how significant was the mean of the small sample relative to random
noise.
Finally, given that one searches for potential relationships (linear or non-linear)
between a large number of observations, the likelihood of observing a strong re-
lationship by random chance is higher than normal (known as multiple hypothesis
testing or field significance). Figure 7 shows an example of the same dataset (geopo-
tential height) correlated with a real index (left) and random noise (right). The figure
shows how easily a random pattern can yield misleadingly high correlations with
smooth spatial patterns.
96 J.H. Faghmous and V. Kumar
3
June-October NINO3.4
2.25
1.5
SST Anomaly (C)
0.75
-0.75
-1.5
-2.25
1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009
(m/s)
Fig. 7 Geopotential height correlated with the Southern Oscillation Index (SOI; left) and
random noise (right). This is an example how high and spatially coherent correlations can be
the result of random chance.
not well resolved in physics based models, such as precipitation. With the growth
of statistical machine learning there have been numerous works on predictive mod-
eling. In this section, we will mainly focus on some of the works that explicitly
addressed the spatio-temporal nature of the data.
Coe and Stern [23] used a first- and second-order Markov chain to model precip-
itation. However scarce observations at the time almost certainly limit the general-
ization of such an approach. Cox and Isham [24] proposed a spatio-temporal model
of rainfall where storm cells obey a Poisson process in space and time with each
cell moving at random velocity and for a random duration. Additional reviews of
precipitation models can be found in [98, 78, 79]. Huang and Cressie [50] improved
on traditional spatial prediction models of water content in snow cover (also known
as snow water equivalent) using a Kalman filter-based spatio-temporal model. The
model effectively incorporated snow content from previous dates to make accu-
rate snow water equivalent predictions for locations where such data was missing.
Cressie et al. [26] designed a spatio-temporal prediction model to model precipita-
tion over North America. Their work employed random sets to leverage data from
multiple model realizations (i.e. multiple initial conditions, parameter settings etc.)
of a North American regional climate model.
Van Leeuwen et al. [92] built a logistic regression-based model trained on land
surface temperatures to detect changes in tropical forest cover. Karpatne et al. [51]
extended the work in [92] by addressing the heterogeneous nature land cover data.
Instead of training a single global model of land cover change based on a single
variable (e.g. land surface temperature), they built multiple models based on land
cover type to improve single-variable forest cover estimation models. A related ap-
plication within the field of land cover change is autonomously identifying the dif-
ferent types of land-cover (urban, grass, corn, etc.) based on the pixel intensity of
a remote sensed image. Traditional remote sensing techniques train a classifier to
classify each pixel in an image to belong to certain land-cover class [85]. However,
each pixel is classified independently of every other pixel without any regard for the
spatio-temporal context. This causes highly variable class labels for the same pixel
across time. Mithal et al. [66] improve the classification accuracy of existing models
by considering the temporal evolution of the class labels of each pixel.
One of the major challenges in predictive modeling is that climate phenomena
tend to have spatial and temporal lags where distant events in space and time affect
seemingly unrelated phenomena far away (physically and temporally). Therefore
identifying meaningful predictors in the proper spatio-temporal range is difficult.
It is also important to note that certain extreme events that are of interest to the
community (e.g. hurricanes) are so rare that the number of observations is much
smaller than the data’s dimensionality (n << D). In this case, a minimum number
of predictors must be used to avoid overfitting and a poor generalized performance.
For instance, Chatterjee et al. [14] used a sparse regularized regression method to
identify the interplay between oceanic and land variables in several regions around
the globe (e.g. how does warming in the South Atlantic affect rainfall in Brazil?).
Their use of parsimony significantly improved the model’s performance. Finally,
model interpretability is crucial for spatio-temporal predictive modeling because the
98 J.H. Faghmous and V. Kumar
Years Years
Fig. 8 Gridded spatio-temporal climate data can be analyzed in a network format. Each grid
location is characterized by a time series. A network can be constructed between each location
with an edge weight being the relationship between the time-series of each location.
Spatio-temporal Data Mining for Climate Data 99
data. Kawale et al. [53] proposed a bootstrapping method to test the significance of
such long-range spatio-temporal patterns.
Inspired by complex networks, [88] were the first to propose the notion of a cli-
mate network and analyze its properties and how they relate to physical phenomena.
For example, several studies have found the network structure to correlate with the
dominant large-scale signals of global climate such as El-Niño [30, 102, 45]. Sim-
ilarly, Tsonis et al. [89] showed that some climate phenomena and datasets obey
a small-world network property [94]. Furthermore, several studies found distinct
structural differences between the networks around tropical and extra-tropical re-
gions [89, 29]. Berezin et al. [7] analyzed the evolution and stability of such net-
works over time and found that networks along the tropics tend to be more stable.
Other studies have linked regions with high in-bound edges, known as supernodes,
to be associated with major large-scale climate phenomena such as the North At-
lantic Oscillation [89, 90].
Others have built networks using non-gridded discrete climate data. Elsner et al.
[32] used seasonal hurricane time-series to construct a network to study interannual
hurricane count variability. Fogarty et al. [38] built a network to analyze coastal
locations (nodes) and their associated hurricane activity (edges) and found distinct
connectivity difference between active and inactive regions. Furthermore, the au-
thors connected various network topographies to phases of the El-Niño Southern
Oscillation.
While network-based methods within climate are increasingly popular, these ef-
forts are relatively young and several questions remain such as how to sparcify fully
connected networks, the notion of multi-variate climate networks, and the distinc-
tion between statistical and physical connectivity [70].
We will spend the remainder of the chapter demonstrating a case study of spatio-
temporal pattern mining with an autonomous ocean eddy monitoring application.
This is because ocean eddies are a central part of ocean dynamics and impact marine
and terrestrial ecosystems. Furthermore, identifying and tracking eddies form a new
generation of data mining challenges where we are interested in tracking uncertain
features in a continuous field.
Fig. 9 Image from the NASA TERRA satellite showing an anti-cyclonic (counter-clockwise
in the Southern Hemisphere) eddy that likely peeled off from the Agulhas Current, which
flows along the southeastern coast of Africa and around the tip of South Africa. This eddy
(roughly 200 km wide) is an example of eddies transporting warm, salty water from the
Indian Ocean to the South Atlantic. We are able to see the eddy, which is submerged under
the surface because of the enhanced phytoplankton activity (reflected in the bright blue color).
This anti-cyclonic eddy would cause a depression in subsurface density surfaces in sea surface
height (SSH) data. Image courtesy of the NASA Earth Observatory. Best seen in color.
waves, the rotation of nonlinear eddies transports momentum, mass, heat, nutri-
ents, as well as salt and other seawater chemical elements, effectively impacting the
ocean’s circulation, large-scale water distribution, and biology. Therefore, under-
standing eddy variability and change over time is of critical importance for projected
marine biodiversity as well as atmospheric and land phenomena.
Eddies are ubiquitous in both space and time, yet autonomously identifying them
is challenging due to the fact that they are not objects moving within the environ-
ment, rather they are a distortion (rotation) evolving through a continuous field (see
Figure 10). To identify and track such features, climate scientists have resorted to
mining the spatial or temporal signature eddies have on a variety of ocean variables
such as sea surface temperatures (SST) and ocean color. The problem is accentuated
further given the lack of base-line data makes any learning algorithms unsupervised.
While there exists extensive literature in traditional object tracking algorithms (e.g.
see Yilmaz et al. [103] for a review), a comprehensive body of work tracking user-
defined features in continuous climate data is still lacking despite the exponential
increase in the volume of such data [69].
Spatio-temporal Data Mining for Climate Data 101
10 10 15 20 20
SSH anomalies (cm)
10 15 15
5 5
10
5 10
5
0 0 0 5
0
−5 0
−5
−5 −5
−10 −5 −10
Fig. 10 An example of a cyclonic eddy traveling through a continuous sea surface height
(SSH) field (from left to right). Unlike common feature mining and tracking tasks, features
in physical sciences are often not self-defined with unambiguous contours and properties.
Instead, they tend to be dynamic user-defined features. In the case of eddies, eddies manifest
as a distortion traveling in space and time through the continuous field. A cyclonic eddy
manifests as a negative SSH anomaly.
Our understanding of ocean eddy dynamics has grown significantly with the ad-
vent of satellite altimetery. Prior to then, oceanographers relied primarily on case
studies using drifting floats in the open ocean to collect detailed information about
individual eddies such as rotational speeds, amplitude, and salinity profiles. With
the increased accessibility to satellite data, ocean surface temperatures and color
have been used to identify ocean eddies based on their signatures on such fields
[71, 37, 28]. While, these fields are impacted by eddy activity, there are additional
phenomena, such as hurricanes or near-surface winds, that affect them as well; effec-
tively complicating eddy identification in such data fields. More recently, sea surface
height (SSH) observations from satellite radar altimeters have emerged as a better-
suited alternative for studying eddy dynamics on a global scale given SSH’s intimate
connection to ocean eddy activity. Eddies are generally classified by their rotational
direction. Cyclonic eddies rotate counter-clockwise (in the Northern Hemisphere),
while anti-cyclonic eddies rotate clockwise. As a result, cyclonic eddies cause a
decrease in SSH, while anti-cyclonic eddies cause an increase in SSH. Such im-
pact allows us to identify ocean eddies in SSH satellite data, where cyclonic eddies
manifest as closed contoured negative SSH anomalies and anti-cyclonic eddies as
positive SSH anomalies. In Figure 11, anti-cyclonic eddies can be seen in patches
of positive (dark red) SSH anomalies, while cyclonic eddies are reflected in closed
contoured negative (dark blue) SSH anomalies.
In section 2.2, we discussed some general challenges that arise when mining cli-
mate data. Here we briefly review considerations one must take when specifically
identifying and tracking eddies on a global scale. First, due to large-scale natural
variability in global SSH data (Figure 12) complicate the task of finding a universal
set of parameters to analyze the data. For example, the mean and standard of the
data yield very little insight due to the high spatial and temporal natural variabil-
ity. Second, unlike traditional data mining where objects are relatively well-defined,
SSH data is prone to noise and uncertainty, making it difficult to distinguish between
meaningful eddy patterns from spurious events and measurement errors. Third, al-
though eddies generally have an ellipse-like shape, the shape’s manifestation in grid-
ded SSH data differs based on latitude. This is because of the stretch deformation
of projecting spherical coordinates into a two-dimensional plane. As a result, one
102 J.H. Faghmous and V. Kumar
Fig. 11 Global sea surface height (SSH) anomaly for the week of October 10 1997 from the
Version 3 dataset of the Archiving, Validation, and Interpretation of Satellite Oceanographic
(AVISO) dataset. Eddies can be observed globally as closed contoured negative (dark blue;
for cyclonic) or positive (dark red; for anti-cyclonic) anomalies. Best seen in color.
Fig. 12 Global unfiltered SSH anomalies. The data is characterized with high spatial and tem-
poral variability, where values vary widely from one location to the next, as well as across
time for the same location. Therefore traditional measures such as mean and standard devia-
tions yield little insight in global patterns.
cannot restrict eddies by shape (e.g. circle, ellipse, etc.) Fourth, eddy heights and
sizes vary by latitude, which makes having a global “acceptable” eddy size un-
feasible [40]. Therefore, applying a single global threshold would wipe out many
relevant patterns in the presence of spatial heterogeneity. A more subtle challenges
is that eddies can manifest themselves as local minima (maxima) embedded in a
large-scale background of negative (positive) anomalies [15] making numerous fea-
tures unnoticeable. False positives are also an issue, as other phenomena such as
linear Rossby waves or fronts can masquerade as eddy-like features in SSH data
[59, 17]. Finally, given the global and ubiquitous nature of eddies, any learning
must be unsupervised. One way to verify the performance of eddy identification
and tracking algorithms is to use field-studies data, where floats and ships physical
Spatio-temporal Data Mining for Climate Data 103
sit on top of eddies. However, such datasets would only provide anecdotal evidence.
Despite these non-trivial challenges, a more vexing challenge is that the majority
of autonomous eddy identification schemes take the four-dimensional feature repre-
sentation of eddies (latitude, longitude, time, and value where “value” depends on
the field) and analyze that data orthogonally in either space or time only, effectively
introducing additional uncertainty.
Anticyclonic Eddy
25
20
SSH Anomalies (cm)
15
10
−5
20
−10
15 15
e
10 10
ud
Latitude 5 git
Lon
5
0 0
Time (weeks)
Fig. 13 Two different but complementary views of eddies’ effect on SSH anomalies. Top: A
three dimensional view of a cyclonic eddy in the SSH field. Bottom: an SSH time-series at
single location. In both cases, the presence of an eddy is indicated through a sustained SSH
depression.
Figure 13 shows two different yet complementary views of eddies and SSH. On
the top panel are two anti-cyclonic eddies in the SSH field. The bottom panel shows
the temporal profile of a single pixel in the SSH dataset. When taken alone each
method has notable limitations. In the spatial view, thresholding the data top-down
would force the application to return artificially larger size regions that the eddy
occupies (since it favors the largest region possible). Furthermore, such a threshold-
ing approach is known to merge eddies in close proximity [16]. A temporal view
would allow us to identify eddy-like behavior by searching for segments of gradual
decrease and increase denoted by the green and red lines [34]. However, a tempo-
ral only approach is not enough as multiple pixels must exhibit similar temporal
behavior in space and time otherwise the approach would be vulnerable to noise
and spurious signals. Our method attempts to combine bother approaches to address
each method’s limitations. We begin by discussing each approach in more detail.
104 J.H. Faghmous and V. Kumar
Fig. 14 Eddy-like features are ubiquitous in global SSH data. The challenges is in identifying
and tracking such features within a continuous SSH field.
Chelton et al. [15] was the first to track eddies globally using a unified set of
parameters. They also introduced the notion of eddy non-linearity (the ratio of rota-
tional and transitional speeds) to differentiate between non-linear eddies and linear
Rossby waves. In the most comprehensive SSH-based eddy tracking study to date,
Chelton et al. [16] identified eddies globally as closed contoured smoothed SSH
anomalies using a thresholding and nearest neighbor search approach. A similar al-
gorithm was presented in [35] with a few modifications over [16] to improve the
runtime complexity and accuracy of the threshold-based method.
At a high level, threshold-based algorithms extract candidate connected com-
ponents from SSH data by gradually thresholding the data and finding connected
component features at each threshold. For each connected component, we applied
six criteria to determine if a feature is an eddy candidate: (i) A minimum eddy size
of 9 pixels; (ii) a maximum eddy size of 1000 pixels; (iii) a minimum amplitude of
Spatio-temporal Data Mining for Climate Data 105
1 cm; (iv) the connected component must contain at least a minimum/maximum; (v)
the distance between any two pixels along the contour of the feature must be less
than a fixed maximum; and (vi) each connected component must have a predefined
convex hull ratio as a function of the latitude of the eddy. The first five conditions are
similar to those proposed by [16]. The convexity criterion is to ensure that we select
the minimal set of points that can form a coherent eddy, and thus avoid mistakenly
grouping multiple eddies together. Once the eddies are detected, the pixels repre-
senting the eddy are removed from consideration for the next threshold level. Doing
so ensures that the algorithm does not over-count eddies. Removing the pixels will
not compromise the accuracy of the algorithm given that the first instance an eddy
is detected will be at its most likely largest size as a function of the threshold.
The main distinction between our implementation, EddyScan, and that of Chel-
ton et al. [16] are two-fold: First, we use unfiltered data while Chelton et al. [16]
pre-process the data. Second, to ensure the selection of compact rotating vortices,
Chelton et al. [16] required that the maximum distance between any pairs of points
within an eddy interior be less than a specified threshold, while EddyScan uses the
convexity criterion to ensure compactness. The primary motivation to use convexity
is to reduce the run time complexity of the algorithm from O(N 2 ) to O(N) in the
number of features identified.
Fig. 15 Aggregate counts for eddy centroids that were observed through each 1◦ × 1◦ region
over the October 1992 - January 2011 period as detected EddyScan. These results show high
eddy activity along the major currents such as the Gulf Stream (North Atlantic) and Kuroshio
Current (North Pacific). Best seen in color.
There are instances, however, when the maximum distance criterion is unable to
avoid merging several smaller eddies together. Figure 16 shows an example where
the minimal distance between any pair of pixels in the blob is met despite there
being several eddies. As a result CH11 (yellow cross) labels the entire feature as a
106 J.H. Faghmous and V. Kumar
Fig. 16 An example of when Chelton et al. [16] maximum distance criterion is met, yet
the large feature is in fact several eddies merged together. Top: a zoomed-in view on SSH
anomalies in the Southern Hemisphere showing at least four coherent structures with positive
SSH anomalies. Bottom: Chelton et al. [16] (yellow cross) identifies a single eddy in the
region, while our convexity parameter allows EddyScan to successfully break the larger blob
into four smaller eddies. The SSH data are in grayscale to improve visibility of the identified
eddies. Best seen in color.
single eddy. EddyScan, however, is able to break the large blob into coherent small
eddies.
Time (weeks)
Fig. 17 A sample time-series analyzed by PDELTA with gradually decreasing segments en-
closed between each pair of green and red lines. These segments were obtained after dis-
carding segments of very short length or insignificant drop that are atypical signatures of an
eddy.
ti ti+1 ti+2
Fig. 18 An illustration to show PDELTA’s spatial analysis component. At any given time ti
only a subset of all time-series are labeled as candidates for being part of an eddy (green
points). Only when a sufficient number of similarly behaving neighbors are detected (in this
case four) PDELTA labels them as an eddy (black circle). As time passes, some time-series
are removed from the eddy (red points) as they are no longer exhibiting a gradual change;
while others are added. If the number of similarly behaving time-series falls below (above)
the minimum (maximum) number of required time-series, the cluster is no longer an eddy
(e.g. top left corner at ti+2 frame).
26
18
0
93 94 95 96 97 98 99 00 01 02 03 04 06 07 08 09 10
19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20
26
18
0
93 94 95 96 97 98 99 00 01 02 03 04 06 07 08 09 10
19 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20
Fig. 19 Monthly eddy counts (lifetime ≥ 16 weeks). Top: Monthly counts for cyclonic eddies
as detected by our automated algorithm PDELTA (blue) and CH11 (red). Bottom: Monthly
counts for anti-cyclonic eddies as detected by our automated algorithm PDELTA (blue) and
Chelton et al. [16] (red).
6000 2000
4500 1500
Time (s)
Time (s)
3000 1000
1500 500
0 0
1x 2x 3x 4x 5x 6x 7x 8x 9x 10x 1x 2x 3x 4x 4x 5x 7x 8x 9x 10x
Times grid resolution increased Time-series lengh increase
Fig. 20 Scalability comparison between our algorithm PDELTA (blue) and a connected com-
ponent algorithm (green) similar to CH11. Left: time required to track all eddies in the dataset
as a function of the grid resolution. Right: time required to track all eddies in the dataset
as a function of the time-series length (i.e. number of weekly observations). Our algorithm
PDELTA (blue) scales better than the connected component algorithm in both time and space.
a sufficient number of neighbors are also candidate time-series at time t then the
identified group is labeled as an eddy. Finally, as the eddy moves from one time-
step to the next, we keep adding new candidate time-series as their ts is reached and
remove other time-series as their te is passed. We count the duration of each eddy
as the number of weeks the minimum number of clustered candidate time-series is
met.
Spatio-temporal Data Mining for Climate Data 109
Figure 17 demonstrates how our approach detects candidate time-series. The top
panel shows the SSH anomaly time-series for one grid point in the Nordic Sea. For
this particular location, our algorithm PDELTA, identified three segments where a
significant gradual decrease in SSH occurred over a long time period starting at
approximatively weeks 60, 410, and 870 respectively. During each decreasing seg-
ment, we search this location’s neighborhood for time-series with similar gradual
decrease. Once the significant decreasing segment ends, either there will be other
neighbors that will continue to form a coherent eddy or the eddy has dissipated if
the minimum eddy size is no longer met.
PDELTA detected slightly more cyclonic (9.89 per month) than anti-cyclonic
(9.48 per month) eddies. These differences are consistent with the findings of Chel-
ton et al. [16]. Overall, we identified a total of 9.08 eddies per month versus 8.87
for Chelton et al. [16]5 . This could be due to the fact that eddies tend to be smaller
in the region analyzed, and thus could have been ignored by CH11’s algorithm once
the data were filtered. Figure 19 shows the monthly cyclonic (top) and anti-cyclonic
(bottom) counts for PDELTA (blue curve) and CH11 (red curve). We find that al-
though the counts match well, PDELTA detected fewer eddies than CH11 during
winter months, but more eddies during summer months.
One major advantage of considering the spatio-temporal context of the SSH data
is that such an approach scales well with respect to the data’s resolution and time-
series length (i.e. number of satellite snapshots). Figure 20 shows empirical results
comparing the computation time of PDELTA and the connected component algo-
rithm as the number of grid cells (M × N) and time-series length (K) are increased;
the figure shows quadratic increase in computation time for the connected compo-
nent algorithm as M × N is increased, while PDELTA’s computation time increases
linearly. This difference is particularly germane since data from future climate mod-
els and satellite observations will be of much higher resolution.
We might also have to re-think the definition of anomalies and extremes beyond that
of abnormal deviation from the mean. Climate extremes may be better analyzed in
a multi-variate fashion, where multiple relatively normal conditions may lead to a
“cumulative” extreme. For instance, while hurricane Katrina was a Category 5 hur-
ricane, it was the breaking of the levee that accentuated its horrific impact. Finally,
traditional evaluation metrics for learning algorithms may need to be extended for
STDM. A large number of climate problems have no reliable “ground truth” data
and therefore rely on unsupervised learning techniques. Hence, it is crucial to de-
velop objective performance measures and experiments that allow to compare the
performance of different unsupervised STDM algorithms. Furthermore, traditional
performance measures such root mean square error might need to be adjusted to
account for spatio-temporal variability.
There are also great opportunities for novel STDM applications within climate
science. Within the applications of user-defined pattern mining, the majority of fea-
tures of interest are usually defined by domain experts. Such an approach is not
always feasible since we have significant knowledge gaps in many domains where
such data exists. Therefore developing unsupervised feature extraction techniques
that autonomously identify significant features based on spatio-temporal variability
(i.e. how different is a pattern from random noise) might be preferable, especially
in large datasets. Additionally, given the large number of climate datasets, each at
a different spatio-temporal resolution, there is a high demand for spatio-temporal
relationship mining and predictive modeling techniques, that take data at a low,
global resolution and infer impact on a higher, local resolution (and vice versa).
Finally, one fundamental quantification might need to emerge between uncertainty
and risk. Data mining and machine learning have used probabilities as a measure of
uncertainty. However, numerous climate-related questions are interested in risk as
opposed to uncertainty. Providing decision-makers with tools to convert statistical
uncertainty to risk quantities based on available information is has the potential to
be a major scientific and societal contribution.
Answers to some of these questions will emerge over time as we continue to see
new STDM applications to climate data. Others, such as significance tests, might
require diligent collaborations with adjacent fields such as statistics. Nonetheless,
there is an exciting (and challenging) road ahead for STDM researchers.
Acknowledgements. Part of the research presented in this chapter was funded by an NSF
Graduate Research Fellowship, an NSF Nordic Research Opportunity Fellowship, a Univer-
sity of Minnesota Doctoral Dissertation Fellowship, and an NSF Expeditions in Computing
Grant (IIS-1029711). Access to computing resources was provided by the University of Min-
nesota Supercomputing Institute. The authors thank Varun Mithal for generating Figure 4 and
Dr. Stefan Sobolowski for generating Figure 7. We also thank Dr. Stefan Liess for construc-
tive comments that improved the quality of the manuscript.
Spatio-temporal Data Mining for Climate Data 111
References
[1] Anbaroğlu, T.C.B.: Spatio-temporal outlier detection in environmental data. Spatial
and Temporal Reasoning for Ambient Intelligence Systems, 1–9 (2009)
[2] Arenas, A., Dı́az-Guilera, A., Kurths, J., Moreno, Y., Zhou, C.: Synchronization in
complex networks. Physics Reports 469(3), 93–153 (2008)
[3] Bain, C.L., De Paz, J., Kramer, J., Magnusdottir, G., Smyth, P., Stern, H., Wang, C.-C.:
Detecting the itcz in instantaneous satellite data using spatiotemporal statistical mod-
eling: Itcz climatology in the east pacific. Journal of Climate 24(1), 216–230 (2011)
[4] Baldocchi, D., Falge, E., Gu, L., Olson, R., Hollinger, D., Running, S., Anthoni, P.,
Bernhofer, C., Davis, K., Evans, R., et al.: Fluxnet: a new tool to study the temporal
and spatial variability of ecosystem-scale carbon dioxide, water vapor, and energy flux
densities. Bulletin of the American Meteorological Society 82(11), 2415–2434 (2001)
[5] Barua, S., Alhajj, R.: Parallel wavelet transform for spatio-temporal outlier detection
in large meteorological data. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X.
(eds.) IDEAL 2007. LNCS, vol. 4881, pp. 684–694. Springer, Heidelberg (2007)
[6] Basak, J., Sudarshan, A., Trivedi, D., Santhanam, M.: Weather data mining using inde-
pendent component analysis. The Journal of Machine Learning Research 5, 239–253
(2004)
[7] Berezin, Y., Gozolchiani, A., Guez, O., Havlin, S.: Stability of climate networks with
time. Scientific Reports 2 (2012)
[8] Boriah, S., Kumar, V., Steinbach, M., Potter, C., Klooster, S.: Land cover change detec-
tion: a case study. In: Proceeding of the 14th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 857–865. ACM (2008)
[9] Braverman, A., Fetzer, E.: Mining massive earth science data sets for large scale struc-
ture. In: Proceedings of the Earth-Sun System Technology Conference (2005)
[10] Camargo, S.J., Robertson, A.W., Gaffney, S.J., Smyth, P., Ghil, M.: Cluster analysis
of typhoon tracks. part i: General properties. Journal of Climate 20(14), 3635–3653
(2007a)
[11] Camargo, S.J., Robertson, A.W., Gaffney, S.J., Smyth, P., Ghil, M.: Cluster analysis
of typhoon tracks. part ii: Large-scale circulation and enso. Journal of Climate 20(14),
3654–3676 (2007b)
[12] Chamber, Y., Garg, A., Mithal, V., Brugere, I., Lau, M., Krishna, V., Boriah, S., Stein-
bach, M., Kumar, V., Potter, C., Klooster, S.A.: A novel time series based approach
to detect gradual vegetation changes in forests. In: CIDU 2011: Proceedings of the
NASA Conference on Intelligent Data Understanding, pp. 248–262 (2011)
[13] Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences:
A survey. IEEE Transactions on Knowledge and Data Engineering 24(5), 823–839
(2012)
[14] Chatterjee, S., Steinhaeuser, K., Banerjee, A., Chatterjee, S., Ganguly, A.: Sparse
group lasso: Consistency and climate applications. In: SDM (2012)
[15] Chelton, D., Schlax, M., Samelson, R., de Szoeke, R.: Global observations of large
oceanic eddies. Geophysical Research Letters 34, L15606 (2007)
[16] Chelton, D., Schlax, M., Samelson, R.: Global observations of nonlinear mesoscale
eddies. Progress in Oceanography (2011a)
[17] Chelton, D.B., Gaube, P., Schlax, M.G., Early, J.J., Samelson, R.M.: The influence of
nonlinear mesoscale eddies on near-surface oceanic chlorophyll. Science 334(6054),
328–332 (2011b)
112 J.H. Faghmous and V. Kumar
[18] Chen, Y., Randerson, J.T., Morton, D.C., DeFries, R.S., Collatz, G.J., Kasibhatla, P.S.,
Giglio, L., Jin, Y., Marlier, M.E.: Forecasting fire season severity in south america
using sea surface temperature anomalies. Science 334(6057), 787–791 (2011)
[19] Cheng, T., Li, Z.: A multiscale approach for spatio-temporal outlier detection. Trans-
actions in GIS 10(2), 253–263 (2006)
[20] Chou, P.A., Lookabaugh, T., Gray, R.M.: Entropy-constrained vector quantization.
IEEE Transactions on Acoustics, Speech and Signal Processing 37(1), 31–42 (1989)
[21] Clark, P., Matwin, S.: Using qualitative models to guide inductive learning. In: Pro-
ceedings of the Tenth International Conference on Machine Learning, pp. 49–56
(1993)
[22] Clearwater, S.H., Provost, F.J.: Rl4: A tool for knowledge-based induction. In: Pro-
ceedings of the 2nd International IEEE Conference on Tools for Artificial Intelligence,
pp. 24–30. IEEE (1990)
[23] Coe, R., Stern, R.: Fitting models to daily rainfall data. Journal of Applied Meteorol-
ogy 21(7), 1024–1031 (1982)
[24] Cox, D., Isham, V.: A simple spatial-temporal model of rainfall. Proceedings of the
Royal Society of London. A. Mathematical and Physical Sciences 415(1849), 317–
328 (1988)
[25] Cressie, N., Wikle, C.K.: Statistics for spatio-temporal data, vol. 465. Wiley (2011)
[26] Cressie, N., Assunçao, R., Holan, S.H., Levine, M., Zhang, J., Samsi, C.-N.: Dynami-
cal random-set modeling of concentrated precipitation in north america. Statistics and
its Interface (2011)
[27] Domingos, P.: Occam’s two razors: The sharp and the blunt. In: Proceedings of the
Fourth International Conference on Knowledge Discovery and Data Mining, pp. 37–
43. AAAI Press (1998)
[28] Dong, C., Nencioli, F., Liu, Y., McWilliams, J.: An automated approach to detect
oceanic eddies from satellite remotely sensed sea surface temperature data. IEEE Geo-
science and Remote Sensing Letters (99), 1–5 (2011)
[29] Donges, J.F., Zou, Y., Marwan, N., Kurths, J.: The backbone of the climate network.
EPL (Europhysics Letters) 87(4), 48007 (2009a)
[30] Donges, J.F., Zou, Y., Marwan, N., Kurths, J.: Complex networks in climate dynamics.
The European Physical Journal-Special Topics 174(1), 157–179 (2009b)
[31] Effron, B., Tibshirani, R.: Statistical data analysis in the computer age. Sci-
ence 253(5018), 390–395 (1991)
[32] Elsner, J., Jagger, T., Fogarty, E.: Visibility network of united states hurricanes. Geo-
physical Research Letters 36(16), L16702 (2009)
[33] Emanuel, K.: The hurricane-climate connection. Bulletin of the American Meteoro-
logical Society 89(5) (2008)
[34] Faghmous, J., Chamber, Y., Vikebø, F., Boriah, S., Liess, S., dos Santos Mesquita, M.,
Kumar, V.: A novel and scalable spatio-temporal technique for ocean eddy monitoring.
In: Twenty-Sixth Conference on Artificial Intelligence, AAAI 2012 (2012a)
[35] Faghmous, J.H., Styles, L., Mithal, V., Boriah, S., Liess, S., Vikebo, F., dos Santos
Mesquita, M., Kumar, V.: Eddyscan: A physically consistent ocean eddy monitoring
application. In: 2012 Conference on Intelligent Data Understanding (CIDU), pp. 96–
103 (2012b)
[36] Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in
time-series databases, vol. 23. ACM (1994)
[37] Fernandes, A.M.: Identification of oceanic eddies in satellite images. In: Bebis, G., et
al. (eds.) ISVC 2008, Part II. LNCS, vol. 5359, pp. 65–74. Springer, Heidelberg (2008)
Spatio-temporal Data Mining for Climate Data 113
[38] Fogarty, E.A., Elsner, J.B., Jagger, T.H., Tsonis, A.A.: Network analysis of us hurri-
canes. Hurricanes and Climate Change, 153–167 (2009)
[39] Foley, J.A.: Can we feed the world & sustain the planet? Scientific American 305(5),
60–65 (2011)
[40] Fu, L., Chelton, D., Le Traon, P., Morrow, R.: Eddy dynamics from satellite altimetry.
Oceanography 23(4), 14–25 (2010)
[41] Fu, Q., Banerjee, A., Liess, S., Snyder, P.K.: Drought detection of the last century: An
mrf-based approach. In: Proceedings of the SIAM International Conference on Data
Mining (2012)
[42] Gaffney, S.J., Robertson, A.W., Smyth, P., Camargo, S.J., Ghil, M.: Probabilistic clus-
tering of extratropical cyclones using regression mixture models. Climate Dynam-
ics 29(4), 423–440 (2007)
[43] Ghosh, S., Das, D., Kao, S.-C., Ganguly, A.R.: Lack of uniform trends but increasing
spatial variability in observed indian rainfall extremes. Nature Climate Change (2011)
[44] Goldenberg, S., Shapiro, L.: Physical mechanisms for the association of el niño and
west african rainfall with atlantic major hurricane activity. Journal of Climate 9(6),
1169–1187 (1996)
[45] Guez, O., Gozolchiani, A., Berezin, Y., Brenner, S., Havlin, S.: Climate network struc-
ture evolves with north atlantic oscillation phases. EPL (Europhysics Letters) 98(3),
38006 (2012)
[46] Hastie, T., Tibshirani, R., Friedman, J., Franklin, J.: The elements of statistical learn-
ing: data mining, inference and prediction. The Mathematical Intelligencer 27(2), 83–
85 (2005)
[47] Henke, D., Smyth, P., Haffke, C., Magnusdottir, G.: Automated analysis of the tempo-
ral behavior of the double intertropical convergence zone over the east pacific. Remote
Sensing of Environment 123, 418–433 (2012)
[48] Hoffman, F.M., Hargrove Jr., W.W., Erickson III, D.J., Oglesby, R.J.: Using clustered
climate regimes to analyze and compare predictions from fully coupled general circu-
lation models. Earth Interactions 9(10), 1–27 (2005)
[49] Hoyos, C., Agudelo, P., Webster, P., Curry, J.: Deconvolution of the factors contribut-
ing to the increase in global hurricane intensity. Science 312(5770), 94 (2006)
[50] Huang, H.-C., Cressie, N.: Spatio-temporal prediction of snow water equivalent using
the kalman filter. Computational Statistics & Data Analysis 22(2), 159–175 (1996)
[51] Karpatne, A., Blank, M., Lau, M., Boriah, S., Steinhaeuser, K., Steinbach, M., Kumar,
V.: Importance of vegetation type in forest cover estimation. In: CIDU, pp. 71–78
(2012)
[52] Kawale, J., Steinbach, M., Kumar, V.: Discovering dynamic dipoles in climate data.
In: SIAM International Conference on Data mining, SDM. SIAM (2011)
[53] Kawale, J., Chatterjee, S., Ormsby, D., Steinhaeuser, K., Liess, S., Kumar, V.: Testing
the significance of spatio-temporal teleconnection patterns. In: Proceedings of the 18th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pp. 642–650. ACM (2012)
[54] Kim, M., Han, J.: A particle-and-density based evolutionary clustering method for
dynamic networks. Proceedings of the VLDB Endowment 2(1), 622–633 (2009)
[55] Lamb, P.J., Peppler, R.A.: North atlantic oscillation: Concept and an application. Bul-
letin of the American Meteorological Society 68, 1218–1225 (1987)
[56] Laxman, S., Sastry, P.S.: A survey of temporal data mining. Sadhana 31(2), 173–198
(2006)
114 J.H. Faghmous and V. Kumar
[57] Lee, Y., Buchanan, B.G., Aronis, J.M.: Knowledge-based learning in exploratory sci-
ence: Learning rules to predict rodent carcinogenicity. Machine Learning 30(2), 217–
240 (1998)
[58] Livezey, R., Chen, W.: Statistical field significance and its determination by monte
carlo techniques (in meteorology). Monthly Weather Review 111, 46–59 (1983)
[59] McGillicuddy Jr., D.: Eddies masquerade as planetary waves. Science 334(6054), 318–
319 (2011)
[60] McGuire, M.P., Janeja, V.P., Gangopadhyay, A.: Spatiotemporal neighborhood dis-
covery for sensor data. In: Gaber, M.M., Vatsavai, R.R., Omitaomu, O.A., Gama, J.,
Chawla, N.V., Ganguly, A.R. (eds.) Sensor-KDD 2008. LNCS, vol. 5840, pp. 203–225.
Springer, Heidelberg (2010)
[61] Mesrobian, E., Muntz, R., Shek, E., Santos, J., Yi, J., Ng, K., Chien, S.-Y., Mechoso,
C., Farrara, J., Stolorz, P., et al.: Exploratory data mining and analysis using conquest.
In: Proceedings of the IEEE Pacific Rim Conference on Communications, Computers,
and Signal Processing, pp. 281–286. IEEE (1995)
[62] Mesrobian, E., Muntz, R., Shek, E., Nittel, S., La Rouche, M., Kriguer, M., Mechoso,
C., Farrara, J., Stolorz, P., Nakamura, H.: Mining geophysical data for knowledge.
IEEE Expert 11(5), 34–44 (1996)
[63] Mestas-Nuñez, A.M., Enfield, D.B.: Rotated global modes of non-enso sea surface
temperature variability. Journal of Climate 12(9), 2734–2746 (1999)
[64] Mithal, V., Garg, A., Brugere, I., Boriah, S., Kumar, V., Steinbach, M., Potter, C.,
Klooster, S.: Incorporating natural variation into time-series based land cover change
identification. In: Proceeding of the 2011 NASA Conference on Intelligent Data Un-
derstanding, CIDU (2011a)
[65] Mithal, V., Garg, A., Boriah, S., Steinbach, M., Kumar, V., Potter, C., Klooster, S.,
Castilla-Rubio, J.C.: Monitoring global forest cover using data mining. ACM Trans-
actions on Intelligent Systems and Technology (TIST) 2(4), 36 (2011b)
[66] Mithal, V., Khandelwal, A., Boriah, S., Steinhauser, K., Kumar, V.: Change detection
from temporal sequences of class labels: Application to land cover change mapping.
In: SIAM International Conference on Data mining, SDM. SIAM (2013)
[67] Neill, D., Moore, A., Cooper, G.: A bayesian spatial scan statistic. In: Advances in
Neural Information Processing Systems 18, p. 1003 (2006)
[68] Neill, D.B., Moore, A.W., Sabhnani, M., Daniel, K.: Detection of emerging space-time
clusters. In: Proceedings of the Eleventh ACM SIGKDD International Conference on
Knowledge Discovery in Data Mining, pp. 218–227. ACM (2005)
[69] Overpeck, J., Meehl, G., Bony, S., Easterling, D.: Climate data challenges in the 21st
century. Science 331(6018), 700 (2011)
[70] Paluš, M., Hartman, D., Hlinka, J., Vejmelka, M.: Discerning connectivity from dy-
namics in climate networks. Nonlinear Processes Geophys. 18 (2011)
[71] Pegau, W., Boss, E., Martı́nez, A.: Ocean color observations of eddies during the sum-
mer in the gulf of california. Geophysical Research Letters 29(9), 1295 (2002)
[72] Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Za-
karia, J., Keogh, E.: Searching and mining trillions of time series subsequences under
dynamic time warping. In: Proceedings of the 18th ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, pp. 262–270. ACM (2012)
[73] Ramachandran, R., Rushing, J., Conover, H., Graves, S., Keiser, K.: Flexible frame-
work for mining meteorological data. In: Proceedings of the 19th Conference on In-
teractive Information and Processing Systems for Meteorology, Oceanography, and
Hydrology (2003)
Spatio-temporal Data Mining for Climate Data 115
[74] Richardson, P.: Eddy kinetic energy in the north atlantic from surface drifters. Journal
of Geophysical Research 88(C7), 4355–4367 (1983)
[75] Scheffer, M., Carpenter, S., Foley, J.A., Folke, C., Walker, B., et al.: Catastrophic shifts
in ecosystems. Nature 413(6856), 591–596 (2001)
[76] Sencan, H., Chen, Z., Hendrix, W., Pansombut, T., Semazzi, F.H.M., Choudhary, A.N.,
Kumar, V., Melechko, A.V., Samatova, N.F.: Classification of emerging extreme event
tracks in multivariate spatio-temporal physical systems using dynamic network struc-
tures: Application to hurricane track prediction. In: IJCAI, pp. 1478–1484 (2011)
[77] Shekhar, S., Vatsavai, R.R., Celik, M.: Spatial and spatiotemporal data mining: Recent
advances. Data Mining: Next Generation Challenges and Future Directions (2008)
[78] Smith, R., Robinson, P.: A bayesian approach to the modeling of spatial-temporal pre-
cipitation data. In: Case Studies in Bayesian Statistics, pp. 237–269. Springer (1997)
[79] Srikanthan, R., McMahon, T., et al.: Stochastic generation of annual, monthly and
daily climate data: A review. Hydrology and Earth System Sciences Discussions 5(4),
653–670 (2001)
[80] Steinbach, M., Tan, P.-N., Kumar, V., Klooster, S., Potter, C.: Discovery of climate
indices using clustering. In: Proceedings of the Ninth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 446–455. ACM (2003)
[81] Steinhaeuser, K., Chawla, N.V., Ganguly, A.R.: Complex networks in climate science:
progress, opportunities and challenges. In: NASA Conf. on Intelligent Data Under-
standing, Mountain View, CA (2010)
[82] Stolorz, P., Dean, C.: Quakefinder: A scalable data mining system for detecting earth-
quakes from space. In: Proceedings of the Second International Conference on Knowl-
edge Discovery and Data Mining (KDD 1996), pp. 208–213 (1996)
[83] Stolorz, P., Mesrobian, E., Muntz, R., Santos, J., Shek, E., Yi, J., Mechoso, C., Farrara,
J.: Fast spatio-temporal data mining from large geophysical datasets. In: Proceedings
of the International Conference on Knowledge Discovery and Data Mining, pp. 300–
305 (1995)
[84] Sugihara, G., May, R.: Nonlinear forecasting as a way of distinguishing chaos from
measurement error in time series. Nature 344(19), 734–741 (1990)
[85] Taubenböck, H., Esch, T., Felbier, A., Wiesner, M., Roth, A., Dech, S.: Monitoring
urbanization in mega cities from space. Remote Sensing of Environment (2011)
[86] Team, C.W.: Climate Change 2007: Synthesis Report. Contribution of Working
Groups I, Ii and Iii to the Fourth Assessment Report of the Intergovernmental Panel
on Climate Change. Ipcc, Geneva, Switzerland (2007)
[87] Tobler, W.R.: A computer movie simulating urban growth in the detroit region. Eco-
nomic Geography 46, 234–240 (1970)
[88] Tsonis, A., Roebber, P.: The architecture of the climate network. Physica A: Statistical
Mechanics and its Applications 333, 497–504 (2004)
[89] Tsonis, A.A., Swanson, K.L., Roebber, P.J.: What do networks have to do with cli-
mate? Bulletin of the American Meteorological Society 87(5), 585–596 (2006)
[90] Tsonis, A.A., Swanson, K.L., Wang, G.: On the role of atmospheric teleconnections in
climate. Journal of Climate 21(12), 2990–3001 (2008)
[91] Ulbrich, U., Leckebusch, G., Pinto, J.: Extra-tropical cyclones in the present and future
climate: a review. Theoretical and Applied Climatology 96(1), 117–131 (2009)
[92] Van Leeuwen, T.T., Frank, A.J., Jin, Y., Smyth, P., Goulden, M.L., van der Werf, G.R.,
Randerson, J.T.: Optimal use of land surface temperature data to detect changes in
tropical forest cover. Journal of Geophysical Research 116(G2), G02002 (2011)
116 J.H. Faghmous and V. Kumar
[93] Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and varia-
tional inference. Foundations and Trends R in Machine Learning 1(1-2), 1–305 (2008)
[94] Watts, D., Strogatz, S.: The small world problem. Collective Dynamics of Small-World
Networks 393, 440–442 (1998)
[95] Webster, P.J., Holland, G.J., Curry, A., Chang, H.: Changes in tropical cyclone num-
ber, duration, and intensity in a warming environment. Science 309(5742), 1844–1846
(2005)
[96] White, M.A., Hoffman, F., Hargrove, W.W., Nemani, R.R.: A global framework for
monitoring phenological responses to climate change. Geophysical Research Let-
ters 32(4), L04705 (2005)
[97] Wilks, D.S.: Statistical methods in the atmospheric sciences. Academic press (2006)
[98] Woolhiser, D.A.: Modeling daily precipitation progress and problems. In: Walden, A.,
Guttorp, P. (eds.) Statistics in the Environmental and Earth Sciences. Edward Arnold,
London (1992)
[99] Wu, E., Liu, W., Chawla, S.: Spatio-temporal outlier detection in precipitation data.
In: Gaber, M.M., Vatsavai, R.R., Omitaomu, O.A., Gama, J., Chawla, N.V., Ganguly,
A.R. (eds.) Sensor-KDD 2008. LNCS, vol. 5840, pp. 115–133. Springer, Heidelberg
(2010)
[100] Wu, E., Liu, W., Chawla, S.: Spatio-temporal outlier detection in precipitation data.
In: Gaber, M.M., Vatsavai, R.R., Omitaomu, O.A., Gama, J., Chawla, N.V., Ganguly,
A.R. (eds.) Sensor-KDD 2008. LNCS, vol. 5840, pp. 115–133. Springer, Heidelberg
(2010)
[101] Wyrtki, K., Magaard, L., Hager, J.: Eddy energy in the oceans. Journal of Geophysical
Research 81(15), 2641–2646 (1976)
[102] Yamasaki, K., Gozolchiani, A., Havlin, S.: Climate networks around the globe are
significantly affected by el nino. Physical Review Letters 100(22), 228501 (2008)
[103] Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computing Surveys
(CSUR) 38(4), 13 (2006)
Mining Discriminative Subgraph Patterns from
Structural Data
1 Introduction
Ning Jin
Catalog Quality Department, Amazon, Seattle, WA 98109, USA
e-mail: [email protected]
Wei Wang
Computer Science Department, University of California, Los Angeles, USA
e-mail: [email protected]
W.W. Chu (ed.), Data Mining and Knowledge Discovery for Big Data, 117
Studies in Big Data 1,
DOI: 10.1007/978-3-642-40837-3_4, © Springer-Verlag Berlin Heidelberg 2014
118 N. Jin and W. Wang
Program bugs are inevitable in software development and localizing program bugs
is a painstaking task. Therefore there have been many studies in automated bug
localization [10]. Automated bug localization usually takes two sets of program
execution traces as input: one set of traces for correct executions and another set
of traces for faulty executions. A program execution trace is generated by logging
method invocation and statement execution. The output of bug localization is
candidate locations of bugs. One way of performing automated bug localization is
to represent program execution traces by graphs and find discriminative subgraphs
that appear frequently in graphs of faulty executions but infrequently in graphs of
correct executions.
A program execution trace can be represented by a directed graph. In a graph
representation of a program execution, each node may correspond to a method, a
basic block or a statement and each edge describes how the control/data flow
moves from one node to another. Nodes are labeled with basic descriptions of the
corresponding methods, basic blocks or statements.
1.2 Definitions
Definition 1 (Graph). A graph is denoted as g = (V, E) where V is a set of nodes
and E is a set of edges connecting the nodes. Both nodes and edges can have
labels.
For example, in Figure 1, there are two graphs in the graph database. The text
in each node is in the form of (node ID: node label). Two nodes in a graph may
have the same label but they cannot have the same node ID. Two nodes in two
different graphs can have the same node ID but they do not necessarily represent
the same entity and may have different labels.
2 Mining Techniques
Overall Framework
Algorithm: graphSig
Input:
Gp: a set of positive graphs
Gn: a set of negative graphs
minFreq: frequency threshold for frequent subgraph mining
maxPvalue: p-value threshold for selecting significant sub-feature vectors
radius: radius for extracting subgraphs around a given node
Output:
P: a set of discriminative subgraph patterns
1. P = ∅;
2. F = ∅;
3. for each g ∈ Gp ∪ Gn
4. R = {feature vectors generated by RWR in g with p-values <
maxPvalue};
5. F = F ∪ R;
6. S = S - {p};
7. for each node label a in Gp ∪ Gn
8. Fa = {f | f ∈ F, f was generated by RWR on nodes whose labels were
a};
Mining Discriminative Subgraph Patterns from Structural Data 125
Overall Framework
2
http://pubchem.ncbi.nlm.nih.gov
128 N. Jin and W. Wang
The similarity between two subgraph patterns p and q is measured by the ratio
of the maximum frequency difference that p and q can have to the sum of
frequencies of p and q. If the ratio is less than a user specified threshold σ, then
the two subgraph patterns are considered highly similar and there is no need to
explore the other if one is already explored. Let Δp(p, q) be the maximum positive
frequency difference that p and q can have and Δn(p, q) be the maximum negative
frequency difference that p and q can have. After one pattern is explored, the other
can be skipped if:
2Δ p ( p, q) 2Δ n ( p, q)
≤ σ and ≤σ
pfreq( p) + pfreq(q) nfreq( p) + nfreq(q)
This subgraph pattern pruning can be further extended to prune a whole search
branch instead of an individual subgraph pattern.
The algorithm of structural leap search is described as below.
Algorithm: Structural_Leap_Search
Input:
Gp: a set of positive graphs
Gn: a set of negative graphs
σ: difference threshold
Output:
p*: optimal subgraph pattern candidate
17. S = {1-edge subgraph patterns};
=
18. p* ∅;
19. G-test(p*) = -∞;
20. while S ≠ ∅
21. p = next subgraph pattern in S;
22. S = S - {p};
23. if p was examined
24. continue;
25. if ∃ q, q was examined and
2Δ p ( p, q) 2Δ n ( p, q)
≤ σ and ≤σ
pfreq( p) + pfreq(q) nfreq( p) + nfreq(q)
26. continue;
27. if G-test(p) > G-test(p*)
28. p* = p;
29. if upper bound of G-test(p) ≤ G-test(p*)
30. continue;
31. S = S ∪ {supergraphs of p with one more edge};
32. return p*;
Mining Discriminative Subgraph Patterns from Structural Data 129
Figure 4 illustrates the relationship between frequency and G-test score for the
AIDS Anti-viral dataset3. It is a contour plot displaying isolines of G-test score in
two dimensions. The X axis is the frequency of a subgraph in the positive dataset,
while the Y axis is the frequency of the same subgraph in the negative dataset. The
curves depict G-test score (to avoid infinite G-test score, a default minimum
frequency is assumed for any pattern whose frequency is 0 in the data). Left upper
corner and right lower corner have the higher G-test scores. The “circle” marks the
highest G-test score subgraph discovered in this dataset. As one can see, its
positive frequency is higher than most of subgraphs. Similar results are also
observed in other graph datasets.
To profit from this discovery, the authors proposed an iterative frequency
descending mining method.
Frequency descending mining begins the mining process with high frequency
threshold θ = 1.0 and it searches for the most discriminative subgraph pattern p*
whose frequency is at least θ. Then frequency descending mining repeatedly lower
the frequency threshold θ to check whether it can find better p* whose frequency
is at least θ. It terminates when θ reaches either 0 or a user-specified threshold.
3
http://pubchem.ncbi.nlm.nih.gov
130 N. Jin and W. Wang
Algorithm: Frequency_Descending_Mine
Input:
Gp: a set of positive graphs
Gn: a set of negative graphs
ε: converging threshold
Output:
p*: optimal subgraph pattern candidate
1. θ = 1;
2. p = ∅;
3. G-test(p) = -∞;
4. do
5. p* = p;
6. S = {subgraph patterns in Gp and Gn with frequency no less than θ};
7. p = argmax p'∈S G − test( p') ;
8. θ = θ / 2;
9. while (G-test(p) - G-test(p) ≥ ε);
10. return p* = p;
Overall Framework
The overall framework of LEAP is as follows:
Step 1: Use structural leap search to find the most discriminative subgraph pattern
p* with frequency threshold θ = 1.0,
Step 2: Repeat Step 1 with θ = θ / 2 until score(p*) converges,
Step 3: Take score(p*) as a seed score; use structural leap search to find the most
discriminative subgraph pattern without frequency threshold.
Almost all efficient subgraph pattern exploration methods, such as gSpan [23] and
FFSM [11], start with subgraphs having only one edge and extend them to larger
subgraphs by adding one edge at a time. Each large subgraph pattern can be
directly extended from more than one smaller subgraph patterns. For example, in
Figure 5, subgraph pattern A-B-C can be extended from either A-B or B-C.
Mining Discriminative Subgraph Patterns from Structural Data 131
pattern p is in the lineage of pattern q and q is a frequent pattern, then p must also
be frequent by the mining algorithm. However, in discriminative subgraph pattern
mining, the redundancy in multiple-lineage exploration becomes its advantage
over single-lineage exploration. Objective functions to measure discrimination
power of subgraphs are usually not antimonotonic. If pattern p is in the lineage of
pattern q and q is a discriminative pattern, p is not necessarily discriminative.
Under such circumstances, multiple-lineage exploration can be aggressive in
pruning patterns with low discrimination scores while single-lineage exploration
cannot afford to prune any pattern unless it is absolutely certain that the pattern
will not lead to any discriminative pattern.
For example, in Figure 5, A-B-C is a highly discriminative subgraph pattern in
the positive set while A-B is not discriminative as it appears in every positive and
negative graph. The single-lineage exploration shown in Figure 6 cannot prune A-
B because otherwise A-B-C will be missed. The multiple-lineage exploration in
Figure 6 can afford to prune A-B since A-B-C can also be reached from B-C.
In the proposed algorithm, LTS, Jin et al. adopted multiple-lineage exploration
to reduce the risk of missing the most discriminative subgraph patterns due to
pruning. Jin et al. used CCAM code to encode subgraph patterns and maintain a
lookup table for subgraph patterns that have been extended to avoid extending a
subgraph pattern repeatedly. Embeddings of subgraph patterns in the graph sets
are also maintained to facilitate subgraph extension and frequency calculation.
Algorithm: fast-probe
Input:
Gp: positive graph set
Gn: negative graph set
Output:
the optimal pattern for each positive graph
1. Put all single-edge subgraph patterns into candidate set C
2. while (C is not empty)
3. p get next pattern and remove it from C
4. updated false
5. for each graph g in Gp
6. if ( ) > optimal score for g so far
7. update the optimal pattern and optimal score for g
8. updated true
9. if (not updated)
10. continue
11. C all subgraph patterns with one more edge attached to
12. for each pattern q in C
13. if q has not been generated before
14. put q into C
15. return the optimal pattern for each g in Gp
long as the assumption holds for at least one of its lineages, the optimal pattern
will be found. Using multiple-lineage exploration helps because the likelihood of
the assumption being true for at least one lineage is much larger in multiple-
lineage exploration than in single-lineage exploration. In addition, the most
discriminative subgraph pattern will not be missed as long as patterns in its
lineages are optimal patterns for one positive graph at the time they are visited.
This is very likely to be true: it is typical that some positive graphs are covered by
multiple highly discriminative subgraphs while others do not have highly
discriminative subgraphs. The former are called as “rich” graphs and the latter are
called as “poor” graphs. Ancestors for the highly discriminative subgraphs for
“rich” graphs may cover “poor” graphs when their positive frequencies are still
high. Let p be the most discriminative subgraph for a “rich” graph g and q be
another highly discriminative subgraph for g. Let q be visited before any ancestor
of p is visited. Patterns in the lineages of p may not be the optimal patterns for g
when they are visited because they may not be as discriminative as q. However
they may be the optimal patterns for some “poor” graphs and thus survive and
produce a lineage to p. The most discriminative subgraphs for “poor” graphs may
be missed when there are no “poorer” graphs for their ancestors to survive. In this
case, a subsequent (branch and bound) search may be needed to recover the most
discriminative subgraphs missed by fast-probe.
Fig. 7 An example of search records and the corresponding prediction tree and prediction
table
Mining Discriminative Subgraph Patterns from Structural Data 137
Each tree node is labeled with score and the root node is labeled with 0.0,
which is the score of an empty subgraph. In their implementation, Jin et al.
discretized scores evenly into 10 bins and used the discretized scores as labels. In
the example, the original scores are used as labels for the sake of intuitive
illustration. In addition to the score label, each tree node is also associated with the
maximum score in the sub-tree rooted at this node. The score records and the
corresponding prediction tree can be considered as a sample of the whole search
space. Therefore, the maximum score at each tree node is an estimated upper-
bound in the search space. For example, for a pattern p with score record (0.5, 0.7,
1.0), its maximum score in the prediction tree is 1.5 and thus its estimated upper-
bound in the search space is 1.5. LTS organizes the sample space by scores (rather
than by subgraph structures in the search space) because it is much easier to
compare scores than structures. Sometimes the score record of a pattern p is absent
in the tree, so LTS additionally generates a lookup table, named prediction table,
to aggregate the information in the tree. The key for each entry in the prediction
table is composed of the number of edges in the pattern and the score of the
pattern. The value stored at each entry is the maximum score of the descendants of
the patterns with the corresponding size and score in the sample space. For
example, if the score record of is (0.4, 0.8), which cannot be found in the
prediction tree, then LTS uses the key <2, 0.8> to look for an upper-bound
estimation in the prediction table, which returns 1.0. The search history H is
composed of the prediction tree and the prediction table. If neither the score
record nor the <size, score> pair of can be found in H, then LTS uses the loose
upper-bound estimation discussed earlier in this section.
Using search history to estimate upper-bound bears the risk of underestimating
upper-bound if the discriminative subgraph mining process, which provides the
score records, fails to capture a good sample of high discrimination scores. This
will result in inefficient pruning and thus prolonged execution time. However,
there is little impact to the mining process if the greedy sampling misses many
low discrimination scores because, although these score records may be absent in
the prediction tree, the prediction table can still provide a reasonably tight upper-
bound estimation and the algorithm always has the last resort to the loose
estimation.
LTS first uses fast-probe to collect score records and generates search history
H, which includes a prediction tree of score records and a prediction table
aggregating the score records. LTS utilizes a vector F to keep track of the optimal
pattern for each positive graph: F[i] stores the optimal pattern for positive graph
gi. Vector F is updated with the optimal patterns found by fast-probe, which
compose a better starting point than single-edge subgraphs, before the following
branch-and-bound search. Then LTS performs a branch-and-bound search in the
subgraph search space and uses a candidate list to keep track of candidate
subgraph patterns. Its goal is to find the most discriminative subgraph for each
positive graph. When the branch-and-bound search begins, the candidate list is
initialized with all subgraphs with one edge. LTS repeatedly pops one subgraph
from the candidate list at a time until the candidate list becomes empty. LTS uses
138 N. Jin and W. Wang
CCAM code [15] to encode subgraphs and maintains a lookup table to keep track
of processed subgraphs. For each subgraph p from the candidate list, LTS updates
F[i] if positive graph gi supports p and score(p) is greater than score(F[i]).
Meanwhile, LTS estimates the upper-bound of p based on search history H and
checks whether the upper-bound is greater than any score(F[i]) with gi supporting
p. If the upper-bound is not greater than the optimal score of any positive graph
supporting p, then p is discarded from further extension. Note that for each
pattern, the algorithm only considers the positive graphs supporting this pattern
when updating optimal scores and pruning with the estimated upper-bound
because the algorithm is looking for the optimal pattern for each positive graph. If
p is preserved, LTS computes all of its extensions with one more edge in the
positive set. The extensions that have not been visited before are put into the
candidate list.
For each graph gi in the positive graph set Gp, the algorithm stores a representative
subgraph pattern and a list of up to s candidate subgraph patterns, where s is
bounded (from above) by the available memory space divided by the number of
graphs. Figure 8 illustrates the organization of candidate patterns and
representative patterns. Only subgraphs of gi with discrimination scores greater
than 1 can be its representative or in its candidate list. The representative pattern
has the highest discrimination score among all patterns that are subgraphs of gi
found during pattern evolution. Although one pattern can be subgraphs of several
positive graphs, each pattern can only be in one candidate list at any time. The
candidate lists are initialized with one-edge patterns.
The total number of subgraph patterns that the candidate lists can hold at any
time is the product of s and | Gp|. The motivation of the design of this framework
is to cause selection pressure which can significantly speed up the convergence of
evolutionary search. When the total size of candidate lists is less than the total
Mining Discriminative Subgraph Patterns from Structural Data 139
number of patterns that can be found in positive graphs, not all patterns can be
held in the candidate lists at the same time. As a result, one resource that candidate
patterns need to compete for is a slot in candidate lists. In other words, patterns
have to compete for survival and not all patterns are considered in the search
process. Generally speaking, the larger the candidate lists are, the less selection
pressure there is and thereby the more patterns are considered in the search. When
the candidate lists are infinitely large, the search process becomes an exhaustive
search.
Pattern Extension
All candidate patterns currently in the candidate lists have a non-zero probability
of being selected for pattern extension. To perform pattern evolution, GAIA runs
for n iterations, where n is a parameter set by the user. During each iteration,
GAIA selects one pattern from each candidate list for extension. The probability
of pattern p in candidate list of gi to be selected for extension is proportional to the
log ratio score of p and is calculated as follows:
( )
( )=
∑ ( ′)
The probability is always between 0 and 1 because only patterns with positive log
ratio scores are allowed in candidate lists as described in Subsection 3.2. This
selection method is commonly used in evolutionary algorithms. The intuition here
is that candidate patterns with higher scores are more likely to be extended to
patterns with high scores because structurally similar subgraph patterns have
similar discrimination power [24]. Note that when s = 1, each candidate list only
holds 1 pattern. The probability of this pattern being selected for extension is 1.
When s > 1, multiple patterns may be held in a candidate list. A random number
generator is used to determine which pattern is selected for extension according to
their probabilities.
For an extension operation of pattern p, GAIA generates a pattern set X(p) and
each pattern p’ in X(p) has one new edge attached to p. This new edge is not
present in p and it can be either between two existing nodes in p or between one
node in p and a new node. Unlike many previous subgraph pattern mining
algorithms that only extend patterns with certain types of edges in order to
efficiently maintain their canonical codes, GAIA considers all one-edge
extensions of pattern p that occur in the positive graphs. This difference in
extension operation is essential to GAIA because evolutionary computation is
essentially a heuristic search for optimal solution. This difference enables GAIA
to explore the candidate pattern space in any direction that appears promising.
Extensions of different patterns can produce the same pattern because a pattern
p with k edges can be directly extended from all of its subgraphs with k-1 edges.
Therefore, a lookup table is needed by GAIA to determine whether a pattern has
already been generated to avoid repetitive examination of the same pattern.
In most cases, an extension operation on one pattern generates many new patterns
and as a result the number of patterns found by the algorithm grows. Sooner or
later the number of patterns will exceed the number of available positions in the
candidate lists. It is also possible that the number of one-edge patterns already
exceeds the number of available positions in the candidate lists at the very
beginning if s is small. Therefore some rules are needed to determine which
Mining Discriminative Subgraph Patterns from Structural Data 141
patterns should survive in the candidate lists and which candidate list they should
dwell in.
First, a pattern that has already been extended should not “live” in the candidate
lists any longer because it has served its role in generating new patterns.
Second, some pattern in the candidate list may migrate to the candidate
list of another graph if such migration will increase its chance of survival.
Let p be the candidate pattern for migration and G(p) be the set of graphs
containing p. Let gi be the graph in G(p) which has the lowest value of
∑ ( ′). p will migrate to the candidate list of gi.
The rationale for this pattern migration is that if a pattern wants to survive then it
should go to a candidate list with the least fierce competition. In GAIA, the
fierceness of competition of a candidate list is measured by the sum of scores of
patterns in the list.
If the candidate list of gi still has vacant positions, then p can move into one
vacant position directly. However, if the candidate list is already full, then p has to
compete with the “resident” patterns in the list. One straightforward approach to
let p compete with “resident” patterns is to compare the log ratio score of p and
the minimum log ratio score among “resident” patterns. If the score of p is greater
than the minimum score among “resident” patterns, then p takes the position of
pattern p’ with the minimum score and p’ no longer exists in any candidate list;
otherwise, p fails to survive and will not exist in any candidate list. The
disadvantage of this greedy approach is that it ignores the fact that patterns with
low log ratio scores may still have some potential to extend into patterns with high
log ratio scores and patterns with high log ratio scores at the time may have
reached their limits and will never extend to better patterns. Therefore, GAIA
adopts a randomized method for pattern competition which is commonly used by
evolutionary algorithms. The score of p is compared against the score of a pattern
p’, which is randomly selected with probability 1/s from the candidate list. If the
score of p is higher, then p’ is eliminated and p takes the position of p’; otherwise,
p is eliminated. By doing so, GAIA can at least have a chance to protect some of
the “weak” patterns and give them an opportunity to extend into “strong” patterns.
The benefit of this randomized approach is more evident when s is reasonably
large. Note that when s = 1 the randomized strategy is essentially the same as the
greedy strategy.
Again, the exhaustive extension operation is of great importance to allow
pattern competition and elimination. When GAIA eliminates a pattern p, the real
loss is not only this pattern but also the patterns generated by extending p. In
previous subgraph pattern mining algorithms, such as gSpan [23] and FFSM [11],
a pattern p can only be extended from one of its subpatterns, p’. If p’ is lost, then
the algorithms will never find p. As a result, for these algorithms, allowing pattern
elimination will surely lose many patterns, some of which are discriminative
patterns. But in GAIA, eliminating p’ does not necessarily lead to the loss of p
because the exhaustive extension operation allows p to be extended from many
different patterns. As a result, the risk of missing discriminative patterns is much
lower than other subgraph mining algorithms.
142 N. Jin and W. Wang
Algorithm: Pattern_Migrate
Input:
p: a subgraph pattern
T: candidate lists
1. g = (∑ ( ))
2. if (the candidate list of g has vacant positions)
3. insert p into the candidate list of g
4. else
5. randomly select a pattern p’ in the candidate list of g
6. if (score (p) > score (p’))
7. replace p’ with p
Algorithm: Pattern_Evolution
Input:
Gp: positive graph set
Gn: negative graph set
s: maximum size of each candidate list, by default equal to available_space/|Gp|
n: maximum number of iterations, by default the maximum interger value in the
system
Output:
representative patterns: the best pattern for each positive graph
D = {all edges that occur in Gp }
1. for each edge e in D
2. Pattern_Migrate (e, T)
3. for k = 1:n
4. if (all candidate lists are empty)
5. break
6. for each g in Gp
7. randomly select a pattern p in the candidate list of g
8. X (p) = {all patterns in Gp with one more edge attached to p}
9. for each pattern p’ in X (p)
10. if (CCAM code of p’ is in H)
11. continue
12. insert p’ into H
13. Migrate (p’, T)
14. update representative patterns
Mining Discriminative Subgraph Patterns from Structural Data 143
Because GAIA is a randomized algorithm (when s > 1), each single run of pattern
evolution may generate different representative patterns and consume varying
amount of CPU time. Some runs of pattern evolution may find better
representative patterns than others and thus lead to classifiers with higher
normalized accuracy. Therefore, if GAIA runs many instances of pattern evolution
in parallel and selects the best subgraph patterns from all representative patterns
found by these instances of pattern evolution, it is very likely that GAIA can get a
better set of discriminative subgraph patterns than using representative patterns
from one instance of pattern evolution alone. Therefore, by generating a consensus
model based on many parallel instances of pattern evolution and only using the
fastest instances of pattern evolution, GAIA can improve the discrimination power
of its results and achieve faster expected response by taking advantage of parallel
computing.
All patterns in a graph set can be organized in a tree structure. Each tree node
represents a pattern and is a supergraph of its parent node, with the root node
being an empty graph. Traversing this tree can enumerate all distinct patterns
without repetition. To facilitate this, a graph canonical code is often employed.
Several graph-coding methods have been proposed for this purpose. COM adopted
the CAM (Canonical Adjacency Matrix) code [11], but this method can be easily
applied to other graph coding strategies.
The code of a graph g is not unique because g may have up to (n!) different
adjacency matrices. So COM used standard lexicographic order on sequences to
define a total order on all possible codes. The matrix that produces the maximal
code for a graph g is called the Canonical Adjacency Matrix of g and the
corresponding code is the CAM code of g. The CAM code of a graph g is unique.
It is proved that exploring a pattern tree with the CAM codes can enumerate all
patterns without repetition.
For example, A-D-E is a pattern in graph P1 in Figure 9. Figure 10 shows two
different adjacency matrices of A-D-E. A “1” indicates the existence of an edge
between two nodes while a “0” indicates the absence of an edge. Adjacency
matrix M leads to code A1D01E and adjacency matrix N leads to code D1A10E.
Although both of them are correct codes of A-D-E, D1A10E is less than A1D01E
lexicographically. In fact, A1D01E is the largest code for A-D-E, so it is the CAM
code and adjacency matrix M is the canonical adjacency matrix.
144 N. Jin and W. Wang
With a given scoring function, COM can rank all patterns by their scores. COM
reorganizes the pattern tree to increase the probability that COM visits patterns
with higher score ranks earlier than those with lower score ranks. The need for a
more effective pattern exploration order is due to the fact that most pattern
enumeration algorithms tend to visit patterns with similar conformations together
since they usually have similar codes. This does not cause any side effect on
effectiveness of pattern enumeration, but it has a huge negative impact on finding
complementary discriminative patterns because patterns with similar
conformations are much more likely to have overlapping supporting sets.
COM takes advantage of the following observation: let p be a pattern in the
pattern tree and p' be the parent pattern of p, the score rank of p is correlated with
the value of ∆(p)=score(p) – 2score(p'). For patterns with two nodes, COM sets
their Δ values equal to their scores score(p).
Mining Discriminative Subgraph Patterns from Structural Data 145
Therefore, when COM explores the pattern space, it first enumerates all
patterns with 2 nodes as candidates and inserts them into a heap structure with the
candidate having the highest ∆ value at the top. Ties are broken by favoring higher
positive frequency and then by CAM code order. Then COM always takes the
pattern at the top of the heap and generates all of its super-patterns with one more
edge by performing the CAM extension operation. COM inserts new patterns into
the heap structure. In this way, COM is able to visit patterns with high score ranks
early and patterns with overlapping supporting sets late. The enumeration
algorithm is described as follows.
Algorithm: COM_enumerate_subgraphs
Input:
G: input graph dataset
1. P {all subgraphs with 2 nodes in G}
2. p argmax p’ ∈ P (∆(p’))
3. while (p ≠ NULL)
4. e {CAM_extension(p)}
5. for each p’ ∈ e
6. if p’ has not been visited
7. P P ∪ {p’}
8. P P – {p}
9. p argmax p’ ∈ P (∆(p’))
Generating Co-occurrences
Any set of subgraph patterns can form a co-occurrence, but not all of them have
high discrimination power. Ideally, the algorithm can find co-occurrences
consisting of subgraph patterns with high frequency in the positive graph set and
low frequency in the negative graph set. Therefore, COM used two user-specified
parameters tp and tn to quantify the quality of a co-occurrence, where tp is the
minimal positive frequency allowed for a resulting co-occurrence and tn is the
maximal negative frequency permitted. The goal of COM is to find a co-
occurrence set R such that each positive graph contains at least one co-occurrence,
where each co-occurrence in R has positive frequency no less than tp and negative
frequency no greater than tn.
This problem can be proved to be equivalent to the set cover problem and is
therefore NP complete. It is intractable to find an optimal solution in the enormous
pattern space. Therefore, COM adopted a greedy approach for rule generation. Let
the candidate co-occurrence set be Rt and the resulting co-occurrence set be R. The
146 N. Jin and W. Wang
algorithm explores the pattern space with the heuristic order in Chapter 0 and
whenever it comes to a new pattern p that has not been processed before, if there
exists one positive graph that contains p but none of the existing co-occurrences,
the algorithm generates a new candidate co-occurrence containing only p and
examines the possibility of merging this new co-occurrence into existing candidate
co-occurrences. Given a new pattern p and a candidate co-occurrence rt, Δ(p, rt) =
score(rt ∪ {p}) – score(p). Pattern p is to be inserted into candidate co-occurrence
r’, r’ = argmax(Δ(p, rt)), Δ(p, rt) ≥ 0. If there are patterns in r’ whose supporting
sets are supersets of the supporting set of p, then inclusion of p into r’ will make
these patterns redundant. These patterns will be removed from r’ when p is
inserted. Then, for either the newly generated co-occurrence {p} or the updated r’,
if it has pfreq(p) ≥ tp and nfreq(p) ≤ tn and it is present in at least one positive
graph that does not contain any co-occurrence in R, it will be removed from Rt and
inserted into R. The algorithm terminates either when all patterns are explored or
when each positive graph contains at least one resulting co-occurrence. Although
in the worst case the algorithm is still exhaustive, experiments show that it is time
efficient in practice.
For example, let tp = 50% and tn = 0%, in Figure 9, the frequent subgraphs of 2
nodes in the positive set are A-B, B-C, D-E, D-G, and G-H. Only positive patterns
with frequency no less than tp need to be considered because (1) as mentioned
earlier the algorithm only considers positive patterns and (2) the frequency of a
co-occurrence with patterns less frequent than tp must be less than tp as well. The
algorithm initializes the rule sets to be empty: R’ = {} and R = {}.
According to the pattern exploration order introduced in Chapter 0, A-B is the
first pattern to process. For simplicity, the example is designed so that these edges
cannot extend to any larger patterns with positive frequency no less than tp. A new
candidate co-occurrence {A-B} is added into R’. Note that R’ was empty and thus
there does not exist any rule in R’ to insert A-B. Next, {B-C} is added into R’ and
B-C is added into candidate co-occurrence {A-B} because Δ({B-C}, {A-B}) is no
less than 0. The modified candidate co-occurrence {A-B, B-C} have pfreq ≥ tp and
nfreq ≤ tn, therefore it is removed from R’ and added into R. Next, D-E is at the top
of the heap, but there is no need to consider it because both of its supporting
graphs, P1 and P2, contain co-occurrence {A-B, B-C} and therefore considering D-
E cannot lead to a better classifier. Then, following a similar procedure, the
algorithm can generate co-occurrence {D-G, G-H} and add it into R. Now the
algorithm terminates because: 1) the heap structure for candidate patterns is empty
and 2) {A-B, B-C} and {D-G, G-H} are sufficient to cover all graphs in the
positive set. For each step, the initial status of R’, R, the pattern at the heap top and
the set of positive graphs that contain none of the co-occurrences in R are shown
in Figure 11.
Mining Discriminative Subgraph Patterns from Structural Data 147
3 Evaluation
4
http://www.rcsb.org/pdb
5
http://scop.mrc-lmb.cam.ac.uk/scop/
148 N. Jin and W. Wang
Bioassay # of
Tumor description # of actives
ID inactives
1 Non-Small Cell Lung 2047 38410
33 Melanoma 1642 38456
41 Prostate 1568 25967
47 Central Nerv Sys 2018 38350
81 Colon 2401 38236
83 Breast 2287 25510
109 Ovarian 2072 38551
123 Leukemia 3123 36741
145 Renal 1948 38157
167 Yeast anticancer 9467 69998
330 Leukemia 2194 38799
6
http://pubchem.ncbi.nlm.nih.org
Mining Discriminative Subgraph Patterns from Structural Data 149
3.2 Comparison
For chemical datasets, LEAP finds the most discriminative subgraph patterns
among the four algorithms, but it is almost two orders of magnitude slower than
the other three algorithms. Therefore, if optimizing pattern discrimination power is
crucial and dataset is relatively small, LEAP is the best choice for discriminative
subgraph pattern mining. However, when the dataset is large, LEAP cannot finish
in a reasonable amount of time. GAIA and LTS offer better trade-off between
pattern quality and runtime efficiency. Between the two, LTS finds better patterns
in less time than GAIA. COM is faster than LEAP, but it does not find
competitive subgraph patterns compared with LEAP.
For protein datasets, LEAP does not find the most discriminative subgraph
patterns among the four algorithms, even though it still takes much longer time
because its structural leap search is less efficient when the candidate patterns are
less similar to each other. Both LTS and GAIA are significantly faster than LEAP
and find more discriminative subgraph patterns. LTS finds more discriminative
subgraph patterns than GAIA, but takes slightly longer time than GAIA. COM is
faster than LEAP, but its patterns are not as discriminative as that of LEAP.
In general, the strength of LEAP and CORK is to find subgraph patterns with
optimal discrimination scores and the cost is significantly longer runtime. In
addition, experiments show that LEAP is more capable at processing graphs
whose subgraphs are more similar to each other. We did not have access to the
original implementation of CORK and thus were unable to evaluate CORK in
experiments, but CORK can only use a specific measurement for discrimination
power and the measurement can be undesirable in certain applications.
The strength of GAIA, LTS, graphSig and COM is to provide better trade-off
between subgraph discrimination power and runtime efficiency. Among the four,
graphSig has the advantage of making good use of domain knowledge because
well studied substructures can be used as features to facilitate the mining process.
Experiments show that COM is faster than LEAP but its pattern quality is not
150 N. Jin and W. Wang
competitive, at least for the protein and chemical datasets we tested. The
advantage of LTS is its fast speed and highly competitive pattern quality. In fact, it
outperforms LEAP for the protein datasets and only trails LEAP slightly for the
chemical datasets in terms of pattern discrimination power. The advantage of
GAIA is that it can be run in parallel to further improve runtime efficiency.
4 Summary
References
1. Bandyopadhyay, D., Huan, J., Liu, J., Prins, J., Snoeyink, J., Wang, W., Tropsha, A.:
Structure-based function inference using protein family-specific fingerprints. Protein
Science 15, 1537–1543 (2006)
2. Bandyopadhyay, D., Huan, J., Prins, J., Snoeyink, J., Wang, W., Tropsha, A.:
Identification of family-specific residue packing motifs and their use for structure-
based protein function prediction: I. Method development. J. Comput. Aided Mol. Des.
(2009)
3. Bandyopadhyay, D., Huan, J., Prins, J., Snoeyink, J., Wang, W., Tropsha, A.:
Identification of family-specific residue packing motifs and their use for structure-
based protein function prediction: II. Case studies and applications. J. Comput. Aided
Mol. Des. (2009)
4. Chen, B.Y., et al.: Geometric sieving: Automated distributed optimization of 3D
motifs for protein function prediction. In: Apostolico, A., Guerra, C., Istrail, S.,
Pevzner, P.A., Waterman, M. (eds.) RECOMB 2006. LNCS (LNBI), vol. 3909, pp.
500–515. Springer, Heidelberg (2006)
5. Chen, W.-Y., Zhang, D., Chang, E.: Combinational Collaborative Filtering for
Personalized Community Recommendation. In: ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), pp. 115–123 (2008)
6. Fei, H., Huan, J.: Structure Feature Selection For Graph Classification. In: ACM 17th
International Conference of Knowledge Management 2008 (CIKM 2008), Napa
Valley, California (2008)
7. Fei, H., Huan, J.: Boosting with Structure Information in the Functional Space: an
Application to Graph Classification. In: Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, SIGKDD (2010)
Mining Discriminative Subgraph Patterns from Structural Data 151
8. Fröhlich, H., Wegner, J.K., Sieker, F., Zell, A.: Optimal Assignment Kernels for
Attributed Molecular Graphs. In: Proceedings of the 22nd International Conference on
Machine Learning (ICML), pp. 225–232 (2005)
9. Helma, C., Cramer, T., Kramer, S., Raedt, L.D.: Data mining and machine learning
techniques for the identification of mutagenicity inducing substructures and structure
activity relationships of noncongeneric compounds. J. Chem. Inf. Comput. Sci. 44,
1402–1411 (2004)
10. Hsu, H., Jones, J.A., Orso, A.: RAPID: Identifying bug signatures to support
debugging activities. In: ASE (Automated Software Engineering) (2008)
11. Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraph in the presence of
isomorphism. In: Proceedings of the 3rd IEEE International Conference on Data
Mining (ICDM), pp. 549–552 (2003)
12. Huan, J., Wang, W., Bandyopadhyay, D., Snoeyink, J., Prins, J., Tropsha, A.: Mining
spatial motifs from protein structure graphs. In: RECOMB, pp. 308–315 (2004)
13. Huan, J., Bandyopadhyay, D., Prins, J., Snoeyink, J., Tropsha, A., Wang, W.:
Distance-based identification of spatial motifs in proteins using constrained frequent
subgraph mining. In: Proceedings of the LSS Computational Systems Bioinformatics
Conference (CSB), pp. 227–238 (2006)
14. Jin, N., Young, C., Wang, W.: Graph Classification Based on Pattern Co-occurrence.
In: Proceedings of the ACM 18th Conference on Information and Knowledge
Management (CIKM), pp. 573–582 (2009)
15. Jin, N., Young, C., Wang, W.: GAIA: graph classification using evolutionary
computation. In: Proceedings of the ACM SIGMOD International Conference on
management of Data, pp. 879–890 (2010)
16. Jin, N., Wang, W.: LTS: Discriminative subgraph mining by learning from search
history. In: ICDE 2011, pp. 207–218 (2011)
17. Khan, A., Yan, X., Wu, K.-L.: Towards Proximity Pattern Mining in Large Graphs. In:
SIGMOD 2010 (Proc. 2010 Int. Conf. on Management of Data) (June 2010)
18. Ranu, S., Singh, A.K.: GraphSig: A Scalable Approach to Mining Significant
Subgraphs in Large Graph Databases. In: Proceedings of the 25th International
Conference on Data Engineering (ICDE), pp. 844–855 (2009)
19. Smalter, A., Huan, J., Lushington, G.: A Graph Pattern Diffusion Kernel for Chemical
Compound Classification. In: Proceedings of the 8th IEEE International Conference on
Bioinformatics and BioEngineering, BIBE 2008 (2008)
20. Smalter, A., Huan, J., Lushington, G.: Graph Wavelet Alignment Kernels for Drug
Virtual Screening. Journal of Bioinformatics and Computational Biology 7(3), 473–
497 (2009)
21. Saigo, H., Kraemer, N., Tsuda, K.: Partial Least Squares Regression for Graph Mining.
In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD 2008), pp. 578–586 (2008)
22. Thoma, M., Cheng, H., Gretton, A., Han, J., Kriegel, H., Smola, A., Song, L., Yu, P.,
Yan, X., Borgwardt, K.: Near-optimal supervised feature selection among frequent
subgraphs. In: SDM 2009, Sparks, Nevada, USA (2009)
23. Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: Proceedings of
the 2002 IEEE International Conference on Data Mining, pp. 721–724 (2002)
24. Yan, X., Cheng, H., Han, J., Yu, P.S.: Mining significant graph patterns by leap
search. In: Proceedings of the ACM SIGMOD International Conference on
Management of Data, pp. 433–444 (2008)
152 N. Jin and W. Wang
25. Yao, H., Kristensen, D.M., Mihalek, I., Sowa, M.E., Shaw, C., Kimmel, M., Kavraki,
L., Lichtarge, O.: An accurate, sensitive, and scalable method to identify functional
sites in protein structures. J. Mol. Biol. 326, 255–261 (2003)
26. Zhang, X., Wang, W., Huan, J.: On demand Phenotype Ranking through Subspace
Clustering. In: Proceedings of SIAM International Conference on Data Mining, SDM
(2007)
27. Zhang, S., Yang, J.: RAM: Randomized Approximate Graph Mining. In: Proceedings
of the 20th International Conference on Scientific and Statistical Database
Management, pp. 187–203 (2008)
Path Knowledge Discovery: Multilevel Text
Mining as a Methodology for Phenomics
Chen Liu, Wesley W. Chu, Fred Sabb, D. Stott Parker, and Robert Bilder
W.W. Chu (ed.), Data Mining and Knowledge Discovery for Big Data, 153
Studies in Big Data 1,
DOI: 10.1007/978-3-642-40837-3_5, c Springer-Verlag Berlin Heidelberg 2014
154 C. Liu et al.
1 Introduction
Increasingly, scientific discovery requires the connection of concepts across disci-
plines, as well as systematizing their interrelationships. Doing this can require link-
ing vast amounts of knowledge from very different domains. Experts in different
fields still publish their discoveries in specialized journals, and even with the in-
creasing availability of scientific literature in electronic media, it remains difficult
to connect these discoveries. For example, an expert in cognitive assessment may
know little about signaling pathways or genes, while an expert in genetics may lack
knowledge about cognitive phenotypes. Although informatics tools such as search
engines are very successful when it comes to helping people search for and retrieve
information, these systems unfortunately lack the capability to connect the knowl-
edge. To overcome this basic problem, new methodologies are needed for scalable
and effective knowledge discovery and integration.
This work was motivated specifically by research on complex neuropsychi-
atric syndromes such as schizophrenia, ADHD and bipolar disorder. A multilevel
framework has been proposed by the Consortium for Neuropsychiatric Phenomics
at UCLA (www.phenomics.ucla.edu) to help systematize discovery [9]. Figure 1
presents a multilevel concept schema, with sample concepts at different levels. Un-
der such a multilevel framework, it is important to understand the relationships
among concepts at different levels, which form “paths” across the multilevel struc-
ture. For example, under the multilevel framework, one may be interested in a set
of related questions as the following: What symptoms are related to schizophrenia?
Which parts of the brain would be affected? What signaling pathway is related?
And finally, which genes are related to this pathway? In recent studies, researchers
discovered that schizophrenia patients usually have deficits in their working mem-
ory function, and working memory is related to neuroanatomic concepts such as
prefrontal cortex. Further, genes such as COMT affect dopamine signaling and thus
affect working memory functionality [14]. We can describe such a sequence of re-
lationships with a path “schizophrenia → working memory → prefrontal cortex →
Path Knowledge Discovery 155
&RJQLWLYH (PRWLRQDO
6\PSWRPV
'HILFLWV 'HILFLWV
3UHIURQWDO
1HXURDQDWRP\
1HXURDQDWRP\ +LSSRFDPSXV
&RUWH[
&RJQLWLYH :LVFRQVLQ&DUG
&RJQLWLYH7DVNV 6WURRS 'LJLW6SDQ
7DVNV 6RUWLQJ
*HQHV
*HQHV '5' '5' &207
Fig. 1 The multilevel schema proposed by the Consortium for Neuropsychiatric Phenomics,
at left, with a sample hierarchy of concepts at right that include phenotypes related to three
syndromes
dopamine → COMT.” Thus, paths are able to describe interactions among concepts
and associations across disciplines.
The path schema shown in Figure 1 is both a hierarchy of relevant concepts and a
hierarchy of phenotypes. Phenotypes are observable characteristics of organisms —
such as color, shape, and experimentally-measured quantities. Although the space
of human genetic variation is large, the space of human phenomic variation is much
larger and more diverse, ranging across many science disciplines. Furthermore, phe-
notypes are influenced by the environment. Phenomics — the systematic study of
phenotypes on a genome-wide scale [8] — requires consideration of experimen-
tal findings across a broad schema of phenotypes. By nature, then, phenomics is a
transdisciplinary undertaking that requires new methodologies.
Perhaps the most essential result of this work is that paths can serve as the basis
of a scalable methodology for multilevel knowledge discovery, so that in particu-
lar, when the path schema defines a hierarchy of phenotypes, the path knowledge
discovery methodology can be useful for phenomics.
The path knowledge discovery problem is challenging for the following reasons.
First, a path describes a sequence of associations across multiple levels of knowl-
edge. Although existing data mining methods such as Apriori [3] perform well when
identifying high-confidence pairwise associations, mining interrelated associations
156 C. Liu et al.
still remains an open problem. Second, associations alone do not provide sufficient
information for knowledge discovery. It is important to understand how the concepts
are interrelated and necessary for retrieving information that can support the associ-
ations. However, a traditional information retrieval system is unable to answer such
a specific query. As described above, the path knowledge discovery problem can be
decomposed into two integral parts: 1) identifying paths describing relations among
concepts at multiple concept levels, and 2) retrieving content corresponding to the
paths from the corpus to explain the interrelations.
We developed PhenoMining tools to solve the path knowledge discovery problem
in phenomics. The tools are built based on a multilevel phenotype lexicon that is
constructed using domain knowledge from experts, and on a corresponding corpus
of scientific literature selected by experts. Two tools have been developed to solve
the two problems in path knowledge discovery above: 1) the PathMining tool is able
to identify associations among concepts in the lexicon in order to construct a path
based on their co-occurrence in the corpus, and it provides a quantitative way to
measure the strength of associations. 2) The Document Content Explorer tool finds
relevant published information for a specific path at fine granularity, so as to explain
the interrelations.
These PhenoMining tools can aid in constructing a phenotype knowledge base
such as PhenoWiki [25] by providing efficient path knowledge discovery. A knowl-
edge base extension called PhenoWiki+ that integrates mining results with the
knowledge base has also been developed to facilitate storage, retrieval and update of
path knowledge discovered with PhenoMining tools. Figure 2 describes the process
of path knowledge discovery and management using our methodology.
Section 2 presents the infrastructure that facilitates path knowledge discovery,
including the multilevel lexicon and the corpus data set, and the index for association
analysis and content retrieval. In Section 3 and 4 we introduce our approach for
path knowledge discovery, which consists of path mining with multiple associations
and relevant content retrieval. We demonstrate an application of path knowledge
discovery in examination of the heritability of cognitive control in Section 5. In
Section 6, we present PhenoWiki+, a knowledge repository that integrates mining
results with the hierarchical multilevel framework. We then review related work in
Section 7 and conclude our discussion in Section 8.
3URFHVVRI3DWK.QRZOHGJH'LVFRYHU\DQG0DQDJHPHQW
(QWU\&RQFHSW'
$ % & $ % & $!'!*
5HODWHGFRQFHSWV$
*+
)XQFWLRQRI$LV
LQIOXHQFHGE\'
5HODWHG3DWK
0XWDWLRQRI*
' ( ) ' ( ) $!'!*
UHVXOWVLQDORZ
$!'!+
SHUIRUPDQFHRI'
5HOHYDQW/LWHUDWXUH
* + * +
<a r t i c l e
a u t h o r s =” Robert M. B i l d e r , Fred w. Sabb , D . S t o t t Parker ,
Donald K a l a r
Wesley W. Chu , Jared Fox , Nelson B . Freimer , and R u s s e l l
A. Poldrack ”
date =” J u l y 2009” j o u r n a l =” C o g n i t i v e n e u r o p s y c h i a t r y ”
pmcid =”2752634” pmid =”19634038”
t i t l e =” C o g n i t i v e O n t o l o g i e s f o r N e u r o p s y c h i a t r i c Phenomics
Research”>
...
<s e c t i o n i d = ” 3 ” t i t l e =” Managing c o m p l e x i t y i n t h e Human
Phenome P r o j e c t ”>
...
<paragraph i d =”6” >
...
<sentence i d = ” 3 3 ” p a r a g r a p hId = ” 6 ” r e f s =”F1”>
W i t h i n t h e Consortium f o r N e u r o p s y c h i a t r i c Phenomics a t UCLA
(www. phenomics . u c l a . edu ) , we have used a s i m p l e schematic
s c a f f o l d f o r t r a n s l a t i o n a l n e u r o p s y c h i a t r i c r e s e a r c h from
genome
t o syndrome , u s i n g seven l e v e l s ( see F i g u r e 1 ) .
</sentence>
</paragraph>
...
</ s e c t i o n>
...
<f i g u r e i d =” F1 ” r e f s = ” 3 3 ” u r l = ” / pmc / a r t i c l e s / PMC2752634 /
f i g u r e / F1 / ”
smallThumb = ” / pmc / a r t i c l e s / PMC2752634 / b i n / nihms−134130−f0001 .
gif ”
t i t l e =” F i g u r e 1”>
S i m p l i f i e d schematic o f m u l t i l v e l e d −nomics domains f o r
co g n i t i ve neuropsychiatry .
</ f i g u r e >
...
</ a r t i c l e >
Fig. 4 The PMDoc presentation of document elements for the paper “Cognitive Ontologies
for Neuropsychiatric Phenomics Research” [9]
$UWLFOH
)LJXUH
Fig. 5 The structure of the PMDoc file corresponding to the XML presented in Figure 4
Posing a path mining query requires some prior knowledge of the field; i.e., which
concepts should be specified in which levels. More importantly, the results are lim-
ited by the number of levels in the query. If the query misses a specific level of
concepts, then the information in that level will not be discovered.
To address this problem, we introduced the idea of “wildcard queries” in path
mining, where queries can leave multiple intermediate concept levels unspecified. If
wildcard levels are used, all concept elements from the corresponding levels of the
lexicon are considered. When specifying the query, the user may put the wildcard
connectors between concept levels. We support three types of wildcard connectors
in the query: “-” specifies no wildcard levels. “?” specifies zero or one wildcard
levels, and “*” specifies arbitrarily many wildcard levels, respectively.
The query interface for the PathMining tool is presented in Figure 6. Three levels
are specified in this example, where the first level includes syndromes, the second
level includes neuroanatomical regions, and the third level includes genes. Users can
use the drop-down menu to the right of the query concept for each level to indicate
whether or not to include the subconcepts in the level. In this example, all three
levels include all subconcepts of the query, and the query also includes any number
of wildcard levels between level 2 (neuroanatomy) and level 3 (genes).
Fig. 6 Using the PathMining query interface to specify a path query. The radio buttons on the
right are options for wildcard levels, where “-”, “?” and “*” indicate no wildcard levels, zero
or one wildcard levels, and any number of wildcard levels, respectively. This query searches
for paths that match the pattern “syndrome → * → neuroanatomy → * → genes”.
index records the co-occurrences of pairs of concepts in the document elements. The
association strength can be further computed from co-occurrence frequencies. The
data mining community uses support and confidence to measure the strength of an
association A → B between concepts A and B [3]:
support(A → B) = σ (A ∩ B) (1)
σ (A ∩ B)
con f idence(A → B) = (2)
σ (A)
where σ (A) stands for the proportion of the documents in the corpus containing the
concept A, and σ (A ∩ B) stands for the proportion of the documents in the corpus
containing both concepts A and B.
Support measures the proportion of documents in which two concepts co-occur,
and represents the probability of co-occurrence across the whole corpus. Confidence
estimates the conditional probability of occurrence of B given A’s occurrence. If
we consider the occurrence of A and B as random events, we can also measure the
strength of the association using the Pearson correlation ρA,B between the two events
E(A, B) − E(A)E(B)
ρA,B = (3)
E(A)(1 − E(A)) E(B)(1 − E(B))
where E(A) is the expectation of the probability of occurrence for the concept A
(i.e. σ (A)). Tan et al [32] pointed out that ρ (A, B) can be approximated by IS(A, B)
ρA,B ≈ IS(A, B) = I(A, B) × σ (A, B) (4)
p(A,B)
where I(A, B) = p(A)p(B) is the interest factor [29]. The interest factor computes
the ratio of the probability of co-occurrence and the expected probability of co-
occurrence given that X and Y are independent of one another. The above approxi-
mation holds when I(A, B) is high, and both p(A) and p(B) are very small, which in
general fits the case of occurrences of concepts in a large text corpus. We can regard
IS as an alternative interpretation of the association rule that does not indicate an in-
ference from antecedents to consequents, but rather a measure of closeness between
two concepts.
The conventional association rule mining problem is to find all associations
whose strength indicators, such as support, confidence, and IS measure, are above
given thresholds. Algorithms such as Apriori [3] solve the problem by generating
the frequent item sets and then counting the support for the candidates in a bottom-
up fashion. The FP-growth algorithm [12] solves the problem with the efficient data
structure, frequent pattern tree (FP-Tree). In order to address the path mining prob-
lem, instead of finding individual associations, we need to measure the strength of a
sequence of associations among the concepts in a path.
Path Knowledge Discovery 165
Local Strength
Global Strength
σ (A ∩ B ∩C)
Con f idence(AB → C) = (5)
σ (A ∩ B)
With this definition, the confidence is the conditional probability that C is part of
the path given that A → B is part of the path.
The correlation measure of the link can be derived by computing the correlation
between two random events: the co-occurrence of all previous antecedents as one
random event and the occurrence of the consequent as the other. According to this
definition, the correlation score of the second link of A → B → C → D can be
computed as IS(AB,C).
Figure 7 presents an example path measured by the two different approaches. The
support, confidence and IS measure are computed using local strength measure
and global strength measure in Figure 7(a) and Figure 7(b), respectively. For the
global strength measure, since the association takes all preceding concepts as an
antecedent, the support value of the association decreases when the path length in-
creases, and confidence and IS scores change correspondingly. This property makes
166 C. Liu et al.
/RFDO$VVRFLDWLRQV *OREDO$VVRFLDWLRQV
$SSURDFK $SSURDFK
D E
Fig. 7 Two different approaches for measuring the strength of associations for the path
“schizophrenia → working memory → PFC → dopamine”: (a) the local strength measure,
(b) the global strength measure. The thickness of the links in the path is proportional to the
IS score of the corresponding associations.
it more difficult to find a high-support path when more concepts are included in the
path.
The major difference between the two approaches lies in the different require-
ments for co-occurrence of concepts in the paths. In the global approach, all the
concepts are required to appear at least once in the same document element in order
to ensure a non-zero confidence. On the other hand, the local approach only requires
adjacent concepts in the path to appear in the same document elements. This differ-
ence leads to two different types of applications for path discovery. For the global
approach, since there is at least one item of literature that explicitly discusses all the
concepts in the paths, path mining reveals existing investigations in the literature
covering the path. The local approach forms a path by stitching high-strength pair-
wise associations together. The paths discovered with the local approach have not
necessarily been studied previously in literature, but may have a good potential for
future study since each pair of concepts in the paths is well related. Therefore, the
local approach can be applied to scenarios focusing on discovery of new paths and
generating new hypotheses.
equivalent to a graph search problem. For the global approach, the path mining
problem can be viewed as an extension of traditional association rule mining.
its associations fails to meet the threshold. This property makes the path equivalent
to the frequent item sets in the Apriori algorithm.
The difference between path mining and traditional association rule mining is
that a path has more than one association involved, and we need to check and main-
tain the strengths of all the associations in the path (such as confidence and IS).
Although path mining provides more information, the computation cost is the same
as traditional association rule mining. According to the definition, the computation
of confidence and correlation is only affected by preceding links in the paths. There-
fore, as the path grows, we only need to compute the strength of newly added links,
which makes the complexity equivalent to conventional association rule mining us-
ing the Apriori algorithm.
Fig. 8 The seven most relevant paths for path query “syndrome → * → cognitive concept →
* → genes” (as specified in Figure 6). Based on the lexicon, schizophrenia is a syndrome-level
concept, working memory and executive function are cognitive-level concepts, dopamine,
and DAR1 and COMT are gene-level concepts. These concepts match the pattern specified
explicitly in the path query. Meanwhile, Cognitive deficits and PFC are symptom-level and
neuroanatomy-level concepts, which are introduced as wildcard levels in the paths. As a re-
sult, the returned paths have four or five concepts where one or two among them are wildcard
concepts.
a consensus of multiple paths can provide a more complete picture. Figure 9 presents
a graph structure aggregated by top paths returned in Figure 8.
Fig. 9 A PhenoGraph can be generated by combining the paths returned from a path query.
The above PhenoGraph is generated for the path query “syndrome * → cognitive concept * →
genes”, as shown in Figure 6. In the PhenoGraph, Attention Deficit Disorder with Hyperac-
tivity and Schizophrenia are syndrome concepts, executive function and working memory are
cognitive concepts, and COMT, DAR1, DRD2 and Dopamine are gene/signaling pathways
concepts. In addition, cognitive deficits, Wisconsin card sorting and PFC are symptom-level,
task-level and neuroanatomy-level concepts respectively. These three concepts were inserted
into the paths as wildcard intermediate levels.
the path is converted to a query for document retrieval. Then, based on the docu-
ment element index, contents at various granularities are matched to the query and
the most relevant ones are returned to users. Finally, the results are classified and
presented to the user for further analysis. The Document Content Explorer works in
concert with path mining to conduct path knowledge discovery. For each path result
returned from the PathMining tool, a user can use the “Retrieve Relevant Content”
link to connect to the Document Content Explorer and retrieve path content. In this
section we will present our approach to completing these tasks. The preprocessing,
content matching and post-processing of retrieved content for paths are discussed in
Sections 4.1, 4.2 and 4.3, respectively.
Path Knowledge Discovery 171
7UDQVODWLQJD3DWKLQWR
0DWFKLQJWKH5HOHYDQW 3RVWSURFHVVLQJDQG
D&RQWHQW5HWULHYDO
&RQWHQW 3UHVHQWLQJWKH5HVXOWV
4XHU\
)URPSDWK
$!%!&
5HOHYDQW&RQWHQW &ODVVLILHGFRQWHQWV
7RTXHU\ 3DSHUV 4XDQWV
$DQG% RU %DQG& $VVHUWLRQV)LJXUHV 6DPSOHFKDUDFWHULVWLFV
7DEOHV
4XHU\([SDQVLRQ
Queries in the Document Content Explorer are concept-based. Each query word
is translated to a concept and matches all concept synonyms. This approach helps
match more relevant content. However, in some situations, simple synonym-based
expansion is inadequate. Fortunately, using the lexicon hierarchy, we may be able
to further expand the query.
As described in Section 2.1, concepts are organized in a hierarchical structure.
From the structure, we will be able to obtain subconcepts for a given concept. For ex-
ample, “DRD1”, “DRD2” and “D5-like” are sub-concepts of “dopamine receptors”.
The hierarchical structure is strictly defined by an “is-a” relation. Users searching
for content about dopamine receptors could also be interested in content about spe-
cific types of receptors. We can use the hierarchy to rewrite the query to search this
larger scope, and this may obtain better results. For general queries, we can expand
the original query by including subconcepts. For instance, the query “dopamine
172 C. Liu et al.
Fig. 11 User interface of the Document Content Explorer. User specifies the query in the
“input panel” on the right and the relevant papers are displayed in the “results panel” on the
left. The query shown includes four concepts: schizophrenia, working memory, prefrontal
cortex and dopamine, translated from the path “schizophrenia → working memory → PFC
→ dopamine”.
detailed view of a paper provides a quick summary that permits users to quickly
grasp the relevance of the results.
Fig. 12 Detailed view of Document Content Explorer. The paper “Cognitive control deficits
in schizophrenia: mechanisms and meaning” [16] has been retrieved with the path query
“schizophrenia → working memory → PFC → dopamine”. Fine-granularity content in the
paper is classified into different categories such as task description, sample characteristics
and quantitative results. Content of different categories is presented in different tabs. The
selected tab (task figures) shows figures in the paper that include task descriptions. Both the
task descriptor keywords (“Stroop”) and the query keywords (“prefrontal cortex, PFC”) are
highlighted.
In our PhenoMining tool, we classify results into three categories, “task descrip-
tion,” “sample characteristics” and “quantitative indicators” to facilitate different
research purposes. If no concepts corresponding to these categories are found in a
document element, it is classified as “general content.” With proper training data,
it is possible to extend such simple rule-based classifications to machine-learning-
based classifiers to improve the accuracy of classification.
In the document detailed view of Document Content Explorer (Figure 12), we
can observe that the results are classified into different categories and are displayed
by the corresponding tabs to permit users to choose content of interest. In the list of
papers returned for a query, the numbers of results classified into different categories
in the paper are also presented; this helps users select the papers relevant to their
interests.
Path Knowledge Discovery 175
Table 1 Path content for the path “working memory → PFC → D1 Receptors” discovered
by the Document Content Explorer. The discovered content describes relationships among
the concepts working memory (WM), D1 receptors, and prefrontal cortex (PFC), which are
highlighted in the extracted assertions.
Fig. 13 Components of the construct “cognitive control”. This figure from [25] displays a
graphical representation of the construct “cognitive control” as defined by the literature and
expert review of behavioral tasks.
go nogo
(7,0.21)
(19,0.34)
(16,0.17) SST
(25,0.25)
(39,0.28)
(36,0.24) (6,0.20)
stroop
(23,0.22)
anti saccade
(28,0.19)
(5,0.11)
cognitive control
(91,0.41) (6,0.15) digit span
(17,0.20)
working memory
(12,0.17)
(24,0.15)
(8,0.14)
response mapping
(6,0.15)
(43,0.42)
spatial working memory
(10,0.19)
delayed match
(8,0.11)
(10,0.17)
response selection
(4,0.11) sternberg
(11,0.26)
(12,0.21)
choice reaction
performance. One important indicator for the n-back test is accuracy. The heritabil-
ity of cognitive control is associated with the heritability of the indicators of be-
havioral tasks (e.g., the heritability for accuracy in the n-back test). Formalizing the
nature of cognitive control requires studying relations among cognitive control, its
subprocesses, and phenotypes such as heritability scores and indicators of behav-
ioral tasks. This can be viewed as a path knowledge discovery problem. With the
pattern “cognitive control → subprocess → task → indicator” we can gather known
results about cognitive control. The results of path knowledge discovery provide a
basis for interdisciplinary analysis of the heritability of cognitive control.
178 C. Liu et al.
Table 2 Subprocesses and their corresponding cognitive tests. The matching is based on the
correlation score of the associations between subprocesses and cognitive tasks. The associa-
tion with the highest correlation score for each task is selected. The name in the parenthesis
are the names of the tasks appeared in [25].
links since some tasks match multiple subprocesses. False positives exist because
the co-occurrence of tasks and subprocesses in a document element (using para-
graph granularity in this example) does not necessarily indicate that the subpro-
cesses are measured by the task. Also, it is entirely possible that two subprocesses
are discussed in the same document element, and our system is unable to separate
them. On the other hand, some tasks are not included in the top paths because the
occurrences of those tasks in the corpus is so low that the correlation with subpro-
cesses is too low to be included in the results. By setting the threshold lower, the
missing tasks may appear but may also introduce more noise. Choosing the proper
threshold to trade off precision and recall would be a decision for users to make.
Overall, using our proposed tools, the time spent on collecting the relevant litera-
ture and deriving the knowledge structure is greatly reduced, and the results from
the tools are comparable to human-derived results.
Figure 15 extends the query to indicator-level concepts, which gives us the com-
plete structure of the cognitive control construct. From the figure, we can observe
the highly correlated indicators for the tasks. For clarity of presentation, we only
present the top paths in Figure 15. By lowering the association strength threshold, a
more complete compilation of findings on cognitive control can be obtained.
After retrieving the paths, the next step is to find the heritability values for indi-
cators. To retrieve information about heritability, we add a “heritability” concept to
the query in the Document Content Explorer. Table 3 shows some sample results of
relevant path content, including assertions, figures, and tables containing the heri-
tability result. These results are analyzed by domain experts to extract discoveries
concerning the heritability of cognitive control.
Compared to traditional approaches, which require a significant amount of man-
ual work by domain experts, our approach provides a much more efficient way to
find paths that match the tasks and indicators to subprocesses, as well as extract
heritability scores from the relevant literature. Our experience is that the mining
results are comparable to human-generated results. It takes seconds to retrieve the
content and a few minutes for a user to browse and select the relevant content. The
traditional manual approach may take several orders of magnitude longer to exe-
cute the same steps and becomes infeasible when the number of papers to examine
becomes large. This typically results in severe reductionist approaches by domain
experts when trying to identify a significant but manageable subset of the literature.
Our tools eliminate the need for drastic a priori approaches to reduce the scope of
literature for review. Thus, with the aid of mining tools, the scope of research can
be enlarged into a corpus of thousands of papers instead of the 150 papers used in
[25]. Our tool greatly improves the scalability of such a complex analysis. Mean-
while, human intelligence still plays an important role in the process; selecting the
best paths and the best content are quite subjective and different users may use the
results differently. It is unrealistic to automate the entire research process, but it is
clearly beneficial to use text mining and information retrieval techniques to replace
the mechanical aspects and speed up the process.
180
cognitive control
(27,0.10) (183,0.18) (78,0.22) (38,0.09) (236,0.19) (607,0.39) (51,0.14) (92,0.12) (42,0.10) (203,0.16) (79,0.19) (66,0.24) (32,0.10) (14,0.10)
digit span wisconsin card sorting spatial working memory sternberg stroop SST choice reaction
(21,0.11) (16,0.11) (29,0.24) (11,0.16) (104,0.43) (119,0.08) (27,0.11) (55,0.08) (136,0.12) (21,0.09) (86,0.12) (69,0.14) (13,0.08)
verbal IQ total error number of categories nonperseverative error perseverative errors RT commission error
Fig. 15 The phenograph generated from path query “cognitive control → subprocesses → cognitive tests → indicators”. The strengths of associations
are measured with local strength measure. The numbers on the links in the graph indicates the support (in absolute value of co-occurrence) and
correlation score of the association presented by the corresponding link. The thickness of the links are proportional to the correlation score.
C. Liu et al.
Path Knowledge Discovery 181
Table 3 Samples of path content that contains heritability data extracted for paths with se-
lected task/indicators
Task / Indicator Heritability Paper Relevant Content
Stroop 0.5 Heritability of Stroop In the Stroop task we found high heritabilities
/ RT and flanker performance of overall reaction time and - more important -
in 12-year old children [31] Stroop inference (h2 = nearly 50%).
Forward 0.542 ± (0.08) A Multimodal Assessment Table 2, working memory, gray matter and
Digit of the Genetic Control over white matter tract measures.
Span Working Memory [14]
knowledge base very difficult. Third, even when each phenotype concept has its
entry in the knowledge base, summarizing related knowledge for concepts is a man-
ual process.
To resolve the shortcomings of the preexisting PhenoWiki system, we developed
the PhenoWiki+ system for integrating our more advanced and automated mining
techniques in order to more efficiently construct and manage a large repository of
phenotype knowledge. Taking advantage of the knowledge discovery abilities of
PhenoMining tools , we are able to build the knowledge base content faster and
on a larger scale. The PhenoWiki+ system is implemented using the Resource De-
scription Framework (RDF) data model, which enables the storage and retrieval
of knowledge with the relationship information preserved, connects knowledge of
different concepts to complete the knowledge structure, and integrates with exter-
nal knowledge sources. Furthermore, by incorporating the annotations generated by
users, the knowledge quality can be further improved and will ease the management
of the knowledge base.
Fig. 16 User interface for inputting quantitative experimental results in PhenoWiki+. On the
left, users can specify the characteristics of the sample group in the experiment, the task and
indicators used and the quantitative results. Using our mining results, we are able to make
suggestions to the user with content extracted from the paper. The two screen shots on the
right present the content extracted from the paper.
2sec reaction time in an n-back test can be specified as quant1 :: has-task :: nback
, quant1:: measured-by :: reaction time and quant1 :: data :: “2 seconds,” where
nback and reaction time are concepts in the lexicon, nback and reaction time
are the concept ids of concepts “nback” and “reaction time” respectively. Each
piece of quantitative data is linked with a sample group. One sample group can be
shared by multiple quantitative data when there are multiple experiments performed
on the sample. A sample group can have multiple sample characteristics such as
age, gender, etc., which can also be conveniently represented with RDF. Each piece
of quantitative data is also linked with the document elements from which it was
obtained.
Query languages such as SPARQL [22] enable users to query an RDF data store
with patterns of triples. Quantitative data can be queried with different criteria, such
as the tasks or indicators used in the experiments, range limits on experimental
results, or certain sample characteristics. In our implementation of PhenoWiki+,
the quant search functionality enables users to search quantitative results by tasks,
indicators, and sample characteristics.
E\ ,QGLFDWRU
UHG
DVX
PH
KDV QDPH
KDVIHDWXUH
VDPSOH 6DPSOH
YDOXH 4XDQW'DWD 6DPSOH*URXS
&KDUDFWHULVWLFV
YDOXH
HG
QF KDV
HUH WD
UHI E\ VN
7DVN
'RFXPHQW
Fig. 17 Data model for quantitative data. The squares represent the entities in the knowledge
base. (each entity is specified with identifiers) The arrows represent the relationships defined
among different types of entities. The ellipses linked by dashed arrows represent the attributes
of specified entities. All the data are represented with RDF triples with the pattern “entity1
:: relation :: entity2”. Using an RDF database supporting SPARQL queries can efficiently
retrieve quantitative data by various criteria such as indicators, tasks or sample characteristics.
system, we record the top paths for each concept entry in the knowledge base. Con-
cepts appearing in the same paths are defined as “related concepts” in the knowledge
base. Users can conveniently navigate the knowledge from one concept to related
concepts in the knowledge base.
The Resource Description Framework (RDF) is a data format widely used in rep-
resenting knowledge involving relations among entities. In PhenoWiki+, we store
the relations of concepts in triples, e.g., “working memory :: is related to :: pre-
frontal cortex.” Some relations are defined in the lexicon. For example, the con-
cept hierarchy presented in Figure 3 can be represented as “PFC :: is-subconcept
:: Frontal Lobe.” Moreover, by using path knowledge discovery tools, association
paths among concepts can be identified. Such relationships can also be stored in the
knowledge base and represented as triples in RDF. In our implementation, each path
is defined as an entity in the RDF data model and has a sequence of associations. For
each association, the antecedent and consequent concepts and strength measures are
defined as properties. Furthermore, the path content, such as documents and quan-
titative data, can be also linked to the associations and the path. Figure 18 presents
the data model of a path.
Furthermore, since RDF is a well-adopted standard in knowledge representation,
we can link our knowledge with many external knowledge bases — including the
Cognitive Atlas [21] and GO [5]. This capability can greatly enlarge our scope of
knowledge acquisition. For example, in the implementation of PhenoWiki+, by in-
cluding the knowledge graph of the Cognitive Atlas, we were able to include the
“is-part-of” relations and the “is-a-kind-of” relations among concepts, as well as the
concepts defined by domain experts in the Cognitive Atlas.
PhenoWiki+ summarizes various relations about a concept in its concept sum-
marization page (Figure 19). From this page, users can navigate to other related
concepts, look for related literature, and find related paths.
Path Knowledge Discovery 185
VFKL]RSKUHQLD
V QW
KD HGH
WH F DV
K
DQ TXHQW
ZRUNLQJ
F QVH
R PHPRU\
V Q $VVRFLDWLRQ
KD LDWLR
RF FRU KDV
V UHOD
DV WLRQ
3DWK FR KD
QI V
LGH
$VVRFLDWLRQ QF
H
KDV
FRQW
$VVRFLDWLRQ N HQW
TXDQW
KD
V
FR
QW
HQ
W
VHQWHQFH
Fig. 18 Path data model. A path consists of an ordered sequence of associations. For each
association, the antecedent and consequent concepts, and the strengths measures such as sup-
port, confidence and correlations, are defined as properties. As an example, the figure shows
the properties for the association schizophrenia → working memory in the path presented in
Figure 7. Path content is linked with a path via the “has-content” relationship referenced by
the identifiers of the corresponding content.
Fig. 19 Concept summarization page for working memory. Various types of data related to
the concept are presented, including concepts, content (documents and quants) and related
paths.
Fig. 20 Private and public annotation examples for different concepts. Private annotations
are like personal bookmarks on concepts. In this figure, a concept is privately annotated with
a project name. The public associations are used for collaborative development of knowledge
bases. In this example, the user contributes the MeSH term link of the concept as a public
annotation.
7 Related Work
The path knowledge discovery problem comprises two integral parts — discovery
of the relations among concepts in different levels and retrieval of the path content
describing such relationships. To the best of our knowledge, there is no existing pub-
lished work covering both aspects of the problem. In the following, we will discuss
the related work in association rule mining and relation discovery, knowledge-based
content retrieval, and related studies from the Consortium for Neuropsychiatric
Phenomics.
Using the concept hierarchy provided in the lexicon, we are able to achieve bet-
ter information retrieval performance and preprocessing and post-processing of the
query to facilitate various research goals. The utilization of domain knowledge in
information retrieval systems has been studied by e.g., Liu et al. [17], who per-
formed scenario-based knowledge expansion using the Unified Medical Language
System (UMLS) — a well-defined medical ontology. Since our system relies on a
lightweight multilevel hierarchy lexicon, we use the knowledge hierarchy to perform
knowledge expansion, which is similar to approaches in ontology-based knowledge
expansion [7].
The novel contributions of our approach to content retrieval are: 1) The query
is translated based on a path and focuses on the content describing the relations
among concepts; 2) Our content retrieval focuses on finer-granularity content such
as sentences and figures, and the classification of the content provides a deeper
understanding of the content based on knowledge from the lexicon. As a result, such
mining greatly reduces the human labor involved in reading papers and digesting
content, and improves the scalability and quality of the findings.
8 Conclusion
Path knowledge discovery consists of two integral parts — path discovery and path
content retrieval — and focuses on the study of relations among concepts at multiple
levels. This is useful in many research fields where a vast number of concepts are
involved, and establishing relations among concepts across levels is important.
Our path discovery identifies and measures a path of knowledge, i.e., a se-
quence of associations among concepts at different levels. We have proposed two
190 C. Liu et al.
approaches for measuring the strength of these path associations — the local
strength measure in which associations are considered independent, and the global
strength measure in which the preceding path associations are considered precon-
ditions of following associations. We have also extended support, confidence, and
correlation measures from traditional association rule mining to the context of path
discovery.
Path content retrieval is a process of searching for relevant content describing
the relations specified in the paths from a particular corpus. Path content reveals
the semantics of relations represented in the paths and provides a basis for deeper
analysis and further study. With the knowledge from the multilevel lexicon, we are
able to preprocess the query by expanding queries using the synonyms list and the
concept hierarchy, and post-process the query by classifying content according to
different research goals.
We presented an example of using path knowledge discovery to examine a rele-
vant research question in neuropsychiatric phenomics: What is the heritability of the
complex phenotype cognitive control? Compared to manual marathons by human
domain experts, path knowledge discovery can greatly reduce labor and achieve
results of comparable quality. Preliminary results show the benefit of using data
mining for path knowledge discovery in order to study complex problems. We also
applied our mining results to the construction of a knowledge base. By extending the
PhenoWiki system, the PhenoWiki+ system overcomes the difficulties of knowledge
acquisition for traditional knowledge base systems by accelerating the data populat-
ing process and connecting scattered knowledge with paths.
Although our work on path knowledge discovery represents a significant step
forward, further work is needed to improve the accuracy of our methodology. First,
path discovery is currently based on association strengths used to satisfy strength
constraints for the paths. The threshold setting can be difficult for users. If training
data with labeled paths is available, machine learning techniques may be used to
automatically set the thresholds. Second, currently the associations between con-
cepts are based on statistical co-occurrence, but where more complete ontologies
are available, more sophisticated computations based on the ontological structure
can expose relations between concepts. Third, advanced information retrieval tech-
niques, such as relevance feedback, may be used to improve search quality. We be-
lieve our approach to path knowledge discovery provides the framework for building
sophisticated discovery tools for complex knowledge areas such as phenomics.
Acknowledgements. This work was supported by USPHS grants, including the NIH
Roadmap Initiative Consortium for Neuropsychiatric Phenomics, including linked awards
UL1DE019580, RL1LM009833. We thank Jianming He, Ying Wang, Andrew Howe, Ji-
acheng Yang, Jianwen Zhou, Xiuming Chen and Jiajun Lu of the CoBase research group
in the UCLA Computer Science Department for their work on the PhenoMining tools and
PhenoWiki+ implementation. We would also like to thank Professors Carrie Bearden and
Joseph Ventura from the Consortium for Neuropsychiatric Phenomics for the initial testing
of the tools and for their stimulating discussions during the development of this work.
Path Knowledge Discovery 191
References
1. Phenomining lexicon,
http://phenominingbeta.cs.ucla.edu/static/new_lexicon.txt
2. Pubmed central web site, http://www.ncbi.nlm.nih.gov/pmc/
3. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases.
In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB
1994, pp. 487–499. Morgan Kaufmann Publishers Inc., San Francisco (1994)
4. Anokhin, A.P., Golosheykin, S., Grant, J.D., Heath, A.C.: Developmental and genetic
influences on prefrontal function in adolescents: a longitudinal twin study of wcst per-
formance. Neuroscience Letters 472(2), 119–122 (2010)
5. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P.,
Dolinski, K., Dwight, S.S., Eppig, J.T., et al.: Gene ontology: tool for the unification of
biology. Nature Genetics 25(1), 25 (2000)
6. Baker, N.C., Hemminger, B.M.: Mining connections between chemicals, proteins, and
diseases extracted from medline annotations. Journal of Biomedical Informatics 43(4),
510 (2010)
7. Bhogal, J., Macfarlane, A., Smith, P.: A review of ontology based query expansion. In-
formation Processing & Management 43(4), 866–886 (2007)
8. Bilder, R.M., Sabb, F.W., Cannon, T.D., London, E.D., Jentsch, J.D., Parker, D.S., Pol-
drack, R.A., Evans, C., Freimer, N.B.: Phenomics: the systematic study of phenotypes
on a genome-wide scale. Neuroscience 164(1), 30–42 (2009)
9. Bilder, R.M., Sabb, F.W., Parker, D.S., Kalar, D., Chu, W.W., Fox, J., Freimer, N.B., Pol-
drack, R.A.: Cognitive ontologies for neuropsychiatric phenomics research. Cognitive
Neuropsychiatry 14(4-5), 419–450 (2009)
10. Creighton, C., Hanash, S.: Mining gene expression databases for association rules. Bioin-
formatics 19(1), 79–86 (2003)
11. Glausier, J.R., Khan, Z.U., Muly, E.C.: Dopamine D1 and D5 receptors are localized to
discrete populations of interneurons in primate prefrontal cortex. Cerebral Cortex 19(8),
1820–1834 (2009)
12. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Pro-
ceedings of the 2000 ACM SIGMOD International Conference on Management of Data,
SIGMOD 2000, pp. 1–12. ACM, New York (2000)
13. Hristovski, D., Friedman, C., Rindflesch, T.C., Peterlin, B.: Exploiting semantic relations
for literature-based discovery. In: AMIA Annual Symposium Proceedings, vol. 2006, p.
349. American Medical Informatics Association (2006)
14. Karlsgodt, K.H., Kochunov, P., Winkler, A.M., Laird, A.R., Almasy, L., Duggirala, R.,
Olvera, R.L., Fox, P.T., Blangero, J., Glahn, D.C.: A multimodal assessment of the ge-
netic control over working memory. The Journal of Neuroscience 30(24), 8197–8202
(2010)
15. Kremen, W.S., Xian, H., Jacobson, K.C., Eaves, L.J., Franz, C.E., Panizzon, M.S., Eisen,
S.A., Crider, A., Lyons, M.J.: Storage and executive components of working memory: in-
tegrating cognitive psychology and behavior genetics in the study of aging. The Journals
of Gerontology Series B: Psychological Sciences and Social Sciences 63(2), P84–P91
(2008)
16. Lesh, T.A., Niendam, T.A., Minzenberg, M.J., Carter, C.S.: Cognitive control deficits in
schizophrenia: mechanisms and meaning. Neuropsychopharmacology 36(1), 316–338
(2010)
17. Liu, Z., Chu, W.W.: Knowledge-based query expansion to support scenario-specific re-
trieval of medical free text. Information Retrieval 10(2), 173–202 (2007)
192 C. Liu et al.
18. U.S. National Library of Medicine. Fact sheet. medical subject headings,
http://www.nlm.nih.gov/pubs/factsheets/mesh.html
19. Oyama, T., Kitano, K., Satou, K., Ito, T.: Extraction of knowledge on protein–protein
interaction by association rule discovery. Bioinformatics 18(5), 705–714 (2002)
20. Parker, D.S., Chu, W.W., Sabb, F.W., Toga, A.W., Bilder, R.M.: Literature mapping with
pubatlas extending pubmed with a blasting interface. Summit on Translational Bioinfor-
matics 2009, 90 (2009)
21. Poldrack, R.A., Kittur, A., Kalar, D., Miller, E., Seppa, C., Gil, Y., Parker, D.S., Sabb,
F.W., Bilder, R.M.: The cognitive atlas: toward a knowledge foundation for cognitive
neuroscience. Frontiers in Neuroinformatics 5 (2011)
22. Prud’hommeaux, E., Seaborne, A.: Sparql query language for rdf,
http://www.w3.org/TR/rdf-sparql-query/
23. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond.
Found. Trends Inf. Retr. 3(4), 333–389 (2009)
24. Runyan, J.D., Moore, A.N., Dash, P.K.: A role for prefrontal calcium-sensitive protein
phosphatase and kinase activities in working memory. Learning & Memory 12(2), 103–
110 (2005)
25. Sabb, F.W., Bearden, C.E., Glahn, D.C., Parker, D.S., Freimer, N., Bilder, R.M.: A col-
laborative knowledge base for cognitive phenomics. Molecular Psychiatry 13(4), 350–
360 (2008)
26. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Com-
mun. ACM 18(11), 613–620 (1975)
27. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Pro-
cess. Manage. 24(5), 513–523 (1988)
28. Seamans, J.K., Durstewitz, D., Christie, B.R., Stevens, C.F., Sejnowski, T.J.: Dopamine
D1/D5 receptor modulation of excitatory synaptic inputs to layer V prefrontal cortex
neurons. Proceedings of the National Academy of Sciences 98(1), 301–306 (2001)
29. Silverstein, C., Brin, S., Motwani, R.: Beyond market baskets: Generalizing association
rules to dependence rules. Data Min. Knowl. Discov. 2(1), 39–68 (1998)
30. Smalheiser, N.R., Torvik, V.I., Zhou, W.: Arrowsmith two-node search interface: A tuto-
rial on finding meaningful links between two disparate sets of articles in medline. Com-
puter Methods and Programs in Biomedicine 94(2), 190 (2009)
31. Stins, J.F., van Baal, G.C.M., Polderman, T.J.C., Verhulst, F.C., Boomsma, D.I.: Her-
itability of stroop and flanker performance in 12-year old children. BMC Neuro-
science 5(1), 49 (2004)
32. Tan, P.-N., Kumar, V., Srivastava, J.: Indirect association: Mining higher order dependen-
cies in data. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS
(LNAI), vol. 1910, pp. 632–637. Springer, Heidelberg (2000)
33. Vinkhuyzen, A.A.E., Van Der Sluis, S., Boomsma, D.I., de Geus, E.J.C., Posthuma, D.:
Individual differences in processing speed and working memory speed as assessed with
the sternberg memory scanning task. Behavior Genetics 40(3), 315–326 (2010)
34. Von Huben, S.N., Davis, S.A., Lay, C.C., Katner, S.N., Crean, R.D., Taffe, M.A.: Differ-
ential contributions of dopaminergic D1-and D2-like receptors to cognitive function in
rhesus monkeys. Psychopharmacology 188(4), 586–596 (2006)
35. Voytek, J.B., Voytek, B.: Automated cognome construction and semi-automated hypoth-
esis generation. Journal of Neuroscience Methods (2012)
InfoSearch: A Social Search Engine
Abstract. The staggering growth of online social networking platforms has also
propelled information sharing among users in the network. This has helped develop
the user-to-content link structure in addition to the already present user-to-user link
structure. These two data structures has provided us with a wealth of dataset that
can be exploited to develop a social search engine and significantly improve our
search for relevant information. Every user in a social networking platform has their
own unique view of the network. Given this, the aim of a social search engine is
to analyze the relationship shared between friends of an individual user and the
information shared to compute the most socially relevant result set for a search
query.
In this work, we present InfoSearch: a social search engine. We focus on how we
can retrieve and rank information shared by the direct friend of a user in a social
search engine. We ask the question, within the boundary of only one hop in a so-
cial network topology, how can we rank the results shared by friends. We develop
InfoSearch over the Facebook platform to leverage information shared by users in
Facebook. We provide a comprehensive study of factors that may have a potential
impact on social search engine results. We identify six different ranking factors and
invite users to carry out search queries through InfoSearch. The ranking factors are:
‘diversity’, ‘degree’, ‘betweenness centrality’, ‘closeness centrality’, ‘clustering co-
efficient’ and ‘time’. In addition to the InfoSearch interface, we also conduct user
studies to analyze the impact of ranking factors on the social value of result sets.
1 Introduction
Users in online social networks have surpassed hundreds of millions in number.
With this staggering growth in the network size, social network platforms like
Facebook and Twitter have introduced various software tools to engage users. In
Prantik Bhattacharyya · Shyhtsun Felix Wu
University of California, Davis, 1 Shields Ave, Davis, CA
e-mail: {pbhattacharyya,sfwu}@ucdavis.edu
W.W. Chu (ed.), Data Mining and Knowledge Discovery for Big Data, 193
Studies in Big Data 1,
DOI: 10.1007/978-3-642-40837-3_6, c Springer-Verlag Berlin Heidelberg 2014
194 P. Bhattacharyya and S.F. Wu
addition to connecting and exchanging messages with friends on a regular basis, so-
cial network platforms also provide a great place to share useful information. Con-
sequently, people have become very good at sharing the information that they value,
support, endorse and think their friends might benefit from. Users share their favorite
web-page(s) on current affairs, news, technology updates, programming, cooking,
music and so on by sharing Internet URLs with their friends through the social net-
work platform. Facebook has introduced ‘Like’, ‘Share’ and ‘Recommend’ buttons
that content providers of any website can include on their website to help visitors
share the URLs with their friends in a fast and easy way. Twitter has also intro-
duced similar technologies to let users ‘Tweet’ the URL in addition to their personal
comment about the URL.
Fig. 1 Example of Information Sharing over Online Social Network (Facebook in this
example)
The simplicity and ubiquitousness of this technology has propelled the integra-
tion of the web graph with the social graph. The additional information present
in each individual’s personal network can be utilized to develop search engines that
include social context in information retrieval and ranking. In typical web search en-
gines, users are restricted to search for information from the global web and retrieve
results that are ranked relevant by a search engine’s algorithm. For example, web
search engines like Google, Yahoo! and Bing traditionally analyzes the information
present in the form of hyper-link structures to rank results during a typical query.
The intuitive justification for utilizing the hyperlink structure to rank web-pages is
based on the idea that one web-page links to another web-page to indicate usefulness
and relevance. During the process of crawling, indexing and ranking, each search
engine formulates result set(s) for a set of keyword that are unique in nature and are
identical to every user visiting the search engine. For example, when users search
for queries related to ‘programming’ or ‘cooking recipes’, search results are similar
in nature to every individual performing a query on the engine.
A search engine result set, however, can be significantly updated to incorporate
social context as a factor during the ranking process. The social context in retrieving
results will allow users to identify results based on the way their friends have shared
and endorsed similar information. Each search query from a user will thus retrieve a
unique set of result. The exclusive nature of each result set will thus be based on the
large volume of information available in each individual user’s personal network.
The search process thus not only enables a user to access a set of information that
has a distinct social component attached to it but also to gain from the collective
InfoSearch: A Social Search Engine 195
Fig. 2 Screenshot of InfoSearch Application on Facebook: Results for the query ‘privacy’
appear for one of the authors
In this work, we develop a search engine to demonstrate how user shared in-
formation can be exploited to deliver search results. Our work can be described in
two parts. In the first part, we develop the social search engine system based on the
Facebook platform that leverages the information shared by users in Facebook as an
extension of our previous work [6]. The search engine is called InfoSearch and is
available at https://apps.facebook.com/infosearch. In the second
part of our work, we discuss key issues that influence result ranking. We explore
questions on how we can define the best result in a social context. In the absence
196 P. Bhattacharyya and S.F. Wu
of ground truth data about the relationship shared between two users (in real or on-
line life), we investigate different ranking factors to analyze the social relationship
between two users and rank search results. We provide a comprehensive study of
factors that impact social search engine results. The ranking factors are based on an
analysis of the structure of the social relationship between friends of a given user: so-
cial diversity, three different measures of centrality: degree, betweenness centrality
and closeness centrality, a measure of clustering: clustering coefficient and finally
a factor based on the time property of a shared information. We derive the social
relationship between two users (friends) of a given user based on the social group
structure shared between them in the user’s individual social network. We present
results based on the impact of the above ranking factors in retrieving information
through user studies.
In section 2, we discuss related work. We formally describe the problem state-
ment related to social search in section 3 and follow up with a discussion of social
network relationship semantics in section 4. In section 5, we discuss the ranking fac-
tors and corresponding algorithms and section 6 describes the system development
process. Section 7 presents statistics on usage. In section 8, we present our findings
obtained through user studies and section 9 concludes with a discussion of future
research directions.
2 Related Work
We discuss related work in this section. First, we discuss work in the area of search
in social networks. Second, we discuss research related to the study of social rela-
tionship semantics. We primarily focus on research related to group and community
formation in social networks.
Several projects have looked into the area of search in social networks. The re-
search problems have broadly fallen into the following categories. First, the identity
or profile search problem in which social network information is used to connect and
subsequently search for users. Dodds et. al. [14] conducted a global social-search ex-
periment to connect 60, 000 users to 18 target persons in 13 countries and validated
the claims of small-world theory. Adamic et. al. [1] conducted a similar project on
the email network inside an organization. More recently, Facebook has introduced
‘Graph Search’ [15] that aims to help user search for content linked by their friends.
Facebook defines a content as any object on the open graph api. Examples of ob-
ject in the open graph api include facebook-pages (e.g. a facebook account created
by a local business, musician, artist) , facebook-apps (e.g. social games), facebook
groups (e.g. university course groups, athletic group), photos shared by its users and
geographic locations shared by the users.
In the second category, social networks have been leveraged to search for ex-
perts in specific domains and find answer to user questions. Lappas et. al. [23] ad-
dressed the problem of searching a set of users suitable to perform a job based on
the information available about user abilities and compatibility with other users. The
work in [11] attempted at automated FAQ generation based on message routing in a
InfoSearch: A Social Search Engine 197
social network through users with knowledge in specific areas. Other works in sim-
ilar directions have also been presented, e.g. [10, 33]. Query models [2] based on
social network of users with different levels of expertize for the purpose of decen-
tralized search have also been developed. Horowitz et. al. [21] presented Aardvark,
a social network based system to route user questions into their extended network
to users most likely knowledgeable in the context of the question.
In the third category, social networks are considered to improve search result
relevancy. User connections are interpreted as a graph such that a user can be rep-
resented as a node and each friend connection can be treated as an edge between
two nodes. Haynes et. al. [20] studied the impact of social distance between users
to improve search result relevancy in a large social networking website, LinkedIn.
The author defined the social distance between users based on the tie structure of
the social graph and aims to provide improved relevance and order in profile iden-
tity entries. Link analysis algorithms, like PageRank [7, 9, 13], are also not suitable
for application since during the search process of an individual user, results from
members of their social circle should not be ranked based on a generalized analysis
of the relative importance of those members in the larger network but rather on their
local importance to the querying user [24, 38]. Mislove et. al. [26] considered the
problem of information search through social network analysis. They compare the
mechanisms for locating information through web and social networking platforms
and discuss the possibility of integrating web search with social network through a
HTTP proxy.
A primary way to understand social relationships is by analyzing social group
formation in social networks. Work in group detection in graphs are primarily asso-
ciated with community detection and graph partitioning problems. Past works [29]
describe the motivation and technical differences between the two approaches. De-
tailed discussions can also be found in the recent survey [16]. Here, we discuss
works related to community detection in social networks.
A common approach for finding sub-communities in networks uses a percolation
method [12, 32, 31]. Here, k-clique percolation is used to detect communities in the
graphs. Cliques in the graph are defined as complete and fully connected subgraphs
of k vertices. Individual vertices can belong to multiple cliques provided that the
overlapping subgroups don’t also share a (k − 1) clique. The work in [17] uses cen-
trality indices to find community boundaries in networks. The proposed algorithm
uses betweenness between all edges in the network to detect groups inside the graph.
The worst-case runtime of the algorithm is O(m2 n) for a graph of m edges and n ver-
tices and is unsuitable for large networks. Improvements in the runtime have been
suggested in later works [34, 36]. Impact of network centrality on egocentric and
socio-centric measures have also been studied [24].
The betweenness approach places nodes in such a way that they exist only in a
single community, restricting the possibility of overlapping communities and detect-
ing disjoint groups in the network. To overcome this shortfall, algorithms in [18, 19]
have proposed the duplication of nodes and local betweenness as a factor in detec-
tion of communities. Other approaches to identify overlapping communities have
also been proposed [5, 4]. The above works describe the community structure based
198 P. Bhattacharyya and S.F. Wu
on relative comparison with the graph segment not included in the community [28]
or based on comparisons with random graphs of similar number of nodes and ver-
tices but different topological structures. For example, the definition of modularity
[30] as an indicator of the community strength defines the measure as a fraction of
the edges in the community minus the edges in a community created by the same al-
gorithm on a random graph. Community definitions also include detection of groups
within the network such that the interconnection between the different groups are
sparse [18, 19]. In this work, we build the social search system on Facebook, uti-
lizing the existing social graph as well as the information database being built by
users. We discuss the details next.
Definition 2. Ego Network: For a user u, ego network is a graph G(u) = (V (u),
E(u)), where V (u) is a set of nodes that includes all friends of u, F(u) and the
node u itself. E(u) is a set of edges among (V (u) − u) such that ∀v ∈ (V (u) − u), v
and u are friends and share an edge in E. Additionally, all edges between nodes in
(V (u) − u) that existed in E are also included in E(u).
Definition 3. Mutual Friend Network: A mutual friend network of an user u is
defined as a subset of the ego network, represented as MF(u) = (F(u), E (u)). F(u)
is the set of all friends of user u and E (u) is a subset of the edges from E(u) with
the edges between user u and nodes in F(u) absent.
Definition 4. Shared Information: A shared information in a social network can
be identified as an URL or a document. An URL or document shared by a user u
is denoted by the tuple (u, d). Each shared URL or document is tagged by a set of
keywords K(d) = (k1d , k2d , ..., km
d ). Additionally, each information is also tagged by a
time-stamp, T (d), based on the time the information was shared by the user in the
social network platform.
Definition 5. Query: A query q by a user u is defined as Q(u, q). The query q can
be a single keyword or a set of keywords i.e. a key-phrase. We discuss details about
how we distinguish keywords and key-phrases during the search process later in
section 6.
Definition 6. Factor: The term ‘factor’ is used to define a ranking factor that orders
and ranks results in the search process. The factors used in this work are defined in
section 5.
Definition 7. Result Candidates: The result candidates, RC(Q(u, q)) for a query
Q(u, q) is defined as the set of shared document tuples (vi , d j ) such that vi ∈ F(u)
and ∀d j , q ∈ K(d j ).
Let the number of results in RC(Q(u, q)) be represented as λ such that λ =
|RC(Q(u, q))|. Lets also denote the number of users in result candidates tuple list
as λv and the number of documents by λd . Also, lets assume the number of unique
users in the above list as λv .
Definition 8. Result Set: A result set, RS(Q(u, q)), for a query Q(u, q) is defined
as a set of ρ document tuples (vi , d j ) such that vi ∈ F(u) and ∀d j , q ∈ K(d j ). Thus,
for a query Q(u, q) with result candidates, RC(Q(u, q)), the number of result sets
possible is given by α = |RC(Q(u,q))|
ρ .
Definition 9. Result Value: The result value of a result set for a given f actor is
defined as RV (RS(Q(u, q)), Factor). The method to compute the result value of a
result set will vary according to the factor and will be described in section 5 along
with each factor.
Definition 10. Result Final: The result final is a collection of result sets, or-
dered by decreasing result value. Thus, the result final for query Q(u, q) can be
defined as RF(Q(u, q)) = {RS1(Q(u, q)), RS2 (Q(u, q)), .., RSα (Q(u, q))} such that
RV (RS1 (Q(u, q)), Factor)≥RV(RS2 (Q(u, q)), Factor)≥..RV(RSα (Q(u, q)), Factor).
200 P. Bhattacharyya and S.F. Wu
In the next section, we will discuss and define the semantics of social relationship
to formalize contribution of each user as they impart social context to formulate the
final result set.
We base our analysis of the relationship between two users from the point of
view of the user, u, performing a search query through the search engine. Thus, we
analyze the relationship shared between users, v and w, through the mutual friend
network of the user u i.e. MF(u). For different users, u1 and u2 with respective mu-
tual friend networks, MF(u1 ) and MF(u2 ) such that the graphs are distinct either in
terms of topology or based on the number of users present in the network, the rela-
tionship shared between two users v and w where both v, w ∈ MF(u1 ) and MF(u2 )
may vary accordingly. We present example mutual friend network visualizations
in Figure 3. The visualizations represents the mutual friend networks of ‘userA’
and ‘userB’ (details about the users and the network properties are mentioned in
InfoSearch: A Social Search Engine 201
section 8.2), respectively, from their Facebook profile. The visualizations were cre-
ated using the Gephi platform [3].
We empirically determine the social groups of a user’s network by analyzing the
mutual friend network of the user. In centrality based methods, we use the factors
of degree, betweenness centrality and closeness centrality. We further explore clus-
tering based methods, namely local clustering coefficient property, to determine the
social relationship semantics between two users, v and w. We present an example
in Figure 4. Ego e is connected to all the other nodes in the graph and shown us-
ing a broken line between the vertices and ego e. The mutual friend network of the
ego e is shown by the connected lines between the other vertices of the figure. We
introduce the formal definition of each relationship characteristic and compare and
contrast the merits of each property next.
A social group in the ego-network of user u can be defined as a set of friends who
are connected among themselves, share a common identity and represents a dimen-
sion in the social life of the user u. A social group can be defined in multiple ways.
In this work, we base our definition on mutuality [38] and the formal definition is
presented next.
Fig. 5 Social Groups for an ego e at k = 1 Fig. 6 Social Groups for an ego e at k = 2
Definition 12. Social Group Distance: The distance between two social groups
is defined to be equal to the Jaccard distance between the groups. For two social
groups, sg(u)i and sg(u) j , from the set SG(u) of user u, distance is defined as:
|sg(u)i ∩ sg(u) j |
dist(sg(u)i , sg(u) j ) = 1 − (1)
|sg(u)i ∪ sg(u) j |
Definition 13. User Distance in Ego Network: User distance between two users, v
and w, in the ego network of user u is defined as the mean distance between the two
user’s associated group(s). For users v and w associated with ηv and ηw number of
social groups represented by giv and gwj such that 1 ≤ ηv and 1 ≤ ηw respectively,
user distance is defined as:
j
∑ dist(giv , gw )
1≤i≤ηv
1≤ j≤ηw
ω (v, w) = (2)
ηu × ηw
The social group distance and user distance formula as proposed above paves
the way for us to understand the social relationship between two users based on
mutuality and creates scope for us to distinguish how distant (or close) users are
to each other from the point of view of a single user. A high value in the user dis-
tance thus empirically suggests a separation (possibly to an extent of unfamiliarity)
and furthermore existence of multiple facets to an individual’s social life. For ex-
ample, a typical individual has friends from their place of employment (which can
InfoSearch: A Social Search Engine 203
4.2 Degree
In this factor, we consider the degree of user v in MF(u) i.e. the factor that indi-
cates the number of users in F(u) connect to v. Let, this value be represented as
deg(v, MF(u)), for all v ∈ F(u). In the example of Figure 4, users a, b and g has a
degree of 2, user c has a degree of 3 and users d, f and h has a value of 1. The num-
ber indicates the strength of connectivity of a particular vertex in the mutual friend
network. A high value can be interpreted as a signal of support for the friend and
reflects their relative importance in MF(u) and thus stands as an important signal to
represent the social relationship shared between users.
While the value of degree (indegree and outdegree values in directional graphs)
have been a signal of significant importance in graph based methodology develop-
ments, e.g. HITS, PageRank, in the context of social relationships and the mutual
friend network of a user, the degree property can often formulate results to indicate
biasness towards a few social relationships. For example, friends from a particular
group (say place of work) can all know each other and can form complete graph,
thus leading towards every user in the said group to have high and similar degree
values and constraining the result set to include results from only one group. Other
properties described next, e.g. betweenness, closeness centrality and clustering co-
efficient also tends to address these issues and thus, we believe ‘diversity’ offers a
certain level of contrast to other social relationship characteristics and hence has the
potential to offer interesting results in a social search engine result set.
an individual user connects through to connect to other users in the graph and is
another important signal to understand the semantics of social relationships.
The betweenness centrality of a node v is computed as [8]: CB (v, MF(u)) =
σ (v)
∑s=v=t∈V stσst where σst is total number of shortest paths from node s to node t
and σst (v) is the number of those paths that pass through v. In the example of Figure
4, users a, b, d, f and h has a betweenness centrality value of 0.0, user c has a value
of 0.13 and g has a value of 0.067.
and also describe methods to evaluate the result value of any result set for a given
ranking factor. In the final subsection, we talk about the ranking algorithm employed
to rank results and determine the final result set(s) from the result candidates. We
start by introducing the ‘diversity’ factor based on the definition of social groups as
discussed in section 4.1.
5.1 Diversity
The ‘diversity’ factor is based on the social group information of the querying user.
The purpose of this factor is to maximize group representation in a result set such
that the social diversity in a result set is maximized and a higher user distance be-
tween the users present in the result set can help user u to inspect results that mem-
bers from the various groups of the network share on the platform. The diversity
value is based on the user-distance method defined in section 4.1 and is defined
next.
Definition 14. Diversity. The diversity of a result set, RS(Q(u, q)), consisting of ρ
results is defined as the mean user distance(s) between each pair of users.
∑ ω (v, w)
v,w∈RS(Q(u,q))
(RS(Q(u, q))) = (3)
| ρ |2
Definition 15. Diversity Result Value. The result value of a result set for the ‘di-
versity’ factor is defined as equal to the diversity value of the result set itself. Thus,
∑ ω (v, w)
v,w∈RS(Q(u,q))
RV (RS(Q(u, q)), ‘Diversity’) = (u, RS(Q(u, q))) = (4)
| ρ |2
5.2 Degree
The ‘degree’ factor is based on the definition of ‘degree’ from section 4.2. The
purpose of this factor is to select friends of the user performing a query who have
the highest number of connections in the mutual friend network and define relevance
in a social context as related to each contributing user’s popularity in the network.
Definition 16. Degree Result Value. The result value of a result set for the ‘degree’
factor is defined as the average of the degree value of all users present in the result
set. Thus,
∑ deg(v, MF(u))
v∈RS(Q(u,q))
RV (RS(Q(u, q)), ‘Degeee’) = (5)
ρ
206 P. Bhattacharyya and S.F. Wu
Definition 17. Betweenness Centrality Result Value. The result value of a result
set for the ‘betweenness centrality’ factor is defined as the average of the between-
ness centrality value of all users present in the result set. Thus,
∑ CB (v, MF(u))
v∈RS(Q(u,q))
RV (RS(Q(u, q)), ‘Betweenness Centrality’) = (6)
ρ
Definition 18. Closeness Centrality Result Value. The result value of a result set
for the ‘closeness centrality’ factor is defined as the average of the closeness cen-
trality value of all users present in the result set. Thus,
∑ CC (v, MF(u))
v∈RS(Q(u,q))
RV (RS(Q(u, q)), ‘Closeness Centrality’) = (7)
ρ
Definition 19. Clustering Coefficient Result Value. The result value of a result
set for the ‘clustering coefficient’ factor is defined as the average of the clustering
coefficient value of all users present in the result set. Thus,
∑ CL (v, MF(u))
v∈RS(Q(u,q))
RV (RS(Q(u, q)), ‘Clustering Coefficient’) = (8)
ρ
InfoSearch: A Social Search Engine 207
5.6 Time
We introduce ‘time’ as the final factor to rank results. The time-stamp of each shared
information, T (d) is considered to rank the result candidates to compute the final
result. In contrast to the previous factors that were based on the social relationship
shared between users, the ‘time’ fator is established to reflect the most recent activity
by users in the context of the query. For example, in the context of a query related
to ‘budget’, the ‘time’ factor can successfully determine search results that link to
the most recently shared information related to ‘budget’.
Definition 20. Time Result Value. The result value of a result set for the ‘time’
factor is defined as the average time-stamp of the information set present in the
result set. Thus,
∑ T (d)
d∈RS(Q(u,q))
RV (RS(Q(u, q)), ‘Time’) = (9)
ρ
In addition to the above definition of a result value for the factor ‘time’, we also
measure the standard deviation in time-stamp values of the information set present
in the result set. The standard deviation value helps us understand the extent of
‘freshness’ or ‘real-time’ nature of the results. In the next section, we discuss the
algorithms employed to compute final result set for each ranking factor.
5.7.1 Diversity
The result value for the ‘diversity’ factor is based on the relationship shared between
two users (user distance property) present in the result set. The steps involved in the
ranking algorithm for ‘diversity’ are described next.
208 P. Bhattacharyya and S.F. Wu
1. If the number of result candidates is less than or equal to the size of a result set,
i.e. if λ ≤ ρ , then only one result set is possible and RF(Q(u, q)) = RC(Q(u, q)).
2. If the number of result candidates is greater than the result size set and the num-
ber of unique users is equal to the result set size, i.e. if λ > ρ and λv = ρ , then
RS(Q(u, q)) is constructed using the most recently shared post (using information
from T (d)) of λv users. This automatically ensures that maximum value of diver-
sity is achieved in the result set. If the starting condition of result candidates pro-
cessing is this step, then the result set becomes the first result set of the final result
set, i.e. RS1 (Q(u, q)). Now, RCnew (Q(u, q)) = RC(Q(u, q)) − RS1(Q(u, q))}. The
values related to λ and λv are updated accordingly and in the next iterations
to construct result set RS2 (Q(u, q)), ..., RSα (Q(u, q)), the applicable steps are
followed.
3. If the number of result candidates is greater than the result size set and the num-
ber of unique users is less than the result set size, i.e. if λ > ρ and λv < ρ , λρ
possible result sets are constructed and using the user information available in
each result set, result value for the ‘diversity’ factor is computed. The result set
with the highest value of diversity is selected and RC(Q(u, q)) is updated to re-
peat the steps to compute next set of results. A user may contribute multiple times
in the result set but the process ensures that the result set has the highest value
of diversity. In the case of multiple result sets with equal value of ‘diversity’,
knowledge about time-stamps of each shared information included in the result
set is used to break the tie and the result set with the highest value of time-stamp
(i.e. the result set with the most recently shared documents) is selected as the
result.
4. If the number of result candidates is greater than the size of a result set and
the number of unique users is also greater than the result set size, i.e. if λ >
ρ and λv > ρ , we start by first constructing λρv number of sets and compute
the diversity value of each set. The set with the highest value of diversity is
selected and documents associated with each user is selected to formulate the
result set. The most recently shared document by users are used and in case of tie
in diversity values, time-stamp values are used to break the tie and the set of most
recently shared documents are declared as winner. The set of result candidates,
RC(Q(u, q)), is updated and the steps are repeated till the set of result candidates
has no more entries.
Based on the relationship shared between two users in a result set, the algorithm
to rank results for the ‘diversity’ factor contrasts the corresponding ranking algo-
rithm of other factors. Algorithm for other factors are presented next.
1. If the number of result candidates is less than or equal to the size of a result set,
i.e. if λ ≤ ρ , then only one result set is possible and RF(Q(u, q)) = RC(Q(u, q)).
2. If the number of result candidates is greater than the result size set i.e. if λ > ρ ,
the results are ordered by the respective value (degree, betweenness central-
ity, closeness centrality or clustering coefficient value) of each user present in
RC(Q(u, q)) and the user with highest value is ranked first. Multiple entries by a
user of higher value are placed in the final result set before entries from a user
with lower degree value are considered.
5.7.3 Time
The algorithm to rank results based on the ‘time’ factor is the simplest among all
the factors. The set of information present in RC(Q(u, q)) is ordered according to
their time-stamp value. The document shared most recently is ranked first followed
by documents in decreasing value of time-stamp. The ordered set is finally used to
construct α results sets and the final result set, RF(Q(u, q)).
This concludes our discussion on the ranking factors and the associated algo-
rithms. In the next section, we discuss details about the implementation of the social
search engine.
6.1 Crawler
The purpose of the Crawler is to pull out information from the Facebook feed of
each signed-in user using the Facebook API. The Facebook feed of a user consists
of links, photos, and other updates from friends. In this work, the Crawler focuses
on crawling the shared links to connect the web graph with the social graph. The
Crawler is executed on a daily basis for each authorized user to retrieve the following
data from their feed.
In our work, the Crawler employs the ‘links’ API provided by Facebook to crawl
the various ‘links’ i.e. internet URLs shared by users on the Facebook platform.
When called by the Crawler, the ‘links’ API returns a set of fields related to each
link entry. Among the returned fields, we consider the following fields: a) ‘id’, b)
‘from’, c) ‘link’, d) ‘name’, e) ‘description’, f) ‘message’ and g) ‘created time’ for
the next component of our search engine. The Crawler also retrieves information
about a user’s friend list to build the ego and mutual friend network of a user. The
Crawler uses the ‘friends’ and ‘friends.getMutualFriends’ API to retrieve informa-
tion about the nodes and edges, respectively to build the ego network of a user. The
Crawler also provides scope to expand our architecture to include other social net-
work platforms by mapping the field lists of each returned link with fields used by
the next two components of the architecture.
6.2 Indexer
The Indexer has two primary tasks. First, it analyzes the information retrieved by the
Crawler to build an index of keywords for each shared URL. Second, the Indexer
also performs the task of analyzing the mutual friend network of each user and build
the corresponding user relationship data. Details of each task are described next.
Once the shared URLs are retrieved from the feed of each signed-in user, the
next step is to build a keyword table for each URL with keywords extracted from
the text retrieved from the URL. We use Yahoo!’s term extraction engine [27] for
this purpose. The term extraction engine takes a string as input and outputs a result
set of extracted terms. Additionally, we also use the Python-based topia.termextract
library [22] to expand the keyword table. This library is based on text term extraction
using the parts-of-speech tagging algorithm. We retrieve text from each URL and
interpret the text using the aforementioned methods to finalize the set of keywords
for each shared link. The second task of the Indexer is to analyze each signed-user’s
mutual friend network and determine the user property information (i.e. values of
degree, betweenness centrality, etc) of each friend in the network. We use the ‘R’
implementation of ‘kCliques’ to build the social group information set [37].
To understand the impact of k in social group formation and accurate construction
of social groups as users interact with InfoSearch, we built a Facebook application
and surveyed users response for different values of k. We varied the value of k be-
tween 1 and 5 and asked users for their thoughts on the accuracy of social groups
formed at different values of k. Conclusions from user responses were then used to
determine the appropriate value of k for final result formulation in InfoSearch. In
InfoSearch: A Social Search Engine 211
social search engines and leave exploration of efficient algorithms for future works.
However, as we will see during our user case studies in Section 8.2, Table 3, as the
number of unique users in result candidates for a query can be substantially high,
we resorted to using heuristic methods during the development process. In the final
result set construction step, if the number of results, λ , and number of unique users,
λv , are both greater than the size of a result set, ρ , we consider the most recent 12
results sorted by ‘time’ and originating from 12 different users as constituent of the
starting result candidates to construct the first final result set, introducing the next
8 results into result candidates list in addition to the remaining 4 results to generate
the
12second final result set and so on. This step ensures we only have to construct
8 = 495 possible result sets before we decide the final result set at each iteration
and users can enjoy the experience of receiving a quick result set for their query.
We also implement an additional feature to help users find information related to
a specific friend or set of friends. This feature is implemented at the query step and
the user has to specify the name of his/her friend(s) in conjunction with the query.
In this particular situation, the retrieval process is limited to the set of information
related to the specified user(s) only and the time factor is used to rank the results
at this step. In the following section, we discuss the deployment of InfoSearch and
present a few statistics on its current usage and performance.
7 User Statistics
We invited colleagues from our lab to use the application. InfoSearch was made
available in March 2011. We present the following statistics analyzing the usage be-
tween March and December 2011. InfoSearch gained 25 signed-in users and through
the signed-in user’s Facebook feed, it has access to regular updates of 5, 250 users.
Each user has an average of 210 users in their ego network and their mutual friend
graph has an average of 1414 edges.
During the time InfoSearch has been active, we have crawled links shared by
3, 159 users. This is a very significant number because it tells us that, among the
users InfoSearch has access to, 60% shared a web link with their friends in the
social network. It is evident that the integration of web and social network graphs is
taking place at a rapid pace and that the growth can have a significant impact on the
way users search for information on the Internet.
The number of links shared by the users during this period is 31, 075. The num-
ber of keywords extracted using the Yahoo! term extraction engine and the Python
topia.termextract library is 1, 065, 835, which amounts to an average of 34 terms for
each link. Additionally, we also consider the number of unique terms present in this
pool to form a picture about the uniqueness in the shared content. We observe that
the number of unique terms shared across all the links is 130, 900, which results
in an average of 4 terms per link. We next discuss case studies to understand the
performance of social search engine results under different ranking factors and al-
gorithms. We start by discussing results from our user study to determine the best
value of k to formulate social groups.
InfoSearch: A Social Search Engine 213
8 User Studies
8.1 Social Group Analysis
An interpretation of the number and qualitative properties of social groups proposed
by any method is a matter of subjective analysis to a particular user. In our definition
of social groups, we mention the permissible upper-bound geodesic distance of k for
two users to be a part a social group in the ego network of a user. A variation in the
values of k can thus determine different social groups and consequently can lead to
different (favorable or unfavorable) appreciation of the quantitative and qualitative
properties of the social groups. To understand the value of k at which the users feel
the social groups formed are best representative of their social network, we built a
Facebook application1 and sought out user feedback. We next describe the details.
A user must approve an application before the application can interact with the
user. Once a user u approves the application to read their respective social data,
information about their friends are fetched. In the second step, the fetched friend
information is used to construct the mutual friend graph, MF(u). Next, we con-
struct social groups, SG(u), starting with value of k equal to 1. We display the group
formed to the user and sought out their feedback on two questions. In the first ques-
tion, we asked users their opinion on the number of groups formed. The answer
scores and their corresponding labels were a) 5, ‘Too Many’ b) 4, ‘Many’ c) 3,
‘Perfect’ d) 2, ‘Less’ and e) 1, ‘Too Less’. In the second question, we asked partici-
pants of their feedback on the quality of the groups formed i.e. if the social groups
formed were accurate representation of their real life groups. To obtain feedback for
this question, we provide the participants the following scores along with the corre-
sponding labels: a) 5, ‘Yes, Perfectly’ b) 4, ‘To a good extent’ c) 3, ‘Average, could
be better’ d) 2, ‘Too many related friends in separate groups’ and e) 1, ‘Too many
unrelated friends in the same group’. We repeat the above step by incrementing the
value of k for an upper limit of k = 5.
Table 1 User feedback scores on number Table 2 User feedback scores on quality
of social groups detected of social groups detected
Thirty users with varying size of friend lists signed into the application. Mea-
surements from the logged-in user’s egocentric networks are presented in Figure 8.
1 The application is available at http://apps.facebook.com/group friends
214 P. Bhattacharyya and S.F. Wu
We present results on the number of groups formed along with the average size of
the groups for varying values of k for different user degrees in each of the figures.
At k = 1, the number of groups formed grows linearly with the degree of the user.
At higher values of k, we observe that the number of groups formed significantly
drops with larger average group sizes. For example, at k = 1, number of groups is
equal to 60 and average group size is equal to 5 for users with degree equal to 100.
However, at k = 2, for the same users, the average size of the groups have risen to
15 while the number of groups has dropped to only 20. This happens because as
we increase the value of k and correspondingly relax the requirements of member
inclusion into a group, higher number of members are included into a single group
including overlapping members. However, the more interesting observation comes
when we compare the values obtained for k = 4 and k = 5. Since, we allow overlaps
to exist across groups, if certain users exist over multiple groups for a given k, when
we would allow a larger k, this overlapping user would cause the groups to collapse
into a single group. Contrary to this assumption, we see only small changes in the
values observed for k = 4 and k = 5 than for changes in values observed for k = 3
and k = 4, indicating that members in the mutual friend graph exist in small clusters
that can be separated from each other at a certain cutoff level; k = 4 in this case.
Scores from the feedback analysis for the above two questions are presented in
Table 1 and Table 2, respectively. We see the feedbacks on the number of social
groups formed at k = 3 is approximately equal to 3, a score indicating a ‘Perfect’
division of the egocentric networks of the users into how they perceive their own
social relationships to be divided in real life. It is also interesting to note in this
section that the standard deviation at this instance is the least of all the feedbacks
received.
User feedbacks on quality of the social groups formed are presented in Table 2.
It is interesting to note that at values of k equal to 1, 2 and 3, feedbacks indicate
a score between ‘Average, could be better’ and ‘To a good extend’ indicating that
the social groups detected are indeed accurate representation of how users perceive
InfoSearch: A Social Search Engine 215
their friends to be members of different sections in the real life. We thus conclude
that a value of k equal to 3 is a good choice to compute social groups and form the
basis of providing diversity based results to users in InfoSearch during any query.
examples, the worst case scenario is to construct a result set of ρ results from a
possible result candidate
of 1008 results originating from 246 unique users where
we can construct 246 8 = 2.96 × 10 14 sets to select the best result set. Clearly, this
is a situation we want to avoid when we compute result for users on the fly. A
consideration of this issue motivated us to exploit methods that will help us scale
the computation and thus, finally in our result generation process, we consider only
the 12 most recent result in the result candidate set to construct each result set. Next,
we discuss the result values. We start by evaluating result values for the diversity
factor.
Diversity result value of a result set is given by RV (RS(Q(u, q)), ‘Diversity’) and
the values are plotted in Figure 9. The diversity values in the plot have been com-
puted for k = 3. It is expected that the result sets produced using the diversity factor
and it’s corresponding ranking algorithms that aims to select the result set with the
maximum value of diversity, has the highest values of diversity compared to the
values of result sets generated by other factors. The plots confirm this hypothesis,
however, it is interesting to note the difference in values of result sets computed
using other factors. The consistency in decreasing values is best exemplified in the
case of userB and query ‘Budget’. userB’s relatively large network (1129 friends)
helps in retrieving results from a vast section of the network with high values of
distance and corresponding diversity between the users. In contrast, diversity values
for result sets formulated using the clustering and centrality measures are lowest in
nature and shows signs of partiality in result formulation by contributions from only
a few segments in the network.
We also observe the lowest diversity value related to any result set in the case
for the result set computed by ‘time’ factor. In the context of a large number of
possible result candidates for query ‘privacy’ for userB, diversity value is only 0.03
compared to the diversity value of 0.12 for the result set determined by the diver-
sity factor itself. Similar patterns can also be observed for query ‘Budget’, values of
0.09 and 0.39 for results ranked by time and diversity respectively. We infer from
this observation that information once shared by a member in a social group, has
a tendency to flow between the members of the particular social group before it
is shared by members of other social groups. This leads us to believe that result
sets formed based on time of sharing can lead to information sources that origi-
nate within particular social groups and will have the lowest social diversity value.
While the diversity based algorithm tries to maximize the value of social diversity in
InfoSearch: A Social Search Engine 217
results, time factor, among the other factors mostly retrieve results that have the least
value of social context present. We next discuss the degree values of result sets.
The degree value of a result set is given by RV (RS(Q(u, q)), ‘Degree’) and the
values are plotted in Figure 10. Similar to results ranked by ‘diversity’ factor which
were expected to generate result sets with the highest values of diversity among any
of the factors, the ‘degree’ value is also expected to be the highest among all the
result sets for the result set generated by the ‘degree’ factor and it’s corresponding
ranking algorithm. The plots confirm the expectation. The values for queries ‘Bud-
get’ and ‘Privacy’ for userA are 22.25 and 23.87 respectively compared to the sec-
ond highest values generated by ‘diversity’ factor at 18.87 and 20.87, respectively.
Similar trends are also observed for userB in Figure 10b. However, it is surpris-
ing to notice the difference between values when compared to the values generated
by the ‘degree’ factor. The values for query ‘Budget’ for factors ‘time’, ‘clustering
coefficient’, ‘closeness centrality’ and ‘betweenness centrality’, 14.25, 15, 15, 15
for userA and 6.25, 3.25, 3.12, 3.12 for userB, respectively, are significantly lower
218 P. Bhattacharyya and S.F. Wu
while the ‘diversity’ factor, 18.87 for userA and 11.75 for userB, is able to rela-
tively match up with the values of the ‘degree’ factor, 22.25 for userA and 11.875
for userB. The relative matching in the results is significant because although de-
veloped for a different reason, the ‘diversity’ factor is successful in capturing the
essence of the ‘degree’ factor and provide comparable values for the ‘degree’ metric,
thus showcasing itself as a strong candidate to power social search engine ranking
algorithms.
We next analyze the result values for the ranking factors based on centrality mea-
sures, i.e. ‘betweenness centrality’ and ‘closeness centrality’. Analogous to the ‘di-
versity’ and ‘degree’ factors, result sets are also expected to have the maximum
value of betweenness centrality and closeness centrality when the result sets were
computed based on the respective factor and associated algorithm. We notice the
phenomenon in the plots in Figures 11 and 12. Furthermore, we observe that the
measures also generate similar result values for other factors. The highest value of
betweenness centrality is observed to be 0.0345 for userA and 0.0033 for userB
InfoSearch: A Social Search Engine 219
during analysis for query ‘Budget’ and 0.0301 for userA and 0.0082 for userB for
query ‘Privacy’ the betweenness centrality factor (among other factors with equal
values).
We see relatively low fluctuation in result values except for in the values gen-
erated by the ‘time’ factor based result set. The respective value for ‘time’ factor
is 0.0339, 0.0029, 0.0267 and 0.0082, a percentage difference of 1.74%, 12.12%,
11.30% and 0%, respectively. This strengthens our previous argument that informa-
tion has a tendency to flow between social groups before it spreads into a broader
section of the ego network and a social search engine based solely on the ‘time’
factor thus fails to offer any advantage in terms of exploiting the prevalent social
information. Next, we look at the ‘clustering coefficient’ result values. Unsurpris-
ingly, we find a repeat of the same behavior here too with the ‘time’ factor offering
the least value among all factors and failing to capture the social relationship based
information into the result set. Finally, we investigate the ‘time’ characteristic of
result sets.
220 P. Bhattacharyya and S.F. Wu
We analyze time value of result sets from a reference date such that we can un-
derstand the relative ‘freshness’ of the data shared in the network. For example, if
we observe two result set(s), we observe the average time-stamp value of shared
information is 10 and 100 days in the future from the reference date, we term the
result with the average time-stamp value of 100 days since the reference date to be
more relevant and fresh to the user. Moreover, we also look at the standard devi-
ation in the time-stamp values of the shared information and we term a result set
with minimum values of deviation as the relevant result. Result values for the ‘time’
factor is presented in Figure 14.
The reference point for ‘time’ value analysis is placed on January 1st , 2011 and
the plots showcase number of days since the reference point. Thus, expectedly we
observe the results generated based on ‘time’ factor has the maximum value com-
pared to the respective value of result sets built using other factors. In the example of
userA for query ‘Budget’, the value of result ranked using ‘time’ factor is 108 days
whereas in contrast the lowest value is offered by the result set ranked by the ‘de-
gree’ factor at 78 days. Furthermore, the corresponding deviation in the time-values
are 30 days and 45 days respectively. Similar trends can also be observed in other
cases. This happens because when results are ranked according to social relation-
ship based factors, results that were shared a significantly long time ago are ranked
higher in order to enrich the social value of the result set. Although not unexpected,
a time based ranking of results thus, fails to accommodate social relationship se-
mantics and provides a result set that is mostly partial to only a sub-section of the
user’s ego network. In the next section, we conclude our work with a discussion
about future work.
9 Concluding Remarks
In this chapter, we described our efforts to build InfoSearch over the Facebook
platform as a prototype social search engine and provide scope to users to search
through the posts shared by their friends. In the process, we identified six important
factors related to ranking search results for social search systems. Users can employ
either one of the factors to rank results as they search through InfoSearch. Based
on data collected through the Facebook feeds of two authors, we also performed
user studies to understand the impact of ranking factors in the formation of result
sets. We observed that ‘time’ based ranking of results, while providing the latest
posts, fails to include sufficient social information in the result based on the value
generated for both ‘degree’ and ‘diversity’ factors.
Among the factors based on semantics of social relationships between a user per-
forming a query and a user sharing a piece of information, ‘diversity’ based factor
provides sufficient social context into the result set as well as performs well in com-
parison to ‘degree’ factor to include time characteristics in the result set. We believe
the area of social search engines has an immense potential in the area of information
search and retrieval and we want to expand this work into multiple directions. First,
we want to grow the usage of InfoSearch by inviting more users to use our system on
InfoSearch: A Social Search Engine 221
a regular basis and provide us feedback on their opinion about the quality of results
formulated. Second, we want to extend the system architecture to include the scope
of distributed databases and develop the application into a distributed system capa-
ble of handling thousands of queries at any given time. Third, we want to extend
the factors involved in the ranking process to include other online social network
platform focused factors like ‘interaction intensity between users’. Finally, we aim
to develop methodologies and standards to objectively evaluate social search engine
results.
References
1. Adamic, L., Adar, E.: How to search a social network. Social Networks 27(3), 187–203
(2005), doi:10.1016/j.socnet.2005.01.007
2. Banerjee, A., Basu, S.: A social query model for decentralized search. In: Proceedings of
the 2nd Workshop on Social Network Mining and Analysiss, vol. 124. ACM, New York
(2008)
3. Bastian, M., Heymann, S., Jacomy, M.: Gephi: An open source software for exploring
and manipulating networks. In: International AAAI Conference on Weblogs and Social
Media (2009)
4. Baumes, J., Goldberg, M., Krishnamoorthy, M., Magdon-Ismail, M., Preston, N.: Find-
ing communities by clustering a graph into overlapping subgraphs. In: International Con-
ference on Applied Computing (2005)
5. Baumes, J., Goldberg, M., Magdon-Ismail, M.: Efficient identification of overlapping
communities. In: Kantor, P., Muresan, G., Roberts, F., Zeng, D.D., Wang, F.-Y., Chen,
H., Merkle, R.C. (eds.) ISI 2005. LNCS, vol. 3495, pp. 27–36. Springer, Heidelberg
(2005)
6. Bhattacharyya, P., Rowe, J., Wu, S.F., Haigh, K., Lavesson, N., Johnson, H.: Your best
might not be good enough: Ranking in collaborative social search engines. In: Proceed-
ings of the 7th International Conference on Collaborative Computing: Networking, Ap-
plications and Worksharing (2011)
7. Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Link analysis ranking: algo-
rithms, theory, and experiments. ACM Transactions on Internet Technology 5(1), 231–
297 (2005), doi:10.1145/1052934.1052942
8. Brandes, U.: A faster algorithm for betweenness centrality. Journal of Mathematical So-
ciology 25, 163–177 (2001)
9. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Com-
puter Networks and ISDN Systems 30(1-7), 107–117 (1998)
10. Cross, R., Parker, A., Borgatti, S.: A bird’s-eye view: Using social network analysis to
improve knowledge creation and sharing. IBM Institute for Business Value (2002)
11. Davitz, J., Yu, J., Basu, S., Gutelius, D., Harris, A.: iLink: search and routing in so-
cial networks. In: Proceedings of the 13th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 931–940. ACM (2007)
12. Derényi, I., Palla, G., Vicsek, T.: Clique percolation in random networks. Physical Re-
view Letters 94(16), 160, 202 (2005)
13. Dhyani, D., Ng, W.K., Bhowmick, S.S.: A survey of Web metrics. ACM Computing
Surveys 34(4), 469–503 (2002), doi:10.1145/592642.592645
14. Dodds, P.S., Muhamad, R., Watts, D.J.: An Experimental Study of Search in Global
Social Networks. Science 301, 827–829 (2003)
222 P. Bhattacharyya and S.F. Wu
36. Tyler, J., Wilkinson, D., Huberman, B.: Email as spectroscopy: Automated discovery of
community structure within organizations. In: First International Conference on Com-
munities and Technologies (2003)
37. Carey, V., Long, L., Gentleman, R.: Package rbgl (2011),
http://cran.r-project.org/web/packages/RBGL/RBGL.pdf
38. Wasserman, S., Faust, K.: Social network analysis: Methods and applications. Cambridge
university press (1994)
39. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Na-
ture 393(6684), 440–442 (1998), http://dx.doi.org/10.1038/30918
40. Wingfield, N.: Facebook, microsoft deepen search ties (May 16, 2011),
http://online.wsj.com/article/SB100014240527487034212045
76327600877796140.html
Social Media in Disaster Relief
Usage Patterns, Data Mining Tools, and Current
Research Directions
Abstract. As social media has become more integrated into peoples’ daily lives,
its users have begun turning to it in times of distress. People use Twitter,
Facebook, YouTube, and other social media platforms to broadcast their needs,
propagate rumors and news, and stay abreast of evolving crisis situations. Disaster
relief organizations have begun to craft their efforts around pulling data about
where aid is needed from social media and broadcasting their own needs and
perceptions of the situation. They have begun deploying new software platforms
to better analyze incoming data from social media, as well as to deploy new
technologies to specifically harvest messages from disaster situations.
1 Introduction
In this chapter, we review the ways in which individuals and organizations have
used social media in past disaster events and discuss ways in which the field will
progress. In the first section, we cover the how both individuals and organizations
have used social media in disaster situations. Our discussion emphasizes how both
types of groups focus on searching for new information and disseminating
information that they find to be useful. In general, facts about disasters collected
from the small number of individuals located near the scene of a disaster are the
most useful when dealing with specific disaster situations. Unfortunately, this data
is rare and difficult to locate within the greater sea of social media postings related
to the disaster.
We follow this by discussing a framework for considering how to analyze and
use social media. This framework consists of several different use cases and
W.W. Chu (ed.), Data Mining and Knowledge Discovery for Big Data, 225
Studies in Big Data 1,
DOI: 10.1007/978-3-642-40837-3_7, © Springer-Verlag Berlin Heidelberg 2014
226 P.M. Landwehr and K.M. Carley
analytic steps: collecting social media data; managing a workflow for analyzing
social media data; constructing a narrative from social media data; processing
social media data to find relevant information; working with geolocation data;
analyzing the text of social media postings; and broadcasting information using
social media. Along with each step we provide reference examples of tools and
libraries that can be used by analysts and first responders.
The chapter concludes with a section looking at current research into how we
can better analyze and understand social media. Our discussion centers on
methods for automatically classifying text and using visual analytics to gain new
insights.
The questions confronting people in a disaster are almost always the same: What
happened? Are my friends and possessions safe? How can we remain safe? Social
media is a new resource for addressing these old needs.
Locals at the site of the disaster who are posting information about what they
are witnessing are in many ways the gold of the social media world, providing
new, actionable information to their followers. They are few in number, and while
their messages are sometimes reposted they often don’t circulate broadly.
Locating their content is an ongoing challenge akin to finding a needle in a
haystack. Such local information can serve as an early alert system, leading
traditional news sources [1].
While non-local users cannot provide reportage on the disaster, they can
propagate local stories across the network and help them gain traction. By simply
discussing a disaster or using hashtags associated with it they can contribute to
other users’ perception that the disaster is relevant. They can also collate data from
other media sources, ferret out local users, and debunk false rumors as they begin
to propagate. They can also serve as a workforce to sort through postings for the
few that are cries for help, identify locations based on photos, find missing people
in scanned video, and create maps where there are none.
Organizations fill a different role in the social media ecosystem. While
individuals seek out information to preserve their own well-being, news media and
aid groups use social media to help carry out their missions. Reporters look to
social media to find stories and get feedback on their coverage. They and their
parent organizations also often post links to breaking stories hosted on their own
websites or being broadcast in the traditional media. Relief groups post requests
for resources, announcements about their activities, and monitor social media for
information they can use in their relief work.
Individuals within organizations are often charged with monitoring social
media for any and all content that might be relevant to understanding the disaster
as it relates to the organization’s mission. This is a free-form search for
information, conditioned only on the organization’s role. It’s similarly difficult to
Social Media in Disaster Relief 227
2.1 Individuals
Disasters rarely end instantaneously. Aftermaths can drag on for days, weeks,
months, or years. (As we write this, three years after Haiti was struck by a
devastating earthquake, thousands of individuals remain in tent cities [3].) Disaster
researchers often divide disasters and disaster response into four phases:
preparedness; response to the event; recovery, including rebuilding after the
response; and mitigation, including enacting changes to minimize the impact of
future events [4].
When people are confronted with a disaster they don’t just seek to preserve
their lives at a single critical moment. Users actively seek out information that can
help them understand what’s happening for a prolonged period of time. They try
and connect with other members of local communities for support, aid, and
understanding. Often, they will use technology to do so. ([5] as cited by [6].)
Shklovski et al. documented this process for individuals in California who were
afflicted by wildfires in 2007 [6]. These fires dragged on for weeks, covering large
swathes of rural countryside. Californians in at-risk areas found the news media
unhelpful, citing a focus on stories about damage to celebrity homes. What locals
wanted was general information about where fires were occurring and who was in
danger. To combat this lack of knowledge, the Californians being studied had set
up two different online forums for posting news and warnings. At the end of
wildfire season, one of the subject forums was closed because it was no longer
228 P.M. Landwehr and K.M. Carley
useful. The other remained open as a community hub and remained part of its
users’ lives.
These researchers later witnessed a similar phenomenon among a community
of musicians in New Orleans in the aftermath of Hurricane Katrina [7]. The
musicians adopted SMS messaging, more regular cell phone use, and posting to
online forums in order to stay in touch during the disaster. Like the Californians
living in range of the wildfires, these New Orleans natives felt that the television
media focused on the most dramatic aspects of the disaster while ignoring the
majority of the afflicted. The victims used satellite images and message boards set
up by the local newspaper to find information that was relevant to them. They
turned to previously unused technologies to socialize in disaster, and in many
cases adopted these new practices into their regular lives.
In both California and New Orleans, individuals turned to technological
resources carry out established information seeking patterns via new media. Since
these studies were carried out, we have seen the advent of Web 2.0 and the
plethora of social media platforms that exist today. It is easier than ever to search
the web for information about disaster, but filtering out rumor, falsehood, and off-
topic discussion from the ocean of online content remains difficult. The
outstanding research challenge remains helping people to find information they
need and to post information so that it can be found.
While people don’t intentionally confine themselves to a particular medium,
they naturally favor those with which they are comfortable and those from which
they believe they can gain more information. Since its introduction in 2007,
Twitter has benefitted from generally positive media coverage [8]. Thanks to both
this positive portrayal and its widespread adoption, the microblogging platform
has become seen as an important source for disaster information. Leading up to
Superstorm Sandy in 2012, blogs published guides for how to best search Twitter
for data [9]. In the storm’s wake, blogs and news media published stories about
how much Twitter had been used [10, 11].
Despite the press coverage, Twitter isn’t the dominant means of electronic
communication. Its usage is the barest fraction of SMS and email [12]. While a
personal email is often rich in meaningful content, Twitter’s broadcast nature has
meant that the relevant tweets sent during any event are buried in a sea of off-topic
noise. Nonetheless, the ready availability of data, as well as the perception that the
service is the “new thing” has made it a popular choice for academic research.
Twitter is by no means insignificant – its millions of users are real- but it is
perhaps overvalued. Even as we focus heavily on it in this chapter, we advise that
you consider the platforms relative position and situate your findings
correspondingly.
The ready availability of data from the platform has also made it a popular
choice for academic research. This doesn’t mean that Twitter is a particularly
dominant communications platform: its usage is the barest fraction of SMS and
email and it suffers as a data source from a great deal of noise generated by third
party users. This also doesn’t mean that Twitter should be dismissed as
Social Media in Disaster Relief 229
insignificant. Rather, it highlights the fact that other platforms, especially SMS,
should be given additional research more in keeping with their usage patterns.
The vast pool of research on how Twitter has been used outside of disaster is
generally beyond this chapter’s scope. However, it is useful for understanding the
how the service has been generally used, so we provide a brief overview here.
Kwak et al. collected a very large corpus of Twitter users, tweets, trending
topics, and social relations between users, and provide a large collection of
summary statistics for each. The researchers make a variety of observations, not
least of which is that there is little overlap in their data between the most followed
users on Twitter and the users who are most retweeted. They also find that
following has a low degree of reciprocity, and that users who follow each other
tend to be in the same time zone [13]. Java et al. have used network methods to
analyze Twitter to try and identify meaningful user communities. In the process,
they categorized the bulk of twitter interactions as consisting of “Daily Chatter”
(descriptions of routine life), conversations, information sharing, and reporting
news. They also characterized users as primarily being defined as information
sources, information seekers, and friends [14]. Naaman et al. collected tweets
from approximately 125000 users over a prolonged period, developed nine
overlapping categories for the tweets, and then identified two clusters of users:
meformers, who often broadcast personal information, and informers, who
generally shared different types of information [15]. Bakshy et al. tried to identify
how one could successfully inject a particular idea into Twitter by influencing a
particular user. The researchers consider a user to have “influence” based on when
users retweet a URL that they have posted; the researchers caution that this
requires a relatively strong signal to detect influence, but is also precisely
measurable. While they identified certain users as possessing influence and
causing cascades of information, they found it difficult to predict when a cascade
would occur or which of these potentials might cause a cascade. The researchers
concluded that the most cost-effective for propagating a particular URL or idea on
twitter would be to seed many non-influential users. These users would have the
potential to create many small information cascades which might then add up to
one of relatively rare large cascades [16].
Research on how Twitter is used in disaster often takes the form of looking at
data collected from a particular subset of users commenting on a disaster and
looks at the particular features of their discussion. For example, Starbird et al.
attempted to understand usage patterns during the 2009 Red River flood by
qualitatively analyzing tweets collected during the flood period that used the terms
“red river” and “redriver”. The researchers identified two overlapping types of
useful tweets by users: generative and synthetic. Generative tweets introduce new
information via description of lived experience or factual commentary on an
extant tweet [17]. Synthetic tweets pull in a variety of outside information and
repackage it specifically for Twitter: a 140-character summary of a news story, for
example, as might be produced by a news organization. While the authors noted
other types of tweets, the generative and synthetic made up the kernel of the useful
data that arrived during the disaster. Original tweets are also hard to find. They
230 P.M. Landwehr and K.M. Carley
made up less than 10% of the sample used by the researchers, and more than 80%
of that small number were produced by individuals located within 6 hours driving
time of the afflicted area.
Similarly, Sinappan et al. attempted to categorize tweets broadcast by
Australians during the 2009 Black Saturday brush fires. Using another search-
based approach, the authors coded the tweets using a modified version of Naaman
et al.’s general tweet categorization scheme specifically for disasters. When
looking at 1684 tweets captured, the researchers found that only 5% contained
directly actionable information [18]. Similarly, only 4% of messages posted to the
Chinese microblog service Sina Weibo after the Yushu Earthquake in 2010 related
to actions that individuals could or needed to take [19]. Roughly 25% of the
messages were tied to situation updates about Yushu, but a large number of them
were from secondary sources, something also true for the data analyzed by
Sinappan et al.
In her thesis research, Sarah Vieweg developed a new categorization system for
the subset of tweets that contain useful information. Synthesizing tweets from four
disasters and referencing the disaster research literature, she created three
overarching categories (social, built, and physical environment) for useful tweets.
These categories are themselves split into 35 subcategories that capture the
message’s content [20]. Sample categories include “Status – Hazard”, “Advice –
Information Space”, and “Evacuation”.
These phenomena (a small number of actionable tweets, a small number of tweets
from locals providing primary source data) play out repeatedly in analyses of
different disasters. The non-local tweets often play secondary roles that are
important in the broader context of the disaster. Sutton witnessed this when
researching Twitter discussions of the 2008 spill of 5.4 million cubic yards of coal
ash into the Tennessee River [21]. While many of the Twitterers were local, Sutton
describes them as using the medium as a “grassroots mechanism” for getting
national media attention aimed at the disaster. They are the non-influential users
trying to start local cascades.
While demanding that a retweet must include a particular URL is stringent, the
basic idea of using retweets as a measure of endorsement is natural and useful.
Starbird & Palin found this to be true in the tweets broadcast during the 2011
Egyptian uprising [22]. (Bear in mind that an uprising differs from conventional
disasters as it features two opposing forces, not simply people in distress.) The
researchers draw the same lines that they have before between locals and non-
locals and the relative importance of these tweets for knowing the condition on the
ground. However, they also note that retweets make up 58% of the corpus they
collected, and that the most circulated tweets were all variants of a particular
“progress bar” meme about uninstalling a dictator or installing democracy. The
meme originated with Twitterers outside of Cairo but eventually made its way into
the city proper, getting picked up by other Twitterers nearer the heart of the
protest. The researchers characterize the meme as the “complex contagion”
described by Centola & Macy, arguing that the remixing of the different meme
elements “show some degree of shared understanding of its purpose”. ([23] as
Social Media in Disaster Relief 231
cited by [22].) Meme retweeting and remixing kept the protesters involved, and
can be seen as a way that even those outside a developing crisis situation can try to
connect themselves to it, possibly as a precursor to additional action.
In addition to trying to raise awareness of the disaster, the Twitterers
responding to the coal ash spill in Tennessee also tried to debunk false rumors
about the disaster’s scope. Indeed, the segment of the twitter community affiliated
with any particular disaster has taken on the job of suppressing rumors relating to
it. NPR Reporter Andy Carvin, who gained acclaim covering global news events
solely on Twitter, has likened his many followers to the staff of a news room:
“rather than having news staff fulfilling the roles of producers, editors,
researchers, etc., I have my Twitter followers playing all of those roles [24].”
Carvin relies on the platform to eventually provide him with access to domain
experts who can verify content or help him debunk it. For example, Carvin was
able to work with his followers to determine that a prominent blog ostensibly
written by a Syrian lesbian documenting the local unrest was actually a hoax [25].
Similarly, during Superstorm Sandy reporter Jack Stuef exposed user
@comfortablysmug as spreading false information about what was happening in
New York City. Many of @comfortablysmug’s tweets were identified as false by
other Twitter users, while Stuef found images from @comfortablysmug’s Twitter
profile in his YouTube and was able to determine the user’s true identity [26, 27].
Mendoza et al. attempted to systematically analyze the practice of individual
Twitterers debunking and supporting the various rumors that can arise as a disaster
progresses [28]. The researchers identified tweets sent in the wake of the 2010
Chilean earthquake that been retweeted at least one thousand times and that were
promulgating ideas externally verified as either true or false. They then looked at
the responses that these tweets had elicited. None of the verified truths were
substantially contested by Twitterers, while all of the falsehoods saw a number of
tweets denying their accuracy. Additionally, the falsehoods were generally
affirmed as true in other tweets more rarely than were the genuine truths. The
exception to this was the widespread reporting of looting in certain areas of
Santiago; tweets about this topic performed similarly to the other true tweets. The
suggests that while generally rumors can be expected to be called out on Twitter,
particular types of rumor will still fly under the radar and be hard to detect. The
study suggests that true reports of disasters will not be regarded as controversial,
which may be useful in automatically confirming their accuracy from social media
data.
Contra the Mendoza et al. study, however, we emphasize that even if eventually
corrected, falsehoods have been propagated on social networks for long enough to
enter the mass media. @comfortablysmug’s stories of flooding at the NYSE were
rebroadcast by several major news outlets before Stuef outed him. In the aftermath
of the 2013 Boston Marathon Bombings, Twitter users and Redditors incorrectly
identified a missing Brown University student and individual mentioned on a
police scanner as the bombing suspects [29, 30]. This caused a brief but potent
online witch-hunt for which Reddit administration apologized [31]. The Boston
232 P.M. Landwehr and K.M. Carley
Police, which has made extensive use of Twitter before and after the bombing,
published the facts of the case to the platform to counter the rumors [32].
We have mentioned that Twitter has gotten a great deal of research attention
relative to other media used in disaster. While it remains the focus of this chapter,
it is important that we acknowledge the ways that individuals are leveraging other
social media in these circumstances. To simply focus on Twitter, when the reality
is that an individual equipped with a smartphone can already function on any
number of social media platforms at once. Technologies for analyzing social
media will not remain confined to a single platform but will exploit as many as
possible for data. They will leverage not just Twitter, but also RSS feeds,
Facebook, SMS, Sina Weibo, Four Square, and others from a variety of different
nations.
As mentioned at the start of this section, musicians in New Orleans adopted
SMS messaging in the aftermath of Katrina to stay in touch. SMS’s ability to
directly connect individuals and the widespread availability of the technology on
low-tech cellphones has made it critical in emergency situations. In the immediate
aftermath of the 2010 Haiti Earthquake, a small group of actors from relief
organizations and the US Government got DigiCel, Haiti’s main cellular service
provider, to reserve the SMS short code 4636 as a dedicated number for
processing distress messages. These messages were archived and translated into
English by Haitian expatriates mobilized by the organizers of “Mission 4636”.
Both expatriates and Haitians still on the island worked to promote the short code
as a useful resource. By Week 3, Mission 4636 was dealing with such a volume of
messages that it began working with the CrowdFlower and Samasource
crowdsourcing platforms to better coordinate message translation [33].
From our perspective on how individuals use social media, SMS was key in
this disaster because significant numbers of Haitians used low-tech cellphones that
could access SMS services in the wake of the earthquake, and because the SMS
infrastructure itself was still working. In that sense, it was the right medium for the
time. AS has occurred with Twitter data in other disasters, the SMS data was rife
with falsehoods despite being sent to a dedicated help line. According to the
Harvard Humanitarian Initiative’s (HHI’s) study of relief organization responses
to the Haiti earthquake, perhaps as many as 70% of the 4636 messages contained
errors, such as requests to locate victims by people who knew the victims to be
dead [34].
Photo sharing during disaster has also seen some degree of academic study,
though more work is needed. In 2008, Liu et al. looked at how Flickr had been
used in response to seven different disasters [35]. They observed that individuals
were posting photos of damaged areas for a variety of different reasons, united by
an over-arching theme of documenting the crisis. The different photographs can
generally be categorized as depicting a particular event, capturing on-line social
convergence (e.g. screen shots of Facebook posts), listing the missing, and
showing personal belongings (taken for inventory purposes).
Flickr can be understood as fulfilling some of the same needs as text-based
services: individuals post representative images of disaster sharing information
Social Media in Disaster Relief 233
about what they understand the situation to be. It’s used to help communities
organize and share information. It’s also used to serve practical, individual needs,
such as inventorying possessions. The authors tie this back to the medium itself: a
photograph is a richer data source than a 140 character message. Twitter isn’t an
efficient tool for cataloging possessions.
Regardless of precise intent, when placed in disaster situations individuals
broadcast and examine data using social media. Academic research has tried to
categorize these usages, often noting that actionable information is hard to find,
that information can be false, and that primary source information from local users
can be rare. Additionally, platform-specific practices can potentially be subverted
for additional information: we can characterize true and false statements seen
during a disaster based on the number of debunking statements noticed in
response. We can infer that photographs taken in disasters of peoples’ possessions
are being used to inventory property.
Just as social media are being leveraged by individuals during disasters, so too
are they being used by relief organizations, both government affiliated and
independent. While the specific purposes behind the uses may be different and
complementary, the uses of the platforms are similar. We discuss these in the next
section.
2.2 Organizations
First responder organizations, which include government agencies, police,
firemen, medical and public health organizations, military responders and not-for-
profits play critical roles in disaster response. These groups generally have access
to data and analytical tools that are not available to the public, as well as the
resources needed to rescue and aid individuals who are in distress. While people
may distrust the accounts of unknown strangers reporting rumors, these
organizations have established brands that often temper or heighten critical
attitudes towards their own postings. First responder organizations are
increasingly turning to social media to identify actionable needs and orient the
response, gauge the scope of impact of the event, provide information to the
public, track an mitigate firestorms and counter false information, and to try to
identify potential secondary disasters before they occur [36].
Where social media has provided individuals with new spaces in which to
mingle and interact, it has provided organizations with new spaces in which to
research ongoing disasters and communicate with both victims and the general
public. St. Denis et al. explored this phenomenon by looking at how a Virtual
Operations Support Team (VOST) dealt with the 2011 Shadow Lake Fire [37].
The concept of the VOST was developed by emergency manager Jeff Phillips as a
way for an organization to coordinate its responses on and to social media
coverage of a disaster, and has been propagated by other emergency managers
[38–40]. According to Phillips, the VOST should “integrate[e] ‘trusted agents’
into [emergency management] operations by creating a virtual team whose focus
234 P.M. Landwehr and K.M. Carley
Haitians. The maps produced by the project, however, were used by groups on the
ground, and the project is often mentioned in close connection with Mission 4636
despite functioning independently [34]. The Haiti Ushahidi project presented a
better public face than did Mission 4636 despite processing significantly less
information. Further, by choosing to release a large subset of the disaster messages
they were working with to the public, they helped put a face on the disaster in a
way that the directed channel of SMS generally does not.
During the 2011 East Japan earthquake, a group of computer scientists and
engineers formed a new, one-off aid group called ANPI_NLP to help get relevant
information from tweets [45]. The researchers sought to parse tweets to find
references to individuals who had gone missing or been found and then updating
records in Google Person Finder, a missing persons database.
While a one-time effort like Mission 4636, in general ANPI_NLP is a V&TC
effort in the Ushahidi Haiti mold. The researchers didn’t present themselves in a
way that would be perceived by the Japanese populace, and the results that they
produced were stored in a database maintained by another V&TC (Google) which
then dealt with relief organizations. Where ANPI_NLP differed from Ushahidi
Haiti was in using up-to-date natural language processing to speed up the task of
extracting information from tweets. The researchers rapidly created a pipeline for
morphologically analyzing tweets and that both extracted named entities and
locations, and classified the nature of the information expressed. The researchers
had to perform some manual coding to create gold standard data and to vet results,
but in general this was an automated process. They also point out the existence of
problems similar to those described by Munro: translation is difficult, and human
resources are critical. To the members of ANPI_NLP, the solution lies in better
automated systems, and in tools that can more rapidly adapt to training data.
Our discussion of how organizations used social media is framed by the
understanding that at some level they use social media the same way individuals
do: they search for information, and contribute in order to participate in the
conversation as fits their mission. Relief organizations must deal with the larger
challenge of managing their presence in particular social media spaces and must
understand the information that is coming to them via the different interfaces. One
way to deal with this is through a dedicated group such as a VOST. Further, a host
of small organizations are appearing to help work with social media data
in particular crises, leveraging local knowledge and deploying new technologies.
In its report, the HHI both noted the importance of this small organization in
the Haitian Earthquake’s aftermath while also airing the concerns of relief
workers that in Haiti it that few of these tools had an established, dependable
reputation.
It is impossible to review all of the tools that exist to help relief organizations
and analysts mine meaning from social media data. In this next section, we
approach this challenge by provide a useful framework for considering them, as
well as descriptions of different tools that fit into the framework.
Social Media in Disaster Relief 237
A variety of different tools have been and are being developed and deployed to
help people and institutions work with social media. In this section, our primary
focus is on those that are useful to analysts trying to mine social media data,
particularly from Twitter, in disaster response situations. They range from libraries
for programming languages to sophisticated GUI-based tools for responders who
need quick assessments of information to platforms for recruiting other workers to
help with tasks. Technically savvy responders and analysts chain the output
produced by multiple tools together in order to create meaningful results.
In this section we describe a mix of these tools, broken down into a rough
framework corresponding to different data mining tasks. Some of our distinctions
would not exist in a conventional data mining text, but speak to our particular
focus on social media in disaster. More specifically, in this section we discuss
tools that support data collection; that support workflow management by way of
third-party tool interoperability and enabling data retrieval; that support narrative
construction from fragments of social media data; that support data processing for
quantitative analysis and disaster response; that support pining social media data
to maps based on geolocation data; and that support quantitative text analysis for
use with machine learning. We also cover an additional, slightly different
category: those used to broadcast on social media and to manage collection and
publication of data. These won’t be relevant to analysts or investigators looking at
specific data published on social media services, but can be important for
developing a holistic understanding of how different platforms are being used.
Such tools are of particular interest to organizations doing their best to manage all
aspects of their social media presence.
1
http://trec.nist.gov/data/tweets/
2
https://dev.twitter.com
3
https://secure.flickr.com/services/api/
238 P.M. Landwehr and K.M. Carley
4
http://spinn3r.com
5
http://gnip.com
6
http://www.casos.cs.cmu.edu/projects/ora/
7
http://www.ushahidi.com/products/swiftriver-platform
8
http://pipes.yahoo.com
Social Media in Disaster Relief 239
so if all messages back to a particular date are needed, a database needs to be built
and maintained with the relevant data and all associated meta-data. No one tool
exists to address all these challenges. As we will see in subsequent parts of this
section, many different tools are emerging to handle pieces of these tasks.
Correspondingly, new tools are emerging to manage workflows between these
more focused tools and the larger process of cleaning and analyzing social media
data.
Social Radar, CRAFT, and SORASCS9 are three tools that address this
problem. Each is a web-based system that supports disaster response by helping
analysts and responders chain together third-party tools for sequential data
analyses. All three tools work by collecting social media data from a data
warehouse or via a particular third-party tool that access a social media platform’s
API. The collected data is then archived and can be sent to different integrated
tools (or sequences of tools) for further processing. These tools often address text-
mining, network analysis, sentiment analysis, geo-spatial analysis, and
visualization. While some are used interactively, others process data in a silent and
opaque manner, converting them from one form to another.
Many of the tools incorporated in Social Radar, developed by MITRE, are
aimed at detecting sentiment in Twitter [50, 51]. It provides a web interface for
looking at trends in Twitter over time such as total sentiment (derived from the
presence of particular sentiment charged terms), heavily retweeted users, and the
prevalence of particular keywords.
CRAFT, developed by General Dynamics, is similar to these other workflow
management tools but also supports an associated environment for general
mashups. Files can be linked to Google Drive, and the platform supports a
“playback” mode that allows disaster response training exercises to be run with
archived social media data collected during prior disasters.
SORASCS, developed at the CASOS Center at Carnegie Mellon University,
supports workflow management and sharing [52, 53]. Unlike CRAFT and Social
Radar, which require outside tools to be integrated before deployment, SORASCS
is an open architecture to which analysts can independently attach their own tools.
It allows analysts to preserve, share, and modify particular workflows by saving
them to files. SORASCS’s open design would make it eligible to serve as a
coordinating under-structure behind CRAFT or Social Radar. While the latter
tools have stronger user interfaces from a crisis responder’s perspective, they
provide no facilities to preserve particular workflows for future use. Unlike
CRAFT and Social Radar, SORASCS does not necessarily convert all data into a
common database; the user is responsible for supplying a database component
themselves. In a sense, SORASCS is at a different level of application hierarchy
than CRAFT and Social Radar. It could serve as middleware using either platform
as a front end. This could provide some benefits to analysts because Social Radar
9
http://www.casos.cs.cmu.edu/projects/project.php?
ID=20&Name=SORASCS
240 P.M. Landwehr and K.M. Carley
and CRAFT put the third-party tools in an open unstructured environment and
don’t support the development of automated and streamlined workflows as does
SORASCS.
10
http://d3js.org
11
http://storify.com
12
http://www.blogger.com
13
http://www.wordpress.com
14
http://www.tumblr.com
15
http://www.dipity.com
Social Media in Disaster Relief 241
While keyword-based coding can be useful for culling data down to general
matches, the reduced data must often still be codified for relevance, actionability,
and accuracy. This can be partially accomplished by automated processing of the
data using trained machine learning algorithms, as in the ANPI_NLP project, but
242 P.M. Landwehr and K.M. Carley
is often handled manually. A human workforce with domain expertise can be used
to provide sophisticated labeling to disaster data.
We’ve discussed the role played by Ushahidi in Haiti, but the platform bears
revisiting here. Individual Ushahidi deployments can be used to categorize
disaster reports and then post them to a map. This system provides a basic
architecture for splitting the coding task across a group of individuals in order to
streamline the completion of particular tasks. Analysts can also label messages
post-facto, making Ushahidi a useful system for individuals seeking to place
particular messages onto a map. The QuickNets platform16, built using Ushahidi’s
source code as a base, further subdivides the crowdsourcing process in order to
make coding tasks easier for individuals to complete.
When a crowdsourcing workforce for coding data must be raised quickly, the
fastest method is to use a dedicated crowdsourcing platform. Amazon Mechanical
Turk17 is the archetypal example of an online labor market but there are many
alternatives. As Mission 43636’s popularity increased during the Haiti
earthquake’s aftermath, it switched from its informal organization system over to
using CrowdFlower18 and Samasource19 to managing their many volunteer
workers who spoke Kreyol and could translate the text.
Volunteers will often feel motivated to contribute time and energy to
addressing disasters and working with disaster data, particularly for very large
disasters. Dedicated communities of “Crisis Mappers” have formed around the
idea of collecting geospatial data from afflicted regions and annotating it with
relevant information20. Similarly, sparked.com21 has focused on recruiting
volunteers interested in contributing to meaningful causes. The best annotators for
data may not be those obtained from a crowdsourcing marketplace but rather from
within these and other communities of skilled volunteers with a specific
investment in helping to resolve disasters.
16
http://www.quick-nets.org
17
http://www.mturk.com
18
http://crowdflower.com
19
http://samasource.org
20
http://crisismappers.net
21
http://sparked.com
Social Media in Disaster Relief 243
Network data connects victims and responders to both locations and the needs
they mentioned. These expressions can be used to identify critical actors and
places that must be reached by responders. For ease of use, the ORA network
22
http://www.liwc.net
23
http://www.indiana.edu/~socpsy/ACT/data.html
24
http://sentiwordnet.isti.cnr.it
244 P.M. Landwehr and K.M. Carley
analysis tool bakes a variety of useful social network metrics into reports to
provide overall assessments of different situations. One of these reports has been
designed to work with data from TweetTracker. It transforms the data to extract
networks of retweets, hashtag co-occurrences, users and content, user and
locations, and popular keyword distributions. It then processes these networks to
identify influential Twitter users, core topics, and changing regions of concern. A
similar technology has been built with ORA using REA for analyzing Lexis-Nexis
data25. This technology has been use with respect to natural and man-initiated
crises [62]. Its simplicity lends itself to first response. Figure 1 shows a network
created from Twitter data using ORA.
For analysts who wish to go deeper and possibly conduct richer data mining on
network structures, ORA supports the extraction of a variety of different social
network metrics. Other GUI-based tools such as Gephi26 and Cytoscape27 also
provide methods for analysts to approach the data, but a variety exist. In contrast,
if the analyst wants to take a programming approach and develop their own
network metrics they may want to work with the statnet package28 for R or the
NetworkX library29 for Python.
3.5 Geolocation
The classic tool used for geo-spatial analysis in the crisis mapping area is ESRI
ArcGIS30. ArcGIS is widely used by large number of response units including
many police departments and military units. It supports pinning a variety of
latitude/longitude data to maps, as well as visualizing changes in its distribution
over time. In addition, ArcGIS supports a full complement of spatial analytics, and
a layered visualization scheme. ArcGIS can import and export shapefiles,
demarcations of geographic shapes, and KML, the XML-based markup language
developed for use with Google Earth31. An increasing number of crisis-mapping
tools, particularly those used by the large first responders, are exporting data in
KML to support interoperability. Open source GIS tools are appearing that contain
many of the features inherent in ArcGIS.
However, since the advent of Google Maps32 eight years ago, an increasing
number of crisis response tools are making use of it as an alternative. Since then,
the quantity of data and tools available for working with geospatial data has only
25
Illustrative results generating using ORA with Sandy data can be seen at
http://www.pfeffer.at/sandy/
26
http://gephi.org
27
http://www.cytoscape.org
28
https://statnet.csde.washington.edu/trac
29
http://networkx.lanl.gov
30
http://www.esri.com/software/arcgis
31
http://www.google.com/earth/index.html
32
http://maps.google.com
Social Media in Disaster Relief 245
increased. According to the HHI’s report, the V&TC community active in the
Haiti earthquake particularly shone in its use of geospatial data. This is due to the
dedicated work of the crisis mapping community and the willing participation of
organizations with access to satellite imagery in crisis situations. In Haiti, a
partnership between Google and GeoEye provided high-resolution images of the
disaster area from above. With the right data, communities could annotate maps
and workers on the ground could plan their activities.
Even when corporate entities do not provide such useful material, the
community is able to rely on open platforms like the mapping site Open
StreetMap33, which has become a staple of the crisis community. All of the
mapping data on OpenStreetMap has been contributed by volunteers; individuals
upload GPS data to the site, and then annotate and edit it to keep it current. To
deal with situations where internet access is limited or where users don’t have
access to GPS equipment, Michael Magurski released first the Walking Papers34
and then Field Papers35 tools. These allow users to download, print, annotate, and
then upload the annotations to OpenStreetMap.
Google Maps has a growing presence in the crisis mapping community as well,
and Google has itself devoted resources to creating maps specifically of crisis
situations. They’ve provided crisis maps for specific incidents such as Superstorm
Sandy that have been annotated with a variety of user data culled from the web
[63]. Google also maintains a real-time crisis map36 that uses similar culling of
data to provide updates about potential and on-going crisis situations.
The TweetTracker tool developed at ASU visualizes extracted tweets on maps
and lets users set spatial bounding boxes for selecting tweets by placing squares
on maps. (See Figure 2 for an example of an exported map.) ORA also supports
visualizing networks and other data on maps. It can import and export shape files
and KML. In addition ORA allows users to cluster entities based on their
particular region and then use that clustering as an element of a social network
analysis.
While it would be incorrect to consider the challenge of properly representing
data that has been connected with physical locations a solved problem, at this
point there are a variety of tools that allow users to place information with specific
latitudes and longitudes on a map. The research challenges are no longer about
rendering these points in an informative manner. They are about developing new
algorithms for deriving data from geographical clusters, and analyzing and
forecasting the geographic distributions of social media postings in specific
disasters.
33
http://www.openstreetmap.org
34
http://walking-papers.org
35
http://fieldpapers.org
36
http://google.org/crisismap/
246 P.M. Landwehr and K.M. Carleey
37
http://alias-i.com
m/lingpipe/index.html
38
http://mallet.cs.u
umass.edu
39
http://www.cs.waik
kato.ac.nz/ml/weka/
40
http://nltk.org
41
http://mlpy.source
eforge.net
Social Media in Disaster Relief 247
A large variety of machine learning models for working with text have been
implement as packages for the statistical language R. The tm package42 bundles
together standard natural language processing features for working with
unstructured text. Once parsed, other packages oriented specifically towards data
mining can be used with the text.
GUI-based tools for working with text data also exist, and may be easier for
first responders to integrate into their workflows than a coding solution. One good
example is AutoMap43, a tool developed at the CASOS Center at Carnegie Mellon
University that supports both GUI-based cleaning and an XML-based scripting
language [65]. Like NLTK and other tools mentioned, AutoMap provides a
number of methods for cleaning text documents like stemming words to their base
forms, deleting stop words, and calculating the frequency of different multi-word
sequences. AutoMap’s scripting GUI makes it relatively easy to improvise and
modify cleaning processes on the fly. The program has also been significantly
integrated with ORA, allowing analysts to use network metrics to identify
prominent co-occurrences of particular words or entities mentioned in documents.
These networks of texts can also be visualized and –if referencing geospatial data-
can be pinned to maps. This approach was used by a team of Arizona State
University and Carnegie Mellon University researchers with data from Superstorm
Sandy to compare the difference in content between Twitter and the news media.
One difficulty of working with text data posted to Twitter and other microblogs
is that it often doesn’t fit the conventions expected in ordinary text. When
ANPI_NLP developed their named entity recognizer, for example, they had to
first train a morphological analyzer to correctly split a tweet into names. Analysts
generally expect to have to train their own parsers when working with microblog
syntax. While not a general purpose named entity recognizer, Gimpel et al. have
developed a tokenizer and part-of-speech tagger for Twitter44 that has since been
improved by Owaputi et al. [66, 67]. The POS tagger correctly classifies
emoticons and the roles of various acronyms (“lol”, “srsly”). While not critical for
disaster on its own, in combination with the methods used by ANPI_NLP this
could improve the speed and accuracy of other algorithms.
Translation of messages posted to social media in other countries remains a
pressing problem, as we have discussed when describing the SMS messages
translated by Mission 4636. This problem was also seen during the Egyptian
Revolution and in the Yushu earthquake in China. While crowdsourcing markets
are a proven solution for this problem, machine translation can also be used for
potentially faster results. Google, for example, provides access to an API for
automatic translation.45 These will be less effective than native speakers of a
particular language, but if it isn’t possible to reasonably mobilize (or afford to
mobilize) such a platform, machine translation is one possible alternative.
42
http://tm.r-forge.r-project.org
43
http://www.casos.cs.cmu.edu/projects/automap/
44
http://www.ark.cs.cmu.edu/TweetNLP/
45
http://developers.google.com/translate/
248 P.M. Landwehr and K.M. Carley
3.7 Broadcasting
Broadcasting tools largely fall outside of the practical use case for analysts. They
are, however, relevant for first responders attempting to leverage social media, so
we mention them here briefly. One example of a broadcasting tool is HootSuite46,
which allows users to manage profiles on multiple social networks, time the
broadcasting of particular tweets, and perform some analytics similar to those
mentioned in our discussion of tools that can be used for data retrieval.
TweetDeck47, an application provided by Twitter, provides a few similar functions
but only for Twitter: users can use the software to control multiple Twitter
accounts, subdivide followers into different groups, and schedule particular tweets
to be posted at certain times.
Regardless of these relatively sophisticated tools, first responders will often
interact with followers through the main interfaces of whatever particular social
media service they are using. If Twitter, it may simply be their organization’s
account from the web, or the smartphone application of an organization member
on the ground.
4 Research Directions
A common need felt by both people and organizations who turn to social media in
disaster is knowing what is happening on the ground as rapidly as possible.
Solving this problem has become the thrust of many ongoing research projects in
the field. That being said, it is important to recognize that there are two very
different audiences to whom this chapter is speaking: first-responders and disaster
researchers. Each group needs different tools to pursue their own ends. First
responders need easy to use simple tools with pre-defined workflows, specialized
interfaces, dashboards, and maps. The time constraints of disasters prevent them
from turning to powerful but less intuitive or rapid tools such as programming
languages. In contrast, disaster researchers need to be able to use and create new
methods, new types of visualizations, with workflows that they develop as part of
the research. In this case, real-time performance is less important than the ability
to perform sophisticated analyses. A particular type of research, translational
research is needed in the disaster response area that supports the movement of
those findings and tools discovered or invented by disaster researchers that are the
most valuable to first responder from the laboratory into the field [68].
We now discuss two families of approaches to this challenge. We will begin
with attempts to leverage machine learning and crowdsourcing to automatically
classify individuals based on whether they provide useful information. We will
then move on to discussing several different methods for visualizing social media
data to provide immediate, intuitive feedback.
46
http://hootsuite.com
47
http://tweetdeck.com
Social Media in Disaster Relief 249
The research projects we have discussed so far have focused on trying to find
the useful tweets within the broader pool of data. Some researchers have taken an
alternate approach, opting to find general information from the general mass of
tweets. For example, Sakaki et al. have used Twitter data to detect earthquake
epicenters [74]. Using the small number of tweets that have location data for
references to earthquakes, they combine both support vector machines and particle
filters to account for the uncertainty of the reported physical locations and then
calculate the likely epicenter. Their system is effective but contingent on having a
large number of tweets tagged with particular locations.
Similarly, the Google Flu Trends project48 uses search queries made to Google
to identify outbreaks of influenza [75]. Flu Trends is a specialized version of
Google Trends in general, which tries to identify trending searches on Google just
as Twitter tries to identify trending topics discussed by its users. The tool’s
success depends on both the large number of searches and also a lack of bias in the
search data.
Going beyond microblog text, Fontugne et al. have investigated Flickr’s
potential for disaster detection [76]. The researchers have developed a prototype
system that tracks uploaded photographs, highlighting particular labels that are
being uploaded by multiple users at once. Their method captured large bursts of
activity in Miyagi prefecture in Japan after the Tohoku earthquake. While the
system shows potency as an alarm system, the researchers also point out that only
7% of the photographs taken within 24 hours of the earthquake were uploaded
within that 24 hours. This is a dramatically different usage pattern from Twitter,
and one that should impact proposed research to leverage Flickr data.
Visualizations of social media data is another ongoing challenge for helping
users comprehend the sea of social media information. While crowdsourcing and
machine learning can help us prepare data, it is often a visualization that helps
individuals understand what the data is saying.
Word clouds have become a staple of modern visualization, as websites such as
Wordle49 have made them easy to create from any readily available text.
Researchers have also looked into optimizing the patterns of words in word clouds
to make them easier to interpret [77]. One notable example of their practical use is
the Eddi system developed by Bernstein et al. [78]. Eddi assigns a set of topic
labels to particular tweets by treating them as web search queries and then
identifying prominent terms in the resultant searches. These topic labels are
displayed as tag clouds that can be used to identify prominent subject of
discussion. Note that Eddi’s primary achievement is its insightful method of
finding categorizations for tweets. However, the system relies on simple tag cloud
systems as a key component of its visualization scheme.
ORA also incorporates a word cloud visualization. When fed longitudinal data,
it allows the user to render a sequence of word clouds as networks that can be
monitored changing over time. This is then supplemented with the ability to track
48
http://www.google.org/flutrends/
49
http://www.wordle.net/
Social Media in Disaster Relief 251
the criticality of topics (e.g., Hashtags) and actors (e.g. Tweeters) in the different
clouds, tracking how different topics have come into or dropped out of
prominence over the course of an event.
Kas et al. have had success using tree maps to display tweets prominently
associated with particular topics [79]. The researchers calculate the co-occurrences
of all words in tweets collected on particular topics, filter words based on how
often they co-occur, and then calculate popularity within particular topics. The
most prominent topic keywords are then placed in a tree map, sized based on the
square roots of their overall frequencies. The researchers carried out a small user
study comparing the effectiveness of using word clouds and tree maps to display
the ranked words from Twitter. They found that in general tree maps were
significantly more useful; test subjects both better identified data presented in the
tree maps relative to that presented in the word clouds but also significantly
preferred using the tree map visualization.
Word clouds and tree maps are both relatively established forms of visualization.
Both methods are constrained by only displaying a static view of the world. Social
media, however, is often in flux. To understand a particular sequence of events it can
be useful to get back to the originator of a particular comment, tweet, or image in
order to understand how it has come to have significance. Shahaf et al. have
developed a new, alternative visualization, the metro map, that addresses this
problem for longer documents but has potential for being adapted to the Twitter
space [80]. The metro map visualization links together sequences of documents
based on shared features. Documents are represented as “stations”, like a traditional
metro map, arranged roughly chronologically. The documents are tied together by
directed “tracks” derived from the amount of overlap in coherence, coverage and
connectivity in the actual text of the documents. Coherence is measured based on the
overlapping content of articles, coverage as the number of topics mentioned across
the collection of documents, and connectivity as the number of connections that
exist.
The visualizations we have discussed have all focused on social media as a
general source of data. We cannot point to particular examples of visualizations of
social media data that are disaster specific. For example, there is no visualization
scheme based on Vieweg’s categories for social media messages posted in
disaster. This is a notable gap, and one that research needs to speak to.
Visualizations that cater to a specific end can be much more effective than a
general tool. For example, Kamvar and Harris’s “We Feel Fine”, a set of
visualizations of individual emotions on Twitter, has caused users to engage in
introspection and personal probing [81]. This is partly due to the text, which
consists of personal statements, and partly due to the way in which the text has
been represented. Visualizations designed to highlight the features of disaster
could provoke similarly reach responses from users while also speaking to relief
workers and analyst’s needs to understand the situation on the ground.
252 P.M. Landwehr and K.M. Carley
5 Conclusion
Acknowledgments. The authors would like to thank Dr. Huan Liu of Arizona State
University for his great help in bringing this chapter to fruition.
References
1. Pfeffer, J., Carley, K.M.: Social Networks, Social Media, Social Change. In:
Nicholson, D.M., Schmorrow, D.D. (eds.) Adv. Des. Cross-Cult. Act. Part Ii, pp. 273–
282. CRC Press (2012)
2. Pfeffer, J., Zorbach, T., Carley, K.M.: Understanding online firestorms: Negative word
of mouth dynamics in social media networks. J. Mark. Commun. (2013)
3. Moloney, A.: Haiti must act to address housing crisis - Oxfam. Thompson Reuters
Found. (2013)
4. Drabek, T.E.: Human System Responses to Disaster: An Inventory of Sociological
Findings. Springer, New York (1986)
5. Dynes, R.R.: Organized Behavior in Disaster. Heath (1970)
6. Shklovski, I., Palen, L., Sutton, J.: Finding community through information and
communication technology in disaster response. In: Proc. 2008 ACM Conf. Comput.
Support. Coop. Work, pp. 127–136. ACM, San Diego (2008)
254 P.M. Landwehr and K.M. Carley
7. Shklovski, I., Burke, M., Kiesler, S., Kraut, R.: Technology Adoption and Use in the
Aftermath of Hurricane Katrina in New Orleans. Am. Behav. Sci. 53, 1228–1246
(2010), doi:10.1177/0002764209356252
8. Arcenaux, N., Weiss, A.S.: Seems stupid until you try it: press coverage of Twitter,
2006-9. New Media Soc. 12, 1262–1279 (2010), doi:10.1177/1461444809360773
9. Sullivan, D.: Tracking Hurricane Sandy News Through Twitter. Mark. Land. (2012)
10. Carr, D.: How Hurricane Sandy Slapped the Sarcasm Out of Twitter, New York Media
Decod. (2012)
11. Laird, S.: Sandy Sparks 20 Million Tweets. Mashable (2012)
12. Munro, R., Manning, C.D.: Short message communications: users, topics, and in-
language processing. In: Proc. 2nd ACM Symp. Comput. Dev., pp. 1–10. ACM,
Atlanta (2012)
13. Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news
media? In: Proc. 19th Int. Conf. World Wide Web, pp. 591–600. ACM, Raleigh (2010)
14. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging
usage and communities. In: Proc. 9th Webkdd 1st Sna-Kdd 2007 Work. Web Min.
Soc. Netw. Anal., pp. 56–65. ACM, San Jose (2007)
15. Naaman, M., Boase, J., Lai, C.-H.: Is it really about me?: message content in social
awareness streams. In: Proc. 2010 ACM Conf. Comput. Support. Coop. Work, Cscw,
pp. 189–192. ACM, Savannah (2010)
16. Bakshy, E., Hofman, J.M., Mason, W.A., Watts, D.J.: Everyone’s an influencer:
quantifying influence on twitter. In: Proc. Fourth ACM Int. Conf. Web Search Data
Min., pp. 65–74. ACM, Hong Kong (2011)
17. Starbird, K., Palen, L., Hughes, A.L., Vieweg, S.E.: Chatter on the red: what hazards
threat reveals about the social life of microblogged information. In: Proc. 2010 ACM
Conf. Comput. Support. Coop. Work, pp. 241–250. ACM, Savannah (2010)
18. Sinnappan, S., Farrell, C., Stewart, E.: Priceless Tweets! A Study on Twitter Messages
Posted During Crisis: Black Saturday. In: Proc. 2010 Australas. Conf. Inf. Syst. Acis
(2010)
19. Qu, Y., Huang, C., Zhang, P., Zhang, J.: Microblogging after a major disaster in
China: a case study of the 2010 Yushu earthquake. In: Proc. Acm 2011 Conf. Comput.
Support. Coop. Work, Cscsw, pp. 25–34. ACM, Hangzhou (2011)
20. Vieweg, S.E.: Situational Awareness in Mass Emergency: A Behavioral and Linguistic
Analysis of Microblogged Communications. University of Colorado at Boulder (2012)
21. Sutton, J.: Twittering Tennessee: Distributed Networks and Collaboration Following a
Technological Disaster. In: Proc. 7th Int. Conf. Inf. Syst. Crisis Response Manag.
(2010)
22. Starbird, K., Palen, L. (How) will the revolution be retweeted?: information diffusion
and the 2011 Egyptian uprising. In: Proc. ACM 2012 Conf. Comput. Support. Coop.
Work, Cscw, pp. 7–16. ACM, Seattle (2012)
23. Centola, D., Macy, M.: Complex Contagions and the Weakness of Long Ties. Am. J.
Sociol. 113, 702–734 (2007)
24. NPR Staff, “Distant Witness”: Social Media’s “Journalism Revolution.” Talk Naton
(2013)
25. @TwitterMedia NPR’s Andy Carvin Uses Twitter to Debunk A Hoax.
#OnlyOnTwitter
26. Kaczynski, A.: How One Well-Connected Pseudonymous Twitter Spread Fake News
About Hurricane Sandy. Buzzfeed Polit (2012)
Social Media in Disaster Relief 255
27. Stuef, J.: The Man Behind @ComfortablySmug, Hurricane Sandy’s Worst Twitter
Villain. Buzzfeed Fwd. (2012)
28. Mendoza, M., Poblete, B., Castillo, C.: Twitter under crisis: can we trust what we RT?
In: Proc. First Work. Soc. Media Anal. Soma, pp. 71–79. ACM, Washington D.C.
(2010)
29. Madrigal, A.C.: It Wasn’t Sunil Tripathi: The Anatomy of a Misinformation Disaster.
The Atlantic (2013)
30. Weinstein, A.: Everybody Named the Wrong Boston Suspects Last Night and
Promptly Forgot. Gawker (2013)
31. Martin, E.: Reflections on the Recent Boston Crisis. Reddit Blog (2013)
32. Keller, J.: How Boston Police Won the Twitter Wars During the Marathon Bomber
Hunt. Bloom. Bussinessweek (2013)
33. Mission 4636, Collaborating organizations and History. Mission 4636 (2010)
34. Harvard Humanitarian Initiative, Disaster Relief 2.0: The future of Information
Sharing in Humanitarian Emergencies. Harvard Humanitarian Initiative, UN Office for
the Coordination of Humanitarian Affairs, United Nations Foundation (2011)
35. Liu, S.B., Palen, L., Sutton, J., et al.: In search of the bigger picture: The emergent role
of online photo sharing in times of disaster. In: Proc. 5th Int. Conf. Inf. Syst. Crisis
Response Manag. (2008)
36. Cohen, S.E.: Sandy Marked a Shift for Social Media Use in Disasters. Emerg. Manag.
(2013)
37. St. Denis, L.A., Hughes, A.L., Palen, L.: Trial by Fire: The Deployment of Trusted
Digital Volunteers in the 2011 Shadow Lake Fire. In: Proc. 9th Int. Conf. Inf. Syst.
Crisis Response Manag. (2012)
38. Reuter, S.: What is a Virtual Operations Support Team? Idisaster 20 (2012)
39. Stephens, K.: Understanding VOSTs (Virtual Operations Support Teams) Hint: It’s All
About Trust. West. Mass Smem (2012)
40. VOSG.us, About. Virtual Oper. Support Group (2011)
41. Panagiotopoulos, P., Ziaee Bigdeli, A., Sams, S.: "5 Days in August" – How London
Local Authorities Used Twitter During the 2011 Riots. In: Scholl, H.J., Janssen, M.,
Wimmer, M.A., Moe, C.E., Flak, L.S. (eds.) EGOV 2012. LNCS, vol. 7443, pp. 102–
113. Springer, Heidelberg (2012)
42. Sarcevic, A., Palen, L., White, J., et al.: “Beacons of hope” in decentralized
coordination: learning from on-the-ground medical twitterers during the 2010 Haiti
earthquake. In: Proc. ACM 2012 Conf. Comput. Support. Coop. Work, Cscw, pp. 47–
56. ACM, Seattle (2012)
43. Munro, R.: Crowdsourcing and the crisis-affected community: Lessons learned and
looking forward from Mission 4636. Inf. Retr. 16, 210–266 (2013),
doi:10.1007/s10791-012-9203-2
44. Okolloh, O.: Ushahidi, or “testimony”: Web 2.0 tools for crowdsourcing crisis
information. Particip. Learn. Action 59, 65–70 (2009)
45. Neubig, G., Yuichiroh, M., Masato, H., Koji, M.: Safety Information Mining — What
can NLP do in a disaster—. In: Proc. 5th Int. Jt. Conf. Nat. Lang. Process. Asian
Federation of Natural Language Processing, Chiang Mai, Thailand, pp. 965–973
(2011)
46. Kumar, S., Barbier, G., Abbasi, M.A., Liu, H.: TweetTracker: An Analysis Tool for
Humanitarian and Disaster Relief. In: Proc. 2011 Int. Aaai Conf. Weblogs Soc. Media,
pp. 661–662. AAAI, Barcelona (2011)
256 P.M. Landwehr and K.M. Carley
47. Carley, K.M., Reminga, J., Storrick, J., Columbus, D.: ORA User’s Guide, Carnegie
Mellon University, School of Computer Science, Institute for Software Research,
Pittsburgh, Pennsylvania (2013)
48. Carley, K.M., Columbus, D.: Basic Lessons in ORA and AutoMap 2011. Carnegie
Mellon University, Pittsburgh (2011)
49. Carley, K.M., Pfeffer, J.: Dynamic Network Analysis (DNA) and ORA. Adv. Des.
Cross-Cult. Act. Part (2012)
50. Costa, B., Boiney, J.: Social Radar. MITRE, McLean, Virginia, USA (2012)
51. Mathieu, J., Fulk, M., Lorber, M., et al.: Social Radar Workflows, Dashboards, and
Environments. MITRE, Bedford (2012)
52. Schmerl, B., Garlan, D., Dwivedi, V., et al.: SORASCS: a case study in SOA-based
platform design for socio-cultural analysis. In: Proc. 33rd Int. Conf. Softw. Eng., pp.
643–652. ACM, Waikiki (2011)
53. Garlan, D., Schmerl, B., Dwivedi, V., et al.: Specifying Workflows in SORASCS to
Automate and Share Common HSCB Processes. In: Proc. Hscb Focus. Integrating Soc.
Sci. Theory Anal. Methods Oper. Use (2011), doi:10.1.1.190.7086
54. Bostock, M., Ogievetsky, V., Heer, J.: D3 Data-Driven Documents. IEEE Trans. Vis.
Comput. Graph. 17, 2301–2309 (2011), doi:10.1109/TVCG.2011.185
55. Bostock, M., Carter, S.: Wind Speeds Along Hurricane Sandy’s Path - Interactive
Feature, New York (2012)
56. Bostock, M., Ericson, M., Leonhardt, D., Marsh, B.: Across U.S. Companies, Tax
Rates Vary Greatly, New York (2013)
57. Bostock, M., Bradsher, K.: China Still Dominates, but Some Manufacturers Look
Else-where, New York (2013)
58. Carvin, A.: Andy Carvin’s Social Stories. Andy Carvins Soc. Stories
59. Lin, J., Snow, R., Morgan, W.: Smoothing techniques for adaptive online language
models: topic tracking in tweet streams. In: Proc. 17th Acm Sigkdd Int. Conf. Knowl.
Discov. Data Min., pp. 422–429. ACM, San Diego (2011)
60. Martin, M.K., Pfeffer, J., Carley, K.M.: Network text analysis of conceptual overlap in
interviews, newspaper articles and keywords. Soc. Netw. Anal. Min. (forthcoming)
61. Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis. Now Publishers (2008)
62. Carley, K.M., Pfeffer, J., Morstatter, F., et al.: Near Real Time Assessment of Social
Media Using Geo-Temporal Network Analytics. In: Proc. 2013 Ieeeacm Int. Conf.
Adv. Soc. Networks Anal. Min. (2013)
63. Schroeder, S.: Google Launches Crisis Map for Hurricane Sandy. Mashable (2012)
64. Abbasi, M.-A., Kumar, S., Filho, J.A.A., Liu, H.: Lessons Learned in Using Social
Media for Disaster Relief - ASU Crisis Response Game (2012)
65. Carley, K.M., Columbus, D., Landwehr, P.: AutoMap User’s Guide, Carnegie-Mellon
University, School of Computer Science, Institute for Software Research, Pittsburgh,
Pennsylvania (2013)
66. Gimpel, K., Schneider, N., O’Connor, B., et al.: Part-of-Speech Tagging for Twitter:
Annotation, Features, and Experiments. In: Proc. 49th Annu. Meet. Assoc. Comput.
Linguist. Hum. Lang. Technol. (2011)
67. Owoputi, O., O’Connor, B., Dyer, C., et al.: Part-of-Speech Tagging for Twitter: Word
Clusters and Other Advances. Carnegie Mellon University, Machine Learning
Department, Pittsburgh, Pennsylvania, USA (2012)
68. Woolf, S.H.: The Meaning of Translational Research and Why It Matters. J. Am. Med.
Assoc. 299, 211–213 (2008), doi:10.1001/jama.2007.26
Social Media in Disaster Relief 257
1 Introduction
Social networks have drawn substantial attention in the recent years due to the
advance of Web 2.0 technologies. Aggregating social network data becomes easier
Chris Yang
Drexel University, Philadelphia, PA
e-mail: [email protected]
Bhavani Thuraisingham
The University of Texas at Dallas, Richardson, TX
e-mail: [email protected]
W.W. Chu (ed.), Data Mining and Knowledge Discovery for Big Data, 259
Studies in Big Data 1,
DOI: 10.1007/978-3-642-40837-3_8, © Springer-Verlag Berlin Heidelberg 2014
260 C. Yang and B. Thuraisingham
through crawling the user interactions in Internet [77]. Social network analysis
discovers knowledge hidden in the structure of social networks which is useful in
many domains such as marketing, epidemiology, homeland security, sociology,
psychology, and management. Social network data is usually owned by an
individual organization or government agency. However, each organization or
agency usually has a partial social network from the data aggregated in their own
source. Knowledge cannot be extracted accurately if only partial information is
available. Sharing of social networks between organizations enables knowledge
discovery from an integrated social network obtained from multiple sources.
However, the information sharing between organizations is usually prohibited due
to the concern of privacy preservation; especially a social network often contains
sensitive information of individuals. Early research on privacy preservation
focuses on relational data and some recent researches extend it to social network
data. Techniques such as k-degree anonymity and k-anonymity achieved by edge
or node perturbation are proposed. However, the anonymized social network is
designed for studying the global network properties. It is not applicable for
integration of social networks or other social network analysis and mining tasks
such as identifying the leading person or gateway. A recent study has also shown
that a substantial distortion to the network structure can be caused by perturbation.
Such distortion may cause errors in social network analysis and mining. In this
chapter we discuss aspects of sharing the insensitive and generalized information
to support social network analysis and mining while preserving the privacy at the
same time.
We will motivate the problem with the following scenario. Consider two local
law enforcement units A and B which have their own criminal social networks, GA
and GB. Each of these criminal social networks is a partial network of the regional
criminal social networks covering the areas policed by A and B. The criminal
intelligence officer of A may not be able to identify the close distance between
suspects i and j by analyzing GA because i and j are connected through k in GB but
k is not available in GA. Similarly, the criminal intelligence officers of B may not
be able to determine the significance of suspect k by conducting centrality analysis
on GB because k makes little influence on the actors in GB but substantial influence
on the actors in GA. By integrating GA and GB, the criminal intelligence officers of
A and B are able to discover the knowledge that otherwise they cannot.
In this chapter, we will discuss our generalization approach for integrating
social networks with privacy preservation. In Section 2 we will first provide some
information on the application of social network for terrorism analysis and the
need for privacy. Limitation of current approaches will be discussed in Section 3.
Our approach is discussed in Section 4. Directions are discussed in Section 5.
investigators in determining the path between any two actors of interest. For
example, Figure 1 illustrates the two sub-groups led by Bin Laden and Fateh and
the connecting paths and gateways between the two leaders. Without this social
network analysis and these visualization tools, the huge volume of aggregated data
may not be meaningful to the agencies for their tasks.
registration records are not de-identified. If a hacker conducts a linking attack and
cross check both the medical records and the voter registration records, he will be
able to find that there is one person who have the same values in the set of quasi-
identifiers, [Age = 29, Sex = M, Location = 35667], and therefore, he is able to
conclude that Charles has HIV.
Table 1 Medical records and voter registration records in information sharing using
attribute removing
person in the medical records because there are k-1 other persons who has the age
in the same range. However, k-anonymity fails when there is a lack of diversity in
the sensitive attributes. For example, there is a lack of diversity of the attribute
values of disease in the quasi group with age = [5,20] and location = [00300,
02000]. One can see that all the values of the attribute Sex are M and all the values
of the attribute Disease are Viral Infection in this quasi group. As a result, a
hacker is able to link Paul (Age = 14, Location = 00332) to quasi group A and
determine that Paul has viral infection.
l-diversity [44] ensures that there are at least l well-represented values of the
attributes for every set of quasi-identifier attributes. The weakness is that one can
still estimate the probability of a particular sensitive value. m-invariance [72]
ensures that each set of quasi-identifier attributes has at least m tuples, each with a
unique set of sensitive values. There is at most 1/m confidence in determining the
sensitive values. Others enhanced techniques of k-anonymity and l-diversity with
personalization, such as personalized anonymity [71] and (α,k)-anonymity [70],
allow users to specify the degree of privacy protection or specify a threshold α on
the relative frequency of the sensitive data. Versatile publishing [28] anonymizes
sub-table to guarantees privacy rules.
Privacy preservation of relational data has also been applied in statistical
database. Query restriction [29,48], output perturbation [8,14,16], and data
modification [2,47,73] are three major approaches. Query restriction [29,48]
266 C. Yang and B. Thuraisingham
rejects certain queries when a leak of sensitive values is possible by combining the
results of previous queries. Output perturbation [8,14,16] adds noise to the result
of a query to produce a perturbed version. Data modification [2,47,73] prepares an
adequately anonymized version of relational data to a query. Cryptography
approach of privacy preservation of relational data aims at developing a protocol
of data exchange between multiple private parties. It tries to minimize the
information revealed by each party. For example, top-k search [66] reports the top-
k tuples in the union of the data in several parties. However, the techniques on
preserving the privacy of relational data cannot be directly applied on social
network data. In the recent years, these techniques were extended for preserving
the privacy of social network data.
4 Our Approach
Instead of using edge or node perturbation or secure multi-party computation
approaches, we propose to use a subgraph generalization approach to preserve the
sensitive data and yet share the insensitive data. The social network owned by
each party will be decomposed to multiple subgraphs. Each subgraph will be
generalized as a probabilistic model depending on the sensitive and insensitive
data available as well as the objective of the social network analysis and mining
tasks. The probabilistic models of the generalized subgraphs from multiple
sources will then be integrated for social network analysis and mining. The social
network analysis and mining will be conducted on the global and generalized
social network rather than the partial social network owned by each party. The
knowledge that cannot be captured in individual social networks will be
discovered in the integrated global social network.
By using such approach, it will overcome the limitations of the errors produced
by the perturbation approach and yet allow integration of multiple social networks.
It also avoids the attack of the encrypted data in SMC approach because the shared
268 C. Yang and B. Thuraisingham
data are insensitive. The complexity of this approach will also be reduced
substantially.
Most attacks cannot discover the exact identity or adjacency given a reasonable
privacy preservation technique. However, many attacks are able to narrow down the
identity to a few possible known identities. Ideally, a privacy-preserving technique
should achieve ∞-tolerance, which means no attack can find a clue of the possible
identity of a sensitive node. In reality, it is almost impossible to achieve ∞-tolerance
due to the background knowledge possessed by the adversaries. However, a good
privacy-preserving technique should reduce privacy leakage as much as possible,
which means achieving a higher value of τ in privacy leakage.
The generalized information in this problem is the probabilistic models of the
generalized social networks instead of a perturbed model using the k-anonymity
approach. As a result, the τ-tolerance of privacy leakage is independent to the
generalization technique. In addition, it preserves both the identities and network
structures. By integrating the probabilistic models of multiple generalized social
networks, the objective is achieving a better performance of social network
analysis tasks. At the same time, neither the probabilistic models nor the social
network analysis results should release private information that may violate the
prescribed τ-tolerance of privacy leakage when it is under adversary attacks.
any privacy concern, one can integrate GP and GQ to generate an integrated G and
obtain a better social network analysis result. Due to privacy concern, OQ cannot
release GQ to OP but only shares the generalized information of GQ to OP. At the
same time, OP does not need all data from OQ but only those that are critical for
the social network analysis task. The objectives are maximizing the information
sharing that is useful for the social network analysis task but preserving the
sensitive information to satisfy the pre-scribed τ-tolerance of privacy leakage and
achieve more accurate social network analysis results.
shortest distance between nodes [19,24] have high utilities. When information is
shared with another organization, some sensitive information must be preserved
but the generalized information can be released so that more accurate estimation
of the required information for centrality measures can be obtained. Attributes for
consideration in generalization includes number of nodes in a sub-group, the
diameter of a sub-group, the distributions of node degrees, and the eccentricity of
the insensitive nodes.
Generalization
A subgraph generalization generates a generalized version of a social network, in
which a connected subgraph is transformed as a generalized node and only
generalized information will be presented in the generalized node. The
generalized information is the probabilistic model of the attributes. A subgraph of
G = (V, E) is denoted as G’ = (V’, E’) where V’ ⊂ V, E’ ⊂ E, E’ ⊆ V’ × V’. G’ is a
connected subgraph if there is a path for each pair of nodes in G’. We only
consider a connected subgraph when we conduct subgraph generalization. The
edge that links from other nodes in the network to any nodes of the subgraph will
be connected to the generalized node. The generalized social network protects all
sensitive information while releasing the crucial and non-sensitive information to
the information requesting party for social network integration and the intended
social network analysis task. A mechanism is needed to (i) identify the subgraphs
for generalization, (ii) determine the connectivity between the set of generalized
nodes in the generalized social network, and (iii) construct the generalized
information to be shared.
The constructed subgraphs must be mutually exclusive and exhaustive. A node
v can only be part of a subgraph but not any other subgraphs. The union of nodes
from all subgraphs V1’, V2’, …, Vn’ should be equal to V, the original set of nodes
in G. To construct a subgraph for generalization, there are a few alternatives
including n-neighborhood, and k-connectivity.
n-neighborhood
k-connectivity
Subgraphs can further be generated if the subgraphs being created are also k-edge
connected.
We illustrate the sub-graph generalization using the K-nearest neighbor (KNN)
method. Let SPD(v, viC ) be the distance of the shortest path between v and viC.
When v is assigned to the subgraph Gi in subgraph generation, SPD(v, viC ) must
be shorter than or equal to SPD(v, vjC ) where j = 1, 2, .., K and j ≠ i. Secondly, an
edge exists between two generalized nodes Gi and Gj in the generalized graph G’ if
and only if there is an edge between any two nodes in G such that one from each
generalized node, Gi and Gj.
The KNN subgraph generation algorithm is presented below:
length=1;
Step 1
V= V - {v1C, v2C , … vKC };
Step 2
While V ≠ Ø
Step 3
For each vj ∈ V
Step 4
For each i = 1 to K
Step 5
IF(SPD(vj, viC ) == length);
Step 6
Vi = Vi + vj;
Step 7
V= V – vj;
Step 8
End For;
Step 9
End For;
Step 10
length++;
Step 11
End While
Step 12
For each (vi,vj) ∈ E
Step 13
IF( Subgraph(vi) == Subgraph(vj) )
Step 14
// Subgraph(vi) is the subgraph such that vi ∈ Subgraph(vi)
Gk = Subgraph(vi)
Step 15
Ek = Ek + (vi,vj)
Step 16
274 C. Yang and B. Thuraisingham
ELSE
Step 17
Create an edge between Subgraph(vi) and Subgraph(vj) and add it to E’
Step 18
End For
Step 19
Figure 3 illustrates the subgraph generation by KNN method. G has seven nodes
including v1 and v2. If we take v1 and v2 as the insensitive nodes and we are going
to generate two subgraphs by 1NN method, all other nodes will be assigned to one
of the two subgraphs depending on their shortest distances with v1 and v2. Two
subgraphs G1 and G2 are generated as illustrated in Figure 3.
G1 G2 Generalized Graph
For each generalized node vj’ ∈ Vi’, we determine the generalized information to
be shared. The generalized information should achieve the following objectives:
(i) is useful for the social network analysis task after integration, (ii) preserves the
sensitive information, and (iii) is minimal so that unnecessary information is not
released. The generalized information of Vi’ can be the probabilistic model of the
A Generalized Approach for Social Network Integration and Analysis 275
distance between any two nodes vj and vk in Vi’, P(Distance(vj, vj) = d), vj, vk ∈ Vi’
[82]. The construction of subgraphs plays an important role in determining the
generalized information to be shared and the usefulness of the generalized
information.
In addition to the utility of the generalized information, the development of the
subgraph construction algorithms must take the privacy leakage into consideration.
By taking the generalized subgraphs and the generalized information of each
subgraphs, attacks can be designed to discover identities and adjacencies of sensitive
and insensitive nodes.
and utilized the closeness centrality as the social network analysis task. It is found
that using the generalization approach can improve the performance of closeness
centrality by over 35%. This result shows that the proposed approach of
integrating social networks is promising.
5 Conclusion
Social network analysis is important for extracting hidden knowledge in a
community. It is particularly important for investigating the terrorist and criminal
communication patterns and the structure of their organization. Unfortunately,
most law enforcement and intelligence units only own a small piece of the social
network. Due to privacy concerns, these pieces of data cannot be shared among
the units. Therefore, the utility of each piece of information is limited. In this
chapter, we introduce a generalization approach for preserving privacy and
integrating multiple social networks. The integrated social network will provide
better information for us to conduct social network analysis such as computing the
centrality. In this chapter, we have also discussed the τ-tolerance, which specifies
the level of privacy leakage that must be protected. Our experimental result also
shows that the generalization approach and social network integration produce
promising performance.
References
1. Adibi, Chalupsky, H., Melz, E., Valente, A.: The KOJAK Group Finder: Connecting
the Dots via Intergrated Knowledge-based and Statistical Reasoning. In: Innovative
Applications of Artificial Intelligence Conference (2004)
2. Agrawal, R., Srikant, R., Thomas, D.: Privacy Preserving OLAP. In: ACM SIGMOD
2005 (2005)
3. Ahmad, M.A., Srivastava, J.: An Ant Colony Optimization Approach to Expert
Identification in Social Networks. In: Liu, H., Salerno, J.J., Young, M.J. (eds.) Social
Computing, Behavioral Modeling, and Prediction. Springer (2008)
4. Backstrom, L., Dwork, C., Kleinberg, J.: Wherefore Art Thou R3579X? Anonymized
Social Networks, Hidden Patterns, and Structural Steganography. In: WWW 2007,
Banff, Alberta, Canada (2007)
5. Bhatt, R., Chaoji, V., Parekh, R.: Predicting Product Adoption in Large-Scale Social
Networks. In: ACM CIKM, Toronto, Ontario (2010)
6. Bhattacharya, I., Getoor, L.: Iterative Record Linkage for Cleaning and Integration. In:
SIGMOD 2004 Workshop on Research Issues on Data Mining and Knowledge
Discovery (2004)
7. Bhattacharya, I., Getoor, L.: Entity Resolution in Graphs. Technical Report 4758,
Computer Science Department, University of Maryland (2005)
8. Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical Privacy: the Sulq
Framework. In: ACM PODS 2005 (2005)
9. Brickell, J., Shmatikov, V.: Privacy-Preserving Graph Algorithms in the Semi-honest
Model. In: Roy, B. (ed.) ASIACRYPT 2005. LNCS, vol. 3788, pp. 236–252. Springer,
Heidelberg (2005)
A Generalized Approach for Social Network Integration and Analysis 277
10. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization using
Hyperlinks. In: ACM SIGMOD 1998 (1998)
11. Chau, A.Y.K., Yang, C.C.: The Shift towards Multi-Disciplinarily in Information
Science. Journal of the American Society for Information Science and Technology
(2008)
12. Chen, H., Yang, C.C.: Intelligence and Security Informatics: Techniques and
Applications. Springer (2008)
13. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K.,
Slattery, S.: Learning to Construct Knowledge Bases from the World Wide Web.
Artificial Intelligence 118, 69–114 (2000)
14. Dinur, I., Nissim, K.: Revealing Information While Preserving Privacy. In: ACM
PODS 2003 (2003)
15. Dong, X., Halevy, A., Madhavan, J.: Reference Reconciliation in Complex
Information Spaces. In: ACM SIGMOD International Conference on Management of
Data (2005)
16. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating Noise to Sensitivity in
Private Data Analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp.
265–284. Springer, Heidelberg (2006)
17. Frantz, T., Carley, K.M.: A Formal Characterization of Cellular Networks. Technical
Report CMU-ISRI-05-109, Carnegie Mellon University (2005)
18. Frikken, K.B., Golle, P.: Private Social Network Analysis: How to Assemble Pieces of
a Graphy Privately. In: The 5th ACM Workshop on Privacy in Electronic Society
(WPES 2006), Alexandria, VA (2006)
19. Gao, J., Qiu, H., Jiang, X., Wang, T., Yang, D.: Fast Top-K Simple Shortest Discovery
in Graphs. In: ACM CIKM, Toronto, Ontario (2010)
20. Gartner, T.: Exponential and Geometric Kernels for Graphs. In: NIPS Workshop on
Unreal Data: Principles of Modeling Nonvectorial Data (2002)
21. Gartner, T.: A Survey of Kernels for Structured Data. ACM SIGKDD Explorations 5,
49–58 (2003)
22. Getoor, L., Diehl, C.P.: Link Mining: A Survey. ACM SIGKDD Explorations 7, 3–12
(2005)
23. Hay, M., Miklau, G., Jensen, D., Weis, P., Srivastava, S.: Anonymizing Social
Networks. Technical Report 07-19, University of Massachusetts, Amherst (2007)
24. Gubichev, A., Bedathur, S., Seufert, S., Weikum, G.: Fast and Accurate Estimation of
Shortest Paths in Large Graphs. In: ACM CIKM, Toronto, Ontario (2010)
25. Himmel, R., Zucker, S.: On the Foundations of Relaxation Labeling Process. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 267–287 (1983)
26. Huang, J., Sun, H., Han, J., Deng, H., Sun, Y., Liu, Y.: SHRINK: A Structural
Clustering Algorithm for Detecting Hierarchical Communities in Networks. In: ACM
CIKM, Toronto, Ontario (2010)
27. Huang, J., Zhuang, Z., Li, J., Giles, C.L.: Collaboration Over Time: Characterizing and
Modeling Network Evolution. In: ACM WSDM 2008 Palo Alto, CA (2008)
28. Jin, X., Zhang, M., Zhang, N., Das, G.: Versatile Publishing for Privacy Preservation.
In: ACM KDD, Washington, DC (2010)
29. Kenthapadi, K., Mishra, N., Nissim, K.: Simulatable Auditing. In: PODS 2005 (2005)
30. Kerschbaum, F., Schaad, A.: Privacy-Preserving Social Network Analysis for Criminal
Investigations. In: Proceedings of the ACM Workshop on Privacy in Electronic
Society, Alexandria, VA (2008)
278 C. Yang and B. Thuraisingham
31. Ketkar, N., Holder, L., Cook, D.: Comparison of Graph-based and Logic-based Multi-
relational Data Mining. In: ACM SIGKDD Explorations, vol. 7 (December 2005)
32. Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment. Journal of the
ACM 46, 604–632 (1999)
33. Kubica, J., Moore, A., Schneider, J., Yang, Y.: Stochastic Link and Group Detection.
In: National Conference on Artificial Intelligence: American Association for Artificial
Intelligence (2002)
34. Kubica, J., Moore, A., Schneider, J.: Tractable Group Detection on Large Link Data
Sets. In: IEEE International Conference on Data Mining (2003)
35. Kuramochi, M., Karypis, G.: Frequent Subgraph Discover. In: IEEE International
Conference on Data Mining (2001)
36. Lafferty, L., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling Sequence Data. In: International Conference on
Machine Learning (2001)
37. Leroy, V., Cambazoglu, B.B., Bonchi, F.: Cold Start Link Prediction. In: ACM
SIGKDD, Washington, DC (2010)
38. Leung, C.W., Lim, E., Lo, D., Weng, J.: Mining Ineresting Link Formation Rules in
Social Networks. In: ACM CIKM, Toronto, Ontario (2010)
39. Li, N., Li, T.: t-closeness: Privacy Beyond k-anonymity and ldiversity. In: ICDE 2007
(2007)
40. Liben-Nowell, D., Kleinberg, J.: The Link Prediction Problem for Social Networks. In:
International Conference on Information and Knowledge Management, CIKM 2003
(2003)
41. Lindell, Y., Pinkas, B.: Secure Multiparty Computation for Privacy-Preserving Data
Mining. The Journal of Privacy and Confidentiality 1(1), 59–98 (2009)
42. Liu, K., Terzi, E.: Towards Identity Anonymization on Graphs. In: ACM SIGMOD
2008. ACM Press, Vancouver (2008)
43. Lu, Q., Getoor, L.: Link-based Classification. In: International Conference on Machine
Learning (2003)
44. Machanavajjhala, A., Gehrke, J., Kifer, D.: L-diversity: Privacy beyond k-anonymity.
In: ICDE 2006 (2006)
45. Merugu, S., Ghosh, J.: A Distributed Learning Framework for Heterogeneous Data
Sources. In: ACM KDD 2005, Chicago, Illinois, USA (2005)
46. Morris, M.: Network Epidemiology: A Handmbook for Survey Design and Data
Collection. Oxford University Press, London (2004)
47. Muralidhar, K., Sarathy, R.: Security of Random Data Perturbation Methods. ACM
Transactions on Database Systems 24, 487–493 (1999)
48. Nabar, S.U., Marthi, B., Kenthapadi, K., Mishra, N., Motwani, R.: Towards
Robustness in Query Auditing. In: VLDB, pp. 151-162 (2006)
49. Nakashima, E.: “Cyber Attack Data-Sharing is Lacking, Congress Told,” the
Washington Post, p. D02 (September 19, 2008),
http://www.washingtonpost.com/wp-dyn/content/article/
2008/09/18/AR2008091803730.html
50. Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the Presence of Individuals from Shared
Database. In: SIGMOD 2007 (2007)
51. Newman, M.E.J.: Detecting Community Structure in Networks. European Physical
Journal B 38, 321–330 (2004)
A Generalized Approach for Social Network Integration and Analysis 279
52. Oh, H.J., Myeaeng, S.H., Lee, M.H.: A Practical Hypertext Categorization Method
using Links and Incrementally Available Class Information. In: International ACM
SIGIR Conference on Research and Development in Information Retrieval (2000)
53. O’Madadhain, J., Hutchins, J., Smyth, P.: Prediction and Ranking Algorithms for
Even-based Network Data. ACM SIGKDD Explorations 7 (December 2005)
54. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking:
Bringing Order to the Web. Technical Report, Standford University (1998)
55. Sageman, M.: Understanding Terror Networks. University of Pennsylvania Press
(2004)
56. Sakuma, J., Kobayashi, S.: Link Analysis for Private Weighted Graphs. In:
Proceedings of ACM SIGIR 2009, Boston, MA, pp. 235–242 (2009)
57. Samarati, P.: Protecting Respondents’ Identities in Microdata Release. IEEE
Transactions on Knowledge and Data Engineering 13, 1010–1027 (2001)
58. Srivastava, J., Pathak, N., Mane, S., Ahmad, M.A.: Data Mining for Social Network
Analysis. Tutorial Notes in the 2006 IEEE International Conference on Data Mining,
Hong Kong, December 18-22 (2006)
59. Sweeney, L.: Uniqueness of Simple Demographics in the US Population. Technical
Report, Carnegie Mellon University (2000)
60. Sweeney, L.: K-Anonymity: A Model for Protecting Privacy. International Journal of
Uncertainty Fuzziness Knowledge-based Systems 10, 557–570 (2002)
61. Tai, C., Yu, P.S., Chen, M.: k-Support Anonymity Based on Pseudo Taxonomy for
Outsourcing of Frequent Itemset Mining. In: ACM SIGKDD, Washington, DC (2010)
62. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: ArnetMiner: Extraction and
Mining of Academic Social Networks. In: ACM KDD 2008. ACM Press, Las Vegas
(2008)
63. Thuraisingham, B.: Security Issues for Federated Databases Systems. In: Computers
and Security. North Holland (1994)
64. Thuraisingham, B.: Assured Information Sharing: Technologies, Challenges and
Directions. In: Chen, H., Yang, C.C. (eds.) Intelligence and Security Informatics:
Technqiues and Applications. SCI, vol. 135, pp. 1–15. Springer, Heidelberg (2008)
65. Tyler, J.R., Wilkinson, D.M., Huberman, B.A.: Email as Spectroscopy: Automated
Discovery of Community Structure within Organizations, The Netherlands (2003)
66. Vaidya, R.J., Clifton, C.: Privacy-preserving top-k queries. In: International
Conference of Data Engineering (2005)
67. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications.
Cambridge University Press, Cambridge (1994)
68. Watts, D.J., Strogatz, S.H.: Collective Dynamics of "Small-wolrd" Networks.
Nature 339, 440–442 (1998)
69. Wolfe, A.P., Jensen, D.: Playing Multiple Roles: Discovering Overlapping Roles in
Social Networks. In: ICML 2004 Workshop on Statistical Relational Learning and its
Connections to Other Fields (2004)
70. Wong, R.C., Li, J., Fu, A., Wang, K.: (a,k)-Anonymity: An enhanced k-Anonymity
Model for Privacy-Preserving Data Publishing. In: SIGKDD, Philadelphia, PA (2006)
71. Xiao, X., Tao, Y.: Personalized Privacy Preservation. In: SIGMOD, Chicago, Illinois
(2006)
72. Xiao, X., Tao, Y.: m-invariance: Towards Privacy Preserving Republication of
Dynamic Datasets. In: ACM SIGMOD 2007. ACM Press (2007)
73. Xiao, X., Tao, Y.: Dynamic Anonymization: Accurate Statistical Analysis with
Privacy Preservation. In: ACM SIGMOD 2008. ACM Press, Vancouver (2008)
280 C. Yang and B. Thuraisingham
74. Xu, J., Chen, H.: CrimeNet Explorer: A Framework for Criminal Network Knowledge
Discovery. ACM Transactions on Information Systems 23, 201–226 (2005)
75. Yan, X., Han, J.: gSpan: Graph-based Substructure Pattern Mining. In: International
Conference on Data Mining (2002)
76. Yang, C.C., Liu, N., Sageman, M.: Analyzing the Terrorist Social Networks with
Visualization Tools. In: IEEE International Conference on Intelligence and Security
Informatics, San Diego, CA (2006)
77. Yang, C.C., Ng, T.D.: Terrorism and Crime Related Weblog Social Network: Link,
Content Analysis and Information Visualization. In: IEEE International Conference on
Intelligence and Security Informatics, New Brunswick, NJ (2007)
78. Yang, C.C., Ng, T.D., Wang, J.-H., Wei, C.-P., Chen, H.: Analyzing and Visualizing
Gray Web Forum Structure. In: Yang, C.C., et al. (eds.) PAISI 2007. LNCS, vol. 4430,
pp. 21–33. Springer, Heidelberg (2007)
79. Yang, C.C.: Information Sharing and Privacy Protection of Terrorist or Criminal
Social Networks. In: IEEE International Conference on Intelligence and Security
Informatics, Taipei, Taiwan, pp. 40–45 (2008)
80. Yang, C.C., Ng, T.D.: Analyzing Content Development and Visualizing Social
Interactions in Web Forum. In: IEEE International Conference on Intelligence and
Security Informatics Taipei, Taiwan (2008)
81. Yang, C.C., Sageman, M.: Analysis of Terrorist Social Networks with Fractal Views.
Journal of Information Science (2009)
82. Yang, C.C., Tang, X.: Social Networks Integration and Privacy Preservation using
Subgraph Generalization. In: Proceedings of AMC SIGKDD Workshop on
CyberSecurity and Intelligence Informatics, Paris, France (June 28, 2009)
83. Yang, C.C., Tang, X., Thuraisingham, B.: An Analysis of User Influence Ranking
Algorithms on Dark Web Forums. In: Proceedings of ACM SIGKDD Workshop on
Intelligence and Security Informatics (ISI-KDD), Washington, D.C. (July 25, 2010)
84. Yang, C.C., Thuraisingham, B.: Privacy-Preserved Social Network Integration and
Analysis for Security Informatics. IEEE Intelligent Systems 25(3), 88–90 (2010)
85. Yang, X., Asur, S., Parthasarathy, S., Mehta, S.: A Visual-Analytic Toolkit for
Dynamic Interaction Graphs. In: ACM KDD 2008, Las Vegas, Nevada (2008)
86. Yao, A.: Protocols for Secure Computations. In: Proceedings of the Annual IEEE
Symposium on Foundations of Computer Science, vol. 23 (1982)
87. Ying, X., Wu, X.: Randomizing Social Networks: A Spectrum Preserving Approach.
In: SIAM International Conference on Data Mining (SDM 2008), Atlanta, GA (2008)
88. Zheleva, E., Getoor, L.: Preserving the Privacy of Sensitive Relationships in Graph
Data. In: Bonchi, F., Malin, B., Saygın, Y. (eds.) PInKDD 2007. LNCS, vol. 4890, pp.
153–171. Springer, Heidelberg (2008)
89. Zhou, B., Pei, J.: Preserving Privacy in Social Networks against Neighborhood
Attacks. In: IEEE International Conference on Data Engineering (2008)
A Clustering Approach to Constrained
Binary Matrix Factorization
W.W. Chu (ed.), Data Mining and Knowledge Discovery for Big Data, 281
Studies in Big Data 1,
DOI: 10.1007/978-3-642-40837-3_9,
c Springer-Verlag Berlin Heidelberg 2014
282 P. Jiang et al.
1 Introduction
Given a binary matrix G ∈ {0, 1}m×n, the problem of binary matrix factoriza-
tion (BMF) is to find two binary matrices U ∈ {0, 1}m×k and W ∈ {0, 1}k×n
so that the distance between G and the matrix product U W is minimal. In the
existing literature, the distance is measured by the square of the Frobenius
norm, leading to an objective function G − U W 2F . BMF arises naturally in
applications involving binary data sets, such as association rule mining for
agaricus-lepiota mushroom data sets [11], biclustering structure identification
for gene expression data sets [28, 29], pattern discovery for gene expression
pattern images [24], digits reconstruction for USPS data sets [21], mining
high-dimensional discrete-attribute data [12, 13], market basket data cluster-
ing [16], and document clustering [29].
Binary data sets occupy a special place in data analysis [16], and it is of
great interest to discover underlying clusters and discrete patterns. Numerous
techniques such as Principal Component Analysis (PCA) [25] have been pro-
posed to deal with continuous data. For nonnegative matrices, nonnegative
matrix factorization (NMF) [14, 15, 17, 30] is used to discover meaningful
patterns in data sets. However, these methods cannot be directly applied to
analyze binary data sets. The presence of binary features poses a great chal-
lenge in the analysis of binary data sets, and it generally leads to NP-hard
problems.
In 2003, Koyutürk et al. [11] first proposed an algorithm called PROX-
IMUS to solve BMF via recursive partitioning. Koyutürk et al. [12] further
showed that BMF is NP-hard because it can be formulated as an integer
programming problem with 2m+n feasible solutions, even for rank-1 BMF.
They showed in [13] that there is no theoretical guarantee on the quality of
the solution produced by PROXIMUS. Lin et al. [18] proposed an algorithm
theoretically equivalent to PROXIMUS but with lower computation cost.
Shen et al. [24] proposed a 2-approximation algorithm for rank-1 BMF by
reformulating it as a 0-1 integer linear problem (ILP). Gillis and Glineur [7]
gave an upper bound for BMF by finding the maximum edge bicliques in the
bipartite graph whose adjacency matrix is G. They also proved that rank-1
BMF is NP-hard.
As discussed above, the matrix product U W is generally not required to
be binary for BMF. We call this unconstrained BMF (UBMF). Since the
matrix G is binary, it is often desirable to have a matrix product that is
also binary. We call the resulting problem constrained BMF (CBMF), where
the matrix product is restricted to the class of binary matrices. CBMF is
well suited for certain classes of applications. For example, given a collection
of text documents, one may be interested in classifying the documents into
groups or clusters based on similarity of content. When CBMF is used for
the classification, it is natural to stipulate that each document in the corpus
be assigned to only one cluster, in which case the resulting matrix product
must be binary.
A Clustering Approach to Constrained Binary Matrix Factorization 283
Note that in the above model, the matrix product U W is not required to
be binary. As pointed out in the introduction, since the matrix G is binary,
it is often desirable to have a binary matrix product, which leads to the
constrained binary matrix factorization (CBMF) problem
If we replace the squared Frobenius norm in problem (2) by the l1 norm, then
we end up with the optimization problem
min G − U W 1 (3)
U,W
The quadratic constraints make problem (3) very hard to solve. To see
this, let us temporarily fix one matrix, say U , then we end up with a BLP
with linear constraints, which is still nontrivial to solve [4]. One way to reduce
the difficulty of problem (3) is to replace the hard quadratic constraints by
linear constraints that will ensure that the resulting matrix product remains
A Clustering Approach to Constrained Binary Matrix Factorization 285
binary. For this purpose, we introduce the following two specific variants of
CBMF:
min G − U W 1 (4)
U,W
min G − U W 1 (5)
U,W
Here ek ∈ Rk×1 and em ∈ Rm×1 are vectors of all ones. The constraint
U ek ≤ em (or W T ek ≤ en ) ensures that every row of U (or every column of
W ) contains at most one nonzero element, and thus it guarantees that U W
is a binary matrix.
Another interesting observation is that for a binary matrix U , all its
columns are orthogonal to each other if and only if all the constraints
U ek ≤ em hold. In other words, the orthogonality of a binary matrix B can
be retained by imposing some linear constraints on the matrix itself. This
is very different from the case of generic matrices. For example, so-called
nonnegative principal component analysis [27] also imposes the orthogonal
requirement on the involved matrix argument, and it leads to a challenging
optimization problem.
Note that the product matrix is guaranteed to be a binary matrix when
k = 1. Therefore, we immediately have the following result.
Proposition 2.1. If k = 1, then problems (1) and (3) are equivalent.
Our next result establishes the relationship between the variants of CBMF
and general CBMF when k = 2.
Proposition 2.2. If k = 2, then problem (3) is equivalent to either prob-
lem (4) or (5).
Proof. It suffices to prove that if (U, W ) is a feasible pair for problem (3),
then it must satisfy either U ek ≤ em or W T ek ≤ en . Suppose to the contrary
that both constraints U ek ≤ em and W T ek ≤ en fail to hold, i.e., the i-th
row of U and the j-th column of W satisfy
contradicting to the assumption that (U, W ) is a feasible pair for problem (3).
Therefore, we have either U ek ≤ em or W T ek ≤ en . This completes the proof
of the proposition.
Inspired by Propositions 2.1 and 2.2, one may conjecture that problems (1)
and (3) are equivalent when k = 2. The following example disproves such a
conjecture. Let
⎛ ⎞ ⎛ ⎞
111100 10
⎜1 1 1 1 0 0⎟ ⎜1 0⎟
⎜ ⎟ ⎜ ⎟
⎜1 1 1 1 1 1⎟ ⎜1 1⎟ 111100
G=⎜ ⎜ ⎟ ⎜ ⎟
⎟, U = ⎜0 1⎟, W = 0 0 0 1 1 1 .
⎜0 0 0 1 1 1⎟ ⎜ ⎟
⎝0 0 0 1 1 1⎠ ⎝0 1⎠
000111 01
Then one can verify that the matrix pair (U, W ) is the unique optimal solution
to problem (1), but it is infeasible for problem (3).
We note that if k ≥ 3, then problem (3) is not equivalent to problem (4)
or (5). This can be seen from the following example. Consider the matrix pair
(U, W ) given by
101 011
U= , WT = .
001 110
One can easily see that (U, W ) is a feasible solution to problem (3) but not
a feasible solution to problem (4) or (5).
It is easy to see that the optimal solution to the above problem can be
obtained as follows:
1 if uj = arg minl=0,1,··· ,k gi − ul 1
wi (j) = .
0 otherwise
p
min vi − vc 1 , (8)
i=1
Proof. The proof follows by observing that the optimal solution (k-centers)
to BKMP can be used as the starting centers for problem (5), and the optimal
solution (k-centers) of problem (5), together with the origin in the input data
space can be used as a starting solution for BKMP with k +1 centers.
f (U ) ≤ 2fopt .
Proof. Let Cp = {gp1 , . . . , gpd } denote the p-th cluster with the binary cen-
troid u∗p at the optimal solution of CBMF for 1 ≤ p ≤ k, and C0 the optimal
cluster aligned with u0 . Then we can rewrite the optimal objective value
of (7) as
k
fopt = gi − u∗p 1 + gi 1 .
p=1 gi ∈Cp gi ∈C0
290 P. Jiang et al.
Let
gp∗ = arg min gpi − u∗p 1 . (9)
i=1,...,d
It follows that
d
d
m
gpi − gp∗ 1 = |gpi (j) − gp∗ (j)|
i=1 i=1 j=1
d
m
≤ |gpi (j) − u∗p (j)| + |u∗p (j) − gp∗ (j)|
i=1 j=1
d
= gpi − u∗p 1 + dgp∗ − u∗p 1
i=1
d
≤2 gpi − u∗p 1 ,
i=1
where the first inequality follows from the triangular inequality for l1 distance,
and the last inequality follows from (9). Therefore, we have
k
f (U ) ≤ gi − gp∗ 1 + gi 1
p=1 gi ∈Cp gi ∈C0
k
≤2 gi − u∗p 1 + gi 1
p=1 gi ∈Cp gi ∈C0
k
≤ 2( gi − u∗p 1 + gi 1 )
p=1 gi ∈Cp gi ∈C0
= 2fopt ,
where the first inequality is implied by the optimality of U ∗ and (9). The
second inequality holds due to (10). It is straightforward to verify the third
inequality, and the last equality follows from (9).
We remark that as one can see from the proof of Theorem 4.1, a 2-
approximation solution can also be obtained even when we do not update
the cluster centers. This implies that we can obtain a 2-approximation to
problem (7) in O(mnk+1 ) time. Similarly, we can modify Algorithm 1 slightly
to obtain a 2-approximation for CBMF (4) in O(nmk+1 ) time. This implies
that the proposed algorithm can find a 2-approximation to CBMF effectively
for small k. Moreover, combining Theorem 4.1 and Proposition 2.1, we can
derive the following result for UBMF.
Corollary 4.1. A 2-approximation to UBMF with k = 1 can be obtained
in O(nm2 + mn2 ) time by applying Algorithm 1 to problems (4) and (5),
clustering both by columns and by rows, respectively, and taking the best result.
A Clustering Approach to Constrained Binary Matrix Factorization 291
Proof. The lemma follows from (7) and the fact u0 = u∗0 .
The next result considers the cluster Ci for some i ≥ 1 whose center is selected
at random uniformly from the set itself. Though we do not use such a strategy
to select the starting centers, the result is helpful in our later analysis.
Lemma 4.2. Let A be an arbitrary cluster in the final optimal clusters Copt ,
and let C be the clustering with the center selected at random uniformly from
A. Then
E(f (A)) ≤ 2fopt (A).
292 P. Jiang et al.
Proof. The proof follows a similar vein as the proof of Lemma 3.1 in [1]
with the exception that the Euclidean distance has been replaced by the l1
distance. Let c(A) be the l1 center of the cluster in the optimal solution. It
follows that
1
E(f (A)) = a − a0 1
|A|
a0 ∈A a∈A
1
≤ a − c(A)1 + |A| · a0 − c(A)1
|A|
a0 ∈A a∈A
=2 a − c(A)1 .
a∈A
It should be mentioned that the above lemma holds for the cluster C0 ∈ Copt ,
where all the data points are aligned with u0 . In such a case, we need only to
change the l1 center c(A) to u0 in the proof of the lemma. We next extend
the above result to the remaining centers chosen with the D1 weighting.
Lemma 4.3. Let A be an arbitrary cluster in the final optimal clusters Copt ,
and let C be an arbitrary clustering. If we add a random center to C from A,
chosen with D1 weighting. Then
Proof. Note that for any a0 ∈ A, the probability that a0 is selected as the
center is D(a0 )/( a∈A D(a)). It follows that
D(a0 )
E(f (A)) = min(D(a), a − a0 1 )
a0 ∈A a∈A D(a) a∈A
1 a∈A (D(a) + a − a0 1 )
≤ min(D(a), a − a0 1 )
|A| a ∈A a∈A D(a) a∈A
0
1 D(a) 1 a∈A a − a0 1
≤ a∈A a − a0 1+ D(a)
|A| a ∈A a∈A D(a) a∈A |A| a ∈A a∈A D(a) a∈A
0 0
2
= a0 − a1 ≤ 4fopt (A),
|A| a ∈A a∈A
0
where the first inequality follows from the triangle inequality for l1 distance,
and the last inequality follows from Lemma 4.2.
The following lemma resembles Lemma 3.3 in [1], with a minor difference
in the constant used in the estimate. For completeness, we include its proof
here.
Lemma 4.4. Let C be an arbitrary clustering. Choose T > 0 ‘uncovered’
clusters from Copt , and let Vu denote the set of points in these clusters, with
A Clustering Approach to Constrained Binary Matrix Factorization 293
Proof. We prove this by induction, showing that if the result holds for (t −
1, T ) and (t − 1, T − 1), then it also holds for (t, T ). Thus, it suffices to check
the base cases t = 0, T > 0 and t = T = 1.
The case t = 0 follows easily from the fact that 1 + Ht = (T − t)/T = 1.
Suppose T = t = 1. We choose the new center from the one uncovered center
with probability f (Vu )/f (V). It follows from Lemma 4.3 that
Suppose that the first center is chosen from some uncovered cluster A, which
happens with probability f (A)/f (V)). Let pa be the conditional probability
that we choose a ∈ A as the center given the fact that the center is from A,
and fa (A) denotes the objective value when a is used as the center. Adding
A to the covered center (thus decreasing both T and t by 1) and applying
the inductive hypothesis again, we have
f (A) T −t
E(f (C )) ≤ pa (f (Vc ) + fa (A) + 4fopt (Vu ) − 4fopt (A)) (1 + Ht−1 ) + (f (Vu ) − f (A))
f (V) T −1
a∈A
f (A) T −t
≤ (f (Vc ) + 4fopt (Vu )) (1 + Ht−1 ) + (f (Vu ) − f (A)) ,
f (V) T −1
where the last inequality follows from Lemma 4.3. Recalling the power-mean
inequality, we have
294 P. Jiang et al.
1
f (A)2 ≥ f (Vu )2 .
T
A∈Vu
T −t f (Vc )f (Vu )
E(f (C )) ≤ (f (Vc ) + 4fopt (Vu )) (1 + Ht−1 ) + f (Vu ) +
T T f (V)
1 T −t
≤ (f (Vc ) + 4fopt (Vu )) (1 + Ht−1 + ) + f (Vu )
T T
1 T −t
≤ (f (Vc ) + 4fopt (Vu )) (1 + Ht−1 + ) + f (Vu ). (12)
t T
This completes the proof of the lemma.
Now we are ready to state the main result in this subsection.
Theorem 4.2. If the starting centers are selected by the random initializa-
tion Algorithm 2, then the expected objective function value E(f ) = E(f (U ))
satisfies
E(f (U )) ≤ 4(log k + 2)fopt .
Proof. Consider the clustering C after all the starting centers have been se-
lected. Let A denote the cluster in Copt from which we choose u1 . Applying
Lemma 4.4 with t = T = k − 1, and with C0 and A the only two possibly
covered clusters, we have
E(f (C)) ≤ (f (C0 ) + f (A) + 4f (Copt ) − 4fopt (C0 ) − 4fopt (A)) (1 + Hk−1 )
≤ 4(2 + log k)f (Copt ),
where the last inequality follows from Lemma 4.1, Lemma 4.3, and the fact
that Hk−1 ≤ 1 + log k.
It is worth mentioning that compared with Theorem 3.1 in [1], the approxi-
mation ratio in the above theorem is sharper, due to the use of the l1 norm.
5 Extension of CBMF
In the previous sections, we have focused on two specific variants of CBMF,
(4) and (5). In this section we introduce several new variants of CBMF and
explore their relationships to UBMF. Note that if we use the l1 norm as the
objective function, then the optimization model for UBMF can be written as
A Clustering Approach to Constrained Binary Matrix Factorization 295
n
k
f (U, W ) = G − U W 1 = gi − wi (j)uj 1 . (13)
i=1 j=1
k
min f (wi ) = gi − wi (j)uj 1 (14)
wi
j=1
Proof. The relation fu∗ (k) ≤ fc∗ (k) holds because the optimal solution of
rank-k CBMF is also a feasible solution for rank-k UBMF.
Now we proceed to prove the relation fc∗ (2k − 1) ≤ fu∗ (k). Denote U =
{u1 , u2 , · · · , uk } the matrix in the optimal solution to rank-k UBMF, and
S(u1 , · · · , uk ) the set of all possible combinations of the columns of U . It
follows immediately that the matrix W can be obtained from the assignment
process (15). Note that for every element sl ∈ S(u1 , · · · , uk ), l = 1, · · · , 2k ,
we can construct another binary vector s̄l by
296 P. Jiang et al.
1 if sil > 1
s̄il = , i = 1, · · · , n. (17)
sil otherwise
Accordingly we obtain another set S̄ that contains all the elements s̄l . Since
the matrix G is binary, for every column gi of G, we have
Note that the set S̄ can be used as a starting matrix in CBMF with rank
2k − 1. It follows from (7) and (16) that fc∗ (2k − 1) ≤ fu∗ (k). This completes
the proof of the theorem.
∗ ∗ ∗ ∗
fCBMF (1) ≥ fCBMF (2) · · · ≥ fCBMF (k) = fUBMF (k) .
This shows that UBMF can be approached via a series of CBMF models.
On the other hand, though problem (13) can be solved via the assign-
ment (15) for a fixed U , the procedure has complexity 2k , which is still very
high for large k. In what follows we present a simple iterative procedure for
problem (13) that reduces the objective function value step by step. For this,
we first rewrite the objective in (13) as
where wi: denotes the i-th row of W , and u:i the i-th column of U . Note that
the matrix G̃ is independent of wi: , since the terms involving wi: cancel. Now
let us temporarily fix G̃ and consider the problem
Define w̃i: = uT:i G̃/eTn u:i . From Theorem 3.2 we can obtain the optimal solu-
tion (of problem (18)) as follows
∗ 0 if w̃ij < 12
wij = . (19)
1 otherwise
It should be pointed out that though the above procedure can reduce
the objective value of problem (13) and is easy to implement, the solution
provided might not be optimal.
6 Numerical Results
In this section we report numerical results for our proposed algorithms for
both CBMF and UBMF on some test data sets, and compare them with other
existing algorithms for UBMF. For efficiency considerations, we implemented
only the randomized algorithm for CBMF analyzed in Section 4.2. Since the
solution from CBMF is also feasible for UBMF, the output from CBMF can
be used as initial matrices for UBMF. Then we apply Algorithm 3 to obtain
a solution for UBMF. Accordingly, we call such an algorithm hybrid UBMF.
We also compare the solutions of UBMF and CBMF. All numerical tests
were conducted using MATLAB R2012 and performed on a 64-bit Windows
7 system with Intel Core2 Quad 2.66 GHz CPU and 4 GB RAM.
For numerical comparison, we apply PROXIMUS to UBMF [11], which
splits a data set based on the entries of a binary vector and performs recur-
sive partitioning in the direction of such vectors. When the rank is fixed, we
apply the rule proposed in [18] to find the best solution among all possible so-
lutions of the desired rank. We also coded the 2-approximation algorithm [24],
denoted by ILP, for rank-1 UBMF. ILP reformulates UBMF as a 0-1 integer
programming program and finds an approximate solution by using its linear
programming relaxation. We also implemented a penalty function algorithm
given in [29]. We chose the penalty function algorithm over the thresholding
algorithm in [29] because initial testing showed the thresholding algorithm
to be very time-consuming.
Data sets from three different categories were tested. Synthetic data sets
are first used to test the effectiveness and efficiency of the proposed algo-
rithms. We also use gene expression data sets to find bicluster structures. In
the last part of this section we apply the proposed algorithms in this work
298 P. Jiang et al.
to document clustering and compare results with those from the standard
k-means algorithm and the PROXIMUS algorithm in [11].
1 |G1 ∩ G2 |
S(M1 , M2 ) = max ,
|M1 | (G2 ,C2 )∈M2 |G1 ∪ G2 |
(G1 ,C1 )∈M1
algorithms are reported in Tables 3 and 4. As we can see from both tables,
the hybrid UBMF reports the highest match score among all algorithms
tested. PROXIMUS is consistently worse than CBMF. Sometimes the penalty
function algorithm is able to produce similar results as the Hybrid UBMF
does, but with significantly longer computational time. For all gene expression
data sets, ILP fails to report any reasonable outputs.
where the {Ck } is the set of clusters we obtain and {Lm } the set of labels, N
is the total number of documents, and T (·) is the number of entities belonging
to class m that are assigned to cluster k. All the data sets are from [6], as
summarized below.
• 20 newsgroups. The 20 Newsgroups data set is a collection of approxi-
mately 20000 newsgroup documents, partitioned evenly across 20 different
newsgroups. It was originally collected by Ken Lang. The 20 newsgroups
collection has become a popular data set for experiments in text applica-
tions of machine learning techniques, such as text classification and text
clustering. We adopt two subsets of this data set for our experiment.
• CNAE-9. This is a data set containing 1080 documents of free text busi-
ness descriptions of Brazilian companies categorized into a subset of nine
categories cataloged in a table called National Classification of Economic
Activities (CNAE). The number of attributes is 857. This data set is highly
sparse (99.22% of the matrix is filled with zeros).
• Internet Ads. This data set represents a set of possible advertisements on
Internet pages. The features encode the geometry of the image (if available)
as well as phrases occuring in the URL, the URL and alt text of the image,
A Clustering Approach to Constrained Binary Matrix Factorization 301
the anchor text, and words occuring near the anchor text. The task is to
predict whether an image is an advertisement or not.
A brief summary of the data sets is given in Table 5 and the results are
reported in Table 6. Each entry is the clustering accuracy of the column
method on the corresponding row data set and a result of averaging 10 runs.
The ILP and penalty function algorithms failed to report on these data sets.
Again, the hybrid UBMF beats other algorithms in terms of accuracy. Also
it is clear that CBMF and Hybrid UBMF algorithms work well on highly
sparse data sets.
7 Conclusions
There are several ways to extend the results in this paper. One possible di-
rection is to develop more effective algorithms for both CBMF and UBMF,
in particular for reasonably large k. For example, in Algorithm 3 we present
a simple iterative procedure to reduce the objective function in UBMF. Since
such a procedure might not provide an optimal solution to problem (13), it
is of interest to incorporate some local search heuristics to further reduce the
objective function. Another possible direction is to consider the scenario of
two different types of mismatched entries: 0-to-1 and 1-to-0. In the current
CBMF model, we minimize the sum of the two types of mismatched entries
without any preference between them. However, in many practical applica-
tions, it might be helpful to include such a preference in the optimization
model. In such a case, we can extend the current CBMF model by using
different weights for each type of error and then design effective algorithms
for the new model. More study is needed to address these issues.
302 P. Jiang et al.
References
[1] Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding.
In: Proc. Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms,
pp. 1027–1035 (2007)
[2] Bruckstein, A.M., Donoho, D.L., Elad, M.: From sparse solutions of systems
of equations to sparse modeling of signals and images. SIAM Review 51(1),
34–81 (2009)
[3] Brunet, J., Tamayo, P., Golub, T.R., Mesirov, J.P., Lander, E.S.: Metagenes
and molecular pattern discovery using matrix factorization. Proc. National
Academy Sciences (2004)
[4] Chaovalitwongse, W., Androulakis, I.P., Pardalos, P.M.: Quadratic integer pro-
gramming: Complexity and equivalent forms. In: Floudas, C.A., Pardalos, P.M.
(eds.) Encyclopedia of Optimization (2007)
[5] Crama, Y., Hansen, P., Jaumard, B.: The basic algorithm for pseudo-Boolean
programming revisited. Discrete Appl. Math. 29, 171–185 (1990)
[6] Frank, A., Asuncion, A.: UCI Machine Learning Repository, School of Infor-
mation and Computer Science, University of California, Irvine, CA (2010),
http://archive.ics.uci.edu/ml
[7] Gillis, N., Glineur, F.: Using underapproximations for sparse nonnegative ma-
trix factorization. Pattern Recognition 43(4), 1676–1687 (2010)
[8] Hammer, P.L., Rudeanu, S.: Boolean Methods in Operations Research and
Related Areas. Springer, New York (1968)
[9] Hasegawa, S., Imai, H., Inaba, M., Katoh, N., Nakano, J.: Efficient algorithms
for variance-based k-clustering. In: Proc. First Pacific Conf. Comput. Graphics
Appl., Seoul, Korea, pp. 75–89. World Scientific, Singapore (1993)
[10] Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput.
Surv. 31(3), 264–323 (1999)
[11] Koyutürk, M., Grama, A.: PROXIMUS: a framework for analyzing very high
dimensional discrete-attributed datasets. In: ACM SIGKDD, pp. 147–156
(2003)
[12] Koyutürk, M., Grama, A., Ramakrishnan, N.: Compression, clustering, and
pattern discovery in very high-dimensional discrete-attribute data sets. IEEE
TKDE 17(4), 447–461 (2005)
[13] Koyutürk, M., Grama, A., Ramakrishnan, N.: Nonorthogonal decomposition
of binary matrices for bounded-error data compression and analysis. ACM
Trans. Math. Softw. 32(1), 33–69 (2006)
[14] Lee, D., Seung, H.S.: Learning the parts of objects by non-negative matrix
factorization. Nature 401, 788–791 (1999)
A Clustering Approach to Constrained Binary Matrix Factorization 303
[15] Lee, D., Seung, H.S.: Algorithms for non-negative matrix factorization. In:
Neural Information Processing Systems, NIPS (2001)
[16] Li, T.: A general model for clustering binary data. In: ACM SIGKDD, pp.
188–197 (2005)
[17] Li, T., Ding, C.: The relationships among various nonnegative matrix factor-
ization methods for clustering. In: ICDM, pp. 362–371 (2006)
[18] Lin, M.M., Dong, B., Chu, M.T.: Integer Matrix Factorization and Its Appli-
cation (2009) (preprint)
[19] Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inform. Theory,
129–137 (1982)
[20] McQueen, J.: Some methods for classification and analysis of multivariate ob-
servations. In: Proc. 5th Berkeley Symposium on Mathematical Statistics and
Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
[21] Meeds, E., Ghahramani, Z., Neal, R.M., Roweis, S.T.: Modeling dyadic data
with binary latent factors. In: Neural Information Processing Systems 19 (NIPS
2006), pp. 977–984 (2006)
[22] Miettinen, P., Mielikäinen, T., Gionis, A., Das, G., Mannila, H.: The discrete
basis problem. IEEE Trans. Knowledge Data Engineering 20(10), 1348–1362
(2008)
[23] Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem,
W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation
of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–
1129 (2006)
[24] Shen, B.H., Ji, S., Ye, J.: Mining discrete patterns via binary matrix factor-
ization. In: ACM SIGKDD, pp. 757–766 (2009)
[25] Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework
for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)
[26] van Uitert, M., Meuleman, W., Wessels, L.: Biclustering sparse binary genomic
data. J. Comput. Biol. 15(10), 1329–1345 (2008)
[27] Zass, R., Shashua, A.: Non-negative sparse PCA. In: Advances in Neural In-
formation Processing Systems (NIPS), vol. 19, pp. 1561–1568 (2007)
[28] Zhang, Z.Y., Li, T., Ding, C., Ren, X.W., Zhang, X.S.: Binary matrix factor-
ization for analyzing gene expression data. Data Min. Knowl. Discov. 20(1),
28–52 (2010)
[29] Zhang, Z.Y., Li, T., Ding, C., Zhang, X.S.: Binary matrix factorization with
applications. In: ICDM, pp. 391–400 (2007)
[30] Zdunek, R.: Data clustering with semi-binary nonnegative matrix factoriza-
tion. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.)
ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 705–716. Springer, Heidelberg
(2008)
Author Index