Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
C lassification
Algorithms and Applications
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A.
PUBLISHED TITLES
ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY
Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. Srivastava
BIOLOGICAL DATA MINING
Jake Y. Chen and Stefano Lonardi
COMPUTATIONAL BUSINESS ANALYTICS
Subrata Das
COMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE
DEVELOPMENT
Ting Yu, Nitesh V. Chawla, and Simeon Simoff
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY,
AND APPLICATIONS
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff
CONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS
Guozhu Dong and James Bailey
DATA CLASSIFICATION: ALGORITHMS AND APPLICATIONS
Charu C. Aggarawal
DATA CLUSTERING: ALGORITHMS AND APPLICATIONS
Charu C. Aggarawal and Chandan K. Reddy
DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH
Guojun Gan
DATA MINING FOR DESIGN AND MARKETING
Yukio Ohsawa and Katsutoshi Yada
DATA MINING WITH R: LEARNING WITH CASE STUDIES
Luís Torgo
FOUNDATIONS OF PREDICTIVE ANALYTICS
James Wu and Stephen Coggeshall
GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY,
SECOND EDITION
Harvey J. Miller and Jiawei Han
HANDBOOK OF EDUCATIONAL DATA MINING
Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker
INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS
Vagelis Hristidis
INTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS
Priti Srinivas Sajja and Rajendra Akerkar
INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS
AND TECHNIQUES
Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu
KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND
LAW ENFORCEMENT
David Skillicorn
KNOWLEDGE DISCOVERY FROM DATA STREAMS
João Gama
MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR
ENGINEERING SYSTEMS HEALTH MANAGEMENT
Ashok N. Srivastava and Jiawei Han
MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS
David Lo, Siau-Cheng Khoo, Jiawei Han, and Chao Liu
MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO
CONCEPTS AND THEORY
Zhongfei Zhang and Ruofei Zhang
MUSIC DATA MINING
Tao Li, Mitsunori Ogihara, and George Tzanetakis
NEXT GENERATION OF DATA MINING
Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar
RAPIDMINER: DATA MINING USE CASES AND BUSINESS ANALYTICS
APPLICATIONS
Markus Hofmann and Ralf Klinkenberg
RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS,
AND APPLICATIONS
Bo Long, Zhongfei Zhang, and Philip S. Yu
SERVICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY
Domenico Talia and Paolo Trunfio
SPECTRAL FEATURE SELECTION FOR DATA MINING
Zheng Alan Zhao and Huan Liu
STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION
George Fernandez
SUPPORT VECTOR MACHINES: OPTIMIZATION BASED THEORY,
ALGORITHMS, AND EXTENSIONS
Naiyang Deng, Yingjie Tian, and Chunhua Zhang
TEMPORAL DATA MINING
Theophano Mitsa
TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS
Ashok N. Srivastava and Mehran Sahami
THE TOP TEN ALGORITHMS IN DATA MINING
Xindong Wu and Vipin Kumar
UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX
DECOMPOSITIONS
David Skillicorn
D ata
C lassification
Algorithms and Applications
Edited by
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, New York, USA
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to
publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials
or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material repro-
duced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming,
and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copy-
right.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400.
CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been
granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identifica-
tion and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
Contributors xxv
Preface xxvii
ix
x Contents
Index 667
Editor Biography
Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in York-
town Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from
Massachusetts Institute of Technology in 1996. His research interest during his Ph.D. years was in
combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James
B. Orlin. He has since worked in the field of performance analysis, databases, and data mining. He
has published over 200 papers in refereed conferences and journals, and has applied for or been
granted over 80 patents. He is author or editor of ten books. Because of the commercial value of the
aforementioned patents, he has received several invention achievement awards and has thrice been
designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his
work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation
Award (2008) for his scientific contributions to privacy technology, a recipient of the IBM Outstand-
ing Technical Achievement Award (2009) for his work on data streams, and a recipient of an IBM
Research Division Award (2008) for his contributions to System S. He also received the EDBT 2014
Test of Time Award for his work on condensation-based privacy-preserving data mining.
He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering
from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery
and Data Mining, an action editor of the Data Mining and Knowledge Discovery Journal, editor-in-
chief of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information
Systems Journal. He serves on the advisory board of the Lecture Notes on Social Networks, a pub-
lication by Springer. He serves as the vice-president of the SIAM Activity Group on Data Mining,
which is responsible for all data mining activities organized by SIAM, including their main data
mining conference. He is a fellow of the IEEE and the ACM, for “contributions to knowledge dis-
covery and data mining algorithms.”
xxiii
This page intentionally left blank
Contributors
Yixiang Fang Qi Li
The University of Hong Kong State University of New York at Buffalo
Hong Kong Buffalo, New York
xxv
xxvi Contributors
The problem of classification is perhaps one of the most widely studied in the data mining and ma-
chine learning communities. This problem has been studied by researchers from several disciplines
over several decades. Applications of classification include a wide variety of problem domains such
as text, multimedia, social networks, and biological data. Furthermore, the problem may be en-
countered in a number of different scenarios such as streaming or uncertain data. Classification is a
rather diverse topic, and the underlying algorithms depend greatly on the data domain and problem
scenario.
Therefore, this book will focus on three primary aspects of data classification. The first set of
chapters will focus on the core methods for data classification. These include methods such as prob-
abilistic classification, decision trees, rule-based methods, instance-based techniques, SVM meth-
ods, and neural networks. The second set of chapters will focus on different problem domains and
scenarios such as multimedia data, text data, time-series data, network data, data streams, and un-
certain data. The third set of chapters will focus on different variations of the classification problem
such as ensemble methods, visual methods, transfer learning, semi-supervised methods, and active
learning. These are advanced methods, which can be used to enhance the quality of the underlying
classification results.
The classification problem has been addressed by a number of different communities such as
pattern recognition, databases, data mining, and machine learning. In some cases, the work by the
different communities tends to be fragmented, and has not been addressed in a unified way. This
book will make a conscious effort to address the work of the different communities in a unified way.
The book will start off with an overview of the basic methods in data classification, and then discuss
progressively more refined and complex methods for data classification. Special attention will also
be paid to more recent problem domains such as graphs and social networks.
The chapters in the book will be divided into three types:
• Method Chapters: These chapters discuss the key techniques that are commonly used for
classification, such as probabilistic methods, decision trees, rule-based methods, instance-
based methods, SVM techniques, and neural networks.
• Domain Chapters: These chapters discuss the specific methods used for different domains
of data such as text data, multimedia data, time-series data, discrete sequence data, network
data, and uncertain data. Many of these chapters can also be considered application chap-
ters, because they explore the specific characteristics of the problem in a particular domain.
Dedicated chapters are also devoted to large data sets and data streams, because of the recent
importance of the big data paradigm.
• Variations and Insights: These chapters discuss the key variations on the classification pro-
cess such as classification ensembles, rare-class learning, distance function learning, active
learning, and visual learning. Many variations such as transfer learning and semi-supervised
learning use side-information in order to enhance the classification results. A separate chapter
is also devoted to evaluation aspects of classifiers.
This book is designed to be comprehensive in its coverage of the entire area of classification, and it
is hoped that it will serve as a knowledgeable compendium to students and researchers.
xxvii
This page intentionally left blank
Chapter 1
An Introduction to Data Classification
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY
[email protected]
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Common Techniques in Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Probabilistic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Rule-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.5 Instance-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.6 SVM Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Handing Different Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Large Scale Data: Big Data and Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1.1 Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1.2 The Big Data Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.3 Multimedia Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.4 Time Series and Sequence Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.5 Network Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.6 Uncertain Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Variations on Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.1 Rare Class Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.2 Distance Function Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.3 Ensemble Learning for Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.4 Enhancing Classification Methods with Additional Data . . . . . . . . . . . . . . . . . . . 24
1.4.4.1 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4.2 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.5 Incorporating Human Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.5.1 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.4.5.2 Visual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.6 Evaluating Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1
2 Data Classification: Algorithms and Applications
1.1 Introduction
The problem of data classification has numerous applications in a wide variety of mining ap-
plications. This is because the problem attempts to learn the relationship between a set of feature
variables and a target variable of interest. Since many practical problems can be expressed as as-
sociations between feature and target variables, this provides a broad range of applicability of this
model. The problem of classification may be stated as follows:
Given a set of training data points along with associated training labels, determine the class la-
bel for an unlabeled test instance.
Numerous variations of this problem can be defined over different settings. Excellent overviews
on data classification may be found in [39, 50, 63, 85]. Classification algorithms typically contain
two phases:
• Training Phase: In this phase, a model is constructed from the training instances.
• Testing Phase: In this phase, the model is used to assign a label to an unlabeled test instance.
In some cases, such as lazy learning, the training phase is omitted entirely, and the classification is
performed directly from the relationship of the training instances to the test instance. Instance-based
methods such as the nearest neighbor classifiers are examples of such a scenario. Even in such cases,
a pre-processing phase such as a nearest neighbor index construction may be performed in order to
ensure efficiency during the testing phase.
The output of a classification algorithm may be presented for a test instance in one of two ways:
1. Discrete Label: In this case, a label is returned for the test instance.
2. Numerical Score: In this case, a numerical score is returned for each class label and test in-
stance combination. Note that the numerical score can be converted to a discrete label for a
test instance, by picking the class with the highest score for that test instance. The advantage
of a numerical score is that it now becomes possible to compare the relative propensity of
different test instances to belong to a particular class of importance, and rank them if needed.
Such methods are used often in rare class detection problems, where the original class distri-
bution is highly imbalanced, and the discovery of some classes is more valuable than others.
The classification problem thus segments the unseen test instances into groups, as defined by the
class label. While the segmentation of examples into groups is also done by clustering, there is
a key difference between the two problems. In the case of clustering, the segmentation is done
using similarities between the feature variables, with no prior understanding of the structure of the
groups. In the case of classification, the segmentation is done on the basis of a training data set,
which encodes knowledge about the structure of the groups in the form of a target variable. Thus,
while the segmentations of the data are usually related to notions of similarity, as in clustering,
significant deviations from the similarity-based segmentation may be achieved in practical settings.
As a result, the classification problem is referred to as supervised learning, just as clustering is
referred to as unsupervised learning. The supervision process often provides significant application-
specific utility, because the class labels may represent important properties of interest.
Some common application domains in which the classification problem arises, are as follows:
• Customer Target Marketing: Since the classification problem relates feature variables to
target classes, this method is extremely popular for the problem of customer target marketing.
An Introduction to Data Classification 3
In such cases, feature variables describing the customer may be used to predict their buy-
ing interests on the basis of previous training examples. The target variable may encode the
buying interest of the customer.
• Medical Disease Diagnosis: In recent years, the use of data mining methods in medical
technology has gained increasing traction. The features may be extracted from the medical
records, and the class labels correspond to whether or not a patient may pick up a disease
in the future. In these cases, it is desirable to make disease predictions with the use of such
information.
• Supervised Event Detection: In many temporal scenarios, class labels may be associated
with time stamps corresponding to unusual events. For example, an intrusion activity may
be represented as a class label. In such cases, time-series classification methods can be very
useful.
• Multimedia Data Analysis: It is often desirable to perform classification of large volumes of
multimedia data such as photos, videos, audio or other more complex multimedia data. Mul-
timedia data analysis can often be challenging, because of the complexity of the underlying
feature space and the semantic gap between the feature values and corresponding inferences.
• Biological Data Analysis: Biological data is often represented as discrete sequences, in
which it is desirable to predict the properties of particular sequences. In some cases, the
biological data is also expressed in the form of networks. Therefore, classification methods
can be applied in a variety of different ways in this scenario.
• Document Categorization and Filtering: Many applications, such as newswire services,
require the classification of large numbers of documents in real time. This application is
referred to as document categorization, and is an important area of research in its own right.
• Social Network Analysis: Many forms of social network analysis, such as collective classi-
fication, associate labels with the underlying nodes. These are then used in order to predict
the labels of other nodes. Such applications are very useful for predicting useful properties of
actors in a social network.
The diversity of problems that can be addressed by classification algorithms is significant, and cov-
ers many domains. It is impossible to exhaustively discuss all such applications in either a single
chapter or book. Therefore, this book will organize the area of classification into key topics of in-
terest. The work in the data classification area typically falls into a number of broad categories;
• Technique-centered: The problem of data classification can be solved using numerous
classes of techniques such as decision trees, rule-based methods, neural networks, SVM meth-
ods, nearest neighbor methods, and probabilistic methods. This book will cover the most
popular classification methods in the literature comprehensively.
• Data-Type Centered: Many different data types are created by different applications. Some
examples of different data types include text, multimedia, uncertain data, time series, discrete
sequence, and network data. Each of these different data types requires the design of different
techniques, each of which can be quite different.
• Variations on Classification Analysis: Numerous variations on the standard classification
problem exist, which deal with more challenging scenarios such as rare class learning, transfer
learning, semi-supervised learning, or active learning. Alternatively, different variations of
classification, such as ensemble analysis, can be used in order to improve the effectiveness
of classification algorithms. These issues are of course closely related to issues of model
evaluation. All these issues will be discussed extensively in this book.
4 Data Classification: Algorithms and Applications
This chapter will discuss each of these issues in detail, and will also discuss how the organization of
the book relates to these different areas of data classification. The chapter is organized as follows.
The next section discusses the common techniques that are used for data classification. Section
1.3 explores the use of different data types in the classification process. Section 1.4 discusses the
different variations of data classification. Section 1.5 discusses the conclusions and summary.
1. Filter Models: In these cases, a crisp criterion on a single feature, or a subset of features, is
used to evaluate their suitability for classification. This method is independent of the specific
algorithm being used.
2. Wrapper Models: In these cases, the feature selection process is embedded into a classifica-
tion algorithm, in order to make the feature selection process sensitive to the classification
algorithm. This approach recognizes the fact that different algorithms may work better with
different features.
In order to perform feature selection with filter models, a number of different measures are used
in order to quantify the relevance of a feature to the classification process. Typically, these measures
compute the imbalance of the feature values over different ranges of the attribute, which may either
be discrete or numerical. Some examples are as follows:
1 This feature is used to measure prostate cancer in men.
An Introduction to Data Classification 5
• Gini Index: Let p1 . . . pk be the fraction of classes that correspond to a particular value of the
discrete attribute. Then, the gini-index of that value of the discrete attribute is given by:
k
G = 1 − ∑ p2i (1.1)
i=1
The value of G ranges between 0 and 1 − 1/k. Smaller values are more indicative of class
imbalance. This indicates that the feature value is more discriminative for classification. The
overall gini-index for the attribute can be measured by weighted averaging over different
values of the discrete attribute, or by using the maximum gini-index over any of the different
discrete values. Different strategies may be more desirable for different scenarios, though the
weighted average is more commonly used.
• Entropy: The entropy of a particular value of the discrete attribute is measured as follows:
k
E = − ∑ pi · log(pi ) (1.2)
i=1
The same notations are used above, as for the case of the gini-index. The value of the entropy
lies between 0 and log(k), with smaller values being more indicative of class skew.
• Fisher’s Index: The Fisher’s index measures the ratio of the between class scatter to the within
class scatter. Therefore, if p j is the fraction of training examples belonging to class j, µ j is
the mean of a particular feature for class j, µ is the global mean for that feature, and σ j is
the standard deviation of that feature for class j, then the Fisher score F can be computed as
follows:
∑kj=1 p j · (µ j − µ)2
F= (1.3)
∑kj=1 p j · σ2j
A wide variety of other measures such as the χ2 -statistic and mutual information are also available in
order to quantify the discriminative power of attributes. An approach known as the Fisher’s discrim-
inant [61] is also used in order to combine the different features into directions in the data that are
highly relevant to classification. Such methods are of course feature transformation methods, which
are also closely related to feature selection methods, just as unsupervised dimensionality reduction
methods are related to unsupervised feature selection methods.
The Fisher’s discriminant will be explained below for the two-class problem. Let µ0 and µ1 be
the d-dimensional row vectors representing the means of the records in the two classes, and let Σ0
and Σ1 be the corresponding d × d covariance matrices, in which the (i, j)th entry represents the
covariance between dimensions i and j for that class. Then, the equivalent Fisher score FS(V ) for a
d-dimensional row vector V may be written as follows:
(V · (µ0 − µ1 ))2
FS(V ) = T
(1.4)
V (p0 · Σ0 + p1 · Σ1 )V
This is a generalization of the axis-parallel score in Equation 1.3, to an arbitrary direction V . The
goal is to determine a direction V , which maximizes the Fisher score. It can be shown that the
optimal direction V ∗ may be determined by solving a generalized eigenvalue problem, and is given
by the following expression:
If desired, successively orthogonal directions may be determined by iteratively projecting the data
onto the residual subspace, after determining the optimal directions one by one.
6 Data Classification: Algorithms and Applications
More generally, it should be pointed out that many features are often closely correlated with
one another, and the additional utility of an attribute, once a certain set of features have already
been selected, is different from its standalone utility. In order to address this issue, the Minimum
Redundancy Maximum Relevance approach was proposed in [69], in which features are incremen-
tally selected on the basis of their incremental gain on adding them to the feature set. Note that this
method is also a filter model, since the evaluation is on a subset of features, and a crisp criterion is
used to evaluate the subset.
In wrapper models, the feature selection phase is embedded into an iterative approach with a
classification algorithm. In each iteration, the classification algorithm evaluates a particular set of
features. This set of features is then augmented using a particular (e.g., greedy) strategy, and tested
to see of the quality of the classification improves. Since the classification algorithm is used for
evaluation, this approach will generally create a feature set, which is sensitive to the classification
algorithm. This approach has been found to be useful in practice, because of the wide diversity of
models on data classification. For example, an SVM would tend to prefer features in which the two
classes separate out using a linear model, whereas a nearest neighbor classifier would prefer features
in which the different classes are clustered into spherical regions. A good survey on feature selection
methods may be found in [59]. Feature selection methods are discussed in detail in Chapter 2.
P(x1 . . . xd |Y (T ) = i)
P(Y (T ) = i|x1 . . . xd ) = P(Y (T ) = i) · (1.6)
P(x1 . . . xd )
Since the denominator is constant across all classes, and one only needs to determine the class with
the maximum posterior probability, one can approximate the aforementioned expression as follows:
The key here is that the expression on the right can be evaluated more easily in a data-driven
way, as long as the naive Bayes assumption is used for simplification. Specifically, in Equation1.7,
the expression P(Y (T ) = i|x1 . . . xd ) can be expressed as the product of the feature-wise conditional
An Introduction to Data Classification 7
probabilities.
d
P(x1 . . . xd |Y (T ) = i) = ∏ P(x j |Y (T ) = i) (1.8)
j=1
This is referred to as conditional independence, and therefore the Bayes method is referred to as
“naive.” This simplification is crucial, because these individual probabilities can be estimated from
the training data in a more robust way. The naive Bayes theorem is crucial in providing the ability
to perform the product-wise simplification. The term P(x j |Y (T ) = i) is computed as the fraction of
the records in the portion of the training data corresponding to the ith class, which contains feature
value x j for the jth attribute. If desired, Laplacian smoothing can be used in cases when enough
data is not available to estimate these values robustly. This is quite often the case, when a small
amount of training data may contain few or no training records containing a particular feature value.
The Bayes rule has been used quite successfully in the context of a wide variety of applications,
and is particularly popular in the context of text classification. In spite of the naive independence
assumption, the Bayes model seems to be quite effective in practice. A detailed discussion of the
naive assumption in the context of the effectiveness of the Bayes classifier may be found in [38].
Another probabilistic approach is to directly model the posterior probability, by learning a dis-
criminative function that maps an input feature vector directly onto a class label. This approach is
often referred to as a discriminative model. Logistic regression is a popular discriminative classifier,
and its goal is to directly estimate the posterior probability P(Y (T ) = i|X) from the training data.
Formally, the logistic regression model is defined as
1
P(Y (T ) = i|X ) = T , (1.9)
1 + e−θ X
where θ is the vector of parameters to be estimated. In general, maximum likelihood is used to deter-
mine the parameters of the logistic regression. To handle overfitting problems in logistic regression,
regularization is introduced to penalize the log likelihood function for large values of θ. The logistic
regression model has been extensively used in numerous disciplines, including the Web, and the
medical and social science fields.
A variety of other probabilistic models are known in the literature, such as probabilistic graphical
models, and conditional random fields. An overview of probabilistic methods for data classification
are found in [20, 64]. Probabilistic methods for data classification are discussed in Chapter 3.
The value of G(N) lies between 0 and 1 − 1/k. The smaller the value of G(N), the greater the skew.
In the cases where the classes are evenly balanced, the value is 1 − 1/k. An alternative measure is
8 Data Classification: Algorithms and Applications
TABLE 1.1: Training Data Snapshot Relating Cardiovascular Risk Based on Previous Events to
Different Blood Parameters
Patient Name CRP Level Cholestrol High Risk? (Class Label)
Mary 3.2 170 Y
Joe 0.9 273 N
Jack 2.5 213 Y
Jane 1.7 229 N
Tom 1.1 160 N
Peter 1.9 205 N
Elizabeth 8.1 160 Y
Lata 1.3 171 N
Daniela 4.5 133 Y
Eric 11.4 122 N
Michael 1.8 280 Y
The value of the entropy lies2 between 0 and log(k). The value is log(k), when the records are
perfectly balanced among the different classes. This corresponds to the scenario with maximum
entropy. The smaller the entropy, the greater the skew in the data. Thus, the gini-index and entropy
provide an effective way to evaluate the quality of a node in terms of its level of discrimination
between the different classes.
While constructing the training model, the split is performed, so as to minimize the weighted
sum of the gini-index or entropy of the two nodes. This step is performed recursively, until a ter-
mination criterion is satisfied. The most obvious termination criterion is one where all data records
in the node belong to the same class. More generally, the termination criterion requires either a
minimum level of skew or purity, or a minimum number of records in the node in order to avoid
overfitting. One problem in decision tree construction is that there is no way to predict the best
time to stop decision tree growth, in order to prevent overfitting. Therefore, in many variations, the
decision tree is pruned in order to remove nodes that may correspond to overfitting. There are differ-
ent ways of pruning the decision tree. One way of pruning is to use a minimum description length
principle in deciding when to prune a node from the tree. Another approach is to hold out a small
portion of the training data during the decision tree growth phase. It is then tested to see whether
replacing a subtree with a single node improves the classification accuracy on the hold out set. If
this is the case, then the pruning is performed. In the testing phase, a test instance is assigned to an
appropriate path in the decision tree, based on the evaluation of the split criteria in a hierarchical
decision process. The class label of the corresponding leaf node is reported as the relevant one.
Figure 1.1 provides an example of how the decision tree is constructed. Here, we have illustrated
a case where the two measures (features) of the blood parameters of patients are used in order to
assess the level of cardiovascular risk in the patient. The two measures are the C-Reactive Protein
(CRP) level and Cholesterol level, which are well known parameters related to cardiovascular risk.
It is assumed that a training data set is available, which is already labeled into high risk and low
risk patients, based on previous cardiovascular events such as myocardial infarctions or strokes. At
the same time, it is assumed that the feature values of the blood parameters for these patients are
available. A snapshot of this data is illustrated in Table 1.1. It is evident from the training data that
2 The value of the expression at pi = 0 needs to be evaluated at the limit.
An Introduction to Data Classification 9
CͲReactiveProtein(CRP)< 2 CͲReactiveProtein(CRP)>2
Cholesterol<250 Cholesterol>250
Cholesterol<200 Cholesterol>200
CRP +Chol/100
Ch l/100 < 4 CRP +Chol/100
Ch l/100 >4
4
Normal HighRisk
(b)MultivariateSplits
FIGURE 1.1: Illustration of univariate and multivariate splits for decision tree construction.
higher CRP and Cholesterol levels correspond to greater risk, though it is possible to reach more
definitive conclusions by combining the two.
An example of a decision tree that constructs the classification model on the basis of the two
features is illustrated in Figure 1.1(a). This decision tree uses univariate splits, by first partitioning
on the CRP level, and then using a split criterion on the Cholesterol level. Note that the Cholesterol
split criteria in the two CRP branches of the tree are different. In principle, different features can
be used to split different nodes at the same level of the tree. It is also sometimes possible to use
conditions on multiple attributes in order to create more powerful splits at a particular level of the
tree. An example is illustrated in Figure 1.1(b), where a linear combination of the two attributes
provides a much more powerful split than a single attribute. The split condition is as follows:
CRP + Cholestrol/100 ≤ 4
Note that a single condition such as this is able to partition the training data very well into the
two classes (with a few exceptions). Therefore, the split is more powerful in discriminating between
the two classes in a smaller number of levels of the decision tree. Where possible, it is desirable
to construct more compact decision trees in order to obtain the most accurate results. Such splits
are referred to as multivariate splits. Some of the earliest methods for decision tree construction
include C4.5 [72], ID3 [73], and CART [22]. A detailed discussion of decision trees may be found
in [22, 65, 72, 73]. Decision trees are discussed in Chapter 4.
assigns a test instance to a particular label. For example, for the case of the decision tree illustrated
in Figure 1.1(a), the rightmost path corresponds to the following rule:
It is possible to create a set of disjoint rules from the different paths in the decision tree. In fact,
a number of methods such as C4.5, create related models for both decision tree construction and
rule construction. The corresponding rule-based classifier is referred to as C4.5Rules.
Rule-based classifiers can be viewed as more general models than decision tree models. While
decision trees require the induced rule sets to be non-overlapping, this is not the case for rule-based
classifiers. For example, consider the following rule:
CRP> 3 ⇒ HighRisk
Clearly, this rule overlaps with the previous rule, and is also quite relevant to the prediction of a
given test instance. In rule-based methods, a set of rules is mined from the training data in the first
phase (or training phase). During the testing phase, it is determined which rules are relevant to the
test instance and the final result is based on a combination of the class values predicted by the
different rules.
In many cases, it may be possible to create rules that possibly conflict with one another on the
right hand side for a particular test instance. Therefore, it is important to design methods that can
effectively determine a resolution to these conflicts. The method of resolution depends upon whether
the rule sets are ordered or unordered. If the rule sets are ordered, then the top matching rules can
be used to make the prediction. If the rule sets are unordered, then the rules can be used to vote on
the test instance. Numerous methods such as Classification based on Associations (CBA) [58], CN2
[31], and RIPPER [26] have been proposed in the literature, which use a variety of rule induction
methods, based on different ways of mining and prioritizing the rules.
Methods such as CN2 and RIPPER use the sequential covering paradigm, where rules with
high accuracy and coverage are sequentially mined from the training data. The idea is that a rule is
grown corresponding to specific target class, and then all training instances matching (or covering)
the antecedent of that rule are removed. This approach is applied repeatedly, until only training
instances of a particular class remain in the data. This constitutes the default class, which is selected
for a test instance, when no rule is fired. The process of mining a rule for the training data is referred
to as rule growth. The growth of a rule involves the successive addition of conjuncts to the left-hand
side of the rule, after the selection of a particular consequent class. This can be viewed as growing a
single “best” path in a decision tree, by adding conditions (split criteria) to the left-hand side of the
rule. After the rule growth phase, a rule-pruning phase is used, which is analogous to decision tree
construction. In this sense, the rule-growth of rule-based classifiers share a number of conceptual
similarities with decision tree classifiers. These rules are ranked in the same order as they are mined
from the training data. For a given test instance, the class variable in the consequent of the first
matching rule is reported. If no matching rule is found, then the default class is reported as the
relevant one.
Methods such as CBA [58] use the traditional association rule framework, in which rules are
determined with the use of specific support and confidence measures. Therefore, these methods are
referred to as associative classifiers. It is also relatively easy to prioritize these rules with the use of
these parameters. The final classification can be performed by either using the majority vote from
the matching rules, or by picking the top ranked rule(s) for classification. Typically, the confidence
of the rule is used to prioritize them, and the support is used to prune for statistical significance.
A single catch-all rule is also created for test instances that are not covered by any rule. Typically,
this catch-all rule might correspond to the majority class among training instances not covered
by any rule. Rule-based methods tend to be more robust than decision trees, because they are not
An Introduction to Data Classification 11
restricted to a strict hierarchical partitioning of the data. This is most evident from the relative
performance of these methods in some sparse high dimensional domains such as text. For example,
while many rule-based methods such as RIPPER are frequently used for the text domain, decision
trees are used rarely for text. Another advantage of these methods is that they are relatively easy
to generalize to different data types such as sequences, XML or graph data [14, 93]. In such cases,
the left-hand side of the rule needs to be defined in a way that is specific for that data domain. For
example, for a sequence classification problem [14], the left-hand side of the rule corresponds to a
sequence of symbols. For a graph-classification problem, the left-hand side of the rule corresponds
to a frequent structure [93]. Therefore, while rule-based methods are related to decision trees, they
have significantly greater expressive power. Rule-based methods are discussed in detail in Chapter 5.
MARGIN
. ..
. .. . .
. . . .
. . .
. . . .
. . . .
. . . .
. . .
. . .. .
.. . . .
.
MARGIN MARGIN VIOLATION WITH PENALTY BASED SLACK VARIABLES
MARGINVIOLATIONWITHPENALTYͲBASEDSLACKVARIABLES
CRP + Cholestrol/100 ≤ 4
In such a case, the split condition in the multivariate case may also be used as stand-alone con-
dition for classification. This, a SVM classifier, may be considered a single level decision tree with
a very carefully chosen multivariate split condition. Clearly, since the effectiveness of the approach
depends only on a single separating hyperplane, it is critical to define this separation carefully.
Support vector machines are generally defined for binary classification problems. Therefore, the
class variable yi for the ith training instance Xi is assumed to be drawn from {−1, +1}. The most
important criterion, which is commonly used for SVM classification, is that of the maximum margin
hyperplane. In order to understand this point, consider the case of linearly separable data illustrated
in Figure 1.2(a). Two possible separating hyperplanes, with their corresponding support vectors and
margins have been illustrated in the figure. It is evident that one of the separating hyperplanes has a
much larger margin than the other, and is therefore more desirable because of its greater generality
for unseen test examples. Therefore, one of the important criteria for support vector machines is to
achieve maximum margin separation of the hyperplanes.
In general, it is assumed for d dimensional data that the separating hyperplane is of the form
W · X + b = 0. Here W is a d-dimensional vector representing the coefficients of the hyperplane of
separation, and b is a constant. Without loss of generality, it may be assumed (because of appropriate
coefficient scaling) that the two symmetric support vectors have the form W · X + b = 1 and W ·
X + b = −1. The coefficients W and b need to be learned from the training data D in order to
maximize the margin of separation between these two parallel hyperplanes. It can shown from
elementary linear algebra that the distance between these two hyperplanes is 2/||W ||. Maximizing
this objective function is equivalent to minimizing ||W ||2 /2. The problem constraints are defined by
the fact that the training data points for each class are on one side of the support vector. Therefore,
these constraints are as follows:
W · Xi + b ≥ +1 ∀i : yi = +1 (1.12)
W · Xi + b ≤ −1 ∀i : yi = −1 (1.13)
This is a constrained convex quadratic optimization problem, which can be solved using Lagrangian
methods. In practice, an off-the-shelf optimization solver may be used to achieve the same goal.
An Introduction to Data Classification 13
In practice, the data may not be linearly separable. In such cases, soft-margin methods may
be used. A slack ξi ≥ 0 is introduced for training instance, and a training instance is allowed to
violate the support vector constraint, for a penalty, which is dependent on the slack. This situation
is illustrated in Figure 1.2(b). Therefore, the new set of constraints are now as follows:
W · Xi + b ≥ +1 − ξi ∀i : yi = +1 (1.14)
W · Xi + b ≤ −1 + ξi ∀i : yi = −1 (1.15)
ξi ≥ 0 (1.16)
Note that additional non-negativity constraints also need to be imposed in the slack variables. The
objective function is now ||W ||2 /2 + C · ∑ni=1 ξi . The constant C regulates the importance of the
margin and the slack requirements. In other words, small values of C make the approach closer to
soft-margin SVM, whereas large values of C make the approach more of the hard-margin SVM. It
is also possible to solve this problem using off-the-shelf optimization solvers.
It is also possible to use transformations on the feature variables in order to design non-linear
SVM methods. In practice, non-linear SVM methods are learned using kernel methods. The key idea
here is that SVM formulations can be solved using only pairwise dot products (similarity values)
between objects. In other words, the optimal decision about the class label of a test instance, from
the solution to the quadratic optimization problem in this section, can be expressed in terms of the
following:
The reader is advised to refer to [84] for the specific details of the solution to the optimization
formulation. The dot product between a pair of instances can be viewed as notion of similarity
among them. Therefore, the aforementioned observations imply that it is possible to perform SVM
classification, with pairwise similarity information between training data pairs and training-test data
pairs. The actual feature values are not required.
This opens the door for using transformations, which are represented by their similarity values.
These similarities can be viewed as kernel functions K(X,Y ), which measure similarities between
the points X and Y . Conceptually, the kernel function may be viewed as dot product between the
pair of points in a newly transformed space (denoted by mapping function Φ(·)). However, this
transformation does not need to be explicitly computed, as long as the kernel function (dot product)
K(X,Y ) is already available:
K(X,Y ) = Φ(X) · Φ(Y ) (1.17)
Therefore, all computations can be performed in the original space using the dot products implied
by the kernel function. Some interesting examples of kernel functions include the Gaussian radial
basis function, polynomial kernel, and hyperbolic tangent, which are listed below in the same order.
K(Xi , X j ) =e−||Xi −X j ||
2 /2σ2
(1.18)
K(Xi , X j ) =(Xi · X j + 1) h
(1.19)
K(Xi , X j ) =tanh(κXi · X j − δ) (1.20)
These different functions result in different kinds of nonlinear decision boundaries in the original
space, but they correspond to a linear separator in the transformed space. The performance of a
classifier can be sensitive to the choice of the kernel used for the transformation. One advantage
of kernel methods is that they can also be extended to arbitrary data types, as long as appropriate
pairwise similarities can be defined.
14 Data Classification: Algorithms and Applications
The major downside of SVM methods is that they are slow. However, they are very popular and
tend to have high accuracy in many practical domains such as text. An introduction to SVM methods
may be found in [30, 46, 75, 76, 85]. Kernel methods for support vector machines are discussed
in [75]. SVM methods are discussed in detail in Chapter 7.
zi = sign{W · Xi + b} (1.21)
The output is a predicted value of the binary class variable, which is assumed to be drawn from
{−1, +1}. The notation b denotes the bias. Thus, for a vector Xi drawn from a dimensionality of d,
the weight vector W should also contain d elements. Now consider a binary classification problem,
in which all labels are drawn from {+1, −1}. We assume that the class label of Xi is denoted by yi .
In that case, the sign of the predicted function zi yields the class label. An example of the perceptron
architecture is illustrated in Figure 1.3(a). Thus, the goal of the approach is to learn the set of
weights W with the use of the training data, so as to minimize the least squares error (yi − zi )2 . The
idea is that we start off with random weights and gradually update them, when a mistake is made
by applying the current function on the training example. The magnitude of the update is regulated
by a learning rate λ. This update is similar to the updates in gradient descent, which are made for
least-squares optimization. In the case of neural networks, the update function is as follows.
t+1 t
W = W + λ(yi − zi )Xi (1.22)
t
Here, W is the value of the weight vector in the tth iteration. It is not difficult to show that the
incremental update vector is related to the negative gradient of (yi − zi )2 with respect to W . It is also
easy to see that updates are made to the weights, only when mistakes are made in classification.
When the outputs are correct, the incremental change to the weights is zero.
The similarity to support vector machines is quite striking, in the sense that a linear function
is also learned in this case, and the sign of the linear function predicts the class label. In fact, the
perceptron model and support vector machines are closely related, in that both are linear function
approximators. In the case of support vector machines, this is achieved with the use of maximum
margin optimization. In the case of neural networks, this is achieved with the use of an incremental
An Introduction to Data Classification 15
INPUTNODES INPUTLAYER
Xi1 Xi1
w1 HIDDENLAYER
OUTPUTNODE
Xi2 w2 Xi2
w3 є Zi
OUTPUT LAYER
OUTPUTLAYER
Zi
Xi3 w4 Xi3
Xi4 Xi4
learning algorithm, which is approximately equivalent to least squares error optimization of the
prediction.
The constant λ regulates the learning rate. The choice of learning rate is sometimes important,
because learning rates that are too small will result in very slow training. On the other hand, if the
learning rates are too fast, this will result in oscillation between suboptimal solutions. In practice,
the learning rates are fast initially, and then allowed to gradually slow down over time. The idea here
is that initially large steps are likely to be helpful, but are then reduced in size to prevent oscillation
between suboptimal solutions. For example, after t iterations, the learning rate may be chosen to be
proportional to 1/t.
The aforementioned discussion was based on the simple perceptron architecture, which can
model only linear relationships. In practice, the neural network is arranged in three layers, referred
to as the input layer, hidden layer, and the output layer. The input layer only transmits the inputs
forward, and therefore, there are really only two layers to the neural network, which can perform
computations. Within the hidden layer, there can be any number of layers of neurons. In such cases,
there can be an arbitrary number of layers in the neural network. In practice, there is only one hidden
layer, which leads to a 2-layer network. An example of a multilayer network is illustrated in Figure
1.3(b). The perceptron can be viewed as a very special kind of neural network, which contains only
a single layer of neurons (corresponding to the output node). Multilayer neural networks allow the
approximation of nonlinear functions, and complex decision boundaries, by an appropriate choice
of the network topology, and non-linear functions at the nodes. In these cases, a logistic or sigmoid
function known as a squashing function is also applied to the inputs of neurons in order to model
non-linear characteristics. It is possible to use different non-linear functions at different nodes. Such
general architectures are very powerful in approximating arbitrary functions in a neural network,
given enough training data and training time. This is the reason that neural networks are sometimes
referred to as universal function approximators.
In the case of single layer perceptron algorthms, the training process is easy to perform by using
a gradient descent approach. The major challenge in training multilayer networks is that it is no
longer known for intermediate (hidden layer) nodes, what their “expected” output should be. This is
only known for the final output node. Therefore, some kind of “error feedback” is required, in order
to determine the changes in the weights at the intermediate nodes. The training process proceeds in
two phases, one of which is in the forward direction, and the other is in the backward direction.
1. Forward Phase: In the forward phase, the activation function is repeatedly applied to prop-
agate the inputs from the neural network in the forward direction. Since the final output is
supposed to match the class label, the final output at the output layer provides an error value,
depending on the training label value. This error is then used to update the weights of the
output layer, and propagate the weight updates backwards in the next phase.
16 Data Classification: Algorithms and Applications
2. Backpropagation Phase: In the backward phase, the errors are propagated backwards through
the neural network layers. This leads to the updating of the weights in the neurons of the
different layers. The gradients at the previous layers are learned as a function of the errors
and weights in the layer ahead of it. The learning rate λ plays an important role in regulating
the rate of learning.
In practice, any arbitrary function can be approximated well by a neural network. The price of this
generality is that neural networks are often quite slow in practice. They are also sensitive to noise,
and can sometimes overfit the training data.
The previous discussion assumed only binary labels. It is possible to create a k-label neural net-
work, by either using a multiclass “one-versus-all” meta-algorithm, or by creating a neural network
architecture in which the number of output nodes is equal to the number of class labels. Each output
represents prediction to a particular label value. A number of implementations of neural network
methods have been studied in [35,57,66,77,88], and many of these implementations are designed in
the context of text data. It should be pointed out that both neural networks and SVM classifiers use a
linear model that is quite similar. The main difference between the two is in how the optimal linear
hyperplane is determined. Rather than using a direct optimization methodology, neural networks
use a mistake-driven approach to data classification [35]. Neural networks are described in detail
in [19, 51]. This topic is addressed in detail in Chapter 8.
• Concept Drift: The data streams are typically created by a generating process, which may
change over time. This results in concept drift, which corresponds to changes in the underly-
ing stream patterns over time. The presence of concept drift can be detrimental to classifica-
tion algorithms, because models become stale over time. Therefore, it is crucial to adjust the
model in an incremental way, so that it achieves high accuracy over current test instances.
• Massive Domain Constraint: The streaming scenario often contains discrete attributes that
take on millions of possible values. This is because streaming items are often associated with
discrete identifiers. Examples could be email addresses in an email addresses, IP addresses
in a network packet stream, and URLs in a click stream extracted from proxy Web logs.
The massive domain problem is ubiquitous in streaming applications. In fact, many synopsis
data structures, such as the count-min sketch [33], and the Flajolet-Martin data structure [41],
have been designed with this issue in mind. While this issue has not been addressed very
extensively in the stream mining literature (beyond basic synopsis methods for counting),
recent work has made a number of advances in this direction [9].
optimize only a small working set of variables while keeping the others fixed. This working set is
selected by using a steepest descent criterion. This optimizes the advantage gained from using a
particular subset of attributes. Another strategy used is to discard training examples, which do not
have any impact on the margin of the classifiers. Training examples that are away from the decision
boundary, and on its “correct” side, have no impact on the margin of the classifier, even if they are
removed. Other methods such as SVMPerf [54] reformulate the SVM optimization to reduce the
number of slack variables, and increase the number of constraints. A cutting plane approach, which
works with a small subset of constraints at a time, is used in order to solve the resulting optimization
problem effectively.
Further challenges arise for extremely large data sets. This is because an increasing size of the
data implies that a distributed file system must be used in order to store it, and distributed processing
techniques are required in order to ensure sufficient scalability. The challenge here is that if large
segments of the data are available on different machines, it is often too expensive to shuffle the data
across different machines in order to extract integrated insights from it. Thus, as in all distributed
infrastructures, it is desirable to exchange intermediate insights, so as to minimize communication
costs. For an application programmer, this can sometimes create challenges in terms of keeping
track of where different parts of the data are stored, and the precise ordering of communications in
order to minimize the costs.
In this context, Google’s MapReduce framework [37] provides an effective method for analysis
of large amounts of data, especially when the nature of the computations involve linearly computable
statistical functions over the elements of the data streams. One desirable aspect of this framework is
that it abstracts out the precise details of where different parts of the data are stored to the applica-
tion programmer. As stated in [37]: “The run-time system takes care of the details of partitioning the
input data, scheduling the program’s execution across a set of machines, handling machine failures,
and managing the required inter-machine communication. This allows programmers without any
experience with parallel and distributed systems to easily utilize the resources of a large distributed
system.” Many classification algorithms such as k-means are naturally linear in terms of their scala-
bility with the size of the data. A primer on the MapReduce framework implementation on Apache
Hadoop may be found in [87]. The key idea here is to use a Map function in order to distribute the
work across the different machines, and then provide an automated way to shuffle out much smaller
data in (key,value) pairs containing intermediate results. The Reduce function is then applied to the
aggregated results from the Map step in order to obtain the final results.
Google’s original MapReduce framework was designed for analyzing large amounts of Web
logs, and more specifically deriving linearly computable statistics from the logs. It has been shown
[44] that a declarative framework is particularly useful in many MapReduce applications, and that
many existing classification algorithms can be generalized to the MapReduce framework. A proper
choice of the algorithm to adapt to the MapReduce framework is crucial, since the framework is
particularly effective for linear computations. It should be pointed out that the major attraction of
the MapReduce framework is its ability to provide application programmers with a cleaner abstrac-
tion, which is independent of very specific run-time details of the distributed system. It should not,
however, be assumed that such a system is somehow inherently superior to existing methods for dis-
tributed parallelization from an effectiveness or flexibility perspective, especially if an application
programmer is willing to design such details from scratch. A detailed discussion of classification
algorithms for big data is provided in Chapter 10.
text is much closer to multidimensional data. However, the standard methods for multidimensional
classification often need to be modified for text.
The main challenge with text classification is that the data is extremely high dimensional and
sparse. A typical text lexicon may be of a size of a hundred thousand words, but a document may
typically contain far fewer words. Thus, most of the attribute values are zero, and the frequencies are
relatively small. Many common words may be very noisy and not very discriminative for the clas-
sification process. Therefore, the problems of feature selection and representation are particularly
important in text classification.
Not all classification methods are equally popular for text data. For example, rule-based meth-
ods, the Bayes method, and SVM classifiers tend to be more popular than other classifiers. Some
rule-based classifiers such as RIPPER [26] were originally designed for text classification. Neural
methods and instance-based methods are also sometimes used. A popular instance-based method
used for text classification is Rocchio’s method [56, 74]. Instance-based methods are also some-
times used with centroid-based classification, where frequency-truncated centroids of class-specific
clusters are used, instead of the original documents for the k-nearest neighbor approach. This gen-
erally provides better accuracy, because the centroid of a small closely related set of documents is
often a more stable representation of that data locality than any single document. This is especially
true because of the sparse nature of text data, in which two related documents may often have only
a small number of words in common.
Many classifiers such as decision trees, which are popularly used in other data domains, are
not quite as popular for text data. The reason for this is that decision trees use a strict hierarchical
partitioning of the data. Therefore, the features at the higher levels of the tree are implicitly given
greater importance than other features. In a text collection containing hundreds of thousands of
features (words), a single word usually tells us very little about the class label. Furthermore, a
decision tree will typically partition the data space with a very small number of splits. This is a
problem, when this value is orders of magnitude less than the underlying data dimensionality. Of
course, decision trees in text are not very balanced either, because of the fact that a given word
is contained only in a small subset of the documents. Consider the case where a split corresponds
to presence or absence of a word. Because of the imbalanced nature of the tree, most paths from
the root to leaves will correspond to word-absence decisions, and a very small number (less than
5 to 10) word-presence decisions. Clearly, this will lead to poor classification, especially in cases
where word-absence does not convey much information, and a modest number of word presence
decisions are required. Univariate decision trees do not work very well for very high dimensional
data sets, because of disproportionate importance to some features, and a corresponding inability to
effectively leverage all the available features. It is possible to improve the effectiveness of decision
trees for text classification by using multivariate splits, though this can be rather expensive.
The standard classification methods, which are used for the text domain, also need to be suitably
modified. This is because of the high dimensional and sparse nature of the text domain. For example,
text has a dedicated model, known as the multinomial Bayes model, which is different from the
standard Bernoulli model [12]. The Bernoulli model treats the presence and absence of a word in
a text document in a symmetric way. However, in a given text document, only a small fraction
of the lexicon size is present in it. The absence of a word is usually far less informative than the
presence of a word. The symmetric treatment of word presence and word absence can sometimes be
detrimental to the effectiveness of a Bayes classifier in the text domain. In order to achieve this goal,
the multinomial Bayes model is used, which uses the frequency of word presence in a document,
but ignores non-occurrence.
In the context of SVM classifiers, scalability is important, because such classifiers scale poorly
both with number of training documents and data dimensionality (lexicon size). Furthermore, the
sparsity of text (i.e., few non-zero feature values) should be used to improve the training efficiency.
This is because the training model in an SVM classifier is constructed using a constrained quadratic
optimization problem, which has as many constraints as the number of data points. This is rather
20 Data Classification: Algorithms and Applications
large, and it directly results in an increased size of the corresponding Lagrangian relaxation. In the
case of kernel SVM, the space-requirements for the kernel matrix could also scale quadratically with
the number of data points. A few methods such as SVMLight [53] address this issue by carefully
breaking down the problem into smaller subproblems, and optimizing only a few variables at a time.
Other methods such as SVMPerf [54] also leverage the sparsity of the text domain. The SVMPerf
method scales as O(n · s), where s is proportional to the average number of non-zero feature values
per training document.
Text classification often needs to be performed in scenarios, where it is accompanied by linked
data. The links between documents are typically inherited from domains such as the Web and social
networks. In such cases, the links contain useful information, which should be leveraged in the
classification process. A number of techniques have recently been designed to utilize such side
information in the classification process. Detailed surveys on text classification may be found in
[12, 78]. The problem of text classification is discussed in detail in Chapter 11 of this book.
• Classifying specific time-instants: These correspond to specific events that can be inferred at
particular instants of the data stream. In these cases, the labels are associated with instants in
time, and the behavior of one or more time series are used in order to classify these instants.
For example, the detection of significant events in real-time applications can be an important
application in this scenario.
• Classifying part or whole series: In these cases, the class labels are associated with portions
or all of the series, and these are used for classification. For example, an ECG time-series will
show characteristic shapes for specific diagnostic criteria for diseases.
An Introduction to Data Classification 21
Both of these scenarios are equally important from the perspective of analytical inferences in a wide
variety of scenarios. Furthermore, these scenarios are also relevant to the case of sequence data.
Sequence data arises frequently in biological, Web log mining, and system analysis applications.
The discrete nature of the underlying data necessitates the use of methods that are quite different
from the case of continuous time series data. For example, in the case of discrete sequences, the
nature of the distance functions and modeling methodologies are quite different than those in time-
series data.
A brief survey of time-series and sequence classification methods may be found in [91]. A
detailed discussion on time-series data classification is provided in Chapter 13, and that of sequence
data classification methods is provided in Chapter 14. While the two areas are clearly connected,
there are significant differences between these two topics, so as to merit separate topical treatment.
ther supervised or unsupervised methods [3]. For example, consider the case of an image collection,
in which the similarity is defined on the basis of a user-centered semantic criterion. In such a case,
the use of standard distance functions such as the Euclidian metric may not reflect the semantic sim-
ilarities between two images well, because they are based on human perception, and may even vary
from collection to collection. Thus, the best way to address this issue is to explicitly incorporate
human feedback into the learning process. Typically, this feedback is incorporated either in terms of
pairs of images with explicit distance values, or in terms of rankings of different images to a given
target image. Such an approach can be used for a variety of different data domains. This is the train-
ing data that is used for learning purposes. A detailed survey of distance function learning methods
is provided in [92]. The topic of distance function learning is discussed in detail in Chapter 18.
• Boosting: Boosting [40] is a common technique used in classification. The idea is to focus
on successively difficult portions of the data set in order to create models that can classify
the data points in these portions more accurately, and then use the ensemble scores over all
the components. A hold-out approach is used in order to determine the incorrectly classified
instances for each portion of the data set. Thus, the idea is to sequentially determine better
classifiers for more difficult portions of the data, and then combine the results in order to
obtain a meta-classifier, which works well on all parts of the data.
• Bagging: Bagging [24] is an approach that works with random data samples, and combines
the results from the models constructed using different samples. The training examples for
each classifier are selected by sampling with replacement. These are referred to as bootstrap
samples. This approach has often been shown to provide superior results in certain scenarios,
though this is not always the case. This approach is not effective for reducing the bias, but can
reduce the variance, because of the specific random aspects of the training data.
• Random Forests: Random forests [23] are a method that use sets of decision trees on either
splits with randomly generated vectors, or random subsets of the training data, and com-
pute the score as a function of these different components. Typically, the random vectors are
generated from a fixed probability distribution. Therefore, random forests can be created by
either random split selection, or random input selection. Random forests are closely related
24 Data Classification: Algorithms and Applications
to bagging, and in fact bagging with decision trees can be considered a special case of ran-
dom forests, in terms of how the sample is selected (bootstrapping). In the case of random
forests, it is also possible to create the trees in a lazy way, which is tailored to the particular
test instance at hand.
• Model Averaging and Combination: This is one of the most common models used in ensemble
analysis. In fact, the random forest method discussed above is a special case of this idea. In
the context of the classification problem, many Bayesian methods [34] exist for the model
combination process. The use of different models ensures that the error caused by the bias of
a particular classifier does not dominate the classification results.
• Stacking: Methods such as stacking [90] also combine different models in a variety of ways,
such as using a second-level classifier in order to perform the combination. The output of
different first-level classifiers is used to create a new feature representation for the second
level classifier. These first level classifiers may be chosen in a variety of ways, such as using
different bagged classifiers, or by using different training models. In order to avoid overfitting,
the training data needs to be divided into two subsets for the first and second level classifiers.
• Bucket of Models: In this approach [94] a “hold-out” portion of the data set is used in order to
decide the most appropriate model. The most appropriate model is one in which the highest
accuracy is achieved in the held out data set. In essence, this approach can be viewed as a
competition or bake-off contest between the different models.
The area of meta-algorithms in classification is very rich, and different variations may work better
in different scenarios. An overview of different meta-algorithms in classification is provided in
Chapter 19.
X
X X CLASS A
CLASS A X
X
X XX
X
XX X X
OLD DECISION BOUNDARY X X X X
X X X
XX X X X
X
X XX X
X
CLASS B X X
X X
X
CLASS B X X
The motivation of semi-supervised learning is that knowledge of the dense regions in the space
and correlated regions of the space are helpful for classification. Consider the two-class example
illustrated in Figure 1.4(a), in which only a single training example is available for each class.
In such a case, the decision boundary between the two classes is the straight line perpendicular
to the one joining the two classes. However, suppose that some additional unsupervised examples
are available, as illustrated in Figure 1.4(b). These unsupervised examples are denoted by ‘x’. In
such a case, the decision boundary changes from Figure 1.4(a). The major assumption here is that
the classes vary less in dense regions of the training data, because of the smoothness assumption.
As a result, even though the added examples do not have labels, they contribute significantly to
improvements in classification accuracy.
In this example, the correlations between feature values were estimated with unlabeled training
data. This has an intuitive interpretation in the context of text data, where joint feature distributions
can be estimated with unlabeled data. For example, consider a scenario, where training data is
available about predicting whether a document is the “politics” category. It may be possible that the
word “Obama” (or some of the less common words) may not occur in any of the (small number
of) training documents. However, the word “Obama” may often co-occur with many features of the
“politics” category in the unlabeled instances. Thus, the unlabeled instances can be used to learn the
relevance of these less common features to the classification process, especially when the amount
of available training data is small.
Similarly, when the data is clustered, each cluster in the data is likely to predominantly contain
data records of one class or the other. The identification of these clusters only requires unsuper-
vised data rather than labeled data. Once the clusters have been identified from unlabeled data,
only a small number of labeled examples are required in order to determine confidently which label
corresponds to which cluster. Therefore, when a test example is classified, its clustering structure
provides critical information for its classification process, even when a smaller number of labeled
examples are available. It has been argued in [67] that the accuracy of the approach may increase ex-
ponentially with the number of labeled examples, as long as the assumption of smoothness in label
structure variation holds true. Of course, in real life, this may not be true. Nevertheless, it has been
shown repeatedly in many domains that the addition of unlabeled data provides significant advan-
tages for the classification process. An argument for the effectiveness of semi-supervised learning
that uses the spectral clustering structure of the data may be found in [18]. In some domains such
as graph data, semi-supervised learning is the only way in which classification may be performed.
This is because a given node may have very few neighbors of a specific class.
Semi-supervised methods are implemented in a wide variety of ways. Some of these methods
directly try to label the unlabeled data in order to increase the size of the training set. The idea is
26 Data Classification: Algorithms and Applications
to incrementally add the most confidently predicted label to the training data. This is referred to as
self training. Such methods have the downside that they run the risk of overfitting. For example,
when an unlabeled example is added to the training data with a specific label, the label might be
incorrect because of the specific characteristics of the feature space, or the classifier. This might
result in further propagation of the errors. The results can be quite severe in many scenarios.
Therefore, semi-supervised methods need to be carefully designed in order to avoid overfitting.
An example of such a method is co-training [21], which partitions the attribute set into two subsets,
on which classifier models are independently constructed. The top label predictions of one classifier
are used to augment the training data of the other, and vice-versa. Specifically, the steps of co-
training are as follows:
1. Divide the feature space into two disjoint subsets f1 and f2 .
2. Train two independent classifier models M1 and M2 , which use the disjoint feature sets f1
and f2 , respectively.
3. Add the unlabeled instance with the most confidently predicted label from M1 to the training
data for M2 and vice-versa.
the learning process. For example, consider the case of learning the class label of Chinese docu-
ments, where enough training data is not available about the documents. However, similar English
documents may be available that contain training labels. In such cases, the knowledge in training
data for the English documents can be transferred to the Chinese document scenario for more ef-
fective classification. Typically, this process requires some kind of “bridge” in order to relate the
Chinese documents to the English documents. An example of such a “bridge” could be pairs of
similar Chinese and English documents though many other models are possible. In many cases,
a small amount of auxiliary training data in the form of labeled Chinese training documents may
also be available in order to further enhance the effectiveness of the transfer process. This general
principle can also be applied to cross-category or cross-domain scenarios where knowledge from
one classification category is used to enhance the learning of another category [71], or the knowl-
edge from one data domain (e.g., text) is used to enhance the learning of another data domain (e.g.,
images) [36, 70, 71, 95]. Broadly speaking, transfer learning methods fall into one of the following
four categories:
1. Instance-Based Transfer: In this case, the feature spaces of the two domains are highly over-
lapping; even the class labels may be the same. Therefore, it is possible to transfer knowledge
from one domain to the other by simply re-weighting the features.
2. Feature-Based Transfer: In this case, there may be some overlaps among the features, but
a significant portion of the feature space may be different. Often, the goal is to perform a
transformation of each feature set into a new low dimensional space, which can be shared
across related tasks.
3. Parameter-Based Transfer: In this case, the motivation is that a good training model has
typically learned a lot of structure. Therefore, if two tasks are related, then the structure can
be transferred to learn the target task.
4. Relational-Transfer Learning: The idea here is that if two domains are related, they may share
some similarity relations among objects. These similarity relations can be used for transfer
learning across domains.
The major challenge in such transfer learning methods is that negative transfer can be caused in
some cases when the side information used is very noisy or irrelevant to the learning process. There-
fore, it is critical to use the transfer learning process in a careful and judicious way in order to truly
improve the quality of the results. A survey on transfer learning methods may be found in [68], and
a detailed discussion on this topic may be found in Chapter 21.
ClassA ClassB
(a) Class Separation (b) Random Sample with SVM (c) Active Sample with SVM Clas-
Classifier sifier
learning algorithms, most of which try to either reduce the uncertainty in classification or reduce the
error associated with the classification process. Some examples of criteria that are commonly used
in order to query the learner are as follows:
• Uncertainty Sampling: In this case, the learner queries the user for labels of examples, for
which the greatest level of uncertainty exists about its correct output [45].
• Query by Committee (QBC): In this case, the learner queries the user for labels of examples
in which a committee of classifiers have the greatest disagreement. Clearly, this is another
indirect way to ensure that examples with the greatest uncertainty are queries [81].
• Greatest Model Change: In this case, the learner queries the user for labels of examples,
which cause the greatest level of change from the current model. The goal here is to learn
new knowledge that is not currently incorporated in the model [27].
• Greatest Error Reduction: In this case, the learner queries the user for labels of examples,
which causes the greatest reduction of error in the current example [28].
• Greatest Variance Reduction: In this case, the learner queries the user for examples, which
result in greatest reduction in output variance [28]. This is actually similar to the previous
case, since the variance is a component of the total error.
• Representativeness: In this case, the learner queries the user for labels that are most represen-
tative of the underlying data. Typically, this approach combines one of the aforementioned
criteria (such as uncertainty sampling or QBC) with a representativeness model such as a
density-based method in order to perform the classification [80].
These different kinds of models may work well in different kinds of scenarios. Another form of
active learning queries the data vertically. In other words, instead of examples, it is learned which
attributes to collect, so as to minimize the error at a given cost level [62]. A survey on active learning
methods may be found in [79]. The topic of active learning is discussed in detail in Chapter 22.
A general discussion on visual data mining methods is found in [10, 47, 49, 55, 83]. A detailed
discussion of methods for visual classification is provided in Chapter 23.
• Methodology used for evaluation: Classification algorithms require a training phase and a
testing phase, in which the test examples are cleanly separated from the training data. How-
ever, in order to evaluate an algorithm, some of the labeled examples must be removed from
the training data, and the model is constructed on these examples. The problem here is that the
removal of labeled examples implicitly underestimates the power of the classifier, as it relates
to the set of labels already available. Therefore, how should this removal from the labeled
examples be performed so as to not impact the learner accuracy too much?
Various strategies are possible, such as hold out, bootstrapping, and cross-validation, of which
the first is the simplest to implement, and the last provides the greatest accuracy of implemen-
tation. In the hold-out approach, a fixed percentage of the training examples are “held out,”
and not used in the training. These examples are then used for evaluation. Since only a subset
of the training data is used, the evaluation tends to be pessimistic with the approach. Some
variations use stratified sampling, in which each class is sampled independently in proportion.
This ensures that random variations of class frequency between training and test examples are
removed.
In bootstrapping, sampling with replacement is used for creating the training examples. The
most typical scenario is that n examples are sampled with replacement, as a result of which
the fraction of examples not sampled is equal to (1 − 1/n)n ≈ 1/e, where e is the basis of
the natural logarithm. The class accuracy is then evaluated as a weighted combination of the
accuracy a1 on the unsampled (test) examples, and the accuracy a2 on the full labeled data.
The full accuracy A is given by:
This procedure is repeated over multiple bootstrap samples and the final accuracy is reported.
Note that the component a2 tends to be highly optimistic, as a result of which the bootstrap-
ping approach produces highly optimistic estimates. It is most appropriate for smaller data
sets.
In cross-validation, the training data is divided into a set of k disjoint subsets. One of the k
subsets is used for testing, whereas the other (k − 1) subsets are used for training. This process
is repeated by using each of the k subsets as the test set, and the error is averaged over all
possibilities. This has the advantage that all examples in the labeled data have an opportunity
to be treated as test examples. Furthermore, when k is large, the training data size approaches
the full labeled data. Therefore, such an approach approximates the accuracy of the model
using the entire labeled data well. A special case is “leave-one-out” cross-validation, where
k is chosen to be equal to the number of training examples, and therefore each test segment
contains exactly one example. This is, however, expensive to implement.
• Quantification of accuracy: This issue deals with the problem of quantifying the error of
a classification algorithm. At first sight, it would seem that it is most beneficial to use a
measure such as the absolute classification accuracy, which directly computes the fraction
of examples that are correctly classified. However, this may not always be appropriate in
An Introduction to Data Classification 31
all cases. For example, some algorithms may have much lower variance across different data
sets, and may therefore be more desirable. In this context, an important issue that arises is that
of the statistical significance of the results, when a particular classifier performs better than
another on a data set. Another issue is that the output of a classification algorithm may either
be presented as a discrete label for the test instance, or a numerical score, which represents the
propensity of the test instance to belong to a specific class. For the case where it is presented
as a discrete label, the accuracy is the most appropriate score.
In some cases, the output is presented as a numerical score, especially when the class is rare.
In such cases, the Precision-Recall or ROC curves may need to be used for the purposes of
classification evaluation. This is particularly important in imbalanced and rare-class scenarios.
Even when the output is presented as a binary label, the evaluation methodology is different
for the rare class scenario. In the rare class scenario, the misclassification of the rare class
is typically much more costly than that of the normal class. In such cases, cost sensitive
variations of evaluation models may need to be used for greater robustness. For example, the
cost sensitive accuracy weights the rare class and normal class examples differently in the
evaluation.
An excellent review of evaluation of classification algorithms may be found in [52]. A discussion
of evaluation of classification algorithms is provided in Chapter 24.
Bibliography
[1] C. Aggarwal. Outlier Analysis, Springer, 2013.
[2] C. Aggarwal and C. Reddy. Data Clustering: Algorithms and Applications, CRC Press, 2013.
32 Data Classification: Algorithms and Applications
[3] C. Aggarwal. Towards Systematic Design of Distance Functions in Data Mining Applications,
ACM KDD Conference, 2003.
[4] C. Aggarwal. Data Streams: Models and Algorithms, Springer, 2007.
[5] C. Aggarwal. On Density-based Transforms for Uncertain Data Mining, ICDE Conference,
2007.
[6] C. Aggarwal. Social Network Data Analytics, Springer, Chapter 5, 2011.
[7] C. Aggarwal and H. Wang. Managing and Mining Graph Data, Springer, 2010.
[8] C. Aggarwal and C. Zhai. Mining Text Data, Chapter 11, Springer, 2012.
[9] C. Aggarwal and P. Yu. On Classification of High Cardinality Data Streams. SDM Conference,
2010.
[10] C. Aggarwal. Towards Effective and Interpretable Data Mining by Visual Interaction, ACM
SIGKDD Explorations, 2002.
[13] C. Aggarwal, J. Han, J. Wang, and P. Yu. A framework for classification of evolving data
streams. In IEEE TKDE Journal, 2006.
[14] C. Aggarwal. On Effective Classification of Strings with Wavelets, ACM KDD Conference,
2002.
[15] D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms, Machine Learning,
6(1):37–66, 1991.
[16] D. Aha. Lazy learning: Special issue editorial. Artificial Intelligence Review, 11:7–10, 1997.
[17] M. Ankerst, M. Ester, and H.-P. Kriegel. Towards an Effective Cooperation of the User and the
Computer for Classification, ACM KDD Conference, 2000.
[18] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds, Machine Learn-
ing, 56:209–239, 2004.
[19] C. Bishop. Neural Networks for Pattern Recognition, Oxford University Press, 1996.
[26] W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization, ACM
Transactions on Information Systems, 17(2):141–173, 1999.
[27] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning, Machine
Learning, 5(2):201–221, 1994.
[28] D. Cohn, Z. Ghahramani and M. Jordan. Active learning with statistical models, Journal of
Artificial Intelligence Research, 4:129–145, 1996.
[29] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning. Vol. 2, Cambridge: MIT
Press, 2006.
[30] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other
Kernel-based Learning Methods, Cambridge University Press, 2000.
[31] P. Clark and T. Niblett. The CN2 Induction algorithm, Machine Learning, 3(4):261–283, 1989.
[32] B. Clarke. Bayes model averaging and stacking when model approximation error cannot be
ignored, Journal of Machine Learning Research, pages 683–712, 2003.
[33] G. Cormode and S. Muthukrishnan, An improved data-stream summary: The count-min sketch
and its applications, Journal of Algorithms, 55(1), (2005), pp. 58–75.
[34] P. Domingos. Bayesian Averaging of Classifiers and the Overfitting Problem. ICML Confer-
ence, 2000.
[35] I. Dagan, Y. Karov, and D. Roth. Mistake-driven Learning in Text Categorization, Proceedings
of EMNLP, 1997.
[36] W. Dai, Y. Chen, G.-R. Xue, Q. Yang, and Y. Yu. Translated learning: Transfer learning across
different feature spaces. Proceedings of Advances in Neural Information Processing Systems,
2008.
[37] J. Dean and S. Ghemawat. MapReduce: A flexible data processing tool, Communication of the
ACM, 53:72–77, 2010.
[38] P. Domingos and M. J. Pazzani. On the optimality of the simple Bayesian classifier under
zero-one loss. Machine Learning, 29(2–3):103–130, 1997.
[39] R. Duda, P. Hart, and D. Stork, Pattern Classification, Wiley, 2001.
[42] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. BOAT: Optimistic Decision Tree Con-
struction, ACM SIGMOD Conference, 1999.
[43] J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest—a framework for fast decision tree
construction of large datasets, VLDB Conference, pages 416–427, 1998.
[45] D. Lewis and J. Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning, ICML
Conference, 1994.
[46] L. Hamel. Knowledge Discovery with Support Vector Machines, Wiley, 2009.
[49] M. C. F. de Oliveira and H. Levkowitz. Visual Data Mining: A Survey, IEEE Transactions on
Visualization and Computer Graphics, 9(3):378–394. 2003.
[50] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer, 2013.
[51] S. Haykin. Neural Networks and Learning Machines, Prentice Hall, 2008.
[52] N. Japkowicz and M. Shah. Evaluating Learning Algorithms: A Classification Perspective,
Cambridge University Press, 2011.
[53] T. Joachims. Making Large scale SVMs practical, Advances in Kernel Methods, Support Vector
Learning, pages 169–184, Cambridge: MIT Press, 1998.
[54] T. Joachims. Training Linear SVMs in Linear Time, KDD, pages 217–226, 2006.
[55] D. Keim. Information and visual data mining, IEEE Transactions on Visualization and Com-
puter Graphics, 8(1):1–8, 2002.
[56] W. Lam and C. Y. Ho. Using a Generalized Instance Set for Automatic Text Categorization.
ACM SIGIR Conference, 1998.
[57] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold
algorithm. Machine Learning, 2:285–318, 1988.
[58] B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining, ACM
KDD Conference, 1998.
[59] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining, Springer,
1998.
[60] R. Mayer. Multimedia Learning, Cambridge University Press, 2009.
[61] G. J. McLachlan. Discriminant analysis and statistical pattern recognition, Wiley-
Interscience, Vol. 544, 2004.
[65] S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey,
Data Mining and Knowledge Discovery, 2(4):345–389, 1998.
[66] H. T. Ng, W. Goh and K. Low. Feature Selection, Perceptron Learning, and a Usability Case
Study for Text Categorization. ACM SIGIR Conference, 1997.
An Introduction to Data Classification 35
[67] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unla-
beled documents using EM, Machine Learning, 39(2–3):103–134, 2000.
[68] S. J. Pan and Q. Yang. A survey on transfer learning, IEEE Transactons on Knowledge and
Data Engineering, 22(10):1345–1359, 2010.
[69] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: Criteria of max-
dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(8):1226–1238, 2005.
[70] G. Qi, C. Aggarwal, and T. Huang. Towards Semantic Knowledge Propagation from Text
Corpus to Web Images, WWW Conference, 2011.
[71] G. Qi, C. Aggarwal, Y. Rui, Q. Tian, S. Chang, and T. Huang. Towards Cross-Category Knowl-
edge Propagation for Learning Visual Concepts, CVPR Conference, 2011.
[72] J. Quinlan. C4.5: Programs for Machine Learning, Morgan-Kaufmann Publishers, 1993.
[73] J. R. Quinlan. Induction of decision trees, Machine Learning, 1(1):81–106, 1986.
[74] J. Rocchio. Relevance feedback information retrieval. The Smart Retrieval System - Experi-
ments in Automatic Document Processing, G. Salton, Ed. Englewood Cliffs, Prentice Hall, NJ:
pages 313–323, 1971.
[75] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regulariza-
tion, Optimization, and Beyond, Cambridge University Press, 2001.
[76] I. Steinwart and A. Christmann. Support Vector Machines, Springer, 2008.
[77] H. Schutze, D. Hull, and J. Pedersen. A Comparison of Classifiers and Document Representa-
tions for the Routing Problem. ACM SIGIR Conference, 1995.
[78] F. Sebastiani. Machine learning in automated text categorization, ACM Computing Surveys,
34(1):1–47, 2002.
[79] B. Settles. Active Learning, Morgan and Claypool, 2012.
[80] B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling
tasks, Proceedings of the Conference on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 1069–1078, 2008.
[81] H. Seung, M. Opper, and H. Sompolinsky. Query by Committee. Fifth Annual Workshop on
Computational Learning Theory, 1992.
[82] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classfier for data mining,
VLDB Conference, pages 544–555, 1996.
[83] T. Soukop and I. Davidson. Visual Data Mining: Techniques and Tools for Data Visualization,
Wiley, 2002.
[84] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson, 2005.
[85] V. Vapnik. The Nature of Statistical Learning Theory, Springer, New York, 1995.
[86] H. Wang, W. Fan, P. Yu, and J. Han. Mining Concept-Drifting Data Streams with Ensemble
Classifiers, KDD Conference, 2003.
[87] T. White. Hadoop: The Definitive Guide. Yahoo! Press, 2011.
36 Data Classification: Algorithms and Applications
[88] E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting.
SDAIR, pages 317–332, 1995.
[89] D. Wettschereck, D. Aha, and T. Mohri. A review and empirical evaluation of feature weight-
ing methods for a class of lazy learning algorithms, Artificial Intelligence Review, 11(1–
5):273–314, 1997.
[91] Z. Xing and J. Pei, and E. Keogh. A brief survey on sequence classification. SIGKDD Explo-
rations, 12(1):40–48, 2010.
[92] L. Yang. Distance Metric Learning: A Comprehensive Survey, 2006. http://www.cs.cmu.
edu/~liuy/frame_survey_v2.pdf
[93] M. J. Zaki and C. Aggarwal. XRules: A Structural Classifier for XML Data, ACM KDD Con-
ference, 2003.
[94] B. Zenko. Is combining classifiers better than selecting the best one? Machine Learning,
54(3):255–273, 2004.
[95] Y. Zhu, S. J. Pan, Y. Chen, G.-R. Xue, Q. Yang, and Y. Yu. Heterogeneous Transfer Learning
for Image Classification. Special Track on AI and the Web, associated with The Twenty-Fourth
AAAI Conference on Artificial Intelligence, 2010.
[96] X. Zhu and A. Goldberg. Introduction to Semi-Supervised Learning, Morgan and Claypool,
2009.
Chapter 2
Feature Selection for Classification: A Review
Jiliang Tang
Arizona State University
Tempe, AZ
[email protected]
Salem Alelyani
Arizona State University
Tempe, AZ
[email protected]
Huan Liu
Arizona State University
Tempe, AZ
[email protected]
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.1 Data Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.1.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.1.3 Feature Selection for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2 Algorithms for Flat Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2.1 Filter Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.2 Wrapper Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.3 Embedded Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3 Algorithms for Structured Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.1 Features with Group Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.2 Features with Tree Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.3 Features with Graph Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4 Algorithms for Streaming Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.1 The Grafting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.2 The Alpha-Investing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.3 The Online Streaming Feature Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5 Discussions and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.5.3 Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
37
38 Data Classification: Algorithms and Applications
2.1 Introduction
Nowadays, the growth of the high-throughput technologies has resulted in exponential growth
in the harvested data with respect to both dimensionality and sample size. The trend of this growth
of the UCI machine learning repository is shown in Figure 2.1. Efficient and effective management
of these data becomes increasing challenging. Traditionally, manual management of these datasets
has been impractical. Therefore, data mining and machine learning techniques were developed to
automatically discover knowledge and recognize patterns from these data.
However, these collected data are usually associated with a high level of noise. There are many
reasons causing noise in these data, among which imperfection in the technologies that collected
the data and the source of the data itself are two major reasons. For example, in the medical images
domain, any deficiency in the imaging device will be reflected as noise for the later process. This
kind of noise is caused by the device itself. The development of social media changes the role of
online users from traditional content consumers to both content creators and consumers. The quality
of social media data varies from excellent data to spam or abuse content by nature. Meanwhile,
social media data are usually informally written and suffers from grammatical mistakes, misspelling,
and improper punctuation. Undoubtedly, extracting useful knowledge and patterns from such huge
and noisy data is a challenging task.
Dimensionality reduction is one of the most popular techniques to remove noisy (i.e., irrele-
vant) and redundant features. Dimensionality reduction techniques can be categorized mainly into
feature extraction and feature selection. Feature extraction methods project features into a new fea-
ture space with lower dimensionality and the new constructed features are usually combinations of
original features. Examples of feature extraction techniques include Principle Component Analysis
(PCA), Linear Discriminant Analysis (LDA), and Canonical Correlation Analysis (CCA). On the
other hand, the feature selection approaches aim to select a small subset of features that minimize
redundancy and maximize relevance to the target such as the class labels in classification. Repre-
sentative feature selection techniques include Information Gain, Relief, Fisher Score, and Lasso.
Both feature extraction and feature selection are capable of improving learning performance,
lowering computational complexity, building better generalizable models, and decreasing required
storage. Feature extraction maps the original feature space to a new feature space with lower di-
mensions by combining the original feature space. It is difficult to link the features from original
feature space to new features. Therefore further analysis of new features is problematic since there
is no physical meaning for the transformed features obtained from feature extraction techniques.
Meanwhile, feature selection selects a subset of features from the original feature set without any
transformation, and maintains the physical meanings of the original features. In this sense, feature
selection is superior in terms of better readability and interpretability. This property has its signifi-
cance in many practical applications such as finding relevant genes to a specific disease and building
a sentiment lexicon for sentiment analysis. Typically feature selection and feature extraction are pre-
sented separately. Via sparse learning such as 1 regularization, feature extraction (transformation)
methods can be converted into feature selection methods [48].
For the classification problem, feature selection aims to select subset of highly discriminant
features. In other words, it selects features that are capable of discriminating samples that belong
to different classes. For the problem of feature selection for classification, due to the availability of
label information, the relevance of features is assessed as the capability of distinguishing different
classes. For example, a feature fi is said to be relevant to a class c j if fi and c j are highly correlated.
In the following subsections, we will review the literature of data classification in Section (2.1.1),
followed by general discussions about feature selection models in Section (2.1.2) and feature selec-
tion for classification in Section (2.1.3).
Feature Selection for Classification: A Review 39
UCI ML Repository Number of Attributes Growth
15
# Attributes (Log)
10
0
1985 1990 1995 2000 2005 2010
Year
(a)
UCI ML Repository Sample Size Growth
16
14
12
Sample Size (Log)
10
2
1985 1990 1995 2000 2005 2010
Year
(b)
FIGURE 2.1: Plot (a) shows the dimensionality growth trend in the UCI Machine Learning Repos-
itory from the mid 1980s to 2012 while (b) shows the growth in the sample size for the same period.
Label
Information
Learning
Algorithm
Training Feature Features
Set Generation
Training
Testing
Features Classifier
Set
Prediction
Label
will utilize the label information as well as the data itself to learn a map function f (or a classifier)
from features to labels, such as,
In the prediction phase, data is represented by the feature set extracted in the training process,
and then the map function (or the classifier) learned from the training phase will perform on the
feature represented data to predict the labels. Note that the feature set used in the training phase
should be the same as that in the prediction phase.
There are many classification methods in the literature. These methods can be categorized
broadly into linear classifiers, support vector machines, decision trees, and Neural networks. A
linear classifier makes a classification decision based on the value of a linear combination of the
features. Examples of linear classifiers include Fisher’s linear discriminant, logistic regression, the
naive bayes classifier, and so on. Intuitively, a good separation is achieved by the hyperplane that
has the largest distance to the nearest training data point of any class (so-called functional margin),
since in general the larger the margin the lower the generalization error of the classifier. Therefore,
the support vector machine constructs a hyperplane or set of hyperplanes by maximizing the mar-
gin. In decision trees, a tree can be learned by splitting the source set into subsets based on a feature
value test. This process is repeated on each derived subset in a recursive manner called recursive
partitioning. The recursion is completed when the subset at a node has all the same values of the
target feature, or when splitting no longer adds value to the predictions.
serious challenges to existing learning methods [39], i.e., the curse of dimensionality [21]. With
the presence of a large number of features, a learning model tends to overfit, resulting in the de-
generation of performance. To address the problem of the curse of dimensionality, dimensionality
reduction techniques have been studied, which is an important branch in the machine learning and
data mining research area. Feature selection is a widely employed technique for reducing dimen-
sionality among practitioners. It aims to choose a small subset of the relevant features from the
original ones according to certain relevance evaluation criterion, which usually leads to better learn-
ing performance (e.g., higher learning accuracy for classification), lower computational cost, and
better model interpretability.
According to whether the training set is labeled or not, feature selection algorithms can be
categorized into supervised [61, 68], unsupervised [13, 51], and semi-supervised feature selec-
tion [71, 77]. Supervised feature selection methods can further be broadly categorized into filter
models, wrapper models, and embedded models. The filter model separates feature selection from
classifier learning so that the bias of a learning algorithm does not interact with the bias of a feature
selection algorithm. It relies on measures of the general characteristics of the training data such as
distance, consistency, dependency, information, and correlation. Relief [60], Fisher score [11], and
Information Gain based methods [52] are among the most representative algorithms of the filter
model. The wrapper model uses the predictive accuracy of a predetermined learning algorithm to
determine the quality of selected features. These methods are prohibitively expensive to run for data
with a large number of features. Due to these shortcomings in each model, the embedded model
was proposed to bridge the gap between the filter and wrapper models. First, it incorporates the
statistical criteria, as the filter model does, to select several candidate features subsets with a given
cardinality. Second, it chooses the subset with the highest classification accuracy [40]. Thus, the em-
bedded model usually achieves both comparable accuracy to the wrapper and comparable efficiency
to the filter model. The embedded model performs feature selection in the learning time. In other
words, it achieves model fitting and feature selection simultaneously [15,54]. Many researchers also
paid attention to developing unsupervised feature selection. Unsupervised feature selection is a less
constrained search problem without class labels, depending on clustering quality measures [12],
and can eventuate many equally valid feature subsets. With high-dimensional data, it is unlikely
to recover the relevant features without considering additional constraints. Another key difficulty
is how to objectively measure the results of feature selection [12]. A comprehensive review about
unsupervised feature selection can be found in [1]. Supervised feature selection assesses the rele-
vance of features guided by the label information but a good selector needs enough labeled data,
which is time consuming. While unsupervised feature selection works with unlabeled data, it is
difficult to evaluate the relevance of features. It is common to have a data set with huge dimension-
ality but small labeled-sample size. High-dimensional data with small labeled samples permits too
large a hypothesis space yet with too few constraints (labeled instances). The combination of the
two data characteristics manifests a new research challenge. Under the assumption that labeled and
unlabeled data are sampled from the same population generated by target concept, semi-supervised
feature selection makes use of both labeled and unlabeled data to estimate feature relevance [77].
Feature weighting is thought of as a generalization of feature selection [69]. In feature selec-
tion, a feature is assigned a binary weight, where 1 means the feature is selected and 0 otherwise.
However, feature weighting assigns a value, usually in the interval [0,1] or [-1,1], to each feature.
The greater this value is, the more salient the feature will be. Most of the feature weight algorithms
assign a unified (global) weight to each feature over all instances. However, the relative importance,
relevance, and noise in the different dimensions may vary significantly with data locality. There are
local feature selection algorithms where the local selection of features is done specific to a test in-
stance, which is is common in lazy leaning algorithms such as kNN [9, 22]. The idea is that feature
selection or weighting is done at classification time (rather than at training time), because knowledge
of the test instance sharpens the ability to select features.
42 Data Classification: Algorithms and Applications
Label
Information
Learning
Feature Algorithm
Selection
Classifier
Typically, a feature selection method consists of four basic steps [40], namely, subset generation,
subset evaluation, stopping criterion, and result validation. In the first step, a candidate feature subset
will be chosen based on a given search strategy, which is sent, in the second step, to be evaluated
according to a certain evaluation criterion. The subset that best fits the evaluation criterion will be
chosen from all the candidates that have been evaluated after the stopping criterion are met. In the
final step, the chosen subset will be validated using domain knowledge or a validation set.
Embedded Models
Wrapper Models
Group Structure
Graph Structure
Tree Structure
FIGURE 2.4: A classification of algorithms of feature selection for classification.
• the resulting class distribution, given only the values for the selected features, is as close as
possible to the original class distribution, given all features.
Ideally, feature selection methods search through the subsets of features and try to find the
best one among the competing 2m candidate subsets according to some evaluation functions [8].
However, this procedure is exhaustive as it tries to find only the best one. It may be too costly
and practically prohibitive, even for a medium-sized feature set size (m). Other methods based on
heuristic or random search methods attempt to reduce computational complexity by compromising
performance. These methods need a stopping criterion to prevent an exhaustive search of subsets.
In this chapter, we divide feature selection for classification into three families according to
the feature structure — methods for flat features, methods for structured features, and methods for
streaming features as demonstrated in Figure 2.4. In the following sections, we will review these
three groups with representative algorithms in detail.
Before going to the next sections, we introduce notations we adopt in this book chapter. Assume
that F = { f1 , f2 , . . . , fm } and C = {c1 , c2 , . . . , cK } denote the feature set and the class label set where
m and K are the numbers of features and labels, respectively. X = {x1 , x2 , . . . , x3 } ∈ Rm×n is the data
where n is the number of instances and the label information of the i-th instance xi is denoted as yi .
∑Kk=1 n j (µi j − µi )2
Si = , (2.2)
∑Kk=1 n j ρ2i j
where µi j and ρi j are the mean and the variance of the i-th feature in the j-th class, respectively, n j
is the number of instances in the j-th class, and µi is the mean of the i-th feature.
Fisher Score evaluates features individually; therefore, it cannot handle feature redundancy. Re-
cently, Gu et al. [17] proposed a generalized Fisher score to jointly select features, which aims to
find a subset of features that maximize the lower bound of traditional Fisher score and solve the
following problem:
W diag(p)X − G
2F + γ
W
2F ,
s.t., p ∈ {0, 1}m , p 1 = d, (2.3)
where p is the feature selection vector, d is the number of features to select, and G is a special label
indicator matrix, as follows:
⎧
⎨ −
nj
n if xi ∈ c j ,
n
nj
G(i, j) = (2.4)
⎩ − n j otherwise.
n
Mutual Information based on Methods [37, 52, 74]: Due to its computational efficiency and
simple interpretation, information gain is one of the most popular feature selection methods. It is
used to measure the dependence between features and labels and calculates the information gain
between the i-th feature fi and the class labels C as
IG( fi , C ) = H( fi ) − H( fi |C ), (2.5)
In information gain, a feature is relevant if it has a high information gain. Features are selected
in a univariate way, therefore, information gain cannot handle redundant features. In [74], a fast
filter method FCBF based on mutual information was proposed to identify relevant features as well
as redundancy among relevant features and measure feature-class and feature-feature correlation.
Feature Selection for Classification: A Review 45
Given a threshold ρ, FCBC first selects a set of features S that is highly correlated to the class with
SU ≥ ρ, where SU is symmetrical uncertainty defined as
IG( fi , C )
SU( fi , C ) = 2 . (2.7)
H( fi ) + H(C )
A feature fi is called predominant iff SU( fi , ck ) ≥ ρ and there is no f j ( f j ∈ S, j = i) such as
SU( j, i) ≥ SU(i, ck ). f j is a redundant feature to fi if SU( j, i) ≥ SU(i, ck ). Then the set of redundant
features is denoted as S(Pi ) , which is further split into S+ − + −
pi and S pi . S pi and S pi contain redundant fea-
tures to fi with SU( j, ck ) > SU(i, ck ) and SU( j, ck ) ≤ SU(i, ck ), respectively. Finally FCBC applied
three heuristics on S( Pi ), S+ −
pi and S pi to remove the redundant features and keep the features that are
most relevant to the class. FCBC provides an effective way to handle feature redundancy in feature
selection.
Minimum-Redundancy-Maximum-Relevance (mRmR) is also a mutual information based
method and it selects features according to the maximal statistical dependency criterion [52]. Due
to the difficulty in directly implementing the maximal dependency condition, mRmR is an approx-
imation to maximizing the dependency between the joint distribution of the selected features and
the classification variable. Minimize Redundancy for discrete features and continuous features is
defined as:
1
for Discrete Features: minWI , WI = 2 ∑ I(i, j),
|S| i, j∈S
1
|S|2 i,∑
for Continuous Features: minWc , Wc = |C(i, j)| (2.8)
j∈S
where I(i, j) and C(i, j) are mutual information and the correlation between fi and f j , respectively.
Meanwhile, Maximize Relevance for discrete features and continuous features is defined as:
1
for Discrete Features: maxVI , VI = ∑ I(h, i),
|S|2 i∈S
1
|S|2 ∑
for Continuous Features: maxVc , Vc = F(i, h) (2.9)
i
1
Si = ∑ d(Xik − XiMk ) − d(Xik − XiHk ),
2 k=1
(2.10)
where Mk denotes the values on the i-th feature of the nearest instances to xk with the same class
label, while Hk denotes the values on the i-th feature of the nearest instances to xk with different
class labels. d(·) is a distance measure. To handle multi-class problem, Equation (2.10) is extended
as,
1 1 1 p(y)
Si = ∑ − ∑ d(Xik − Xi j ) + ∑ ∑ d(Xik − Xi j ) (2.11)
K k=1 mk x j ∈Mk y =y hky 1 − p(y) x j ∈Hk
k
where Mk and Hky denotes the sets of nearest points to xk with the same class and the class y
with sizes of mk and hky respectively, and p(y) is the probability of an instance from the class
y. In [56], the authors related the relevance evaluation criterion of ReliefF to the hypothesis of
margin maximization, which explains why the algorithm provides superior performance in many
applications.
46 Data Classification: Algorithms and Applications
Training Training
Set Set
Feature Search Classifier
Feature
Feature Performance Set
Set Estimation
Feature Evaluation
Feature
Hypothesis
Set
Classifier
Testing
Set
Final Evaluation
FIGURE 2.5: A general framework for wrapper methods of feature selection for classification.
• Step 2: evaluating the selected subset of features by the performance of the classifier,
• Step 3: repeating Step 1 and Step 2 until the desired quality is reached.
A general framework for wrapper methods of feature selection for classification [36] is shown
in Figure 2.5, and it contains three major components:
• feature selection search — how to search the subset of features from all possible feature
subsets,
• feature evaluation — how to evaluate the performance of the chosen classifier, and
• induction algorithm.
In wrapper models, the predefined classifier works as a black box. The feature search component
will produce a set of features and the feature evaluation component will use the classifier to estimate
the performance, which will be returned back to the feature search component for the next iteration
of feature subset selection. The feature set with the highest estimated value will be chosen as the
Feature Selection for Classification: A Review 47
final set to learn the classifier. The resulting classifier is then evaluated on an independent testing
set that is not used during the training process [36].
The size of search space for m features is O(2m ), thus an exhaustive search is impractical unless
m is small. Actually the problem is known to be NP-hard [18]. A wide range of search strategies
can be used, including hill-climbing, best-first, branch-and-bound, and genetic algorithms [18]. The
hill-climbing strategy expands the current set and moves to the subset with the highest accuracy,
terminating when no subset improves over the current set. The best-first strategy is to select the
most promising set that has not already been expanded and is a more robust method than hill-
climbing [36]. Greedy search strategies seem to be particularly computationally advantageous and
robust against overfitting. They come in two flavors — forward selection and backward elimination.
Forward selection refers to a search that begins at the empty set of features and features are progres-
sively incorporated into larger and larger subsets, whereas backward elimination begins with the full
set of features and progressively eliminates the least promising ones. The search component aims to
find a subset of features with the highest evaluation, using a heuristic function to guide it. Since we
do not know the actual accuracy of the classifier, we use accuracy estimation as both the heuristic
function and the valuation function in the feature evaluation phase. Performance assessments are
usually done using a validation set or by cross-validation.
Wrapper models obtain better predictive accuracy estimates than filter models [36,38]. However,
wrapper models are very computationally expensive compared to filter models. It produces better
performance for the predefined classifier since we aim to select features that maximize the quality;
therefore the selected subset of features is inevitably biased to the predefined classifier.
where c(·) is the classification objective function, penalty(w) is a regularization term, and α is the
regularization parameter controlling the trade-off between the c(·) and the penalty. Popular choices
of c(·) include quadratic loss such as least squares, hinge loss such as 1 SVM [5], and logistic loss
such as BlogReg [15].
• Quadratic loss:
n
c(w, X) = ∑ (yi − w xi )2 , (2.13)
i=1
• Hinge loss:
n
c(w, X) = ∑ max(0, 1 − yi w xi ), (2.14)
i=1
• Logistic loss:
n
c(w, X) = ∑ log(1 + exp(−yi(w xi + b))). (2.15)
i=1
An important property of the 1 regularization is that it can generate an estimation of w [64] with
exact zero coefficients. In other words, there are zero entities in w, which denotes that the corre-
sponding features are eliminated during the classifier learning process. Therefore, it can be used for
feature selection.
Adaptive Lasso [80]: The Lasso feature selection is consistent if the underlying model satisfies
a non-trivial condition, which may not be satisfied in practice [76]. Meanwhile the Lasso shrink-
age produces biased estimates for the large coefficients; thus, it could be suboptimal in terms of
estimation risk [14].
The adaptive Lasso is proposed to improve the performance of as [80]
m
1
penalty(w) = ∑ |wi |, (2.17)
b
i=1 i
where the only difference between Lasso and adaptive Lasso is that the latter employs a weighted
adjustment bi for each coefficient wi . The article shows that the adaptive Lasso enjoys the oracle
properties and can be solved by the same efficient algorithm for solving the Lasso.
The article also proves that for linear models w with n√ m, the adaptive Lasso estimate is
selection consistent under very general conditions if bi is a n consistent estimate of wi . Compli-
mentary to this proof, [24] shows that when m n for linear models, the adaptive Lasso estimate
is also selection consistent under a partial orthogonality condition in which the covariates with zero
coefficients are weakly correlated with the covariates with nonzero coefficients.
Feature Selection for Classification: A Review 49
with 0 < γ ≤ 1 and λ ≥ 1. The elastic net is a mixture of bridge regularization with different values
of γ. [81] proposes γ = 1 and λ = 1, which is extended to γ < 1 and λ = 1 by [43].
Through the loss function c(w, X), the above-mentioned methods control the size of residuals.
An alternative way to obtain a sparse estimation of w is Dantzig selector, which is based on the
normal score equations and controls the correlation of residuals with X as [6],
min
w
1 , s.t.
X (y − wX)
∞ ≤ λ, (2.20)
·
∞ is the ∞ -norm of a vector and Dantzig selector was designed for linear regression models.
Candes and Tao have provided strong theoretical justification for this performance by establishing
sharp non-asymptotic bounds on the 2 -error in the estimated coefficients, and showed that the error
is within a factor of log(p) of the error that would be achieved if the locations of the non-zero
coefficients were known [6, 28]. Strong theoretical results show that LASSO and Dantzig selector
are closely related [28].
where G denotes the structure of features, and α controls the trade-off between data fitting and
regularization. Equation (2.21) will lead to sparse classifiers, which lend themselves particularly
well to interpretation, which is often of primary importance in many applications such as biology or
social sciences [75].
where
·
q is the q -norm with q > 1, and hi is the weight for the i-th group. Lasso does not take
group structure information into account and does not support group selection, while group Lasso
can select or not select a group of features as a whole.
Once a group is selected by the group Lasso, all features in the group will be selected. For
certain applications, it is also desirable to select features from the selected groups, i.e., performing
simultaneous group selection and feature selection. The sparse group Lasso takes advantages of
both Lasso and group Lasso, and it produces a solution with simultaneous between- and within-
group sparsity. The sparse group Lasso regularization is based on a composition of the q,1 -norm
and the 1 -norm,
k
penalty(w, G ) = α
w
1 + (1 − α) ∑ hi
wGi
q , (2.23)
i=1
where α ∈ [0, 1], the first term controls the sparsity in the feature level, and the second term controls
the group selection.
Figure 2.6 demonstrates the different solutions among Lasso, group Lasso, and sparse group
Lasso. In the figure, features form four groups {G1 , G2 , G3 , G4 }. Light color denotes the corre-
sponding feature of the cell with zero coefficients and dark color indicates non-zero coefficients.
From the figure, we observe that
• Lasso does not consider the group structure and selects a subset of features among all groups;
• group Lasso can perform group selection and select a subset of groups. Once the group is
selected, all features in this group are selected; and
• sparse group Lasso can select groups and features in the selected groups at the same time.
In some applications, the groups overlap. One motivation example is the use of biologically
meaningful gene/protein sets (groups) given by [73]. If the proteins/genes either appear in the same
pathway, or are semantically related in terms of gene ontology (GO) hierarchy, or are related from
Feature Selection for Classification: A Review 51
G1 G2 G3 G4
Lasso
Group
Lasso
Sparse Group
Lasso
FIGURE 2.6: Illustration of Lasso, group Lasso and sparse group Lasso. Features can be grouped
into four disjoint groups {G1 , G2 , G3 , G4 }. Each cell denotes a feature and light color represents the
corresponding cell with coefficient zero.
gene set enrichment analysis (GSEA), they are related and assigned to the same groups. For exam-
ple, the canonical pathway in MSigDB has provided 639 groups of genes. It has been shown that the
group (of proteins/genes) markers are more reproducible than individual protein/gene markers and
modeling such group information as prior knowledge can improve classification performance [7].
Groups may overlap — one protein/gene may belong to multiple groups. In these situations, group
Lasso does not correctly handle overlapping groups and a given coefficient only belongs to one
group. Algorithms investigating overlapping groups are proposed as [27, 30, 33, 42]. A general
overlapping group Lasso regularization is similar to that for group Lasso regularization in Equa-
tion (2.23)
k
penalty(w, G ) = α
w
1 + (1 − α) ∑ hi
wGi
q , (2.24)
i=1
however, groups for overlapping group Lasso regularization may overlap, while groups in group
Lasso are disjoint.
G10
G11 G31
G21
f1 f2 f3 f4 f5 f6 f7 f8
Figure 2.7 shows a sample index tree of depth 3 with 8 features, where Gij are defined as
G01 = { f1 , f2 , f3 , f4 , f5 , f6 , f7 , f8 },
G11 = { f1 , f2 }, G21 = { f3 , f4 , f5 , f6 , f7 }, G31 = { f8 },
G21 = { f1 , f2 }, G22 = { f3 , f4 } G32 = { f5 , f6 , f7 }.
• the nodes from the same depth level do not overlap; and
• the index set of a child node is a subset of that of its parent node.
With the definition of the index tree, the tree-guided group Lasso regularization is,
d ni
penalty(w, G ) = ∑ ∑ hij
wGi
q , (2.25)
j
i=0 j=1
Since any parent node overlaps with its child nodes, if a specific node is not selected (i.e., its
corresponding model coefficient is zero), then all its child nodes will not be selected. For example,
in Figure 2.7, if G12 is not selected, both G22 and G23 will not be selected, indicating that features
{ f3 , f4 , f5 , f6 , f7 } will be not selected. Note that the tree structured group Lasso is a special case of
the overlapping group Lasso with a specific tree structure.
Feature Selection for Classification: A Review 53
f1
f2 1 1
f3 1 1 1
1 1 1
1 1 1
f4 1 1
1 1 1
1 1
f5 f6
Graph Features
FIGURE 2.8: An illustration of the graph of 7 features { f1 , f2 , . . . , f7 } and its corresponding repre-
sentation A.
which is equivalent to
where L = D − A is the Laplacian matrix and D is a diagonal matrix with Dii = ∑mj=1 Ai j . The
Laplacian matrix is positive semi-definite and captures the underlying local geometric structure
of the data. When L is an identity matrix, w L w =
w
22 and then Equation (2.27) reduces to
the elastic net penalty [81]. Because w L w is both convex and differentiable, existing efficient
algorithms for solving the Lasso can be applied to solve Equation (2.27).
54 Data Classification: Algorithms and Applications
Equation (2.27) assumes that the feature graph is unsigned, and encourages positive correlation
between the values of coefficients for the features connected by an edge in the unsigned graph.
However, two features might be negatively correlated. In this situation, the feature graph is signed,
with both positive and negative edges. To perform feature selection with a signed feature graph,
GFlasso employs a different 1 regularization over a graph [32],
where ri j is the correlation between two features. When fi and f j are positively connected, ri j > 0,
i.e., with a positive edge, penalty(w, G ) forces the coefficients wi and w j to be similar; while fi
and f j are negatively connected, ri j < 0, i.e., with a negative edge, the penalty forces wi and w j to
be dissimilar. Due to possible graph misspecification, GFlasso may introduce additional estimation
bias in the learning process. For example, additional bias may occur when the sign of the edge
between fi and f j is inaccurately estimated.
In [72], the authors introduced several alternative formulations for graph Lasso. One of the
formulations is defined as
where a pairwise ∞ regularization is used to force the coefficients to be equal and the grouping
constraints are only put on connected nodes with Ai j = 1. The 1 -norm of w encourages sparseness,
and max (|wi |, |w j |) will penalize the larger coefficients, which can be decomposed as
1
max (|wi |, |w j |) = (|wi + w j | + |wi − w j |), (2.30)
2
which can be further represented as
where u and v are two vectors with only two non-zero entities, i.e., ui = u j = 12 and vi = −v j = 12 .
The GOSCAR formulation is closely related to OSCAR in [4]. However, OSCAR assumes that
all features form a complete graph, which means that the feature graph is complete. OSCAR works
for A whose entities are all 1, while GOSCAR can work with an arbitrary undirected graph where
A is any symmetric matrix. In this sense, GOSCAR is more general. Meanwhile, the formulation
for GOSCAR is much more challenging to solve than that of OSCAR.
The limitation of the Laplacian Lasso — that the different signs of coefficients can introduce ad-
ditional penalty — can be overcome by the grouping penalty of GOSCAR. However, GOSCAR can
easily overpenalize the coefficient wi or w j due to the property of the max operator. The additional
penalty would result in biased estimation, especially for large coefficients. As mentioned above,
GFlasso will introduce estimation bias when the sign between wi and w j is wrongly estimated. This
motivates the following non-convex formulation for graph features,
where the grouping penalty ∑i, j Ai j
|wi | − |w j |
controls only magnitudes of differences of coef-
ficients while ignoring their signs over the graph. Via the 1 regularization and grouping penalty,
feature grouping and selection are performed simultaneously where only large coefficients as well
as pairwise difference are shrunk [72].
For features with graph structure, a subset of highly connected features in the graph is likely to
be selected or not selected as a whole. For example, in Figure 2.8, { f5 , f6 , f7 } are selected, while
{ f1 , f2 , f3 , f4 } are not selected.
Feature Selection for Classification: A Review 55
• Step 3: Determining whether to remove features from the set of currently selected features.
Different algorithms may have different implementations for Step 2 and Step 3, and next we
will review some representative methods in this category. Note that Step 3 is optional and some
streaming feature selection algorithms only implement Step 2.
56 Data Classification: Algorithms and Applications
When all features are available, penalty(w) penalizes all weights in w uniformly to achieve feature
selection, which can be applied to streaming features with the grafting technique. In Equation (2.33),
every one-zero weight w j added to the model incurs a regularize penalty of α|w j |. Therefore the
feature adding to the model only happens when the loss of c(·) is larger than the regularizer penalty.
The grafting technique will only take w j away from zero if:
∂c
> α, (2.34)
∂w j
otherwise the grafting technique will set the weight to zero (or exclude the feature).
• Step 4: i = i + 1
2.5.1 Scalability
With the tremendous growth of dataset sizes, the scalability of current algorithms may be in
jeopardy, especially with these domains that require an online classifier. For example, data that
cannot be loaded into the memory require a single data scan where the second pass is either un-
available or very expensive. Using feature selection methods for classification may reduce the issue
58 Data Classification: Algorithms and Applications
of scalability for clustering. However, some of the current methods that involve feature selection
in the classification process require keeping full dimensionality in the memory. Furthermore, other
methods require an iterative process where each sample is visited more than once until convergence.
On the other hand, the scalability of feature selection algorithms is a big problem. Usually, they
require a sufficient number of samples to obtain, statically, adequate results. It is very hard to observe
feature relevance score without considering the density around each sample. Some methods try to
overcome this issue by memorizing only samples that are important or a summary. In conclusion,
we believe that the scalability of classification and feature selection methods should be given more
attention to keep pace with the growth and fast streaming of the data.
2.5.2 Stability
Algorithms of feature selection for classification are often evaluated through classification ac-
curacy. However, the stability of algorithms is also an important consideration when developing
feature selection methods. A motivated example is from bioinformatics: The domain experts would
like to see the same or at least a similar set of genes, i.e., features to be selected, each time they
obtain new samples in the presence of a small amount of perturbation. Otherwise they will not trust
the algorithm when they get different sets of features while the datasets are drawn for the same
problem. Due to its importance, stability of feature selection has drawn the attention of the feature
selection community. It is defined as the sensitivity of the selection process to data perturbation in
the training set. It is found that well-known feature selection methods can select features with very
low stability after perturbation is introduced to the training samples. In [2] the authors found that
even the underlying characteristics of data can greatly affect the stability of an algorithm. These
characteristics include dimensionality m, sample size n, and different data distribution across dif-
ferent folds, and the stability issue tends to be data dependent. Developing algorithms of feature
selection for classification with high classification accuracy and stability is still challenging.
Two attempts to handle linked data w.r.t. feature selection for classification are LinkedFS [62]
and FSNet [16]. FSNet works with networked data and is supervised, while LinkedFS works with
1 http://www.twitter.com/
2 https://www.facebook.com/
Feature Selection for Classification: A Review 59
଼
ݑସ
݂ଵ ݂ଶ …. …. …. ݂ ܿଵ….ܿ
ଵ
ݑଵ ݑଷ ଶ
ଷ
ସ
ଵ ଶ ݑଶ ହ
ଷ ହ
ସ ଼
݂ଵ ݂ଶ …. …. …. ݂ ܿଵ ….ܿ
ଵ
ݑଵ ଶ
ݑଵ ݑଶ ݑଷ ݑସ
ଷ
1
ݑଶ ସ
1 1 1
ହ
1
ݑଷ 1 1
ݑସ ଼
FIGURE 2.10: Typical linked social media data and its two representations.
social media data with social context and is semi-supervised. In LinkedFS, various relations (co-
Post, coFollowing, coFollowed, and Following) are extracted following social correlation theories.
LinkedFS significantly improves the performance of feature selection by incorporating these rela-
tions into feature selection. There are many issues needing further investigation for linked data, such
as handling noise, and incomplete and unlabeled linked social media data.
Acknowledgments
We thank Daria Bazzi for useful comments on the representation, and helpful discussions from
Yun Li and Charu Aggarwal about the content of this book chapter. This work is, in part, supported
by the National Science Foundation under Grant No. IIS-1217466.
60 Data Classification: Algorithms and Applications
Bibliography
[1] S. Alelyani, J. Tang, and H. Liu. Feature selection for clustering: A review. Data Clustering:
Algorithms and Applications, Editors: Charu Aggarwal and Chandan Reddy, CRC Press, 2013.
[2] S. Alelyani, L. Wang, and H. Liu. The effect of the characteristics of the dataset on the
selection stability. In Proceedings of the 23rd IEEE International Conference on Tools with
Artificial Intelligence, 2011.
[3] F. R. Bach. Consistency of the group Lasso and multiple kernel learning. The Journal of
Machine Learning Research, 9:1179–1225, 2008.
[4] H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and
supervised clustering of predictors with oscar. Biometrics, 64(1):115–123, 2008.
[5] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support
vector machines. Proceedings of the Fifteenth International Conference on Machine Learning
(ICML ’98), 82–90. Morgan Kaufmann, 1998.
[6] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than
n. The Annals of Statistics, 35(6):2313–2351, 2007.
[7] H. Y. Chuang, E. Lee, Y.T. Liu, D. Lee, and T. Ideker. Network-based classification of breast
cancer metastasis. Molecular Systems Biology, 3(1), 2007.
[8] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Analysis, 1(1-4):131–
156, 1997.
[9] C. Domeniconi and D. Gunopulos. Local feature selection for classification. Computational
Methods, page 211, 2008.
[10] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. Wiley-Interscience, 2012.
[11] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. New York: John Wiley & Sons,
2nd edition, 2001.
[12] J.G. Dy and C.E. Brodley. Feature subset selection and order identification for unsupervised
learning. In In Proc. 17th International Conference on Machine Learning, pages 247–254.
Morgan Kaufmann, 2000.
[13] J.G. Dy and C.E. Brodley. Feature selection for unsupervised learning. The Journal of Machine
Learning Research, 5:845–889, 2004.
[14] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle prop-
erties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.
[15] N. L. C. Talbot G. C. Cawley, and M. Girolami. Sparse multinomial logistic regression via
Bayesian l1 regularisation. In Neural Information Processing Systems, 2006.
[16] Q. Gu, Z. Li, and J. Han. Towards feature selection in network. In International Conference
on Information and Knowledge Management, 2011.
[17] Q. Gu, Z. Li, and J. Han. Generalized Fisher score for feature selection. arXiv preprint
arXiv:1202.3725, 2012.
Feature Selection for Classification: A Review 61
[18] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of
Machine Learning Research, 3:1157–1182, 2003.
[19] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using
support vector machines. Machine learning, 46(1-3):389–422, 2002.
[20] M.A. Hall and L.A. Smith. Feature selection for machine learning: Comparing a correlation-
based filter approach to the wrapper. In Proceedings of the Twelfth International Florida
Artificial Intelligence Research Society Conference, volume 235, page 239, 1999.
[21] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,
2001.
[22] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, 1996.
[23] J. Huang, J. L. Horowitz, and S. Ma. Asymptotic properties of bridge estimators in sparse
high-dimensional regression models. The Annals of Statistics, 36(2):587–613, 2008.
[24] J. Huang, S. Ma, and C. Zhang. Adaptive Lasso for sparse high-dimensional regression mod-
els. Statistica Sinica, 18(4):1603, 2008.
[25] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the
26th Annual International Conference on Machine Learning, pages 417–424. ACM, 2009.
[26] I. Inza, P. Larrañaga, R. Blanco, and A. J. Cerrolaza. Filter versus wrapper gene selection
approaches in dna microarray domains. Artificial intelligence in Medicine, 31(2):91–103,
2004.
[27] L. Jacob, G. Obozinski, and J. Vert. Group Lasso with overlap and graph Lasso. In Proceedings
of the 26th Annual International Conference on Machine Learning, pages 433–440. ACM,
2009.
[28] G. James, P. Radchenko, and J. Lv. Dasso: Connections between the Dantzig selector and
Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(1):127–
142, 2009.
[29] R. Jenatton, J. Audibert, and F. Bach. Structured variable selection with sparsity-inducing
norms. Journal of Machine Learning Research, 12:2777–2824, 2011.
[30] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical
dictionary learning. Journal of Machine Learning Research, 12:2297–2334, 2011.
[31] D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection bias in relational
learning. In International Conference on Machine Learning, pages 259–266, 2002.
[32] S. Kim and E. Xing. Statistical estimation of correlated genome associations to a quantitative
trait network. PLoS Genetics, 5(8):e1000587, 2009.
[33] S. Kim and E. Xing. Tree-guided group Lasso for multi-task regression with structured spar-
sity. In Proceedings of the 27th International Conference on Machine Learing, Haifa, Israel,
2010.
[34] K. Kira and L. Rendell. A practical approach to feature selection. In Proceedings of the Ninth
International Workshop on Machine Learning, pages 249–256. Morgan Kaufmann Publishers
Inc., 1992.
62 Data Classification: Algorithms and Applications
[35] K. Knight and W. Fu. Asymptotics for Lasso-type estimators. Annals of Statistics, 78(5):1356–
1378, 2000.
[36] R. Kohavi and G.H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-
2):273–324, 1997.
[37] D. Koller and M. Sahami. Toward optimal feature selection. In Proceedings of the Interna-
tional Conference on Machine Learning, pages 284–292, 1996.
[38] H. Liu and H. Motoda. Feature selection for knowledge discovery and data mining, volume
454. Springer, 1998.
[39] H. Liu and H. Motoda. Computational Methods of Feature Selection. Chapman and Hall/CRC
Press, 2007.
[40] H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clus-
tering. IEEE Transactions on Knowledge and Data Engineering, 17(4):491, 2005.
[41] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via efficient l 2, 1-norm minimization. In
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages
339–348. AUAI Press, 2009.
[42] J. Liu and J. Ye. Moreau-Yosida regularization for grouped tree structure learning. Advances
in Neural Information Processing Systems, 187:195–207, 2010.
[43] Z. Liu, F. Jiang, G. Tian, S. Wang, F. Sato, S. Meltzer, and M. Tan. Sparse logistic regression
with lp penalty for biomarker identification. Statistical Applications in Genetics and Molecular
Biology, 6(1), 2007.
[44] B. Long, Z.M. Zhang, X. Wu, and P.S. Yu. Spectral clustering for multi-type relational data.
In Proceedings of the 23rd International Conference on Machine Learning, pages 585–592.
ACM, 2006.
[45] B. Long, Z.M. Zhang, and P.S. Yu. A probabilistic framework for relational clustering. In
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pages 470–479. ACM, 2007.
[46] S. Ma and J. Huang. Penalized feature selection and classification in bioinformatics. Briefings
in Bioinformatics, 9(5):392–403, 2008.
[47] S.A. Macskassy and F. Provost. Classification in networked data: A toolkit and a univariate
case study. The Journal of Machine Learning Research, 8:935–983, 2007.
[49] J. McAuley, J Ming, D. Stewart, and P. Hanna. Subband correlation and robust speech recog-
nition. IEEE Transactions on Speech and Audio Processing, 13(5):956–964, 2005.
[50] L. Meier, S. De Geer, and P. Bühlmann. The group Lasso for logistic regression. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53–71, 2008.
[51] P. Mitra, C. A. Murthy, and S. Pal. Unsupervised feature selection using feature similarity.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:301–312, 2002.
Feature Selection for Classification: A Review 63
[52] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: Criteria of max-
dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and
Machine Intelligence, pages 1226–1238, 2005.
[53] S. Perkins and J. Theiler. Online feature selection using grafting. In Proceedings of the
International Conference on Machine Learning, pages 592–599, 2003.
[54] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[57] Y. Saeys, I. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics.
Bioinformatics, 23(19):2507–2517, 2007.
[58] T. Sandler, P. Talukdar, L. Ungar, and J. Blitzer. Regularized learning with networks of fea-
tures. Neural Information Processing Systems, 2008.
[59] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classifi-
cation in network data. AI Magazine, 29(3):93, 2008.
[60] M. R. Sikonja and I. Kononenko. Theoretical and empirical analysis of ReliefF and RReliefF.
Machine Learning, 53:23–69, 2003.
[61] L. Song, A. Smola, A. Gretton, K. Borgwardt, and J. Bedo. Supervised feature selection
via dependence estimation. In Proceedings of the 24th International Conference on Machine
Learning, pages 823–830, 2007.
[62] J. Tang and H. Liu. Feature selection with linked data in social media. In SDM, pages 118–128,
2012.
[63] B. Taskar, P. Abbeel, M.F. Wong, and D. Koller. Label and link prediction in relational data.
In Proceedings of the IJCAI Workshop on Learning Statistical Models from Relational Data.
Citeseer, 2003.
[64] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statisti-
cal Society. Series B (Methodological), pages 267–288, 1996.
[65] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via
the fused Lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
67(1):91–108, 2005.
[66] R. Tibshirani and P. Wang. Spatial smoothing and hot spot detection for cgh data using the
fused Lasso. Biostatistics, 9(1):18–29, 2008.
[67] J. Wang, P. Zhao, S. Hoi, and R. Jin. Online feature selection and its applications. IEEE
Transactions on Knowledge and Data Engineering, pages 1–14, 2013.
[68] J. Weston, A. Elisseff, B. Schoelkopf, and M. Tipping. Use of the zero norm with linear models
and kernel methods. Journal of Machine Learning Research, 3:1439–1461, 2003.
[69] D. Wettschereck, D. Aha, and T. Mohri. A review and empirical evaluation of feature weight-
ing methods for a class of lazy learning algorithms. Artificial Intelligence Review, 11:273–314,
1997.
64 Data Classification: Algorithms and Applications
[70] X. Wu, K. Yu, H. Wang, and W. Ding. Online streaming feature selection. In Proceedings of
the 27th International Conference on Machine Learning, pages 1159–1166, 2010.
[71] Z. Xu, R. Jin, J. Ye, M. Lyu, and I. King. Discriminative semi-supervised feature selection via
manifold regularization. In IJCAI’09: Proceedings of the 21th International Joint Conference
on Artificial Intelligence, 2009.
[72] S. Yang, L. Yuan, Y. Lai, X. Shen, P. Wonka, and J. Ye. Feature grouping and selection over
an undirected graph. In Proceedings of the 18th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 922–930. ACM, 2012.
[73] J. Ye and J. Liu. Sparse methods for biomedical data. ACM SIGKDD Explorations Newsletter,
14(1):4–15, 2012.
[74] L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter
solution. In International Conference on Machine Learning, 20:856, 2003.
[75] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
[76] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning
Research, 7:2541–2563, 2006.
[77] Z. Zhao and H. Liu. Semi-supervised feature selection via spectral analysis. In Proceedings
of SIAM International Conference on Data Mining, 2007.
[78] D. Zhou, J. Huang, and B. Schölkopf. Learning from labeled and unlabeled data on a directed
graph. In International Conference on Machine Learning, pages 1036–1043. ACM, 2005.
[79] J. Zhou, J. Liu, V. Narayan, and J. Ye. Modeling disease progression via fused sparse group
Lasso. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 1095–1103. ACM, 2012.
[80] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical
Association, 101(476):1418–1429, 2006.
[81] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
Chapter 3
Probabilistic Models for Classification
Hongbo Deng
Yahoo! Labs
Sunnyvale, CA
[email protected]
Yizhou Sun
College of Computer and Information Science
Northeastern University
Boston, MA
[email protected]
Yi Chang
Yahoo! Labs
Sunnyvale, CA
[email protected]
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.1 Bayes’ Theorem and Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.3 Maximum-Likelihood Estimates for Naive Bayes Models . . . . . . . . . . . . . . . . . . 70
3.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Logistic Regression Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.2 Parameters Estimation for Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.3 Regularization in Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4 Probabilistic Graphical Models for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1.1 Bayesian Network Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.4.1.2 Inference in a Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.1.3 Learning Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.2.1 The Inference and Learning Algorithms . . . . . . . . . . . . . . . . . . . . . 79
3.4.3 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.4.3.1 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.4.3.2 Clique Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
65
66 Data Classification: Algorithms and Applications
3.1 Introduction
In machine learning, classification is considered an instance of the supervised learning meth-
ods, i.e., inferring a function from labeled training data. The training data consist of a set of
training examples, where each example is a pair consisting of an input object (typically a vec-
tor) x = x1 , x2 , ..., xd and a desired output value (typically a class label) y ∈ {C1 ,C2 , ...,CK }. Given
such a set of training data, the task of a classification algorithm is to analyze the training data and
produce an inferred function, which can be used to classify new (so far unseen) examples by assign-
ing a correct class label to each of them. An example would be assigning a given email into “spam”
or “non-spam” classes.
A common subclass of classification is probabilistic classification, and in this chapter we will
focus on some probabilistic classification methods. Probabilistic classification algorithms use sta-
tistical inference to find the best class for a given example. In addition to simply assigning the best
class like other classification algorithms, probabilistic classification algorithms will output a cor-
responding probability of the example being a member of each of the possible classes. The class
with the highest probability is normally then selected as the best class. In general, probabilistic clas-
sification algorithms has a few advantages over non-probabilistic classifiers: First, it can output a
confidence value (i.e., probability) associated with its selected class label, and therefore it can ab-
stain if its confidence of choosing any particular output is too low. Second, probabilistic classifiers
can be more effectively incorporated into larger machine learning tasks, in a way that partially or
completely avoids the problem of error propagation.
Within a probabilistic framework, the key point of probabilistic classification is to estimate
the posterior class probability p(Ck |x). After obtaining the posterior probabilities, we use decision
theory [5] to determine class membership for each new input x. Basically, there are two ways in
which we can estimate the posterior probabilities.
In the first case, we focus on determining the class-conditional probabilities p(x|Ck ) for each
class Ck individually, and infer the prior class p(Ck ) separately. Then using Bayes’ theorem, we can
obtain the posterior class probabilities p(Ck |x). Equivalently, we can model the joint distribution
p(x,Ck ) directly and then normalize to obtain the posterior probabilities. As the class-conditional
probabilities define the statistical process that generates the features we measure, these approaches
that explicitly or implicitly model the distribution of inputs as well as outputs are known as gen-
erative models. If the observed data are truly sampled from the generative model, then fitting the
parameters of the generative model to maximize the data likelihood is a common method. In this
chapter, we will introduce two common examples of probabilistic generative models for classifica-
tion: Naive Bayes classifier and Hidden Markov model.
Another class of models is to directly model the posterior probabilities p(Ck |x) by learning
a discriminative function f (x) = p(Ck |x) that maps input x directly onto a class label Ck . This
approach is often referred to as the discriminative model as all effort is placed on defining the overall
discriminative function with the class-conditional probabilities in consideration. For instance, in the
case of two-class problems, f (x) = p(Ck |x) might be continuous value between 0 and 1, such that
f < 0.5 represents class C1 and f > 0.5 represents class C2 . In this chapter, we will introduce
Probabilistic Models for Classification 67
where the first equation is the sum rule, and the second equation is the product rule. Here p(X,Y )
is a joint probability, the quantity p(Y |X ) is a conditional probability, and the quantity p(X) is a
marginal probability. These two simple rules form the basis for all of the probabilistic theory that
we use throughout this chapter.
Based on the product rule, together with the symmetry property p(X,Y ) = p(Y, X), it is easy to
obtain the following Bayes’ theorem,
p(X|Y )p(Y )
p(Y |X) = , (3.3)
p(X)
which plays a central role in machine learning, especially classification. Using the sum rule, the de-
nominator in Bayes’ theorem can be expressed in terms of the quantities appearing in the numerator
The denominator in Bayes’ theorem can be regarded as being the normalization constant required
to ensure that the sum of the conditional probability on the left-hand side of Equation (3.3) over all
values of Y equals one.
Let us consider a simple example to better understand the basic concepts of probability theory
and the Bayes’ theorem. Suppose we have two boxes, one red and one white, and in the red box
we have two apples, four lemons, and six oranges, and in the white box we have three apples, six
lemons, and one orange. Now suppose we randomly pick one of the boxes and from that box we
randomly select an item, and have observed which sort of item it is. In the process, we replace the
item in the box from which it came, and we could imagine repeating this process many times. Let
us suppose that we pick the red box 40% of the time and we pick the white box 60% of the time,
and that when we select an item from a box we are equally likely to select any of the items in the
box.
Let us define random variable Y to denote the box we choose, then we have
p(Y = r) = 4/10
and
p(Y = w) = 6/10,
where p(Y = r) is the marginal probability that we choose the red box, and p(Y = w) is the marginal
probability that we choose the white box.
Suppose that we pick a box at random, and then the probability of selecting an item is the fraction
of that item given the selected box, which can be written as the following conditional probabilities
and
p(X = a|Y = w) + p(X = l|Y = w) + p(X = o|Y = w) = 1.
Probabilistic Models for Classification 69
Now suppose an item has been selected and it is an orange, and we would like to know which
box it came from. This requires that we evaluate the probability distribution over boxes conditioned
on the identity of the item, whereas the probabilities in Equation (3.4)-(3.9) illustrate the distribution
of the item conditioned on the identity of the box. Based on Bayes’ theorem, we can calculate the
posterior probability by reversing the conditional probability
p(X = o|Y = r)p(Y = r) 6/12 × 4/10 10
p(Y = r|X = o) = = = ,
p(X = o) 13/50 13
where the overall probability of choosing an orange p(X = o) can be calculated by using the sum
and product rules
6 4 1 6 13
p(X = o) = p(X = o|Y = r)p(Y = r) + p(X = o|Y = w)p(Y = w) = × + × = .
12 10 10 10 50
From the sum rule, it then follows that p(Y = w|X = o) = 1 − 10/13 = 3/13.
In general cases, we are interested in the probabilities of the classes given the data samples.
Suppose we use random variable Y to denote the class label for data samples, and random variable
X to represent the feature of data samples. We can interpret p(Y = Ck ) as the prior probability for
the class Ck , which represents the probability that the class label of a data sample is Ck before we
observe the data sample. Once we observe the feature X of a data sample, we can then use Bayes’
theorem to compute the corresponding posterior probability p(Y |X). The quantity p(X|Y ) can be
expressed as how probable the observed data X is for different classes, which is called the likelihood.
Note that the likelihood is not a probability distribution over Y , and its integral with respect to Y
does not necessarily equal one. Given this definition of likelihood, we can state Bayes’ theorem
as posterior ∝ likelihood × prior. Now that we have introduced the Bayes’ theorem, in the next
subsection, we will look at how Bayes’ theorem is used in the Naive Bayes classifier.
It is typically intractable to learn exact Bayesian classifiers. Considering the case that Y is
boolean and X is a vector of d boolean features, we need to estimate approximately 2d parame-
ters p(X1 = x1 , X2 = x2 , ..., Xd = xd |Y = Ck ). The reason is that, for any particular value Ck , there are
2d possible values of x, which need to compute 2d − 1 independent parameters. Given two possible
values for Y , we need to estimate a total of 2(2d − 1) such parameters. Moreover, to obtain reliable
estimates of each of these parameters, we will need to observe each of these distinct instances mul-
tiple times, which is clearly unrealistic in most practical classification domains. For example, if X
is a vector with 20 boolean features, then we will need to estimate more than 1 million parameters.
To handle the intractable sample complexity for learning the Bayesian classifier, the Naive Bayes
classifier reduces this complexity by making a conditional independence assumption that the fea-
tures X1 , ..., Xd are all conditionally independent of one another, given Y . For the previous case,
this conditional independence assumption helps to dramatically reduce the number of parameters to
be estimated for modeling p(X|Y ) from the original 2(2d − 1) to just 2d. Consider the likelihood
p(X = x|Y = Ck ) of Equation (3.10), we have
p(X1 = x1 , X2 = x2 , ..., Xd = xd |Y = Ck )
d
= ∏ p(X j = x j |X1 = x1 , X2 = x2 , ..., X j−1 = x j−1,Y = Ck )
j=1
d
= ∏ p(X j = x j |Y = Ck ). (3.11)
j=1
The second line follows from the chain rule, a general property of probabilities, and the third line
follows directly from the above conditional independence, that the value for the random variable
X j is independent of all other feature values, X j for j = j, when conditioned on the identity of the
label Y . This is the Naive Bayes assumption. It is a relatively strong and very useful assumption.
When Y and X j are boolean variables, we only need 2d parameters to define p(X j |Y = Ck ).
After substituting Equation (3.11) in Equation (3.10), we can obtain the fundamental equation
for the Naive Bayes classifier
p(Y = Ck ) ∏ j p(X j |Y = Ck )
p(Y = Ck |X1 ...Xd ) = . (3.12)
∑i p(Y = yi ) ∏ j p(X j |Y = yi )
If we are interested only in the most probable value of Y , then we have the Naive Bayes classification
rule:
p(Y = Ck ) ∏ j p(X j |Y = Ck )
Y ← arg max , (3.13)
Ck ∑i p(Y = yi ) ∏i p(X j |Y = yi )
Because the denominator does not depend on Ck , the above formulation can be simplified to the
following
Y ← argmax p(Y = Ck ) ∏ p(X j |Y = Ck ). (3.14)
Ck j
For the d input features Xi , suppose each can take on J possible discrete values, and we use
Xi = xi j to denote that. The second one is
θi jk ≡ p(Xi = xi j |Y = Ck )
for each input feature Xi , each of its possible values xi j , and each of the possible values Ck of Y. The
value for θi jk can be interpreted as the probability of feature Xi taking value xi j , conditioned on the
underlying label being Ck . Note that they must satisfy ∑ j θi jk = 1 for each pair of i, k values, and
there will be dJK such parameters, and note that only d(J − 1)K of these are independent.
These parameters can be estimated using maximum likelihood estimates based on calculating
the relative frequencies of the different events in the data. Maximum likelihood estimates for θi jk
given a set of training examples are
count(Xi = xi j ∧Y = Ck )
θ̂i jk = p̂(Xi = xi j |Y = Ck ) = (3.15)
count(Y = Ck )
where count(x) return the number of examples in the training set that satisfy property x, e.g.,
(n)
count(Xi = xi j ∧Y = Ck ) = ∑Nn=1 {Xi = xi j ∧Y (n) = Ck }, and count(Y = Ck ) = ∑Nn=1 {Y (n) = Ck }.
This is a very natural estimate: We simple count the number of times label Ck is seen in conjunction
with Xi taking value xi j , and count the number of times the label Ck is seen in total, and then take
the ratio of these two terms.
To avoid the case that the data does not happen to contain any training examples satisfying the
condition in the numerator, it is common to adapt a smoothed estimate that effectively adds in a
number of additional hallucinated examples equally over the possible values of Xi . The smoothed
estimate is given by
count(Xi = xi j ∧Y = Ck ) + l
θ̂i jk = p̂(Xi = xi j |Y = Ck ) = , (3.16)
count(Y = Ck ) + lJ
where J is the number of distinct values that Xi can take on, and l determines the strength of this
smoothing. If l is set to 1, this approach is called Laplace smoothing [6].
Maximum likelihood estimates for πk take the following form
count(Y = Ck )
π̂k = p̂(Y = Ck ) = , (3.17)
N
where N = ∑Kk=1 count(Y = Ck ) is the number of examples in the training set. Similarly, we can
obtain a smoothed estimate by using the following form
count(Y = Ck ) + l
π̂k = p̂(Y = Ck ) = , (3.18)
N + lK
where K is the number of distinct values that Y can take on, and l again determines the strength of
the prior assumptions relative to the observed data.
3.2.4 Applications
Naive Bayes classifier has been widely used in many classification problems, especially when
the dimensionality of the features is high, such as document classification, spam detection, etc. In
this subsection, let us briefly introduce the document classification using Naive Bayes classifier.
Suppose we have a number of documents x, and each document has the occurrence of words
w from a dictionary D . Generally, we assume a simple bag-of-words document model, then each
document can be modeled as |D | single draws from a binomial distribution. In that case, for each
word w, the probability of the word occurring in the document from class k is pkw (i.e., p(w|Ck )),
and the probability of it not occurring in the document from class k is obviously 1 − pkw . If word
72 Data Classification: Algorithms and Applications
w occurs in the document at lease once then we assign the value 1 to the feature corresponding to
the word, and if it does not occur in the document we assign the value 0 to the feature. Basically,
each document will be represented by a feature vector of ones and zeros with the same length as the
size of the dictionary. Therefore, we are dealing with high dimensionality for large dictionaries, and
Naive Bayes classifier is particularly suited for solving this problem.
We create a matrix D with the element Dxw to denote the feature (presence or absence) of the
word w in document x, where rows correspond to documents and columns represent the dictio-
nary terms. Based on Naive Bayes classifier, we can model the class-conditional probability of a
document x coming from class k as
|D | |D |
p(Dx |Ck ) = ∏ p(Dxw |Ck ) = ∏ pDkwxw (1 − pkw)1−Dxw .
w=1 w=1
According to the maximum-likelihood estimate described in Section 3.2.3, we may easily obtain the
parameter pkw by
∑x∈Ck Dxw
p̂kw = ,
Nk
where Nk is the number of documents from class k, and ∑x∈Ck Dxw is the number of documents from
class k that contain the term w. To handle the case that a term does not occur in the document from
class k (i.e., p̂kw = 0), the smoothed estimate is given by
∑x∈Ck Dxw + 1
p̂kw = .
Nk + 2
Similarly, we can obtain the estimation for the prior probability p(Ck ) based on Equation (3.18).
Once the parameters pkw and p(Ck ) are estimated, then the estimate of the class conditional likeli-
hood can be plugged into the classification rule Equation (3.14) to make a prediction.
Discussion: Naive Bayes classifier has proven effective in text classification, medical diagnosis,
and computer performance management, among many other applications [21, 25, 27, 36]. Various
empirical studies of this classifier in comparison to decision tree and neural network classifiers have
found it to be comparable in some domains. As the independence assumption on which Naive Bayes
classifiers are based almost never holds for natural data sets, it is important to understand why Naive
Bayes often works well even though its independence assumption is violated. It has been observed
that the optimality in terms of classification error is not necessarily related to the quality of the fit to
the probability distribution (i.e., the appropriateness of the independence assumption) [36]. More-
over, Domingos and Pazzani [10] found that an optimal classifier is obtained as long as both the
actual and estimated distributions agree on the most-probable class, and they proved Naive Bayes
optimality for some problems classes that have a high degree of feature dependencies, such as dis-
junctive and conductive concepts. In summary, considerable theoretical and experimental evidence
has been developed that a training procedure based on the Naive Bayes assumptions can yield an
optimal classifier in a variety of situations where the assumptions are wildly violated. However, in
practice this is not always the case, owing to inaccuracies in the assumptions made for its use, such
as the conditional independence and the lack of available probability data.
FIGURE 3.1: Illustration of logistic function. In logistic regression, p(Y = 1|X) is defined to follow
this form.
p(Ck |X), which represents a form of discriminative model. In this section, we will introduce a pop-
ular discriminative classifier, logistic regression, as well as the parameter estimation method and its
applications.
by introducing the convention of letting X0 = 1 (X = X0 , X1 , ..., Xd ). Notice that g(z) tends towards
1 as z → ∞, and g(z) tends towards 0 as z → −∞. Figure 3.1 shows a plot of the logistic function.
As we can see from Figure 3.1, g(z), and hence g(θT X) and p(Y |X), are always bounded between
0 and 1.
As the sum of the probabilities must equal 1, p(Y = 0|X ) can be estimated using the following
equation
e−θ X
T
1 − g(θT X) if yk = 0. Note that we can combine both Equation (3.19) and Equation (3.20) as a more
compact form
Assuming that the N training examples were generated independently, the likelihood of the param-
eters can be written as
→
−
L(θ) = p( Y |X ; θ)
N
= ∏ p(Y (n) = yk |X (n); θ)
n=1
N Y (n) 1−Y (n)
= ∏ p(Y (n) = 1|X (n)) p(Y (n) = 0|X (n) )
n=1
N Y (n) 1−Y (n)
= ∏ g(θT X (n) ) 1 − g(θT X (n) )
n=1
where θ is the vector of parameters to be estimated, Y (n) denotes the observed value of Y in the
nth training example, and X (n) denotes the observed value of X in the nth training example. To
classify any given X, we generally want to assign the value yk to Y that maximizes the likelihood as
discussed in the following subsection.
where θ = θ0 , θ1 , ..., θd is the vector of parameters to be estimated, and this model has d + 1
adjustable parameters for a d-dimensional feature space.
Maximizing the likelihood is equivalent to maximizing the log likelihood:
N
l(θ) = log L(θ) = ∑ log p(Y (n) = yk |X (n) ; θ)
n=1
N
= ∑ Y (n)
log g(θT (n)
X ) + (1 − Y (n)
) log 1 − g(θT (n)
X ) (3.23)
n=1
N
θT X (n)
= ∑ Y (n) T (n)
(θ X ) − log 1 + e . (3.24)
n=1
To estimate the vector of parameters θ, we maximize the log likelihood with the following rule
Since there is no closed form solution to maximizing l(θ) with respect to θ, we use gradient as-
cent [37, 38] to solve the problem. Written in vectorial notation, our updates will therefore be given
Probabilistic Models for Classification 75
by θ ← θ + αΔθ l(θ). After taking partial derivatives, the ith component of the vector gradient has
the form
N
∂l(θ) (n)
= ∑ Y (n) − g(θT X (n) ) Xi , (3.26)
∂θi n=1
where g(θT X (n) ) is the logistic regression prediction using Equation (3.19). The term inside the
parentheses can be interpreted as the prediction error, which is the difference between the observed
Y (n) and its predicted probability. Note that if Y (n) = 1 then we expect g(θT X (n) ) (p(Y = 1|X )) to be
1, whereas if Y (n) = 0 then we expect g(θT X (n) ) to be 0, which makes p(Y = 0|X ) equal to 1. This
(n) (n)
error term is multiplied by the value of Xi , so as to take account for the magnitude of the θi Xi
term in making this prediction.
According to the standard gradient ascent and the derivative of each θi , we can repeatedly update
the weights in the direction of the gradient as follows:
N
(n)
θi ← θi + α ∑ Y (n) − g(θT X (n) ) Xi , (3.27)
n=1
where α is a small constant (e.g., 0.01) that is called as the learning step or step size. Because the
log likelihood l(θ) is a concave function in θ, this gradient ascent procedure will converge to a
global maximum rather than a local maximum. The above method looks at every example in the
entire training set on every step, and it is called batch gradient ascent. Another alternative method
is called stochastic gradient ascent. In that method, we repeatedly run through the training set, and
each time we encounter a training example, then we update the parameters according to the gradient
of error with respect to that single training example only, which can be expressed as
(n)
θi ← θi + α Y (n) − g(θT X (n) ) Xi . (3.28)
In many cases where the computational efficiency is important, it is common to use the stochastic
gradient ascent to estimate the parameters.
Besides maximum likelihood, Newton’s method [32, 39] is a different algorithm for maximiz-
ing l(θ). Newton’s method typically enjoys faster convergence than (batch) gradient descent, and
requires many fewer iterations to get very close to the minimum. For more details about Newton’s
method, please refer to [32].
which constrains the norm of the weight vector to be small. Here λ is a constant that determines the
strength of the penalty term.
By adding this penalty term, it is easy to show that maximizing it corresponds to calculating the
MAP estimate for θ under the assumption that the prior distribution p(θ) is a Normal distribution
with mean zero, and a variance related to 1/λ. In general, the MAP estimate for θ can be written as
N
∑ log p(Y (n) = yk |X (n); θ) + log p(θ),
n=1
76 Data Classification: Algorithms and Applications
and if p(θ) is a zero mean Gaussian distribution, then log p(θ) yields a term proportional to
θ
2 .
After taking partial derivatives, we can easily obtain the form
N
∂l(θ) (n)
= ∑ Y (n) − g(θT X (n) ) Xi − λθi ,
∂θi n=1
3.3.4 Applications
Logistic regression is used extensively in numerous disciplines [26], including the Web, and
medical and social science fields. For example, logistic regression might be used to predict whether
a patient has a given disease (e.g., diabetes), based on observed characteristics of the patient (age,
gender, body mass index, results of various blood tests, etc.) [3, 22]. Another example [2] might be
to predict whether an American voter will vote Democratic or Republican, based on age, income,
gender, race, state of residence, votes in previous elections, etc. The technique can also be used
in engineering, especially for predicting the probability of failure of a given process, system, or
product. It is also used in marketing applications such as predicting of a customer’s propensity
for purchasing a product or ceasing a subscription, etc. In economics it can be used to predict the
likelihood of a person’s choosing to be in the labor force, and in a business application, it would be
used to predict the likelihood of a homeowner defaulting on a mortgage.
FIGURE 3.2: Examples of directed acyclic graphs describing the dependency relationships among
variables.
task, we can create a node for the random variable that is going to be classified, and the goal is
to infer the discrete value associated with the random variable. For example, in order to classify
whether a patient has lung cancer, a node representing “lung cancer” will be created together with
other factors that may directly or indirectly cause lung cancer and outcomes that will be caused by
lung cancer. Given the observations for other random variables in BN, we can infer the probability
of a patient having lung cancer.
all binary random variables, a possible conditional probability table for p(X2 |X1 , X3 ) could be:
X1 = 0; X3 = 0 X1 = 0; X3 = 1 X1 = 1; X3 = 0 X1 = 1; X3 = 1
X2 = 0 0.1 0.4 0.8 0.3
X2 = 1 0.9 0.6 0.2 0.7
Because a BN is a complete model for the variables and their relationships, a complete joint
probability distribution (JPD) over all the variables is specified for a model. Given the JPD, we
can answer all possible inference queries by summing out (marginalizing) over irrelevant variables.
However, the JPD has size O(2n ), where n is the number of nodes, and we have assumed each
node can have 2 states. Hence summing over the JPD takes exponential time. The most common
exact inference method is Variable Elimination [8]. The general idea is to perform the summation
to eliminate the non-observed non-query variables one by one by distributing the sum over the
product. The reader can refer to [8] for more details. Instead of exact inference, a useful approximate
algorithm called Belief propagation [30] is commonly used on general graphs including Bayesian
network, which will be introduced in Section 3.4.3.
FIGURE 3.3: Graphical structures for the regular and hidden Markov model.
Thus, if we use such a model to predict the next observation in a sequence, the distribution of pre-
dictions will depend on the value of the immediately preceding observation and will be independent
of all earlier observations, conditional on the preceding observation.
A hidden Markov model (HMM) can be considered as the simplest dynamic Bayesian network.
In a hidden Markov model, the state yi is not directly visible, and only the output xi , dependent on
the state, is visible. The hidden state space is discrete, and is assumed to consist of one of N possible
values, which is also called latent variable. The observations can be either discrete or continuous,
and are typically generated from a categorical distribution or a Gaussian distribution. Generally,
an HMM can be considered as a generalization of a mixture model where the hidden variables are
related through a Markov process rather than independent of each other.
Suppose the latent variables form a first-order Markov chain as shown in Figure 3.3(b). The
random variable yt is the hidden state at time t, and the random variable xt is the observation at time
t. The arrows in the figure denote conditional dependencies. From the diagram, it is clear that yt−1
and yt+1 are independent given yt , so that yt+1 ⊥⊥ yt−1 |yt . This is the key conditional independence
property, which is called the Markov property. Similarly, the value of the observed variable xt only
depends on the value of the hidden variable yt . Then, the joint distribution for this model is given by
n n
p(x1 , ..., xn , y1 , ..., yn ) = p(y1 ) ∏ p(yt |yt−1 ) ∏ p(xt |yt ), (3.33)
t=2 t=1
where p(yt |yt−1 ) is the state transition probability, and p(xt |yt ) is the observation probability.
lows: αt (i) = P(x1 = o1 , ..., xt = ot , yt = qi |Λ) and βt (i) = P(xt+1 = ot+1 , ..., xT = oT |yt = qi , Λ).
Note the forward values enable us to solve the problem through marginalizing, then we obtain
N N
P(X1T |Λ) = ∑ P(o1 , ..., oT , yT = qi |Λ) = ∑ αT (i).
i=1 i=1
The forward values can be computed efficiently with the principle of dynamic programming:
α1 (i) = πi bi (o1 ),
N
αt+1 ( j) = ∑ αt (i)ai j b j (ot+1 ).
i=1
βT (i) = 1,
N
βt (i) = ∑ ai j b j (ot+1 )βt+1 ( j).
j=1
Here Vt ( j) is the probability of the most probable state sequence responsible for the first t obser-
vations that has q j as its final state. The Viterbi path can be retrieved by saving back pointers that
remember which state yt = q j was used in the second equation. Let Ptr(yt , qi ) be the function that
returns the value of yt−1 used to compute Vt (i) as follows:
The complexity of this algorithm is O(T × N 2 ), where T is the length of observed sequence and N
is the number of possible states.
Now we need a method of adjusting the parameters Λ to maximize the likelihood for a given
training set. The Baum-Welch algorithm [4] is used to find the unknown parameters of HMMs,
which is a particular case of a generalized EM algorithm [9]. We start by choosing arbitrary values
for the parameters, then compute the expected frequencies given the model and the observations.
The expected frequencies are obtained by weighting the observed transitions by the probabilities
specified in the current model. The expected frequencies obtained are then substituted for the old
parameters and we iterate until there is no improvement. On each iteration we improve the prob-
ability of being observed from the model until some limiting probability is reached. This iterative
procedure is guaranteed to converge on a local maximum [34].
Probabilistic Models for Classification 81
The joint distribution above is defined as the product of potentials, and so the total energy is obtained
by adding the energies of each of the maximal cliques.
A log-linear model is a Markov random field with feature functions fk such that the joint distri-
bution can be written as
K
1
p(x1 , x2 , ..., xn ) = exp ∑ λk fk (xCk ) ,
Z k=1
where fk (xCk ) is the function of features for the clique Ck , and λk is the weight vector of features. The
log-linear model provides a much more compact representation for many distributions, especially
when variables have large domains such as text.
82 Data Classification: Algorithms and Applications
where fi j (xi , x j ) is the potential function of the pairwise clique. After enough iterations, this process
is likely to converge to a consensus. Once messages have converged, the marginal probabilities of
all the variables can be determined by
The reader can refer to [30] for more details. The main cost is the message update equation, which
is O(N 2 ) for each pair of variables (N is the number of possible states).
Recently, MRF has been widely used in many text mining tasks, such as text categorization [7]
and information retrieval [28]. In [28], MRF is used to model the term dependencies using the joint
distribution over queries and documents. The model allows for arbitrary text features to be incorpo-
rated as evidence. In this model, an MRF is constructed from a graph G, which consists of query
nodes qi and a document node D. The authors explore full independence, sequential dependence,
and full dependence variants of the model. Then, a novel approach is developed to train the model
that directly maximizes the mean average precision. The results show significant improvements are
possible by modeling dependencies, especially on the larger Web collections.
FIGURE 3.4: Graphical structure for the conditional random field model.
conditional nature, resulting in the relaxation of the independence assumptions required by HMMs
in order to ensure tractable inference.
Considering a linear-chain CRF with Y = {y1 , y2 , ..., yn } and X = {x1 , x2 , ..., xn } as shown in
Figure 3.4, an input sequence of observed variable X represents a sequence of observations and
Y represents a sequence of hidden state variables that needs to be inferred given the observations.
The yi ’s are structured to form a chain, with an edge between each yi and yi+1 . The distribution
represented by this network has the form:
K
1
p(y1 , y2 , ..., yn |x1 , x2 , ..., xn ) = exp ∑ λk fk (yi , yi−1 , xi ) ,
Z(X) k=1
where Z(X ) = ∑yi exp ∑Kk=1 λk fk (yi , yi−1 , xi ) .
3.5 Summary
This chapter has introduced the most frequently used probabilistic models for classification,
which include generative probabilistic classifiers, such as naive Bayes classifiers and hidden Markov
models, and discriminative probabilistic classifiers, such as logistic regression and conditional ran-
dom fields. Some of the classifiers, like the naive Bayes classifier, may be more appropriate for
high-dimensional text data, whereas others such as HMM and CRF are used more commonly for
temporal or sequential data. The goal of learning algorithms for these probabilistic models are to
84 Data Classification: Algorithms and Applications
find MLE or MAP estimations for parameters in these models. For simple naive Bayes classifier
and logistic regression, there are closed form solutions for them. In other cases, iterative algorithms
such as the gradient-descent method are good options.
Bibliography
[1] C. Andrieu, N. De Freitas, A. Doucet, and M. Jordan. An introduction to mcmc for machine
learning. Machine Learning, 50(1):5–43, 2003.
[2] J. Antonakis and O. Dalgas. Predicting elections: Child’s play! Science, 323(5918):1183–
1183, 2009.
[3] S. C. Bagley, H. White, and B. A. Golomb. Logistic regression in the medical literature:
Standards for use and reporting, with particular attention to one medical domain. Journal of
Clinical Epidemiology, 54(10):979–985, 2001.
[4] L. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the
statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical
Statistics, 41(1):164–171, 1970.
[5] C. Bishop. Pattern Recognition and Machine Learning, volume 4. Springer New York, 2006.
[6] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language model-
ing. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics,
pages 310–318. Association for Computational Linguistics, 1996.
[7] S. Chhabra, W. Yerazunis, and C. Siefkes. Spam filtering using a Markov random field model
with variable weighting schemas. In Proceedings of the 4th IEEE International Conference
on Data Mining, 2004. ICDM’04, pages 347–350, 2004.
[9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via
the em algorithm. Journal of the Royal Statistical Society. Series B, 39(1):1–38, 1977.
[10] P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-
one loss. Machine Learning, 29(2-3):103–130, 1997.
[11] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis, volume 3. Wiley New
York, 1973.
[12] G. Forney Jr. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
[14] R. Kindermann, J. Snell, and American Mathematical Society. Markov Random Fields and
Their Applications. American Mathematical Society Providence, RI, 1980.
[15] D. Kleinbaum and M. Klein. Maximum likelihood techniques: An overview. Logistic regres-
sion, pages 103–127, 2010.
Probabilistic Models for Classification 85
[18] J. Kupiec. Robust part-of-speech tagging using a hidden markov model. Computer Speech &
Language, 6(3):225–242, 1992.
[19] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In ICML, pages 282–289, 2001.
[20] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian classifiers. In AAAI, volume 90,
pages 223–228, 1992.
[21] D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval.
In ECML, pages 4–15, 1998.
[22] J. Liao and K.-V. Chin. Logistic regression for disease classification using microarray data:
Model selection in a large p and small n case. Bioinformatics, 23(15):1945–1951, 2007.
[23] P. MacCullagh and J. A. Nelder. Generalized Linear Models, volume 37. CRC Press, 1989.
[24] A. McCallum and W. Li. Early results for named entity recognition with conditional random
fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on
Natural Language Learning at HLT-NAACL 2003-Volume 4, pages 188–191, ACL, 2003.
[25] A. McCallum, K. Nigam, et al. A comparison of event models for Naive Bayes text classifi-
cation. In AAAI-98 Workshop on Learning for Text Categorization, volume 752, pages 41–48.
1998.
[26] S. Menard. Applied Logistic Regression Analysis, volume 106. Sage, 2002.
[27] V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with Naive Bayes—Which
Naive Bayes? In Third Conference on Email and Anti-Spam, pp. 27–28, 2006.
[28] D. Metzler and W. Croft. A Markov random field model for term dependencies. In Proceedings
of the 28th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pages 472–479. ACM, 2005.
[29] T. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in
Artificial Intelligence, volume 17, pages 362–369. 2001.
[30] K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate inference: An
empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in AI, volume 9,
pages 467–475, 1999.
[31] R. M. Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of
Computational and Graphical Statistics, 9(2):249–265, 2000.
[33] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recogni-
tion. Proceedings of the IEEE, 77(2):257–286, 1989.
[34] L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. IEEE ASSP
Magazine, 3(1): 4–15, January 1986.
86 Data Classification: Algorithms and Applications
[35] I. Rish. An empirical study of the naive bayes classifier. In IJCAI 2001 Workshop on Empirical
Methods in Artificial Intelligence, pages 41–46, 2001.
[36] I. Rish, J. Hellerstein, and J. Thathachar. An analysis of data characteristics that affect Naive
Bayes performance. In Proceedings of the Eighteenth Conference on Machine Learning, 2001.
[37] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings of
the 2003 Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology, pages 134–141. 2003.
[40] C. Sutton and A. McCallum. An introduction to conditional random fields for relational learn-
ing. In Introduction to Statistical Relational Learning, L. Getoor and B. Taskar(eds.) pages
95–130, MIT Press, 2006.
Chapter 4
Decision Trees: Theory and Algorithms
Victor E. Lee
John Carroll University
University Heights, OH
[email protected]
Lin Liu
Kent State University
Kent, OH
[email protected]
Ruoming Jin
Kent State University
Kent, OH
[email protected]
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2 Top-Down Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.1 Node Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2.2 Tree Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Case Studies with C4.5 and CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.1 Splitting Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3.2 Stopping Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3.3 Pruning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.4 Handling Unknown Values: Induction and Prediction . . . . . . . . . . . . . . . . . . . . . . 101
4.3.5 Other Issues: Windowing and Multivariate Criteria . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4 Scalable Decision Tree Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.1 RainForest-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4.2 SPIES Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.3 Parallel Decision Tree Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5 Incremental Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.1 ID3 Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.2 VFDT Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.5.3 Ensemble Method for Streaming Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.1 Introduction
One of the most intuitive tools for data classification is the decision tree. It hierarchically par-
titions the input space until it reaches a subspace associated with a class label. Decision trees are
appreciated for being easy to interpret and easy to use. They are enthusiastically used in a range of
87
88 Data Classification: Algorithms and Applications
business, scientific, and health care applications [12,15,71] because they provide an intuitive means
of solving complex decision-making tasks. For example, in business, decision trees are used for
everything from codifying how employees should deal with customer needs to making high-value
investments. In medicine, decision trees are used for diagnosing illnesses and making treatment
decisions for individuals or for communities.
A decision tree is a rooted, directed tree akin to a flowchart. Each internal node corresponds
to a partitioning decision, and each leaf node is mapped to a class label prediction. To classify a
data item, we imagine the data item to be traversing the tree, beginning at the root. Each internal
node is programmed with a splitting rule, which partitions the domain of one (or more) of the data’s
attributes. Based on the splitting rule, the data item is sent forward to one of the node’s children.
This testing and forwarding is repeated until the data item reaches a leaf node.
Decision trees are nonparametric in the statistical sense: they are not modeled on a probabil-
ity distribution for which parameters must be learned. Moreover, decision tree induction is almost
always nonparametric in the algorithmic sense: there are no weight parameters which affect the
results.
Each directed edge of the tree can be translated to a Boolean expression (e.g., x1 > 5); therefore,
a decision tree can easily be converted to a set of production rules. Each path from root to leaf
generates one rule as follows: form the conjunction (logical AND) of all the decisions from parent
to child.
Decision trees can be used with both numerical (ordered) and categorical (unordered) attributes.
There are also techniques to deal with missing or uncertain values. Typically, the decision rules are
univariate. That is, each partitioning rule considers a single attribute. Multivariate decision rules
have also been studied [8, 9]. They sometimes yield better results, but the added complexity is often
not justified. Many decision trees are binary, with each partitioning rule dividing its subspace into
two parts. Even binary trees can be used to choose among several class labels. Multiway splits
are also common, but if the partitioning is into more than a handful of subdivisions, then both the
interpretability and the stability of the tree suffers. Regression trees are a generalization of decision
trees, where the output is a real value over a continuous range, instead of a categorical value. For
the remainder of the chapter, we will assume binary, univariate trees, unless otherwise stated.
Table 4.1 shows a set of training data to answer the classification question, “What sort of contact
lenses are suitable for the patient?” This data was derived from a public dataset available from the
UCI Machine Learning Repository [3]. In the original data, the age attribute was categorical with
three age groups. We have modified it to be a numerical attribute with age in years. The next three
attributes are binary-valued. The last attribute is the class label. It is shown with three values (lenses
types): {hard, soft, no}. Some decision tree methods support only binary decisions. In this case, we
can combine hard and soft to be simply yes.
Next, we show four different decision trees, all induced from the same data. Figure 4.1(a) shows
the tree generated by using the Gini index [8] to select split rules when the classifier is targeting
all three class values. This tree classifies the training data exactly, with no errors. In the leaf nodes,
the number in parentheses indicates how many records from the training dataset were classified into
this bin. Some leaf nodes indicate a single data item. In real applications, it may be unwise to permit
the tree to branch based on a single training item because we expect the data to have some noise
or uncertainty. Figure 4.1(b) is the result of pruning the previous tree, in order to achieve a smaller
tree while maintaining nearly the same classification accuracy. Some leaf nodes now have a pair of
number: (record count, classification errors).
Figure 4.2(a) shows a 2-class classifier (yes, no) and uses the C4.5 algorithm for selecting the
splits [66]. A very aggressively pruned tree is shown in Figure 4.2(b). It misclassifies 3 out of 24
training records.
Decision Trees: Theory and Algorithms 89
FIGURE 4.1 (See color insert.): 3-class decision trees for contact lenses recommendation.
90 Data Classification: Algorithms and Applications
FIGURE 4.2 (See color insert.): 2-class decision trees for contact lenses recommendation.
Notation We will use the following notation (further summarized for data and partition in Table
4.2) to describe the data, its attributes, the class labels, and the tree structure. A data item x is
a vector of d attribute values with an optional class label y. We denote the set of attributes as A =
{A1 , A2 , . . . , Ad }. Thus, we can characterize x as {x1 , x2 , . . . , xd }, where x1 ∈ A1 , x2 ∈ A2 , . . . , xd ∈ Ad .
Let Y = {y1 , y2 , . . . , ym } be the set of class labels. Each training item x is mapped to a class value y
where y ∈ Y. Together they constitute a data tuple x, y. The complete set of training data is X.
1A training set is inconsistent if two items have different class values but are identical in all other attribute values.
Decision Trees: Theory and Algorithms 91
A partitioning rule S subdivides data set X into a set of subsets collectively known as XS ; that is,
XS = {X1 , X2 , . . . , Xk } where i Xi = X. A decision tree is a rooted tree in which each set of children
of each parent node corresponds to a partitioning (XS ) of the parent’s data set, with the full data set
associated with the root. The number of items in Xi that belong to class y j is |Xi j |. The probability
|X |
that a randomly selected member of Xi is of class y j is pi j = |Xiij| .
The remainder of the chapter is organized as follows. Section 4.2 describes the operation of
classical top-down decision tree induction. We break the task down into several subtasks, examin-
ing and comparing specific splitting and pruning algorithms that have been proposed. Section 4.3
features case studies of two influential decision tree algorithms, C4.5 [66] and CART [8]. Here we
delve into the details of start-to-finish decision tree induction and prediction. In Section 4.4, we
describe how data summarization and parallelism can be used to achieve scalability with very large
datasets. Then, in Section 4.5, we introduce techniques and algorithms that enable incremental tree
induction, especially in the case of streaming data. We conclude with a review of the advantages
and disadvantages of decision trees compared to other classification methods.
As a final but vital step, for each of the data subsets generated by the splitting rule, we recursively
call B UILD S UB T REE (lines 13–15). Each call generates a subtree that is then attached as a child to
the principal node. We now have a tree, which is returned as the output of the function.
• Numerical (ordered) attributes: If there are k different values, then we can make either
(k − 1) different binary splits or one single k-way split.
These rules are summarized in Table 4.3. For the data set in Table 4.1, the age attribute has 20
distinct values, so there are 19 possible splits. The other three attributes are binary, so they each
offer one split. There are a total of 22 ways to split the root node.
We now look at how each splitting rule is evaluated for its goodness. An ideal rule would form
subsets that exhibit class purity: each subset would contain members belonging to only one class y.
To optimize the decision tree, we seek the splitting rule S which minimizes the impurity function
F(XS ). Alternately, we can seek to maximize the amount that the impurity decreases due to the split:
Decision Trees: Theory and Algorithms 93
ΔF(S) = F(X) − F(XS ). The authors of CART [8] provide an axiomatic definition of an impurity
function F:
Definition 4.1 An impurity function F for a m-state discrete variable Y is a function defined on the
set of all m-tuple discrete probability vectors (p1 , p2 , . . . , pm ) such that
1. F is maximum only at ( m1 , m1 , . . . , m1 ),
2. F is minimum only at the “purity points” (1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1),
1. Error Rate
A simple measure is the percentage of misclassified items. If y j is the class value that appears
|{y =y :(x,y)∈X }|
most frequently in partition Xi , then the error rate for Xi is E (Xi ) = j
|Xi |
i
= 1 − pi j .
The error rate for the entire split XS is the weighted sum of the error rates for each subset.
This equals the total number of misclassifications, normalized by the size of X.
|Xi |
ΔFerror (S) = E (X) − ∑ E (Xi ). (4.1)
i∈S |X |
This measure does not have good discriminating power. Suppose we have a two-class system,
in which y is the majority class not only in X, but in every available partitioning of X . Error
rate will consider all such partitions equal in preference.
where py is the probability that a random selection would have state y. We add a subscript
(e.g., X) when it is necessary to indicate the dataset being measured. Information entropy can
be interpreted at the expected amount of information, measured in bits, needed to describe the
state of a system. Pure systems require the least information. If all objects in the system are
in the same state k, then pk = 1, log pk = 0, so entropy H = 0. There is no randomness in the
system; no additional classification information is needed. At the other extreme is maximal
uncertainty, when there are an equal number of objects in each of the |Y | states, so py = |Y1 | ,
for all y. Then, H(Y ) = −|Y |( |Y1 | log |Y1 | ) = log |Y |. To describe the system we have to fully
specify all the possible states using log |Y | bits. If the system is pre-partitioned into subsets
94 Data Classification: Algorithms and Applications
according to another variable (or splitting rule) S, then the information entropy of the overall
system is the weighted sum of the entropies for each partition, HXi (Y ). This is equivalent to
the conditional entropy HX (Y |S).
|Xi |
ΔFin f oGain (S) = − ∑ py log py +∑ ∑ piy log piy (4.3)
y∈Y i∈S |X| y∈Y
=HX (Y ) − ∑ pi HXi (Y )
i∈S
=HX (Y ) −HX (Y |S).
We know lim p→0 p log p goes to 0, so if a particular class value y is not represented in a
dataset, then it does not contribute to the system’s entropy.
A shortcoming of the information gain criterion is that it is biased towards splits with larger
k. Given a candidate split, if subdividing any subset provides additional class differentiation,
then the information gain score will always be better. That is, there is no cost to making a
split. In practice, making splits into many small subsets increases the sensitivity to individual
training data items, leading to overfit. If the split’s cardinality k is greater than the number of
class values m, then we might be “overclassifying” the data.
For example, suppose we want to decide whether conditions are suitable for holding a sem-
inar, and one of the attributes is day of the week. The “correct” answer is Monday through
Friday are candidates for a seminar, while Saturday and Sunday are not. This is naturally a
binary split, but ID3 would select a 7-way split.
|Xi |
ΔFGini (S) = Gini(X) − ∑ Gini(Xi ). (4.5)
i∈S |X|
|Xi | |Xi |
splitIn f o(S) = − ∑ log = H(S) (4.6)
i∈S |X| |X|
SplitInfo considers only the number of subdivisions and their relative sizes, not their purity. It
is higher when there are more subdivisions and when they are more balanced in size. In fact,
splitInfo is the entropy of the split where S is the random variable of interest, not Y . Thus, the
Decision Trees: Theory and Algorithms 95
gain ratio seeks to factor out the information gained from the type of partitioning as opposed
to what classes were contained in the partitions.
The gain ratio still has a drawback. A very imbalanced partitioning will yield a low value for
H(S) and thus a high value for ΔFgainRatio, even if the information gain is not very good. To
overcome this, C4.5 only considers splits whose information gain scores are better than the
average value [66].
5. Normalized Measures — Information Distance
López de Mántaras has proposed using normalized information distance as a splitting crite-
rion [51]. The distance between a target attribute Y and a splitting rule S is the sum of the two
conditional entropies, which is then normalized by dividing by their joint entropy:
This function is a distance metric; that is, it meets the nonnegative, symmetry, and triangle
inequality properties, and dN (Y, S) = 0 when Y = S. Moreover its range is normalized to [0, 1].
Due to its construction, it solves both the high-k bias and the imbalanced partition problems
of information gain (ID3) and gain ratio (C4.5).
not the other, then the two vectors are orthogonal, and the cosine is 0. The measure is
formulated as (1 − cosθ) so that we seek a maximum value.
∑ py0 py1
y∈Y
ORT (S) = 1 − cos(P0, P1 ) = 1 − (4.11)
||P0 || · ||P1||.
respectively. More generally, we can say the weight is w p,t , where p is the predicted class and t is
the true class. wtt = 0 because this represents a correct classification, hence no error.
Let us look at a few examples of weights being incorporated into an impurity function. Let Ti be
the class predicted for partition Xi , which would be the most populous class value in Xi .
Weighted Error Rate: Instead of simply counting all the misclassifications, we count how many
of each type of classification occurs, multiplied by its weight.
∑i∈S ∑y∈Y wTi ,y |Xiy |
Ferror,wt = = ∑ ∑ wTi ,y piy . (4.12)
|X| i∈S y∈Y
Weighted Entropy: The modified entropy can be incorporated into the information gain, gain
ratio, or information distance criterion.
|Xi | · E (Xi ) + m − 1
E (Xi ) = , (4.15)
|Xi | + m
where m is the number of different classes. Using this as an impurity criterion, this pruning
method works just like the tree growth step, except it merges instead of splits. Starting from
the parent of a leaf node, it compares its expected error rate with the size-weighted sum of the
error rates of its children. If the parent has a lower expected error rate, the subtree is converted
to a leaf. The process is repeated for all parents of leaves until the tree has been optimized.
Mingers [54] notes a few flaws with this approach: 1) the assumption of equally likely classes
is unreasonable, and 2) the number of classes strongly affects the degree of pruning.
Looking at the [age > 55?] node in Figure 4.2(a) again, the current subtree has a score
E (subtree) = (5/6) 5(0)+2−1
5+2
1(0)+2−1
+ (1/6) 1+2 = (5/6)(1/7) + (1/6)(1/3) = 0.175. If we
change the node to a leaf, we get E (lea f ) = 6(1/6)+2−1
6+2 = 2/8 = 0.250. The pruned ver-
sion has a higher expected error, so the subtree is not pruned.
|L(v)|
not pruned: E pess (T (v)) = ∑ E (l) +
2
(4.16)
l∈L(v)
1
if pruned: E pess (v) = E (v) + .
2
(4.17)
Because this adjustment alone might still be too optimistic, the actual rule is that a subtree
will be pruned if the decrease in error is larger than the Standard Error.
In C4.5, Quinlan modified pessimistic error pruning to be more pessimistic. The new esti-
mated error is the upper bound of the binomial distribution confidence interval, UCF (E , |Xi |).
C4.5 uses 25% confidence by default. Note that the binomial distribution should not be ap-
proximated by the normal distribution, because the approximation is not good for small error
rates.
For our [age > 55?] example in Figure 4.2(a), C4.5 would assign the subtree an error score of
(5/6)UCF (0, 5) + (1/6)UCF (0, 1) = (0.833)0.242 + (0.166).750 = 0.327. If we prune, then
the new root has a score of UCF (1, 6) = 0.390. The original split has a better error score, so
we retain the split.
Esposito et al. [20] have compared the five earlier pruning techniques. They find that cost-
complexity pruning and reduced error pruning tend to overprune, i.e., create smaller but less accurate
decision trees. Other methods (error-based pruning, pessimistic error pruning, and minimum error
pruning) tend to underprune. However, no method clearly outperforms others on all measures. The
wisest strategy for the user seems to be to try several methods, in order to have a choice.
C4.5: In default mode, C4.5 makes binary splits for numerical attributes and k-way splits for cate-
gorical attributes. ID3 used the Information Gain criterion. C4.5 normally uses the Gain Ratio, with
the caveat that the chosen splitting rule must also have an Information Gain that is stronger than
the average Information Gain. Numerical attributes are first sorted. However, instead of selecting
the midpoints, C4.5 considers each of the values themselves as the split points. If the sorted values
are (x1 , x2 , . . . , xn ), then the candidate rules are {x > x1 , x > x2 , . . . , x > xn−1 }. Optionally, instead
of splitting categorical attributes into k branches, one branch for each different attribute value, they
can be split into b branches, where b is a user-designated number. To implement this, C4.5 first per-
forms the k-way split and then greedily merges the most similar children until there are b children
remaining.
CART: Earlier, the authors experimented with a minimum improvement rule: |ΔFGini | > β. How-
ever, this was abandoned because there was no right value for β. While the immediate benefit of
splitting a node may be small, the cumulative benefit from multiple levels of splitting might be
substantial. In fact, even if splitting the current node offers only a small reduction of impurity, its
chidren could offer a much larger reduction. Consequently, CART’s only stopping condition is a
minimum node size. Instead, it strives to perform high-quality pruning.
C4.5: In ID3, a Chi-squared test was used as a stopping condition. Seeing that this sometimes caused
overpruning, Quinlan removed this stopping condition in C4.5. Like CART, the tree is allowed to
grow unfettered, with only one size constraint: any split must have at least two children containing
at least nmin training items each, where nmin defaults to 2.
Decision Trees: Theory and Algorithms 101
C4.5 uses pessimistic error pruning with the binomial confidence interval. Quinlan himself acknowl-
edges that C4.5 may be applying statistical concepts loosely [66]. As a heuristic method, however,
it works about as well as any other method. Its major advantage is that it does not require a separate
dataset for pruning. Moreover, it allows a subtree to be replaced not only by a single node but also
by the most commonly selected child.
2. Partitioning the Training Set: Once a splitting criteria is selected, to which child node will the
imcomplete data items be assigned?
3. Making Predictions: If making class predictions for items with missing attribute values, how
will they proceed down the tree?
Recent studies have compared different techniques for handling missing values in decision trees [18,
67]. CART and C4.5 take very different approaches for addressing these concerns.
CART: CART assumes that missing values are sparse. It calculates and compares splitting criteria
using only data that contain values for the relevant attributes. However, if the top scoring splitting
criteria is on an attribute with some missing values, then CART selects the best surrogate split
that has no missing attribute values. For any splitting rule S, a surrogate rule generates similar
partitioning results, and the surrogate S is the one that is most strongly correlated. For each actual
rule selected, CART computes and saves a small ordered list of top surrogate rules. Recall that
CART performs binary splits. For dataset Xi , p11 is the fraction of items that is classified by both S
and S as state 1; p00 is the fraction that is classifed by both as state 0. The probability that a random
item is classified the same by both S and S is p(S, S ) = p11 (S, S ) + p00 (S, S ). This measure is
further refined in light of the discriminating power of S. The final predictive measure of association
between S and S is
min(p0 (S), p1 (S)) − (1 − p(S, S))
λ(S |S) = . (4.18)
min(p0 (S), p1 (S))
The scaling factor min(p0 (S), p1 (S)) estimates the probability that S correctly classifies an item.
Due to the use of surrogates, we need not worry about how to partition items with missing attribute
values.
When trying to predict the class of a new item, if a missing attribute is encountered, CART
looks for the best surrogate rule for which the data item does have an attribute value. This rule is
102 Data Classification: Algorithms and Applications
used instead. So, underneath the primary splitting rules in a CART tree are a set of backup rules.
This method seems to depend much on there being highly correlated attributes. In practice, decision
trees can have some robustness; even if an item is misdirected at one level, there is some probability
that it will be correctly classified in a later level.
C4.5: To compute the splitting criteria, C4.5 computes the information gain using only the items
with known attribute values, then weights this result by the fraction of total items that have known
values for A. Let XA be the data subset of X that has known values for attribute A.
|XA |
ΔFin f oGain (S) = (HXA (Y ) − HXA (Y |S)). (4.19)
|X|
Additionally, splitIn f o(S), the denominator in C4.5’s Gain Ratio, is adjusted so that the set of
items with unknown values is considered a separate partition. If S previously made a k-way split,
splitIn f o(S) is computed as though it were a (k + 1)-way split.
To partitioning the training set, C4.5 spreads the items with unknown values according to the
same distribution ratios as the items with known attribute values. In the example in Figure 4.3, we
have 25 items. Twenty of them have known colors and are partitioned as in Figure 4.3(a). The 5
remaining items are distributed in the same proportions as shown in Figure 4.3(b). This generates
fractional training items. In subsequent tree levels, we may make fractions of fractions. We now
have a probabilistic tree.
If such a node is encountered while classifying unlabeled items, then all children are selected,
not just one, and the probabilities are noted. The prediction process will end at several leaf nodes,
which collectively describe a probability distribution. The class with the highest probability can be
used for the prediction.
Windowing in C4.5: Windowing is the name that Quinlan uses for a sampling technique that was
originally intended to speed up C4.5’s tree induction process. In short, a small sample, the window,
of the training set is used to construct an initial decision tree. The initial tree is tested using the
remaining training data. A portion of the misclassified items are added to the window, a new tree is
inducted, and the non-window training data are again used for testing. This process is repeated until
the decision tree’s error rate falls below a target threshold or the error rate converges to a constant
level.
In early versions, the initial window was selected uniformly randomly. By the time of this 1993
book, Quinlan had discovered that selecting the window so that the different class values were repre-
sented about equally yielded better results. Also by that time, computer memory size and processor
Decision Trees: Theory and Algorithms 103
speeds had improved enough so that the multiple rounds with windowed data were not always faster
than a single round with all the data. However, it was discovered that the multiple rounds improve
classification accuracy. This is logical, since the windowing algorithm is a form of boosting.
Multivariate Rules in CART: Breiman et al. investigated the use of multivariate splitting criteria,
decision rules that are a function of more than one variable. They considered three different forms:
linear combinations, Boolean combinations, and ad hoc combinations. CART considers combining
only numerical attributes. For this discussion, assume A = (A1 , . . . , Ad ) are all numerical. In the uni-
variable case, for Ai , we search the |Ai | − 1 possible split points for the one that yields the maximal
value of C. Using a geometric analogy, if d = 3, we have a 3-dimensional data space. A univariable
rule, such as xi < C, defines a half-space that is orthogonal to one of the axes. However, if we lift
the restriction that the plane is orthogonal to an axis, then we have the more general half-space
∑i ci xi < C. Note that a coefficient ci can be positive or negative. Thus, to find the best multivari-
able split, we want to find the values of C and c = (c1 , . . . , cd ), normalized to ∑i c2i = 1, such that
ΔF is optimized. This is clearly an expensive search. There are many search heuristics that could
accelerate the search, but they cannot guarantee to find the globally best rule. If a rule using all d
different attributes is found, it is likely that some of the attributes will not contribute much. The
weakest coefficients can be pruned out.
CART also offers to search for Boolean combinations of rules. It is limited to rules containing
only conjunction or disjuction. If Si is a rule on attribute Ai , then candidate rules have the form
S = S1 ∧ S2 ∧ · · · ∧ Sd or S = S1 ∨ S2 ∨ · · · ∨ Sd . A series of conjunctions is equivalent to walking
down a branch of the tree. A series of disjunctions is equavalent to merging children. Unlike linear
combinations of rules that offer possible splits that are unavailable with univariate splits, Boolean
combinations do not offer a new capability. They simply compress what would otherwise be a large,
bushy decision tree.
The ad hoc combination is a manual pre-processing to generate new attributes. Rather than a
specific computational technique, this is an acknowledgment that the given attributes might not
have good linear correlation with the class variable, but that humans sometimes can study a smal
dataset and have helpful intuitions. We might see that a new intermediate function, say the log or
square of any existing parameter, might fit better.
None of these features have been aggressively adapted in modern decision trees. In the end, a
standard univariate decision tree induction algorithm can always create a tree to classify a training
set. The tree might not be as compact or as accurate on new data as we would like, but more often
than not, the results are competitive with those of other classification techniques.
One of the first decision tree construction methods for disk-resident datasets was SLIQ [53].
To find splitting points for a numerical attribute, SLIQ requires separation of the input dataset into
attribute lists and sorting of attribute lists associated with a numerical attribute. An attribute list in
SLIQ has a record-id and attribute value for each training record. To be able to determine the records
associated with a non-root node, a data-structure called a class list is also maintained. For each train-
ing record, the class list stores the class label and a pointer to the current node in the tree. The need
for maintaining the class list limits the scalability of this algorithm. Because the class list is ac-
cessed randomly and frequently, it must be maintained in main memory. Moreover, in parallelizing
the algorithm, it needs to be either replicated, or a high communication overhead is incurred.
A somewhat related approach is SPRINT [69]. SPRINT also requires separation of the dataset
into class labels and sorting of attribute lists associated with numerical attributes. The attribute
lists in SPRINT store the class label for the record, as well as the record-id and attribute value.
SPRINT does not require a class list data structure. However, the attribute lists must be partitioned
and written back when a node is partitioned. Thus, there may be a significant overhead for rewriting
a disk-resident data set. Efforts have been made to reduce the memory and I/O requirements of
SPRINT [41, 72]. However, they do not guarantee the same precision from the resulting decision
tree, and do not eliminate the need for writing-back the datasets.
In 1998, Gehrke proposed RainForest [31], a general framework for scaling decision tree con-
struction. It can be used with any splitting criteria. We provide a brief overview below.
Small AVC group: This is primarily comprised of AVC sets for all categorical attributes. Since the
number of distinct elements for a categorical attribute is usually not very large, the size of these
AVC sets is small. In addition, SPIES also adds the AVC sets for numerical attributes that only
have a small number of distinct elements. These are built and treated in the same fashion as in the
RainForest approach.
Concise AVC group: The range of numerical attributes that have a large number of distinct elements
in the dataset is divided into intervals. The number of intervals and how the intervals are constructed
are important parameters to the algorithm. The original SPIES implementation uses equal-width
intervals. The concise AVC group records the class histogram (i.e., the frequency of occurrence of
each class) for each interval.
Partial AVC group: Based upon the concise AVC group, the algorithm computes a subset of the
values in the range of the numerical attributes that are likely to contain the split point. The partial
AVC group stores the class histogram for the points in the range of a numerical attribute that has
been determined to be a candidate for being the split condition.
SPIES uses two passes to efficiently construct the above AVC groups. The first pass is a quick
Sampling Step. Here, a sample from the dataset is used to estimate small AVC groups and concise
106 Data Classification: Algorithms and Applications
{ *Completion Step* }
8: Build Small AVCGroup(X);
9: Build Concise AVCGroup(X);
10: Build Partial AVCGroup(X);
11: g ← Find Best Gain(AVCGroup);
numerical attributes. Based on these, it obtains an estimate of the best (highest) gain, denoted as g .
Then, using g , the intervals that do not appear likely to include the split point will be pruned. The
second pass is the Completion Step. Here, the entire dataset is used to construct complete versions
of the three AVC subgroups. The partial AVC groups will record the class histogram for all of the
points in the unpruned intervals.
After that, the best gain g from these AVC groups can be obtained. Because the pruning is
based upon only an estimate of small and concise AVC groups, false pruning may occur. However,
false pruning can be detected using the updated values of small and concise AVC groups during the
completion step. If false pruning has occurred, SPIES can make another pass on the data to construct
partial AVC groups for points in falsely pruned intervals. The experimental evaluation shows SPIES
significantly reduces the memory requirements, typically by 85% to 95%, and that false pruning
rarely happens.
In Figure 4.4(c), we show the concise AVC set for the Age attribute, assuming 10-year ranges.
The table size depends on the selected range size. Compare its size to the RainForest AVC in Figure
4.4(a). For discrete attributes and numerical attributes with small distinct values, RainForest and
SPIES generate the same small AVC tables, as in Figure 4.4(b).
Other scalable decision tree construction algorithms have been developed over the years; the
representatives include BOAT [30] and CLOUDS [2]. BOAT uses a statistical technique called
bootstrapping to reduce decision tree construction to as few as two passes over the entire dataset.
In addition, BOAT can handle insertions and deletions of the data. CLOUDS is another algorithm
that uses intervals to speed up processing of numerical attributes [2]. However, CLOUDS’ method
does not guarantee the same level of accuracy as one would achieve by considering all possible
numerical splitting points (though in their experiments, the difference is usually small). Further,
CLOUDS always requires two scans over the dataset for partitioning the nodes at one level of
Decision Trees: Theory and Algorithms 107
the tree. More recently, SURPASS [47] makes use of linear discriminants during the recursive
partitioning process. The summary statistics (like AVC tables) are obtained incrementally. Rather
than using summary statistics, [74] samples the training data, with confidence levels determined by
PAC learning theory.
on the challenging task of combining bootstrapping, which implies sequential improvement, with
distributed processing.
• n p : # positive records;
• nn : # negative records;
p
• ni j : # positive records with value xi j for attribute Ai ;
• nnij : # negative records with value xi j for attribute Ai ;
Decision Trees: Theory and Algorithms 109
Then
|Ai | p
ni j + nnij
E(Ai ) = ∑ I(nipj , nnij ), (4.20)
j=1 n p + nn
with
0 if x = 0 or y = 0
I(x, y) = y y
− x+y
x x
log x+y − x+y log x+y otherwise.
In Algorithm 4.3, we can see that whenever an erroneous splitting attribute is found at v (Line
10), ID4 simply removes all the subtrees rooted at v’s immediate children (Line 11), and computes
the correct splitting attribute A∗v (Line 12).
Clearly ID4 is not efficient because it removes the entire subtree when a new A∗v is found, and this
situation could render certain concepts unlearnable by ID4, which could be induced by ID3. Utgoff
introduced two improved algorithms: ID5 [77] and ID5R [78]. In particular, ID5R guarantees it will
produce the same decision tree that ID3 would have if presented with the same training items.
In Algorithm 4.4, when a splitting test is needed at node v, an arbitrary attribute Ao ∈ Av is
chosen; further, according to counts ni jy (v) the optimal splitting attribute A∗v is calculated based on
E-score. If A∗v = Ao , the splitting attribute A∗v is pulled up from all its subtrees (Line 10) to v, and all
its subtrees are recursively updated similarly (Line 11 and 13).
The fundamental difference between ID4 (Algorithm 4.3) and ID5R (Algorithm 4.4) is that
when ID5R finds a wrong subtree, it restructures the subtree (Line 11 and 13) instead of discarding
it and replacing it with a leaf node for the current splitting attribute. The restructuring process in
Algorithm 4.4 is called the pull-up procedure. The general pull-up procedure is as follows, and
illustrated in Figure 4.6. In Figure 4.6, left branches satisfy the splitting tests, and right ones do not.
(a) Recursively pull the attribute A∗v to the root of each immediate subtree of v. Convert any
leaves to internal nodes as necessary, choosing A∗v as splitting attribute.
(b) Transpose the subtree rooted at v, resulting in a new subtree with A∗v at the root, and the
old root attribute Ao at the root of each immediate subtree of v.
There are several other works that fall into the ID3 family. A variation for multivariate splits
appears in [81], and an improvement of this work appears in [79], which is able to handle numerical
attributes. Having achieved an arguably efficient technique for incrementally restructuring a tree,
Utgoff applies this technique to develop Direct Metric Tree Induction (DMTI). DMTI leverages fast
tree restructuring to fashion an algorithm that can explore more options than traditional greedy top-
down induction [80]. Kalles [42] speeds up ID5R by estimating the minimum number of training
items for a new attribute to be selected as the splitting attribute.
VFDT (Very Fast Decision Tree learner) is based on the Hoeffding tree, a decision tree learning
method. The intuition of the Hoeffding tree is that to find the best splitting attribute it is sufficient
to consider only a small portion of the training items available at a node. To acheive this goal, the
Hoeffding bound is utilized. Basically, given a real-valued random variable r having range R, if we
have observed n values for this random variable, and the sample mean is r̄, then the Hoeffding bound
states that, with probability 1 − δ, the true mean of r is at least r̄ − ε, where
R2 ln(1/δ)
ε= . (4.21)
2n
Based on the above analysis, if at one node we find that F̄(Ai ) − F̄(A j ) ≥ ε, where F̄ is the
splitting criterion, and Ai and A j are the two attributes with the best and second best F̄ respectively,
then Ai is the correct choice with probability 1 − δ. Using this novel observation, the Hoeffding tree
algorithm is developed (Algorithm 4.5).
In Algorithm 4.5, the ni jy counts are sufficient to calculate F̄. Initially decision tree T only
contains a leaf node v1 (Line 1), and v1 is labeled by predicting the most frequent class. For each
item x, y, it is first classified into a leaf node v through T (Line 5). If the items in v are from more
than one class, then v is split according to the Hoeffding bound (Line 8). The key property of the
Hoeffding tree is that under realistic assumptions (see [19] for details), it is possible to guarantee
that the generated tree is asymptotically close to the one produced by a batch learner.
When dealing with streaming data, one practical problem that needs considerable attention is
concept drift, which does not satisfy the assumption of VFDT: that the sequential data is a random
sample drawn from a stationary distribution. For example, the behavior of customers of online
shopping may change from weekdays to weekends, from season to season. CVFDT [33] has been
developed to deal with concept drift.
CVFDT utilizes two strategies: a sliding window W of training items, and alternate subtrees
ALT (v) for each internal node v. The decision tree records the statistics for the |W | most recent
112 Data Classification: Algorithms and Applications
unique training items. More specifically, instead of learning a new model from scratch when a
new training item x, y comes, CVFDT increments the sufficient statistics ni jy at corresponding
nodes for the new item and decrements the counts for the oldest records xo , yo in the window.
Periodically, CVFDT reevaluates the classification quality and replaces a subtree with one of the
alternate subtrees if needed.
An outline of CVFDT is shown in Algorithm 4.6. When a new record x, y is received, we
classify it according to the current tree. We record in a structure L every node in the tree T and in
the alternate subtrees ALT that are encountered by x, y (Line 7). Lines 8 to 14 keep the sliding
window up to date. If the tree’s number of data items has now exceeded the maximum window
size (Line 9), we remove the oldest data item from the statistics (Line 11) and from W (Line 12).
ForgetExample traverses the decision tree and decrements the corresponding counts ni jy for xo , yo
in any node of T or ALT . We then add x, y to the tree, increasing ni jy statistics according to L (Line
14). Finally, once every f items, we invoke Procedure CheckSplitValidity, which scans T and ALT
looking for better splitting attributes for each internal node. It revises T and ALT as necessary.
Of course, more recent works can be found following this family. Both VFDT and CVFDT
only consider discrete attributes; the VFDTc [26] system extends VFDT in two major directions:
1) VFDTc is equipped with the ability to deal with numerical attributes; and 2) a naı̈ve Bayesian
classifier is utilized in each leaf. Jin [40] presents a numerical interval pruning (NIP) approach to
efficiently handle numerical attributes, and speeds up the algorithm by reducing the sample size.
Further, Bifet [6] proposes a more efficient decision tree learning method than [26] by replacing
naı̈ve Bayes with perceptron classifiers, while maintaining competitive accuracy. Hashemi [32] de-
Decision Trees: Theory and Algorithms 113
velops a flexible decision tree (FlexDT) based on fuzzy logic to deal with noise and missing values
in streaming data. Liang [48] builds a decision tree for uncertain streaming data.
Notice that there are some general works on handling concept drifting for streaming data.
Gama [27, 28] detects drifts by tracing the classification errors for the training items based on PAC
framework.
In Algorithm 4.7, we can see that when a new chunk S arrives, not only a new classifier C is
trained, but also the weights of the previous trained classifiers are recomputed in this way to handle
the concept drifting.
Kolter et al. [45] propose another ensemble classifier to detect concept drift in streaming data.
Similar to [84], their method dynamically adjusts the weight of each base classifier according to its
accuracy. In contrast, their method has a weight parameter threshold to remove bad classifiers and
trains a new classifier for the new item if the existing ensemble classifier fails to identity the correct
class.
Fan [21] notices that the previous works did not answer the following questions: When would
the old data help detect concept drift and which old data would help? To answer these questions, the
114 Data Classification: Algorithms and Applications
author develops a method to sift the old data and proposes a simple cross-validation decision tree
ensemble method.
Gama [29] extends the Hoeffding-based Ultra Fast Forest of Trees (UFFT) [25] system to han-
dle concept drifting in streaming data. In a similar vein, Abulsalam [1] extends the random forests
ensemble method to run in amortized O(1) time, handles concept drift, and judges whether a suffi-
cient quantity of labeled data has been received to make reasonable predictions. This algorithm also
handles multiple class values. Bifet [7] provides a new experimental framework for detecting con-
cept drift and two new variants of bagging methods: ADWIN Bagging and Adaptive-Size Hoeffding
Tree (ASHT). In [5], Bifet et al. combine Hoeffding trees using stacking to classify streaming data,
in which each Hoeffding tree is built using a subset of item attributes, and ADWIN is utilized both
for the perceptron meta-classifier for resetting learning rate and for the ensemble members to detect
concept drifting.
4.6 Summary
Compared to other classification methods [46], the following stand out as advantages of decision
trees:
• Easy to interpret. A small decision tree can be visualized, used, and understood by a layper-
son.
• Handling both numerical and categorical attributes. Classification methods that rely on
weights or distances (neural networks, k-nearest neighbor, and support vector machines) do
not directly handle categorical data.
• Fast. Training time is competitive with other classification methods.
The shortcomings tend to be less obvious and require a little more explanation. The following
are some weaknesses of decision trees:
• Not well-suited for multivariate partitions. Support vector machines and neural networks are
particularly good at making discriminations based on a weighted sum of all the attributes.
However, this very feature makes them harder to interpret.
• Not sensitive to relative spacing of numerical values. Earlier, we cited decision trees’ ability
to work with either categorical or numerical data as an advantage. However, most split criteria
do not use the numerical values directly to measure a split’s goodness. Instead, they use the
values to sort the items, which produces an ordered sequence. The ordering then determines
the candidate splits; a set of n ordered items has n − 1 splits.
• Greedy approach may focus too strongly on the training data, leading to overfit.
• Sensitivity of induction time to data diversity. To determine the next split, decision tree in-
duction needs to compare every possible split. As the number of different attribute values
increases, so does the number of possible splits.
Decision Trees: Theory and Algorithms 115
Despite some shortcomings, the decision tree continues to be an attractive choice among classifi-
cation methods. Improvements continue to be made: more accurate and robust split criteria, ensem-
ble methods for even greater accuracy, incremental methods that handle streaming data and concept
drift, and scalability features to handle larger and distributed datasets. A simple concept that began
well before the invention of the computer, the decision tree remains a valuable tool in the machine
learning toolkit.
Bibliography
[1] Hanady Abdulsalam, David B. Skillicorn, and Patrick Martin. Classification using streaming
random forests. IEEE Transactions on Knowledge and Data Engineering, 23(1):22–36, 2011.
[2] Khaled Alsabti, Sanjay Ranka, and Vineet Singh. Clouds: A decision tree classifier for large
datasets. In Proceedings of the Fourth International Conference on Knowledge Discovery and
Data Mining, KDD’98, pages 2–8. AAAI, 1998.
[3] K. Bache and M. Lichman. UCI machine learning repository. http://archive.ics.uci.
edu/ml, 2013.
[4] Amir Bar-Or, Assaf Schuster, Ran Wolff, and Daniel Keren. Decision tree induction in high
dimensional, hierarchically distributed databases. In Proceedings of the Fifth SIAM Interna-
tional Conference on Data Mining, SDM’05, pages 466–470. SIAM, 2005.
[5] Albert Bifet, Eibe Frank, Geoffrey Holmes, and Bernhard Pfahringer. Accurate ensembles
for data streams: Combining restricted hoeffding trees using stacking. Journal of Machine
Learning Research: Workshop and Conference Proceedings, 13:225–240, 2010.
[6] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Eibe Frank. Fast perceptron decision
tree learning from evolving data streams. Advances in Knowledge Discovery and Data Mining,
pages 299–310, 2010.
[7] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Richard Kirkby, and Ricard Gavaldà. New
ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining, KDD’09, pages 139–148.
ACM, 2009.
[8] Leo Breiman, Jerome Friedman, Charles J. Stone, and Richard A. Olshen. Classification and
Regression Trees. Chapman & Hall/CRC, 1984.
[9] Carla E. Brodley and Paul E. Utgoff. Multivariate decision trees. Machine Learning, 19(1):45–
77, 1995.
[10] Harry Buhrman and Ronald De Wolf. Complexity measures and decision tree complexity: A
survey. Theoretical Computer Science, 288(1):21–43, 2002.
[11] Doina Caragea, Adrian Silvescu, and Vasant Honavar. A framework for learning from dis-
tributed data using sufficient statistics and its application to learning decision trees. Interna-
tional Journal of Hybrid Intelligent Systems, 1(1):80–89, 2004.
[12] Xiang Chen, Minghui Wang, and Heping Zhang. The use of classification trees for bioinfor-
matics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):55–63,
2011.
116 Data Classification: Algorithms and Applications
[13] David A. Cieslak and Nitesh V. Chawla. Learning decision trees for unbalanced data. In Pro-
ceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery
in Databases - Part I, ECML PKDD’08, pages 241–256. Springer, 2008.
[14] S. L. Crawford. Extensions to the cart algorithm. International Journal of Man-Machine
Studies, 31(2):197–217, September 1989.
[15] Barry De Ville. Decision Trees for Business Intelligence and Data Mining: Using SAS Enter-
prise Miner. SAS Institute Inc., 2006.
[16] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters.
Communications of the ACM, 51(1):107–113, 2008. Originally presented at OSDI ’04: 6th
Symposium on Operating Systems Design and Implementation.
[17] Tom Dietterich, Michael Kearns, and Yishay Mansour. Applying the weak learning framework
to understand and improve C4.5. In Proceedings of the Thirteenth International Conference
on Machine Learning, ICML’96, pages 96–104. Morgan Kaufmann, 1996.
[18] Yufeng Ding and Jeffrey S. Simonoff. An investigation of missing data methods for classi-
fication trees applied to binary response data. The Journal of Machine Learning Research,
11:131–170, 2010.
[19] Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the
Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD’00, pages 71–80. ACM, 2000.
[20] Floriana Esposito, Donato Malerba, Giovanni Semeraro, and J. Kay. A comparative analysis
of methods for pruning decision trees. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(5):476–491, 1997.
[21] Wei Fan. Systematic data selection to mine concept-drifting data streams. In Proceedings of
the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 128–137. ACM, 2004.
[22] Usama M. Fayyad and Keki B. Irani. The attribute selection problem in decision tree gener-
ation. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’92,
pages 104–110. AAAI Press, 1992.
[23] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy.
Advances in Knowledge Discovery and Data Mining. The MIT Press, February 1996.
[24] Jerome H. Friedman. A recursive partitioning decision rule for nonparametric classification.
IEEE Transactions on Computers, 100(4):404–408, 1977.
[25] João Gama, Pedro Medas, and Ricardo Rocha. Forest trees for on-line data. In Proceedings of
the 2004 ACM Symposium on Applied Computing, SAC’04, pages 632–636. ACM, 2004.
[26] João Gama, Ricardo Rocha, and Pedro Medas. Accurate decision trees for mining high-speed
data streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, KDD’03, pages 523–528. ACM, 2003.
[27] João Gama and Gladys Castillo. Learning with local drift detection. In Advanced Data Mining
and Applications, volume 4093, pages 42–55. Springer-Verlag, 2006.
[28] João Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. Learning with drift detection.
Advances in Artificial Intelligence–SBIA 2004, pages 66–112, 2004.
Decision Trees: Theory and Algorithms 117
[29] João Gama, Pedro Medas, and Pedro Rodrigues. Learning decision trees from dynamic data
streams. In Proceedings of the 2005 ACM Symposium on Applied Computing, SAC’05, pages
573–577. ACM, 2005.
[30] Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, and Wei-Yin Loh. Boat– optimistic
decision tree construction. ACM SIGMOD Record, 28(2):169–180, 1999.
[31] Johannes Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. Rainforest — A framework for
fast decision tree construction of large datasets. In Proceedings of the International Conference
on Very Large Data Bases, VLDB’98, pages 127–162, 1998.
[32] Sattar Hashemi and Ying Yang. Flexible decision tree for data stream classification in the pres-
ence of concept change, noise and missing values. Data Mining and Knowledge Discovery,
19(1):95–131, 2009.
[33] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD’01, pages 97–106. ACM, 2001.
[34] Earl Busby Hunt, Janet Marin, and Philip J. Stone. Experiments in Induction. Academic Press,
New York, London, 1966.
[35] Laurent Hyafil and Ronald L. Rivest. Constructing optimal binary decision trees is np-
complete. Information Processing Letters, 5(1):15–17, 1976.
[36] Ruoming Jin and Gagan Agrawal. A middleware for developing parallel data mining im-
plementations. In Proceedings of the 2001 SIAM International Conference on Data Mining,
SDM’01, April 2001.
[37] Ruoming Jin and Gagan Agrawal. Shared memory parallelization of data mining agorithms:
Techniques, programming interface, and performance. In Proceedings of the Second SIAM
International Conference on Data Mining, SDM’02, pages 71–89, April 2002.
[38] Ruoming Jin and Gagan Agrawal. Shared memory parallelization of decision tree construc-
tion using a general middleware. In Proceedings of the 8th International Euro-Par Parallel
Processing Conference, Euro-Par’02, pages 346–354, Aug 2002.
[39] Ruoming Jin and Gagan Agrawal. Communication and memory efficient parallel decision tree
construction. In Proceedings of the Third SIAM International Conference on Data Mining,
SDM’03, pages 119–129, May 2003.
[40] Ruoming Jin and Gagan Agrawal. Efficient decision tree construction on streaming data. In
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD’03, pages 571–576. ACM, 2003.
[41] Mahesh V. Joshi, George Karypis, and Vipin Kumar. Scalparc: A new scalable and efficient
parallel classification algorithm for mining large datasets. In First Merged Symp. IPPS/SPDP
1998: 12th International Parallel Processing Symposium and 9th Symposium on Parallel and
Distributed Processing, pages 573–579. IEEE, 1998.
[42] Dimitrios Kalles and Tim Morris. Efficient incremental induction of decision trees. Machine
Learning, 24(3):231–242, 1996.
[43] Michael Kearns and Yishay Mansour. On the boosting ability of top-down decision tree learn-
ing algorithms. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing,
STOC’96, pages 459–468. ACM, 1996.
118 Data Classification: Algorithms and Applications
[44] Michael Kearns and Yishay Mansour. A fast, bottom-up decision tree pruning algorithm with
near-optimal generalization. In Proceedings of the 15th International Conference on Machine
Learning, pages 269–277, 1998.
[45] Jeremy Z. Kolter and Marcus A. Maloof. Dynamic weighted majority: A new ensemble
method for tracking concept drift. In Proceedings of the Third IEEE International Confer-
ence on Data Mining, 2003., ICDM’03, pages 123–130. IEEE, 2003.
[46] S. B. Kotsiantis. Supervised machine learning: A review of classification techniques. In Pro-
ceedings of the 2007 Conference on Emerging Artifical Intelligence Applications in Computer
Engineering, pages 3–24. IOS Press, 2007.
[47] Xiao-Bai Li. A scalable decision tree system and its application in pattern recognition and
intrusion detection. Decision Support Systems, 41(1):112–130, 2005.
[48] Chunquan Liang, Yang Zhang, and Qun Song. Decision tree for dynamic and uncertain data
streams. In 2nd Asian Conference on Machine Learning, volume 3, pages 209–224, 2010.
[49] Tjen-Sien Lim, Wei-Yin Loh, and Yu-Shan Shih. A comparison of prediction accuracy, com-
plexity, and training time of thirty-three old and new classification algorithms. Machine Learn-
ing, 40(3):203–228, 2000.
[50] Christiane Ferreira Lemos Lima, Francisco Marcos de Assis, and Cleonilson Protásio de
Souza. Decision tree based on shannon, rényi and tsallis entropies for intrusion tolerant
systems. In Proceedings of the Fifth International Conference on Internet Monitoring and
Protection, ICIMP’10, pages 117–122. IEEE, 2010.
[51] R. López de Mántaras. A distance-based attribute selection measure for decision tree induc-
tion. Machine Learning, 6(1):81–92, 1991.
[52] Yishay Mansour. Pessimistic decision tree pruning based on tree size. In Proceedings of the
14th International Conference on Machine Learning, pages 195–201, 1997.
[53] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. Sliq: A fast scalable classifier for data
mining. In Proceedings of the Fifth International Conference on Extending Database Tech-
nology, pages 18–32, Avignon, France, 1996.
[54] John Mingers. Expert systems—Rule induction with statistical data. Journal of the Opera-
tional Research Society, 38(1): 39–47, 1987.
[55] John Mingers. An empirical comparison of pruning methods for decision tree induction. Ma-
chine Learning, 4(2):227–243, 1989.
[56] James N. Morgan and John A. Sonquist. Problems in the analysis of survey data, and a pro-
posal. Journal of the American Statistical Association, 58(302):415–434, 1963.
[57] G. J. Narlikar. A parallel, multithreaded decision tree builder. Technical Report CMU-CS-98-
184, School of Computer Science, Carnegie Mellon University, 1998.
[58] Tim Niblett. Constructing decision trees in noisy domains. In I. Bratko and N. Lavrac, editors,
Progress in Machine Learning. Sigma, 1987.
[59] Jie Ouyang, Nilesh Patel, and Ishwar Sethi. Induction of multiclass multifeature split decision
trees from distributed data. Pattern Recognition, 42(9):1786–1794, 2009.
[60] Nikunj C. Oza and Stuart Russell. Online bagging and boosting. In Eighth International
Workshop on Artificial Intelligence and Statistics, pages 105–112. Morgan Kaufmann, 2001.
Decision Trees: Theory and Algorithms 119
[61] Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo. Planet: Massively
parallel learning of tree ensembles with mapreduce. Proceedings of the VLDB Endowment,
2(2):1426–1437, 2009.
[62] J. Ross Quinlan. Learning efficient classification procedures and their application to chess
end-games. In Machine Learrning: An Artificial Intelligence Approach. Tioga Publishing
Company, 1983.
[63] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, March 1986.
[64] J. Ross Quinlan. Simplifying decision trees. International Journal of Man-Machine Studies,
27(3):221–234, 1987.
[65] J. Ross Quinlan and Ronald L. Rivest. Inferring decision trees using the minimum description
length principle. Information and Computing, 80:227–248, 1989.
[66] John Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[67] Maytal Saar-Tsechansky and Foster Provost. Handling missing values when applying classifi-
cation models. Journal of Machine Learning Research, 8:1623–1657, 2007.
[68] Jeffrey C. Schlimmer and Douglas Fisher. A case study of incremental concept induction.
In Proceedings of the Fifth National Conference on Artificial Intelligence, pages 495–501.
Morgan Kaufmann, 1986.
[69] John Shafer, Rakeeh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classifier for
data mining. In Proceedings of the 22nd International Conference on Very Large Databases
(VLDB), pages 544–555, September 1996.
[70] Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal,
27(3):379–423, July/October 1948.
[71] Harold C. Sox and Michael C. Higgins. Medical Decision Making. Royal Society of Medicine,
1988.
[72] Anurag Srivastava, Eui-Hong Han, Vipin Kumar, and Vineet Singh. Parallel formulations of
decision-tree classification algorithms. In Proceedings of the 1998 International Conference
on Parallel Processing, ICPP’98, pages 237–261, 1998.
[73] W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea) for large-scale clas-
sification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, KDD’01, pages 377–382. ACM, 2001.
[74] Hyontai Sug. A comprehensively sized decision tree generation method for interactive data
mining of very large databases. In Advanced Data Mining and Applications, pages 141–148.
Springer, 2005.
[75] Umar Syed and Golan Yona. Using a mixture of probabilistic decision trees for direct predic-
tion of protein function. In Proceedings of the Seventh Annual International Conference on
Research in Computational Molecular Biology, RECOMB’03, pages 289–300. ACM, 2003.
[76] Kagan Tumer and Joydeep Ghosh. Error correlation and error reduction in ensemble classi-
fiers. Connection Science, 8(3-4):385–404, 1996.
[77] Paul E. Utgoff. Id5: An incremental id3. In Proceedings of the Fifth International Conference
on Machine Learning, ICML’88, pages 107–120, 1988.
120 Data Classification: Algorithms and Applications
[78] Paul E. Utgoff. Incremental induction of decision trees. Machine Learning, 4(2):161–186,
1989.
[79] Paul E. Utgoff. An improved algorithm for incremental induction of decision trees. In Pro-
ceedings of the Eleventh International Conference on Machine Learning, ICML’94, pages
318–325, 1994.
[80] Paul E Utgoff, Neil C Berkman, and Jeffery A Clouse. Decision tree induction based on
efficient tree restructuring. Machine Learning, 29(1):5–44, 1997.
[81] Paul E. Utgoff and Carla E. Brodley. An incremental method for finding multivariate splits for
decision trees. In Proceedings of the Seventh International Conference on Machine Learning,
ICML’90, pages 58–65, 1990.
[82] Paul A. J. Volf and Frans M.J. Willems. Context maximizing: Finding mdl decision trees. In
Symposium on Information Theory in the Benelux, volume 15, pages 192–200, 1994.
[83] Chris S. Wallace and J. D. Patrick. Coding decision trees. Machine Learning, 11(1):7–22,
1993.
[84] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining concept-drifting data streams
using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, pages 226–235. ACM, 2003.
[85] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda,
Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Stein-
bach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowledge and
Information Systems, 14(1):1–37, 2008.
[86] Jerry Ye, Jyh-Herng Chow, Jiang Chen, and Zhaohui Zheng. Stochastic gradient boosted
distributed decision trees. In Proceedings of the 18th ACM Conference on Information and
Knowledge Management, CIKM’09, pages 2061–2064. ACM, 2009.
[87] Olcay Taner Yıldız and Onur Dikmen. Parallel univariate decision trees. Pattern Recognition
Letters, 28(7):825–832, 2007.
[88] M. J. Zaki, Ching-Tien Ho, and Rakesh Agrawal. Parallel classification for data mining on
shared-memory multiprocessors. In Proceedings of the Fifteenth International Conference on
Data Engineering, ICDE’99, pages 198–205, May 1999.
Chapter 5
Rule-Based Classification
Xiao-Li Li
Institute for Infocomm Research
Singapore
[email protected]
Bing Liu
University of Illinois at Chicago (UIC)
Chicago, IL
[email protected]
5.1 Introduction
Classification is an important problem in machine learning and data mining. It has been widely
applied in many real-world applications. Traditionally, to build a classifier, a user first needs to col-
lect a set of training examples/instances that are labeled with predefined classes. A classification
algorithm is then applied to the training data to build a classifier that is subsequently employed
121
122 Data Classification: Algorithms and Applications
to assign the predefined classes to test instances (for evaluation) or future instances (for applica-
tion) [1].
In the past three decades, many classification techniques, such as Support Vector Machines
(SVM) [2], Neural Network (NN) [3], Rule Learning [9], Naı̈ve Bayesian (NB) [5], K-Nearest
Neighbour (KNN) [6], and Decision Tree [4], have been proposed. In this chapter, we focus on
rule learning, also called rule-based classification. Rule learning is valuable due to the following
advantages.
1. Rules are very natural for knowledge representation, as people can understand and interpret
them easily.
2. Classification results are easy to explain. Based on a rule database and input data from the
user, we can explain which rule or set of rules is used to infer the class label so that the user
is clear about the logic behind the inference.
3. Rule-based classification models can be easily enhanced and complemented by adding new
rules from domain experts based on their domain knowledge. This has been successfully
implemented in many expert systems.
4. Once rules are learned and stored into a rule database, we can subsequently use them to
classify new instances rapidly through building index structures for rules and searching for
relevant rules efficiently.
5. Rule based classification systems are competitive with other classification algorithms and in
many cases are even better than them.
Now, let us have more detailed discussions about rules. Clearly, rules can represent information
or knowledge in a very simple and effective way. They provide a very good data model that human
beings can understand very well. Rules are represented in the logic form as IF-THEN statements,
e.g., a commonly used rule can be expressed as follows:
where the IF part is called the “antecedent” or “condition” and the THEN part is called the
“consequent” or “conclusion.” It basically means that if the condition of the rule is satisfied, we
can infer or deduct the conclusion. As such, we can also write the rule in the following format,
namely, condition → conclusion. The condition typically consists of one or more feature tests (e.g.
f eature1 > value2 , f eature5 = value3 ) connected using logic operators (i.e., “and”, “or”, “not”).
For example, we can have a rule like: If sex=“female” and (35 < age <45) and (salary=“high” or
creditlimit=“high”), then potential customer =“yes”. In the context of classification, the conclusion
can be the class label, e.g., “yes” (potential customer =“yes”) or “no” (potential customer =“no”).
In other words, a rule can be used for classification if its “consequent” will be one of those prede-
fined classes and its “antecedent” or “precondition” contains conditions of various features and their
corresponding values.
Many machine learning and data mining techniques have been proposed to automatically learn
rules from data. In computer science domain, rule-based systems have been extensively used as an
effective way to store knowledge and to do logic inference. Furthermore, based on the given inputs
and the rule database, we can manipulate the stored knowledge for interpreting the generated outputs
as well as for decision making. Particularly, rules and rule based classification systems have been
widely applied in various expert systems, such as fault diagnosis for aerospace and manufacturing,
medical diagnosis, highly interactive or conversational Q&A systems, mortgage expert systems, etc.
In this chapter, we will introduce some representative techniques for rule-based classification,
which includes two key components, namely 1) rule induction, which learns rules from a given
Rule-Based Classification 123
training database/set automatically; and 2) classification, which makes use of the learned rule set
for classification. Particularly, we will study two popular rule-based classification approaches: (1)
rule induction and (2) classification based on association rule mining.
1. Rule induction. Many rule induction/learning algorithms, such as [9], [10], [11], [12], [13],
[14], have adopted the sequential covering strategy, whose basic idea is to learn a list of rules
from the training data sequentially, or one by one. That is, once a new rule has been learned,
it will remove the corresponding training examples that it covers, i.e., remove those training
examples that satisfy the rule antecedent. This learning process, i.e., learn a new rule and
remove its covered training data, is repeated until rules can cover the whole training data or
no new rule can be learned from the remaining training data.
2. Classification based on association rule mining. Association rule mining [16], is perhaps
the most important model invented by data mining researchers. Many efficient algorithms
have been proposed to detect association rules from large amounts of data. One special type
of association rules is called class association rules (CARs). The consequent of a CAR must
be a class label, which makes it attractive for classification purposes. We will describe Clas-
sification Based on Associations (CBA) — the first system that uses association rules for
classification [30], as well as a number of more recent algorithms that perform classification
based on mining and applying association rules.
1. Let RULE LIST be the empty list; // initialize an empty rule set in the beginning
2. Repeat until Best CPX is nil or E is empty;
3. Let Best CPX be Find Best Complex(E)
4. If Best CPX is not nil
5. Then let E be the examples covered by Best CPX
6. Remove from E the examples E covered by Best CPX
7. Let C be the most common class of examples in E
8. Add the rule “If Best CPX then the class is C” to the end of RULE LIST .
9. Output RULE LIST .
In this algorithm, we need two inputs, namely, E and SELECT ORS. E is the training data and
SELECTORS is the set of all possible selectors that test each attribute and its corresponding values.
Set RULE LIST is the decision list, storing the final output list of rules, which is initialized as
to empty set in step 1. Best CPX records the best rule detected in each iteration. The function
Find Best Complex(E) learns the Best CPX. We will elaborate the details of this function later in
Section 5.2.2. Steps 2 to 8 form a Repeat-loops which learns the best rule and refines the training
data. In particular, in each Repeat-loop, once a non-empty rule is learned from the data (steps 3 and
4), all the training examples that are covered by the rule are removed from the data (steps 5 and
6). The rule discovered, consisting of the rule condition and the most common class label of the
examples covered by the rule, is added at the end of RULE LIST (steps 7 and 8).
The stopping criteria for the Repeat-loop (from steps 2–8) can be either E = 0/ (no training
examples left for learning) or Rule Best CPX is nil (there is no new rule learned from the training
data). After the rule learning process completes (i.e., satisfies one of the two stopping criteria), a
default class c is inserted at the end of RULE LIST . This step is performed because of the following
two reasons: 1) there may still be some training examples that are not covered by any rule as no
good rule can be mined from them, and 2) some test instances may not be covered by any rule in
the RULE LIST and thus we cannot classify it if we do not have a default-class. Clearly, with this
default-class, we are able to classify any test instance. The default-class is typically the majority
class among all the classes in the training data, which will be used only if no rule learned from
Rule-Based Classification 125
the training data can be used to classify a test example. The final list of rules, together with the
default-class, is represented as follows:
Finally, using the list of rules for classification is rather straightforward. For a given test instance,
we simply try each rule in the list sequentially, starting from r1 , then r2 (if r1 cannot cover the test
instance), r3 (if both r1 and r2 cannot cover the test instance) and so on. The class consequent of the
first rule that covers this test instance is assigned as the class of the test instance. If no rule (from
r1 , r2 , . . . , rk ) applies to the test instance, the default-class is applied.
Different from the CN2 algorithm that learns each rule without pre-fixing a class, RIPPLE learns
all rules for each class individually. In particular, only after rule learning for one class is completed,
it moves on to the next class. As such, all rules for each class appear together in the rule deci-
sion list. The sequence of rules for each individual class is not important, but the rule subsets for
different classes are ordered and still important. The algorithm usually mines rules for the least fre-
quent/minority/rare class first, then the second minority class, and so on. This process ensures that
some rules are learned for rare or minority classes. Otherwise, they may be dominated by frequent
or majority classes and we will end up with no rules for the minority classes. The RIPPER rule
induction algorithm is shown as follows, which is also based on sequential covering:
In this algorithm, the data set D is split into two subsets, namely, Pos and Neg, where Pos
contains all the examples of class c from D, and Neg the rest of the examples in D (see step 3),
i.e., in a one-vs.-others manner. Here c ∈ C is the current working class of the algorithm, which
is initialized as the least frequent class in the first iteration. As we can observe from the algorithm,
steps 2 to 12 is a For-loop, which goes through all the classes one by one, starting from the minority
class. That is why this method is called Ordered Classes, from the least frequent class to the most
frequent class. For each class c, we have an internal While-loop from steps 4 to 11 that includes
the rule learning procedure, i.e., perform the Learn-One-Rule() function to learn a rule Rule in step
126 Data Classification: Algorithms and Applications
5; insert the learned Rule at the end of RuleList in step 8; remove examples covered by Rule from
(Pos, Neg) in step 9. Note two stopping conditions for internal rule learning of each class c are given
in step 4 and step 6, respectively — we stop the while-loop for the internal rule learning process
for the class c when the Pos becomes empty or no new rule can be learned by function Learn-One-
Rule(Pos, Neg, c) from the remaining training data.
The other parts of the algorithm are very similar to those of the CN2 algorithm. The Learn-One-
Rule() function will be described later in Section 5.2.2.
Finally, applying the RuleList for classification is done in a similar way as for the CN2 algo-
rithm. The only difference is that the order of all rules within each class is not important anymore
since they share the same class label, which will lead to the same classification results. Since the
rules are now ranked by classes, given a test example, we will try the rules for the least frequent
classes first until we can find a single rule that can cover the test example to perform its classifica-
tion; otherwise, we have to apply the default-class.
Note the evaluation() function shown below employs the entropy function, the same as in the
decision tree learning, to evaluate how good the BestCond is.
Function evaluation(BestCond, D)
1. D ← the subset of training examples in D covered by BestCond
|C|
2. entropy(D) = − ∑ j=1 Pr(c j ) × log2Pr(c j );
3. Output −entropy(D) // since entropy measures impurity.
Specifically, in the first step of the evaluation() function, it obtains an example set D that consists
of a subset of training examples in D covered by BestCond. In its second step, it calculates an
entropy function entropy(D ) based on the probability distribution — Pr(c j ) is the probability of
class c j in the data set D , which is defined as the number of examples of class c j in D divided by
the total number of examples in D . In the entropy computation, 0 × log0 = 0. The unit of entropy
is bit. We now provide some examples to help understand the entropy measure.
Assume the data set D has only two classes, namely positive class (c1 =P) and negative class
(c2 =N). Based on the following three different combinations of probability distributions, we can
compute their entropy values as follows:
1. The data set D has 50% positive examples (i.e., Pr(P) = 0.5) and 50% negative examples
(i.e., Pr(N) = 0.5). Then, entropy(D ) = −0.5 × log20.5 − 0.5 × log20.5 = 1.
2. The data set D has 20% positive examples (i.e., Pr(P) = 0.2) and 80% negative examples
(i.e., Pr(N) = 0.8). Then, entropy(D ) = −0.2 × log20.2 − 0.8 × log20.8 = 0.722.
3. The data set D has 100% positive examples (i.e., Pr(P) = 1) and no negative examples (i.e.,
Pr(N) = 0). Then, entropy(D ) = −1 × log21 − 0 × log20 = 0.
From the three scenarios shown above, we can observe that when the data class becomes purer
and purer (e.g., all or most of the examples belong to one individual class), the entropy value be-
comes smaller and smaller. As a matter of fact, it can be shown that for this binary case (only has
positive and negative classes), when Pr(P) = 0.5 and Pr(N) = 0.5, the entropy has the maximum
value, i.e., 1 bit. When all the data in D belong to one class, the entropy has the minimum value,
i.e., 0 bit. It is clear that the entropy measures the amount of impurity according to the data class
distribution. Obviously, we would like to have a rule that has a low entropy or even 0 bit, since it
means that the rule will lead to one major class and we are thus more confident to apply the rule for
classification.
In addition to the entropy function, other evaluation functions can also be applied. Note that
when BestCond = 0, / it covers every example in D, i.e., D = D .
Learn-One-Rule
In the Learn-One-Rule() function, a rule is first generated and then subjected to a pruning process.
This method starts by splitting the positive and negative training data Pos and Neg, into growing and
pruning sets. The growing sets, GrowPos and GrowNeg, are used to generate a rule, called BestRule.
The pruning sets, PrunePos and PruneNeg, are used to prune the rule because BestRule may overfit
the training data with too many conditions, which could lead to poor predictive performance on the
unseen test data. Note that PrunePos and PruneNeg are actually validation sets, which are used to
access the rule’s generalization. If a rule has 50% error rate in the validation sets, then it does not
generalize well and thus the function does not output it.
Rule-Based Classification 129
where each av j (j=1, 2, . . . k) in rule R is a condition (an attribute-value pair). By adding a new
condition avk+1 , we obtain the rule R+ : av1 , . . . , avk , avk+1 → class. The evaluation function for
R+ is the following information gain criterion (which is different from the gain function used in
decision tree learning):
p1 p0
gain(R, R+ ) = p1 × (log2 − log2 ) (5.1)
p1 + n1 p0 + n0
where p0 (respectively, n0 ) is the number of positive (or negative) examples covered by R in Pos
(or Neg), and p1 (or n1 ) is the number of positive (or negative) examples covered by R+ in Pos (or
Neg). R+ will be better than R if R+ can cover more proportion of positive examples than R. The
GrowRule() function simply returns the rule R+ that maximizes the gain.
PruneRule() function: To prune a rule, we consider deleting every subset of conditions from
the BestRule, and choose the deletion that maximizes:
p−n
v(BestRule, PrunePos, PruneNeg) = , (5.2)
p+n
where p (respectively n) is the number of examples in PrunePos (or PruneNeg) covered by the
current rule (after a deletion).
The basic idea of Classification Based on Association Rule Mining is to first find strong cor-
relations or associations between the frequent itemsets and class labels based on association rule
mining techniques. These rules can be subsequently used for classification for test examples. Em-
pirical evaluations have demonstrated that classification based on association rules are competitive
with the state-of-the-art classification models, such as decision trees, navie Bayes, and rule induction
algorithms.
In Section 5.3.1, we will present the concepts of association rule mining and an algorithm to
automatically detect rules from transaction data in an efficient way. Then, in Section 5.3.2, we
will introduce mining class association rules, where the class labels or target attributes) are on the
right-hand side of the rules. Finally, in Section 5.3.3, we describe some techniques for performing
classification based on discovered association rules.
The rule basically means we can use Bread to infer Milk or those customers who buy Bread also
frequently buy Milk. However, it should be read together with two important quality metrics, namely
support and confidence. Particularly, the support of 10% for this rule means that 10% of customers
buy Bread and Milk together, or 10% of all the transactions under analysis show that Bread and
Milk are purchased together. In addition, a confidence of 80% means that those who buy Bread also
buy Milk 80% of the time. This rule indicates that item Bread and item Milk are closely associated.
Note in this rule, these two metrics are actually used to measure the rule strength, which will be
defined in Section 5.3.1.1. Typically, association rules are considered interesting or useful if they
satisfy two constraints, namely their support is larger than a minimum support threshold and their
confidence is larger than a minimum confidence threshold. Both thresholds are typically provided
by users and good thresholds may need users to investigate the mining results and vary the threholds
multiple times.
Clearly, this association rule mining model is very generic and can be used in many other ap-
plications. For example, in the context of the Web and text documents, it can be used to find word
co-occurrence relationships and Web usage patterns. It can also be used to find frequent substruc-
tures such as subgraphs, subtrees, or sublattices, etc [19], as long as these substructures frequently
occurr together in the given dataset.
Rule-Based Classification 131
Note standard association rule mining, however, does not consider the sequence or temporal
order in which the items are purchased. Sequential pattern mining takes the sequential information
into consideration. An example of a sequential pattern is “5% of customers buy bed first, then
mattress and then pillows”. The items are not purchased at the same time, but one after another.
Such patterns are useful in Web usage mining for analyzing click streams in server logs [20].
(X ∪Y ).count
support = . (5.3)
n
Note that support is a very important measure for filtering out those non-frequent rules that
have a very low support since they occur in a very small percentage of the transactions and their
occurrences could be simply due to chance.
Next, we define the confidence of a rule.
Confidence of a rule: The confidence of a rule, X → Y , is the percentage of transactions in T
that contain X also contain Y , which is computed as follows:
(X ∪Y ).count
Con f idence = . (5.4)
X.count
Confidence thus determines the predictability and reliability of a rule. In other words, if the
confidence of a rule is too low, then one cannot reliably infer or predict Y given X. Clearly, a rule
with low predictability is not very useful in practice.
132 Data Classification: Algorithms and Applications
Given a transaction set T , the problem of mining association rules from T is to discover all
association rules in T that have support and confidence greater than or equal to the user-specified
minimum support (represented by minsup) and minimum confidence (represented by minconf).
Here we emphasize the keyword “all”, i.e., association rule mining requires the completeness
of rules. The mining algorithms should not miss any rule that satisfies both minsup and minconf
constraints.
Finally, we illustrate the concepts mentioned above using a concrete example, shown in Table
5.1.
We are given a small transaction database, which contains a set of seven transactions T =
(t1 ,t2 , . . . ,t7 ). Each transaction ti (i= 1, 2, . . ., 7) is a set of items purchased in a basket in a
supermarket by a customer. The set I is the set of all items sold in the supermarket, namely,
{Bee f , Boots,Cheese,Chicken,Clothes, Milk}.
Given two user-specified constraints, i.e. minsup = 30% and minconf = 80%, we aim to find all
the association rules from the transaction database T . The following is one of the association rules
that we can obtain from T , where sup= 3/7 is the support of the rule, and conf= 3/3 is the confidence
of the rule.
Let us now explain how to calculate the support and confidence for this transaction database.
Out of the seven transactions (i.e. n = 7 in Equation 5.3), there are three of them, namely, t5 ,t6 ,t7
contain itemset {Chicken, Clothes} ∪ {Milk} (i.e., (X ∪Y ).count=3 in Equation 5.3). As such, the
support of the rule, sup=(X ∪Y ).count/n=3/7=42.86%, which is larger than the minsup =30% (i.e.
42.86% > 30%).
On the other hand, out of the three transactions t5 ,t6 ,t7 containing the condition item-
set {Chicken, Clothes} (i.e., X .count=3), they also contain the consequent item {Milk}, i.e.,
{Chicken, Clothes} ∪ {Milk}= (X ∪ Y ).count = 3. As such, the confidence of the rule, conf =
(X ∪Y ).count/X.count = 3/3 = 100%, which is larger than the minconf = 80% (100% > 80%). As
this rule satisfies both the given minsup and minconf, it is thus valid.
We notice that there are potentially other valid rules. For example, the following one has two
items as its consequent, i.e.,
Over the past 20 years, a large number of association rule mining algorithms have been pro-
posed. They mainly improve the mining efficiency since it is critical to have an efficient algorithm
to deal with large scale transaction databases in many real-world applications. Please refer to [49]
for detailed comparison across various algorithms in terms of their efficiencies.
Note that no matter which algorithms are applied, the final results, i.e., association rules minded,
are all the same based on the definition of association rules. In other words, given a transaction
Rule-Based Classification 133
data set T , as well as a minimum support minsup and a minimum confidence minconf, the set of
association rules occurring in T is uniquely determined. All the algorithms should find the same
set of rules although their computational efficiencies and memory requirements could be different.
In the next session, we introduce the best known mining algorithm, namely the Apriori algorithm,
proposed by Agrawal in [17].
1. It starts with the seed set of itemsets Fk−1 found to be frequent in the (k-1)-th pass. It then uses
this seed set to generate candidate itemsets Ck (step 4), which are potential frequent itemsets.
This step used the candidate-gen() procedure, as shown in Algorithm 4.
2. The transaction database is then passed over again and the actual support of each candidate
itemset c in Ck is counted (steps 5-10). Note that it is not necessary to load the entire data into
memory before processing. Instead, at any time point, only one transaction needs to reside in
memory. This is a very important feature of the Apriori algorithm as it makes the algorithm
scalable to huge data sets that cannot be loaded into memory.
3. At the end of the pass, it determines which of the candidate itemsets are actually frequent
(step 11).
The final output of the algorithm is the set F of all frequent itemsets (step 13) where set F
contains frequent itemsets with different sizes, i.e., frequent 1-itemsets, frequent 2-itemsets, . . . ,
frequent k-itemsets (k is the highest order of the frequent itemsets).
Next, we elaborate the key candidate-gen() procedure that is called in step 4. Candidate-gen ()
generates candidate frequent itemsets in two steps, namely the join step and the pruning step. Join
step (steps 2-6 in Algorithm 4): This step joins two frequent (k-1)-itemsets to produce a possible
candidate c (step 6). The two frequent itemsets f1 and f2 have exactly the same k − 2 items (i.e.,
i1 , . . . , ik−2 ) except the last one (ik−1 = ik−1 in steps 3-5). The joined k-itemset c is added to the set
of candidates Ck (step 7). Pruning step (steps 8–11 in the next algorithm): A candidate c from the
join step may not be a final frequent item-set. This step determines whether all the k-1 subsets (there
are k of them) of c are in Fk−1 . If any one of them is not in Fk−1 , then c cannot be frequent according
to the downward closure property, and is thus deleted from Ck .
Finally, we will provide an example to illustrate the candidate-gen() procedure.
Given a set of frequent itemsets at level 3, F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}},
the join step (which generates candidates C4 for level 4) produces two candidate itemsets, {1, 2, 3, 4}
and {1, 3, 4, 5}. {1, 2, 3, 4} is generated by joining the first and the second itemsets in F3 as their first
and second items are the same. {1, 3, 4, 5} is generated by joining the third and the fourth itemsets
in F3 , i.e., {1, 3, 4} and {1, 3, 5}. {1, 3, 4, 5} is pruned because {3, 4, 5} is not in F3 .
Rule-Based Classification 135
We now provide a running example of the whole Apriori algorithm based on the transactions
shown in Table 5.1. In this example, we have used minsup = 30%.
Apriori algorithm first scans the transaction data to count the supports of individual items. Those
items whose supports are greater than or equal to 30% are regarded as frequent and are stored in set
F1 , namely frequent 1-itemsets.
In F1 , the number after each frequent itemset is the support count of the corresponding itemset.
For example, {Beef}:4 means that the itemset {Beef} has occurred in four transactions, namely
t1 ,t2 ,t4 , and t5 . A minimum support count of 3 is sufficient for being frequent (all the itemsets in F1
have sufficient supports ≥ 3).
We then perform the Candidate-gen procedure using F1 , which generates the following candidate
frequent itemsets C2 :
For each itemset in C2 , we need to determine if it is frequent by scanning the database again and
storing the frequent 2-itemsets in set F2 :
We now complete the level-2 search (for all 2-itemsets). Similarly, we generate the candidate
frequent itemsets C3 via Candidate-gen procedure:
Finally, we count the frequency of {Chicken, Clothes, Milk} in database and it is stored in F3
given that its support is greater than the minimal support.
Note that since we only have one itemset in F3 , the algorithm stops since we need at least two
itemsets to generate a candidate itemset for C4 . Apriori algorithm is just a representative of a large
number of association rule mining algorithms that have been developed over the last 20 years. For
more algorithms, please see [19].
STEP2: Association Rule Generation As we mentioned earlier, the Apriori algorithm can generate
all frequent itemsets as well as all confident association rules. Interestingly, generating association
rules is fairly straightforward compared with frequent itemset generation. In fact, we generate all
association rules from frequent itemsets. For each frequent itemset f , we use all its non-empty
subsets to generate association rules. In particular, for each such subset β, β ⊆ f , we output a Rule
5.5 if the confidence condition in Equation 5.6 is satisfied.
( f − β) → β, (5.5)
f .count
con f idence = ≥ mincon f (5.6)
( f − β).count
Note the f .count and ( f − β).count are the supports count of itemset f and itemset ( f − β),
respectively. According to Equation 5.3, our rule support is f .count/n, where n is the total number
of transactions in the transaction set T . Clearly, if f is frequent, then any of its non-empty subsets is
also frequent according to the downward closure property. In addition, all the support counts needed
for confidence computation in Equation 5.6, i.e., f .count and ( f − β).count, are available as we have
recorded the supports for all the frequent itemsets during the mining process, e.g., using the Apriori
algorithm. As such, there is no additional database scan needed for association rule generation.
items in I and they do not have any overlapping. A class association rule (CAR) is an implication of
the following form:
Algorithm CAR-Apriori(T)
1. C1 ← init-pass(T ); // the first pass over T
2. F1 ← { f | f ∈ C1 , f .rulesupCount/ f .condsupCount ≥ minsup}; // n is the no. of transactions
in T ;
3. CAR1 ← { f | f ∈ F1 , f .rulesupCount/n ≥ mincon f }; // n is the no. of transactions in T ;
4. For (k = 2; Fk−1 = 0; / k++) do
5. Ck ← CARcandidate-gen(Fk−1);
6. For each transaction t ∈ T do
7. For each candidate c ∈ Ck do
8. If c.condset is contained in t then // c is a subset of t
9. c.condsupCount + +;
10. if t.class = c.class then
11. c.rulesupCount + +;
12. EndFor
13. EndFor
14. Fk ← {c ∈ Ck |c.rulesupCount/n ≥ minsup}
15. CARk ← { f | f ∈ Fk , f .rulesupCount/ f .condsupCount ≥ mincon f };
16. EndFor
17. Output CAR ← k CARk .
One important observation regarding ruleitem generation is that if a ruleitem/rule has a confi-
dence of 100%, then extending the ruleitem with more conditions, i.e., adding items to its condset,
will also result in rules with 100% confidence although their supports may drop with additional
items. In some applications, we may consider these subsequent rules with more conditions redun-
dant because these additional conditions do not provide any more information for classification. As
such, we should not extend such ruleitems in candidate generation for the next level (from k − 1 to
k), which can reduce the number of generated rules significantly. Of course, if desired, redundancy
handling procedure can be added in the CAR-Apriori algorithm easily to stop the unnecessary ex-
panding process.
Finally, the CARcandidate-gen() function is very similar to the candidate-gen() function in the
Apriori algorithm, and it is thus not included here. The main difference lies in that in CARcandidate-
gen(), ruleitems with the same class label are combined together by joining their condsets.
We now give an example to illustrate the usefulness of CARs. Table 5.2 shows a sample loan
application dataset from a bank, which has four attributes, namely Age, Has job, Own house, and
Credit rating. The first attribute Age has three possible values, i.e., young, middle, and old. The
second attribute Has Job indicates whether an applicant has a job, with binary values: true (has a
job) and false (does not have a job). The third attribute Own house shows whether an applicant owns
a house (similarly, it has two values denoted by true and false). The fourth attribute, Credit rating,
has three possible values: fair, good, and excellent. The last column is the class/target attribute,
which shows whether each loan application was approved (denoted by Yes) or not (denoted by No)
by the bank.
Assuming the user-specified minimal support minsup = 2/15 = 13.3% and the minimal confi-
dence minconf = 70%, we can mine the above dataset to find the following rules that satisfy the two
constraints:
Own house = f alse, Has job = true → Class = Yes [sup=3/15, conf=3/3]
Own house = true → Class = Yes [sup=6/15, conf=6/6]
Own house = f alse, Has job = true → Class = Yes[sup=3/15, conf=3/3]
Rule-Based Classification 139
the training data and potentially perform well on the unseen test data. Pruning is also called gener-
alization as it makes rules more general and more applicable to test instances. Of course, we still
need to maintain high confidences of CARs during the pruning process so that we can achieve more
reliable and accurate classification results once the confident rules are applied. Readers can refer to
papers [30, 31] for details of some pruning methods.
Multiple Minimum Class Supports: In many real-life classification problems, the datasets
could have uneven or imbalanced class distributions, where majority classes cover a large proportion
of the training data, while other minority classes (rare or infrequent classes) only cover a very small
portion of the training data. In such a scenario, a single minsup may be inadequate for mining CARs.
For example, we have a fraud detection dataset with two classes C1 (represents “normal class”) and
C2 (denotes for “fraud class”). In this dataset, 99% of the data belong to the majority class C1 , and
only 1% of the data belong to the minority class C2 , i.e., we do not have many instances from “fraud
class.” If we set minsup = 1.5%, we may not be able to find any rule for the minority class C2 as
this minsup is still too high for minority class C2 . To address the problem, we need to reduce the
minsup, say set minsup = 0.2% so that we can detect some rules for class C2 . However, we may find
a huge number of overfitting rules for the majority class C1 because minsup = 0.2% is too low for
class C1 . The solution for addressing this problem is to apply multiple minimum class supports for
different classes, depending on their sizes. More specifically, we could assign a different minimum
class support minsupi for each class Ci , i.e., all the rules of class Ci must satisfy corresponding
minsupi . Alternatively, we can provide one single total minsup, denoted by total minsup, which is
then distributed to each class according to the class distribution:
Number o f Transactions in Ci
minsupi = total minsup × . (5.9)
Total Number o f Transactions in Database
The equation sets higher minsups for those majority classes while it sets lower minsups for those
minority classes.
Parameter Selection: The two parameters used in CARs mining are the minimum support and
the minimum confidence. While different minimum confidences may also be used for each class,
they do not affect the classification results much because the final classifier tends to use high confi-
dence rules. As such, one minimum confidence is usually sufficient. We thus are mainly concerned
with how to determine the best support minsupi for each class Ci . Similar to other classification
algorithms, we can apply the standard cross-validation technique to partition the training data into
n folds where n − 1 folds are used for training and the remaining 1 fold is used for testing (we can
repeat this n times so that we have n different combinations of training and testing sets). Then, we
can try different values for minsupi in the training data to mine CARs and finally choose the value
for minsupi that gives the best average classification performance on the test sets.
rules to cover the training data D. The set of selected rules, including a default class, is then used
as the classifier. The selection of rules is based on a total order defined on the rules in S. Given two
rules, ri and r j , we say ri r j or ri precedes r j or ri has a higher precedence than r j if
where ri ∈ S, ri r j if j > i. When classifying a test case, the first rule that satisfies the case will be
used to classify it. If there is not a single rule that can be applied to the test case, it takes the default
class, i.e., de f ault − class, in Equation 5.10. A simplified version of the algorithm for building such
a classifier is given in the following algorithm. The classifier is the RuleList.
Algorithm CBA (T )
1. S = sort(S); // sorting is done according to the precedence
2. RuleList = 0/ ; // the rule list classifier is initialized as empty set
3. For each rule r ∈ S in sequence do
4. / AND r classifies at least one example in D correctly Then
If (D = 0)
5. delete from D all training examples covered by r;
6. add r at the end of RuleList
7. EndIf
8. EndFor
9. add the majority class as the default class at the end of RuleList
In Algorithm CBA, we first sort all the rules in S according to their precedence defined above.
Then we through the rules one by one, from the highest precedence to the lowest precedence, during
the for-loop. Particularly, for each rule, we will perform sequential covering from step 3 to 8. Finally,
we construct our RuleList by appending the majority class so that we can classify any test instance.
Combine Multiple Rules: Like the first method Use the Strongest Rule, this method does not
take any additional step to build a classifier. Instead, at the classification time, for each test instance,
the system first searches a subset of rules that cover the instance.
1. If all the rules in the subset have the same class, then the class is assigned to the test instance.
2. If the rules have different classes, then the system divides the rules into a number of groups
according to their classes, i.e., all rules of from the same class are in the same group. The
system then compares the aggregated effects of the rule groups and finds the strongest group.
Finally, the class label of the strongest group is assigned to the test instance.
To measure the strength of each rule group, there again can be many possible ways. For example,
the CMAR system uses a weighted χ2 measure [31].
Class Association Rules as Features: In this method, rules are used as features to augment the
original data or simply form a new data set, which is subsequently fed to a traditional classification
algorithm, e.g., Support Vector Machines (SVM), Decision Trees (DT), Naı̈ve Bayesian (NB), K-
Nearest Neighbour (KNN), etc.
To make use of CARs as features, only the conditional part of each rule is needed. For each
training and test instance, we will construct a feature vector where each dimension corresponds to
a specific rule. Specifically, if a training or test instance in the original data satisfies the conditional
142 Data Classification: Algorithms and Applications
part of a rule, then the value of the feature/attribute in its vector will be assigned 1; 0 otherwise. The
reason that this method is helpful is that CARs capture multi-attribute or multi-item correlations
with class labels. Many classification algorithms, like Naı̈ve Bayesian (which assumes the features
are independent), do not take such correlations into consideration for classifier building. Clearly,
the correlations among the features can provide additional insights on how different feature combi-
nations can better infer the class label and thus they can be quite useful for classification. Several
applications of this method have been reported [32–35].
Classification Using Normal Association Rules
Not only can class association rules be used for classification, but also normal association rules.
For example, normal association rules are regularly employed in e-commerce Web sites for product
recommendations, which work as follows: When a customer purchases some products, the system
will recommend him/her some other related products based on what he/she has already purchased
as well as the previous transactions from all the customers.
Recommendation is essentially a classification or prediction problem. It predicts what a cus-
tomer is likely to buy. Association rules are naturally applicable to such applications. The classifi-
cation process consists of the following two steps:
1. The system first mines normal association rules using previous purchase transactions (the
same as market basket transactions). Note, in this case, there are no fixed classes in the data
and mined rules. Any item can appear on the left-hand side or the right-hand side of a rule.
For recommendation purposes, usually only one item appears on the right-hand side of a rule.
2. At the prediction (or recommendation) stage, given a transaction (e.g., a set of items already
purchased by a given customer), all the rules that cover the transaction are selected. The
strongest rule is chosen and the item on the right-hand side of the rule (i.e., the consequent)
is then the predicted item and is recommended to the user. If multiple rules are very strong,
multiple items can be recommended to the user simultaneously.
This method is basically the same as the “use the strongest rule” method described above. Again,
the rule strength can be measured in various ways, e.g., confidence, χ2 test, or a combination of both
support and confidence [42]. Clearly, the other methods, namely, Select a Subset of the Rules to Build
a Classifier, and Combine Multiple Rules, can be applied as well.
The key advantage of using association rules for recommendation is that they can predict any
item since any item can be the class item on the right-hand side.
Traditional classification algorithms, on the other hand, only work with a single fixed class
attribute, and are not easily applicable to recommendations.
Finally, in recommendation systems, multiple minimum supports can be of significant help.
Otherwise, rare items will never be recommended, which causes the coverage problem. It is shown
in [43] that using multiple minimum supports can dramatically increase the coverage. Note that
rules from rule induction cannot be used for this recommendation purpose because the rules are not
independent of each other.
generation phase, CMAR mines the complete set of rules in the form of R : P → c, where P is a
pattern in the transaction training data set, and c is a class label, i.e., R is a class association rule.
The support and confidence of the rule R, namely sup(R) and con f (R), satisfy the user pre-defined
minimal support and confidence thresholds, respectively.
CMAR used an effective and scalable association rule mining algorithm based on the FP-growth
method [21]. As we know, existing association rule mining algorithms typically consist of two steps:
1) detect all the frequent patterns and 2) mine association rules that satisfy the confidence threshold
based on the mined frequent patterns. CMAR, on the other hand, has no separated rule generation
step. It constructs a class distribution-associated FP-tree and for every pattern, it maintains the dis-
tribution of various class labels among examples matching the pattern, without any overhead in the
procedure of counting database. As such, once a frequent pattern is detected, rules with regard to the
pattern can be generated straightaway. In addition, CMAR makes use of the class label distribution
to prune. Given a frequent pattern P, let us assume c is the most dominant/mojority class in the set
of examples matching P. If the number of examples having class label c and matching P is less than
the support threshold, then there is no need to search any superpattern (superset) P of P. This is
very clear as any rule in the form of P → c cannot satisfy the support threshold either as superset
P will have no larger support than pattern P.
Once rules are mined from the given transaction data, CMAR builds a CR-tree to save space
in storing rules as well as to search for rules efficiently. CMAR also performs a rule pruning step
to remove redundant and noise rules. In particular, three principles were used for rule pruning,
including 1) use more general and high-confidence rules to prune those more specific and lower
confidence ones; 2) select only positively correlated rules based on χ2 testing; 3) prune rules based
on database coverage.
Finally, in the classification phase, for a given test example, CMAR extracts a subset of rules
matching the test example and predicts its class label by analyzing this subset of rules. CMAR first
groups rules according to their class labels and then finds the strongest group to perform classifica-
tion. It uses a weighted χ2 measure [30] to integrate both information of intra-group rule correlation
and popularity. In other words, if those rules in a group are highly positively correlated and have
good support, then the group has higher strength.
2. XRules
Different from CBA and CMAR which are applied to transaction data sets consisting of multi-
dimensional records, XRules [44] on the other hand, build a structural rule-based classifier for semi-
structured data, e.g., XML. In the training stage, it constructs structural rules that indicate what
kind of structural patterns in an XML document are closely related to a particular class label. In the
testing stage, it employs these structural rules to perform the structural classification.
Based on the definition of structural rules, XRules performed the following three steps during
the training stage: 1) Mine frequent structural rules specific to each class using its proposed XMiner
(which extends TreeMiner to find all frequent trees related to some class), with sufficient support
and strength. Note that users need to provide a minimum support πmin i for each class ci . 2) Prioritize
or order the rules in decreasing level of precedence as well as removing unpredictive rules. 3) De-
termine a special class called default-class, which will be used to classify those test examples when
none of the mined structural rules are applicable. After training, the classification model consists of
an ordered rule set, and a default-class.
Finally, the testing stage performs classification on the given test examples without class labels.
Given a test example S, there are two main steps for its classification, including, i.e., the rule re-
trieval step, which finds all matching rules (stored in set R(S)) for S, as well as class prediction
step, which combines the statistics from each matching rule in R(S) to predict the most likely class
/ i.e., there are no matching rules, then default class is assigned to S;
for S. Particularly, if R(S) = 0,
otherwise, R(S) = 0./ Assume Ri (S) represent the matching rules in R(S) with class ci as their con-
sequents. XRules used an average confidence method, i.e., for each class ci , it computes the average
144 Data Classification: Algorithms and Applications
rule strength for all the rules in Ri (S). If the average rule strength for class ci is big enough, the
algorithm assigns the class ci to the test example S. If the average rule strengths for all the classes
are all very small, then the default class is used again to assign to S.
5.4 Applications
In this section, we briefly introduce some applications of applying rule based classification meth-
ods in text categorization [51], intrusion detection [74], diagnostic data mining [25], as well as gene
expression data mining [50].
w1 ∈ d and w2 ∈ d . . . and wk ∈ d.
Note that the context of a word w1 consists of a number of other words w2 , . . ., and wk , that need
to co-occur with w1 , but they may occur in any order, and in any location in document d.
The standard RIPPER algorithm was extended in the following two ways so that it can be better
used for text categorization.
1. Allow users to specify a loss ratio [65]. A loss ratio is defined as the ratio of the cost of
a false negative to the cost of a false positive. The objective of the learning is to minimize
misclassification cost on the unseen or test data. RIPPER can balance the recall and precision
for a given class by setting a suitable loss ratio. Specifically, during the RIPPER’s pruning and
optimization stages, suitable weights are assigned to false positive errors and false negative
errors, respectively.
Rule-Based Classification 145
2. In text classification, while a large corpus or a document collection contains many different
words, a particular document will usually only contain quite limited words. To save space for
representation, a document is represented as a single attribute a, with its value as the set of
words that appear in the document or a word list of the document, i.e., a = {w1 , w2 , ..., wn }.
The primitive tests (conditions) on a set-valued attribute a are in the form of wi ∈ a.
For a rule construction, RIPPER will repeatedly add conditions to rule r0 , which is initialized
as an empty antecedent. Specifically, at each iteration i, a single condition is added to the rule ri ,
producing an expanded rule ri+1 . The condition added to ri+1 is the one that maximizes information
gain with regards to ri . Given the set-valued attributes, RIPPER will carry out the following two
steps to find a best condition to add:
1. For the current rule ri , RIPPER will iterate over the set of examples/documents S that are
covered by ri and record a word list W where each word wi ∈ W appears as an element/value
of attribute a in S. For each wi ∈ W , RIPPER also computes two statistics, namely pi and ni ,
which represent the number of positive and negative examples in S that contain wi , respec-
tively.
2. RIPPER will go over all the words wi ∈ W , and use pi and ni to calculate the information gain
for its condition wi ∈ a. We can then choose the condition that yields the largest information
gain and add it to ri to form rule ri+1 .
The above process of adding new literals/conditions continues until the rule does not cover
negative examples or until no condition has a positive information gain. Note the process only
requires time linear in the size of S, facilitating its applications to handle large text corpora.
RIPPER has been used to classify or filter personal emails [69] based on a relatively small sets
of labeled messages.
Sleeping-experts for phrases for text categorization
Sleeping-experts [56] is an ensemble framework that builds a master algorithm to integrate the
“advice” of different “experts” or classifiers [51, 76]. Given a test example, the master algorithm
uses a weighted combination of the predictions of the experts. One efficient weighted assignment
algorithm is the multiplicative update method where weights for each individual experts are updated
by multiplying them by a constant. Particularly, those “correct” experts that make right classification
will be able to keep their weights unchanged (i.e., multiplying 1) while those “bad” experts that
make wrong classification have to multiply a constant (less than 1) so that their weights will become
smaller.
In the context of text classification, the experts correspond to all length-k phrases that occur
in a corpus. Given a document that needs to be classified, those experts are “awake” and make
predictions if they appear in the document; the remaining experts are said to be “sleeping” on the
document. Different from the context information used in the RIPPER, the context information in
sleeping-experts (or sleeping-experts for phrases), is defined in the following phrase form:
wi1 , wi2 . . . wi j
Do for t=1, 2, . . . , T
146 Data Classification: Algorithms and Applications
(c) pt+1
= Zt t+1
Zt+1 pw̄
w̄k k
8. Update: Pool ← Pool ∪ E t .
In this algorithm, the master algorithm maintains a pool, recording the sparse phrases that ap-
peared in the previous documents and a set p, containing one weight for each sparse phrase in the
pool.
This algorithm iterates over all the T labeled examples to update the weight set p. Particularly,
at each time step t, we have a document wt1 , wt2 , . . ., wtl with length l, and its classification label ct
(step 1). In step 2, we search for a set of active phrases, denoted by W t from the given document.
Step 3 defines two active mini-experts w̄1 and w̄0 for each phrase w̄ where w̄1 (w̄0 ) predicts the
document belongs to the class (does not belong to the class). Obviously, given the actual class label,
only one of them is correct. In step 4, this algorithm initializes the weights of new mini-experts
(not in the pool) as 1. Step 5 classifies the document by calculating the weighted sum of the min-
experts and storing the sum into the variable yt — the document is classified as positive (class 1)
if yt > θC ; otherwise the negative (class 0). θC = 12 has been set to minimize the errors and get a
balanced precision and recall. After performing classification, Step 6 updated weights to reflect the
correlation between the classification results and the actual class label. It first computes the loss
l(w̄k ) of each mini-expert w̄k — if the predicted label is equal to the actual label, then the loss l(w̄k )
is zero; 1 otherwise. The weight of each expert is then multiplied by a factor βl(w̄k ) where β < 1 is
called the learning rate, which controls how quickly the weights are updated. The value for β is in
the range [0.1,0.5]. Basically, this algorithm keeps the weight of the correctly classified mini-expert
unchanged but lowers the weight of the wrongly classified mini-expert by multiplying β. Finally,
step 7 normalizes the active mini-experts so that the total weight of the active mini-experts does
not change. In effect, this re-normalization is to increase the weights of the mini-experts that were
correct in classification.
Rule-Based Classification 147
calls (e.g., open, read, etc), the audit data were segmented into a list of records, each of which
had 11 consecutive system calls. RIPPER has been employed to detect rules that serve as normal
(execution) profiles. In total, 252 rules are mined to characterize the normal co-occurrences of these
system calls and to identify the intrusions that deviate from the normal system calls.
In addition, another type of intrusions, where intruders aim to disrupt network services by at-
tacking the weakness in TCP/IP protocols, has also been identified [74]. By processing the raw
packet-level data, it is possible to create a time series of connection-level records that capture the
connection information such as duration, number of bytes transferred in each direction, and the flag
that specifies whether there is an error according to the protocol etc. Once again, RIPPER has been
applied to mine 20 rules that serve as normal network profile, characterizing the normal traffic pat-
terns for each network service. Given the temporal nature of activity sequences [75], the temporal
measures over features and the sequential correlation of features are particularly useful for accurate
identification. Note the above anomaly detection methods need sufficient data which can cover as
much variation of the normal behaviors as possible. Otherwise, given insufficient audit data, the
anomaly detection will not be successful as some normal activities will be flagged as intrusions.
1. Predictive data mining: The objective is to build predictive or classification models that can
be used to classify future cases or to predict the classes of future cases. This has been the
focus of research of the machine learning community.
2. Diagnostic data mining: The objective here is usually to understand the data and to find causes
of some problems in order to solve the problems. No prediction or classification is needed.
In the above example, the problems are failed during setup and dropped while in progress. A
large number of data mining applications in engineering domains are of this type because product
improvement is the key task. The above application falls into the second type. The objective is not
prediction, but to better understand the data and to find causes of call failures or to identify situations
in which calls are more likely to fail. That is, the user wants interesting and actionable knowledge.
Clearly, the discovered knowledge has to be understandable. Class association rules are suitable for
this application.
It is easy to see that such kinds of rules can be produced by classification algorithms such as
decision trees and rule induction (e.g., CN2 and RIPPER), but they are not suitable for the task due
to three main reasons:
1. A typical classification algorithm only finds a very small subset of the rules that exist in data.
Most of the rules are not discovered because their objective is to find only enough rules for
classification. However, the subset of discovered rules may not be useful in the application.
Those useful rules are left undiscovered. We call this the completeness problem.
Rule-Based Classification 149
2. Due to the completeness problem, the context information of rules are lost, which makes rule
analysis later very difficult as the user does not see the complete information.
3. Since the rules are for classification purposes, they usually contain many conditions in order
to achieve high accuracy. Long rules are, however, of limited use according to our experi-
ence because the engineers can hardly take any action based on them. Furthermore, the data
coverage of long rules is often so small that it is not worth doing anything about them.
Class association rule mining [30] is found to be more suitable as it generates all rules. The
Opportunity Map system basically enables the user to visualize class association rules in all kinds
of ways through OLAP operations in order to find those interesting rules that meet the user needs.
To address the above challenging problems, RCBT discovers the most significant TopkRGS for
each row of a gene expression dataset. Note that TopkRGS can provide a more complete description
for each row, which is different from existing interestingness measures that may fail to discover
any interesting rules to cover some of the rows if given a higher support threshold. As such, the
information in those rows that are not covered will not be captured in the set of rules. Given that
gene expression datasets have a small number of rows, RCBT will not lose important knowledge.
Particularly, the rule group conceptually clusters rules from the same set of rows. We use the
example in Table 5.3 to illustrate the concept of a rule group [50]. Note the gene expression data
in Table 5.3 have been discretized. They consist of 5 rows, namely, r1, r2, . . ., r5 where the first
three rows have class label C while the last two have label ¬C. Given an item set I, its Item Support
Set, denoted R(I), is defined as the largest set of rows that contain I. For example, given item set
I = {a, b}, its Item Support Set, R(I) = {r1, r2}. In fact, we observe that R(a) = R(b) = R(ab) =
R(ac) = R(bc) = R(abc) = {r1, r2}. As such, they make up a rule group {a → C, b → C, . . . ,
abc → C} of consequent C, with the upper bound abc → C and the lower bounds a → C, and b → C.
Obviously all rules in the same rule group have the exactly same support and confidence since
they are essentially derived from the same subset of rows [50], i.e. {r1, r2} in the above example.
We can easily identify the remaining rule members based on the upper bound and all the lower
bounds of a rule group. In addition, the significance of different rule groups can be evaluated based
on both their confidence and support scores.
In addition, RCBT has designed a row enumeration technique as well as several pruning strate-
gies that make the rule mining process very efficient. A classifier has been constructed from the
top-k covering rule groups. Given a test instance, RCBT also aims to reduce the chance of classify-
ing it based on the default class by building additional stand-by classifiers. Specifically, given k sets
of rule groups RG1, . . . , RGk, k classifiers CL1, ...,CLk are built where CL1 is the main classifier and
CL2, . . . ,CLk are stand-by classifiers. It makes a final classification decision by aggregating voting
scores from all the classifiers.
A number of experiments have been carried out on real bioinformatics datasets, showing that
the RCBT algorithm is orders of magnitude faster than previous association rule mining algorithms.
of class association rules as the classifier. Other classifier building techniques include combining
multiple rules by Li et al. [31], using rules as features by Meretakis and Wuthrich [38], Antonie and
Zaiane [32], Deshpande and Karypis [33], and Lesh et al. [35], generating a subset of rules by Cong
et al. [39], Wang et al. [40], Yin and Han [41], and Zaki and Aggarwal [44]. Additional systems
include those by Li et al. [45], Yang et al. [46], etc.
Note that well-known decision tree methods [4], such as ID3 and C4.5, build a tree structure for
classification. The tree has two different types of nodes, namely decision nodes (internal nodes) and
leaf nodes. A decision node specifies a test based on a single attribute while a leaf node indicates
a class label. A decision tree can also be converted to a set of IF-THEN rules. Specifically, each
path from the root to a leaf forms a rule where all the decision nodes along the path form the condi-
tions of the rule and the leaf node forms the consequent of the rule. The main differences between
decision tree and rule induction are in their learning strategy and rule understandability. Decision
tree learning uses the divide-and-conquer strategy. In particular, at each step, all attributes are eval-
uated and one is selected to partition/divide the data into m disjoint subsets, where m is the number
of values of the attribute. Rule induction, however, uses the separate-and-conquer strategy, which
evaluates all attribute-value pairs (conditions) and selects only one. Thus, each step of divide-and-
conquer expands m rules, while each step of separate-and-conquer expands only one rule. On top of
that, the number of attribute-value pairs are much larger than the number of attributes. Due to these
two effects, the separate-and-conquer strategy is much slower than the divide-and-conquer strategy.
In terms of rule understandability, while if-then rules are easy to understand by human beings, we
should be cautious about rules generated by rule induction (e.g., using the sequential covering strat-
egy) since they are generated in order. Such rules can be misleading because the covered data are
removed after each rule is generated. Thus the rules in the rule list are not independent of each other.
In addition, a rule r may be of high quality in the context of the data D from which r was generated.
However, it may be a very weak rule with a very low accuracy (confidence) in the context of the
whole data set D (D ⊆ D) because many training examples that can be covered by r have already
been removed by rules generated before r. If you want to understand the rules generated by rule
induction and possibly use them in some real-world applications, you should be aware of this fact.
The rules from decision trees, on the other hand, are independent of each other and are also mu-
tually exclusive. The main differences between decision tree (or a rule induction system) and class
association rules (CARs) are in their mining algorithms and the final rule sets. CARs mining detects
all rules in data that satisfy the user-specified minimum support (minsup) and minimum confidence
(minconf) constraints while a decision tree or a rule induction system detects only a small subset of
the rules for classification. In many real-world applications, rules that are not found in the decision
tree (or a rule list) may be able to perform classification more accurately. Empirical comparisons
have demonstrated that in many cases, classification based on CARs performs more accurately than
decision trees and rule induction systems.
The complete set of rules from CARs mining could also be beneficial from a rule usage point
of view. For example, in a real-world application for finding causes of product problems (e.g., for
diagnostic purposes), more rules are preferred to fewer rules because with more rules, the user is
more likely to find rules that indicate the causes of the problems. Such rules may not be generated
by a decision tree or a rule induction system. A deployed data mining system based on CARs is
reported in [25]. Finally, CARs mining, like standard association rule mining, can only take discrete
attributes for its rule mining, while decision trees can deal with continuous attributes naturally.
Similarly, rule induction can also use continuous attributes. But for CARs mining, we first need to
apply an attribute discretization algorithm to automatically discretize the value range of a continuous
attribute into suitable intervals [47, 48], which are then considered as discrete values to be used
for CARs mining algorithms. This is not a problem as there are many discretization algorithms
available.
152 Data Classification: Algorithms and Applications
Bibliography
[1] Li, X. L., Liu, B., and Ng, S.K. Learning to identify unexpected instances in the test set.
In Proceedings of Twentieth International Joint Conference on Artificial Intelligence, pages
2802–2807, India, 2007.
[2] Cortes, Corinna and Vapnik, Vladimir N. Support-vector networks. Machine Learning, 20
(3):273–297, 1995.
[3] Hopfield, J. J. Neural networks and physical systems with emergent collective computational
abilities. In Proceedings of the National Academy of Sciences USA, 79 (8):2554–2558, 1982.
[4] Quinlan, J. C4.5: Programs for machine learning. Morgan Kaufmann Publishers, 1993.
[5] George H. John and Pat Langley. Estimating continuous distributions in Bayesian classifiers.
In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pages
338–345, San Mateo, 1995.
[6] Bremner, D., Demaine, E., Erickson, J., Iacono, J., Langerman, S., Morin, P., and Toussaint,
G. Output-sensitive algorithms for computing nearest-neighbor decision boundaries. Discrete
and Computational Geometry, 33 (4):593–604, 2005.
[7] Hosmer, David W. and Lemeshow, Stanley. Applied Logistic Regression. Wiley, 2000.
[8] Rivest, R. Learning decision lists. Machine Learning, 2(3):229–246, 1987.
[9] Clark, P. and Niblett, T. The CN2 induction algorithm. Machine Learning, 3(4):261–283,
1989.
[10] Quinlan, J. Learning logical definitions from relations. Machine Learning, 5(3):239-266,
1990.
[11] Furnkranz, J. and Widmer, G. Incremental reduced error pruning. In Proceedings of Interna-
tional Conference on Machine Learning (ICML-1994), pages 70–77, 1994.
[12] Brunk, C. and Pazzani, M. An investigation of noise-tolerant relational concept learning al-
gorithms. In Proceedings of International Workshop on Machine Learning, pages 389–393,
1991.
[13] Cohen, W. W. Fast effective rule induction. In Proceedings of the Twelfth International Con-
ference on Machine Learning, pages 115–123, 1995.
[16] Agrawal, R., Imieliski, T., and Swami, A. Mining association rules between sets of items in
large databases. In Proceedings of ACM SIGMOD International Conference on Management
of Data (SIGMOD-1993), pages 207–216, 1993.
[17] Agrawal, R. and Srikant, R. Fast algorithms for mining association rules in large databases.
In Proceedings of International Conference on Very Large Data Bases (VLDB-1994), pages
487–499, 1994.
Rule-Based Classification 153
[18] Michalski, R. S. On the quasi-minimal solution of the general covering problem. In Proceed-
ings of the Fifth International Symposium on Information Processing, pages 125–128, 1969.
[19] Han, J. W., Kamber, M., and Pei, J. Data Mining: Concepts and Technqiues. 3rd edition,
Morgan Kaufmann, 20011.
[20] Liu, B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, 2006.
[21] Han, J. W., Pei, J., and Yin, Y. Mining frequent patterns without candidate generation. In
Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2000), pages
1–12, 2000.
[22] Bayardo, Jr., R. and Agrawal, R. Mining the most interesting rules. In Proceedings of ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-1999),
pages 145–154, 1999.
[23] Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, A. Finding interest-
ing rules from large sets of discovered association rules. In Proceedings of ACM International
Conference on Information and Knowledge Management (CIKM-1994), pages 401–407, 1994.
[24] Liu, B., Hsu, W., and Ma, Y. Pruning and summarizing the discovered associations. In Pro-
ceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Min-
ing (KDD-1999), pages 125–134, 1999.
[25] Liu, B., Zhao, K., Benkler, J., and Xiao, W. Rule interestingness analysis using OLAP oper-
ations. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD-2006), pages 297–306, 2006.
[26] Padmanabhan, B. and Tuzhilin, A. Small is beautiful: discovering the minimal set of unex-
pected patterns. In Proceedings of ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-2000), pages 54–63, 2000.
[29] Tan, P., Kumar, V., and Srivastava, J. Selecting the right interestingness measure for association
patterns. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (KDD-2002), pages 32-41, 2002.
[30] Liu, B., Hsu, W., and Ma, Y. Integrating classification and association rule mining. In Proceed-
ings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD-1998), pages 80–86, 1998.
[31] Li, W., Han, J., and Pei, J. CMAR: Accurate and efficient classification based on multiple
class-association rules. In Proceedings of IEEE International Conference on Data Mining
(ICDM-2001), pages 369–376, 2001.
[32] Antonie, M. and Zaı̈ane, O. Text document categorization by term association. In Proceedings
of IEEE International Conference on Data Minig (ICDM-2002), Pages, 19–26, 2002.
[33] Deshpande, M. and Karypis, G. Using conjunction of attribute values for classification. In
Proceedings of ACM International Conference on Information and Knowledge Management
(CIKM-2002), pages 356–364, 2002.
154 Data Classification: Algorithms and Applications
[34] Jindal, N. and Liu, B. Identifying comparative sentences in text documents. In Proceedings
of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-
2006), pages 244–251, 2006.
[35] Lesh, N., Zaki, M., and Ogihara, M. Mining features for sequence classification. In Proceed-
ings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD-1999), pages 342-346, 1999.
[36] Michalski, R., Mozetic, I., Hong, J., and Lavrac, N. The multi-purpose incremental learning
system AQ15 and its testing application to three medical domains. In Proceedings of National
Conference on Artificial Intelligence (AAAI-86), pages 1041–1045, 1986.
[37] Pazzani, M., Brunk, C., and Silverstein, G. A knowledge-intensive approach to learning rela-
tional concepts. In Proceedings of International Workshop on Machine Learning (ML-1991),
pages 432–436, 1991.
[38] Meretakis, D. and Wuthrich, B. Extending naı̈ve Bayes classifiers using long itemsets. In
Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD-1999), pages 165–174, 1999.
[39] Cong, G., Tung, A.K.H., Xu, X., Pan, F., and Yang, J. Farmer: Finding interesting rule groups
in microarray datasets. In Proceedings of ACM SIGMOD Conference on Management of Data
(SIGMOD-2004), pages 143–154, 2004.
[40] Wang, K., Zhou, S., and He, Y. Growing decision trees on support-less association rules. In
Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD-2000), pages 265-269, 2000.
[41] Yin, X. and Han, J. CPAR: Classification based on predictive association rules. In Proceedings
of SIAM International Conference on Data Mining (SDM-2003), pages 331-335, 2003.
[42] Lin, W., Alvarez, S., and Ruiz, C. Efficient adaptive-support association rule mining for rec-
ommender systems. Data Mining and Knowledge Discovery, 6(1):83-105, 2002.
[43] Mobasher, B., Dai, H., Luo, T., and Nakagawa, M. Effective personalization based on as-
sociation rule discovery from web usage data. In Proceedings of ACM Workshop on Web
Information and Data Management, pages 9–15, 2001.
[44] Zaki, M. and Aggarwal, C. XRules: an effective structural classifier for XML data. In Pro-
ceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Min-
ing (KDD-2003), pages 316–325, 2003.
[45] Li, J., Dong, G., Ramamohanarao, K., and Wong, L. DeEPs: A new instance-based lazy
discovery and classification system. Machine Learning, 54(2):99-124, 2004.
[46] Yang, Q., Li, T., and Wang, K. Building association-rule based sequential classifiers for web-
document prediction. Data Mining and Knowledge Discovery, 8(3):253–273, 2004.
[47] Dougherty, J., Kohavi, R., and Sahami, M. Supervised and unsupervised discretization of con-
tinuous features. In Proceedings of International Conference on Machine Learning (ICML-
1995), pages 194–202, 1995.
[48] Fayyad, U. and Irani, K. Multi-interval discretization of continuous-valued attributes for clas-
sification learning. In Proceedings of the International Joint Conference on Artificial Intelli-
gence (IJCAI-1993), pages 1022–1028, 1993.
Rule-Based Classification 155
[49] Zheng, Z., Kohavi, R., and Mason, L. Real world performance of association rule algorithms.
In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD-2001), pages 401-406, 2001.
[50] Cong, G., Tan, K.-L., Tung A.K.H., and Xu, X. Mining top-k covering rule groups for gene
expression data. In Proceedings of the 2005 ACM-SIGMOD International Conference on
Management of Data (SIGMOD–05), pages 670–681, 2005.
[51] Cohen, W.W., and. Yoram, S. Context-sensitive learning methods for text categorization. ACM
Transactions on Information Systems, 17(2):141–173, 1999.
[52] Johnson, D., Oles. F., Zhang T., and Goetz, T. A decision tree-based symbolic rule induction
system for text categorization. IBM Systems Journal, 41(3):428–437, 2002.
[53] Apte, C., Damerau, F., and Weiss, S. Automated learning of decision rules for text categoriza-
tion. ACM Transactions on Information Systems, 12(3):233–251, 1994.
[54] Weiss, S. M., Apte C., Damerau, F., Johnson, D., Oles, F., Goetz, T., and Hampp, T. Maximiz-
ing text-mining performance. IEEE Intelligent Systems, 14(4):63–69, 1999.
[55] Weiss, S. M. and Indurkhya, N. Optimized rule induction. IEEE Expert, 8(6):61–69, 1993.
[56] Freund, Y., Schapire, R., Singer, Y., and Warmuth, M. Using and combining predictors that
specialize. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pp.
334–343, 1997.
[57] Joachims, T. Text categorization with support vector machines: learning with many relevant
features. In Proceedings of the European Conference on Machine Learning (ECML), pages
137–142, 1998.
[58] Andrew, M. and Nigam, K. A comparison of event models for Naı̈ve Bayes text classification.
In Proceedings of AAAI-98 workshop on learning for text categorization. Vol. 752. 1998.
[59] Liu, B., Lee, W. S., Yu, P. S. and Li, X. L. Partially supervised classification of text documents.
In Proceedings of the Nineteenth International Conference on Machine Learning (ICML-
2002), pages 387–394, Australia, 2002.
[60] Rocchio, J. Relevance feedback in information retrieval. In G. Salton (ed.). The Smart Re-
trieval System: Experiments in Automatic Document Processing, Prentice-Hall, Upper Saddle
River, NJ, 1971.
[61] Li, X. L. and Liu, B. Learning to classify texts using positive and unlabeled data. In Proceed-
ings of Eighteenth International Joint Conference on Artificial Intelligence, pages 587–592,
Mexico, 1993.
[62] Li, X. L., Liu, B., Yu, P. S., and Ng, S. K. Positive unlabeled learning for data stream classi-
fication. In Proceedings of the Ninth SIAM International Conference on Data Mining, pages
257–268, 2009.
[63] Li, X. L., Liu, B., Yu, P. S., and Ng, S. K. Negative training data can be harmful to text classi-
fication. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language
Processing, pages 218–228, USA, 2010.
[64] Liu, B., Dai, Y., Li, X. L., Lee, W. S., and Yu, P. S. Building text classifiers using positive and
unlabeled examples. In Proceedings of Third IEEE International Conference on Data Mining,
pages 179–186, 2003.
156 Data Classification: Algorithms and Applications
[65] Lewis, D. and Catlett, J. Heterogeneous uncertainty sampling for supervised learning. In
Proceedings of the Eleventh Annual Conference on Machine Learning, pages 148–156, 1994.
[66] Li, X. L., Tan, Y. C., and Ng, S. K. Systematic gene function prediction from gene expression
data by using a fuzzy nearest-cluster method BMC Bioinformatics, 7(Suppl 4):S23, 2006.
[67] Han, X.X. and Li, X.L. Multi-resolution independent component analysis for high-
performance tumor classification and biomarker discovery, BMC Bioinformatics, 12(Suppl 1):
S7, 2011
[68] Yang, P., Li, X. L., Mei, J. P., Kwoh, C. K., and Ng, S. K. Positive-unlabeled learning for
disease gene identification, Bioinformatics, Vol 28(20):2640–2647, 2012
[69] Cohen, W.W. Learning rules that classify e-mail. In Proceedings of the AAAI Spring Sympo-
sium on Machine Learning in Information Access, pages 18–25, 1996.
[70] Liu, B., Dai, Y., Li, X. L., Lee, W. S., and Yu, P. S. Text classification by labeling words. In
Proceedings of the National Conference on Artificial Intelligence, pages 425–430, USA, 2004.
[71] Cohen, W.W. Learning trees and rules with set-valued features In Proceedings of the Thir-
teenth National Conference on Artificial Intelligence, pages 709–716, 1996.
[72] Heady, R., Luger, G., Maccabe, A., and Servilla, M. The Architecture of a Network Level
Intrusion Detection System. Technical report, University of New Mexico, 1990.
[73] Lee, W., Stolfo, S. J., and Mok, K. W. Adaptive intrusion detection: A data mining approach.
Artificial Intelligence Review - Issues on the Application of Data Mining Archive, 14(6):533–
567, 2000.
[74] Lee, W. and Stolfo, S. J. Data mining approaches for intrusion detection. In Proceedings of
the 7th USENIX Security Symposium, San Antonio, TX, 1998.
[75] Mannila, H. and Toivonen, H. Discovering generalized episodes using minimal occurrences.
In Proceedings of the 2nd International Conference on Knowledge Discovery in Databases
and Data Mining. Portland, Oregon, pages 146–151,1996.
[76] Friedman, J.H. and Popescu, B.E. Predictive learning via rule ensembles The Annals of
Applied Statistics, 2(3):916–954, 2008.
Chapter 6
Instance-Based Learning: A Survey
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY
[email protected]
6.1 Introduction
Most classification methods are based on building a model in the training phase, and then using
this model for specific test instances, during the actual classification phase. Thus, the classification
process is usually a two-phase approach that is cleanly separated between processing training and
test instances. As discussed in the introduction chapter of this book, these two phases are as follows:
• Training Phase: In this phase, a model is constructed from the training instances.
• Testing Phase: In this phase, the model is used to assign a label to an unlabeled test instance.
Examples of models that are created during the first phase of training are decision trees, rule-based
methods, neural networks, and support vector machines. Thus, the first phase creates pre-compiled
abstractions or models for learning tasks. This is also referred to as eager learning, because the
models are constructed in an eager way, without waiting for the test instance. In instance-based
157
158 Data Classification: Algorithms and Applications
learning, this clean separation between the training and testing phase is usually not present. The
specific instance, which needs to be classified, is used to create a model that is local to a specific test
instance. The classical example of an instance-based learning algorithm is the k-nearest neighbor
classification algorithm, in which the k nearest neighbors of a classifier are used in order to create a
local model for the test instance. An example of a local model using the k nearest neighbors could
be that the majority class in this set of k instances is reported as the corresponding label, though
more complex models are also possible. Instance-based learning is also sometimes referred to as
lazy learning, since most of the computational work is not done upfront, and one waits to obtain the
test instance, before creating a model for it [9]. Clearly, instance-based learning has a different set
of tradeoffs, in that it requires very little or no processing for creating a global abstraction of the
training data, but can sometimes be expensive at classification time. This is because instance-based
learning typically has to determine the relevant local instances, and create a local model from these
instances at classification time. While the obvious way to create a local model is to use a k-nearest
neighbor classifier, numerous other kinds of lazy solutions are possible, which combine the power
of lazy learning with other models such as locally-weighted regression, decision trees, rule-based
methods, and SVM classifiers [15, 36, 40, 77]. This chapter will discuss all these different scenarios.
It is possible to use the traditional “eager” learning methods such as Bayes methods [38], SVM
methods [40], decision trees [62], or neural networks [64] in order to improve the effectiveness of
local learning algorithms, by applying them only on the local neighborhood of the test instance at
classification time.
It should also be pointed out that many instance-based algorithms may require a pre-processing
phase in order to improve the efficiency of the approach. For example, the efficiency of a nearest
neighbor classifier can be improved by building a similarity index on the training instances. In spite
of this pre-processing phase, such an approach is still considered lazy learning or instance-based
learning since the pre-processing phase is not really a classification model, but a data structure that
enables efficient implementation of the run-time modeling for a given test instance.
Instance-based learning is related to but not quite the same as case-based reasoning [1, 60, 67],
in which previous examples may be used in order to make predictions about specific test instances.
Such systems can modify cases or use parts of cases in order to make predictions. Instance-based
methods can be viewed as a particular kind of case-based approach, which uses specific kinds of
algorithms for instance-based classification. The framework of instance-based algorithms is more
amenable for reducing the computational and storage requirements, noise and irrelevant attributes.
However, these terminologies are not clearly distinct from one another, because many authors use
the term “case-based learning” in order to refer to instance-based learning algorithms. Instance-
specific learning can even be extended to distance function learning, where instance-specific dis-
tance functions are learned, which are local to the query instance [76].
Instance-based learning methods have several advantages and disadvantages over traditional
learning methods. The lazy aspect of instance-based learning is its greatest advantage. The global
pre-processing approach of eager learning algorithms is inherently myopic to the characteristics
of specific test instances, and may create a model, which is often not optimized towards specific
instances. The advantage of instance-based learning methods is that they can be used in order to
create models that are optimized to specific test instances. On the other hand, this can come at a
cost, since the computational load of performing the classification can be high. As a result, it may
often not be possible to create complex models because of the computational requirements. In some
cases, this may lead to oversimplification. Clearly, the usefulness of instance-based learning (as in
all other class of methods) depends highly upon the data domain, size of the data, data noisiness and
dimensionality. These aspects will be covered in some detail in this chapter.
This chapter will provide an overview of the basic framework for instance-based learning, and
the many algorithms that are commonly used in this domain. Some of the important methods such as
nearest neighbor classification will be discussed in more detail, whereas others will be covered at a
much higher level. This chapter is organized as follows. Section 6.2 introduces the basic framework
Instance-Based Learning: A Survey 159
for instance-based learning. The most well-known instance-based method is the nearest neighbor
classifier. This is discussed in Section 6.3. Lazy SVM classifiers are discussed in Section 6.4. Lo-
cally weighted methods for regression are discussed in section 6.5. Locally weighted naive Bayes
methods are introduced in Section 6.6. Methods for constructing lazy decision trees are discussed
in Section 6.7. Lazy rule-based classifiers are discussed in Section 6.8. Methods for using neural
networks in the form of radial basis functions are discussed in Section 6.9. The advantages of lazy
learning for diagnostic classification are discussed in Section 6.10. The conclusions and summary
are discussed in Section 6.11.
However, a broader and more powerful principle to characterize such methods would be:
Similar instances are easier to model with a learning algorithm, because of the simplification of
the class distribution within the locality of a test instance.
Note that the latter principle is a bit more general than the former, in that the former principle
seems to advocate the use of a nearest neighbor classifier, whereas the latter principle seems to sug-
gest that locally optimized models to the test instance are usually more effective. Thus, according
to the latter philosophy, a vanilla nearest neighbor approach may not always obtain the most accu-
rate results, but a locally optimized regression classifier, Bayes method, SVM or decision tree may
sometimes obtain better results because of the simplified modeling process [18, 28, 38, 77, 79]. This
class of methods is often referred to as lazy learning, and often treated differently from traditional
instance-based learning methods, which correspond to nearest neighbor classifiers. Nevertheless,
the two classes of methods are closely related enough to merit a unified treatment. Therefore, this
chapter will study both the traditional instance-based learning methods and lazy learning methods
within a single generalized umbrella of instance-based learning methods.
The primary output of an instance-based algorithm is a concept description. As in the case of
a classification model, this is a function that maps instances to category values. However, unlike
traditional classifiers, which use extensional concept descriptions, instance-based concept descrip-
tions may typically contain a set of stored instances, and optionally some information about how the
stored instances may have performed in the past during classification. The set of stored instances
can change as more instances are classified over time. This, however, is dependent upon the under-
lying classification scenario being temporal in nature. There are three primary components in all
instance-based learning algorithms.
160 Data Classification: Algorithms and Applications
1. Similarity or Distance Function: This computes the similarities between the training in-
stances, or between the test instance and the training instances. This is used to identify a
locality around the test instance.
2. Classification Function: This yields a classification for a particular test instance with the use
of the locality identified with the use of the distance function. In the earliest descriptions
of instance-based learning, a nearest neighbor classifier was assumed, though this was later
expanded to the use of any kind of locally optimized model.
3. Concept Description Updater: This typically tracks the classification performance, and makes
decisions on the choice of instances to include in the concept description.
Traditional classification algorithms construct explicit abstractions and generalizations (e.g., de-
cision trees or rules), which are constructed in an eager way in a pre-processing phase, and are
independent of the choice of the test instance. These models are then used in order to classify test
instances. This is different from instance-based learning algorithms, where instances are used along
with the training data to construct the concept descriptions. Thus, the approach is lazy in the sense
that knowledge of the test instance is required before model construction. Clearly the tradeoffs
are different in the sense that “eager” algorithms avoid too much work at classification time, but
are myopic in their ability to create a specific model for a test instance in the most accurate way.
Instance-based algorithms face many challenges involving efficiency, attribute noise, and signifi-
cant storage requirements. A work that analyzes the last aspect of storage requirements is discussed
in [72].
While nearest neighbor methods are almost always used as an intermediate step for identifying
data locality, a variety of techniques have been explored in the literature beyond a majority vote
on the identified locality. Traditional modeling techniques such as decision trees, regression model-
ing, Bayes, or rule-based methods are commonly used to create an optimized classification model
around the test instance. It is the optimization inherent in this localization that provides the great-
est advantages of instance-based learning. In some cases, these methods are also combined with
some level of global pre-processing so as to create a combination of instance-based and model-
based algorithms [55]. In any case, many instance-based methods combine typical classification
generalizations such as regression-based methods [15], SVMs [54, 77], rule-based methods [36], or
decision trees [40] with instance-based methods. Even in the case of pure distance-based methods,
some amount of model building may be required at an early phase for learning the underlying dis-
tance functions [75]. This chapter will also discuss such techniques within the broader category of
instance-based methods.
point of view. It has been shown in [31] that the nearest neighbor rule provides at most twice the
error as that provided by the local Bayes probability.
Such an approach may sometimes not be appropriate for imbalanced data sets, in which the rare
class may not be present to a significant degree among the nearest neighbors, even when the test
instance belongs to the rare class. In the case of cost-sensitive classification or rare-class learning the
majority class is determined after weighting the instances with the relevant costs. These methods
will be discussed in detail in Chapter 17 on rare class learning. In cases where the class label is
continuous (regression modeling problem), one may use the weighted average numeric values of
the target class. Numerous variations on this broad approach are possible, both in terms of the
distance function used or the local model used for the classification process.
• The choice of the distance function clearly affects the behavior of the underlying classifier. In
fact, the problem of distance function learning [75] is closely related to that of instance-based
learning since nearest neighbor classifiers are often used to validate distance-function learning
algorithms. For example, for numerical data, the use of the euclidian distance assumes a
spherical shape of the clusters created by different classes. On the other hand, the true clusters
may be ellipsoidal and arbitrarily oriented with respect to the axis system. Different distance
functions may work better in different scenarios. The use of feature-weighting [69] can also
change the distance function, since the weighting can change the contour of the distance
function to match the patterns in the underlying data more closely.
• The final step of selecting the model from the local test instances may vary with the appli-
cation. For example, one may use the majority class as the relevant one for classification, a
cost-weighted majority vote, or a more complex classifier within the locality such as a Bayes
technique [38, 78].
One of the nice characteristics of the nearest neighbor classification approach is that it can be used
for practically any data type, as long as a distance function is available to quantify the distances
between objects. Distance functions are often designed with a specific focus on the classification
task [21]. Distance function design is a widely studied topic in many domains such as time-series
data [42], categorical data [22], text data [56], and multimedia data [58] or biological data [14].
Entropy-based measures [29] are more appropriate for domains such as strings, in which the dis-
tances are measured in terms of the amount of effort required to transform one instance to the other.
Therefore, the simple nearest neighbor approach can be easily adapted to virtually every data do-
main. This is a clear advantage in terms of usability. A detailed discussion of different aspects of
distance function design may be found in [75].
A key issue with the use of nearest neighbor classifiers is the efficiency of the approach in the
classification process. This is because the retrieval of the k nearest neighbors may require a running
time that is linear in the size of the data set. With the increase in typical data sizes over the last few
years, this continues to be a significant problem [13]. Therefore, it is useful to create indexes, which
can efficiently retrieve the k nearest neighbors of the underlying data. This is generally possible
for many data domains, but may not be true of all data domains in general. Therefore, scalability
is often a challenge in the use of such algorithms. A common strategy is to use either indexing of
the underlying instances [57], sampling of the data, or aggregations of some of the data points into
smaller clustered pseudo-points in order to improve accuracy. While the indexing strategy seems
to be the most natural, it rarely works well in the high dimensional case, because of the curse of
dimensionality. Many data sets are also very high dimensional, in which case a nearest neighbor
index fails to prune out a significant fraction of the data points, and may in fact do worse than a
sequential scan, because of the additional overhead of indexing computations.
Such issues are particularly challenging in the streaming scenario. A common strategy is to use
very fine grained clustering [5, 7] in order to replace multiple local instances within a small cluster
(belonging to the same class) with a pseudo-point of that class. Typically, this pseudo-point is the
162 Data Classification: Algorithms and Applications
centroid of a small cluster. Then, it is possible to apply a nearest neighbor method on these pseudo-
points in order to obtain the results more efficiently. Such a method is desirable, when scalability is
of great concern and the data has very high volume. Such a method may also reduce the noise that is
associated with the use of individual instances for classification. An example of such an approach is
provided in [7], where classification is performed on a fast data stream, by summarizing the stream
into micro-clusters. Each micro-cluster is constrained to contain data points only belonging to a
particular class. The class label of the closest micro-cluster to a particular instance is reported as the
relevant label. Typically, the clustering is performed with respect to different time-horizons, and a
cross-validation approach is used in order to determine the time-horizon that is most relevant at a
given time. Thus, the model is instance-based in a dual sense, since it is not only a nearest neighbor
classifier, but it also determines the relevant time horizon in a lazy way, which is specific to the
time-stamp of the instance. Picking a smaller time horizon for selecting the training data may often
be desirable when the data evolves significantly over time. The streaming scenario also benefits
from laziness in the temporal dimension, since the most appropriate model to use for the same test
instance may vary with time, as the data evolves. It has been shown in [7], that such an “on demand”
approach to modeling provides more effective results than eager classifiers, because of its ability to
optimize for the test instance from a temporal perspective. Another method that is based on the
nearest neighbor approach is proposed in [17]. This approach detects the changes in the distribution
of the data stream on the past window of instances and accordingly re-adjusts the classifier. The
approach can handle symbolic attributes, and it uses the Value Distance Metric (VDM) [60] in order
to measure distances. This metric will be discussed in some detail in Section 6.3.1 on symbolic
attributes.
A second approach that is commonly used to speed up the approach is the concept of instance
selection or prototype selection [27, 41, 72, 73, 81]. In these methods, a subset of instances may
be pre-selected from the data, and the model is constructed with the use of these pre-selected in-
stances. It has been shown that a good choice of pre-selected instances can often lead to improvement
in accuracy, in addition to the better efficiency [72, 81]. This is because a careful pre-selection of
instances reduces the noise from the underlying training data, and therefore results in better clas-
sification. The pre-selection issue is an important research issue in its own right, and we refer the
reader to [41] for a detailed discussion of this important aspect of instance-based classification. An
empirical comparison of the different instance selection algorithms may be found in [47].
In many rare class or cost-sensitive applications, the instances may need to be weighted differ-
ently corresponding to their importance. For example, consider an application in which it is desirable
to use medical data in order to diagnose a specific condition. The vast majority of results may be nor-
mal, and yet it may be costly to miss a case where an example is abnormal. Furthermore, a nearest
neighbor classifier (which does not weight instances) will be naturally biased towards identifying
instances as normal, especially when they lie at the border of the decision region. In such cases,
costs are associated with instances, where the cost associated with an abnormal instance is the same
as the relative cost of misclassifying it (false negative), as compared to the cost of misclassifying
a normal instance (false positive). The weights on the instances are then used for the classification
process.
Another issue with the use of nearest neighbor methods is that it does not work very well when
the dimensionality of the underlying data increases. This is because the quality of the nearest neigh-
bor decreases with an increasing number of irrelevant attributes [45]. The noise effects associated
with the irrelevant attributes can clearly degrade the quality of the nearest neighbors found, espe-
cially when the dimensionality of the underlying data is high. This is because the cumulative effect
of irrelevant attributes often becomes more pronounced with increasing dimensionality. For the
case of numeric attributes, it has been shown [2], that the use of fractional norms (i.e. L p -norms for
p < 1) provides superior quality results for nearest neighbor classifiers, whereas L∞ norms provide
the poorest behavior. Greater improvements may be obtained by designing the distance function
Instance-Based Learning: A Survey 163
more carefully, and weighting more relevant ones. This is an issue that will be discussed in detail in
later subsections.
In this context, the issue of distance-function design is an important one [50]. In fact, an entire
area of machine learning has been focussed on distance function design. Chapter 18 of this book
has been devoted entirely to distance function design, and an excellent survey on the topic may be
found in [75]. A discussion of the applications of different similarity methods for instance-based
classification may be found in [32]. In this section, we will discuss some of the key aspects of
instance-function design, which are important in the context of nearest neighbor classification.
Here, the parameter q can be chosen either on an ad hoc basis, or in a data-driven manner. This
choice of metric has been shown to be quite effective in a variety of instance-centered scenarios
[36, 60]. Detailed discussions of different kinds of similarity functions for symbolic attributes may
be found in [22, 30].
V ( j) = ∑ f (di ) (6.2)
i:i∈Sk ,ci = j
Here f (·) is either an increasing or decreasing function of its argument, depending upon when di
represents similarity or distance, respectively. It should be pointed out that if the appropriate weight
is used, then it is not necessary to use the k nearest neighbors, but simply to perform this average
over the entire collection.
164 Data Classification: Algorithms and Applications
Y
Y
X X
(a) Axis-Parallel (b) Arbitrarily Oriented
FIGURE 6.1: Illustration of importance of feature weighting for nearest neighbor classification.
The nearest neighbors are computed on this set of distances, and the majority vote among the k
nearest neighbors is reported as the class labels. This approach tends to work well because it picks
the k nearest neighbor group in a noise-resistant way. For test instances that lie on the decision
boundaries, it tends to discount the instances that lie on the noisy parts of a decision boundary, and
instead picks the k nearest neighbors that lie away from the noisy boundary. For example, consider a
data point Xi that lies reasonably close to the test instance, but even closer to the decision boundary.
Furthermore, since Xi lies almost on the decision boundary, it lies extremely close to another training
example in a different class. In such a case the example Xi should not be included in the k-nearest
neighbors, because it is not very informative. The small value of ri will often ensure that such an
example is not picked among the k-nearest neighbors. Furthermore, such an approach also ensures
that the distances are scaled and normalized by the varying nature of the patterns of the different
classes in different regions. Because of these factors, it has been shown in [65] that this modification
often yields more robust results for the quality of classification.
to be zero, that attribute is eliminated completely. This can be considered an implicit form of fea-
ture selection. Thus, for two d-dimensional records X = (x1 . . . xd ) and Y = (y1 . . . yd ), the feature
weighted distance d(X,Y ,W ) with respect to a d-dimensional vector of weights W = (w1 . . . wd ) is
defined as follows:
d
d(X,Y ,W ) = ∑ wi · (xi − yi )2 . (6.4)
i=1
For example, consider the data distribution, illustrated in Figure 6.1(a). In this case, it is evident
that the feature X is much more discriminative than feature Y , and should therefore be weighted to
a greater degree in the final classification. In this case, almost perfect classification can be obtained
with the use of feature X, though this will usually not be the case, since the decision boundary is
noisy in nature. It should be pointed out that the Euclidian metric has a spherical decision boundary,
whereas the decision boundary in this case is linear. This results in a bias in the classification pro-
cess because of the difference between the shape of the model boundary and the true boundaries in
the data. The importance of weighting the features becomes significantly greater, when the classes
are not cleanly separated by a decision boundary. In such cases, the natural noise at the decision
boundary may combine with the significant bias introduced by an unweighted Euclidian distance,
and result in even more inaccurate classification. By weighting the features in the Euclidian dis-
tance, it is possible to elongate the model boundaries to a shape that aligns more closely with the
class-separation boundaries in the data. The simplest possible weight to use would be to normalize
each dimension by its standard deviation, though in practice, the class label is used in order to deter-
mine the best feature weighting [48]. A detailed discussion of different aspects of feature weighting
schemes is provided in [70].
In some cases, the class distribution may not be aligned neatly along the axis system, but may
be arbitrarily oriented along different directions in the data, as in Figure 6.1(b). A more general
distance metric is defined with respect to a d × d matrix A rather than a vector of weights W .
d(X,Y , A) = (X − Y )T · A · (X − Y ). (6.5)
The matrix A is also sometimes referred to as a metric. The value of A is assumed to be the inverse
of the covariance matrix of the data in the standard definition of the Mahalanobis distance for un-
supervised applications. Generally, the Mahalanobis distance is more sensitive to the global data
distribution and provides more effective results.
The Mahalanobis distance does not, however take the class distribution into account. In su-
pervised applications, it makes much more sense to pick A based on the class distribution of the
underlying data. The core idea is to “elongate” the neighborhoods along less discriminative direc-
tions, and to shrink the neighborhoods along more discriminative dimensions. Thus, in the modified
metric, a small (unweighted) step along a discriminative direction, would result in relatively greater
distance. This naturally provides greater importance to more discriminative directions. Numerous
methods such as the linear discriminant [51] can be used in order to determine the most discrim-
inative dimensions in the underlying data. However, the key here is to use a soft weighting of the
different directions, rather than selecting specific dimensions in a hard way. The goal of the matrix A
is to accomplish this. How can A be determined by using the distribution of the classes? Clearly, the
matrix A should somehow depend on the within-class variance and between-class variance, in the
context of linear discriminant analysis. The matrix A defines the shape of the neighborhood within a
threshold distance, to a given test instance. The neighborhood directions with low ratio of inter-class
variance to intra-class variance should be elongated, whereas the directions with high ratio of the
inter-class to intra-class variance should be shrunk. Note that the “elongation” of a neighborhood
direction is achieved by scaling that component of the distance by a larger factor, and therefore
de-emphasizing that direction.
166 Data Classification: Algorithms and Applications
Let D be the full database, and Di be the portion of the data set belonging to class i. Let µ
represent the mean of the entire data set. Let pi = |Di |/|D | be the fraction of records belonging to
class i, µi be the d-dimensional row vector of means of Di , and Σi be the d × d covariance matrix of
Di . Then, the scaled1 within class scatter matrix Sw is defined as follows:
k
Sw = ∑ pi · Σi . (6.6)
i=1
Note that the matrix Sb is a d × d matrix, since it results from the product of a d × 1 matrix with
a 1 × d matrix. Then, the matrix A (of Equation 6.5), which provides the desired distortion of the
distances on the basis of class-distribution, can be shown to be the following:
It can be shown that this choice of the metric A provides an excellent discrimination between the dif-
ferent classes, where the elongation in each direction depends inversely on ratio of the between-class
variance to within-class variance along the different directions. The aforementioned description is
based on the discussion in [44]. The reader may find more details of implementing the approach in
an effective way in that work.
A few special cases of the metric of Equation 6.5 are noteworthy. Setting A to the identity
matrix corresponds to the use of the Euclidian distance. Setting the non-diagonal entries of A entries
to zero results in a similar situation to a d-dimensional vector of weights for individual dimensions.
Therefore, the non-diagonal entries contribute to a rotation of the axis-system before the stretching
process. For example, in Figure 6.1(b), the optimal choice of the matrix A will result in greater
importance being shown to the direction illustrated by the arrow in the figure in the resulting metric.
In order to avoid ill-conditioned matrices, especially in the case when the number of training data
points is small, a parameter ε can be used in order to perform smoothing.
−1/2 −1/2 −1/2 −1/2
A = Sw · (Sw · Sb · Sw + ε · I) · Sw . (6.9)
Here ε is a small parameter that can be tuned, and the identity matrix is represented by I. The
use of this modification assumes that any particular direction does not get infinite weight. This is
quite possible, when the number of data points is small. The use of this parameter ε is analogous to
Laplacian smoothing methods, and is designed to avoid overfitting.
Other heuristic methodologies are also used in the literature for learning feature relevance. One
common methodology is to use cross-validation in which the weights are trained using the original
instances in order to minimize the error rate. Details of such a methodology are provided in [10,48].
It is possible to also use different kinds of search methods such as Tabu search [61] in order to
improve the process of learning weights. This kind of approach has also been used in the context
of text classification, by learning the relative importance of the different words, for computing the
similarity function [43]. Feature relevance has also been shown to be important for other domains
such as image retrieval [53]. In cases where domain knowledge can be used, some features can
be eliminated very easily with tremendous performance gains. The importance of using domain
knowledge for feature weighting has been discussed in [26].
1 The unscaled version may be obtained by multiplying S with the number of data points. There is no difference from
w
the final result, whether the scaled or unscaled version is used, within a constant of proportionality.
Instance-Based Learning: A Survey 167
Y Y B
A B
A
X X
(a) Axis-Parallel (b) Arbitrarily Oriented
FIGURE 6.2: Illustration of importance of local adaptivity for nearest neighbor classification.
The most effective distance function design may be performed at query-time, by using a weight-
ing that is specific to the test instance [34, 44]. This can be done by learning the weights only on the
basis of the instances that are near a specific test instance. While such an approach is more expen-
sive than global weighting, it is likely to be more effective because of local optimization specific to
the test instance, and is better aligned with the spirit and advantages of lazy learning. This approach
will be discussed in detail in later sections of this survey.
It should be pointed out that many of these algorithms can be considered rudimentary forms of
distance function learning. The problem of distance function learning [3, 21, 75] is fundamental to
the design of a wide variety of data mining algorithms including nearest neighbor classification [68],
and a significant amount of research has been performed in the literature in the context of the
classification task. A detailed survey on this topic may be found in [75]. The nearest neighbor
classifier is often used as the prototypical task in order to evaluate the quality of distance functions
[2, 3, 21, 45, 68, 75], which are constructed using distance function learning techniques.
in the form of different labels in order to make inferences about the relevance of the different di-
mensions. A recent method for distance function learning [76] also constructs instance-specific
distances with supervision, and shows that the use of locality provides superior results. However,
the supervision in this case is not specifically focussed on the traditional classification problem,
since it is defined in terms of similarity or dissimilarity constraints between instances, rather than
labels attached to instances. Nevertheless, such an approach can also be used in order to construct
instance-specific distances for classification, by transforming the class labels into similarity or dis-
similarity constraints. This general principle is used frequently in many works such as those dis-
cussed by [34, 39, 44] for locally adaptive nearest neighbor classification.
Labels provide a rich level of information about the relative importance of different dimensions
in different localities. For example, consider the illustration of Figure 6.2(a). In this case, the feature
Y is more important in locality A of the data, whereas feature X is more important in locality B
of the data. Correspondingly, the feature Y should be weighted more, when using test instances in
locality A of the data, whereas feature X should be weighted more in locality B of the data. In the
case of Figure 6.2(b), we have shown a similar scenario as in Figure 6.2(a), except that different
directions in the data should be considered more or less important. Thus, the case in Figure 6.2(b)
is simply an arbitrarily oriented generalization of the challenges faced in Figure 6.2(a).
One of the earliest methods for performing locally adaptive nearest neighbor classification was
proposed in [44], in which the matrix A (of Equation 6.5) and the neighborhood of the test data point
X are learned iteratively from one another in a local way. The major difference here is that the matrix
A will depend upon the choice of test instance X, since the metric is designed to be locally adaptive.
The algorithm starts off by setting the d × d matrix A to the identity matrix, and then determines the
k-nearest neighbors Nk , using the generalized distance described by Equation 6.5. Then, the value of
A is updated using Equation 6.8 only on the neighborhood points Nk found in the last iteration. This
procedure is repeated till convergence. Thus, the overall iterative procedure may be summarized as
follows:
1. Determine Nk as the set of the k nearest neighbors of the test instance, using Equation 6.5 in
conjunction with the current value of A.
2. Update A using Equation 6.8 on the between-class and within-class scatter matrices of Nk .
At completion, the matrix A is used in Equation 6.8 for k-nearest neighbor classification. In practice,
Equation 6.9 is used in order to avoid giving infinite weight to any particular direction. It should
be pointed out that the approach is [44] is quite similar in principle to an unpublished approach
proposed previously in [39], though it varies somewhat on the specific details of the implementation,
and is also somewhat more robust. Subsequently, a similar approach to [44] was proposed in [34],
which works well with limited data. It should be pointed out that linear discriminant analysis is
not the only method that may be used for “deforming” the shape of the metric in an instance-
specific way to conform better to the decision boundary. Any other model such as an SVM or neural
network that can find important directions of discrimination in the data may be used in order to
determine the most relevant metric. For example, an interesting work discussed in [63] determines
a small number of k-nearest neighbors for each class, in order to span a linear subspace for that
class. The classification is done based not on distances to prototypes, but the distances to subspaces.
The intuition is that the linear subspaces essentially generate pseudo training examples for that
class. A number of methods that use support-vector machines will be discussed in the section on
Support Vector Machines. A review and discussion of different kinds of feature selection and feature
weighting schemes are provided in [8]. It has also been suggested in [48] that feature weighting
may be superior to feature selection, in cases where the different features have different levels of
relevance.
Instance-Based Learning: A Survey 169
the label i, and E 0 (T, i) be the event that the test instance T does not contain the label i. Then, in
order to determine whether or not the label i is included in test instance T , the maximum posterior
principle is used:
b = argmaxb∈{0,1}{P(E b (T, i)|C)}. (6.10)
In other words, we wish to maximize between the probability of the events of label i being included
or not. Therefore, the Bayes rule can be used in order to obtain the following:
P(E b (T, i)) · P(C|E b (T, i))
b = argmaxb∈{0,1} . (6.11)
P(C)
Since the value of P(C) is independent of b, it is possible to remove it from the denominator, without
affecting the maximum argument. This is a standard approach used in all Bayes methods. Therefore,
the best matching label may be expressed as follows:
b = argmaxb∈{0,1} P(E b (T, i)) · P(C|E b (T, i)) . (6.12)
The prior probability P(E b (T, i)) can be estimated as the fraction of the labels belonging to a par-
ticular class. The value of P(C|E b (T, i)) can be estimated by using the naive Bayes rule.
n
P(C|E b (T, i)) = ∏ P(C( j)|E b (T, i)). (6.13)
j=1
Each of the terms P(C( j)|E b (T, i)) can be estimated in a data driven manner by examining among
the instances satisfying the value b for class label i, the fraction that contains exactly the count C( j)
for the label j. Laplacian smoothing is also performed in order to avoid ill conditioned probabilities.
Thus, the correlation between the labels is accounted for by the use of this approach, since each of
the terms P(C( j)|E b (T, i)) indirectly measures the correlation between the labels i and j.
This approach is often popularly understood in the literature as a nearest neighbor approach,
and has therefore been discussed in the section on nearest neighbor methods. However, it is more
similar to a local naive Bayes approach (discussed in Section 6.6) rather than a distance-based
approach. This is because the statistical frequencies of the neighborhood labels are used for local
Bayes modeling. Such an approach can also be used for the standard version of the classification
problem (when each instance is associated with exactly one label) by using the statistical behavior
of the neighborhood features (rather than label frequencies). This yields a lazy Bayes approach for
classification [38]. However, the work in [38] also estimates the Bayes probabilities locally only over
the neighborhood in a data driven manner. Thus, the approach in [38] sharpens the use of locality
even further for classification. This is of course a tradeoff, depending upon the amount of training
data available. If more training data is available, then local sharpening is likely to be effective. On
the other hand, if less training data is available, then local sharpening is not advisable, because it
will lead to difficulties in robust estimations of conditional probabilities from a small amount of
data. This approach will be discussed in some detail in Section 6.6.
If desired, it is possible to combine the two methods discussed in [38] and [78] for multi-label
learning in order to learn the information in both the features and labels. This can be done by
using both feature and label frequencies for the modeling process, and the product of the label-
based and feature-based Bayes probabilities may be used for classification. The extension to that
case is straightforward, since it requires the multiplication of Bayes probabilities derived form two
different methods. An experimental study of several variations of nearest neighbor algorithms for
classification in the multi-label scenario is provided in [59].
Instance-Based Learning: A Survey 171
A number of optimizations such as caching have been proposed in [77] in order to improve the
efficiency of the approach. Local SVM classifiers have been used quite successfully for a variety of
applications such as spam filtering [20].
172 Data Classification: Algorithms and Applications
KNN L lit
KNNͲLocality
KNNͲLocality
Y Y
B
A
X X
Therefore, the weight decreases linearly with the distance, but cannot decrease beyond 0, once the
k-nearest neighbor has been reached. The naive Bayes method is applied in a standard way, except
that the instance weights are used in estimating all the probabilities for the Bayes classifier. Higher
values of k will result in models that do not fluctuate much with variations in the data, whereas very
small values of k will result in models that fit the noise in the data. It has been shown in [38] that
the approach is not too sensitive to the choice of k within a reasonably modest range of values of
k. Other schemes have been proposed in the literature, which use the advantages of lazy learning in
conjunction with the naive Bayes classifier. The work in [79] fuses a standard rule-based learner with
naive Bayes models. A technique discussed in [74] lazily learns multiple Naive Bayes classifiers,
and uses the classifier with the highest estimated accuracy in order to decide the label for the test
instance.
The work discussed in [78] can also be viewed as an unweighted local Bayes classifier, where
the weights used for all instances are 1, rather than the weighted approach discussed above. Further-
more, the work in [78] uses the frequencies of the labels for Bayes modeling, rather than the actual
features themselves. The idea in [78] is that the other labels themselves serve as features, since the
same instance may contain multiple labels, and sufficient correlation information is available for
learning. This approach is discussed in detail in Section 6.3.7.
many relevant features, only a small subset of them may be used for splitting. When a data set
contains N data points, a decision tree is allowed only O(log(N)) (approximately balanced) splits,
and this may be too small in order to use the best set of features for a particular test instance. Clearly,
the knowledge of the test instance allows the use of more relevant features for construction of the
appropriate decision path at a higher level of the tree construction.
The additional knowledge of the test instance helps in the recursive construction of a path in
which only relevant features are used. One method proposed in [40] is to use a split criterion, which
successively reduces the size of the training associated with the test instance, until either all in-
stances have the same class label, or the same set of features. In both cases, the majority class
label is reported as the relevant class. In order to discard a set of irrelevant instances in a particular
iteration, any standard decision tree split criterion is used, and only those training instances, that
satisfy the predicate in the same way as the test instance will be selected for the next level of the
decision path. The split criterion is decided using any standard decision tree methodology such as
the normalized entropy or the gini-index. The main difference from the split process in the tradi-
tional decision tree is that only the node containing the test instance is relevant in the split, and the
information gain or gini-index is computed on the basis of this node. One challenge with the use of
such an approach is that the information gain in a single node can actually be negative if the original
data is imbalanced. In order to avoid this problem, the training examples are re-weighted, so that the
aggregate weight of each class is the same. It is also relatively easy to deal with missing attributes
in test instances, since the split only needs to be performed on attributes that are present in the test
instance. It has been shown [40] that such an approach yields better classification results, because
of the additional knowledge associated with the test instance during the decision path construction
process.
A particular observation here is noteworthy, since such decision paths can also be used to con-
struct a robust any-time classifier, with the use of principles associated with a random forest ap-
proach [25]. It should be pointed out that a random forest translates to a random path created by
a random set of splits from the instance-centered perspective. Therefore a natural way to imple-
ment the instance-centered random forest approach would be to discretize the data into ranges. A
test instance will be relevant to exactly one range from each attribute. A random set of attributes is
selected, and the intersection of the ranges provides one possible classification of the test instance.
This approach can be repeated in order to provide a very efficient lazy ensemble, and the number
of samples provides the tradeoff between running time and accuracy. Such an approach can be used
in the context of an any-time approach in resource-constrained scenarios. It has been shown in [4]
how such an approach can be used for efficient any-time classification of data streams.
4 ≤ xi ≤ 4. If x1 is symbolic and its value is a, then the corresponding condition in the antecedent is
x1 = a.
As in the case of instance-centered methods, a distance is defined between a test instance and a
rule. Let R = (A1 . . . Am ,C) be a rule with the m conditions A1 . . . Am in the antecedent, and the class
C in the consequent. Let X = (x1 . . . xd ) be a d-dimensional example. Then, the distance Δ(X, R)
between the instance X and the rule R is defined as follows.
m
Δ(X, R) = ∑ δ(i)s . (6.16)
j=1
Here s is a real valued parameter such as 1,2,3, etc., and δ(i) represents the distance on the ith
conditional. The value of δ(i) is equal to the distance of the instance to the nearest end of the range
for the case of a numeric attribute and the value difference metric (VDM) of Equation 6.1 for the
case of a symbolic attribute. This value of δ(i) is zero, if the corresponding attribute value is a match
for the antecedent condition. The class label for a test instance is defined by the label of the nearest
rule to the test instance. If two or more rules have the same accuracy, then the one with the greatest
accuracy on the training data is used.
The set of rules in the RISE system are constructed as follows. RISE constructs good rules by
using successive generalizations on the original set of instances in the data. Thus, the algorithm starts
off with the training set of examples. RISE examines each rule one by one, and finds the nearest
example of the same class that is not already covered by the rule. The rule is then generalized in order
to cover this example, by expanding the corresponding antecedent condition. For the case of numeric
attributes, the ranges of the attributes are increased minimally so as to include the new example, and
for the case of symbolic attributes, a corresponding condition on the symbolic attribute is included.
If the effect of this generalization on the global accuracy of the rule is non-negative, then the rule
is retained. Otherwise, the generalization is not used and the original rule is retained. It should be
pointed out that even when generalization does not improve accuracy, it is desirable to retain the
more general rule because of the desirable bias towards simplicity of the model. The procedure is
repeated until no rule can be generalized in a given iteration. It should be pointed out that some
instances may not be generalized at all, and may remain in their original state in the rule set. In the
worst case, no instance is generalized, and the resulting model is a nearest neighbor classifier.
A system called DeEPs has been proposed in [49], which combines the power of rules and lazy
learning for classification purposes. This approach examines how the frequency of an instance’s sub-
set of features varies among the training classes. In other words, patterns that sharply differentiate
between the different classes for a particular test instance are leveraged and used for classifica-
tion. Thus, the specificity to the instance plays an important role in this discovery process. Another
system, HARMONY, has been proposed in [66], which determines rules that are optimized to the
different training instances. Strictly speaking, this is not a lazy learning approach, since the rules are
optimized to training instances (rather than test instances) in a pre-processing phase. Nevertheless,
the effectiveness of the approach relies on the same general principle, and it can also be generalized
for lazy learning if required.
function is computed at the instance using these centers. The combination of functions computed
from each of these centers is computed with the use of a neural network. Radial basis functions can
be considered three-layer feed-forward networks, in which each hidden unit computes a function of
the form:
fi (x) = e−||x−xi || /2·σi .
2 2
(6.17)
Here σ2i represents the local variance at center xi . Note that the function fi (x) has a very similar form
to that commonly used in kernel density estimation. For ease in discussion, we assume that this is
a binary classification problem, with labels drawn from {+1, −1}, though this general approach
extends much further, even to the extent of regression modeling. The final function is a weighted
combination of these values with weights ci .
N
f ∗ (x) = ∑ wi · fi (x). (6.18)
i=1
Here x1 . . . xN represent the N different centers, and wi denotes the weight of center i, which is
learned in the neural network. In classical instance-based methods, each data point xi is an individual
training instance, and the weight wi is set to +1 or −1, depending upon its label. However, in RBF
methods, the weights are learned with a neural network approach, since the centers are derived from
the underlying training data, and do not have a label directly attached to them.
The N centers x1 . . . xN are typically constructed with the use of an unsupervised approach
[19, 37, 46], though some recent methods also use supervised techniques for constructing the cen-
ters [71]. The unsupervised methods [19, 37, 46] typically use a clustering algorithm in order to
generate the different centers. A smaller number of centers typically results in smaller complexity,
and greater efficiency of the classification process. Radial-basis function networks are related to
sigmoidal function networks (SGF), which have one unit for each instance in the training data. In
this sense sigmoidal networks are somewhat closer to classical instance-based methods, since they
do not have a first phase of cluster-based summarization. While radial-basis function networks are
generally more efficient, the points at the different cluster boundaries may often be misclassified.
It has been shown in [52] that RBF networks may sometimes require ten times as much training
data as SGF in order to achieve the same level of accuracy. Some recent work [71] has shown how
supervised methods even at the stage of center determination can significantly improve the accu-
racy of these classifiers. Radial-basis function networks can therefore be considered an evolution of
nearest neighbor classifiers, where more sophisticated (clustering) methods are used for prototype
selection (or re-construction in the form of cluster centers), and neural network methods are used
for combining the density values obtained from each of the centers.
1 1
* Test Instance * Test Instance
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1
0.1
0
0 2
2
1 1.5 1 1.5
1 1
0 0.5 0 0.5
0 0
−1 −0.5 −1 −0.5
−1 −1
−2 −1.5 −1.5
−2 −2 −2
Attribute X Attribute Y Attribute X Attribute Y
by X1 . . . XN . Let us further assume that the k classes in the data are denoted by C1 . . . Ck . The number
of points belonging to the class Ci is ni , so that ∑ki=1 ni = N. The data set associated with the class i is
denoted by Di . This means that ∪ki=1 Di = D . The probability density at a given point is determined
by the sum of the smoothed values of the kernel functions Kh (·) associated with each point in the
data set. Thus, the density estimate of the data set D at the point x is defined as follows:
1
n X∑
f (x, D ) = · Kh (x − Xi ). (6.19)
∈D i
The kernel function is a smooth unimodal distribution such as the Gaussian function:
||x−Xi || 2
1 −
Kh (x − Xi ) = √ · e 2h2 . (6.20)
2π · h
The kernel function is dependent on the use of a parameter h, which is the level of smoothing. The
accuracy of the density estimate depends upon this width h, and several heuristic rules have been
proposed for estimating the bandwidth [6].
The value of the density f (x, D ) may differ considerably from f (x, Di ) because of the difference
in distributions of the different classes. Correspondingly, the accuracy density A (x, Ci , D ) for the
class Ci is defined as follows:
ni · f (x, Di )
A (x, Ci , D ) = . (6.21)
∑i=1 ni · f (x, Di )
k
The above expression always lies between 0 and 1, and is simply an estimation of the Bayesian
posterior probability of the test instance belonging to class Ci . It is assumed that the a-priori proba-
bility of the test instance belonging to class Ci , (without knowing any of its feature values) is simply
equal to its fractional presence ni /N in class Ci . The conditional probability density after knowing
the feature values x is equal to f (x, Di ). Then, by applying the Bayesian formula for posterior prob-
abilities,the condition of Equation 6.21 is obtained. The higher the value of the accuracy density,
the greater the relative density of Ci compared to the other classes.
Another related measure is the interest density I (x, Ci , D ), which is the ratio of the density of
the class Ci to the overall density of the data.
f (x, Di )
I (x, Ci , D ) = . (6.22)
f (x, D )
178 Data Classification: Algorithms and Applications
The class Ci is over-represented at x, when the interest density is larger than one. The dominant
class at the coordinate x is denoted by C M (x, D ), and is equal to argmaxi∈{1,...k} I (x, Ci , D ). Cor-
respondingly, the maximum interest density at x is denoted by I M (x, D ) = maxi∈{1,...k} I (x, Ci , D ).
Both the interest and accuracy density are valuable quantifications of the level of dominance of the
different classes. The interest density is more effective at comparing among the different classes at
a given point, whereas the accuracy density is more effective at providing an idea of the absolute
accuracy at a given point.
So far, it has been assumed that all of the above computations are performed in the full dimen-
sional space. However, it is also possible to project the data onto the subspace E in order to perform
this computation. Such a calculation would quantify the discriminatory power of the subspace E at
x. In order to denote the use of the subspace E in any computation, the corresponding expression
will be superscripted with E. Thus the density in a given subspace E is denoted by f E (·, ·), the
accuracy density by A E (·, ·, ·), and the interest density by I E (·, ·, ·). Similarly, the dominant class
is defined using the subspace-specific interest density at that point, and the accuracy density profile
is defined for that particular subspace. An example of the accuracy density profile (of the dominant
class) in a 2-dimensional subspace is illustrated in Figure 6.4(a). The test instance is also labeled in
the same figure in order to illustrate the relationship between the density profile and test instance.
It is desired to find those projections of the data in which the interest density value I M (t, D )
E
is the maximum. It is quite possible that in some cases, different subspaces may provide different
information about the class behavior of the data; these are the difficult cases in which a test instance
may be difficult to classify accurately. In such cases, the user may need to isolate particular data
localities in which the class distribution is further examined by a hierarchical exploratory process.
While the density values are naturally defined over the continuous space of quantitative attributes, it
has been shown in [6] that intuitively analogous values can be defined for the interest and accuracy
densities even when categorical attributes are present.
For a given test example, the end user is provided with unique options in exploring various
characteristics that are indicative of its classification. A subspace determination process is used on
the basis of the highest interest densities at a given test instance. Thus, the subspace determination
process finds the appropriate local discriminative subspaces for a given test example. These are the
various possibilities (or branches) of the decision path, which can be utilized in order to explore the
regions in the locality of the test instance. In each of these subspaces, the user is provided with a
visual profile of the accuracy density. This profile provides the user with an idea of which branch is
likely to lead to a region of high accuracy for that test instance. This visual profile can also be utilized
in order to determine which of the various branches are most suitable for further exploration. Once
such a branch has been chosen, the user has the option to further explore into a particular region
of the data that has high accuracy density. This process of data localization can quickly isolate an
arbitrarily shaped region in the data containing the test instance. This sequence of data localizations
creates a path (and a locally discriminatory combination of dimensions) that reveals the underlying
classification causality to the user.
In the event that a decision path is chosen that is not strongly indicative of any class, the user has
the option to backtrack to a higher level node and explore a different path of the tree. In some cases,
different branches may be indicative of the test example belonging to different classes. These are
the “ambiguous cases” in which a test example could share characteristics from multiple classes.
Many standard modeling methods may classify such an example incorrectly, though the subspace
decision path method is much more effective at providing the user with an intensional knowledge
of the test example because of its exploratory approach. This can be used in order to understand the
causality behind the ambiguous classification behavior of that instance.
The overall algorithm for decision path construction is illustrated in Figure 6.5. The details of
the subroutines in the procedure are described in [6], though a summary description is provided
here. The input to the system is the data set D , the test instance t for which one wishes to find the
Instance-Based Learning: A Survey 179
diagnostic characteristics, a maximum branch factor bmax , and a minimum interest density irmin . In
addition, the maximum dimensionality l of any subspace utilized in data exploration is utilized as
user input. The value of l = 2 is especially interesting because it allows for the use of visual profile
of the accuracy density. Even though it is natural to use 2-dimensional projections because of their
visual interpretability, the data exploration process along a given path reveals a higher dimensional
combination of dimensions, which is most suitable for the test instance. The branch factor bmax is
the maximum number of possibilities presented to the user, whereas the value of irmin is the corre-
sponding minimum interest density of the test instance in any subspace presented to the user. The
value of irmin is chosen to be 1, which is the break-even value for interest density computation. This
break-even value is one at which the interest density neither under-estimates nor over-estimates the
accuracy of the test instance with respect to a class. The variable PATH consists of the pointers to the
sequence of successively reduced training data sets, which are obtained in the process of interactive
decision tree construction. The list PATH is initialized to a single element, which is the pointer to the
original data set D . At each point in the decision path construction process, the subspaces E1 . . . Eq
are determined, which have the greatest interest density (of the dominant class) in the locality of
the test instance t. This process is accomplished by the procedure ComputeClassificationSubspaces.
Once these subspaces have been determined, the density profile is constructed for each of them by
180 Data Classification: Algorithms and Applications
the procedure ConstructDensityProfile. Even though one subspace may have higher interest density
at the test instance than another, the true value of a subspace in separating the data locality around
the test instance is often a subjective judgement that depends both upon the interest density of the
test instance and the spatial separation of the classes. Such a judgement requires human intuition,
which can be harnessed with the use of the visual profile of the accuracy density profiles of the var-
ious possibilities. These profiles provide the user with an intuitive idea of the class behavior of the
data set in various projections. If the class behavior across different projections is not very consis-
tent (different projections are indicative of different classes), then such a node is not very revealing
of valuable information. In such a case, the user may choose to backtrack by specifying an earlier
node on PATH from which to start further exploration.
On the other hand, if the different projections provide a consistent idea of the class behavior, then
the user utilizes the density profile in order to isolate a small region of the data in which the accuracy
density of the data in the locality of the test instance is significantly higher for a particular class.
This is achieved by the procedure IsolateData. This isolated region may be a cluster of arbitrary
shape depending upon the region covered by the dominating class. However, the use of the visual
profile helps to maintain the interpretability of the isolation process in spite of the arbitrary contour
of separation. An example of such an isolation is illustrated in Figure 6.4(b), and is facilitated by
the construction of the visual density profile. The procedure returns the isolated data set L along
with a number of called the accuracy significance p(L , Ci ) of the class Ci . The pointer to this new
data set L is added to the end of PATH. At that point, the user decides whether further exploration
into that isolated data set is necessary. If so, the same process of subspace analysis is repeated on
this node. Otherwise, the process is terminated and the most relevant class label is reported. The
overall exploration process also provides the user with a good diagnostic idea of how the local data
distributions along different dimensions relate to the class label.
Bibliography
[1] A. Aamodt and E. Plaza. Case-based reasoning: Foundational issues, methodological varia-
tions, and system approaches, AI communications, 7(1):39–59, 1994.
[2] C. Aggarwal, A. Hinneburg, and D. Keim. On the surprising behavior of distance metrics in
high dimensional space, Data Theory—ICDT Conference, 2001; Lecture Notes in Computer
Science, 1973:420–434, 2001.
[3] C. Aggarwal. Towards systematic design of distance functions for data mining applications,
Proceedings of the KDD Conference, pages 9–18, 2003.
[4] C. Aggarwal and P. Yu. Locust: An online analytical processing framework for high dimen-
sional classification of data streams, Proceedings of the IEEE International Conference on
Data Engineering, pages 426–435, 2008.
[5] C. Aggarwal and C. Reddy. Data Clustering: Algorithms and Applications, CRC Press, 2013.
[12] D. Aha. Lazy learning: Special issue editorial, Artificial Intelligence Review, 11:7–10, 1997.
[13] F. Angiulli. Fast nearest neighbor condensation for large data sets classification, IEEE Trans-
actions on Knowledge and Data Engineering, 19(11):1450–1464, 2007.
[14] M. Ankerst, G. Kastenmuller, H.-P. Kriegel, and T. Seidl. Nearest Neighbor Classification in
3D Protein Databases, ISMB-99 Proceedings, pages 34–43, 1999.
[15] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning, Artificial Intelligence Re-
view, 11(1–5):11–73, 1997.
[16] S. D. Bay. Combining nearest neighbor classifiers through multiple feature subsets. Proceed-
ings of ICML Conference, pages 37–45, 1998.
[17] J. Beringer and E Hullermeier. Efficient instance-based learning on data streams, Intelligent
Data Analysis, 11(6):627–650, 2007.
[18] M. Birattari, G. Bontempi, and M. Bersini. Lazy learning meets the recursive least squares
algorithm, Advances in neural information processing systems, 11:375–381, 1999.
182 Data Classification: Algorithms and Applications
[19] C. M. Bishop. Improving the generalization properties of radial basis function neural networks,
Neural Computation, 3(4):579–588, 1991.
[20] E. Blanzieri and A. Bryl. Instance-based spam filtering using SVM nearest neighbor classifier,
FLAIRS Conference, pages 441–442, 2007.
[21] J. Blitzer, K. Weinberger, and L. Saul. Distance metric learning for large margin nearest neigh-
bor classification, Advances in neural information processing systems: 1473–1480, 2005.
[22] S. Boriah, V. Chandola, and V. Kumar. Similarity measures for categorical data: A comparative
evaluation. Proceedings of the SIAM Conference on Data Mining, pages 243–254, 2008.
[23] L. Bottou and V. Vapnik. Local learning algorithms, Neural COmputation, 4(6):888–900,
1992.
[24] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees,
Wadsworth, 1984.
[25] L. Breiman. Random forests, Machine Learning, 48(1):5–32, 2001.
[26] T. Cain, M. Pazzani, and G. Silverstein. Using domain knowledge to influence similarity judge-
ments, Proceedings of the Case-Based Reasoning Workshop, pages 191–198, 1991.
[27] C. L. Chang. Finding prototypes for nearest neighbor classifiers: IEEE Transactions on Com-
puters, 100(11):1179–184, 1974.
[28] W. Cheng and E. Hullermeier. Combining instance-based learning and logistic regression for
multilabel classification, Machine Learning, 76 (2–3):211–225, 2009.
[29] J. G. Cleary and L. E. Trigg. K ∗ : An instance-based learner using an entropic distance measure,
Proceedings of ICML Conference, pages 108–114, 1995.
[30] S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic
features, Machine Learning, 10(1):57–78, 1993.
[31] T. Cover and P. Hart. Nearest neighbor pattern classification, IEEE Transactions on Informa-
tion Theory, 13(1):21–27, 1967.
[32] P. Cunningham, A taxonomy of similarity mechanisms for case-based reasoning, IEEE Trans-
actuions on Knowledge and Data Engineering, 21(11):1532–1543, 2009.
[33] B. V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE
Computer Society Press, 1990.
[34] C. Domeniconi, J. Peng, and D. Gunopulos. Locally adaptive metric nearest-neighbor classi-
fication, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1281–1285,
2002.
[35] C. Domeniconi and D. Gunopulos. Adaptive nearest neighbor classification using support vec-
tor machines, Advances in Neural Information Processing Systems, 1: 665–672, 2002.
[37] G. W. Flake. Square unit augmented, radially extended, multilayer perceptrons, In Neural
Networks: Tricks of the Trade, pages 145–163, 1998.
Instance-Based Learning: A Survey 183
[38] E. Frank, M. Hall, and B. Pfahringer. Locally weighted Naive Bayes, Proceedings of the Nine-
teenth Conference on Uncertainty in Artificial Intelligence, pages 249–256, 2002.
[39] J. Friedman. Flexible Nearest Neighbor Classification, Technical Report, Stanford University,
1994.
[40] J. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees, Proceedings of the National Confer-
ence on Artificial Intelligence, pages 717–724, 1996.
[41] S. Garcia, J. Derrac, J. Cano, and F. Herrera. Prototype selection for nearest neighbor classi-
fication: Taxonomy and empirical study, IEEE Transactions on Pattern Analysis and Machine
Intelligence, 34(3):417–436, 2012.
[42] D. Gunopulos and G. Das. Time series similarity measures. In Tutorial notes of the sixth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 243–
307, 2000.
[43] E. Han, G. Karypis, and V. Kumar. Text categorization using weight adjusted k-nearest neigh-
bor classification, Proceedings of the Pacific-Asia Conference on Knowledge Discovery and
Data Mining, pages 53–65, 2001.
[44] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 18(6):607–616, 1996.
[45] A. Hinneburg, C. Aggarwal, and D. Keim. What is the nearest neighbor in high dimensional
spaces? Proceedings of VLDB Conference, pages 506–515, 2000.
[46] Y. S. Hwang and S. Y. Bang. An efficient method to construct a radial basis function neural
network classifier, Neural Networks, 10(8):1495–1503, 1997.
[47] N. Jankowski and M. Grochowski, Comparison of instance selection algorithms, I algorithms
survey, Lecture Notes in Computer Science, 3070:598–603, 2004.
[48] R. Kohavi, P. Langley, and Y. Yun. The utility of feature weighting in nearest-neighbor algo-
rithms, Proceedings of the Ninth European Conference on Machine Learning, pages 85–92,
1997.
[49] J. Li, G. Dong, K. Ramamohanarao, and L. Wong. Deeps: A new instance-based lazy discovery
and classification system, Machine Learning, 54(2):99–124, 2004.
[50] D. G. Lowe. Similarity metric learning for a variable-kernel classifier, Neural computation,
7(1):72–85, 1995.
[51] G. J. McLachlan. Discriminant analysis and statistical pattern recognition, Wiley-
Interscience, Vol. 544, 2004.
[52] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units, Neural
computation, 1(2), 281–294, 1989.
[53] J. Peng, B. Bhanu, and S. Qing. Probabilistic feature relevance learning for content-based
image retrieval, Computer Vision and Image Understanding, 75(1):150–164, 1999.
[54] J. Peng, D. Heisterkamp, and H. Dai. LDA/SVM driven nearest neighbor classification, Com-
puter Vision and Pattern Recognition Conference, 1–58, 2001.
[55] J. R. Quinlan. Combining instance-based and model-based learning. Proceedings of ICML
Conference, pages 236–243, 1993.
184 Data Classification: Algorithms and Applications
[56] G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983.
[57] H. Samet. Foundations of multidimensional and metric data structures, Morgan Kaufmann,
2006.
[58] M. Sonka, H. Vaclav, and R. Boyle. Image Processing, Analysis, and Machine Vision. Thom-
son Learning, 1999.
[59] E. Spyromitros, G. Tsoumakas, and I. Vlahavas. An empirical study of lazy multilabel classi-
fication algorithms, In Artificial Intelligence: Theories, Models and Applications, pages 401–
406, 2008.
[60] C. Stanfil and D. Waltz. Toward memory-based reasoning, Communications of the ACM,
29(12):1213–1228, 1986.
[61] M. Tahir, A. Bouridane, and F. Kurugollu. Simultaneous feature selection and feature weight-
ing using Hybrid Tabu Search in K-nearest neighbor classifier, Pattern Recognition Letters,
28(4):483–446, 2007.
[62] P. Utgoff. Incremental induction of decision trees, Machine Learning, 4(2):161–186, 1989.
[63] P. Vincent and Y. Bengio. K-local hyperplane and convex distance nearest algorithms, Neural
Information Processing Systems, pages 985–992, 2001.
[64] D. Volper and S. Hampson. Learning and using specific instances, Biological Cybernetics,
57(1–2):57–71, 1987.
[65] J. Wang, P. Neskovic, and L. Cooper. Improving nearest neighbor rule with a simple adaptive
distance measure, Pattern Recognition Letters, 28(2):207–213, 2007.
[66] J. Wang and G. Karypis. On mining instance-centric classification rules, IEEE Transactions
on Knowledge and Data Engineering, 18(11):1497–1511, 2006.
[67] I. Watson and F. Marir. Case-based reasoning: A review, Knowledge Engineering Review,
9(4):327–354, 1994.
[68] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neigh-
bor classification, NIPS Conference, MIT Press, 2006.
[69] D. Wettschereck, D. Aha, and T. Mohri. A review and empirical evaluation of feature weight-
ing methods for a class of lazy learning algorithms, Artificial Intelligence Review, 11(1–
5):273–314, 1997.
[70] D. Wettschereck and D. Aha. Weighting features. Case-Based Reasoning Research and De-
velopment, pp. 347–358, Springer Berlin, Heidelberg, 1995.
[71] D. Wettschereck and T. Dietterich. Improving the performance of radial basis function net-
works by learning center locations, NIPS, Vol. 4:1133–1140, 1991.
[72] D. Wilson and T. Martinez. Reduction techniques for instance-based learning algorithms, Ma-
chine Learning 38(3):257–286, 2000.
[73] D. Wilson and T. Martinez. An integrated instance-based learning algorithm, Computers and
Intelligence, 16(1):28–48, 2000.
[74] Z. Xie, W. Hsu, Z. Liu, and M. L. Lee. Snnb: A selective neighborhood based naive Bayes for
lazy learning, Advances in Knowledge Discovery and Data Mining, pages 104–114, Springer,
2002.
Instance-Based Learning: A Survey 185
[77] H. Zhang, A. Berg, M. Maire, and J. Malik. SVM-KNN: Discriminative nearest neighbor
classification for visual category recognition, Computer Vision and Pattern Recognition, pages
2126–2136, 2006.
[78] M. Zhang and Z. H. Zhou. ML-kNN: A lazy learning approach to multi-label learning, Pattern
Recognition, 40(7): 2038–2045, 2007.
[79] Z. Zheng and G. Webb. Lazy learning of Bayesian rules. Machine Learning, 41(1):53–84,
2000.
[80] Z. H. Zhou and Y. Yu. Ensembling local learners through multimodal perturbation, IEEE
Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 35(4):725–735, 2005.
[81] X. Zhu and X. Wu. Scalable representative instance selection and ranking, Proceedings of
IEEE International Conference on Pattern Recognition, Vol 3:352–355, 2006.
This page intentionally left blank
Chapter 7
Support Vector Machines
Po-Wei Wang
National Taiwan University
Taipei, Taiwan
[email protected]
Chih-Jen Lin
National Taiwan University
Taipei, Taiwan
[email protected]
7.1 Introduction
Machine learning algorithms have a tendency to over-fit. It is possible to achieve an arbitrarily
low training error with some complex models, but the testing error may be high, because of poor
generalization to unseen test instances. This is problematic, because the goal of classification is not
to obtain good accuracy on known training data, but to predict unseen test instances correctly. Vap-
nik’s work [34] was motivated by this issue. His work started from a statistical derivation on linearly
separable scenarios, and found that classifiers with maximum margins are less likely to overfit. This
concept of maximum margin classifiers eventually evolved into support vector machines (SVMs).
SVM is a theoretically sound approach for controlling model complexity. It picks important
instances to construct the separating surface between data instances. When the data is not linearly
separable, it can either penalize violations with loss terms, or leverage kernel tricks to construct non-
linear separating surfaces. SVMs can also perform multiclass classifications in various ways, either
by an ensemble of binary classifiers or by extending margin concepts. The optimization techniques
of SVMs are mature, and SVMs have been used widely in many application domains.
This chapter will introduce SVMs from several perspectives. We introduce maximum margin
classifiers in Section 7.2 and the concept of regularization in Section 7.3. The dual problems of
SVMs are derived in Section 7.4. Then, to construct nonlinear separating surfaces, feature trans-
formations and kernel tricks are discussed in Section 7.5. To solve the kernelized SVM problem,
187
188 Data Classification: Algorithms and Applications
FIGURE 7.1: Infinitely many classifiers could separate the positive and negative instances.
decomposition methods are introduced in Section 7.6. Further, we discuss the multiclass strategies
in Section 7.7 and make a summary in Section 7.8.
This chapter is not the only existing introduction of SVM. A number of surveys and books may
be found in [6, 12, 33]. While this chapter does not discuss applications of SVMs, the reader is
referred to [16] for a guide to such applications.
Note that we may use different norms in the definition of distance. By [24, Theorem 2.2], there is a
closed-form solution when H is a hyperplane:
w x + b|
|w
Distance(xx, H ) = ,
w
∗
w
Support Vector Machines 189
in which
·
∗ is the dual norm of the norm we choose in the definition of distance. The margin,
which measures the gap between the separating surface H and the nearest instance, is defined as
follows:
|ww x i + b|
M ≡ min .
i
ww
∗
When the instance x i lies on the correct side of the hyperplane, the numerator of margin M could be
simplified as follows:
w x i + b| = yi (w
|w w x i + b) > 0.
Thus, to find a maximum margin classifier, we can instead solve the following problem:
w x i + b)
yi (w
max M, s.t. ≥ M.
w , b, M w
∗
w
It could be verified that any non-zero multiple of the optimal solution (w̄
w, b̄) is still an optimal
solution. Therefore, we could set
M
w w
∗ = 1,
so that the problem could be rewritten in a simpler form
w
∗ ,
min
w s.t. w x i + b) ≥ 1, ∀i = 1 . . . l.
yi (w
w, b
If the Euclidean norm
xx
2 ≡ ∑i x2i is chosen, this formulation reverts to the original SVM formu-
lation in [1]. If an infinity norm
xx
∞ ≡ maxi (|xi |) is chosen, it instead leads to the L1 regularized
SVM. An illustration is provided in Figures 7.2(a) and 7.2(b).
The concept of maximum margin could also be explained by statistical learning theory. From
[35], we know that if training data and testing data are sampled i.i.d., under probability 1 − η we
have
VC · (ln VC
2l
+ 1) − ln η4
testing error ≤ training error + ,
l
in which VC is the Vapnik-Chervonenkis dimension of our model. In our separable scenario, the
training error is zero. Therefore, the minimization of the testing error could be achieved by con-
trolling the VC dimension of our model. However, optimizing the VC dimension is a difficult task.
190 Data Classification: Algorithms and Applications
Instead, we may optimize a simpler upper-bound for VC. Assume that all data lie in an n-dimension
sphere of radius R. It is proven in [35, Theorem 10.3] that the VC dimension of the linear classifier
is upper bounded,
VC ≤ min(
ww
22 R2 , n) + 1.
As a result,
w
w
22 bounds the VC dimension. The latter bounds the testing error. The maximum
margin classifier, which has the smallest
w
w
22 , minimizes an upper bound of the testing error. Thus,
SVM is likely to have a smaller generalization error than other linear classifiers [28, 35].
ξL1 (w w x i + b)).
w, b; yi , x i ) = max(0, 1 − yi (w
If we square the violation, the loss function is referred to as the L2 hinge loss,
ξL2 (w w x i + b))2 .
w , b; yi , x i ) = max(0, 1 − yi(w
(a) (b)
FIGURE 7.4: We can take different tradeoffs between margin size and violations.
On the other hand, when the infinity norm is chosen to define the distance, we have the L1-
regularized SVM problem in [5],
l
min w
1 + C ∑ ξL1 (w
w w, b; yi , x i ).
w, b i=1
As we can see, all these formulations are composed of two parts: a norm from the maximum margin
objective, and the loss terms from the violation of the models. The maximum margin part is also
called the Regularization Term, which is used to control the model complexity with respect to the
violation of the models. From the perspective of numeric stability, the regularization term also plays
a role in the balance between the accuracy of the model and the numeric range of w . It ensures that
the value of w does not take in extreme values.
Because of the tradeoffs between regularization and loss terms, the loss might not be zero even
if the data are linearly separable. However, for L2 regularized SVM with L1 hinge loss, it is proved
that when C is large enough, the solution (w̄ w, b̄) of the SVM with loss terms are identical to the
solution of the SVM without loss terms under the separable scenario. A proof of this is available
in [23].
Different from the original non-differentiable problem, this formulation is a convex quadratic
programming problem over some linear constraints. Since this problem is a constrained optimization
192 Data Classification: Algorithms and Applications
problem, a Lagrangian relaxation approach may be used to derive the optimality condition. The
Lagrangian is defined to be the sum of the objective function, and a constraint-based penalty term
with non-negative Lagrange multipliers α ≥ 0 and β ≥ 0 ,
l l l
1
w , b, ξ , α , β ) ≡ w w + C ∑ ξi − ∑ αi (yi (w
L(w w x i + b) − 1 + ξi) − ∑ βi ξi .
2 i=1 i=1 i=1
α, β ), in which
w, b, ξ ) and maxα≥00,ββ≥00 D(α
Consider the problem minw ,b,ξξ P(w
w, b, ξ ) ≡ max L(w
P(w w , b, ξ , α , β ) α, β ) ≡ min L(w
and D(α w , b, ξ , α , β ).
β≥00
α ≥00,β w ,b,ξξ
We point out the following property of the function P. When (w w , b, ξ ) is a feasible solution of (7.2),
w, b, ξ ) equals the objective of (7.2) by setting the Lagrangian multipliers to zero. Otherwise,
P(w
w, b, ξ ) → ∞ when (w
P(w w, b, ξ ) is not feasible.
As a result, the problem minw ,b,ξξ P(w w, b, ξ ) is equivalent to (7.2). Furthermore, because
w , b, ξ , α , β ) is convex in (w
L(w w, b, ξ ), and concave in (αα , β ), we have the following equality from
the saddle point property of concave-convex functions,
w,b,ξξ)
P(w α ,β
D(α β)
w , b, ξ , α , β ) =
min max L(w max w , b, ξ , α , β ) .
min L(w
w ,b,ξξ α ≥00 ,β
β≥00 β≥00 w ,b,ξξ
α ≥00,β
This property is popularly referred to as the Minimax Theorem [32] of concave-convex functions.
The left-hand side, minw ,b,ξξ P(w w, b, ξ ), is called the Primal Problem of SVM, and the right-hand
α, β ), is the Dual Problem of SVM. Similar to the primal objective function,
side, maxα ≥00, β ≥00 D(α
l
α, β ) → −∞ when
D(α ∑ αi yi = 0 or Cee − α + β = 0 ,
i=1
where e is the vector of ones. However, such an (α α, β ) does not need to be considered, because
the dual problem maximizes its objective function. Under any given (α α, β ) with D(α α, β ) = −∞, the
minimum (ŵ w, b̂, ξ̂ξ) is achieved in the unconstrained problem minw,b,ξξ L(w
w , b, ξ , α , β ), if and only if
∇w ,b,ξξ L(w
w, b, ξ , α , β ) = 0 . In other words,
∂L l
∂L l
∂L
w − ∑ αi yi x i = 0 ,
= 0 ⇒ ŵ = 0 ⇒ ∑ yi αi = 0, = 0 ⇒ Cee − α + β = 0 . (7.3)
∂w
w i=1 ∂b i=1 ∂ξξ
α, β ) ≡ minw,b,ξξ L(w
Thus, the dual objective, D(α w , b, ξ , α , β ), could be expressed as follows:
1 l l l
α, β ) = L(ŵ
D(α w , b̂, ξ̂ξ, α , β ) = − ∑∑
2 i=1
αi α j y i y j x
i x j + ∑ αi .
j=1 i=1
1 l l l
max
β
α ,β
− ∑∑
2 i=1
αi α j y i y j x
i x j + ∑ αi ,
j=1 i=1
l
s.t. ∑ yi αi = 0, C − αi − βi = 0, αi ≥ 0, βi ≥ 0, ∀i = 1 . . . l.
i=1
Support Vector Machines 193
w,b̄,ξ̄ξ)
P(w̄ α,β̄
D(ᾱ β)
w , b̄, ξ̄ξ, ᾱ
L(w̄ α, β̄
β) ≤ max L(w̄
w , b̄, ξ̄ξ, α , β ) = min L(w α, β̄
w , b, ξ , ᾱ β) ≤ L(w̄
w , b̄, ξ̄ξ, ᾱ
α , β̄
β).
β
α ,β w ,b,ξξ
we have
w x i + b̄) − 1 + ξ̄i) = 0,
ᾱi (yi (w̄ β̄i ξ̄i = 0, ∀i = 1 . . . l. (7.7)
The conditions (7.5), (7.6), and (7.7) are referred to as the Karush-Kuhn-Tucher (KKT) optimal-
ity conditions [4]. We have shown that the optimal solution should satisfy the KKT conditions.
In fact, for a convex optimization problem with linear constraints, the KKT conditions are both
necessary and sufficient for optimality.
Further, the KKT conditions link the primal and dual solutions. We have w̄ w = ∑li=1 ᾱi yi x i from
(7.5), and b̄ could be inferred from (7.7). Once (ᾱ
1 α, b̄) has been calculated, we could evaluate the
decision value as follows:
l
w x + b̄ = ∑ ᾱi yi x
h(xx) = w̄ i x + b̄. (7.8)
i=1
One can observe that only non-zero ᾱi ’s affect the decision value. We call instances with non-zero
ᾱi as the support vectors. Further, from the KKT condition,
⎧
⎪
⎨< 1 =⇒ ᾱi = C,
yi (w̄
w x i + b̄) > 1 =⇒ ᾱi = 0,
⎪
⎩
= 1 =⇒ 0 ≤ ᾱi ≤ C.
The result shows that all the misclassified instances have ᾱi = C, and all correctly classified instances
off the margin have ᾱi = 0. Under the separable scenario, only a few instances are mis-classified or
fall within the margin, while all other ᾱi are zero. This means that the set of support vectors should
be sparse.
1 If 0 < ᾱ < C, then (7.5), (7.6), and (7.7) imply that ξ̄ = 0 and b̄ = y − w̄
i i i w x i . If all ᾱi are bounded (i.e., ᾱi = 0 or
ᾱi = C), the calculation of b̄ can be done by a slightly different setting; see, for example, [7].
194 Data Classification: Algorithms and Applications
(a) No linear classifier performs well (b) The transformed data are linearly
on the data. separable.
However, the transformation has a serious problem. It is known that the dth-order Taylor expansion
for an n-dimensional input has (d+n)!
d!n! coefficients, because of all the combinations of coordinates of
x less than or equal to order d. Thus, when the dimensionality of features and the degree of Taylor
expansion grow, the dimension of w , which is identical to the number of coefficients in the Taylor
expansion, may become very large. By replacing each x i in (7.4) with the mapped vector φ(xxi ), the
dual formulation with transformed instances is as follows:
1
min α Qα α − e α , s.t. 0 = y α , 0 ≤ α ≤ Cee, Qi j = yi y j φ(xx i ) φ(xx j ).
α 2
Support Vector Machines 195
The dual problem only has l variables, and the calculation of Qi j may not even require expanding
φ(xxi ) and φ(xx j ). For example, suppose that we carefully scale the feature mapping of (7.10) as
follows: √ √ √
φ(xx) = 1 2x1 2x2 x1 x1 2x1 x2 x2 x2 ,
In this case, φ(xxi ) φ(xx j ) could be obtained as follows:
φ(xxi ) φ(xx j ) = (1 + x
i x j) .
2
As a result, it is not necessary to explicitly expand the feature mapping. In the same manner, one
could apply such a scaling to the dth order Taylor expansion, and the dot product φ(xxi ) φ(xx j ) could
be simplified as follows:
φ(xxi ) φ(xx j ) = (1 + x
i x j) .
d
In fact, such a method can even handle infinite dimensional expansions. Consider the following
feature mapping of a one-dimensional instance x,
" #
φ(x) = exp(−x2 ) 1 2
1! x 22 2
2! x
23 3
3! x ... . (7.11)
The transformation is constructed by all degrees of polynomials of x, and w φ(xx ) can be seen as a
scaled Taylor expansion with respect to the scalar x. Such a feature transformation could not even
be performed explicitly, because of the infinite dimensionality. However, the dot product of the
transformed feature vectors has a closed form, as follows:
As a result, we could implicitly apply this infinite dimensional feature mapping in the dual formu-
lation, which is expressed only in terms of the dot product. This means that the transformation does
not need to be performed explicitly, because the dot products in the dual formulation can be substi-
tuted with the kernel function. Such a method is referred to as the kernel trick, and the product is
referred to as the Kernel Function, k(xx , z ) ≡ φ(xx) φ(zz). Similar to (7.8), we can predict an input x
via
l l
h(xx) = ∑ ᾱi yi φ(xxi ) φ(xx ) + b̄ = ∑ ᾱi yi k(xxi , x ) + b̄. (7.12)
i=1 i=1
The matrix K, in which Ki j ≡ k(xx i , x j ) = φ(xxi ) φ(xx j ), is called the Kernel Matrix. With the kernel
matrix, the dual SVM could be written as the following form in [9].
1
min α − e α ,
α Qα s.t. 0 = y α , 0 ≤ α ≤ Cee, Qi j = yi y j Ki j .
α 2
As the definition shows, the kernel matrix is simply a dot map between transformed instances.
Solving dual SVMs would only require the kernel matrix K, and we do not even need to know the
mapping function φ(·). In other words, classification could be done in an unknown high dimensional
space, if the labels and a kernel matrix of training instances are given.2 However, not all matrices
K are valid kernel matrices, because K must be generated from a dot map. The following theorem
discusses when K is a valid kernel matrix.
2 Dual SVM is not the only formulation to apply the kernel trick. In fact, we can directly modify the primal problem to
Theorem 1 A matrix K ∈ Rl×l is a valid kernel matrix if and only if it is symmetric positive semi-
definite (SPSD). In other words, the following needs to be true:
Ki j = K ji , ∀i, j and u Kuu ≥ 0, ∀uu ∈ Rl .
We will prove both directions of this statement. For the necessary direction, if K is SPSD, then there
exists an A such that K = A A by the Spectral Theorem. As a result, there is a corresponding feature
mapping φ(xxi ) = Ai , where Ai is the ith column of A. Then, by definition, K is a valid kernel matrix.
For the sufficient direction, if K is a valid kernel matrix, then it is generated by a corresponding
feature mapping φ(xx). By that φ(xxi ) φ(xx j ) = φ(xx j ) φ(xx i ), K is symmetric. K is also positive semi-
definite because for an arbitrary u ∈ Rl ,
l l l
u Kuu = ∑ ∑ ui u j φ(xxi ) φ(xx j ) =
∑ ui φ(xxi )
2 ≥ 0.
i=1 j=1 i=1
For the prediction of an arbitrary point, the kernel function k(·) is needed as shown in (7.12). In
the same way, not all functions might be valid kernel functions. The following Mercer’s Theorem
[26, 27] gives a condition for a valid kernel function.
Theorem 2 k : Rn × Rn → R is a valid kernel function if and only if the kernel matrix it generated
for any finite sequence x 1 , x 2 , . . . , x l is SPSD.
This theorem is slightly different from Theorem 1. It ensures a consistent feature mapping φ(xx ),
possibly in an infinite dimensional space, that generates all the kernel matrices from the kernel
function. In practice, we do not need to specify what the feature mapping is. For example, a Gaussian
kernel function, also called the Radial Basis Function (RBF), is a common choice for such an
implicit feature transform. It is a multivariate version of (7.11), with the following kernel function
k(xx, z ) = exp(−γ
xx − z
2 ),
where γ is the kernel parameter decided by users. Similar to (7.11), the feature mapping of the
RBF kernel corresponds to all polynomial expansions of the input instance vector. We conceptually
explain that the mapping function of the RBF kernel transforms the input data to become linearly
separable in a high-dimensional space. When the input data are separable by any smooth surface,
as illustrated in Figure 7.6, a smooth h(xx) corresponds to the separating surface. Note that smooth
functions could be analyzed by Taylor expansions, which are a series of polynomial terms. That
w φ̄(xx) for some w̄
is, similar to (7.9), h(xx) can be expressed as h(xx) = w̄ w and φ̄(xx), which includes
polynomial terms. Because φ(xx) of the RBF kernels is a scaled vector of φ̄(xx), we can adjust w̄ w
to another vector w . Then h(xx) = w φ(xx) with RBF’s mapping function φ(xx). Therefore, h(xx) is a
linear separating hyperplane for φ(xxi ), ∀i in the space of φ(xx). However, although highly nonlinear
mappings can make training data separable, overfitting may occur to cause poor prediction results.
Note that the kernel trick works even if we do not know which separating surface it corresponds
to in the infinite dimensional space. The implicit transformation gives SVMs the power to construct
nonlinear separating surfaces, but its lack of explanation is also a disadvantage.
FIGURE 7.6: A smooth separating surface for the data corresponds to a linear separating hyper-
plane in the mapped space used by the RBF kernel. That is, the RBF kernel transforms data to be
linear separable in a higher-dimensional space.
The kernelized dual SVM is a quadratic programming problem with linear constraints, which could
be solved by typical quadratic solvers. However, these solvers usually assume full access to the
matrix Q. This may be too large to be memory-resident, when the value of l is large. Thus, Dual
Decomposition Methods [18, 29] are proposed to overcome these challenges. These methods de-
compose the full scale dual problem into subproblems with less variables, which could be solved
efficiently without full access to Q. The Sequential Minimal Optimization (SMO) approach, as
discussed in [30], is an extreme case of the dual decomposition method by using only two variables
in each subproblem.
We now demonstrate how a decomposition method works. We denote D̄(α α) ≡ 12 α Qα
α − e α
as the dual objective function, and e i as the unit vector with respect to the ith coordinate. At each
step of SMO, only two variables αi and α j in the current α are changed for the next solution α . To
make each step feasible, α must satisfy the equality constraint 0 = y α . It can be validated that by
choosing
α = α + dyi e i − dy j e j ,
α satisfies the equality for any scalar d. By substituting the new α into the dual objective function,
the subproblem with regards to the variable d becomes the following:
1
min α) − y j ∇ j D̄(α
(Kii + K j j − 2Ki j ) · d 2 + (yi ∇i D̄(α α)) · d + D̄(α
α)
d∈R 2 (7.13)
s.t. 0 ≤ αi + yi d ≤ C, 0 ≤ α j − y j d ≤ C.
This is a single-variable quadratic optimization problem, with box constraints. It is known to have
a closed-form solution. An illustration is provided in Figure 7.7, and more details are provided
α ) and ∇ j D̄(α
in [7, Section 6]. In addition, ∇i D̄(α α ) could be obtained by
l l
α ) = yi ∑ yt αt Kit − 1 and ∇ j D̄(α
∇i D̄(α α ) = y j ∑ yt αt K jt − 1.
t=1 t=1
The operation evaluates 2l entries of the kernel. Alternatively, by the fact that we only change two
variables each time, we could maintain the full gradient at the same cost:
For t = 1, . . . , l, α ) = ∇t D̄(α
∇t D̄(α α) + yt (Kit − K jt )d.
Then, at the next iteration, the two needed gradient elements are directly available. One reason to
198 Data Classification: Algorithms and Applications
original optimum
[ ]
maintain the full gradient is that it is useful for choosing coordinates i and j. As a result, solving the
subproblem costs O(l) kernel evaluations, and O(l) space for storing the full gradient.
The selection of coordinates i and j is an important issue. If they are not not carefully chosen,
the objective value might not even decrease. One naive approach is to try all pairs of coordinates
and pick the one with the largest reduction. However, it is expensive to examine all l(l − 1)/2
pairs. Instead, [19] chooses the pair that has the maximum violation in the optimality condition. In
another approach, the second-order selection rule [14] used in LIBSVM [7], fixes the coordinate i
via the maximum violation, and determines the value of j that could produce the largest decrease
in objective value.3 Both methods require the gradient ∇D̄(α α ), which is available if we choose to
maintain the gradient. In addition, these two methods ensure that all other operations require no
more than O(l) cost. Therefore, if each kernel evaluation needs O(n) operations, an SMO step of
updating two variables requires O(l · n) cost. These types of decomposition methods effectively
solve the challenges associated with memory requirements, because they require only the space
for two kernel columns, rather than the full kernel matrix. A review of algorithms and solvers for
kernelized SVMs is provided in [3].
Classifiers
Class f1 f2 f3 f4
c1 + − − −
c2 − + − −
c3 − − + −
c4 − − − +
Each column of the table stands for a binary classifier trained on the specified positive and
negative classes. To predict an input x , we apply all the binary classifiers to obtain binary predictions.
If all of them are correct, there is exactly one positive value fm (xx), and we choose the corresponding
class cm to be the multiclass prediction. However, the binary predictions could be wrong, and there
may be multiple positive values, or even no positive value. To handle such situations, we could
compare the decision values hm (xx) of classifiers fm , ∀m among the k classes, and choose the largest
one.
f (xx ) = arg max hm (xx). (7.14)
m=1...k
If
w
wm
, ∀m are the same, a larger hm (xx) implies a larger distance of the instance from the decision
surface. In other words, we are more confident about the binary predictions. However, the setting in
(7.14) is an ad-hoc solution for resolving conflicting predictions, because
w wm
, ∀m are likely to be
different.
On the other hand, it may not be necessary to train binary classifiers over all the training data. We
could construct one binary classifier for each pair of classes. In other words, for k classes, k(k − 1)/2
binary classifiers are trained on each subset of distinct labels cr and cs . This method is called the
One-Against-One strategy [20]. This is used for SVMs in [7, 15, 21]. The notation frs denotes the
classifier with respect to classes cr and cs , and the notation 0 indicates the irrelevant classes. The
one-against-one strategy is visualized in the following table.
Classifiers
Class f12 f13 f14 f23 f24 f34
c1 + + + 0 0 0
c2 − 0 0 + + 0
c3 0 − 0 − 0 +
c4 0 0 − 0 − −
In prediction, a voting procedure based on predictions of all k(k − 1)/2 binary classifiers is
conducted. A total of k bins are prepared for the k classes. For each classifier, with respect to classes
cr and cs , we cast a vote to the corresponding bin according to the predicted class. The output of
the voting process is the class with maximum votes. If all predictions are correct, the winner’s bin
should contain k − 1 votes, while all others should have less than k − 1 votes.
When kernelized SVMs are used as binary classifiers, the one-against-one strategy usually
spends less time in training than the one-against-rest strategy [17]. Although there are more subprob-
lems in the one-against-one strategy, each of the subproblems involves fewer instances and could be
trained faster. Assume it costs O(l d ) time to train a dual SVM. If each class has about the same num-
ber of instances, the time complexity for the one-against-one strategy is O((l/k)d k2 ) = O(l d k2−d ).
In contrast, the one-against-rest strategy needs O(l d k) cost. As a result, the one-against-one strategy
has a lower complexity when d > 1.
In kernelized SVMs, recall that we use the following equation to evaluate the decision value of
a binary classifier,
l
h(xx) = ∑ ᾱi yi k(xx i , x ) + b̄.
i=1
200 Data Classification: Algorithms and Applications
+
) +
t _
(no+
_ +
_ + _
(no
t _ +
)
_
The kernel evaluations for k(xxi , x ), ∀ᾱi = 0 account for most of the time in prediction. For multiclass
methods such as one-against-one and one-against-rest, the values k(xx i , x ) are the same across dif-
ferent binary subproblems. Therefore, we only need to pre-calculate the kernel entries with nonzero
ᾱi in all the binary classifiers once. In practice, the number of “total” support vectors across binary
problems of one-against-rest and one-against-one strategies do not differ much. Thus, the required
times for prediction are similar for both multiclass strategies.
To improve the one-against-one strategy, we could avoid applying all k(k − 1)/2 classifiers for
prediction. Instead of making decisions on all classifiers, we can apply one classifier at a time to
exclude one class, and after k − 1 runs there is only one class left. This policy is referred to as the
DAGSVM [31], because the procedure is similar to walking through a direct acyclic graph (DAG)
in Figure 7.8. Though such a policy may not be fair to all the classes, it has the merit of being fast.
The binary reduction strategies may ensemble inconsistent classifiers. If we would like to solve
multiclass classification as a whole, we must redefine the separability and the margin of multiclass
data. Recall that in the one-against-rest strategy, we predict the class with the largest decision value.
For the simplicity of our formulation, we consider a linear separating hyperplane without the bias
term, hm (xx) = w m x . As a result, the decision function is
When all instances are classified correctly under the setting, there are w 1 , . . . , w k such that ∀i =
1, . . . , l,
w
yi x i > w m x i , ∀m = 1 . . . k, m = yi .
With the same technique in Section 7.2, the linearly separable condition could be formulated as
∀i = 1, . . . , l,
w
yi x i − w m x i ≥ 1, ∀m = 1 . . . k, m = yi .
wr − w s ) x i |
|(w
Mr,s ≡ min .
i s.t.yi ∈{r,s}
ww r − w s
∗
An illustration is provided in Figure 7.9. To maximize the margin Mr,s for every pair r, s, we instead
minimize the sum of its inverse ∑r =s 1/Mr,s2 . The work [10] further shows that
1
∑ M 2
≤ k ∑ w m wm.
r,s=1,...,k s.t. r =s r,s m=1...k
Therefore, a maximum margin classifier for multiclass problems could be formulated as the follow-
Support Vector Machines 201
ing problem,
k
min
w 1 ,...,w
wk ∑ w m w m
m=1
s.t. w
yi x i − w m x i ≥ 1,
∀m = 1, . . . , k, m = yi , ∀i = 1, . . . , l.
For the nonseparable scenario, the formulation with a loss term ξ is
1 k l
min
w 1 ,...,w
wk , ξ
∑ m
2 m=1
w w m + C ∑ ξi
i=1
subject to w
yi x i − w m x i ≥ (1 − δy,m ) − ξi ,
∀i = 1 . . . l, ∀m = 1 . . . k,
where δyi ,m is defined to be 1 when yi equals m, and 0 otherwise. We could validate that when
yi equals m, the corresponding inequality becomes ξi ≥ 0. This formulation is referred to as the
Crammer and Singer’s SVM (CSSVM) [10,11]. Similarly, we can derive its dual problem and the
kernelized form. Compared to the binary reduction strategies, the work in [17] shows that CSSVM
usually gives similar classification accuracy, but needs more training time. Other similar, but slightly
different strategies, are provided in [22, 36].
7.8 Conclusion
SVM is now considered a mature machine learning method. It has been widely applied in dif-
ferent applications. Furthermore, SVM has been well-studied, both from theoretical and practical
perspectives. In this chapter we focus on the kernelized SVMs, which construct nonlinear separating
surfaces. However, on sparse data with many features and instances, a linear SVM may achieve sim-
ilar accuracy with a kernelized SVM, but is significantly faster. Readers may refer to [13] and [37]
for details.
The topic of kernelized SVM is broader than one can hope to cover in a single chapter. For
example, we have not discussed topics such as kernelized support vector regression (SVR) and
kernel approximations to improve the training speed. Our aim was to introduce the basic concepts
202 Data Classification: Algorithms and Applications
of SVM. Readers can use the pointers provided in this chapter, to further study the rich literature of
SVM, and learn about recent advancements.
Bibliography
[1] B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In
Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–
152. ACM Press, 1992.
[2] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. Jackel, Y. LeCun, U. A. Müller,
E. Säckinger, P. Simard, and V. Vapnik. Comparison of classifier methods: A case study in
handwriting digit recognition. In International Conference on Pattern Recognition, pages 77–
87. IEEE Computer Society Press, 1994.
[3] L. Bottou and C.-J. Lin. Support vector machine solvers. In L. Bottou, O. Chapelle, D. De-
Coste, and J. Weston, editors, Large Scale Kernel Machines, pages 301–320. MIT Press, Cam-
bridge, MA, 2007.
[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[5] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support
vector machines. In Proceedings of the Twenty-Fifth International Conference on Machine
Learning (ICML), 1998.
[6] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining
and Knowledge Discovery, 2(2):121–167, 1998.
[7] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Trans-
actions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[8] O. Chapelle. Training a support vector machine in the primal. Neural Computation,
19(5):1155–1178, 2007.
[9] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995.
[10] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass
problems. In Computational Learning Theory, pages 35–46, 2000.
[11] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based
vector machines. Journal of Machine Learning Research, 2:265–292, 2001.
[12] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge
University Press, Cambridge, UK, 2000.
[13] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for
large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
[14] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information
for training SVM. Journal of Machine Learning Research, 6:1889–1918, 2005.
[15] J. H. Friedman. Another approach to polychotomous classification. Technical report, Depart-
ment of Statistics, Stanford University, 1996.
Support Vector Machines 203
[16] C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification.
Technical report, Department of Computer Science, National Taiwan University, 2003.
[17] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines.
IEEE Transactions on Neural Networks, 13(2):415–425, 2002.
[18] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and
A. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 169–184,
MIT Press, Cambridge, MA, 1998.
[19] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to Platt’s
SMO algorithm for SVM classifier design. Neural Computation, 13:637–649, 2001.
[20] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise procedure
for building and training a neural network. In J. Fogelman, editor, Neurocomputing: Algo-
rithms, Architectures and Applications. Springer-Verlag, 1990.
[21] U. H.-G. Kressel. Pairwise classification and support vector machines. In B. Schölkopf, C. J. C.
Burges, and A. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning,
pages 255–268, MIT Press, Cambridge, MA, 1998.
[22] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Journal of the American
Statistical Association, 99(465):67–81, 2004.
[23] C.-J. Lin. Formulations of support vector machines: a note from an optimization point of view.
Neural Computation, 13(2):307–317, 2001.
[24] O. L. Mangasarian. Arbitrary-norm separating plane. Operations Research Letters, 24(1):15–
23, 1999.
[25] O. L. Mangasarian. A finite Newton method for classification. Optimization Methods and
Software, 17(5):913–929, 2002.
[26] J. Mercer. Functions of positive and negative type, and their connection with the theory of in-
tegral equations. Philosophical Transactions of the Royal Society of London. Series, 209:415–
446, 1909.
[27] H. Q. Minh, P. Niyogi, and Y. Yao. Mercer’s theorem, feature maps, and smoothing. In
Proceedings of the Nineteenth Annual Workshop on Computational Learning Theory (COLT),
pages 154–168. 2006.
[28] E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. AI
Memo 1602, Massachusetts Institute of Technology, 1997.
[29] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face
detection. In Proceedings of IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), pages 130–136, 1997.
[30] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In
B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support
Vector Learning, MIT Press, Cambridge, MA, 1998.
[31] J. C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classifica-
tion. In Advances in Neural Information Processing Systems, volume 12, pages 547–553. MIT
Press, Cambridge, MA, 2000.
[32] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.
204 Data Classification: Algorithms and Applications
[33] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
[34] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York,
NY, 1982.
[35] V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.
[36] J. Weston and C. Watkins. Multi-class support vector machines. In M. Verleysen, editor,
Proceedings of ESANN99, pages 219–224, Brussels, 1999. D. Facto Press.
[37] G.-X. Yuan, C.-H. Ho, and C.-J. Lin. Recent advances of large-scale linear classification.
Proceedings of the IEEE, 100(9):2584–2603, 2012.
Chapter 8
Neural Networks: A Review
Alain Biem
IBM T. J. Watson Research Center
New York, NY
[email protected]
205
206 Data Classification: Algorithms and Applications
8.1 Introduction
Neural networks have recently been rediscovered as an important alternative to various standard
classification methods. This is due to a solid theoretical foundation underlying neural network re-
search, along with recently-achieved strong practical results on challenging real-world problems.
Early work established neural networks as universal functional approximators [33, 59, 60], able to
approximate any given vector space mapping. As classification is merely a mapping from a vector
space to a nominal space, in theory, a neural network is capable of performing any given classifi-
cation task, provided that a judicious choice of the model is made and an adequate training method
in implemented. In addition, neural networks are able to directly estimate posterior probabilities,
which provides a clear link to classification performance and makes them a reliable estimator of
the optimal Bayes classifier [14, 111, 131]. Neural networks are also referred to as connectionist
models.
Theories of neural networks have been developed over many years. Since the late 19th cen-
tury, there have been attempts to create mathematical models that mimic the functioning of the
human nervous system. The discovery by Cajal in 1892 [29] that the nervous system is comprised
of neurons communicating with each other by sending electrical signals down their axons, was a
Neural Networks: A Review 207
breakthrough. A neuron receives various signals from other linked neurons, processes them, and
then decides to inhibit or send a signal based on some internal logic, thus acting as an electrical
gate. This distributed neuronal process is the mechanism the brain uses to perform its various tasks.
This finding led to an increase in research on mathematical models that attempt to simulate such be-
havior, with the hope, yet to be substantiated, of achieving human-like classification performance.
The first broadly-publicized computational model of the neuron was proposed by McCulloch
and Pitts in 1943 [91]. It was a binary threshold unit performing a weighted sum of binary input
values, and producing a 0 or 1 output depending on whether the weighted sum exceeded a given
threshold. This model generated great hope for neural networks, as it was shown that with a suffi-
cient number of such neurons, any arbitrary function could be simulated, provided that the weights
were accurately selected [96]. In 1962, Rosenblatt [114] proposed the perceptron learning rule, an
iterative learning procedure for single-layer linear threshold units in which the unit can take scalar
inputs to produce a binary output. The perceptron learning rule was guaranteed to converge to the
optimal set of weights, provided that the target function was computable by the network.
Regrettably, research into neural networks slowed down for almost 15 years, largely due to an
influential paper by Minsky and Papert [78] which proved that a single-layer perceptron is incapable
of simulating an XOR gate, which severely limits its capabilities. In the 1980s, interest in neural
networks was revived, based on a proposal that they may be seen as a memory encoding system.
Based on this view, associative networks were proposed utilizing some energy measure to optimize
the network. This framework led to the Hopfield network [58] and Boltzmann Machine [1]. In terms
of usage for classification, the development of the back-propagation algorithm [90, 116], as a way
to train a multi-layer perceptron, provided a fundamental breakthrough. The multi-layer perceptron
(MLP) and its variants do not have the limitation of the earlier single-layer perceptron models,
and were proven to be able to approximate any given vector space mapping. Consequently, MLPs
have been the most widely-used neural network architecture. The 1990s witnessed the emergence of
various neural network applications, mostly around small or medium-size architectures. Networks
with one or two hidden layers were common, as networks with more layers suffered training issues
such as slow convergence, local minima problems, and complex learning parameter tuning.
Since the mid-2000s, the focus has been on the ability of neural networks to discover internal
representations within their multi-layered architecture [15]. The deep learning concept [56] was
proposed as a way to exploit deep neural networks with many layers and large numbers of units
that maximize the extraction of unseen non-linear features within their layers, and thus be able to
accomplish complex classification tasks. With the use of high-performance computing hardware
and the utilization of modular and selective learning techniques, deep neural networks with millions
of weights have been able to achieve breakthrough classification performance.
In terms of practical use, neural networks have been successfully applied to a variety of
real world classification problems in various domains [132] including handwriting recognition
[32, 47, 73, 89], speech recognition [25, 87], fault detection [6, 61], medical diagnosis [9, 28], fi-
nancial markets [124, 126], and more.
In this chapter, we provide a review of the fundamental concepts of neural networks, with a
focus on their usage in classification tasks. The goal of this chapter is not to provide an exhaustive
coverage of all neural network architectures, but to provide the reader with an overview of widely
used architectures for classification. In particular, we focus on layered feedforward architectures.
The chapter is organized as follows. In Section 8.2, we introduce the fundamental concepts
underlying neural networks. In Section 8.3 we introduce single-layer architectures. We review Ra-
dial Basis Function network as an example of a kernel neural network in Section 8.4. We discuss
multi-layer architectures in Section 8.5, and deep neural networks in Section 8.6. We summarize in
Section 8.7.
208 Data Classification: Algorithms and Applications
• The architecture or the topology that outlines the connections between units, including a
well-defined set of input and output units.
• The data encoding policy describing how input data or class labels are represented in the
network.
• The training algorithm used to estimate the optimal set of weights associated with each unit.
In the remainder of this section, we will review each of these characteristics in more detail.
1. A net value function ξ, which utilizes the unit’s parameters or weights w to summarize input
data into a net value, ν, as
ν = ξ(x, w). (8.1)
The net value function mimics the behavior of a biological neuron as it aggregates signals
from linked neurons into an internal representation. Typically, it takes the form of a weighted
sum, a distance, or a kernel.
2. An activation function, or squashing function, φ, that transforms net value into the unit’s
output value o as
o = φ(ν). (8.2)
The activation function simulates the behaviour of a biological neuron as it decides to fire
or inhibit signals, depending on its internal logic. The output value is then dispatched to all
receiving units as determined by the underlying topology. Various activations have been pro-
posed. The most widely-used ones include the linear function, the step or threshold function,
and the sigmoid and hyperbolic tangent function, as shown in Figure 8.2.
Neural Networks: A Review 209
FIGURE 8.1: Mathematical model of an ANN unit. The unit computes a net value as a scalar
function of the inputs using local parameters, and then passes the net value to an activation function
that produces the unit output.
ξ(x, w) = wt x (8.3)
d
= ∑ xi wi = ν (8.4)
i=1
φ(ν) = 1(ν > θ) (8.5)
210 Data Classification: Algorithms and Applications
where w = (w1 , · · · , wd ) is the binary weight vector representing the unit’s parameters, with wi ∈
{0, 1}. The activation function is the step function 1(b) with value equal to 1 when the Boolean
value b is true and zero otherwise.
The binary threshold unit performs a two-class classification on a space of binary inputs with a
set of binary weights. The net value, as a weighted sum, is similar to a linear regression (notwith-
standing the thresholding) and is able to accurately classify a two-class linear problem in this space.
It was demonstrated that the binary threshold unit can realize a variety of logic functions including
the AND, OR, and NAND gates [91].
However, its inability to perform the XOR gate [114] severely limits its classification capabil-
ities. Another major drawback of the binary threshold unit is the lack of learning algorithms to
estimate the weights, given the classification task.
ξ(x, w) = wt x (8.6)
d
= ∑ xi wi + wo (8.7)
i=1
φ(ν) = αν
with a typical value of α = 1, where x = (x1 , · · · , xd ) and w = (w1 , · · · , wd ). This unit computes the
distance between the inputs x and some prototype vector w [75]. A commonly used distance is the
Euclidean distance, i.e. ξ(x; w) = ∑i (xi − wi )2 .
for a parameter value w characterizing the unit’s parameters. In Equation 8.14, we use a Gaussian
kernel as activation function, which is the primary choice for classification tasks. The net value
212 Data Classification: Algorithms and Applications
typically uses a scaled Euclidean distance or the Mahalanobis distance. The radial basis provides
a local response activation similar to local receptive patterns in brain cells, centred at the unit’s
parameters. Hence, the unit is activated when the input data is close in value to the its parameters.
The Gaussian kernel enables the unit to achieve non-linear classification capabilities, while enabling
better local pattern matching [18].
The associated activation function is typically a linear or a sigmoid function. Polynomial units are
infrequently used for classification tasks and we do not discuss them in more detail.
• Feedforward networks or multi-layer perceptron (MLP), which features an input layer, one or
several hidden layers, and an output layer. The units in the feedforward networks are linear
or sigmoidal units. Data move from the input layer, through the hidden layers, and up to the
output layer. Feedback is not allowed.
• Radial Basis Function (RBF) networks [115] made of a single hidden layer with kernel units
and output layer with linear units.
• The Leaning Vector Quantization (LVQ) network [75], featuring a single layer network where
input units are connected directly to output units made of distance units.
8.2.5 Learning
Neural network learning for classification is the estimation of the set of weights that optimally
achieve a classification task. Learning is done at the network level to minimize a classification error
metric. This includes schemes for each unit to update its weight, to increase the overall performance
of the network. Each class of networks provides a learning scheme suited for its particular topology.
In principle, the update of weight wi j between neuron j getting input xi from neuron i and producing
an output o j should reflect the degree to which the two neurons frequently interact in producing the
network’s output. The weight update Δwi j is thus of the form
where η is some learning rate parameters and F() some multiplicative function. Various update
rules are proposed to reflect this broad principle.
The above form of the delta can be used on all types of units, provided the availability of a desired
output. However, this form of the delta rule does not use the unit’s output, which results in a dis-
crepancy between the system used for classification and the one used for training, especially for
non-linear units. The generalized form of the delta rule provides a remedy to this issue by embed-
ding the derivative of the activation function, producing an update in the form
where φ () is the derivative of the activation function. The generalized delta rule requires the use of
a smooth activation function and is the form widely used for classification tasks in many network
architectures. It is a supervised training scheme directly derived from a gradient descent on square
error between a target function and the output. Training starts with weights initialized to small
values.
Ep = − ∑ tn wt xn (8.20)
n,on =tn
where the sum is performed over misclassified examples (on = tn ). This criterion attempts to find
the linear hyperplane in the x-space that separates the two classes at the origin. To see this, observe
that the dot product −tn wt xn is always positive for a misclassified xn . Therefore, minimizing the
Neural Networks: A Review 215
perceptron criterion is equivalent to solving the linear inequality tn wt xn > 0 for all xn , which can be
done either incrementally, one sample at a time, or in batch fashion.
The SLP learning is shown in Algorithm 8.1. It is a equivalent to stochastic gradient descent on
the perceptron criterion. Some variants of the algorithm use a linearly decreasing learning rate η(n)
as function of n. For each misclassified sample xn , the training process adds the data to the weight
when data belong to class C2 and subtracts the input data from the weights, when the data belong
to C1 . The effect of this label-based selective update is to incrementally re-align the separating
hyperplane to the direction that optimally separates the two classes [19,35]. The batch version of the
perceptron algorithm performs a similar update on the accumulated sum of misclassified samples:
for a linearly decreasing learning η(t) as a function of training time t. The update is done up to a
specified number of epochs or when all data have been correctly classified. It is a direct application
of the delta rule since weights are updated according to their contribution to the error.
The perceptron convergence theorem guarantees the optimality of the process: The learning
algorithm, as illustrated in Algorithm 8.1, will find the optimal set of weights within a finite number
of iterations, provided that the classes are linearly separable [35, 52, 114].
For non-linear problems, there is no guarantee of success. In that scenario, learning is aimed at
achieving classification performance within a tolerable margin of error γ, such that
tn wt xn > γ (8.22)
for all n. The perceptron learning then minimizes misclassification error relative to that margin,
which gives
For a γ > 0 satisfying the condition in Equation 8.22, the perceptron margin theorem provides a
bound on the number of achievable errors. Suppose there exists a normalized set of weights w such
that
w
= 1 and that the entire training set is contained in a ball of diameter D (
xn
< D for all
n), then the number of mistakes M on the training set X made by the perceptron learning algorithm
2
is at most Dγ2 . However, a priori knowledge of the optimal γ is hardly available. In practice, γ can be
set to
γ = min tn wt xn (8.24)
n
216 Data Classification: Algorithms and Applications
after completion of the learning scheme, and that value can be used to provide a bound on the error
rate. Note that the bound does depend on the dimensionality of the input data, making the perceptron
applicable to large-dimensional data. To remove dependency on D, it is common to normalize each
training data by ||xxnn || , which produces the bound of the form γ12 .
then, for an input data xn of the class Ck that generates an output ok the multi-class perceptron
performs an update whenever ok is misclassified as a class Cq :
Similar to the two-class case, the update occurs only when the input data have been misclasssified.
The effect is to increase the weight of the output vector of the right category (wk ← wk + ηxn ) and
decrease the weight of competing category wq ← wq − ηxn . The multi-class perceptron learning is
shown in Algorithm 8.2.
the good weight estimate from the pocket, whenever the standard perceptron is suboptimal. The
Adatron [3] proposes an efficient adaptive learning scheme and the voted-perceptron [41] embeds
a margin maximization in the learning process. In fact, it has been demonstrated that the voted-
perceptron can achieve similar performance as a kernel-based Support Vector Machine while being
a much simpler and much easier model to train.
8.3.2 Adaline
The Adaptive Linear Neuron (ADALINE) was proposed by Widrow and Hoff [5] in 1960 as
an adaptation of the binary threshold gate of McCullough and Pitts [91]. Like the perceptron, the
adaline is a single-layer neural network with an input layer made of multiple units, and a single
output unit. And like the perceptron, the output unit is a linear threshold unit. However, while
the perceptron focuses on optimizing the sign of the net value function, the adaline focuses on
minimizing the square error between the net value and the target. We start by describing the original
adaline network aimed at a two-class problem and then describe extensions for multi-class settings.
Ea = ∑(tn − νn )2 (8.28)
n
with tn as the desired output of sample xn , and with tn ∈ {−1, 1} and νn as the net value of the
output unit. This error is defined using the net value instead of the output value. Minimization of the
adaline criterion is an adaptive (iterative) process implemented as the Widrow-Hoff learning rule,
also called the delta rule:
for incoming data xn and with a learning rate η > 0 as shown in Algorithm 8.3.
The learning update in Equation 8.29 converges in the L2 -norm sense to the plane that separates
the data (for linearly separable data). Widrow-Hoff also proposed an alternative to Equation 8.29 to
make the update insensitive to the size the input data :
xn
w ← w + η(tn − νn ) (8.30)
xn
where
refers to the Euclidean norm. The Widrow-Hoff learning method has the following char-
acteristics. The direction of the weight update (tn − νn )xn has same effect as performing a gradient
descent on Ea . The adaline error is defined at the net value level and not at the output level, which
means that the Widrow-Hoff rule behaves like a linear regression estimation [19, 35]. If the unit
is non-linear, the output value of the unit is not used during training, yielding a clear discrepancy
between the optimization done during training and the system used for classification.
The predicted class corresponds to the output unit with the highest output function:
K
x ∈ Cj if j = arg max ok (8.32)
k=1
where ok = νk for all k, given a sequence of training data {x1 , · · · , xN } and corresponding target
vector tn = (tn1 , · · · ,tnK ) with tn j encoded using the principle of 1-of-K in the set {−1, 1}.
Multi-class learning applies the adaline learning for each weight w j :
where the distance is typically the Euclidean distance. The output layer is made of K × Q output
units. The classification decision selects the class label of the closest prototype:
x ∈ Ck if k = arg min arg min o jq . (8.35)
j q
The classification decision is the class label of the prototype that yields the smallest output, or
equivalently, is the closest prototype to input data.
LVQ learning attempts to find the best prototypes that compress the data while achieving the best
classification performance. Training falls in the family of discriminative training since prototypes
in the output layer are selectively updated based on the class label of the input data. Prototypes
of the same class as the input data are moved closer to the data point and prototype of competing
classes are moved further. This is done selectively and iteratively. Three training algorithms are
available: LVQ1, LVQ2, and LVQ3 [74]. We review LVQ1 and LVQ2 as the most widely used
training schemes for LVQ, LVQ3 being very similar to LVQ2.
of the two selected winning prototypes w jq and wkp , ensuring that the two winning prototypes are
close to the boundary between the two competing categories. The LVQ2 update is as follows:
w jq (t) = w jq (t − 1) − η(t)(x − w jq) for the first winner but incorrect category (8.37)
wkq (t) = wkq (t − 1) + η(t)(x − wkq) for the second winner and correct category (8.38)
LVQ2 training selects and updates the two winning prototypes, whenever misclassification occurs
and when these prototypes are directly competing. This is slightly similar to the multi-class percep-
tron model except that the goal of the LVQ2 is to minimize the distance to the correct prototype
while the perceptron minimizes the dot product between the input and the weights.
LVQ can be seen as a supervised extension of vector quantization (VQ), Self-Organization Map
(SOM) [76], or k-means clustering. Those algorithms compress data into a set of representative
prototypes and generate clusters of data centered at prototypes. These unsupervised techniques are
generally used as a starting point for LVQ training, which is then used to readjust the clusters, bound-
aries based on the data labels. Clustering can generate initial prototypes using the entire data set but
it is more effective to perform clustering within each category, thus generating initial prototypes per
category, and then use LVQ competitive training to readjust the boundaries [93].
FIGURE 8.3: An RBF network with radial basis unit in the hidden layer and linear units in the
output layer.
from function approximation and regularization theory with regression as targeted applications [98],
they have been successfully adapted to classification problems [101,102]. Theoretically, RBFNs are
universal approximators, making them powerful neural network candidates for any data classifica-
tion task [45]. In practice, a judicious choice of the kernel should be done with an adequate training
scheme. For classification problems, a Gaussian activation is the most widely used kernel and is the
one illustrated throughout this section.
An RBF network is illustrated in Figure 8.3. Let us assume a K-class classification problem with
an RBFN of Q hidden units (called radial basis functions) and K output units. Each q-th radial basis
unit in the hidden layer produces an output zq from the input units:
= ||x − µq||
νq (8.39)
ν
q
zq = exp − (8.40)
2
where a common choice for the norm is the quadratic function
||x − µq|| = (x − µq)t Σq (x − µq) (8.41)
and where Σq is a positive definite matrix. µq and Σq are the radial basis center and the width or radii
of hidden unit q for q ∈ {1, · · · , Q}. The outputs of the hidden layer are passed to the output layer,
where each k-th output unit generates
ok = wtk z (8.42)
(8.43)
with z = (1, z1 , · · · , zq ) and wk = (wok , w1k , · · · , wQk ) as the set of weights characterizing the k-
th output unit. An RBFN is fully determined by the set of µq , Σq , wqk for q ∈ {0, 1, · · · , Q} and
k ∈ {1, · · · , K}.
222 Data Classification: Algorithms and Applications
Similar to the output unit in the multi-class single-layer perceptron, the output units have a set of
dedicated weights and are linear units implementing a linear transform on the features generated by
the hidden units. The use of a kernel in the hidden layer introduces needed non-linearity capabilities
and is equivalent to a projection into a high-dimensional feature space, a feature shared by all kernel
machines. Clearly, an ok is a linear combination of kernels in a manner similar to that of SVMs. A
key difference with SVMs is the training scheme and the selection of kernels’ parameters.
The classification decision uses the maximum score rule, yielding:
tn = Wt zn (8.45)
for all n, where W = (w1 , · · · , wK ) is the matrix of weight (a column k of W is the weight-vector of
k-th output unit). In matrix form, Equation 8.45 can be rewritten as
T = Wt Z (8.46)
where T and Z correspond to the matrix made of target-vectors and hidden activation vectors, re-
spectively. The general solution to this equation is given by
W = Z† T (8.47)
where Z† is the pseudo-inverse of Z defined as Z† = (Zt Z)−1 Z. The pseudo-inverse can be com-
puted via linear solvers such as Cholesky decomposition, singular value decomposition, orthogonal
least squares, quasi Newton techniques, or Levenberg-Marquardt [46].
given a sequence of training data and corresponding target vector tn . The initial set of centers and
weights could be derived from one of the center estimation methods outlined earlier. Weights are
usually randomly initialized to small values of zero means and unit variance.
Alternatively, the gradient descent could be performed selectively only on weights while keeping
the centers and radii fixed. This has the advantage to avoid matrix inversion as required by the linear
solver techniques. It was demonstrated that a wide class of RBFNs feature only a unique minima in
the error surface, which leads to one single solution for most training techniques [98].
RBFs can be sensitive to noise and outliers due to the use of cluster centroids as centers. One
alternative to this problem is to use a robust statistics learning scheme, as proposed by the mean and
median RBF [22, 23].
is provided by the Cover theorem [51]. Some RBF variants use a sigmoid at the output layer, further
enhancing classification capabilities.
RBFNs are capable of achieving strong classification capabilities with a relatively smaller num-
ber of units. Consequently, RBFNs have been successfully applied to a variety of data classification
tasks [103] including the recognition of handwritten numerals [83], image recognition [23, 107],
and speech recognition [105, 110, 128], process faults detection [85], and various pattern recogni-
tion tasks [24, 86].
FIGURE 8.4: An MLP network made of three layers: two hidden layers and one output layer. Units
in the network are sigmoid units. Bias are represented as input units.
with the output o in the range of [0, 1] (or [−1, 1] if the tanh function is used and therefore the
decision threshold is zero). Clearly, the output unit implements a logistic regression on the space
of features generated by the last hidden layer. An MLP for a two-class problem is similar to the
perceptron or adaline in its use of a single output unit. However, its capabilities are greatly enhanced
by the use of hidden layers.
for j ∈ {1, · · · , K} with ok ∈ [0, 1] when the sigmoid is used as activation function. The use of a
1-of-K encoding scheme for the target function is standard. The MLPs training is equivalent to
226 Data Classification: Algorithms and Applications
approximating a mapping f : Rd → RK , where target values are the natural basis of RK . It has
been proven that with the use of the 1-of-K encoding scheme and an adequate error criterion, such
as the mean square error or cross-entropy, an MLP output can approximate posterior probabilities
[14, 111, 131]:
for all k. This is, however, an asymptotic result. In practice, the sigmoidal unit’s output values do not
allow for a probabilistic interpretation since they do not sum to one. For a probabilistic interpretation
of the network’s outputs, a softmap transformation on the output unit’s net value or activation can
be used [26]:
eνk
ok = K ν (8.53)
∑ j=1 e j
where clearly ∑Kk=1 ok = 1.
d
(1) (1)
νh = ∑ wih xi (8.54)
i=0
(1) (1)
zh = φ(νh ) (8.55)
(1)
where wih denotes the weights of the synapse connecting the i-th input unit to the h-th hidden unit
(1)
in the first layer; we assume that woh refers to bias and x0 = 1. For simplicity, we also assume that
all units in the network use the same activation function φ().
From hidden layer to hidden layer: A unit h in the l-th hidden layer, Hl , gets its data from
units in the previous layer, Hl−1 , to produce an internal output as
In practice, training uses an empirical approximation of the objective function defined on a training
set X :
1
E (w) = ∑ E(x; w) (8.61)
N x∈X
and the minimization process is monitored on the values of that empirical approximation.
Various error metrics are available for supervised training of the MLP and are defined using the
target output function that describes the desirable output vector for a given input data example. A
1-of-K encoding scheme for the target output vector is the norm. We will review a few widely used
error metrics used to train MLPs.
MSE is a widely used metric, thanks to its quadratic properties. Variants can include more general
L p norms.
Training a network to minimize the CE objective function can be interpreted as minimizing the
Kullback-Liebler information distance [131] or maximizing the mutual information. Faster learning
has frequently been reported for information theoretic error metrics compared to the quadratic er-
ror. Learning with logarithmic error metrics has also been shown to be less prone to local minima
convergence.
228 Data Classification: Algorithms and Applications
with υ > 0. The term dk is called the misclassification measure. A negative value indicates a correct
classification decision (Emce (x; w) ≈ 0) and a positive value indicates an incorrect classification
decision (Emce (x; w) ≈ 1).
The MCE objection function
1
Emce (w) = ∑ Emce (x; w)
N x∈
(8.66)
X
is a smoothed approximation of the error rate on the training set, and therefore minimizes misclas-
sification errors in a more direct manner than MSE and CE [16].
(l)
∂E E ∂ν j
(l)
= (l) (l)
. (8.71)
∂wh j ∂ν j ∂wh j
(l) (l−1)
= δ j zh (8.72)
Neural Networks: A Review 229
By referencing the output layer as HL and input later as H0 , the weight update is simply
(l) (l) (l−1)
∇wi j = −ηδ j zi (8.73)
(l) (l)
for any given weight wi j and l ∈ {1, · · · , L}, where δ j is typically called the “credit” of unit j
in layer Hl . We remark that the MLP’s weight update in Equation 8.73 implements the Hebbian
learning rule.
To update a unit’s weight requires the computation of the unit’s credit. Computing the credit
of each unit depends on its position within the network. The credit δk that corresponds to the k-th
output unit (without a softmap transformation, for simplicity) is
∂E ok
δk = (8.74)
∂ok ∂ν(L)
k
(L) ∂E
= φ (νk ) (8.75)
∂ok
where φ () is the derivative of the activation function. When using the sigmoid function with the
MSE criterion, the credit of an output unit is
(l)
(l) ∂E ∂zh
δh = (l) (l)
(8.78)
∂zh ∂νh
(l+1)
(l) ∂E ∂νh
= φ (νh ) ∑ (l+1) (l)
(8.79)
h ∈Hl+1
∂νh ∂zh
(l) (l) (l+1) (l+1)
= zh (1 − zh ) ∑ δh whh (8.80)
h ∈H l+1
where we assumed the use of the sigmoid function in deriving Equation 8.80.
The credit is computed from the output unit using Equation 8.77 and then propagated to the
hidden layer using Equation 8.80. These two equations are the foundation of the backpropagation
algorithm enabling the computation of the gradient descent for each weight by propagating the
credit from the output layer to hidden layers. The backpropagation algorithm is illustrated in Algo-
rithm 8.4.
13: end if
14:
15: for i in Hl−1 do
(l) (l) (l−1) (l)
16: wi j ← wi j − η(t)zi δj # weight update
17: end for
18:
19: end for
20: end for
21: end for
22: end for
(l)
23: wi j : weight from unit i to unit j in layer Hl
(l)
24: zi : output of unit i in layer Hl
25: η(t) : learning at iteration t
(l)
26: δ j : error term (credit) of unit j in layer Hl
Enhancements to the gradient descent include, among other techniques, the use of a momentum
smoother [116], the use of a dynamic learning rate [7,129], the resilient backpropagation algorithm,
and the Quickprop algorithm.
8.5.4.2 Delta-Bar-Delta
The optimal choice of the learning rate depends on the complexity of the error surface, and
the network architecture. To circumvent this problem, the delta-bar-delta algorithm [66] assigns an
individual learning rate ηw for each weight w in the network. This weight is updated during learning,
depending on the direction of the gradient. The delta-bar-delta algorithm operates on the objective
function E as follows.
$E (t), at each iteration t to reduce
First, it estimates a moving average of the error derivative, ∂∂w
gradient fluctuation as
%
∂E ∂E %
∂E
(t) = β (t) + (1 − β) (t − 1) (8.82)
∂w ∂w ∂w
with 0 < β < 1. Then, the individual learning rate is updated as follows:
⎧
⎪ $ E (t) ∂E (t − 1) > 0
⎨ ηw (t) + κ if ∂∂w ∂w
ηw (t + 1) = $
∂E ∂E (8.83)
⎪ (1 − ρ)η (t) if (t) ∂w (t − 1) < 0
⎩ w ∂w
ηw (t) otherwise
The learning rate increases when consecutive gradients point in the same direction, and de-
creases otherwise. The choice of meta-learning parameters κ, ρ, β is driven by various considera-
tions, in particular the type of data. The use of a momentum has been suggested to reduce gradient
oscillations [95]. The update is in batch mode. An incremental version of the delta-bar-delta algo-
rithm that can be used in online settings is available [125].
where γw is the update step that is tracked at each training iteration t and updated as
⎧ E (t) ∂E (t − 1) > 0
⎨ min{η+ γw (t), γmax } if ∂∂w ∂w
γw (t + 1) = max{η− γw (t), γmin } if ∂w∂E (t) ∂E (t − 1) < 0 (8.85)
⎩ ∂w
γw (t) otherwise
where 0 < η− < 1 < η+ with typical values 0.5 and 1.2 respectively, with the initial update-value
set to δw = 0.1.
The effect of Rprop is to speed up learning in flat regions of the error surface, and in areas that
are close to a local minimum. The size of the update is bounded by the use of the γmin and γmax , to
moderate the speed-up.
8.5.4.4 Quick-Prop
Second-order algorithms consider more information about the shape of the error surface than
the mere value of its gradient. A more efficient optimization can be done if the curvature of the error
surface is considered at each step. However, those methods require the estimation of the Hessian
matrix, which is computationally expensive for a feedforward network [7]. Pseudo-Newton methods
232 Data Classification: Algorithms and Applications
can be used to simplify the form of the Hessian matrix [10] where only the diagonal components
are computed, yielding,
∂E
(t)
Δw(t) = ∂w . (8.86)
∂ E
2
∂ w
2 (t)
∂2 E
The Quick prop algorithm further approximates the second derivative ∂2 w
as the difference be-
tween consecutive values of the first derivative, producing
∂E
∂w (t)
Δw(t) = Δw(t − 1) ∂E ∂E
. (8.87)
∂w (t) − ∂w (t − 1)
No matrix inversion is necessary and the computational effort involved in finding the required
second partial derivatives is limited. Quick-prop relies on numerous assumptions on the error sur-
face, including a quadratic form and a non-multivariate nature of the Hessian matrix. Various heuris-
tics are therefore necessary to make it work in practice. This includes limiting the update Δw(t) to
some maximum value and switching to a simple gradient update whenever the sign of the approxi-
mated Hessian does not change for multiple consecutive iterations [36, 37].
next layer, meaning that the network generates composed and unseen internal representations of the
data. The output layer can be viewed as a classification machine on the space spanned by the last
hidden layer’s outputs. More hidden layers and more units in those layers generate a more powerful
network with the ability to generate more meaningful internal representations of the data.
However, training large and deep neural networks (DNN) with a large number of hidden layers
and units is a difficult enterprise. With a number of parameters in the order of millions, the error
surface is rather complex and has numerous local minima with many of them leading to poor perfor-
mance [4, 42]. Also, during backpropogation, there is the issue of the vanishing gradient, where the
norm of the gradient gets smaller as the chain rule of the gradient computation gets expanded across
many layers [12, 57]. Lastly, a neural network with a large number of weights requires tremendous
computational power. Deep learning techniques come with various strategies and heuristics to over-
come these problems. Those strategies include the use of prior knowledge on the domain of applica-
tion, the application of a layer-wise selective training technique, and the use of a high-performance
computing system.
same weights enabling spatial-translation invariance capabilities. Lastly, some layers are designed
to perform maxPooling [80] by selecting the maximum value of non-overlapping regions generated
by lower layers. The end result is a powerful system that generates patterns in a hierarchical manner
similar to that of the visual cortex. Weight-sharing constraints and local connectivity yield a sparse
weight set that significantly reduces the degrees of freedom in the system and enables relatively
faster training with increased chances of generalization [79].
The CNN has been applied very successfully to numerous classification tasks including hand-
written character recognition [32, 118] and face recognition [39, 79].
the first RBM’s hidden layer outputs as the second RBM’s visible layer, and then continuing up to
the last hidden layer.
and by doing so, the MLP encodes an internal presentation in the hidden layers that can be used
to regenerate the data in the output layer. Auto-encoding can be likened to a non-linear principal
component estimation. When used within a deep-learning system, each hidden layer of the DNN
is used as a hidden layer in an encoder and trained accordingly. After training is done, the weights
are fixed and used to generate data for the next layer, which is also trained within an auto-encoder
using the outputs of the previously trained layer. The process is repeated up to the last hidden layer.
A deep CNN (DCNN) design was proposed based on those principles and achieved an improved
performance compared to the original CNN version [81, 109]; it was a good showcase of the dual-
use of domain knowledge in the design of the CNN architecture and layer-wise training for network
optimization.
8.7 Summary
We have seen in this chapter that neural networks are powerful universal mappers for data clas-
sification endowed with certain properties such as non-linearity, the capacity to represent informa-
tion in a hierarchical fashion, and the use of parallelism in performing the classification tasks. The
resurging interest of neural networks coincides with the availability of high-performance computing
to perform computationally demanding tasks. The emergence of training methodologies to exploit
these computing resources and the concrete results achieved in selected classification tasks make
neural networks an attractive choice for classification problems. The choice of a particular network
is still linked to the task to be accomplished. Prior knowledge is an important element to consider
236 Data Classification: Algorithms and Applications
and the emerging dual use of unsupervised layer-wise training and supervised training has been
proven robust to provide good initialization of the weights and avoid local minima.
Acknowledgements
The author would like to thank Deepak Turaga of IBM T. J. Watson Research Center for his
support and encouragement for this review.
Bibliography
[1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann ma-
chines. Cognitive Science, 9(1):147–169, 1985.
[2] G An. The effects of adding noise during backpropagation training on a generalization per-
formance. Neural Computation, 8:643–674, 1996.
[3] J. K. Anlauf and M. Biehl. The AdaTron: an Adaptive Perceptron Algorithm. Europhysics
Letters, 10(7):687–692, 1989.
[4] P. Auer, M. Herbster, and M. K. Warmuth. Exponentially many local minima for single neu-
rons. In M. Mozer, D. S. Touretzky, and M. Perrone, editors, Advances in Neural Information
Processing Systems, volume 8, pages 315–322. MIT Press, Cambridge, MA, 1996.
[5] B. Widrow and M. E. Hoff Adaptive Switching Circuits. In IRE WESCON Convention
Record, pages 96–104, 1960.
[6] E. B. Barlett and R. E. Uhrig. Nuclear power plant status diagnostics using artificial neural
networks, Nuclear Technology, 97:272–281, 1992.
[7] T. Battiti. First- and second-order methods for learning: Between steepest descent and New-
ton’s method. Neural Computation, 4(2):141–166, 1992.
[8] E. B. Baum and F. Wilczek. Supervised learning of probability distribution by neural net-
work. In A. Andeson, editor, Neural Information Processing Systems, pages 52–61. American
Institute of Physics, 1988.
[9] W. G. Baxt. Use of an artificial neural network for data analysis in clinical decision-making:
The diagnosis of acute coronary occlusion. Neural Computation, 2(4):480–489, 1990.
[10] S. Becker and Y. Le Cun. Improving the convergence of back-propagation learning with
second order methods. In D. S. Touretzky, G. E. Hinton, and T. J. Sejnowski, editors, Pro-
ceedings of the 1988 Connectionist Models Summer School, pages 29–37. San Francisco,
CA: Morgan Kaufmann, 1989.
[11] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep
networks. In Proceedings of NIPS, pages 153–160, MIT Press, Cambridge, MA, 2006.
[12] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient de-
scent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
Neural Networks: A Review 237
[13] M. Biehl, A. Ghosh, and B. Hammer. Dynamics and generalization ability of LVQ algo-
rithms. Journal of Machine Learning Research, 8:323–360, 2007.
[14] A. Biem. Discriminative Feature Extraction Applied to Speech Recognition. PhD thesis,
Université Paris 6, 1997.
[15] A. Biem and S. Katagiri. Cepstrum liftering based on minimum classification error. In
Technical Meeting of Institute of Electrical Information Communcation Engineering of Japan
(IEICE), volume 92-126, pages 17–24, July 1992.
[16] A. Biem, S. Katagiri, and B.-H. Juang. Pattern recognition based on discriminative feature
extraction. IEEE Transactions on Signal Processing, 45(02):500–504, 1997.
[17] S. A. Billings and C. F. Fung. Recurrent radial basis function networks for adaptive noise
cancellation. Neural Networks, 8(2):273–290, 1995.
[18] C. Bishop. Improving the generalization properties of radial basis function neural networks.
Neural Computation, 3(4):579–588, 1991.
[19] C. M. Bishop. Neural Network for Pattern Recognition. Oxford University Press, 1995.
[20] C. M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Compu-
tation, 7(1):108–116, 1995.
[21] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.
[22] A. G. Bors and I. Pitas. Median radial basis function neural network. IEEE Transactions on
Neural Networks, 7(6):1351–1364, 1996.
[23] A. G. Bors and I. Pitas. Object classification in 3-D images using alpha-trimmed mean radial
basis function network. IEEE Transactions on Image Processing, 8(12):1744–1756, 1999.
[24] D. Bounds, P. Lloyd, and B. Mathew. A comparison of neural network and other pattern
recognition approaches to the diagnosis of low back disorders. Neural Networks, 3(5):583–
91, 1990.
[25] H. Bourlard and N. Morgan. Continuous speech recognition by connectionist statistical meth-
ods. IEEE Transactions on Neural Networks, 4:893–909, 1993.
[26] J. S. Bridle. Probabilistic interpretation of feedforward classification network outputs, with
relationships to statistical pattern recognition. In Fogelman-Soulie and Herault, editors, Neu-
rocomputing: Algorithms, Architectures and Applications, NATO ASI Series, pages 227–236.
Springer, 1990.
[27] O. Buchtala, P. Neumann, and B. Sick. A strategy for an efficient training of radial basis
function networks for classification applications. In Proceedings of the International Joint
Conference on Neural Networks, volume 2, pages 1025–1030, 2003.
[28] H. B. Burke, P. H. Goodman, D. B. Rosen, D. E. Henson, J. N. Weinstein, F. E. Harrell,
J. R. Marks, D. P. Winchester, and D. G. Bostwick. Artificial neural networks improve the
accuracy of cancer survival prediction. Cancer, 79:857–862, 1997.
[29] S. Cajal. A new concept of the histology of the central nervous system. In Rottenberg and
Hochberg, editors, Neurological Classics in Modern Translation. New York: Hafner, 1977.
[30] K. Crammer, R. Gilad-Bachrach, A. Navot, and A. Tishby. Margin analysis of the LVQ
algorithm. In Advances in Neural Information Processing Systems, volume 15, pages 462–
469. MIT Press, Cambridge, MA, 2003.
238 Data Classification: Algorithms and Applications
[31] Y. Le Cun and Y. Bengio. Convolutional networks for images, speech, and time series. In
M.A. Arbib, editor, Handbook of Brain Theory and Neural Networks, pages 255–258. MIT
Press, Cambridge, MA, 1995.
[32] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubberd, and L. D.
Jackel. Handwritten digit recognition with a back-propagation network. Advances in Neural
Information Processing Systems, 2:396–404, 1990.
[33] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of
Control Signals and Systems, 2(303-314), 1989.
[34] D. Hebb Organization of Behavior. Science Edition, 1961.
[35] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley Interscience
Publications, 1973.
[36] S. Fahlman. Faster learning variations on back-propagation: An empirical study. In D. S.
Touretzky, G. E. Hinton, and T. J. Sejnowski, editors, Proceedings of the 1988 Connectionist
Models Summer School, pages 38–51. Morgan Kaufmann, San Francisco, CA, 1989.
[37] S. E. Fahlman. An empirical study of learning speech in back-propagation networks. Tech-
nical report, Canergie Mellon University, 1988.
[38] Scott E. Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In
D.S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 524–
532. Morgan Kaufmann, San Mateo, 1990.
[39] B. Fasel. Robust face analysis using convolutional neural networks. In Proceedings of the
International Conference on Pattern Recognition, 2002.
[40] J. A. Flanagan. Self-organisation in Kohonen’s SOM. Neural Networks, 9(7):1185–1197,
1996.
[41] Y. Freund and R Schapire. Large margin classification using the perceptron algorithm. Ma-
chine Learning, 37(3):277–296., 1999.
[42] K. Fukumizu and S. Amari. Local minima and plateaus in hierarchical structures of multi-
layer perceptrons. Neural Networks, 13(3):317–327, 2000.
[43] K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism
of pattern recognition unaffected by shift in position. Biological Cybernetics, 36:193–202,
1980.
[44] S.I. Gallant. Perceptron-based learning algorithms. IEEE Transactions on Neural Networks,
1(2):179–191, 1990.
[45] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures.
Neural Computation, 7(2):219–269, 1995.
[46] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press,
3rd edition, 1996.
[47] I. Guyon. Applications of neural networks to character recognition. International Journal of
Pattern Recognition Artificial Intelligence, 5:353–382, 1991.
[48] P. Haffner, M. Franzini, and A. Waibel. Integrating time alignment and neural networks
for high performance continuous speech recognition. In Proceedings of IEEE International
Conference of Acoustic, Speech, and Signal Processing (ICASSP), pages 105–108, May 1991.
Neural Networks: A Review 239
[49] P. Haffner and A. Waibel. Multi-state time delay neural networks for continuous speech
recognition. In NIPS, pages 135–142, 1992.
[50] B. Hammer and T. Villmann. Generalized relevance learning vector quantization. Neural
Networks, 15(8-9):1059–1068, 2002.
[51] S. Haykin. Neural Networks. A Comprehensive Foundation. Macmillan College Publishing,
New York, 1994.
[52] J. Hertz, A. Krogh, and R. Palmer. Introduction to the Theory of Neural Computation.
Addison-Wesley, 1991.
[53] G. Hinton. Connectionist learning procedures. Artificial Intelligence, 40(1-3): 185–234,
1989.
[54] G. Hinton, L. Deng, D. Yu, and G. Dahl. Deep neural networks for acoustic modeling in
speech recognition, IEEE Signal Processing Magazine, 29(6):82–97, 2012.
[55] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks.
Science, 313(5786):504– 507, 2006.
[56] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets.
Neural Computation, 18(7):1527–1554, 2006.
[57] S. Hochreiter. The vanishing gradient problem during learning recurrent neural nets and
problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based
Systems, 06(02):107–116, 1998.
[58] J. Hopfield. Neural networks and physical systems with emergent collective computational
abilities. Reprinted in Anderson and Rosenfeld (1988), editors, In Proceedings of National
Academy of Sciences USA, volume 79, pages 2554–58, April 1982.
[59] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Net-
works, 4:251–257, 1991.
[60] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal
approximators. Neural Networks, 2:359–366, 1989.
[61] J. C. Hoskins, K. M. Kaliyur, and D. M. Himmelblau. Incipient fault detection and diagnosis
using artificial neural networks. Proceedings of the International Joint Conference on Neural
Networks, pages 81–86, 1990.
[62] D. Hubel and T. Wiesel. Receptive fields and functional architecture of monkey striate cortex.
Journal of Physiology, 195:215-243, 1968.
[63] C. Igel and M Husken. Empirical evaluation of the improved rprop learning algorithms.
Neurocomputing, 50:105–123, 2003.
[64] B. Igelnik and Y. H. Pao. Stochastic choice of basis functions in adaptive function approxi-
mation and the functional-link net. IEEE Transactions on Neural Networks, 6(6):1320–1329,
1995.
[65] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local
experts. Neural Computation, 3(1):79–87, 1991.
[66] R. A. Jacobs. Increased rate of convergence through learning rate adaptation. Neural Net-
works, 1(4):295–307, 1988.
240 Data Classification: Algorithms and Applications
[67] J. J. Hopfield. Learning algorithms and probability distributions in feed-forward and feed-
back networks. In National Academy of Science, USA, volume 84, pages 8429–8433, De-
cember 1987.
[68] B.-H. Juang and S. Katagiri. Discriminative learning for minimum error classification. IEEE
Transactions on Acoustics, Speech, and Signal Processing, 40(12):3043–3054, December
1992.
[69] S. Katagiri, B.-H. Juang, and A. Biem. Discriminative Feature Extraction. In R. J. Mammone,
editor, Artificial Neural Networks for Speech and Vision. Chapman and Hall, 1993.
[70] S. Katagiri, C.-H. Lee, and B.-H. Juang. Discriminative multilayer feed-forward networks.
In Proceedings of the IEEE Worshop on Neural Networks for Signal Processing, pages 309–
318, 1991.
[71] M. Kikuchi and K. Fukushima. Neural network model of the visual system: Binding form
and motion. Neural Networks, 9(8):1417–1427, 1996.
[72] B. Kingsbury, T. Sainath, and H. Soltau. Scalable minimum Bayes risk training of DNN
acoustic models using distributed Hessian-free optimization. In Interspeech, 2012.
[73] S. Knerr, L. Personnaz, and G. Dreyfus. Handwritten digit recognition by neural networks
with single-layer training. IEEE Transactions on Neural Networks, 3:962–968, 1992.
[74] T. Kohonen. The Self-Organizing Map. Proceedings of the IEEE, 78(9):1464–1480, 1990.
[75] T. Kohonen, G. Barma, and T. Charles. Statistical pattern recognition with neural networks:
Benchmarking studies. In IEEE Proceedings of ICNN, volume 1, pages 66–68, 1988.
[76] T. Kohonen. Self-Organization and Associative Memory. Springer, Berlin, third edition,
1989.
[77] T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1995.
[78] M. L. Minsky and S. A. Papert. Perceptrons. MIT Press, Cambridge, MA, 1969.
[79] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back. Face recognition: A convolutional
neural network approach. IEEE Transactions on Neural Networks, 8(1):98–113, 1997.
[80] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[81] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional deep belief networks for scalable
unsupervised learning of hierarchical representations. In 26th International Conference on
Machine Learning, pages 609–616, 2009.
[82] H. Lee, Y. Largman, P. Pham, and A. Ng. Unsupervised feature learning for audio classifica-
tion using convolutional deep belief networks. In Advances in Neural Information Processing
Systems, 22:1096–1104, 2009.
[83] Y. Lee. Handwritten digit recognition using k nearest-neighbor, radial-basis function, and
backpropagation neural networks. Neural Computation, 3(3):440–449, 1991.
[84] R. Lengelle and T. Denoeux. Training MLPs layer by layer using an objective function for
internal representations. Neural Networks, 9:83–97, 1996.
[85] J. A. Leonard and M. A. Kramer. Radial basis function networks for classifying process
faults. IEEE Control Systems Magazine, 11(3):31–38, 1991.
Neural Networks: A Review 241
[86] R. P Lippmann. Pattern classification using neural networks. IEEE Communications Maga-
zine, 27(11):47–50, 59–64, 1989.
[87] R. P. Lippmann. Review of neural networks for speech recognition. Neural Computation,
1(1):1–38, 1989.
[88] D. Lowe. Adaptive radial basis function nonlinearities and the problem of generalisation. In
1st IEEE International Conference on Artificial Neural Networks, pages 171–175, 1989.
[89] G. L. Martin and J. A. Pittman. Recognizing hand-printed letters and digits using backprop-
agation learning. Neural Computation, 3(2):258–267, 1991.
[90] J. L. McClelland, D. E. Rumelhart, and the PDP Research Group. Parallel Distributed Pro-
cessing: Explorations in the Microstructure of Cognition, volume 2. MIT Press, Cambridge,
MA, 1986.
[91] W.S. McCulloch and W. Pitts. A logical calculus of the ideas in immanent nervous system.
Bulletin of Mathematical BioPhysics, 5:115–33, 1943.
[92] E. McDermott and S. Katagiri. Shift-invariant, multi-category phoneme recognition using
Kohonen’s LVQ2. In Proceedings of IEEE ICASSP, 1:81–84, 1989.
[93] E. McDermott and S. Katagiri. LVQ-based shift-tolerant phoneme recognition. IEEE Trans-
actions on Acoustics, Speech, and Signal Processing, 39:1398–1411, 1991.
[94] E. McDermott and S. Katagiri. Prototype-based minimum classification error/generalized
probabilistic descent for various speech units. Computer Speech and Language, 8(8):351–
368, Oct. 1994.
[95] A. Minai and R.D. Williams. Acceleration of backpropagation through learning rate and
momentum adaptation. In International Joint Conference on Neural Networks, pages 676–
679, 1990.
[96] M. Minsky. Computation: Finite and Infinite Machines. Prentice-Hall, Englewood Cliffs,
NJ, 1967.
[97] M. Miyatake, H. Sawai, Y. Minami, and K. Shikano. Integrated training for spotting Japanese
phonemes using large phonemic time-delay neural networks. In Proceedings of IEEE
ICASSP, 1: pages 449–452, 1990.
[98] P. F. M. Bianchini and M. Gori. Learning without local minima in radial basis function
networks. IEEE Transactions on Neural Networks, 6(3):749–756, 1995.
[99] J. Moody. The effective number of parameters: An analysis of generalization and regulariza-
tion in nonlinear learning systems. In J. Moody, S. J. Hanson, and R. P Lippmann, editors,
Advances in Neural Information Processing Systems, 4:847–854, Morgan Kaufmann, San
Mateo, CA, 1992.
[100] J. Moody and C. Darken. Learning with localized receptive fields. In D. Touretzky, G. Hinton,
and T. Sejnowski, editors, Proceedings of the 1988 Connectionist Summer School, pages 133–
143. San Mateo, CA: Morgan Kaufmann., 1988.
[101] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units.
Neural Computation, 1(2):281–294, 1989.
[102] M. Musavi, W. Ahmed, K. Chan, K. Faris, and D Hummels. On the training of radial basis
function classifiers. Neural Networks, 5:595–603, 1992.
242 Data Classification: Algorithms and Applications
[103] K. Ng. A Comparative Study of the Practical Characteristics of Neural Network and Con-
ventional Pattern Classifiers. M.S. Thesis, Massachusetts Institute of Technology. Dept. of
Electrical Engineering and Computer Science, 1990.
[104] N. J. Nilsson. Learning Machines. New York, NY: McGraw-Hill, 1965.
[105] M. Niranjan and F. Fallside. Neural networks and radial basis functions in classifying static
speech patterns. Computer Speech and Language, 4(3):275–289, 1990.
[106] R. Nopsuwanchai and A. Biem. Prototype-based minimum error classifier for handwritten
digits recognition. In IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP’04), volume 5, pages 845–848, 2004.
[107] T. Poggio and S. Edelman. A network that learns to recognize three dimensional objects.
Letters to Nature, 343:263–266, 1990.
[108] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE,
78:1481–1497, 1990.
[109] M.-A. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse repre-
sentations with an energy-based model. In B. Scholkopf, J. Platt, and T. Hoffman, editors,
Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007.
[110] S. Renals and R. Rohwer. Phoneme classification experiments using radial basis function.
In Proceedings of the International Joint Conference on Neural Networks, volume 1, pages
461–467, 1989.
[111] M. D. Richard and R. P. Lippmann. Neural network classifiers estimate Bayesian a posteriori
probabilities. Neural Computation, 3(4):461–483, 1991.
[112] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning:
the rprop algorithm. In IEEE International Conference on Neural Networks, pages 586–591,
San Francisco, CA, 1993.
[113] A.J. Robinson. Application of recurrent nets to phone probability estimation. IEEE Transac-
tions on Neural Networks, 5(2):298–305, March 1994.
[117] David E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel Distributed Processing. MIT
Press, Cambridge, MA, 1986.
[118] E. Sackinger, B. Boser, J. Bromley, and Y. LeCun. Application of the ANNA neural network
chip to high-speed character recognition. IEEE Transactions on Neural Networks, 3:498–
505, 1992.
[119] A. Sato and K Yamada. Generalized learning vector quantization. In D. S. Touretzky and
M. E. Hasselmo, editors, Advances in neural information processing systems, volume 8. MIT
Press, Cambridge, MA., 1996.
Neural Networks: A Review 243
[120] H. Sawai. TDNN-LR continuous speech recognition system using adaptive incremental
TDNN training. In Proceedings of IEEE ICASSP, volume 1, pages 53–55, 1991.
[121] P. Schneider, M. Biehl, and B. Hammer. Adaptive relevance matrices in learning vector
quantization. Neural Computation, 21(12):3532–3561, December 2009.
[126] J. Utans and J. Moody. Selecting neural network architecture via the prediction risk: Applica-
tion to corporate bond rating prediction. In Proceedings of the 1st International Conference
on Artificial Intelligence Applications, pages 35–41, 1991.
[127] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
[128] R. L K Venkateswarlu, R.V. Kumari, and G.V. Jayasri. Speech recognition using radial basis
function neural network. In Electronics Computer Technology (ICECT), 2011 3rd Interna-
tional Conference on, volume 3, pages 441–445, 2011.
[129] T. P. Vogl, J. K. Mangis, J. K. Rigler, W. T. Zink, and D. L. Alkon. Accelerating the conver-
gence of the backpropagation method. Biological Cybernetics, 59:257–263, 1988.
[130] D. Wettschereck and T. Dietterich. Improving the performance of radial basis function net-
works by learning center locations. In Advances in Neural Information Processing Systems,
volume 4, pages 1133–1140, Morgan Kaufmann, San Mateo, CA. 1992.
[131] H. White. Learning in artificial neural networks: A statistical perspective. Neural Computa-
tion, 1(4):425–464, 1989.
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598
[email protected]
9.1 Introduction
Advances in hardware technology have led to the increasing popularity of data streams [1].
Many simple operations of everyday life, such as using a credit card or the phone, often lead to
automated creation of data. Since these operations often scale over large numbers of participants,
they lead to massive data streams. Similarly, telecommunications and social networks often contain
large amounts of network or text data streams. The problem of learning from such data streams
presents unprecedented challenges, especially in resource-constrained scenarios.
An important problem in the streaming scenario is that of data classification. In this problem,
the data instances are associated with labels, and it is desirable to determine the labels on the test
instances. Typically, it is assumed that both the training data and the test data may arrive in the form
of a stream. In the most general case, the two types of instances are mixed with one another. Since
the test instances are classified independently of one another, it is usually not difficult to perform
the real-time classification of the test stream. The problems arise because of the fact that the training
245
246 Data Classification: Algorithms and Applications
model is always based on the aggregate properties of multiple records. These aggregate properties
may change over time. This phenomenon is referred to as concept drift. All streaming models need
to account for concept drift in the model construction process. Therefore, the construction of a
training model in a streaming and evolving scenarios can often be very challenging.
Aside from the issue of concept drift, many other issues arise in data stream classification, which
are specific to the problem at hand. These issues are dependent on the nature of the application in
which streaming classification is used. A single approach may not work well in all scenarios. This
diversity in problem scenarios is discussed below:
• In many scenarios, some of the classes may be rare, and may arrive only occasionally in the
data stream. In such cases, the classification of the stream becomes extremely challenging
because of the fact that it is often more important to detect the rare class rather than the nor-
mal class. This is typical of cost sensitive scenarios. Furthermore, in some cases, previously
unseen classes may be mixed with classes for which training data is available.
• Many other data domains such as text [6] and graphs [9] have been studied in the context of
classification. Such scenarios may require dedicated techniques, because of the difference in
the underlying data format.
• In many cases, the entire stream may not be available at a single processor or location. In such
cases, distributed mining of data streams becomes extremely important.
Clearly, different scenarios and data domains present different challenges for the stream classi-
fication process. This chapter will discuss the different types of stream classification algorithms that
are commonly used in the literature. We will also study different data-centric scenarios, correspond-
ing to different domains or levels of difficulty.
This chapter is organized as follows. The next section presents a number of general algorithms
for classification of quantitative data. This includes methods such as decision trees, nearest neighbor
classifiers, and ensemble methods. Most of these algorithms have been developed for quantitative
data, but can easily be extended to the categorical scenario. Section 9.3 discusses the problem of
rare class classification in data streams. In Section 9.4, we will study the problem of massive domain
stream classification. Different data domains such as text and graphs are studied in Section 9.5.
Section 9.6 contains the conclusions and summary.
A Survey of Stream Classification Algorithms 247
Hoeffding bound is used to show that the decision tree on the sub-sampled tree would make the same
split as on the full stream with high probability. This approach can be used with a variety of criteria
such as the gini-index, or information gain. For example, consider the case of the gini-index. For
two attributes i and j, we would like to pick the attribute i, for which its gini-index Gi is smaller than
G j . While dealing with sampled data, the problem is that an error could be caused by the sampling
process, and the order of the gini-index might be reversed. Therefore, for some threshold level ε, if
Gi − G j < − ε is true for the sampled data, it is desired that Gi − G j < 0 on the original data with
high probability. This would result in the same split at that node in the sampled and original data,
if i and j correspond to the best and second-best, attributes, respectively. The number of examples
required to produce the same split as the original data (with high probability) is determined. The
Hoeffding bound is used to determine the number of relevant examples, so that this probabilistic
guarantee may be achieved. If all splits in the decision tree are the same, then the same decision
tree will be created. The probabilistic guarantees on each split can be converted to probabilistic
guarantees on the construction of the entire decision tree, by aggregating the probabilities of error
over the individual nodes. The Hoeffding tree can also be applied to data streams, by building
the tree incrementally, as more examples stream in, from the higher levels to the lower levels. At
any given node, one needs to wait until enough tuples are available in order to make decisions
about lower levels. The memory requirements are modest, because only the counts of the different
discrete values of the attributes (over different classes) need to be maintained in order to make split
decisions. The VFDT algorithm is also based on the Hoeffding tree algorithm, though it makes a
number of modifications. Specifically, it is more aggressive about making choices in the tie breaking
of different attributes for splits. It also allows the deactivation of less promising leaf nodes. It is
generally more memory efficient, because of these optimizations.
The original VFDT method is not designed for cases where the stream is evolving over time.
The work in [47] extends this method to the case of concept-drifting data streams. This method
is referred to as CVFDT. CVFDT incorporates two main ideas in order to address the additional
challenges of drift:
1. A sliding window of training items is used to limit the impact of historical behavior.
smaller number of samples. The work in [40] also extends the VFDT method for continuous data
and drift and applies Bayes classifiers at the leaf nodes.
It was pointed out in [27] that it is often assumed that old data is very valuable in improving the
accuracy of stream mining. While old data does provide greater robustness in cases where patterns
in the stream are stable, this is not always the case. In cases where the stream has evolved, the
old data may not represent the currently relevant patterns for classification. Therefore, the work
in [27] proposes a method that is able to sensibly select the correct choice from the data with the
use of a little extra cost. The technique in [27] uses a decision tree ensemble method, in which
each component of the ensemble mixture is essentially a decision tree. While many models achieve
the same goal using a sliding window model, the merit of this approach is the ability to perform
systematic data selection.
Bifet et al [19] proposed a method, that shares similarities with the work in [40]. However,
the work in [19] replaces the naive Bayes with perceptron classifiers. The idea is to gain greater
efficiency, while maintaining competitive accuracy. Hashemi et al [46] developed a flexible decision
tree, known as FlexDT, based on fuzzy logic. This is done in order to address noise and missing
values in streaming data. The problem of decision tree construction has also been extended to the
uncertain scenario by Liang [59]. A detailed discussion of some of the streaming decision-tree
methods, together with pseudocodes, are provided in Chapter 4.
Among the aforementioned methods, the second is easier to generalize to the streaming scenario,
because online methods exist for frequent pattern mining in data streams. Once the frequent patterns
have been determined, rule-sets can be constructed from them using any offline algorithm. Since
frequent patterns can also be efficiently determined over sliding windows, such methods can also be
used for decremental learning. The reader is referred to [1] for a primer on the streaming methods
for frequent pattern mining.
Since decision trees can be extended to streaming data, the corresponding rule-based classifiers
can also be extended to the streaming scenario. As discussed in the introduction chapter, the rule-
growth phase of sequential covering algorithms shares a number of conceptual similarities with
decision tree construction. Therefore, many of the methods for streaming decision tree construction
can be extended to rule growth. The major problem is that the sequential covering algorithm assumes
the availability of all the training examples at one time. These issues can however be addressed by
sampling recent portions of the stream.
An interesting rule-based method, which is based on C4.5Rules, is proposed in [90]. This
method is able to adapt to concept drift. This work distinguishes between proactive and reactive
models. The idea in a proactive model is to try to anticipate the concept drift that will take place,
250 Data Classification: Algorithms and Applications
and to make adjustments to the model on this basis. A reactive model is one in which the additional
concept drift that has already occurred is used in order to modify the model. The work in [90] uses
C4.5Rules [73] as the base leaner in order to create the triggers for the different scenarios. The sec-
tion on text stream classification in this chapter also discusses a number of other rule-based methods
for streaming classification.
Another recent method, known as LOCUST [4], uses a lazy learning method in order to improve
the effectiveness of the classification process in an evolving data stream. The reader is advised to
refer to Chapter 6 on instance-based learning. Thus, the training phase is completely dispensed
with, except for the fact that the training data is organized in the form of inverted lists on the
discretized attributes for efficient retrieval. These lists are maintained using an online approach, as
more data arrives. Furthermore, the approach is resource-adaptive, because it can adjust to varying
speeds of the underlying data stream. The way in which the algorithm is made resource-adaptive is
by structuring it as an any-time algorithm. The work in [4] defines an online analytical processing
framework for real-time classification. In this technique, the data is received continuously over time,
and the classification is performed in real time, by sampling local subsets of attributes, which are
relevant to a particular data record. This is achieved by sampling the inverted lists on the discretized
attributes. The intersection of these inverted lists represents a subspace local to the test instance.
Thus, each of the sampled subsets of attributes represents an instance-specific rule in the locality of
the test instance. The majority class label of the local instances in that subset of attributes is reported
as the relevant class label for a particular record for that sample. Since multiple attribute samples
are used, this provides greater robustness to the estimation process. Since the approach is structured
as an any-time method, it can vary on the number of samples used, during periods of very high load.
A micro-cluster is a statistical data structure that maintains the zero-th order, first order, and
second order moments from the data stream. It can be shown that these moments are sufficient
for maintaining most of the required cluster statistics. In this approach, the algorithm dynamically
maintains a set of micro-clusters [11], such that each micro-cluster is constrained to contain data
points of the same class. Furthermore, snapshots of the micro-clusters are maintained (indirectly)
over different historical horizons. This is done by maintaining micro-cluster statistics since the
beginning of the stream and additively updating them. Then, the statistics are stored either uniformly
or over pyramidally stored intervals. By storing the statistics over pyramidal intervals, better space
efficiency can be achieved. The statistics for a particular horizon (tc − h,tc ) of length h may be
inferred by subtracting the statistics at time tc − h from the statistics at time tc . Note that statistics
over multiple time horizons can be constructed using such an approach.
In order to perform the classification, a key issue is the choice of the horizon to be used for a
particular test instance. For this purpose, the method in [14] separates out a portion of the training
stream as the hold out stream. The classification accuracy is tested over different time horizons over
the hold out streams, and the best accuracy horizon is used in each case. This approach is similar
to the standard “bucket of models” approach used for parameter-tuning in data classification [92].
The major difference in this case is that the parameter being tuned is temporal time-horizon, since
it is a data stream, that is being considered. This provides more effective results because smaller
horizons are automatically used in highly evolving streams, whereas larger horizons are used in less
evolving streams. It has been shown in [14] that such an approach is highly adaptive to different
levels of evolution of the data stream. The microclustering approach also has the advantage that it
can be naturally generalized to positive-unlabeled data stream classification [58].
3. In the context of concept drift detection, it is often difficult to estimate the correct window
size. During periods of fast drift, the window size should be small. During periods of slow
drift, the window size should be larger, in order to minimize generalization error. This ensures
252 Data Classification: Algorithms and Applications
that a larger number of training data points are used during stable periods. Therefore, the
window sizes should be adapted according to the varying trends in the data stream. This can
often be quite challenging in the streaming scenario.
The work in streaming SVM-classification is quite significant, but relatively less known than the
more popular techniques on streaming decision trees and ensemble analysis. In fact, some of the
earliest work [52, 82] in streaming SVM learning precedes the earliest streaming work in decision
tree construction. The work in [52] is particularly notable, because it is one of the earliest works
that uses a dynamic window-based framework for adjusting to concept drift. This precedes most of
the work on streaming concept drift performed subsequently by the data mining community. Other
significant works on incremental support vector machines are discussed in [30, 36, 38, 74, 76, 80].
The key idea in most of these methods is that SVM classification is a non-linear programming
problem, which can be solved with the use of Kuhn-Tucker conditions. Therefore, as long as it is
possible to account for the impact of the addition or removal of test instances on these conditions,
such an approach is likely to work well. Therefore, these methods show how to efficiently maintain
the optimality conditions while adding or removing instances. This is much more efficient than
retraining the SVM from scratch, and is also very useful for the streaming scenario. In terms of
popularity, the work in [30] is used most frequently as a representative of support vector learning.
The notable feature of this work is that it shows that decremental learning provides insights about
the relationship between the generalization performance and the geometry of the data.
Perceptron Algorithm
Inputs: Learning Rate: µ
Training Data (Xi , yi ) ∀i ∈ {1 . . . n}
Initialize weight vectors in A and b to 0 or small random numbers
repeat
Apply each training data to the neural network to check if the
sign of A · Xi + b matches yi ;
if sign of A · Xi + b does not match yi , then
update weights A based on learning rate µ
until weights in A converge
A Survey of Stream Classification Algorithms 253
subset of data on which to apply the method in order to achieve the most accurate results. It was
pointed out in [42] that the appropriate assumptions for mining concept drifting data streams are not
always easy to infer from the underlying data. It has been shown in [42] that a simple voting-based
ensemblar framework can sometimes perform more effectively than relatively complex models.
Ensemble methods have also been extended to the rare class scenario. For example, the work
in [41] is able to achieve effective classification in the context of skewed distributions. This scenario
is discussed in detail in the next section.
but may not necessarily be restricted to a particular class, However, in practice, the problem of
limited labels arises mostly in binary classification problems where the rare class has a relatively
small number of labels, and the remaining records are unlabeled.
The different kinds of scenarios that could arise in these contexts are as follows:
• Rare Classes: The determination of such classes is similar to the static supervised scenario,
except that it needs to be done efficiently in the streaming scenario. In such cases, a small
fraction of the records may belong to a rare class, but they may not necessarily be distributed
in a non-homogenous way from a temporal perspective. While some concept drift may also
need to be accounted for, the modification to standard stream classification algorithms re-
mains quite analogous to the modifications of the static classification problem to the rare-class
scenario.
• Novel Classes: These are classes, that were not encountered before in the data stream. There-
fore, they may not be reflected in the training model at all. Eventually, such classes may
become a normal part of the data over time. This scenario is somewhat similar to semi-
supervised outlier detection in the static scenario, though the addition of the temporal com-
ponent brings a number of challenges associated with it.
• Infrequently Recurring Classes: These are classes, that have not been encountered for a while,
but may re-appear in the stream. Such classes are different from the first type of outliers, be-
cause they arrive in temporally rare bursts. Since most data stream classification algorithms
use some form of discounting in order to address concept drift, they may sometimes com-
pletely age out information about old classes. Such classes cannot be distinguished from novel
classes, if the infrequently recurring classes are not reflected in the training model. Therefore,
issues of model update and discounting are important in the detection of such classes. The
third kind of outlier was first proposed in [64].
Much of the traditional work on novel class detection [67] is focussed only on finding novel
classes that are different from the current models. However, this approach does not distinguish
between the different novel classes that may be encountered over time. A more general way of
understanding the novel class detection problem is to view it as a combination of supervised (clas-
sification) and unsupervised (clustering) models. Thus, as in unsupervised novel class detection
models such as first story detection [13, 97], the cohesion between the test instances of a novel class
is important in determining whether they belong to the same novel class or not. The work in [64,65]
combines both supervised and semi-supervised models by:
• Maintaining a supervised model of the classes available in the training data as an ensemble of
classification models.
• Maintaining an unsupervised model of the (unlabeled) novel classes received so far as cohe-
sive groups of tightly knit clusters.
When a new test instance is received, the classification model is first applied to it to test whether it
belongs to a currently existing (labeled) class. If this is not the case, it is tested whether it naturally
belongs to one of the novel classes. The relationship of the test instance to a statistical boundary of
the clusters representing the novel classes is used for this purpose. If neither of these conditions hold,
it is assumed that the new data point should be in a novel class of its own. Thus, this approach creates
a flexible scenario that combines supervised and unsupervised methods for novel class detection.
A massive-domain stream is defined as one in which each attribute takes on an extremely large
number of possible values. Some examples are as follows:
1. In internet applications, the number of possible source and destination addresses can be very
large. For example, there may be well over 108 possible IP-addresses.
2. The individual items in supermarket transactions are often drawn from millions of possibili-
ties.
3. In general, when the attributes contain very detailed information such as locations, addresses,
names, phone numbers or other such information, the domain size is very large. Recent years
have seen a considerable increase in such applications because of advances in data collection
techniques.
Many synopsis techniques such as sketches and distinct-element counting are motivated by the
massive-domain scenario [34]. Note that these data structures would not be required for streams
with a small number of possible values. Therefore, while the importance of this scenario is well
understood in the context of synopsis data structures, it is rarely studied in the context of core mining
problems. Recent work has also addressed this scenario in the context of core mining problems such
as clustering [10] and classification [5].
The one-pass restrictions of data stream computation create further restrictions on the compu-
tational approach, which may be used for discriminatory analysis. Thus, the massive-domain size
creates challenges in terms of space requirements, whereas the stream model further restricts the
classes of algorithms that may be used in order to create space-efficient methods. For example,
consider the following types of classification models:
• Techniques such as decision trees [20,73] require the computation of the discriminatory power
of each possible attribute value in order to determine how the splits should be constructed. In
order to compute the relative behavior of different attribute values, the discriminatory power
of different attribute values (or combinations of values) need to be maintained. This becomes
difficult in the context of massive data streams.
• Techniques such as rule-based classifiers [24] require the determination of combinations of at-
tributes that are relevant to classification. With increasing domain size, it is no longer possible
to compute this efficiently either in terms of space or running time.
The stream scenario presents additional challenges for classifiers. This is because the one-pass
constraint dictates the choice of data structures and algorithms that can be used for the classification
problem. All stream classifiers such as that discussed in [26] implicitly assume that the underlying
domain size can be handled with modest main memory or storage limitations.
In order to discuss further, some notations and definitions will be introduced. The data stream
D contains d-dimensional records that are denoted by X1 . . . XN . . .. Associated with each record is
a class that is drawn from the index {1 . . . k}. The attributes of record Xi are denoted by (x1i . . . xdi ).
It is assumed that the attribute value xki is drawn from the unordered domain set Jk = {vk1 . . . vkMk }.
The value of M k denotes the domain size for the kth attribute. The value of M k can be very large,
and may range on the order of millions or billions. When the discriminatory power is defined in
terms of subspaces of higher dimensionality, this number multiples rapidly to very large values.
Such intermediate computations will be difficult to perform on even high-end machines.
Even though an extremely large number of attribute-value combinations may be possible over
the different dimensions and domain sizes, only a limited number of these possibilities are usually
relevant for classification purposes. Unfortunately, the intermediate computations required to effec-
tively compare these combinations may not be easily feasible. The one-pass constraint of the data
stream model creates an additional challenge in the computation process. In order to perform the
258 Data Classification: Algorithms and Applications
classification, it is not necessary to explicitly determine the combinations of attributes that are re-
lated to a given class label. The more relevant question is the determination of whether some combi-
nations of attributes exist that are strongly related to some class label. As it turns out, a sketch-based
approach is very effective in such a scenario.
Sketch-based approaches [34] were designed for enumeration of different kinds of frequency
statistics of data sets. The work in [5] extends the well-known count-min sketch [34] to the problem
of classification of data streams. In this sketch, a total of w = ln(1/δ) pairwise independent hash
functions are used, each of which map onto uniformly random integers in the range h = [0, e/ε],
where e is the base of the natural logarithm. The data structure itself consists of a two dimensional
array with w · h cells with a length of h and width of w. Each hash function corresponds to one
of w 1-dimensional arrays with h cells each. In standard applications of the count-min sketch, the
hash functions are used in order to update the counts of the different cells in this 2-dimensional
data structure. For example, consider a 1-dimensional data stream with elements drawn from a
massive set of domain values. When a new element of the data stream is received, each of the w
hash functions are applied, in order to map onto a number in [0 . . . h − 1]. The count of each of
the set of w cells is incremented by 1. In order to estimate the count of an item, the set of w cells
to which each of the w hash-functions map are determined. The minimum value among all these
cells is determined. Let ct be the true value of the count being estimated. The estimated count is at
least equal to ct , since all counts are non-negative, and there may be an over-estimation because of
collisions among hash cells. As it turns out, a probabilistic upper bound to the estimate may also be
determined. It has been shown in [34] that for a data stream with T arrivals, the estimate is at most
ct + ε · T with probability at least 1 − δ.
In typical subspace classifiers such as rule-based classifiers, low dimensional projections such
as 2-dimensional or 3-dimensional combinations of attributes in the antecedents of the rule are
used. In the case of data sets with massive domain sizes, the number of possible combinations
of attributes (even for such low-dimensional combinations) can be so large that the corresponding
statistics cannot be maintained explicitly during intermediate computations. However, the sketch-
based method provides a unique technique for maintaining counts by creating super-items from
different combinations of attribute values. Each super-item V containing a concatenation of the
attribute value strings along with the dimension indices to which these strings belong. Let the actual
value-string corresponding to value ir be S(ir ), and let the dimension index corresponding to the
item ir be dim(ir ). In order to represent the dimension-value combinations corresponding to items
i1 . . . i p , a new string is created by concatenating the strings S(i1 ) . . . S(i p ) and the dimension indices
dim(i1 ) . . . dim(i p ).
This new super-string is then hashed into the sketch table as if it is the attribute value for the
special super-item V . For each of the k-classes, a separate sketch of size w · h is maintained, and
the sketch cells for a given class are updated only when a data stream item of the corresponding
class is received. It is important to note that the same set of w hash functions is used for updating
the sketch corresponding to each of the k classes in the data. Then, the sketch is updated once for
each 1-dimensional attribute value for the d different attributes, and once for each of the super-items
created by attribute combinations. For example, consider the case where it is desired to determine
discriminatory combinations or 1- or 2-dimensional attributes. There are a total of d +d ·(d −1)/2 =
d · (d + 1)/2 such combinations. Then, the sketch for the corresponding class is updated L = d · (d +
1)/2 times for each of the attribute-values or combinations of attribute-values. In general, L may be
larger if even higher dimensional combinations are used, though for cases of massive domain sizes,
even a low-dimensional subspace would have a high enough level of specificity for classification
purposes. This is because of the extremely large number of combinations of possibilities, most of
which would have very little frequency with respect to the data stream. For all practical purposes,
one can assume that the use of 2-dimensional or 3-dimensional combinations provides sufficient
discrimination in the massive-domain case. The value of L is dependent only on the dimensionality
A Survey of Stream Classification Algorithms 259
and is independent of the domain size along any of the dimensions. For modest values of d, the
value of L is typically much lower than the number of possible combinations of attribute values.
The sketch-based classification algorithm has the advantage of being simple to implement. For
each of the classes, a separate sketch table with w · d values is maintained. Thus, there are a total
of w · d · k cells that need to be maintained. When a new item from the data stream arrives, a total
of L · w cells of the ith sketch table are updated. Specifically, for each item or super-item we update
the count of the corresponding w cells in the sketch table by one unit. The overall approach for
updating the sketch table is illustrated in Figure 9.1. The input to the algorithm is the data stream
D , the maximum dimensionality of the subspace combinations that are tracked, and the number of
classes in the data set.
The key to using the sketch-based approach effectively is to be able to efficiently determine
discriminative combinations of attributes. While one does not need to determine such combinations
explicitly in closed form, it suffices to be able to test whether a given combination of attributes is
discriminative. This suffices to perform effective classification of a given test instance. Consider the
state of the data stream, when N records have arrived so far. The number of data streams records
received from the k different classes are denoted by N1 . . . Nk , so that we have ∑ki=1 Ni = N.
Most combinations of attribute values have very low frequency of presence in the data stream.
Here, one is interested in those combinations of attribute values that have high relative presence in
one class compared to the other classes. Here we are referring to high relative presence for a given
class in combination with a moderate amount of absolute presence. For example, if a particular
combination of values occurs in 0.5% of the records corresponding to the class i, but it occurs in less
than 0.1% of the records belonging to the other classes, then the relative presence of the combination
in that particular class is high enough to be considered significant. Therefore, the discriminative
power of a given combination of values (or super-item) V will be defined. Let fi (V ) denote the
fractional presence of the super-item V in class i, and gi (V ) be the fractional presence of the super-
item V in all classes other than i. In order to identify classification behavior specific to class i, the
super-item V is of interest, if fi (V ) is significantly greater than gi (V ).
Definition 9.4.1 The discriminatory power θi (V ) of the super-item V is defined as the fractional
difference in the relative frequency of the attribute-value combination V in class i versus the relative
presence in classes other than i. Formally, the value of θi (V ) is defined as follows:
fi (V ) − gi (V )
θi (V ) = . (9.2)
fi (V )
Since one is interested only in items from which fi (V ) is greater than gi (V ), the value of θi (V )
260 Data Classification: Algorithms and Applications
in super-items of interest will lie between 0 and 1. The larger the value of θi (V ), the greater the
correlation between the attribute-combination V and the class i. A value of θi (V ) = 0 indicates no
correlation, whereas the value of θi (V ) = 1 indicates perfect correlation. In addition, it is interesting
to determine those combinations of attribute values that occur in at least a fraction s of the records
belonging to any class i. Such attribute value combinations are referred to as discriminatory. The
concept of (θ, s)-discriminatory combinations is defined as follows:
Lemma 9.4.1 With probability at least (1 − δ), the values of fi (V ) and gi (V ) are respectively over-
estimated to within L · ε of their true values when we use sketch tables with size w = ln(1/δ) and
h = e/ε.
Next, the accuracy of estimation of θi (V ) is estimated. Note that one is only interested in those
attribute-combinations V for which fi (V ) ≥ s and fi (V ) ≥ gi (V ), since such patterns have sufficient
statistical counts and are also discriminatory with θi (V ) ≥ 0.
Lemma 9.4.2 Let βi (V ) be the estimated value of θi (V ) for an attribute-combination V with frac-
tional selectivity at least s and fi (V ) ≥ gi (V ). Let ε be chosen such that ε = ε · L/s << 1. With
probability at least 1 − δ, it is the case that βi (V ) ≤ θi (V ) + ε .
Next, we will examine the case when the value of θi (V ) is under-estimated [5].
Lemma 9.4.3 Let βi (V ) be the estimated value of θi (V ) for an attribute-combination V with frac-
tional selectivity at least s and fi (V ) ≥ gi (V ). Let ε be chosen such that ε = ε · L/s << 1. With
probability at least 1 − δ, it is the case that βi (V ) ≥ θi (V ) − ε .
The results of Lemma 9.4.2 and 9.4.3 can be combined to conclude the following:
Lemma 9.4.4 Let βi (V ) be the estimated value of θi (V ) for an attribute-combination V with frac-
tional selectivity at least s and fi (V ) ≥ gi (V ). Let ε be chosen such that ε = ε · L/s << 1. With
probability at least 1 − 2 · δ, it is the case that βi (V ) ∈ (θi (V ) − ε , θi (V ) + ε ).
This result follows from the fact each of the inequalities in Lemmas 9.4.2 and 9.4.3 are true with
probability at least 1 − δ. Therefore, both inequalities are true with probability at least (1 − 2 · δ).
Another natural corollary of this result is that any pattern that is truly (θ, s)-discriminatory will be
A Survey of Stream Classification Algorithms 261
TABLE 9.1: Storage Requirement of Sketch Table for Different Data and Pattern Dimensionalities
(ε = 0.01, δ = 0.01, s = 0.01)
discovered with probability at least (1 − 2 · δ) by using the sketch based approach in order to deter-
mine all patterns that are at least (θ − ε , s · (1 − ε ))-discriminatory. Furthermore, the discriminatory
power of such patterns will not be over- or under-estimated by an inaccuracy greater than ε .
The process of determining whether a super-item V is (θ, s) requires us to determine fi (V ) and
gi (V ) only. The value of fi (V ) may be determined in a straightforward way by using the sketch
based technique of [34] in order to determine the estimated frequency of the item. We note that
fi (V ) can be determined quite accurately since we are only considering those patterns that have a
certain minimum support. The value of gi (V ) may be determined by adding up the sketch tables for
all the other different classes, and then using the same technique.
What are the space requirements of the approach for classifying large data streams under a
variety of practical parameter settings? Consider the case where we have a 10-dimensional massive
domain data set that has at least 107 values over each attribute. Then, the number of possible 2-
dimensional and 3-dimensional value combinations are 1014 ∗ 10 ∗ (10 − 1)/2 and 1021 ∗ 10 ∗ (10 −
1) ∗ (10 − 2)/6, respectively. We note that even the former requirement translates to an order of
magnitude of about 4500 tera-bytes. Clearly, the intermediate-space requirements for aggregation-
based computations of many standard classifiers are beyond most modern computers. On the other
hand, for the sketch-based technique, the requirements continue to be quite modest. For example, let
us consider the case where we use the sketch-based approach with ε = 0.01 and δ = 0.01. Also, let
us assume that we wish to have the ability to perform discriminatory classification on patterns with
specificity at least s = 0.01. We have illustrated the storage requirements for data sets of different
dimensionalities in Table 9.1. While the table is constructed for the case of ε = 0.01, the storage
numbers in the last column of the table are illustrated for the case of ε = 0.2. We will see later
that the use of much large values of ε can provide effective and accurate results. It is clear that the
storage requirements are quite modest, and can be held in main memory in most cases. For the case
of stringent accuracy requirements of ε = 0.01, the high dimensional data sets may require a modest
amount of disk storage in order to capture the sketch table. However, a practical choice of ε = 0.2
always results in a table that can be stored in main memory. These results are especially useful in
light of the fact that even data sets of small dimensionalities cannot be effectively processed with
the use of traditional methods.
The sketch table may be leveraged to perform the classification for a given test instance Y . The
first step is to extract all the L patterns in the test instance with dimensionality less than r. Then, we
determine those patterns that are (θ, s)-discriminatory with respect to at least one class. The process
of finding (θ, s)-discriminatory patterns has already been discussed in the last section. We use a
voting scheme in order to determine the final class label. Each pattern that is (θ, s)-discriminatory
constitutes a vote for that class. The class with the highest number of votes is reported as the relevant
class label. The overall procedure is reported in Figure 9.2.
262 Data Classification: Algorithms and Applications
• In the first case, the training data may be available for batch learning, but the test data may
arrive in the form of a stream.
• In the second case, both the training and the test data may arrive in the form of a stream.
The patterns in the training data may continuously change over time, as a result of which the
models need to be updated dynamically.
The first scenario is usually easy to handle, because most classifier models are compact and
classify individual test instances efficiently. On the other hand, in the second scenario, the train-
ing model needs to be constantly updated in order to account for changes in the patterns of the
underlying training data. The easiest approach to such a problem is to incorporate temporal decay
A Survey of Stream Classification Algorithms 263
factors into model construction algorithms, so as to age out the old data. This ensures that the new
(and more timely) data is weighted more significantly in the classification model. An interesting
technique along this direction has been proposed in [77], in which a temporal weighting factor is
introduced in order to modify the classification algorithms. Specifically, the approach has been ap-
plied to the Naive Bayes, Rocchio, and k-nearest neighbor classification algorithms. It has been
shown that the incorporation of temporal weighting factors is useful in improving the classification
accuracy, when the underlying data is evolving over time.
A number of methods have also been designed specifically for the case of text streams. In partic-
ular, the method discussed in [39] studies methods for classifying text streams in which the classi-
fication model may evolve over time. This problem has been studied extensively in the literature in
the context of multi-dimensional data streams [14, 87]. For example, in a spam filtering application,
a user may generally delete the spam emails for a particular topic, such as those corresponding to
political advertisements. However, in a particular period such as the presidential elections, the user
may be interested in the emails for that topic, and so it may not be appropriate to continue to classify
that email as spam.
The work in [39] looks at the particular problem of classification in the context of user-interests.
In this problem, the label of a document is considered either interesting or non-interesting. In order
to achieve this goal, the work in [39] maintains the interesting and non-interesting topics in a text
stream together with the evolution of the theme of the interesting topics. A document collection
is classified into multiple topics, each of which is labeled either interesting or non-interesting at a
given point. A concept refers to the main theme of interesting topics. A concept drift refers to the
fact that the main theme of the interesting topic has changed.
The main goals of the work are to maximize the accuracy of classification and minimize the cost
of re-classification. In order to achieve this goal, the method in [39] designs methods for detecting
both concept drift as well as model adaptation. The former refers to the change in the theme of user-
interests, whereas the latter refers to the detection of brand new concepts. In order to detect concept
drifts, the method in [39] measures the classification error-rates in the data stream in terms of true
and false positives. When the stream evolves, these error rates will increase, if the change in the
concepts are not detection. In order to determine the change in concepts, techniques from statistical
quality control are used, in which we determine the mean µ and standard deviation σ of the error
rates, and determine whether this error rate remains within a particular tolerance, which is [µ − k ·
σ, µ + k · σ]. Here the tolerance is regulated by the parameter k. When the error rate changes, we
determine when the concepts should be dropped or included. In addition, the drift rate is measured
in order to determine the rate at which the concepts should be changed for classification purposes.
In addition, methods for dynamic construction and removal of sketches are discussed in [39].
Another related work is that of one-class classification of text streams [93], in which only train-
ing data for the positive class is available, but there is no training data available for the negative
class. This is quite common in many real applications in which it is easy to find representative docu-
ments for a particular topic, but it is hard to find the representative documents in order to model the
background collection. The method works by designing an ensemble of classifiers in which some of
the classifiers corresponds to a recent model, whereas others correspond to a long-term model. This
is done in order to incorporate the fact that the classification should be performed with a combina-
tion of short-term and long-term trends. Another method for positive-unlabeled stream classification
is discussed in [62].
A rule-based technique, which can learn classifiers incrementally from data streams, is the
sleeping-experts systems [29, 33]. One characteristic of this rule-based system is that it uses the
position of the words in generating the classification rules. Typically, the rules correspond to sets
of words that are placed close together in a given text document. These sets of words are related to
a class label. For a given test document, it is determined whether these sets of words occur in the
document, and are used for classification. This system essentially learns a set of rules (or sleeping
experts), which can be updated incrementally by the system. While the technique was proposed
264 Data Classification: Algorithms and Applications
prior to the advent of data stream technology, its online nature ensures that it can be effectively used
for the stream scenario.
One of the classes of methods that can be easily adapted to stream classification is the broad
category of neural networks [79, 89]. This is because neural networks are essentially designed as
a classification model with a network of perceptrons and corresponding weights associated with
the term-class pairs. Such an incremental update process can be naturally adapted to the streaming
context. These weights are incrementally learned as new examples arrive. The first neural network
methods for online learning were proposed in [79, 89]. In these methods, the classifier starts off
by setting all the weights in the neural network to the same value. The incoming training example
is classified with the neural network. In the event that the result of the classification process is
correct, then the weights are not modified. On the other hand, if the classification is incorrect,
then the weights for the terms are either increased or decreased depending upon which class the
training example belongs to. Specifically, if the class to which the training example belongs is a
positive instance, the weights of the corresponding terms (in the training document) are increased
by α. Otherwise, the weights of these terms are reduced by α. The value of α is also known as the
learning rate. Many other variations are possible in terms of how the weights may be modified. For
example, the method in [25] uses a multiplicative update rule, in which two multiplicative constants
α1 > 1 and α2 < 1 are used for the classification process. The weights are multiplied by α1 , when
the example belongs to the positive class, and is multiplied by α2 otherwise. Another variation [56]
also allows the modification of weights, when the classification process is correct. A number of
other online neural network methods for text classification (together with background on the topic)
may be found in [21, 28, 70, 75]. A Bayesian method for classification of text streams is discussed
in [22]. The method in [22] constructs a Bayesian model of the text, which can be used for online
classification. The key components of this approach are the design of a Bayesian online perceptron
and a Bayesian online Gaussian process, which can be used effectively for online learning.
complication is caused by the fact the the edges for a particular graph may arrive out of order
(i.e., not contiguously) in the data stream. Since re-processing is not possible in a data stream,
such out-of-order edges create challenges for algorithms that extract structural characteristics
from the graphs.
2. The massive size of the graph creates a challenge for effective extraction of information that is
relevant to classification. For example, it is difficult to even store summary information about
the large number of distinct edges in the data. In such cases, the determination of frequent
discriminative subgraphs may be computationally and space-inefficient to the point of being
impractical.
A discriminative subgraph mining approach was proposed in [9] for graph stream classification.
We will define discriminative subgraphs both in terms of edge pattern co-occurrence and the class
label distributions. The presence of such subgraphs in test instances can be used in order to infer
their class labels. Such discriminative subgraphs are difficult to determine with enumeration-based
algorithms because of the massive domain of the underlying data. A probabilistic algorithm was
proposed for determining the discriminative patterns. The probabilistic algorithm uses a hash-based
summarization approach to capture the discriminative behavior of the underlying massive graphs.
This compressed representation can be leveraged for effective classification of test (graph) instances.
The graph is defined over the node set N. This node set is assumed to be drawn from a massive
domain of identifiers. The individual graphs in the stream are denoted by G1 . . . Gn . . .. Each graph
Gi is associated with the class label Ci , which is drawn from {1 . . . m}. It is assumed that the data is
sparse. The sparsity property implies that even though the node and edge domain may be very large,
the number of edges in the individual graphs may be relatively modest. This is generally true over a
wide variety of real applications such as those discussed in the introduction section. In the streaming
case, the edges of each graph Gi may not be neatly received at a given moment in time. Rather, the
edges of different graphs may appear out of order in the data stream. This means that the edges
for a given graph may not appear contiguously in the stream. This is definitely the case for many
applications such as social networks and communication networks in which one cannot control the
ordering of messages and communications across different edges. This is a particularly difficult
case. It is assumed that the edges are received in the form < GraphId >< Edge >< ClassLabel >.
Note that the class label is the same across all edges for a particular graph identifier. For notational
convenience, we can assume that the class label is appended to the graph identifier, and therefore
we can assume (without loss of generality or class information) that the incoming stream is of the
form < GraphId >< Edge >. The value of the variable < Edge > is defined by its two constituent
nodes.
A rule-based approach was designed in [9], which associates discriminative subgraphs to spe-
cific class labels. Therefore, a way of quantifying the significance of a particular subset of edges P
for the purposes of classification is needed. Ideally, one would like the subgraph P to have signifi-
cant statistical presence in terms of the relative frequency of its constituent edges. At the same time,
one would like P to be discriminative in terms of the class distribution.
The first criterion retains subgraphs with high relative frequency of presence, whereas the second
criterion retains only subgraphs with high discriminative behavior. It is important to design effective
techniques for determining patterns that satisfy both of these characteristics. First, the concept of a
significant subgraph in the data will be defined. A significant subgraph P is defined as a subgraph
(set of edges), for which the edges are correlated with one another in terms of absolute presence.
This is also referred to as edge coherence. This concept is formally defined as follows:
Definition 9.5.1 Let f∩ (P) be the fraction of graphs in G1 . . . Gn in which all edges of P are present.
Let f∪ (P) be the fraction of graphs in which at least one or more of the edges of P are present.
Then, the edge coherence C(P) of the subgraph P is denoted by f∩ (P)/ f∪ (P).
266 Data Classification: Algorithms and Applications
We note that the above definition of edge coherence is focussed on relative presence of subgraph
patterns rather than the absolute presence. This ensures that only significant patterns are found.
Therefore, large numbers of irrelevant patterns with high frequency but low significance are not
considered. While the coherence definition is more effective, it is computationally quite challenging
because of the size of the search space that may need to be explored. The randomized scheme
discussed here is specifically designed in order to handle this challenge.
Next, the class discrimination power of the different subgraphs is defined. For this purpose, the
class confidence of the edge set P with respect to the class label r ∈ {1 . . . m} is defined as follows.
Definition 9.5.2 Among all graphs containing subgraph P, let s(P, r) be the fraction belonging to
class label r. We denote this fraction as the confidence of the pattern P with respect to the class r.
Correspondingly, the concept of the dominant class confidence for a particular subgraph is defined.
Definition 9.5.3 The dominant class confidence DI(P) or subgraph P is defined as the maximum
class confidence across all the different classes {1 . . . m}.
A significantly large value of DI(P) for a particular test instance indicates that the pattern P is very
relevant to classification, and the corresponding dominant class label may be an attractive candidate
for the test instance label.
In general, one would like to determine patterns that are interesting in terms of absolute presence,
and are also discriminative for a particular class. Therefore, the parameter-pair (α, θ) is defined,
which corresponds to threshold parameters on the edge coherence and class interest ratio.
Definition 9.5.4 A subgraph P is said to be be (α, θ)-significant, if it satisfies the following two
edge-coherence and class discrimination constraints:
(a) Edge Coherence Constraint: The edge-coherence C(P) of subgraph P is at least α. In other
words, it is the case that C(P) ≥ α.
(b) Class Discrimination Constraint: The dominant class confidence DI(P) is at least θ. In other
words, it is the case that DI(P) ≥ θ.
The above constraints are quite challenging because of the size of the search space that needs to be
explored in order to determine relevant patterns. This approach is used, because it is well suited to
massive graphs in which it is important to prune out as many irrelevant patterns as possible. The
edge coherence constraint is designed to prune out many patterns that are abundantly present, but
are not very significant from a relative perspective. This helps in more effective classification.
A probabilistic min-hash approach is used for determining discriminative subgraphs. The min-
hash technique is an elegant probabilistic method, which has been used for the problem of finding
interesting 2-itemsets [32]. This technique cannot be easily adapted to the graph classification prob-
lem, because of the large number of distinct edges in the graph. Therefore, w a 2-dimensional
compression technique was used, in which a min-hash function will be used in combination with a
more straightforward randomized hashing technique. We will see that this combination approach is
extremely effective for graph stream classification.
The aim of constructing this synopsis is to be able to design a continuously updatable data
structure, which can determine the most relevant discriminative subgraphs for classification. Since
the size of the synopsis is small, it can be maintained in main memory and be used at any point
during the arrival of the data stream. The ability to continuously update an in-memory data structure
is a natural and efficient design for the stream scenario. At the same time, this structural synopsis
maintains sufficient information, which is necessary for classification purposes.
For ease in further discussion, a tabular binary representation of the graph data set with N rows
and n columns will be utilized. This table is only conceptually used for description purposes, but it is
not explicitly maintained in the algorithms or synopsis construction process. The N rows correspond
A Survey of Stream Classification Algorithms 267
to the N different graphs present in the data. While columns represent the different edges in the data,
this is not a one-to-one mapping. This is because the number of possible edges is so large that it
is necessary to use a uniform random hash function in order to map the edges onto the n columns.
The choice of n depends upon the space available to hold the synopsis effectively, and affects the
quality of the final results obtained. Since many edges are mapped onto the same column by the hash
function, the support counts of subgraph patterns are over-estimated with this approach. Since the
edge-coherence C(P) is represented as a ratio of supports, the edge coherence may either be over-
or under-estimated. The details of this approach are discussed in [9]. A number of other models and
algorithms for graph stream classification have recently been proposed [45,57,72]. The work in [57]
uses discriminative hash kernels for graph stream classification. A method that uses a combination
of hashing and factorization is proposed in [45]. Finally, the work in [72] addresses the problem of
graph stream classification in the presence of imbalanced distributions and noise.
Social streams can be viewed as graph streams that contain a combination of text and structural
information. An example is the Twitter stream, which contains both structural information (in the
form of sender-recipient information), and text in the form of the actual tweet. The problem of
event detection is closely related to that of classification. The main difference is that event labels are
associated with time-instants rather than individual records. A method for performing supervised
event detection in social streams is discussed in [15]. The work in [15] also discusses unsupervised
methods for event detection.
Bibliography
[1] C. Aggarwal. Data Streams: Models and Algorithms, Springer, 2007.
[2] C. Aggarwal. On density-based transforms for uncertain data mining, Proceedings of the ICDE
Conference, pages 866–875, 2007.
[4] C. Aggarwal and P. Yu. LOCUST: An online analytical processing framework for high dimen-
sional classification of data streams, Proceedings of the ICDE Conference, pages 426–435,
2008.
[5] C. Aggarwal and P. Yu. On classification of high cardinality data streams. Proceedings of the
SDM Conference, pages 802–813, 2010.
[9] C. Aggarwal, On classification of graph streams, Proceedings of the SDM Conference, 2010.
[10] C. Aggarwal. A framework for clustering massive-domain data streams, Proceedings of the
ICDE Conference, pages 102–113, 2009.
[11] C. C. Aggarwal, J. Han. J. Wang, and P. Yu. A framework for clustering evolving data streams,
Proceedings of the VLDB Conference, pages 81–92, 2003.
[12] C. C. Aggarwal. On biased reservoir sampling in the presence of stream evolution, Proceedings
of the VLDB Conference, pages 607–618, 2006.
[13] C. C. Aggarwal and P. Yu. On clustering massive text and categorical data streams, Knowledge
and Information Systems, 24(2), pp. 171–196, 2010.
[14] C. Aggarwal, J. Han, J. Wang, and P. Yu. A framework for classification of evolving data
streams. In IEEE Transactions on Knowledge and Data Engineering, 18(5):577–589, 2006.
[15] C. Aggarwal and K. Subbian. Event detection in social streams, Proceedings of the SDM Con-
ference, 2012.
[18] A. Banerjee and J. Ghosh. Competitive learning mechanisms for scalable, balanced and incre-
mental clustering of streaming texts, Neural Networks, pages 2697–2702, 2003.
A Survey of Stream Classification Algorithms 269
[19] A. Bifet, G. Holmes, B. Pfahringer, and E. Frank. Fast perceptron decision tree learning from
evolving data streams. Advances in Knowledge Discovery and Data Mining, pages 299–310,
2010.
[20] L. Breiman, J. Friedman, and C. Stone, Classification and Regrssion Trees, Chapman and Hall,
1984.
[21] K. Crammer and Y. Singer. A new family of online algorithms for category ranking, ACM
SIGIR Conference, pages 151–158, 2002.
[22] K. Chai, H. Ng, and H. Chiu. Bayesian online classifiers for text classification and filtering,
ACM SIGIR Conference, pages 97–104, 2002.
[23] P. Clark and T. Niblett. The CN2 induction algorithm, Machine Learning, 3(4): 261–283,
1989.
[24] W. Cohen. Fast effective rule induction, ICML Conference, pages 115–123, 1995.
[25] I. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization. Proceedings
of Conference Empirical Methods in Natural Language Processing, pages 55–63, 1997.
[26] P. Domingos and G. Hulten, Mining high-speed data streams, ACM KDD Conference, pages
71–80, 2000.
[27] W. Fan. Systematic data selection to mining concept drifting data streams, ACM KDD Confer-
ence, pages 128–137, 2004.
[28] Y. Freund and R. Schapire. Large margin classification using the perceptron algorithm, Ma-
chine Learning, 37(3):277–296, 1998.
[29] Y. Freund, R. Schapire, Y. Singer, M. Warmuth. Using and combining predictors that spe-
cialize. Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pages
334–343, 1997.
[30] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learn-
ing, NIPS Conference, pages 409–415, 2000.
[31] N. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced
data sets. ACM SIGKDD Explorations Newsletter, 6(1):1–6, 2004.
[32] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang,
Finding interesting associations without support pruning, IEEE TKDE, 13(1): 64–78, 2001.
[33] W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM
Transactions on Information Systems, 17(2): 141–173, 1999.
[34] G. Cormode and S. Muthukrishnan. An improved data-stream summary: The count-min sketch
and its applications, Journal of Algorithms, 55(1):58–75, 2005.
[35] I. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization, Proceedings
of EMNLP, pages 55-63, 1997.
[36] C. Domeniconi and D. Gunopulos. Incremental support vector machine construction. ICDM
Conference, pages 589–592, 2001.
[37] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang, Graph distances in the data
stream model. SIAM Jour. on Comp., 38(5):1709–1727, 2005.
270 Data Classification: Algorithms and Applications
[38] G. Fung and O. L. Mangasarian. Incremental support vector machine classification. SIAM
Conference on Data Mining, pages 77–86, 2002.
[39] G. P. C. Fung, J. X. Yu, and H. Lu. Classifying text streams in the presence of concept drifts.
Advances in Knowledge Discovery and Data Mining, 3056:373–383, 2004.
[40] J. Gama, R. Fernandes, and R. Rocha. Decision trees for mining data streams. Intelligent Data
Analysis, 10:23–45, 2006.
[41] J. Gao, W. Fan, J. Han, and P. Yu. A general framework for mining concept drifting data stream
with skewed distributions, SDM Conference, 2007.
[42] J. Gao, W. Fan, and J. Han. On appropriate assumptions to mine data streams: Analysis and
practice, ICDM Conference, pages 143–152, 2007.
[43] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. BOAT: Optimistic decision tree con-
struction, ACM SIGMOD Conference, pages 169–180, 1999.
[44] J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest—A framework for fast decision tree
construction of large datasets, VLDB Conference, pages 416–427, 1998.
[45] T. Guo, L. Chi, and X. Zhu. Graph hashing and factorization for fast graph stream classifica-
tion. ACM CIKM Conference, pages 1607–1612, 2013.
[46] S. Hashemi and Y. Yang. Flexible decision tree for data stream classification in the presence of
concept change, noise and missing values. Data Mining and Knowledge Discovery, 19(1):95–
131, 2009.
[47] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. ACM KDD
Conference, pages 97–106, 2001.
[48] R. Jin and G. Agrawal. Efficient Decision Tree Construction on Streaming Data, ACM KDD
Conference, pages 571–576, 2003.
[49] D. Kalles and T. Morris. Efficient incremental induction of decision trees. Machine Learning,
24(3):231–242, 1996.
[50] T. Joachims. Making large scale SVMs practical. Advances in Kernel Methods, Support Vector
Learning, pp. 169–184, MIT Press, Cambridge, 1998.
[51] T. Joachims. Training linear SVMs in linear time. KDD, pages 217–226, 2006.
[52] R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. ICML
Conference, pages 487–494, 2000.
[53] J. Kolter and M. Maloof. Dynamic weighted majority: A new ensemble method for tracking
concept drift, ICDM Conference, pages 123–130, 2003.
[54] Y.-N. Law and C. Zaniolo. An adaptive nearest neighbor classification algorithm for data
streams, PKDD Conference, pages 108–120, 2005.
[55] D. Lewis. The TREC-4 filtering track: description and analysis. Proceedings of TREC-4, 4th
Text Retrieval Conference, pages 165–180, 1995.
[56] D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training algorithms for linear text classi-
fiers. ACM SIGIR Conference, pages 298–306, 1996.
A Survey of Stream Classification Algorithms 271
[57] B. Li, X. Zhu, L. Chi, and C. Zhang. Nested subtree hash kernels for large-scale graph classi-
fication over streams. ICDM Conference, pages 399–408, 2012.
[58] X. Li, P. Yu, B. Liu, and S. K. Ng. Positive-unlabeled learning for data stream classification,
SDM Conference, pages 257–268, 2009.
[59] C. Liang, Y. Zhang, and Q. Song. Decision tree for dynamic and uncertain data streams. 2nd
Asian Conference on Machine Learning, volume 3, pages 209–224, 2010.
[60] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold
algorithm. Machine Learning, 2: pages 285–318, 1988.
[61] B. Liu, W. Hsu, and Y. Ma, Integrating classification and association rule mining, ACM KDD
Conference, pages 80–86, 1998.
[62] S. Pan, Y. Zhang, and X. Li. Dynamic classifier ensemble for positive unlabeled text stream
classification. Knowledge and Information Systems, 33(2):267–287, 2012.
[63] J. R. Quinlan. C4.5: Programs in Machine Learning, Morgan-Kaufmann Inc, 1993.
[65] M. Masud, Q. Chen, L. Khan, C. Aggarwal, J. Gao, J. Han, A. Srivastava, and N. Oza. Clas-
sification and adaptive novel class detection of feature-evolving data streams, IEEE Trans-
actions on Knowledge and Data Engineering, 25(7):1484–1487, 2013. Available at http:
//doi.ieeecomputersociety.org/10.1109/TKDE.2012.109
[66] M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham: A practical approach to classify
evolving data streams: Training with limited amount of labeled data. ICDM Conference, pages
929–934, 2008.
[67] M. Markou and S. Singh. Novelty detection: A review, Part 1: Statistical approaches, Signal
Processing, 83(12):2481–2497, 2003.
[68] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining, EDBT
Conference, pages 18–32, 1996.
[69] L. Minku, A. White, and X. Yao. The impact of diversity on online ensemble learning
in the presence of concept drift, IEEE Transactions on Knowledge and Data Engineering,
22(6):730–742, 2010.
[70] H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability
case study for text categorization. SIGIR Conference, pages 67–73, 1997.
[71] N. Oza and S. Russell. Online bagging and boosting. In Eighth Int. Workshop on Artificial
Intelligence and Statistics, pages 105–112. Morgan Kaufmann, 2001.
[72] S. Pan and X. Zhu. Graph classification with imbalanced class distributions and noise. AAAI
Conference, pages 1586–1592, 2013.
[73] R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
[74] L. Ralaivola and F. d’Alché-Buc. Incremental support vector machine learning: A local ap-
proach. Artificial Neural Network, pages 322–330, 2001.
272 Data Classification: Algorithms and Applications
[75] F. Rosenblatt. The perceptron: A probabilistic model for information and storage organization
in the brain, Psychological Review, 65: pages 386–407, 1958.
[76] S. Ruping. Incremental learning with support vector machines. IEEE ICDM Conference, pp.
641–642, 2001.
[77] T. Salles, L. Rocha, G. Pappa, G. Mourao, W. Meira Jr., and M. Goncalves. Temporally-aware
algorithms for document classification. ACM SIGIR Conference, pages 307–314, 2010.
[78] J. Schlimmer and D. Fisher. A case study of incremental concept induction. Proceedings of
the Fifth National Conference on Artificial Intelligence, pages 495–501. Morgan Kaufmann,
1986.
[79] H. Schutze, D. Hull, and J. Pedersen. A comparison of classifiers and document representations
for the routing problem. ACM SIGIR Conference, pages 229–237, 1995.
[80] A. Shilton, M. Palaniswami, D. Ralph, and A. Tsoi. Incremental training of support vector
machines. IEEE Transactions on Neural Networks, 16(1):114–131, 2005.
[81] W. N. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale classification.
ACM KDD Conference, pages 377–382, 2001.
[82] N. Syed, H. Liu, and K. Sung. Handling concept drifts in incremental learning with support
vector machines. ACM KDD Conference, pages 317–321, 1999.
[83] P. Utgoff. Incremental induction of decision trees. Machine Learning, 4(2):161–186, 1989.
[84] P. Utgoff and C. Brodley. An incremental method for finding multivariate splits for decision
trees. Proceedings of the Seventh International Conference on Machine Learning, pages 58–
65, 1990.
[85] P. Utgoff. An improved algorithm for incremental induction of decision trees. Proceedings of
the Eleventh International Conference on Machine Learning, pages 318–325, 1994.
[86] J. Vitter. Random sampling with a reservoir, ACM Transactions on Mathematical Software
(TOMS), 11(1):37–57, 1985.
[87] H. Wang, W. Fan, P. Yu, J. Han. Mining concept-drifting data streams with ensemble classi-
fiers, KDD Conference, pages 226–235, 2003.
[88] H. Wang, J. Yin, J. Pei, P. Yu, and J. X. Yu. Suppressing model overfitting in concept drifting
data streams, ACM KDD Conference, pages 736–741, 2006.
[89] E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting,
SDAIR, pages 317–332, 1995.
[90] Y. Yang, X. Wu, and X. Zhu. Combining proactive and reactive predictions for data streams,
ACM KDD Conference, pages 710–715, 2005.
[91] K. L. Yu and W. Lam. A new on-line learning algorithm for adaptive text filtering, ACM CIKM
Conference, pages 156–160, 1998.
[92] S. Dzeroski and B. Zenko. Is combining classifiers better than selecting the best one? Machine
Learning, 54(3):255–273, 2004.
[93] Y. Zhang, X. Li, and M. Orlowska. One class classification of text streams with concept drift,
ICDMW Workshop, pages 116–125, 2008.
A Survey of Stream Classification Algorithms 273
[94] J. Zhang, Z. Ghahramani, and Y. Yang. A probabilistic model for online document clustering
with application to novelty detection, NIPS, 2005.
[95] P. Zhang, Y. Zhu, and Y. Shi. Categorizing and mining concept drifting data streams, ACM
KDD Conference, pages 812–820, 2008.
[96] X. Zhu, X. Wu, and Y. Zhang. Dynamic classifier selection for effective mining from noisy
data streams, ICDM Conference, pages 305–312, 2004.
[97] http://www.itl.nist.gov/iad/mig/tests/tdt/tasks/fsd.html
This page intentionally left blank
Chapter 10
Big Data Classification
Hanghang Tong
City College
City University of New York
New York, New York
[email protected]
10.1 Introduction
We are in the age of ‘big data.’ Big Data has the potential to revolutionize many scientific
disciplines, ranging from astronomy, biology, and education, to economics, social science, etc [16].
From an algorithmic point of view, the main challenges that ‘big data’ brings to classification can
be summarized by three characteristics, that are often referred to as the “three Vs,” namely volume,
variety, and velocity. The first characteristic is volume, which corresponds to the fact that the data
is being generated at unprecedented scale. For example, it is estimated that [16] there were more
than 13 exabytes of new data stored by enterprises and users in 2010. The second characteristic is
variety, according to which, real data is often heterogeneous, comprised of multiple different types
and/or coming from different sources. The third characteristic is that the data is not only large and
complex, but also generated at a very high rate.
These new characteristics of big data have brought new challenges and opportunities for clas-
sification algorithms. For example, to address the challenge of velocity, on-line streaming classifi-
cation algorithms have been proposed (please refer to the previous chapter for details); in order to
address the challenge of variety, the so-called heterogeneous machine learning has been emerging,
including multi-view classification for data heterogeneity, transfer learning and multi-task classifi-
cation for classification task heterogeneity, multi-instance learning for instance heterogeneity, and
275
276 Data Classification: Algorithms and Applications
classification with crowd-sourcing for oracle heterogeneity; in order to address the challenge of
volume, many efforts have been devoted to scaling up classification algorithms.
In this chapter, we will summarize some recent efforts in classification algorithms to address the
scalability issue in response to the volume challenge in big data. For discussion on the streaming
scenario, we refer to the previous chapter. We will start by introducing how to scale up classifica-
tion algorithms on a single machine. Then we will introduce how to further scale up classification
algorithms by parallelism.
10.2.1 Background
Given a training set (xi , yi )(i = 1, ...., n), where x is a d-dimensional feature vector and yi = ±1
is the class label, linear support vector machines (SVM) aim to find a classifier in the form of
hw,b (x) = sign(wT x + b), where w and b are the parameters.
By introducing a dummy feature of constant value 1s, we ignore the parameter b. Then, the
parameter w can be learnt from the training set by solving the following optimization problem:
1 C n
minw,ξi ≥0 wwT + ∑ ξi
2 n i=1
s.t.∀i ∈ {1, ..., n} : yi (wT xi ) ≥ 1 − ξi. (10.1)
Traditionally, this optimization problem can be solved in its dual form, which can be solved in
turn by quadratic programming, e.g., SMO [23], LIBSVM [3], SVMLight [18], etc. All of these
methods scale linearly with respect to the dimensionality d of the feature vector. However, they
usually require a super-linear time in the number of training examples n. On the other hand, many
algorithms have been proposed to speed up the training process of SVMs in the past decade to
achieve linear scalability with respect to the number of the training examples n, such as the interior-
point based approach [7], the proximal-based method [9], the newton-based approach [21], etc.
However, all these methods still scale quadratically with the dimensionality d of the feature space.
10.2.2 SVMPerf
For many big data applications, it is often the case that we have a large number of training ex-
amples n and a high dimensional feature space d. For example, for the text classification problem,
Big Data Classification 277
we could have millions of training examples and hundreds of thousands of features. For such appli-
cations, it is highly desirable to have a classification algorithm that scales well with respect to both
n and d.
To this end, SVMPerf was proposed in [20], whose time complexity is O(ns), where s is the aver-
age non-zero feature values per training example. Note that such a time-complexity is very attractive
for many real applications where the features are sparse (small s) even though its dimensionality d
is high.
The key idea of SVMPerf is to re-formulate the original SVM in Equation (10.1) as the following
structural SVM:
1
minw,ξ≥0 wwT + Cξ
2
n
1 1 n
s.t.∀c ∈ {0, 1}n : wT ∑ (ci yi xi ) ≥ ∑ ci − ξ. (10.2)
n i=1 n i=1
Compared with the original formulation in Equation (10.1), the above formulation has a much
simpler objective function, where we only have a single slack variable ξ. On the other hand, there
are 2n constraints in the structural SVM, each of which corresponds to the sum of a subset of the
constraints in the original formulation and such a subset is specified by the vector c = (c1 , ..., cn ) ∈
{0, 1}n. Thus, the reduction in the number of slack variables is achieved at the expense of having
many more constraints. So how is this better?
Despite the fact that we have an exponential number of constraints in Equation (10.2), it turns
out we can resort to the cutting-plane algorithm to efficiently solve it. The basic idea is as follows.
Instead of working on the entire 2n constraints directly, we keep an active current work set of
constrains W . The algorithm starts with an empty set of constraint set W , and then iteratively
(1) solves Equation (10.2) by considering the constraints that are only in the current set W ; and (2)
expands the current working set W of constraints, based on the current classifier, until a pre-defined
precision is reached.
The key component in this algorithm is the approach for expanding the current working set W .
The work in [20] suggests using the constraint in Equation (10.2) that requires the biggest ξ to
make it feasible. To be specific, if w is the currently learnt weighted vector, we define c as ci = 1 if
yi (wT xi ) < 1; and ci = 0 otherwise. Then, we will add this constraint c in the current work set W .
Note that the above algorithm is designed for discrete binary classification. The work in [20]
further generalizes it to the ordinal regression for the task of ranking (as opposed to binary classifica-
tion), where yi ∈ {1, ..., R} indicates an ordinal scale. The algorithm for this case has the complexity
of O(snlog(n)).
In SVMPerf, there is a parameter ε that controls the accuracy in the iterative process. The overall
time complexity is also related to this parameter. To be specific, the number of required iterations in
SVMPerf is O( ε12 ). This yields an overall complexity of O( εns2 ).
10.2.3 Pegasos
Note that we still have a linear term with respect to the number of training examples O(n) in
SVMPerf. For the applications from big data, this term is often on the order of millions or even
billions. In order to further speed up the training process, Pegasos was proposed in [25] based on a
stochastic sub-gradient descent method.
Unlike SVMPerf and many other methods, which formulate SVM as a constrained optimization
problem, Pegasos relies on the the following un-constrained optimization formulation of SVM:
λ 1 n
minw wwT + ∑ l(w, (xi , yi ))
2 n i=1
where l(w, (xi , yi )) = max{0, 1 − yiwT xi }. (10.3)
278 Data Classification: Algorithms and Applications
Basic Pegasos. Pegasos is an on-line algorithm. In other words, in each iteration, it aims to
update the weight vector w by using only one randomly sampled example xit through the following
approximate objective function:
λ
f (w; it ) = wwT + l(w, (xit , yit ) (10.4)
2
where t = 1, 2, ... is the iteration number and (it ∈ {1, ..., n} is the index of the training example that
is sampled in the t th iteration.
Note that f (w; it ) has one non-differential point. Therefore, we use its sub-gradient instead to
update the weight vector w as
wt+1 = wt − ηt ∇t . (10.5)
In the above update process, ∇t is the sub-gradient of f (w; it ) in the t th iteration: ∇t = λwt −
1[yit wT xit < 1]yit xit , where 1[.] is the indicator function. In Equation (10.5), the learning rate ηt is
set as ηt = λt1 . Finally, the training pair (xit , yit ) is chosen uniformly at random.
The authors in [25] also introduced an optional projection step to limit the admissible solution of
w to a ball of radius √1 . In other words, after updating w using Equation (10.5), we further update
λ
it as follows: √
1/ λ
wt+1 ← min{1, }wt+1 .
wt+1
It turns out that the time complexity of the above algorithm is O(s/(λε)). Compared with the
complexity of SVMPerf, we can see that it is independent of the number of training examples n.
This makes it especially suitable for large training data set sizes.
Mini-Batch Pegasos. The authors in [25] also propose several variants of the above basic ver-
sion of the Pegasos algorithm. The first variant is the so-called Mini-Batch update, where in each
iteration, we use k training examples (as opposed to a single example) in order to update the weight
vector w:
λ 1
f (w; it ) = wwT + ∑ l(w, (xi , yi ) (10.6)
2 k i∈At
where 1 ≤ k ≤ n and At ⊂ {1, ...., n},
At
= k. In this Mini-Batch mode, in order to update the
weight vector wt at each iteration t, we first need to find out which of these k selected examples
violate the margin rule based on the current weight vector wt : At+ = {i ∈ At : yi wt xi < 1}. Then, we
will leverage these training examples in At+ to update the weight vector w:
1 1
wt+1 ← (1 − )wt +
t λtk ∑ yi x i .
i∈A+
i
Kernelized Pegasos. The second variant is the kernelized Pegasos, which can address cases
where the decision boundary is nonlinear. This variant may be formulated as the following opti-
mization problem :
λ 1 n
minw wwT + ∑ l(w, φ(xi , yi ))
2 n i=1
where l(w, (xi , yi )) = max{0, 1 − yiwT φ(xi )} (10.7)
where the mapping φ(.) is implicitly specified by the kernel operator K (x1 , x2 ) = φ(x1 ) φ(x2 ). The
authors in [25] show that we can still solve the above optimization problem in its primal form using
stochastic gradient descent, whose iteration number is still O( λεs
). Nonetheless, in the kernelized
th
case, we might need min(t, n) kernel evaluations in the t evaluations. This makes the overall com-
sn
plexity O( λε ).
Big Data Classification 279
where the second term is the regularization term for the parameter w, which is often chosen to
be a smooth, convex function, e.g., Ω(w) = 12 w w in the case of SVM and regression; and λ > 0
is the regularization weight. The first term Remp , which can often be written as the summation
over all the training examples: Remp = 1n ∑ni=1 l(xi , yi , w) and l(xi , yi , w), is the loss function over
the ith training pair (xi , yi ) (e.g., the hinge loss in SVM, etc.). The challenge of regularized risk
minimization problems mostly comes from Remp as (1) it involves all the training examples so that
the computation of this term is costly for large-scale problems; and (2) Remp itself is often not
smooth over the entire parameter space.
Convex Bundle Methods. To address these challenges, the basic idea of bundle method [26] is
to iteratively approximate Remp by the first-order Taylor expansion, and then update the parameter
w by solving such an approximation. Note that Remp itself may not be differential (i.e., not smooth).
In this case, we will use the so-called sub-differential to do the Taylor expansion. We summarize its
basic ideas below.
Recall that µ is called subgradient of a convex, finite function F at a given point w if for any w̃,
we have F(w̃) ≥ F(w)+ < w̃ − w, µ >. The subdifferential is the set of all subgradients. Therefore,
if we approximate Remp by the first-order Taylor expansion at the current estimation of the parameter
wt using its subdifferential, each approximation by one of its subgradients provides a lower bound
of Remp .
In the convex bundle method, the goal is to minimize the maximum of all such lower bounds
(plus the regularization term) to update the parameter w. The convergence analysis shows that this
method converges in O(1/ε) iterations for a general convex function. Its convergence rate can be
further improved as O(log(1/ε)) if Remp is continuously differential.
Non-Convex Bundle Methods. Note that in [26], we require that the function Remp be convex.
The reason is that if the function Remp is convex, its Taylor approximation at the current estimation
always provides a lower bound (i.e., under-estimator) of Remp . While empirical risk function Remp
for some classification problems (e.g., standard SVM, logistic regression, etc.) is indeed convex,
this is not the case for some other more complicated classification problems, including the so-called
transductive SVM (TSVM) [4, 19], ramp loss SVM [29], convolutional nets [17], etc.
A classic technique for such non-connex optimization problems is to use convex relaxation so
that we can transfer the original non-convex function into a convex function. Nonetheless, it is not
an easy task to find such a transformation. In [5], the authors generalize the standard bundle methods
to handle such a non-convex function. The overall procedure is similar to that in [26] in the sense
that we still repeatedly use the first-order Taylor expansion of the Remp at the current estimation
to approximate the true Remp . The difference is that we need to off-set the Taylor approximation
so that the approximation function is still an under-estimator of the true empirical loss function.
Note that in the non-convex case, it is not clear if we can still have O(1/ε) or a similar convergence
rate.
280 Data Classification: Algorithms and Applications
where Q(i, j) = K(xi , x j ) and α ∈ Rn . The final classifier is defined by y = sign(∑ni=1 yi αi K(x, xi +
b)).
Bayesian Committee Machine. In [27], the authors proposed first partitioning the training set
into a few subsets, and then training a separate SVM on each subset independently. In the test stage,
the final prediction is a specific combination of the prediction from each SVM.
Here, the key is in the method for combining these prediction results. In Bayesian Committee
Machine, the authors suggested a combination scheme that is based on the variance of the test data
from each individual SVM.
Cascade SVM. In [13], the authors propose the following procedure to solve the QP in (10.9).
Its basic idea is to ‘chunk’ and to eliminate the non-support vectors as early as possible. To be
specific, we first partition the entire training set into small chunks, and then train a separate SVM
for each chunk (layer 1); the output of each chunk is a (hopefully) small number of support vectors.
These support vectors are treated as the new training set. We repeat the above process — partition
the support vectors from layer 1 into chunks, train a separate SVM for each chunk, and output the
new support vectors from each chunk. This process will be repeated again until we only have one
single chunk. The output (i.e., the support vectors) from that chunk will be fed back to layer 1 to
test the global convergence.
It can be seen that with this approach, each layer except the first layer will only operate on a
small number of training examples (i.e., the support vectors from the previous layers); and for each
chunk at a given layer, we can train an SVM by solving its corresponding QP independently and
thus it can be parallelized easily.
Parallel IPM-SVM. On a single machine, one of the most powerful methods to solve the QP in
Equation (10.9) is the so-called prime-dual Interior-Point Method (IPM) [8]. Please refer to [1] for
the details of using IPM to solve the QP problem in SVM. The key and most expensive step in IPM
involves the multiplication between a matrix inverse Σ and a vector q: Σ −1 q, where Σ = Q + D and
D is some diagonal matrix. It is very expensive (O(n3 )) to compute Σ −1 q.
However, if we can approximate Q by its low-rank approximation, we can largely speed up
this step. To be specific, if Q = H H, where H is a rank-k (k n) matrix (e.g., by the incomplete
Cholesky decomposition), by the Sharman-Morrison-Woodury theorem, we have
Σ −1 q = (D + H H)−1 q
= D−1 q − D−1H(I + HD−1 H)−1 H D−1 q (10.10)
In other words, instead of computing the inverse of an n × n matrix, we only need to compute the
inverse of a k × k matrix. In [1], the authors further propose to parallel this step as well as factorizing
the original Q matrix by the incomplete Cholesky decomposition.
10.3.3 MRM-ML
A recent, remarkable work to parallel classification algorithms (machine learning algorithms
in general) was done in [2]. Unlike most of the previous work, which aims to speed up a single
machine learning algorithm, in that work, the authors identified a family of machine learning algo-
rithms that can be easily sped up in the parallel computation setting. Specially, they showed that
282 Data Classification: Algorithms and Applications
any algorithm that fit in the so-called statistical query model can be parallelized easily under the
multi-core MapReduce environment.
The key is that such an algorithm can be re-formulated in some “summation form.” The clas-
sification algorithms that fall in this category include linear regression, locally weighted linear re-
gression, logistic regression, naive Bayes, SVM with the quadratic loss, etc. For all these algorithms
(and many more un-supervised learning algorithms), they achieve a linear speedup with respect to
the number of processors in the cluster.
Let us use linear regression to explain its basic ideas. In linear regression, we look for a pa-
rameter vector θ by minimizing ∑ni=1 (θ xi − yi )2 . If we stack all the feature vectors xi (i = 1, ..., n)
into an n × d matrix X, and all the labels yi (i = 1, ..., n) into an n × 1 vector y, in the most sim-
ple case (i.e., without any regularization terms), we can solve θ as θ = (X X)−1 X y. In the case
we have a large number of training examples, the main computational bottleneck is to compute
A = X X and b = X y. Each of these two terms can be re-written in the “summation” form as fol-
lows: A = ∑ni=1 (xi xi ) and b = ∑ni=1 (xi yi ). Note that the summation is taken over different training
examples. Therefore, the computation can be divided into equal size partitions of the data, each of
which can be done by a mapper job, and the final summation can be done by a reducer.
Another example that fits in this family is logistic regression. Recall that in logistic regression,
the classifier is in the form of hθ (x) = g(θ x) = 1+exp(−θ1
x) . Here, the weight vector θ can be learnt
in a couple of different ways, one of which is by the Newton-Raphson approach as follows: after
some initialization, we will iteratively update the weighted vector θ by: θ ← θ − H −1 ∇θ . For a
large training data set, the computational bottleneck is to compute the Hessian matrix H and the
gradient ∇θ . Like in the linear regression case, both terms can be re-written as the “summation”
forms at each Newton-Raphson step t: the Hessian matrix H(i, j) := H(i, j) + hθ (x(t) )(hθ (x(t) ) −
(t) (t) (t)
1)xi x j ; and the gradient ∇θ = ∑ni=1 (y(t) − hθ (x(t) ))xi . Therefore, if we divide the training data
into equal size partitions, both terms can be parallelized: each mapper will do the summation over
the corresponding partition and the final summation can be done by a reducer.
10.3.4 SystemML
MRM-ML answers the question of what kind of classification (or machine learning in general)
algorithms can be parallelized in a MapReduce environment. However, it is still costly and even
prohibitive to implement a large class of machine learning algorithms as low-level MapReduce
jobs. To be specific, in MRM-ML, we still need to hand-code each individual MapReduce job for a
given classification algorithm. What is more, in order to improve the performance, the programmer
still needs to manually schedule the execution plan based on the size of the input data set as well as
the size of the MapReduce cluster.
To address these issues, SystemML was proposed in [12]. The main advantage of SystemML is
that it provides a high-level language called Declarative Machine Learning Language (DML). DML
encodes two common and important building blocks shared by many machine learning algorithms,
including linear algebra and iterative optimization. Thus, it frees the programmer from lower-level
implementation details. By automatic program analysis, SystemML further breaks an input DML
script into smaller so-called statement blocks. So, in this way, it also frees the users from manual
scheduling of the execution plan.
Let us illustrate this functionality by the matrix-matrix multiplication — a common operation
in many machine learning algorithms. To be specific, if we partition the two input matrices (A
and B) in the block forms, the multiplication between them can be written in the form block as
Ci, j = ∑k Ai,k Bk, j , where Ci, j is the corresponding block in the resulting matrix C. In the parallel
environment, we have different choices to implement such a block-based matrix multiplication. In
the first strategy (referred to as “replication based matrix multiplication” in [12]), we only need one
single map-reduce job, but some matrix (like A) might be replicated and sent to multiple reduc-
Big Data Classification 283
ers. In contrast, in the second strategy (referred to as “cross product based matrix multiplication”
in [12]), each block of the two input matrices will only be sent to a single reducer, with the cost of
an additional map-reduce job. Here, which strategy we should choose is highly dependent on the
characteristic of the two input matrices. For example, if the number of rows in A is much smaller
than its columns (say A is a topic-document matrix), the replication based strategy might be much
more efficient than the cross-product based strategy. In another example, in many iterative matrix-
based machine learning algorithms (e.g., NMF, etc), we might need the product between three input
matrices ABC. Here, the ordering of such a matrix multiplication (e.g., (AB)C vs. A(BC)) also
depends on the characteristic (i.e., the size) of the input matrices.
SystemML covers a large family of machine learning algorithms including linear models, ma-
trix factorization, principal component analysis, pagerank, etc. Its empirical evaluations show that
it scales to very large data sets and its performance is comparable to those hand-coded implementa-
tions.
10.4 Conclusion
Given that (1) the data size keeps growing explosively and (2) the intrinsic complexity of a
classification algorithm is often high, the scalability seems to be a ‘never-ending’ challenge in clas-
sification. In this chapter, we have briefly reviewed two basic techniques to speed up and scale up
a classification algorithm. This includes (a) how to scale up classification algorithms on a single
machine by carefully solving its associated optimization problem; and (b) how to further scale up
classification algorithms by parallelism. A future trend of big data classification is to address the
scalability in conjunction with other challenges of big data (e.g., variety, velocity, etc).
Bibliography
[1] Edward Y. Chang, Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, and Hang Cui.
PSVM: Parallelizing support vector machines on distributed computers. In NIPS, 2007.
[2] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng,
and Kunle Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281–288,
2006.
[3] Chih Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines, 2001.
[4] Ronan Collobert, Fabian H. Sinz, Jason Weston, and Léon Bottou. Trading convexity for
scalability. In ICML, pages 201–208, 2006.
[5] Trinh-Minh-Tri Do and Thierry Artieres. Regularized bundle methods for convex and non-
convex risks. Journal of Machine Learning Research, 13(Dec.): pages 3539–3583, MIT Press,
2012.
[6] Pedro Domingos and Geoff Hulten, Mining high-speed data streams, ACM KDD Conference,
pages 71–80, 2000.
[7] Michael C. Ferris and Todd S. Munson. Interior-point methods for massive support vector
machines. SIAM Journal on Optimization, 13(3):783–804, 2002.
284 Data Classification: Algorithms and Applications
[8] Katsuki Fujisawa. The implementation of the primal-dual interior-point method for the
semidefinite programs and its engineering applications (Unpublished manuscript), 1992.
[9] Glenn Fung and Olvi L. Mangasarian. Proximal support vector machine classifiers. In KDD,
pages 77–86, 2001.
[10] Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, and Wei-Yin Loh. BOAT: Opti-
mistic Decision Tree Construction, ACM SIGMOD Conference, 1999.
[11] Johannes Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. RainForest—A framework for
fast decision tree construction of large datasets, VLDB Conference, pages 416–427, 1998.
[12] Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. Pednault, Berthold Reinwald, Vikas
Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. Systemml:
Declarative machine learning on Map-Reduce. In ICDE, pages 231–242, 2011.
[13] Hans Peter Graf, Eric Cosatto, Léon Bottou, Igor Durdanovic, and Vladimir Vapnik. Parallel
support vector machines: The cascade svm. In NIPS, 2004.
[14] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. ACM
KDD Conference, 2001.
[15] Ruoming Jin and Gagan Agrawal. Efficient Decision Tree Construction on Streaming Data,
ACM KDD Conference, 2003.
[16] Community white paper. Challenges and opportunities with big data. Technical Report avail-
able online at http://www.cra.org/ccc/files/docs/init/bigdatawhitepaper.pdf/
[17] Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best
multi-stage architecture for object recognition? In ICCV, pages 2146–2153, 2009.
[18] Thorsten Joachims. Making large-scale support vector machine learning practical, In Advances
in Kernel Methods, pages 169–184, MIT Press, Cambridge, MA, 1999.
[19] Thorsten Joachims. Transductive inference for text classification using support vector ma-
chines. In ICML, pages 200–209, 1999.
[20] Thorsten Joachims. Training linear SVMs in linear time. In KDD, pages 217–226, 2006.
[21] S. Sathiya Keerthi and Dennis DeCoste. A modified finite Newton method for fast solution of
large scale linear SVMs. Journal of Machine Learning Research, 6:341–361, 2005.
[22] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. SLIQ: A fast scalable classifier for data
mining. In Extending Database Technology, pages 18–32, 1996.
[23] John C. Platt. Sequential minimal optimization: A fast algorithm for training support vector
machines. In Advances in Kernel Methods – Suport Vector Learning, 1998.
[24] John C. Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classifier
for data mining. In VLDB, pages 544–555, 1996.
[25] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: primal
estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3–30, 2011.
[26] Choon Hui Teo, S. V. N. Vishwanathan, Alex J. Smola, and Quoc V. Le. Bundle methods for
regularized risk minimization. Journal of Machine Learning Research, 11:311–365, 2010.
[27] Volker Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000.
Big Data Classification 285
[28] Haixun Wang, Wei Fan, Philip Yu, and Jiawei Han, Mining concept-drifting data streams using
ensemble classifiers, KDD Conference, pages 226–235, 2003.
[29] Zhuang Wang and Slobodan Vucetic. Fast online training of ramp loss support vector ma-
chines. In ICDM, pages 569–577, 2009.
[30] Gongqing Wu, Haiguang Li, Xuegang Hu, Yuanjun Bi, Jing Zhang, and Xindong Wu.
MReC4.5: C4.5 ensemble classification with MapReduce. In Fourth ChinaGrid Annual Con-
ference, pages 249–255, 2009.
This page intentionally left blank
Chapter 11
Text Classification
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY
[email protected]
ChengXiang Zhai
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
287
288 Data Classification: Algorithms and Applications
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
11.1 Introduction
The problem of classification has been widely studied in the database, data mining, and infor-
mation retrieval communities. The problem of classification is defined as follows. Given a set of
records D = {X1 , . . . , XN } and a set of k different discrete values indexed by {1 . . . k}, each repre-
senting a category, the task is to assign one category (equivalently the corresponding index value)
to each record Xi . The problem is usually solved by using a supervised learning approach where a
set of training data records (i.e., records with known category labels) are used to construct a clas-
sification model, which relates the features in the underlying record to one of the class labels. For
a given test instance for which the class is unknown, the training model is used to predict a class
label for this instance. The problem may also be solved by using unsupervised approaches that do
not require labeled training data, in which case keyword queries characterizing each class are often
manually created, and bootstrapping may be used to heuristically obtain pseudo training data. Our
review focuses on supervised learning approaches.
There are some variations of the basic problem formulation given above for text classifcation. In
the hard version of the classification problem, a particular label is explicitly assigned to the instance,
whereas in the soft version of the classification problem, a probability value is assigned to the test
instance. Other variations of the classification problem allow ranking of different class choices for
a test instance, or allow the assignment of multiple labels [60] to a test instance. The classification
problem assumes categorical values for the labels, though it is also possible to use continuous values
as labels. The latter is referred to as the regression modeling problem. The problem of text classi-
fication is closely related to that of classification of records with set-valued features [34]; however,
this model assumes that only information about the presence or absence of words is used in a doc-
ument. In reality, the frequency of words also plays a helpful role in the classification process, and
the typical domain-size of text data (the entire lexicon size) is much greater than a typical set-valued
classification problem.
A broad survey of a wide variety of classification methods may be found in [50,72], and a survey
that is specific to the text domain may be found in [127]. A relative evaluation of different kinds
of text classification methods may be found in [153]. A number of the techniques discussed in this
chapter have also been converted into software and are publicly available through multiple toolkits
such as the BOW toolkit [107], Mallot [110], WEKA,1 and LingPipe.2
The problem of text classification finds applications in a wide variety of domains in text mining.
Some examples of domains in which text classification is commonly used are as follows:
• News Filtering and Organization: Most of the news services today are electronic in nature
in which a large volume of news articles are created every single day by the organizations. In
such cases, it is difficult to organize the news articles manually. Therefore, automated methods
can be very useful for news categorization in a variety of Web portals [90]. In the special case
of binary categorization where the goal is to distinguish news articles interesting to a user
from those that are not, the application is also referred to as text filtering.
• Document Organization and Retrieval: The above application is generally useful for many
applications beyond news filtering and organization. A variety of supervised methods may be
used for document organization in many domains. These include large digital libraries of doc-
1 http://www.cs.waikato.ac.nz/ml/weka/
2
http://alias-i.com/lingpipe/
Text Classification 289
uments, Web collections, scientific literature, or even social feeds. Hierarchically organized
document collections can be particularly useful for browsing and retrieval [24].
• Opinion Mining: Customer reviews or opinions are often short text documents that can be
mined to determine useful information from the review. Details on how classification can be
used in order to perform opinion mining are discussed in [101]. A common classification
task is to classify an opinionated text object (e.g., a product review) into positive or negative
sentiment categories.
• Email Classification and Spam Filtering: It is often desirable to classify email [29, 33, 97]
in order to determine either the subject or to determine junk email [129] in an automated way.
This is also referred to as spam filtering or email filtering.
A wide variety of techniques have been designed for text classification. In this chapter, we will dis-
cuss the broad classes of techniques, and their uses for classification tasks. We note that these classes
of techniques also generally exist for other data domains such as quantitative or categorical data.
Since text may be modeled as quantitative data with frequencies on the word attributes, it is possible
to use most of the methods for quantitative data directly on text. However, text is a particular kind of
data in which the word attributes are sparse, and high dimensional, with low frequencies on most of
the words. Therefore, it is critical to design classification methods that effectively account for these
characteristics of text. In this chapter, we will focus on the specific changes that are applicable to
the text domain. Some key methods, that are commonly used for text classification are as follows:
• Decision Trees: Decision trees are designed with the use of a hierarchical division of the
underlying data space with the use of different text features. The hierarchical division of the
data space is designed in order to create class partitions that are more skewed in terms of their
class distribution. For a given text instance, we determine the partition that it is most likely to
belong to, and use it for the purposes of classification.
• Pattern (Rule)-Based Classifiers: In rule-based classifiers we determine the word patterns
that are most likely to be related to the different classes. We construct a set of rules, in which
the left-hand side corresponds to a word pattern, and the right-hand side corresponds to a
class label. These rules are used for the purposes of classification.
• SVM Classifiers: SVM Classifiers attempt to partition the data space with the use of linear
or non-linear delineations between the different classes. The key in such classifiers is to de-
termine the optimal boundaries between the different classes and use them for the purposes
of classification.
• Neural Network Classifiers: Neural networks are used in a wide variety of domains for the
purposes of classification. In the context of text data, the main difference for neural network
classifiers is to adapt these classifiers with the use of word features. We note that neural
network classifiers are related to SVM classifiers; indeed, they both are in the category of
discriminative classifiers, which are in contrast with the generative classifiers [116].
• Bayesian (Generative) Classifiers: In Bayesian classifiers (also called generative classifiers),
we attempt to build a probabilistic classifier based on modeling the underlying word features
in different classes. The idea is then to classify text based on the posterior probability of
the documents belonging to the different classes on the basis of the word presence in the
documents.
• Other Classifiers: Almost all classifiers can be adapted to the case of text data. Some of the
other classifiers include nearest neighbor classifiers, and genetic algorithm-based classifiers.
We will discuss some of these different classifiers in some detail and their use for the case of
text data.
290 Data Classification: Algorithms and Applications
The area of text categorization is so vast that it is impossible to cover all the different algorithms in
detail in a single chapter. Therefore, our goal is to provide the reader with an overview of the most
important techniques, and also the pointers to the different variations of these techniques.
Feature selection is an important problem for text classification. In feature selection, we attempt
to determine the features that are most relevant to the classification process. This is because some
of the words are much more likely to be correlated to the class distribution than others. Therefore,
a wide variety of methods have been proposed in the literature in order to determine the most
important features for the purpose of classification. These include measures such as the gini-index
or the entropy, which determine the level at which the presence of a particular feature skews the
class distribution in the underlying data. We will discuss the different feature selection methods that
are commonly used for text classification.
The rest of this chapter3 is organized as follows. In the next section, we will discuss methods for
feature selection in text classification. In Section 11.3, we will describe decision tree methods for
text classification. Rule-based classifiers are described in detail in Section 11.4. We discuss naive
Bayes classifiers in Section 11.5. The nearest neighbor classifier is discussed in Section 11.7. In
Section 11.6, we will discuss a number of linear classifiers, such as the SVM classifier, direct re-
gression modeling, and the neural network classifier. A discussion of how the classification methods
can be adapted to text and Web data containing hyperlinks is discussed in Section 11.8. In Section
11.9, we discuss a number of different meta-algorithms for classification such as boosting, bagging,
and ensemble learning. Methods for enhancing classification methods with additional training data
are discussed in Section 11.10. Section 11.11 contains the conclusions and summary.
topics such as SVM, Neural Networks, active learning, semisupervised learning, and transfer learning.
Text Classification 291
to the case of the classification problem, and are often used in a variety of unsupervised applications
such as clustering and indexing. In the case of the classification problem, it makes sense to supervise
the feature selection process with the use of the class labels. This kind of selection process ensures
that those features that are highly skewed towards the presence of a particular class label are picked
for the learning process. A wide variety of feature selection methods are discussed in [154, 156].
Many of these feature selection methods have been compared with one another, and the experimen-
tal results are presented in [154]. We will discuss each of these feature selection methods in this
section.
Then, the gini-index for the word w, denoted by G(w), is defined4 as follows:
k
G(w) = ∑ pi (w)2 (11.2)
i=1
The value of the gini-index G(w) always lies in the range (1/k, 1). Higher values of the gini-index
G(w) represent a greater discriminative power of the word w. For example, when all documents
that contain word w belong to a particular class, the value of G(w) is 1. On the other hand, when
documents containing word w are evenly distributed among the k different classes, the value of
G(w) is 1/k.
One criticism with this approach is that the global class distribution may be skewed to be-
gin with, and therefore the above measure may sometimes not accurately reflect the discriminative
power of the underlying attributes. Therefore, it is possible to construct a normalized gini-index in
order to reflect the discriminative power of the attributes more accurately. Let P1 . . . Pk represent the
global distributions of the documents in the different classes. Then, we determine the normalized
probability value pi (w) as follows:
pi (w)/Pi
pi (w) = . (11.3)
∑ j=1 p j (w)/Pj
k
The use of the global probabilities Pi ensures that the gini-index more accurately reflects class-
discrimination in the case of biased class distributions in the whole document collection. For a
document corpus containing n documents, d words, and k classes, the complexity of the information
gain computation is O(n ·d ·k). This is because the computation of the term pi (w) for all the different
words and the classes requires O(n · d · k) time.
4 The gini-index is also sometimes defined as 1 − k p (w)2 , with lower values indicating greater discriminative power
∑i=1 i
of the feature w.
292 Data Classification: Algorithms and Applications
The greater the value of the information gain I(w), the greater the discriminatory power of the word
w. For a document corpus containing n documents and d words, the complexity of the information
gain computation is O(n · d · k).
11.2.4 χ2 -Statistic
The χ2 statistic is a different way to compute the lack of independence between the word w and a
particular class i. Let n be the total number of documents in the collection, pi (w) be the conditional
probability of class i for documents that contain w, Pi be the global fraction of documents containing
the class i, and F(w) be the global fraction of documents that contain the word w. The χ2 -statistic
of the word between word w and class i is defined as follows:
n · F(w)2 · (pi (w) − Pi )2
χ2i (w) = (11.6)
F(w) · (1 − F(w)) · Pi · (1 − Pi))
Text Classification 293
As in the case of the mutual information, we can compute a global χ2 statistic from the class-specific
values. We can use either the average or maximum values in order to create the composite value:
k
χ2avg (w) = ∑ Pi · χ2i (w)
i=1
χ2max (w) = maxi χ2i (w)
We note that the χ2 -statistic and mutual information are different ways of measuring the cor-
relation between terms and categories. One major advantage of the χ2 -statistic over the mutual in-
formation measure is that it is a normalized value, and therefore these values are more comparable
across terms in the same category.
be computationally expensive. Similar ideas were also implemented in a probabilistic topic model,
called supervised LDA [16], where class labels are used to “supervise” the discovery of topics in
text data [16].
A method called sprinkling is proposed in [26], in which artificial terms are added to (or “sprin-
kled” into) the documents, which correspond to the class labels. In other words, we create a term
corresponding to the class label, and add it to the document. LSI is then performed on the document
collection with these added terms. The sprinkled terms can then be removed from the representa-
tion, once the eigenvectors have been determined. The sprinkled terms help in making the LSI more
sensitive to the class distribution during the reduction process. It has also been proposed in [26]
that it can be generally useful to make the sprinkling process adaptive, in which all classes are not
necessarily treated equally, but the relationships between the classes are used in order to regulate the
sprinkling process. Furthermore, methods have also been proposed in [26] to make the sprinkling
process adaptive to the use of a particular kind of classifier.
subspace, and the vector found in the next iteration is the optimal one from a smaller search space.
The process of finding linear discriminants is continued until the class separation, as measured by
an objective function, reduces below a given threshold for the vector determined in the current
iteration. The power of such a dimensionality reduction approach has been illustrated in [23], in
which it has been shown that a simple decision tree classifier can perform much more effectively on
this transformed data, as compared to more sophisticated classifiers.
Next, we discuss how the Fisher’s discriminant is actually constructed. First, we will set up the
objective function J(α), which determines the level of separation of the different classes along a
given direction (unit-vector) α. This sets up the crisp optimization problem of determining the value
of α, which maximizes J(α). For simplicity, let us assume the case of binary classes. Let D1 and
D2 be the two sets of documents belonging to the two classes. Then, the projection of a document
X ∈ D1 ∪ D2 along α is given by X · α. Therefore, the squared class separation S(D1 , D2 , α) along
the direction α is given by:
2
∑X∈D1 α · X ∑X ∈D2 α · X
S(D1 , D2 , α) = − . (11.7)
|D1 | |D2 |
In addition, we need to normalize this absolute class separation with the use of the underlying class
variances. Let Var(D1 , α) and Var(D2 , α) be the individual class variances along the direction α. In
other words, we have:
2
∑X ∈D1 (X · α)2 ∑X∈D1 X · α
Var(D1 , α) = − . (11.8)
|D1 | |D1 |
The value of Var(D2 , α) can be defined in a similar way. Then, the normalized class-separation J(α)
is defined as follows:
S(D1 , D2 , α)
J(α) = (11.9)
Var(D1 , α) + Var(D2 , α)
The optimal value of α needs to be determined subject to the constraint that α is a unit vector. Let µ1
and µ2 be the means of the two data sets D1 and D2 , and C1 and C2 be the corresponding covariance
matrices. It can be shown that the optimal (unscaled) direction α = α∗ can be expressed in closed
form, and is given by the following:
& '
∗
C1 + C2 −1
α = (µ1 − µ2 ). (11.10)
2
The main difficulty in computing the above equation is that this computation requires the inversion
of the covariance matrix, which is sparse and computationally difficult in the high-dimensional text
domain. Therefore, a gradient descent approach can be used in order to determine the value of α in
a more computationally effective way. Details of the approach are presented in [23].
Another related method that attempts to determine projection directions that maximize the topi-
cal differences between different classes is the Topical Difference Factor Analysis method proposed
in [84]. The problem has been shown to be solvable as a generalized eigenvalue problem. The
method was used in conjunction with a k-nearest neighbor classifier, and it was shown that the use
of this approach significantly improves the accuracy over a classifier that uses the original set of
features.
note that this method has really been proposed in [68, 69] as an unsupervised method that preserves
the underlying clustering structure, assuming the data has already been clustered in a pre-processing
phase. Thus, the generalized dimensionality reduction method has been proposed as a much more
aggressive dimensionality reduction technique, which preserves the underlying clustering structure
rather than the individual points. This method can however also be used as a supervised technique
in which the different classes are used as input to the dimensionality reduction algorithm, instead
of the clusters constructed in the pre-processing phase [152]. This method is known as the Optimal
Orthogonal Centroid Feature Selection Algorithm (OCFS), and it directly targets at the maximiza-
tion of inter-class scatter. The algorithm is shown to have promising results for supervised feature
selection in [152].
hierarchically. In the context of text data, such predicates are typically conditions on the presence
or absence of one or more words in the document. The division of the data space is performed
recursively in the decision tree, until the leaf nodes contain a certain minimum number of records,
or some conditions on class purity. The majority class label (or cost-weighted majority label) in the
leaf node is used for the purposes of classification. For a given test instance, we apply the sequence
of predicates at the nodes, in order to traverse a path of the tree in top-down fashion and determine
the relevant leaf node. In order to further reduce the overfitting, some of the nodes may be pruned by
holding out a part of the data, which are not used to construct the tree. The portion of the data which
is held out is used in order to determine whether or not the constructed leaf node should be pruned or
not. In particular, if the class distribution in the training data (for decision tree construction) is very
different from the class distribution in the training data that is used for pruning, then it is assumed
that the node overfits the training data. Such a node can be pruned. A detailed discussion of decision
tree methods may be found in [20, 50, 72, 122].
In the particular case of text data, the predicates for the decision tree nodes are typically defined
in terms of the terms in the underlying text collection. For example, a node may be partitioned into
its children nodes depending upon the presence or absence of a particular term in the document. We
note that different nodes at the same level of the tree may use different terms for the partitioning
process.
Many other kinds of predicates are possible. It may not be necessary to use individual terms for
partitioning, but one may measure the similarity of documents to correlated sets of terms. These
correlated sets of terms may be used to further partition the document collection, based on the
similarity of the document to them. The different kinds of splits are as follows:
• Single Attribute Splits: In this case, we use the presence or absence of particular words (or
even phrases) at a particular node in the tree in order to perform the split. At any given level,
we pick the word that provides the maximum discrimination between the different classes.
Measures such as the gini-index or information gain can be used in order to determine the
level of entropy. For example, the DT-min10 algorithm [93] is based on this approach.
• Discriminant-Based Multi-Attribute Split: For the multi-attribute case, a natural choice for
performing the split is to use discriminants such as the Fisher discriminant for performing the
split. Such discriminants provide the directions in the data along which the classes are best
separated. The documents are projected on this discriminant vector for rank ordering, and
then split at a particular coordinate. The choice of split point is picked in order to maximize
the discrimination between the different classes. The work in [23] uses a discriminant-based
split, though this is done indirectly because of the use of a feature transformation to the
discriminant representation, before building the classifier.
Some of the earliest implementation of classifiers may be found in [92,93,99,147]. The last of these
is really a rule-based classifier, which can be interpreted either as a decision tree or a rule-based
classifier. Most of the decision tree implementations in the text literature tend to be small variations
on standard packages such as ID3 and C4.5, in order to adapt the model to text classification. Many
of these classifiers are typically designed as baselines for comparison with other learning models
[75].
298 Data Classification: Algorithms and Applications
A well known implementation of the decision tree classifier based on the C4.5 taxonomy of
algorithms [122] is presented in [99]. More specifically, the work in [99] uses the successor to the
C4.5 algorithm, which is also known as the C5 algorithm. This algorithm uses single-attribute splits
at each node, where the feature with the highest information gain [37] is used for the purpose of
the split. Decision trees have also been used in conjunction with boosting techniques. An adaptive
boosting technique [56] is used in order to improve the accuracy of classification. In this technique,
we use n different classifiers. The ith classifier is constructed by examining the errors of the (i − 1)th
classifier. A voting scheme is applied among these classifiers in order to report the final label. Other
boosting techniques for improving decision tree classification accuracy are proposed in [134].
The work in [51] presents a decision tree algorithm based on the Bayesian approach developed
in [28]. In this classifier, the decision tree is grown by recursive greedy splits, where the splits
are chosen using Bayesian posterior probability of model structure. The structural prior penalizes
additional model parameters at each node. The output of the process is a class probability rather
than a deterministic class label for the test instance.
by generating a set of targeted rules that are related to the different classes, and one default catch-all
rule, which can cover all the remaining instances.
A number of criteria can be used in order to generate the rules from the training data. Two of the
most common conditions that are used for rule generation are those of support and confidence. These
conditions are common to all rule-based pattern classifiers [100] and may be defined as follows:
• Support: This quantifies the absolute number of instances in the training data set that are
relevant to the rule. For example, in a corpus containing 100,000 documents, a rule in which
both the left-hand set and right-hand side are satisfied by 50,000 documents is more important
than a rule that is satisfied by 20 documents. Essentially, this quantifies the statistical volume
that is associated with the rule. However, it does not encode the strength of the rule.
• Confidence: This quantifies the conditional probability that the right-hand side of the rule
is satisfied, if the left-hand side is satisfied. This is a more direct measure of the strength of
the underlying rule.
We note that the aforementioned measures are not the only measures that are possible, but are
widely used in the data mining and machine learning literature [100] for both textual and non-
textual data, because of their intuitive nature and simplicity of interpretation. One criticism of the
above measures is that they do not normalize for the a-priori presence of different terms and features,
and are therefore prone to misinterpretation, when the feature distribution or class-distribution in the
underlying data set is skewed.
The training phase constructs all the rules, which are based on measures such as the above. For
a given test instance, we determine all the rules that are relevant to the test instance. Since we allow
overlaps, it is possible that more than one rule may be relevant to the test instance. If the class labels
on the right-hand sides of all these rules are the same, then it is easy to pick this class as the relevant
label for the test instance. On the other hand, the problem becomes more challenging when there
are conflicts between these different rules. A variety of different methods are used to rank-order
the different rules [100], and report the most relevant rule as a function of these different rules. For
example, a common approach is to rank-order the rules by their confidence, and pick the top-k rules
as the most relevant. The class label on the right-hand side of the greatest number of these rules is
reported as the relevant one.
An interesting rule-based classifier for the case of text data has been proposed in [6]. This
technique uses an iterative methodology, which was first proposed in [148] for generating rules.
Specifically, the method determines the single best rule related to any particular class in the training
data. The best rule is defined in terms of the confidence of the rule, as defined above. This rule
along with its corresponding instances are removed from the training data set. This approach is
continuously repeated, until it is no longer possible to find strong rules in the training data, and
complete predictive value is achieved.
The transformation of decision trees to rule-based classifiers is discussed generally in [122], and
for the particular case of text data in [80]. For each path in the decision tree a rule can be generated,
which represents the conjunction of the predicates along that path. One advantage of the rule-based
classifier over a decision tree is that it is not restricted to a strict hierarchical partitioning of the
feature space, and it allows for overlaps and inconsistencies among the different rules. Therefore,
if a new set of training examples are encountered, which are related to a new class or new part of
the feature space, then it is relatively easy to modify the rule set for these new examples. Further-
more, rule-based classifiers also allow for a tremendous interpretability of the underlying decision
space. In cases in which domain-specific expert knowledge is known, it is possible to encode this
into the classification process by manual addition of rules. In many practical scenarios, rule-based
techniques are more commonly used because of their ease of maintenance and interpretability.
One of the most common rule-based techniques is the RIPPER technique discussed in [32–34].
The RIPPER technique uses the sequential covering paradigm to determine the combinations of
300 Data Classification: Algorithms and Applications
words that are related to a particular class. The RIPPER method has been shown to be especially ef-
fective in scenarios where the number of training examples is relatively small [31]. Another method
called sleeping experts [32,57] generates rules that take the placement of the words in the documents
into account. Most of the classifiers such as RIPPER [32–34] treat documents as set-valued objects,
and generate rules based on the co-presence of the words in the documents. The rules in sleeping ex-
perts are different from most of the other classifiers in this respect. In this case [32,57], the left-hand
side of the rule consists of a sparse phrase, which is a group of words close to one another in the
document (though not necessarily completely sequential). Each such rule has a weight, which de-
pends upon its classification specificity in the training data. For a given test example, we determine
the sparse phrases that are present in it, and perform the classification by combining the weights
of the different rules that are fired. The sleeping experts and RIPPER systems have been compared
in [32], and have been shown to have excellent performance on a variety of text collections.
• Multinomial Model: In this model, we capture the frequencies of terms in a document by rep-
resenting a document with a bag of words. The documents in each class can then be modeled
as samples drawn from a multinomial word distribution. As a result, the conditional probabil-
ity of a document given a class is simply a product of the probability of each observed word
in the corresponding class.
No matter how we model the documents in each class (be it a multivariate Bernoulli model or
a multinomial model), the component class models (i.e., generative models for documents in each
class) can be used in conjunction with the Bayes rule to compute the posterior probability of the
class for a given document, and the class with the highest posterior probability can then be assigned
to the document.
There has been considerable confusion in the literature on the differences between the multivari-
ate Bernoulli model and the multinomial model. A good exposition of the differences between these
two models may be found in [108]. In the following, we describe these two models in more detail.
Text Classification 301
If we sampled a term set T of any size from the term distribution of one of the randomly chosen
classes, and the final outcome is the set Q, then what is the posterior probability that we had origi-
nally picked class i for sampling? The a-priori probability of picking class i is equal to its fractional
presence in the collection.
We denote the class of the sampled set T by CT and the corresponding posterior probability
by P(CT = i|T = Q). This is essentially what we are trying to find. It is important to note that
since we do not allow replacement, we are essentially picking a subset of terms from V with no
frequencies attached to the picked terms. Therefore, the set Q may not contain duplicate elements.
Under the naive Bayes assumption of independence between terms, this is essentially equivalent to
either selecting or not selecting each term with a probability that depends upon the underlying term
distribution. Furthermore, it is also important to note that this model has no restriction on the number
of terms picked. As we will see later, these assumptions are the key differences with the multinomial
Bayes model. The Bayes approach classifies a given set Q based on the posterior probability that Q
is a sample from the data distribution of class i, i.e., P(CT = i|T = Q), and it requires us to compute
the following two probabilities in order to achieve this:
1. What is the prior probability that a set T is a sample from the term distribution of class i? This
probability is denoted by P(CT = i).
2. If we sampled a set T of any size from the term distribution of class i, then what is the
probability that our sample is the set Q? This probability is denoted by P(T = Q|CT = i).
We will now provide a more mathematical description of Bayes modeling. In other words, we
wish to model P(CT = i|Q is sampled). We can use the Bayes rule in order to write this conditional
probability in a way that can be estimated more easily from the underlying corpus. In other words,
we can simplify as follows:
We note that the last condition of the above sequence uses the naive independence assumption,
because we are assuming that the probabilities of occurrence of the different terms are independent
of one another. This is practically necessary, in order to transform the probability equations to a
form that can be estimated from the underlying data.
302 Data Classification: Algorithms and Applications
The class assigned to Q is the one with the highest posterior probability given Q. It is easy to see
that this decision is not affected by the denominator, which is the marginal probability of observing
Q. That is, we will assign the following class to Q:
It is important to note that all terms in the right hand-side of the last equation can be estimated
from the training corpus. The value of P(CT = i) is estimated as the global fraction of documents
belonging to class i, the value of P(t j ∈ T |CT = i) is the fraction of documents in the ith class that
contain term t j . We note that all of the above are maximum likelihood estimates of the corresponding
probabilities. In practice, Laplacian smoothing [144] is used, in which small values are added to the
frequencies of terms in order to avoid zero probabilities of sparsely present terms.
In most applications of the Bayes classifier, we only care about the identity of the class with
the highest probability value, rather than the actual probability value associated with it, which is
why we do not need to compute the normalizer P(T = Q). In fact, in the case of binary classes, a
number of simplifications are possible in computing these Bayes “probability” values by using the
logarithm of the Bayes expression, and removing a number of terms that do not affect the ordering
of class probabilities. We refer the reader to [124] for details.
Although for classification, we do not need to compute P(T = Q), some applications necessitate
the exact computation of the posterior probability P(CT = i|T = Q). For example, in the case of
supervised anomaly detection (or rare class detection), the exact posterior probability value P(CT =
i|T = Q) is needed in order to fairly compare the probability value over different test instances, and
rank them for their anomalous nature. In such cases, we would need to compute P(T = Q). One
way to achieve this is simply to take a sum over all the classes:
This is based on the conditional independence of features for each class. Since the parameter values
are estimated for each class separately, we may face the problem of data sparseness. An alterna-
tive way of computing it, which may alleviate the data sparseness problem, is to further make the
assumption of (global) independence of terms, and compute it as:
where the term probabilities are based on global term distributions in all the classes.
A natural question arises, as to whether it is possible to design a Bayes classifier that does not
use the naive assumption, and models the dependencies between the terms during the classification
process. Methods that generalize the naive Bayes classifier by not using the independence assump-
tion do not work well because of the higher computational costs and the inability to estimate the
parameters accurately and robustly in the presence of limited data. The most interesting line of
work in relaxing the independence assumption is provided in [128]. In this work, the tradeoffs in
spectrum of allowing different levels of dependence among the terms have been explored. On the
one extreme, an assumption of complete dependence results in a Bayesian network model that turns
out to be computationally very expensive. On the other hand, it has been shown that allowing lim-
ited levels of dependence can provide good tradeoffs between accuracy and computational costs.
We note that while the independence assumption is a practical approximation, it has been shown
Text Classification 303
in [35, 47] that the approach does have some theoretical merit. Indeed, extensive experimental tests
have tended to show that the naive classifier works quite well in practice.
A number of papers [24,74,86,91,124,129] have used the naive Bayes approach for classification
in a number of different application domains. The classifier has also been extended to modeling
temporally aware training data, in which the importance of a document may decay with time [130].
As in the case of other statistical classifiers, the naive Bayes classifier [129] can easily incorporate
domain-specific knowledge into the classification process. The particular domain that the work in
[129] addresses is that of filtering junk email. Thus, for such a problem, we often have a lot of
additional domain knowledge that helps us determine whether a particular email message is junk or
not. For example, some common characteristics of the email that would make an email to be more
or less likely to be junk are as follows:
• The domain of the sender such as .edu or .com can make an email to be more or less likely to
be junk.
• Phrases such as “Free Money” or over emphasized punctuation such as “!!!” can make an
email more likely to be junk.
• Whether the recipient of the message was a particular user, or a mailing list.
The Bayes method provides a natural way to incorporate such additional information into the classi-
fication process, by creating new features for each of these characteristics. The standard Bayes tech-
nique is then used in conjunction with this augmented representation for classification. The Bayes
technique has also been used in conjunction with the incorporation of other kinds of domain knowl-
edge, such as the incorporation of hyperlink information into the classification process [25, 118].
The Bayes method is also suited to hierarchical classification, when the training data is arranged
in a taxonomy of topics. For example, the Open Directory Project (ODP), Yahoo! Taxonomy, and a
variety of news sites have vast collections of documents that are arranged into hierarchical groups.
The hierarchical structure of the topics can be exploited to perform more effective classification
[24, 86], because it has been observed that context-sensitive feature selection can provide more
useful classification results. In hierarchical classification, a Bayes classifier is built at each node,
which then provides us with the next branch to follow for classification purposes. Two such methods
are proposed in [24, 86], in which node specific features are used for the classification process.
Clearly, much fewer features are required at a particular node in the hierarchy, because the features
that are picked are relevant to that branch. An example in [86] suggests that a branch of the taxonomy
that is related to Computer may have no relationship with the word “cow.” These node-specific
features are referred to as signatures in [24]. Furthermore, it has been observed in [24] that in a
given node, the most discriminative features for a given class may be different from their parent
nodes. For example, the word “health” may be discriminative for the Yahoo! category @Health,
but the word “baby” may be much more discriminative for the category @Health@Nursing. Thus,
it is critical to have an appropriate feature selection process at each node of the classification tree.
The methods in [24, 86] use different methods for this purpose.
• The work in [86] uses an information-theoretic approach [37] for feature selection, which
takes into account the dependencies between the attributes [128]. The algorithm greedily
eliminates the features one-by-one so as to least disrupt the conditional class distribution at
that node.
• The node-specific features are referred to as signatures in [24]. These node-specific signa-
tures are computed by calculating the ratio of intra-class variance to inter-class variance for
the different words at each node of the tree. We note that this measure is the same as that
optimized by the Fisher’s discriminant, except that it is applied to the original set of words,
rather than solved as a general optimization problem in which arbitrary directions in the data
are picked.
304 Data Classification: Algorithms and Applications
A Bayesian classifier is constructed at each node in order to determine the appropriate branch. A
small number of context-sensitive features provide one advantage of these methods, i.e., Bayesian
classifiers work much more effectively with a much smaller number of features. Another major
difference between the two methods is that the work in [86] uses the Bernoulli model, whereas that
in [24] uses the multinomial model, which will be discussed in the next subsection. This approach in
[86] is referred to as the Pachinko Machine classifier and that in [24] is known as TAPER (Taxonomy
and Path Enhanced Retrieval System).
Other noteworthy methods for hierarchical classification are proposed in [13, 109, 151]. The
work [13] addresses two common problems associated with hierarchical text classification: (1) error
propagation; (2) non-linear decision surfaces. The problem of error propagation occurs when the
classification mistakes made at a parent node are propagated to its children nodes. This problem
was solved in [13] by using cross validation to obtain a training data set for a child node that is more
similar to the actual test data passed to the child node from its parent node than the training data set
normally used for training a classifier at the child node. The problem of non-linear decision surfaces
refers to the fact that the decision boundary of a category at a higher level is often non-linear (since
its members are the union of the members of its children nodes). This problem is addressed by using
the tentative class labels obtained at the children nodes as features for use at a parent node. These
are general strategies that can be applied to any base classifier, and the experimental results in [13]
show that both strategies are effective.
If we sampled L terms sequentially from the term distribution of one of the randomly cho-
sen classes (allowing repetitions) to create the term set T , and the final outcome for sampled set
T is the set Q with the corresponding frequencies F, then what is the posterior probability that we
had originally picked class i for sampling? The a-priori probability of picking class i is equal to its
fractional presence in the collection.
The aforementioned probability is denoted by P(CT = i|T = [Q, F]). An assumption that is
commonly used in these models is that the length of the document is independent of the class label.
While it is easily possible to generalize the method, so that the document length is used as a prior,
independence is usually assumed for simplicity. As in the previous case, we need to estimate two
values in order to compute the Bayes posterior.
1. What is the prior probability that a set T is a sample from the term distribution of class i? This
probability is denoted by P(CT = i).
2. If we sampled L terms from the term distribution of class i (with repetitions), then what is the
probability that our sampled set T is the set Q with associated frequencies F? This probability
is denoted by P(T = [Q, F]|CT = i).
Text Classification 305
We can substitute Equation 11.12 in Equation 11.11 to obtain the class with the highest Bayes pos-
terior probability, where the class priors are computed as in the previous case, and the probabilities
P(t j ∈ T |CT = i) can also be easily estimated as previously with Laplacian smoothing [144]. Note
that for the purpose of choosing the class with the highest posterior probability, we do not really
have to compute ∏mL! Fi ! , as it is a constant not depending on the class label (i.e., the same for all the
i=1
classes). We also note that the probabilities of class absence are not present in the above equations
because of the way in which the sampling is performed.
A number of different variations of the multinomial model have been proposed in [61, 82, 96,
109, 111, 117]. In the work [109], it is shown that a category hierarchy can be leveraged to im-
prove the estimate of multinomial parameters in the naive Bayes classifier to significantly improve
classification accuracy. The key idea is to apply shrinkage techniques to smooth the parameters for
data-sparse child categories with their common parent nodes. As a result, the training data of related
categories are essentially “shared” with each other in a weighted manner, which helps improve the
robustness and accuracy of parameter estimation when there are insufficient training data for each
individual child category. The work in [108] has performed an extensive comparison between the
Bernoulli and the multinomial models on different corpora, and the following conclusions were
presented:
• The multi-variate Bernoulli model can sometimes perform better than the multinomial model
at small vocabulary sizes.
• The multinomial model outperforms the multi-variate Bernoulli model for large vocabulary
sizes, and almost always beats the multi-variate Bernoulli when vocabulary size is chosen
optimally for both. On the average a 27% reduction in error was reported in [108].
The afore-mentioned results seem to suggest that the two models may have different strengths, and
may therefore be useful in different scenarios.
in enhancing the effectiveness of probabilistic classifiers [98, 117]. These methods are particularly
useful in cases where the amount of training data is limited. In particular, clustering can help in the
following ways:
• The Bayes method implicitly estimates the word probabilities P(ti ∈ T |CT = i) of a large
number of terms in terms of their fractional presence in the corresponding component. This is
clearly noisy. By treating the clusters as separate entities from the classes, we now only need
to relate (a much smaller number of) cluster membership probabilities to class probabilities.
This reduces the number of parameters and greatly improves classification accuracy [98].
• The use of clustering can help in incorporating unlabeled documents into the training data
for classification. The premise is that unlabeled data is much more copiously available than
labeled data, and when labeled data is sparse, it should be used in order to assist the classi-
fication process. While such unlabeled documents do not contain class-specific information,
they do contain a lot of information about the clustering behavior of the underlying data. This
can be very useful for more robust modeling [117], when the amount of training data is low.
This general approach is also referred to as co-training [10, 17, 45].
The common characteristic of both methods [98, 117] is that they both use a form of supervised
clustering for the classification process. While the goal is quite similar (limited training data), the
approach used for this purpose is quite different. We will discuss both of these methods in this
section.
In the method discussed in [98], the document corpus is modeled with the use of supervised
word clusters. In this case, the k mixture components are clusters that are correlated to, but are
distinct from the k groups of documents belonging to the different classes. The main difference
from the Bayes method is that the term probabilities are computed indirectly by using clustering
as an intermediate step. For a sampled document T , we denote its class label by CT ∈ {1 . . . k},
and its mixture component by M T ∈ {1 . . . k}. The k different mixture components are essentially
word-clusters whose frequencies are generated by using the frequencies of the terms in the k dif-
ferent classes. This ensures that the word clusters for the mixture components are correlated to the
classes, but they are not assumed to be drawn from the same distribution. As in the previous case,
let us assume that the A document contains the set of words Q. Then, we would like to estimate the
probability P(T = Q|CT = i) for each class i. An interesting variation of the work in [98] from the
Bayes approach is that it does not attempt to determine the posterior probability P(CT = i|T = Q).
Rather, it simply reports the class with the highest likelihood P(T = Q|CT = i). This is essen-
tially equivalent to assuming, in the Bayes approach, that the prior distribution of each class is the
same.
The other difference of the approach is in terms of how the value of P(T = Q|CT = i) is com-
puted. As before, we need to estimate the value of P(t j ∈ T |CT = i), according to the naive Bayes
rule. However, unlike the standard Bayes classifier, this is done very indirectly with the use of mix-
ture modeling. Since the mixture components do not directly correspond to the class, this term can
only be estimated by summing up the expected value over all the mixture components:
k
P(t j ∈ T |CT = i) = ∑ P(t j ∈ T |MT = s) · P(MT = s|CT = i). (11.13)
s=1
The value of P(t j ∈ T |M T = s) is easy to estimate by using the fractional presence of term t j in the
sth mixture component. The main unknown here is the set of model parameters P(M T = s|CT =
i). Since a total of k classes and k mixture-components are used, this requires the estimation of
only k2 model parameters, which is typically quite modest for a small number of classes. An EM-
approach has been used in [98] in order to estimate this small number of model parameters in a
robust way. It is important to understand that the work in [98] is an interesting combination of
Text Classification 307
supervised topic modeling (dimensionality reduction) and Bayes classification after reducing the
effective dimensionality of the feature space to a much smaller value by clustering. The scheme
works well because of the use of supervision in the topic modeling process, which ensures that the
use of an intermediate clustering approach does not lose information for classification. We also note
that in this model, the number of mixtures can be made to vary from the number of classes. While
the work in [98] does not explore this direction, there is really no reason to assume that the number
of mixture components is the same as the number of classes. Such an assumption can be particularly
useful for data sets in which the classes may not be contiguous in the feature space, and a natural
clustering may contain far more components than the number of classes.
Next, we will discuss the second method [117], which uses unlabeled data. The approach in
[117] uses the unlabeled data in order to improve the training model. Why should unlabeled data
help in classification at all? In order to understand this point, recall that the Bayes classification
process effectively uses k mixture components, which are assumed to be the k different classes. If
we had an infinite amount of training data, it would be possible to create the mixture components,
but it would not be possible to assign labels to these components. However, the most data-intensive
part of modeling the mixture is that of determining the shape of the mixture components. The actual
assignment of mixture components to class labels can be achieved with a relatively small number
of class labels. It has been shown in [30] that the accuracy of assigning components to classes
increases exponentially with the number of labeled samples available. Therefore, the work in [117]
designs an EM-approach [44] to simultaneously determine the relevant mixture model and its class
assignment.
It turns out that the EM-approach, as applied to this problem, is quite simple to implement.
It has been shown in [117] that the EM-approach is equivalent to the following iterative method-
ology. First, a naive Bayes classifier is constructed by estimating the model parameters from the
labeled documents only. This is used in order to assign probabilistically weighted class labels to
the unlabeled documents. Then, the Bayes classifier is reconstructed, except that we also use the
newly labeled documents in the estimation of the underlying model parameters. We again use this
classifier to reclassify the (originally unlabeled) documents. The process is continually repeated
till convergence is achieved. This process is in many ways similar to pseudo-relevance feedback
in information retrieval where a portion of top-ranked documents returned from an initial round of
retrieval would be assumed to be relevant document examples (may be weighted), which can then
be used to learn an improved query representation for improving ranking of documents in the next
round of retrieval [150].
The ability to significantly improve the quality of text classification with a small amount of
labeled data, and the use of clustering on a large amount of unlabeled data, has been a recurring
theme in the text mining literature. For example, the method in [141] performs purely unsupervised
clustering (with no knowledge of class labels), and then as a final step assigns all documents in
the cluster to the dominant class label of that cluster (as an evaluation step for the unsupervised
clustering process in terms of its ability in matching clusters to known topics).5 It has been shown
that this approach is able to achieve a comparable accuracy of matching clusters to topics as a
supervised naive Bayes classifier trained over a small data set of about 1000 documents. Similar
results were obtained in [55] where the quality of the unsupervised clustering process was shown to
be comparable to an SVM classifier that was trained over a small data set.
5 In a supervised application, the last step would require only a small number of class labels in the cluster to be known to
MARGIN
. ..
. .. . .
. . . .
. . .
. . . .
. . . .
. . . .
. . .
. . .. .
.. . . .
.
MARGIN MARGIN VIOLATION WITH PENALTY BASED SLACK VARIABLES
MARGINVIOLATIONWITHPENALTYͲBASEDSLACKVARIABLES
X + b = −1. The coefficients W and b need to be learned from the training data D in order to
maximize the margin of separation between these two parallel hyperplanes. It can be shown from
elementary linear algebra that the distance between these two hyperplanes is 2/||W ||. Maximizing
this objective function is equivalent to minimizing ||W ||2 /2. The constraints are defined by the fact
that the training data points for each class are on one side of the support vector. Therefore, these
constraints are as follows:
W · Xi + b ≥ +1 ∀i : yi = +1 (11.14)
W · Xi + b ≤ −1 ∀i : yi = −1. (11.15)
This is a constrained convex quadratic optimization problem, which can be solved using Lagrangian
methods. In practice, an off-the-shelf optimization solver may be used to achieve the same goal.
In practice, the data may not be linearly separable. In such cases, soft-margin methods may
be used. A slack ξi ≥ 0 is introduced for training instance, and a training instance is allowed to
violate the support vector constraint, for a penalty, which is dependent on the slack. This situation
is illustrated in Figure 1.2(b). Therefore, the new set of constraints are now as follows:
W · X + b ≥ +1 − ξi ∀i : yi = +1 (11.16)
W · X + b ≤ −1 + ξi ∀i : yi = −1 (11.17)
ξi ≥ 0. (11.18)
Note that additional non-negativity constraints also need to be imposed in the slack variables. The
objective function is now ||W ||2 /2 + C · ∑ni=1 ξi . The constant C regulates the importance of the
margin and the slack requirements. In other words, small values of C make the approach closer to
soft-margin SVM, whereas large values of C make the approach more of the hard-margin SVM. It
is also possible to solve this problem using off-the-shelf optimization solvers.
It is also possible to use transformations on the feature variables in order to design non-linear
SVM methods. In practice, non-linear SVM methods are learned using kernel methods. The key idea
here is that SVM formulations can be solved using only pairwise dot products (similarity values)
between objects. In other words, the optimal decision about the class label of a test instance, from
the solution to the quadratic optimization problem in this section, can be expressed in terms of the
following:
310 Data Classification: Algorithms and Applications
K(Xi , X j ) =e−||Xi −X j ||
2 /2σ2
. (11.20)
K(Xi , X j ) =(Xi · X j + 1) . h
(11.21)
K(Xi , X j ) =tanh(κXi · X j − δ). (11.22)
These different functions result in different kinds of nonlinear decision boundaries in the original
space, but they correspond to a linear separator in the transformed space. The performance of a
classifier can be sensitive to the choice of the kernel used for the transformation. One advantage
of kernel methods is that they can also be extended to arbitrary data types, as long as appropriate
pairwise similarities can be defined.
The first set of SVM classifiers, as adapted to the text domain, were proposed in [74–76]. A
deeper theoretical study of the SVM method has been provided in [77]. In particular, it has been
shown why the SVM classifier is expected to work well under a wide variety of circumstances. This
has also been demonstrated experimentally in a few different scenarios. For example, the work in
[49] applied the method to email data for classifying it as spam or non-spam data. It was shown that
the SVM method provides much more robust performance as compared to many other techniques
such as boosting decision trees, the rule based RIPPER method, and the Rocchio method. The SVM
method is flexible and can easily be combined with interactive user-feedback methods [123].
The major downside of SVM methods is that they are slow. Our discussion in this section shows
that the problem of finding the best separator is a Quadratic Programming problem. The number of
constraints is proportional to the number of data points. This translates directly into the number of
Lagrangian relaxation variables in the optimization problem. This can sometimes be slow, especially
for high dimensional domains such as text. It has been shown [51] that by breaking a large Quadratic
Programming problem (QP problem) into a set of smaller problems, an efficient solution can be
derived for the task.
A number of other methods have been proposed to scale up the SVM method for the special
structure of text. A key characteristic of text is that the data is high dimensional, but an individual
document contains very few features from the full lexicon. In other words, the number of non-zero
features is small. A number of different approaches have been proposed to address these issues.
The first approach [78], referred to as SVMLight, shares a number of similarities with [51]. The first
approach also breaks down the quadratic programming problem into smaller subproblems. This
achieved by using a working set of Lagrangian variables, which are optimized, while keeping the
Text Classification 311
other variables fixed. The choice of variables to select is based on the gradient of the objective func-
tion with respect to these variables. This approach does not, however, fully leverage the sparsity
of text. A second approach, known as SVMPerf [79] also leverages the sparsity of text data. This
approach reduces the number of slack variables in the quadratic programming formulation, while
increasing the constraints. A cutting plane algorithm is used to solve the optimization problem effi-
ciently. The approach is shown to require O(n · s) time, where n is the number of training examples,
and s is the average number of non-zero features per training document. The reader is referred to
Chapter 10 on big-data classification, for a detailed discussion of this approach. The SVM approach
has also been used successfully [52] in the context of a hierarchical organization of the classes, as
often occurs in Web data. In this approach, a different classifier is built at different positions of the
hierarchy.
SVM methods are very popular and tend to have high accuracy in the text domain. Even the
linear SVM works rather well for text in general, though the specific accuracy is obviously data-set
dependent. An introduction to SVM methods may be found in [36, 40, 62, 132, 133, 145]. Kernel
methods for support vector machines are discussed in [132].
exp(A · Xi + b)
p(C = yi |Xi ) =
1 + exp(A · Xi + b).
This gives us a conditional generative model for yi given Xi . Putting it in another way, we assume
that the logit transformation of p(C = yi |Xi ) can be modeled by the linear combination of features
of the instance Xi , i.e.,
p(C = yi |Xi )
log = A · Xi + b.
1 − p(C = yi |Xi )
Thus logistic regression is also a linear classifier as the decision boundary is determined by a linear
function of the features. In the case of binary classification, p(C = yi |Xi ) can be used to determine
312 Data Classification: Algorithms and Applications
the class label (e.g., using a threshold of 0.5). In the case of multi-class classification, we have
p(C = yi |Xi ) ∝ exp(A · Xi + b), and the class label with the highest value according to p(C = yi |Xi )
would be assigned to Xi . Given a set of training data points {(X1 , yi ), ...(Xn , yn )}, the logistic re-
gression classifier can be trained by choosing parameters A to maximize the conditional likelihood
∏ni=1 p(yi |Xi ).
In some cases, the domain knowledge may be of the form where some sets of words are more
important than others for a classification problem. For example, in a classification application, we
may know that certain domain-words (Knowledge Words (KW)) may be more important to classifi-
cation of a particular target category than other words. In such cases, it has been shown [43] that it
may be possible to encode such domain knowledge into the logistic regression model in the form of
prior on the model parameters and use Bayesian estimation of model parameters.
It is clear that the regression classifiers are extremely similar to the SVM model for classifi-
cation. Indeed, since LLSF, Logistic Regression, and SVM are all linear classifiers, they are thus
identical at a conceptual level; the main difference among them lies in the details of the optimization
formulation and implementation. As in the case of SVM classifiers, training a regression classifier
also requires an expensive optimization process. For example, fitting LLSF requires expensive ma-
trix computations in the form of a singular value decomposition process.
The output is a predicted value of the binary class variable, which is assumed to be drawn from
{−1, +1}. The notation b denotes the bias. Thus, for a vector Xi drawn from a dimensionality of d,
the weight vector W should also contain d elements. Now consider a binary classification problem,
in which all labels are drawn from {+1, −1}. We assume that the class label of Xi is denoted by yi .
In that case, the sign of the predicted function zi yields the class label. Thus, the goal of the approach
is to learn the set of weights W with the use of the training data, so as to minimize the least squares
error (yi − zi )2 . The idea is that we start off with random weights and gradually update them, when
a mistake is made by applying the current function on the training example. The magnitude of the
update is regulated by a learning rate λ. This update is similar to the updates in gradient descent,
Text Classification 313
o
o o o
o x x
o x x o
o x x
x x oo
o o
o o
which are made for least-squares optimization. In the case of neural networks, the update function
is as follows.
t+1 t
W = W + λ(yi − zi )Xi (11.24)
t
Here, W is the value of the weight vector in the tth iteration. It is not difficult to show that the
incremental update vector is related to the negative gradient of (yi − zi )2 with respect to W . It is
also easy to see that updates are made to the weights only when mistakes are made in classification.
When the outputs are correct, the incremental change to the weights is zero. The overall perceptron
algorithm is illustrated below.
Perceptron Algorithm
Inputs: Learning Rate: λ
Training Data (Xi , yi ) ∀i ∈ {1 . . . n}
Initialize weight vectors in W and b to small random numbers
repeat
Apply each training data to the neural network to check if the
sign of W · Xi + b matches yi ;
if sign of W · Xi + b does not match yi , then
update weights W based on learning rate λ
until weights in W converge
The similarity to support vector machines is quite striking, in the sense that a linear function
is also learned in this case, and the sign of the linear function predicts the class label. In fact, the
perceptron model and support vector machines are closely related, in that both are linear function
approximators. In the case of support vector machines, this is achieved with the use of maximum
margin optimization. In the case of neural networks, this is achieved with the use of an incremental
learning algorithm, which is approximately equivalent to least squares error optimization of the
prediction.
The constant λ regulates the learning rate. The choice of learning rate is sometimes important,
because learning rates that are too small will result in very slow training. On the other hand, if the
learning rates are too fast, this will result in oscillation between suboptimal solutions. In practice,
the learning rates are fast initially, and then allowed to gradually slow down over time. The idea here
is that initially large steps are likely to be helpful, but are then reduced in size to prevent oscillation
between suboptimal solutions. For example, after t iterations, the learning rate may be chosen to be
proportional to 1/t.
The aforementioned discussion was based on the simple perceptron architecture, which can
model only linear relationships. A natural question arises as to how a neural network may be used,
if all the classes may not be neatly separated from one another with a linear separator. For example,
314 Data Classification: Algorithms and Applications
in Figure 11.2, we have illustrated an example in which the classes may not be separated with the
use of a single linear separator. The use of multiple layers of neurons can be used in order to induce
such non-linear classification boundaries. The effect of such multiple layers is to induce multiple
piece-wise linear boundaries, which can be used to approximate enclosed regions belonging to a
particular class. In such a network, the outputs of the neurons in the earlier layers feed into the
neurons in the later layers. The training process of such networks is more complex, as the errors
need to be back-propagated over different layers. Some examples of such classifiers include those
discussed in [87, 126, 146, 153]. However, the general observation [135, 149] for text has been that
linear classifiers generally provide comparable results to non-linear data, and the improvements of
non-linear classification methods are relatively small. This suggests that the additional complexity
of building more involved non-linear models does not pay for itself in terms of significantly better
classification.
In practice, the neural network is arranged in three layers, referred to as the input layer, hidden
layer, and the output layer. The input layer only transmits the inputs forward, and therefore, there
are really only two layers to the neural network, which can perform computations. Within the hidden
layer, there can be any number of layers of neurons. In such cases, there can be an arbitrary number
of layers in the neural network. In practice, there is only one hidden layer, which leads to a two-layer
network. The perceptron can be viewed as a very special kind of neural network, which contains only
a single layer of neurons (corresponding to the output node). Multilayer neural networks allow the
approximation of nonlinear functions, and complex decision boundaries, by an appropriate choice
of the network topology, and non-linear functions at the nodes. In these cases, a logistic or sigmoid
function, known as a squashing function, is also applied to the inputs of neurons in order to model
non-linear characteristics. It is possible to use different non-linear functions at different nodes. Such
general architectures are very powerful in approximating arbitrary functions in a neural network,
given enough training data and training time. This is the reason that neural networks are sometimes
referred to as universal function approximators.
In the case of single layer perceptron algorthms, the training process is easy to perform by using
a gradient descent approach. The major challenge in training multilayer networks is that it is no
longer known for intermediate (hidden layer) nodes what their “expected” output should be. This is
only known for the final output node. Therefore, some kind of “error feedback” is required, in order
to determine the changes in the weights at the intermediate nodes. The training process proceeds in
two phases, one of which is in the forward direction, and the other is in the backward direction.
1. Forward Phase: In the forward phase, the activation function is repeatedly applied to prop-
agate the inputs from the neural network in the forward direction. Since the final output is
supposed to match the class label, the final output at the output layer provides an error value,
depending on the training label value. This error is then used to update the weights of the
output layer, and propagate the weight updates backwards in the next phase.
2. Backpropagation Phase: In the backward phase, the errors are propagated backwards through
the neural network layers. This leads to the updating of the weights in the neurons of the
different layers. The gradients at the previous layers are learned as a function of the errors
and weights in the layer ahead of it. The learning rate λ plays an important role in regulating
the rate of learning.
In practice, any arbitrary function can be approximated well by a neural network. The price of this
generality is that neural networks are often quite slow in practice. They are also sensitive to noise,
and can sometimes overfit the training data.
The previous discussion assumed only binary labels. It is possible to create a k-label neural
network, by either using a multiclass “one-versus-all” meta-algorithm, or by creating a neural net-
work architecture in which the number of output nodes is equal to the number of class labels. Each
output represents prediction to a particular label value. A number of implementations of neural net-
work methods have been studied in [41, 102, 115, 135, 149], and many of these implementations
Text Classification 315
are designed in the context of text data. It should be pointed out that both neural networks and
SVM classifiers use a linear model that is quite similar. The main difference between the two is in
how the optimal linear hyperplane is determined. Rather than using a direct optimization methodol-
ogy, neural networks use a mistake-driven approach to data classification [41]. Neural networks are
described in detail in [15, 66].
40 in most of the afore-mentioned work, depending upon the size of the underlying corpus.
In practice, it is often set empirically using cross validation.
• We perform training data aggregation during pre-processing, in which clusters or groups of
documents belonging to the same class are created. A representative meta-document is created
from each group. The same k-nearest neighbor approach is applied as discussed above, except
that it is applied to this new set of meta-documents (or generalized instances [88]) rather
than to the original documents in the collection. A pre-processing phase of summarization is
useful in improving the efficiency of the classifier, because it significantly reduces the number
of distance computations. In some cases, it may also boost the accuracy of the technique,
especially when the data set contains a large number of outliers. Some examples of such
methods are discussed in [64, 88, 125].
A method for performing nearest neighbor classification in text data is the WHIRL method dis-
cussed in [31]. The WHIRL method is essentially a method for performing soft similarity joins on
the basis of text attributes. By soft similarity joins, we refer to the fact that the two records may not
be exactly the same on the joined attribute, but may be approximately similar based on a pre-defined
notion of similarity. It has been observed in [31] that any method for performing a similarity-join
can be adapted as a nearest neighbor classifier, by using the relevant text documents as the joined
attributes.
One observation in [155] about nearest neighbor classifiers was that feature selection and doc-
ument representation play an important part in the effectiveness of the classification process. This
is because most terms in large corpora may not be related to the category of interest. Therefore, a
number of techniques were proposed in [155] in order to learn the associations between the words
and the categories. These are then used to create a feature representation of the document, so that
the nearest neighbor classifier is more sensitive to the classes in the document collection. A similar
observation has been made in [63], in which it has been shown that the addition of weights to the
terms (based on their class-sensitivity) significantly improves the underlying classifier performance.
The nearest neighbor classifier has also been extended to the temporally-aware scenario [130], in
which the timeliness of a training document plays a role in the model construction process. In or-
der to incorporate such factors, a temporal weighting function has been introduced in [130], which
allows the importance of a document to gracefully decay with time.
For the case of classifiers that use grouping techniques, the most basic among such methods
is that proposed by Rocchio in [125]. In this method, a single representative meta-document is
constructed from each of the representative classes. For a given class, the weight of the term tk is
the normalized frequency of the term tk in documents belonging to that class, minus the normalized
frequency of the term in documents which do not belong to that class. Specifically, let f pk be the
expected weight of term tk in a randomly picked document belonging to the positive class, and fnk
be the expected weight of term tk in a randomly picked document belonging to the negative class.
Then, for weighting parameters α p and αn , the weight frocchio
k is defined as follows:
k
frocchio = α p · f pk − αn · fnk (11.25)
The weighting parameters α p and αn are picked so that the positive class has much greater weight
as compared to the negative class. For the relevant class, we now have a vector representation of
the terms ( frocchio
1 , frocchio
2 . . . frocchio
n ). This approach is applied separately to each of the classes,
in order to create a separate meta-document for each class. For a given test document, the closest
meta-document to the test document can be determined by using a vector-based dot product or
other similarity metric. The corresponding class is then reported as the relevant label. The main
distinguishing characteristic of the Rocchio method is that it creates a single profile of the entire
class. This class of methods is also referred to as the Rocchio framework. The main disadvantage
of this method is that if a single class occurs in multiple disjoint clusters that are not very well
Text Classification 317
connected in the data, then the centroid of these examples may not represent the class behavior
very well. This is likely to be a source of inaccuracy for the classifier. The main advantage of this
method is its extreme simplicity and efficiency; the training phase is linear in the corpus size, and
the number of computations in the testing phase are linear to the number of classes, since all the
documents have already been aggregated into a small number of classes. An analysis of the Rocchio
algorithm, along with a number of different variations may be found in [74].
In order to handle the shortcomings of the Rocchio method, a number of classifiers have also
been proposed [2, 19, 64, 88], which explicitly perform the clustering of each of the classes in the
document collection. These clusters are used in order to generate class-specific profiles. These pro-
files are also referred to as generalized instances in [88]. For a given test instance, the label of the
closest generalized instance is reported by the algorithm. The method in [19] is also a centroid-based
classifier, but is specifically designed for the case of text documents. The work in [64] shows that
there are some advantages in designing schemes in which the similarity computations take account
of the dependencies between the terms of the different classes.
We note that the nearest neighbor classifier can be used in order to generate a ranked list of
categories for each document. In cases where a document is related to multiple categories, these
can be reported for the document, as long as a thresholding method is available. The work in [157]
studies a number of thresholding strategies for the k-nearest neighbor classifier. It has also been
suggested in [157] that these thresholding strategies can be used to understand the thresholding
strategies of other classifiers that use ranking classifiers.
the nearest neighbors are not available. In such cases, a relaxation labeling method was proposed in
order to perform the classification. Two methods have been proposed in this work:
• Fully Supervised Case of Radius One Enhanced Linkage Analysis: In this case, it is as-
sumed that all the neighboring class labels are known. In such a case, a Bayesian approach
is utilized in order to treat the labels on the nearest neighbors as features for classification
purposes. In this case, the linkage information is the sole information that is used for classifi-
cation purposes.
• When the class labels of the nearest neighbors are not known: In this case, an iterative
approach is used for combining text and linkage based classification. Rather than using the
pre-defined labels (which are not available), we perform a first labeling of the neighboring
documents with the use of document content. These labels are then used to classify the label
of the target document, with the use of both the local text and the class labels of the neighbors.
This approach is used iteratively for re-defining the labels of both the target document and its
neighbors until convergence is achieved.
The conclusion from the work in [25] is that a combination of text and linkage based classification
always improves the accuracy of a text classifier. Even when none of the neighbors of the document
have known classes, it always seemed to be beneficial to add link information to the classification
process. When the class labels of all the neighbors are known, the advantages of using the scheme
seem to be quite significant.
An additional idea in the paper is that of the use of bridges in order to further improve the
classification accuracy. The core idea in the use of a bridge is the use of 2-hop propagation for
link-based classification. The results with the use of such an approach are somewhat mixed, as the
accuracy seems to reduce with an increasing number of hops. The work in [25] shows results on a
number of different kinds of data sets such as the Reuters database, US patent database, and Yahoo!
Since the Reuters database contains the least amount of noise, pure text classifiers were able to do a
good job. On the other hand, the US patent database and the Yahoo! database contain an increasing
amount of noise, which reduces the accuracy of text classifiers. An interesting observation in [25]
was that a scheme that simply absorbed the neighbor text into the current document performed
significantly worse than a scheme that was based on pure text-based classification. This is because
there are often significant cross-boundary linkages between topics, and such linkages are able to
confuse the classifier. A publicly available implementation of this algorithm may be found in the
NetKit tool kit available in [106].
Another relaxation labeling method for graph-based document classification is proposed in [5].
In this technique, the probability that the end points of a link take on a particular pair of class
labels is quantified. We refer to this as the link-class pair probability. The posterior probability
of classification of a node T into class i is expressed as the sum of the probabilities of pairing
all possible class labels of the neighbors of T with class label i. We note a significant percentage
of these (exponential number of ) possibilities are pruned, since only the currently most probable6
labelings are used in this approach. For this purpose, it is assumed that the class labels of the different
neighbors of T (while dependent on T ) are independent of each other. This is similar to the naive
assumption, which is often used in Bayes classifiers. Therefore, the probability for a particular
combination of labels on the neighbors can be expressed as the product of the corresponding link-
class pair probabilities. The approach starts off with the use of a standard content-based Bayes or
SVM classifier in order to assign the initial labels to the nodes. Then, an iterative approach is used
to refine the labels, by using the most probable label estimations from the previous iteration in
order to refine the labels in the current iteration. We note that the link-class pair probabilities can
6 In the case of hard labeling, the single most likely labeling is used, whereas in the case of soft labeling, a small set of
possibilities is used.
Text Classification 319
STRUCTURAL NODES
WORD NODES
be estimated as the smoothed fraction of edges in the last iteration that contains a particular pair of
classes as the end points (hard labeling), or it can also be estimated as the average product of node
probabilities over all edges which take on that particular class pair (soft labeling). This approach is
repeated to convergence.
Another method that uses a naive Bayes classifier to enhance link-based classification is pro-
posed in [118]. This method incrementally assigns class labels, starting off with a temporary as-
signment and then gradually making them permanent. The initial class assignment is based on a
simple Bayes expression based on both the terms and links in the document. In the final catego-
rization, the method changes the term weights for Bayesian classification of the target document
with the terms in the neighbor of the current document. This method uses a broad framework which
is similar to that in [25], except that it differentiates between the classes in the neighborhood of
a document in terms of their influence on the class label of the current document. For example,
documents for which the class label was either already available in the training data, or for which
the algorithm has performed a final assignment, have a different confidence weighting factor than
those documents for which the class label is currently temporarily assigned. Similarly, documents
that belong to a completely different subject (based on content) are also removed from consideration
from the assignment. Then, the Bayesian classification is performed with the re-computed weights,
so that the document can be assigned a final class label. By using this approach the technique is able
to compensate for the noise and inconsistencies in the link structures among different documents.
One major difference between the work in [25] and [118] is that the former is focussed on using
link information in order to propagate the labels, whereas the latter attempts to use the content of the
neighboring pages. Another work along this direction, which uses the content of the neighboring
pages more explicitly, is proposed in [121]. In this case, the content of the neighboring pages is
broken up into different fields such as titles, anchor text, and general text. The different fields are
given different levels of importance, which is learned during the classification process. It was shown
in [121] that the use of title fields and anchor fields is much more relevant than the general text. This
accounts for much of the accuracy improvements demonstrated in [121].
The work in [3] proposes a method for dynamic classification in text networks with the use of
a random-walk method. The key idea in the work is to transform the combination of structure and
320 Data Classification: Algorithms and Applications
content in the network into a pure network containing only content. Thus, we transform the original
network G = (N, A,C) into an augmented network GA = (N ∪ Nc , A ∪ Ac ), where Nc and Ac are an
additional set of nodes and edges added to the original network. Each node in Nc corresponds to
a distinct word in the lexicon. Thus, the augmented network contains the original structural nodes
N, and a new set of word nodes Nc . The added edges in Ac are undirected edges added between
the structural nodes N and the word nodes Nc . Specifically, an edge (i, j) is added to Ac , if the
word i ∈ Nc occurs in the text content corresponding to the node j ∈ N. Thus, this network is
semi-bipartite, in that there are no edges between the different word nodes. An illustration of the
semi-bipartite content-structure transformation is provided in Figure 11.3.
It is important to note that once such a transformation has been performed, any of the collective
classification methods [14] can be applied to the structural nodes. In the work in [3], a random-walk
method has been used in order to perform the collective classification of the underlying nodes. In
this method, repeated random walks are performed starting at the unlabeled nodes that need to be
classified. The random walks are defined only on the structural nodes, and each hop may either be
a structural hop or a content hop. We perform l different random walks, each of which contains
h nodes. Thus, a total of l · h nodes are encountered in the different walks. The class label of this
node is predicted to be the label with the highest frequency of presence in the different l · h nodes
encountered in the different walks. The error of this random walk-based sampling process has been
bounded in [14]. In addition, the method in [14] can be adapted to dynamic content-based networks,
in which the nodes, edges, and their underlying content continuously evolve over time. The method
in [3] has been compared to that proposed in [23] (based on the implementation in [106]), and it has
been shown that the classification methods of [14] are significantly superior.
Another method for classification of linked text data is discussed in [160]. This method designs
two separate regularization conditions; one is for the text-only classifier (also referred to as the local
classifier), and the other is for the link information in the network structure. These regularizers are
expressed in the terms of the underlying kernels; the link regularizer is related to the standard graph
regularizer used in the machine learning literature, and the text regularizer is expressed in terms of
the kernel gram matrix. These two regularization conditions are combined in two possible ways.
One can either use linear combinations of the regularizers, or linear combinations of the associ-
ated kernels. It was shown in [160] that both combination methods perform better than either pure
structure-based or pure text-based methods. The method using a linear combination of regularizers
was slightly more accurate and robust than the method that used a linear combination of the kernels.
A method in [38] designs a classifier that combines a naive Bayes classifier (on the text domain),
and a rule-based classifier (on the structural domain). The idea is to invent a set of predicates, which
are defined in the space of links, pages, and words. A variety of predicates (or relations) are defined
depending upon the presence of the word in a page, linkages of pages to each other, the nature of
the anchor text of the hyperlink, and the neighborhood words of the hyperlink. These essentially
encode the graph structure of the documents in the form of boolean predicates, and can also be used
to construct relational learners. The main contribution in [38] is to combine the relational learners
on the structural domain with the naive Bayes approach in the text domain. We refer the reader
to [38, 39] for the details of the algorithm, and the general philosophy of such relational learners.
One of the interesting methods for collective classification in the context of email networks was
proposed in [29]. The technique in [29] is designed to classify speech acts in email. Speech acts
essentially characterize whether an email refers to a particular kind of action (such as scheduling
a meeting). It has been shown in [29] that the use of sequential thread-based information from the
email is very useful for the classification process. An email system can be modeled as a network
in several ways, one of which is to treat an email as a node, and the edges as the thread relation-
ships between the different emails. In this sense, the work in [29] devises a network-based mining
procedure that uses both the content and the structure of the email network. However, this work is
rather specific to the case of email networks, and it is not clear whether the technique can be adapted
(effectively) to more general networks.
Text Classification 321
A different line of solutions to such problems, which are defined on a heterogeneous feature
space, is to use latent space methods in order to simultaneously homogenize the feature space, and
also determine the latent factors in the underlying data. The resulting representation can be used
in conjunction with any of the text classifiers that are designed for latent space representations. A
method in [162] uses a matrix factorization approach in order to construct a latent space from the
underlying data. Both supervised and unsupervised methods were proposed for constructing the
latent space from the underlying data. It was then shown in [162] that this feature representation
provides more accurate results, when used in conjunction with an SVM-classifier.
Finally, a method for Web page classification is proposed in [138]. This method is designed
for using intelligent agents in Web page categorization. The overall approach relies on the design
of two functions that correspond to scoring Web pages and links respectively. An advice language
is created, and a method is proposed for mapping advice to neural networks. It is has been shown
in [138] how this general purpose system may be used in order to find home pages on the Web.
• For a given test instance, a specific classifier is selected, depending upon the performance of
the classifiers that are closest to that test instance.
• A weighted combination of the results from the different classifiers are used, where the weight
is regulated by the performance of the classifier on validation instances that are most similar
to the current test instance.
322 Data Classification: Algorithms and Applications
The last two methods above try to select the final classification in a smarter way by discriminating
between the performances of the classifiers in different scenarios. The work by [89] used category-
averaged features in order to construct a different classifier for each category.
The major challenge in ensemble learning is to provide the appropriate combination of classifiers
for a particular scenario. Clearly, this combination can significantly vary with the scenario and
the data set. In order to achieve this goal, the method in [12] proposes a method for probabilistic
combination of text classifiers. The work introduces a number of variables known as reliability
variables in order to regulate the importance of the different classifiers. These reliability variables
are learned dynamically for each situation, so as to provide the best classification.
of one class than another. For example, while it may be tolerable to misclassify a few spam emails
(thereby allowing them into the inbox), it is much more undesirable to incorrectly mark a legitimate
email as spam. Cost-sensitive classification problems also naturally arise in cases in which one
class is more rare than the other, and it is therefore more desirable to identify the rare examples.
In such cases, it is desirable to optimize the cost-weighted accuracy of the classification process.
We note that many of the broad techniques that have been designed for non-textual data [48, 50,
53] are also applicable to text data, because the specific feature representation is not material to
how standard algorithms are modified to the cost-sensitive case. A good understanding of cost-
sensitive classification both for the textual and non-textual case may be found in [4, 48, 53]. Some
examples of how classification algorithms may be modified in straightforward ways to incorporate
cost-sensitivity are as follows:
• In a decision-tree, the split condition at a given node tries to maximize the accuracy of its chil-
dren nodes. In the cost-sensitive case, the split is engineered to maximize the cost-sensitive
accuracy.
• In rule-based classifiers, the rules are typically quantified and ordered by measures corre-
sponding to their predictive accuracy. In the cost-sensitive case, the rules are quantified and
ordered by their cost-weighted accuracy.
• In Bayesian classifiers, the posterior probabilities are weighted by the cost of the class for
which the prediction is made.
• In linear classifiers, the optimum hyperplane separating the classes is determined in a cost-
weighted sense. Such costs can typically be incorporated in the underlying objective function.
For example, the least-square error in the objective function of the LLSF method can be
weighted by the underlying costs of the different classes.
• In a k-nearest neighbor classifier, we report the cost-weighted majority class among the k
nearest neighbors of the test instance.
We note that the use of a cost-sensitive approach is essentially a change of the objective func-
tion of classification, which can also be formulated as an optimization problem. While the standard
classification problem generally tries to optimize accuracy, the cost-sensitive version tries to opti-
mize a cost-weighted objective function. A more general approach was proposed in [58] in which a
meta-algorithm was proposed for optimizing a specific figure of merit such as the accuracy, preci-
sion, recall, or F1 -measure. Thus, this approach generalizes this class of methods to any arbitrary
objective function, making it essentially an objective-centered classification method. A generalized
probabilistic descent algorithm (with the desired objective function) is used in conjunction with the
classifier of interest in order to derive the class labels of the test instance. The work in [58] shows
the advantages of using the technique over a standard SVM-based classifier.
X
X X CLASS A
CLASS A X
X
X XX
X
XX X X
OLD DECISION BOUNDARY X X X X
X X X
XX X X X
X
X XX X
X
CLASS B X X
X X
X
CLASS B X X
are used to enhance the classification process. These methods are briefly described in this section.
A survey on transfer learning methods may be found in [68].
“politics” category in the unlabeled instances. Thus, the unlabeled instances can be used to learn the
relevance of these less common features to the classification process, especially when the amount
of available training data is small.
Similarly, when the data are clustered, each cluster in the data is likely to predominantly contain
data records of one class or the other. The identification of these clusters only requires unsuper-
vised data rather than labeled data. Once the clusters have been identified from unlabeled data,
only a small number of labeled examples are required in order to determine confidently which label
corresponds to which cluster. Therefore, when a test example is classified, its clustering structure
provides critical information for its classification process, even when a smaller number of labeled
examples are available. It has been argued in [117] that the accuracy of the approach may increase
exponentially with the number of labeled examples, as long as the assumption of smoothness in
label structure variation holds true. Of course, in real life, this may not be true. Nevertheless, it
has been shown repeatedly in many domains that the addition of unlabeled data provides signifi-
cant advantages for the classification process. An argument for the effectiveness of semi-supervised
learning, which uses the spectral clustering structure of the data, may be found in [18]. In some
domains such as graph data, semisupervised learning is the only way in which classification may be
performed. This is because a given node may have very few neighbors of a specific class.
Text classification from labeled and unlabeled documents uses EM. Semi-supervised methods
are implemented in a wide variety of ways. Some of these methods directly try to label the unlabeled
data in order to increase the size of the training set. The idea is to incrementally add the most
confidently predicted label to the training data. This is referred to as self training. Such methods
have the downside that they run the risk of overfitting. For example, when an unlabeled example is
added to the training data with a specific label, the label might be incorrect because of the specific
characteristics of the feature space, or the classifier. This might result in further propagation of the
errors. The results can be quite severe in many scenarios.
Therefore, semisupervised methods need to be carefully designed in order to avoid overfitting.
An example of such a method is co-training [17], which partitions the attribute set into two subsets,
on which classifier models are independently constructed. The top label predictions of one classifier
are used to augment the training data of the other, and vice-versa. Specifically, the steps of co-
training are as follows:
2. Train two independent classifier models M1 and M2 , which use the disjoint feature sets f1
and f2 , respectively.
3. Add the unlabeled instance with the most confidently predicted label from M1 to the training
data for M2 and vice-versa.
Since the two classifiers are independently constructed on different feature sets, such an approach
avoids overfitting. The partitioning of the feature set into f1 and f2 can be performed in a variety
of ways. While it is possible to perform random partitioning of features, it is generally advisable
to leverage redundancy in the feature set to construct f1 and f2 . Specifically, each feature set fi
should be picked so that the features in f j (for j = i) are redundant with respect to it. Therefore,
each feature set represents a different view of the data, which is sufficient for classification. This
ensures that the “confident” labels assigned to the other classifier are of high quality. At the same
time, overfitting is avoided to at least some degree, because of the disjoint nature of the feature
set used by the two classifiers. Typically, an erroneously assigned class label will be more easily
detected by the disjoint feature set of the other classifier, which was not used to assign the erroneous
label. For a test instance, each of the classifiers is used to make a prediction, and the combination
326 Data Classification: Algorithms and Applications
score from the two classifiers may be used. For example, if the naive Bayes method is used as the
base classifier, then the product of the two classifier scores may be used.
The aforementioned methods are generic meta-algorithms for semi-supervised leaning. It is also
possible to design variations of existing classification algorithms such as the EM-method, or trans-
ductive SVM classifiers. EM-based methods [117] are very popular for text data. These methods
attempt to model the joint probability distributions of the features and the labels with the use of par-
tially supervised clustering methods. This allows the estimation of the conditional probabilities in
the Bayes classifier to be treated as missing data, for which the EM-algorithm is very effective. This
approach shows a connection between the partially supervised clustering and partially supervised
classification problems. The results show that partially supervised classification is most effective
when the clusters in the data correspond to the different classes. In transductive SVMs, the labels
of the unlabeled examples are also treated as integer decision variables. The SVM formulation is
modified in order to determine the maximum margin SVM, with the best possible label assignment
of unlabeled examples. The SVM classifier has also been shown to be useful in large scale scenarios
in which a large amount of unlabeled data and a small amount of labeled data is available [139].
This is essentially a semi-supervised approach because of its use of unlabeled data in the classi-
fication process. This technique is also quite scalable because of its use of a number of modified
quasi-Newton techniques, which tend to be efficient in practice. Surveys on semi-supervised meth-
ods may be found in [27, 163].
1. Crosslingual Learning: In this case, the documents from one language are transferred to the
other. An example of such an approach is discussed in [11].
2. Crossdomain learning: In this case, knowledge from the text domain is typically transferred
to multimedia, and vice-versa. An example of such an approach, which works between the
text and image domain, is discussed in [119, 121].
Broadly speaking, transfer learning methods fall into one of the following four categories:
1. Instance-based Transfer: In this case, the feature space of the two domains are highly over-
lapping; even the class labels may be the same. Therefore, it is possible to transfer knowledge
from one domain to the other by simply re-weighting the features.
Text Classification 327
2. Feature-based Transfer: In this case, there may be some overlaps among the features, but
a significant portion of the feature space may be different. Often, the goal is to perform a
transformation of each feature set into a new low dimensional space, which can be shared
across related tasks.
3. Parameter-Based Transfer: In this case, the motivation is that a good training model has
typically learned a lot of structure. Therefore, if two tasks are related, then the structure can
be transferred to learn the target task.
4. Relational-Transfer Learning: The idea here is that if two domains are related, they may share
some similarity relations among objects. These similarity relations can be used for transfer
learning across domains.
The major challenge in such transfer learning methods is that negative transfer can be caused in
some cases when the side information used is very noisy or irrelevant to the learning process. There-
fore, it is critical to use the transfer learning process in a careful and judicious way in order to truly
improve the quality of the results. A survey on transfer learning methods may be found in [105],
and a detailed discussion on this topic may be found in Chapter 21.
The domains of these sets are rather large, as it comprises the entire lexicon. Therefore, text min-
ing techniques need to be designed to effectively manage large numbers of elements with varying
frequencies. Almost all the known techniques for classification such as decision trees, rules, Bayes
methods, nearest neighbor classifiers, SVM classifiers, and neural networks have been extended to
the case of text data. Recently, a considerable amount of emphasis has been placed on linear clas-
sifiers such as neural networks and SVM classifiers, with the latter being particularly suited to the
characteristics of text data. In recent years, the advancement of Web and social network technolo-
gies have led to a tremendous interest in the classification of text documents containing links or
other meta-information. Recent research has shown that the incorporation of linkage information
into the classification process can significantly improve the quality of the underlying results.
Bibliography
[1] C. C. Aggarwal, C. Zhai. A survey of text classification algorithms, In Mining Text Data, pages
163–222, Springer, 2012.
[2] C. C. Aggarwal, S. C. Gates, and P. S. Yu. On using partial supervision for text categorization,
IEEE Transactions on Knowledge and Data Engineering, 16(2):245–255, 2004.
[3] C. C. Aggarwal and N. Li. On node classification in dynamic content-based networks, SDM
Conference, 2011.
[4] I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. Spyropoulos. An evalua-
tion of naive Bayesian anti-spam filtering. Proceedings of the Workshop on Machine Learning
in the New Information Age, in conjunction with ECML Conference, 2000.
http://arxiv.org/PS_cache/cs/pdf/0006/0006013v1.pdf
[5] R. Angelova and G. Weikum. Graph-based text classification: Learn from your neighbors,
ACM SIGIR Conference, pages 485–492, 2006.
[6] C. Apte, F. Damerau, and S. Weiss. Automated learning of decision rules for text categoriza-
tion, ACM Transactions on Information Systems, 12(3):233–251, 1994.
[7] M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potential func-
tion method in pattern recognition learning, Automation and Remote Control, 25: 821–837,
1964.
[8] L. Baker and A. McCallum. Distributional clustering of words for text classification, ACM
SIGIR Conference, pages 96–103, 1998.
[9] R. Bekkerman, R. El-Yaniv, Y. Winter, and N. Tishby. On feature distributional clustering for
text categorization, ACM SIGIR Conference, pages 146–153, 2001.
[10] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seeding, ICML Con-
ference, pages 27–34, 2002.
[11] N. Bel, C. Koster, and M. Villegas. Cross-lingual text categorization. In Research and ad-
vanced technology for digital libraries, Springer, Berlin Heidelberg, pages 126–139, 2003.
[12] P. Bennett, S. Dumais, and E. Horvitz. Probabilistic combination of text classifiers using reli-
ability indicators: Models and results, ACM SIGIR Conference, pages 207, 214, 2002.
Text Classification 329
[13] P. Bennett and N. Nguyen. Refined experts: Improving classification in large taxonomies. Pro-
ceedings of the 32nd ACM SIGIR Conference, pages 11–18, 2009.
[14] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks, In
Social Network Data Analytics, Ed. Charu Aggarwal, Springer, 2011.
[15] C. Bishop. Neural Networks for Pattern Recognition, Oxford University Press, 1996.
[16] D. M. Blei and J. D. McAuliffe. Supervised topic models, NIPS 2007.
[17] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training, COLT, pages
92–100, 1998.
[18] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds, Machine Learn-
ing, 56:209–239, 2004.
[19] D. Boley, M. Gini, R. Gross, E.-H. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and
J. Moore. Partitioning-based clustering for web document categorization, Decision Support
Systems, 27(3):329–341, 1999.
[20] L. Brieman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees, CRC
Press, Boca Raton, FL, 1984.
[21] L. Breiman. Bagging predictors, Machine Learning, 24(2):123–140, 1996.
[22] L. Cai, T. Hofmann. Text categorization by boosting automatically extracted concepts, ACM
SIGIR Conference, pages 182–189, 2003.
[23] S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple
linear discriminant projections, VLDB Journal, 12(2):172–185, 2003.
[24] S. Chakrabarti, B. Dom. R. Agrawal, and P. Raghavan. Using taxonomy, discriminants and
signatures for navigating in text databases, VLDB Conference, pages 446–455, 1997.
[25] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks,
ACM SIGMOD Conference, pages 307–318, 1998.
[26] S. Chakraborti, R. Mukras, R. Lothian, N. Wiratunga, S. Watt, and D. Harper. Supervised
latent semantic indexing using adaptive sprinkling, IJCAI, pages 1582–1587, 2007.
[27] O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning, Vol. 2, MIT Press, Cam-
bridge, MA, 2006.
[28] D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach for learning Bayesian net-
works with local structure, Thirteenth Conference on Uncertainty in Artificial Intelligence,
pages 80–89, 1997.
[29] V. R. de Carvalho and W. Cohen. On the collective classification of email “speech acts”, ACM
SIGIR Conference, pages 345–352, 2005.
[30] V. Castelli and T. M. Cover. On the exponential value of labeled samples, Pattern Recognition
Letters, 16(1):105–111, 1995.
[31] W. Cohen and H. Hirsh. Joins that generalize: text classification using WHIRL, ACM KDD
Conference, pages 169–173, 1998.
[32] W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization, ACM
Transactions on Information Systems, 17(2):141–173, 1999.
330 Data Classification: Algorithms and Applications
[33] W. Cohen. Learning rules that classify e-mail, AAAI Conference, pages 18–25, 1996.
[34] W. Cohen. Learning trees and rules with set-valued features, AAAI Conference, pages 709–
716, 1996.
[35] W. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval, ACM
Transactions on Information Systems, 13(1):100–111, 1995.
[36] C. Cortes and V. Vapnik. Support-vector networks, Machine Learning, 20(3):273–297, 1995.
[37] T. M. Cover and J. A. Thomas. Elements of Information Theory, New York: John Wiley and
Sons, 1991.
[38] M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better mod-
els for hypertext. Machine Learning, 43(1-2):97–119, 2001.
[39] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.
Learning to extract symbolic knowledge from the worldwide web, AAAI Conference, pages
509–516, 1998.
[40] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other
Kernel-based Learning Methods, Cambridge University Press, 2000.
[41] I. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization, Proceedings
of EMNLP, pages 55–63, 1997.
[42] W. Dai, Y. Chen, G.-R. Xue, Q. Yang, and Y. Yu. Translated learning: Transfer learning across
different feature spaces, Proceedings of Advances in Neural Information Processing Systems,
2008.
[43] A. Dayanik, D. Lewis, D. Madigan, V. Menkov, and A. Genkin. Constructing informative prior
distributions from domain knowledge in text classification, ACM SIGIR Conference, pages
493–500, 2006.
[44] A. P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via
the em algorithm, Journal of the Royal Statistical Society, Series B, 39(1): pp. 1–38, 1977.
[45] F. Denis and A. Laurent. Text Classification and Co-Training from Positive and Unlabeled
Examples, ICML 2003 Workshop: The Continuum from Labeled to Unlabeled Data. http:
//www.grappa.univ-lille3.fr/ftp/reports/icmlws03.pdf.
[46] S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent se-
mantic analysis, JASIS, 41(6):391–407, 1990.
[47] P. Domingos and M. J. Pazzani. On the the optimality of the simple Bayesian classifier under
zero-one loss, Machine Learning, 29(2–3), 103–130, 1997.
[48] P. Domingos. MetaCost: A general method for making classifiers cost-sensitive, ACM KDD
Conference, pages 155–164, 1999.
[49] H. Drucker, D. Wu, and V. Vapnik. Support vector machines for spam categorization, IEEE
Transactions on Neural Networks, 10(5):1048–1054, 1999.
[50] R. Duda, P. Hart, and W. Stork. Pattern Classification, Wiley Interscience, 2000.
[51] S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and repre-
sentations for text categorization, CIKM Conference, pages 148–155, 1998.
Text Classification 331
[52] S. Dumais and H. Chen. Hierarchical classification of web content, ACM SIGIR Conference,
pages 256–263, 2000.
[53] C. Elkan. The foundations of cost-sensitive learning, IJCAI Conference, pages 973–978, 2001.
[54] R. Fisher. The use of multiple measurements in taxonomic problems, Annals of Eugenics,
7(2):179–188, 1936.
[55] R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semi-
supervised learning, NIPS Conference, pages 121–132, 2002.
[56] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an
application to boosting. In Proceedings of Second European Conference on Computational
Learning Theory, pages 23–37, 1995.
[57] Y. Freund, R. Schapire, Y. Singer, and M. Warmuth. Using and combining predictors that
specialize, Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pages
334–343, 1997.
[58] S. Gao, W. Wu, C.-H. Lee, and T.-S. Chua. A maximal figure-of-merit learning approach to
text categorization, SIGIR Conference, pages 190–218, 2003.
[59] R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin based feature selection – theory and
algorithms, ICML Conference, pages 43–50, 2004.
[60] S. Gopal and Y. Yang. Multilabel classification with meta-level features, ACM SIGIR Confer-
ence, pages 315–322, 2010.
[61] L. Guthrie and E. Walker. Document classification by machine: Theory and practice, COLING,
pages 1059–1063, 1994.
[62] L. Hamel. Knowledge Discovery with Support Vector Machines, Wiley, 2009.
[63] E.-H. Han, G. Karypis, and V. Kumar. Text categorization using weighted-adjusted k-nearest
neighbor classification, PAKDD Conference, pages 53–65, 2001.
[64] E.-H. Han and G. Karypis. Centroid-based document classification: Analysis and experimental
results, PKDD Conference, pages 424–431, 2000.
[65] D. Hardin, I. Tsamardinos, and C. Aliferis. A theoretical characterization of linear SVM-based
feature selection, ICML Conference, 2004.
[66] S. Haykin. Neural Networks and Learning Machines, Prentice Hall, 2008.
[67] T. Hofmann. Probabilistic latent semantic indexing, ACM SIGIR Conference, pages 50–57,
1999.
[68] P. Howland, M. Jeon, and H. Park. Structure preserving dimension reduction for clustered text
data based on the generalized singular value decomposition, SIAM Journal of Matrix Analysis
and Applications, 25(1):165–179, 2003.
[69] P. Howland and H. Park. Generalizing discriminant analysis using the generalized singu-
lar value decomposition, IEEE Transactions on Pattern Analysis and Machine Intelligence,
26(8):995–1006, 2004.
[70] D. Hull, J. Pedersen, and H. Schutze. Method combination for document filtering, ACM SIGIR
Conference, pages 279–287, 1996.
332 Data Classification: Algorithms and Applications
[71] R. Iyer, D. Lewis, R. Schapire, Y. Singer, and A. Singhal. Boosting for document routing,
CIKM Conference, pages 70–77, 2000.
[72] M. James. Classification Algorithms, Wiley Interscience, 1985.
[73] D. Jensen, J. Neville, and B. Gallagher. Why collective inference improves relational classifi-
cation, ACM KDD Conference, page 593–598, 2004.
[74] T. Joachims. A probabilistic analysis of the rocchio algorithm with TFIDF for text categoriza-
tion, ICML Conference, pages 143–151, 1997.
[75] T. Joachims. Text categorization with support vector machines: Learning with many relevant
features, ECML Conference, pages 137–142, 1998.
[76] T. Joachims. Transductive inference for text classification using support vector machines,
ICML Conference, pages 200–209, 1999.
[77] T. Joachims. A statistical learning model of text classification for support vector machines,
ACM SIGIR Conference, pages 128–136, 2001.
[78] T. Joachims. Making large scale SVMs practical, In Advances in Kernel Methods, Support
Vector Learning, pages 169–184, MIT Press, Cambridge, MA, 1998.
[79] T. Joachims. Training linear SVMs in linear time, KDD, pages 217–226, 2006.
[80] D. Johnson, F. Oles, T. Zhang and T. Goetz. A decision tree-based symbolic rule induction
system for text categorization, IBM Systems Journal, 41(3):428–437, 2002.
[83] G. Karypis and E.-H. Han. Fast supervised dimensionality reduction with applications to doc-
ument categorization and retrieval, ACM CIKM Conference, pages 12–19, 2000.
[84] T. Kawatani. Topic difference factor extraction between two document sets and its application
to text categorization, ACM SIGIR Conference, pages 137–144, 2002.
[85] Y.-H. Kim, S.-Y. Hahn, and B.-T. Zhang. Text filtering by boosting naive Bayes classifiers,
ACM SIGIR Conference, pages 168–175, 2000.
[86] D. Koller and M. Sahami. Hierarchically classifying documents using very few words, ICML
Conference, pages 170–178, 2007.
[87] S. Lam and D. Lee. Feature reduction for neural network based text categorization, DASFAA
Conference, pages 195–202, 1999.
[88] W. Lam and C. Y. Ho. Using a generalized instance set for automatic text categorization, ACM
SIGIR Conference, pages 81–89, 1998.
[89] W. Lam and K.-Y. Lai. A meta-learning approach for text categorization, ACM SIGIR Confer-
ence, pages 303–309, 2001.
[90] K. Lang. Newsweeder: Learning to filter netnews, ICML Conference, page 331–339, 1995.
Text Classification 333
[91] L. S. Larkey and W. B. Croft. Combining classifiers in text categorization, ACM SIGIR Con-
ference, pages 289–297, 1996.
[92] D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning, ICML
Conference, pages 148–156, 1994.
[93] D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization,
SDAIR, pages 83–91, 1994.
[94] D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval,
ECML Conference, pages 4–15, 1998.
[95] D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task,
ACM SIGIR Conference, pages 37–50, 1992.
[96] D. Lewis and W. Gale. A sequential algorithm for training text classifiers, SIGIR Conference,
pages 3–12, 1994.
[97] D. Lewis and K. Knowles. Threading electronic mail: A preliminary study, Information Pro-
cessing and Management, 33(2):209–217, 1997.
[98] H. Li and K. Yamanishi. Document classification using a finite mixture model, Annual Meeting
of the Association for Computational Linguistics, pages 39–47, 1997.
[99] Y. Li and A. Jain. Classification of text documents, The Computer Journal, 41(8):537–546,
1998.
[100] B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining, ACM KDD
Conference, pages 80–86, 1998.
[101] B. Liu and L. Zhang. A survey of opinion mining and sentiment analysis. In Mining Text
Data, Ed. C. Aggarwal, C. Zhai, Springer, 2011.
[102] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold
algorithm. Machine Learning, 285–318, 1988.
[103] P. Long and R. Servedio. Random classification noise defeats all convex potential boosters,
ICML Conference, pages 287–304, 2008.
[104] Y. Lu, Q. Mei, and C. Zhai. Investigating task performance of probabilistic topic models: an
empirical study of PLSA and LDA, Information Retrieval, 14(2):178-203.
[105] S. J. Pan and Q. Yang. A survey on transfer learning, IEEE Transactons on Knowledge and
Data Engineering, 22(10):1345–1359, 2010.
[106] S. A. Macskassy and F. Provost. Classification in networked data: A toolkit and a univariate
case study, Journal of Machine Learning Research, 8(May):935–983, 2007.
[107] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification
and clustering and http://www.cs.cmu.edu/~mccallum/bow, 1996.
[108] A. McCallum, K. Nigam. A comparison of event models for naive Bayes text classification,
AAAI Workshop on Learning for Text Categorization, 1998.
[109] A. McCallum, R. Rosenfeld, and T. Mitchell, A. Ng. Improving text classification by shrink-
age in a hierarchy of classes, ICML Conference, pages 359–367, 1998.
334 Data Classification: Algorithms and Applications
[110] A.K. McCallum. “MALLET: A Machine Learning for Language Toolkit,” http://mallet.
cs.umass.edu, 2002.
[111] T. M. Mitchell. Machine Learning, WCB/McGraw-Hill, 1997.
[112] T. M. Mitchell. The role of unlabeled data in supervised learning, Proceedings of the Sixth
International Colloquium on Cognitive Science, 1999.
[113] D. Mladenic, J. Brank, M. Grobelnik, and N. Milic-Frayling. Feature selection using linear
classifier weights: Interaction with classification models, ACM SIGIR Conference, pages 234–
241, 2004.
[114] K. Myers, M. Kearns, S. Singh, and M. Walker. A boosting approach to topic spotting on
subdialogues, ICML Conference, 2000.
[115] H. T. Ng, W. Goh, and K. Low. Feature selection, perceptron learning, and a usability case
study for text categorization, ACM SIGIR Conference, pages 67–73, 1997.
[116] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: A comparison of
logistic regression and naive Bayes, NIPS. pages 841–848, 2001.
[117] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from labeled
and unlabeled documents, AAAI Conference, pages 792–799, 1998.
[118] H.-J. Oh, S.-H. Myaeng, and M.-H. Lee. A practical hypertext categorization method using
links and incrementally available class information, pages 264–271, ACM SIGIR Conference,
2000.
[119] G. Qi, C. Aggarwal, and T. Huang. Towards semantic knowledge propagation from text cor-
pus to web images, WWW Conference, pages 297–306, 2011.
[120] G. Qi, C. Aggarwal, Y. Rui, Q. Tian, S. Chang, and T. Huang. Towards cross-category knowl-
edge propagation for learning visual concepts, CVPR Conference, pages 897–904, 2011.
[121] X. Qi and B. Davison. Classifiers without borders: incorporating fielded text from neighbor-
ing web pages, ACM SIGIR Conference, pages 643–650, 2008.
[124] S. E. Robertson and K. Sparck-Jones. Relevance weighting of search terms. Journal of the
American Society for Information Science, 27(3):129–146, 1976.
[125] J. Rocchio. Relevance feedback information retrieval. In The Smart Retrieval System- Exper-
iments in Automatic Document Processing, G. Salton, Ed., Prentice Hall, Englewood Cliffs,
NJ, pages 313–323, 1971.
[126] M. Ruiz and P. Srinivasan. Hierarchical neural networks for text categorization, ACM SIGIR
Conference, pages 281–282, 1999.
[127] F. Sebastiani. Machine learning in automated text categorization, ACM Computing Surveys,
34(1):1–47, 2002.
[128] M. Sahami. Learning limited dependence Bayesian classifiers, ACM KDD Conference, pages
335–338, 1996.
Text Classification 335
[132] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regulariza-
tion, Optimization, and Beyond, Cambridge University Press, 2001.
[133] I. Steinwart and A. Christmann. Support Vector Machines, Springer, 2008.
[134] R. Schapire and Y. Singer. BOOSTEXTER: A Boosting-based system for text categorization,
Machine Learning, 39(2/3):135–168, 2000.
[135] H. Schutze, D. Hull, and J. Pedersen. A comparison of classifiers and document representa-
tions for the routing problem, ACM SIGIR Conference, pages 229–237, 1995.
[138] J. Shavlik and T. Eliassi-Rad. Intelligent agents for web-based tasks: An advice-taking ap-
proach, AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI
Press, 1998. http://www.cs.wisc.edu/~shavlik/mlrg/publications.html
[139] V. Sindhwani and S. S. Keerthi. Large scale semi-supervised linear SVMs, ACM SIGIR Con-
ference, pages 477–484, 2006.
[140] N. Slonim and N. Tishby. The power of word clusters for text classification, European Col-
loquium on Information Retrieval Research (ECIR), 2001.
[141] N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequen-
tial information maximization, ACM SIGIR Conference, pages 129–136, 2002.
[142] J.-T. Sun, Z. Chen, H.-J. Zeng, Y. Lu, C.-Y. Shi, and W.-Y. Ma. Supervised latent semantic
indexing for document categorization, ICDM Conference, pages 535–538, 2004.
[143] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson, 2005.
[146] A. Weigand, E. Weiner, and J. Pedersen. Exploiting hierarchy in text catagorization. Infor-
mation Retrieval, 1(3):193–216, 1999.
[147] S. M. Weiss, C. Apte, F. Damerau, D. Johnson, F. Oles, T. Goetz, and T. Hampp. Maximizing
text-mining performance, IEEE Intelligent Systems, 14(4):63–69, 1999.
[148] S. M. Weiss and N. Indurkhya. Optimized rule induction, IEEE Expert, 8(6):61–69, 1993.
[149] E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting,
SDAIR, pages 317–332, 1995.
336 Data Classification: Algorithms and Applications
[150] J. Xu and B. W. Croft, Improving the effectiveness of information retrieval with local context
analysis, ACM Transactions on Information Systems, 18(1), Jan, 2000. pp. 79–112.
[151] G.-R. Xue, D. Xing, Q. Yang, Y. Yu. Deep classification in large-scale text hierarchies, ACM
SIGIR Conference, 2008.
[152] J. Yan, N. Liu, B. Zhang, S. Yan, Z. Chen, Q. Cheng, W. Fan, W.-Y. Ma. OCFS: optimal
orthogonal centroid feature selection for text categorization, ACM SIGIR Conference, 2005.
[153] Y. Yang, L. Liu. A re-examination of text categorization methods, ACM SIGIR Conference,
1999.
[154] Y. Yang, J. O. Pederson. A comparative study on feature selection in text categorization, ACM
SIGIR Conference, 1995.
[155] Y. Yang, C.G. Chute. An example-based mapping method for text categorization and re-
trieval, ACM Transactions on Information Systems, 12(3), 1994.
[156] Y. Yang. Noise Reduction in a Statistical Approach to Text Categorization, ACM SIGIR Con-
ference, 1995.
[157] Y. Yang. A Study on Thresholding Strategies for Text Categorization, ACM SIGIR Confer-
ence, 2001.
[158] Y. Yang, T. Ault, T. Pierce. Combining multiple learning strategies for effective cross-
validation, ICML Conference, 2000.
[159] J. Zhang, Y. Yang. Robustness of regularized linear classification methods in text categoriza-
tion, ACM SIGIR Conference, 2003.
[160] T. Zhang, A. Popescul, B. Dom. Linear prediction models with graph regularization for web-
page categorization, ACM KDD Conference, 2006.
[161] C. Zhai, Statistical Language Models for Information Retrieval (Synthesis Lectures on Hu-
man Language Technologies), Morgan and Claypool Publishers, 2008.
[162] S. Zhu, K. Yu, Y. Chi, Y. Gong. Combining content and link for classification using matrix
factorization, ACM SIGIR Conference, 2007.
[163] X. Zhu and A. Goldberg. Introduction to Semi-Supervised Learning, Morgan and Claypool,
2009.
Chapter 12
Multimedia Classification
Shiyu Chang
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
Wei Han
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
Xianming Liu
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
Ning Xu
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
Pooya Khorrami
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
Thomas S. Huang
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
337
338 Data Classification: Algorithms and Applications
12.1 Introduction
Many machine classification problems resemble human decision making tasks, which are multi-
modal in nature. Humans have the capability to combine various types of sensory data and associate
them with natural language entities. Complex decision tasks such as person identification often
heavily depend on such synergy or fusion of different modalities. For that reason, much effort on
machine classification methods goes into exploiting the underlying relationships among modalities
and constructing an effective fusion algorithm. This is a fundamental step in the advancement of arti-
ficial intelligence, because the scope of learning algorithms is not limited to one type of sensory data.
This chapter is about multimedia classification, a decision making task involving several types
of data such as audio, image, video (time sequence of images), and text. Although other modali-
ties, e.g., haptic data, may fall into the category of multimedia, we only discuss audio, visual, and
text because they represent a broad spectrum of both human decision tasks and machine learning
applications.
12.1.1 Overview
Advances in modern digital media technologies has made generating and storing large amounts
of multimedia data more feasible than ever before. The World Wide Web has become a huge, di-
verse, dynamic, and interactive medium for obtaining knowledge and sharing information. It is a
rich and gigantic repository of multimedia data collecting a tremendous number of images, mu-
sic, videos etc. The emerging need to understand large, complex, and information-rich multimedia
contents on the Web is crucial to many data-intensive applications in the field of business, science,
medical analysis, and engineering.
The Web (or a multimedia database in general) stores a large amount of data whose modality
is divided into multiple categories including audio, video, image, graphics, speech, text, document,
and hypertext, which contains text markups and linkages. Machine classification on the Web is an
entire process of extracting and discovering knowledge from multimedia documents on a computer-
based methodology. Figure 12.1 illustrates the general procedure for multimedia data learning.
Multimedia classification and predictive modeling are the most fundamental problems that re-
semble human decision making tasks. Unlike learning algorithms that focus within a single domain,
Multimedia Classification 339
FIGURE 12.1 (See color insert.): Flowchart of general multimedia learning process.
many questions are open in information fusion. Multimedia classification studies how to fuse de-
pendent sources, modalities, or samples efficiently to achieve a lower classification error. Most of
the learning algorithms can be applied to multimedia domain in a straightforward manner. However,
the specificity of an individual model of data is still very significant in most real-world applications
and multimedia system designs. To shed more light on the problems of multimedia classification,
we will first introduce the three fundamental types of data including audio, visual, and text.
Audio and visual are essential sensory channels for humans to acquire information about the
world. In multimedia classification, the fusion of audio and visual has been extensively studied, and
achieved different levels of improvement in tasks including event detection, person identification,
and speech recognition, when compared to methods based on only one modality. In this chapter we
survey several important related topics and various existing audio-visual fusion methods.
Text is quite different from any sensory modality. It is in the form of natural language, a special
medium for human to exchange knowledge. There is an unproven hypothesis in neuroscience that
the human brain associates patterns observed in different sensory data to entities that are tied to
basic units in natural language [43]. This hypothesis coincides with the recent efforts on ontology,
a graphical knowledge representation that consists of abstract entities with semantically meaning-
ful links, and learning algorithms to build the connection among natural language, ontology, and
sensory data (mostly images).
In this chapter, we mainly focus on decision making from multi-modality of data in an
application-oriented manner instead of illustrating different learning algorithms. The remainder of
this chapter is organized as follows. In Section 12.2, we illustrate commonly used features for the
three fundamental data types. Audio-visual fusion methods and potential applications are introduced
in Section 12.3. Section 12.4 explains how to use ontological structure to enhance multimedia ma-
chine classification accuracy. We then discuss geographical classification using multimedia data in
Section 12.5. Finally, we conclude this chapter and discuss future research directions in the field of
Multimedia.
Euclidean space. Although in the modern digital world, data can be naturally represented as vectors,
finding discriminative features in a meaningful feature space is extremely important. We call the
process of studying meaningful data representation and discovering knowledge from the data as
feature extraction.
FIGURE 12.2 (See color insert.): Time-domain waveform of the same speech from different per-
sons [80].
FIGURE 12.3 (See color insert.): Frequency response of the same speech in Figure 12.2 [80].
In general, good features can make the learning problem trivial. Conversely, bad representations
will make the problem unsolvable in the extreme case. A simple example in the context of speech
recognition is illustrated in Figure 12.2. Both waveforms are exactly the same sentence spoken by
different people. One can easily observe large deviations between the two, which make it difficult
to visually determine whether the two signals indicate the same speech without the explicit help of
professionals in the field. However, if we transfer them into the frequency domain using a short time
Fourier Transform, the harmonics shown in Figure 12.3 on different frequencies over time are almost
identical to each other. This simple example suggests that raw signal data is rarely used as input for
machine classification algorithms. We now briefly introduce some common features that have been
widely used in the multimedia community. This simple example demonstrates how raw signal data is
quite limited in its ability to model the content that is relevant for machine classification algorithms.
images and audios, text documents do not have a natural feature representation. To alleviate the
problem, the most popular and the simplest feature is a bag-of-words, which vectorizes the text
documents. Bag-of-words feature extraction uses the relative frequency of occurrences of a set of
keywords as a quantitative measure to convert each text document to a vector. The reason for us-
ing relative frequencies instead of absolute word counts is to reduce the effect of length difference
among documents.
Furthermore, feature extraction usually involves some data preprocessing steps such as data
cleaning and normalization. By removing unwanted details and “normalizing” data, learning per-
formances can be significantly improved. For instance, in text documents, punctuation symbols and
non-alphanumeric characters are usually filtered because they do not contribute to feature discrimi-
nation. Moreover, words that occur too frequently such as “the,” “a,” “is,” and “at” are also discarded
because of their frequent prevalence in documents. The most common way to remove such common
English words is by applying “stop-list” at the preprocessing stage. Another common trick to apply
for data preprocessing is variant unifying. Variant refers to different forms of the same word, for
example, “go,” “goes,” “went,” “gone,” and “going.” This can be solved by stemming. This corre-
sponds to the replacement of these words by a standard one. More sophisticated methods can not
only be applied to the words with the same semantic meaning, but also can improve discrimination
by considering word co-occurrence. The interested reader can refer to Chapter 10 in this book.
this rather simple task is far more difficult for computer-based algorithms. The images of the same
person in a database might have been taken under different conditions including viewpoint, light-
ing condition, image quality, occlusion, resolution, etc. Figure 12.4 illustrates huge differences for
images of a previous US President.
FIGURE 12.4 (See color insert.): Example images of President George W. Bush showing huge
intra-person variations.
Many sophisticated feature extraction and machine classification methods have been proposed in
the image domain in recent years. In this subsection, we will mainly focus on the four fundamentals:
color, texture, shape, and spatial relation.
1. Color: Color features are widely used in image classification to capture the spatial color
distribution of each given image. Colors are defined on different color spaces. Commonly
used color spaces include RGB, HSV, YCrCb, etc. These color spaces are very close to human
perception, and detailed description on different color spaces can be found in [74]. Color-
covariance, color histogram, color moments, and color coherence are all commonly used
color features defined on these color spaces.
One of the most widely used features in image classification is color. Images represent color
data using color spaces such RGB, HSV, or YCrCb, each of which closely models human per-
ception. A more detailed description of the different color spaces can be found in [74]. These
spaces are often used when constructing color histograms or modeling color-covariance, color
moments, and color coherence.
2. Texture: Texture features are designed for capturing a specific pattern appearing in an image.
For objects containing certain textures on its surface (i.e. fruit skin, clouds, trees, bricks, etc.),
the extracted features provide important information for image classification. Some important
texture features include Gabor filtering [100], wavelet transform [98], and other local statis-
tical features [90], [50]. Some easily computable textural features based on graytone spatial
dependencies can be found in [31]. There applications have been shown in category identifi-
cation, photo-micrographs, panchromatic aerial photographs, and multi-spectral imagery.
3. Shape: The shape is a well-defined concept for extracting the information of desired objects
in images. Shape features include Fourier descriptors, aspect ratio, circularity, moment in-
variants, consecutive boundary segments, etc. [65]. Despite the importance of shape features
for classifying different objects, when compared with color and texture feature, shape is less
efficient due to the difficulty of obtaining an accurate segmentation.
Multimedia Classification 343
4. Spatial Relation: Besides the aforementioned image features, spatial location is also useful
in region classification. Consider the previous example of an object’s spatial layout infor-
mation for effective classification. Most of the existing work simply defines spatial location
as “upper,” “bottom,” “top,” according to where the object appears in an image [34], [68].
If an image contains more than one object, relative spatial relationship is more important
than absolute spatial location in deriving semantic features. The most common structure used
to represent directional relationship between objects such as “left/right,” “up/down,” is 2D-
string [17].
In recent years, the most efficient image features have included HOG (Histogram of Oriented
Gradients) [22], SIFT (Scale Invariant Feature Transform) [54], and its variations [6, 10, 67]. They
are local features, which characterize the visual content within a local range instead of the whole
images. They are widely used in object detection, recognition, and retrieval. We briefly introduce
these two popular features here.
1. HOG: The HOG is a descriptor that characterizes the object appearance and shape by evalu-
ating the distribution of intensity gradients or edge directions. For an image, it is first divided
into small spatial regions (called “cells”). For each cell, a 1-D histogram of gradient direc-
tions or edge orientations over the pixels of the cell is calculated [22]. The descriptor is then
constructed by the combinations of these histograms, which encodes the local spatial infor-
mation. To further improve the accuracy, the local histogram can be contrast-normalized by
accumulating a measure of the intensity (referred as “energy”) over a larger spatial region
(called “block”) of the image. In implementation, the cells and blocks on dense grids are
generated by sliding windows at different scales.
The key idea behind HOG is to find a “template” of the average shape for a certain type of
targets (e.g., pedestrian), by “viewing” and picking up all the possible examples. The HOG
descriptor works well in various tasks, due to its high computational efficiency. It can capture
the local shapes, and the tolerance in translation and contrast. It is widely used in human face
detection and object recognition.
2. SIFT: Scale Invariant Feature Transform [54] is the most widely used image feature in image
classification, object detection, image matching, etc.. It detects the key points (referred to as
“interest points”) in an image, and constructs a scale and rotation invariant descriptor for each
interest point.
SIFT corresponds to the detector and descriptor. First, in the detector, interesting points (also
referred to as “key points”) are obtained by searching the points at which the DoG (differ-
ence of Gaussian) values assume extrema with respect to both the spatial coordinates in the
image domain and the scale level in the scale space, inspired by the scale selection in scale
space [48]. At each interest point detected, a position-dependent descriptor is calculated sim-
ilarly to the HOG, but in a more sophisticated way.
To achieve the scale invariance, the size of the local neighborhood (corresponds to the “block”
size in HOG) is normalized according to the scale. To achieve the rotation invariance, the
principal orientation is determined by counting all the orientations of gradients within its
neighborhood, which is used to orient the grid. Finally, the descriptor is calculated with re-
spect to the principal orientation. Figure 12.5 shows an example of the SIFT detectors and the
matching between two different images.
One of the most important extensions of SIFT is the Dense version of SIFT features [10].
Instead of detecting interest points in the first step, it samples points densely (e.g., in uniform
grids) from the image, and assigns a scale for each point alternatively. It has been reported
that in some situations Dense SIFT outperforms the original SIFT feature.
344 Data Classification: Algorithms and Applications
FIGURE 12.5 (See color insert.): Samples of SIFT detector from [51]: interest points detected
from two images of a similar scene. For each point, the circle indicates the scale. It also shows the
matching between the similar SIFT points in the two images.
1. Time-domain features:
(a) Short-time energy: The energy of audio waveform within a frame.
(b) Energy statistics: The energy statistics of a particular audio waveform usually includes
mean, standard deviation, dynamic range, etc.
(c) Silence ration: The percentage of low-energy audio frames.
(d) Zero crossing rate (ZCR): ZCR is a commonly used temporal feature. It counts the
number of times audio waves across zero axis within a frame.
(e) Pause rate: The rate of stopping rate of speech due to separation of sentences.
2. Frequency-domain features:
(a) Pitch: It measures the fundamental frequency of audio signals.
(b) Subband energy ratio: It is a histogram-like energy distribution over different frequen-
cies.
(c) Spectral statistics: The most important spectral statistics are frequency centroid (FC)
and bandwidth (BW). FC indicates weighted average of all frequency components of
an audio frame. For a particular frame, BW is the weighted average of the squared
difference between each frequency and its frequency centroid.
3. Psycho-acoustic features:
(a) Four-Hz modulation energy: Usually, speech contains a characteristic energy modula-
tion peak around the four hertz syllabic rate. It calculates the energy of four-Hz modu-
lation.
(b) Spectral roll-off point: The spectral roll-off point is used to identify voiced and unvoiced
speech by calculating the 95 percent of the power spectrum.
Multimedia Classification 345
Audio and visual data can be coherent in a unified learning framework. Utilizing different
modalities of data could potentially enhance classification results. It is also related to video clas-
sification, since video itself contains both a visual and an audio component. The detailed methods
will be discussed later in this chapter.
Normalization
Most pose normalization methods in AVSR extract lips from the image. The algorithms for
lips localization mainly fall into two categories [76]. The first category makes use of the prior
information about the whole face. Two representative methods are active shape model (ASM)
and active appearance model (AAM) [19]. While the ASM represents the face as a set of
landmarks, AAM further includes the textures inside the face boundary. These methods can be
extended to 3D [24] to further normalize the out-of-plane head rotations. The other category
only uses the local appearance around the lips [1]. The informative aspects of lips are the
shape of outer and inner contours, and the image appearance inside the contours. Existing
AVSR front ends can be shape-based [56], appearance-based (pixel) [61], or a combination
of the two.
Robust pose normalization and part detection are, in general, difficult tasks. This is espe-
cially true in uncontrolled environments with varying illumination and background. Most of
the current AVSR research simplifies the normalization stage by controlled environment and
human labeling. For example, many audio visual speech corpora are recorded in limited pose
variation and constant lighting settings [66]. The cropping window for mouth regions is usu-
348 Data Classification: Algorithms and Applications
a b
c d
FIGURE 12.7: Graphical model illustration of audio visual models, a): HMM, b): MSHMM, c):
FHMM, d): CHMM.
ally provided, while other regions of the faces can be ignored in AVSR. The simplification
of facial image processing helps AVSR researchers concentrate on feature design and audio
visual fusion, and enable meaningful benchmarks for different methods.
Features The raw features after normalization can be classified as shape-based and appearance-
based. Frequently used shape features are the geometric property of the lip contour, and model
parameters of a parametric lip model. Appearance-based features take raw pixels from the
region of interest from normalized face images.
As the typical dimension of image pixels is too large for the statistical modeling in a tra-
ditional speech recognizer, feature transforms are adopted to reduce the dimension. Some
popular approaches are Principal Component Analysis (PCA) [26], Discrete Cosine Trans-
form (DCT) [71], wavelet [77], Linear Discriminative Analysis (LDA) [61], etc. Although
dimensionality reduction is not necessary for the contour based approaches, PCA is often ap-
plied for the purpose of more compact features. The combination of shape and appearance
features can be viewed as another multimodal fusion task: each feature carries different in-
formation of the same source. In AVSR systems, direct concatenation is the commonly used
approach [16].
Hidden Markov Model is usually loosened to allow state asynchrony. The various types of HMMs
proposed for the audio visual fusion have been summarized in [70] using the notion of a graphical
model, as shown in Figure 12.7.
Theoretically, audio visual HMM is a special case of Dynamical Bayesian Model (DBN) with
two observation streams. Various decision fusion methods differ in the statistical dependency among
observation and hidden state streams, e.g., the edges in the DBN graph. The audio visual prod-
uct HMM [70] assumes one hidden state stream and two conditional independent observation
streams. Therefore, the states are synchronous, and the likelihoods are multiplied together. Fac-
torial HMM [29] uses one hidden state stream for each modality but the two streams are assumed to
be independent. The coupled HMM further imposes dependencies among audio and visual hidden
states. Therefore, it has the capability to incorporate audio visual asynchrony, at least in theory.
common-shared terminology for knowledge description. Hierarchy is the simplest ontology, where
each node represents one concept, and nodes are linked by directed edges representing the Subclass
relation in a tree structure. Ontologies have already shown their usefulness in many application
domains, such as Art Architecture Thesaurus (AAT) for art, and the Gene ontology [4] for health
care.
Knowledge can be modeled by ontologies. For example, a car is a wheeled vehicle and a mo-
torcycle is also a wheeled vehicle. Thus, both should incorporate a wheel. This means that one can
determine semantic relationships between class labels that are assigned to the observed visual ob-
ject instances during visual object recognition. In fact, humans use this knowledge when learning
the visual appearance of the objects. For instance, when one encounters a new car model, it is not
sensible to learn all the appearance details. It is enough to remember that it looks like a car, as well
as the discriminative details. This can help in learning the visual appearance of new object types
and speed up the recognition process. Both advantages are very desirable in object recognition.
Motivated by these ideas, some initial research has been done on utilizing different aspects of
ontology for many applications. We first introduce some popular ontology applied in multimedia.
Then, we provide more technical details for using the ontological relations. More specifically, the
Subclass relation and Co-occurrence relation are described. We also include some algorithms that
use the ontology inherently. Finally, we make some concluding remarks.
can help boost machine understanding, and also bridge the semantic gap between low and high level
concepts to some extent.
12.4.2.1 Definition
The most applicable and frequently used relations are Subclass relations and Co-occurrence
relations. The definition of the two relations are given as follows:
1. Subclass: also called hyperonymy, hyponymy, or “IS-A” relation. It links more general
synsets such as furniture to increasingly specific ones such as bed.
2. Co-occurrence: It is a linguistics term that refers to the frequent occurrence of two terms from
a text corpus alongside each other in a certain order. For example, indoor and furniture will
typically be correlated by their occurrence in a corpus.
Ontology-based algorithms utilize one of the two relations, or both of them, according to specific
applications. The ways that the relations are used are very different, and have their own pros and
cons. We will then introduce these algorithms based on the type of relations they use.
FIGURE 12.8 (See color insert.): The challenges in landmark image recognition. (a) and (b) are
photos of Drum Tower of Xi’an and Qianmen Gate of Beijing, respectively. Though of different
locations the visual appearances are similar due to historical and style reasons. (c) shows three
disparate viewing angles of Lincoln Memorial in Washington, DC.
to be treated as a rectangular coordinate system, and the position is determined as latitude and
longitude.
2. Text information: Typically, the textual meta-data such as the descriptions of Flickr images, or
articles in personal blogs, contain named entities of positions (e.g., “Sydney Opera,” “Eiffel
tower”). This is essential in indicating the geo-location of images. An example is [35], which
extracted city names from MSN blogs and photo albums.
3. Visual features: Most problems deal with matching images to locations (e.g., landmark
names). Visual features are of high importance for landmark recognition. Popular visual fea-
tures include local points (e.g., SIFT [54]) and bag-of-words representations [84].
4. Collaborative features: With the popularity of social networks, collaborative features are also
enhanced, especially for social media. Examples of collaborative features include Hashing
Tags on Twitter labeled by users, and user comments or reposts. The problem of utilizing the
collaborative features more efficiently still remains relatively unexplored, since some of them
are highly unstructured and noisy.
Geographical classification always needs to fuse different modalities of information, to accurately
predict geo-tags.
2. Geo-location ambiguity: Even the same geo-location or landmark may look different from
different visual angles. Though the advanced scale and rotation visual features such as SIFT
[54] can handle the in-plane rotation, current image descriptors cannot address the out-of-
plane rotation, i.e., the viewing angle changes.
Figure 12.8 shows two examples of these challenges. To deal with the visual ambiguity chal-
lenge, it is feasible to make use of multi-modality information, to eliminate these challenges. Clas-
sical approaches include fusing GPS position and visual features [37], or textual information and
visual features [35].
Multimedia Classification 355
For geo-location ambiguity, learning the viewing-angle level models for each landmark is widely
adopted. This can be further integrated into a voting or boosting framework. In the following, we
introduce two related topics, corresponding to the geo-classification for Web images and videos,
respectively.
FIGURE 12.9 (See color insert.): The same landmark will differ in appearance from different
viewing angles. Typical approaches use clustering to get the visual information for each viewing
angle. From Liu et al. [52]
To reduce the effect of ambiguities in landmark classification (see Figure 12.9), meta-data are
introduced and the classification is performed using multiple features. In [37, 38], the images are
first pre-processed according to their GPS information, to eliminate the geo-location ambiguity.
Typically, the images of the same landmark will be clustered into different viewing angles to tackle
the visual ambiguity. Cristani et al. [21] integrated both the visual and geographical information in
356 Data Classification: Algorithms and Applications
a statistical graph model, by a joint inference. Ji et al. [36] proposed to build the vocabulary tree
considering context (e.g., textual information), and then applied to the geo-location classification
task. Crandall et al. further [20] proposed to combine multiple features like visual, textual, and
temporal features to assign photos into the worldwide geographical map.
12.5.3.1 Classifiers
The k-NN (k Nearest Neighbors) [32,41] and the Support Vector Machine (SVM) are commonly
used. Moreover, researchers also modeled the image collections in a graph in which the nodes are
connected according to their content or context similarities. PageRank [73] or HITS-like [42] algo-
rithms are used to propagate and classify the images’ geographical information following a prob-
abilistic manner. Cristani et al. [21] utilized the probabilistic topic model, the location-dependent
pLSA (LD-pLSA), to jointly model the visual and geographical information, which aims to coher-
ently categorize images using a dual-topic modeling. Recently, structural SVM is also utilized [47]
in recognizing the landmarks.
12.6 Conclusion
In this chapter we present current advances in the field of multimedia classification, with details
on feature selection, information fusion, and three important application domains. These correspond
to audio-visual fusion, ontology based classification, and geographical classification. In particular,
we emphasized the benefit of information fusion for multiple content forms. Brief overviews of
the state-of-the-art methods were provided by this chapter, though a detailed exposition would be
beyond the scope of a single chapter. Because of its great practical value, multimedia classification
remains an active and fast growing research area. It also serves as a good test-bed for new methods
in audio/image processing and data science, including general data classification methods in the rest
of this book.
Multimedia Classification 357
Bibliography
[1] Waleed H Abdulla, Paul WT Yu, and Paul Calverly. Lips tracking biometrics for speaker
recognition. International Journal of Biometrics, 1(3):288–306, 2009.
[2] Ali Adjoudani and Christian Benoit. On the integration of auditory and visual parameters in
an HMM-based ASR. NATO ASI Series F Computer and Systems Sciences, 150:461–472,
1996.
[3] Sakire Arslan Ay, Lingyan Zhang, Seon Ho Kim, Ma He, and Roger Zimmermann. GRVS:
A georeferenced video search engine. In Proceedings of the 17th ACM International Confer-
ence on Multimedia, pages 977–978. ACM, 2009.
[4] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler,
J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al.
Gene ontology: Tool for the unification of biology. Nature Genetics, 25(1):25–29, 2000.
[5] Sakire Arslan Ay, Roger Zimmermann, and Seon Ho Kim. Viewable scene modeling for
geospatial video search. In Proceedings of the 16th ACM International Conference on Mul-
timedia, pages 309–318. ACM, 2008.
[6] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In
Computer Vision–ECCV 2006, pages 404–417. Springer, 2006.
[7] Richard Bellman. Adaptive Control Processes: A Guided Tour, volume 4. Princeton Univer-
sity Press Princeton, NJ, 1961.
[8] Marco Bertini, Alberto Del Bimbo, and Carlo Torniai. Automatic video annotation using
ontologies extended with visual information. In Proceedings of the 13th Annual ACM Inter-
national Conference on Multimedia, pages 395–398. ACM, 2005.
[9] Stephan Bloehdorn, Kosmas Petridis, Carsten Saathoff, Nikos Simou, Vassilis Tzouvaras,
Yannis Avrithis, Siegfried Handschuh, Yiannis Kompatsiaris, Steffen Staab, and Michael G
Strintzis. Semantic annotation of images and videos for multimedia analysis. In The Semantic
Web: Research and Applications, pages 592–607. Springer, 2005.
[10] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Scene classification via pLSA. In
Computer Vision–ECCV 2006, pages 517–530. Springer, 2006.
[11] Christoph Bregler, Michele Covell, and Malcolm Slaney. Video rewrite: Driving visual
speech with audio. In Proceedings of the 24th Annual Conference on Computer Graph-
ics and Interactive Techniques, pages 353–360. ACM Press/Addison-Wesley Publishing Co.,
1997.
[12] Christoph Bregler and Yochai Konig. “Eigenlips” for robust speech recognition. In IEEE
International Conference on Acoustics, Speech, and Signal Processing, 1994. ICASSP-94.,
volume 2, pages II–669. IEEE, 1994.
[13] Darin Brezeale and Diane J. Cook. Automatic video classification: A survey of the literature.
IEEE Transactions on Systems, Man, and Cybernetics, Part C, 38(3):416–430, 2008.
[14] Lijuan Cai and Thomas Hofmann. Hierarchical document categorization with support vector
machines. In Proceedings of the Thirteenth ACM International Conference on Information
and Knowledge Management, pages 78–87. ACM, 2004.
358 Data Classification: Algorithms and Applications
[15] Liangliang Cao, Jiebo Luo, and Thomas S Huang. Annotating photo collections by label
propagation according to multiple similarity cues. In Proceedings of the 16th ACM Interna-
tional Conference on Multimedia, pages 121–130. ACM, 2008.
[16] Michael T Chan. HMM-based audio-visual speech recognition integrating geometric and
appearance-based visual features. In 2001 IEEE Fourth Workshop on Multimedia Signal
Processing, pages 9–14. IEEE, 2001.
[17] S K Chang, Q Y Shi, and C W Yan. Iconic indexing by 2-d strings. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 9(3):413–428, May 1987.
[18] David M Chen, Georges Baatz, K Koser, Sam S Tsai, Ramakrishna Vedantham, Timo Pyl-
vanainen, Kimmo Roimela, Xin Chen, Jeff Bach, Marc Pollefeys, et al. City-scale landmark
identification on mobile devices. In IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2011, pages 737–744. IEEE, 2011.
[19] Timothy F Cootes, Gareth J Edwards, and Christopher J. Taylor. Active appearance models.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681–685, 2001.
[20] David J Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. Mapping the
world’s photos. In Proceedings of the 18th International Conference on World Wide Web,
pages 761–770. ACM, 2009.
[21] Marco Cristani, Alessandro Perina, Umberto Castellani, and Vittorio Murino. Geo-located
image analysis using latent representations. In IEEE Conference on Computer Vision and
Pattern Recognition, 2008. CVPR 2008, pages 1–8. IEEE, 2008.
[22] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005.
CVPR 2005, volume 1, pages 886–893. IEEE, 2005.
[23] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern
Recognition, 2009. CVPR 2009, pages 248–255. IEEE, 2009.
[24] Fadi Dornaika and Jörgen Ahlberg. Fast and reliable active appearance model search for 3-D
face tracking. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,
34(4):1838–1853, 2004.
[25] Susan Dumais and Hao Chen. Hierarchical classification of web content. In Proceedings
of the 23rd Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pages 256–263. ACM, 2000.
[26] Stéphane Dupont and Juergen Luettin. Audio-visual speech modeling for continuous speech
recognition. IEEE Transactions on Multimedia, 2(3):141–151, 2000.
[27] Christiane Fellbaum (Ed.) WordNet: An Electronic Lexical Database, Springer, 2010.
[28] Yul Gao and Jianping Fan. Incorporating concept ontology to enable probabilistic concept
reasoning for multi-level image annotation. In Proceedings of the 8th ACM International
Workshop on Multimedia Information Retrieval, pages 79–88. ACM, 2006.
[29] Zoubin Ghahramani and Michael I Jordan. Factorial hidden Markov models. Machine Learn-
ing, 29(2-3):245–273, 1997.
Multimedia Classification 359
[30] Hervé Glotin, D Vergyr, Chalapathy Neti, Gerasimos Potamianos, and Juergen Luettin.
Weighting schemes for audio-visual fusion in speech recognition. In 2001 IEEE International
Conference on Acoustics, Speech, and Signal Processing, 2001. Proceedings, (ICASSP’01),
volume 1, pages 173–176. IEEE, 2001.
[31] Robert M Haralick, K Shanmugam, and Its’Hak Dinstein. Textural features for image classifi-
cation. IEEE Transactions on Systems, Man and Cybernetics, SMC-3(6):610–621, November
1973.
[32] James Hays and Alexei A Efros. Im2gps: estimating geographic information from a single
image. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008,
pages 1–8. IEEE, 2008.
[33] Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial Intelligence,
17(1):185–203, 1981.
[34] J Jeon, V Lavrenko, and R Manmatha. Automatic image annotation and retrieval using cross-
media relevance models. In Proceedings of the 26th Annual International ACM SIGIR Con-
ference on Research and Development in Informaion Retrieval, SIGIR ’03, pages 119–126,
New York, NY, USA, 2003. ACM.
[35] Rongrong Ji, Xing Xie, Hongxun Yao, and Wei-Ying Ma. Mining city landmarks from blogs
by graph modeling. In Proceedings of the 17th ACM International Conference on Multime-
dia, pages 105–114. ACM, 2009.
[36] Rongrong Ji, Hongxun Yao, Qi Tian, Pengfei Xu, Xiaoshuai Sun, and Xianming Liu.
Context-aware semi-local feature detector. ACM Transactions on Intelligent Systems and
Technology (TIST), 3(3):44, 2012.
[37] Lyndon Kennedy, Mor Naaman, Shane Ahern, Rahul Nair, and Tye Rattenbury. How flickr
helps us make sense of the world: context and content in community-contributed media col-
lections. In Proceedings of the 15th International Conference on Multimedia, pages 631–640.
ACM, 2007.
[38] Lyndon S Kennedy and Mor Naaman. Generating diverse and representative image search
results for landmarks. In Proceedings of the 17th International Conference on World Wide
Web, pages 297–306. ACM, 2008.
[39] Einat Kidron, Yoav Y Schechner, and Michael Elad. Pixels that sound. In IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, volume 1,
pages 88–95. IEEE, 2005.
[40] Seon Ho Kim, Sakire Arslan Ay, Byunggu Yu, and Roger Zimmermann. Vector model in
support of versatile georeferenced video search. In Proceedings of the First Annual ACM
SIGMM Conference on Multimedia Systems, pages 235–246. ACM, 2010.
[41] Jim Kleban, Emily Moxley, Jiejun Xu, and BS Manjunath. Global annotation on georef-
erenced photographs. In Proceedings of the ACM International Conference on Image and
Video Retrieval, page 12. ACM, 2009.
[42] Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM
(JACM), 46(5):604–632, 1999.
[43] Sydney M Lamb. Pathways of the brain: The neurocognitive basis of language, volume 170.
John Benjamins Publishing, 1999.
360 Data Classification: Algorithms and Applications
[44] Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-
3):107–123, 2005.
[45] Martha Larson, Mohammad Soleymani, Pavel Serdyukov, Stevan Rudinac, Christian
Wartena, Vanessa Murdock, Gerald Friedland, Roeland Ordelman, and Gareth JF Jones. Au-
tomatic tagging and geotagging in video collections and communities. In Proceedings of the
1st ACM International Conference on Multimedia Retrieval, page 51. ACM, 2011.
[46] Li-Jia Li, Hao Su, Li Fei-Fei, and Eric P Xing. Object bank: A high-level image repre-
sentation for scene classification & semantic feature sparsification. In Advances in Neural
Information Processing Systems, pages 1378–1386, 2010.
[47] Yunpeng Li, David J Crandall, and Daniel P Huttenlocher. Landmark classification in large-
scale image collections. In IEEE 12th International Conference on Computer Vision 2009,
pages 1957–1964. IEEE, 2009.
[48] Tony Lindeberg. Feature detection with automatic scale selection. International Journal of
Computer Vision, 30(2):79–116, 1998.
[49] Tony Lindeberg. Scale invariant feature transform. Scholarpedia, 7(5):10491, 2012.
[50] Fang Liu and R W Picard. Periodicity, directionality, and randomness: Wold features for im-
age modeling and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence,
18(7):722–733, 1996.
[51] Xian-Ming Liu, Hongxun Yao, Rongrong Ji, Pengfei Xu, Xiaoshuai Sun, and Qi Tian. Learn-
ing heterogeneous data for hierarchical web video classification. In Proceedings of the 19th
ACM International Conference on Multimedia, pages 433–442. ACM, 2011.
[52] Xian-Ming Liu, Yue Gao, Rongrong Ji, Shiyu Chang, and Thomas Huang. Localizing web
videos from heterogeneous images. In Workshops at the Twenty-Seventh AAAI Conference
on Artificial Intelligence, 2013.
[53] Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma. A survey of content-based image
retrieval with high-level semantics. Pattern Recognition, 40(1):262–282, 2007.
[54] David G Lowe. Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision, 60(2):91–110, 2004.
[55] Simon Lucey. Audio-Visual Speech Processing. PhD thesis, Citeseer, 2002.
[56] Juergen Luettin, Neil A Thacker, and Steve W Beet. Active shape models for visual speech
feature extraction. NATO ASI Series F Computer and Systems Sciences, 150:383–390, 1996.
[57] Hangzai Luo and Jianping Fan. Building concept ontology for medical video annotation. In
Proceedings of the 14th Annual ACM International Conference on Multimedia, pages 57–60.
ACM, 2006.
[58] Jiebo Luo, Dhiraj Joshi, Jie Yu, and Andrew Gallagher. Geotagging in multimedia and com-
puter vision – a survey. Multimedia Tools and Applications, 51(1):187–211, 2011.
[59] Marcin Marszalek and Cordelia Schmid. Semantic hierarchies for visual object recognition.
In IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07, pages
1–7. IEEE, 2007.
[60] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. Robust wide-baseline stereo from
maximally stable extremal regions. Image and Vision Computing, 22(10):761–767, 2004.
Multimedia Classification 361
[61] Iain Matthews, Gerasimos Potamianos, Chalapathy Neti, and Juergen Luettin. A comparison
of model and transform-based visual features for audio-visual LVCSR. In Proceedings of the
International Conference and Multimedia Expo, pages 22–25, 2001.
[62] Harry McGurk and John MacDonald. Hearing lips and seeing voices. Nature, 264:746–748,
1976.
[63] Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. Con-
sumer video understanding: A benchmark database and an evaluation of human and machine
performance. In Proceedings of ACM International Conference on Multimedia Retrieval
(ICMR), oral session, 2011.
[64] Yu-Gang Jiang, Subhabrata Bhattacharya, Shih-Fu Chang, and Mubarak Shah. High-level
event recognition in unconstrained videos. International Journal of Multimedia Information
Retrieval, 2(2):73–101, 2012.
[65] Rajiv Mehrotra and James E. Gary. Similar-shape retrieval in shape data management. Com-
puter, 28(9):57–62, 1995.
[66] Kieron Messer, Jiri Matas, Josef Kittler, Juergen Luettin, and Gilbert Maitre. XM2VTSDB:
The extended M2VTS database. In Second International Conference on Audio and Video-
Based Biometric Person Authentication, pages 72–77, 1999.
[67] Krystian Mikolajczyk and Cordelia Schmid. Scale & affine invariant interest point detectors.
International Journal of Computer Vision, 60(1):63–86, 2004.
[68] Aleksandra Mojsilovic, Jose Gomes, and Bernice Rogowitz. Isee: Perceptual features for
image library navigation, Proceedings of SPIE 4662, Human Vision and Electronic Imaging
VII, pages 266–277, 2002.
[69] Milind Naphade, John R Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon
Kennedy, Alexander Hauptmann, and Jon Curtis. Large-scale concept ontology for multi-
media. Multimedia, IEEE, 13(3):86–91, 2006.
[70] Ara V Nefian, Luhong Liang, Xiaobo Pi, Xiaoxing Liu, and Kevin Murphy. Dynamic
Bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in
Signal Processing, 2002(11):1274–1288, 2002.
[71] Ara V Nefian, Luhong Liang, Xiaobo Pi, Liu Xiaoxiang, Crusoe Mao, and Kevin Murphy. A
coupled HMM for audio-visual speech recognition. In 2002 IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages II–2013. IEEE,
2002.
[72] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng.
Multimodal deep learning. In Proceedings of the 28th International Conference on Machine
Learning (ICML-11), ICML, volume 11, 2011.
[73] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Citation
Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab, Palo Alto, CA,
1999.
[74] Konstantinos N Plataniotis and Anastasios N Venetsanopoulos. Color Image Processing and
Applications. New York: Springer-Verlag New York, Inc., 2000.
[75] Gerasimos Potamianos, Eric Cosatto, Hans Peter Graf, and David B Roe. Speaker indepen-
dent audio-visual database for bimodal ASR. In Audio-Visual Speech Processing: Computa-
tional & Cognitive Science Approaches, pages 65–68, 1997.
362 Data Classification: Algorithms and Applications
[76] Gerasimos Potamianos, Chalapathy Neti, Juergen Luettin, and Iain Matthews. Audio-visual
automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Pro-
cessing, 22:23, MIT Press, Cambridge, MA. 2004.
[77] Gerasimos Potamianos, Hans Peter Graf, and Eric Cosatto. An image transform approach for
HMM based automatic lipreading. In Proceedings, 1998 International Conference on Image
Processing, 1998. ICIP 98, pages 173–177. IEEE, 1998.
[78] Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang, Tao Mei, and Hong-Jiang Zhang. Cor-
relative multi-label video annotation. In Proceedings of the 15th International Conference
on Multimedia, pages 17–26. ACM, 2007.
[79] Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang, and Hong-Jiang Zhang. Two-
dimensional multilabel active learning with an efficient online adaptation model for im-
age classification. IEEE Transactions on Pattern Analysis and Machine Intelligence,
31(10):1880–1897, 2009.
[80] Paris Smaragdis. Machine learning for signal processing, In Course slides for CS598PS Fall
2013.
[81] Grant Schindler, Matthew Brown, and Richard Szeliski. City-scale location recognition. In
IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07, pages 1–7.
IEEE, 2007.
[82] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local
SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition,
2004. ICPR 2004, volume 3, pages 32–36. IEEE, 2004.
[83] Ishwar K Sethi, Ioana L Coman, and Daniela Stan. Mining association rules between low-
level image features and high-level concepts. In Proceedings of SPIE 4384, Data Mining and
Knowledge Discovery: Theory, Tools, and Technology III, pages 279–290, 2001.
[84] Josef Sivic and Andrew Zisserman. Video Google: A text retrieval approach to object match-
ing in videos. In Proceedings of the Ninth IEEE International Conference on Computer
Vision 2003, pages 1470–1477. IEEE, 2003.
[85] Alan F Smeaton, Paul Over, and Wessel Kraaij. Evaluation campaigns and trecvid. In MIR
’06: Proceedings of the 8th ACM International Workshop on Multimedia Information Re-
trieval, pages 321–330, ACM Press, New York, 2006.
[86] John R Smith, Milind Naphade, and Apostol Natsev. Multimedia semantic indexing using
model vectors. In ICME’03, Proceedings, of the 2003 International Conference on Multime-
dia and Expo, 2003, volume 2, pages II–445. IEEE, 2003.
[87] Cees GM Snoek, Marcel Worring, Jan C Van Gemert, Jan-Mark Geusebroek, and
Arnold WM Smeulders. The challenge problem for automated detection of 101 semantic
concepts in multimedia. In Proceedings of the 14th Annual ACM International Conference
on Multimedia, pages 421–430. ACM, 2006.
[88] William H Sumby and Irwin Pollack. Visual contribution to speech intelligibility in noise.
The Journal of the Acoustical Society of America, 26:212, 1954.
[89] Richard Szeliski. Where am i? In Proceedings of ICCV Computer Vision Conference. IEEE,
2005.
Multimedia Classification 363
[90] Hideyuki Tamura, Shunji Mori, and Takashi Yamawaki. Textural features corresponding to
visual perception. IEEE Transactions on System, Man and Cybernetic, 8(6):460–473, 1978.
[91] Min-Hsuan Tsai, Shen-Fu Tsai, and Thomas S Huang. Hierarchical image feature extraction
and classification. In Proceedings of the International Conference on Multimedia, pages
1007–1010. ACM, 2010.
[92] Shen-Fu Tsai, Liangliang Cao, Feng Tang, and Thomas S Huang. Compositional object pat-
tern: a new model for album event recognition. In Proceedings of the 19th ACM International
Conference on Multimedia, pages 1361–1364. ACM, 2011.
[93] Shen-Fu Tsai, Hao Tang, Feng Tang, and Thomas S Huang. Ontological inference frame-
work with joint ontology construction and learning for image understanding. In 2012 IEEE
International Conference on Multimedia and Expo (ICME), pages 426–431. IEEE, 2012.
[94] Chrisa Tsinaraki, Panagiotis Polydoros, and Stavros Christodoulakis. Interoperability support
for ontology-based video retrieval applications. In Image and Video Retrieval, pages 582–
591. Springer, 2004.
[95] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support
vector machine learning for interdependent and structured output spaces. In Proceedings of
the Twenty-First International Conference on Machine Learning, page 104. ACM, 2004.
[96] Pavan Turaga, Rama Chellappa, Venkatramana S Subrahmanian, and Octavian Udrea. Ma-
chine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems
for Video Technology, 18(11):1473–1488, 2008.
[97] Hualu Wang, Ajay Divakaran, Anthony Vetro, Shih-Fu Chang, and Huifang Sun. Survey of
compressed-domain features used in audio-visual indexing and analysis. Journal of Visual
Communication and Image Representation, 14(2):150–183, June 2003.
[98] James Z Wang, Jia Li, and Gio Wiederhold. Simplicity: Semantics-sensitive integrated
matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 23:947–963, 2001.
[99] Lijuan Wang, Yao Qian, Matthew R Scott, Gang Chen, and Frank K Soong. Computer-
assisted audiovisual language learning. Computer, 45(6):38–47, 2012.
[100] Wei Ying Ma and B S Manjunath. Netra: A toolbox for navigating large image databases. In
Multimedia Systems, pages 568–571, 1999.
[101] Zheng-Jun Zha, Tao Mei, Zengfu Wang, and Xian-Sheng Hua. Building a comprehensive
ontology to refine video concept detection. In Proceedings of the International Workshop on
Multimedia Information Retrieval, pages 227–236. ACM, 2007.
[102] Jianguo Zhang, Marcin Marszalek, Svetlana Lazebnik, and Cordelia Schmid. Local features
and kernels for classification of texture and object categories: A comprehensive study. Inter-
national Journal of Computer Vision, 73(2):213–238, 2007.
[103] Wei Zhang and Jana Kosecka. Image based localization in urban environments. In Third
International Symposium on 3D Data Processing, Visualization, and Transmission, pages
33–40. IEEE, 2006.
[104] Yan-Tao Zheng, Ming Zhao, Yang Song, Hartwig Adam, Ulrich Buddemeier, Alessandro
Bissacco, Fernando Brucher, Tat-Seng Chua, and Hartmut Neven. Tour the world: Building a
web-scale landmark recognition engine. In IEEE Conference on Computer Vision and Pattern
Recognition, 2009. CVPR 2009, pages 1085–1092. IEEE, 2009.
364 Data Classification: Algorithms and Applications
[105] Xi Zhou, Na Cui, Zhen Li, Feng Liang, and Thomas S Huang. Hierarchical Gaussianization
for image classification. In 2009 IEEE 12th International Conference on Computer Vision,
pages 1971–1977. IEEE, 2009.
Chapter 13
Time Series Data Classification
Dimitrios Kotsakos
Dept. of Informatics and Telecommunications
University of Athens
Athens, Greece
[email protected]
Dimitrios Gunopulos
Dept. of Informatics and Telecommunications
University of Athens
Athens, Greece
[email protected]
13.1 Introduction
Time-series data is one of the most common forms of data encountered in a wide variety of
scenarios such as the stock markets, sensor data, fault monitoring, machine state monitoring, envi-
ronmental applications, or medical data. The problem of classification finds numerous applications
in the time series domain, such as the determination of predefined groups of entities that are most
similar to a time series entity whose group is still unknown. Timeseries classification has numerous
applications in diverse problem domains:
• Financial Markets: In financial markets, the values of the stocks represent time-series which
continually vary with time. The classification of a new time series of unknown group can
provide numerous insights into descisions about this specific time series.
365
366 Data Classification: Algorithms and Applications
• Medical Data: Different kinds of medical data such as EEG readings are in the form of time
series. The classification of such time series, e.g., for a new patient, can provide insights into
similar treatment or aid the domain experts in the decisions that have to be made, as similar
behavior may indicate similar diseases.
• Machine State Monitoring: Numerous forms of machines create sensor data, which provides
a continuous idea of the states of these objects. These can be used in order to provide an idea
of the underlying behaviors and the groups a time series belongs to.
• Spatio-temporal Data: Trajectory data can be considered a form of multi-variate time series
data, in which the X- and Y -coordinates of objects correspond to continuously varying series.
The behavior in these series can be used in order to determine the class a trajectory belongs
to, and then decide, e.g., if a specific trajectory belongs to a pedestrian or to a vehicle.
Time series data falls in the class of contextual data representations. Many kinds of data such as
time series data, discrete sequences, and spatial data fall in this class. Contextual data contain two
kinds of attributes:
• Contextual Attribute: For the case of time-series data, this corresponds to the time dimension.
The time dimension provides the reference points at which the behavioral values are mea-
sured. The timestamps could correspond to actual time values at which the data points are
measured, or they could correspond to indices at which these values are measured.
• Behavioral Attribute: This could correspond to any kind of behavior that is measured at the
reference point. Some examples include stock ticker values, sensor measurements such as
temperature, or other medical time series.
Given a time series object of unknown class, the determination of the class it belongs to is
extremely challenging because of the difficulty in defining similarity across different time series,
which may be scaled and translated differently both on the temporal and behavioral dimension.
Therefore, the concept of similarity is very important for time series data classification. Therefore,
this chapter will devote a section to the problem of time series similarity measures.
In the classification problem the goal is to separate the classes that characterize the dataset by
a function that is induced from the available labeled examples. The ultimate goal is to produce a
classifier that classifies unseen examples, or examples of unknown class, with high precision and
recall, while being able to scale well. A significant difference between time series data classification
and classification of objects in Euclidean space is that the time series to be classified to pre-defined
or pre-computed classes may not be of equal length to the time series already belonging to these
classes. When this is not the case, so all time series are of equal length, standard classification
techniques can be applied by representing each time series as a vector and using a traditional L p -
norm distance. With such an approach, only similarity in time can be exploited, while similarity
in shape and similarity in change are disregarded. In this study we split the classification process
in two basic steps. The first one is the choice of the similarity measure that will be employed,
while the second one concerns the classification algorithm that will be followed. Xing et al. in their
sequence classification survey [42] argue that the high dimensionality of the feature space and their
sequential nature make sequence classification a challenging task. These facts apply in the time
series classification task as well, as time series are essentially sequences.
The majority of the studies and approaches proposed in the literature for time series classification
is application dependent. In most of them the authors try to improve the performance of a specific
feature. From this point of view, the more general time series classification approaches are of their
own interest and importance.
Time Series Data Classification 367
13.3.1 L p -Norms
The L p -norm is a distance metric, since it satisfies all of the non-negativity, identity, symmetry
and the triangle inequality conditions. An advantage of the L p -norm is that it can be computed in
linear time to the length of the trajectories under comparison, thus its time complexity is O(n), n
being the length of the time series. In order to use the L p -norm, the two time series under comparison
must be of the same length.
The Minkowski of order p or the L p -norm distance, being the generalization of Euclidean dis-
tance, is defined as follows:
n
L p − norm(T1 , T2 ) = DM,p (T1 , T2 ) = p
∑ (T1i − T2i) p . (13.1)
i=1
The Euclidean Distance between two one-dimensional time series T1 and T2 of length n is a special
case of the L p -norm for p = 2 and is defined as:
n
DE (T1 , T2 ) = L2 − norm = ∑ (T1i − T2i)2 . (13.2)
i=1
DTW stems from the speech processing community [33] and has been very popular in the lit-
erature of time series distance measures. With use of dynamic programming, DTW between two
one-dimensional time series T1 and T2 of length m and n, respectively, can be computed as follows:
Although ED for strings is proven to be a metric distance, the two ED-related time series distance
measures DTW and Longest Common Subsequence (LCSS) that will be described in the next sub-
section are proven not to follow the triangle inequality. Lei Chen [7] proposed two extensions to
EDIT distance, namely Edit distance with Real Penalty (ERP) to support local time shifting and
Edit Distance on Real sequence (EDR) to handle both local time shifting and noise in time series
and trajectories. Both extensions have a high computational cost, so Chen proposes various lower
bounds, indexing and pruning techniques to retrieve similar time series more efficiently. Both ERP
and DTW can handle local time shifting and measure the distance between two out-of-phase time
series effectively. An advantage that ERP has over DTW is that the former is a metric distance func-
tion, whereas the latter is not. DTW does not obey triangle inequality, and therefore traditional index
methods cannot be used to improve efficiency in DTW-based applications. On the other hand, ERP
is proven to be a metric distance function [7], and therefore traditional access methods can be used.
EDR, as a distance function, proves to be more robust than Euclidean distance, DTW, and ERP and
more accurate than LCSS as will be described in the next subsection. EDR is not a metric, thus
the author proposes three non-constraining pruning techniques (mean value Q-grams, near triangle
inequality, and histograms) to improve retrieval efficiency.
where HEAD(S1 ) is the subsequence [S1,1 , S1,2 , ..., S1,m−1 ], δ is an integer that controls the maximum
distance in the time axis between two matched elements, and ε is a real number 0 < ε < 1 that
controls the maximum distance that two elements are allowed to have to be considered matched.
Apart from being used in time series classification, LCSS distance is often used in domains like
speech recognition and text pattern mining. Its main drawback is that it often is needed to scale
are transform the one sequence to the other. A detailed study of LCSS variations and algorithms is
presented in [5].
13.4 k-NN
The well-known and popular traditional k-NN classifier is not different when it comes to time
series classification. The most challenging part in this setting is again the choice of the distance or
370 Data Classification: Algorithms and Applications
similarity measure that will be used to compute the distance or similarity between two time series
objects. The intuition is that a time series object is assigned the class label of the majority of its
k nearest neighbors. k-NN has been very popular in the classification literature, due to both its
simplicity and the highly accurate results it produces.
Prekopcsák and Lemire study the problem of choosing a suitable distance measure for near-
est neighbor time series classification [28]. They compare variations of the Mahalanobis distance
against Dynamic Time Warping (DTW), and find that the latter is superior in most cases, while
the Mahalanobis distance is one to two orders of magnitude faster. The task of learning one Maha-
lanobis distance per class is recommended by the authors, in order to achieve better performance.
The authors experimented with a traditional 1-NN classifier and came to the conclusion that the
class-specific covariance-shrinkage estimate and the class-specific diagonal Mahalanobis distances
achieve the best results regarding classification error and running time. Another interesting obser-
vation is that in 1-NN classification, the DTW is at least two orders of magnituted slower than the
Mahalanobis distance for all 17 datasets the authors experimented on.
Ding et al. [9] use the k-NN classifier with k = 1 on labelled data as a golden measure to evaluate
the efficacy of a variety of distance measures in their attempt to extensively validate representation
methods and similarity measures over many time series datasets. Specifically, the idea is to predict
the correct class label of each time series with an 1-NN classifier as the label of the object that is
most similar to the query object. With this approach, the authors evaluate directly the effectiveness
of each distance measure as its choice is very important for the k-NN classifier in all settings, let
alone time series classification, where the choice of a suitable distance measure is not a trivial task.
In a more recent work, similar to k-NN classification, Chen et al. propose generative latent
source model for online binary time series classification in a social network with the goal of detect-
ing trends as early as possible [6]. The proposed classifier uses weighted majority voting to classify
the unlabeled time series, by assigning the time series to be classified to the class indicated by the
labeled data. The labeled time series vote in favor of the ground truth model they belong to. The
authors applied the described model for prediction of virality of news topics in the Twitter social
network, while they do not take the definitions of virality or trend into account and use them only
as a black box ground truth. The experimental results indicated that weighted majority voting can
often detect or predict trends earlier than Twitter.
A = 1, 1, 5; B = 2, 2, 5;C = 2, 2, 2, 2, 2, 5 (13.3)
the DTW computation yields:
A = 1, 1, 1, 1; B = 2, 2, 2, 2;C = 3, 3, 3, 3 (13.5)
with ε = 1, the LCSS computation yields:
As LCSS is not a distance function but a similarity measure, we can turn it into a distance:
LCSS(T Si , T S j )
DLCSS (T Si , T S j ) = 1 − . (13.7)
max{length(T Si ), length(T S j )}
By applying the above formula on the time series of the example we obtain:
main contribution of [23] was the insertion of synthetic data points and the over-sampling of mi-
nority class in imbalanced data sets. Synthetic data points can be added in all distance spaces, even
in non-metric ones, so that elastic sequence matching algorithms like DTW can be used. Huerta et
al. extend SVMs to classify multidimensional time series by using concepts from nonlinear dynam-
ical systems theory [17], and more specifically the Rössler oscillator, which is a three-dimensional
dynamical system. The authors experimented on both real and synthetic benchmark datasets and
the proposed dynamical SVMs proved to achieve in several cases better results than many pro-
posed kernel-based approaches for multidimensional time series classification. Sasirekha and Ku-
mar applied SVM classification on real-life long-term ECG time series recordings and concluded
that SVMs can act as a good predictor of the heart rate signal [34]. In their work they propose
and compare a variety of feature extraction techniques and compare the respective accuracy values
by comparing the corresponding Heart Rate Variability (HRV) features. The experimental results
show that SVM classification for heartbeat time series performs well in detecting different attacks
and is comparable to those reported in literature. Kampouraki et al. [20] also applied Gaussian
kernel-based SVM classification with a variety of statistical and wavelet HRV features to tackle
the heartbeat time series classification problem, arguing that the SVM classifier performs better in
this specific application than other proposed approaches based on neural networks. In their study,
SVMs proved to be more robust to cases where the signal-to-noise ratio was very low. In the exper-
imental evaluation, the authors compared SVM classification to learning vector quantization (LVQ)
neural network [22] and a Levenberg-Marquardt (LM) minimization back-propagation neural net-
work [15], and the SVMs achieved better precision and recall values, even when the signals to be
classified were noisy. Grabocka et al. propose a method that improves SVM time series classifi-
cation performance by transforming the training set by developing new training objects based on
the support vector instances [14]. In their experimental evaluation on many time series datasets, the
enhanced method outperforms both the DTW-based k-NN and the traditional SVM classifiers. The
authors propose a data-driven, localized, and selective new-instance generation method based on
Virtual Support Vectors (VSV), which have been used for image classification. In a brief outline,
the method uses a deformation algorithm to transform the time series, by splitting each time series
instance into many regions and transforming each of them separately. The experimental results show
an accuracy improvement between 5% and 11% over traditional SVM classification, while produc-
ing better mean error rates than the DTW-based k-NN classifier. However, the proposed approach
of boosting SVM accuracy has the trade-off of being worse in terms of time complexity, because of
the presence of more training examples.
As all works mentioned above underline, the main issue in applying SVMs to time series clas-
sification is the choice of the kernel function that leads to the minimum generalization error and at
the same time keeps the classification error of the time series training set low. Various kernel ap-
proaches have been proposed in the literature, some of which have been used to classify sequences
that have similar characteristics with time series (e.g. [18, 41] or for handwriting recognition [4].
resulting tree is often too large, a fact that leads to overfitting problems, so a pruning process reduces
the tree size in order to both improve classification accuracy and to reduce classification complex-
ity. The most popular decision tree algorithms are ID3 (Iterative Dichotomiser 3) [29], C4.5, which
is an extension of ID3 [30], CART (Classification and Regression Tree) [27], and CHAID (CHi-
squared Automatic Interaction Detector) [35], which performs multi-level splits when computing
classification trees. Douzal-Chouakria and Amblard describe an extension to the traditional deci-
sion tree classifier, where the metric changes from one node to another and the split criterion in
each node extracts the most characteristic subsequences of the time series to be split [10]. Geurts
argues that many time series classification problems can be solved by identifying and exploiting
local patterns in the time series. In [13] the author performs time series segmentation to extract
prototypes that can be represented by sets of numerical values so that a traditional classifier can
be applied. More specifically, the proposed method extracts patterns from time series, which then
are combined in decision trees to produce classification rules. The author applies the proposed ap-
proach on a variety of synthetic and real datasets and proves that pattern extraction used this way
can improve classification accuracy and interpretability, as the decision trees’ rules can provide
a user with a broad overview of classes and datasets. With a similar goal in mind, namely inter-
pretability of classification results, Rodrı́guez and Alonso present a method that uses decision trees
for time series classification [32]. More specifically, two types of trees are presented: a) interval-
based trees, where each decision node evaluates a function in an interval, and b) DTW-based trees,
where each decision node has a reference example, which the time series to classify is compared
against, e.g., DTW (tsi ,tsre f ) < T hreshold. The decision trees can also be constructed by extracting
global and local features (e.g., global: mean, length, maximum; local: local maximum, slope, etc.)
or by extracting simple or complex patterns. A complex pattern is a sequence of simple patterns and
a simple pattern is a time series whose Euclidean Distance to the examined one is small enough.
Most decision tree algorithms are known to have high variance error [19]. In their work, Jović et
al. examined the performance of decision tree ensembles and experimented on biomedical datasets.
Decision tree ensembles included AdaBoost.M1 and Multiboost applied to the traditional decision
tree algorithm C4.5. The classification accuracy improved and the authors argue that decision tree
ensembles are equivalent if not better than SVM-based classifiers that have been commonly used
for time series classification in the biomedical domain, which is one of the most popular ones for
time series classification.
As in all cases where classification trees are chosen as a mechanism to classify data of unknown
labels, in the time series domain the main issue is find appropriate split criteria that separate the
input dataset and avoid overfitting. Some time series specific issues regarding the choice of a split
criterion are related to the open nature of the distance between two time series objects. Time series
classification trees that use the same split criterion for all nodes may fail to identify unique charac-
teristics that distinguish time series classes in different parts of the tree. Moreover, distance-based
approaches should not consider only the cases where the distance is computed on the whole time
series, since there may be cases where subsequences exhibit certain separation characteristics.
However, most of the works that use decision trees for time series classification focus on com-
prehensibility of the classification results, rather than running time or accuracy, precision, or recall.
Thus, a general conclusion that can be drawn regarding decision trees for time series classification
is the following: classification trees for time series are good and suitable when interpretability of
results are the main concern. When classification accuracy matters, other classifiers, like k-NN or
SVMs, achieve better results.
374 Data Classification: Algorithms and Applications
Moreover, there is a set of prior probabilities Pr = p j that represent the probability for the first
state of the model. HMMs in most proposed approaches assume a Gaussian probability distribu-
tion OPj for each state j. The most common approach for learning Hidden Markov Models is the
Expectation Maximization (EM) algorithm [26], which after performing a recursive estimation con-
verges to a local maximum of the likelihood. More specifically, the EM algorithm is used by the
popular BaumWelch algorithm in order to find the unknown parameters of a HMM [31]. It relies
on the assumption that the i-th hidden variable depends only on the (i − 1)-th hidden variable and
not on previous ones, as the current observation variable does. So, given a set of observed time
series that get transformed into a set of observed feature vectors, the Baum-Welch algorithm finds
the maximum likelihood estimate of the paramters of the corresponding HMM. More formally, if
Xt is a hidden random variable at time t, the probability P(Xt |Xt−1 ) does note depend on the value
of t. A hidden Markov chain is described by θ = (T P, OP, Pr). The BaumWelch algorithm com-
putes θ = maxθ P(T S|θ), namely the parameters that maximize the probability of the observation
sequence T S.
In [24], Kotsifakos et al. evaluated and compared the accuracy achieved by search-by-model and
search-by-example methods in large time series databases. Search-by-example included DTW and
constrained-DTW methods. In the search-by-model approach, the authors used a two-step training
phase that consisted of (i) initialization and (ii) refinement. In the initialization step an HMM is
computed for each class in the dataset. In the refinement step, the Viterbi algorithm is applied on
each training time series per class in order to identify the best state sequence for each of them.
The transition probabilities used in this step are the ones computed at the initialization step of the
algorithm. After an HMM has been trained for each class in the training set, a probability distri-
bution is defined, which represents the probability of each time series in the testing set assuming
Time Series Data Classification 375
that it belongs to the corresponding class. Given a time series T S, the probability of T S belong-
ing to class Ci , P(T S|T S ∈ Ci ), i = 0, ..., |C| is computed with the dynamic programming Forward
algorithm [31]. In the experimental evaluation on 20 time series datasets, the model-based classifi-
cation method yielded more accurate results than the search-by-example technique when there were
a lot of training examples per model. However, when there were few training examples per class,
search-by-example proved to be more accurate.
In [3], Antonucci and De Rosa also use HMMs for time series classification, while coping with
the problems imposed by the short length of time series in some applications. The EM algorithm
calculates unreliable estimates when the amount of training data is small. To tackle this problem,
the authors describe an imprecise version of the learning algorithm and propose an HMM-based
classifier that returns multiple classes when the highest-likelihood intervals corresponding to the
returned class overlap.
13.9 Conclusion
Time series data mining literature includes a lot of works devoted to time series classification
tasks. As time series datasets and databases grow larger in size and as applications are evolving, per-
formance and accuracy of time series classification algorithms become ever more important for real
life applications. Although most time series classification approaches adopt and adapt techniques
from general data classification studies, some essential modifications to these techniques have to be
introduced in order to cope with the nature of time series distance and time series representation.
The problems and the benefits of time series classification have found applications in economics,
health care, and multimedia among others. In this chapter, we highlighted the major issues and the
most influential corresponding techniques that deal with the time series classification task.
376 Data Classification: Algorithms and Applications
Acknowledgements
This work was supported by European Union and Greek National funds through the Oper-
ational Program “Education and Lifelong Learning”of the National Strategic Reference Frame-
work (NSRF) — Research Funding Programs: Heraclitus II fellowship, THALIS — GeomComp,
THALIS — DISFER, ARISTEIA — MMD, ARISTEIA — INCEPTION” and the EU FP7 project
INSIGHT (www.insight-ict.eu).
Bibliography
[1] Ghazi Al-Naymat, Sanjay Chawla, and Javid Taheri. Sparsedtw: A novel approach to speed up
dynamic time warping. In Proceedings of the Eighth Australasian Data Mining Conference-
Volume 101, pages 117–127. Australian Computer Society, Inc., 2009.
[2] Helmut Alt and Michael Godau. Computing the fréchet distance between two polygonal
curves. International Journal of Computational Geometry & Applications, 5(01–02):75–91,
1995.
[3] Alessandro Antonucci and Rocco De Rosa. Time series classification by imprecise hidden
markov models. In WIRN, pages 195–202, 2011.
[4] Claus Bahlmann, Bernard Haasdonk, and Hans Burkhardt. Online handwriting recognition
with support vector machines—a kernel approach. In Proceedings of Eighth International
Workshop on Frontiers in Handwriting Recognition, 2002, pages 49–54. IEEE, 2002.
[5] Lasse Bergroth, Harri Hakonen, and Timo Raita. A survey of longest common subsequence
algorithms. In SPIRE 2000. Proceedings of Seventh International Symposium on String Pro-
cessing and Information Retrieval, 2000, pages 39–48. IEEE, 2000.
[6] George H Chen, Stanislav Nikolov, and Devavrat Shah. A latent source model for online time
series classification. arXiv preprint arXiv:1302.3639, 2013.
[7] Lei Chen, M Tamer Özsu, and Vincent Oria. Robust and fast similarity search for moving
object trajectories. In Proceedings of the 2005 ACM SIGMOD International Conference on
Management of Data, pages 491–502. ACM, 2005.
[8] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,
20(3):273–297, 1995.
[9] Hui Ding, Goce Trajcevski, Peter Scheuermann, Xiaoyue Wang, and Eamonn Keogh. Query-
ing and mining of time series data: experimental comparison of representations and distance
measures. Proceedings of the VLDB Endowment, 1(2):1542–1552, 2008.
[10] Ahlame Douzal-Chouakria and Cécile Amblard. Classification trees for time series. Pattern
Recognition, 45(3):1076–1091, 2012.
[11] Anne Driemel, Sariel Har-Peled, and Carola Wenk. Approximating the fréchet distance for
realistic curves in near linear time. Discrete & Computational Geometry, 48(1):94–127, 2012.
Time Series Data Classification 377
[12] Thomas Eiter and Heikki Mannila. Computing discrete fréchet distance. Tech. Report CS-TR-
2008–0010, Christian Doppler Laboratory for Expert Systems, Vienna, Austria, 1994.
[13] Pierre Geurts. Pattern extraction for time series classification. In Principles of Data Mining
and Knowledge Discovery, pages 115–127. Springer, 2001.
[14] Josif Grabocka, Alexandros Nanopoulos, and Lars Schmidt-Thieme. Invariant time-series
classification. In Machine Learning and Knowledge Discovery in Databases, pages 725–740.
Springer, 2012.
[15] Martin T Hagan and Mohammad B Menhaj. Training feedforward networks with the mar-
quardt algorithm. IEEE Transactions on Neural Networks, 5(6):989–993, 1994.
[16] Sariel Har-Peled et al. New similarity measures between polylines with applications to mor-
phing and polygon sweeping. Discrete & Computational Geometry, 28(4):535–569, 2002.
[17] Ramón Huerta, Shankar Vembu, Mehmet K Muezzinoglu, and Alexander Vergara. Dynamical
SVM for time series classification. In Pattern Recognition, pages 216–225. Springer, 2012.
[18] Tommi Jaakkola, David Haussler, et al. Exploiting generative models in discriminative classi-
fiers. Advances in Neural Information Processing Systems, pages 487–493, 1999.
[19] Alan Jović, Karla Brkić, and Nikola Bogunović. Decision tree ensembles in biomedical time-
series classification. In Pattern Recognition, pages 408–417. Springer, 2012.
[20] Argyro Kampouraki, George Manis, and Christophoros Nikou. Heartbeat time series clas-
sification with support vector machines. IEEE Transactions on Information Technology in
Biomedicine, 13(4):512–518, 2009.
[21] Eamonn Keogh and Chotirat Ann Ratanamahatana. Exact indexing of dynamic time warping.
Knowledge and Information Systems, 7(3):358–386, 2005.
[22] Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.
[23] Suzan Köknar-Tezel and Longin Jan Latecki. Improving SVM classification on imbalanced
time series data sets with ghost points. Knowledge and Information Systems, 28(1):1–23, 2011.
[24] Alexios Kotsifakos, Vassilis Athitsos, Panagiotis Papapetrou, Jaakko Hollmén, and Dimitrios
Gunopulos. Model-based search in large time series databases. In PETRA, page 36, 2011.
[25] Daniel Lemire. Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern
Recognition, 42(9):2169–2180, 2009.
[26] Geoffrey McLachlan and Thriyambakam Krishnan. The EM Algorithm and Extensions, vol-
ume 382. John Wiley & Sons, 2007.
[27] L Breiman, JH Friedman, RA Olshen, and Charles J Stone. Classification and Regression
Trees. Wadsworth International Group, 1984.
[28] Zoltán Prekopcsák and Daniel Lemire. Time series classification by class-specific mahalanobis
distance measures. Advances in Data Analysis and Classification, 6(3):185–200, 2012.
[29] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
[30] John Ross Quinlan. C4.5: Programs for Machine Learning, volume 1. Morgan Kaufmann,
1993.
378 Data Classification: Algorithms and Applications
[31] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
[32] Juan J Rodrı́guez and Carlos J Alonso. Interval and dynamic time warping-based decision
trees. In Proceedings of the 2004 ACM symposium on Applied computing, pages 548–552.
ACM, 2004.
[33] Nick Roussopoulos, Stephen Kelley, and Frédéric Vincent. Nearest neighbor queries. ACM
sigmod record, 24(2):71–79, 1995.
[34] A Sasirekha and P Ganesh Kumar. Support vector machine for classification of heartbeat time
series data. International Journal of Emerging Science and Technology, 1(10):38–41, 2013.
[35] Merel van Diepen and Philip Hans Franses. Evaluating chi-squared automatic interaction
detection. Information Systems, 31(8):814–831, 2006.
[36] Michail Vlachos, Dimitrios Gunopulos, and Gautam Das. Rotation invariant distance mea-
sures for trajectories. In Proceedings of the Tenth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 707–712. ACM, 2004.
[37] Michail Vlachos, Dimitrios Gunopulos, and George Kollios. Robust similarity measures
for mobile object trajectories. In 2002. Proceedings of the 13th International Workshop on
Database and Expert Systems Applications, pages 721–726. IEEE, 2002.
[38] Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn Keogh. Index-
ing multi-dimensional time-series with support for multiple distance measures. In Proceedings
of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Min-
ing, pages 216–225. ACM, 2003.
[39] Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn Keogh. In-
dexing multidimensional time-series. The VLDB Journal—The International Journal on Very
Large Data Bases, 15(1):1–20, 2006.
[40] Michail Vlachos, George Kollios, and Dimitrios Gunopulos. Discovering similar multidi-
mensional trajectories. In 2002 Proceedings on the 18th International Conference on Data
Engineering, pages 673–684. IEEE, 2002.
[41] Chris Watkins. Dynamic alignment kernels. Advances in Neural Information Processing
Systems, pages 39–50, 1999.
[42] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence classification.
SIGKDD Explor. Newsl., 12(1):40–48, November 2010.
[43] Demetrios Zeinalipour-Yazti, Christos Laoudias, Costandinos Costa, Michail Vlachos, M An-
dreou, and Dimitrios Gunopulos. Crowdsourced trace similarity with smartphones. IEEE
Transactions on Knowledge and Data Engineering, 25(6):1240–1253, 2012.
[44] Demetrios Zeinalipour-Yazti, Song Lin, and Dimitrios Gunopulos. Distributed spatio-temporal
similarity search. In Proceedings of the 15th ACM International Conference on Information
and Knowledge Management, pages 14–23. ACM, 2006.
Chapter 14
Discrete Sequence Classification
Mohammad Al Hasan
Indiana University Purdue University
Indianapolis, IN
[email protected]
14.1 Introduction
Sequence classification is an important research task with numerous applications in various
domains, including bioinformatics, e-commerce, health informatics, computer security, and finance.
In bioinformatics domain, classification of protein sequences is used to predict the structure and
function of newly discovered proteins [12]. In e-commerce, query sequences from a session are
used to distinguish committed shoppers from the window shoppers [42]. In the Web application
domain, a sequence of queries is used to distinguish Web robots from human users [49]. In health
informatics, sequence of events is used in longitudinal studies for predicting the risk factor of a
379
380 Data Classification: Algorithms and Applications
patient for a given disease [17]. In the security domain, the sequence of a user’s system activities
on a terminal is monitored for detecting abnormal activities [27]. In the financial world, sequences
of customer activities, such as credit card transactions, are monitored to identify potential identity
theft, sequences of bank transactions are used for classifying accounts that are used for money
laundering [33], and so on. See [36] for a survey on sequential pattern mining
Because of its importance, sequence classification has found the deserved attention of the re-
searchers from various domains, including data mining, machine learning, and statistics. Earlier
works on this task are mostly from the field of statistics that are concerned with classifying se-
quences of real numbers that vary with time. This gave rise to a series of works under the name
of time series classification [15], which has found numerous applications in real life. Noteworthy
examples include classification of ECG signals for classification between various states of a pa-
tient [20], and classification of stock market time series [22]. However, in recent years, particularly
in data mining, the focus has shifted towards classifying discrete sequences [25, 28], where a se-
quence is considered as a sequence of events, and each event is composed of one or a set of items
from an alphabet. Examples can be a sequence of queries in a Web session, a sequence of events in a
manufacturing process, a sequence of financial transactions from an account, and so on. Unlike time
series, the events in a discrete sequence are not necessarily ordered based on their temporal position.
For instance, a DNA sequence is composed of four amino acids A,C, G, T , and a DNA segment, such
as ACCGT TACG, is simply a string of amino acids without any temporal connotation attached to
their order in the sequence.
In this chapter, our discussion will be focused on the classification of discrete sequences only.
This is not limiting, because any continuous sequence can be easily discretized using appropriate
methods [13]. Excluding a brief survey [52], we have not found any other works in the existing
literature that summarize various sequence classification methods in great detail as we do in this
chapter.
The rest of the chapter is organized as follows. In Section 14.2, we discuss some background
materials. In Section 14.3, we introduce three different categories in which the majority of sequence
classification methods can be grouped. We then discuss sequence classification methods belonging
to these categories in each of the following three sections. In Section 14.7 we discuss some of the
methods that overlap across multiple categories based on the grouping defined in Section 14.3. Sec-
tion 14.8 provides a brief overview of a set of sequence classification methods that uses somewhat
non-typical problem definition. Section 14.9 concludes this chapter.
14.2 Background
In this section, we will introduce some of the background materials that are important to un-
derstand the discussion presented in this chapter. While some of the background materials may
have been covered in earlier chapters of this book, we cover them anyway to make the chapter as
independent as possible.
14.2.1 Sequence
Let, I= {i1 , i2 , · · · , im } be a set of m distinct items comprising an alphabet, Σ. An event is a non-
empty unordered collection of items. An event is denoted as (i1 i2 . . . ik ), where i j is an item. Number
of items in an event is its size. A sequence is an ordered list of events. A sequence α is denoted as
(α1 → α2 → · · · → αq ), where αi is an event. For example, AB → E → ACG is a sequence, where AB
is an event, and A is an item. The length of the sequence (α1 → α2 → · · · → αq ) is q and the width is
Discrete Sequence Classification 381
the maximum size of any αi , for 1 ≤ i ≤ q. For some sequences, each of the events contain exactly
one item, i.e., the width of the sequence is 1. In that case, the sequence is simply an ordered string
of items. In this paper, we call such a sequence a string sequence or simply a string. Examples of
string sequences are DNA or protein sequences, query sequences, or sequences of words in natural
language text.
14.2.4 n-Grams
For string sequences, sometimes substrings, instead of subsequences, are used as features for
classification. These substrings are commonly known as n-grams. Formally, an n-gram is a contigu-
ous sequence of n items from a given sequence. An n-gram of size 1 is referred to as a “unigram,”
size 2 as a “bigram,” size 3 is a “trigram,” and so on.
A learning method uses these data instances for building a classification model that can be used
for predicting the class label of an unseen data instance. Since each data instance is represented as
a feature vector, such a classification method is called feature-based classification. Sequence data
does not have a native vector representation, so it is not straightforward to classify sequences using
a feature-based method. However, researchers have proposed methods that find features that capture
sequential patterns that are embedded in a sequence.
Sequential data that belong to different classes not only vary in the items that they contain, but
also vary in the order of the items as they occur in the sequence. To build a good classifier, a feature
in a sequence classifier must capture the item orders that are prevalent in a sequence, which can be
done by considering the frequencies of short subsequences in the sequences. For string sequences,
the 2-grams in a sequence consider all the ordered term-pairs, 3-grams in a sequence consider all
the ordered term-triples, and so on; for a string sequence of length l we can thus consider all the
n-grams that have length up to l. The frequencies of these sub-sequences can be used to model
all the item-orders that exist in the given sequence. For example, if CACAG is a sequence, we can
consider all the n-grams of length 2, 3, 4, and 5 as its feature; CA, AC, and AG are 2-grams, CAC,
ACA, and CAG are 3-grams, CACA, and ACAG are 4-grams, and the entire sequence is a 5-gram
feature. Except for CA, which has a feature value of 2, all the other features have a value of 1. In
case the sequences are not strings, then term order can be modeled by mining frequent sequential
patterns, and the count of those patterns in a given sequence can be used as the feature value.
Feature-based classification has several challenges. One of those is, for long sequences or for
a large alphabet, there are too many features to consider and many of them are not good features
for classification. Complexity of many supervised classification algorithms depends on the feature
dimension. Further, for many classification methods, irrelevant features can degrade classification
performance. To overcome this limitation, feature selection methods are necessary for selecting a
subset of sequential features that are good candidates for classification. Another challenge is that
the length of various sequences in a dataset can vary considerably, and hence the variation of n-
gram or sequence pattern frequency values among different sequences could be due to their length
variation, instead of their class label. To negate the influence of length on feature values, various
normalizations (similar to tf-idf in information retrieval) can be applied to make the feature values
uniform across sequences of different lengths.
In machine learning, feature selection [35] is a well-studied task. Given a collection of features,
this task selects a subset of features that achieves the best performance for the classification task. For
sequence classification, if we exhaustively enumerate all the n-grams or if we mine a large collection
of sequential patterns using a small minimum support threshold, we should apply a feature selection
method for finding a concise set of good sequential features. There are two broad categories in
feature selection in the domain of machine learning: they are filter-based, and wrapper-based. In this
and the following section, we will discuss some methods that are similar to filter-based approach.
In Section 14.4.3, we discuss a feature selection method that is built on the wrapper paradigm.
1. Support and Confidence: Support and Confidence are two measures that are used for ranking
384 Data Classification: Algorithms and Applications
association rules in itemset pattern mining [2]. These measures have also been adapted for
finding classification association rules (CAR) [32]. A CAR, as defined in [32], is simply a
rule P → y, where P is an itemset and y is a class label. However, in the context of sequential
feature selection, P is a sequential feature instead of an itemset, so we will refer to such a rule
as CR (classification rule) to avoid confusion with CAR, as the latter is specifically defined
for itemsets.
The support of a CR P → y (denoted as σ(P → y, D )) is defined as the number of cases in
D that contain the pattern P and are also labeled with the class y. To make the support size
σ(P→y,D )
invariant, we typically normalize it with the database size, i.e., support(P → y) = |D | .
The confidence of a CR P → y is defined as the fraction of cases in D that contain the pat-
tern P and have a class label y with respect to the cases that contains the pattern P. i.e.,
D)
con f idence(P → y) = σ(P→y,
σ(P,D
. If the dataset has only two classes (say, y1 and y2 ), and the
confidence of the CR (P → y1 ) is c, then the confidence of the CR (P → y2 ) is 1 − c.
Using the above definitions, both the support and the confidence of a CR is between 0 and 1.
The higher the support of a CR, the more likely that the rule will be applicable in classifying
unseen sequences; the higher the confidence of a rule, the more likely that the rule will predict
the class accurately. So a simple way to perform feature selection is to choose a minimum
support and minimum confidence value and select a sequence C if for a class y, the CR, C → y,
which satisfies both the above constraints. For instance, if minsup = 0.2 and mincon f = 0.60
are minimum support and minimum confidence threshold, and y is one of the class labels, then
the sequence C is selected as a feature if support(C → y) ≥ 0.2 and con f idence(C → y) ≥ 0.6.
Support and confidence based measure is easy to understand and is computationally inexpen-
sive. However, there are some drawbacks. First, it is not easy to select a good value for the
minsup and mincon f threshold, and this choice may affect the classification quality. Another
limitation is that this feature selection method does not consider the multiple occurrences of a
sequence feature in an input sequence, as the support and confidence measures add a weight
of 1 when a sequence feature exists in an input-sequence, ignoring the plurality of the exis-
tence in that sequence. Another drawback is that this feature selection method considers each
feature in isolation, ignoring dependencies among difference features.
2. Information gain: Information gain is a feature selection metric. It is popularly used in a
decision tree classifier for selecting the split feature that is used for partitioning the feature
space at a given node of the tree. However, this metric can generally be used for ranking
features based on their importance for a classification task. Consider a sequence pattern X,
and a dataset of input-sequences, D ; based on whether X exists or not in a input-sequence in
D , we can partition D into two parts, DX=y , and DX=n . For a good feature, the partitioned
datasets are more pure than the original dataset, i.e., the partitions consist of input-sequences
mostly from one of the classes only. This purity is quantified using the entropy, which we
define below.
In information theory, the entropy of a partition or region D is defined as H(D ) =
− ∑ki=1 P(ci |D ) lg P(ci |D ), where P(Ci |D) is the probability of class Ci in D , and k is the
number of classes. If a dataset is pure, i.e., it consists of input sequences from only one
class, then the entropy of that dataset is 0. If the classes are mixed up, and each appear with
equal probability P(ci |D ) = 1k , then the entropy has the highest value, H(D ) = lg k. If a
dataset is partitioned in Dy and Dn , then the resulting entropy of the partitioned dataset is:
|D | |DX=n |
H(Dy , D ) = |X=y
D | H(DX=y ) + |D | H(DX=n ). Now information gain of the feature X is de-
fined as Gain(X, D ) = H(D ) − H(DX=Y , DX=N ). The higher the gain, the better the feature
for sequence classification. For selecting a subset of features, we can simply rank the features
based on the information gain value, and select a desired number of top-ranked features.
Discrete Sequence Classification 385
3. Odds ratio: Odds ratio measures the odds of a sequence occurring in input-sequences labeled
with some class ci , normalized by the odd that it occurs in input-sequences labeled with some
class other than ci . If X is a sequence feature, P(X|ci ) denotes the probability that the sequence
X occurs in class ci , and P(X|c̄i ) denotes the probability that the sequence X occurs in any
i )·(1−P(X|c̄i ))
class other than ci , then the odd ratio of X can be computed by P(X|cP(X|c̄ )·(1−P(X|c ))
.
i i
If the value of odd ratio is near 1, then the sequence P is a poor feature, as it is equally likely
to occur in input-sequences of class ci and also in input-sequences of classes other than ci . An
odds ratio greater (less) than 1 indicates that P is more (less) likely to occur in input sequences
of class ci than in input sequences of other classes; in that case P is a good feature. The odds
ratio must be nonnegative if it is defined. It is undefined if the denominator is equal to 0.
Note that sometimes, we take the absolute value of the logarithm of odd ratio instead of odd-
ratio. Then, for a poor feature, log-odd-ratio is a small number near 0, but for a good feature, it
is a large positive number. To obtain a small subset of sequential features, we can simply rank
the sequential features based on the log-odd-ratio, and consider a desired number of features
according to the ranking.
Besides the above, there are other measures, such as Kullback-Leibler (K-L) divergence, Eu-
clidean distance, feature-class entropy, etc., that can also be used for feature ranking.
Winnow classifier; both the classifiers perform significantly better with the features mined from
F EATURE M INE compared to a set of baseline features that have a length of one.
sequences may share only a few of those domains; in that case even if the overall similarity between
these two proteins is weak, if the matched functional domains are highly significant, these proteins
may belong to the same class. Similarly, when classifying genome sequences, similarity metric
should consider only the coding part of the DNA, and discard a significant part of the genome,
such as junk DNA, and tandem repeats. In such scenarios, finding a suitable distance metric is a
pre-condition for achieving good classification performance.
One of the most common distance-based classifiers is k-NN (nearest neighbors) classifier, which
is lazy, i.e., it does not pre-compute a classification model. For an unseen sequence instance, u, k-
NN computes u’s distance to all the input-sequences in D , and finds the k nearest neighbors of u. It
then predicts the majority of the class labels of k-NN as the class label of u.
SVM (support vector machines) is another classification model, which can also be regarded as
a distance based classifier. However, unlike k-NN, SVM computes the similarity instead of distance
between a pair of sequences. SVM calls such a similarity function a kernel function (K(x, y)) , which
is simply a dot product between the feature representation of a pair of sequences x, and y, in a Hilbert
space [44]. Given a gram matrix, which stores all the pair-wise similarities between the sequences in
the training set in a matrix, SVM learning method finds the maximum-margin hyperplane to separate
two classes. Once a suitable kernel function is defined, SVM method for sequence classification
is the same for any other classification task. So the main challenge of using SVM for sequence
classification is to define a suitable kernel function, and to speed up the computation of the gram
matrix.
Below, we discuss a collection of distance metrics that can be used for measuring the distance
between a pair of sequences. It is important to note that though we use the term “metric,” many of
the similarities measurements may not be “metrics” using their mathematical definition. Also, note
that we sometimes denote the metric as a similarity metric, instead of distance metric, if the metric
computes the similarity between a pair of sequences.
find a set of similar sequences that are pair-wise similar. For a pair of similar sequence, BLAST also
returns the bit score (also known as BLAST-score), which represents the similarity strength that can
be used for sequence classification.
Various probabilistic tools can be used to build the generative model, Mci . The simplest is a
naive Bayes classifier, which assumes that the features are independent, so the likelihood proba-
bility is simply the multiplication of the likelihood of finding a feature in a sequence given that
the sequence belongs to the class ci . In [5], the authors show different variants of naive Bayes for
building generative models for protein sequence classification. The first variant treats each protein
sequence as if it were simply a bag of amino acids. The second variant (NB n-grams) applies the
naive Bayes classifier to a bag of n-grams (n > 1). Thus for the second variant, if we choose n = 1,
it becomes the first variant. Note that the second variant, NB n-grams, violates the naive Bayes as-
sumption of independence, because the neighboring n-grams overlap along the sequence and two
adjacent n-grams have n − 1 elements in common. The third variant overcomes the above problem
by constructing an undirected graphical probabilistic model for n-grams that uses a junction tree
algorithm. Thus, the third variant is similar to the second variant except that the likelihood of each
n-grams for some class is corrected so that the independence assumption of naive Bayes can be up-
held. Their experiments show that the third variant performs better than the others while classifying
protein sequences.
Another tool for building generative models for sequences is k-order Markov chains [54]. For
this, a sequence is modeled as a graph in which each sequence element is represented by a node,
and a direct dependency between two neighboring elements is represented by an edge in the graph.
More generally, Markov models of order k capture the dependency between the current element sk+1
and its k preceding elements [sk+1 ...s1 ] in a sequence. The higher the value of k, the more complex
the model is. As shown in [53, 54], the joint probability distribution for a (k − 1)-order Markov
model (MM) follows directly from the junction tree Theorem [11] and the definition of conditional
probability.
Like any other generative model, the probability P(s = si |si−1 · · · si−k+1 , c j ) can be estimated
from data using the counts of the subsequences si · · · si−k+1 , c j and si−1 · · · si−k+1 , c j . Once all the
model parameters (probabilities) are known, a new sequence s can be assigned to the most likely
class based on the generative models for each class.
Yakhnenko et al. [53] trains a (k − 1)-order Markov model discriminatively instead of using a
conventional generative setting so that the classification power of the Markov model can be in-
creased. The difference between a generative model and a discriminative model is as below: a
generative model, as we have discussed above, models the probability distribution of the process
generating the sequence from each class; then the classification is performed by examining the like-
lihood of each class producing the observed features in the data, and assigning the sequence to the
390 Data Classification: Algorithms and Applications
most likely class. On the other hand, a discriminative model directly computes the class membership
probabilities (or model class boundaries) without modeling the underlying class feature densities;
to achieve this, it finds the parameter values that maximize a conditional likelihood function [39].
Thus the training step uses the count of various k-grams to iteratively update the parameter values
(condition likelihood probabilities) to achieve a local optimal solution using some terminating con-
dition. [53] used gradient descent method for updating the parameters. In their work, Yakhnenko et
al. show that a discriminative Markov model performs better than a generative Markov model for
sequence classification.
Hidden Markov Model (HMM) is also popular for sequence clustering and classification. Pre-
dominant use of HMM in sequence modeling is to build profile HMM that can be used for building
probabilistic models of a sequence family. It is particularly used in bioinformatics for aligning a
protein sequence with a protein family, a task commonly known as multiple sequence alignment. A
profile HMM has three hidden states, match, insertion, and deletion; match and insertion are emit-
ting states, and deletion is a non-emitting state. In the context of protein sequences, the emission
probability of a match state is obtained from the distribution of residues in the corresponding col-
umn, and the emission probability of an insertion state is set to the background distribution of the
residues in the dataset. Transition probabilities among various states are decided based on various
gap penalty models [14]. To build a profile HMM from a set of sequences, we can first align them,
and then use the alignment to learn the transition and emission probabilities by counting. Profile
HMMs can also be built incrementally by aligning each new example with a profile HMM, and then
updating the HMM parameters from the alignment that includes the new example.
Once built, a profile HMM can be used for sequence classification [24, 48, 56], In the classifi-
cation setting, a distinct profile HMM is build for each of the classes using the training examples
of the respective classes. For classifying an unseen sequence, it is independently aligned with the
profile-HMM of each of the classes using dynamic programming; the sequence is then classified to
the class that achieves the highest alignment score. One can also estimate the log-likelihood ratio to
decide which class a sequence should belong to. For this, given an HMM Hi for the class i, assume
P(s|Hi ) denotes the probability of s under the HMM, and P(s|H0 ) denotes the probability of s under
the null model. Then the likelihood ratio can be computed as:
al. begin with an HMM trained from positive examples of class i to model a given protein fam-
ily. Then they use this HMM to map each new protein sequence (say, s) that they want to clas-
sify into a fixed length vector (called score vector) and compute the kernel function on the basis
of the Euclidean distance between the score vector of s and the score vectors of known positive
and negative examples of the protein family i. The resulting discriminative function is given by
L (s) = ∑i:si ∈Xi λi K(s, si ) − ∑i:si ∈X0 λi K(s, si ), where λi are estimated from the positive training ex-
amples (Xi ) and negative training examples (X0 ).
In [43], the authors proposed a method that combines distance-based and feature-based methods.
It is a feature-based method that uses a distance-based method for super-structuring and abstraction.
It first finds all the k-grams of a sequence, and represents the sequences as a bag of k-grams. How-
ever, instead of using each of the k-grams as a unique feature, it partitions the k-grams into different
sets so that each of the sets contain a group of most similar features. Each such group can be consid-
ered as an abstract feature. To construct the abstract features, the method takes a sequence dataset
as a bag-of-k-grams, and clusters the k-grams using agglomerative hierarchical clustering until m
(user-defined) abstract features are obtained. In this clustering step, it computes a distance between
each of the existing abstraction by using an information gain criterion.
Blasiak and Rangwala [9] proposed another hybrid method that combines distance-based and
feature-based methods. They employ a scheme that uses an HMM variant to map a sequence to a
set of fixed-length description vectors. The parameters of an HMM variant is learnt using various
inference algorithms, such as Baum-Welch, Gibbs sampling, and a variational method. From the
fixed-length description, any feature based traditional classification method can be used to classify
a sequence into different classes.
In another work, Aggarwal [1] proposed a method that uses wavelet decomposition of a se-
quence to map it in the wavelet space. The objective of this transformation is to exploit the multi-
resolution property of wavelet to create a scheme which considers sequence features that capture
similarities between two sequences in different levels of granularity. Although the distance is com-
puted in the wavelet space, for the classification, the method uses a rule-based classifier.
uses bagged clustering of the full sequence dataset. These two kernels are used to modify a base
string kernel so that unlabeled data are considered in the kernel computation. Authors show that the
modified kernels greatly improve the performance of the classification of protein sequences.
Zhong et al. [57] propose an HMM-based semi-supervised classification for time series data
(sequence of numbers); however, the method is general and can be easily made to work for discrete
sequences as long as they can be modeled with an HMM. The method uses labeled data to train the
initial parameters of a first order HMM, and then uses unlabeled data to adjust the model in an EM
process. Wei et al. [50] adopt one nearest neighbor classifier for semi-supervised time series clas-
sification. The method handles a scenario where only small amounts of positively labeled instances
are available.
14.9 Conclusions
Sequence classification is a well-developed research task with many effective solutions, how-
ever, some challenges still remain. One of them is that in many domains such a classification task
suffers from class imbalance, i.e., for a sequence classification task in those domains the prior prob-
ability of the minority class can be a few magnitudes smaller that that of the majority class. A
common example of this scenario can be a task that distinguishes between normal and anomalous
sequences of credit card transactions of a customer. In a typical iid training dataset, the population
of the anomalous class is rare, and most of the classification methods, to some extend, suffer perfor-
mance loss due to this phenomenon. More research is required to overcome this limitation. Another
challenge is to obtain a scalable sequence classification system, preferably on distributed systems
(such as mapreduce) so that large scale sequence classification problems that appear in Web and
e-commerce domains can be solved effectively.
To conclude, in this chapter we provided a survey of discrete sequence classification. We
grouped the methodologies of sequence classification into three major groups: feature-based,
distance-based, and model based, and discussed various classification methods that fall in these
groups. We also discussed a few methods that overlap across multiple groups. Finally, we discussed
some variants of sequence classification and discussed a few methods that solve those variants.
Bibliography
[1] Charu C. Aggarwal. On effective classification of strings with wavelets. In Proceedings of the
Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’02, pages 163–172, 2002.
[2] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in
large databases. In Proceedings of the 20th International Conference on Very Large Data
Bases, VLDB ’94, pages 487–499, 1994.
[3] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool.
Journal of Molecular Biology, 215:403–410, 1990.
[4] Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann. Hidden Markov support vector
machines. In Proceedings of International Conference on Machine Learning, ICML’03, 2003.
[5] Carson Andorf, Adrian Silvescu, Drena Dobbs, and Vasant Honavar. Learning classifiers for
assigning protein sequences to gene ontology functional families. In Fifth International Con-
ference on Knowledge Based Computer Systems (KBCS), page 256, 2004.
[6] C. Bahlmann and H. Burkhardt. Measuring HMM similarity with the Bayes probability of
error and its application to online handwriting recognition. In Proceedings of the Sixth In-
ternational Conference on Document Analysis and Recognition, ICDAR ’01, pages 406–411,
2001.
[7] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric
framework for learning from labeled and unlabeled examples. Journal of Machine Learning
Research, 7:2399–2434, 2006.
394 Data Classification: Algorithms and Applications
[8] Donald J. Berndt and James Clifford. Using dynamic time warping to find patterns in time
series. In KDD Workshop’94, pages 359–370, 1994.
[9] Sam Blasiak and Huzefa Rangwala. A Hidden Markov Model variant for sequence classi-
fication. In Proceedings of the Twenty-Second International Joint Conference on Artificial
Intelligence - Volume Two, IJCAI’11, pages 1192–1197. AAAI Press, 2011.
[10] Rich Caruana and Alexandru Niculescu-Mizil. Data mining in metric space: An empiri-
cal analysis of supervised learning performance criteria. In Proceedings of the Tenth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04,
pages 69–78, 2004.
[11] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic Networks
and Expert Systems. Springer, 1999.
[12] Mukund Deshpande and George Karypis. Evaluation of techniques for classifying biological
sequences. In Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge
Discovery and Data Mining, PAKDD ’02, pages 417–431, 2002.
[13] Elena S. Dimitrova, Paola Vera-Licona, John McGee, and Reinhard C. Laubenbacher. Dis-
cretization of time series data. Journal of Computational Biology, 17(6):853–868, 2010.
[14] Richard Durbin, Eddy Sean R., Anders Krogh, and Graeme Mitchison. Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press,
1988.
[15] Pierre Geurts. Pattern extraction for time series classification. In Proceedings of the 5th
European Conference on Principles of Data Mining and Knowledge Discovery, PKDD ’01,
pages 115–127, 2001.
[16] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist
temporal classification: Labelling unsegmented sequence data with recurrent neural networks.
In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages
369–376, 2006.
[17] M. Pamela Griffin and J. Randall Moorman. Toward the early diagnosis of neonatal sepsis and
sepsis-like illness using novel heart rate analysis. Pediatrics, 107(1):97–104, 2001.
[18] Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computa-
tional Biology. Cambridge University Press, 1997.
[19] Tommi Jaakkola, Mark Diekhans, and David Haussler. Using the Fisher kernel method to
detect remote protein homologies. In Proceedings of the Seventh International Conference on
Intelligent Systems for Molecular Biology, pages 149–158, 1999.
[20] Wei Jiang, S.G. Kong, and G.D. Peterson. ECG signal classification using block-based neu-
ral networks. In Proceedings of the 2005 IEEE International Joint Conference on Neural
Networks, 2005. IJCNN ’05, volume 1, pages 326–331, 2005.
[21] Mohammed Waleed Kadous and Claude Sammut. Classification of multivariate time series
and structured data using constructive induction. Machine Learning, 58(2-3):179–216, 2005.
[22] Eamonn Keogh and Shruti Kasetty. On the need for time series data mining benchmarks:
A survey and empirical demonstration. In Proceedings of the Eighth ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 102–111,
2002.
Discrete Sequence Classification 395
[23] Minyoung Kim and Vladimir Pavlovic. Sequence classification via large margin Hidden
Markov Models. Data Mining and Knowledge Discovery, 23(2):322–344, 2011.
[24] Anders Krogh, Michael Brown, I. Saira Mian, Kimmen Sjolander, and David Haussler. Hid-
den Markov Models in computational biology: Applications to protein modeling. Journal of
Molecular Biology, 235:1501–1531, 1994.
[25] Daniel Kudenko and Haym Hirsh. Feature generation for sequence categorization. In Pro-
ceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Ap-
plications of Artificial Intelligence, AAAI ’98/IAAI ’98, pages 733–738, 1998.
[26] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eigh-
teenth International Conference on Machine Learning, ICML ’01, pages 282–289, 2001.
[27] Terran Lane and Carla E. Brodley. Temporal sequence learning and data reduction for anomaly
detection. ACM Transactions on Information Systems Security, 2(3):295–331, 1999.
[28] Neal Lesh, Mohammed J Zaki, and Mitsunori Ogihara. Mining features for sequence classi-
fication. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 342–346. ACM, 1999.
[29] C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences.
Journal of Machine Learning Research, 5:1435–1455, 2004.
[30] C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble. Mismatch string kernels for
discriminative protein classification. Bioinformatics, 20(4):467–476, 2004.
[31] Christina S. Leslie, Eleazar Eskin, and William Stafford Noble. The spectrum kernel: A string
kernel for SVM protein classification. In Pacific Symposium on Biocomputing, pages 566–575,
2002.
[32] B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Proceed-
ings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD’98),
pages 80–86, 1998.
[33] Xuan Liu, Pengzhu Zhang, and Dajun Zeng. Sequence matching for suspicious activity de-
tection in anti-money laundering. In Proceedings of the IEEE ISI 2008 PAISI, PACCF, and
SOCO International Workshops on Intelligence and Security Informatics, PAISI, PACCF and
SOCO ’08, pages 50–61, 2008.
[34] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text
classification using string kernels. Journal of Machine Learning Research, 2:419–444, March
2002.
[35] Luis Carlos Molina, Lluı́s Belanche, and Àngela Nebot. Feature selection algorithms: A survey
and experimental evaluation. In Proceedings of the 2002 IEEE International Conference on
Data Mining, ICDM ’02, pages 306–313, 2002.
[36] Carl H. Mooney and John F. Roddick. Sequential pattern mining–Approaches and algorithms.
ACM Comput. Surv., 45(2):19:1–19:39, March 2013.
[37] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Pro-
ceedings of the National Academy of Sciences, 85(8):2444–2448, 1988.
396 Data Classification: Algorithms and Applications
[38] L. R. Rabiner. Readings in speech recognition. Chapter A tutorial on hidden Markov models
and selected applications in speech recognition, pages 267–296. Morgan Kaufmann Publish-
ers, SanFrancisco, CA, 1990.
[39] D. Rubinstein and T. Hastie. Discriminative vs informative learning. In Proceedings of the
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD
’97, 1997.
[40] B. Scholkopf, K. Tsuda, and J-P. Vert. Kernel Methods in Computational Biology, pages 171-
192. MIT Press, 2004.
[41] Rong She, Fei Chen, Ke Wang, Martin Ester, Jennifer L. Gardy, and Fiona S. L. Brinkman.
Frequent-subsequence-based prediction of outer membrane proteins. In Proceedings of the
Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’03, pages 436–445, 2003.
[42] Yelong Shen, Jun Yan, Shuicheng Yan, Lei Ji, Ning Liu, and Zheng Chen. Sparse hidden-
dynamics conditional random fields for user intent understanding. In Proceedings of the 20th
International Conference on World Wide Web, WWW ’11, pages 7–16, 2011.
[43] Adrian Silvescu, Cornelia Caragea, and Vasant Honavar. Combining super-structuring and
abstraction on sequence classification. In Proceedings of the 2009 Ninth IEEE International
Conference on Data Mining, ICDM ’09, pages 986–991, 2009.
[44] Alex J. Smola and Bernhard Schlkopf. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2001.
[45] S. Sonnenburg, G. Rätsch, and C. Schäfer. Learning interpretable SVMS for biological se-
quence classification. In Proceedings of the 9th Annual International Conference on Research
in Computational Molecular Biology, RECOMB’05, pages 389–407, 2005.
[46] Sören Sonnenburg, Gunnar Rätsch, and Bernhard Schölkopf. Large scale genomic sequence
SVM classifiers. In Proceedings of the 22nd International Conference on Machine Learning,
ICML ’05, pages 848–855, 2005.
[47] Yann Soullard and Thierry Artieres. Iterative refinement of HMM and HCRF for sequence
classification. In Proceedings of the First IAPR TC3 Conference on Partially Supervised
Learning, PSL’11, pages 92–95, 2012.
[48] Prashant Srivastava, Dhwani Desai, Soumyadeep Nandi, and Andrew Lynn. HMM-mode -
improved classification using profile Hidden Markov Models by optimising the discrimina-
tion threshold and modifying emission probabilities with negative training sequences. BMC
Bioinformatics, 8(1):104, 2007.
[49] Pang-Ning Tan and Vipin Kumar. Discovery of web robot sessions based on their navigational
patterns. Data Mining and Knowledge Discovery, 6(1):9–35, 2002.
[50] Li Wei and Eamonn Keogh. Semi-supervised time series classification. In Proceedings of
the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’06, pages 748–753, 2006.
[51] Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, and
William Stafford Noble. Semi-supervised protein classification using cluster kernels. Bioin-
formatics, 21(15):3241–3247, 2005.
Discrete Sequence Classification 397
[52] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence classification.
ACM SIGKDD Explorations Newsletter, 12(1):40–48, 2010.
[53] Oksana Yakhnenko, Adrian Silvescu, and Vasant Honavar. Discriminatively trained Markov
model for sequence classification. In Proceedings of the Fifth IEEE International Conference
on Data Mining, pages 498–505, IEEE, 2005.
[54] Zheng Yuan. Prediction of protein subcellular locations using markov chain models. FEBS
Letters, pages 23–26, 1999.
[55] Mohammed J. Zaki. SPADE: An efficient algorithm for mining frequent sequences. Machine
Learning, 42(1-2):31–60, 2001.
[56] Yuan Zhang and Yanni Sun. HMM-frame: Accurate protein domain classification for metage-
nomic sequences containing frameshift errors. BMC Bioinformatics, 12(1):198, 2011.
[57] Shi Zhong. Semi-supervised sequence classification with HMMs. IJPRAI, 19(2):165–182,
2005.
[58] Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. Learning from labeled and un-
labeled data on a directed graph. In Proceedings of the 22nd International Conference on
Machine Learning, ICML ’05, pages 1036–1043, 2005.
[59] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label
propagation. Technical Report CMUCALD-02-107, Carnegie Mellon University, 2002.
This page intentionally left blank
Chapter 15
Collective Classification of Network Data
Ben London
University of Maryland
College Park, MD 20742
[email protected]
Lise Getoor
University of Maryland
College Park, MD 20742
[email protected]
15.1 Introduction
Network data has become ubiquitous. Communication networks, social networks, and the World
Wide Web are becoming increasingly important to our day-to-day life. Moreover, networks can be
defined implicitly by certain structured data sources, such as images and text. We are often interested
in inferring hidden attributes (i.e., labels) about network data, such as whether a Facebook user will
adopt a product, or whether a pixel in an image is part of the foreground, background, or some
specific object. Intuitively, the network should help guide this process. For instance, observations
and inference about someone’s Facebook friends should play a role in determining their adoption
399
400 Data Classification: Algorithms and Applications
probability. This type of joint reasoning about label correlations in network data is often referred to
as collective classification.
Classic machine learning literature tends to study the supervised setting, in which a classifier is
learned from a fully-labeled training set; classification performance is measured by some form of
statistical accuracy, which is typically estimated from a held-out test set. It is commonly assumed
that data points (i.e., feature-label pairs) are generated independently and identically from an un-
derlying distribution over the domain, as illustrated in Figure 15.1(a). As a result, classification
is performed independently on each object, without taking into account any underlying network
between the objects. Classification of network data does not fit well into this setting. Domains
such as Webpages, citation networks, and social networks have naturally occurring relationships
between objects. Because of these connections (illustrated in Figure 15.1(b)), their features and la-
bels are likely to be correlated. Neighboring points may be more likely to share the same label (a
phenomenon sometimes referred to as social influence or contagion), or links may be more likely
between instances of the same class (referred to as homophily or assortativity). Models that classify
each object independently are ignoring a wealth of information, and may not perform well.
Classifying real network data is further complicated by heterogenous networks, in which nodes
may not have uniform local features and degrees (as illustrated in Figure 15.1(c)). Because of this,
we cannot assume that nodes are identically distributed. Also, it is likely that there is not a clean
split between the training and test sets (as shown in Figure 15.1(d)), which is common in relational
datasets. Without independence between training and testing, it may be difficult to isolate training
accuracy from testing accuracy, so the statistical properties of the estimated model are not straight-
forward.
In this article, we provide an overview of existing approaches to collective classification. We
begin by formally defining the problem. We then examine several approaches to collective classifi-
cation: iterative wrappers for local predictors, graph-based regularization and probabilistic graphical
models. To help ground these concepts in practice, we review some common feature engineering
techniques for real-world problems. Finally, we conclude with some interesting applications of col-
lective classification.
FIGURE 15.1 (See color insert.): (a) An illustration of the common i.i.d. supervised learning set-
ting. Here each instance is represented by a subgraph consisting of a label node (blue) and several
local feature nodes (purple). (b) The same problem, cast in the relational setting, with links connect-
ing instances in the training and testing sets, respectively. The instances are no longer independent.
(c) A relational learning problem in which each node has a varying number of local features and
relationships, implying that the nodes are neither independent nor identically distributed. (d) The
same problem, with relationships (links) between the training and test set.
node i, let Ni denote the set of indices corresponding to its (open) neighborhood; that is, the set of
nodes adjacent to i (but not including it).
nodes are simply the training set; in the inductive setting, this assumes that draws from the distri-
bution over network structures are partially labeled. However, inductive collective classification is
still possible even if no labels are given.
One could then clamp Yi to {±1} using its sign, sgn(Yi ). While we probably will not know all
of the labels of Ni , if we already had predictions for them, we could use these, then iterate until
the predictions converge. This is precisely the idea behind a method known as label propagation.
Though the algorithm was originally proposed by [48] for general transductive learning, it can easily
be applied to network data by constraining the similarities according to a graph. An example of this
is the modified adsorption algorithm [39].
Algorithm 15.1 provides pseudocode for a simple implementation of label propagation. The
algorithm assumes that all labels are k-valued, meaning |Yi | = k for all i = 1, . . . , n. It begins by
constructing a n × k label matrix Y ∈ Rn×k , where entry i, j corresponds to the probability that
Yi = j. The label matrix is initialized as
⎧
⎪
⎨1 if Yi ∈ Y and Yi = j,
Yi, j = 0 if Yi ∈ Y and Yi = j, (15.2)
⎪
⎩
1/k if Yi ∈ Yu .
It also requires an n × n transition matrix T ∈ Rn×n ; semantically, this captures the probability that a
label propagates from node i to node j, but it is effectively just the normalized edge weight, defined
as wi, j
if j ∈ Ni ,
Ti, j = ∑ j∈Ni i, j
w
(15.3)
0 if j ∈ Ni .
The algorithm iteratively multiplies Y by T, thereby propagating label probabilities via a weighted
average. After the multiply step, the unknown rows of Y, corresponding to the unknown labels, must
be normalized, and the known rows must be clamped to their known values. This continues until the
values of Y have stabilized (i.e., converged to within some sufficiently small ε of change), or until a
maximum number of iterations has been reached.
One interesting property of this formulation of label propagation is that it is guaranteed to con-
verge to a unique solution. In fact, there is a closed-form solution, which we will describe in Sec-
tion 15.4.
404 Data Classification: Algorithms and Applications
Algorithm 15.2 depicts pseudo-code for a simple iterative classification algorithm. The algo-
rithm begins by initializing all unknown labels Yu using only the features (Xi , XNi ) and observed
neighbor labels YN ⊆ YNi . (This may require a specialized initialization classifier.) This process
i
is sometimes referred to as bootstrapping. It then iteratively updates these values using the current
predictions as well as the observed features and labels. This process repeats until the predictions
have stabilized, or until a maximum number of iterations has been reached.
Clearly, the order in which nodes are updated affects the predictive accuracy and convergence
rate, though there is some evidence to suggest that iterative classification is fairly robust to a number
of simple ordering strategies—such as random ordering, ascending order of neighborhood diversity
and descending order of prediction confidences [11]. Another practical issue is when to incorpo-
rate the predicted labels from the previous round into the the current round of prediction. Some
researchers [28, 31] have proposed a “cautious” approach, in which only predicted labels are intro-
duced gradually. More specifically, at each iteration, only the top k most confident predicted labels
are used, thus ignoring less confident, potentially noisy predictions. At the start of the algorithm, k
is initialized to some small number; then, in subsequent iterations, the value of k is increased, so
that in the last iteration all predicted labels are used.
One benefit of iterative classification is that it can be used with any local classifier, making it
extremely flexible. Nonetheless, there are some practical challenges to incorporating certain clas-
sifiers. For instance, many classifiers are defined on a predetermined number of features, making
it difficult to accommodate arbitrarily-sized neighborhoods. A common workaround is to aggre-
gate the neighboring features and labels, such as using the proportion of neighbors with a given
Collective Classification of Network Data 405
label, or the most frequently occurring label. For classifiers that return a vector of scores (or condi-
tional probabilities) instead of a label, one typically uses the label that corresponds to the maximum
score. Some of the classifiers used included: naı̈ve Bayes [6, 31], logistic regression [24], decision
trees, [14] and weighted-majority [25].
Iterative classification prescribes a method of inference, but it does not instruct how to train the
local classifiers. Typically, this is performed using traditional, non-collective training.
(h − y)C(h − y).
Unconstrained graph-based regularization methods can be generalized using the following ab-
straction (due to [8]). Let Q ∈ Rn×n denote a symmetric matrix, whose entries are determined based
on the structure of the graph G , the local attributes X (if available), and the observed labels Y . We
will give several explicit definitions for Q shortly; for the time being, it will suffice to think of Q as
a regularizer on h. Formulated as an unconstrained optimization, the learning objective is
One can interpret the first term as penalizing certain label assignments, based on observed informa-
tion; the second term is simply the prediction error with respect to the training labels. Using vector
calculus, we obtain a closed-form solution to this optimization as
where I is the n × n identity matrix. This is fairly efficient to compute for moderate-sized networks;
the time complexity is dominated by O(n3 ) operations for the matrix inversion and multiplication.
For prediction, the “soft” values of h can be clamped to {±1} using the sign operator.
The effectiveness of this generic approach comes down to how one defines the regularizer, Q.
One of the first instances is due to [49]. In this formulation, Q is a graph Laplacian, constructed
1 This is not to say that inductive models are not useful in the transductive setting. Indeed, many practitioners apply
as follows: for each edge (i, j) ∈ E , define a weight matrix W ∈ Rn×n , where each element wi, j is
defined using the radial basis function in (15.1); define a diagonal matrix D ∈ Rn×n as
n
di,i ∑ wi, j ;
j=1
Q I − D− 2 WD− 2 ,
1 1
per [47]. [15] extended this method for heterogeneous networks—that is, graphs with multiple types
of nodes and edges. Another variant, due to [43], sets
Q (I − A) (I − A),
where A ∈ Rn×n is a row-normalized matrix capturing the local pairwise similarities. All of these
formulations impose a smoothness constraint on the predictions, that “similar” nodes—where simi-
larity can be defined by the Gaussian in (15.1) or some other kernel—should be assigned the same
label.
There is an interesting connection between graph-based regularization and label propagation.
Under the various parameterizations of Q, one can show that (15.4) provides a closed-form solution
to the label propagation algorithm in Section 15.3.1 [48]. This means that one can compute certain
formulations of label propagation without directly computing the iterative algorithm. Heavily op-
timized linear algebra solvers can be used to compute (15.4) quickly. Another appealing aspect of
these methods is their strong theoretical guarantees [8].
Definition 15.5.1 (Bayesian Network) A Bayesian network consists of a set of random variables
Z {Z1 , . . . , Zn }, a directed, acyclic graph (DAG) G (V , E ), and a set of conditional probability
distributions (CPDs), {P(Zi | ZPi )}ni=1 , where Pi denotes the indices corresponding to the causal par-
ents of Zi . When multiplied, the CPDs describe the joint distribution of Z; i.e., P(Z) = ∏i P(Zi | ZPi ).
BNs model causal relationships, which are captured by the directionalities of the edges; an edge
(i, j) ∈ E indicates that Zi influences Z j . For a more thorough review of BNs, see [18] or Chapter X
of this book.
Though BNs are very popular in machine learning and data mining, they can only be used for
models with fixed structure, making them inadequate for problems with arbitrary relational struc-
ture. Since collective classification is often applied to arbitrary data graphs—such as those found
in social and citation networks—some notion of templating is required. In short, templating defines
subgraph patterns that are instantiated (or, grounded) by the data graph; model parameters (CPDs)
are thus tied across different instantiations. This allows directed graphical models to be used on
complex relational structures.
One example of a templated model is probabilistic relational models (PRMs) [9, 12]. A PRM
is a directed graphical model defined by a relational database schema. Given an input database, the
schema is instantiated by the database records, thus creating a BN. This has been shown to work for
some collective classification problems [12, 13], and has the advantage that a full joint distribution
is defined.
To satisfy the requirement of acyclicity when the underlying data graph is undirected, one con-
structs a (templated) BN or PRM as follows. For each potential edge {i, j} in data graph, define a
binary random variable Ei, j . Assume that a node’s features are determined by its label. If we further
assume that its label is determined by its neighbors’ labels (i.e., contagion), then we draw a directed
edge from each Ei, j to the corresponding Yi and Y j , as illustrated in Figure 15.2a. On the other hand,
if we believe that a node’s label determines who it connects to (i.e., homophily), then we draw an
edge from each Yi to all Ei,· , as shown in Figure 15.2b. The direction of causality is a modeling deci-
sion, which depends on one’s prior belief about the problem. Note that, in both cases, the resulting
graphical model is acyclic.
Yi Yj Yk Yi Yj Yk
Xi Xj Xk Xi Xj Xk
FIGURE 15.2 (See color insert.): Example BN for collective classification. Label nodes (green)
determine features (purple), which are represented by a single vector-valued variable. An edge vari-
able (yellow) is defined for all potential edges in the data graph. In (a), labels are determined by link
structure, representing contagion. In (b), links are functions of labels, representing homophily. Both
structures are acyclic.
distribution. RDN inference is therefore only approximate, but can be very fast. Learning RDNs is
also fast, because it reduces to independently learning a set of CPDs.
Definition 15.5.2 (Markov random field) A Markov random field (MRF) is defined by a set of ran-
dom variables Z {Z1 , . . . , Zn }, a graph G (V , E ), and a set of clique potentials {φc : dom(c) →
R}c∈C , where C is a set of predefined cliques and dom(c) is the domain of clique c. (To simplify
notation, assume that potential φc only operates on the set of variables contained in clique c.) The
potentials are often defined as a log-linear combination of features fc and weights wc , such that
φc (z) exp (wc · fc (z)). An MRF defines a probability distribution P that factorizes as
1 1
P(Z = z) = ∏ φc (z) = exp ∑ wc · fc (z) ,
Φ c∈C Φ c∈C
where Φ ∑z ∏c∈C φc (z ) is a normalizing constant. This model is said to be Markovian because
any variable Zi is independent of any non-adjacent variables, conditioned its neighborhood ZNi
(sometimes referred to as its Markov blanket).
For collective classification, one can define a conditional MRF (sometimes called a CRF), whose
conditional distribution is
1 1
P(Y = y | X = x, Y = y ) = ∏ φc (x, y) = exp ∑ wc · fc (x, y) .
u u
Φ c∈C Φ c∈C
For relational tasks, such as collective classification, one typically defines the cliques via templat-
ing. Similar to a PRM (see Section 15.5.1), a clique template is just a subgraph pattern—although
in this case it is a fully-connected, undirected subgraph. The types of templates used directly af-
fects model complexity, in that smaller templates correspond to a simpler model, which usually
generalizes better to unseen data. Thus, MRFs are commonly defined using low-order templates,
such as singletons, pairs, and sometimes triangles. Examples of templated MRFs include relational
MRFs [40], Markov logic networks [36], and hinge-loss Markov random fields [2].
To make this concrete, we consider the class of pairwise MRFs. A pairwise MRF has features
and weights for all singleton and pairwise cliques in the graph; thus, its distribution factorizes as
1
P(Z = z) =
Φ i∈∏ φi (z) ∏ φi, j (z)
V {i, j}∈E
1
= exp
Φ ∑ wi · fi (z) + ∑ wi, j · fi, j (z) .
i∈V {i, j}∈E
Collective Classification of Network Data 409
(Since it is straightforward to derive the posterior distribution for collective classification, we omit
it here.) If we assume that the domains of the variables are discrete, then it is common to define the
features as basis vectors indicating the state of the assignment. For example, if |Zi | = k for all i,
then fi (z) is a length-k binary vector, whose jth entry is equal to one if zi is in the jth state and zero
otherwise; similarly, fi, j (z) has length k2 and the only nonzero entry corresponds to the joint state
of (zi , z j ). To make this MRF templated, we simply replace all wi with a single wsingle , and all wi, j
with a single wpair .
It is important to note that the data graph does not necessarily correspond to the graph of the
MRF; there is some freedom in how one defines the relational features { fi, j }{i, j}∈E . However, when
using a pairwise MRF for collective classification, it is natural to define a relational feature for each
edge in the data graph. Defining fi, j as a function of (Yi ,Y j ) models the dependence between labels.
One may alternately model (Xi ,Yi , X j ,Y j ) to capture the pairwise interactions of both features and
labels.
problem to these representations can have a greater impact than the choice of model or inference
algorithm. This process, sometimes referred to as feature construction (or feature engineering), is
perhaps the most challenging aspect of data mining. In this section, we explore various techniques,
motivated by concrete examples.
Links can also be discovered as part of the inference problem [5, 26]. Indeed, collective clas-
sification and link prediction are complimentary tasks, and it has been shown that coupling these
predictions can improve overall accuracy [29].
It is important to note that not all data graphs are unimodal—that is, they may involve multiple
types of nodes and edges. In citation networks, authors write papers; authors are affiliated with
institutions; papers cite other papers; and so on.
Another interesting application is activity detection in video data. Given a recorded video se-
quence (say, from a security camera) containing multiple actors, the goal is to label each actor as
performing a certain action (from a predefined set of actions). Assuming that bounding boxes and
tracking (i.e., identity maintenance) are given, one can bolster local reasoning with spatiotempo-
ral relational reasoning. For instance, it is often the case that certain actions associate with others:
if an actor is crossing the street, then other actors in the proximity are likely crossing the street;
similarly, if one actor is believed to be talking, then other actors in the proximity are likely either
talking or listening. One could also reason about action transitions: if an actor is walking at time t,
then it is very likely that they will be walking at time t + 1; however, there is a small probability
that they may transition to a related action, such as crossing the street or waiting. Incorporating this
high-level relational reasoning can be considered a form of collective classification. This approach
has been used in a number of publications [7, 16] to achieve current state-of-the-art performance on
benchmark datasets.
Collective classification is also used in computational biology. For example, researchers study-
ing protein-protein interaction networks often need to annotate proteins with their biological func-
tion. Discovering protein function experimentally is expensive. Yet, protein function is sometimes
correlated with interacting proteins; so, given a set of labeled proteins, one can reason about the
remaining labels using collective methods [23].
The final application we consider is viral marketing, which is interesting for its relationship to
active collective classification. Suppose a company is introducing a new product to a population.
Given the social network and the individuals’ (i.e., local) demographic features, the goal is to deter-
mine which customers will be interested in the product. The mapping to collective classification is
straightforward. The interesting subproblem is in how one acquires labeled training data. Customer
surveys can be expensive to conduct, so companies want to acquire the smallest set of user opinions
that will enable them to accurately predict the remaining user opinions. This can be viewed as an
active learning problem for collective classification [3].
15.8 Conclusion
Given the recent explosion of relational and network data, collective classification is quickly
becoming a mainstay of machine learning and data mining. Collective techniques leverage the idea
that connected (related) data objects are in some way correlated, performing joint reasoning over
a high-dimensional, structured output space. Models and inference algorithms range from simple
iterative frameworks to probabilistic graphical models. In this chapter, we have only discussed a
few; for greater detail on these methods, and others we did not cover, we refer the reader to [25]
and [37]. Many of the algorithms discussed herein have been implemented in NetKit-SRL,3 an
open-source toolkit for mining relational data.
Acknowledgements
This material is based on work supported by the National Science Foundation under Grant No.
0746930 and Grant No. IIS1218488.
3 http://netkit-srl.sourceforge.net
414 Data Classification: Algorithms and Applications
Bibliography
[1] D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, and A. Ng. Discrim-
inative learning of Markov random fields for segmentation of 3D scan data. In International
Conference on Computer Vision and Pattern Recognition, pages 169–176, 2005.
[2] Stephen H. Bach, Bert Huang, Ben London, and Lise Getoor. Hinge-loss Markov random
fields: Convex inference for structured prediction. In Uncertainty in Artificial Intelligence,
2013.
[3] Mustafa Bilgic and Lise Getoor. Effective label acquisition for collective classification. In
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages
43–51, 2008. Winner of the KDD’08 Best Student Paper Award.
[4] Mustafa Bilgic, Lilyana Mihalkova, and Lise Getoor. Active learning for networked data. In
Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010.
[5] Mustafa Bilgic, Galileo Mark Namata, and Lise Getoor. Combining collective classification
and link prediction. In Workshop on Mining Graphs and Complex Structures in International
Conference of Data Mining, 2007.
[6] Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using
hyperlinks. In International Conference on Management of Data, pages 307–318, 1998.
[7] W. Choi, K. Shahid and S. Savarese. What are they doing?: Collective activity classification
using spatio-temporal relationship among people. In VS, 2009.
[8] Corinna Cortes, Mehryar Mohri, Dmitry Pechyony, and Ashish Rastogi. Stability analysis and
learning bounds for transductive regression algorithms. CoRR, abs/0904.0814, 2009.
[9] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In
International Joint Conference on Artificial Intelligence, 1999.
[10] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restora-
tion of images. Morgan Kaufmann Publishers Inc., San Francisco, CA, 1990.
[11] L. Getoor. Link-based classification. In Advanced Methods for Knowledge Discovery from
Complex Data, Springer, 2005.
[12] Lise Getoor, Nir Friedman, Daphne Koller, and Ben Taskar. Learning probabilistic models of
link structure. Journal of Machine Learning Research, 3:679–707, 2002.
[13] Lise Getoor, Eran Segal, Benjamin Taskar, and Daphne Koller. Probabilistic models of text
and link structure for hypertext classification. In IJCAI Workshop on Text Learning: Beyond
Supervision, 2001.
[14] D. Jensen, J. Neville, and B. Gallagher. Why collective inference improves relational classi-
fication. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 593–598, 2004.
[15] Ming Ji, Yizhou Sun, Marina Danilevsky, Jiawei Han, and Jing Gao. Graph regularized trans-
ductive classification on heterogeneous information networks. In Proceedings of the 2010
European Conference on Machine Learning and Knowledge Discovery in Databases: Part I,
ECML PKDD’10, pages 570–586, Berlin, Heidelberg, 2010. Springer-Verlag.
Collective Classification of Network Data 415
[16] Sameh Khamis, Vlad I. Morariu, and Larry S. Davis. Combining per-frame and per-track
cues for multi-person action recognition. In European Conference on Computer Vision, pages
116–129, 2012.
[17] A. Knobbe, M. deHaas, and A. Siebes. Propositionalisation and aggregates. In Proceedings
of the Fifth European Conference on Principles of Data Mining and Knowledge Discovery,
pages 277–288, 2001.
[18] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT
Press, 2009.
[19] S. Kramer, N. Lavrac, and P. Flach. Propositionalization approaches to relational data mining.
In S. Dzeroski and N. Lavrac, editors, Relational Data Mining. Springer-Verlag, New York,
2001.
[20] M. Krogel, S. Rawles, F. Zeezny, P. Flach, N. Lavrac, and S. Wrobel. Comparative evalu-
ation of approaches to propositionalization. In International Conference on Inductive Logic
Programming, pages 197–214, 2003.
[21] Ankit Kuwadekar and Jennifer Neville. Relational active learning for joint collective classi-
fication models. In Proceedings of the 28th International Conference on Machine Learning,
2011.
[22] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Inter-
national Conference on Machine Learning, pages 282–289, 2001.
[23] Stanley Letovsky and Simon Kasif. Predicting protein function from protein/protein interac-
tion data: a probabilistic approach. Bioinformatics, 19:197–204, 2003.
[24] Qing Lu and Lise Getoor. Link based classification. In Proceedings of the International
Conference on Machine Learning, 2003.
[25] S. Macskassy and F. Provost. Classification in networked data: A toolkit and a univariate case
study. Journal of Machine Learning Research, 8(May):935–983, 2007.
[26] Sofus A. Macskassy. Improving learning in networked data by combining explicit and mined
links. In Proceedings of the Twenty-Second Conference on Artificial Intelligence, 2007.
[27] Sofus A. Macskassy. Using graph-based metrics with empirical risk minimization to speed up
active learning on networked data. In Proceedings of the 15th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2009.
[28] Luke K. Mcdowell, Kalyan M. Gupta, and David W. Aha. Cautious inference in collective
classification. In Proceedings of AAAI, 2007.
[29] Galileo Mark Namata, Stanley Kok, and Lise Getoor. Collective graph identification. In ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011.
[30] Galileo Mark Namata, Ben London, Lise Getoor, and Bert Huang. Query-driven active sur-
veying for collective classification. In Workshop on Mining and Learning with Graphs, 2012.
[31] Jennifer Neville and David Jensen. Iterative classification in relational data. In Workshop on
Statistical Relational Learning, AAAI, 2000.
[32] Jennifer Neville and David Jensen. Relational dependency networks. Journal of Machine
Learning Research, 8:653–692, 2007.
416 Data Classification: Algorithms and Applications
[33] C. Perlich and F. Provost. Aggregation-based feature invention and relational concept classes.
In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
[34] C. Perlich and F. Provost. Distribution-based aggregation for relational learning with identifier
attributes. Machine Learning Journal, 62(1-2):65–105, 2006.
[35] Matthew J. Rattigan, Marc Maier, David Jensen Bin Wu, Xin Pei, JianBin Tan, and Yi Wang.
Exploiting network structure for active inference in collective classification. In Proceedings of
the Seventh IEEE International Conference on Data Mining Workshops, ICDMW ’07, pages
429–434. IEEE Computer Society, 2007.
[36] Matt Richardson and Pedro Domingos. Markov logic networks. Machine Learning, 62 (1-
2):107–136, 2006.
[37] Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina
Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93–106, 2008.
[38] Hossam Sharara, Lise Getoor, and Myra Norton. Active surveying: A probabilistic approach
for identifying key opinion leaders. In The 22nd International Joint Conference on Artificial
Intelligence (IJCAI ’11), 2011.
[39] Partha Pratim Talukdar and Koby Crammer. New regularized algorithms for transductive
learning. In ECML/PKDD (2), pages 442–457, 2009.
[40] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In
Proceedings of the Annual Conference on Uncertainty in Artificial Intelligence, 2002.
[41] B. Taskar, V. Chatalbashev, and D. Koller. Learning associative Markov networks. In Pro-
ceedings of the International Conference on Machine Learning, 2004.
[42] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Neural Information
Processing Systems, 2003.
[43] Mingrui Wu and Bernhard Schölkopf. Transductive classification via local learning regular-
ization. Journal of Machine Learning Research - Proceedings Track, 2:628–635, 2007.
[44] Yiming Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization.
Journal of Intelligent Information Systems, 18 (2-3):219–241, 2002.
[45] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and gen-
eralized belief propagation algorithms. In IEEE Transactions on Information Theory, pages
2282–2312, 2005.
[46] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Generalized belief propagation. In Neural Infor-
mation Processing Systems, 13:689–695, 2000.
[47] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard
Schölkopf. Learning with local and global consistency. In NIPS, pages 321–328, 2003.
[48] Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label
propagation. Technical report, Carnegie Mellon University, 2002.
[49] Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-supervised learning using Gaus-
sian fields and harmonic functions. In Proceedings of the 20th International Conference on
Machine Learning (ICML-03), pages 912–919, 2003.
Chapter 16
Uncertain Data Classification
Reynold Cheng
The University of Hong Kong
[email protected]
Yixiang Fang
The University of Hong Kong
[email protected]
Matthias Renz
University of Munich
[email protected]
16.1 Introduction
In emerging applications such as location-based services (LBS), sensor networks, and biolog-
ical databases, the values stored in the databases are often uncertain [11, 18, 19, 30]. In an LBS,
for example, the location of a person or a vehicle sensed by imperfect GPS hardware may not be
exact. This measurement error also occurs in the temperature values obtained by a thermometer,
where 24% of measurements are off by more than 0.5◦ C, or about 36% of the normal tempera-
ture range [15]. Sometimes, uncertainty may be injected to the data by the application, in order to
provide a better protection of privacy. In demographic applications, partially aggregated data sets,
rather than personal data, are available [3]. In LBS, since the exact location of a person may be sen-
sitive, researchers have proposed to “anonymize” a location value by representing it as a region [20].
For these applications, data uncertainty needs to be carefully handled, or else wrong decisions can
be made. Recently, this important issue has attracted a lot of attention from the database and data
mining communities [3, 9–11, 18, 19, 30, 31, 36].
417
418 Data Classification: Algorithms and Applications
In this chapter, we investigate the issue of classifying data whose values are not certain. Similar
to other data mining solutions, most classification methods (e.g., decision trees [27, 28] and naive
Bayes classifiers [17]) assume exact data values. If these algorithms are applied on uncertain data
(e.g., temperature values), they simply ignore them (e.g., by only using the thermometer reading).
Unfortunately, this can severely affect the accuracy of mining results [3, 29]. On the other hand, the
use of uncertainty information (e.g., the derivation of the thermometer reading) may provide new
insight about the mining results (e.g., the probability that a class label is correct, or the chance that
an association rule is valid) [3]. It thus makes sense to consider uncertainty information during the
data mining process.
How to consider uncertainty in a classification algorithm, then? For ease of discussion, let us
consider a database of n objects, each of which has a d-dimensional feature vector. For each feature
vector, we assume that the value of each dimension (or “attribute”) is a number (e.g., temperature).
A common way to model uncertainty of an attribute is to treat it as a probability density function, or
pdf in short [11]. For example, the temperature of a thermometer follows a Gaussian pdf [15]. Given
a d-dimensional feature vector, a natural attempt to handle uncertainty is to replace the pdf of an
attribute by its mean. Once all the n objects are processed in this way, each attribute value becomes
a single number, and thus any classification algorithm can be used. We denote this method as AVG.
The problem of this simple method is that it ignores the variance of the pdf. More specifically, AVG
cannot distinguish between two pdfs with the same mean but a big difference in variances. As shown
experimentally in [1, 23, 29, 32, 33], this problem impacts the effectiveness of AVG.
Instead of representing the pdf by its mean value, a better way could be to consider all its
possible values during the classification process. For example, if an attribute’s pdf is distributed in
the range [30, 35], each real value in this range has a non-zero chance to be correct. Thus, every
single value in [30, 35] should be considered. For each of the n feature vectors, we then consider
all the possible instances — an enumeration of d attribute values based on their pdfs. Essentially,
a database is expanded to a number of possible worlds, each of which has a single value for each
attribute, and has a probability to be correct [11, 18, 19, 30]. Conceptually, a classification algorithm
(e.g., decision tree) can then be applied to each possible world, so that we can obtain a classification
result for each possible world. This method, which we called PW, provides information that is not
readily available for AVG. For example, we can compute the probability that each label is assigned
to each object; moreover, the label with the highest probability can be assigned to the object. It
was also pointed out in previous works (e.g., [3, 32, 33]) that compared with AVG, PW yields a better
classification result.
Unfortunately, PW is computationally very expensive. This is because each attribute has an in-
finitely large number of possible values, which results in an infinitely large number of possible
worlds. Even if we approximate a pdf with a set of points, the number of possible worlds can still be
exponentially large [11,18,19,30]. Moreover, the performance of PW does not scale with the database
size. In this chapter, we will investigate how to incorporate data uncertainty in several well-known
algorithms, namely decision tree, rule-based, associative, density-based, nearest-neighbor-based,
support vector, and naive Bayes classification. These algorithms are redesigned with the goals that
(1) their effectiveness mimic that of PW, and (2) they are computationally efficient.
The rest of this chapter is as follows. In Section 16.2, we will describe uncertainty models and
formulate the problem of classifying uncertain data. Section 16.3 describes the details of several
uncertain data classification algorithms. We conclude in Section 16.4.
Uncertain Data Classification 419
16.2 Preliminaries
We now discuss some background information that is essential to the understanding of the clas-
sification algorithms. We first describe the data uncertainty models assumed by these algorithms, in
Section 16.2.1. Section 16.2.2 then explains the framework that is common to all the classification
algorithms we discussed here.
These two models about the uncertainty of an attribute are commonly used by uncertain data
classification algorithms. Next, let us give an overview of these algorithms.
In the second step, for each test object y j with feature vector (y1j , y2j , · · · ydj ) in Test D,
M(y1j , y2j , · · · ydj ) yields a probability distribution Pj of class labels for y j . We then assign y j the
class label, say, c0 , whose probability is the highest, i.e., c0 = arg maxc∈C {Pj (c)}.
The two common measures of a classifier M are:
• Accuracy: M should be highly accurate in its prediction of labels for Train D. For example,
M should have a high percentage of test tuples predicted correctly on Train D, compared with
the ground truth. To achieve a high accuracy, a classifier should be designed to avoid underfit
or overfit problems.
• Performance: M should be efficient, i.e., the computational cost should be low. Its perfor-
mance should also scale with the database size.
As discussed before, although AVG is efficient, it has a low accuracy. On the other hand, PW has a
higher accuracy, but is extremely inefficient. We next examine several algorithms that are designed
to satisfy both of these requirements.
probability distribution
id class mean
-10 -1.0 0.0 +1.0 +10
1 A +2.0 8/11 3/11
2 A -2.0 1/9 8/9
3 A +2.0 5/8 1/8 2/8
4 B -2.0 5/19 1/19 13/19
5 B +2.0 1/35 30/35 4/35
6 B -2.0 3/11 8/11
data with uncertain numerical attributes. In this section, we focus on the Uncertain Decision tree
(or UDT) proposed in [32].
Input: A set of labeled objects whose uncertain attributes are associated with pdf models.
Algorithm: Let us first use an example to illustrate how decision trees can be used for classi-
fying uncertain data. Suppose that there are six tuples with given labels Aor B (Figure 16.1). Each
tuple has a single uncertain attribute. After sampling its pdf for a few times, its probability distribu-
tion is approximated by five values with probabilities, as shown in the figure. Notice that the mean
values of attributes tuples 1, 3 and 5 are the same, i.e., 2.0. However, their probability distributions
are different.
We first use these tuples as a training dataset, and consider the method AVG. Essentially, we
only use these tuples’ mean values to build a decision tree based on traditional algorithms such
as C4.5 [28] (Figure 16.2(a)). In this example, the split point is x = −2. Conceptually, this value
determines the branch of the tree used for classifying a given tuple. Each leaf node contains the
information about the probability of an object to be in each class. If we use this decision tree to
classify tuple 1, we will reach the right leaf node by traversing the tree, since its mean value is 2.0.
Let P(A) and P(B) be the probabilities of a test tuple for belonging to classes A and B, respectively.
Since P(A) = 2/3, which is greater than P(B) = 1/3, we assign tuple 1 with label “A.” Using a simi-
lar approach, we label tuples 1, 3, and 5 as “A,” and name tuples 2, 4, 6 as “B.” The number of tuples
classified correctly, i.e., tuples 1, 3, 4 and 6, is 4, and the accuracy of AVG in this example is 2/3.
The form of the decision tree trained by UDT is in fact the same as the one constructed through
AVG. In our previous example, the decision tree generated by UDT is shown in Figure 16.2(b). We
can see that the only difference from (a) is that the split point is x = −1, which is different from that
FIGURE 16.2: Decision trees constructed by (a) AVG and (b) UDT.
422 Data Classification: Algorithms and Applications
of AVG. Let us now explain (1) how to use UDT to train this decision tree; and (2) how to use it to
classify previously unseen tuples.
1. Training Phase. This follows the framework of C4.5 [28]. It builds decision trees from the
root to the leaf nodes in a top-down, recursive, and divide-and-conquer manner. Specifically, the
training set is recursively split into smaller subsets according to some splitting criterion, in order to
create subtrees. The splitting criterion is designed to determine the best way to partition the tuples
into individual classes. The criterion not only tells us which attribute to be tested, but also how to use
the selected attribute (by using the split point). We will explain this procedure in detail soon. Once
the split point of a given node has been chosen, we partition the dataset into subsets according to
its value. In Figure 16.2(a), for instance, we classify tuples with x > −2 to the right branch, and put
tuples with x ≤ −2 to the left branch. Then we create new sub-nodes with the subsets recursively.
The commonly used splitting criteria are based on indexes such as information gain (entropy), gain
ratio, and Gini index. Ideally, a good splitting criterion should make the resulting partitions at each
branch as “pure”as possible (i.e., the number of distinct objects that belong to the branch is as few
as possible).
Let us now explain how to choose an attribute to split, and how to split it. First, assume that
an attribute is chosen to split, and the goal is to find out how to choose the best split point for an
uncertain attribute. To do this, UDT samples points from the pdf of the attribute, and considers them
as candidate split points. It then selects the split point with the lowest entropy. The computation
of entropy H(q, A j ) for a given split point q over an attribute A j is done as follows: first split the
dataset into different subsets according to the split point. Then for each subset, we compute its
probabilities of belonging to different classes. Then its entropy H(q, A j ) can be computed based on
these probabilities. The optimal split point is the one that minimizes the entropy for A j . Notice that
this process can potentially generate more candidate split points than that of AVG, which is more
expensive but can be more accurate [32]. In the above example, to choose the best split point, we
only have to check 2 points in AVG since the attributes concerned only have two distinct values, 2
and -2. On the other hand, we have to check five distinct values in UDT.
How to choose the attribute for which a split point is then applied? This is done by first choosing
the optimal split points for all the attributes respectively, and then select the attribute whose entropy
based on its optimal split point is the minimum among those of other attributes. After an attribute
and a split point have been chosen for a particular node, we split the set S of objects into two subsets,
L and R. If the pdf of a tuple contains the split point x, we split it into two fractional tuples [28]
tL and tR , and add them to L and R, respectively. For example, for a given tuple xi , let the interval
of its j- attribute A j be [li, j , ri, j ], and the split point be li, j ≤ p ≤ ri, j . After splitting, we obtain two
fractional tuples, whose attributes have intervals [li, j , p] and (p, ri, j ]. The pdf of attribute A j of a
fractional tuple x f is the same as the one defined for xi , except that the pdf value is equal to zero
outside the interval of the fractional tuple’s attribute (e.g., [li, j , p]). After building the tree by the
above steps, some branches of a decision tree may reflect anomalies in the training data due to noise
or outliers. In this case, traditional pruning techniques, such as pre-pruning and post-pruning [28],
can be adopted to prune these nodes.
2. Testing Phase. For a given test tuple with uncertainty, there may be more than one path for
traversing the decision tree in a top-down manner (from the root node). This is because a node may
only cover part of the tuple. When we visit an internal node, we may split the tuple into two parts at
its split point, and distribute each part recursively down the child nodes accordingly. When the leaf
nodes are reached, the probability distribution at each leaf node contributes to the final distribution
for predicting its class label.1
To illustrate this process, let us consider the decision tree in Figure 16.2(b) again, where the split
point is x = −1. If we use this decision tree to classify tuple 1 by its pdf, its value is either -1 with
probability 8/11, or 10 with probability 3/11. After traversing from the root of the tree, we will reach
1 In AVG, there is only one path tracked from root to leaf node.
Uncertain Data Classification 423
the left leaf node with probability 8/11, and the right leaf node with probability 3/11. Since the left
leaf node has a probability 0.8 of belonging to class A, and right leaf node has a probability 0.212
of belonging to class A, P(A) = (8/11) × 0.8 + (3/11) × 0.212 = 0.64. We can similarly compute
P(B) = (8/11) × 0.2 + (3/11) × 0.788 = 0.36. Since P(A) > P(B), tuple 1 is labelled as “A.”
Efficient split point computation. By considering pdf during the decision tree process, UDT
promises to build a more accurate decision tree than AVG. Its main problem is that it has to consider a
lot more split points than AVG, which reduces its efficiency. In order to accelerate the speed of finding
the best split point, [32] proposes several strategies to prune candidate split points. To explain,
suppose that for a given set S of tuples, the interval of the j-th attribute of the i-th tuple in S is
j
represented as [li, j , ri, j ], associated with pdf fi (x). The set of end-points of objects in S on attribute
A j can be defined as Q j = {q|(q = li, j ) ∨ (q = ri, j )}. Points that lie in (li, j , ri, j ) are called interior
points. We assume that there are v end-points, q1 , q2 · · · qv , sorted in ascending order. Our objective
is to find an optimal split point in [q1 , qv ].
The end-points define v − 1 disjoint intervals: (qk , qk+1 ], where 1 ≤ k ≤ v − 1. [32] defines three
types of interval: empty, interval, and heterogeneous. An interval (qk , qk+1 ] is called empty if the
j
probability on this interval qqkk+1 fi (x)dx = 0 for all xi ∈ S. An interval (qk , qk+1 ] is homogeneous if
j
there exists a class label c ∈ C such that the probability on this interval qqkk+1 fi (x)dx = 0 ⇒ Ci = c
for all xi ∈ S. An interval (qk , qk+1 ] is heterogeneous if it is neither empty nor homogeneous. We
next discuss three techniques for pruning candidate split points.
(i) Pruning empty and homogeneous intervals. [32] shows that empty and homogeneous in-
tervals do not need to be considered. In other words, the interior points of empty and homogeneous
intervals can be ignored.
(ii) Pruning by bounding. This strategy removes heterogeneous intervals through a bounding
technique. The main idea is to compute the entropy H(q, A j ) for all end-points q ∈ Q j . Let H ∗j
be the minimum value. For each heterogeneous interval (a, b], a lower bound, L j , of H(z, A j ) over
all candidate split points z ∈ (qk , qk+1 ] is computed. An efficient method of finding L j without
considering all the candidate split points inside the interval is detailed in [33]. If L j ≥ H ∗j , none
of the candidate split points within the interval (qk , qk+1 ] can give an entropy smaller than H ∗j ;
therefore, the whole interval can be pruned. Since the number of end-points is much smaller than
the total number of candidate split points, many heterogeneous intervals are pruned in this manner,
which also reduces the number of entropy computations.
(iii) End-point sampling. Since many of the entropy calculations are due to the determina-
tion of entropy at the end-points, we can sample a portion (say, 10%) of the end-points, and use
their entropy values to derive a pruning threshold. This method also reduces the number of entropy
calculations, and improves the computational performance.
Discussions: From Figures 16.1 and 16.2, we can see that the accuracy of UDT is higher than
that of AVG, even though its time complexity is higher than that of AVG. By considering the sampled
points of the pdfs, UDT can find better split points and derive a better decision tree. In fact, the ac-
curacy of UDT depends on the number of sampled points; the more points are sampled, the accuracy
of UDT is closer to that of PW. In the testing phase, we have to track one or several paths in order
to determine its final label, because an uncertain attribute may be covered by several tree nodes.
Further experiments on ten real datasets taken from the UCI Machine Learning Repository [6] show
that UDT builds more accurate decision trees than AVG does. These results also show that the pruning
techniques discussed above are very effective. In particular, the strategy that combines the above
three techniques reduces the number of entropy calculations to 0.56%-28% (i.e., a pruning effec-
tiveness ranging from 72% to 99.44%). Finally, notice that UDT has also been extended to handle
pmf in [33]. This is not difficult, since UDT samples a pdf into a set of points. Hence UDT can easily
be used to support pmf.
We conclude this section by briefly describing two other decision tree algorithms designed for
uncertain data. In [23], the authors enhance the traditional decision tree algorithms and extend
424 Data Classification: Algorithms and Applications
measures, including entropy and information gain, by considering the uncertain data’s intervals and
pdf values. It can handle both numerical and categorical attributes. It also defines a “probabilistic
entropy metric” and uses it to choose the best splitting point. In [24], a decision-tree based clas-
sification system for uncertain data is presented. This tool defines new measures for constructing,
pruning, and optimizing a decision tree. These new measures are computed by considering pdfs.
Based on the new measures, the optimal splitting attributes and splitting values can be identified.
1. Rules Growing Phase. uRule starts with an initial rule: {} ⇒ Ck , where the left side is an
empty set and the right side contains the target class label Ck . New conjuncts will subsequently
be added to the left set. uRule uses the probabilistic information gain as a measure to identify
the best conjunct to be added into the rule antecedent (We will next discuss this measure in more
detail). It selects the conjunct that has the highest probabilistic information gain, and adds it as an
antecedent of the rule to the left set. Then tuples covered by this rule will be removed from the
training dataset Dk . This process will be repeated until the training data is empty or the conjuncts of
the rules contain all the attributes. Before introducing probability information gain, uRule defines
the probability cardinalities for attributes with pdf model as well as pmf model.
(a) pdf model. For a given attribute associated with pdf model, its values are represented by
a pdf with an interval, so a conjunct related to an attribute associated with pdf model covers an
interval. For example, a conjunct of a rule for the attribute income may be 1000 ≤ income < 2000.
The maximum and the minimum values of the interval are called end-points. Suppose that there
are N end-points for an attribute after eliminating duplication; then this attribute can be divided
into N+1 partitions. Since the leftmost and rightmost partitions do not contain data instances at all,
we only need to consider the remaining N-1 partitions. We let each candidate conjunct involve a
partition when choosing the best conjunct for an attribute, so there are N-1 candidate conjuncts for
this attribute.
For a given rule R extracted from Dk , we call the instances in Dk that are covered and classified
correctly by R positive instances, and call the instances in Dk that are covered, but are not classified
correctly by R, negative instances. We denote the set of positive instances of R w.r.t. Dk as Pk , and
denote the set of negative instances of R w.r.t. Dk as Nk .
For a given attribute A j with pdf model, the probability cardinality of all the positive in-
|Pk |
j
stances over one of its intervals [a, b) can be computed as ∑ P(xl ∈ [a, b) ∧ cl = Ck ), and the
l=1
probability cardinality of all the negative instances over the interval [a, b) can be computed as
|Nk |
∑ P(xl ∈ [a, b) ∧ cl = Ck ). We denote the former as PC+ ([a, b)) and the latter as PC− ([a, b)).
j
l=1
(b) pmf model. For a given attribute associated with pmf model, its values are represented by a
set of discrete values with probabilities. Each value is called as a candidate split point. A conjunct
related to an attribute associated with pmf model only covers one of its values. For example, a
conjunct of a rule for the attribute pro f ession may be pro f ession=student. To choose the best split
point for an attribute, we have to consider all its candidate split points. Similar to the case of pdf
model, we have the following definitions.
For a given attribute A j with pmf model, the probability cardinality of all the positive instances
|Pk |
j
over one of its values v can be computed as ∑ P(xl = v ∧ cl = Ck ), and the probability cardinality
l=1
|Nk |
j
of all the negative instances over value v can be computed as ∑ P(xl = v ∧ cl = Ck ). We denote the
l=1
former as PC+ (v) and the latter as PC− (v).
Now let us discuss how to choose the best conjunct for a rule R in the case of a pdf model (the
method for pmf model is similar). We consider all the candidate conjuncts related to each attribute.
For each candidate conjunct, we do the following two steps:
• (1) We compute the probability cardinalities of the positive and negative instances of R over
the covered interval of this conjunct, using the method shown in (a). Let us denote the proba-
bility cardinalities on R’s positive and negative instances as PC+ (R) and PC− (R).
• (2) We form a new rule R by adding this conjunct to R. Then we compute the probability
cardinalities of R ’s positive and negative instances over the covered interval of this conjunct
respectively. Let us denote the former as PC+ (R ) and the latter as PC− (R ). Then we use
De f inition 1 to compute the probabilistic information gain between R and R .
426 Data Classification: Algorithms and Applications
Among all the candidate conjuncts of all the attributes, we choose the conjunct with the highest
probabilistic information gain, which is defined as De f inition 1, as the best conjunct of R, and then
add it to R.
De f inition 1. Let R be a rule by adding a conjunct to R. The probabilistic information gain
between R and R over the interval or split point related to this conjunct is
!
PC+ (R ) PC+ (R)
ProbIn f o(R, R ) = PC+ (R ) · log2 + − log 2 . (16.1)
PC (R ) + PC−(R ) PC+ (R) + PC−(R)
From De f inition 1, we can see that the probabilistic information gain is proportional to PC+ (R )
PC+ (R )
and PC+ (R )+PC− (R ) , so it prefers rules that have high accuracies.
2. Rule Pruning Phase. After extracting a rule from grow Dataset, we have to conduct pruning
on it because it may perform well on grow Dataset, but it may not perform well in the test phase. So
we should prune some conjuncts of it based on its performance on the validation set prune Dataset.
For a given rule R in the pruning phase, uRule starts with its most recently added conjuncts when
considering pruning. For each conjunct of R’s condition, we consider a new rule R by removing it
from R. Then we use R and R to classify the prune Dataset, and get the set of positive instances
and the set of negative instances of them. If the proportion of positive instances of R is larger than
that of R, it means that the accuracy can be improved by removing this conjunct, so we remove this
conjunct from R and repeat the same pruning steps on other conjuncts of R, otherwise we keep it
and stop the pruning.
3. Classification Phase. Once the rules are learnt from the training dataset, we can use them
to classify unlabeled instances. In traditional rule-based classifications such as RIPPER [12], they
often use the first rule that covers the instance to predict the label. However, an uncertain data object
may be covered by several rules, and so we have to consider the weight of an object covered by
different rules (conceptually, the weight here is the probability that an instance is covered by a rule;
for details, please read [25]), and then assign the class label of the rule with the highest weight to
the instance.
Discussions: The algorithm uRule follows the basic framework of the traditional rule-
based classification algorithm RIPPER [12]. It extracts rules from one class at a time by the
Learn One Rule procedure. In the Learn One Rule procedure, it splits the training dataset into
grow Dataset, and prune Dataset, generates rules from grow Dataset and prunes rules based on
prune Dataset. To choose the best conjunct when generating rules, uRule proposes the concept of
probability cardinality, and uses a measure called probabilistic information gain, which is based
on probability cardinality, to choose the best conjunct for a rule. In the test phase, since a test ob-
ject may be covered by one or several rules, uRule has considered the weight of coverage for final
classification. Their experimental results show that this method can achieve relative stable accuracy
even though the extent of uncertainty reaches 20%. So it is very robust in cases which contain large
ranges of uncertainty.
The basic algorithm for the detection of such rules works as follows: Given a collection of
attribute-value pairs, in the first step the algorithm identifies the complete set of frequent patterns
from the training dataset, given the user-specified minimum support threshold and/or discriminative
measurements like the minimum confidence threshold. Then, in the second step a set of rules are
selected based on several covering paradigms and discrimination heuristics. Finally, the rules for
classification extracted in the first step can be used to classify novel database entries or for training
other classifiers. In many studies, associative classification has been found to be more accurate than
some traditional classification algorithms, such as decision tree, which only considers one attribute
at a time, because it explores highly confident associations among multiple attributes.
In the case of uncertain data, specialized associative classification solutions are required. The
work called HARMONY proposed in [35] has been extended for uncertain data classification in
[16], which is called uHARMONY. It efficiently identifies discriminative patterns directly from
uncertain data and uses the resulting classification features/rules to help train either SVM or rule-
based classifiers for the classification procedure.
Input: A set of labeled objects whose uncertain attributes are associated with pmf models.
Algorithm: uHARMONY follows the basic framework of algorithm HARMONY [35]. The
first step is to search for frequent patterns. Traditionally an itemset y is said to be supported by a
tuple xi if y ⊆ xi . For a given certain database D consisting of n tuples x1 , x2 , · · · , xn . The label of xi
is denoted by ci . |{xi |xi ⊆ D ∧ y ⊆ xi }| is called the support of itemset y with respect to D, denoted
by supy . The support value of itemset y under class c is defined as |{xi |ci = c ∧ xi ⊆ D ∧ y ⊆ xi }|,
denoted by supx c . y is said to be frequent if supx ≥ supmin , where supmin is a user specified minimum
support threshold. The confidence of y under class c can be defined as con fy c = supy c /supy .
However, for a given uncertain database, there exists a probability of y ⊆ xi when y contains at
least one item of uncertain attribute, and the support of y is no longer a single value but a probability
distribution function instead. uHARMONY defines pattern frequentness by means of the expected
support and uses the expected confidence to represent the discrimination of found patterns.
1. Expected Confidence. For a given itemset y on the training dataset, its expected sup-
n
port E(supy ) is defined as E(supy ) = ∑ P(y ⊆ xi ). Its expected support on class c is defined as
i=1
n
E(supy c) = ∑ (P(y ⊆ xi ) ∧ ci = c). Its expected confidence can be defined as De f inition 1.
i=1
De f inition 1. Given a set of objects and the set of possible worlds W with respect to it, the
expected confidence of an itemset y on class c is
supy,wi c
E(con fy c ) = ∑ con fy,wi c × P(wi ) = ∑ supy,wi
× P(wi ) (16.2)
wi ∈W wi ∈W
where P(wi ) stands for the probability of world wi ; con fy,wi c is the confidence of y on class c in
world wi , while supy,wi (supy,wi c ) is the support of x (on class c) in world wi .
For a given possible world wi , con fy,wi c , supy,wi c , and supy,wi can be computed using the tradi-
tional method as introduced in the previous paragraph. Although we have con fy c = supy c /supy on
a certain database, the expected confidence E(con fy c )=E(supy c /supy ) of itemset y on class c is not
equal to E(supy c )/E(supy ) on an uncertain database. For example, Figure 16.3 shows an uncertain
database about computer purchase evaluation with a certain attribute on looking and an uncertain
attribute on quality. The column named Evaluation represents the class labels. For the example of
tuple 1, the probabilities of Bad, Medium, and Good qualities are 0.8, 0.1 and 0.1 respectively.
Let us consider the class c=Unacceptable, and the itemset y={Looking = −, Quality = −},
which means the values of the attributes Looking and Quality are Bad in the itemset y. In order to
compute the expected confidence E(con fy c ), we have to consider two cases that contain y (notice
that we will not consider cases that do not contain y since their expected support and expected
confidence is 0. ). Let case1 denote the first case in which the first two tuples contain y and the
428 Data Classification: Algorithms and Applications
other two tuples do not contain y, so the probability of case1 is P(case1) = 0.8 × 0.1. Since only
one tuple in w1 , i.e., tuple 1, has the label c = Unacceptable, supy,case1 c = 1 and supy,case1 = 2.
Let case2 denote the second case in which only the first one tuple contains y and the other three
tuples do not contain y, so the probability of case2 is P(case2) = 0.8 × 0.9 supy,case2 c = 1 and
supy,case2 = 1. So we can calculate E(con fy c ) = 1.0 × (0.8 × 0.9) + 0.5 × (0.8 × 0.1) = 0.76, while
E(supy c )/E(supy ) = 0.8/(0.8 + 0.1) = 0.89. Therefore, the calculation of expected confidence is
non-trivial and requires careful computation.
Unlike probabilistic cardinalities like probabilistic information gain discussed in Section 16.3.2
which may be not precise and lack theoretical explanations and statistical meanings, the expected
confidence is statistically meaningful while providing a relatively accurate measure of discrimina-
tion. However, according to the De f inition 1, if we want to calculate the expected confidence of a
given itemset y, the number of possible worlds |W | is extremely large and it is actually exponential.
The work [16] proposes some theorems to compute the expected confidence efficiently.
Lemma 1. Since 0 ≤ supy c ≤ supy ≤ n, we have: E(con fy c ) = ∑ con fy,wi c × P(wi ) =
wi ∈W
n i n E (sup c ) n
j
∑ ∑ i × P(supy = i ∧ supy = j) =
c
∑ i i
y
= ∑ Ei (con fy ), where Ei (supy c ) and Ei (con fy c )
c
i=0 j=0 i=0 i=0
denote the part of expected support and confidence of itemset y on class c when supy = i.
Given 0 ≤ j ≤ n, let E j (supy c ) = ∑ni=0 Ei, j (supy c ) be the expected support of y on class c on
the first j objects of Train D, and Ei, j (supy c ) be the part of expected support of y on class c with
support of i on the first j objects of Train D.
T heorem 1. Let pi denote P(y ⊆ xi ) for each object xi ∈ Train D, and Pi, j be the probability of y
having support of i on the first j objects of Train D. We have Ei, j (supy c ) = p j × Ei−1, j−1(supy c ) +
(1 − p j ) × Ei, j−1 (supy c ) when c j = c, and Ei, j (supy c ) = p j × Ei−1, j−1 (supy c + 1) + (1 − p j ) ×
Ei, j−1 (supy c ) when c j = c, where 0 ≤ i ≤ j ≤ n. Ei, j (supy c )=0 for ∀ j where i = 0, or where j < i.
(For the details of the proof, please refer to [16].)
Notice that Pi, j can be computed in a similar way as shown in T herorem 1 (for details, please
refer to [16]). Based on the above theory, we can compute the expected confidence with O(n2 ) time
complexity and O(n) space complexity. In addition, the work [16] also proposes some strategies
to compute the upper bound of expected confidence, which makes the computation of expected
confidence more efficient.
2. Rules Mining. The steps of mining frequent itemsets of uHARMONY are very similar with
HARMONY [35]. Before running the itemset mining algorithm, it sorts the attributes to place all
the certain attributes before uncertain attributes. The uncertain attributes will be handled at last
when extending items by traversing the attributes. During the mining process, we start with all the
candidate itemsets, each of which contains only one attribute-value pair, namely a conjunct. For each
candidate itemset y, we first compute its expected confidence E(con fy c ) under different classes, and
then we assign the class label c∗ with the maximum confidence as its class label. We denote this
candidate rule as y ⇒ c∗ and then count the number of covered instances by this rule. If the number
of covered instances by this itemset is not zero, we output a final rule y ⇒ c∗ and remove y from the
set of candidate itemsets. Based on this mined rule, we compute a set of new candidate itemsets by
Uncertain Data Classification 429
adding all the possible conjuncts related to other attributes to y. For each new candidate itemset, we
repeat the above process to generate all the rules.
After obtaining a set of itemsets, we can use them as rules for classification directly. For each
test instance, by summing up the product of the confidence of each itemset on each class and the
probability of the instance containing the itemset, we can assign the class label with the largest value
to it. In addition, we can convert the mined patterns to training features of other classifiers such as
SVM for classification [16].
Discussions: The algorithm uHARMONY follows the basic framework of HARMONY to mine
discriminative patterns directly and effectively from uncertain data. It proposes to use expected con-
fidence of a discovered itemset to measure the discrimination. If we compute the expected con-
fidence according to its original definition, which considers all the simple combinations of those
training objects, i.e., the possible worlds, the time complexity is exponential, which is impractical
in practice. So uHARMONY proposes some theorems to compute expected confidence efficiently.
The overall time complexity and space complexity of computing the expected confidence of a given
itemset y are at most O(n2 ) and O(n), respectively. Thus, it can be practical in real applications.
These mined patterns can be used as classification features/rules to classify new unlabeled objects
or help to train other classifiers (i.e., SVM). Their experimental results show that uHARMONY can
outperform other uncertain data classification algorithms such as decision tree classification [23]
and rule-based classification [25] with 4% to 10% improvement on average in accuracy on 30 cate-
gorical datasets.
z x Error of x
y
E
Error off y
O x
point in the data set. Each kernel function is associated with a kernel width h, which determines the
level of smoothing created by the function. The kernel estimation is defined as follows:
1 n
f (x) = ∑ Kh (x − xi)
n i=1
(16.3)
√ −(x−xi )2
Kh (x − xi ) = (1/ 2π · h) · e 2h2 (16.4)
The overall effect of kernel density estimation is to replace each discrete point xi by a continu-
ous function Kh (x − xi ), which peaks at x)i and has a variance that is determined by the smoothing
parameter h.
The presence of attribute-specific errors may change the density estimates. In order to take these
errors into account, we need to define error-based kernels. The error-based kernel can be defined
as:
√ −(x−x)i )2
Qh (x − x)i , ϕ(xi )) = 1/ 2π · (h + ϕ(xi)) · e 2 +ϕ(xi ) ) ,
2·(h 2
(16.5)
where ϕ(xi ) denotes the estimated error associated with data point xi .
As in the previous case, the error-based density at a given data point is defined as the sum of the
error-based kernels over different data points. Consequently, the error-based density f Q (x, ϕ(xi )) at
point xi is defined as:
N
f Q (x, ϕ(xi )) = (1/N) · ∑ Qh (x − xi , ϕ(xi )). (16.6)
i=1
This method can be generalized for very large data sets and data streams by condensing the
data into micro-clusters; specifically, here we need to adapt the concept of micro-cluster by in-
corporating the error information. In the following we will define an error-based micro-cluster as
follows:
De f inition 1: An error-based micro-cluster for a set of d-dimensional points MC={x1 , x2 , · · · xm }
with order 1, 2, · · · , m is defined as a (3 · d + 1) tuple (CF2x (MC), EF2x (MC),CF1x (MC), n(MC)),
wherein CF2x (MC), EF2x (MC), and CF1x (MC) each correspond to a vector of d entries. The
definition of each of these entries is as follows.
• For each dimension, the sum of the squares of the data values is maintained in CF2x (MC).
p 2
Thus, CF2x (MC) contains d values. The p-th entry of CF2x (MC) is equal to ∑mi=1 (xi ) .
Uncertain Data Classification 431
• For each dimension, the sum of the square of the errors in data values is maintained in
EF2x (MC). Thus, EF2x (MC) contains d values. The p-th entry of EF2x (MC) is equal to
2
∑mi=1 (ϕ p (xi )) .
• For each dimension, the sum of the data values is maintained in CF1x (MC). Thus, CF1x (MC)
p
contains d values. The p-entry of CF1x (MC) is equal to ∑m i=1 xi .
• The number of points in the data is maintained in n(MC).
In order to compress large data sets into micro-clusters, we can create and maintain the clusters
using a single pass of data as follows. Supposing the number of expected micro-clusters for a given
dataset is q, we first randomly choose q centroids for all the micro-clusters that are empty in the
initial stage. For each incoming data point, we assign it to its closest micro-cluster centroid using
the nearest neighbor algorithm. Note that we never create a new micro-cluster, which is different
from [2]. Similarly, these micro-clusters will never be discarded in this process, since the aim of this
micro-clustering process is to compress the data points so that the resulting statistic can be held in
a main memory for density estimation. Therefore, the number of micro-clusters q is defined by the
amount of main memory available.
In [1] it is shown that for a given micro-cluster MC={x1 , x2 , · · · xm }, its true error Δ(MC) can be
computed efficiently. Its value along the j-th dimension can be defined as follows:
If the training dataset contains the micro-clusters MC1 , MC2 , · · · MCm where m is the number
of micro-clusters, then we can define the density estimate at x as follows:
m
f Q (x, Δ(MC)) = (1/N) · ∑ n(MCi )Qh (x − c(MCi )i , Δ(xi )). (16.9)
i=1
This estimate can be used for a variety of data mining purposes. In the following, it is shown
how these error-based densities can be exploited for classification by introducing an error-based
generalization of a multi-variate classifier.
2. Density-Based Classification. The density-based classification designs a density-based adap-
tion of rule-based classifiers. For the classification of a test instance, we need to find relevant classi-
fication rules for it. The challenge is to use the density-based approach in order to find the particular
subsets of dimensions that are most discriminatory for the test instance. So we need to find those
subsets of dimensions, namely subspaces, in which the instance-specific local density of dimensions
for a particular class is significantly higher than its density in the overall data. Let Di denote the ob-
jects that belong to class i. For a particular test instance x, a given subspace S, and a given training
dataset D, let g(x, S, D) denote the density of S and D. The process of computing the density over
a given subspace S is equal to the computation of the density of the full-dimensional space while
incorporating only the dimensions in S.
In the test phase, we compute the micro-clusters together with the corresponding statistic of
each data set D1 , . . . , DL where L is the number of classes in the preprocessing step. Then, the
432 Data Classification: Algorithms and Applications
compressed micro-clusters for different classes can be used to generate the accuracy density over
different subspaces by using Equations (16.8) and (16.9). To construct the final set of rules, we can
use an iterative approach to find the most relevant subspaces for classification. For the relevance of
a particular subspace S we define the density-based local accuracy A(x, S,Ci ) as follows:
|Di | · g(x, S, Di )
A(x, S,Ci ) = . (16.10)
|Train D| · g(x, S, Train D)
The dominant class dom(x, S) for point x in dataset S can be defined as the class with the highest
accuracy. That is,
dom(x, S) = arg max A(x, S,Ci ). (16.11)
1≤i≤L
To determine all the highly relevant subspaces, we follow a bottom-up methodology to find those
subspaces S that yield a high accuracy of the dominant class dom(x, S) for a given test instance x.
Let T j (1 ≤ j ≤ d) denote the set of all j-th dimensional subspaces. We will compute the relevance
between x and each subspace in T j (1 ≤ j ≤ d) in an iterative way, starting with T1 and then proceed-
ing with T2 , T3 , · · · Td . The subspaces with high relevance can be used as rules for final classification,
by assigning to a test instance x the label of the class that is the most frequent dominant class of
all highly relevant subspaces of x. Notice that to make this process efficient, we can impose some
constraints. In order to decide whether we should consider a subspace in the set of ( j + 1)-th di-
mension spaces, at least one subset of it needs to satisfy the accuracy threshold requirements. Let
L j (L j ⊆ T j ) denote the subspaces that have sufficient discriminatory power for that instance, namely
the accuracy is above threshold α. In order to find such subspaces, we can use Equations (16.10) and
(16.11). We iterate over increasing values of j, and join the candidate set L j with the set L1 in order
to determine T j+1 . Then, we can construct L j+1 by choosing subspaces from T j+1 whose accuracy
is above α. This process can be repeated until the set T j+1 is empty. Finally, the sets of dimensions
Lall =∪ j L j can be used to predict the class label of the test instance x.
To summarize the density-based classification procedure, the steps of classifying a test instance
are as follows. (1) compute the micro-clusters for each data set D1 , . . . DL only once in the pre-
processing step; (2) compute the highly relevant subspaces for this test instance following the
bottom-up methodology and obtain a set of subspaces Lall ; (3) assign the label of the majority
class in Lall to the test instance.
Discussions: Unlike other uncertain data classification algorithms [8, 29, 32], the density-based
classification introduced in this section not only considers the distribution of data, but also the errors
along different dimensions when estimating the kernel density. It proposes a method for error-based
density estimation, and it can be generalized to large data sets cases. In the test phase, it designs
a density-based adaption of the rule-based classifiers. To classify a test instance, it finds all the
highly relevant subspaces and then assign the label of the majority class of these subspaces to the
test instance. In addition, the error-based density estimation discussed in this algorithm can also
be used to construct accurate solutions for other cases, and thus it can be applied for a variety of
data management and mining applications when the data distribution is very uncertain or contains
a lot of errors. In an experimental evaluation it is demonstrated that the error-based classification
method not only performs fast, but also can achieve higher accuracy compared to methods that
do not consider error at all, such as the nearest-neighbor-based classification method that will be
introduced next.
4 +1.0
0 +q
+ q × 0.5 × 0.5
-2
-4
-6
-8
-6 -4 -2 0 2 4 6 8
FIGURE 16.5: Example of a comparison between the nearest neighbor and class.
is as follows: Given a set of training instances, to classify an unlabeled instance q, we have to search
for the nearest-neighbor NNTrain D (q) of q in the training set, i.e., the training instance having the
smallest distance to q, and assign the class label of NNTrain D (q) to q. The nearest neighbor-based
classification can be generalized easily to take into account the k nearest-neighbor easily, which is
called k nearest neighbor classification (KNN, for short). In this case, the classification rule assigns
to the unclassified instance q the class containing most of its k nearest neighbors. The KNN classi-
fication is more robust against noise compared to the simple NN classification variant. The work [4]
proposes a novel nearest neighbor classification approach that is able to handle uncertain data called
Uncertain Nearest Neighbor (UNN, for short) classification, which extends the traditional nearest
neighbor classification to the case of data with uncertainty. It proposes a concept of nearest-neighbor
class probability, which is proved to be a much more powerful uncertain object classification method
than the straight-forward consideration of nearest-neighbor object probabilities.
Input: A set of labeled objects whose uncertain attributes are associated with pdf models.
Algorithm: To extend the nearest neighbor method for uncertain data classification, it is
straight-forward to follow the basic framework of the traditional nearest neighbor method and re-
place the distance metric method with distance between means (i.e., the expected location of the
uncertain instances) or the expected distance between uncertain instances. However, there is no
guarantee on the quality of the classification yielded by this naive approach. The example as shown
in Figure 16.5 demonstrates that this naive approach is defective because it may misclassify. Sup-
pose we have one certain unclassified object q, three uncertain training objects whose support is
delimited by ellipsis belonging to class=“red” and another uncertain training object whose support
is delimited by a circle belonging to another class=“blue”. Suppose the class=“blue” object consists
of one normally distributed uncertain object whose central point is (0,4), while the class=“red” ob-
jects are bimodal uncertain objects. Following the naive approach, one can easily see that q will be
assigned to the “blue” class. However, the probability that a “red”-class object is closer to q than a
“blue”-class one is 1 − 0.53 = 0.875. This also indicates that the probability that the nearest neigh-
bor of object q is a “red”-class object is 87.5%. Thus, the performance of the naive approach is quite
poor as distribution information (such as the variances) of the uncertain objects is ignored.
1. Nearest Neighbor Rule for Uncertain Data. To avoid the flaws of the naive approach fol-
lowing the above observations, the approach in [4] proposes the concept of most probable class.
It simultaneously considers the distance distributions between q and all uncertain training objects
in Train D. The probability Pr(NNTrain D (q) = c) that the object q will be assigned to class c by
434 Data Classification: Algorithms and Applications
where function Ic (·) outputs 1 if its argument is c, and 0 otherwise, NNID (q) represents the label
of the nearest neighbor of q in set ID , Pr(ID ) represents the occurrence probability of set ID , and
D represents all the possible combinations of training instances in Train D. It is easy to see that
the probability Pr(NNTrain D (q) = c) is the summation of the occurrence probabilities of all the
outcomes ID of the training set Train D for which the nearest neighbor object of q in ID has class c.
Note that if the domain involved in the training dataset doesn’t concide with the real number field,
we can transfer the integration as summations and get Equation(16.13).
2. Efficient Strategies. To classify a test object, the computational cost will be very large if
we compute the probabilities using the above equations directly, so [4] introduces some efficient
strategies such as the uncertain nearest neighbor (UNN) rule. Let Dc denote the subset of Train D
composed of the objects having class label c, Dk (q, c) denote the distance between object q and class
c (k is the number of nearest neighbors), and pi (R) = Pr(d(q, xi ) ≤ R) denote the cumulative density
function representing the relative likelihood for the distance between q and training set object xi to
assume value less than or equal to R. Then we have
pi (R) = Pr(d(q, xi ) ≤ R) = f xi (v)dv (16.15)
BR (q)
where f xi (v) denotes the probability for xi to assume value v, BR (q) denotes the set of values {v ∈
Train D : d(v, q) ≤ R}.
Then, the cumulative density function associated with Dk (q, c) can be computed as Equation
(16.16), which is one minus the probability that no object of the class c lies within distance R from
q.
Pr(Dk (q, c) ≤ R) = 1 − ∏ (1 − pi(R)) (16.16)
xi ∈Dc
Let us consider the case of a binary classification problem for uncertain data. Suppose that there
are two classes, i.e., c and c , and a test object q. We define the nearest neighbor class of the object
q as c if
Pr(Dk (q, c) ≤ Dk (q, c )) ≥ 0.5 (16.17)
holds, and c otherwise. This means that class c is the nearest neighbor of q if the probability that the
nearest neighbor of q comes from class c is greater than the probability that it comes from class c .
In particular, this probability can be computed by means of the following one-dimensional integral.
+∞
Pr(Dk (q, c) ≤ Dk (q, c )) = Pr(Dk (q, c) = R) · Pr(Dk (q, c) > R)dR. (16.18)
0
The Uncertain Nearest Neighbor Rule (UNN): Given a training dataset, it assigns the label of
its nearest neighbor class to the test object q. In the UNN rule, the distribution of the closet class
Uncertain Data Classification 435
tends to overshadow noisy objects and so it will be more robust than traditional nearest neighbor
classification. The UNN rule can be generalized for handling more complex cases such as multiclass
classification problems, etc. Besides, [4] proposes a theorem that the uncertain nearest neighbor rule
outputs the most probable class of the test object (for proof, please refer to [4]). This indicates that
the UNN rule performs equivalent with the most probable class.
In the test phase, [4] also proposes some efficient strategies. Suppose each uncertain object x
is associated with a finite region SUP(x), containing the support of x, namely the region such that
Pr(x ∈/ SUP(x)) = 0 holds. Let the region SUP(x) be a hypersphere, then SUP(x) can be identified
by means of its center c(x) and its radius r(x) where c(x) is an instance and r(x) is a positive real
number. Then, the minimum distance mindist(xi , x j ) and the maximum distance maxdist(xi , x j )
between xi and x j can be defined as follows:
mindist(xi , x j ) = min{d(v, w) : v ∈ SUP(xi )∧w ∈ SUP(x j )} = max{0, d(c(xi ), c(x j ))−r(xi )−r(x j )}.
(16.19)
maxdist(xi , x j ) = max{d(v, w) : v ∈ SUP(xi) ∧ w ∈ SUP(x j )} = d(c(xi ), c(x j )) + r(xi ) + r(x j ).
(16.20)
Let innerDc (q, R) denote the set {x ∈ Dc : maxdist(q, x) ≤ R}, that is the subset of Dc composed
q
of all the object x whose maximum distance from q is not greater than R. Let Rc denote the positive
q
real number Rc = min{R ≥ 0 : |innerDc (q, R)| ≥ k} representing the smallest radius R for which
there exist at least k objects of the class c having maximum distance from q not greater than R.
q q
Moreover, we can define Rmax and Rmin as follows:
q
Rqmax = min{Rqc , Rc }Rqmin = min mindist(q, x). (16.21)
x∈D
To compute the probability reported in Equation (16.18) for binary classification, a specific finite
domain can be considered instead of an infinite domain such that the probability can be computed
efficiently as follows:
Rqmax
Pr(Dk (q, c) < Dk (q, c )) = q
Pr(Dk (q, c) = R) · Pr(Dk (q, c ) > R)dR. (16.22)
Rmin
In addition, some other efficient strategies for computing the above integral have also been proposed,
such as the Histogram technique (for details, please refer to [4]).
3. Algorithm Steps. The overall steps of the uncertain nearest neighbor classification algorithm
can be summarized as follows, for a given dataset TrainD , with two classes c and c and integer k > 0.
To classify a certain test object q, we process it according to the following steps. (1) Determine the
q q q
value Rmax = min{Rc , Rc }; (2) determine the set Dq composed of the training set objects xi such that
mindist(q, xi ) ≤ Rmax ; (3) if in Dq there are less than k objects of the class c (c, resp.), then return c
q
(c , resp.) with associated probability 1 and exit; (4) determine the value Rmin = mini (mindist(q, xi ))
q
by considering only the object xi belonging to Dq ; (5) determine the nearest neighbor probability
p of class c w.r.t. class c . If p ≥ 0.5, return c with associated probability, otherwise return c with
associated probability 1-p.
Discussions: As we noticed from the simple example shown in Figure 16.5, the naive approach
that follows the basic framework of traditional nearest neighbor classification and is easy to im-
plement can not guarantee the quality of classification in the uncertain data cases. The approach
proposed in [4] extends the nearest neighbor-based classification for uncertain data. The main con-
tribution of this algorithm is to introduce a novel classification rule for the uncertain setting, i.e.,
the Uncertain Nearest Neighbor (UNN) rule. UNN relies on the concept of nearest neighbor class,
rather than on that of neatest neighbor object. For a given test object, UNN will assign the label of
its nearest neighbor class to it. This probabilistic class assignment approach is much more robust
and achieves better results than the straight-forward approach assigning to q the class of the most
436 Data Classification: Algorithms and Applications
probable nearest neighbor of a test object q. To classify a test object efficiently, some novel strate-
gies are proposed. Further experiments on real datasets demonstrate that the accuracy of nearest
neighbor-based methods can outperform some other methods such as density-based method [1] and
decision tree [32], among others.
The integration over the unknown true input xi might become quite difficult. But we can use a
more tractable and easier approach to solve the maximum-likelihood estimate. Equation (16.25) is
an approximation that is often used in practical applications.
max ∑ ln sup[p(xi , ci |θ)p(xi |θ, σi , xi )] (16.25)
θ,θ i xi
For the prediction problem, there are two types of statistical models, generative models and
Uncertain Data Classification 437
discriminative models (conditional models). In [8] the discriminative model is used assuming that
p(xi , ci |θ) = p(xi )p(ci |θ, xi ). Now, let us consider regression problems with Gaussian noise as an
example:
(θT xi − ci )
xi − xi
2
2
p(xi , ci |θ) ∼ p(xi ) exp − , p(xi |θ , σi , xi ) ∼ exp − (16.26)
2σ2 2σ2i
For binary classification, where ci ∈ {±1}, we consider the logistic conditional probability
model for ci , while still assuming Gaussian noise in the input data. Then we can obtain the joint
probabilities and the estimate of θ.
& '
, σ , x ) ∼ exp −
xi −xi
2
p(xi , ci |θ) ∼ p(xi ) exp 1+exp(−θ 1
T x c ) , p(x i |θ i i 2σ2i
i i
! (16.28)
−θT xi ci ) +
xi −xi
2
θ = arg min ∑ inf
ln(1 + e 2σ2
:
θ i xi i
2. Support Vector Classification. Based on the above statistic model, [8] proposes a total sup-
port vector classification (TSVC) algorithm for classifying input data with uncertainty, which fol-
lows the basic framework of traditional SVM algorithms. It assumes that data points are affected by
an additive noise, i.e., xi = xi + Δxi where noise Δxi satisfies a simple bounded uncertainty model
Δxi
≤ δi with uniform priors.
The traditional SVM classifier construction is based on computing a separating hyperplane
wT x + b = 0, where w = (w1 , w2 , · · · wn ) is a weight vector and b is a scalar, often referred to as
a bias. Like in the traditional case, when the input data is uncertain data, we also have to distinguish
between linearly separable and non-linearly separable input data. Consequently, TSVC will address
the following two problems separately.
(1) Separable case: To search for the maximum marginal hyperplane (MMH) we have to re-
place the parameter θ in Equation (16.26) and (16.28) with a weight vector w and a bias b. Then,
the problem of searching for MMH can be formulated as follows.
min 1
w
2
w,b,Δxi ,i=1,··· ,n 2 (16.29)
subject to ci (wT (xi + Δxi ) + b) ≥ 1,
Δxi
≤ δi , i = 1, · · · , n.
(2) Non-separable case: We replace the square loss in Equation (16.26) or the logistic loss in
Equation (16.28) with the margin-based hinge-loss ξi = max{0, 1 − ci (wT xi + b)}, which is often
used in the standard SVM algorithm, and we get
min C ∑ni=1 ξi + 12
w
2
w,b,ξi ,Δxi ,i=1,··· ,n (16.30)
subject to ci (wT (xi + Δxi ) + b) ≥ 1 − ξi , ξi ≥ 0,
Δxi
≤ δi , i = 1, · · · , n.
To solve the above problems, TSVC proposes an iterative approach based on an alternating
optimization method [7] and its steps are as follows.
Initialize Δxi = 0, repeat the following two steps until a termination criterion is met:
1. Fix Δxi , i = 1, · · · , n to the current value, solve problem in Equation (16.30) for w,
b, and ξ , where ξ = (ξ1 , ξ2 , · · · ξn ).
438 Data Classification: Algorithms and Applications
2. Fix w, b to the current value, solve problem in Equation (16.30) for Δxi , i = 1, · · · , n,
and ξ .
The first step is very similar to standard SVM, except for the adjustments to account for in-
put data uncertainty. Similar to how traditional SVMs are usually optimized, we can compute w
and b [34]. The second step is to compute the Δxi . When there are only linear functions or kernel
functions needed to be considered, TSVC proposes some more efficient algorithms to handle them,
respectively (for details, please refer to [8]).
Discussions: The authors developed a new algorithm (TSVC), which extends the traditional
support vector classification to the input data, which is assumed to be corrupted with noise. They
devise a statistical formulation where unobserved data is modelled as a hidden mixture compo-
nent, so the data uncertainty can be well incorporated in this model. TSVC attempts to recover
the original classifier from the corrupted training data. Their experimental results on synthetic and
real datasets show that TSVC can always achieve lower error rates than the standard SVC algo-
rithm. This demonstrates that the new proposed approach TSVC, which allows us to incorporate
uncertainty of the input data, in fact obtains more accurate predictors than the standard SVM for
problems where the input data is affected by noise.
The kernel density estimation, a non-parametric way of estimating the probability density func-
tion of a random variable, is often used to estimate the class conditional density. For the set of
2 The input format of naive Bayes classification is a little different from other algorithms. This is because we have to
estimate the prior and conditional probabilities of different classes separately.
Uncertain Data Classification 439
instances with class label Ck , we can estimate the probability density function of their j-th dimen-
sion by using kernel density estimation as follows:
Nk j
1 x j − xnk
f)h j (x j |Ck ) = j ∑ K( j ) (16.33)
k Nk hk nk =1 hk
j
where the function K is some kernel with the bandwidth hk . A common choice of K is the standard
Gaussian function with zero mean and unit variance, i.e., K(x) = √12π e− 2 x .
1 2
To classify x = (x1 , x2 , · · · xd ) using a naive Bayes model with Equation (16.32), we need to
estimate the class condition density P(x j |Ck ). We use fˆh j (x j ) as the estimation. From Equations
k
(16.32) and (16.33), we get:
Nk j
d
1 x j − xnk
P(x|Ck ) = ∏ { ∑ K( )}. (16.34)
j=1 Nk hkj nk =1 hkj
We can compute P(Ck |x) using Equation (16.31) and (16.34) and predict the label of x as y =
arg max P(Ck |x).
Ck ∈C
j
2. Uncertain Data Case. In this case, each dimension xnk (1 ≤ j ≤ d) of a tuple in the training
j
dataset is uncertain, i.e., represented by a probability density function pnk . To classify an unlabeled
uncertain data x = (x , x , · · · x ), where each attribute is modeled by a pdf p j ( j = 1, 2, · · · d), [29]
1 2 d
proposes a Distribution-based naive Bayes classification for uncertain data classification. It follows
the basic framework of the above algorithm and extends some parts of it for handling uncertain data.
The key step of the Distribution-based method is the estimation of class conditional density
on uncertain data. It learns the class conditional density from uncertain data objects represented by
probability distributions. They also use kernel density estimation to estimate the class conditional
density. However, we may not be able to estimate P(x j |Ck ) using f)h j (x j |Ck ) without any redesign,
k
because the value of x j is modeled by a pdf p j and the kernel function K is defined for scalar-valued
parameters only. So, we need to extend Equation (16.33) to create a kernel-density estimation for x j .
Since x j is a probability distribution, it is natural to replace K in Equation (16.33) using its expected
value, and we can get:
Nk j
x j −xnk
fˆh j (x j ) = 1
j ∑ E[K( j )]
k Nk hk n =1 hk
k
Nk (16.35)
x j −xnk
j
j j j
= 1
j ∑ K( j )p j (x j )pnk (xnk )dx j dxnk .
Nk hk n =1 hk
k
Using the above equation to estimate P(x j |Ck ) in Equation (16.32) gives:
Nk j
d
1 x j − xnk
P(x|Ck ) = ∏ { j ∑ K( j
j j
)p j (x j )pnk (xnk )dx j dxnk }.
j
(16.36)
j=1 Nk hk nk =1 hk
The Distribution-based method presents two possible methods, i.e., formula-based method and
sample-based method, to compute the double integral in Equation (16.36).
(1). Formula-Based Method. In this method, we first derive the formula for kernel estimation
for uncertain data objects. With this formula, we can then compute the kernel density and run the
naive Bayes method to perform the classification. Suppose x and xnk are uncertain data objects with
multivariate Gaussian distributions, i.e., x ∼ N(µ, ∑) and Xnk ∼ N(µnk , ∑nk ). Here, µ = (µ1 , µ2 , · · · µd )
and µnk = (µ1nk , µ2nk , · · · µdnk ) are the means of x and xnk , while Σ and ∑nk are their covariance matrixes,
respectively. Because of the independence assumption, Σ and ∑nk are diagonal matrixes. Let σ j
440 Data Classification: Algorithms and Applications
j
and σnk be the standard deviations of the j-th dimension for x and xnk , respectively. Then, x j ∼
N(µ j , σ j · σ j ) and x j ∼ N(µ j , σ j · σ j ). To classify x using naive Bayes model, we compute all the
j
class condition densities P(x|Ck ) based on Equation (16.36). Since xnk follows Gaussian distribution,
we have:
j j
1 1 xn − µ n
pnj k (xnj k ) = j √ exp(− ( k j k )) (16.37)
σnk 2π 2 σnk
and similarly for x j (by omitting all subscripts in Equation (16.37)).
Based on formulae (16.36) and (16.37), we have:
⎧ 2 ⎫
⎪
⎪ 1 µ −µnk
j j
⎪
⎪
d ⎨ Nk
exp(− 2 ( j ) ) ⎪ ⎪
⎬
vk,n
P(x|Ck ) = ∏ ∑ √ k
(16.38)
⎪ n =1
j=1 ⎪
j
Nk vk,n 2π ⎪ ⎪
⎪ k
⎩ k ⎪
⎭
j j j j j
where vk,nk = hk · hk + σ j · σ j + σnk · σnk . It is easy to see the time complexity of computing
P(x|Ck ) is O(Nk d), so the time complexity of the formula-based method is ∑Kk=1 O(Nk d) = O(nd).
(2). Sample-based method. In this method, every training and testing uncertain data object is
represented by sample points based on their own distributions. When using kernel density estimation
for a data object, every sample point contributes to the density estimation. The integral of density
can be transformed into the summation of the data points’ contribution with their probability as
weights. Equation (16.36) is thus replaced by:
j j
d
1 Nk s s xc − xnk ,d
P(x|Ck ) = ∏ j ∑∑ ∑ K( j
j
)P(xcj )P(xnk ,d ) (16.39)
j=1 Nk hk nk =1 c=1 d=1 hk
j
where xc represents the c-th sample point of uncertain test data object x along the j-th dimension.
j
xnk ,d represents the d-th sample point of uncertain training data object xnk along the j-th dimen-
j j
sion. P(xc ) and P(xnk ,d ) are probabilities according to x and xnk ’s distribution, respectively. s is the
j
number of samples used for each of x j and xkn along the j-th dimension, and gets the corresponding
probability of each sample point. After computing the P(x|Ck ) for x with each class Ck , we can com-
pute the posterior probability P(Ck |x) based on Equation (16.31) and x can be assigned to the class
with maximum P(Ck |x). It is easy to see the time complexity of computing P(x|Ck ) is O(Nk ds2 ), so
the time complexity of the formula-based method is ∑Kk=1 O(Nk ds2 ) = O(nds2 ).
Discussions: In this section, we have discussed a naive Bayes method, i.e., distribution-based
naive Bayes classification, for uncertain data classification. The critical step of naive Bayes clas-
sification is to estimate the class conditional density, and kernel density estimation is a common
way for that. The distribution-based method extends the kernel density estimation method to han-
dle uncertain data by replacing the kernel function using its expectation value. Compared with AVG,
the distribution-based method has considered the whole distributions, rather than mean values of
instances, so more accurate classifiers can be learnt, which means we can achieve better accuracies.
Besides, it reduces the extended kernel density estimation to the evaluation of double-integrals.
Then, two methods, i.e. formula-based method and sample-based method, are proposed to solve it.
Although the distribution-based naive Bayes classification may achieve better accuracy, it performs
slowly because it has to compute two summations over s sample points each, while AVG uses a single
value in the place of this double summation. The further experiments on real data datasets in [29]
show that distribution-based method can always achieve higher accuracy than AVG. Besides, the
formula-based method generally gives higher accuracy than the sample-based method.
Uncertain Data Classification 441
16.4 Conclusions
As one of the most essential tasks in data mining and machine learning, classification has been
studied for decades. However, the solutions developed for classification often assume that data
values are exact. This may not be adequate for emerging applications (e.g., LBS and biological
databases), where the uncertainty of their databases cannot be ignored. In this chapter, we survey
several recently-developed classification algorithms that consider uncertain data as a “first-class citi-
zen.” In particular, they consider the entire uncertainty information that is available, i.e., every single
possible value in the pdf or pmf of the attributes involved, in order to achieve a higher effectiveness
than AVG. They are also more efficient than PW, since they do not have to consider all the possible
worlds.
We believe that there are plenty of interesting issues to consider in this area, including:
• A systematic comparison among the algorithms that we have studied here is lacking. It would
be interesting to derive and compare their computational complexities. More experiments
should also be conducted on these solutions, so that we know which algorithm performs
better under which situations. Also, most of the datasets used in the experiments are synthetic
(e.g., the pdf is assumed to be uniform). It would be important to test these solutions on data
obtained from real applications or systems.
• Some classification algorithms apply to pmf only, while others (e.g., naive Bayes classifier)
can be used for pdf with particular distributions (e.g., Gaussian). It would be interesting to
examine a general and systematic method to modify a classification algorithm, so that it can
use both pdf and pmf models.
• Most of the classification algorithms consider the uncertainty of attributes, represented by pdf
or pmf. As a matter of fact, in the uncertain database literature, there are a few other well-
studied uncertainty models, including tuple uncertainty [19], logical model [30], c-tables [18],
and Bayesian models. These models differ in the complexity of representing uncertainty;
e.g., some models assume entities (e.g., pdf and pmf) are independent, while others (e.g.,
Bayesian) handle correlation among entities. It would be interesting to consider how classifi-
cation algorithms should be designed to handle these uncertainty models.
• All the algorithms studied here only work on one database snapshot. However, if these data
changes (e.g., in LBS, an object is often moving, and the price of a stock in a stock market sys-
tem fluctuates), it would be important to develop efficient algorithms that can incrementally
update the classification results for a database whose values keep changing.
• It would be interesting to examine how the uncertainty handling techniques developed for
classification can also be applied to other mining tasks, e.g., clustering, frequent pattern dis-
covery, and association rule mining.
Bibliography
[1] C. C. Aggarwal. On density based transforms for uncertain data mining. In ICDE Conference
Proceedings. IEEE, pages 866–875, 2007.
442 Data Classification: Algorithms and Applications
[2] C. C. Aggarwal, J. Han, J. Wang, and P Yu. A framework for clustering evolving data streams.
In VLDB Conference Proceedings, 2003.
[3] C. C. Aggarwal and Philip S. Yu. A survey of uncertain data algorithms and applications. In
IEEE Transactions on Knowledge and Data Engineering, 21(5):609–623, 2009.
[4] F. Angiulli and F. Fassetti. Nearest neighbor-based classification of uncertain data. In ACM
Transactions on Knowledge Discovery from Data, 7(1):1, 2013.
[7] J. Bezdek and R. Hathaway. Convergence of alternating optimation. In Neural, Parallel Sci-
ence Computation, 11:351-368, 2003.
[8] J. Bi and T. Zhang. Support vector classification with input data uncertainty. In NIPS, 2004.
[9] J. Chen and R. Cheng. Efficient evaluation of imprecise location-dependent queries. In Inter-
national Conference on Data Engineering, pages 586–595, Istanbul, Turkey, April 2007.
[10] R. Cheng, J. Chen, and X. Xie. Cleaning uncertain data with quality guarantees. In Proceed-
ings of the VLDB Endowment, 1(1):722–735, August 2008.
[11] R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise
data. In Proceedings of the ACM Special Interest Group on Management of Data, pages 551–
562, June 2003.
[12] W. W. Cohen. Fast effective rule induction. In Proceedings of 12th International Conference
on Machine Learning, pages 115–125, 1995.
[13] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases with noise. In Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining (KDD-96), pages 226–231, 1996.
[14] J. Furnkranze and G. Widmer. Incremental reduced error pruning. In Machine Learning:
Proceedings of 12th Annual Conference, 1994.
[15] Gray L. Freed and J. Kennard Fraley. 25% “error rate” in ear temperature sensing device.
Pediatrics, 87(3):414–415, March 2009.
[16] C. Gao and J. Wang. Direct mining of discriminative patterns for classifying uncertain data.
In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pages 861–870, 2010.
[17] J. Han and K. Micheline. Data Mining: Concepts and Techniques. Morgan Kaufman Publish-
ers Inc., San Francisco, CA, 2005.
[18] T. Imielinski and W. Lipski Jr. Incomplete information in relational databases. In ICDE, 2008.
[19] B. Kanagal and A. Deshpande. Lineage processing over correlated probabilistic database. In
SIGMOD, 2010.
Uncertain Data Classification 443
[20] D. Lin. E. Bertino, R. Cheng, and S. Prabhakar. Position transformation: A location privacy
protection method for moving objects. In Transactions on Data Privacy: Foundations and
Technologies (TDP), 2(1): 21–46, April 2009.
[21] B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Proceed-
ings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 80–86, 1998.
[22] W. Li, J. Han and J. Pei. CMAR: Accurate and efficient classification based on multiple class-
association rules. In Proceedings of the 2001 IEEE International Conference on Data Mining,
pages 369–376, 2001.
[23] B. Qin, Y. Xia, and F. Li. DTU: A decision tree for uncertain data. In Proceedings of the 13th
Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 4–15, 2009.
[24] B. Qin, Y. Xia, R. Sathyesh, J. Ge, and S. Prabhakar. Classifying uncertain data with decision
tree. In DASFAA, pages 454–457, 2011.
[25] B. Qin, Y. Xia, S. Prabhakar and Y. Tu. A rule-based classification algorithm for uncertain data.
In The Workshp on Management and Mining of Uncertain Data in ICDE, pages 1633–1640,
2009.
[26] B. Qin, Y. Xia, and S. Prabhakar. Rule induction for uncertain data. In Knowledge and
Information Systems. 29(1):103–130, 2011.
[27] J. Ross Quinlan. Induction of decision trees. Machine Learning,1(1):81–106, 1986.
[28] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[29] J. Ren, S. D. Lee, X. Chen, B. Kao, R. Cheng, and D. Chueng Naive Bayes classification of
uncertain data. In ICDM Conference Proceedings, pages 944–949, 2009.
[30] A. D. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In
ICDE, 2008.
[31] L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining uncertain data with probabilistic
guarantees. In the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,
pages 273–282, Washington D.C., USA, July 2010.
[32] S. Tsang, B. Kao, K. Y. Yip, W. S. Ho, and S. D. Lee. Decision trees for uncertain data. In
ICDE, IEEE, pages 441–444, 2009.
[33] S. Tsang, B. Kao, K. Y. Yip, W. S. Ho, and S. D. Lee. Decision trees for uncertain data. In
IEEE Transactions on Knowledge and Data Engineering, 23(1):64–78, 2011.
[34] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., New York, 1988.
[35] J. Wang and G. Karypis On mining instance-centric classification rules. In IEEE Transactions
on Knowledge and Data Engineering, 18(11):1497–1511, 2006.
[36] L. Wang, D. W. Cheung, R. Cheng, and S. D. Lee. Efficient mining of frequent itemsets on
large uncertain databases. In the IEEE Transactions on Knowledge and Data Engineering,
24(12):2170–2183, Dec 2012.
This page intentionally left blank
Chapter 17
Rare Class Learning
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598
[email protected]
17.1 Introduction
The problem of rare class detection is closely related to outlier analysis [2]. In unsupervised
outlier analysis, no supervision is used for the anomaly detection process. In such scenarios, many
of the anomalies found correspond to noise, and may not be of any interest to an analyst. It has
been observed [35,42,61] in diverse applications such as system anomaly detection, financial fraud,
and Web robot detection that the nature of the anomalies is often highly specific to particular kinds
445
446 Data Classification: Algorithms and Applications
of abnormal activity in the underlying application. In such cases, unsupervised outlier detection
methods may often discover noise, which may not be specific to that activity, and therefore may
also not be of any interest to an analyst. The goal of supervised outlier detection and rare class
detection is to incorporate application-specific knowledge into the outlier analysis process, so as
to obtain more meaningful anomalies with the use of learning methods. Therefore, the rare class
detection problem may be considered the closest connection between the problems of classification
and outlier detection. In fact, while classification may be considered the supervised analogue of
the clustering problem, the rare class version of the classification problem may be considered the
supervised analogue of the outlier detection problem. This is not surprising, since the outliers may
be considered rare “unsupervised groups” for a clustering application.
In most real data domains, some examples of normal or abnormal data may be available. This
is referred to as training data, and can be used to create a classification model, which distinguishes
between normal and anomalous instances. Because of the rare nature of anomalies, such data is
often limited, and it is hard to create robust and generalized models on this basis. The problem
of classification has been widely studied in its own right, and numerous algorithms are available
in the literature [15] for creating supervised models from training data. In many cases, different
kinds of abnormal instances may be available, in which case the classification model may be able
to distinguish between them. For example, in an intrusion scenario, different kinds of intrusion
anomalies are possible, and it may be desirable to distinguish among them.
This problem may be considered a very difficult special case (or variation) of the classification
problem, depending upon the following possibilities, which may be present either in isolation or in
combination.
• Class Imbalance: The distribution between the normal and rare class will be very skewed.
From a practical perspective, this implies that the optimization of classification accuracy may
not be meaningful, especially since the misclassification of positive (outlier) instances is less
desirable than the misclassification of negative (normal) instances. In other words, false pos-
itives are more acceptable than false negatives. This leads to cost-sensitive variations of the
classification problem, in which the objective function for classification is changed.
• Contaminated Normal Class Examples (Positive-Unlabeled Class Problem): In many real
scenarios, the data may originally be present in unlabeled form, and manual labeling is per-
formed for annotation purposes. In such cases, only the positive class is labeled, and the
remaining “normal” data contains some abnormalities. This is natural in large scale applica-
tions such as the Web and social networks, in which the sheer volume of the underlying data
makes contamination of the normal class more likely. For example, consider a social network-
ing application, in which it is desirable to determine spam in the social network feed. A small
percentage of the documents may be spam. In such cases, it may be possible to recognize and
label some of the documents as spam, but many spam documents may remain in the examples
of the normal class. Therefore, the “normal” class may also be considered an unlabeled class.
In practice, however, the unlabeled class is predominantly the normal class, and the anomalies
in it may be treated as contaminants. The classification models need to be built to account for
this. Technically, this case can be considered a form of partial supervision [40], though it can
also be treated as a difficult special case of full supervision, in which the normal class is more
noisy and contaminated. Standard classifiers can be used on the positive-unlabeled version of
the classification problem, as long as the relative frequency of contaminants is not extreme. In
cases where the unlabeled class does not properly reflect the distribution in the test instances,
the use of such unlabeled classes can actually harm classification accuracy [37].
A different flavor of incomplete supervision refers to missing training data about an entire
class, rather than imperfect or noisy labels. This case is discussed below.
Rare Class Learning 447
Both these methodologies will be discussed in this section. For the case of the cost-sensitive prob-
lem, it will also be discussed how classification techniques can be heuristically modified in order to
approximately reflect costs. A working knowledge of classification methods is assumed in order to
understand the material in this section. The reader is also referred to [15] for a description of the
different types of classifiers.
For the discussion in this section, it is assumed that the training data set is denoted by D , and
the labels are denoted by L = {1, . . . k}. Without loss of generality, it can be assumed that the normal
class is indexed by 1. The ith record is denoted by Xi , and its label li is drawn from L. The number
of records belonging to the ith class are denoted by Ni , and ∑ki=1 Ni = N. The class imbalance
assumption implies that N1 >> N − N1 . While imbalances may exist between other anomalous
classes too, the major imbalance occurs between the normal and the anomalous classes.
Rare Class Learning 449
of the rare class in the locality of the instance. A vanilla 20-nearest neighbor classifier will virtually
always1 classify this instance to a normal class in a large bootstrapped sample. This situation is
not specific to the nearest neighbor classifier, and is likely to occur in many classifiers, when the
class distribution is very skewed. For example, an unmodified Bayes classifier will usually assign a
lower probability to the rare class, because of its much lower a-priori probability, which is factored
into the classification. Consider a situation where a hypothetically perfect Bayes classifier has a
prior probability of 1% and a posterior probability of 30% for the correct classification of a rare
class instance. Such a classifier will typically assign far fewer than 30% of the votes to the rare
class in a bagged prediction, especially2 when large bootstrap samples are used. In such cases, the
normal class will win every time in the bagging because of the prior skew. This means that the
bagged classification probabilities can sometimes be close to 1 for the normal class in a skewed
class distribution.
This suggests that the effect of cost weighting can sometimes be overwhelmed by the erroneous
skews in the probability estimation attained by bagging. In this particular example, even with a cost
ratio of 100 : 1, the rare class instance will be wrongly relabeled to a normal class. This moves the
classification boundaries in the opposite direction of what is desired. In fact, in cases where the
unmodified classifier degrades to a trivial classifier of always classifying to the normal class, the
expected misclassification cost criterion of [14] will result in relabeling all rare class instances to
the normal class, rather than the intended goal of selective relabeling in the other direction. In other
words, relabeling may result in a further magnification of the errors arising from class skew. This
leads to degradation of classification accuracy, even from a cost-weighted perspective.
In the previous example, if the fraction of the 20-nearest neighbors belonging to a class are used
as its probability estimate for relabeling, then much more robust results can be obtained with Meta-
Cost. Therefore, the effectiveness of MetaCost depends on the quality of the probability estimate
used for re-labeling. Of course, if good probability estimates are directly available from the training
model in the first place, then a test instance may be directly predicted using the expected misclas-
sification cost, rather than using the indirect approach of trying to “correct” the training data by
re-labeling. This is the idea behind weighting methods, which will be discussed in the next section.
and all other terms within the Bayes estimation remain the same. Therefore, this is equivalent to
multiplying the Bayes probability in the unweighted case with the cost, and picking the largest one.
Note that this is the same criterion that is used in MetaCost, though the latter uses this criterion for
relabeling training instances, rather than predicting test instances. When good probability estimates
are available from the Bayes classifier, the test instance can be directly predicted in a cost-sensitive
way.
4
RARE
CLASS
2
−2
−4
−6
NORMAL
CLASS
−8
−10
−10 −8 −6 −4 −2 0 2 4 6 8 10
• The model construction phase for a smaller training data set requires much less time.
Rare Class Learning 453
• The normal class is less important for modeling purposes, and most of the rare class is in-
cluded for modeling. Therefore, the discarded instances do not take away too much from the
modeling effectiveness.
In the case of SVM classifiers, it is possible to create a two-class distribution by using the origin
as one of the classes [59]. Typically, a kernel function is used in order to transform the data into a
new space in which the dot product corresponds to the value of the kernel function. In such a case,
an SVM classifier will naturally create a hyperplane that separates out the combination of features
which describe the one class in the data. However, the strategy of using the origin as the second class
in combination with a feature transformation is not necessarily generic and may not work well in all
data domains. This differential behavior across different data sets has already been observed in the
literature. In some cases, the performance of vanilla one-class SVM methods is quite poor, without
careful changes to the model [55]. Other one-class methods for SVM classification are discussed
in [31, 43, 55, 62].
case of costly instances, it is desirable to increase weights more than less costly instances in case
of misclassification. On the other hand, in cases of correct classification, it is desirable to reduce
weights less for more costly instances. In either case, the adjustment is such that costly instances get
relatively higher weight in later iterations. Therefore β− (ci ) is a non-decreasing function with cost,
whereas β+ (ci ) is a non-increasing function with cost. A different way to perform the adjustment
would be to use the same exponential factor for weight updates as the original Adaboost algorithm,
but this weight is further multiplied with the cost ci [20], or other non-decreasing function of the
cost. Such an approach would also provide higher weights to instances with larger costs. The use
of boosting in weight updates has been shown to significantly improve the effectiveness of the
imbalanced classification algorithms.
Boosting methods can also be combined with synthetic oversampling techniques. An example
of this is the SMOTEBoost algorithm, which combines synthetic oversampling with a boosting ap-
proach. A number of interesting comparisons of boosting algorithms are presented in [27, 29]. In
particular, an interesting observation in [29] is that the effectiveness of the boosting strategy is de-
pendent upon the quality of the learner that it works with. When the boosting algorithm starts off
with a weaker algorithm to begin with, the final (boosted) results are also not as good as those
derived by boosting a stronger algorithm.
A number of methods have been proposed in the literature for this variant of the classification
problem, which can address the aforementioned issues.
While some methods in the literature treat this as a new problem, which is distinct from the
fully supervised classification problem [40], other methods [18] recognize this problem as a noisy
variant of the classification problem, to which traditional classifiers can be applied with some mod-
ifications. An interesting and fundamental result proposed in [18] is that the accuracy of a classifier
trained on this scenario differs by only a constant factor from the true conditional probabilities of
being positive. The underlying assumption is that the labeled examples in the positive class are
picked randomly from the positive examples in the combination of the two classes. These results
provide strong support for the view that learning from positive and unlabeled examples is essentially
equivalent to learning from positive and negative examples.
There are two broad classes of methods that can be used in order to address this problem. In the
first class of methods, heuristics are used in order to identify training examples that are negative.
Subsequently, a classifier is trained on the positive examples, together with the examples, which
have already been identified to be negative. A less common approach is to assign weights to the
unlabeled training examples [36, 40]. The second case is a special one of the first, in which each
weight is chosen to be binary. It has been shown in the literature [41] that the second approach is
superior. An SVM approach is used in order to learn the weights. The work in [71] uses the weight
vector in order to provide robust classification estimates.
of supervised and unsupervised techniques. It is also important to distinguish this problem from
one-class classification, in which instances of the positive class are available. In the one-class clas-
sification problem, it is desirable to determine other examples, which are as similar as possible to
the training data, whereas in the novel class problem, it is desirable to determine examples, that are
as different as possible from the training data.
In cases where only examples of the normal class are available, the only difference from the
unsupervised scenario is that the training data is guaranteed to be free of outliers. The specification
of normal portions of the data makes the determination of further outliers easier, because this data
can be used in order to construct a model of what the normal data looks like. Another distinction
between unsupervised outlier detection and one-class novelty detection is that novelties are often
defined in a temporal context, and eventually become a normal part of the data.
in [7] attempts to determine a linear or non-linear decision surface, which wraps around the surfaces
of the normal class. Points that lie outside this decision surface are anomalies. It is important to note
that this model essentially uses an indirect approach such as SVM to model the dense regions in the
data. Virtually all unsupervised outlier detection methods attempt to model the normal behavior of
the data, and can be used for novel class detection, especially when the only class in the training data
is the normal class. Therefore the distinction between normal-class only variations of the novel class
detection problem and the unsupervised version of the problem are limited and artificial, especially
when other labeled anomalous classes do not form a part of the training data. Numerous analogues
of unsupervised methods have also been developed for novelty detection, such as extreme value
methods [56], direct density ratio estimation [23], and kernel-based PCA methods [24]. This is
not surprising, given that the two problems are different only at a rather superficial level. In spite
of this, the semi-supervised version of the (normal-class only) problem seems to have a distinct
literature of its own. This is somewhat unnecessary, since any of the unsupervised algorithms can
be applied to this case. The main difference is that the training and test data are distinguished from
one another, and the outlier score is computed for a test instance with respect to the training data.
Novelty detection can be better distinguished from the unsupervised case in temporal scenarios,
where novelties are defined continuously based on the past behavior of the data. This is discussed in
Chapter 9 on streaming classification in this book.
1. Is the test point a natural fit for a model of the training data? This model also includes the
currently occurring rare classes. A variety of unsupervised models such as clustering can be
used for thus purpose. If not, it is immediately flagged as an outlier, or a novelty.
2. If the test point is a fit for the training data, then a classifier model is used to determine
whether it belongs to one of the rare classes. Any cost-sensitive model (or an ensemble of
them) can be used for this purpose.
Thus, this model requires a combination of unsupervised and supervised methods in order to deter-
mine the outliers in the data. This situation arises more commonly in online and streaming scenarios,
which will be discussed in the next section.
since a single snapshot of training data is assumed. Many applications such as intrusion detection
are naturally focussed on a streaming scenario. In such cases, novel classes may appear at any point
in the data stream, and it may be desirable to distinguish different kinds of novel classes from one
another [5, 46–48]. Furthermore, when new classes are discovered, these kinds of anomalies may
recur over time, albeit quite rarely. In such cases, the effectiveness of the model can be improved
by keeping a memory of the rarely recurring classes. This case is particularly challenging because
aside from the temporal aspects of modeling, it is desirable to perform the training and testing in an
online manner, in which only one pass is allowed over the incoming data stream. This scenario is a
true amalgamation of supervised and unsupervised methods for anomaly detection, and is discussed
in detail in Chapter 9 on streaming classification.
In the streaming scenario containing only unlabeled data, unsupervised clustering methods [3,4]
can be used in order to identify significant novelties in the stream. In these methods, novelties occur
as emerging clusters in the data, which eventually become a part of the normal clustering structure
of the data. Both the methods in [3, 4] have statistical tests to identify, when a newly incoming
instance in the stream should be considered a novelty. Thus, the output of these methods provides an
understanding of the natural complementary relationship between the clusters (normal unsupervised
models) and novelties (temporal abnormalities) in the underlying data.
Random Set
of Records
Query Experts
for Some Record Select Important
Labels Records
Build
Classification Apply Model
Models to Data
uncertainty or ambiguity, that should be presented to the user in order to gain the greatest knowledge
about the decision boundaries between the different classes. It is expected that the selected examples
should lie on the decision boundaries, in order to maximize the learning of the contours separating
different classes, with the use of least amount of expert supervision, which can be expensive in many
scenarios.
A common approach to achieving this goal in active learning is the principle of query by com-
mittee [57]. . In these methods, an ensemble of classifiers is learned, and the greatest disagreement
among them is used to select data points that lie on the decision boundary. A variety of such criteria
based on ensemble learning are discussed in [49]. It is also possible to use the model characteris-
tics directly in order to select such points. For example, two primary criteria that can be used for
selection are as follows [52]:
• Low Likelihood: These are data points that have low fit to the model describing the data. For
example, if an EM algorithm is used for modeling, then these are points that have low fit to
the underlying model.
• High Uncertainty: These are points that have the greatest uncertainty in terms of the compo-
nent of the model to which they belong. In other words, in an EM model, such a data point
would show relatively similar soft probabilities for different components of the mixture.
All data points are ranked on the basis of the two aforementioned criteria. The lists are merged by
alternating between them, and adding the next point in the list, which has not already been added to
the merged list. Details of other relevant methods such as interleaving are discussed in [52].
Rare Class Learning 461
ods need to be used for the detection process. In cases where examples of a single normal class are
available, the scenario becomes almost equivalent to the unsupervised version of the problem.
Supervised methods are closely related to active learning in which human experts may intervene
in order to add more knowledge to the outlier detection process. Such combinations of automated
filtering with human interaction can provide insightful results. The use of human intervention some-
times provides the more insightful results, because the human is involved in the entire process of
label acquisition and final anomaly detection.
Bibliography
[1] N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning, ACM KDD Con-
ference, pages 504–509, 2006.
[2] C. C. Aggarwal, Outlier Analysis, Springer, 2013.
[3] C. C. Aggarwal, J. Han. J. Wang, and P. Yu. A framework for clustering evolving data streams,
VLDB Conference, pages 81–92, 2003.
[4] C. C. Aggarwal and P. Yu. On Clustering massive text and categorical data streams, Knowledge
and Information Systems, 24(2):171–196, 2010.
[5] T. Al-Khateeb, M. Masud, L. Khan, C. Aggarwal, J. Han, and B. Thuraisingham. Recurring
and novel class detection using class-based ensemble, ICDM Conference, 2012.
[6] L. Brieman. Bagging predictors, Machine Learning, 24:123–140, 1996.
[7] C. Campbell and K. P. Bennett. A linear-programming approach to novel class detection, NIPS
Conference, 2000.
[8] N. V. Chawla, K. W. Bower, L. O. Hall, and W. P. Kegelmeyer. SMOTE: synthetic minor-
ity over-sampling technique, Journal of Artificial Intelligence Research (JAIR), 16:321–356,
2002.
[9] N. Chawla, A. Lazarevic, L. Hall, and K. Bowyer. SMOTEBoost: Improving prediction of the
minority class in boosting, PKDD, pages 107–119, 2003.
[10] N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: Special issue on learning from imbal-
anced data sets, ACM SIGKDD Explorations Newsletter, 6(1):1–6, 2004.
[11] N. V. Chawla, D. A. Cieslak, L. O. Hall, and A. Joshi. Automatically countering imbalance
and its empirical relationship to cost. Data Mining and Knowledge Discovery, 17(2):225–252,
2008.
[12] P. K. Chan and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distri-
butions: A case study in credit card fraud detection. KDD Conference, pages 164–168, 1998.
[13] D. Cohn, R. Atlas, and N. Ladner. Improving generalization with active learning, Machine
Learning, 15:201–221.
[14] P. Domingos. MetaCost: A general framework for making classifiers cost-sensitive, ACM KDD
Conference, pages 165–174, 1999.
[15] R. Duda, P. Hart, and D. Stork, Pattern Classification, Wiley, 2001.
464 Data Classification: Algorithms and Applications
[16] C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: Why undersampling
beats oversampling. ICML Workshop on Learning from Imbalanced Data Sets, pages 1–8,
2003.
[17] C. Drummond and R. Holte. Explicitly representing expected cost: An alternative to ROC
representation. ACM KDD Conference, pages 198–207, 2001.
[18] C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data, ACM KDD
Conference, pages 213–220, 2008.
[19] C. Elkan. The foundations of cost-sensitive learning, IJCAI, pages 973–978, 2001.
[20] W. Fan, S. Stolfo, J. Zhang, and P. Chan. AdaCost: Misclassification cost sensitive boosting,
ICML Conference, pages 97–105, 1999.
[21] T. Fawcett. ROC Graphs: Notes and Practical Considerations for Researchers, Technical Re-
port HPL-2003-4, Palo Alto, CA: HP Laboratories, 2003.
[22] J. He and J. Carbonell. Nearest-Neighbor-Based Active Learning for Rare Category Detec-
tion. CMU Computer Science Department, Paper 281, 2007. http://repository.cmu.
edu/compsci/281
[23] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Statistical outlier detection
using direct density ratio estimation. Knowledge and Information Systems, 26(2):309–336,
2011.
[24] H. Hoffmann. Kernel PCA for novelty detection, Pattern Recognition, 40(3):863–874, 2007.
[25] G. Lanckriet, L. Ghaoui, and M. Jordan. Robust novelty detection with single class MPM,
NIPS, 2002.
[26] M. Joshi, R. Agarwal, and V. Kumar. Mining needles in a haystack: Classifying rare classes
via two-phase rule induction, ACM SIGMOD Conference, pages 91–102, 2001.
[27] M. Joshi, V. Kumar, and R. Agarwal. Evaluating boosting algorithms to classify rare classes:
Comparison and improvements. ICDM Conference, pages 257–264, 2001.
[28] M. Joshi and R. Agarwal. PNRule: A new framework for learning classifier models in data
mining (A case study in network intrusion detection), SDM Conference, 2001.
[29] M. Joshi, R. Agarwal, and V. Kumar. Predicting rare classes: Can boosting make any weak
learner strong? ACM KDD Conference, pages 297–306, 2002.
[30] M. Joshi. On evaluating performance of classifiers for rare classes, ICDM Conference, pages
641–644, 2003.
[31] P. Juszczak and R. P. W. Duin. Uncertainty sampling methods for one-class classifiers. ICML
Workshop on Learning from Imbalanced Data Sets, 6(1), 2003.
[32] G. Lanckriet, L. Ghaoui, and M. Jordan. Robust novelty detection with single class MPM,
NIPS, 2002.
[33] G. Karakoulas and J. Shawe-Taylor. Optimising classifiers for imbalanced training sets, NIPS,
pages 253–259, 1998.
[34] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One sided selec-
tion. ICML Conference, pages 179–186, 1997.
Rare Class Learning 465
[35] T. Lane and C. Brodley. Temporal sequence learning and data reduction for anomaly detection,
ACM Transactions on Information and Security, 2(3):295–331, 1999.
[36] W. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic
regression. ICML Conference, 2003.
[37] X. Li, B. Liu, and S. Ng. Negative training data can be harmful to text classification, EMNLP,
pages 218–228, 2010.
[38] L. Liu, and X. Fern. Constructing training sets for outlier detection, SDM Conference, 2012.
[39] X.-Y. Liu, J. Wu, and Z.-H. Zhou. Exploratory undersampling for class-imbalance learning.
IEEE Transactions on Systems, Man and Cybernetics – Part B, Cybernetics, 39(2):539–550,
April 2009.
[40] B. Liu, W. S. Lee, P. Yu, and X. Li. Partially supervised classification of text, ICML Confer-
ence, pages 387–394, 2002.
[41] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. Yu. Building text classifiers using positive and unlabeled
examples. ICDM Conference, pages 179–186, 2003.
[42] X. Liu, P. Zhang, and D. Zeng. Sequence matching for suspicious activity detection in anti-
money laundering. Lecture Notes in Computer Science, 5075:50–61, 2008.
[43] L. M. Manevitz and M. Yousef. One-class SVMs for document classification, Journal of Ma-
chine Learning Research, 2:139–154, 2001.
[44] M. Markou and S. Singh. Novelty detection: A review, Part 1: Statistical approaches, Signal
Processing, 83(12):2481–2497, 2003.
[45] M. Markou and S. Singh. Novelty detection: A review, Part 2: Neural network-based ap-
proaches, Signal Processing, 83(12):2481–2497, 2003.
[46] M. Masud, Q. Chen, L. Khan, C. Aggarwal, J. Gao, J, Han, and B. Thuraisingham. Addressing
concept-evolution in concept-drifting data streams. ICDM Conference, 2010.
[47] M. Masud, T. Al-Khateeb, L. Khan, C. Aggarwal, J. Gao, J. Han, and B. Thuraisingham.
Detecting recurring and novel classes in concept-drifting data streams. ICDM Conference,
2011.
[48] M. Masud, Q. Chen, L. Khan, C. Aggarwal, J. Gao, J. Han, A. Srivastava, and N. Oza. Classifi-
cation and adaptive novel class detection of feature-evolving data streams, IEEE Transactions
on Knowledge and Data Engineering, 25(7):1484–1497, 2013.
[49] P. Melville and R. Mooney. Diverse ensembles for active learning, ICML Conference, pages
584–591, 2004.
[50] D. Mladenic and M. Grobelnik. Feature selection for unbalanced class distribution and naive
bayes. ICML Conference, pages 258-267, 1999.
[51] S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection
using the local correlation integral. ICDE Conference, pages 315-326, 2003.
[52] D. Pelleg and A. Moore. Active learning for anomaly and rare category detection, NIPS Con-
ference, 2004.
[53] F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison
under imprecise class and cost distributions, ACM KDD Conference, pages 43–48, 1997.
466 Data Classification: Algorithms and Applications
[54] F. Provost, T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing
induction algorithms, ICML Conference, pages 445-453, 1998.
[55] B. Raskutti and A. Kowalczyk. Extreme rebalancing for SVMS: A case study. SIGKDD Ex-
plorations, 6(1):60–69, 2004.
[56] S. Roberts. Novelty detection using extreme value statistics, IEEE Proceedings on Vision,
Image and Signal Processing, 146(3):124–129, 1999.
[57] H. Seung, M. Opper, and H. Sompolinsky. Query by committee. ACM Workshop on Compu-
tational Learning Theory, pages 287–294, 1992.
[58] R. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions.
Annual Conference on Computational Learning Theory, 37(3):297–336, 1998.
[62] D. Tax. One Class Classification: Concept-learning in the Absence of Counter-examples, Doc-
toral Dissertation, University of Delft, Netherlands, 2001.
[63] Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser. SVMs Modeling for highly imbalanced
classification, IEEE Transactions on Systems, Man and Cybernetics — Part B: Cybernetics,
39(1):281–288, 2009.
[64] K. M. Ting. An instance-weighting method to induce cost-sensitive trees. IEEE Transactions
on Knowledge and Data Engineering, 14(3):659–665, 2002.
[65] G. Weiss and F. Provost. Learning when training data are costly: The effect of class distribution
on tree induction, Journal of Artificial Intelligence Research, 19(1):315–354, 2003.
[66] G. Wu and E. Y. Chang. Class-boundary alignment for imbalanced dataset learning. Proceed-
ings of the ICML Workshop on Learning from Imbalanced Data Sets, pages 49–56, 2003.
[67] R. Yan, Y. Liu, R. Jin, and A. Hauptmann. On predicting rare classes with SVM ensembles in
scene classification. IEEE International Conference on Acoustics, Speech and Signal Process-
ing, 3:21–24, 2003.
[68] H. Yu, J. Han, and K. C.-C. Chang. PEBL: Web page classification without negative examples.
IEEE Transactions on Knowledge and Data Engineering, 16(1):70–81, 2004.
[70] B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are
unknown, KDD Conference, pages 204–213, 2001.
[71] D. Zhang and W. S. Lee. A simple probabilistic approach to learning from positive and unla-
beled examples. Annual UK Workshop on Computational Intelligence, pages 83–87, 2005.
Rare Class Learning 467
[72] J. Zhang and I. Mani. KNN Approach to unbalanced data distributions: A case study involving
information extraction. Proceedings of the ICML Workshop on Learning from Imbalanced
Datasets, 2003.
[73] Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data.
SIGKDD Explorations, 6(1):80–89, 2004.
[74] C. Zhu, H. Kitagawa, S. Papadimitriou, and C. Faloutsos. OBE: Outlier by example, PAKDD
Conference, 36(2):217–247, 2004.
[75] C. Zhu, H. Kitagawa, and C. Faloutsos. Example-based robust outlier detection in high dimen-
sional datasets, ICDM Conference, 2005.
[76] http://www.itl.nist.gov/iad/mig/tests/tdt/tasks/fsd.html
This page intentionally left blank
Chapter 18
Distance Metric Learning for Data Classification
Fei Wang
IBM T. J. Watson Research Center
Yorktown Heights, NY
[email protected]
18.1 Introduction
Distance metric learning is a fundamental problem in data mining and knowledge discovery, and
it is of key importance for many real world applications. For example, information retrieval utilizes
the learned distance metric to measure the relevance between the candidate data and the query;
clinical decision support uses the learned distance metric to measure pairwise patient similarity [19,
23, 24]; pattern recognition can use the learned distance metric to match most similar patterns. In
a broader sense, distance metric learning lies in the heart of many data classification problems.
As long as a proper distance metric is learned, we can always adopt k-Nearest Neighbor (kNN)
classifier [4] to classify the data. In recent years, many studies have demonstrated [12, 27, 29],
either theoretically or empirically, that learning a good distance metric can greatly improve the
performance of data classification tasks.
469
470 Data Classification: Algorithms and Applications
TABLE 18.1: The Meanings of Various Symbols That Will be Used Throughout This Chapter
Symbol Meaning
n number of data
d data dimensionality
xi the i-th data vector
X data matrix
M precision matrix of the generalized Mahalanobis distance
wi the i-th projection vector
W projection matrix
Ni the neighborhood of xi
φ(·) nonlinear mapping used in kernel methods
K kernel matrix
L Laplacian matrix
In this survey, we will give an overview of the existing supervised distance metric learning
approaches and point out their strengths and limitations, as well as present challenges and future re-
search directions. We focus on supervised algorithms because they are under the same setting as data
classification. We will categorize those algorithms from the aspect of linear/nonlinear, local/global,
transductive/inductive, and also the computational technology involved.
In the rest of this chapter, we will first introduce the definition of distance metric in Section 18.2.
Then we will overview existing supervised metric learning algorithms in Section 18.3, followed by
discussions and conclusions in Section 18.5. Table 18.1 summarizes the notations and symbols that
will be used throughout the paper.
where S is the inverse of the data covariance matrix (also referred to as the precision matrix).3
Most of the recent distance metric learning algorithms can be viewed as learning a generalized
Mahalanobis distance defined as below:
Generalized Mahalanobis Distance (GMD). A GMD measures the distance between data vectors
x and y by
D (x, y) = (x − y)M(x − y) (18.3)
where M is some arbitrary Symmetric Positive Semi-Definite (SPSD) matrix.
The major goal of learning a GMD is to learn a proper M. As M is SPSD, we can decompose M
as M = UΛU with eigenvalue decomposition, where U is a matrix collecting all eigenvectors of
M, and Λ is a diagonal matrix with all eigenvalues of M on its diagonal line. Let W = UΛ1/2 , then
we have
D (x, y) = (x − y)WW (x − y) = (W (x − y))(W (x − y))
= x−-
(- y) (-x−-y) (18.4)
By expanding the numerator and denominator of the above expression, we can observe that the
numerator corresponds to the sum of distances between each data point to its class center after
projection, and the denominator represents the sum of distances between every class center to the
entire data center after projection. Therefore, minimizing the objective will maximize the between-
class scatterness while minimizing the within-class scatterness after projection. Solving problem
(18.7) directly is hard, some researchers [8, 10] have done research on this topic. LDA is a linear
and global method. The learned distance between xi and x j is the Euclidean distance between W xi
and W x j .
Kernelization of LDA: Similar to the case of PCA, we can extend LDA to the nonlinear case via
the kernel trick, which is called Kernel Discriminant Analysis (KDA)) [14]. After mapping the data
into the feature space using φ, we can compute the compactness and scatterness matrices as
φ 1 1
C∑ ∑ (φ(xi ) − φ̄c )(φ(xi ) − φ̄c )
ΣC = (18.8)
c n c xi ∈c
1
C∑
ΣS = (φ̄c − φ̄)(φ̄c − φ̄) . (18.9)
c
Distance Metric Learning for Data Classification 473
Suppose the projection matrix we want to get is Wφ in the feature space, then with the representer
theorem
Wφ = Φα (18.10)
where Φ = [φ(x1 ), φ(x2 ), · · · , φ(xn )] and α is the coefficient vector over all φ(xi ) for 1 ≤ i ≤ n. We
define K = Φ Φ as the kernel matrix.
Then
C
φ φ 1 1
(W ) ΣC W = α φ
∑ nc ∑ Φ (φ(xi ) − φ̄c )(φ(xi ) − φ̄c ) Φ α
C c=1
xi ∈c
C
1 1
= α ∑ nc ∑ (K·i − K̄·c)(K·i − K̄·c) α
C c=1
xi ∈c
= α MC α (18.11)
w x − b = 1 or w x − b = −1. (18.15)
The distance between the two parallel hyperplanes is 2/
w
. Then if the data from two classes
are clearly separated, the goal of SVM is to solve the following optimization problem to find the
hyperplane that maximizes the margin between two classes
1
minw,b
w
2 (18.16)
2
s.t. li (w xi − b) 1 (∀i = 1, 2, · · · , n).
474 Data Classification: Algorithms and Applications
However in reality the two classes may not be perfectly separable, i.e., there might be some over-
lapping between them. Then we need soft margin SVM, which aims at solving
n
1
minw,b,ξ
w
2 + C ∑ ξi (18.17)
2 i=1
s.t. li (w xi − b) 1 − ξi (∀i = 1, 2, · · · , n)
where {ξi } 0 are slack variables used to penalize the margin on the overlapping region.
MMDA aims to solve more than one projection directions, which aims to solve the following
optimization problem
1 d C d n
min
W,b,ξr ≥0
∑
2 r=1
wr
2 + ∑ ∑ ξri
n r=1 i=1
(18.18)
s.t. ∀i = 1, . . . , n, r = 1, . . . , d
li (wr )T xi + b ≥ 1 − ξri,
WT W = I.
Therefore MMDA is a global and linear approach. One can also apply the kernel trick to make
it nonlinear; the details can be found in [11]. The learned distance between xi and x j is just the
Euclidean distance between W xi and W x j .
This is a quadratic optimization problem and [28] proposed an iterative projected gradient ascent
method to solve it. As M is positive semi-definite, we can always factorize it as M = WW . Thus
LSI is a global and linear approach. The learned distance formulation is exactly the general Maha-
lanobis distance with precision matrix M.
not correlated with the specific task [18]. We also define small clusters called chunklets, which are
connected components derived by all the must-links. The specific steps involved in RCA include:
• Construct chunklets according to equivalence (must-link) constraints, such that the data in
each chunklet are connected by must-link constraints pairwisely.
n
• Assume a total of p points in k chunklets, where chunklet j consists of points {x ji }i=1
j
and its
mean is m̄ j . RCA computes the following weighted within-chunklet covariance matrix:
k nj
1
C=
p ∑ ∑ (x ji − m̄ j )(x ji − m̄ j ) . (18.20)
j=1 i=1
• Compute the whitening transformation W = C1/2 , and apply it to the original data points: x̃ =
Wx. Alternatively, use the inverse of C as the precision matrix of a generalized Mahalanobis
distance.
Therefore, RCA is a global, linear approach.
where Li = { j|li = l j }) that is the set of points in the same class as point i.
The objective NCA maximizes is the expected number of points correctly classified under this
scheme:
J (W) = ∑ pi = ∑ ∑ pi j (18.25)
i i j∈Li
[7] proposed a truncated gradient descent approach to minimize J (W). NCA is a local and linear
approach. The learned distance between xi and x j is the Euclidean distance between W xi and
W x j .
where | · | represents the cardinality of a set. This margin measures the difference between the aver-
age distance from xi to the data points in its heterogeneous neighborhood and the average distance
from it to the data points in its homogeneous neighborhood. The maximization of such a margin can
push the data points whose labels are different from xi away from xi while pulling the data points
having the same class label with xi towards xi .
Therefore, the total average neighborhood margin can be defined as
⎛ ( (2 ⎞
y − y
2 (y − y (
γ = ∑i γi = ∑ ⎝ ∑ . e .k − ∑ . o .j ⎠
i i
. . . . (18.26)
i e N
k:xk ∈Ni o i N j:x j ∈Ni i
and the ANMM criterion is to maximize γ. By replacing yi = W xi , [25] obtains the optimal W by
performing eigenvalue decomposition of some discriminability matrix. Thus ANMM is a local and
linear approach. The learned distance between xi and x j is the Euclidean distance between W xi
and W x j . The authors in [25] also proposed a kernelized version of ANMM to handle nonlinear
data called KANMM, thus KANMM is a local and nonlinear approach.
tensors are considered to be similar if the Frobenius norm of their difference tensor is small.
Distance Metric Learning for Data Classification 477
deploys a different margin formulation. Specifically, LMNN defines the pull energy term as
( (2
( (
ε pull = ∑ o (W (xi − x j )( (18.27)
x j ∈Ni
which is the sum of pairwise distances between a data point xi and the data point in xi ’s homoge-
neous neighborhood after projection. LMNN defines the push energy as
( (2 ( (2 !
( ( ( (
ε push = ∑ ∑ ∑(1 − δil ) 1 + (W (xi − x j )( − (W (xi − xl )( (18.28)
i x j ∈Nio l +
where δil = 1 is the labels of xi and xl are the same, and δil = 0 otherwise. The intuition is to require
that the data points from different classes should be at least separated from it by the distance 1. This
formulation is very similar to the margin formulation in multiclass SVM [2] LMNN also pushes the
data with different labels to at least distance 1 from its homogeneous neighborhood. The goal of
LMNN is to minimize
ε = µε pull + (1 − µ)ε push. (18.29)
The authors in [26] proposed a semi-definite programming technique to solve for M = WW . Thus
LMNN is a local and linear approach. The learned distance between xi and x j is the Euclidean
distance between W xi and W x j .
where M = WW . The supervised terms consisting of compactness and scatterness are
⎡ ⎤
⎡ ⎤
where the unsupervised term is t4 = tr(W XΣX W) is the PCA-like term. Note that before we
apply CMM, all data points need to be centered, i.e., their mean should be subtracted from the data
matrix. The intuition of CMM is to maximally unfold the data points in the projection space while
at the same time satisfying those pairwise constraints. The authors in [22] showed that the optimal
W can be obtained by eigenvalue decomposition to some matrix. Therefore CMM is a global and
linear approach. Wang et. al. [22] also showed how to derive its kernelized version for handling
nonlinear data. The learned distance between xi and x j is the Euclidean distance between W xi and
W x j .
iteratively. At each iteration, only one or a small batch of data are involved. Another scenario where
the online learning strategy can be naturally applied is to learn distance metrics for streaming data,
where the data are coming in a streaming fashion so that the distance metric needs to be updated
iteratively. Next we present two examples of online distance metric learning approaches.
POLA operates in an iterative way: First POLA initializes M as a zero matrix, then at each step, it
randomly picks one data pair from the constraint set (either M or C ), and then does projections on
Ci1j and C 2 alternatively. By projecting M and b onto Ci1j , POLA gets the updating rules for M and
b as
) =
M M − li j αi j ui j u
ij (18.41)
)
b = b + αi j li j (18.42)
where
ui j = xi − x j (18.43)
Ji j (M, b)
αi j = . (18.44)
ui j
42 + 1
) = M − λµµ
M (18.45)
) Therefore, POLA
where λ = min{λ̃, 0} and (λ̃, µ) are the smallest eigenvalue-eigenvector pair of M.
incorporates the data in constraint sets in a sequential manner.
480 Data Classification: Algorithms and Applications
)t )
Mt+1 = arg min R (M, Mt ) + ηt (Dt , D (18.46)
M
) t = (xi − x j ) M(xi − x j ) and R (M, Mt ) = dlogdet (M, Mt ) is the logdet divergence. [3]
where D
showed that Mt+1 can be updated with the following rule
where
⎧
⎨ η0 , D )
t − Dt 0 & '
ηt = (18.48)
⎩ min η0 , 1
)
1
, otherwise.
2(Dt −Dt ) (xi −x j ) (I+(Mt−1 −I)−1 )(xi −x j )
POLA and OITML are only two examples of online distance metric learning.
• Large scale distance metric learning. Most of the existing distance metric learning approaches
involve computationally expensive procedures. How can we make distance metric learning
efficient and practical on large-scale data? Promising solutions include online learning or
distributed learning. We have introduced the most recent works on online distance metric
learning in Section 4.1. For parallel/distributed distance metric algorithms, as the major com-
putational techniques involved are eigenvalue decomposition and quadratic programming, we
can adopt parallel matrix computation/optimization algorithms [1,15] to make distance metric
learning more scalable and efficient.
• Empirical evaluations. Although a lot of distance metric learning algorithms have been pro-
posed, there is still a lack of systematic comparison and proof points on the utility of many
Distance Metric Learning for Data Classification 481
distance metric learning algorithms in real world applications. Such empirical discussion will
be helpful to showcase the practical value of distance metric learning algorithms. Some re-
cent works have started developing and applying distance metric learning on healthcare for
measuring similarity among patients [19, 23, 24].
• General formulation. As can be seen from this survey, most of the existing distance metric
learning algorithms suppose the learned distance metric is Euclidean in the projected space.
Such an assumption may not be sufficient for real world applications as there is no guarantee
that Euclidean distance is most appropriate to describe the pairwise data relationships. There
are already some initial effort towards this direction [5, 13], and this direction is definitely
worth exploring.
Bibliography
[1] Yair Censor. Parallel Optimization: Theory, algorithms, and applications. Oxford University
Press, 1997.
[2] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-
based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.
[3] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-
theoretic metric learning. In Proceedings of International Conference on Machine Learning,
pages 209–216, 2007.
[4] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd Edition).
Wiley-Interscience, 2nd edition, 2001.
[5] Charles Elkan. Bilinear models of affinity. Personal note, 2011.
[6] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition, Second Edition (Com-
puter Science and Scientific Computing Series). Academic Press, 1990.
[7] Jacob Goldberger, Sam Roweis, Geoff Hinton, and Ruslan Salakhutdinov. Neighborhood com-
ponent analysis. In Advances in Neural Information Processing Systems, 17:513–520, 2004.
[8] Y. Guo, S. Li, J. Yang, T. Shu, and L. Wu. A generalized foley-sammon transform based
on generalized fisher discriminant criterion and its application to face recognition. Pattern
Recognition Letters, 24(1-3):147–158, 2003.
[9] Steven C. H. Hoi, Wei Liu, and Shih-Fu Chang. Semi-supervised distance metric learning
for collaborative image retrieval. In Proceedings of IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pages 1–7, 2008.
[10] Yangqing Jia, Feiping Nie, and Changshui Zhang. Trace ratio problem revisited. IEEE Trans-
actions on Neural Networks, 20(4):729–735, 2009.
[11] András Kocsor, Kornél Kovács, and Csaba Szepesvári. Margin maximizing discriminant anal-
ysis. In Proceedings of European Conference on Machine Learning, volume 3201 of Lecture
Notes in Computer Science, pages 227–238. Springer, 2004.
[12] Brian Kulis. Metric learning. Tutorial at International Conference on Machine Learning,
2010.
482 Data Classification: Algorithms and Applications
[13] Zhen Li, Liangliang Cao, Shiyu Chang, John R. Smith, and Thomas S. Huang. Beyond ma-
halanobis distance: Learning second-order discriminant function for people verification. In
Prcoeedings of Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE
Computer Society Conference on, pages 45–50, 2012.
[14] S. Mika, G. Ratsch, J. Weston, B. Schölkopf, and K. R. Müllers. Fisher discriminant analysis
with kernels. Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE
Signal Processing Society Workshop, pages 41–48, 1999.
[15] Jagdish J. Modi. Parallel algorithms and matrix computation. Oxford University Press, 1989.
[16] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. The Hebrew
University of Jerusalem. Phd thesis, July 2007.
[17] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-
metrics. In Proceedings of International Conference on Machine Learning, pages 94–101,
2004.
[18] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component
analysis. In Proceedings of European Conference on Computer Vision, pages 776–790, 2002.
[19] Jimeng Sun, Daby Sow, Jianying Hu, and Shahram Ebadollahi. Localized supervised metric
learning on temporal physiological data. In ICPR, pages 4149–4152, 2010.
[20] Ivor W. Tsang, Pak ming Cheung, and James T. Kwok. Kernel relevant component analysis for
distance metric learning. In IEEE International Joint Conference on Neural Networks (IJCNN,
pages 954–959, 2005.
[21] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer New York Inc., New
York, NY, 1995.
[22] Fei Wang, Shouchun Chen, Changshui Zhang, and Tao Li. Semi-supervised metric learning
by maximizing constraint margin, CIKM Conference, pages 1457–1458, 2008.
[23] Fei Wang, Jimeng Sun, and Shahram Ebadollahi. Integrating distance metrics learned from
multiple experts and its application in patient similarity assessment. In SDM, 2011.
[24] Fei Wang, Jimeng Sun, Jianying Hu, and Shahram Ebadollahi. imet: Interactive metric learning
in healthcare applications. In SDM, 2011.
[25] Fei Wang and Changshui Zhang. Feature extraction by maximizing the average neighbor-
hood margin. In Proceedings of IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2007.
[26] Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. Distance metric learning for large
margin nearest neighbor classification. In Advances in Neural Information Processing Systems,
2005.
[27] Michael Werman, Ofir Pele, and Brian Kulis. Distance functions and metric learning. Tutorial
at European Conference on Computer Vision, 2010.
[28] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learn-
ing, with application to clustering with side-information. In Advances in Neural Information
Processing Systems 15, 15:505–512, 2002.
[29] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Technical report,
Department of Computer Science and Engineering, Michigan State University, 2006.
Chapter 19
Ensemble Learning
Yaliang Li
State University of New York at Buffalo
Buffalo, NY
[email protected]
Jing Gao
State University of New York at Buffalo
Buffalo, NY
[email protected]
Qi Li
State University of New York at Buffalo
Buffalo, NY
[email protected]
Wei Fan
Huawei Noah’s Ark Lab
Hong Kong
[email protected]
483
484 Data Classification: Algorithms and Applications
19.1 Introduction
People have known the benefits of combining information from multiple sources since a long
time ago. From the old story of “blind men and an elephant,” we can learn the importance of cap-
turing a big picture instead of focusing on one single perspective. In the story, a group of blind men
touch different parts of an elephant to learn what it is like. Each of them makes a judgment about
the elephant based on his own observation. For example, the man who touched the ear said it was
like a fan while the man who grabbed the tail said it was like a rope. Clearly, each of them just got a
partial view and did not arrive at an accurate description of the elephant. However, as they captured
different aspects about the elephant, we can learn the big picture by integrating the knowledge about
all the parts together.
Ensemble learning can be regarded as applying this “crowd wisdom” to the task of classification.
Classification, or supervised learning, tries to infer a function that maps feature values into class la-
bels from training data, and apply the function to data with unknown class labels. The function is
called a model or classifier. We can regard a collection of some possible models as a hypothesis
space H , and each single point h ∈ H in this space corresponds to a specific model. A classifi-
cation algorithm usually makes certain assumptions about the hypothesis space and thus defines a
hypothesis space to search for the correct model. The algorithm also defines a certain criterion to
measure the quality of the model so that the model that has the best measure in the hypothesis space
will be returned. A variety of classification algorithms have been developed [8], including Support
Vector Machines [20,22], logistic regression [47], Naive Bayes, decision trees [15,64,65], k-nearest
neighbor algorithm [21], and Neural Networks [67, 88]. They differ in the hypothesis space, model
quality criteria and search strategies.
In general, the goal of classification is to find the model that achieves good performance when
predicting the labels of future unseen data. To improve the generalization ability of a classification
model, it should not overfit the training data; instead, it should be general enough to cover unseen
cases. Ensemble approaches can be regarded as a family of classification algorithms, which are
developed to improve the generalization abilities of classifiers. It is hard to get a single classification
model with good generalization ability, which is called a strong classifier, but ensemble learning
can transform a set of weak classifiers into a strong one by their combination. Formally, we learn
T classifiers from a training set D: {h1 (x), . . . , hT (x)}, each of which maps feature values x into a
class label y. We then combine them into an ensemble classifier H(x) with the hope that it achieves
better performance.
There are two major factors that contribute to the success of ensemble learning. First, theoretical
analysis and real practice have shown that the expected error of an ensemble model is smaller
than that of a single model. Intuitively, if we know in advance that h1 (x) has the best prediction
performance on future data, then without any doubt, we should discard the other classifiers and
choose h1 (x) for future predictions. However, we do not know the true labels of future data, and
thus we are unable to know in advance which classifier performs the best. Therefore, our best bet
should be the prediction obtained by combining multiple models. This simulates what we do in real
life—When making investments, we will seldom rely on one single option, but rather distribute the
money across multiple stocks, plans, and accounts. This will usually lead to lower risk and higher
gain. In many other scenarios, we will combine independent and diverse opinions to make better
decisions.
A simple example can be used to further illustrate how the ensemble model achieves better
performance [74]. Consider five completely independent classifiers and suppose each classifier has a
prediction accuracy of 70% on future data. If we build an ensemble classifier by combining these five
classifiers using majority voting, i.e., predict the label as the one receiving the highest votes among
classifiers, then we can compute the probability of making accurate classification as: 10 × 0.73 ×
Ensemble Learning 485
5 5
5
0RGHO
7UDLQLQJ6HW 0RGHO
7UDLQLQJ6HW 7UXH7HVW6HW
%RXQGDU\
0.32 + 5 × 0.74 × 0.3 + 0.75 = 83.7% (the sum of the probability of having 3, 4, and 5 classifiers
predicting correctly). If we now have 101 such classifiers, following the same principle, we can
derive that the majority voting approach on 101 classifiers can reach an accuracy of 99.9%. Clearly,
the ensemble approach can successfully cancel out independent errors made by individual classifiers
and thus improve the classification accuracy.
Another advantage of ensemble learning is its ability to overcome the limitation of hypothesis
space assumption made by single models. As discussed, a single-model classifier usually searches
for the best model within a hypothesis space that is assumed by the specific learning algorithm.
It is very likely that the true model does not reside in the hypothesis space, and then it is impos-
sible to obtain the true model by the learning algorithm. Figure 19.1 shows a simple example of
binary classification. The rightmost plot shows that the true decision boundary is V-shaped, but if
we search for a classifier within a hypothesis space consisting of all the linear classifiers, for exam-
ple, models 1 and 2, we are unable to recover the true boundary. However, by combining multiple
classifiers, ensemble approaches can successfully simulate the true boundary. The reason is that en-
semble learning methods combine different hypotheses and the final hypothesis is not necessarily
contained in the original hypothesis space. Therefore, ensemble methods have more flexibility in
the hypothesis they could represent.
Due to these advantages, many ensemble approaches [4, 23, 52, 63, 74, 75, 90] have been devel-
oped to combine complementary predictive power of multiple models. Ensemble learning is demon-
strated as useful in many data mining competitions (e.g., Netflix contest,1 KDD cup,2 ICDM con-
test3 ) and real-world applications. There are two critical components in ensemble learning: Training
base models and learning their combinations, which are discussed as follows.
From the majority voting example we mentioned, it is obvious that the base classifiers should be
accurate and independent to obtain a good ensemble. In general, we do not require the base models
to be highly accurate—as long as we have a good amount of base classifiers, the weak classifiers can
be boosted to a strong classifier by combination. However, in the extreme case where each classifier
is terribly wrong, the combination of these classifiers will give even worse results. Therefore, at
least the base classifiers should be better than random guessing. Independence among classifiers is
another important property we want to see in the collection of base classifiers. If base classifiers are
highly correlated and make very similar predictions, their combination will not improve anymore.
In contrast, when base classifiers are independent and make diverse predictions, the independent
errors have better chances to be canceled out. Typically, the following techniques have been used to
generate a good set of base classifiers:
• Obtain different bootstrap samples from the training set and train a classifier on each bootstrap
sample;
1 http://www.netflixprize.com/
2 http://www.kddcup-orange.com/
3 http://www.cs.uu.nl/groups/ADA/icdm08cup/
486 Data Classification: Algorithms and Applications
• Extract different subsets of examples or subsets of features and train a classifier on each
subset;
• Apply different learning algorithms on the training set;
• Incorporate randomness into the process of a particular learning algorithm or use different
parametrization to obtain different prediction results.
To further improve the accuracy and diversity of base classifiers, people have explored various ways
to prune and select base classifiers. More discussions about this can be found in Section 19.6.
Once the base classifiers are obtained, the important question is how to combine them. The
combination strategies used by ensemble learning methods roughly fall into two categories: Un-
weighted and weighted. Majority voting is a typical un-weighted combination strategy, in which
we count the number of votes for each predicted label among the base classifiers and choose the
one with the highest votes as the final predicted label. This approach treats each base classifier
as equally accurate and thus does not differentiate them in the combination. On the other hand,
weighted combination usually assigns a weight to each classifier with the hope that higher weights
are given to more accurate classifiers so that the final prediction can bias towards the more accurate
classifiers. The weights can be inferred from the performance of base classifiers or the combined
classifier on the training set.
In this book chapter, we will provide an overview of the classical ensemble learning methods dis-
cussing their base classifier generation and combination strategies. We will start with Bayes optimal
classifier, which considers all the possible hypotheses in the whole hypothesis space and combines
them. As it cannot be practically implemented, two approximation methods, i.e., Bayesian model
averaging and Bayesian model combination, have been developed (Section 19.2). In Section 19.3,
we discuss the general idea of bagging, which combines classifiers trained on bootstrap samples
using majority voting. Random forest is then introduced as a variant of bagging. In Section 19.4,
we discuss the boosting method, which gradually adjusts the weights of training examples so that
weak classifiers can be boosted to learn accurately on difficult examples. AdaBoost will be intro-
duced as a representative of the boosting method. Stacking is a successful technique to combine
multiple classifiers, and its usage in top performers of the Netflix competition has attracted much
attention to this approach. We discuss its basic idea in Section 19.5. After introducing these classical
approaches, we will give a brief overview of recent advances in ensemble learning, including new
ensemble learning techniques, ensemble pruning and selection, and ensemble learning in various
challenging learning scenarios. Finally, we conclude the book chapter by discussing possible future
directions in ensemble learning. The notations used throughout the book chapter are summarized in
Table 19.1.
to obtain an ensemble classifier that assigns a label to unseen data x in the following way:
y = arg max
y∈Y
∑ p(y|x, h)p(h|D ). (19.1)
h∈H
In this equation, y is the predicted label of x, Y is the set of all possible labels, H is the hypothesis
space that contains all possible hypothesis h, and p() denotes probability functions. We can see that
the Bayes optimal classifier combines all possible base classifiers, i.e., the hypotheses, by summing
up their weighted votes. The posterior probability p(h|D ) is adopted to be their corresponding
weights.
We assume that training examples are drawn independently. By Bayes’ theorem, the posterior
probability p(h|D ) is given by:
In this equation, p(h) is the prior probability reflecting the degree of our belief that h is the “correct”
model prior to seeing any data. p(D |h) = Πm i=1 p(xi |h) is the likelihood, i.e., how likely the given
training set is generated under the model h. The data prior p(D ) is the same for all hypotheses, so
it could be ignored when making the label predictions.
To sum up, Bayes optimal classifier makes a prediction for x as follows:
y = arg max
y∈Y
∑ p(y|x, h)p(D |h)p(h). (19.3)
h∈H
By this approach, all the models are combined by considering both prior knowledge and data like-
lihood. In other words, the Bayes optimal classifier combines all the hypotheses in the hypothesis
space, and each hypothesis is given a weight reflecting the probability that the training data would
be sampled under the hypothesis if that hypothesis were true. In [61], it is pointed out that Bayes
optimal classifier can reach optimal classification results under Bayes theorem, and on average, no
other ensemble method can outperform it.
Although Bayes optimal classifier is the ideal ensemble method, unfortunately, it cannot be
practically implemented. The prior knowledge about each model p(h) is usually unknown. For
most tasks, the hypothesis space is too large to iterate over and many hypotheses only output a pre-
dicted label rather than a probability. Even though in some classification algorithms, it is possible
to estimate the probability from training data, computing an unbiased estimate of the probability of
the training data given a hypothesis, i.e., p(D |h), is difficult. The reason is that when we calculate
488 Data Classification: Algorithms and Applications
p(xi |h), we usually have some assumptions that could be biased. Therefore, in practice, some al-
gorithms have been developed to approximate Bayes optimal classifier. Next, we will discuss two
popular algorithms, i.e., Bayesian model averaging and Bayesian model combination, which effec-
tively implement the basic principles of Bayes optimal classifier.
Suppose the size of the random subset of training data is m , then we can get
Bayesian model averaging algorithm is shown in Algorithm 19.1 [1]. In [42], it was proved
that when the representative hypotheses are drawn and averaged using the Bayes theorem, Bayesian
model averaging has an expected error that is bounded to be at most twice the expected error of the
Bayes optimal classifier.
Despite its theoretical correctness, Bayesian model averaging may encounter over-fitting prob-
lems. Bayesian model averaging prefers the hypothesis that by chance has the lowest error on train-
ing data rather than the hypothesis that truly has the lowest error [26]. In fact, Bayesian model
averaging conducts a selection of classifiers instead of combining them. As the uniform distribution
is used to set the prior probability, the weight of each base classifier is equivalent to its likelihood
on training data. Typically, none of these base classifiers can fully characterize the training data, so
most of them receive a small value for the likelihood term. This may not be a problem if base clas-
sifiers are exhaustively sampled from the hypothesis space, but this is rarely possible when we have
Ensemble Learning 489
limited training examples. Due to the normalization, the base classifier that captures data distribu-
tion the best will receive the largest weight and has the highest impact on the ensemble classifier.
In most cases, this classifier will dominate the output and thus Bayesian model averaging conducts
model selection in this sense.
Example. We demonstrate Bayesian model averaging step by step using the following example.
Suppose we have a training set shown in Table 19.2, which contains 10 examples with 1-dimensional
feature values and corresponding class labels (+1 or −1).
We use a set of simple classifiers that only use a threshold to make class predictions: If x is
above a threshold, it is classified into one class; if it is on or below the threshold, it is classified
into the other class. By learning from the training data, the classifier will decide the optimal θ (the
decision boundary) and the class labels to be assigned to the two sides so that the error rate will be
minimized.
We run the algorithm for five rounds using m = 5. Altogether we construct five training sets
from the original training set and each of them contains five randomly selected examples. We first
learn a base classifier ht based on each sampled training set, and then calculate its training error rate
ε(ht ), the likelihood p(Dt |h) and weight(ht ) using the formula described in Algorithm 19.1. Table
19.3 shows the training data for each round and the predicted labels on the selected examples made
by the base classifiers.
Based on the model output shown in Table 19.3, an ensemble classifier can be constructed by
following Bayesian model averaging and its prediction results are shown in Table 19.4.
The highlighted points are incorrectly classified. Thus, the ensemble classifier has an error rate
of 0.3. It is in fact the same classifier we obtained from Rounds 2 and 5. The weights of these two
base classifiers are so high that the influence of the others is ignored in the final combination. Some
techniques have been proposed to solve this dominating weight problem, for example, the Bayesian
model combination method that will be discussed next.
490 Data Classification: Algorithms and Applications
It is clear that Bayesian model combination incurs more computation compared with Bayesian
model averaging. Instead of sampling each hypothesis individually, Bayesian model combination
samples from the space of possible ensemble hypotheses. It needs to compute an ensemble classifier
and update several sets of parameters during one iteration. Weight f inal shows the normalized Weight3
that we will use to calculate the final classifier. It is indeed a weighted combination of all sets of
TempWeight. Since the last TempWeight set gives the best results with the highest likelihood, the
final weight Weight f inal is closer to it. The prediction results made by the ensemble classifier of
Bayesian model averaging are shown in Table 19.8. This classifier classifies all the points correctly.
Comparing with Bayesian model averaging, Bayesian model combination gives better results.
19.3 Bagging
In this section, we introduce a very popular ensemble approach, i.e., Bagging, and then discuss
Random Forest, which adapts the idea of Bagging to build an ensemble of decision trees.
we sample data from D with replacement to form a new data set D . The size of D will be kept
492 Data Classification: Algorithms and Applications
the same as that of D . Some of the examples appear more than once in D while some examples
in D do not appear in D . For a particular example xi , the probability that it appears k times in D
follows a Poisson distribution with λ = 1 [11]. By setting k = 0 and λ = 1, we can get that xi does
not appear in D with a probability of 1e , so xi appears in D with a probability of 1 − 1e ≈ 0.632.
D is expected to have 63.2% unique data of D while the rest are duplicates. After sampling T data
sets using bootstrap sampling, we train a classifier on each of the sampled data sets D and combine
their output by majority voting. For each example xi , its final prediction by Bagging is the class
label with the highest number of predictions made by base classifiers.
We need to be careful in selecting learning algorithms to train base classifiers in Bagging. As D
has 63.2% overlap with the original data set D , if the learning algorithm is insensitive to the change
on training data, all the base classifiers will output similar predictions. Then the combination of
these base classifiers cannot improve the performance of ensemble. To ensure high diversity of base
classifiers, Bagging prefers unstable learning algorithms such as Neural Networks, Decision Trees
rather than stable learning algorithms such as K-Nearest Neighbor.
Typically, Bagging adopts majority voting to combine base classifiers. However, if the base
classifiers output prediction confidence, weighted voting or other combination methods are also
possible. If the size of data set is relatively small, it is not easy to get base classifiers with high
diversity because base classifiers’ diversity mainly comes from data sample manipulation. In such
cases, we could consider to introduce more randomness, such as using different learning algorithms
in base models.
Example. We demonstrate Bagging step by step using the toy example shown in Table 19.2. In this
experiment, we set T = 5 and |D |=|D |=10. Therefore, we construct five training sets and each one
of them contains ten random examples. Since we draw samples with replacement, it is possible to
have some examples repeating in the training set. We then learn base classifiers from these training
sets. Table 19.9 shows the training data for each round and the prediction result of the base classifier
applied on the corresponding training set.
Based on the base model output shown in Table 19.9, we can calculate the final result by majority
voting. For example, for x = 1, there are three classifiers that predict its label to be +1 and two
classifiers that predict its label as -1, so its final predicted label is +1. The label predictions made by
Bagging on this example dataset are shown in Table 19.10.
We can see that only one point (x = 4) is incorrectly classified (highlighted in the table). The
ensemble classifier has an error rate of only 0.1, so Bagging achieves better performance than the
base models.
Ensemble Learning 493
placement.
• Build a decision tree on D by applying LearnDecisionTree function and passing the
following parameters LearnDecisionTree(data = D , iteration = 0, ParentNode = root).
LearnDecisionTree is a recursive function that takes the dataset, iteration step, and parent
node index as input and returns a partition of the current dataset.
Specifically, LearnDecisionTree function conducts the following steps at each node:
• Check whether the stopping criterion is satisfied. Usually we can choose one or multiple
choices of the following criteria: a) All the examples at the current node have the same class
label; b) the impurity of current node is below a certain threshold (impurity can be represented
by entropy, Gini index, misclassification rate or other measures; more details will be discussed
later); c) there are no more features available to split the data at the current node to improve
the impurity; or d) the height of the tree is greater than a pre-defined number. If the stopping
criterion is satisfied, then we stop growing the tree, otherwise, continue as follows.
• Randomly sample a subset of n features from the whole feature space Rn so that each example
becomes {xi , yi } where xi ∈ Rn and n ≤ n. We denote the dataset in the subspace as D̂current .
• Find the best feature q∗ to split the current dataset that achieves the biggest gain in im-
purity. Specifically, suppose i(node) denotes the impurity of a node, and Le f tChildCode
494 Data Classification: Algorithms and Applications
and RightChildNode are the child nodes of the current node. We are trying to find a fea-
ture q∗ to maximize i(node) − PL · i(Le f tChildNode) − PR · i(RightChildNode), where PL
and PR are the fraction of the data that go to the corresponding child node if the split
feature q∗ is applied. We can use one of the following impurity measures: a) Entropy:
i(node) = − ∑yi ∈Y p(yi ) log p(yi ); b) Gini index: i(node) = 1 − ∑yi ∈Y p(yi )2 ; c) misclassi-
fication rate: i(node) = 1 − maxyi ∈Y p(yi ).
• After we select the splitting feature, we will split the data at the current node v into two parts
and assign them to its child nodes Lef tChildNode and RightChildNode. We denote these two
datasets as DL and DR . Label v is the parent node of Lef tChildNode and RightChildNode
under the split feature q∗ .
19.4 Boosting
Boosting is a widely used ensemble approach, which can effectively boost a set of weak clas-
sifiers to a strong classifier by iteratively adjusting the importance of examples in the training set
and learning base classifiers based on the weight distribution. In this section, we review the general
procedure of Boosting and its representative approach: AdaBoost.
As summarized in Algorithm 19.5, Boosting approaches learn different weak classifiers iter-
atively by adjusting the weight of each training example. During each round, the weight of mis-
classified data will be increased and base classifiers will focus on those misclassified ones more
and more. Under this general framework, many Boosting approaches have been developed, includ-
ing AdaBoost [34], LPBoost [6], BrownBoost [32] and LogitBoost [71]. In the following, we will
discuss the most widely used Boosting approach, Adaboost, in more detail.
19.4.2 AdaBoost
We will first describe AdaBoost in the context of binary classification. Given a training set
D = {xi , yi }mi=1 (xi ∈ Rn , yi ∈ {+1, −1}), our goal is to learn a classifier that could classify unseen
data with high accuracy. AdaBoost [34, 66] derives an ensemble model H by combining different
weak classifiers. During the training process, the weight of each training example is adjusted based
on the learning performance in the previous round, and then the adjusted weight distribution will be
fed into the next round. This is equivalent to inferring classifiers from training data that are sampled
from the original data set based on the weight distribution.
The detailed procedure of AdaBoost is discussed as follows. At first, the weight distribution W
is initialized as W1 (i) = m1 , i.e., we initialize W1 by uniform distribution when no prior knowledge
is provided. We learn a base classifier ht at each iteration t, given the training data D and weight
distribution Wt . Among all possible h ∈ H , we choose the best one that has the lowest classification
error. In binary classification (Y = {+1, −1}), if a classifier h has worse performance than random
guessing (error rate ε(h) ≥ 0.5), by simply flipping the output, we can turn h into a good classifier
with a training error 1 − ε(h). In this scenario, choosing the best classifier is equivalent to picking
either the best or worst one by considering all h ∈ H . Without loss of generality, we assume that
all the base classifiers have better performance compared with random guessing, i.e., ε(h) ≤ 0.5. At
each round, we pick the best model that minimizes the training error.
AdaBoost algorithm adopts weighted voting strategy to combine base classifiers. We derive a
base classifier ht at iteration t and calculate its weight in combination as
1 1 − ε(ht )
αt = ln ,
2 ε(ht )
which is computed according its training error. This weight assignment has the following properties:
1) if ε(h1 ) ≤ ε(h2 ), then we have αh1 ≥ αh2 , i.e., a higher weight will be assigned to the classifier
with smaller training error; and 2) as ε(ht ) ≤ 0.5, αt is always positive.
An important step in each round of AdaBoost is to update the weight distribution of training
examples based on the current base classifier according to the update equation
Wt (i) exp{−αt · ht (xi )yi }
Wt+1 (i) = ,
Zt
Ensemble Learning 497
where Zt = ∑m
i =1 Wt (i ) exp{−αt · ht (xi )yi } is a normalization term to ensure that the sum of Wt+1 (i)
is 1. From this equation, we can see that if a training example is misclassified, its weight will
be increased and then in the next iteration the classifier will pay more attention to this example.
Instead, if an example is correctly classified, its corresponding weight will decrease. Specifically, if
xi is wrongly classified, ht (xi ) · yi is −1 and αt ≥ 0 so that −αt · ht (xi )yi is positive. As exp{−αt ·
ht (xi )yi } > 1, the new weight Wt+1 (i) > Wt (i). Similarly, we can see that if xi is correctly classified,
Wt+1 (i) < Wt (i), i.e., the new weight decreases.
The above procedure constitutes one iteration in AdaBoost. We repeat these steps until some
stopping criterion is satisfied. We can set the number of iterations T as the stop criterion, or simply
stop when we cannot find a base classifier that is better than random guessing. The whole algorithm
is summarized in Algorithm 19.6.
2: for t ← 1 to T do
3: Learn a base classifier ht = arg minh ε(h) where ε(h) = ∑m
i=1 Wt (i) · 1(h(xi ) = yi )
t)
4: Calculate the weight of ht : αt = 12 ln 1−ε(h
ε(ht )
Wt (i) exp{−αt ·ht (xi )yi }
5: Update the weight distribution of training examples: Wt+1 (i) = ∑m W (i ) exp{−αt ·ht (xi )yi }
i =1 t
6: end for
7: return H = ∑t=1
T
αt · ht ;
As misclassified data receive more attention during the learning process, AdaBoost can be sen-
sitive to noise and outliers. When applying to noisy data, the performance of AdaBoost may not be
satisfactory. In practice, we may alleviate such a problem by stopping early (set T as a small num-
ber), or reducing the weight increase on misclassified data. Several follow-up approaches have been
developed to address this issue. For example, MadaBoost [24] improves AdaBoost by depressing
large weights, and FilterBoost [10] adopts log loss functions instead of exponential loss functions.
Now let us see how to extend AdaBoost to multi-class classification in which Y = {1, 2, . . ., K}.
We can generalize the above algorithm [33] by changing the way of updating weight distribution:
If ht (xi ) = yi , then Wt+1 (i) = Wt (i)·exp{−α
Zt
t}
; else, Wt+1 (i) = Wt (i)·exp{α
Zt
t}
. With this modification, we
can directly apply the other steps in AdaBoost on binary classification to the multi-class scenario
if the base classifiers can satisfy ε(h) < 0.5. However, when there are multiple classes, it may not
be easy to get weak classifiers with ε(h) < 0.5. To overcome this difficulty, an alternative way is
to convert multi-class problem with K classes into K binary classification problems [72], each of
which determines whether x belongs to the k-th class or not.
Note that the algorithm requires that the learning algorithm that is used to infer base classifiers
should be able to learn on training examples with weights. If the learning algorithm is unable to learn
from weighted data, an alternative way is re-sampling, which samples training data according to the
weight distribution and then applies the learning algorithm. Empirical results show that there is no
clear performance difference between learning with weight distribution and re-sampling training
data.
There are many theoretical explanations to AdaBoost and Boosting. In [71] a margin-based
explanation to AdaBoost was introduced, which has nice geometric intuition. Meanwhile, it was
shown that AdaBoost algorithm can be interpreted as a stagewise estimation procedure for fitting an
additive logistic regression model [71]. As for Boosting, population theory was proposed [14] and
it was considered as the Gauss-Southwell minimization of a loss function [7].
Example. We demonstrate AdaBoost algorithm using the data set in Table 19.2. In this example, we
498 Data Classification: Algorithms and Applications
still set T = 3. At each round, we learn a base classifier on the original data, and then calculate αt
based on weighted error rate. Next, we update weight distribution Wt+1 (i) as described in Algorithm
19.6. We repeat this procedure for T times. Table 19.11 shows the training data for each round and
the results of the base classifiers.
At the first round, the base classifier makes errors on points x = 8, 9, 10, so these examples’
weights increase accordingly. Then at Round 2, the base classifier pays more attention to these points
and classifies them correctly. Since points x = 4, 5, 6, 7 are correctly predicted by both classifiers,
they are considered “easy” and have lower weights, i.e., lower penalty if they are misclassified.
From Table 19.11, we can construct an ensemble classifier as shown in Table 19.12, which makes
correct predictions on all the examples.
19.5 Stacking
In this section, we introduce Stacked Generalization (Stacking), which learns an ensemble clas-
sifier based on the output of multiple base classifiers.
choices to learn base classifiers: 1) We can apply Bootstrap sampling technique to learn inde-
pendent classifiers; 2) we can adopt the strategy used in Boosting, i.e., adaptively learn base
classifiers based on data with a weight distribution; 3) we can tune parameters in a learn-
ing algorithm to generate diverse base classifiers (homogeneous classifiers); 4) we can apply
different classification methods and/or sampling methods to generate base classifiers (hetero-
geneous classifiers).
• Step 2: Construct a new data set based on the output of base classifiers. Here, the out-
put predicted labels of the first-level classifiers are regarded as new features, and the orig-
inal class labels are kept as the labels in the new data set. Assume that each example in
D is {xi , yi }. We construct a corresponding example {xi , yi } in the new data set, where
xi = {h1 (xi ), h2 (xi ), . . . , hT (xi )}.
• Step 3: Learn a second-level classifier based on the newly constructed data set. Any learning
method could be applied to learn the second-level classifier.
Once the second-level classifier is generated, it can be used to combine the first-level classifiers.
For an unseen example x, its predicted class label of stacking is h (h1 (x), h2 (x), . . . , hT (x)), where
{h1 , h2 , . . . , hT } are first-level classifiers and h is the second-level classifier.
We can see that Stacking is a general framework. We can plug in different learning approaches
or even ensemble approaches to generate first or second level classifiers. Compared with Bagging
and Boosting, Stacking “learns” how to combine the base classifiers instead of voting.
Example. We show the basic procedure of Stacking using the data set in Table 19.2. We set T = 2
at the first step of Algorithm 19.7 and show the results of the two base classifiers trained on original
data in Table 19.13.
Then we can construct a new data set based on the output of base classifiers. Since there are two
500 Data Classification: Algorithms and Applications
base classifiers, our new xi has two dimensions: xi = (x1i , x2i ), where x1i is xi ’s predicted label from
the first classifier, and x2i is the predicted label from the second classifier. The new data set is shown
in Table 19.14.
Note that in the illustrations of various ensemble algorithms on this toy example shown in the
previous sections, we always use one-dimensional data on which we simply apply a threshold-
based classifier. In this Stacking example, we can easily extend the threshold-based classifier to two
dimensions:
−1, if x1 = −1 and x2 = −1
h(x) = (19.4)
+1, otherwise.
The final results on the toy example obtained by this second-level classifier are shown in Table
19.15. We can see that the ensemble classifier classifies all the points correctly.
19.5.3 Discussions
Empirical results [19] show that Stacking has robust performance and often out-performs
Bayesian model averaging. Since its introduction [89], Stacking has been successfully applied to
a wide variety of problems, such as regression [12], density estimation [79] and spam filtering [68].
Recently, Stacking has shown great success in the Netflix competition, which was an open compe-
tition on using user history ratings to predict new ratings of films. Many of the top teams employ
ensemble techniques and Stacking is one of the most popular approaches to combine classifiers
among teams. In particular, the winning team [5] adopted Stacking (i.e., blending) to combine hun-
dreds of different models, which achieved the best performance.
There are several important practical issues in Stacking. It is important to consider what types
of feature to create for the second-level classifier’s training set, and what type of learning methods
to use in building the second-level classifier [89]. Besides using predicted class labels of first-level
classifiers, we can consider using class probabilities as features [82]. The advantage of using con-
ditional probabilities as features is that the training set of second-level classifier will include not
only predictions but also prediction confidence of first-level classifiers. In [82], the authors further
suggest using multi-response linear regression, a variant of least-square linear regression, as the
second-level learning algorithm. There are other choices of second-level classification algorithms.
For example, Feature-Weighted Linear Stacking [76] combines classifiers using a linear combina-
tion of meta features.
502 Data Classification: Algorithms and Applications
that the combination of the selected classifiers can achieve comparable or even better performance
compared with the ensemble using all the classifiers. These approaches are usually referred to as
ensemble pruning, which can be roughly categorized into the following groups [90]: 1) Ordering-
based pruning, which orders the models according to a certain criterion and then selects the top
models, 2) clustering-based pruning, which clusters similar models and prunes within each cluster
to maintain a set of diverse models, and 3) optimization-based pruning, which tries to find a subset
of classifiers by optimizing a certain objective function with regard to the generalization ability of
the ensemble built on selected classifiers.
Ensemble Learning in Different Scenarios. Ensemble learning has also been studied in the
context of challenging learning scenarios. In a stream environment where data continuously arrive,
some ensemble approaches [38,39,50,73,86] have been developed to handle concept drifts, i.e., the
fact that data distributions change over time. Many classification tasks encounter the class imbal-
ance problem in which examples from one class dominate the training set. Classifiers learnt from
this training set will perform well on the majority class but poorly on the other classes [87]. One
effective technique to handle class imbalance is to create a new training set by over-sampling mi-
nority examples or under-sampling majority examples [3, 16]. Naturally such sampling techniques
can be combined with ensemble approaches to improve the performance of the classifier on mi-
nority examples [17, 39, 44]. Ensemble methods have also been adapted to handle cost-sensitive
learning [25, 30, 85], in which misclassification on different classes incurs different costs and the
goal is to minimize the combined cost. Moreover, ensemble learning has been shown to be useful
in not only classification, but also many other learning tasks and applications. One very important
lesson learnt from the widely known Netflix competition is the effectiveness of ensemble techniques
in collaborative filtering [5, 51]. The top teams all somewhat involve blending the predictions of a
variety of models that provide complementary predictive powers. In unsupervised learning tasks
such as clustering, many methods have been developed to integrate various clustering solutions
into one clustering solution by reaching consensus among base models [41, 80]. Recently, people
began to explore ensemble learning techniques for combining both supervised and unsupervised
models [2, 40, 58]. The objective of these approaches is to reach consensus among base models
considering both predictions made by supervised models and clustering constraints given by un-
supervised models. Another relevant topic is multi-view learning [9, 18, 77, 78], which assumes
different views share similar target functions and learns classifiers from multiple views of the same
objects by minimizing classification error as well as inconsistency among classifiers.
19.7 Conclusions
In this chapter, we gave an overview of ensemble learning techniques with a focus on the most
popular methods including Bayesian model averaging, bagging, boosting, and stacking. We dis-
cussed various base model generation and model combination strategies used in these methods. We
also discussed advanced topics in ensemble learning including alternative techniques, performance
analysis, ensemble pruning and diversity, and ensemble used in other learning tasks.
Although the topic of ensemble learning has been studied for decades, it still enjoys great at-
tention in many different fields and applications due to its many advantages in various learning
tasks. Nowadays, rapidly growing information and emerging applications continuously pose new
challenges for ensemble learning. In particular, the following directions are worth investigating in
the future:
• Big data brings big challenges to data analytics. Facing the daunting scale of big data, it
is important to adapt ensemble techniques to fit the needs of large-scale data processing.
504 Data Classification: Algorithms and Applications
For example, ensemble models developed on parallel processing and streaming platforms are
needed.
• Another important issue in ensemble learning is to enhance the comprehensibility of the
model. As data mining is used to solve problems in different fields, it is essential to have
the results interpretable and accessible to users with limited background knowledge in data
mining.
• Although numerous efforts on theoretical analysis of ensemble learning have been made, there
is still a largely unknown space to explore to fully understand the mechanisms of ensemble
models. Especially the analysis so far has been focusing on traditional approaches such as
bagging and boosting, but it is also important to explain ensemble approaches from theoretical
perspectives in emerging problems and applications.
New methodologies, principles, and technologies will be developed to address these challenges
in the future. We believe that ensemble learning will continue to benefit real-world applications
in providing an effective tool to extract highly accurate and robust models from gigantic, noisy,
dynamically evolving, and heterogeneous data.
Bibliography
[1] Wikipedia. http://en.wikipedia.org/wiki/Ensemble_learning.
[2] A. Acharya, E. R. Hruschka, J. Ghosh, and S. Acharyya. C 3e: A framework for combin-
ing ensembles of classifiers and clusterers. In Proceedings of the International Workshop on
Multiple Classifier Systems (MCS’11), pages 269–278, 2011.
[3] G. Batista, R. Prati, and M. C. Monard. A study of the behavior of several methods for bal-
ancing machine learning training data. SIGKDD Explorations Newsletter, 6:20–29, 2004.
[4] E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bag-
ging, boosting, and variants. Machine Learning, 36(1-2):105–139, 2004.
[5] R. M. Bell and Y. Koren. Lessons from the Netflix prize challenge. SIGKDD Explorations
Newsletter, 9(2):75–79, 2007.
[6] K. P. Bennett. Linear programming boosting via column generation. In Machine Learning,
pages 225–254, 2002.
[7] P. J. Bickel, Y. Ritov, and A. Zakai. Some theory for generalized boosting algorithms. The
Journal of Machine Learning Research, 7:705–732, 2006.
[10] J. K. Bradley and R. Schapire. Filterboost: Regression and classification on large datasets.
Advances in Neural Information Processing Systems, 20:185–192, 2008.
[14] L. Breiman. Population theory for boosting ensembles. The Annals of Statistics, 32(1):1–11,
2004.
[15] L. Breiman, J. H. Friedman, C. J. Stone, and R. A. Olshen. Classification and Regression
Trees. Chapman and Hall/CRC, 1984.
[19] B. Clarke. Comparing Bayes model averaging and stacking when model approximation error
cannot be ignored. The Journal of Machine Learning Research, 4:683–712, 2003.
[20] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
[21] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Informa-
tion Theory, 13:21–27, 1967.
[22] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines: And Other
Kernel-Based Learning Methods. Cambridge University Press, 2000.
[23] T. Dietterich. Ensemble methods in machine learning. In Proceedings of the International
Workshop on Multiple Classifier Systems (MCS’00), pages 1–15, 2000.
[26] P. Domingos. Bayesian averaging of classifiers and the overfitting problem. In Proceedings of
the International Conference on Machine Learning (ICML’00), pages 223–230, 2000.
[28] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, London,
1993.
[29] T. Evgeniou, M. Pontil, and A. Elisseeff. Leave one out error, stability, and generalization of
voting combinations of classifiers. Machine Learning, 55(1):71–97, 2004.
[30] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. Adacost: Misclassification cost-sensitive boost-
ing. In Proceedings of the International Conference on Machine Learning (ICML’99), pages
97–105, 1999.
506 Data Classification: Algorithms and Applications
[31] W. Fan, H. Wang, P. S. Yu, and S. Ma. Is random model better? on its accuracy and efficiency.
In Proc. of the IEEE International Conference on Data Mining (ICDM’03), pages 51–58,
2003.
[32] Y. Freund. An adaptive version of the boost by majority algorithm. Machine Learning,
43(3):293–318, 2001.
[33] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of
the International Conference on Machine Learning (ICML’96), pages 148–156, 1996.
[36] J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles. Technical report,
Statistics, Stanford University, 2003.
[37] J. H. Friedman and B. E. Popescu. Predictive learning via rule ensembles. Annals of Applied
Statistics, 3(2):916–954, 2008.
[38] J. Gao, W. Fan, and J. Han. On appropriate assumptions to mine data streams: Analysis and
practice. In Proceedings of the IEEE International Conference on Data Mining (ICDM’07),
pages 143–152, 2007.
[39] J. Gao, W. Fan, J. Han, and P. S. Yu. A general framework for mining concept-drifting data
streams with skewed distributions. In Proceedings of the SIAM International Conference on
Data Mining (SDM’07), pages 3–14, 2007.
[40] J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han. Graph-based consensus maximization among
multiple supervised and unsupervised models. In Advances in Neural Information Processing
Systems, pages 585–593, 2009.
[41] J. Ghosh and A. Acharya. Cluster ensembles. WIREs Data Mining and Knowledge Discovery,
1:305–315, 2011.
[42] D. Haussler, M. Kearns, and R. Schapire. Bounds on the sample complexity of Bayesian
learning using information theory and the VC dimension. In Machine Learning, pages 61–74,
1992.
[43] D. Hernández-Lobato, G. M.-M. Noz, and I. Partalas. Advanced topics in ensemble learning
ecml/pkdd 2012 tutorial. https://sites.google.com/site/ecml2012ensemble/.
[44] S. Hido, H. Kashima, and Y. Takahashi. Roughly balanced bagging for imbalanced data.
Statistical Analysis and Data Mining, 2(5-6):412–426, 2009.
[45] T. K. Ho. Random decision forests. In Proceedings of the International Conference on Docu-
ment Analysis and Recognition, pages 278–282, 1995.
[46] J. Hoeting, D. Madigan, A. Raftery, and C. Volinsky. Bayesian model averaging: a tutorial.
Statistical Science, 14(4):382–417, 1999.
[50] J. Kolter and M. Maloof. Using additive expert ensembles to cope with concept drift. In
Proceedings of the International Conference on Machine Learning (ICML’05), pages 449–
456, 2005.
[51] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.
Computer, 42(8):30–37, 2009.
[52] L. I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley & Sons,
2004.
[53] L. I. Kuncheva. Diversity in multiple classifier systems. Information Fusion, 6(1):3–4, 2005.
[54] L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their
relationship with the ensemble accuracy. Machine Learning, 51(2):181–207, 2003.
[55] F. T. Liu, K. M. Ting, and W. Fan. Maximizing tree diversity by building complete-random
decision trees. In Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge
Discovery and Data Mining (PAKDD’05), pages 605–610, 2005.
[56] F. T. Liu, K. M. Ting, Y. Yu, and Z.-H. Zhou. Spectrum of variable-random trees. Journal of
Artificial Intelligence Research, 32:355–384, 2008.
[57] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Proceedings of the IEEE Interna-
tional Conference on Data Mining (ICDM’08), pages 413–422, 2008.
[58] X. Ma, P. Luo, F. Zhuang, Q. He, Z. Shi, and Z. Shen. Combining supervised and unsupervised
models via unconstrained probabilistic embedding. In Proceedings of the International Joint
Conference on Artifical Intelligence (IJCAI’11), pages 1396–1401, 2011.
[59] L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit optimization
of margins. Machine Learning, 38(3):243–255, 2000.
[60] P. Melville and R. J. Mooney. Creating diversity in ensembles using artificial data. Information
Fusion, 6:99–111, 2006.
[61] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[62] K. Monteith, J. L. Carroll, K. D. Seppi, and T. R. Martinez. Turning Bayesian model averaging
into Bayesian model combination. In Proceedings of the International Joint Conference on
Neural Networks, pages 2657–2663, 2011.
[63] R. Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine,
6(3):21–45, 2006.
[66] R. Rojas. Adaboost and the super bowl of classifiers: A tutorial introduction to adaptive boost-
ing, 2009. http://www.inf.fu-berlin.de/inst/ag-ki/adaboost4.pdf.
508 Data Classification: Algorithms and Applications
[67] D. E. Rumelhart, J. L. McClelland, and the. PDP Research Group, editors. Parallel Distributed
Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. MIT Press,
1986.
[68] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spyropoulos, and P. Stam-
atopoulos. Stacking classifiers for anti-spam filtering of e-mail. arXiv preprint cs/0106040,
2001.
[69] R. E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.
[70] R. E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison,
M. H. Hansen, C. Holmes, B. Mallick, and B. Yu, editors, Nonlinear Estimation and Classifi-
cation. Springer, 2003.
[71] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation
for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998.
[72] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predic-
tions. In Machine Learning, pages 80–91, 1999.
[73] M. Scholz and R. Klinkenberg. An ensemble classifier for drifting concepts. In Proceedings
of the ECML/PKDD Workshop on Knowledge Discovery in Data Streams, pages 53–64, 2005.
[74] G. Seni and J. Elder. Ensemble methods in data mining: Improving accuracy through combin-
ing predictions. Morgan & Claypool, 2010.
[75] M. Sewell. Ensemble learning research note. Technical report, University College London,
2011.
[76] J. Sill, G. Takács, L. Mackey, and D. Lin. Feature-weighted linear stacking. arXiv:0911.0460,
2009.
[77] V. Sindhwani, P. Niyogi, and M. Belkin. A co-regularization approach to semi-supervised
learning with multiple views. In Proc. of the ICML’05 workshop on Learning with Multiple
Views, 2005.
[78] V. Sindhwani and D. S. Rosenberg. An RKHS for multi-view learning and manifold
co-regularization. In Proceedings of the International Conference on Machine Learning
(ICML’08), pages 976–983, 2008.
[79] P. Smyth and D. Wolpert. Linearly combining density estimators via stacking. Machine Learn-
ing, 36(1-2):59–83, 1999.
[80] A. Strehl and J. Ghosh. Cluster ensembles — a knowledge reuse framework for combining
multiple partitions. Journal of Machine Learning Research, 3:583–617, 2003.
[81] E. K. Tang, P. N. Suganthan, and X. Yao. An analysis of diversity measures. Machine Learn-
ing, 65(1):247–271, 2006.
[82] K. M. Ting and I. H. Witten. Issues in stacked generalization. arXiv:1105.5466, 2011.
[83] K. Tumer and J. Ghosh. Analysis of decision boundaries in linearly combined neural classi-
fiers. Pattern Recognition, 29(2):341–348, 1996.
[84] G. Valentini and T. G. Dietterich. Bias-variance analysis of support vector machines for the
development of svm-based ensemble methods. Journal of Machine Learning Research, 5:725–
775, 2004.
Ensemble Learning 509
[85] P. Viola and M. Jones. Fast and robust classification using asymmetric adaboost and a detector
cascade. In Advances in Neural Information Processing System 14, pages 1311–1318, 2001.
[86] H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams using ensemble
classifiers. In Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD’03), pages 226–235, 2003.
[87] G. Weiss and F. Provost. The effect of class distribution on classifier learning. Technical
Report ML-TR-43, Rutgers University, 2001.
[88] P. J. Werbos. Beyond regression: new tools for prediction and analysis in the behavioral sci-
ences. PhD thesis, Harvard University, 1974.
[89] D. H. Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
[90] Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC Machine
Learning & Pattern Recognition Series, 2012.
This page intentionally left blank
Chapter 20
Semi-Supervised Learning
Kaushik Sinha
Wichita State University
Wichita, KS
[email protected]
20.1 Introduction
Consider an input space X and an output space Y , where one would like to see an example from
input space X and automatically predict its output. In traditional supervised learning, a learning
algorithm is typically given a training set of the form {(xi , yi )}li=1 , where each pair (xi , yi ) ∈ X × Y
is drawn independently at random according to an unknown joint probability distribution PX ×Y . In
the case of a supervised classification problem, Y is a finite set of class labels and the goal of the
511
512 Data Classification: Algorithms and Applications
learning algorithm is to construct a function g : X → Y that predicts the label y given x. For exam-
ple, consider the problem of predicting if an email is a spam email. This is a binary classification
problem where Y = {+1, −1} and the label “ + 1” corresponds to a spam email and the label “ − 1”
corresponds to a non-spam email. Given a training set of email-label pairs, the learning algorithm
needs to learn a function that given a new email, would predict if it is a spam email.
The criterion for choosing this function g is the low probability of error, i.e., the algorithm must
choose g in such a way that when an unseen pair (x, y) ∈ X × Y is chosen according to PX ×Y
and only x is presented to the algorithm, the probability that g(x) = y is minimized over a class of
functions g ∈ G . In this case, the best function that one can hope for is based on the conditional
distribution P(Y |X) and is given by η(x) = sign[E(Y |X = x)], which is known as the so called
Bayes optimal classifier. Note that, when a learning algorithm constructs a function g based on a
training set of size l, the function g is a random quantity and it depends on the training set size
l. So it is a better idea to represent the function g as gl to emphasize its dependence on training
set size l. A natural question, then, is what properties should one expect from gl as the training
set size l increases? A learning algorithm is called consistent if gl converges to η (in appropriate
convergence mode) as the training set size l tends to infinity. This is the best one can hope for, as it
guarantees that as the training set size l increases, gl converges to the “right” function. Of course,
“convergence rate” is also important, which specifies how fast gl converges to the right function.
Thus, one would always prefer a consistent learning algorithm. It is clear that performance of such
an algorithm improves as the training set size increases, or in other words, as we have capacity to
label more and more examples.
Unfortunately, in many real world classification tasks, it is often a significant challenge to ac-
quire enough labeled examples due to the prohibitive cost of labeling every single example in a
given data set. This is due to the fact that labeling often requires domain specific knowledge and
in many cases, a domain expert has to actually look at the data to manually label it! In many real
life situations, this can be extremely expensive and time consuming. On the other hand, in many
practical situations, a large pool of unlabeled examples are quite easy to obtain at no extra cost as
no labeling overhead is involved. Consider the email classification problem again. To figure out if
an email is a spam email, someone has to read its content to make the judgment, which may be time
consuming (and expensive if one needs to be hired to do this job), while unlabeled emails are quite
easy to obtain: just take all the emails from one’s email account.
Traditional supervised classification algorithms only make use of the labeled data, and prove
insufficient in the situations where the number of labeled examples are limited, since as pointed
out before, performance of a consistent supervised algorithm improves as the number of labeled
examples (training set size) increases. Therefore, if only a limited amount of labeled examples
is available, performance of supervised classification algorithms may be quite unsatisfactory. As
an alternative, semi-supervised learning, i.e., learning from both labeled and unlabeled, data has
received considerable attention in recent years due to its potential in reducing the need for expensive
labeled data. The hope is that somehow the power of unlabeled examples drawn i.i.d according the
marginal distribution PX may be exploited to complement the unavailability of labeled examples
and design better classification algorithms. This can be justified in many situations, for example, as
shown in the following two scenarios.
Consider the first scenario. Suppose in a binary classification problem, only two labeled exam-
ples are given, one from each class, as shown in the left panel of Figure 20.1. A natural choice to
induce a classifier on the basis of this would be the linear separator as shown by a dotted line in
the left panel of Figure 20.1. As pointed out in [8], a variety of theoretical formalisms including
Bayesian paradigms, regularization, minimum description length, or structural risk minimization
principles, and the like, have been constructed to rationalize such a choice (based on prior notion
of simplicity) as to why the linear separator is the simplest structure consistent with the data. Now,
consider the situation where, in addition to the two labeled examples, one is given additional unla-
beled examples as shown in the right panel of Figure 20.1. It is quite evident that in light of this new
Semi-Supervised Learning 513
set of unlabeled examples, one must re-evaluate one’s prior notion of simplicity as linear separator
is clearly not an ideal choice anymore. The particular geometric structure of the marginal distribu-
tion in this case suggests that the most natural classifier is a non-linear one and hence when labeled
examples are limited, unlabeled examples can indeed provide useful guidance towards learning a
better classifier.
FIGURE 20.1 (See color insert.): Unlabeled examples and prior belief.
Now, consider the second scenario where the focus is again on binary classification. Here sup-
pose the examples are generated according to two Gaussian distributions, one per class, as shown
in red and green, respectively, in Figure 20.2. The corresponding Bayes-optimal decision boundary,
which classifies examples into the two classes and provides minimum possible error, is shown by
the dotted line. The Bayes optimal decision boundary can be obtained from the Bayes rule, if the
Gaussian mixture distribution parameters (i.e., the mean and variance of each Gaussian, and the
mixing parameter between them) are known. It is well known that unlabeled examples alone, when
generated from a mixture of two Gaussians, are sufficient to recover the original mixture compo-
nents ( [41, 42]). However, unlabeled examples alone cannot assign examples to classes. Labeled
examples are needed for this purpose. In particular, when the decision regions are already known,
only a few labeled examples would be enough to assign examples to classes. Therefore, when an
infinite amount of unlabeled examples along with a few labeled examples are available, learning
proceeds in two steps: (a) identify the mixture parameters and hence the decision regions from un-
labeled examples, (b) use the labeled examples to assign class labels to the learned decision regions
in step (a). In practice, a sufficiently large set of unlabeled examples often suffices to estimate the
mixture parameters to reasonable accuracy.
The ubiquity and easy availability of unlabeled examples, together with the increased computa-
tional power of modern computers, make the paradigm of semi-supervised learning quite attractive
in various applications, while connections to natural learning make it also conceptually intriguing.
Vigorous research over the last decade in this domain has resulted in several models, numerous as-
sociated semi-supervised learning algorithms, and theoretical justifications. To some extent, these
efforts have been documented in various dedicated books and survey papers on semi-supervised
learning (see e.g., [19, 52, 72, 74, 75]).
What is missing in those documents, however, is a comprehensive treatment of various semi-
supervised models, associated algorithms, and their theoretical justifications, in an unified manner.
This vast gap is quite conceivable since the field of semi-supervised learning is relatively new and
most of the theoretical results in this field are known only recently. The goal of this survey chapter
is to close this gap as much as possible. While our understanding of semi-supervised learning is not
yet complete, various recent theoretical results provide significant insight towards this direction and
it is worthwhile to study these results along with the models and algorithms in an unified manner. To
this end, this survey chapter aims, by no means, to cover every single semi-supervised learning al-
gorithm developed in the past. Instead, the modest goal is to cover various popular semi-supervised
514 Data Classification: Algorithms and Applications
µ1 µ2
FIGURE 20.2: In Gaussian mixture setting, the mixture components can be fully recovered us-
ing only unlabeled examples, while labeled examples are used to assign labels to the individual
components.
learning models, some representative algorithms within each of these models, and various theoreti-
cal justifications for using these models.
Both of the above assumptions are based on strong intuition and empirical evidence obtained from
real life high dimensional data and have led to development of broad families of semi-supervised
learning algorithms, by introducing various data dependent regularizations. For example, various
graph-based algorithms, in some way or the other, try to capture the manifold structure by con-
structing a neighborhood graph whose nodes are labeled and unlabeled examples.
While the majority of of the semi-supervised learning algorithms are based on either the man-
ifold assumption or the cluster assumption, there are two other principles that give rise to various
other semi-supervised learning algorithms as well. These are :
1. Co-Training
In the co-training model ( [10]), the instance space has two different yet redundant views,
where each view in itself is sufficient for correct classification. In addition to that, if the
underlying distribution is “compatible” in the sense that it assigns zero probability mass to
the examples that differ in prediction according to the two consistent classifiers in those two
views, then one can hope that unlabeled examples might help in learning.
2. Generative Model
In the generative model, labeled examples are generated in two steps: first, by randomly se-
lecting a class label, and then in the second step, by selecting an example from this class. The
class conditional distributions belong to a parametric family. Given unlabeled examples from
the marginal (mixture) distribution, one can hope to learn the parameters by using maximum
likelihood estimate as long as the mixture is identifiable, then by using maximum posteriori
estimate to infer the label of an unlabeled example.
In the subsequent sections, we discuss in detail various semi-supervised learning algorithms
based on manifold/cluster assumption, co-training, and generative models and also provide theoret-
ical justification and relevant results for each of these models.
• once a class label is chosen, an instance from this class is generated according to the class
conditional probability p(x|y).
Generative models assume that marginal distribution p(x) = ∑y p(y)p(x|y) is an identifiable mix-
ture model and the class conditional distributions p(x|y) are parametric. In particular, the class
conditional distributions p(x|y, θ) are parameterized by parameter vector θ. The parameterized joint
distribution then can be written as p(x, y|θ) = p(y|θ)p(x|y, θ). A simple application of Bayes rule
p(y|θ)p(x|y,θ)
suggests that p(y|x, θ) = p(x|θ) . For any θ, the label of any instance x is given by the maxi-
mum a posteriori (MAP) estimate arg maxy p(y|x, θ). Since θ is not known in advance, the goal of
the generative semi-supervised learning is to make an estimate θ̂ of the parameter vector θ from
data and then use
ŷ = arg max p(y|x, θ̂).
y
20.2.1 Algorithms
A practical semi-supervised learning algorithm under the generative model is proposed in
[19, 45] for text classification, where each individual mixture component p(x|y, θ) is a multino-
mial distribution over the words in a vocabulary. In order to learn a classifier in this framework, the
first step is to estimate the parametric model from data. This is done by maximizing the observ-
able likelihood incorporating l labeled and u unlabeled examples arg maxθ p(θ)(θ)p(x, y|θ), which
is equivalent to maximizing the log likelihood:
l l+u
l(θ) = log(p(θ)) + ∑ log (p(yi |θ) · p(xi |yi , θ)) + ∑ log ∑ (p(y|θ) · p(xi |y, θ)) , (20.1)
i=1 i=l+1 y
where the prior distribution is formed with a product of Dirichlet distributions, one for each of
the class multinomial distributions and one for the overall class probabilities. Notice that the equa-
tion above contains a log of sums for the unlabeled data, which makes maximization by partial
derivatives computationally intractable. As an alternative, Expectation-Maximization (EM) algo-
rithm ( [28]) is used to maximize Equation 20.1. EM is an iterative hill climbing procedure that
finds the local maxima of Equation 20.1 by performing alternative E and M steps until the log
likelihood converges. The procedure starts with an initial guess of the parameter vector θ. At any
iteration, in the E step, the algorithm estimates the expectation of the missing values (unlabeled
class information) given the model parameter estimates in the previous iteration. The M step, on the
other hand, maximizes the likelihood of the model parameters using the expectation of the missing
values found in the E step. The same algorithm was also used by Baluja ( [5]) on a face orientation
discrimination task yielding good performance.
If data are indeed generated from an identifiable mixture model and if there is a one-to-one
correspondence between mixture components and classes, maximizing observable likelihood seems
to be a reasonable approach to finding model parameters, and the resulting classifier using this
method performs better than those trained only from labeled data. However, when these assumptions
do not hold, the effect of unlabeled data is less clear and in many cases it has been reported that
unlabeled data do not help or even degrade performance ( [24]).
is, to figure out which component corresponds to which class and identify the decision regions).
In fact, this is the exact motivation to understand and quantify the effect of unlabeled examples in
classification problems under generative model assumption in many research endeavors ( [14–16,
49, 50]).
Castelli and Cover ( [15,16]) investigate the usefulness of unlabeled examples in the asymptotic
sense, with the assumption that a number of unlabeled examples go to infinity at a rate faster than
that of the labeled examples. Under the identifiable mixture model assumption and various mild
assumptions, they argue that marginal distribution can be estimated from unlabeled examples alone
and labeled examples are merely needed for identifying the decision regions. Under this setting, they
prove that classification error decreases exponentially fast towards the Bayes optimal solution with
the number of labeled examples, and only polynomially with the number of unlabeled examples,
thereby, justifying the usefulness of unlabeled examples. In a similar but independent effort, Ratsaby
and Venkatesh ( [49], [50]) prove similar results when individual class conditional densities are
Gaussian distributions.
However, both the above results inherently assume that estimators can replicate the correct un-
derlying distribution that generates the data. However, in many real life learning problems, this may
be a difficult assumption to satisfy. Consider the situation where true underlying marginal distri-
bution is f , and since f is unknown, it is assumed (maybe for computational convenience) that
marginal distribution belongs to the class of probability distribution G . Now, if unlabeled examples
are used to estimate the marginal distribution, as a number of unlabeled examples are added one can
get better and better estimates and eventually the best possible estimate g∗ within class G , which,
however, may be quite different from true underlying distribution f . The fact that g∗ is the best pos-
sible estimator doesn’t mean that g∗ leads to the best possible classification boundary. This fact is
nicely demonstrated by a toy example in [19, 24], where the true underlying marginal distribution is
a mixture of beta distributions but they are modeled using a mixture of Gaussian distributions, which
leads to incorrect classification boundary and degraded classification performance. Cozman and Co-
hen ( [24]) use this informal reasoning to justify the observation that in certain cases unlabeled data
actually causes a degraded classification performance, as reported in the literature [5, 45, 54].
The informal reasoning of Cozmen and Cohen ( [24]) is, to some extent, formalized in a recent
work by Belkin and Sinha ( [57]). Belkin and Sinha ( [57]) investigate the situation when the data
come from a probability distribution, which can be modeled, but not perfectly, by an identifiable
mixture distribution. This seems applicable to many practical situations, when, for example, a mix-
ture of Gaussians is used to model the data without knowing what the underlying true distribution
is. The main finding of this work is that when the underlying model is different from the assumed
model and unlabeled examples are used to estimate the marginal distribution under the assumed
model, putting one’s effort into finding the best estimator under the assumed model is not a good
idea since it is different from the true underlying model anyway, and hence does not guarantee iden-
tifying the best decision boundary. Instead, a “reasonable” estimate, where the estimation error is
on the order of the difference between assumed and true model is good enough, because beyond
that point unlabeled examples do not provide much useful information. Depending on the size of
perturbation or difference between the true underlying model and the assumed model, this work
shows that the data space can be partitioned into different regimes where labeled and unlabeled ex-
amples play different roles in reducing the classification error and labeled examples are not always
exponentially more valuable as compared to the unlabeled examples as mentioned in the literature
( [14–16, 49, 50]).
In a recent work, Dillon et al. ( [29]) study the asymptotic analysis of generative semi-supervised
learning. This study allows the number of labeled examples l and number of unlabeled examples
u to grow simultaneously but at a constant ratio λ = l/(l + u). The authors consider a stochastic
setting for maximizing the joint likelihood of data under the generative model, where the estimate
Semi-Supervised Learning 519
where Zi ∼ Ber(λ) are Bernoulli random variables with parameter λ. Under mild assumptions, the
authors prove that the empirical estimate θ̂n converges to the true model parameter
√ θ0 with probabil-
ity 1, as n → ∞. In addition, they prove that θ̂n is asymptotically normal, i.e., n(θ̂n − θ0 ) converges
in distribution to N (0, Σ−1 ), where Σ = λVarθ0 (∇θ log p(x, y|θ0 )) + (1 − λ)Varθ0 (∇θ log p(x|θ0 ))
and that it may be used to characterize the the accuracy of θ̂n as a function of n, θ0 , λ. This result
suggests that asymptotic variance is a good proxy for finite sample measures of error rates and
empirical mean squared error, which the authors also verify empirically.
20.3 Co-Training
Blum and Mitchell introduce co-training ( [10]) as a general term for rule based bootstrapping,
in which examples are believed to have two distinct yet sufficient feature sets (views) and each rule
is based entirely on either the first view or the second view. This method enjoys reasonable success
in scenarios where examples can naturally be thought of as having two views. One prototypical
example for application of the co-training method is the task of Web page classification. Web pages
contain text that describes contents of the Web page (the first view) and have hyperlinks pointing
to them (the second view). The existance of two different and somewhat redundant sources of in-
formation (two views) and the availability of unlabeled data suggest the following learning strategy
( [10]) for classifying, say, faculty Web pages. First, using small set of labeled examples, find weak
predictors based on each kind of information, e.g., the phrase “research interest” on a Web page
may be a weak indicator that the page is a faculty home page and the phrase “my adviosr” on a link
may be an indicator that the page being pointed to is a faculty Web page. In the second stage attempt
is made to bootstrap from these weak predictors using unlabeled data.
In co-training model, the instance space is X = X1 × X2 , where X1 and X2 correspond to two
different views of an example, in other words, each example x is written as a pair (x1 , x2 ), where
x1 ∈ X1 and x2 ∈ X2 . The label space is Y . There are distinct concept classes C1 and C2 defined
over X1 and X2 , respectively, i.e., C1 consists of functions predicting label Y from X1 and C2 con-
sists of functions predicting labels Y from X2 . The central assumption here is that each view in
itself is sufficient for correct classification. This boils down to the existence of c1 ∈ C1 and c2 ∈ C2
such that c1 (x) = c2 (x) = y for any example x = (x1 , x2 ) ∈ X1 × X2 observed with label y ∈ Y .
This means that the probability distribution PX over X assigns zero mass to any eample (x1 , x2 )
such that c1 (x1 ) = c2 (x2 ). Co-training makes a crucial assumption that the two views x1 and x2 are
conditionally independent given the label y, i.e., P(x1 |y, x2 ) = P(x1 |y) and P(x2 |y, x1 ) = P(x2 |y).
The reason why one can expect that unlabeled data might help in this context is as follows. For a
given distribution PX over X the target function c = (c1 , c2 ) is compatible with PX if it assigns zero
mass to any eample (x1 , x2 ) such that c1 (x1 ) = c2 (x2 ). As a result, even though C1 and C2 may be
large concept classes with high complexity, say in terms of high VC dimension measure, for a given
distribution PX , the set of compatible target concepts may be much simpler and smaller. Thus, one
might hope to be able to use unlabeled examples to get a better understanding of which functions
in C = C1 × C2 are compatible, yielding useful information that might reduce the number of labeled
examples needed by a learning algorithm.
520 Data Classification: Algorithms and Applications
20.3.1 Algorithms
Co-training is an iterative procedure. Given a set l of labeled examples and u unlabeled ex-
amples, the algorithm first creates a smaller pool of u , (u < u) unlabeled examples and then it
iterates the following procedure. First l labeled examples are used to train two distinct classifiers
h1 : X1 → Y and h2 : X2 → Y , where h1 is trained based on the x1 view of the instances and h2 is
trained based on the x2 view of the instances. Typically, naive Bayes classifier is used for the choice
of h1 and h2 . After the two classifiers are trained, each of these classifiers h1 and h2 are individually
allowed to choose, from the pool of u unlabeled examples, some positive and and negative exam-
ples that they are confident about. Typically, in the implementation in [10], the number of positive
examples is set to one and the number of negative examples is set to three. Each example selected in
this way along with their labels predicted by h1 and h2 , respectively, are added to the set of labeled
examples and then h1 and h2 are retrained again based on this larger set of labeled examples. The
set of unlabeled examples u is replenished from u and the procedure iterates for a certain number
of iterations.
If the conditional independence assumption holds, then co-training works reasonably well as
is reported by Nigam et al. ( [46]), who perform extensive empirical experiments to compare co-
training and generative mixture models. Collin and Singer ( [23]) suggest a refinement of the co-
training algorithm in which one explicitly optimizes an objective function that measures the degree
of agreement between the predictions based on x1 view and those based on x2 view. The objective
function is further boosted by methods suggested in [23]. Unlike the co-training procedure suggested
in [10], Goldman and Zhou ( [31]) make no assumption about the existence of two redundant views
but insist that their co-training strategy places the following requirement on each supervised learning
algorithm. The hypothesis of supervised learning algorithm partitions the example space into a
set of equivalence classes. In particular, they use two learners of different types that require the
whole feature set (as opposed to individual views) for training and essentially use one learner’s high
confidence examples, identified by a set of statistical tests, from the unlabeled set to teach the other
learner and vice versa. Balcan and Blum ( [4]) demonstrate that co-training can be quite effective,
in the sense that in some extreme cases, only one labeled example is all that is needed to learn the
classifier. Over the years, many other modifications of co-training algorithms have been suggested
by various researchers. See for example [20,35,36,66,68,69] for many of these methods and various
empirical studies.
for co-training to actually do anything useful, there should be at least some examples for which the
hypothesis based on the first view should be confident about their labels but not the other hypothesis
based on the second view, and vice versa.
Unfortunately, in order to provide theoretical justification for co-training, often, strong assump-
tions about the second type are made. The original work of Blum and Mitchell ( [10]) makes the
conditional independence assumption and provides intuitive explanation regarding why co-training
works in terms of maximizing agreement on unlabeled examples between classifiers based on dif-
ferent views of the data. However, the conditional independence assumption in that work is quite
strong and no generalization error bound for co-training or justification for the intuitive account in
terms of classifier agreement on unlabeled examples is provided. Abney ( [1]) shows that the strong
independence assumption is often violated in data and in fact a weaker assumption in the form of
“weak rule dependence” actually suffices.
Even though Collins and Singer ( [23]) introduce the degree of agreement between the classifiers
based on two views in the objective function, no formal justification was provided until the work
of Dasgupta et. al ( [26]). Dasgupta et. al ( [26]) provide a generalization bound for co-training
under partial classification rules. A partial classification rule either outputs a class label or outputs
a special symbol ⊥ indicating no opinion. The error of a partial rule is the probability that the rule
is incorrect given that it has an opinion. In [26], a bound on the generalization error of any two
partial classification rules h1 and h2 on two views, respectively, is given, in terms of the empirical
agreement rate between h1 and h2 . This bound formally justifies both the use of agreement in the
objective function (as was done in [23]) and the use of partial rules. The bound shows the potential
power of unlabeled data, in the sense that low generalization error can be achieved for complex rules
with a sufficient number of unlabeled examples.
All the theoretical justification mentioned above in [1, 10, 26] requires either conditional inde-
pendence given the labels assumption or the assumption of weak rule dependence. Balcan et. al
( [3]) substantially relax the strength of the conditional independence assumption to just a form of
“expansion” of the underlying distribution (a natural analog of the graph theoretic notion of expan-
sion and conductance) and show that in some sense, this is a necessary condition for co-training to
succeed. However, the price that needs to be paid for this relaxation is a fairly strong assumption
on the learning algorithm, in the form that the hypotheses the learning algorithm produces are never
“confident but wrong,” i.e., they are correct whenever they are confident, which formally translates
to the condition that the algorithm is able to learn from positive examples only. However, only a
heuristic analysis is provided in [3] when this condition does not hold.
522 Data Classification: Algorithms and Applications
1 l+u
2 i,∑
f T L f = f T (D − W ) f = wi j ( fi − f j )2 (20.3)
j=1
20.4.1 Algorithms
Various graph-based techniques fall under four broad categories: graph cut-based method,
graph-based random walk methods, graph transduction methods, and manifold regularization meth-
ods. Sometimes the boundary between these distinctions is a little vague. Nevertheless, some rep-
resentative algorithms from each of these subgroups are given below. In recent years, large scale
graph-based learning has also emerged as an important area ( [40]). Some representative work along
this line is also provided below.
20.4.1.1 Graph Cut
Various graph-based semi-supervised learning methods are based on graph cuts ( [11,12,34,37]).
Blum and Chawla ( [11]) formulate the semi-supervised problem as a graph mincut problem, where
in the binary case positive labels act as sources and negative labels act as sinks. The objective here is
to find a minimum set of edges that blocks the flow from sources to sinks. Subsequently, the nodes
connected to sources are labeled as positive examples and the the nodes connected to the sinks are
labeled as negative examples. As observed in [32], graph mincut problem is equivalent to solving
for the lowest energy configuration in the Markov Random Field, where the energy is expressed as
1 1
2∑
E( f ) = wi j | fi − f j | = ∑ wi j ( fi − f j )2
i, j 4 i, j
Semi-Supervised Learning 523
where fi ∈ {+1, −1} are binary labels. Solving the lowest energy configuration in this Markov ran-
dom field produces a partion of the entire (labeled and unlabeled) dataset that maximally optimizes
self consistency, subject to the constraint that configuration must agree with the labeled examples.
This technique is extended in [12] by adding randomness to the graph structure to address some of
the shortcomings of [11] as well as to provide better theoretical justification from both the Markov
random field perspective and from sample complexity considerations.
Kveton et al. ( [37]) propose a different way of performing semi-supervised learning based on
graph cut by first computing a harmonic function solution on the data adjacency graph (similar
to [73]), then by learning a maximum margin discriminator conditioned on the labels induced by
this solution.
Solution of the above optimization problem has a harmonic property, meaning that the value of f at
each unlabeled example is the average of f at the neighboring examples :
1
fj = ∑ wi j fi for j = l + 1, . . ., l + u
d j i∼ j
where the notation i ∼ j is used to represent the fact that node i and node j are neighbors.
Belkin et al. ( [6]) propose the following method using Tikhonov regularization, where the
regularization parameter λ controls the effect of data fitting term and smoothness term.
l l+u
arg min ∑ ( fi − yi )2 + λ ∑ wi j ( fi − f j )2 (20.5)
f ∈Rl+u i=1 i, j=1
Zhou et. al ( [65]) propose an iterative method that essentially solves the following optimization
problem.
2
l+u l+u
f f
arg min ∑ ( fi − yi )2 + λ ∑ wi j √ −
i j
v1 , . . . , vk ∈ Rl+u are the first k eigenvectors of the graph Laplacian. Then, in [7], essentially the
following least squared problem is solved :
2
l k
α̂ = arg min ∑ yi − ∑ α j v j (xi ) (20.6)
α i=1 j=1
where the notation v j (xi ) means the value of the jth eigenvector on example xi .
Note that in all the above cases, it is straightforward to get a closed form solution of the above
optimization problems.
Zhou et al. ( [70]) propose an algorithm where instead of the widely used regularizer f T L f ,
an iterated higher order graph Laplacian regularizer in the form of f T Lm f , which corresponds to
Sobolev semi-norm of order m, is used. The optimization problem in this framework takes the form
l
arg min ∑ ( fi − yi )2 + λ f T Lm f
f ∈Rl+u i=1
where again, λ controls the effect of data fitting term and smoothness term.
where HK is the reproducing kernel Hilbert space (RKHS),
f
K is the RKHS norm of f , and ∇M f
is the gradient of f along the manifold M where the integral is taken over the marginal distribution.
This additional term is smoothness penalty corresponding to the probability distribution and may
be approximated on the basis of labeled and unlabeled examples using a graph Laplacian associated
to the data. The regularization parameter γA controls the complexity of the function in the ambient
space while γI controls the complexity of the function in the intrinsic geometry of PX . Given finite
labeled and unlabeled examples, the above problem takes the form
1 l γI l+u
f∗ = arg min ∑
l i=1
V (xi , yi , f ) + γA
f
2K +
(u + l)2 i,∑
( f (xi ) − f (x j ))2 wi j
f ∈ HK j=1
1 l γI
= arg min ∑ V (xi , yi , f ) + γA
f
2K + (u + l)2 f T L f .
l i=1
(20.7)
f ∈ HK
Belkin et al. ( [8]) show that the minimizer of the optimization problem (20.7) admits an expansion
of the form
l+u
f ∗ (x) = ∑ αi K(xi , x)
i=1
Semi-Supervised Learning 525
where K(·, ·) is a kernel function, and as a result it is an inductive procedure and has a natural
out-of-sample extension unlike many other semi-supervised learning algorithms.
In [55], Sindhwani et al. describe a technique to turn transductive and standard supervised learn-
ing algorithms into inductive semi-supervised learning algorithms. They give a data dependent ker-
nel that adapts to the geometry of data distribution. Starting with a base kernel K defined over the
whole input space, the proposed method warps the RKHS specified by K by keeping the same func-
tion space but altering the norm by adding a data dependent “point-cloud-norm.” This results in a
new RKHS space with a corresponding new kernel that deforms the original space along a finite
dimensional subspace defined by data. Using the new kernel, standard supervised kernel algorithms
trained on labeled examples only can perform inductive semi-supervised learning.
Finally, class label for node k is chosen as the one that maximizes the posterior, i.e.,
arg maxc Ppost (y = c|k). The unknown parameters P(y|i) are estimated either by maximum like-
lihood method with EM or maximum margin methods subjected to constraints.
Azron ( [2]) propose a semi-supervised learning method in which each example is associated
with a particle that moves between the examples (nodes a of a graph) according to the transition
probability matrix. Labeled examples are set to be absorbing states of the Markov random walk and
the probability of each particle to be absorbed by the different labeled examples, as the number of
steps increases, is used to derive a distribution over associated missing labels. This algorithm is in
the spirit of [59] but there is a considerable difference. In [59], the random walk carries the given
labels and propagates them on the graph amongst the unlabeled examples, which results in the need
to find a good number of steps t for the walk to take. In [2], however, this parameter is set to infinity.
Note that the algorithm in [65] can also be interpreted from the random walk point of view, as
was done in [67].
526 Data Classification: Algorithms and Applications
1 l γI
arg min ∑ (yi − f (xi ))2 + γA
f
2K + (u + l)2 f T L f
l i=1
f ∈ HK
and the Laplacian Support Vector Machine problem (LapSVM) where the optimization problem is
of the form
1 l γI
arg min ∑ (1 − yi f (xi ))+ + γA
f
2K + (u + l)2 f T L f .
l i=1
f ∈ HK
The only difference in the two expressions above is due to the choice of different loss functions.
In the first case it is square loss function and in the second case it is hinge loss function where
(1 − yi f (xi )+ is interpreted as max(0, 1 − yi f (xi )). It turns out that Representer Theorem can be used
to show that the solution in both the cases is an expansion of kernel functions over both the labeled
and the unlabeled examples and takes the form f (x) = ∑l+u ∗
i=1 αi K(x, xi ), where the coefficients αi s
are obtained differently for LapRLS and LapSVM.
In the case of LapRLS, the solutions α∗ = [α∗1 , . . . , α∗l+u ] is given by the following equation:
& '−1
∗ γI l
α = JK + γA lI + LK Y (20.8)
(u + l)2
where Y is an (l + u) dimensional label vector given by Y = [y1 , . . . , yl , 0, . . . , 0] and J is an (l + u) ×
(l + u) diagonal matrix given by J = diag(1, . . . , 1, 0, . . . , 0), with the first l diagonal entries as 1 and
the rest 0.
In case of LapSVM, the solution is given by the following equation:
& '−1
∗ γI
α = 2γA I + 2 LK J T Y β∗ (20.9)
(u + l)2
where J = [I 0] is an l × (l + u) matrix with I as the l × l identity matrix, Y = diag(y1 , . . . .yl ), and
β∗ is obtained by solving the following SVM dual problem :
maxβ∈Rl ∑li=1 βi − 12 βT Qβ (20.10)
subject to : ∑li=1 lβi yi = 0
0 ≤ βi ≤ 1l for i = 1, . . . , l.
Details of the algorithm are given in Algorithm 20.3.
Semi-Supervised Learning 527
2. Choose kernel function K(x, y). Compute the Gram matrix Ki j = K(xi , x j ).
3. Compute the Graph Laplacian matrix: L = D −W where D is a diagonal matrix
given by Dii =l+u
j=1 Wi j .
4. Choose γA and γI .
5. Compute α∗ using Equation 20.8 for squared loss (Laplacian RLS) or using
Equation 20.9 and 20.10 for hinge loss (Laplacian SVM).
6. Output f ∗ (x) = ∑l+u ∗
i=1 αi K(xi , x).
where V (·, ·) is a convex loss function and K ∈ R(l+u)×(l+u) is the kernel gram matrix whose (i, j)th
entry is the kernel function k(xi , x j ). Some of the graph Laplacian-based algorithms fall under this
category when K −1 is replaced by graph Laplacian L. Note that kernel gram matrix is always
positive semi-definite but it may be singular. In that case the correct interpretation of f T K −1 f is
limµ→0+ f T (K + µI(l+u)×(l+u))−1 f where I(l+u)×(l+u) is (l + u) × (l + u) identity matrix. Zhang et
al. ( [63]) show that the estimator fˆ in Equation 20.11 converges to its limit almost surely when
the number of unlabeled examples u → 0. However, it is unclear what this particular limit estima-
tor is and in case it is quite different from the true underlying function then the analysis may be
uninformative. Zhang et al. ( [63]) also study the generalization behavior of Equation 20.11 in the
528 Data Classification: Algorithms and Applications
following form and provide the average predictive performance when a set of l examples xi1 , . . . , xil
are randomly labeled from l + u examples x1 , . . . , x(l+u) :
Exi1 ,...,xil 1u ∑xi ∈{x
/ i1 ,...,xil } V ( f i (xi1 , . . . , xil ), yi
ˆ
T K −1 f + γ trace(K)
2
≤ inf f ∈Rl+u l+u 1
∑l+u
i=1 V ( f i , y i ) + λ f
2λl(l+u)
≤ inf f ∈Rl+u l+u ∑i=1 V ( fi , yi ) + √γ2l trace( (l+u)
1 l+u K
) f T K −1 f
. .
.∂ .
where . ∂p V (p, y) ≤ γ. and the last inequality is due to optimal value of λ. The generalization bound
suggests that design of transductive kernel in some sense controls the predictive performance and
various kernels used in [63] are closely related to the graph Laplacian (see [63] for detailed discus-
sion).
Zhou et. al ( [71]) study the error rate and sample complexity of Laplacian eigenmap ( [7]) as
shown in Equation 20.6 at the limit of infinite unlabeled examples. The analysis studied here gives
guidance to the choice of number of eigenvectors k in Equation 20.6. In & particular, it 'shows that
d
l d+2
when data lies on a d-dimensional domain, the optimal choice of k is O log(l)
yielding
& − 2 '
l d+2
asymptotic error rate O log(l) . This result is based on integrated mean squared error. By
replacing integrated mean square error with conditional
d mean squared error (MSE) on the labeled
examples, the optimal choice of k becomes O l d+2 and the corresponding mean squared error
2
O l − d+2 . Note that this is the optimal error rate of nonparametric local polynomial regression
on the d-dimensional unknown manifold ( [9]) and it suggests that unless specific assumptions are
made, unlabeled examples do not help.
20.5.1 Algorithms
Various algorithms have been developed based on cluster assumption. Chapelle et. al ( [17]) pro-
pose a framework for constructing kernels that implements the cluster assumption. This is achieved
by first starting with an RBF kernel and then modifying the eigenspectrum of the kernel matrix by
introducing different transfer functions. The induced distance depends on whether two points are in
the same cluster or not.
Transductive support vector machine or TSVM ( [33]) builds the connection between marginal
density and the discriminative decision boundary by not putting the decision boundary in high den-
sity regions. This is in agreement with the cluster assumption. However, cluster assumption was not
explicitly used in [33]. Chapelle et. al ( [18]) propose a gradient descent on the primal formulation
of TSVM that directly optimizes the slightly modified TSVM objective according to the cluster
assumption: to find a decision boundary that avoids the high density regions. They also propose
a graph-based distance that reflects cluster assumption by shrinking distances between the exam-
Semi-Supervised Learning 529
ples from the same clusters. Used with an SVM, this new distance performs better than standard
Euclidean distance.
Szummer et al. ( [60]) use information theory to explicitly constrain the conditional density
p(y|x) on the basis of marginal density p(x) in a regularization framework. The regularizer is a
function of both p(y|x) and p(x) and penalizes any changes in p(y|x) more in the regions with high
p(x).
Bousquet et al. ( [13]) propose a regularization-based framework that penalizes variations of
function more in high density regions and less in low density regions.
Sinha et. al ( [58]) propose a semi-supervised learning algorithm that exploits the cluster as-
sumption. They show that when data is clustered, i.e., the high density regions are sufficiently
separated by low density valleys, each high density area corresponds to a unique representative
eigenvector of a kernel gram matrix. Linear combination of such eigenvectors (or, more precisely,
of their Nystrom extensions) provide good candidates for good classification functions when the
cluster assumption holds. First choosing an appropriate basis of these eigenvectors from unlabeled
data and then using labeled data with Lasso to select a classifier in the span of these eigenvec-
tors yields a classifier, that has a very sparse representation in this basis. Importantly, the sparsity
corresponds naturally to the cluster assumption.
3. Given a transfer function φ, let λ̃i = φ(λi ), where the λi are the eigenvalues of L, and construct L̃ =
U Λ̃U T .
4. Let D̃ be a diagonal matrix with D̃ii = 1/L̃i i and compute cluster kernel
K̃ = D̃1/2 L̃D̃1/2 .
observed on the whole excess risk, and one needs to consider cluster excess risk, which corresponds
to a part of the excess risk that is interesting for this problem (within clusters). This is exactly what
is done in [51] under the strong cluster assumption, which informally specifies clusters as level sets
under mild conditions (for example, clusters are separated, etc.). The estimation procedure studied
in this setting ( [51]) proceeds in three steps,
2. Identify homogeneous regions from the clusters found in step 1, that are well connected and
have certain minimum Lebesgue measure.
3. Assign a single label to each estimated homogeneous region by a majority vote on labeled
examples.
Under mild conditions, the excess cluster risk is shown to be
& −α '
u
+ e−l(θδ) /2
2
Õ
1−θ
where, α > 0, δ > 0 and θ ∈ (0, 1). The result shows that unlabeled examples are helpful given
cluster assumption. Note that the results are very similar to the ones obtained under parametric
settings in the generative model case (see for examples, [14–16, 49, 50]).
In a separate work, Singh et al. ( [56]) study the performance gains that are possible with
semi-supervised learning under cluster assumption. The theoretical characterization presented in
this work explains why in certain situations unlabeled examples can help in learning, while in other
situations they may not. Informally, the idea is that if the margin γ is fixed, then given enough labeled
examples, a supervised learner can achieve optimal performance and thus there is no improvement
due to unlabeled examples in terms of the error rate convergence for a fixed collection of distribu-
tions. However, if the component density sets are discernible from a finite sample size u of unlabeled
examples but not from a finite sample of l labeled examples (l < u), then semi-supervised learning
can provide better performance than purely supervised learning. In fact, there are certain plausible
situations in which semi-supervised learning yields convergence rates that cannot be achieved by
any supervised learner. In fact, under mild conditions, Singh et al. ( [56]) show that there exists a
semi-supervised learner fˆl,u (obtained using l labeled and u unlabeled examples) whose expected
excess risk E(E ( fˆl,u )) (risk of this learner minus infimum risk over all possible learners) is bounded
from above as follows:
& '−1/d
1 u
E(E ( fl,u )) ≤ ε2 (l) + O
ˆ +l
u (log u)2
provided |γ| > C0 (m/(log m)2 )−1/2 . The quantity ε2 (l) in the above expression is the finite sample
upper bound of expected excess risk for any supervised learner using l labeled examples. The above
result suggests that if the decision sets are discernible using unbalanced examples (i.e., the margin is
large enough compared to average spacing between unlabeled examples), then there exists a semi-
supervised learner that can perform as well a a supervised
learner
with complete knowledge of the
decision sets provided u i, so that (i/ε2 (l))d = O u/(logu)2 , implying that the additional term
in the above bound is negligible compared to ε2 (l). Comparing supervised learner lower bound and
semi-supervised learner upper bound, Singh et al. ( [56]) provide different margin ranges where
semi-supervised learning can potentially be beneficial using unlabeled examples.
Semi-Supervised Learning 531
structure prediction. Each of these subfields is quite mature and one can find various surveys for
each of them.
Even though semi-supervised learning techniques are widely used, our theoretical understand-
ing and analysis of various semi-supervised learning methods are yet far from complete. On the
one hand, we need asymptotic results of various methods saying that in the limit the estimator un-
der consideration converges to the true underlying classifier, which will ensure that the procedure
does the “right” thing given enough labeled and unlabeled examples. On the other hand, we need
to have finite sample bounds that will specifically tell us how fast the estimator converges to the
true underlying classifier as a function of number of labeled and unlabeled examples. This type of
results are completely missing. For example, various graph-based algorithms are quite popular and
widely used but yet we hardly have any theoretical finite sample result! Another area of concern
is that there are so many different semi-supervised learning algorithms that can be very different
from each other under the same assumption or very similar algorithms with only a minor differ-
ence. In both the cases, performance of the algorithms can be totally different. This suggests that
analysis of individual semi-supervised learning algorithms that perform well in practice, are also
very important. This kind of analysis will not only help us understand the behavior of the existing
semi-supervised learning algorithms but also help us design better semi-supervised algorithms in
the future. In the coming years, researchers from the semi-supervised learning community need to
put significant effort towards resolving these issues.
Bibliography
[1] S. Abney. Bootstrapping. In Fortieth Annual Meeting of the Association for Computational
Linguistics, pages 360–367, 2002.
[2] A. Azran. The rendezvous algorithm: Multi-class semi-supervised learning with Markov ran-
dom walks. In Twenty-fourth International Conference on Machine Learning (ICML), pages
47–56 2007.
[3] M.-F. Balcan, A. Blum and K. Yang. Co-training and expansion: Towards bridging theory and
practice In Nineteenth Annual Conference on Neural Information Processing Systems (NIPS),
2005.
[4] M.-F. Balcan and A. Blum. An augmented PAC model for learning from labeled and unlabeled
data. In O. Chapelle, B. Schölkopf and A. Zien, editors. Semi-Supervised Learning, MIT Press,
Cambridge, MA, 2006.
[5] S. Baluja. Probabilistic modeling for face orientation discrimination: Learning from labeled
and unlabeled data In Twelfth Annual Conference on Neural Information Processing Systems
(NIPS), pages 854–860, 1998.
[6] M. Belkin, I. Matveeva and P. Niyogi. Regularization and semi-supervised learning on large
graphs. In Seventeenth Annual Conference on Learning Theory(COLT), pages 624–638, 2004.
[7] M. Belkin and P. Niyogi. Semi-supervised learning on Reimannian manifold. Machine Learn-
ing, 56:209–239, 2004.
[8] M. Belkin, P. Niyogi and V. Sindhwani. Manifold regularization: A geometric framework for
learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:
2399–2434, 2006.
Semi-Supervised Learning 533
[9] P. J. Bickel and B. Li. Local polynomial regression on unknown manifolds. Complex Datasets
and Inverse Problems: Tomography, Networks and Beyond, IMS Lecture Notes Monographs
Series, 54: 177–186, 2007.
[10] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Eleventh
Annual Conference on Learning Theory (COLT), pages 92–100, 1998.
[11] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In
Eighteenth International Conference on Machine Learning (ICML), pages 19–26, 2001.
[12] A. Blum, J. D. Lafferty, M. R. Rwebangira and R. Reddy. Semi-supervised learning using
randomized mincuts. In Twenty-first International Conference on Machine Learning (ICML),
2004.
[13] O. Bousquet, O Chapelle and M. Hein. Measure based regularization. In Seventeenth Annual
Conference on Neural Information Processing Systems (NIPS), 2003.
[14] V. Castelli. The relative value of labeled and unlabeled samples in pattern recognition. PhD
Thesis, Stanford University, 1994.
[15] V. Castelli and T. M. Cover. On the exponential value of labeled samples. Pattern Recognition
Letters, 16: 10–111, 1995.
[16] V. Castelli and T. M. Cover. The relative value of labeled and unlabeled samples in pattern
recognition with an unknown mixing parameters. IEEE Transactions on Information Theory,
42(6):2102–2117, 1996.
[17] O. Chapelle, J. Weston and B. Scholkopf. Cluster kernels for semi-supervised learning. In
Sixteenth Annual Conference on Neural Information Processing Systems (NIPS), 2002.
[18] O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In Tenth
International Workshop on Artificial Intelligence and Statistics (AISTATS), 2005.
[19] O. Chapelle, B. Schölkopf and A. Zien, editors. Semi-Supervised Learning, MIT Press, Cam-
bridge, MA, 2006. http://www.kyb.tuebingen.mpg.de/ssl-book
[20] N. V. Chawla and G. Karakoulas. Learning from labeled and unlabeled data: An empirical
study across techniques and domains. Journal of Artifical Intelligence Research, 23:331–366,
2005.
[21] D. Cohn, L. Atlas and R. Ladner. Improving generalization with active learning. Machine
Learning, 5(2):201–221, 1994.
[22] D. Cohn, Z. Ghahramani and M. Jordan. Active learning with statistical models. Journal of
Artificial Intelligence Research, 4:129–145, 1996.
[23] M. Collins and Y. Singer. Unsupervised models for named entity classification. In In Joint
SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large
Corpora, pages 100–110, 1999.
[24] F. B. Cozman and I. Cohen. Unlabeled data can degrade classification performance of gener-
ative classifiers. In Fifteenth International Florida Artificial Intelligence Society Conference,
pages 327–331, 2002.
[25] W. dai, Y. Chen, G.-R. Xue, Q. Yang and Y. Yu. Translated learning: Transfer learning across
different feature spaces. In Twenty first Annual Conference on Neural Information Processing
Systems (NIPS), 2007.
534 Data Classification: Algorithms and Applications
[26] S. Dasgupta, M. L. Littman and D. McAllester PAC generalization bounds for co-training. In
Fifteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 375–
382, 2001.
[27] O. Delalleu, Y. Bengio and N. L. Roux. Efficient non-parametric function induction in semi-
supervised learning. In Tenth International Workshop on Artificial Intelligence and Statistics
(AISTATS), pages 96–103, 2005.
[28] A. P. Dempster, N. M. Liard and D. B. Rubin. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–39, 1977.
[29] J. V. Dillon, K. Balasubramanian and G. Lebanon. Asymptotic analysis of generative semi-
supervised learning. In Twenty-seventh International Conference on Machine Learning
(ICML), pages 295–302, 2010.
[30] R. Fergus, Y. Weiss and A. Torralba. Semi-supervised learning in gigantic image collections.
In Twenty-fourth Annual Conference on Neural Information Processing Systems (NIPS), 2009.
[31] S. Goldman and Y. Zhou. Enhancing supervised learning with unlabeled data. In Seventeenth
International Conference on Machine Learning (ICML), pages 327–334, 2000.
[32] D. Greig, B. Porteous and A. Seheult. Exact maximum a posteriori estimation for binary
images. Journal of the Royal Statistical Society, Series B, 51:271–279, 1989.
[33] T. Joachims. Transductive inference for text classification using support vector machines. In
Sixteenth International Conference on Machine Learning (ICML), pages 200–209, 1999.
[34] T. Joachims. Transductive learning via spectral graph partitioning. In Twentieth International
Conference on Machine Learning (ICML), pages 290–293, 2003.
[35] R. Johnson and T. Zhang. Two-view feature generation model for semi-supervised learning.
In Twenty-fourth International Conference on Machine Learning (ICML), pages 25–32, 2007.
[36] R. Jones. Learning to extract entities from labeled and unlabeled text. In PhD Thesis, Carnegie
Mellon University, 2005.
[37] B. Kveton, M. Valko, A. Rahimi and L. Huang. Semi-supervised learning with max-margin
graph cuts. In Thirteenth International Conference on Artificial Intelligence and Statistics
(AISTATS), pages 421–428, 2010.
[38] D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In
Eleventh International Conference on Machine Learning (ICML), pages 148–156, 1994.
[39] W. Liu, J. He and S.-F. Chang. Large graph construction for scalable semi-supervised learning.
In Twenty-seventh International Conference on Machine Learning (ICML), 2010.
[40] W. Liu, J. Wang and S.-F. Chang. Robust and scalable graph-based semi-supervised learning.
Proceedings of the IEEE, 100(9):2624–2638, 2012.
[41] G. J. McLachlan and K. Basford. Mixture Models. Marcel Dekker, New York, 1988.
[42] G. J. McLachlan and T. Krishnan. The EM Algorithms and Extensions. John Wiley & Sons,
New York, 1997.
[43] B. Nadler, N. Srebro and X. Zhou. Semi-supervised learning with the graph Laplacian: The
limit of infinite unlabeled data. In Twenty-third Annual Conference on Neural Information
Processing Systems (NIPS), 2009.
Semi-Supervised Learning 535
[46] K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. Ninth
International Conference on Information and Knowledge Management, pages 86–93, 2000.
[47] S. J. Pan and Q Yang. A survey on transfer learning. IEEE Transactions on Knowledge
Discovery and Data Engineering, 22(10):1345–1359, 2010.
[48] G. Qi, C. Aggarwal and T. Huang Towards cross-category knowledge propagation for learning
visual concepts. Twenty-fourth International Conference on Computer Vision and Pattern
Recognition, pages 897–904, 2011.
[49] J. Ratsaby. Complexity of learning from a mixture of labeled and unlabeled examples. PhD
Thesis, University of Pennsylvania, 1994.
[50] J. Ratsaby and S. S. Venkatesh. Learning from a mixture of labeled and unlabeled examples
with parametric side information. In Eighth Annual Conference on Learning Theory, pages
412–417, 1995.
[51] P. Rigollet. Generalization error bounds in semi-supervised classification under the cluster
assumption. Journal of Machine Learning Research, 8: 1369–1392, 2007.
[52] M. Seeger, Learning with labeled and unlabeled data, Tech. Report, 2000.
http://infoscience.epfl.ch/record/161327/files/review.pdf
[53] H. Seung, M. Opper and H. Sompolinsky. Query by committee. In Fifth Annual Workshop on
Learning Theory (COLT), pages 287–294, 1992.
[54] B. M. Shahshahani and D. A. Landgrebe. The effect of unlabeled samples in reducing the
sample size problem and mitigating Hughes phenomenon. IEEE Transactions on Geoscience
and Remote Sensing, 32(5):1087–1095, 1994.
[55] V. Sindhwani, P. Niyogi and M. Belkin. Beyond the point cloud: from transductive to
semi-supervised learning. In Twenty-second International Conference on Machine Learning
(ICML), pages 824–831, 2005.
[56] A. Singh, R. Nowak and X. Zhu. Unlabeled data: Now it helps, now it doesn’t. In Twenty-
second Annual Conference on Neural Information Processing Systems (NIPS), 2008.
[57] K. Sinha and M. Belkin. The value of labeled and unlabeled examples when the model is im-
perfect. In Twenty-first Annual Conference on Neural Information Processing Systems (NIPS),
pages 1361–1368, 2007.
[58] K. Sinha and M. Belkin. Semi-supervised learning using sparse eigenfunction bases. In Twenty
third Annual Conference on Neural Information Processing Systems (NIPS), pages 1687–1695,
2009.
[59] M. Szummer and T. Jaakkola. Partially labeled classification with Markov random walks. In
Sixteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 945–
952, 2001.
536 Data Classification: Algorithms and Applications
[60] M. Szummer and T. Jaakkola. Information regularization with partially labeled data. In Sev-
enteenth Annual Conference on Neural Information Processing Systems (NIPS), pages 1025–
1032, 2002.
[61] I. W. Tsang and J. T. Kwok. Large scale sparsified manifold regularization. In Twenty-first An-
nual Conference on Neural Information Processing Systems (NIPS), pages 1401–1408, 2006.
[62] J. Wang. Semi-supervised learning for scalable and robust visual search. PhD Thesis,
Columbia University, 2011.
[63] T. Zhang and R. K. Ando. Analysis of spectral kernel design based semi-supervised learning.
In Nineteenth Annual Conference on Neural Information Processing Systems (NIPS), 2005.
[64] K. Zhang, J. T. Kwok and B. Parvin. Prototype vector machine for large scale semi-supervised
learning. In Twenty-sixth International Conference on Machine Learning (ICML), pages 1233–
1240, 2009.
[65] D. Zhou, O. Bousquet, T. N. Lal, J. Weston and B. Scholkopf. Learning with local and global
consistency. In Eighteenth Annual Conference on Neural Information Processing Systems
(NIPS), pages 321–328, 2003.
[66] Y. Zhou and S. Goldman. Democratic co-learning. In Sixteenth IEEE International Conference
on Tools with Artificial Intelligence, pages 594–602, 2004.
[67] D. Zhou and B. Schlkopf. Learning from labeled and unlabeled data using random walks. In
Twenty-sixth DAGM Symposium, pages 237–244, 2004.
[68] Z. H. Zhou and M. Li. Tri-training: exploiting unlabeled data using three classifiers. IEEE
Transactions on Knwledge and Data Engineering, 17:1529–1541, 2005.
[69] Z. H. Zhou, D. C. Zhan and Q. Yang. Semi-supervised learning with very few labeled exam-
ples. In Tewnty-second AAAI Conference on Artificial Intelligence, pages 675–680, 2007.
[70] X. Zhou and M. Belkin. Semi-supervised learning by higher order regularization. In Four-
teenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 892–
900, 2011.
[71] X. Zhou and N. Srebro. Error analysis of Laplacian eigenmaps for semi-supervised learning. In
Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages
901–908, 2011.
[72] X. Zhou, A. Saha and V. Sindhwani. Semi-supervised learning: Some recent advances. In
Cost-Sensitive Machine Learning, Editors: B. Krishnapuram, S. Yu, B. Rao, Chapman &
Hall/CRC Machine Learning & Pattern Recognition 2011.
[73] X. Zhu, Z. Ghahramani and J. Lafferty. Semi-supervised learning using Gaussian fields and
harmonic functions. In Twentieth International Conference on Machine Learning (ICML),
pages 912–919, 2003.
[74] X. Zhu. Semi-Supervised Learning Literature Survey, Tech. Report, University of Wisconsin
Madison, 2005.
[75] X. Zhu and A. B. Goldberg. Introduction to Semi-Supervised Learning (Synthesis Lectures on
Artificial Intelligence and Machine Learning), Morgan and Claypool Publishers, 2009.
[76] Y. Zhu, S. J. Pan, Y. Chen. G.-R. Xue, Q. Yang and Y. Yu. Heterogeneous transfer learning for
image classification. In Special Track on AI and the Web, associated with Twenty-fourth AAAI
Conference on Artificial Intelligence, 2010.
Chapter 21
Transfer Learning
537
538 Data Classification: Algorithms and Applications
21.1 Introduction
Supervised machine learning techniques have already been widely studied and applied to vari-
ous real-world applications. However, most existing supervised algorithms work well only under a
common assumption: the training and test data are represented by the same features and drawn from
the same distribution. Furthermore, the performance of these algorithms heavily rely on collecting
high quality and sufficient labeled training data to train a statistical or computational model to make
predictions on the future data [57, 86, 132]. However, in many real-world scenarios, labeled training
data are in short supply or can only be obtained with expensive cost. This problem has become a
major bottleneck of making machine learning methods more applicable in practice.
In the last decade, semi-supervised learning [20, 27, 63, 89, 167] techniques have been proposed
to address the labeled data sparsity problem by making use of a large amount of unlabeled data to
discover an intrinsic data structure to effectively propagate label information. Nevertheless, most
semi-supervised methods require that the training data, including labeled and unlabeled data, and
the test data are both from the same domain of interest, which implicitly assumes the training and
test data are still represented in the same feature space and drawn from the same data distribution.
Instead of exploring unlabeled data to train a precise model, active learning, which is another
branch in machine learning for reducing annotation effort of supervised learning, tries to design
an active learner to pose queries, usually in the form of unlabeled data instances to be labeled by
an oracle (e.g., a human annotator). The key idea behind active learning is that a machine learning
algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data
from which it learns [71,121]. However, most active learning methods assume that there is a budget
for the active learner to pose queries in the domain of interest. In some real-world applications, the
budget may be quite limited, which means that the labeled data queried by active learning may not
be sufficient to learn an accurate classifier in the domain of interest.
Transfer learning, in contrast, allows the domains, tasks, and distributions used in training and
testing to be different. The main idea behind transfer learning is to borrow labeled data or extract
knowledge from some related domains to help a machine learning algorithm to achieve greater per-
formance in the domain of interest [97,130]. Thus, transfer learning can be referred to as a different
strategy for learning models with minimal human supervision, compared to semi-supervised and
active learning. In the real world, we can observe many examples of transfer learning. For example,
we may find that learning to recognize apples might help us to recognize pears. Similarly, learning
to play the electronic organ may help facilitate learning the piano. Furthermore, in many engineer-
ing applications, it is expensive or impossible to collect sufficient training data to train models for
use in each domain of interest. It would be more practical if one could reuse the training data that
have been collected in some related domains/tasks or the knowledge that is already extracted from
some related domains/tasks to learn a precise model for use in the domain of interest. In such cases,
knowledge transfer or transfer learning between tasks or domains become more desirable and cru-
cial.
Many diverse examples in knowledge engineering can be found where transfer learning can truly
be beneficial. One example is sentiment classification, where our goal is to automatically classify
reviews on a product, such as a brand of camera, into polarity categories (e.g., positive, negative or
neural). In literature, supervised learning methods [100] have proven to be promising and widely
used in sentiment classification. However, these methods are domain dependent, which means that
a model built on one domain (e.g., reviews on a specific product with annotated polarity categories)
by using these methods may perform poorly on another domain (e.g., reviews on another specific
product without polarity categories). The reason is that one may use different domain-specific words
to express opinions in different domains. Table 21.1 shows several review sentences of two domains:
Electronics and Video Games. In the Electronics domain, one may use words like “compact” and
Transfer Learning 539
TABLE 21.1: Cross-Domain Sentiment Classification Examples: Reviews of Electronics and Video
Games.
Electronics Video Games
+ Compact; easy to operate; very good picture A very good game! It is action packed and
quality; looks sharp! full of excitement. I am very much hooked
on this game.
+ I purchased this unit from Circuit City and Very realistic shooting action and good
I was very excited about the quality of the plots. We played this and were hooked.
picture. It is really nice and sharp.
- It is also quite blurry in very dark settings. The game is so boring. I am extremely un-
I will never buy HP again. happy and will probably never buy UbiSoft
again.
Note: Boldfaces are domain-specific words that occur much more frequently in one domain than in the other
one. “+” and “-” denote positive and negative sentiment respectively.
“sharp” to express positive sentiment and use “blurry” to express negative sentiment, while in the
Video Game domain, the words like “hooked” and “realistic” indicate positive opinions and the
word “boring” indicates negative opinion. Due to the mismatch of domain-specific words between
domains, a sentiment classifier trained on one domain may not work well when directly applied
to other domains. Therefore, cross-domain sentiment classification algorithms are highly desirable
to reduce domain dependency and manually labeling cost by transferring knowledge from related
domains to the domain of interest [18, 51, 92].
The need for transfer learning may also arise in applications of wireless sensor networks, where
wireless data can be easily outdated over time or very differently received by different devices. In
these cases, the labeled data obtained in one time period or on one device may not follow the same
distribution in a later time period or on another device. For example, in indoor WiFi-based localiza-
tion, which aims to detect a mobile’s current location based on previously collected WiFi data, it is
very expensive to calibrate WiFi data for building a localization model in a large-scale environment
because a user needs to label a large collection of WiFi signal data at each location. However, the
values of WiFi signal strength may be a function of time, device, or other dynamic factors. As shown
in Figure 21.1, values of received signal strength (RSS) may differ across different time periods and
mobile devices. As a result, a model trained in one time period or on one device may estimate lo-
cations poorly in another time period or on another device. To reduce the re-calibration effort, we
might wish to adapt the localization model trained in one time period (the source domain) for a
new time period (the target domain), or to adapt the localization model trained on a mobile device
(the source domain) for a new mobile device (the target domain) with little or without additional
calibration [91, 98, 152, 165].
As a third example, transfer learning has been shown to be promising for defect prediction in
the area of software engineering, where the goal is to build a prediction model from data sets mined
from software repositories, and the model is used to identify software defects. In the past few years,
numerous effective software defect prediction approaches based on supervised machine learning
techniques have been proposed and received a tremendous amount of attention [66, 83]. In practice,
cross-project defect prediction is necessary. New projects often do not have enough defect data to
build a prediction model. This cold-start is a well-known problem for recommender systems [116]
and can be addressed by using cross-project defect prediction to build a prediction model using
data from other projects. The model is then applied to new projects. However, as reported by some
researchers, cross-project defect prediction often yields poor performance [170]. One of the main
reasons for the poor cross-project prediction performance is the difference between the data dis-
tributions of the source and target projects. To improve the cross-project prediction performance
540 Data Classification: Algorithms and Applications
20
35 35
18
30 30
16
25 25 14
12
20 20
10
15 15 8
10 10 6
4
5 5
2
0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
(a) WiFi RSS received by device A in T1 . (b) WiFi RSS received by device A in T2 .
35 35
30 30
25 25
20 20
15 15
10 10
5 5
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
(c) WiFi RSS received by device B in T1 . (d) WiFi RSS received by device B in T2 .
FIGURE 21.1 (See color insert.): Contours of RSS values over a 2-dimensional environment col-
lected from a same access point in different time periods and received by different mobile devices.
Different colors denote different values of signal strength.
with little additional human supervision, transfer learning techniques are again desirable, and have
proven to be promising [87].
Generally speaking, transfer learning can be classified into two different fields: 1) transfer learn-
ing for classification, regression, and clustering problems [97], and 2) transfer learning for reinforce-
ment learning tasks [128]. In this chapter, we focus on transfer learning in data classification and its
real-world applications. Furthermore, as first introduced in a survey article [97], there are three main
research issues in transfer learning: 1) What to transfer, 2) How to transfer, and 3) When to trans-
fer. Specifically, “What to transfer” asks which part of knowledge can be extracted and transferred
across domains or tasks. Some knowledge is domain- or task-specific, which may not be observed in
other domains or tasks, while some knowledge is commonly shared by different domains or tasks,
which can be treated as a bridge for knowledge transfer across domains or tasks. After discover-
ing which knowledge can be transferred, learning algorithms need to be developed to transfer the
knowledge, which corresponds to the “how to transfer” issue. Different knowledge-transfer strate-
gies lead to specific transfer learning approaches. “When to transfer” asks in which situations it
is appropriate to use transfer learning. Likewise, we are interested in knowing in which situations
Transfer Learning 541
knowledge should not be transferred. In some situations, when the source domain and target domain
are not related to each other, brute-force transfer may be unsuccessful. In the worst case, it may even
hurt the performance of learning in the target domain, a situation that is often referred to as negative
transfer. Therefore, the goal of “When to transfer” is to avoid negative transfer and then ensure
positive transfer.
The rest of this chapter is organized as follows. In Section 21.2, we start by giving an overview
of transfer learning including its brief history, definitions, and different learning settings. In Sec-
tions 21.3–21.4, we summarize approaches into different categories based on two transfer learn-
ing settings, namely homogenous transfer learning and heterogeneous transfer learning. In Sec-
tions 21.5–21.6, we discuss the negative transfer issue and other research issues of transfer learning.
After that, we show diverse real-world applications of transfer learning in Section 21.7. Finally, we
give concluding remarks in Section 21.8.
21.2.1 Background
The study of transfer learning is motivated by the fact that people can intelligently apply knowl-
edge learned previously to solve new problems faster or with better solutions [47]. For example, if
one is good at coding in C++ programming language, he/she may learn Java programming language
fast. This is because both C++ and Java are object-oriented programming (OOP) languages, and
share similar programming motivations. As another example, if one is good at playing table tennis,
he/she may learn playing tennis fast because the skill sets of these two sports are overlapping. For-
mally, from a psychological point of view, the definition of transfer learning or learning of transfer
is the study of the dependency of human conduct, learning, or performance on prior experience.
More than 100 years ago, researchers had already explored how individuals would transfer in one
context to another context that share similar characteristics [129].
The fundamental motivation for transfer learning in the field of machine learning is the need
for lifelong machine learning methods that retain and reuse previously learned knowledge such
that intelligent agencies can adapt to new environments or novel tasks effectively and efficiently
with little human supervision. Informally, the definition of transfer learning in the field of machine
learning is the ability of a system to recognize and apply knowledge and skills learned in previous
domains or tasks to new domains or novel domains, which share some commonality.
P(y|x). In classification, labels can be binary, i.e., Y = {−1, +1}, or discrete values, i.e., multiple
classes.
For simplicity, we only consider the case where there is one source domain DS , and one target
domain DT , as this is by far the most popular of the research works in the literature. The issue of
knowledge transfer from multiple source domains will be discussed in Section 21.6. More specif-
ically, we denote DS = {(xSi , ySi )}ni=1
S
the source domain data, where xSi ∈ XS is the data instance,
and ySi ∈ YS is the corresponding class label. Similarly, we denote DT = {(xTi , yTi )}ni=1
T
the target
domain data, where the input xTi is in XT and yTi ∈ YT is the corresponding output. In most cases,
0 ≤ nT nS . Based on the notations defined above, the definition of transfer learning can be defined
as follows [97]:
Definition 1 Given a source domain DS and learning task TS , a target domain DT and learning
task TT , transfer learning aims to help improve the learning of the target predictive function fT (·)
in DT using the knowledge in DS and TS , where DS = DT , or TS = TT .
In the above definition, a domain is a pair D = {X , P(x)}. Thus the condition DS = DT implies
that either XS = XT or P(xS ) = P(xT ). Similarly, a task is defined as a pair T = {Y , P(y|x)}. Thus the
condition TS = TT implies that either YS = YT or P(yS |xS ) = P(yT |xT ). When the target and source
domains are the same, i.e., DS = DT , and their learning tasks are the same, i.e, TS = TT , the learning
problem becomes a traditional machine learning problem. Based on whether the feature spaces or
label spaces are identical or not, we can further categorize transfer learning into two settings: 1)
homogenous transfer learning, and 2) heterogenous transfer learning. In the following two sections,
we give the definitions of these two settings and review their representative methods, respectively.
Definition 2 Given a source domain DS and learning task TS , a target domain DT and learning
task TT , homogenous transfer learning aims to help improve the learning of the target predictive
function fT (·) in DT using the knowledge in DS and TS , where XS XT = 0/ and YS = YT , but
P(xS ) = P(xT ) or P(yS |xS ) = P(yT |xT ).
Based on the above definition, in homogenous transfer learning, the feature spaces between domains
are overlapping, and the label spaces between tasks are identical. The difference between domains or
tasks is caused by the marginal distributions or predictive distributions. Approaches to homogenous
transfer learning can be summarized into four categories: 1) instance-based approach, 2) feature-
representation-based approach, 3) model-parameter-based approach, and 4) relational-information-
based approach. In the following sections, we describe the motivations of these approaches and
introduce some representative methods of each approach.
can be further categorized into two contexts: 1) no target labeled data are available, and 2) a few
target labeled data are available.
where l(x, y, θ) is a loss function that depends on the parameter θ. Since no labeled data are assumed
to be available in the target domain, it is impossible to optimize (21.1) over target domain labeled
data. It can be proved that the optimization problem (21.1) can be rewritten as follows:
!
∗ PT (x, y)
θ = arg min E(x,y)∼PS l(x, y, θ) , (21.2)
θ∈Θ PS (x, y)
which aims to learn the optimal parameter θ∗ by minimizing the weighted expected risk over source
domain labeled data. Because it is assumed that PS (y|x) = PT (y|x), by decomposing the joint distri-
(x,y) (x)
bution P(x, y) = P(y|x)P(x), we obtain PPT (x,y) = PPT (x) . Hence, (21.2) can be further rewritten as
S S
!
∗ PT (x)
θ = arg min E(x,y)∼PS l(x, y, θ) , (21.3)
θ∈Θ PS (x)
where a weight of a source domain instance x is the ratio of the target and source domain marginal
nS
distributions at the data point x. Given a sample of source domain labeled data {(xSi , ySi )}i=1 , by
PT (x)
denoting β(x) = P (x) , a regularized empirical objective of (21.3) can be formulated as
S
nS
θ∗ = arg min ∑ β(xSi )l(xSi , ySi , θ) + λΩ(θ), (21.4)
θ∈Θ i=1
where Ω(θ) is a regularization term to avoid overfitting on the training sample. Therefore, a research
issue on applying the ERM framework to transfer learning is how to estimate the weights {β(x)}’s.
Intuitively, a simple solution is to first estimate PT (x) and PS (x), respectively, and thus calculate the
(x)
ratio PPT (x) for each source domain instance xSi . However, density estimations on PT (x) and PS (x) are
S
difficult, especially when data are high-dimensional and the data size is small. An alterative solution
(x)
is to estimate PPT (x) directly.
S
(x)
In the literature, there exist various ways to estimate PPT (x) directly. Here we introduce
S
three representative methods. For more information on this context, readers may refer to [104].
Zadrozny [158] assumed that the difference in data distributions is caused by the data generation
process. Specifically, the source domain data are assumed to be sampled from the target domain
data following a rejection sampling process. Let s ∈ {0, 1} be a selector variable to denote whether
an instance in the target domain is selected to generate the source domain data or not, i.e., s = 1
denotes the instance is selected, otherwise unselected. In this way, the distribution of the selector
variable maps the target distribution onto the source distribution as follows:
where B is the parameter to limit the discrepancy between PS (x) and PT (x), and ε is the nonnegative
parameter to ensure the reweighted PS (x) is close to a probability distribution. It can be shown that
the optimization problem (21.5) can be transformed into a quadratic programming (QP)
problem,
PS (xS )
and the optimal solutions {β(xSi )}’s of (21.5) are equivalent to the ratio values PT (x i ) ’s of (21.3)
Si
to be estimated.
As a third method, Sugiyama et al. [127] assumed that the ratio β(x) can be estimated by the
following linear model,
b
-
β(x) = ∑ α ψ (x),
=1
where {ψ (x)}b=1 are the basic functions that are predefined, and the coefficients {α }b=1 are the
parameters to be estimated. In this way, the problem of estimating β(x) is transformed into the
problem of estimating the parameters {α }b=1 . By denoting P-T (x) = -
β(x)PS (x), the parameters can
be learned by solving the following optimization problem,
where l(·) is a loss function of the estimated target distribution P-T (x) to the ground truth target distri-
bution PT (x). Different loss functions lead to various specific algorithms. For instance, Sugiyama et
al. [127] proposed to use the Kullback-Leibler divergence as the loss function, while Kanamori et
al. [65] proposed to use the least-squared loss as the loss function. Note that the ground truth of
PS (x) and PT (x) are unknown. However, as shown in [65, 127], PS (x) and PT (x) can be eliminated
when optimizing the parameters {α }b=1 .
where nTl is the number of target domain labeled data, w is the model parameter, ξS and ξT are the
slack variables to absorb errors on the source and target domain data, respectively, λS and λT are the
tradeoff parameters to balance the impact of different terms in the objective, and γi is the weight on
the source domain instance xSi . There are various ways to set the values of {γi }’s. In [142], Wu and
Dietterich proposed to simply set γi = 1 for each data point in the source domain. Jiang and Zhai [62]
proposed a heuristic method to remove the “misleading” instances from the source domain, which is
equivalent to setting γi = 0 for all “misleading” source domain instances and γi = 1 for the remaining
instances. Note that the basic classifier used in [62] is a probabilistic model instead of SVM, but the
idea is similar.
Dai et al. [38] proposed a boosting algorithm, known as TrAdaBoost, for transfer learning.
TrAdaBoost is an extension of the AdaBoost algorithm [49]. The basic idea of TrAdaBoost attempts
to iteratively re-weight the source domain data to reduce the effect of the “bad” source data while
encouraging the “good” source data to contribute more to the target domain. Specifically, for each
round of boosting, TrAdaBoost uses the same strategy as AdaBoost to update weights of the target
domain labeled data, while proposing a new mechanism to decrease the weights of misclassified
source domain data.
by a bag of words with c(vi , x j ) to denote the frequency of word vi in x j . Without loss of generality,
we use a unified vocabulary W for all domains, and assume |W | = m.
For each sentiment data x j , there is a corresponding label y j , where y j = +1 if the overall sen-
timent expressed in x j is positive, and y j = −1 if the overall sentiment expressed in x j is negative.
A pair of sentiment text and its corresponding sentiment polarity {x j , y j } is called the labeled sen-
timent data. If x j has no polarity assigned, it is unlabeled sentiment data. Note that besides positive
and negative sentiment, there are also neutral and mixed sentiment data in practical applications.
Mixed polarity means user sentiment is positive in some aspects but negative in other ones. Neu-
tral polarity means that there is no sentiment expressed by users. In this chapter, we only focus on
positive and negative sentiment data.
For simplicity, we assume that a sentiment classifier f is a linear function as
where x ∈ Rm×1 , sgn(w x) = +1 if w x ≥ 0, otherwise, sgn(w x) = −1, and w is the weight vector
of the classifier, which can be learned from a set of training data (i.e., pairs of sentiment data and
their corresponding polarity labels).
Consider the example shown in Table 21.1 as an motivating example. We use the standard bag-
of-words representation to represent sentiment data of the Electronics and Video Games domains.
From Table 21.2, we observe that the difference between domains is caused by the frequency of the
domain-specific words. On one hand, the domain-specific words in the Electronics domain, such as
compact, sharp and blurry, cannot be observed in the Video Games domain. On the other hand, the
domain-specific words in the Video Games domain, such as hooked, realistic and boring, cannot be
observed in the Electronics domain. Suppose that the Electronics domain is the source domain and
the Video Games domain is the target domain. Apparently, based on the three training sentences in
the Electronics domain, the weights of the features compact and sharp are positive, the weight of
the feature blurry are negative, and the weights of the features hooked, realistic and boring can be
arbitrary or zeros if an 1 -norm regularization term is performed on w for model training. However,
an ideal weight vector for the Video Games domain are supposed to have positive weights on the
features hooked, realistic and a negative weight on the feature boring. Therefore, a classifier learned
from the Electronics domain may predict poorly or randomly on the Video Games domain data.
Generally speaking, in sentiment classification, features can be classified into three types: 1)
source domain (i.e., the Electronics domain) specific features, such as compact, sharp, and blurry,
2) target domain (i.e., the Video Game domain) specific features, such as hooked, realistic, and bor-
ing, and 3) domain independent features or pivot features, such as good, excited, nice, and never buy.
Based on these observations, an intuitive idea of feature learning is to align the source and target
domain specific features to generate cluster- or group- based features by using the domain indepen-
dent features as a bridge such that the difference between the source and target domain data based on
the new feature representation can be reduced. For instance, if the domain specific features shown
in Table 21.2 can be aligned in the way presented in Table 21.3, where the feature alignments are
Transfer Learning 547
used as new features to represent the data, then apparently, a linear model learned from the source
domain (i.e., the Electronics domain) can be used to make precise predictions on the target domain
data (i.e., the Video Game domain).
TABLE 21.3: Using Feature Alignments as New Features to Represent Cross-Domain Data.
sharp hooked compact realistic blurry boring
+1 1 1 0
electronics +1 1 0 0
-1 0 0 1
+1 1 0 0
video games +1 1 1 0
-1 0 0 1
Therefore, there are two research issues to be addressed. A first issue is how to identify domain
independent or pivot features. A second issue is how to utilize the domain independent features and
domain knowledge to align domain specific features from the source and target domains to generate
new features. Here the domain knowledge is that if two sentiment words co-occur frequently in one
sentence or document, then their sentiment polarities tend to be the same with a high probability.
For identifying domain independent or pivot features, some researchers have proposed several
heuristic approaches [18,92]. For instance, Blitzer et al. [18] proposed to select pivot features based
on the term frequency in both the source and target domains and the mutual dependence between the
features and labels in the source domain. The idea is that a pivot feature should be discriminative to
the source domain data and appear frequently in both the source and target domains. Pan et al. [92]
proposed to select domain independent features based on the mutual dependence between features
and domains. Specifically, by considering all instances in the source domain with labels 1’s and
all instances in the target domain with labels 0’s, the mutual information can be used to measure
the dependence between the features and the constructed domain labels. The motivation is that if
a feature has high mutual dependence to the domains, then it is domain specific. Otherwise, it is
domain independent.
For aligning domain specific features from the source and target domains to generate cross-
domain features, Blitzer et al. [19] proposed the structural correspondence learning (SCL) method.
SCL is motivated by a multi-task learning algorithm, alternating structure optimization (ASO) [4],
which aims to learn common features underlying multiple tasks. Specifically, SCL first identifies a
set of pivot features of size m, and then treats each pivot feature as a new output vector to construct
a pseudo task with non-pivot features as inputs. After that, SCL learns m linear classifiers to model
the relationships between the non-pivot features and the constructed output vectors as follows,
y j = sgn(wj xnp ), j = 1, . . . , m,
where y j is an output vector constructed from a corresponding pivot feature, and xnp is a vector of
non-pivot features. Finally, SCL performs the singular value decomposition (SVD) on the weight
matrix W = [w1 w2 . . . wm ] ∈ Rq×m , where q is the number of non-pivot features, such that W =
UDV , where Uq×r and Vr×m are the matrices of the left and right singular vectors. The matrix
Dr×r is a diagonal matrix consisting of non-negative singular values, which are ranked in non-
increasing order. The matrix U[1:h,:] , where h is the number of features to be learned, is then used as
a transformation to align domain-specific features to generate new features.
Pan et al. [92] proposed the Spectral Feature Alignment (SFA) method for aligning domain
specific features, which shares a similar high-level motivation with SCL. Instead of constructing
pseudo tasks to use model parameters to capture the correlations between domain-specific and
domain-independent features, SFA aims to model the feature correlations using a bipartite graph.
Specifically, in the bipartite graph, a set of nodes correspond to the domain independent features,
548 Data Classification: Algorithms and Applications
and the other set of nodes correspond to domain specific features in either the source or target do-
main. There exists an edge connecting a domain specific feature and a domain independent feature,
if they co-occur in the same document or within a predefined window. A number associated on an
edge is the total number of the co-occurrence of the corresponding domain specific and domain
independent features in the source and target domains. The motivation of using a bipartite graph
to model the feature correlations is that if two domain specific features have connections to more
common domain independent features in the graph, they tend to be aligned or clustered together
with a higher probability. Meanwhile, if two domain independent features have connections to more
common domain specific features in the graph, they tend to be aligned together with a higher prob-
ability. After the bipartite graph is constructed, the spectral clustering algorithm [88] is applied on
the graph to cluster domain specific features. In this way, the clusters can be treated as new features
to represent cross-domain data.
where ϕ is the mapping to be learned, which maps the original data to a low-dimensional space.
The first term in the objective of (21.7) aims to minimize the distance in distributions between the
source and target domain data, Ω(ϕ) is a regularization term on the mapping ϕ, and the constraints
are to ensure original data properties are preserved.
Note that, in general, the optimization problem (21.7) is computationally intractable. To make it
computationally solvable, Pan et al. [90] proposed to transform the optimization problem (21.7) to
a kernel matrix learning problem, resulting in solving a semidefinite program (SDP). The proposed
method is known as Maximum Mean Discrepancy Embedding (MMDE), which is based on the
non-parametric measure MMD as introduced in Section 21.3.1.1. MMDE has proven to be effective
in learning features for transfer learning. However, it has two major limitations: 1) Since it requires
to solve a SDP, its computational cost is very expensive; 2) Since it formulates the kernel matrix
learning problem in a transductive learning setting, it cannot generalize to out-of-sample instances.
To address the limitations of MMDE, Pan et al. [95, 96] further relaxed the feature learning prob-
lem of MMDE to a generalized eigen-decomposition problem, which is very efficient and easily
generalized to out-of-sample instances. Similarly, motivated by the idea of MMDE, Si et al. [125]
proposed to use the Bregman divergence as the distance measure between sample distributions to
minimize the distance between the source and target domain data in a latent space.
Transfer Learning 549
where θ ∈ Rk×1 is the individual model parameter to be learned, and U ∈ Rm×k is the transformation
shared by all task data, which maps original data to a k-dimensional feature space, and needs to be
learned as well. Note that the setting of multi-task learning is different from that of transfer learning,
where a lot of labeled training data are assumed to be available in a source domain, and the focus is
to learn a more precise model for the target domain. However, the idea of common feature learning
under different tasks can still be borrowed for learning features for transfer learning by assuming
that a few labeled training data in the target domain are available. The high-level objective of feature
learning based on multi-task learning can be formulated as follows:
nt
min
U,θS ,θT
∑ ∑ l(U xti , yti , θt ) + λΩ(Θ,U)
t∈{S,T } i=1
s.t. constraints on U, (21.8)
where Θ = [θS θT ] ∈ Rk×2 and Ω(Θ,U) is a regularization term on Θ and U. Based on different
forms of Ω(Θ,U) and different constraints on U, approaches to learning features based on multi-
task learning can be generally classified into two categories. In a first category of approaches, U
is assumed to be full rank, which means that m = k, and Θ is sparse. A motivation behind this is
that the full-rank U is only to transform the data from original space to another space of the same
dimensionality, where a few good features underlying different tasks can be found potentially, and
the sparsity assumption on Θ is to select such good features and ignore those that are not helpful
for the source and target tasks. One of the representative approaches in this category was proposed
by Argyriou et al. [6], where the
·
2,1 norm is proposed to regularize the matrix form of the
model parameters Θ,1 and U is assumed to be orthogonal, which means that U U = UU = I. As
shown in [6], the optimization problem can be transformed to a convex optimization formulation
and solved efficiently. In a follow-up work, Argyriou et al. [8] proposed a new spectral function on
Θ for multi-task feature learning.
In a second category of approaches, U is assumed to be row rank, which means that k < m, or
k m in practice, and there are no sparsity assumptions on Θ. In this way, U transforms the original
data to good common feature representations directly. Representative approaches in this category
include the Alternating Structure Optimization (ASO) method, which has been mentioned in Sec-
tion 21.3.2.1. As described, in ASO, the SVD is performed on the matrix of the source and target
model-parameters to recover a low-dimensional predictive space as a common feature space. The
ASO method has been applied successfully to several applications [5, 18]. However, the proposed
optimization problem is non-convex and thus a global optimum is not guaranteed to be achieved.
Chen et al. [30] presented an improved formulation, called iASO, by proposing a novel regulariza-
tion term on U and Θ. Furthermore, in order to convert the new formulation into a convex formula-
tion, in [30], Chen et al. proposed a convex alternating structure optimization (cASO) algorithm to
solve the optimization problem.
1 The
·
2,1 -norm of Θ is defined as
Θ
2,1 = ∑m
i=1
Θ
2 , where Θ is the i row of Θ.
i 1 i th
550 Data Classification: Algorithms and Applications
where xS and xT are original feature vectors of the source and target domains, respectively, and 0
is a vector of zeros, whose length is equivalent to that of the original feature vector. The idea is to
reduce the difference between domains while ensuring the similarity between data within domains
is larger than that across different domains. In a follow-up work, Daumé III [40] extends the feature
augmentation method in a semi-supervised learning manner. Dai et al. [36] proposed a co-clustering
based algorithm to discover common feature clusters, such that label information can be propagated
across different domains by using the common clusters as a bridge. Xue et al. [147] proposed a
cross-domain text classification algorithm that extends the traditional probabilistic latent semantic
analysis (PLSA) [58] algorithm to extract common topics underlying the source and target domain
text data for transfer learning.
way, the transferred knowledge is encoded into the model parameters. In the rest of this section, we
first introduce a simple method to show how to transfer knowledge across tasks or domains through
model parameters, and then describe a general framework of the model-parameter-based approach.
Without loss of generality, we assume that the classifier to be learned is linear and can be written
as follows,
m
f (x) = θ, x = θ x = ∑ θi xi .
i=1
Given a lot of labeled training data in the source domain and a few labeled training data in the
target domain, we further assume that the source model parameter θS is well-trained, and our goal
to exploit the structure captured by θS to learn a more precise model parameter θT from the target
domain training data.
Evgeniou and Pontil [48] proposed that the model parameter can be decomposed into two parts;
one is referred to as a task specific parameter, and the other is referred to as a common parameter.
In this way, the source and target model parameters θS and θT can be decomposed as
θS = θ0 + vS ,
θT = θ0 + vT ,
where θ0 is the common parameter shared by the source and target classifiers, vS and vT are the
specific parameters of the source and target classifiers, respectively. Evgenious and Pontil further
proposed to learn the common and specific parameters by solving the following optimization prob-
lem,
nt
arg min
θS ,θT
∑ ∑ l(xti , yti , θt ) + λΩ(θ0, vS , vT ), (21.9)
t∈{S,T } i=1
where Ω(θ0 , vS , vT ) is the regularization term on θ0 , vS and vT , and λ > 0 is the corresponding
trade-off parameter. The simple idea presented in (21.9) can be generalized to a framework of the
model-parameter-based approach as follows,
nt
arg min
Θ
∑ ∑ l(xti , yti , θt ) + λ1tr(Θ Θ) + λ2 f (Θ) (21.10)
t∈{S,T } i=1
Besides using (21.11), Zhang and Yeung [161] proposed the following form to model the correla-
tions between the source and target parameters,
f (Θ) = tr(Θ Ω−1 Θ), (21.12)
where Ω is the covariance matrix to model the relationships between the source and target domains,
which is unknown and needs to be learned with the constraints Ω " 0 and tr(Ω) = 1. Agarwal et
al. [1] proposed to use a manifold of parameters to regularize the source and target parameters as
follows:
( (2
( (
f (Θ) = ∑ (θt − - θtM ( , (21.13)
t∈{S,T }
552 Data Classification: Algorithms and Applications
where - θM -M
S and θT are the projections of the source parameter θS and target parameter θT on the
manifold of parameters, respectively.
Besides the framework introduced in (21.10), there are a number of methods that are based on
non-parametric Bayesian modeling. For instance, Lawrence and Platt [69] proposed an efficient al-
gorithm for transfer learning based on Gaussian Processes (GP) [108]. The proposed model tries
to discover common parameters over different tasks, and an informative vector machine was intro-
duced to solve large-scale problems. Bonilla et al. [21] also investigated multi-task learning in the
context of GP. Bonilla et al. proposed to use a free-form covariance matrix over tasks to model inter-
task dependencies, where a GP prior is used to induce the correlations between tasks. Schwaighofer
et al. [118] proposed to use a hierarchical Bayesian framework (HB) together with GP for transfer
learning.
Note that the relational-information-based approach introduced in this section aims to explore
and exploit relationships between instances instead of the instances themselves for knowledge trans-
fer. Therefore, the relational-information-based approach can also be applied to heterogeneous
transfer learning problems that will be introduced in the following section, where the source and
target feature or label spaces are different.
Definition 3 Given a source domain DS and learning task TS , a target domain DT and learning
task TT , heterogeneous transfer learning aims to help improve the learning of the target predictive
function fT (·) in DT using the knowledge in DS and TS , where XS XT = 0/ or YS = YT .
Based on the definition, heterogeneous transfer learning can be further categorized into two con-
texts: 1) approaches to transferring knowledge across heterogeneous feature spaces, and 2) ap-
proaches to transferring knowledge across different label spaces.
translators to construct corresponding features across domains. However, in general, such trans-
lators for corresponding features are not available and difficult to construct in many real-world
applications. Kulis [67] proposed an Asymmetric Regularized Cross-domain transformation (ARC-
t) method to learn asymmetric transformation across domains based on metric learning. Similar
to DAMA, ARC-t also utilizes the label information to construct similarity and dissimilarity con-
straints between instances from the source and target domains, respectively. The formulated metric
learning problem can be solved by an alternating optimization algorithm.
negative transfer is a very important issue, few research works were proposed on this issue in the
past. Rosenstein et al. [114] empirically showed that if two tasks are very dissimilar, then brute-
force transfer may hurt the performance of the target task.
Recently, some research works have been explored to analyze relatedness among tasks using
task clustering techniques, such as [11, 15], which may help provide guidance on how to avoid
negative transfer automatically. Bakker and Heskes [11] adopted a Bayesian approach in which
some of the model parameters are shared for all tasks and others are more loosely connected through
a joint prior distribution that can be learned from the data. Thus, the data are clustered based on the
task parameters, where tasks in the same cluster are supposed to be related to each other. Hassan
Mahmud and Ray [80] analyzed the case of transfer learning using Kolmogorov complexity, where
some theoretical bounds are proved. In particular, they used conditional Kolmogorov complexity to
measure relatedness between tasks and transfer the “right” amount of information in a sequential
transfer learning task under a Bayesian framework. Eaton et al. [46] proposed a novel graph-based
method for knowledge transfer, where the relationships between source tasks are modeled by a
graph using transferability as the metric. To transfer knowledge to a new task, one needs to map
the target task to the graph and learn a target model on the graph by automatically determining the
parameters to transfer to the new learning task.
More recently, Argyriou et al. [7] considered situations in which the learning tasks can be di-
vided into groups. Tasks within each group are related by sharing a low-dimensional representation,
which differs among different groups. As a result, tasks within a group can find it easier to transfer
useful knowledge. Jacob et al. [60] presented a convex approach to cluster multi-task learning by
designing a new spectral norm to penalize over a set of weights, each of which is associated to a
task. Bonilla et al. [21] proposed a multi-task learning method based on Gaussian Process (GP),
which provides a global approach to model and learn task relatedness in the form of a task covari-
ance matrix. However, the optimization procedure introduced in [21] is non-convex and its results
may be sensitive to parameter initialization. Motivated by [21], Zhang and Yeung [161] proposed
an improved regularization framework to model the negative and positive correlation between tasks,
where the resultant optimization procedure is convex.
The above works [7, 11, 15, 21, 60, 161] on modeling task correlations are from the context
of multi-task learning. However, in transfer learning, one may be particularly interested in trans-
ferring knowledge from one or more source tasks to a target task rather than learning these tasks
simultaneously. The main concern of transfer learning is the learning performance in the target task
only. Thus, we need to give an answer to the question that given a target task and a source task,
whether transfer learning techniques should be applied or not. Cao et al. [23] proposed an Adaptive
Transfer learning algorithm based on GP (AT-GP), which aims to adapt transfer learning schemes
by automatically estimating the similarity between the source and target tasks. In AT-GP, a new
semi-parametric kernel is designed to model correlations between tasks, and the learning procedure
targets improving performance of the target task only. Seah et al. [120] empirically studied the neg-
ative transfer problem by proposing a predictive distribution matching classifier based on SVMs to
identify the regions of relevant source domain data where the predictive distributions maximally
align with that of the target domain data, and thus avoid negative transfer.
Bibliography
[1] Arvind Agarwal, Hal Daumé III, and Samuel Gerber. Learning multiple tasks using manifold
regularization. In Advances in Neural Information Processing Systems 23, pages 46–54.
2010.
[2] Eneko Agirre and Oier Lopez de Lacalle. On robustness and domain adaptation using svd
for word sense disambiguation. In Proceedings of the 22nd International Conference on
Computational Linguistics, pages 17–24. ACL, June 2008.
[3] Morteza Alamgir, Moritz Grosse-Wentrup, and Yasemin Altun. Multitask learning for brain-
computer interfaces. In Proceedings of the 13th International Conference on Artificial Intel-
ligence and Statistics, volume 9, pages 17–24. JMLR W&CP, May 2010.
[4] Rie K. Ando and Tong Zhang. A framework for learning predictive structures from multiple
tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, 2005.
Transfer Learning 559
[5] Rie Kubota Ando and Tong Zhang. A high-performance semi-supervised learning method for
text chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational
Linguistics, pages 1–9. ACL, June 2005.
[6] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learn-
ing. In Advances in Neural Information Processing Systems 19, pages 41–48. MIT Press,
2007.
[7] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer
learning in a heterogeneous environment. In Proceedings of the 2008 European Conference
on Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer
Science, pages 71–85. Springer, September 2008.
[8] Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil, and Yiming Ying. A spectral
regularization framework for multi-task structure learning. In Advances in Neural Informa-
tion Processing Systems 20, pages 25–32. MIT Press, 2008.
[9] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods
for transductive transfer learning. In Workshops conducted in association with the 7th IEEE
International Conference on Data Mining, pages 77–82. IEEE Computer Society, 2007.
[10] Jing Bai, Ke Zhou, Guirong Xue, Hongyuan Zha, Gordon Sun, Belle Tseng, Zhaohui Zheng,
and Yi Chang. Multi-task learning for learning to rank in web search. In Proceeding of
the 18th ACM Conference on Information and Knowledge Management, pages 1549–1552.
ACM, 2009.
[11] Bart Bakker and Tom Heskes. Task clustering and gating for Bayesian multitask learning.
Journal of Machine Learning Reserch, 4:83–99, 2003.
[12] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jenn
Wortman. A theory of learning from different domains. Machine Learning, 79(1-2):151–
175, 2010.
[13] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of represen-
tations for domain adaptation. In Advances in Neural Information Processing Systems 19,
pages 137–144. MIT Press, 2007.
[14] Shai Ben-David, Tyler Lu, Teresa Luu, and David Pal. Impossibility theorems for domain
adaptation. In Proceedings of the 13th International Conference on Artificial Intelligence
and Statistics, volume 9, pages 129–136. JMLR W&CP, May 2010.
[15] Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning.
In Proceedings of the 16th Annual Conference on Learning Theory, pages 825–830. Morgan
Kaufmann Publishers Inc., August 2003.
[16] Steffen Bickel and Tobias Scheffer. Dirichlet-enhanced spam filtering based on biased sam-
ples. In Advances in Neural Information Processing Systems 19, pages 161–168. MIT Press,
2006.
[17] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jenn Wortman. Learning
bounds for domain adaptation. In Annual in Neural Information Processing Systems 20,
pages 129–136. MIT Press, 2008.
[18] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, boom-boxes and
blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics, pages 432–439. ACL, June 2007.
560 Data Classification: Algorithms and Applications
[19] John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with structural
correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in
Natural Language, pages 120–128. ACL, July 2006.
[20] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In
Proceedings of the 11th Annual Conference on Learning Theory, pages 92–100, July 1998.
[21] Edwin Bonilla, Kian Ming Chai, and Chris Williams. Multi-task Gaussian process prediction.
In Advances in Neural Information Processing Systems 20, pages 153–160. MIT Press, 2008.
[22] Bin Cao, Nathan N. Liu, and Qiang Yang. Transfer learning for collective link prediction
in multiple heterogenous domains. In Proceedings of the 27th International Conference on
Machine Learning, pages 159–166. Omnipress, June 2010.
[23] Bin Cao, Sinno Jialin Pan, Yu Zhang, Dit-Yan Yeung, and Qiang Yang. Adaptive transfer
learning. In Proceedings of the 24th AAAI Conference on Artificial Intelligence, pages 407–
412, AAAI Press, July 2010.
[25] Kian Ming A. Chai, Christopher K. I. Williams, Stefan Klanke, and Sethu Vijayakumar.
Multi-task gaussian process learning of robot inverse dynamics. In Advances in Neural In-
formation Processing Systems 21, pages 265–272. 2009.
[26] Yee Seng Chan and Hwee Tou Ng. Domain adaptation with active learning for word sense
disambiguation. In Proceedings of the 45th Annual Meeting of the Association of Computa-
tional Linguistics, pages 49–56. ACL, June 2007.
[27] Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-Supervised Learning. MIT
Press, Cambridge, MA, 2006.
[28] Olivier Chapelle, Pannagadatta Shivaswamy, Srinivas Vadrevu, Kilian Weinberger, Ya Zhang,
and Belle Tseng. Multi-task learning for boosting with application to web search ranking. In
Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pages 1189–1198. ACM, July 2010.
[29] Bo Chen, Wai Lam, Ivor W. Tsang, and Tak-Lam Wong. Extracting discriminative concepts
for domain adaptation in text mining. In Proceedings of the 15th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 179–188. ACM, June 2009.
[30] Jianhui Chen, Lei Tang, Jun Liu, and Jieping Ye. A convex formulation for learning shared
structures from multiple tasks. In Proceedings of the 26th Annual International Conference
on Machine Learning, pages 137–144. ACM, June 2009.
[31] Lin Chen, Dong Xu, Ivor W. Tsang, and Jiebo Luo. Tag-based web photo retrieval improved
by batch mode re-tagging. In Proceedings of the 23rd IEEE Conference on Computer Vision
and Pattern Recognition, pages 3440–3446, IEEE, June 2010.
[32] Tianqi Chen, Jun Yan, Gui-Rong Xue, and Zheng Chen. Transfer learning for behavioral
targeting. In Proceedings of the 19th International Conference on World Wide Web, pages
1077–1078. ACM, April 2010.
[33] Yuqiang Chen, Ou Jin, Gui-Rong Xue, Jia Chen, and Qiang Yang. Visual contextual advertis-
ing: Bringing textual advertisements to images. In Proceedings of the 24th AAAI Conference
on Artificial Intelligence. AAAI Press, July 2010.
Transfer Learning 561
[34] Wenyuan Dai, Yuqiang Chen, Gui-Rong Xue, Qiang Yang, and Yong Yu. Translated learn-
ing: Transfer learning across different feature spaces. In Advances in Neural Information
Processing Systems 21, pages 353–360. 2009.
[35] Wenyuan Dai, Ou Jin, Gui-Rong Xue, Qiang Yang, and Yong Yu. Eigentransfer: a unified
framework for transfer learning. In Proceedings of the 26th Annual International Conference
on Machine Learning, pages 25–31. ACM, June 2009.
[36] Wenyuan Dai, Guirong Xue, Qiang Yang, and Yong Yu. Co-clustering based classification
for out-of-domain documents. In Proceedings of the 13th ACM International Conference on
Knowledge Discovery and Data Mining, pages 210–219. ACM, August 2007.
[37] Wenyuan Dai, Guirong Xue, Qiang Yang, and Yong Yu. Transferring naive Bayes classifiers
for text classification. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence,
pages 540–545. AAAI Press, July 2007.
[38] Wenyuan Dai, Qiang Yang, Guirong Xue, and Yong Yu. Boosting for transfer learning. In
Proceedings of the 24th International Conference on Machine Learning, pages 193–200.
ACM, June 2007.
[39] Hal Daumé III. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics, pages 256–263. ACL, June 2007.
[40] Hal Daumé III, Abhishek Kumar, and Avishek Saha. Co-regularization based semi-
supervised domain adaptation. In Advances in Neural Information Processing Systems 23,
pages 478–486. 2010.
[41] Jesse Davis and Pedro Domingos. Deep transfer via second-order markov logic. In Pro-
ceedings of the 26th Annual International Conference on Machine Learning, pages 217–224.
ACM, June 2009.
[42] Lixin Duan, Ivor W. Tsang, Dong Xu, and Tat-Seng Chua. Domain adaptation from multiple
sources via auxiliary classifiers. In Proceedings of the 26th Annual International Conference
on Machine Learning, pages 289–296. ACM, June 2009.
[43] Lixin Duan, Ivor W. Tsang, Dong Xu, and Stephen J. Maybank. Domain transfer SVM for
video concept detection. In Proceedings of the 22nd IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pages 1375–1381. IEEE, June 2009.
[44] Lixin Duan, Dong Xu, and Ivor W. Tsang. Learning with augmented features for heteroge-
neous domain adaptation. In Proceedings of the 29th International Conference on Machine
Learning. icml.cc/Omnipress, June 2012.
[45] Lixin Duan, Dong Xu, Ivor W. Tsang, and Jiebo Luo. Visual event recognition in videos by
learning from web data. In Proceedings of the 23rd IEEE Conference on Computer Vision
and Pattern Recognition. IEEE, June 2010.
[46] Eric Eaton, Marie desJardins, and Terran Lane. Modeling transfer relationships between
learning tasks for improved inductive transfer. In Proceedings of the 2008 European Confer-
ence on Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Com-
puter Science, pages 317–332. Springer, September 2008.
[47] Henry C. Ellis. The Transfer of Learning. The Macmillan Company, New York, 1965.
[48] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In Proceed-
ings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 109–117. ACM, August 2004.
562 Data Classification: Algorithms and Applications
[49] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning
and an application to boosting. In Proceedings of the 2nd European Conference on Compu-
tational Learning Theory, pages 23–37. Springer-Verlag, 1995.
[50] Wei Gao, Peng Cai, Kam-Fai Wong, and Aoying Zhou. Learning to rank only using training
data from related domain. In Proceeding of the 33rd International ACM SIGIR Conference
on Research and Development in Information Retrieval, pages 162–169. ACM, July 2010.
[51] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sen-
timent classification: A deep learning approach. In Proceedings of the 28th International
Conference on Machine Learning, pages 513–520. Omnipress, 2011.
[52] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A kernel method for
the two-sample problem. In Advances in Neural Information Processing Systems 19, pages
513–520. MIT Press, 2007.
[53] Hong Lei Guo, Li Zhang, and Zhong Su. Empirical study on the performance stability of
named entity recognition model across domains. In Proceedings of the 2006 Conference on
Empirical Methods in Natural Language Processing, pages 509–516. ACL, July 2006.
[54] Rakesh Gupta and Lev Ratinov. Text categorization with knowledge transfer from heteroge-
neous data sources. In Proceedings of the 23rd National Conference on Artificial Intelligence,
pages 842–847. AAAI Press, July 2008.
[55] Sunil Kumar Gupta, Dinh Phung, Brett Adams, Truyen Tran, and Svetha Venkatesh. Nonneg-
ative shared subspace learning and its application to social media retrieval. In Proceedings of
the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 1169–1178. ACM, July 2010.
[56] Abhay Harpale and Yiming Yang. Active learning for multi-task adaptive filtering. In Pro-
ceedings of the 27th International Conference on Machine Learning, pages 431–438. ACM,
June 2010.
[57] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-
ing: Data Mining, Inference and Prediction. 2nd edition, Springer, 2009.
[58] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd An-
nual International ACM SIGIR Conference on Research and Development in Information
Retrieval, pages 50–57. ACM, 1999.
[59] Jiayuan Huang, Alex Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard
Schölkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Infor-
mation Processing Systems 19, pages 601–608. MIT Press, 2007.
[60] Laurent Jacob, Francis Bach, and Jean-Philippe Vert. Clustered multi-task learning: A convex
formulation. In Advances in Neural Information Processing Systems 21, pages 745–752.
2009.
[61] Jing Jiang. Multi-task transfer learning for weakly-supervised relation extraction. In Pro-
ceedings of the 47th Annual Meeting of the Association for Computational Linguistics and
the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages
1012–1020. ACL, August 2009.
[62] Jing Jiang and ChengXiang Zhai. Instance weighting for domain adaptation in NLP. In
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics,
pages 264–271. ACL, June 2007.
Transfer Learning 563
[63] Thorsten Joachims. Transductive inference for text classification using support vector ma-
chines. In Proceedings of the 16th International Conference on Machine Learning, pages
200–209. Morgan Kaufmann Publishers Inc., June 1999.
[64] Guo Jun Qi, Charu C. Aggarwal, and Thomas S. Huang. Towards semantic knowledge prop-
agation from text corpus to web images. In Proceedings of the 20th International Conference
on World Wide Web, pages 297–306. ACM, March 2011.
[65] Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct
importance estimation. Journal of Machine Learning Research, 10:1391–1445, 2009.
[66] Sunghun Kim, Jr., E. James Whitehead, and Yi Zhang. Classifying software changes: Clean
or buggy? IEEE Transactions on Software Engineering, 34:181–196, March 2008.
[67] Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain
adaptation using asymmetric kernel transforms. In Proceedings of the 24th IEEE Conference
on Computer Vision and Pattern Recognition, pages 1785–1792. IEEE, June 2011.
[68] Christoph H. Lampert and Oliver Krömer. Weakly-paired maximum covariance analysis
for multimodal dimensionality reduction and transfer learning. In Proceedings of the 11th
European Conference on Computer Vision, pages 566–579, September 2010.
[69] Neil D. Lawrence and John C. Platt. Learning to learn with the informative vector machine.
In Proceedings of the 21st International Conference on Machine Learning. ACM, July 2004.
[70] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efficient sparse coding al-
gorithms. In Advances in Neural Information Processing Systems 19, pages 801–808. MIT
Press, 2007.
[71] David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers.
In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 3–12. ACM/Springer, July 1994.
[72] Bin Li, Qiang Yang, and Xiangyang Xue. Can movies and books collaborate?: Cross-domain
collaborative filtering for sparsity reduction. In Proceedings of the 21st International Jont
Conference on Artificial Intelligence, pages 2052–2057. Morgan Kaufmann Publishers Inc.,
July 2009.
[73] Bin Li, Qiang Yang, and Xiangyang Xue. Transfer learning for collaborative filtering via a
rating-matrix generative model. In Proceedings of the 26th Annual International Conference
on Machine Learning, pages 617–624. ACM, June 2009.
[74] Fangtao Li, Sinno Jialin Pan, Ou Jin, Qiang Yang, and Xiaoyan Zhu. Cross-domain co-
extraction of sentiment and topic lexicons. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics, pages 410–419. ACL, July 2012.
[75] Lianghao Li, Xiaoming Jin, Sinno Jialin Pan, and Jian-Tao Sun. Multi-domain active learning
for text classification. In Proceedings of the 18th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 1086–1094. ACM, August 2012.
[76] Xuejun Liao, Ya Xue, and Lawrence Carin. Logistic regression with an auxiliary data source.
In Proceedings of the 22nd International Conference on Machine Learning, pages 505–512.
ACM, August 2005.
[77] Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang, and Yong Yu. Can Chinese
web pages be classified with English data source? In Proceedings of the 17th International
Conference on World Wide Web, pages 969–978. ACM, April 2008.
564 Data Classification: Algorithms and Applications
[78] Yiming Liu, Dong Xu, Ivor W. Tsang, and Jiebo Luo. Using large-scale web data to fa-
cilitate textual query based retrieval of consumer photos. In Proceedings of the 17th ACM
International Conference on Multimedia, pages 55–64. ACM, October 2009.
[79] Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, and Qing He. Transfer learning from
multiple source domains via consensus regularization. In Proceedings of the 17th ACM Con-
ference on Information and Knowledge Management, pages 103–112. ACM, October 2008.
[80] M. M. Hassan Mahmud and Sylvian R. Ray. Transfer learning using Kolmogorov complex-
ity: Basic theory and empirical evaluations. In Advances in Neural Information Processing
Systems 20, pages 985–992. MIT Press, 2008.
[81] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with mul-
tiple sources. In Advances in Neural Information Processing Systems 21, pages 1041–1048.
2009.
[82] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and
the Rényi divergence. In Proceedings of the 25th Conference on Uncertainty in Artificial
Intelligence, pages 367–374. AUAI Press, June 2009.
[83] Tim Menzies, Jeremy Greenwald, and Art Frank. Data mining static code attributes to learn
defect predictors. IEEE Transactions on Software Engineering, 33:2–13, January 2007.
[84] Lilyana Mihalkova, Tuyen Huynh, and Raymond J. Mooney. Mapping and revising Markov
logic networks for transfer learning. In Proceedings of the 22nd AAAI Conference on Artifi-
cial Intelligence, pages 608–614. AAAI Press, July 2007.
[85] Lilyana Mihalkova and Raymond J. Mooney. Transfer learning by mapping with minimal
target data. In Proceedings of the AAAI-2008 Workshop on Transfer Learning for Complex
Tasks, July 2008.
[86] Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.
[87] Jaechang Nam, Sinno Jialin Pan, and Sunghun Kim. Transfer defect learning. In Proceedings
of the 35th International Conference on Software Engineering, pages 382–391. IEEE/ACM,
May 2013.
[88] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an
algorithm. In Advances in Neural Information Processing Systems 14, pages 849–856. MIT
Press, 2001.
[89] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. Text clas-
sification from labeled and unlabeled documents using EM. Machine Learning, 39(2-3):103–
134, 2000.
[90] Sinno Jialin Pan, James T. Kwok, and Qiang Yang. Transfer learning via dimensionality
reduction. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pages 677–
682. AAAI Press, July 2008.
[91] Sinno Jialin Pan, James T. Kwok, Qiang Yang, and Jeffrey J. Pan. Adaptive localization in a
dynamic WiFi environment through multi-view learning. In Proceedings of the 22nd AAAI
Conference on Artificial Intelligence, pages 1108–1113. AAAI Press, July 2007.
[92] Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Chen Zheng. Cross-domain
sentiment classification via spectral feature alignment. In Proceedings of the 19th Interna-
tional Conference on World Wide Web, pages 751–760. ACM, April 2010.
Transfer Learning 565
[93] Sinno Jialin Pan, Dou Shen, Qiang Yang, and James T. Kwok. Transferring localization
models across space. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence,
pages 1383–1388. AAAI Press, July 2008.
[94] Sinno Jialin Pan, Zhiqiang Toh, and Jian Su. Transfer joint embedding for cross-domain
named entity recognition. ACM Transactions on Information Systems, 31(2):7:1–7:27, May
2013.
[95] Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, and Qiang Yang. Domain adaptation via
transfer component analysis. In Proceedings of the 21st International Joint Conference on
Artificial Intelligence, pages 1187–1192, July 2009.
[96] Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, and Qiang Yang. Domain adaptation via
transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
[97] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on
Knowledge and Data Engineering, 22(10):1345–1359, 2010.
[98] Sinno Jialin Pan, Vincent W. Zheng, Qiang Yang, and Derek H. Hu. Transfer learning for
WiFi-based indoor localization. In Proceedings of the Workshop on Transfer Learning for
Complex Tasks of the 23rd AAAI Conference on Artificial Intelligence, July 2008.
[99] Weike Pan, Evan W. Xiang, Nathan N. Liu, and Qiang Yang. Transfer learning in collabora-
tive filtering for sparsity reduction. In Proceedings of the 24th AAAI Conference on Artificial
Intelligence. AAAI Press, July 2010.
[100] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification
using machine learning techniques. In Proceedings of the ACL-02 Conference on Empirical
Methods in Natural Language Processing, pages 79–86. ACL, July 2002.
[101] Peter Prettenhofer and Benno Stein. Cross-language text classification using structural cor-
respondence learning. In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 1118–1127. ACL, July 2010.
[102] Guo-Jun Qi, Charu C. Aggarwal, Yong Rui, Qi Tian, Shiyu Chang, and Thomas S. Huang.
Towards cross-category knowledge propagation for learning visual concepts. In Proceedings
of the 24th IEEE Conference on Computer Vision and Pattern Recognition, pages 897–904.
IEEE, June 2011.
[103] Novi Quadrianto, Alex J. Smola, Tiberio S. Caetano, S.V.N. Vishwanathan, and James Petter-
son. Multitask learning without label correspondences. In Advances in Neural Information
Processing Systems 23, pages 1957–1965. Curran Associates, Inc., December 2010.
[104] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence.
Dataset Shift in Machine Learning. MIT Press, 2009.
[105] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. Self-taught
learning: Transfer learning from unlabeled data. In Proceedings of the 24th International
Conference on Machine Learning, pages 759–766. ACM, June 2007.
[106] Rajat Raina, Andrew Y. Ng, and Daphne Koller. Constructing informative priors using trans-
fer learning. In Proceedings of the 23rd International Conference on Machine Learning,
pages 713–720. ACM, June 2006.
[107] Parisa Rashidi and Diane J. Cook. Activity recognition based on home to home transfer
learning. In Proceedings of the Workshop on Plan, Activity, and Intent Recognition of the
24th AAAI Conference on Artificial Intelligence. AAAI Press, July 2010.
566 Data Classification: Algorithms and Applications
[108] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine
Learning. The MIT Press, 2005.
[109] Vikas C. Raykar, Balaji Krishnapuram, Jinbo Bi, Murat Dundar, and R. Bharat Rao. Bayesian
multiple instance learning: Automatic feature selection and inductive transfer. In Proceedings
of the 25th International Conference on Machine Learning, pages 808–815. ACM, July 2008.
[110] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine Learning
Journal, 62(1-2):107–136, 2006.
[111] Alexander E. Richman and Patrick Schone. Mining Wiki resources for multilingual named
entity recognition. In Proceedings of 46th Annual Meeting of the Association of Computa-
tional Linguistics, pages 1–9. ACL, June 2008.
[112] Stephen Robertson and Ian Soboroff. The trec 2002 filtering track report. In Text REtrieval
Conference, 2001.
[113] Marcus Rohrbach, Michael Stark, György Szarvas, Iryna Gurevych, and Bernt Schiele. What
helps where—and why? Semantic relatedness for knowledge transfer. In Proceedings of the
23rd IEEE Conference on Computer Vision and Pattern Recognition, pages 910–917. IEEE,
June 2010.
[114] Michael T. Rosenstein, Zvika Marx, and Leslie Pack Kaelbling. To transfer or not to transfer.
In NIPS-05 Workshop on Inductive Transfer: 10 Years Later, December 2005.
[115] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models
to new domains. In Proceedings of the 11th European Conference on Computer Vision, pages
213–226. Springer, September 2010.
[116] Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. Methods
and metrics for cold-start recommendations. In Proceedings of the 25th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, pages 253–
260. ACM, August 2002.
[117] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Ma-
chines, Regularization, Optimization, and Beyond. MIT Press, 2001.
[118] Anton Schwaighofer, Volker Tresp, and Kai Yu. Learning Gaussian process kernels via hier-
archical Bayes. In Advances in Neural Information Processing Systems 17, pages 1209–1216.
MIT Press, 2005.
[119] Gabriele Schweikert, Christian Widmer, Bernhard Schölkopf, and Gunnar Rätsch. An empir-
ical analysis of domain adaptation algorithms for genomic sequence analysis. In Advances
in Neural Information Processing Systems 21, pages 1433–1440. 2009.
[120] Chun-Wei Seah, Yew-Soon Ong, Ivor W. Tsang, and Kee-Khoon Lee. Predictive distribution
matching SVM for multi-domain learning. In Proceedings of 2010 European Conference
on Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer
Science, pages 231–247. Springer, September 2010.
[121] Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648,
University of Wisconsin–Madison, 2009.
[122] Xiaoxiao Shi, Wei Fan, and Jiangtao Ren. Actively transfer domain knowledge. In Proceed-
ings of the 2008 European Conference on Machine Learning and Knowledge Discovery in
Databases, Lecture Notes in Computer Science, pages 342–357. Springer, September 2008.
Transfer Learning 567
[123] Xiaoxiao Shi, Wei Fan, Qiang Yang, and Jiangtao Ren. Relaxed transfer of different classes
via spectral partition. In Preceedings of the 2009 European Conference on Machine Learning
and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pages 366–381.
Springer, September 2009.
[124] Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, and Ruixin Zhu. Transfer learning on heteroge-
nous feature spaces via spectral transformation. In Proceedings of the 10th IEEE Interna-
tional Conference on Data Mining, pages 1049–1054. IEEE Computer Society, December
2010.
[125] Si Si, Dacheng Tao, and Bo Geng. Bregman divergence-based regularization for transfer
subspace learning. IEEE Transactions an Knowledge Data Engineering, 22(7):929–942,
2010.
[126] Michael Stark, Michael Goesele, and Bernt Schiele. A shape-based object class model for
knowledge transfer. In Proceedings of 12th IEEE International Conference on Computer
Vision, pages 373–380. IEEE, September 2009.
[127] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Von Buenau, and Motoaki
Kawanabe. Direct importance estimation with model selection and its application to covariate
shift adaptation. In Advances in Neural Information Processing Systems 20, pages 1433–
1440. MIT Press, 2008.
[128] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A
survey. Journal of Machine Learning Research, 10(1):1633–1685, 2009.
[129] Edward Lee Thorndike and Robert Sessions Woodworth. The influence of improvement in
one mental function upon the efficiency of the other functions. Psychological Review, 8:247–
261, 1901.
[130] Sebastian Thrun and Lorien Pratt, editors. Learning to learn. Kluwer Academic Publishers,
Norwell, MA, 1998.
[131] Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. Safety in numbers: Learning cat-
egories from few examples with multi model knowledge transfer. In Proceedings of the 23rd
IEEE Conference on Computer Vision and Pattern Recognition, pages 3081–3088. IEEE,
June 2010.
[132] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
[133] Alexander Vezhnevets and Joachim Buhmann. Towards weakly supervised semantic seg-
mentation by means of multiple instance and multitask learning. In Proceedings of the 23rd
IEEE Conference on Computer Vision and Pattern Recognition, pages 3249–3256. IEEE,
June 2010.
[134] Bo Wang, Jie Tang, Wei Fan, Songcan Chen, Zi Yang, and Yanzhu Liu. Heterogeneous cross
domain ranking in latent space. In Proceedings of the 18th ACM Conference on Information
and Knowledge Management, pages 987–996. ACM, November 2009.
[135] Chang Wang and Sridhar Mahadevan. Heterogeneous domain adaptation using manifold
alignment. In Proceedings of the 22nd International Joint Conference on Artificial Intelli-
gence, pages 1541–1546. IJCAI/AAAI, July 2011.
[136] Hua-Yan Wang, Vincent W. Zheng, Junhui Zhao, and Qiang Yang. Indoor localization in
multi-floor environments with reduced effort. In Proceedings of the 8th Annual IEEE Inter-
national Conference on Pervasive Computing and Communications, pages 244–252. IEEE
Computer Society, March 2010.
568 Data Classification: Algorithms and Applications
[137] Pu Wang, Carlotta Domeniconi, and Jian Hu. Using Wikipedia for co-clustering based cross-
domain text classification. In Proceedings of the Eighth IEEE International Conference on
Data Mining, pages 1085–1090. IEEE Computer Society, 2008.
[138] Xiaogang Wang, Cha Zhang, and Zhengyou Zhang. Boosted multi-task learning for face ver-
ification with applications to web image and video search. In Proceedings of the 22nd IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, pages 142–149.
IEEE, June 2009.
[139] Christian Widmer, Jose Leiva, Yasemin Altun, and Gunnar Rätsch. Leveraging sequence
classification by taxonomy-based multitask learning. In Proceedings of 14th Annual Inter-
national Conference on Research in Computational Molecular Biology, Lecture Notes in
Computer Science, pages 522–534. Springer, April 2010.
[140] Tak-Lam Wong, Wai Lam, and Bo Chen. Mining employment market via text block detection
and adaptive cross-domain information extraction. In Proceedings of the 32nd International
ACM SIGIR Conference on Research and Development in Information Retrieval, pages 283–
290. ACM, July 2009.
[141] Dan Wu, Wee Sun Lee, Nan Ye, and Hai Leong Chieu. Domain adaptive bootstrapping for
named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in
Natural Language Processing, pages 1523–1532. ACL, August 2009.
[142] Pengcheng Wu and Thomas G. Dietterich. Improving svm accuracy by training on auxiliary
data sources. In Proceedings of the 21st International Conference on Machine Learning.
ACM, July 2004.
[143] Evan W. Xiang, Bin Cao, Derek H. Hu, and Qiang Yang. Bridging domains using world wide
knowledge for transfer learning. IEEE Transactions on Knowledge and Data Engineering,
22:770–783, 2010.
[144] Evan W. Xiang, Sinno Jialin Pan, Weike Pan, Jian Su, and Qiang Yang. Source-selection-
free transfer learning. In Proceedings of 22nd International Joint Conference on Artificial
Intelligence, pages 2355–2360. IJCAI/AAAI, July 2011.
[145] Sihong Xie, Wei Fan, Jing Peng, Olivier Verscheure, and Jiangtao Ren. Latent space domain
transfer between high dimensional overlapping distributions. In 18th International World
Wide Web Conference, pages 91–100. ACM, April 2009.
[146] Qian Xu, Sinno Jialin Pan, Hannah Hong Xue, and Qiang Yang. Multitask learning for
protein subcellular location prediction. IEEE/ACM Transactions on Computational Biology
and Bioinformatics, 8(3):748–759, 2011.
[147] Gui-Rong Xue, Wenyuan Dai, Qiang Yang, and Yong Yu. Topic-bridged PLSA for cross-
domain text classification. In Proceedings of the 31st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, pages 627–634. ACM,
July 2008.
[148] Rong Yan and Jian Zhang. Transfer learning using task-level features with application to
information retrieval. In Proceedings of the 21st International Jont Conference on Artifical
Intelligence, pages 1315–1320. Morgan Kaufmann Publishers Inc., July 2009.
[149] Jian-Bo Yang, Qi Mao, Qiaoliang Xiang, Ivor Wai-Hung Tsang, Kian Ming Adam Chai,
and Hai Leong Chieu. Domain adaptation for coreference resolution: An adaptive ensemble
approach. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural
Transfer Learning 569
Language Processing and Computational Natural Language Learning, pages 744–753. ACL,
July 2012.
[150] Jun Yang, Rong Yan, and Alexander G. Hauptmann. Cross-domain video concept detection
using adaptive SVMS. In Proceedings of the 15th International Conference on Multimedia,
pages 188–197. ACM, September 2007.
[151] Qiang Yang. Activity recognition: Linking low-level sensors to high-level intelligence. In
Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 20–
25. Morgan Kaufmann Publishers Inc., July 2009.
[152] Qiang Yang, Sinno Jialin Pan, and Vincent W. Zheng. Estimating location using Wi-Fi. IEEE
Intelligent Systems, 23(1):8–13, 2008.
[153] Tianbao Yang, Rong Jin, Anil K. Jain, Yang Zhou, and Wei Tong. Unsupervised transfer clas-
sification: Application to text categorization. In Proceedings of the 16th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining, pages 1159–1168. ACM,
July 2010.
[154] Xiaolin Yang, Seyoung Kim, and Eric Xing. Heterogeneous multitask learning with joint
sparsity constraints. In Advances in Neural Information Processing Systems 22, pages 2151–
2159. 2009.
[155] Yi Yao and Gianfranco Doretto. Boosting for transfer learning with multiple sources. In Pro-
ceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition, pages
1855–1862. IEEE, June 2010.
[156] Jie Yin, Qiang Yang, and L.M. Ni. Learning adaptive temporal radio maps for signal-strength-
based location estimation. IEEE Transactions on Mobile Computing, 7(7):869–883, July
2008.
[157] Xiao-Tong Yuan and Shuicheng Yan. Visual classification with multi-task joint sparse rep-
resentation. In Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern
Recognition, pages 3493–3500. IEEE, June 2010.
[158] Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Pro-
ceedings of the 21st International Conference on Machine Learning. ACM, July 2004.
[159] Kai Zhang, Joe W. Gray, and Bahram Parvin. Sparse multitask regression for identifying
common mechanism of response to therapeutic targets. Bioinformatics, 26(12):i97–i105,
2010.
[160] Yu Zhang, Bin Cao, and Dit-Yan Yeung. Multi-domain collaborative filtering. In Proceedings
of the 26th Conference on Uncertainty in Artificial Intelligence, pages 725–732. AUAI Press,
July 2010.
[161] Yu Zhang and Dit-Yan Yeung. A convex formulation for learning task relationships in multi-
task learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence,
pages 733–442. AUAI Press, July 2010.
[162] Yu Zhang and Dit-Yan Yeung. Multi-task warped gaussian process for personalized age
estimation. In Proceedings of the 23rd IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pages 2622–2629. IEEE, June 2010.
[163] Lili Zhao, Sinno Jialin Pan, Evan W. Xiang, Erheng Zhong, Zhongqi Lu, and Qiang Yang.
Active transfer learning for cross-system recommendation. In Proceedings of the 27th AAAI
Conference on Artificial Intelligence. AAAI Press, July 2013.
570 Data Classification: Algorithms and Applications
[164] Vincent W. Zheng, Derek H. Hu, and Qiang Yang. Cross-domain activity recognition. In
Proceedings of the 11th International Conference on Ubiquitous Computing, pages 61–70.
ACM, September 2009.
[165] Vincent W. Zheng, Sinno Jialin Pan, Qiang Yang, and Jeffrey J. Pan. Transferring multi-
device localization models using latent multi-task learning. In Proceedings of the 23rd AAAI
Conference on Artificial Intelligence, pages 1427–1432. AAAI Press, July 2008.
[166] Vincent W. Zheng, Qiang Yang, Evan W. Xiang, and Dou Shen. Transferring localization
models over time. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence,
pages 1421–1426. AAAI Press, July 2008.
[167] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer
Sciences, University of Wisconsin-Madison, 2005.
[168] Yin Zhu, Yuqiang Chen, Zhongqi Lu, Sinno Jialin Pan, Gui-Rong Xue, Yong Yu, and Qiang
Yang. Heterogeneous transfer learning for image classification. In Proceedings of the 25th
AAAI Conference on Artificial Intelligence. AAAI Press, August 2011.
[169] Hankui Zhuo, Qiang Yang, Derek H. Hu, and Lei Li. Transferring knowledge from another
domain for learning action models. In Proceedings of 10th Pacific Rim International Confer-
ence on Artificial Intelligence, pages 1110–1115. Springer-Verlag, December 2008.
[170] Thomas Zimmermann, Nachiappan Nagappan, Harald Gall, Emanuel Giger, and Brendan
Murphy. Cross-project defect prediction: a large scale experiment on data vs. domain vs.
process. In Proceedings of the 7th joint meeting of the European software engineering con-
ference and the ACM SIGSOFT symposium on the foundations of software engineering, pages
91–100. ACM, August 2009.
Chapter 22
Active Learning: A Survey
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY
[email protected]
Xiangnan Kong
University of Illinois at Chicago
Chicago, IL
[email protected]
Quanquan Gu
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
Jiawei Han
University of Illinois at Urbana-Champaign
Urbana, IL
[email protected]
Philip S. Yu
University of Illinois at Chicago
Chicago, IL
[email protected]
571
572 Data Classification: Algorithms and Applications
22.1 Introduction
One of the great challenges in a wide variety of learning problems is the ability to obtain suffi-
cient labeled data for modeling purposes. Labeled data is often expensive to obtain, and frequently
requires laborious human effort. In many domains, unlabeled data is copious, though labels can be
attached to such data at a specific cost in the labeling process. Some examples of such data are as
follows:
• Document Collections: Large amounts of document data may be available on the Web, which
are usually unlabeled. In such cases, it is desirable to attach labels to documents in order to
create a learning model. A common approach is to manually label the documents in order to
label the training data, a process that is slow, painstaking, and laborious.
• Privacy-Constrained Data Sets: In many scenarios, the labels on records may be sensitive
information, which may be acquired at a significant query cost (e.g., obtaining permission
from the relevant entity).
• Social Networks: In social networks, it may be desirable to identify nodes with specific prop-
erties. For example, an advertising company may desire to identify nodes in the social network
that are interested in “cosmetics.” However, it is rare that labeled nodes will be available in
the network that have interests in a specific area. Identification of relevant nodes may only
occur through either manual examination of social network posts, or through user surveys.
Both processes are time-consuming and costly.
Active Learning: A Survey 573
In all these cases, labels can be obtained, but only at a significant cost to the end user. An impor-
tant observation is that all records are not equally important from the perspective of labeling. For
example, some records may be noisy and contain no useful features that are relevant to classifica-
tion. Similarly, records that cleanly belong to one class or another may be helpful, but less so than
records that lie closer to the separation boundaries between the different classes.
An additional advantage of active learning methods is that they can often help in the removal of
noisy instances from the data, which can be beneficial from an accuracy perspective. In fact, some
studies [104] have shown that a carefully designed active learning method can sometimes provide
better accuracy than is available from the base data.
Clearly, given the differential value of different records, an important question that arises in ac-
tive learning is as follows:
How do we select instances from the underlying data to label, so as to achieve the most effec-
tive training for a given level of effort?
Different performance criteria may be used to quantify and fine-tune the tradeoffs between accuracy
and cost, but the broader goal of all the criteria is to maximize the “bang for the buck” in spend-
ing the minimum effort in selecting examples, that maximize the accuracy as much as possible. An
excellent survey on active learning may be found in [105].
Every active learning system has two primary components, one of which is already given:
• Oracle: This provides the responses to the underlying query. The oracle may be a human
labeler, a cost driven data acquisition system, or any other methodology. It is important to
note that the oracle algorithm is part of the input, though the user may play a role in its
design. For example, in a multimedia application, the user may look at an image and provide
a label, but this comes at an expense of human labor [115]. However, for most of the active
learning algorithms, the oracle is really treated as a black box that is used directly.
• Query System: The job of the query system is to pose queries to the oracle for labels of specific
records. It is here that most of the challenges of active learning systems are found.
Numerous strategies are possible for different active learning scenarios. At the highest level, this
corresponds to the broader framework of how the queries are posed to the learner.
• Membership Query Synthesis: In this case, the learner actively synthesizes instances from the
entire space, and does not necessarily sample from some underlying distribution [3]. The key
here is that the learner many actually construct instances from the underlying space, which
may not be a part of any actual pre-existing data. However, this may lead to challenges in
the sense that the constructed examples may not be meaningful. For example, a synthesized
image from a group of pixels will very rarely be meaningful. On the other hand, arbitrarily
chosen spatial coordinates in a sea surface temperature prediction system will almost always
be meaningful. Therefore, the usability of the approach clearly depends upon the underlying
scenario.
• Selective or Sequential Sampling: In this case, the samples are drawn from the underlying data
distribution, and the learner decides whether or not they should be labeled [21]. In this case,
the query comes from an actual underlying data distribution, and is therefore guaranteed to
make sense. In this case, the queries are sampled one by one, and a decision is made whether
or not they should be queried. Such an approach is synonymous with the streaming scenario,
since the decisions about querying an instance need to be made in real time in such cases.
This terminology is however overloaded, since many works such as those in [104] use the
term “selective sampling” to refer to another strategy described below.
574 Data Classification: Algorithms and Applications
• Pool-based Sampling: As indicated by its name, it suggests the availability of a base “pool”
of instances from which to query the records of labels [74]. The task of the learner is to
therefore determine instances from this pool (typically one by one), which are as informative
as possible for the active learning process. This situation is encountered very commonly in
practical scenarios, and also allows for relatively clean models for active learning.
The vast majority of the strategies in the literature use pool-based sampling, and in fact some
works such as [104] refer to pool-based sampling as selective sampling. Therefore, this chapter will
mostly focus on pool-based strategies, since these form the core of most active learning methods.
Beyond these strategies, a number of other basic scenarios are possible. For example, in batch
learning, an entire set of examples need to be labeled at a given time for the active learning process.
An example of this is the Amazon Mechanical Turk, in which an entire set of examples is made
available for labeling at a given time. This is different from the methods common in the literature,
in which samples are labeled one by one, so that the learner has a chance to adjust the model, before
selecting the next example. In such cases, it is usually desirable to incorporate diversity in the batch
of labeled instances, in order to ensure that there is not too much redundancy within a particular
batch of labeled instances [16, 55, 57, 119].
Active learning has numerous challenges, in that it does not always improve the accuracy of
classification. While some of these issues may be related to algorithmic aspects such as sample
selection bias [12], other cases are inherent to the nature of the underlying data. However, in many
special cases, it has been shown [30] the number of labels needed to learn actively can be logarithmic
in the usual sample complexity of passive learning.
This chapter will discuss the different methods that are used for active learning. First, we will
provide a motivational example of how the selective sampling approach can describe the contours
of the different class boundaries with far fewer examples. Then, we will provide a discussion of the
different strategies that are commonly used for active learning. We will see that number of different
scenarios are possible for the active learning process in terms of how the samples are selected.
This chapter is organized as follows. Section 22.2 provides an example of how active learning
provides advantages for the learning process. We will also discuss its relationship to other methods
in the literature such as semi-supervised learning. Section 22.3 discusses query strategy frameworks
for active learning. Section 22.4 studies models for theoretical active learning. Section 22.5 dis-
cusses the methodologies for handling complex data types such as sequences and graphs. Section
22.6 discusses advanced topics for active learning, such as streaming data, feature learning, and
class-based querying. Section 22.7 discusses the conclusions and summary.
ClassA ClassB
(a) Class Separation (b) Random Sample with SVM (c) Active Sample with SVM Clas-
Classifier sifier
of samples, an SVM classifier will be unable to accurately divide the data space. This is shown in
Figure 22.1(b), where a portion of the data space is incorrectly classified, because of the error of
modeling the SVM classifier. In Figure 22.1(c), we have shown an example of a well chosen set of
seven instances along the decision boundary of the two classes. In this case, the SVM classifier is
able to accurately model the decision regions between the two classes. This is because of the careful
choice of the instances chosen by the active learning process. An important point to note is that it is
particularly useful to sample instances that provide a distinct view of how the different classes are
separated in the data. As will be discussed later, this general principle is used quite frequently in a
variety of active learning methods, where regions of greater uncertainty are often sampled in order
to obtain the relevant decision boundaries [75].
Since active learning is an approach that often uses human feedback in order to account for the
lack of obtaining training examples, it is related to a number of other methods that either use human
feedback or augment it with other kinds of training information. These two classes of methods are
as follows:
• Human Feedback: Human feedback is often used to improve the accuracy of the learning pro-
cess. For example, decision trees and other classifiers may be built with the active intervention
of the user [4, 113]. In this case, the model itself is constructed with user-intervention rather
than the choice of data examples.
• Semi-supervised and Transfer Learning: In this case, other kinds of data (e.g., unlabeled data
or labeled data from other domains) are used in order to overcome the lack of training exam-
ples [4, 9, 29, 94, 97, 134].
Since both of these forms of supervision share some common principles with active learning, they
will be discussed in some detail below.
at the level of labels [4, 113]. A detailed discussion on the use of human feedback is provided in the
chapter on visual classification in this book.
• Semi-supervised Learning: In this case, unlabeled data is used in order to learn the base
distribution of the data in the underlying space [9, 94]. Once the base distribution of the data
has been learned, it is combined with the labeled data in order to learn the class contours
more effectively. The idea here is that the data is often either aligned along low dimensional
manifolds [9], or is clustered in specific regions [94], and this can be learned effectively
from the training data. This additional information helps in learning, even when the amount
of labeled data available is small. For example, when the data is clustered [94], each of the
dense regions typically belongs to a particular class, and a very small number of labels can
map the different dense regions to the different classes.
• Transfer Learning: Transfer learning also uses additional labeled data from a different source,
and may sometimes even be drawn from a different domain. For example, consider the case
where it is desirable to classify Chinese documents with a small training collection. While
it may be harder to obtain labeled Chinese documents, it is much easier to obtain labeled
English documents. At the same time, data that provides correspondence between Chinese
and English documents may be available. These different kinds of data may be combined
together in order to provide a more effective training model for Chinese documents. Thus,
the core idea here is to “transfer” knowledge from one domain to the other. However, transfer
learning does not actively acquire labels in order to augment the sparse training data.
Thus, human feedback methods, transfer learning methods, and semi-supervised learning meth-
ods are all designed to handle the problem of paucity of training data. This is their shared character-
istic with active learning methods. However, at the detailed level, the strategies are quite different.
Both semi-supervised learning and transfer learning have been discussed in detail in different chap-
ters of this book.
of unrepresentative outliers. This situation is especially likely to occur in data sets that are very
noisy. In order to address such issues, some models focus directly on the error itself, or try to find
samples that are representative of the underlying data. Therefore, we broadly classify the querying
strategies into one of three categories:
• Heterogeneity-based models: These models attempt to sample from regions of the space that
are either more heterogeneous, or dissimilar to what has already been seen so far. Examples of
such models include uncertainty sampling, query-by-committee, and expected model change.
All these methods are based on sampling either uncertain regions of the data, or those that
are dissimilar to what has been queried so far. These models only look at the heterogeneity
behavior of the queried instance, rather than the effect of its addition on the performance of a
classifier on the remaining unlabeled instances.
• Performance-based models: These models attempt to directly optimize the performance of
the classifier in terms of measures such as error or variance reduction. One characteristic of
these methods is that they look at the effect of adding the queried instance on the performance
of the classifier on the remaining unlabeled instances.
• Representativeness-based models: These models attempt to create data that is as representa-
tive as possible of the underlying population of training instances. For example, density-based
models are an example of such scenarios. In these cases, a product of a heterogeneity crite-
rion and a representativeness criterion is used in order to model the desirability of querying a
particular instance. Thus, these methods try to balance the representativeness criteria with the
uncertainty properties of the queried instance.
Clearly, there is significant diversity in the strategies that one may use in the active learning
process. These different strategies have different tradeoffs and work differently, depending upon the
underlying application, analyst goal, and data distribution. This section will provide an overview of
these different strategies for active learning. Throughout the following discussion, it will be assumed
that there are a total of k classes, though some cases will also be analyzed in the binary scenario
when k is set to 2.
pair of criteria that are especially relevant for k-ary classification is the entropy measure or the gini-
index. If the predicted probabilities of the k classes are p1 . . . pk , respectively, based on the current
set of labeled instances, then the entropy measure En(X) is defined as follows:
k
En(X) = − ∑ pi · log(pi ).
i=1
Larger values of the entropy indicate greater uncertainty. Therefore, this objective function needs
to be maximized. Note that an equal proportion of labels across the different classes results in the
highest possible entropy. A second measure is the gini-index G.
k
G(X) = 1 − ∑ p2i .
i=1
As in the case of entropy, higher values of the gini-index indicate greater uncertainty. It should be
pointed out that some of these measures may not work in the case of imbalanced data, where the
classes are not evenly distributed. In such cases, the classes may often be associated with costs,
where the cost of misclassification of i is denoted by wi . Each probability pi is replaced by a value
proportional to pi · wi , with the constant of the proportionality being determined by the probability
values summing to 1.
Numerous active sampling techniques have been developed in the literature on the basis of these
principles, and extensive comparisons have also been performed between these different techniques.
The interested reader is referred to [27, 60, 74, 75, 102, 106] for the different techniques, and to
[68, 103, 106] for the comparison of these measures. It should also be pointed out that it is not
necessary to use a Bayes model that explicitly predicts probabilities. In practice, it is sufficient to
use any model that provides a prediction confidence for each class label. This can be converted into
a pseudo-probability for each class, and used heuristically for the instance-selection process.
22.3.1.2 Query-by-Committee
This approach [109] uses a committee of different classifiers, which are trained on the current
set of labeled instances. These classifiers are then used to predict the class label of each unlabeled
instance. The instance for which the classifiers disagree the most is selected as the relevant one in
this scenario. At an intuitive level, the query-by-committee method achieves similar heterogeneity
goals as the uncertainty sampling method, except that it does so by measuring the differences in the
predictions of different classifiers, rather than the uncertainty of labeling a particular instance. Note
that an instance that is classified to different classes with almost equal probability (as in uncertainty
sampling) is more likely to be predicted in different classes by different classifiers. Thus, there is
significant similarity between these methods at an intuitive level, though they are generally treated
as very different methods in the literature. Interestingly, the method for measuring the disagreement
is also quite similar between the two classes of methods. For example, by replacing the prediction
probability pi of each class i with the fraction of votes received for each class i, it is possible to
obtain similar measures for the entropy and the gini-index. In addition, other probabilistic measures
such as the KL-divergence have been proposed in [84] for this purpose.
The construction of the committee can be achieved by either varying the model parameters of a
particular classifier (through sampling) [28, 84], or by using a bag of different classifiers [1]. It has
been shown that the use of a small number of classifiers is generally sufficient [109], and the use of
diverse classifiers in the committee is generally beneficial [89].
objective function with respect to the model parameters is used. The intuition of such an approach
is to use an instance that is most different from the current model that is already known. Thus,
this is also a heterogeneity-based approach, as is the case with uncertainty sampling, and query-by-
committee. Such an approach is only applicable to models where gradient-based training is used,
such as discriminative probabilistic models. Let δgi (X) be the change in the gradient with respect to
the model parameters, if the training label of the candidate instance X (with unknown label) is i. Let
pi be the posterior probability of the instance i with respect to the current label set in the training
data. Then, the expected model change C(X) with respect to the instance X is defined as follows:
k
C(X) = ∑ pi · δgi (X).
i=1
The instance X with the largest value of C(X) is queried for the label. Numerous techniques for
querying, that have been proposed using this approach may be found in [25, 47, 106, 107].
The value of E(X,V ) is maximized rather than minimized (as in the case of uncertainty-based
models). Furthermore, the error objective is a function of both the queried instance and the set
of unlabeled instances V . This result can easily be extended to the case of k-way models by us-
ing the entropy criterion, as was discussed in the case of uncertainty-based models. In that case,
(X,i)
the expression above is modified to replace ||Pj (Z) − 0.5|| with the class-specific entropy term
(X,i) (X,i)
−Pj (Z) · log(Pj (Z)). Furthermore, this criterion needs to be minimized. In this context, the
580 Data Classification: Algorithms and Applications
minimization of the expression can be viewed as the minimization of the expected loss function.
This general framework has been used in a variety of different contexts in [53, 93, 100, 135].
The value of H(X) (assumed to be a maximization function) can be any of the heterogeneity criteria
(transformed appropriately for maximization) such as the entropy criterion En(X) from uncertainty
sampling, or the expected model change criterion C(X). The representativeness criterion R(X,V )
is simply a measure of the density of X with respect to the instances in V . A simple version of
this density is the average similarity of X to the instances in V [106], though it is possible to use
more sophisticated methods such as kernel-density estimation to achieve the same goal. Note that
such an approach is likely to ensure that the instance X is in a dense region, and is therefore not
likely to be an outlier. Numerous variations of this approach have been proposed, such as those
in [42, 84, 96, 106].
1) Informative, the queried instance will be close to the decision boundary of the learning model in
terms of criteria like uncertainty, or the queried instance should be far away from existing labeled
instances in order to bring new knowledge about the feature space. 2) Representative, the queried
instance should be less likely to be outlier data and should be representative to a group of other un-
labeled data. For example, in the work [59], the query selection is based upon both informativeness
and representativeness of the unlabeled instances. A min-max framework of active learning is used
to measure scores for both criteria.
22.4.3 Preliminaries
Let X be the instance space and Y = {±1} be the set of possible labels. Let H be the hypothesis
class, i.e., a set of mapping functions from X to Y . We assume that there is a distribution D over
all instances in X. For simplicity, we assume that H is finite (|H| < ∞), but does not completely
agree on any single x ∈ X, i.e., ∀x ∈ X , ∃h1 , h2 ∈ H, h1 (x) = h2 (x). Note that |H| can be replaced by
VC dimension for an infinite hypothesis space. The algorithm is evaluated with respect to a given
loss function : Y × Y → [0, ∞). The most common loss is 0 − 1 loss, i.e., (z, y) = 1(y = z). The
other losses include squared loss (z, y) = (y − z)2 , hinge loss (z, y) = (1 − yz)+ and logistic loss
(z, y) = log(1 + exp(−yz)). The loss of a hypothesis h ∈ H with respect to a distribution P over
X × Y is defined as
L(h) = E(x,y)∼P (h(x), y). (22.2)
Let h∗ = arg minh∈H L(h) be a hypothesis of the minimum error in H and L∗ = L(h∗ ). We have
L∗ = 0 in the realizable case, and L∗ > 0 in the non-realizable case. The goal of active learning is to
return a hypothesis h ∈ H with an error L(h) that is not much more than L(h∗ ), using as few label
queries as possible.
22.4.4.1 Algorithm
In the importance weighted active learning (IWAL) framework [12], an active learner looks
at the unlabeled data x1 , x2 , . . . , xt one by one. After each new point xt , the learner determines a
probability pt ∈ [0, 1]. Then a coin with the bias pt is tossed, and the label yt is queried if and only
if the coin comes up heads. The query probability pt depends on all previous unlabeled examples
x1:t−1 , any previously queries labels, and the current unlabeled example xt .
The algorithm maintains a set of labeled examples seen so far, where each example is assigned
with an importance value pt . The key of IWAL is a subroutine, which returns the probability pt of
requesting yt , given xt and the previous history xi , yi : 1 ≤ i ≤ t − 1. Specifically, let Qt ∈ {0, 1} be a
random variable that determines whether to query yt or not. That is, Qt = 1 indicates that the label
yt is queried, and otherwise Qt = 0. Then Qt ∈ {0, 1} is conditionally independent of the current
label yt , i.e.,
Qt ⊥ Yt |X1:t ,Y1:t−1 , Q1:t−1 (22.3)
and with conditional expectation
E[Qt |X1:t ,Y1:t−1 ] = pt . (22.4)
If yt is queried, IWAL adds (xt , yt , 1/pt ) to the set, where 1/pt is the importance of predicting yt
on xt . The key of IWAL is how to specify a rejection threshold function to determine the query
probability pt . [12] and [13] discussed different rejection threshold functions.
The importance weighted empirical loss of a hypothesis h is defined as
T
1 Qt
LT (h) =
T ∑ (h(xt ), yt ) (22.5)
t=1 pt
A basic property of the above estimator is unbiasedness. It is easy to verify that E[LT (h)] = L(h),
where the expectation is taken over all the random variables involved.
To summarize, we show the IWAL algorithm in Algorithm 22.1.
22.4.4.2 Consistency
A desirable property of any learning algorithm is the consistency. The following theorem shows
that IWAL is consistent, as long as pt is bounded away from 0. Furthermore, its sample complexity
is within a constant factor of supervised learning in the worst case.
Theorem 3 [12] For all distributions D, for all finite hypothesis classes H, if there is a constant
584 Data Classification: Algorithms and Applications
pmin > 0 such that pt > pmin for all 1 ≤ t ≤ T , then with a probability that is at least 1 − δ, we have
√
2 log |H| + log( 1δ )
LT (h) ≤ L(h) + . (22.6)
pmin T
Recall that a typical supervised learning algorithm has the following bound for sample com-
plexity
Theorem 4 For all distributions D, for all finite hypothesis classes H, with a probability that is at
least 1 − δ, we have
log|H| + log( 1δ )
LT (h) ≤ L(h) + . (22.7)
T
By comparing with the above results, we can see that the sample complexity of IWAL is at most
2
p2min
times the sample complexity of a typical supervised learning algorithm.
In fact, a fairly strong and large deviation bound can be given for each ht output by IWAL as
follows:
Theorem 5 [12] For all distributions D, for all finite hypothesis classes H, with a probability that
is at least 1 − δ, the hypothesis output by IWAL satisfies
8 t(t + 1)|Ht |2
L(hT ) ≤ L(h) + 2 log . (22.8)
t δ
for h1 , h2 ∈ H. Let B(h, r) = {h ∈ H : ρ(h, h ) ≤ r} be the ball centered around h of radius r.
Definition 22.4.2 [12](disagreement coefficient) The disagreement coefficient is the infimum value
of θ such that for all r
sup ρ(h∗ , h) ≤ θr. (22.10)
h∈B(h∗ ,r)
Theorem 6 [12] For all distributions D, for all finite hypothesis classes H, if the loss function
has a slope asymmetry Cl , and the learning problem has a disagreement coefficient θ, the expected
number of labels queried by IWAL after seeing T examples is at most
∗ |H|T
4θCl L T + O T log . (22.12)
δ
“The bomb blast in Cairo < place > left Mubarak < person > wounded.”
In this case, the token “Cairo” is associated with the label place, whereas the token “Mubarak”
is associated with the label person. Features may also be associated with tokens of the instance,
which provide useful information for classification purposes. The labels associated with the indi-
vidual tokens in a sequence are not independent of one another, since the adjacent tokens (and
labels) in a given sequence directly impact the label of any given token. A common approach for
sequence classification is to use Hidden Markov Models (HMM), because they are naturally de-
signed to address the dependencies between the different data instances. Another class of methods
that are frequently used for active learning in sequences are Conditional Random Fields (CRF). A
detailed discussion of these methods is beyond the scope of this chapter. The reader is referred to
the work in [8, 28, 60, 102, 106], which provide a number of different probabilistic algorithms for
active learning from sequences. The work in [106] is particularly notable, because it presents an
experimental comparison of a large number of algorithms.
FIGURE 22.2 (See color insert.): Motivation of active learning on small graphs [70].
ios, such as the classification of many small graphs (e.g., chemical compounds) [70], or the classi-
fication of nodes in a single large graph (e.g., a social or information network) [10, 18, 49–52, 61].
Most of the specialized active learning methods in the literature are designed for the latter scenario,
because the former scenario can often be addressed with straightforward adaptations of standard ac-
tive learning methods for multidimensional data. However, in some cases, specific aspects of feature
selection in graphs can be combined with active learning methods. Such methods will be discussed
below.
Conventional active learning methods usually assume that the features of the instances are given
beforehand. One issue with graph data is that the features are not readily available. Many classifica-
tion approaches for small graphs require a feature extraction step to extract a set of subgraphs that
are discriminative for the classification task [63,69,114,121,123]. The number of possible subgraph
features that can be extracted is rather large, and the features extracted are highly dependent on the
labeled instances. In the active learning settings, we can only afford to query a small number of
graphs and obtain their labels. The performance of the feature extraction process depends heavily
on the quality of the queried graphs in the active learning process. Meanwhile, in the active learning,
we need to evaluate the importance of each instance. The performance of active learning also de-
pends heavily on the quality of the feature extraction process. Thus the active learning and subgraph
feature extraction steps are mutually beneficial.
For example, in Figure 22.2, we are given two labeled graphs (G1 and G2 ) within which we
have only a small number of the useful subgraph features (F1 and F2 ). If we query the graph G3 for
the label, we are not only improving the classification performances due to the fact that G3 is both
representative and informative among the unlabeled graphs, but we are also likely to improve the
performances of feature extraction, because G3 contains new features like F3 .
The process of query selection can assist the process of feature selection in finding useful sub-
graph features. In other words, the better the graph object we select, the more effectively we can
discover the useful subgraph features. Therefore, the work in [70] couples the active sample prob-
lem and subgraph feature selection problem, where the two processes are considered simultaneously.
The idea is that the two processes influence each other since sample selection should affect feature
selection and vice-versa. A method called gActive was proposed to maximize the dependency be-
tween subgraph features and graph labels using an active learning framework. A branch-and-bound
algorithm is used to search for the optimal query graph and optimal features simultaneously.
then leveraged to provide an improved algorithm for the use of submodular function maximization
techniques. The active learning method first clusters the graph and then randomly chooses a node
in each cluster. The work in [61] proposes a variance minimization approach to active learning in
graphs. Specifically, the work in [61] proposes to select the most informative nodes, by proposing
to minimize the prediction variance of the Gaussian field and harmonic function, which was used
in [133] for semi-supervised learning in graphs. The work in [49] approaches the problem by con-
sidering the generalization error of a specific classifier on graphs. The Learning with Local and
Global Consistency (LLGC) method proposed in [135] was chosen because of its greater effective-
ness. A data-dependent generalization error bound for LLGC was proposed in [49] using the tool of
transductive Rademacher Complexity [35]. This tool measures the richness of a class of real-valued
functions with respect to a probability distribution. It was shown in [49] that the empirical transduc-
tive Rademacher complexity is a good surrogate for active learning on graphs. The work therefore
selects the nodes by minimizing the empirical transductive Rademacher complexity of LLGC on
a graph. The resulting active learning method is a combinatorial optimization problem, which is
optimized using a sequential optimization algorithm.
Another line of research [10, 11, 18, 50] has considered adaptive active learning, where the la-
bels for the nodes of a graph are queried and predicted in an iterative way with the use of a trained
classifier from the previous labels. The work in [11] made the observation that label propagation
in graphs is prone to “accumulating” errors owing to correlations among the node labels. In other
words, once an error is made in the classification, this error propagates throughout the network to
the other nodes. Therefore, an acquisition method is proposed, which learns where a given collective
classification method makes mistakes, and then suggests acquisitions in order to correct those mis-
takes. It should be pointed out that such a strategy is unique to active learning in graphs, because of
the edge-based relationships between nodes, which result in homophily-based correlations among
node labels. The work in [11] also proposes two other methods, one of which greedily improves
the objective function value, and the other adapts a viral marketing model to the label acquisition
process. It was shown in [11] that the corrective model to label acquisition generally provides the
best results.
The work in [10] works in scenarios where both content and structure are available with the
nodes. It is assumed that the labels are acquired sequentially in batch sizes of k. The algorithm uses
two learners called CO (content-only) and CC (collective classifier), and combines their predictions
in order to make decisions about labeling nodes. In particular, nodes for which these two classi-
fiers differ in their prediction are considered good candidates for labeling, because the labels of
such nodes provide the most informative labels for knowing the relative importance of the different
aspects (content and structure) over different parts of the data. The proposed algorithm ALFNET
proceeds by first clustering the nodes using the network structure of the data. Then, it selects the k
clusters that satisfy the following two properties:
An overall score is computed on the basis of these two criteria. For a batch size of k, the top-k
clusters are selected. One node is selected from each of these clusters for labeling purposes.
The work in [18] proposes an active learning method that is based on the results of [51, 52] in
order to design optimal query placement methods. The work in [18] proposes a method for active
learning in the special case of trees, and shows that the optimal number of mistakes on the non-
queries nodes is made by a simple mincut classifier. A simple modification of this algorithm also
achieves optimality (within a constant factor) on the trade-off between the number of mistakes, and
the number of non-queried nodes. By using spanning trees, the method can also be generalized to ar-
bitrary graphs, though the problem of finding an optimal solution in this general case remains open.
Active Learning: A Survey 589
The work in [50] proposes selective sampling for graphs, which combines the ideas of online
learning and active learning. In this case, the selective sampling algorithm observes the examples
in a sequential manner, and predicts its label after each observation. The algorithm can, however,
choose to decide whether to receive feedback indicating whether the label is correct or not. This is
because, if the label can already be predicted with high confidence, a significant amount of effort can
be saved by not receiving feedback for such examples. The work in [50] uses the LLGC framework
[135] for the prediction process.
Feature-based active learning is similar in spirit to instance-based active learning, but the method-
ologies used for querying are quite different.
One observation about missing feature values is that many of them can often be partially imputed
from the correlations among the features in the existing data. Therefore, it is not useful to determine
those features that can be imputed. Based on this principle, a technique was proposed in [132]
in which missing feature values are first imputed. The ones among them that have the greatest
uncertainty are then queried. A second approach proposed in this work uses a classifier to determine
those instances that are misclassified. The feature values of the misclassified instances are then
queried for labels. In incremental feature acquisition, a small set of misclassified instances are used
in order to acquire their labels [88], or by using a utility function that needs to be maximized
[87, 101].
Active learning for feature values is also relevant in the context of scenarios where the feature
values are acquired for test instances rather than training instances. Many of the methodologies
used for training instances cannot be used in this scenario, since class labels are not available in
order to supervise the training process. This particular scenario was proposed in [48]. Typically
standard classifiers such as naive Bayes and decision trees are modified in order to improve the
classification [19, 37, 79, 110]. Other more sophisticated methods model the approach as a sequence
of decisions. For such sequential decision making, a natural approach is to use Hidden Markov
Models [62].
Another scenario for active learning of features is that we can actively select instances for fea-
ture selection tasks. Conventional active learning methods are mainly designed for classification
tasks. The works in [80–82] studied the problem of how to selectively sample instances for feature
selection tasks. The idea is that by partitioning the instances in the feature space using a KD-tree, it
is more efficient to selectively sample the instances with very different features.
pairs with the smallest absolute kernel values |Ki, j | are more likely to be must-link pairs, i.e., both
instances are in the same class. It is also crucial to query instances pairs that are more likely to be
cannot-link pairs, where the two instances are in different classes.
The work in [56] proposed an approach based upon min-max principle, i.e., the assumption is
that the most informative instance pairs are those resulting in a large classification margin regardless
of their assigned class labels.
refreshed too frequently, then it will result in unnecessary computational effort without significant
benefit. Therefore, the approach uses a mathematical model in order to determine when the stream
patterns have changed significantly enough for an update to be reasonably considered. This is done
by estimating the error of the model on the newly incoming data stream, without knowing the true
labels. Thus, this approach is not, strictly speaking, a traditional active learning method that queries
for the labels of individual instances in the stream.
The traditional problem of active learning from data streams, where a decision is made on each
instance to be queried, is also referred to as selective sampling, which was discussed in the introduc-
tion section of this chapter. The term “selective sampling” is generally used by the machine learning
community, whereas the term “data stream active learning” is often used by data mining researchers
outside the core machine learning community. Most of the standard techniques for sample selection
that are dependent on the properties of the unlabeled instance (e.g., query-by-committee), can be
easily extended to the streaming scenario [43,44]. Methods that are dependent on the aggregate per-
formance of the classifier are somewhat inefficient in this context. This problem has been studied in
several different contexts such as part-of-speech tagging [28], sensor scheduling [71], information
retrieval [127], and word-sense disambiguation [42].
cussed in the context of active learning for node classification in networks, where such methods are
very popular. Even in the case where labeled data is available, it is generally not advisable to pick
the k best instances. This is because the k best instances that are picked may all be quite similar, and
may not provide information that is proportional to the underlying batch size. Therefore, much of
the focus of the work in the literature [16,55,57,119] is to incorporate diversity among the instances
in the selection process.
If an entity is predicted as the mayor of a city, the class label of the entity should be “politician”
or “person”, instead of other labels like “water.” In such scenarios, the outputs of multiple tasks
are coupled with certain constraints. These constraints could provide additional knowledge for the
active learning process. The work in [128] studied the problem of active learning for multi-task
learning problem with output constraints. The idea is that we should exploit not only the uncertainty
of prediction in each task but also the inconsistency among the outputs on multiple tasks. If the class
labels on two different tasks are mutually exclusive while the model predicts positive labels in both
tasks, it indicates that the model makes an error on one of the two tasks. Such information is pro-
vided by the output constraints and can potentially help active learners to evaluate the importance
of each unlabeled instance more accurately.
with each other on the same instance. The answer from an individual “oracle” may not be accurate
enough. Combining answers from multiple “oracles” would be necessary. For example, in medical
imaging, the diagnosis of different doctors may be different on one patient’s image, while it is often
impossible to perform a biopsy on the patient in order to figure out the ground-truth.
In multi-oracle active learning, we could not only select which instance to query for the label,
but also select which oracle(s) or annotator(s) to be queried with the instance. In other words, multi-
oracle active learning is an instance-oracle-pair selection problem, instead of an instance-selection
problem. This paradigm brings up new challenges to the active learning scenario. In order to benefit
the learning model the most, we need to estimate both the importance of each unlabeled instance
and the expertise of each annotator.
The work in [122] designed an approach for multi-oracle active learning. The model first esti-
mated the importance of each instance, which is similar to conventional active learning models. The
model then estimated which “oracle” is most confident about its label annotation on the instance.
These two steps are performed iteratively. The high level idea of the approach is that we first decide
which instance we need to query the most, then decide who has the most confident answer for the
query.
The work in [111] studied both the uncertainty in the model and noises in the oracle. The idea is
that the costs of labelers with low qualities are usually low in crowdsourcing applications, while we
can repeatedly query different labelers for one instance to improve th labeling quality with additional
costs. Thus the work [111] studied the problem of selective repeated labeling, which can potentially
reduce the cost in labeling while improving the overall labeling quality, although this work focused
on the cases that all oracles are equally and consistently noisy, which may not fit in many real-world
cases. The work [32] extended the problem setting by allowing labelers to have different noise
levels, though the noise levels are still consistent over time. It shows that the qualities of individual
oracles can be properly estimated, then we can perform active learning more effectively by querying
only the most reliable oracles in selective labeling process. Another work extended the setting of
multi-oracle active learning in that the skill sets of multiple oracles are not assumed to be static,
but are changeable by teaching each other actively. The work in [39] explored this problem setting,
and proposed a self-taught active learning method from crowdsourcing. In addition to selecting the
most important instance-oracle-pair, we also need to figure out which pair of oracles could be the
best pair for self-teaching, i.e., one oracle has the expertise in a skill where the other oracle needs to
improve the most. In order to find the most effective oracle pair, we not only need to estimate how
well one oracle is on a type of skill, but also which oracle needs the most help on improving the
type of skill. In this way, the active learning method can help improve the overall skill sets of the
oracles and serve the labeling task with better performance.
M2 may have better performance on “throughput” than model M1 . In this case, we call a model a
non-dominated solution or “optimal” solution, if there is no other solution that can achieve better
performance in all objectives than the model. Thus the task of multi-objective optimization is to
obtain a set of multiple “optimal” models/solutions, such that all the models in the set are “optimal”
and cannot be dominated by any other solution or dominate each other. Such a model set is called
a “Pareto-optimal set,” where there is a potentially infinite number of “optimal” solutions in the
model space. It is usually infeasible to get a set of Pareto-optimal models by exhaustive search.
Moreover, in some real-world applications, the evaluation of the performance of a
model/solution with reference to multiple criteria is not free, but can be rather expensive. One ex-
ample is the hardware design problem. If we would like to test the performance of a model design
in hardware, synthesis of only one design can take hours or even days. The major cost of the learn-
ing process is on the model evaluation step, instead of the label acquiring step. This is how active
learning comes into play. We could perform active learning in the model/design space, instead of the
data instance space. The work in [138] studied the problem of selectively sampling the model/design
space to predict the Pareto-optimal set, and proposed a method called Pareto Active Learning (PAL).
The performance on multiple objectives is modeled as a Gaussian process, and the designs are esti-
mated to iteratively select the models that can maximize the model searching process.
unlabeled instances from the source domain and query the oracle with the instance. The queried
instances should be able to benefit the target domain in the transfer learning process. The work
in [20] integrated the active learning and transfer learning into one unique framework using a single
convex optimization. The work in [125] provided a theoretical analysis on active transfer learning.
The work in [40] extended the active transfer learning problem with the multi-oracle settings,
where there can be multiple oracles in the system. Each oracle can have different expertise, which
can be hard to estimate because the quantity of instances labeled by all the participating oracles is
rather small. The work in [131] extended the active transfer learning framework into the recommen-
dation problems, where the knowledge can be actively transfered from one recommendation system
to another recommendation system.
22.7 Conclusions
This chapter discusses the methodology of active learning, which is relevant in the context of
cases where it is expensive to acquire labels. While the approach has a similar motivation to many
other methods, such as semi-supervised learning and transfer learning, the approach used for deal-
ing with the paucity of labels is quite different. In active learning, instances from the training data
are modeled with the use of a careful sampling approach, which maximizes the accuracy of clas-
sification, while reducing the cost of label acquisition. Active learning has also been generalized
to structured data with dependencies such as sequences and graphs. The problem is studied in the
context of a wide variety of different scenarios such as feature-based acquisition and data streams.
With the increasing amounts of data available in the big-data world, and the tools now available to
us through crowd sourcing methodologies such as Amazon Mechanical Turk, batch processing of
such data sets has become an imperative. While some methodologies still exist for these scenarios,
a significant scope exists for the further development of technologies for large scale active learning.
Bibliography
[1] N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. ICML Con-
ference, pages 1–9, 1998.
598 Data Classification: Algorithms and Applications
[22] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine
Learning, 5(2):201–221, 1994.
[23] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. Journal of
Artificial Intelligence Research, 4:129–145, 1996.
[24] O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. Vol. 2, MIT Press, Cam-
bridge, MA, 2006.
[25] S.F. Chen and R. Rosenfeld. A survey of smoothing techniques for ME models. IEEE Trans-
actions on Speech and Audio Processing, 8(1):37–50, 2000.
[26] W. Chu, M. Zinkevich, L. Li, A. Thomas, and B. Tseng. Unbiased online active learning in
data streams. ACM KDD Conference, pages 195–203, 2011.
[27] A. Culotta and A. McCallum. Reducing labeling effort for structured prediction tasks, AAAI
Conference, pages 746–751, 2005.
[28] I. Dagan and S. Engelson. Committee-based sampling for training probabilistic classifiers.
ICML Conference, pages 150–157, 1995.
[29] W. Dai, Y. Chen, G.-R. Xue, Q. Yang, and Y. Yu. Translated learning: Transfer learning across
different feature spaces. Proceedings of Advances in Neural Information Processing Systems,
pages 353–360, 2008.
[31] P. Donmez, J. G. Carbonell, and P. N. Bennett. Dual strategy active learning. ECML confer-
ence, pages 116–127, 2007.
[32] P. Donmez, J. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources
for selective sampling. KDD Conference, pages 259–268, 2009.
[33] T. Dietterich, R. Lathrop, and T. Lozano-Perez. Solving the multiple-instance problem with
axis-parallel rectangles. Artificial Intelligence, 89:31–71, 1997.
[34] G. Druck, B. Settles, and A. McCallum. Active learning by labeling features. EMNLP, pages
81–90, 2009.
[35] R. El-Yaniv and D. Pechyony. Transductive rademacher complexity and its applications. Jour-
nal of Artificial Intelligence Research, 35:193–234, 2009.
[36] A. Epshteyn, A. Vogel, G. DeJong. Active reinforcement learning. ICML Conference, pages
296–303, 2008.
[37] S. Esmeir and S. Markovitch. Anytime induction of cost-sensitive trees. Advances in Neural
Information Processing Systems (NIPS), pages 425–432, 2008.
[38] A. Esuli and F. Sebastiani. Active learning strategies for multi-label text classification. ECIR
Conference, pages 102–113, 2009.
[39] M. Fang, X. Zhu, B. Li, W. Ding, and X. Wu, Self-taught active learning from crowds. ICDM
Conference, pages 273–288, 2012.
[40] M. Fang, J. Yin, and X. Zhu. Knowledge transfer for multi-labeler active learning ECMLP-
KDD Conference, 2013
600 Data Classification: Algorithms and Applications
[41] W. Fan, Y. A. Huang, H. Wang, and P. S. Yu. Active mining of data streams. SIAM Conference
on Data Mining, 2004.
[42] A. Fujii, T. Tokunaga, K. Inui, and H. Tanaka. Selective sampling for example-based word
sense disambiguation. Computational Linguistics, 24(4):573–597, 1998.
[43] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
[44] Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective samping using the query by com-
mittee algorithm. Machine Learning, 28:133–168, 1997.
[45] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma.
Neural Computation, 4:1–58, 1992.
[46] R. Gilad-Bachrach, A. Navot, and N. Tishby. Query by committee made real. In NIPS, 2005.
[47] J. Goodman. Exponential priors for maximum entropy models. Human Language Technology
and the North American Association for Computational Linguistics, pages 305–312, 2004.
[48] R. Greiner, A. Grove, and D. Roth. Learning cost-sensitive active classifiers. Artificial Intelli-
gence, 139, pages 137–174, 2002.
[49] Q. Gu and J. Han. Towards active learning on graphs: An error bound minimization approach.
IEEE International Conference on Data Mining, pages 882–887, 2012.
[50] Q. Gu, C. Aggarwal, J. Liu, and J. Han. Selective sampling on graphs for classification. ACM
KDD Conference, pages 131–139, 2013.
[51] A. Guillory and J. Bilmes. Active semi-supervised learning using submodular functions. UAI,
274–282, 2011.
[52] A. Guillory and J. A. Bilmes. Label selection on graphs. NIPS, pages 691–699, 2009.
[53] Y. Guo and R. Greiner. Optimistic active learning using mutual information. International
Joint Conference on Artificial Intelligence, pages 823–829, 2007.
[54] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages
353–360, 2007.
[55] S. C. H. Hoi, R. Jin, and M.R. Lyu. Large-scale text categorization by batch mode active
learning. World Wide Web Conference, pages 633–642, 2006.
[58] S. C. H. Hoi, R. Jin, J. Zhu, and M.R. Lyu. Semi-supervised SVM batch mode active learning
for image retrieval. IEEE Conference on CVPR, pages 1–3, 2008.
[59] S. Huang, R. Jin, and Z. Zhou. Active learning by querying informative and representative
examples. NIPS Conference, pages 892–200, 2011.
[60] R. Hwa. Sample selection for statistical parsing. Computational Linguistics, 30:73–77, 2004.
Active Learning: A Survey 601
[61] M. Ji and J. Han. A variance minimization criterion to active learning on graphs. AISTATS,
pages 556–554, 2012.
[62] S. Ji and L. Carin. Cost-sensitive feature acquisition and classification. Pattern Recognition,
40:1474–1485, 2007.
[63] N. Jin, C. Young, and W. Wang. GAIA: graph classification using evolutionary computation.
In SIGMOD, pages 879–890, 2010.
[64] A. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image classifica-
tion In CVPR Conference, 2009.
[65] K. Judah, A. Fern, and T. Dietterich. Active imitation learning via reduction to I.I.D. active
learning. In UAI Conference, 2012.
[66] A. Kapoor, E. Horvitz, and S. Basu. Selective supervision: Guiding supervised learning with
decision theoretic active learning. Proceedings of International Joint Conference on Artificial
Intelligence (IJCAI), pages 877–882, 2007.
[67] R.D. King, K.E. Whelan, F.M. Jones, P.G. Reiser, C.H. Bryant, S.H. Muggleton, D.B. Kell,
and S.G. Oliver. Functional genomic hypothesis generation and experimentation by a robot
scientist. Nature, 427 (6971):247–252, 2004.
[68] C. Korner and S. Wrobel. Multi-class ensemble-based active learning. European Conference
on Machine Learning, pp. 687–694, pages 687–694, 2006.
[69] X. Kong and P. S. Yu. Semi-supervised feature selection for graph classification. In KDD,
pages 793–802, 2010.
[70] X. Kong, W. Fan, and P. Yu. Dual Active feature and sample selection for graph classification.
ACM KDD Conference, pages 654–662, 2011.
[71] V. Krishnamurthy. Algorithms for optimal scheduling and management of hidden Markov
model sensors. IEEE Transactions on Signal Processing, 50(6):1382–1397, 2002.
[72] B. Krishnapuram, D. Williams, Y. Xue, L. Carin, M. Figueiredo, and A. Hartemink. Active
learning of features and labels. ICML Conference, 2005.
[73] A. Kuwadekar and J. Neville, Relational active learning for joint collective classification mod-
els. ICML Conference, pages 385–392, 2011.
[74] D. Lewis and W. Gale. A sequential algorithm for training text classifiers. ACM SIGIR Con-
ference, pages 3–12, 1994.
[75] D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. ICML
Conference, pages 148–156, 1994.
[76] X. Li and Y. Guo. Active learning with multi-label SVM classification. IJCAI Conference,
pages 14–25, 2013.
[77] X. Li, D. Kuang, and C.X. Ling. Active learning for hierarchical text classication. PAKDD
Conference, pages 14–25, 2012.
[78] X. Li, D. Kuang and C.X. Ling. Multilabel SVM active learning for image classification. ICIP
Conference, pages 2207–2210, 2004.
[79] C. X. Ling, Q. Yang, J. Wang, and S. Zhang. Decision trees with minimal costs. ICML Con-
ference, pages 483–486, 2004.
602 Data Classification: Algorithms and Applications
[80] H. Liu, H. Motoda, and L. Yu. A selective sampling approach to active feature selection.
Artificial Intelligence, 159(1-2):49–74, 2004
[81] H. Liu, H. Motoda, and L. Yu. Feature selection with selective sampling. ICML Conference,
pages 395–402, 2002
[82] H. Liu, L. Yu, M. Dash, and H. Motoda. Active feature selection using classes. PAKDD con-
ference, page 474–485, 2003
[83] R. Lomasky, C. Brodley, M. Aernecke, D. Walt, and M. Friedl. Active class selection. Pro-
ceedings of the European Conference on Machine Learning, pages 640–647, 2007.
[84] A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classifica-
tion. ICML Conference, pages 359–367, 1998.
[85] D. MacKay. Information-based objective functions for active data selection. Neural Computa-
tion, 4(4):590–604, 1992.
[86] D. Margineantu. Active cost-sensitive learning. International Joint Conference on Artificial
Intelligence (IJCAI), pages 1622–1623, 2005.
[87] P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. An expected utility approach to
active feature-value acquisition. IEEE ICDM Conference, pages 2005.
[88] P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. Active feature-value acquisition
for classifier induction. IEEE International Conference on Data Mining (ICDM), pages 483–
486, 2004.
[89] P. Melville and R. Mooney. Diverse ensembles for active learning. ICML Conference, pages
584–591, 2004.
[90] I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervised learning = robust multi-view
learning. ICML Conference, pages 435-442, 2002.
[91] I. Muslea, S. Minton, and C. Knoblock. Active learning with multiple views. Journal of Artifi-
cial Intelligence Research, 27(1):203–233, 2006.
[92] I. Muslea, S. Minton, and C. Knoblock. Selective sampling with redundant views. AAAI Con-
ference, pages 621–626, 2000.
[93] R. Moskovitch, N. Nissim, D. Stopel, C. Feher, R. Englert, and Y. Elovici. Improving the de-
tection of unknown computer worms activity using active learning. Proceedings of the German
Conference on AI, pages 489–493, 2007.
[94] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unla-
beled documents using EM. Machine Learning, 39(2–3):103–134, 2000.
[95] A. McCallum and K. Nigam. Employing EM pool-based active learning for text classification.
ICML Conference, pages 350–358, 1998.
[96] H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. ICML Conference,
pages 79–86, 2004.
[97] S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactons on Knowledge and Data
Engineering, 22(10):1345–1359, 2010.
[98] G. Qi, X. Hua, Y. Rui, J. Tang, and H. Zhang. Two-dimensional active learning for image
classification. CVPR Conference, 2008.
Active Learning: A Survey 603
[99] R. Reichart, K. Tomanek, U. Hahn, and A. Rappoport. Multi-task active learning for linguistic
annotations. ACL Conference, 2008.
[100] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of
error reduction. ICML Conference, pages 441–448, 2001.
[102] T. Scheffer, C. Decomain, and S. Wrobel. Active hidden Markov models for information
extraction. International Conference on Advances in Intelligent Data Analysis (CAIDA), pages
309–318, 2001.
[103] A. I. Schein and L.H. Ungar. Active learning for logistic regression: An evaluation. Machine
Learning, 68(3):253–265, 2007.
[104] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. ICML
Conference, pages 839–846, 2000.
[105] B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learn-
ing, 6(1):1–114, 2012.
[106] B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling
tasks. Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 1069–1078, 2008.
[107] B. Settles, M. Craven, and S. Ray. Multiple-instance active learning. Advances in Neural
Information Processing Systems (NIPS), pages 1289–1296, 2008.
[108] B. Settles, M. Craven, and L. Friedland. Active learning with real annotation costs. NIPS
Workshop on Cost-Sensitive Learning, 2008.
[109] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. ACM Workshop on Com-
putational Learning Theory, pages 287–294, 1992.
[110] V. S. Sheng and C. X. Ling. Feature value acquisition in testing: A sequential batch test
algorithm. ICML Conference, pages 809–816, 2006.
[111] V. S. Sheng, F. Provost, and P.G. Ipeirotis. Get another label? Improving data quality and data
mining using multiple, noisy labelers. ACM KDD Conference, pages 614–622, 2008
[112] M. Singh, E. Curran, and P. Cunningham. Active learning for multi-label image annotation.
Technical report, University College Dublin, 2009
[113] T. Soukop and I. Davidson. Visual Data Mining: Techniques and Tools for Data Visualization,
Wiley, 2002.
[114] M. Thoma, H. Cheng, A. Gretton, J. Han, H. Kriegel, A. Smola, L. Song, P. Yu, X. Yan, and
K. Borgwardt. Near-optimal supervised feature selection among frequent subgraphs. In SDM,
pages 1075–1086, 2009.
[115] M. Wang and X. S. Hua. Active learning in multimedia annotation and retrieval: A survey.
ACM Transactions on Intelligent Systems and Technology (TIST), 2(2):10, 2011.
[116] W. Wang and Z. Zhou. On multi-view active learning and the combination with semi-
supervised learning. ICML Conference, pages 1152–1159, 2008.
604 Data Classification: Algorithms and Applications
[117] W. Wang and Z. Zhou. Multi-view active learning in the non-realizable case. NIPS Confer-
ence, pages 2388–2396, 2010.
[118] X. Shi, W. Fan, and J. Ren. Actively transfer domain knowledge. ECML Conference, pages
342–357, 2008
[119] Z. Xu, R. Akella, and Y. Zhang. Incorporating diversity and density in active learning for
relevance feedback. European Conference on IR Research (ECIR), pages 246–257, 2007.
[120] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang. Representative sampling for text classification
using support vector machines. ECIR Conference, pages 393–407, 2003.
[121] X. Yan, H. Cheng, J. Han, and P. Yu. Mining significant graph patterns by leap search. In
SIGMOD, pages 433–444, 2008.
[122] Y. Yan, R. Rosales, G. Fung, and J. Dy. Active learning from crowds. In ICML, 2011.
[123] X. Yan, P. S. Yu, and J. Han. Graph indexing based on discriminative frequent struture anal-
ysis. ACM Transactions on Database Systems, 30(4):960–993, 2005.
[124] B. Yang, J. Sun, T. Wang, and Z. Chen. Effective multi-label active learning for text classica-
tion. ACM KDD Conference, pages 917–926, 2009.
[125] L. Yang, S. Hanneke, and J. Carbonell A theory of transfer learning with applications to
active learning. Machine Learning, 90(2):161–189, 2013.
[126] K. Yu, J. Bi, and V. Tresp. Active learning via transductive experimental design. ICML Con-
ference, pages 1081–1088, 2006.
[127] H. Yu. SVM selective sampling for ranking with application to data retrieval. ACM KDD
Conference, pages 354–363, 2005.
[128] Y. Zhang. Multi-task active learning with output constraints. AAAI, 2010
[129] T. Zhang and F.J. Oles. A probability analysis on the value of unlabeled data for classification
problems. ICML Conference, pages 1191–1198, 2000.
[130] Q. Zhang and S. Sun. Multiple-view multiple-learner active learning. Pattern Recognition,
43(9):3113–3119, 2010.
[131] L. Zhao, S. Pan, E. Xiang, E. Zhong, Z. Lu, and Q. Yang Active transfer learning for cross-
system recommendation. AAAI Conference, 2013
[132] Z. Zheng and B. Padmanabhan. On active learning for data acquisition. IEEE International
Conference on Data Mining, pages 562–569, 2002.
[133] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global
consistency. NIPS Conference, pages 321–328, 2003.
[134] Y. Zhu, S. J. Pan, Y. Chen, G.-R. Xue, Q. Yang, and Y. Yu. Heterogeneous transfer learning
for image classification. Special Track on AI and the Web, associated with The Twenty-Fourth
AAAI Conference on Artificial Intelligence, 2010.
[135] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semisupervised learn-
ing using Gaussian fields and harmonic functions. Proceedings of the ICML Workshop on the
Continuum from Labeled to Unlabeled Data, pages 58–65, 2003.
Active Learning: A Survey 605
[136] X. Zhu, P. Zhang, X. Lin, and Y. Shi. Active learning from data streams. ICDM Conference,
pages 757–762, 2007.
[137] Z. Zhu, X. Zhu, Y. Ye, Y. Guo, and X. Xue. Transfer active learning. CIKM Conference, pages
2169–2172, 2011.
[138] M. Zuluaga, G. Sergent, A. Krause, M. Puschel, and Y. Shi. Active Learning for Multi-
Objective Optimization. JMLR, 28(1):462–470, 2013.
This page intentionally left blank
Chapter 23
Visual Classification
607
608 Data Classification: Algorithms and Applications
23.1 Introduction
Extracting meaningful knowledge from very large datasets is a challenging task that requires
the application of machine learning methods. This task is called data mining, the aim of which is to
retrieve, explore, predict, and derive new information from a given dataset. Given the complexity of
the task and the size of the dataset, users should be involved in this process because, by providing
adequate data and knowledge visualizations, the pattern recognition capabilities of the human can be
used to drive the learning algorithm [6]. This is the goal of Visual Data Mining [78, 85]: to present
the data in some visual form, allowing the human to get insight into the data, draw conclusions, and
directly interact with the data [18]. In [75], the authors define visual data mining as “the process
of interaction and analytical reasoning with one or more visual representations of an abstract data
that leads to the visual discovery or robust patterns in these data that form the information and
knowledge utilised in informed decision making.”
Visual data mining techniques have proven to be of high value in exploratory data analysis and
they also have a high potential for exploring large databases [31]. This is particularly important in
a context where an expert user could make use of domain knowledge to either confirm or correct a
dubious classification result. An example of this interactive process is presented in [83], where the
graphical interactive approaches to machine learning make the learning process explicit by visualiz-
ing the data and letting the user ‘draw’ the decision boundaries. In this work, parameters and model
selection are no longer required because the user controls every step of the inductive process.
By means of visualization techniques, researchers can focus and analzse patterns of data from
datasets that are too complex to be handled by automated data analysis methods. The essential
idea is to help researchers examine the massive information stream at the right level of abstraction
through appropriate visual representations and to take effective actions in real-time [46]. Interactive
visual data mining has powerful implications in leveraging the intuitive abilities of the human for
data mining problems. This may lead to solutions that can model data mining problems in a more
intuitive and unrestricted way. Moreover, by using such techniques, the user also has much better
understanding of the output of the system even in the case of single test instances [1, 3].
The research field of Visual Data Mining has witnessed a constant growth and interest. In 1999,
in a Guest Editor’s Introduction of Computer Graphics and Application Journal [85], Wong writes:
All signs indicate that the field of visual data mining will continue to grow at an even
faster pace in the future. In universities and research labs, visual data mining will play
a major role in physical and information sciences in the study of even larger and more
complex scientific data sets. It will also play an active role in nontechnical disciplines
to establish knowledge domains to search for answers and truths.
More than ten years later, Keim presents new challenges and applications [44]:
Nearly all grand challenge problems of the 21st century, such as climate change, the
energy crisis, the financial crisis, the health crisis and the security crisis, require the
analysis of very large and complex datasets, which can be done neither by the com-
puter nor the human alone. Visual analytics is a young active science field that comes
with a mission of empowering people to find solutions for complex problems from
large complex datasets. By tightly integrating human intelligence and intuition with the
storage and processing power of computers, many recently developed visual analyt-
ics solutions successfully help people in analyzing large complex datasets in different
application domains.
In this chapter, we focus on one particular task of visual data mining, namely visual classifica-
tion. The classification of objects based on previously classified training data is an important area
Visual Classification 609
within data mining and has many real-world applications (see Section 23.3). The chapter is orga-
nized as follows: in this introduction, we present the requirements for Visual Classification (Sec-
tion 23.1.1), a set of challenges (Section 23.1.3), and a brief overview of some of the approaches
organized by visualization metaphors (Section 23.1.2); in Section 23.2, we present the main visu-
alization approaches for visual classification. For each approach, we introduce at least one of the
seminal works and one application. In Section 23.3, we present some of the most recent visual clas-
sification systems that have been applied to real-world problems. In Section 23.4, we give our final
remarks.
1. to quickly grasp the primary factors influencing the classification with very little knowledge
of statistics;
2. to see the whole model and understand how it applies to records, rather than the visualization
being specific to every record;
5. to infer record counts and confidence in the shown probabilities so that the reliability of the
classifier’s prediction for specific values can be assessed quickly from the graphics;
7. the system should handle many attributes without creating an incomprehensible visualization
or a scene that is impractical to manipulate.
610 Data Classification: Algorithms and Applications
classifiers and the evidence for those decisions. The Class Radial Visualization [69] is an integrated
visualization system that provides interactive mechanisms for a deep analysis of classification re-
sults and procedures. In this system, class items are displayed as squares and equally distributed
around the perimeter of a circle. Objects to be classified are displayed as colored points in the cir-
cle and the distance between the point and the squares represents the uncertainty of assigning that
object to the class. In [65], the author presents two interactive methods to improve the results of
a classification task: the first one is an interactive decision tree construction algorithm with a help
mechanism based on Support Vector Machines (SVM); the second one is a visualization method
used to try to explain SVM results. In particular, it uses a histogram of the data distribution accord-
ing to the distance to the boundary, and linked to it is a set of scatter-plot matrices or the parallel
coordinates. This method can also be used to help the user in the parameter tuning step of the SVM
algorithm, and significantly reduce the time needed for the classification.
In [18], an overview of the techniques available under the light of different categorizations is
presented. The role of interaction techniques is discussed, as well as the important question of how
to select an appropriate visualization technique for a task.
The problem of identifying adequate visual representation is also discussed in [56]. The authors
classify the visual techniques in two classes: technical and interactive techniques. For each approach
they discuss advantages and disadvantages in visualizing data to be mined.
[11] presents how to integrate visualization and data mining techniques for knowledge discov-
ery. In particular, this work looks at strengths and weaknesses of information visualization tech-
niques and data mining techniques.
In [25], the authors present a model for hierarchical aggregation in information visualization
for the purpose of improving overview and scalability of large scale visualization. A set of standard
visualization techniques is presented and a discussion of how they can be extended with hierarchical
aggregation functionality is given.
23.2 Approaches
In this section, we present an overview of many of the most important approaches used in data
visualization that have been applied to visual classification. This survey is specifically designed to
present only visual classification approaches. For each approach, we added a reference to at least one
of the seminal works and one example of an application for the specific classification task. We did
not enter into discussions on the appropriateness, advantages, and disadvantages of each technique,
which can be found in other surveys presented in Section 23.1.4. We present the approaches in
alphabetical order: nomograms, parallel coordinates, radial visualizations, scatter plots, topological
maps, and trees. All the figures in this Section were produced with R,1 and the code to reproduce
these plots can be downloaded.2
23.2.1 Nomograms
A nomogram 3 is any graphical representation of a numerical relationship. Invented by French
mathematician Maurice d’Ocagne in 1891 a nomogram has as its primary purpose to enable the
user to graphically compute the outcome of an equation without needing a computer or calculator.
Today, nomograms are often used in medicine to predict illness based on some evidence. For exam-
ple, [57] shows the utility of such a tool to estimate the probability of diagnosis of acute myocardial
infarction. In this case, the nomogram is designed in such a way that it can be printed on paper and
easily used by physicians to obtain the probability of diagnosis without using any calculator or com-
puter. There are a number of nomograms used in daily clinical practice for prognosis of outcomes
of different treatments, especially in the field of oncology [43, 81]. In Figure 23.1, an example of a
nomogram for predicting the probability of survival given factors like age and cholesterol level is
shown.4
The main benefit of this approach is simple and clear visualization of the complete model and
the quantitative information it contains. The visualization can be used for exploratory analysis and
classification, as well as for comparing different probabilistic models.
1 http://www.r-project.org/
2 http://www.purl.org/visualclassification
3 http://archive.org/details/firstcourseinnom00broduoft
4 http://cran.r-project.org/web/packages/rms/index.html
Visual Classification 613
0 10 20 30 40 50 60 70 80 90 100
Points
Age in
Years 10 20 30 40 50 60 70 80 90
cholesterol
(sex=female) 280 260 240 220 160 140 120
cholesterol
(sex=male) 120 140 160 180 220 240 260 280
FIGURE 23.1: Nomograms. Given the age of a person and the level of cholesterol of the patient, by
drawing a straight line that connects these two points on the graph, it is possible to read much infor-
mation about survival probabilities (Y >= 1, Y >= 2, Y = 3) according to different combinations
of features.
ordinates, starting on the y-axis, n copies of the real line are placed parallel (and equidistant) to the
y-axis. Each line is labeled from x1 to xn . A point c with coordinates (c1 , c2 , ..., cn ) is represented by
a polygonal line whose n vertices are at (i − 1, ci ) for i = 1, ..., n.
Since points that belong to the same class are usually close in the n-dimensional space, objects
of the same class have similar polygonal lines. Therefore, one can immediately see groups of lines
that correspond to points of the same class. Axes ordering, spacing, and filtering can significantly
increase the effectiveness of this visualization, but these processes are complex for high dimen-
sional datasets [86]. In [78], the authors present an approach to measure the quality of the parallel
coordinates-view according to some ranking functions.
In Figure 23.2, an example of parallel coordinates to classify the Iris Dataset5 is shown. The
four-dimensional object has been projected onto four parallel coordinates. Flowers of the same kind
show similar polygonal patterns; however, edge cluttering is already a problem even with this small
number of objects. 6
FIGURE 23.2 (See color insert.): Parallel coordinates. In this example, each object has four di-
mensions and represents the characteristics of a species of iris flower (petal and sepal width and
length in logarithmic scale). The three types of lines represent the three kinds of iris. With paral-
lel coordinates, it is easy to see common patterns among flowers of the same kind; however, edge
cluttering is already visible even with a small dataset.
mpg
25
cyl drat
8 4
180 150
disp hp
FIGURE 23.3: Star coordinates. Three five-dimensional objects are mapped on this star plot. Each
coordinate is one of the axis radiating from the center. In this example, the three objects are cars
described by features like: number of cylinders, horse power, and miles per gallon. The number at
the end of each axis represents the maximum value for that dimension. Cars with similar features
have similar polygons, too.
have an intuitive visual evaluation but also have a precise evaluation on the consistency of cluster
structure by calculating geometrical information of their data distributions.
study what pairs of features allow for a better separation. Even with this relatively few number of
items, the problem of overlapping points is already visible.
In [45], the authors discuss the issue of the high degree of overlap in scatter plots in exploring
large data sets. They propose a generalization of scatter plots where the analyst can control the de-
gree of overlap, allowing the analyst to generate many different views for revealing patterns and
relations from the data. In [78], an alternative solution to this problem is given by presenting a way
to measure the quality of the scatter plots view according to some ranking functions. For example,
a projection into a two-dimensional space may need to satisfy a certain optimality criterion that
attempts to preserve distances between the class-means. In [20], a kind of projections that are sim-
ilar to Fisher’s linear discriminants, but faster to compute, are proposed. In [7], a type of plot that
projects points on a two-dimensional plane called a similarity-dissimilarity plot is discussed. This
plot provides information about the quality of features in the feature space, and classification accu-
racy can be predicted from the assessment of features on this plot. This approach has been studied
on synthetic and real life datasets to prove the usefulness of the visualization of high dimensional
data in biomedical pattern classification.
23.2.4.1 Clustering
In [19], the authors compare two approaches for projecting multidimentional data onto a two-
dimensional space: Principal Component Analysis (PCA) and random projection. They investigate
which of these approaches best fits nearest neighbor classification when dealing with two types
of high-dimensional data: images and micro arrays. The result of this investigation is that PCA
is more effective for severe dimensionality reduction, while random projection is more suitable
when keeping a high number of dimensions. By using one of the two approaches, the accuracy
of the classifier is greatly improved. This shows that the use of PCA and random projection may
lead to more efficient and more effective nearest neighbour classification. In [70], an interactive
visualization tool for high-speed power system frequency data streams is presented. A k-median
approach for clustering is used to identify anomaly events in the data streams. The objective of this
work is to visualize the deluge of expected data streams for global situational awareness, as well
as the ability to detect disruptive events and classify them. [2] discusses a interactive approach for
nearest neighbor search in order to choose projections of the data in which the patterns of the data
containing the query point are well distinguished from the entire data set. The repeated feedback
of the user over multiple iterations is used to determine a set of neighbors that are statistically
significant and meaningful.
Sepal.Width
3.0
2.0
1 2 3 4 5 6 7
Petal.Length
2.5
1.5
Petal.Width
0.5
FIGURE 23.4: Scatter plots. In the Iris Dataset, flowers are represented by four-dimensional vec-
tors. In this figure, the matrix of scatterplots presents all the possible two-dimensional combinations
of features. The shade of grey of each point represents the kind of Iris. Some combinations allows
for a better separation between classes; nevertheless, even with this relatively few number of items,
the problem of overlapping points is already visible.
The projection of the audio data results in the transformation of diffuse, nebulous classes in high-
dimensional space into compact clusters in the low-dimensional space that can be easily separated
by simple clustering mechanisms. In this space, decision boundaries for optimal classification can
be more easily identified using simple clustering criteria.
In [21], a similar approach is used as a visualization tool to understand the relationships be-
tween categories of textual documents, and to help users to visually audit the classifier and identify
suspicious training data. When plotted on the Cartesian plane according to this formulation, the
documents that belong to one category have specific shifts along the x-axis and the y-axis. This
approach is very useful to compare the effect of different probabilistic models like Bernoulli, multi-
nomial, or Poisson. The same approach can be applied to the problem of parameters optimization
for probabilistic text classifiers, as discussed in [62].
Visual Classification 619
FIGURE 23.5: Self Organizing Maps. A 5 × 5 hexagonal SOM has been trained on a dataset of
wines. Each point (triangle, circle, or cross) represents a wine that originally is described by a 13-
dimensional vector. The shape of the point represents the category of the wine; the shade of grey of
each activation unit (the big circles of the grid) is the predicted category. Wines that are similar in
the 13-dimensional space are close to each other on this grid.
variables. As a consequence, there is a well-defined objective function given by the log likelihood,
and convergence to a local maximum of the objective function is guaranteed by the use of the
Expectation Maximization algorithm.
In [4], GTM is to cluster motor unit action potentials for the analysis of the behavior of the
neuromuscular system. The aim of the analysis is to reveal how many motor units are active dur-
ing a muscle contraction. This work compares the strength and weaknesses of GTM and principal
component analysis (PCA), an alternative multidimensional projection technique. The advantage of
PCA is that the method allows the visualization of objects in a Euclidian space where the perception
of distance is easy to understand. On the other hand, the main advantage of the GTM is that each
unit may be considered as an individual cluster, and the access to these micro-clusters may be very
useful for elimination or selection of wanted or unwanted information.
23.2.6 Trees
During the 1980s, the appeal of graphical user interfaces encouraged many developers to create
node-link diagrams. By the early 1990s, several research groups developed innovative methods of
tree browsing that offered different overview and browsing strategies. For a history of the develop-
ment of visualization tools based on trees, refer to [73]. In this section, we present four variants of
visualization of trees: decision trees, tree maps, hyperbolic trees, and phylogenetic trees.
Visual Classification 621
24.583
|
n=60
Price>=9446
Price< 9446
22.708 32.083
n=48 n=12
Type=Larg,Medm,Van
Type=Cmpc,Smll,Sprt
20.696 24.56
n=23 n=25
Type=Larg,Van Price>=1.148e+04
Type=Medm Price< 1.148e+04
FIGURE 23.6: Decision trees. Each node of the tree predicts the average car mileage given the
price, the country, the reliability, and the car type according to the data from the April 1990 issue of
Consumer Reports. In this example, given the price and the type of the car, we are able to classify
the car in different categories of gas consumption.
DEU
RUS
IDN PAK
FRA
FIGURE 23.7 (See color insert.): Treemaps. This plot represents a dataset of 2010 about popu-
lation size and gross national income for each country. The size of each node of the treemap is
proportional to the size of the population, while the shade of blue of each box represents the gross
national income of that country. The countries of a continent are grouped together into a rectangular
area.
ferent interactive approach where the user interactively edits projections of multi-dimensional data
and “paints” regions to build a decision tree [80]. The visual interaction of this systems combines
Parallel Coordinates and Star Coordinates by showing this “dual” projection of the data.
23.2.6.2 Treemap
The Treemap visualization technique [72] makes use of the area available on the display, map-
ping hierarchies onto a rectangular region in a space-filling manner. This efficient use of space al-
lows large hierarchies to be displayed and facilitates the presentation of semantic information. Each
node of a tree map has a weight that is used to determine the size of a nodes-bounding box. The
weight may represent a single domain property, or a combination of domain properties. A node’s
weight determines its display size and can be thought of as a measure of importance or degree of
interest [39].
In Figure 23.7, a tree map shows the gross national income per country. Each box (the node of
the tree) represents a country, and the size of the box is proportional to the size of the population of
that country. The shade of grey of the box reflects the gross national income of the year 2010.10
10 http://cran.r-project.org/web/packages/treemap/index.html
Visual Classification 623
Treemaps can also be displayed in 3D [30]. For example, patent classification systems intel-
lectually organize the huge number of patents into pre-defined technology classes. To visualize the
distribution of one or more patent portfolios, an interactive 3D treemap can be generated, in which
the third dimension represents the number of patents associated with a category.
23.3 Systems
One of the most important characteristics of a visual classification system is that users should
gain insights about the data [15]. For example, how much the data within each class varies, which
classes are close to or distinct from each other, seeing which features in the data play an important
role to discriminate one class from another, and so on. In addition, the analysis of misclassified data
should provide a better understanding of which type of classes are difficult to classify. Such insight
can then be fed back to the classification process in both the training and the test phases.
In this section, we present a short but meaningful list of visual classification systems that have
been published in the last five years and that fulfill most of the previous characteristics. The aim
of this list is to address how visual classification systems support automated classification for real-
world problems.
researchers who could benefit greatly from the ability to express user preferences about how a clas-
sifier should work.
EnsembleMatrix [77] allows users to create an ensemble classification system by discovering
appropriate combination strategies. This system supplies a visual summary that spans multiple clas-
sifiers and helps users understand the models’ various complimentary properties. EsnembleMatrix
provides two basic mechanisms to explore combinations of classifiers: (i) partitioning, which di-
vides the class space into multiple partitions; (ii) arbitrary linear combinations of the classifiers for
each of these partitions.
The ManiMatrix (Manipulable Matrix) system is an interactive system that enables researchers
to intuitively refine the behavior of classification systems [41]. ManiMatrix focuses on the manual
refinement on sets of thresholds that are used to translate the probabilistic output of classifiers into
classification decisions. By appropriately setting such parameters as the costs of misclassification
of items, it is possible to modify the behavior of the algorithm such that it is best aligned with the
desired performance of the system. ManiMatrix enables its users to directly interact with a confusion
matrix and to view the implications of incremental changes to the matrix via a realtime interactive
cycle of reclassification and visualization.
23.3.3 iVisClassifier
The iVisClassifier system [15] allows users to explore and classify data based on Linear Dis-
criminant Analysis (LDA), a supervised reduction method. Given a high-dimensional dataset with
cluster labels, LDA projects the points onto a reduced dimensional representation. This low dimen-
sional space provides a visual overview of the cluster’s structure. LDA enables users to understand
each of the reduced dimensions and how they influence the data by reconstructing the basis vector
into the original data domain.
In particular, iVisClassifier interacts with all the reduced dimensions obtained by LDA through
parallel coordinates and a scatter plot. By using heat maps, iVisClassifier gives an overview about
clusters’ relationships both in the original space and in the reduced dimensional space. A case study
of facial recognition shows that iVisClassifier facilitates the interpretability of the computational
model. The experiments showed that iVisClassifier can efficiently support a user-driven classifica-
tion process by reducing human search space, e.g., recomputing LDA with a user-selected subset of
data and mutual filtering in parallel coordinates and the scatter plot.
Visual Classification 625
23.3.4 ParallelTopics
When analyzing large text corpora, questions pertaining to the relationships between topics
and documents are difficult to answer with existing visualization tools. For example, what are the
characteristics of the documents based on their topical distribution? and what documents contain
multiple topics at once? ParallelTopics [23] is a visual analytics system that integrates interactive
visualization with probabilistic topic models for the analysis of document collections.
ParallelTopics makes use of the Parallel Coordinate metaphor to present the probabilistic dis-
tribution of a document across topics. This representation can show how many topics a document
is related to and also the importance of each topic to the document of interest. ParallelTopics also
supports other tasks, which are also essential to understanding a document collection, such as sum-
marizing the document collection into major topics, and presenting how the topics evolve over time.
23.3.5 VisBricks
The VisBricks visualization approach provides a new visual representation in the form of a
highly configurable framework that is able to incorporate any existing visualization as a building
block [54]. This method carries forward the idea of breaking up the inhomogeneous data into groups
to form more homogeneous subsets, which can be visualized independently and thus differently.
The visualization technique embedded in each block can be tailored to different analysis tasks.
This flexible representation supports many explorative and comparative tasks. In VisBricks, there
are two level of analysis: the total impression of all VisBricks together gives a comprehensive high-
level overview of the different groups of data, while each VisBrick independently shows the details
of the group of data it represents.
23.3.6 WHIDE
The Web-based Hyperbolic Image Data Explorer (WHIDE) system is a Web visual data min-
ing tool for the analysis of multivariate bioimages [50]. This kind of analysis spans the analysis
of the space of the molecule (i.e., sample morphology) and molecular colocation or interaction.
WHIDE utilizes hierarchical hyperbolic self-organizing maps (H2SOM), a variant of the SOM, in
combination with Web browser technology.
WHIDE has been applied to a set of bio-images recorded to show field of view in tissue sections
from a colon cancer study, to compare tissue from normal colon with tissue classified as tumor. The
result of the use of WHIDE in this particular context has shown that this system efficiently reduces
the complexity of the data by mapping each of the pixels to a cluster, and provides a structural basis
for a sophisticated multimodal visualization, which combines topology preserving pseudo-coloring
with information visualization.
of the classifier in the context of the labeled documents, as well as for judging the quality of the
classifier in iterative feedback loops.
Bibliography
[1] Charu C. Aggarwal. Towards effective and interpretable data mining by visual interaction.
SIGKDD Explor. Newsl., 3(2):11–22, January 2002.
[2] Charu C. Aggarwal. On the use of human-computer interaction for projected nearest neighbor
search. Data Min. Knowl. Discov., 13(1):89–117, July 2006.
[3] Charu C. Aggarwal. Toward exploratory test-instance-centered diagnosis in high-dimensional
classification. IEEE Trans. on Knowl. and Data Eng., 19(8):1001–1015, August 2007.
[4] Adriano O. Andrade, Slawomir Nasuto, Peter Kyberd, and Catherine M. Sweeney-Reed. Gen-
erative topographic mapping applied to clustering and visualization of motor unit action po-
tentials. Biosystems, 82(3):273–284, 2005.
[5] Mihael Ankerst, Christian Elsen, Martin Ester, and Hans-Peter Kriegel. Visual classifica-
tion: an interactive approach to decision tree construction. In Proceedings of the Fifth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’99,
ACM, pages 392–396, New York, NY, USA, 1999.
[6] Mihael Ankerst, Martin Ester, and Hans-Peter Kriegel. Towards an effective cooperation of
the user and the computer for classification. In KDD’00, ACM, pages 179–188, Boston, MA,
USA, 2000.
[7] Muhammad Arif. Similarity-dissimilarity plot for visualization of high dimensional data in
biomedical pattern classification. J. Med. Syst., 36(3):1173–1181, June 2012.
Visual Classification 627
[8] Barry Becker, Ron Kohavi, and Dan Sommerfield. Visualizing the simple Baysian classifier. In
Usama Fayyad, Georges G. Grinstein, and Andreas Wierse, editors, Information Visualization
in Data Mining and Knowledge Discovery, pages 237–249. Morgan Kaufmann Publishers Inc.,
San Francisco, 2002.
[9] Barry G. Becker. Using mineset for knowledge discovery. IEEE Computer Graphics and
Applications, 17(4):75–78, 1997.
[10] Andreas Becks, Stefan Sklorz, and Matthias Jarke. Exploring the semantic structure of tech-
nical document collections: A cooperative systems approach. In Opher Etzion and Peter
Scheuermann, editors, CoopIS, volume 1901 of Lecture Notes in Computer Science, pages
120–125. Springer, 2000.
[11] Enrico Bertini and Denis Lalanne. Surveying the complementary role of automatic data analy-
sis and visualization in knowledge discovery. In Proceedings of the ACM SIGKDD Workshop
on Visual Analytics and Knowledge Discovery: Integrating Automated Analysis with Interac-
tive Exploration, VAKD ’09, ACM, pages 12–20, New York, NY, USA, 2009.
[12] Christopher M. Bishop, Markus Svensén, and Christopher K. I. Williams. GTM: The genera-
tive topographic mapping. Neural Computation, 10(1):215–234, 1998.
[13] Clifford Brunk, James Kelly, and Ron Kohavi. Mineset: An integrated system for data mining.
In Daryl Pregibon, David Heckerman, Heikki Mannila, editors, KDD-97, AAAI Press, pages
135–138, Newport Beach, CA, August 14-17 1997.
[14] Matthew Chalmers and Paul Chitson. BEAD: Explorations in information visualization. In
Nicholas J. Belkin, Peter Ingwersen, and Annelise Mark Pejtersen, editors, SIGIR, pages 330–
337. ACM, 1992.
[15] Jaegul Choo, Hanseung Lee, Jaeyeon Kihm, and Haesun Park. iVisClassifier: An interactive
visual analytics system for classification based on supervised dimension reduction. In 2010
IEEE Symposium on Visual Analytics Science and Technology (VAST), pages 27–34, 2010.
[16] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Informa-
tion Theory, 13(1):21–27, 1967.
[17] A.M. Cuadros, F.V. Paulovich, R. Minghim, and G.P. Telles. Point placement by phylogenetic
trees and its application to visual analysis of document collections. In IEEE Symposium on
Visual Analytics Science and Technology, 2007, VAST 2007, pages 99–106, 2007.
[18] Maria Cristina Ferreira de Oliveira and Haim Levkowitz. From visual data exploration to
visual data mining: A survey. IEEE Trans. Vis. Comput. Graph., 9(3):378–394, 2003.
[19] S. Deegalla and H. Bostrom. Reducing high-dimensional data by principal component analysis
vs. random projection for nearest neighbor classification. In 5th International Conference on
Machine Learning and Applications, 2006, ICMLA ’06, pages 245–250, 2006.
[20] Inderjit S. Dhillon, Dharmendra S. Modha, and W.Scott Spangler. Class visualization of high-
dimensional data with applications. Computational Statistics & Data Analysis, 41(1):59–90,
2002.
[21] Giorgio Maria Di Nunzio. Using scatterplots to understand and improve probabilistic models
for text categorization and retrieval. Int. J. Approx. Reasoning, 50(7):945–956, 2009.
[22] Stephan Diehl, Fabian Beck, and Michael Burch. Uncovering strengths and weaknesses of
radial visualizations—an empirical approach. IEEE Transactions on Visualization and Com-
puter Graphics, 16(6):935–942, November 2010.
628 Data Classification: Algorithms and Applications
[23] Wenwen Dou, Xiaoyu Wang, R. Chang, and W. Ribarsky. Paralleltopics: A probabilistic ap-
proach to exploring document collections. In 2011 IEEE Conference on Visual Analytics
Science and Technology (VAST), pages 231–240, 2011.
[24] G. Draper, Y. Livnat, and R.F. Riesenfeld. A survey of radial methods for information visual-
ization. IEEE Transactions on Visualization and Computer Graphics, 15(5):759–776, 2009.
[25] N. Elmqvist and J. Fekete. Hierarchical aggregation for information visualization: Overview,
techniques, and design guidelines. IEEE Transactions on Visualization and Computer Graph-
ics, 16(3):439–454, 2010.
[26] Daniel Engel, Lars Hüttenberger, and Bernd Hamann. A survey of dimension reduction meth-
ods for high-dimensional data analysis and visualization. In Christoph Garth, Ariane Middel,
and Hans Hagen, editors, VLUDS, volume 27 of OASICS, pages 135–149. Schloss Dagstuhl -
Leibniz-Zentrum fuer Informatik, Germany, 2011.
[27] Katia Romero Felizardo, Elisa Yumi Nakagawa, Daniel Feitosa, Rosane Minghim, and
José Carlos Maldonado. An approach based on visual text mining to support categorization
and classification in the systematic mapping. In Proceedings of the 14th International Con-
ference on Evaluation and Assessment in Software Engineering, EASE’10, British Computer
Society, Swinton, UK, pages 34–43, 2010.
[28] José Roberto M. Garcia, Antônio Miguel V. Monteiro, and Rafael D. C. Santos. Visual data
mining for identification of patterns and outliers in weather stations’ data. In Proceedings of
the 13th International Conference on Intelligent Data Engineering and Automated Learning,
IDEAL’12, pages 245–252, Springer-Verlag Berlin, Heidelberg, 2012.
[29] Zhao Geng, ZhenMin Peng, R.S. Laramee, J.C. Roberts, and R. Walker. Angular histograms:
Frequency-based visualizations for large, high dimensional data. IEEE Transactions on Visu-
alization and Computer Graphics, 17(12):2572–2580, 2011.
[30] M. Giereth, H. Bosch, and T. Ertl. A 3d treemap approach for analyzing the classificatory dis-
tribution in patent portfolios. In IEEE Symposium on Visual Analytics Science and Technology,
VAST ’08, pages 189–190, 2008.
[31] Charles D. Hansen and Chris R. Johnson. Visualization Handbook. Academic Press, 1st
edition, December 2004.
[32] F. Heimerl, S. Koch, H. Bosch, and T. Ertl. Visual classifier training for text document retrieval.
IEEE Transactions on Visualization and Computer Graphics, 18(12):2839–2848, 2012.
[33] William R. Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors. The 35th In-
ternational ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR ’12, Portland, OR, USA, ACM, August 12-16, 2012.
[34] Patrick Hoffman, Georges Grinstein, Kenneth Marx, Ivo Grosse, and Eugene Stanley. DNA
visual and analytic data mining. In Proceedings of the 8th Conference on Visualization ’97,
VIS ’97, pages 437–ff., IEEE Computer Society Press Los Alamitos, CA, 1997.
[35] Patrick Hoffman, Georges Grinstein, and David Pinkney. Dimensional anchors: A graphic
primitive for multidimensional multivariate information visualizations. In Proceedings of the
1999 Workshop on New Paradigms in Information Visualization and Manipulation in conjunc-
tion with the Eighth ACM International Conference on Information and Knowledge Manage-
ment, NPIVM ’99, pages 9–16, ACM, New York, NY, 1999.
Visual Classification 629
[36] Hailong Hou, Yan Chen, R. Beyah, and Yan-Qing Zhang. Filtering spam by using factors
hyperbolic tree. In Global Telecommunications Conference, 2008. IEEE GLOBECOM 2008.
IEEE, pages 1–5, 2008.
[37] A. Inselberg and Bernard Dimsdale. Parallel coordinates: A tool for visualizing multi-
dimensional geometry. In Visualization ’90, Proceedings of the First IEEE Conference on
Visualization, pages 361–378, 1990.
[38] Alfred Inselberg. The plane with parallel coordinates. The Visual Computer, 1(2):69–91, 1985.
[41] Ashish Kapoor, Bongshin Lee, Desney Tan, and Eric Horvitz. Interactive optimization for
steering machine classification. In Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems, CHI ’10, pages 1343–1352, ACM, New York, NY, 2010.
[42] G. V. Kass. An exploratory technique for investigating large quantities of categorical data.
Journal of the Royal Statistical Society. Series C (Applied Statistics), 29(2):119–127, 1980.
[43] M. W. Kattan, J. A. Eastham, A. M. Stapleton, T. M. Wheeler, and P. T. Scardino. A preoper-
ative nomogram for disease recurrence following radical prostatectomy for prostate cancer. J
Natl Cancer Inst, 90(10):766–71, 1998.
[44] Daniel Keim and Leishi Zhang. Solving problems with visual analytics: challenges and appli-
cations. In Proceedings of the 11th International Conference on Knowledge Management and
Knowledge Technologies, i-KNOW ’11, pages 1:1–1:4, ACM, New York, NY, 2011.
[45] Daniel A. Keim, Ming C. Hao, Umeshwar Dayal, Halldor Janetzko, and Peter Bak. General-
ized scatter plots. Information Visualization, 9(4):301–311, December 2010.
[46] Daniel A. Keim, Joern Kohlhammer, Geoffrey Ellis, and Florian Mansmann, editors. Master-
ing The Information Age—Solving Problems with Visual Analytics. Eurographics, November
2010.
[47] Daniel A. Keim, Fabrice Rossi, Thomas Seidl, Michel Verleysen, and Stefan Wrobel. In-
formation visualization, visual data mining and machine learning (Dagstuhl Seminar 12081).
Dagstuhl Reports, 2(2):58–83, 2012.
[48] Teuvo Kohonen. Self-organized formation of topologically correct feature maps. In James A.
Anderson and Edward Rosenfeld, editors, Neurocomputing: Foundations of research, pages
511–521. MIT Press, Cambridge, MA, 1982.
[49] Teuvo Kohonen. Self-Organizing Maps. Springer Series in Information Retrieval. Springer,
second edition, March 1995.
[50] Jan Kölling, Daniel Langenkämper, Sylvie Abouna, Michael Khan, and Tim W. Nattkem-
per. Whide—a web tool for visual data mining colocation patterns in multivariate bioimages.
Bioinformatics, 28(8):1143–1150, April 2012.
630 Data Classification: Algorithms and Applications
[51] John Lamping, Ramana Rao, and Peter Pirolli. A focus+context technique based on hyper-
bolic geometry for visualizing large hierarchies. In Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems, CHI ’95, pages 401–408, ACM Press/Addison-Wesley
Publishing Co. New York, NY, 1995.
[52] Gregor Leban, Blaz Zupan, Gaj Vidmar, and Ivan Bratko. Vizrank: Data visualization guided
by machine learning. Data Min. Knowl. Discov., 13(2):119–136, 2006.
[53] Ioannis Leftheriotis. Scalable interaction design for collaborative visual exploration of big
data. In Proceedings of the 2012 ACM International Conference on Interactive Tabletops and
Surfaces, ITS ’12, pages 271–276, ACM, New York, NY, USA, 2012.
[54] A. Lex, H. Schulz, M. Streit, C. Partl, and D. Schmalstieg. Visbricks: Multiform visualization
of large, inhomogeneous data. IEEE Transactions on Visualization and Computer Graphics,
17(12):2291–2300, 2011.
[55] Yan Liu and Gavriel Salvendy. Design and evaluation of visualization support to facilitate
decision trees classification. International Journal of Human-Computer Studies, 65(2):95–
110, 2007.
[56] H. Ltifi, M. Ben Ayed, A.M. Alimi, and S. Lepreux. Survey of information visualization
techniques for exploitation in KDD. In IEEE/ACS International Conference on Computer
Systems and Applications, 2009, AICCSA 2009, pages 218–225, 2009.
[57] J. Lubsen, J. Pool, and E. van der Does. A practical device for the application of a diagnostic
or prognostic function. Methods of Information in Medicine, 17(2):127–129, April 1978.
[58] Christian Martin, Naryttza N. Diaz, Jörg Ontrup, and Tim W. Nattkemper. Hyperbolic SOM-
based clustering of DNA fragment features for taxonomic visualization and classification.
Bioinformatics, 24(14):1568–1574, July 2008.
[59] Dieter Merkl. Text classification with self-organizing maps: Some lessons learned. Neuro-
computing, 21(1–3):61–77, 1998.
[60] Martin Mozina, Janez Demsar, Michael W. Kattan, and Blaz Zupan. Nomograms for visu-
alization of naive Bayesian classifier. In Jean-François Boulicaut, Floriana Esposito, Fosca
Giannotti, and Dino Pedreschi, editors, PKDD, volume 3202 of Lecture Notes in Computer
Science, pages 337–348. Springer, 2004.
[61] Emmanuel Müller, Ira Assent, Ralph Krieger, Timm Jansen, and Thomas Seidl. Morpheus:
Interactive exploration of subspace clustering. In Proceedings of the 14th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD ’08, pages 1089–
1092, ACM, New York, NY, USA, 2008.
[62] Giorgio Maria Di Nunzio and Alessandro Sordoni. A visual tool for Bayesian data analysis:
The impact of smoothing on naive bayes text classifiers. In William R. Hersh, Jamie Callan,
Yoelle Maarek, and Mark Sanderson, editors. The 35th International ACM SIGIR Conference
on Research and Development in Information Retrieval, SIGIR ’12, Portland, OR, USA, August
12-16, 2012. ACM, 2012.
[63] P. Oesterling, G. Scheuermann, S. Teresniak, G. Heyer, S. Koch, T. Ertl, and G.H. Weber.
Two-stage framework for a topology-based projection and visualization of classified document
collections. In 2010 IEEE Symposium on Visual Analytics Science and Technology (VAST),
pages 91–98, 2010.
Visual Classification 631
[64] J.G. Paiva, L. Florian, H. Pedrini, G.P. Telles, and R. Minghim. Improved similarity trees
and their application to visual data classification. IEEE Transactions on Visualization and
Computer Graphics, 17(12):2459–2468, 2011.
[65] François Poulet. Towards effective visual data mining with cooperative approaches. In Simeon
J. Simoff et al. (ed.) Visual Data Mining—Theory, Techniques and Tools for Visual Analytics,
pages 389–406, 2008.
[66] Brett Poulin, Roman Eisner, Duane Szafron, Paul Lu, Russell Greiner, David S. Wishart, Alona
Fyshe, Brandon Pearcy, Cam Macdonell, and John Anvik. Visual explanation of evidence with
additive classifiers. In AAAI, pages 1822–1829. AAAI Press, 2006.
[67] Bhiksha Raj and Rita Singh. Classifier-based non-linear projection for adaptive endpointing
of continuous speech. Computer Speech & Language, 17(1):5–26, 2003.
[68] Randall M. Rohrer, John L. Sibert, and David S. Ebert. A shape-based visual interface for text
retrieval. IEEE Computer Graphics and Applications, 19(5):40–46, 1999.
[69] Christin Seifert and Elisabeth Lex. A novel visualization approach for data-mining-related
classification. In Ebad Banissi, Liz J. Stuart, Theodor G. Wyeld, Mikael Jern, Gennady L. An-
drienko, Nasrullah Memon, Reda Alhajj, Remo Aslak Burkhard, Georges G. Grinstein, Den-
nis P. Groth, Anna Ursyn, Jimmy Johansson, Camilla Forsell, Urska Cvek, Marjan Trutschl,
Francis T. Marchese, Carsten Maple, Andrew J. Cowell, and Andrew Vande Moere, editors,
Information Visualization Conference, pages 490–495. IEEE Computer Society, 2009.
[70] B. Shneiderman. Direct manipulation: A step beyond programming languages. Computer,
16(8):57–69, 1983.
[71] B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualiza-
tions. In 1996. Proceedings of IEEE Symposium on Visual Languages, pages 336–343, 1996.
[72] Ben Shneiderman. Tree visualization with tree-maps: 2-d space-filling approach. ACM Trans.
Graph., 11(1):92–99, January 1992.
[73] Ben Shneiderman, Cody Dunne, Puneet Sharma, and Ping Wang. Innovation trajectories for
information visualizations: Comparing treemaps, cone trees, and hyperbolic trees. Information
Visualization, 11(2):87–105, 2012.
[74] Simeon J. Simoff, Michael H. Böhlen, and Arturas Mazeika, editors. Visual Data Mining—
Theory, Techniques and Tools for Visual Analytics, volume 4404 of Lecture Notes in Computer
Science. Springer, 2008.
[75] Simeon J. Simoff, Michael H. Böhlen, and Arturas Mazeika, editors. Visual Data Mining—
Theory, Techniques and Tools for Visual Analytics, volume 4404 of Lecture Notes in Computer
Science. Springer, 2008.
[76] Rita Singh and Bhiksha Raj. Classification in likelihood spaces. Technometrics, 46(3):318–
329, 2004.
[77] Justin Talbot, Bongshin Lee, Ashish Kapoor, and Desney S. Tan. Ensemblematrix: Interactive
visualization to support machine learning with multiple classifiers. In Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems, CHI ’09, pages 1283–1292,
ACM, New York, NY, 2009.
[78] A. Tatu, G. Albuquerque, M. Eisemann, P. Bak, H. Theisel, M. Magnor, and D. Keim. Au-
tomated analytical methods to support visual exploration of high-dimensional data. IEEE
Transactions on Visualization and Computer Graphics, 17(5):584–597, 2011.
632 Data Classification: Algorithms and Applications
[79] Soon T. Teoh and Kwan-liu Ma. StarClass: Interactive visual classification using star coordi-
nates, Proceedings of the 3rd SIAM International Conference on Data Mining, pages 178–185,
2003.
[80] Soon Tee Teoh and Kwan-Liu Ma. Paintingclass: Interactive construction, visualization and
exploration of decision trees. In Proceedings of the Ninth ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, KDD ’03, pages 667–672, ACM, New
York, NY, 2003.
[81] Clark T.G., Stewart M.E., Altman D.G., Gabra H., and Smyth J.F. A prognostic model for
ovarian cancer. British Journal of Cancer, 85(7):944–952, October 2001.
[82] Michail Vlachos, Carlotta Domeniconi, Dimitrios Gunopulos, George Kollios, and Nick
Koudas. Non-linear dimensionality reduction techniques for classification and visualization.
In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining, pages 645–651, ACM, New York, NY, 2002.
[83] Malcolm Ware, Eibe Frank, Geoffrey Holmes, Mark Hall, and Ian H. Witten. Interactive
machine learning: Letting users build classifiers. Int. J. Hum. Comput. Stud., 56(3):281–292,
March 2002.
[84] James A. Wise, James J. Thomas, Kelly Pennock, D. Lantrip, M. Pottier, Anne Schur, and
V. Crow. Visualizing the non-visual: Spatial analysis and interaction with information from
text documents. In Nahum D. Gershon and Stephen G. Eick, editors, INFOVIS, pages 51–58.
IEEE Computer Society, 1995.
[85] Pak Chung Wong. Guest editor’s introduction: Visual data mining. IEEE Computer Graphics
and Applications, 19(5):20–21, 1999.
[86] Jing Yang, Wei Peng, Matthew O. Ward, and Elke A. Rundensteiner. Interactive hierarchical
dimension ordering, spacing and filtering for exploration of high dimensional datasets. In
Proceedings of the Ninth annual IEEE Conference on Information Visualization, INFOVIS’03,
pages 105–112, IEEE Computer Society, Washington, DC, 2003.
[87] Ke-Bing Zhang, M.A. Orgun, R. Shankaran, and Du Zhang. Interactive visual classification
of multivariate data. In 11th International Conference on Machine Learning and Applications
(ICMLA), 2012, volume 2, pages 246–251, 2012.
[88] Ke-Bing Zhang, M.A. Orgun, and Kang Zhang. A visual approach for external cluster valida-
tion. In CIDM 2007. IEEE Symposium on Computational Intelligence and Data Mining, 2007,
pages 576–582, 2007.
[89] Hong Zhou, Xiaoru Yuan, Huamin Qu, Weiwei Cui, and Baoquan Chen. Visual clustering in
parallel coordinates. Computer Graphics Forum, 27(3):1047–1054, 2008.
Chapter 24
Evaluation of Classification Methods
Nele Verbiest
Ghent University
Belgium
[email protected]
Karel Vermeulen
Ghent University
Belgium
[email protected]
Ankur Teredesai
University of Washington, Tacoma
Tacoma, WA
[email protected]
24.1 Introduction
The evaluation of the quality of the classification model significantly depends on the eventual
utility of the classifier. In this chapter we provide a comprehensive treatment of various classification
evaluation techniques, and more importantly, provide recommendations on when to use a particular
technique to maximize the utility of the classification model.
First we discuss some generical validation schemes for evaluating classification models. Given a
dataset and a classification model, we describe how to set up the data to use it effectively for training
633
634 Data Classification: Algorithms and Applications
and then validation or testing. Different schemes are studied, among which are cross validation
schemes and bootstrap models. Our aim is to elucidate how choice of a scheme should take into
account the bias, variance, and time complexity of the model.
Once the train and test datasets are available, the classifier model can be constructed and the
test instances can be classified to solve the underlying problem. In Section 24.3 we list evaluation
measures to evaluate this process. The most important evaluation measures are related to accuracy;
these measures evaluate how well the classifier is able to recover the true class of the instances.
We distinguish between discrete classifiers, where the classifier returns a class, and probabilistic
classifiers, where the probability is that the instance belongs to the class. In the first case, typical
evaluation measures like accuracy, recall, precision, and others are discussed, and guidelines on
which measure to use in which cases are given. For probabilistic classifiers we focus on ROC curve
analysis and its extension for multi-class problems. We also look into evaluation measures that are
not related to accuracy, such as time complexity, storage requirements, and some special cases.
We conclude this chapter in Section 24.4 with statistical tests for comparing classifiers. After
the evaluation measures are calculated for each classifier and each dataset, the question arises which
classifiers are better. We emphasize the importance of using statistical tests to compare classifiers, as
too often authors posit that their new classifiers are better than others based on average performances
only. However, it might happen that one classifier has a better average performance but that it only
outperforms the other classifiers for a few datasets. We discuss parametric statistical tests briefly,
as mostly the assumptions for parametric tests are not satisfied. Next, we rigorously describe non-
parametric statistical tests and give recommendations for using them correctly.
remaining K − 1 parts. The main advantage of this technique is that each data point is evaluated
exactly once.
The choice of K is a trade-off between bias and variance [30]. For low values of K, the sizes of
the training sets in the K-CV procedure are smaller, and the classifications are more biased depend-
ing on how the performance of the classifier changes with the instances included in the train set and
with sample size. For instance, when K = 2, the two training sets are completely different and only
cover one half of the original data, so the quality of the predictions can differ drastically for the two
train sets. When K = 5, all train sets have 60 percent of the data in common, so the bias will be
lower. For high values of K, the procedure becomes more variable due to the stronger dependence
on the training data, as all training sets are very similar to one another.
Typical good values for K are 5 or 10. Another widely used related procedure is 5 × 2 CV [8], in
which the 2-CV procedure is repeated five times. When K equals the data size, K-CV is referred to
as Leave-One-Out-CV (LOOCV, [51]). In this case, each instance is classified by building a model
on all remaining instances and applying the resulting model on the instance. Generalized-CV (GCV)
is a simplified version of LOOCV that is less computationally expensive.
Other than splitting the data in K folds, one can also consider all subsets of a given size P, which
is referred to as leave-P-out-CV or also delete-P-CV [3]. The advantage of this method is that many
more combinations of training instances are used, but this method comes with a high computational
cost.
Another type of methods that use more combinations of training instances are repeated learning-
testing methods, also called Monte-Carlo-CV (MCCV, [54]) methods. They randomly select a frac-
tion of the data as train data and use the remaining data as test data, and repeat this process multiple
times.
Bootstrap validation models build the classification model on several training sets that are drawn
from the original data with resampling and then apply this model to the original data. As each
bootstrap sample has observations in common with the original training sample that is used as the
test set, a factor 0.632 is applied to correct the optimistic resulting performance. For this reason, the
method is also referred to as 0.632-bootstrap [13].
We note that when using CV methods, stratification of the data is an important aspect. One
should make sure that the data is divided such that the class distribution of the whole data set is also
reflected in the separate folds. Otherwise, a so-called data shift [39] can occur. This is especially
important when working with imbalanced problems. The easiest way is to split each class in K parts
and to assign a part to each fold. A more involved approach described in [7] attempts to direct similar
instances to different folds. This so-called Unsupervised-CV (UCV) method is deterministic, that
is, each run will return the same results.
A data shift cannot only occur in the class distribution, one should also be aware that the input
features follow the same distribution to prevent a covariate data shift [39, 40, 49]. One solution
is to use Distribution-Balanced-Stratified-CV (DB-SCV, [56]), which divides nearby instances to
different folds. An improved solution is Distribution-Optimally-Balanced-Stratified-CV (DOB-CV,
[49]).
We conclude this section with recommendations on the use of validation schemes. When choos-
ing a validation scheme, three aspects should be kept in mind: variance, bias, and computational
cost. In most situations, K-CV schemes can be used safely. The choice of K depends on the goal of
the evaluation. For model selection, low variance is important, so a LOOCV scheme can be used. To
assess the quality of a classifier, the bias is more important, so K = 5 or K = 10 is a good choice. For
imbalanced datasets or small datasets, K = 5 is often recommended. K-CV schemes with higher K
values are computationally more expensive, especially LOOCV, which is mostly too complex to use
for large datasets. The advantage of MCCV over K-CV is that more combinations of instances are
used, but of course this comes with a high computational cost. For small datasets, 0.632-bootstrap
is highly recommended.
636 Data Classification: Algorithms and Applications
Based on the confusion matrix, many metrics can be defined. The most well-known one is
classification accuracy, denoted by acc here. It is defined as the ratio of correctly classified instances,
which can also be expressed as the sum of the diagonal elements in the confusion matrix:
k
acc(c) = ∑ Mii . (24.1)
i=1
It is a general measure that gives an idea of the overall performance of the classifier, in the example
in Table 24.1, acc(c) = 0.5, while for the classifier in the example in Table 24.2, acc(c) = 0.7.
Other well-known evaluation metrics are only defined for binary classifiers. For instance, the
recall (also referred to as true positive rate or sensitivity) is the true positives compared to the
number of truly positive instances:
TP
recall(c) = . (24.2)
T P + FN
The precision is the true positives compared to the number of instances predicted positive:
TP
precision(c) = . (24.3)
T P + FP
Another metric is the specificity (also called true negative rate), defined as the number of correctly
classified negative instances divided by the number of truly negative instances:
TN
speci f icity(c) = . (24.4)
FP + T N
Finally, the false alarm (written as falarm here, also known as the false positive rate) is the false
positives compared to the number of negative instances:
FP
f alarm(c) = . (24.5)
T N + FP
In the example in Table 24.2, recall(c) = 0.83, precision(c) = 0.71, speci f icity(c) = 0.5, and
f alarm(c) = 0.5.
Based on precision and recall, the F-measure (F1 -score) can be used, which is the harmonic
mean of precision and recall:
precison ∗ recall
F1 (c) = 2 . (24.6)
precision + recall
638 Data Classification: Algorithms and Applications
TABLE 24.3: Confusion Matrix Corresponding to the Classifier Represented in Table 24.1
A B C
A 2 1 1
B 1 3 1
C 1 0 0
TABLE 24.4: Confusion Matrix Corresponding to the Classifier Represented in Table 24.2
P N
P 5 (TP) 1 (FN)
N 2 (FP) 2 (TN)
and is some weighted average of precision and recall. It is a generalisation of the Fβ measure,
defined as:
precision ∗ recall 2
Fβ (c) = (1 + β2) (β precision) + recall (24.7)
.
The higher β, the more emphasis is put on recall, the lower β, the more influence precision has.
For instance, in the example in Table 24.2, it holds that F1 (c) = 0.77, while F2 (c) = 0.80, which is
closer to the recall, and F0.5 (c) = 0.76, which is closer to the precision.
All metrics above are defined for binary classification problems, but they can easily be used for
multi-class problems. A common practice is to calculate the measure for each class seperately and
then to average the metrics over all classes (one vs. all). In the case of the three-class classification
problem in Table 24.1, it means that first the binary classification problem that has A as first class
and B or C as second class is considered, and that the corresponding metric is calculated. This is
repeated for class B vs. class A and C and for class C vs. class A and B. At the end, the three measures
are averaged. For example, in the example in Table 24.1, the recall is recall(c) = 0.5+0.2+13 = 0.57.
Another metric that can handle multi-class problems is Cohen’s kappa [1], which is an agreement
measure that compensates for classifications that may be due to chance, defined as follows:
k k
n ∑ Mii − ∑ Mi. M.i
i=1 i=1
κ(c) = k
(24.8)
n2 − ∑ Mi. M.i
i=1
where M.i is the sum of the elements in the ith column of M and Mi. the sum of the elements in the
ith row of M.
There is no simple answer to the question of which evaluation metric to use, and in general there
is no classifier that is optimal for each evaluation metric. When evaluating general classification
problems, the accuracy is mostly sufficient, together with an analysis of Cohen’s kappa. Of course,
when problems are imbalanced, one should also take into account the F-measures to check if there
is a good balance between recall and precision. When considering real-world problems, one should
be careful when selecting appropriate evaluation metrics. For instance, when there is a high cost
related to classifying instances to the negative class, a high false alarm is problematic. When it is
more important not to oversee instances that are actually positive, a high recall is important. In some
cases it is recommended to use multiple evaluation metrics, and a balance between them should be
aimed for.
FIGURE 24.1: ROC-space with two classifiers c and d. The arrow indicates the direction of overall
improvement of classifiers.
in Table 24.5. These probabilities are generated by the classifiers themselves, and contain more
information than discrete classifiers. As such, there would be a loss of information if one would
ignore the probabilities and discretize the probabilities to obtain a discrete classifier.
We first discuss binary classification problems. The most important evaluation metrics for prob-
abilistic classifiers are related to Receiver Operating Characteristics (ROC, [38]) analysis. These
techniques place classifiers in the ROC space, which is a two-dimensional space with the false pos-
itive rate on the horizontal axis and the true positive rate on the vertical axis (Figure 24.1). A point
with coordinates (x, y) in the ROC space represents a classifier with false positive rate x and true
positive rate y. Some special points are (1, 0), which is the worst possible classifier, and (0, 1), which
is a perfect classifier. A classifier is better than another one if it is situated north-west of it in the
ROC space. For instance, in Figure 24.1, classifier c is better than classifier d as it has a lower false
positive rate and a higher true positive rate.
Probabilistic classifiers need a threshold to make a final decision for each class. For each possible
threshold, a different discrete classifier is obtained, with a different TP rate and FP rate. By consid-
ering all possible thresholds, putting their corresponding classifiers in the ROC space, and drawing a
line through them, a so-called ROC curve is obtained. In Figure 24.2, an example of such a stepwise
function is given. In this simple example, the curve is constructed by considering all probabilities
as thresholds, calculating the corresponding TP rate and FP rate and putting them in the plot. How-
ever, when a large dataset needs to be evaluated, more efficient algorithms are required. In that case,
Algorithm 24.1 can be used, which only requires O (nlogn) operations. It makes use of the obser-
vation that instances classified positive for a certain threshold will also be classified for all lower
thresholds. The algorithm sorts the instances decreasing with reference to the outputs p(x) of the
classifier, and processes each instance at the time to calculate the corresponding TP rate and FP rate.
ROC curves are a great tool to visualize and analyze the performance of classifiers. The most
important advantage is that they are independent of the class distribution, which makes them inter-
esting to evaluate imbalanced problems. In order to compare two probabilistic classifiers with each
640 Data Classification: Algorithms and Applications
1 1 D
A
B
C
FP rate FP rate
0 0
1 1
other, the two ROC curves can be plotted on the same ROC space, as illustrated in Figure 24.3. On
the left-hand-side, it is easy to see that classifier A outperforms B, as its ROC-curve is north-west
of the ROC-curve of B. This means that for each threshold, the TP rate of A is higher than the TP
rate of B, and that the FP rate of B is higher for each threshold. On the right-hand-side, the situation
is less obvious and it is hard to say if C is better than D or the other way around. However, some
interesting information can be extracted from the graph, that is, we can conclude that C is more
precise than D for high thresholds, which means that it is the best choice if not many false positives
are allowed. D on the other hand is more appropriate when a high sensitivity is required.
Although this visualization is a handy tool that is frequently used, many researchers prefer a
single value to assess the performance of a classifier. The Area-Under the Curve (AUC) is a measure
that is often used to that goal. It is the surface between the ROC curve and the horizontal axis, and
is a good measure to reflect the quality, as ROC curves that are more to the north west are better
and have a bigger surface. AUC values are often used as a single evaluation metric in imbalanced
classification problems. In real-world problems, however, one should keep in mind that ROC curves
reveal much more information than the single AUC value.
An important point when evaluating classifiers using ROC curves is that they do not measure
the absolute performance of the classifier but the relative ranking of the probabilities. For instance,
when the probabilities of a classifier are all higher than 0.5 and the threshold is also 0.5, no instance
will be classified as negative. However, when all probabilities of the positive instances are higher
than the probabilities of the negative instances, the classifier will have a perfect ROC curve and the
AUC will be 1. This example shows that determining an appropriate threshold for the final classifier
should be done appropriately.
When evaluating probabilistic classifiers for multi-class problems [38], a straightforward option
is to decompose the multi-class problems to multiple two-class problems and to carry out the ROC
analysis for each two-class problem. Understanding and interpreting these multiple visualizations is
challenging, therefore specific techniques for multi-class ROC analysis were developed.
The first approach is discussed in [41], where ROC analysis is extended for three-class classifica-
tion problems. Instead of an ROC curve, an ROC surface is plotted, and the Volume Under the Sur-
face (VUS) is calculated analogously to the AUC. ROC surfaces can be used to compare classifiers
on three-class classification problems by analyzing the maximum information gain on each of them.
ROC analysis for multi-class problems gains more and more attention. In [10], the multi-class
problem is decomposed in partial two-class problems, where each problem retains labels of the in-
642 Data Classification: Algorithms and Applications
stances in one specific class, and the instances in the remaining classes get another label. The AUC’s
are calculated for each partial problem and the final AUC is obtained by computing the weighted
average of the AUC’s, where the weight is determined based on the frequency of the correspond-
ing class in the data. Another generalization is given in [19], where the AUC’s of all combinations
of two different classes are computed, and the sum of these intermediate AUCs is divided by the
number of all possible misclassifications. The advantage of this approach over the previous one is
that this approach is less sensitive to class distributions. In [14], the extension from ROC curve to
ROC surface is further extended to ROC hyperpolyhedrons, which are multi-dimensional geometric
objects. The volume of these geometric objects is the analogue of the VUS or AUC. A more recent
approach can be found in [20], where a graphical visualization of the performance is developed.
To conclude, we note that computational cost is an important issue for multi-class ROC analysis;
mostly all pairs of classes need to be considered. Many researchers [31–34] are working on faster
approaches to circumvent this problem.
(also referred to as instances, datasets, or examples) by n, and the significance level by α. The
calculated evaluation measure of case i based on classifier j is denoted Yi j .
If the null hypothesis is rejected, the only information provided by the within-subjects ANOVA
is that there are significant differences between the classifiers’ performances, but not information is
provided about which classifier outperforms another. It is tempting to use multiple pairwise com-
parisons to get more information. However, this will lead to an accumulation of the Type I error
coming from the combination of pairwise comparisons, also referred to as the Family Wise Er-
ror Rate (FWER, [42]), which is the probability of making at least one false discovery among the
different hypotheses.
In order to make more detailed conclusions after an ANOVA test rejects the null hypothesis, i.e.,
significant differences are found, post-hoc procedures are needed that correct for multiple testing to
avoid inflation of Type I errors. Most of the post-hoc procedures we discuss here are explained for
multiple comparisons with a control method and thus we perform k − 1 post-hoc tests. However, the
extension to an arbitrary number of post-hoc tests is straightforward.
The most simple procedure is the Bonferroni procedure [11], which uses paired t-tests for the
pairwise comparisons but controls the FWER by dividing the significance level by the number of
comparisons made, which here corresponds to an adjusted significane level of α/(k − 1). Equiva-
lently, one can multiply the obtained p-values by the number of comparisons and compare to the
original significance level α. However, the Bonferonni correction is too conservative when many
comparisons are performed. Another approach is the Dunn-Sidak [50] correction, which alters the
p-values to 1 − (1 − p)1/(k−1).
A more reliable test when interested in all paired comparisons is Tukey’s Honest Significant
Difference (HSD) test [2]. This test controls the FWER so it will not exceed α. To be more specific,
suppose we are comparing classifier j with l. The Tukey’s HSD test is based on the studentized
range statistic
|Ȳ· j − Ȳ·l |
q jl = .
MSres /n
It corrects for multiple testing by changing the critical value to reject the null hypothesis. Under
the assumptions of the within-subjects ANOVA, q jl follows a studentized range distribution with
k and (n − 1)(k − 1) degrees of freedom and the null hypothesis that the mean performance of
classifier j is the same of classifier l is rejected if q jl > qk,(n−1)(k−1),α where qk,(n−1)(k−1),α is the
1 − α percentile of a qk,(n−1)(k−1)-distribution.
A similar post-hoc procedure is the Newman-Keuls [28] procedure, which is more powerful than
Tuckey’s HSD test but does not guarentee the FWER will not exceed the prespecified significance
level α. The Newman-Keuls procedure uses a systematic stepwise approach to carry out many com-
parisons. It first orders the sample means Ȳ· j . For notational convenience, suppose the index j labels
these ordered sample means in ascending order: Ȳ1 < . . . < Ȳk . In the first step, the test verifies if
the largest and smallest sample means significantly differ from each other using the test statistic qn1
and uses the critical value qk,(n−1)(k−1),α, if these sample means are k steps away from each other. If
the null hypothesis is retained, it means that the null hypotheses of all comparisons will be retained.
If the null hypothesis is rejected, all comparisons of sample means that are k − 1 steps from one
another are performed, and qn2 and qn−1,1 are compared to the critical value qk−1,(n−1)(k−1),α. In
a stepwise manner, the range between the means is lowered, until no null hypotheses are rejected
anymore.
A last important post-hoc test is Scheffé’s test [46]. This test also controls the FWER, but now
regardless, the number of post-hoc comparisons. It is considered to be very conservative and hence
Tuckey’s HSD test may be preferred. The procedure calculates for each comparison of interest the
difference between the means of the corresponding classifiers and compares them to the critical
value
2MSres
(k − 1)Fk−1,(n−1)(k−1),α
n
646 Data Classification: Algorithms and Applications
binomial distribution. That is, when the number of instances is sufficiently large, under the null
hypothesis, n1 approximately follows a normal √ distribution with mean n/2 and variance n/4. The
null hypothesis is rejected if |n1 − n/2|/( n/2) > zα/2 where zα/2 is the 1 − α/2 percentile of the
standard normal distribution.
An alternative paired test is Wilcoxon signed-ranks test [52]. The Wilcoxon signed-ranks test
uses the differences Yi1 −Yi2 . Under the null hypothesis, the distribution of these differences is sym-
metric around the median and hence we must have that the distribution of the positive differences
is the same as the distribution of the negative differences. The Wilcoxon signed rank test then aims
to detect a deviation from this to reject the null hypothesis. The procedure assigns a rank to each
difference according to the absolute value of these differences, where the mean of ranks is assigned
to cases with ties. For instance, when the differences are 0.03, 0.06, −0.03, 0.01, −0.04, and 0.2, the
respective ranks are 3.5, 6, 3.5, 1, 5, and 2. Next, the ranks of the positive and negative differences
are summed separately, in our case R+ = 12.5 and R− = 8.5. When few instances are available, to
reject the null hypothesis, min(R+ , R− ) should be less than or equal to a critical value depending
on the significance level and the number of instances; see [48] for tables. Example, when α = 0.1,
the critical value is 2, meaning that in our case the null hypothesis is not rejected at the 0.1 signif-
icance level. When a sufficient number of instances are available, one can rely on the asymptotic
approximation of the distribution of R+ or R− . Let T denote either R+ or R− . Under the null hypoth-
esis, they both have the same approximate normal distribution with mean n(n + 1)/4 and variance
n(n + 1)(2n + 1)/24. The Wilcoxon signed rank test then rejects the null hypothesis when
|T − n(n + 1)/4|
> zα/2
n(n + 1)(2n + 1)/24
Evaluation of Classification Methods 647
where zα/2 is the 1 − α/2 percentile of the standard normal distribution. An important practical
aspect that should be kept in mind is that the Wilcoxon test uses the continuous evaluation measures,
so one should not round the values to one or two decimals, as this would decrease the power of the
test in case of a high number of ties.
For a sufficient number of instances and classifiers (as a rule of thumb n > 10 and k > 5), χ2
approximately follows a chi-square distribution with k − 1 degrees of freedom. If χ2 exceeds the
critical value χ2k−1,α , where χ2k−1,α is the 1 − α percentile of the chi-square distribution with k − 1
degrees of freedom, the null hypothesis is rejected. For a small number of data sets and classifiers,
exact critical values have been computed [48, 55]. The test statistic for the Iman and Davenport’s
648 Data Classification: Algorithms and Applications
test, which is less conservative than Friedman’s test and should hence be preferred, is given by
(n − 1)χ2
F=
n(k − 1) − χ2
and (assuming a sufficient number of instances and classifiers) follows approximately an F-
distribution with k − 1 and (n − 1)(k − 1) degrees of freedom. If F > Fk−1,(n−1)(k−1),α with
Fk−1,(n−1)(k−1),α the 1 − α percentile of an Fk−1,(n−1)(k−1)-distribution, the null hypothesis is re-
jected. In both cases, if the test statistic exceeds the corresponding critical value, it means that there
are significant differences between the methods, but no other conclusion whatsoever can be made.
Next we discuss two more advanced non-parametric tests that in certain circumstances may improve
upon the Friedman test, especially when the number of classifiers is low. The Friedman Aligned
Rank [22] test calculates the ranks differently. For each data set, the average or median performance
of all classifiers on this data set is calculated and this value is substracted from each performance
value of the different classifiers to obtain the aligned observations. Next, all kn aligned observa-
tions are assigned a rank, the aligned rank Ri j with i referring to the data set and j referring to the
classifier. The Friedman Aligned Ranks test statistic is given by
" #
(k − 1) ∑kj=1 R̂2· j − (kn2 /4)(kn + 1)2
T=
[kn(kn + 1)(2kn + 1)/6]− ∑ni=1 R̂2i· /k
where R̂· j = ∑ni=1 Ri j equals the rank total of the jth classifier and R̂i· = ∑kj=1 Ri j equals the rank
total of the ith data set. For a sufficient number of data sets, T approximately follows χ2 -distribution
with k − 1 degrees of freedom. If T > χ2k−1,α , the null hypothesis is rejected with χ2k−1,α the 1 − α
percentile of a χ2k−1 -distribution.
The Quade test [43] is an improvement upon the Friedman Aligned Ranks test by incorporating
the fact that not all data sets are equally important. That is, some data sets are more difficult to
classify than others and methods that are able to classify these difficult data sets correctly should
be favored. The Quade test computes weighted ranks based on the range of the performances of
j
different classifiers on each data set. To be specific, first calculate the ranks ri as for the Friedman
test. Next, calculate for each data set i the range of the performances of the different classifiers:
max j Yi j − min j Yi j , and rank them with the smallest rank (1) given to the data set with the smallest
range, and so on where average ranks are used in case of ties. Denote the obtained rank for data
set i by Q " i . The weighted # average adjusted rank for data set i with classifier j is then computed as
j
Si j = Qi ri − (k + 1)/2 with (k + 1)/2 the average rank within data sets. The Quade test statistic
is then given by
(n − 1) ∑kj=1 S2j /n
T3 =
n(n + 1)(2n + 1)k(k + 1)(k − 1)/72 − ∑kj=1 S2j /n
where S j = ∑ni=1 Si j is the sum of the weighted ranks for each classifier. T3 approximately fol-
lows an F-distribution with k − 1 and (n − 1)(k − 1) degrees of freedom. If T3 > Fk−1,(n−1)(k−1),α
with Fk−1,(n−1)(k−1),α the 1 − α percentile of an Fk−1,(n−1)(k−1)-distribution, the null hypothesis is
rejected.
When the null hypothesis (stating that all classifiers perform equivalently) is rejected, the av-
erage ranks calculated by these four methods themselves can be used to get a meaningful ranking
of which methods perform best. However, as was the case for parametric multiple comparisons,
post-hoc procedures are still needed to evaluate which pairwise differences are significant. The test
statistic to compare algorithm j with algorithm l for the Friedman test and the Iman Davenport’s
test is given by
R j − Rl
Z jl =
k(k + 1)/6n
Evaluation of Classification Methods 649
with R j and Rl the average ranks computed in the Friedman and Imand Davenport procedures. For
the Friedman Aligned Ranks procedure, the test statistic to compare algorithm j with algorithm l is
given by
R̂ j − R̂l
Z jl =
k(n + 1)/6
with R̂ j = R̂· j /n and R̂l = R̂·l /n the average aligned ranks. Finally, for the Quade procedure, the test
statistic to compare algorithm j with algorithm l is given by
T j − Tl
Z jl =
[k(k + 1)(2n + 1)(k − 1)]/[18n(n + 1)]
j
with T j = ∑ni=1 Qi ri /[n(n + 1)/2] the average weighted ranks without average adjusting as described
in the Quade procedure. All three test statistics Z jl all approximately follow a standard normal
from which an appropriate p-value can be calculated, being the probability that a standard normal
distributed variable exceeds the absolute value of the observed test statistic.
A post-hoc procedure involves multiple pairwise comparisons based on the test statistics Z jl ,
and as already mentioned for parametric multiple comparisons, these post-hoc tests cannot be used
without caution as the FWER is not controlled, leading to inflated Type I errors. Therefore we
consider post-hoc procedures based on adjusted p-values of the pairwise comparisons to control the
FWER. Recall that the p-value returned by a statistical test is the probability that a more extreme
observation than the current one is observed, assuming the null hypothesis holds. This simple p-
value reflects this probability of one comparison, but does not take into account the remaining
comparisons. Adjusted p-values (APV) deal with this problem and after the adjustment, these APVs
can be compared with the nominal significance level α. The post-hoc procedures that we discuss first
are all designed for multiple comparisons with a control method, that is, we compare one algorithm
against the k − 1 remaining ones. In the following, p j denotes the p-value obtained for the jth null
hypothesis, stating that the control method and the jth method are performing equally well. The p-
values are ordered from smallest to largest: p1 ≤ . . . ≤ pk−1 , and the corresponding null hypotheses
are rewritten accordingly as H1 , . . . , Hk−1 . Below we discuss several procedures to obtain adjusted
p-values: one-step, two-step, step-down, and step-up.
1: One-step. The Bonferroni-Dunn [11] procedure is a simple one-step procedure that divides
the nominal significance level α by the number of comparisons (k − 1) and the usual p-values can
be compared with this level of significance. Equivalently, the adjusted value in this case is min{(k −
1)pi , 1}. Although simple, this procedure may be too conservative for practical use when k is not
small.
2: Two-step. The two-step Li [36] procedure rejects all null hypotheses if the biggest p-value
pk−1 ≤ α. Otherwise, the null hypothesis related to pk−1 is accepted and the remaining null hypothe-
ses Hi with pi ≤ (1 − pk−1)/(1 − α)α are rejected. The adjusted p-values are pi /(pi + 1 − pk−1).
3: Step-down. We discuss three more advanced step-down methods. The Holm [24] procedure is
the most popular one and starts with the lowest p-value. If p1 ≤ α/(k − 1), the first null hypothesis
is rejected and the next comparison is made. If p2 ≤ α/(k − 2), the second null hypothesis H2 is
also rejected and the next null hypothesis is verified. This process continues until a null hypothesis
cannot be rejected anymore. In that case, all remaining null hypotheses are retained as well. The
adjusted p-values for the Holm procedure are min [max{(k − j)p j : 1 ≤ j ≤ i}, 1]. Next, the Holland
[23] procedure is similar to the Holm procedure. It rejects all null hypotheses H1 tot Hi−1 if i is
the smallest integer such that pi > 1 − (1 − α)k−i , the adjusted p-values are min[max{1 − (1 −
p j )k− j : 1 ≤ j ≤ i}, 1]. Finally, the Finner [15] procedure, also similar, rejects all null hypotheses
(k−1)/i. The adjusted p-values are
H1 to " Hi−1 if i is the smallest integer such #that pi > 1 − (1 − α)
min max{1 − (1 − p j)(k−1)/i : 1 ≤ j ≤ i}, 1 .
650 Data Classification: Algorithms and Applications
4: Step-up. Step-up procedures include the Hochberg [21] procedure, which starts off with
comparing the largest p-value pk−1 with α, then the second largest p-value pk−2 is compared to
α/2 and so on, until a null hypothesis that can be rejected is found. The adjusted p-values are
max{(k − j)p j : (k − 1) ≥ j ≥ i}. Another more involved procedure is the Hommel [25] procedure.
First, it finds the largest j for which pn− j+ > α/ j for all = 1, . . . , j. If no such j exists, all null
hypotheses are rejected. Otherwise, the null hypotheses for which pi ≤ α/ j are rejected. In contrast
to the other procedures, calculating the adjusted p-values cannot be done using a simple formula.
Instead the procedure listed in Algorithm 24.2 should be used. Finally, Rom’s [45] procedure was
developed to increase Hochberg’s power. It is completely equivalent, except that the α values are
now calculated as i− j
j=1 α − ∑ j=1 k αk−1− j
i−2 i
∑i−1 j
αk−i =
i
with αk−1 = α and αk−2 = α/2. Adjusted p-values could also be obtained using the formula for
αk−i but no closed form formula is available [18].
Algorithm 24.2 Calculation of the adjusted p-values using Hommels post-hoc procedure.
INPUT: p-values p1 ≤ p2 ≤ pk−1
In some cases, researchers are interested in all pairwise differences in a multiple comparison and
do not simply want to find significant differences with one control method. Some of the procedures
above can be easily extended to this case.
The Nemenyi procedure, which can be seen as a Bonferonni correction, adjusts α in one single
step by dividing it by the number of comparisons performed (k(k − 1)/2).
A more involved method is Shaffer’s procedure [47], which is based on the observation that
the hypotheses are interrelated. For instance, if method 1 is significantly better than method 2 and
method 2 is significantly better than method 3, method 1 will be significantly better than method 3.
Shaffer’s method follows a step-down method, and at step j, it rejects H j if p j ≤ α/t j , where t j is
the maximum number of hypotheses that can be true given that hypotheses H1 , . . . H j−1 are false.
This value ti can be found in the table in [47].
Evaluation of Classification Methods 651
Finally, the Bergmann-Hommel [26] procedure says that an index set I ⊆ {1, . . . M}, with
M the number of hypotheses, is exhaustive if and only if it is possible that all null hypothe-
ses H j , j ∈ I could be true and all hypotheses H j , j ∈ / I are not. Next, the set A is defined as
∪{I : I exhaustive, min(pi : i ∈ I) > α/|I|}, and the procedure rejects all the H j with j ∈ A.
We now discuss one test that can be used for multiple comparisons with a control algorithm
without a further post-hoc analysis needed, the multiple sign test [44], which is an extension of
the standard sign test. The multiple sign test proceeds as follows: it counts for each classifier the
number of times it outperforms the control classifier and the number of times it is outperformed
by the control classifier. Let R j be the minimum of those two values for the jth classifier. The null
hypothesis is now different from the previous ones since it involves median performances. The null
hypothesis states that the median performances are the same and is rejected if R j exceeds a critical
value depending on the the nominal significance level α, the number of instances n, and the number
k − 1 of alternative classifiers. These critical values can be found in the tables in [44].
All previous methods are based on hypotheses and the only information that can be obtained is if
algorithms are significantly better or worse than others. Contrast estimation based on medians [9] is
a method that can quantify these differences and reflects the magnitudes of the differences between
the classifiers for each data set. It can be used after significant differences are found. For each pair of
classifiers, the differences between them are calculated for all problems. For each pair, the median of
these differences over all data sets are taken. These medians can now be used to make comparisons
between the different methods, but if one wants to know how a control method performs compared
to all other methods, the mean of all medians can be taken.
and use this constant c as the critical value to reject a certain null hypothesis, we automatically
control for the FWER at the α level. To be more specific, we reject those null hypotheses stating
that classifier j performs equally well as classifier l if the corresponding observed value for |Z jl |
exceeds c. This constant can be easily found by taking the 1 − α percentile of the permutation null
distribution of max j,l |Z jl |.
652 Data Classification: Algorithms and Applications
Bibliography
[1] A. Ben-David. Comparison of classification accuracy using Cohen’s weighted kappa. Expert
Systems with Applications, 34(2):825–832, 2008.
[2] A. M. Brown. A new software for carrying out one-way ANOVA post hoc tests. Computer
Methods and Programs in Biomedicine, 79(1):89–95, 2005.
[3] A. Celisse and S. Robin. Nonparametric density estimation by exact leave-out cross-validation.
Computational Statistics and Data Analysis, 52(5):2350–2368, 2008.
[4] W. W. Daniel. Applied nonparametric statistics, 2nd ed. Boston: PWS-Kent Publishing Com-
pany, 1990.
[5] J. Derrac, S. Garcı́a, D. Molina, and F. Herrera. A practical tutorial on the use of nonpara-
metric statistical tests as a methodology for comparing evolutionary and swarm intelligence
algorithms. Swarm and Evolutionary Computation, 1(1):3–18, 2011.
[6] L. Devroye and T. Wagner. Distribution-free performance bounds for potention function rules.
IEEE Transactions in Information Theory, 25(5):601–604, 1979.
[7] N. A. Diamantidis, D. Karlis, and E. A. Giakoumakis. Unsupervised stratification of cross-
validation for accuracy estimation. Artificial Intelligence, 116(1-2):1–16, 2000.
[8] T. Dietterich. Approximate statistical tests for comparing supervised classification learning
algorithms. Neural Computation, 10:1895–1923, 1998.
[9] K. Doksum. Robust procedures for some linear models with one observation per cell. Annals
of Mathematical Statistics, 38:878–883, 1967.
[10] P. Domingos and F. Provost. Well-trained PETs: Improving probability estimation trees, 2000.
CDER Working Paper, Stern School of Business, New York University, NY, 2000.
[11] O. Dunn. Multiple comparisons among means. Journal of the American Statistical Associa-
tion, 56:52–64, 1961.
Evaluation of Classification Methods 653
[12] E. S. Edgington. Randomization tests, 3rd ed. New York: Marcel-Dekker, 1995.
[13] B. Efron and R. Tibshirani. Improvements on cross-validation: The .632+ bootstrap method.
Journal of the American Statistical Association, 92(438):548–560, 1997.
[14] C. Ferri, J. Hernández-Orallo, and M.A. Salido. Volume under the ROC surface for multi-class
problems. In Proc. of 14th European Conference on Machine Learning, pages 108–120, 2003.
[15] H. Finner. On a monotonicity problem in step-down multiple test procedures. Journal of the
American Statistical Association, 88:920–923, 1993.
[16] M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis
of variance. Journal of the American Statistical Association, 32:674–701, 1937.
[17] M. Friedman. A comparison of alternative tests of significance for the problem of m rankings.
Annals of Mathematical Statistics, 11:86–92, 1940.
[18] S. Garcı́a, A. Fernández, J. Luengo, and F. Herrera. Advanced nonparametric tests for multi-
ple comparisons in the design of experiments in computational intelligence and data mining:
Experimental analysis of power. Information Sciences, 180:2044–2064, 2010.
[19] D. J. Hand and R. J. Till. A simple generalisation of the area under the ROC curve for multiple
class classification problems. Machine Learning, 45(2):171–186, 2001.
[21] Y. Hochberg. A sharper Bonferroni procedure for multiple tests of significance. Biometrika,
75:800–803, 1988.
[22] J. Hodges and E. Lehmann. Ranks methods for combination of independent experiments in
analysis of variance. Annals of Mathematical Statistics, 3:482–497, 1962.
[23] M. C. B. S. Holland. An improved sequentially rejective Bonferroni test procedure. Biomet-
rics, 43:417–423, 1987.
[24] S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of
Statistics, 6:65–70, 1979.
[25] G. Hommel. A stagewise rejective multiple test procedure based on a modified Bonferroni
test. Biometrika, 75:383–386, 1988.
[26] G. Hommel and G. Bernhard. A rapid algorithm and a computer program for multiple
test procedures using logical structures of hypotheses. Computer Methods and Programs in
Biomedicine, 43(3-4):213–216, 1994.
[27] R. Iman and J. Davenport. Approximations of the critical region of the Friedman statistic.
Communications in Statistics, 9:571–595, 1980.
[28] M. Keuls. The use of the studentized range in connection with an analysis of variance. Eu-
phytica, 1:112–122, 1952.
[29] R. E. Kirk. Experimental design: Procedures for the Behavioral Sciences, 3rd ed. Pacific
Grove, CA: Brooks/Cole Publishing Company, 1995.
654 Data Classification: Algorithms and Applications
[30] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model se-
lection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence—
Volume 2, IJCAI’95, pages 1137–1143, 1995.
[31] T. C. W. Landgrebe and R. P. W. Duin. A simplified extension of the area under the ROC to
the multiclass domain. In 17th Annual Symposium of the Pattern Recognition Association of
South Africa, 2006.
[32] T. C. W. Landgrebe and R. P. W. Duin. Approximating the multiclass ROC by pairwise anal-
ysis. Pattern Recognition Letters, 28(13):1747–1758, 2007.
[33] T. C. W. Landgrebe and R. P. W. Duin. Efficient multiclass ROC approximation by decompo-
sition via confusion matrix perturbation analysis. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 30(5):810–822, 2008.
[34] T. C. W. Landgrebe and P. Paclik. The ROC skeleton for multiclass ROC estimation. Pattern
Recognition Letters, 31(9):949–958, 2010.
[35] S. C. Larson. The shrinkage of the coefficient of multiple correlation. Journal of Educational
Psychology, 22(1):45–55, 1931.
[36] J. Li. A two-step rejection procedure for testing multiple hypotheses. Journal of Statistical
Planning and Inference, 138:1521–1527, 2008.
[37] J. Luengo, S. Garcı́a, and F. Herrera. A study on the use of statistical tests for experimentation
with neural networks: Analysis of parametric test conditions and non-parametric tests. Expert
Systems with Applications, 36:7798–7808, 2009.
[38] M. Majnik and Z. Bosnić. ROC analysis of classifiers in machine learning: A survey. Intelli-
gent Data Analysis, 17(3):531–558, 2011.
[39] J. G. Moreno-Torres, T. Raeder, R. A. Rodriguez, N. V. Chawla, and F. Herrera. A unifying
view on dataset shift in classification. Pattern Recognition, 45(1):521–530, 2012.
[40] J. G. Moreno-Torres, J. A. Sáez, and F. Herrera. Study on the impact of partition-induced
dataset shift on k -fold cross-validation. IEEE Transactions on Neural Networks and Learning
Systems, 23(8):1304–1312, 2012.
[41] D. Mossman. Three-way ROCS. Medical Decision Making, 19:78–89, 1999.
[42] T. Nichols and S. Hayasaka. Controlling the familywise error rate in functional neuroimaging:
A comparative review. Statistical Methods in Medical Research, 12:419–446, 2003.
[43] D. Quade. Using weighted rankings in the analysis of complete blocks with additive block
effects. Journal of the American Statistical Association, 74:680–683, 1979.
[44] A. Rhyne and R. Steel. Tables for a treatments versus control multiple comparisons sign test.
Technometrics, 7:293–306, 1965.
[45] D. Rom. A sequentially rejective test procedure based on a modified Bonferroni inequality.
Biometrika, 77:663–665, 1990.
[46] H. Scheffé. A method for judging all contrasts in the analysis of variance. Biometrika, 40:87–
104, 1953.
[47] J. Shaffer. Modified sequentially rejective multiple test procedures. Journal of American
Statistical Association, 81:826–831, 1986.
Evaluation of Classification Methods 655
[48] D. J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures, 4th ed.,
Boca Raton, FL: Chapman and Hall/CRC, 2006.
[49] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-
likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.
[50] Z. Sidak. Rectangular confidence regions for the means of multivariate normal distributions.
Journal of the American Statistical Association, 62(318):626–633, 1967.
[51] M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the
Royal Statistics Society, 36:111–147, 1974.
[52] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 6:80–83, 1945.
[53] D. H. Wolpert. The supervised learning no-free-lunch theorems. In Proceedings of the Sixth
Online World Conference on Soft Computing in Industrial Applications, 2001.
[54] Q. S. Xu and Y. Z. Liang. Monte Carlo cross validation. Chemometrics and Intelligent Labo-
ratory Systems, 56(1):1–11, 2001.
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NY
[email protected]
25.1 Introduction
This chapter will summarize the key resources available in the literature on data classification.
While this book provides a basic understanding of the important aspects of data classification, these
resources will provide the researcher more in-depth perspectives on each individual topic. There-
fore, this chapter will summarize the key resources in this area. In general, the resources on data
classification can be divided into the categories of (i) books, (ii) survey articles, and (iii) software.
Numerous books exist in data classification, both for general topics in classification and for
more specific subject areas. In addition, numerous software packages exist in the area of data clas-
sification. In fact, this is perhaps the richest area of data classification in terms of availability of
software. This chapter will list all the possible resources currently available both from commercial
and non-commercial sources for data classification.
This chapter is organized as follows. Section 25.2 presents educational resources on data classi-
fication. This section is itself divided into two subsections. The key books are discussed in Section
25.2.1, whereas the survey papers on data classification will be discussed in Section 25.2.2. Finally,
the software on data classification will be discussed in Section 25.3. Section 25.4 discusses the
conclusions and summary.
657
658 Data Classification: Algorithms and Applications
Tiberius [92] provides tools based on SVM, neural networks, decision trees, and a variety of other
data modeling methods. It also supports regression modeling, and three-dimensional data visualiza-
tion. Treparel [94] provides high-performance tools based on the SVM classifiers for massive data
sets. The software NeuroXL can be used from within the Excel software for data classification [99].
This is particularly helpful, since it means that the software can be used directly from spread-sheets,
which are commonly used for data manipulation within corporate environments.
Software has also been designed for evaluation and validation of classification algorithms. The
most popular among them is Analyze-it, which [104] provides a comprehensive platform for analyz-
ing classification software. It is capable of constructing numerous metrics such as ROC curves for
analytical purposes. The platform integrates well with Excel, which makes it particularly convenient
to use in a wide variety of scenarios.
[58] for nucleic acid sequences and Protein Information Resources (PIR) [56] and UniProt [55] for
protein sequences.
In the context of image applications, researchers in machine learning and computer vision com-
munities have explored the problem extensively. ImageCLEF [77] and ImageNet [78] are two widely
used image data repositories that are used to demonstrate the performance of image data retrieval
and learning tasks. Vision and Autonomous Systems Center’s Image Database [75] from Carnegie
Mellon University and the Berkeley Segmentation dataset [76] can be used to test the performance
of classification for image segmentation problems. An extensive list of Web sites that provide image
databases is given in [79] and [84].
25.4 Summary
This chapter presents a summary of the key resources for data classification in terms of books,
surveys, and commercial and non-commercial software packages. It is expected that many of these
resources will evolve over time. Therefore, the reader is advised to use this chapter as a general
guideline on which to base their search, rather than treating it as a comprehensive compendium.
Since data classification is a rather broad area, much of the recent software and practical imple-
mentations have not kept up with the large number of recent advances in this field. This has also
been true of the more general books in the field, which discuss the basic methods, but not the recent
advances such as big data, uncertain data, or network classification. This chapter is an attempt to
bridge the gap in this rather vast field, by creating a book, that covers the different areas of data
classification in detail.
Bibliography
[1] C. Aggarwal. Outlier Analysis, Springer, 2013.
[2] C. Aggarwal. Data Streams: Models and Algorithms, Springer, 2007.
[3] C. Aggarwal. Social Network Data Analytics, Springer, Chapter 5, 2011.
[4] C. Aggarwal, H. Wang. Managing and Mining Graph Data, Springer, 2010.
[5] C. Aggarwal, C. Zhai. Mining Text Data, Springer, 2012.
[6] C. Aggarwal, C. Zhai. A survey of text classification algorithms, In Mining Text Data, pages
163–222, Springer, 2012.
[7] C. Aggarwal. Towards effective and interpretable data mining by visual interaction, ACM
SIGKDD Explorations, 3(2):11–22, 2002.
[8] D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms, Machine Learning,
6(1):37–66, 1991.
[9] D. Aha. Lazy learning: Special issue editorial. Artificial Intelligence Review, 11(1–5): 7–10,
1997.
[10] C. Bishop. Neural Networks for Pattern Recognition, Oxford University Press, 1996.
662 Data Classification: Algorithms and Applications
[33] S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactons on Knowledge and Data
Engineering, 22(10):1345–1359, 2010.
[34] J. R. Quinlan, Induction of decision trees, Machine Learning, 1(1):81–106, 1986.
[35] F. Sebastiani. Machine learning in automated text categorization, ACM Computing Surveys,
34(1):1–47, 2002.
[36] B. Settles. Active Learning, Morgan and Claypool, 2012.
[37] T. Soukop, I. Davidson. Visual Data Mining: Techniques and Tools for Data Visualization,
Wiley, 2002.
[38] B. Scholkopf, A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization,
Optimization, and Beyond Cambridge University Press, 2001.
[39] B. Scholkopf and A. J. Smola. Learning with Kernels. Cambridge, MA, MIT Press, 2002.
[40] I. Steinwart and A. Christmann. Support Vector Machines, Springer, 2008.
[41] V. Vapnik. The Nature of Statistical Learning Theory, Springer, 2000.
[42] T. White. Hadoop: The Definitive Guide. Yahoo! Press, 2011.
[43] D. Wettschereck, D. Aha, T. Mohri. A review and empirical evaluation of feature weighting
methods for a class of lazy learning algorithms, Artificial Intelligence Review, 11(1–5):273–
314, 1997.
[44] Z. Xing, J. Pei, and E. Keogh. A brief survey on sequence classification. SIGKDD Explo-
rations, 12(1):40–48, 2010.
[45] L. Yang. Distance Metric Learning: A Comprehensive Survey, 2006. http://www.cs.cmu.
edu/~liuy/frame_survey_v2.pdf
[46] X. Zhu, and A. Goldberg. Introduction to Semi-Supervised Learning, Morgan and Claypool,
2009
[47] http://mallet.cs.umass.edu/
[48] http://www.cs.ucr.edu/~eamonn/time_series_data/
[49] http://www.kdnuggets.com/datasets/
[50] http://www.cs.waikato.ac.nz/ml/weka/
[51] http://www.kdnuggets.com/software/classification.html
[52] http://www-01.ibm.com/software/analytics/spss/products/modeler/
[53] http://www.sas.com/technologies/analytics/datamining/miner/index.html
[54] http://www.mathworks.com/
[55] http://www.ebi.ac.uk/uniprot/
[56] http://www-nbrf.georgetown.edu/pirwww/
[57] http://www.ncbi.nlm.nih.gov/genbank/
[58] http://www.ebi.ac.uk/embl/
664 Data Classification: Algorithms and Applications
[59] http://www.ebi.ac.uk/Databases/
[60] http://mips.helmholtz-muenchen.de/proj/ppi/
[61] http://string.embl.de/
[62] http://dip.doe-mbi.ucla.edu/dip/Main.cgi
[63] http://thebiogrid.org/
[64] http://www.ncbi.nlm.nih.gov/geo/
[65] http://www.gems-system.org/
[66] http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
[67] http://www.statoo.com/en/resources/anthill/Datamining/Data/
[68] http://www.csse.monash.edu.au/~dld/datalinks.html
[69] http://www.sigkdd.org/kddcup/
[70] http://lib.stat.cmu.edu/datasets/
[71] http://www.daviddlewis.com/resources/testcollections/reuters21578/
[72] http://qwone.com/~jason/20Newsgroups/
[73] http://people.cs.umass.edu/~mccallum/data.html
[74] http://snap.stanford.edu/data/
[75] http://vasc.ri.cmu.edu/idb/
[76] http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/
[77] http://www.imageclef.org/
[78] http://www.image-net.org/
[79] http://www.imageprocessingplace.com/root_files_V3/image_databases.htm
[80] http://datamarket.com/data/list/?q=provider:tsdl
[81] http://www.cs.cmu.edu/~mccallum/bow/
[82] http://trec.nist.gov/data.html
[83] http://www.kdnuggets.com/competitions/index.html
[84] http://www.cs.cmu.edu/~cil/v-images.html
[85] http://www.salford-systems.com/
[86] http://www.rulequest.com/Personal/
[87] http://www.stat.berkeley.edu/users/breiman/RandomForests/
[88] http://www.comp.nus.edu.sg/~dm2/
[89] http://www.cs.cmu.edu/~wcohen/#sw
Educational and Software Resources for Data Classification 665
[90] http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[91] http://www.kxen.com/
[92] http://www.tiberius.biz/
[93] http://www.esat.kuleuven.be/sista/lssvmlab/
[94] http://treparel.com/
[95] http://svmlight.joachims.org/
[96] http://www.kernel-machines.org/
[97] http://www.cs.ubc.ca/~murphyk/Software/bnsoft.html
[98] ftp://ftp.sas.com/pub/neural/FAQ.html
[99] http://www.neuroxl.com/
[100] http://www.mathworks.com/products/neural-network/index.html
[101] http://www-01.ibm.com/software/analytics/spss/
[102] http://www.kdnuggets.com/software/classification-other.html
[103] http://datamarket.com
[104] http://analyse-it.com/products/method-evaluation/
[105] http://www.oracle.com/us/products/database/options/advanced-analytics/
overview/index.html
This page intentionally left blank
tears?
reduced normal
No (12) astigmatism?
no yes
tears?
age? sightedness?
reduced normal
<=58 > 58 far near
No (12) astigmatism?
Soft (5) No (1) age? Hard (3)
no yes
<= 20 > 20 Soft (6,1) Hard (6,2)
Hard (1) No (2)
tears?
reduced normal
No (12) sightedness?
far near
astigmatism? age?
reduced normal
<= 18 > 18
No (12) Yes (12, 3)
Yes (1) No (2)
0.3
0.2
0.1
0
–0.1
–0.2
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (sec)
0.3
0.2
0.1
0
–0.1
–0.2
–0.3
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (sec)
FIGURE 12.2: Time domain waveform of the same speech from different person [80].
8000
7000
Frequency (Hz)
6000
5000
4000
3000
2000
1000
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (sec)
8000
7000
Frequency (Hz)
6000
5000
4000
3000
2000
1000
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Time (sec)
FIGURE 12.3: Frequency response of the same speech in Figure 12.2 [80].
FIGURE 12.4: Example images of President George W. Bush showing huge intra-person variations.
FIGURE 12.5: Samples of SIFT detector from [51]: interest points detected from two images of a
similar scene. For each point, the circle indicates the scale. It also shows the matching between the
SIFT points in the two images.
(a) The Drum Tower of Xi’an (b) Qianmen Gate of Beijing (c) Different viewing angle of “Lincoln Memorial”
FIGURE 12.8: The challenges of landmark recognition: (a) and (b) are photos of Drum Tower of
Xi’an and Qianmen Gate of Beijing, respectively. Though of different locations, the visual appear-
ances are similar due to historical and style reasons. (c) shows three disparate viewing angles of the
Lincoln Memorial in Washington, DC.
FIGURE 12.9: The same landmark will differ in appearance from different viewing angles. Typical
approaches use clustering to get the visual information for each viewing angle. (From Liu et al. [52].
Training Data Test Data Training Data Test Data
FIGURE 15.1: (a) An illustration of the common i.i.d. supervised learning setting. Here each in-
stance is represented by a subgraph consisting of a label node (blue) and several local feature nodes
(purple). (b) The same problem, cast in the relational setting, with links connecting instances in the
training and training sets, respectively. The instances are no longer independent. (c) A relational
learning problem in which each node has a varying number of local features and relationships, im-
plying that the nodes are neither independent nor identically distributed. (d) The same problem, with
relationships (links) between the training and test set.
Yi Yj Yk Yi Yj Yk
Xi Xj Xk Xi Xj Xk
FIGURE 15.2: Example BN for collective classification. Label nodes (green) determine features
(purple), which are represented by a single vector-varied variable. An edge variable (yellow) is
defined for all potential edges in the data graph. In (a), labels are determined by link structure,
representing contagion. In (b), links are functions of labels, representing homophily. Both structures
are acyclic.
FIGURE 20.1: Unlabeled examples and prior belief.
35 35 20
18
30 30
16
25 25 14
12
20 20
10
15 15 8
10 10 6
4
5 5
2
0 0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
(a) WiFi RSS received by device A in T1. (b) WiFi RSS received by device A in T2.
35 35
30 30
25 25
20 20
15 15
10 10
5 5
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
(c) WiFi RSS received by device B in T1. (d) WiFi RSS received by device B in T2.
FIGURE 21.1: Contours of RSS values over a 2-dimensional environment collected from a same
access point in difference time periods and received by different mobile devices. Different colors
denote different values of signal strength.
F3
Usefulness
F1
appeared in
labeled query all the other
graphs graph unlabeled graphs
G1 G2 G3
G3 G2
G5
G1
G4
FIGURE 23.2: Parallel coordinates. In this example, each object has four dimensions and represents
the characteristics of a species of iris flower (petal and sepal width and length in logarithmic scale).
The three types of lines represent the three kinds of iris. With parallel coordinates, it is easy to see
common patterns among flowers of the same kind; however, edge cluttering is already visible even
with a small dataset.
DEU
RUS
IDN PAK
FRA
CHN
GBR Europe POL
HUN BLR
BGD JPN NLD CZE
ITA SWE AUT
GRC
SRB FIN HRV BIH
BEL
ESP DNK
Asia PRT SVK IRL
PHL VNM IRN
USA
MMR AFG IRQ MEX
TUR
FIGURE 23.7: Treemaps. This plot represents a dataset of 2010 about population size and gross
national income for each country. The size of each node of the treemap is proportional to the size
of the population, while the shade of blue of each box represents the gross national income of that
country. The countries of a continent are grouped together into a rectangular area.